Understanding Synchronization Tracing

Language:

This section explores the synchronization tracing data and explains how to relate it to clock-profiling data.

Go to the Functions view and sort according to inclusive Total CPU Time by clicking the column header Inclusive Total CPU.
Select the do_work() function at the top of the list.
Look at the Called-by/Calls panel at the bottom of the Functions view and note that do_work() is called from two places, and it calls ten functions.

Most often, do_work() is called when a thread to process the data is created, and is shown as called from _lwp_start(). In one case, do_work() calls one single-threaded task called nothreads() after being called from locktest().

The ten functions that do_work() calls represent ten different tasks, and each task uses a different synchronization method that the program executed. In some experiments created from mttest you might see an eleventh function which uses relatively little time to fetch the work blocks for the other tasks. This function fetch_work() is displayed in the Calls panel in the preceding screen shot.

Note that except for the first two of the callees in the Calls panel, all callees show approximately the same amount of time (~10.6 seconds) of Attributed Total CPU.
Switch to the Callers-Callees view.

Callers-Callees view shows the same callers and callees as the Called-by/Calls panel, but it also shows the other metrics that were selected in the Overview page, including Attributed Sync Wait Time.

Look for the two functions lock_global() and lock_local(), and note that they show about the same amount of Attributed Total CPU time, but very different amounts of Attributed Sync Wait Time.
Select the lock_global() function and switch to Source view.

Note that all the Sync Wait time is on the line with the call to pthread_mutex_lock(&global_lock) which has 0 Total CPU Time. As you might guess from the function names, the four threads executing this task all do their work when they acquire a global lock, which they acquire one by one.
Go back to the Functions view and select lock_global(), then click the Filter icon and select Add Filter: Include only stacks containing the selected functions.
Select the Timeline view and you should see the four threads.

The first thread to get the lock (T:17 in the screen shot) works for 2.65 seconds, then gives up the lock. You can see that the thread state bar is green for that thread which means all its time was spent in User CPU Time, with none in User Lock Time. Notice also that the second bar for Synchronization Tracing Call Stacks marked with the show no call stacks for this thread.

The second thread (T:16 in the screen shot) has waited 2.6 seconds in User Lock Time, and then it computes for 2.65 seconds and gives up the lock. The Synchronization Tracing Call Stacks coincide with the User Lock Time.

The third thread (T:15) has waited 5.3 seconds in User Lock Time and it then computes for 2.65 seconds, and gives up the lock.

The last thread (T:14) has waited 7.9 seconds in User Lock Time, and it computes for 2.65 seconds. The total computation was 2.65 x 4 or ~10.6 seconds.

The total synchronization wait was 2.6 + 5.3 + 7.9 or ~15.9 seconds, which you can confirm in the Functions view.
Remove the filter by clicking the X in the Active Filters panel.
Go back to the Functions view, select the function lock_local(), and switch to the Source view.

Note that the Sync Wait Time is 0 on the line with the call to pthread_mutex_lock(&array->lock), line 1043 in the screen shot. This is because the lock is local to the workblock, so there is no contention and all four threads compute simultaneously.

The experiment you looked at was recorded with a calibrated threshold. In the next section, you compare to a second experiment which was recorded with zero threshold when you ran the make command.