Tutorial: Using the Sampling Collector and Analyzer

Analyzing Program Performance with Sun WorkShop

Chapter 2

Tutorial: Using the Sampling Collector and Analyzer

This chapter demonstrates how to use the Sampling Collector and the Sampling Analyzer to understand the performance of three sample programs:

Example 1: synprog. A simple program that demonstrates various programming constructs, as well as an example of the gprof fallacy
Example 2: omptest. A program that uses OpenMP parallelization in Fortran
Example 3: mttest. A program that uses explicit multithreading

In these examples, each of the programs is subjected to the same sequence of performance questions:

What can I change in my program to improve its performance?

Program analysis at this level is concerned with higher-level algorithm issues. The compiled code is already as optimized as the compiler can make it; you are now looking for ways to refine the program algorithm itself to make it execute more efficiently.

What resources is my program using?

For example, how much CPU time is my program using?

Where in my program are most of these resources being used?

For example, is there a particular function where my program is spending most of its time? Which line or instruction in the function accounts for most of the time being spent?

How did the program arrive at that line or instruction in the execution process?

Once you determine where your program is using the most resources, the Analyzer offers various ways of examining your code so you can determine why.

Note – In all these examples, the times shown represent data collected in particular experiments. When you record data, you will see different numbers, because your platform is probably different from the one used to collect the data reported here.

The source code for each of these demonstration programs is included on the distribution. Before you start the tutorial:
1. Make sure the Sun WorkShop directory has been added to your path:

/opt/SUNWspro/bin
2. Copy the files in one or more of the demo directories to a work area of your own, and do a make to construct the demo program.

Example 1: synprog

The synprog program is a simple program that demonstrates a number of programming constructs, each of which exhibits some interesting performance characteristic: simple Metric analysis, recursion, loading and unloading dynamically linked shared objects. You can use synprog to exercise the Collector and Analyzer.

Among other things, synprog demonstrates the so-called "gprof fallacy". gprof is a standard UNIX performance tool that in this case properly identifies the function where the program is spending most of its CPU time (exclusive time, that is, time spent in the function itself), but wrongly reports the caller responsible for most of that time.

Copying synprog

The Sun WorkShop installation program installs the synprog source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/synprog
In a default installation, installation_directory is /opt.

Before you start on this part of the tutorial, create a working directory and copy the synprog source files and the makefile into it:
cp -r installation_directory ~/synprog
Building synprog

Before you build the synprog program:

Open the makefile in a text editor:

The makefile contains alternative settings for the environment variables ARCH and OFLAGS.

You can accept the default setting for ARCH, which works on SPARC 7, 8, or 9 platforms, and x86 platforms.
Note – The default architecture (the -xarch compiler flag) is v7 for SPARC, to accommodate older machines, and ia32 for Intel. Most newer SPARC machines support v8. If your machine supports v8, comment out the default and uncomment the line ARCH = -xarch=v8 or, on a v9 machine, ARCH = -xarch=v9. Using the default generates code that calls the libc.so .mul and .div routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the <Unknown> function; see "The <Unknown> Function" on page 104 for more information.

The makefile offers you a couple of settings for OFLAGS that affect program optimization. You can build synprog with the default, which uses the command-line options -g -xF -v -V, run the Collector on it, and look at the results in the Analyzer. Then you can repeat the process with the other setting to get an idea of how the different settings affect how the compiler optimizes code. For an explanation of the various compiler options in the OFLAGS settings, see the C User's Guide.
Note – The sample program described in this section is compiled with no optimizations.

To build synprog:

1. Save the makefile and close the editor.

2. Type make at the command prompt.

Collecting Data About synprog

To collect performance data on synprog:
1. Start Sun WorkShop by typing:

% workshop
2. Click the Debugging button to open the Debugging window.

3. Load synprog into the Debugging window:

Choose Debug New Program from the main Debugging menu.
In the Debug New Program dialog box, enter the path for synprog's in the Name text box, or use the list box to navigate to synprog.
Click OK.
4. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.

Observe the following points:

The Collector is enabled for one run only.
The default path and file name for the experiment-record file appear in the Experiment File text box.
Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.

For this experiment you are going to collect only clock-based profiling data and use the default sampling interval, so use the default settings.

5. Click the Start button:

synprog runs in the Debugging window, and the Collector collects clock-based profiling data and stores it in the default experiment-record file, test.1.er.

Analyzing synprog Performance Metrics

To open the Analyzer and load test.1.er into it:

1. Click on the Analyzer button in either the Sun WorkShop main window tool bar or the Sampling Collector window.

2. In the Load Experiment dialog box, type test.1.er, then click on OK.

To start the Analyzer and load test.1.er from the command line:
Type:

% analyzer test.1.er
The Analyzer window displays the function list for synprog. The function list displays the default clock-based profiling metrics:

Exclusive user CPU time (the amount of time spent in the function itself), in seconds
Inclusive user CPU time (the amount of time spent in the function itself and any functions it calls), in seconds

The function list is sorted on exclusive CPU time.

Simple Metric Analysis

First, look at execution times for two very simple functions, cputime() and icputime(). Both contain a for loop that increments a variable x by one, but in cputime(), x is floating-point, while in icputime(), x is an integer.

1. Locate cputime() and icputime() in the Function List display.

You can see immediately by looking at the exclusive user CPU time for the two functions that cputime() takes much longer to execute than icputime(). You can use other features of the Analyzer to find out why this is so.

2. In the Function List display, click on cputime() to select it.

3. Click Source in the lower tool bar.
A text editor opens, displaying the annotated source code for function cputime(). (You may need to widen the text editor window.)
You can see that most of the execution time is used by the loop line and the line in which x is incremented:

for(j=0; j<1000000; j++) {

x = x + 1.0;

}
4. Select icputime() in the Function List display and click Source.
The source code for icputime() replaces the source code for cputime() in the text editor. Look at the loop line and the line in which x is incremented:

for(j=0; j<1000000; j++) {

x = x + 1;

}
You can see that this line takes much less time to execute than the corresponding line in cputime(). (The time on the loop line is about the same as it was for function cputime().)
Now look at the annotated disassembly for these two functions, to see why this is so.
5. Select cputime() in the Function List display, and click Disassembly in the lower tool bar.

The annotated disassembly for cputime() appears in the text editor. Scroll down until you see some non-zero metric values. Look at the ones for the line of source code in which x is incremented.
You can see that while a small amount of time is spent in loading and adding the values for x and 1, the greatest amount of time is spent executing the fstod instruction, which converts the value of x from a single floating-point value to a double floating-point value, so it can be incremented by 1, a double. Afterwards, another small amount of CPU time is spent by the fdtos command, which converts x back to a single. Altogether, these two operations account for about three-quarters of the CPU time expended.

6. Now select icputime() in the Function List display and click Disassembly.

The annotated disassembly for icputime() appears in the text editor. Scroll down until you see the performance metrics for the line of source code where x is incremented.
You can see that all that is involved is a simple load, add, and store operation that takes approximately a third of the time of the floating-point add, because no conversions are necessary. The value 1 does not even need to be loaded into a register--it can be added directly to x.

The Effects of Recursion

The synprog program contains two examples of recursive calling sequences:

Function recurse() demonstrates direct recursion. It calls function real_recurse(), which then calls itself until a test condition is met, at which point it performs some work that requires user CPU time, after which the flow of control returns through successive calls to real_recurse(), until it reaches recurse().
Function bounce() demonstrates indirect recursion. It calls function bounce_a(), which checks to see if a test condition is met, and if it is not, calls function bounce_b(). bounce_b() in turn calls bounce_a(). This sequence continues until the test condition in bounce_a() is met, at which time bounce_a() performs some work that requires user CPU time. After that, the flow of control returns through successive calls to bounce_b() and bounce_a(), until it reaches bounce().

In either case, exclusive metrics belong only to the function in which the actual work is done, in these cases real_recurse() and bounce_a(). These metrics are passed up as inclusive metrics to every function that calls the final function.

First, look at the metrics for recurse() and real_recurse():

1. In the Function List display, find function recurse() and click on it to select it.

Notice that function recurse() shows inclusive user CPU time, but its exclusive user CPU time is 0 (zero). This is because all recurse() does is execute a call to real_recurse().
Note – For some runs, recurse() may show a small exclusive CPU time value. Because profile experiments are statistical in nature, the experiment you ran on synprog may record one or two ticks in the recurse() function. However, the exclusive time is much less than the inclusive time.

2. Click Callers-Callees to open the Callers-Callees window, which shows that recurse() calls one function, real_recurse().

3. Click on real_recurse() to select it.

The Callers-Callees window now displays the following information:

Both recurse() and real_recurse() appear in the callers pane as callers of real_recurse(). You would expect this, because after recurse() calls real_recurse(), real_recurse() calls itself recursively.
In order to simplify the display, real_recurse() does not appear in the callee pane as its own callee.
Exclusive metrics as well as inclusive metrics are recorded for real_recurse(), where the actual user CPU time is spent. These are passed back up to recurse() as inclusive metrics.
Exclusive metrics are also displayed for real_recurse() in the caller pane. If a function generates exclusive metrics, the Analyzer displays them for that function anywhere it appears in the Callers-Callees window.

Now look at what happens in the indirect recursive sequence:

1. Find function bounce() in the Function List display and click on it to select it.

Notice that function bounce() shows inclusive user CPU time, but its exclusive user CPU time is 0 (zero). This is because all bounce() does is execute a call to bounce_a().

2. Click Callers-Callees to open the Callers-Callees window, which shows that bounce() calls only one function, bounce_a().

3. Click on bounce_a() to select it.

The Callers-Callees window now displays the following information:

Both bounce() and bounce_b() appear in the callers pane as callers of bounce_a().
In addition, bounce_b() appears in the callee pane. If a function calls an intermediate function instead of calling itself recursively, that intermediate function does appear in the callee pane.
Exclusive as well as inclusive metrics are displayed for bounce_a(), where the actual user CPU time is spent. These are passed up to the functions that call bounce_a() as inclusive metrics.
4. Click on bounce_b() to select it.

bounce_b() is shown as both calling and being called by bounce_a(). Exclusive as well as inclusive metrics for bounce_a() appear in both the caller and the callee panes because if a function generates exclusive metrics, the Analyzer displays the metrics for that function anywhere the function appears in the Callers-Callees window.

The gprof Fallacy

Now look at how the Analyzer resolves the gprof fallacy in synprog.

Select the function gpf_work(). This is one of the functions in which synprog is spending most of its time.

In this example, however, it is less helpful to see where the program is spending time than to figure out why. To do so, look at the functions that call gpf_work() and how these functions call it:

Click the Callers-Callees button in the lower Analyzer tool bar to open the Callers-Callees window.

The Callers-Callees window is divided into three horizontal panes:

The middle pane shows data pertaining to the selected function, in this case gpf_work().
The top pane shows data pertaining to all functions that call the selected function, in this case gpf_b() and gpf_a().
The bottom pane shows data pertaining to all functions called by the selected function. In this case, the bottom pane is empty, because gpf_work() does not call any other functions.

By looking at the Callers pane, you can see that gpf_work is called by two functions, gpf_b() and gpf_a(). The Analyzer shows that most of the time in gpf_work() results from calls from gpf_b(), whereas much less time results from calls from gpf_a(). You must look at the callers, to see why gpf_b() calls account for over ten times as much time in gpf_work() as calls from gpf_a():

1. Click on gpf_a() in the Callers pane to select it.

gpf_a() now becomes the selected function, and moves to the middle pane; its callers appear in the top Callers pane, and gpf_work(), its callee, now appears in the bottom Callees pane.

2. Go back to the main Analyzer window (where gpf_a() is now the selected function), and click the Source button to open a text editor displaying the annotated source code for gpf_a().

3. In the text editor, scroll down so that you can see the code for both gpf_a() and gpf_b().

You can see from the code that gpf_a() calls gpf_work() ten times with an argument of 1, whereas gpf_b() calls gpf_work() only once, but with an argument of 10. The arguments from gpf_a() and gpf_b() are passed to the formal argument amt in gpf_work().

Now look at the code for gpf_work(), to see why the way the callers call gpf_work() makes a difference:

Scroll down one screen in the text editor, to display the code for gpf_work().

From the line imax = 4 * amt * amt, which sets the upper limit for the following for loop, you can see that time spent in gpf_work() depends on the square of its argument. So ten times as much time, more or less, will be spent on one call from a function with an argument of 10 (400 iterations) than will be spent on ten calls from an function with an argument of 1 (10 instances of 4 iterations).

And what has all this to do with gprof? The "gprof fallacy" is the result of gprof estimating the amount of time spent in a function from the number of times the function is called, regardless of how the function might be using the arguments passed to it. So for an analysis of synprog, gprof would attribute ten times as much time to calls from gpf_a() as it would to calls from gpf_b(). That is the gprof fallacy.

The distortion increases the higher the power of the argument on which the CPU time in the function depends. For example, matrix multiply, whose work goes as the cube of the order, would be even more sensitive.

Loading Dynamically Linked Shared Objects

The synprog directory contains two dynamically linked shared objects, so_syn.so and so_syx.so. In the course of execution, synprog first loads so_syn.so and makes a call to one of its functions, so_burncpu(). Then it unloads so_syn.so, loads so_syx.so at what happens to be the same address, and makes a call to one of the so_syx.so functions, sx_burncpu(). Then, without unloading so_syx.so, it loads so_syn.so again at a different address, because the address it was loaded at first is being used by another shared object, and makes another call to so_burncpu().

The functions so_burncpu() and sx_burncpu() perform identical operations, as you can see if you look at their source code. Therefore they should take the same amount of user CPU time to execute.

The addresses at which the shared objects are loaded are determined at run time, and the run-time loader chooses where to load the objects.

This rather artificial exercise demonstrates that the same function can be called at different addresses at different points in the program execution, that different functions can be called at the same address, and that the Analyzer deals correctly with this behavior, aggregating the data for a function regardless of the address at which it appears:

1. In the Function List display, click on sx_burncpu() to select it.

2. Choose View Show summary metrics. The Summary Metrics window for sx_burncpu() appears.

Note the user CPU time for sx_burncpu().

3. Click on so_burncpu() to select it. The summary data for so_burncpu() appears in the Summary Metrics window.

You can see that although so_burncpu() performs operations identical to those of sx_burncpu(), the user CPU time for so_burncpu() is almost exactly twice the user CPU time for sx_burncpu(). This is because so_burncpu() was executed twice; the Analyzer recognized that the same function was executing and aggregated the data for it, even though it appeared at two different addresses in the course of program execution.

Example 2: omptest

Note – omptest is a Fortran program that contains OpenMP directives and compiles only with a Fortran 95 compiler on SPARC platforms.

The omptest program uses OpenMP parallelization and is designed to test the efficiency of parallel distribution. Analysis of omptest deals with the following questions:

What are the obstacles to parallelization?
How is load balanced across threads?
What are the costs of memory and bus contention?
Note – This example assumes that you are running omptest on a machine with at least four CPUs.

Copying omptest

The Sun WorkShop installation program installs the omptest source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/omptest
In a default installation, installation_directory is /opt.

Before you start on this part of the tutorial, create a working directory and copy the omptest source files and the makefile into it:
% cp -r installation_directory ~/omptest
Building omptest

Before you build the omptest program:

Open the makefile in a text editor.

The makefile contains alternative settings for the environment variables ARCH and OFLAGS.

You can accept the default setting for ARCH, which works on SPARC 7, 8, or 9 platforms.
Note – The default architecture for the -xarch compiler flag is v7 for SPARC, to accommodate older machines. The default setting in the omptest makefile is ARCH = -xarch=v8, since most newer SPARC machines support v8; furthermore, using the v7 default generates code that calls the libc.so .mul and .div routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the <Unknown> function; see "The <Unknown> Function" on page 104 for more information.

The makefile offers you several settings for OFLAGS that affect optimization and parallelization. You can build omptest with the default, which uses the command-line options -g -O3 -mp=openmp -explicitpar -depend -stackvar -loopinfo -v -V, run the Collector on it, and look at the results in the Analyzer, then repeat the process with the other settings to get an idea of how the different settings affect how the compiler optimizes and parallelizes code. For an explanation of the various compiler options in the OFLAGS settings, see the Fortran User's Guide.
Note – If you do not specify parallelization (-mp=openmp -explicitpar), the compiler does not interpret OpenMP directives, and compiles the program to execute serially.

To build omptest:

1. Save the makefile and close the editor.

2. Type make at the command prompt.

Collecting Data About omptest

For this part of the demonstration, you must have omptest running on a four-CPU system or larger. Sun WorkShop must be installed on the machine, so you can run the Collector to collect performance data.

Note – You might want to refer to the chapters on parallelization and OpenMP in the Fortran Programming Guide for background on parallelization strategies and OpenMP directives.
1. Start Sun WorkShop by typing:

% workshop
2. Click the Debugging button to open the Debugging window.

3. Load omptest into the Debugging window:

Choose Debug New Program from the main Debugging menu bar.
In the Debug New Program dialog box, enter the path for omptest in the Name text box, or use the list box to navigate to omptest.
Click OK.
4. Set the value of the PARALLEL environment variable to 4:

Choose Debug Edit Run Parameters to open the Run Parameters dialog box.
Click Environment Variables to open the Environment Variables dialog box.
Type PARALLEL in the Name text field and 4 in the Value text field.
Click OK to exit the Environment Variables dialog box and the Run Parameters dialog box.
5. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.

Observe the following points:

The Collector is enabled for one run only.
The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to omptest.1.er.
Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.

For this experiment you are going to collect only clock-based profiling data and use the default sampling interval, so use the default settings.

6. Click the Start button.

omptest runs in the Debugging window, and the Collector collects clock-based profiling data and stores it in the default experiment-record file, omptest.1.er.

7. Repeat Step 4 through Step 6 with PARALLEL set to 2 and save the performance information in the experiment-record file omptest.2.er.

Analyzing omptest Performance Metrics

For purposes of analyzing parallelization strategies, we will be looking at four functions and comparing how they behave on the four-CPU run versus the two-CPU run:

psec()_ is a PARALLEL SECTION routine
pdo_() is a PARALLEL DO routine
critsum_() is a CRITICAL SECTION routine
redsum_() is a REDUCTION routine

All these routines were generated at compile time as a result of OpenMP directives inserted into omptest source code. Each pair (psec_() and pdo_(), critsum_() and redsum_()) represents contrasting strategies for dealing with effective parallelization.

Note – Behavior of parallel code can be unexpected. Read "Parallel Execution and Compiler-Generated Body Functions" on page 96 and "Compiler-Generated Body Functions" on page 103 for more information on how this code behaves.

Look first at the performance of psec_() and pdo_() on the four-CPU and the two-CPU runs:

Open two Analyzer windows, with omptest.1.er loaded into one, and omptest.2.er loaded into the other. Position the two windows so you can compare the metrics displayed in them.

In each of the Analyzer windows:

1. In the File List display, find the data for psec_() and click on the line to select it.

2. Choose View Show Summary Metrics.

The Summary Metrics window appears, displaying the metrics for psec_().

3. Look at the inclusive metrics for user CPU time, wall-clock time, and total LWP.

For the two-CPU run, psec_() uses about 8.9 seconds user CPU time, about 4.8 seconds wall-clock time, and about 9.6 seconds total LWP. The ratio of wall-clock time to either user CPU time or total LWP is about 1 to 2, which indicates relatively efficient parallelization.
For the four-CPU run, however, psec_() takes about the same user CPU time (8.4 seconds), but both the wall-clock time and the total LWP are higher (6.12 seconds and 11.55 seconds respectively). psec_() takes longer to run on the four-CPU machine. There are only two sections within the psec_() PARALLEL SECTION construct, so only two threads are required to execute them, using only two of the four available CPUs at any given time. The slightly poorer performance on the four-CPU machine is due to overhead involved in scheduling the threads among the four CPUs.

4. Scroll to the function pdo_() and click on the line to select it.

5. In the main Analyzer menu, choose View Show Summary Metrics.

The Summary Metrics window appears, displaying the metrics for pdo_().

6. Look at the inclusive metrics for user CPU time, wall-clock time, and total LWP.

The user CPU time for pdo_() is about the same as for psec_() (about 8.4 seconds for the two-CPU run, about 8.6 seconds for the four-CPU run). But now the ratio of wall-clock time to user CPU time is about 1 to 2 on the two-CPU run, but about 1 to 4 on the four-CPU run, indicating that the pdo_() parallelizing strategy scales much more efficiently on multiple CPUs, taking into account how many CPUs are available and scheduling the loop appropriately.

Now look at the relative efficiency of the routines critsec_() and reduc_(). In this case, the annotated source code shows how efficiently each parallelization strategy deals with an identical assignment statement embedded in a pair of do loops; its purpose is to sum the contents of three two-dimensional arrays:
t = (a(j,i)+b(j,i)+c(j,i))/k

sum = sum+t
1. In the Function List display, find the data for critsum_() and click on the line to select it.

2. Click Source in the lower Analyzer tool bar.

The annotated source code from which critsum_() was generated appears in a text editor.

3. Look at the inclusive user CPU time.

It is enormous--almost 13 seconds. The inclusive user CPU time is so high because _critsum() uses a critical section parallelization strategy; although the summing operation is spread over all four CPUs, only one CPU at a time is allowed to add its value of t to sum. This is not a very efficient strategy for making use of parallelization.

4. Close the text editor and scroll to the data for redsum_(). Click on the line to select it.

5. Click Source in the lower Analyzer tool bar.

The annotated source code from which redsum_() was generated appears in a text editor.

6. Look at the inclusive user CPU time.

It is much smaller, only about 1.7 seconds, because redsum_() uses a reduction strategy, by which the partial sums of (a(j,i)+b(j,i)+c(j,i))/k are distributed over multiple processors, after which these intermediate values are added to sum. This strategy makes much more efficient use of the available CPUs.

Example 3: mttest

The mttest program emulates the server in a client-server, where clients queue up requests and the server uses multiple threads to service them, using explicit threading. Performance data collected on mttest demonstrates the sorts of contentions that arise from various locking strategies, and the effect of caching on execution time.

Copying mttest

The Sun WorkShop installation program installs the mttest source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/mttest
In a default installation, installation_directory is /opt.

Before you start on this part of the tutorial, create a working directory and copy the mttest source files and the makefile into it:
% cp -r installation_directory ~/mttest
Building mttest

Before you build the mttest program:

Open the makefile in a text editor:

The makefile contains alternative settings for the environment variables ARCH, OFLAGS, THREADS, and FLAG.

You can accept the default setting for ARCH, which works on SPARC 7, 8, or 9 platforms, and x86 platforms.
Note – The default architecture (the -xarch compiler flag) is v7 for SPARC, to accommodate older machines, and ia32 for Intel. Most newer SPARC machines support v8. If your machine supports v8, comment out the default and uncomment the line ARCH = -xarch=v8 or, on a v9 machine, ARCH = -xarch=v9. Using the default generates code that calls the libc.so .mul and .div routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the <Unknown> function; see "The <Unknown> Function" on page 104 for more information.

The makefile offers you several settings for OFLAGS that affect optimization and parallelization. You can build omptest with the default, which uses the command-line options -g -xF -v -V, run the Collector on it, and look at the results in the Analyzer, then repeat the process with the other settings to get an idea of how the different settings affect how the compiler optimizes and parallelizes code. For an explanation of the various compiler options in the OFLAGS settings, see the Fortran User's Guide.
For THREADS and FLAG, comment whichever setting is closest to your target application, and comment out the other one. The settings for THREADS are SOLARIS and POSIX, which determine which threads standard your program is compiled for. The settings for FLAG are BOUND and UNBOUND; they determine whether your program uses bound or unbound threads.

To build mttest:

1. Save the makefile and close the editor.

2. Type make at the command prompt.

Collecting and Analyzing Data About mttest

The executable mttest you have just built is compiled for explicit multithreading, and it will run as a multithreaded program on a machine with multiple CPUs or with one CPU. There are some interesting differences--and similarities--in its performance metrics if you run it on a single CPU system.

The following first two sections, "Collecting Data About mttest (Four-CPU System)" and Analyzing mttest Performance Metrics (Four-CPU System), look at mttest's performance on a four-CPU machine; the next two sections, Collecting Data About mttest (One-CPU System) and Analyzing mttest Performance Metrics (One-CPU System), look at the same set of performance metrics on a one-CPU machine.

Collecting Data About mttest (Four-CPU System)

For this part of the demonstration, you must have mttest running on a four-CPU system. Sun Workshop must be installed on the machine, so you can run the Collector to collect performance data.
1. Start Sun WorkShop by typing:

% workshop
2. Click the Debugging button to open the Debugging window.

3. Load mttest into the Debugging window:

Choose Debug New Program from the main Debugging menu.
In the Debug New Program dialog box, enter the mttest path in the Name text box, or use the list box to navigate to mttest.
Click OK.
4. From the main Debugging window menu, choose Windows Sampling Collector to open the WorkShop Sampling Collector window.

Observe the following points:

The Collector is enabled for one run only.
The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to mttest.1.er.
Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
5. For this experiment, in addition to clock-based profiling, you will collect synchronization wait tracing information, so select the Synchronization Wait Tracing check box.

6. Click the Start button.

mttest runs in the Debugging window, and the Collector collects clock-based profiling and synchronization wait tracing data and stores it in the experiment-record file, mttest.1.er.

Analyzing mttest Performance Metrics (Four-CPU System)

To open the Analyzer and load mttest.1.er into it:

1. Click on the Analyzer button in either the Sun WorkShop main window tool bar or the Sampling Collector window.

2. In the Load Experiment dialog box, type mttest.1.er, then click on OK.

To open the Analyzer and load mttest.1.er from the command line:
Type:

% analyzer mttest.1.er
The Analyzer window displays the Function List for mttest.

1. In the File List display, scroll down to the data for lock_local() and lock_global().

Note that both functions have the same inclusive user CPU time (about 3.2 seconds). This indicates that both functions are doing the same amount of work.
However, lock_global() spends a lot of time in synchronization waiting (almost five seconds), whereas lock_local() spends no time in synchronization waiting. If you bring up the annotated source code for the two functions, you can see why this is so.

2. Click on the line of data for lock_global() to select it.

3. Click the Source button in the lower Analyzer tool bar.
The annotated source code for function lock_global() appears in the text editor. You can see that lock_global() applies a global lock to the data:

mutex_lock(&global_lock);
Because of the global lock, all running threads must contend for access to the data, and only one thread has access to it at a time. The rest of the threads must wait until the working thread releases the lock to access the data.
4. Click on the line of data for lock_local() to select it.

5. Click the Source button in the lower tool bar.
The annotated source code for function lock_local() appears in the text editor. lock_local() applies a lock only to data in a particular thread's work block:

mutex_lock(&(array->lock));
No thread can have access to another thread's work block, so their work can proceed without contention or time wasted in synchronization waits, and the wait time for lock_local() is 0 (zero) seconds.
6. Return to the Function List display in the main Analyzer window and scroll to the data for the functions ComputeA() and ComputeB().

7. Click on the line of data for ComputeA() to select it.

8. Click the Source button in the lower Analyzer tool bar.

The text editor opens, displaying the annotated source code for ComputeA().

9. Scroll down in the text editor so that the source for ComputeA() and ComputeB() is displayed.

Observe that although the code for these functions is virtually identical (a loop adding one to a variable), ComputeB() uses almost ten seconds of inclusive user CPU time, whereas ComputeA() uses only about 3.3 seconds. To find out the reason for this discrepancy, you must examine the code that calls ComputeA() and ComputeB().

10. Return to the Function List display in the main Analyzer window, and click on the line of data for ComputeA() to select it.

11. Click the Callers-Callees button in the lower tool bar.

The Callers-Callees window opens, showing the selected function in the middle display pane, and its caller in the upper pane.

12. Click on caller lock_none() to select it.

13. Return to the Function List display in the main Analyzer window.

Note that lock_none() is now the selected function.

14. Click the Source button in the lower Analyzer tool bar.

The text editor opens, displaying the annotated source code for lock_none().

15. Scroll down to display the code for the function which calls ComputeB(), which is cache_trash().

16. Compare the calls to ComputeA() and ComputeB():

ComputeA() is called with a double in the thread's work block (&array->list[0]) as an argument, which can be read and written to directly without danger of contention with other threads.
ComputeB(), however, is called with a series of doubles that occupy successive words in memory (&element[array->index]). Whenever a thread writes to one of these addresses in memory, any other threads that have that address in their cache must delete the data, which is now out-of-date. If one of the threads needs the data again later in the program, the data must be copied back into the data cache from memory, which is a time-consuming operation. The resulting cache misses, which are attempts to access data not available in the data cache, waste a lot of CPU time, which explains why ComputeB() uses three times as much user CPU time as ComputeA().

17. Return to the Function List display in the main Analyzer window. Click on the line for ComputeB() to select it.

18. Click on the Disassembly button in the lower tool bar.

The text editor opens, displaying annotated disassembly code for ComputeB(). You can see that most of the user CPU time (over 7 seconds) is being used by the faddd instruction, which is waiting for a register load that misses in the cache.

Collecting Data About mttest (One-CPU System)

For this part of the demonstration, you must have mttest running on a single-CPU system. As before, Sun Workshop must be installed on the machine, so you can run the Collector to collect performance data.
1. Start Sun WorkShop by typing:

% workshop
2. Click the Debugging button to open the Debugging window.

3. Load mttest into the Debugging window:

Choose Debug New Program from the main Debugging menu.
In the Debug New Program dialog box, enter the mttest path in the Name text box, or use the list box to navigate to mttest.
Click OK.
4. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.

Observe the following points:

The Collector is enabled for one run only.
The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to mttest.2.er.
Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
5. For this experiment, in addition to clock-based profiling, you are going to collect synchronization wait tracing information, so select the Synchronization Wait Tracing check box.

6. Click the Start button:

mttest runs in the Debugging window, and the Collector collects clock-based profiling and synchronization wait tracing data and stores it in the experiment-record file, mttest.2.er.

Analyzing mttest Performance Metrics (One-CPU System)

Load mttest.2.er into the Analyzer:
Type:

% analyzer mttest.2.er
The Analyzer window opens with the Function List for mttest displayed.

1. In the File List display, scroll down to the data for lock_local() and lock_global().

As on the four-CPU system, both functions have the same inclusive user CPU time (in this case about 10 seconds). This indicates that both functions are doing the same amount of work.
However, total LWP time for lock_global() is actually less than for lock_local() (25 seconds versus 37 seconds). This could be because of the way each locking system schedules the threads to run on the CPU. The global lock set by lock_global() allows each thread to execute in sequence until it has run to completion, so that the first thread executes for 2.5 seconds and then exits, while the others remain in a wait state until it is through; then the next thread, having waited for the first thread, executes for 2.5 seconds, for a total LWP time of 5 seconds; the third thread waits 5 seconds for the first two threads and then executes, for a total LWP time of 7.5 seconds; and the fourth thread, having waited 7.5 seconds, executes with a total LWP time of 10 seconds. lock_local()'s local lock, on the other hand, schedules each thread onto the CPU for a fraction of its run and then repeats the process until all the threads have run to completion, so that all four threads remain in a wait state for almost three-quarters of the 10-second execution time.
lock_global() still uses a lot of time in synchronization waiting (almost 15 seconds: 2.5 plus 5 plus 7.5), whereas lock_local() still uses no time in synchronization waiting. lock_local applies a lock only to data in a particular thread's work block. Its work proceeds as on the four-CPU system without contention or time wasted in synchronization waits. The synchronization wait time for lock_local() again is 0 (zero) seconds.

2. Return to the Function List display in the main Analyzer window, and scroll to the data for the functions ComputeA() and ComputeB().

3. Click on the line of data for ComputeA() to select it.

4. Click the Source button in the lower Analyzer tool bar.

The text editor opens, displaying the annotated source code for ComputeA().

5. Scroll down in the text editor so that the source for ComputeA() and ComputeB() is displayed.

Observe that on the single-CPU system, inclusive user CPU time for ComputeA() and ComputeB() is virtually identical.

6. Return to the Function List display in the main Analyzer window, and click on the line of data for ComputeA() to select it.

7. Click the Callers-Callees button in the lower tool bar.

The Callers-Callees window opens, showing the selected function ComputeA() in the middle display pane, and its caller in the pane above.

8. Click on caller cache_trash() to select it.

9. Return to the Function List display in the main Analyzer window.

Note that cache_trash() is now the selected function.

10. Click the Source button in the lower Analyzer tool bar.

The text editor open, displaying the annotated source code for cache_trash().

11. Scroll up to display the code for the function that calls ComputeA(), which is lock_none().

12. Compare the calls to ComputeA() and ComputeB():

ComputeA() is called with a double in the thread's work block as an argument (&array->list[0]), which can be read and written to directly without danger of contention with other threads.
ComputeB() is called with a series of doubles that occupy successive words in memory (&element[array->index]), accessed via a data cache. Recall that if multiple threads are running on multiple CPUs, whenever a thread writes to one of these addresses in memory, any other threads that have that address in their cache must delete the data, which is now obsolete. If one of them needs the data again later in the program, it must copied back into the data cache from memory, a time-consuming operation. The resulting cache misses waste a lot of CPU time.
However, when only one thread is running, no other threads write to memory, so the running thread's cache data never becomes invalid. No cache misses or resulting copies from memory occur, so the performance for ComputeB() is just as efficient on a one-CPU machine as the performance for ComputeA().

/opt/SUNWspro/bin

/installation_directory/SUNWspro/examples/analyzer/synprog

`cp -r` installation_directory `~/synprog`

% `workshop`

% `analyzer test.1.er`

for(j=0; j<1000000; j++) { x = x + 1.0; }

for(j=0; j<1000000; j++) { x = x + 1; }

/installation_directory/SUNWspro/examples/analyzer/omptest

% `cp -r` installation_directory `~/omptest`

% `workshop`

t = (a(j,i)+b(j,i)+c(j,i))/k sum = sum+t

/installation_directory/SUNWspro/examples/analyzer/mttest

% `cp -r` installation_directory `~/mttest`

% `workshop`

% `analyzer mttest.1.er`

mutex_lock(&global_lock);

mutex_lock(&(array->lock));

% `workshop`

% `analyzer mttest.2.er`

Library | Contents | Previous | Next | Index

Chapter 2

Tutorial: Using the Sampling Collector and Analyzer

Example 1: `synprog`

Copying `synprog`

Building `synprog`

Collecting Data About `synprog`

Analyzing `synprog` Performance Metrics

Simple Metric Analysis

The Effects of Recursion

The `gprof` Fallacy

Loading Dynamically Linked Shared Objects

Example 2: `omptest`

Copying `omptest`

Building `omptest`

Collecting Data About `omptest`

Analyzing `omptest` Performance Metrics

Example 3: `mttest`

Copying `mttest`

Building `mttest`

Collecting and Analyzing Data About `mttest`

Collecting Data About `mttest` (Four-CPU System)

Analyzing `mttest` Performance Metrics (Four-CPU System)

Collecting Data About `mttest` (One-CPU System)

Analyzing `mttest` Performance Metrics (One-CPU System)

Chapter 2

Tutorial: Using the Sampling Collector and Analyzer

Example 1: synprog

Copying synprog

Building synprog

Collecting Data About synprog

Analyzing synprog Performance Metrics

Simple Metric Analysis

The Effects of Recursion

The gprof Fallacy

Loading Dynamically Linked Shared Objects

Example 2: omptest

Copying omptest

Building omptest

Collecting Data About omptest

Analyzing omptest Performance Metrics

Example 3: mttest

Copying mttest

Building mttest

Collecting and Analyzing Data About mttest

Collecting Data About mttest (Four-CPU System)

Analyzing mttest Performance Metrics (Four-CPU System)

Collecting Data About mttest (One-CPU System)

Analyzing mttest Performance Metrics (One-CPU System)

Example 1: `synprog`

Copying `synprog`

Building `synprog`

Collecting Data About `synprog`

Analyzing `synprog` Performance Metrics

The `gprof` Fallacy

Example 2: `omptest`

Copying `omptest`

Building `omptest`

Collecting Data About `omptest`

Analyzing `omptest` Performance Metrics

Example 3: `mttest`

Copying `mttest`

Building `mttest`

Collecting and Analyzing Data About `mttest`

Collecting Data About `mttest` (Four-CPU System)

Analyzing `mttest` Performance Metrics (Four-CPU System)

Collecting Data About `mttest` (One-CPU System)

Analyzing `mttest` Performance Metrics (One-CPU System)