Analyzing Program Performance with Sun WorkShop |
Tutorial: Using the Sampling Collector and Analyzer
This chapter demonstrates how to use the Sampling Collector and the Sampling Analyzer to understand the performance of three sample programs:
- Example 1: synprog. A simple program that demonstrates various programming constructs, as well as an example of the
gprof
fallacy- Example 2: omptest. A program that uses OpenMP parallelization in Fortran
- Example 3: mttest. A program that uses explicit multithreading
In these examples, each of the programs is subjected to the same sequence of performance questions:
- What can I change in my program to improve its performance?
- Program analysis at this level is concerned with higher-level algorithm issues. The compiled code is already as optimized as the compiler can make it; you are now looking for ways to refine the program algorithm itself to make it execute more efficiently.
- What resources is my program using?
- For example, how much CPU time is my program using?
- Where in my program are most of these resources being used?
- For example, is there a particular function where my program is spending most of its time? Which line or instruction in the function accounts for most of the time being spent?
- How did the program arrive at that line or instruction in the execution process?
Once you determine where your program is using the most resources, the Analyzer offers various ways of examining your code so you can determine why.
Note In all these examples, the times shown represent data collected in particular experiments. When you record data, you will see different numbers, because your platform is probably different from the one used to collect the data reported here.
The source code for each of these demonstration programs is included on the distribution. Before you start the tutorial:
1. Make sure the Sun WorkShop directory has been added to your path:
/opt/SUNWspro/bin2. Copy the files in one or more of the demo directories to a work area of your own, and do a make to construct the demo program.Example 1:
synprog
The
synprog
program is a simple program that demonstrates a number of programming constructs, each of which exhibits some interesting performance characteristic: simple Metric analysis, recursion, loading and unloading dynamically linked shared objects. You can usesynprog
to exercise the Collector and Analyzer.Among other things,
synprog
demonstrates the so-called "gprof
fallacy".gprof
is a standard UNIX performance tool that in this case properly identifies the function where the program is spending most of its CPU time (exclusive time, that is, time spent in the function itself), but wrongly reports the caller responsible for most of that time.Copying
synprog
The Sun WorkShop installation program installs the synprog source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/synprogIn a default installation, installation_directory is
/opt
.Before you start on this part of the tutorial, create a working directory and copy the synprog source files and the makefile into it:
cp -r
installation_directory~/synprog
Building
synprog
Before you build the synprog program:
- Open the makefile in a text editor:
The makefile contains alternative settings for the environment variables
ARCH
andOFLAGS
.
- You can accept the default setting for
ARCH
, which works on SPARC 7, 8, or 9 platforms, and x86 platforms.
Note The default architecture (the-xarch
compiler flag) is v7 for SPARC, to accommodate older machines, and ia32 for Intel. Most newer SPARC machines support v8. If your machine supports v8, comment out the default and uncomment the lineARCH
=
-xarch=v8
or, on a v9 machine,ARCH
=
-xarch=v9
. Using the default generates code that calls thelibc.so
.mul
and.div
routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the<Unknown>
function; see "The <Unknown> Function" on page 104 for more information.
- The makefile offers you a couple of settings for
OFLAGS
that affect program optimization. You can build synprog with the default, which uses the command-line options-g
-xF
-v
-V
, run the Collector on it, and look at the results in the Analyzer. Then you can repeat the process with the other setting to get an idea of how the different settings affect how the compiler optimizes code. For an explanation of the various compiler options in theOFLAGS
settings, see the C User's Guide.
Note The sample program described in this section is compiled with no optimizations.
1. Save the makefile and close the editor.2. Typemake
at the command prompt.Collecting Data About
synprog
To collect performance data on
synprog
:1. Start Sun WorkShop by typing:
%workshop
2. Click the Debugging button to open the Debugging window.
3. Loadsynprog
into the Debugging window:
- Choose Debug New Program from the main Debugging menu.
- In the Debug New Program dialog box, enter the path for
synprog
's in the Name text box, or use the list box to navigate tosynprog
.- Click OK.
4. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.
- Observe the following points:
- The Collector is enabled for one run only.
- The default path and file name for the experiment-record file appear in the Experiment File text box.
- Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
- For this experiment you are going to collect only clock-based profiling data and use the default sampling interval, so use the default settings.
5. Click the Start button:
synprog
runs in the Debugging window, and the Collector collects clock-based profiling data and stores it in the default experiment-record file,test.1.er
.Analyzing
synprog
Performance MetricsTo open the Analyzer and load
test.1.er
into it:1. Click on the Analyzer button in either the Sun WorkShop main window tool bar or the Sampling Collector window.
2. In the Load Experiment dialog box, typetest.1.er
, then click on OK.To start the Analyzer and load
test.1.er
from the command line:The Analyzer window displays the function list for
synprog
. The function list displays the default clock-based profiling metrics:
- Exclusive user CPU time (the amount of time spent in the function itself), in seconds
- Inclusive user CPU time (the amount of time spent in the function itself and any functions it calls), in seconds
The function list is sorted on exclusive CPU time.
Simple Metric Analysis
First, look at execution times for two very simple functions,
cputime()
andicputime()
. Both contain afor
loop that increments a variablex
by one, but incputime()
,x
is floating-point, while inicputime()
,x
is an integer.1. Locatecputime()
andicputime()
in the Function List display.
- You can see immediately by looking at the exclusive user CPU time for the two functions that
cputime()
takes much longer to execute thanicputime()
. You can use other features of the Analyzer to find out why this is so.2. In the Function List display, click oncputime()
to select it.3. Click Source in the lower tool bar.
4. Selecticputime()
in the Function List display and click Source.
- The source code for
icputime()
replaces the source code forcputime()
in the text editor. Look at the loop line and the line in which x is incremented:
for(j=0; j<1000000; j++) {x = x + 1;}- You can see that this line takes much less time to execute than the corresponding line in
cputime()
. (The time on the loop line is about the same as it was for functioncputime()
.)- Now look at the annotated disassembly for these two functions, to see why this is so.
5. Selectcputime()
in the Function List display, and click Disassembly in the lower tool bar.
- The annotated disassembly for
cputime()
appears in the text editor. Scroll down until you see some non-zero metric values. Look at the ones for the line of source code in which x is incremented.- You can see that while a small amount of time is spent in loading and adding the values for
x
and 1, the greatest amount of time is spent executing thefstod
instruction, which converts the value ofx
from a single floating-point value to a double floating-point value, so it can be incremented by 1, a double. Afterwards, another small amount of CPU time is spent by thefdtos
command, which convertsx
back to a single. Altogether, these two operations account for about three-quarters of the CPU time expended.6. Now selecticputime()
in the Function List display and click Disassembly.
- The annotated disassembly for
icputime()
appears in the text editor. Scroll down until you see the performance metrics for the line of source code wherex
is incremented.- You can see that all that is involved is a simple load, add, and store operation that takes approximately a third of the time of the floating-point add, because no conversions are necessary. The value 1 does not even need to be loaded into a register--it can be added directly to
x
.The Effects of Recursion
The
synprog
program contains two examples of recursive calling sequences:
- Function
recurse()
demonstrates direct recursion. It calls functionreal_recurse()
, which then calls itself until a test condition is met, at which point it performs some work that requires user CPU time, after which the flow of control returns through successive calls toreal_recurse()
, until it reachesrecurse()
.- Function
bounce()
demonstrates indirect recursion. It calls functionbounce_a()
, which checks to see if a test condition is met, and if it is not, calls functionbounce_b()
.bounce_b()
in turn callsbounce_a()
. This sequence continues until the test condition inbounce_a()
is met, at which timebounce_a()
performs some work that requires user CPU time. After that, the flow of control returns through successive calls tobounce_b()
and bounce_a(), until it reachesbounce()
.In either case, exclusive metrics belong only to the function in which the actual work is done, in these cases
real_recurse()
andbounce_a()
. These metrics are passed up as inclusive metrics to every function that calls the final function.First, look at the metrics for
recurse()
andreal_recurse()
:1. In the Function List display, find functionrecurse()
and click on it to select it.
- Notice that function
recurse()
shows inclusive user CPU time, but its exclusive user CPU time is 0 (zero). This is because allrecurse()
does is execute a call toreal_recurse()
.
Note For some runs,recurse()
may show a small exclusive CPU time value. Because profile experiments are statistical in nature, the experiment you ran on synprog may record one or two ticks in therecurse()
function. However, the exclusive time is much less than the inclusive time.
2. Click Callers-Callees to open the Callers-Callees window, which shows thatrecurse()
calls one function,real_recurse()
.3. Click onreal_recurse()
to select it.
- The Callers-Callees window now displays the following information:
- Both
recurse()
andreal_recurse()
appear in the callers pane as callers ofreal_recurse()
. You would expect this, because afterrecurse()
callsreal_recurse()
,real_recurse()
calls itself recursively.- In order to simplify the display,
real_recurse()
does not appear in the callee pane as its own callee.- Exclusive metrics as well as inclusive metrics are recorded for
real_recurse(),
where the actual user CPU time is spent. These are passed back up torecurse()
as inclusive metrics.- Exclusive metrics are also displayed for
real_recurse()
in the caller pane. If a function generates exclusive metrics, the Analyzer displays them for that function anywhere it appears in the Callers-Callees window.Now look at what happens in the indirect recursive sequence:
1. Find functionbounce()
in the Function List display and click on it to select it.
- Notice that function
bounce()
shows inclusive user CPU time, but its exclusive user CPU time is 0 (zero). This is because allbounce()
does is execute a call tobounce_a()
.2. Click Callers-Callees to open the Callers-Callees window, which shows thatbounce()
calls only one function,bounce_a()
.3. Click onbounce_a()
to select it.
- The Callers-Callees window now displays the following information:
- Both
bounce()
andbounce_b()
appear in the callers pane as callers ofbounce_a()
.- In addition,
bounce_b()
appears in the callee pane. If a function calls an intermediate function instead of calling itself recursively, that intermediate function does appear in the callee pane.- Exclusive as well as inclusive metrics are displayed for
bounce_a()
, where the actual user CPU time is spent. These are passed up to the functions that callbounce_a()
as inclusive metrics.4. Click onbounce_b()
to select it.
bounce_b()
is shown as both calling and being called bybounce_a()
. Exclusive as well as inclusive metrics forbounce_a()
appear in both the caller and the callee panes because if a function generates exclusive metrics, the Analyzer displays the metrics for that function anywhere the function appears in the Callers-Callees window.The
gprof
FallacyNow look at how the Analyzer resolves the
gprof
fallacy insynprog
.Select the function
gpf_work()
. This is one of the functions in which synprog is spending most of its time.In this example, however, it is less helpful to see where the program is spending time than to figure out why. To do so, look at the functions that call
gpf_work()
and how these functions call it:
- Click the Callers-Callees button in the lower Analyzer tool bar to open the Callers-Callees window.
- The Callers-Callees window is divided into three horizontal panes:
- The middle pane shows data pertaining to the selected function, in this case
gpf_work()
.- The top pane shows data pertaining to all functions that call the selected function, in this case
gpf_b()
andgpf_a()
.- The bottom pane shows data pertaining to all functions called by the selected function. In this case, the bottom pane is empty, because
gpf_work()
does not call any other functions.By looking at the Callers pane, you can see that
gpf_work
is called by two functions,gpf_b()
andgpf_a()
. The Analyzer shows that most of the time ingpf_work()
results from calls fromgpf_b()
, whereas much less time results from calls fromgpf_a()
. You must look at the callers, to see whygpf_b()
calls account for over ten times as much time ingpf_work()
as calls fromgpf_a()
:1. Click ongpf_a()
in the Callers pane to select it.
gpf_a()
now becomes the selected function, and moves to the middle pane; its callers appear in the top Callers pane, andgpf_work()
, its callee, now appears in the bottom Callees pane.2. Go back to the main Analyzer window (wheregpf_a()
is now the selected function), and click the Source button to open a text editor displaying the annotated source code forgpf_a()
.3. In the text editor, scroll down so that you can see the code for bothgpf_a()
andgpf_b()
.
- You can see from the code that
gpf_a()
callsgpf_work()
ten times with an argument of 1, whereas gpf_b() callsgpf_work()
only once, but with an argument of 10. The arguments from gpf_a() and gpf_b() are passed to the formal argument amt in gpf_work().Now look at the code for
gpf_work()
, to see why the way the callers callgpf_work()
makes a difference:
- Scroll down one screen in the text editor, to display the code for
gpf_work()
.
- From the line
imax = 4 * amt * amt
, which sets the upper limit for the following for loop, you can see that time spent ingpf_work()
depends on the square of its argument. So ten times as much time, more or less, will be spent on one call from a function with an argument of 10 (400 iterations) than will be spent on ten calls from an function with an argument of 1 (10 instances of 4 iterations).And what has all this to do with
gprof
? The "gprof
fallacy" is the result of gprof estimating the amount of time spent in a function from the number of times the function is called, regardless of how the function might be using the arguments passed to it. So for an analysis ofsynprog
, gprof would attribute ten times as much time to calls fromgpf_a()
as it would to calls fromgpf_b()
. That is thegprof
fallacy.The distortion increases the higher the power of the argument on which the CPU time in the function depends. For example, matrix multiply, whose work goes as the cube of the order, would be even more sensitive.
Loading Dynamically Linked Shared Objects
The
synprog
directory contains two dynamically linked shared objects,so_syn.so
andso_syx.so
. In the course of execution,synprog
first loadsso_syn.so
and makes a call to one of its functions,so_burncpu()
. Then it unloadsso_syn.so
, loadsso_syx.so
at what happens to be the same address, and makes a call to one of theso_syx.so
functions, sx_burncpu(). Then, without unloadingso_syx.so
, it loadsso_syn.so
again at a different address, because the address it was loaded at first is being used by another shared object, and makes another call toso_burncpu()
.The functions
so_burncpu()
andsx_burncpu()
perform identical operations, as you can see if you look at their source code. Therefore they should take the same amount of user CPU time to execute.The addresses at which the shared objects are loaded are determined at run time, and the run-time loader chooses where to load the objects.
This rather artificial exercise demonstrates that the same function can be called at different addresses at different points in the program execution, that different functions can be called at the same address, and that the Analyzer deals correctly with this behavior, aggregating the data for a function regardless of the address at which it appears:
1. In the Function List display, click onsx_burncpu()
to select it.2. Choose View Show summary metrics. The Summary Metrics window forsx_burncpu()
appears.
- Note the user CPU time for
sx_burncpu()
.3. Click onso_burncpu()
to select it. The summary data forso_burncpu()
appears in the Summary Metrics window.
- You can see that although
so_burncpu()
performs operations identical to those ofsx_burncpu()
, the user CPU time forso_burncpu()
is almost exactly twice the user CPU time forsx_burncpu()
. This is becauseso_burncpu()
was executed twice; the Analyzer recognized that the same function was executing and aggregated the data for it, even though it appeared at two different addresses in the course of program execution.Example 2:
omptest
Note omptest is a Fortran program that contains OpenMP directives and compiles only with a Fortran 95 compiler on SPARC platforms.
The omptest program uses OpenMP parallelization and is designed to test the efficiency of parallel distribution. Analysis of omptest deals with the following questions:
- What are the obstacles to parallelization?
- How is load balanced across threads?
- What are the costs of memory and bus contention?
Note This example assumes that you are runningomptest
on a machine with at least four CPUs.
Copying
omptest
The Sun WorkShop installation program installs the omptest source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/omptestIn a default installation, installation_directory is
/opt
.Before you start on this part of the tutorial, create a working directory and copy the omptest source files and the makefile into it:
%cp -r
installation_directory~/omptest
Building
omptest
Before you build the
omptest
program:
- Open the makefile in a text editor.
- The makefile contains alternative settings for the environment variables
ARCH
andOFLAGS
.
- You can accept the default setting for
ARCH
, which works on SPARC 7, 8, or 9 platforms.
Note The default architecture for the-xarch
compiler flag is v7 for SPARC, to accommodate older machines. The default setting in theomptest
makefile isARCH
=
-xarch=v8
, since most newer SPARC machines support v8; furthermore, using the v7 default generates code that calls thelibc.so
.mul
and.div
routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the<Unknown>
function; see "The <Unknown> Function" on page 104 for more information.
- The makefile offers you several settings for
OFLAGS
that affect optimization and parallelization. You can build omptest with the default, which uses the command-line options-g
-O3 -mp=openmp -explicitpar -depend -stackvar -loopinfo -v
-V
, run the Collector on it, and look at the results in the Analyzer, then repeat the process with the other settings to get an idea of how the different settings affect how the compiler optimizes and parallelizes code. For an explanation of the various compiler options in theOFLAGS
settings, see the Fortran User's Guide.
Note If you do not specify parallelization (-mp=openmp -explicitpar
), the compiler does not interpret OpenMP directives, and compiles the program to execute serially.
1. Save the makefile and close the editor.2. Typemake
at the command prompt.Collecting Data About
omptest
For this part of the demonstration, you must have
omptest
running on a four-CPU system or larger. Sun WorkShop must be installed on the machine, so you can run the Collector to collect performance data.
Note You might want to refer to the chapters on parallelization and OpenMP in the Fortran Programming Guide for background on parallelization strategies and OpenMP directives.
1. Start Sun WorkShop by typing:
%workshop
2. Click the Debugging button to open the Debugging window.
3. Loadomptest
into the Debugging window:
- Choose Debug New Program from the main Debugging menu bar.
- In the Debug New Program dialog box, enter the path for
omptest
in the Name text box, or use the list box to navigate toomptest
.- Click OK.
4. Set the value of thePARALLEL
environment variable to4
:
- Choose Debug Edit Run Parameters to open the Run Parameters dialog box.
- Click Environment Variables to open the Environment Variables dialog box.
- Type
PARALLEL
in the Name text field and4
in the Value text field.- Click OK to exit the Environment Variables dialog box and the Run Parameters dialog box.
5. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.
- Observe the following points:
- The Collector is enabled for one run only.
- The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to
omptest.1.er
.- Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
- For this experiment you are going to collect only clock-based profiling data and use the default sampling interval, so use the default settings.
6. Click the Start button.
omptest
runs in the Debugging window, and the Collector collects clock-based profiling data and stores it in the default experiment-record file,omptest.1.er
.7. Repeat Step 4 through Step 6 withPARALLEL
set to 2 and save the performance information in the experiment-record fileomptest.2.er
.Analyzing
omptest
Performance MetricsFor purposes of analyzing parallelization strategies, we will be looking at four functions and comparing how they behave on the four-CPU run versus the two-CPU run:
psec()_
is aPARALLEL
SECTION
routinepdo_()
is aPARALLEL
DO
routinecritsum_()
is aCRITICAL
SECTION
routineredsum_()
is aREDUCTION
routineAll these routines were generated at compile time as a result of OpenMP directives inserted into
omptest
source code. Each pair (psec_()
andpdo_()
,critsum_()
andredsum_()
) represents contrasting strategies for dealing with effective parallelization.
Note Behavior of parallel code can be unexpected. Read "Parallel Execution and Compiler-Generated Body Functions" on page 96 and "Compiler-Generated Body Functions" on page 103 for more information on how this code behaves.
Look first at the performance of
psec_()
andpdo_()
on the four-CPU and the two-CPU runs:
- Open two Analyzer windows, with
omptest.1.er
loaded into one, andomptest.2.er
loaded into the other. Position the two windows so you can compare the metrics displayed in them.In each of the Analyzer windows:
1. In the File List display, find the data forpsec_()
and click on the line to select it.2. Choose View Show Summary Metrics.
- The Summary Metrics window appears, displaying the metrics for
psec_()
.3. Look at the inclusive metrics for user CPU time, wall-clock time, and total LWP.
- For the two-CPU run,
psec_()
uses about 8.9 seconds user CPU time, about 4.8 seconds wall-clock time, and about 9.6 seconds total LWP. The ratio of wall-clock time to either user CPU time or total LWP is about 1 to 2, which indicates relatively efficient parallelization.- For the four-CPU run, however,
psec_()
takes about the same user CPU time (8.4 seconds), but both the wall-clock time and the total LWP are higher (6.12 seconds and 11.55 seconds respectively).psec_()
takes longer to run on the four-CPU machine. There are only two sections within thepsec_()
PARALLEL
SECTION
construct, so only two threads are required to execute them, using only two of the four available CPUs at any given time. The slightly poorer performance on the four-CPU machine is due to overhead involved in scheduling the threads among the four CPUs.4. Scroll to the functionpdo_()
and click on the line to select it.5. In the main Analyzer menu, choose View Show Summary Metrics.
- The Summary Metrics window appears, displaying the metrics for
pdo_()
.6. Look at the inclusive metrics for user CPU time, wall-clock time, and total LWP.
- The user CPU time for
pdo_()
is about the same as forpsec_()
(about 8.4 seconds for the two-CPU run, about 8.6 seconds for the four-CPU run). But now the ratio of wall-clock time to user CPU time is about 1 to 2 on the two-CPU run, but about 1 to 4 on the four-CPU run, indicating that thepdo_()
parallelizing strategy scales much more efficiently on multiple CPUs, taking into account how many CPUs are available and scheduling the loop appropriately.Now look at the relative efficiency of the routines
critsec_()
andreduc_()
. In this case, the annotated source code shows how efficiently each parallelization strategy deals with an identical assignment statement embedded in a pair ofdo
loops; its purpose is to sum the contents of three two-dimensional arrays:
t = (a(j,i)+b(j,i)+c(j,i))/ksum = sum+t1. In the Function List display, find the data forcritsum_()
and click on the line to select it.2. Click Source in the lower Analyzer tool bar.
- The annotated source code from which
critsum_()
was generated appears in a text editor.3. Look at the inclusive user CPU time.
- It is enormous--almost 13 seconds. The inclusive user CPU time is so high because
_critsum()
uses a critical section parallelization strategy; although the summing operation is spread over all four CPUs, only one CPU at a time is allowed to add its value of t tosum
. This is not a very efficient strategy for making use of parallelization.4. Close the text editor and scroll to the data forredsum_()
. Click on the line to select it.5. Click Source in the lower Analyzer tool bar.
- The annotated source code from which
redsum_()
was generated appears in a text editor.6. Look at the inclusive user CPU time.
- It is much smaller, only about 1.7 seconds, because
redsum_()
uses a reduction strategy, by which the partial sums of(a(j,i)+b(j,i)+c(j,i))/k
are distributed over multiple processors, after which these intermediate values are added tosum
. This strategy makes much more efficient use of the available CPUs.Example 3:
mttest
The
mttest
program emulates the server in a client-server, where clients queue up requests and the server uses multiple threads to service them, using explicit threading. Performance data collected onmttest
demonstrates the sorts of contentions that arise from various locking strategies, and the effect of caching on execution time.Copying
mttest
The Sun WorkShop installation program installs the
mttest
source files in the following directory.
/installation_directory/SUNWspro/examples/analyzer/mttestIn a default installation, installation_directory is
/opt
.Before you start on this part of the tutorial, create a working directory and copy the mttest source files and the makefile into it:
%cp -r
installation_directory~/mttest
Building
mttest
Before you build the mttest program:
- Open the makefile in a text editor:
The makefile contains alternative settings for the environment variables
ARCH
,OFLAGS
,THREADS
, andFLAG
.
- You can accept the default setting for
ARCH
, which works on SPARC 7, 8, or 9 platforms, and x86 platforms.
Note The default architecture (the-xarch
compiler flag) is v7 for SPARC, to accommodate older machines, and ia32 for Intel. Most newer SPARC machines support v8. If your machine supports v8, comment out the default and uncomment the lineARCH
=
-xarch=v8
or, on a v9 machine,ARCH
=
-xarch=v9
. Using the default generates code that calls thelibc.so
.mul
and.div
routines rather than using integer multiply and divide instructions, and the time spent in these arithmetic operations shows up in the<Unknown>
function; see "The <Unknown> Function" on page 104 for more information.
- The makefile offers you several settings for
OFLAGS
that affect optimization and parallelization. You can buildomptest
with the default, which uses the command-line options-g
-xF
-v
-V
, run the Collector on it, and look at the results in the Analyzer, then repeat the process with the other settings to get an idea of how the different settings affect how the compiler optimizes and parallelizes code. For an explanation of the various compiler options in theOFLAGS
settings, see the Fortran User's Guide.- For
THREADS
andFLAG
, comment whichever setting is closest to your target application, and comment out the other one. The settings forTHREADS
areSOLARIS
andPOSIX
, which determine which threads standard your program is compiled for. The settings forFLAG
areBOUND
andUNBOUND
; they determine whether your program uses bound or unbound threads.1. Save the makefile and close the editor.2. Typemake
at the command prompt.Collecting and Analyzing Data About
mttest
The executable mttest you have just built is compiled for explicit multithreading, and it will run as a multithreaded program on a machine with multiple CPUs or with one CPU. There are some interesting differences--and similarities--in its performance metrics if you run it on a single CPU system.
The following first two sections, "Collecting Data About mttest (Four-CPU System)" and Analyzing mttest Performance Metrics (Four-CPU System), look at
mttest
's performance on a four-CPU machine; the next two sections, Collecting Data About mttest (One-CPU System) and Analyzing mttest Performance Metrics (One-CPU System), look at the same set of performance metrics on a one-CPU machine.Collecting Data About
mttest
(Four-CPU System)For this part of the demonstration, you must have
mttest
running on a four-CPU system. Sun Workshop must be installed on the machine, so you can run the Collector to collect performance data.1. Start Sun WorkShop by typing:
%workshop
2. Click the Debugging button to open the Debugging window.
3. Loadmttest
into the Debugging window:
- Choose Debug New Program from the main Debugging menu.
- In the Debug New Program dialog box, enter the
mttest
path in the Name text box, or use the list box to navigate tomttest
.- Click OK.
4. From the main Debugging window menu, choose Windows Sampling Collector to open the WorkShop Sampling Collector window.
- Observe the following points:
- The Collector is enabled for one run only.
- The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to
mttest.1.er
.- Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
5. For this experiment, in addition to clock-based profiling, you will collect synchronization wait tracing information, so select the Synchronization Wait Tracing check box.6. Click the Start button.
mttest
runs in the Debugging window, and the Collector collects clock-based profiling and synchronization wait tracing data and stores it in the experiment-record file,mttest.1.er
.Analyzing
mttest
Performance Metrics (Four-CPU System)To open the Analyzer and load
mttest.1.er
into it:1. Click on the Analyzer button in either the Sun WorkShop main window tool bar or the Sampling Collector window.
2. In the Load Experiment dialog box, typemttest.1.er
, then click on OK.To open the Analyzer and load
mttest.1.er
from the command line:The Analyzer window displays the Function List for
mttest
.1. In the File List display, scroll down to the data forlock_local()
andlock_global()
.
- Note that both functions have the same inclusive user CPU time (about 3.2 seconds). This indicates that both functions are doing the same amount of work.
- However,
lock_global()
spends a lot of time in synchronization waiting (almost five seconds), whereaslock_local()
spends no time in synchronization waiting. If you bring up the annotated source code for the two functions, you can see why this is so.2. Click on the line of data forlock_global()
to select it.3. Click the Source button in the lower Analyzer tool bar.
- The annotated source code for function
lock_global()
appears in the text editor. You can see thatlock_global()
applies a global lock to the data:
mutex_lock(&global_lock);- Because of the global lock, all running threads must contend for access to the data, and only one thread has access to it at a time. The rest of the threads must wait until the working thread releases the lock to access the data.
4. Click on the line of data forlock_local()
to select it.5. Click the Source button in the lower tool bar.
- The annotated source code for function
lock_local()
appears in the text editor.lock_local()
applies a lock only to data in a particular thread's work block:
mutex_lock(&(array->lock));- No thread can have access to another thread's work block, so their work can proceed without contention or time wasted in synchronization waits, and the wait time for
lock_local()
is 0 (zero) seconds.6. Return to the Function List display in the main Analyzer window and scroll to the data for the functionsComputeA()
andComputeB()
.7. Click on the line of data forComputeA()
to select it.8. Click the Source button in the lower Analyzer tool bar.
- The text editor opens, displaying the annotated source code for
ComputeA()
.9. Scroll down in the text editor so that the source forComputeA()
andComputeB()
is displayed.
- Observe that although the code for these functions is virtually identical (a loop adding one to a variable),
ComputeB()
uses almost ten seconds of inclusive user CPU time, whereasComputeA()
uses only about 3.3 seconds. To find out the reason for this discrepancy, you must examine the code that callsComputeA()
andComputeB()
.10. Return to the Function List display in the main Analyzer window, and click on the line of data forComputeA()
to select it.11. Click the Callers-Callees button in the lower tool bar.
- The Callers-Callees window opens, showing the selected function in the middle display pane, and its caller in the upper pane.
12. Click on callerlock_none()
to select it.13. Return to the Function List display in the main Analyzer window.
- Note that
lock_none()
is now the selected function.14. Click the Source button in the lower Analyzer tool bar.
- The text editor opens, displaying the annotated source code for
lock_none()
.15. Scroll down to display the code for the function which callsComputeB()
, which iscache_trash()
.16. Compare the calls toComputeA()
andComputeB()
:
ComputeA()
is called with a double in the thread's work block (&array->list[0]
) as an argument, which can be read and written to directly without danger of contention with other threads.ComputeB()
, however, is called with a series of doubles that occupy successive words in memory (&element[array->index]
). Whenever a thread writes to one of these addresses in memory, any other threads that have that address in their cache must delete the data, which is now out-of-date. If one of the threads needs the data again later in the program, the data must be copied back into the data cache from memory, which is a time-consuming operation. The resulting cache misses, which are attempts to access data not available in the data cache, waste a lot of CPU time, which explains whyComputeB()
uses three times as much user CPU time asComputeA()
.17. Return to the Function List display in the main Analyzer window. Click on the line forComputeB()
to select it.18. Click on the Disassembly button in the lower tool bar.
- The text editor opens, displaying annotated disassembly code for
ComputeB()
. You can see that most of the user CPU time (over 7 seconds) is being used by thefaddd
instruction, which is waiting for a register load that misses in the cache.Collecting Data About
mttest
(One-CPU System)For this part of the demonstration, you must have
mttest
running on a single-CPU system. As before, Sun Workshop must be installed on the machine, so you can run the Collector to collect performance data.1. Start Sun WorkShop by typing:
%workshop
2. Click the Debugging button to open the Debugging window.
3. Loadmttest
into the Debugging window:
- Choose Debug New Program from the main Debugging menu.
- In the Debug New Program dialog box, enter the
mttest
path in the Name text box, or use the list box to navigate tomttest
.- Click OK.
4. From the Debugging window menu bar, choose Windows Sampling Collector to open the Sampling Collector window.
- Observe the following points:
- The Collector is enabled for one run only.
- The default path and file name for the experiment-record file appear in the Experiment File text box. Change the name of the experiment-record file to
mttest.2.er
.- Under the Data to Collect check boxes, the only type of data selected for collection is the default, Clock-Based Profiling.
5. For this experiment, in addition to clock-based profiling, you are going to collect synchronization wait tracing information, so select the Synchronization Wait Tracing check box.6. Click the Start button:
mttest
runs in the Debugging window, and the Collector collects clock-based profiling and synchronization wait tracing data and stores it in the experiment-record file,mttest.2.er
.Analyzing
mttest
Performance Metrics (One-CPU System)Load
mttest.2.er
into the Analyzer:The Analyzer window opens with the Function List for
mttest
displayed.1. In the File List display, scroll down to the data forlock_local()
andlock_global()
.
- As on the four-CPU system, both functions have the same inclusive user CPU time (in this case about 10 seconds). This indicates that both functions are doing the same amount of work.
- However, total LWP time for
lock_global()
is actually less than forlock_local()
(25 seconds versus 37 seconds). This could be because of the way each locking system schedules the threads to run on the CPU. The global lock set bylock_global()
allows each thread to execute in sequence until it has run to completion, so that the first thread executes for 2.5 seconds and then exits, while the others remain in a wait state until it is through; then the next thread, having waited for the first thread, executes for 2.5 seconds, for a total LWP time of 5 seconds; the third thread waits 5 seconds for the first two threads and then executes, for a total LWP time of 7.5 seconds; and the fourth thread, having waited 7.5 seconds, executes with a total LWP time of 10 seconds.lock_local()
's local lock, on the other hand, schedules each thread onto the CPU for a fraction of its run and then repeats the process until all the threads have run to completion, so that all four threads remain in a wait state for almost three-quarters of the 10-second execution time.lock_global()
still uses a lot of time in synchronization waiting (almost 15 seconds: 2.5 plus 5 plus 7.5), whereaslock_local()
still uses no time in synchronization waiting.lock_local
applies a lock only to data in a particular thread's work block. Its work proceeds as on the four-CPU system without contention or time wasted in synchronization waits. The synchronization wait time forlock_local()
again is 0 (zero) seconds.2. Return to the Function List display in the main Analyzer window, and scroll to the data for the functionsComputeA()
andComputeB()
.3. Click on the line of data forComputeA()
to select it.4. Click the Source button in the lower Analyzer tool bar.
- The text editor opens, displaying the annotated source code for
ComputeA()
.5. Scroll down in the text editor so that the source forComputeA()
andComputeB()
is displayed.
- Observe that on the single-CPU system, inclusive user CPU time for
ComputeA()
andComputeB()
is virtually identical.6. Return to the Function List display in the main Analyzer window, and click on the line of data forComputeA()
to select it.7. Click the Callers-Callees button in the lower tool bar.
- The Callers-Callees window opens, showing the selected function
ComputeA()
in the middle display pane, and its caller in the pane above.8. Click on callercache_trash()
to select it.9. Return to the Function List display in the main Analyzer window.
- Note that
cache_trash()
is now the selected function.10. Click the Source button in the lower Analyzer tool bar.
- The text editor open, displaying the annotated source code for
cache_trash()
.11. Scroll up to display the code for the function that callsComputeA()
, which islock_none()
.12. Compare the calls toComputeA()
andComputeB()
:
ComputeA()
is called with a double in the thread's work block as an argument (&array->list[0]
), which can be read and written to directly without danger of contention with other threads.ComputeB()
is called with a series of doubles that occupy successive words in memory (&element[array->index]
), accessed via a data cache. Recall that if multiple threads are running on multiple CPUs, whenever a thread writes to one of these addresses in memory, any other threads that have that address in their cache must delete the data, which is now obsolete. If one of them needs the data again later in the program, it must copied back into the data cache from memory, a time-consuming operation. The resulting cache misses waste a lot of CPU time.- However, when only one thread is running, no other threads write to memory, so the running thread's cache data never becomes invalid. No cache misses or resulting copies from memory occur, so the performance for
ComputeB()
is just as efficient on a one-CPU machine as the performance forComputeA()
.
Sun Microsystems, Inc. Copyright information. All rights reserved. Feedback |
Library | Contents | Previous | Next | Index |