C H A P T E R 8

Profiling

An important component of performance tuning is profiling, through which you develop a picture of how well your code is running and what sorts of bottlenecks it might have. Profiling can be a difficult task in the simplest of cases, and the complexities multiply with MPI programs because of their parallelism. Without profiling information, however, code optimization can be wasted effort.

This chapter describes:

General Profiling Methodology

MPProf Profiling Tool

Performance Analyzer Profiling of Sun MPI Programs

Other Profiling Approaches

This chapter includes a few case studies that examine some of the NAS Parallel Benchmarks 2.3. These are available from the NASA Ames Research Center at

http://www.nas.nasa.gov/Software/NPB/index.html

Note - The runs shown in this chapter were not optimized for the platforms on which they executed.

General Profiling Methodology

It is likely that only a few parts of a program account for most of its runtime. Profiling enables you to identify these "hot spots" and characterize their behavior. You can then focus your optimization efforts on the spots where they will have the most effect.

Profiling can be an experimental, exploratory procedure. So you might find yourself rerunning an experiment frequently. It is a challenge to design such runs so that they complete quickly, while still capturing the performance characteristics you are trying to study. There are several ways you can strip down your runs to achieve this balance, including reducing the data set and performing fewer loop iterations. However, regardless of which streamlining method you employ, keep the following caveats in mind:

Try to maintain the same problem size, since changing the size of your data set can change the performance characteristics of your code. Similarly, reducing the number of processors used can mask scalability problems or produce ungeneralizable behavior.

If the problem size must be reduced because only a few processors are available, try to determine how the data set should be scaled to maintain comparable performance behavior. For many algorithms, it makes most sense to maintain a fixed subgrid size. For example, if a full dataset of 8 Gbytes is expected to run on 64 processors, then maintain the fixed subgrid size of 128 Mbyte per processor by profiling a 512-Mbyte data set on 4 processors.

Try to shorten experiments by running fewer iterations. One difficulty with this approach is that the long-term, steady-state performance behavior of your code might become dwarfed by otherwise inconsequential factors. In particular, code might behave differently the first few times it is executed than when buffers, caches, and other resources have been warmed up.

Basic Approaches

There are various approaches to profiling Sun HPC ClusterTools programs:

Use MPProf, a message-passing profiler introduced with the Sun HPC ClusterTools 5 release. This extremely easy-to-use profiler offers both basic information about MPI performance of your program and Sun MPI-specific recommendations for environment variable tuning.

Use the Performance Analyzer. (The programs Collector and Performance Analyzer are supplied in conjunction with the Sun Studio Compiler Collections.) This method is probably the most basic approach to profiling an HPC application on Sun systems, both for Sun MPI and non-MPI programs. No recompiling or relinking is required. Sampling data shows which routines are consuming the most time. User computation and MPI message passing are profiled, and caller-callee relationships are shown. Recompilation and relinking with the -g switch enable attribution to individual source-code lines with almost no loss in optimizations. On UltraSPARC microprocessors, hardware-based profiling can identify where floating-point operations, cache misses, and so forth, occur. A timeline view indicates synchronization and load imbalance problems. For information about using Performance Analyzer profiling, see Performance Analyzer Profiling of Sun MPI Programs.

Modify your source code to include timer calls. This modification is most appropriate if you have reasonable familiarity with the program. You can place timers at a high level to understand gross aspects of the code, or at a fine level to study particular details. For information about the inserting timer calls using Sun MPI, see Inserting MPI Timer Calls.

Use the MPI profiling interface (PMPI) to diagnose other aspects of message-passing performance. The MPI standard supports an interface for instrumentation of MPI calls, which enables you to apply custom or third-party instrumentation of MPI usage without modifying your application's source code. For more information about using the MPI profiling interface, see Using the MPI Profiling Interface.

Use the DTrace dynamic tracing utility. DTrace is a component of the Solaris 10 OS. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. For more information about DTrace, refer to the Solaris Dynamic Tracing Guide (Part Number 817-6223). This guide is part of the Solaris 10 OS Software Developer Collection.

Solaris 10 OS documentation can be found on the web at the following location:

http://www.sun.com/documentation

TABLE 8-1 outlines the advantages and disadvantages associated with each of these methods of profiling.

TABLE 8-1 Profiling Alternatives
Method	Advantages	Disadvantages
MPProf	Very simple to use. Generates self-explanatory performance report. Gives feedback in simple terms (such as which environment variables to set for better performance).	Simplicity. Has no knowledge of user code. Performance suggestions are fallible.
DTrace	Included with the Solaris 10 OS Can be used on live production systems to monitor behavior and track errors. No need to recompile application code. Can be attached and detached to and from an MPI program multiple times without disturbing the run. Use of a scripting language allows you to generate complex output. The D scripting language has many built-in functions, so scripts are less complicated to write.	Need to learn another scripting language. Very simple output processing (no GUI). Very low level tool.
Performance Analyzer Profiling	No recompilation or relinking is required. Profiles whole programs: user computation and MPI message passing. Identifies time-consuming routines. With `-g` recompilation and relinking, gives attribution on a per-source-line basis with negligible loss in optimization level. Shows caller-callee relationships. Uses a style familiar to `gprof` users. On UltraSPARC microprocessors, profiles based on hardware counters (floating-point operations, cache misses, and so forth). Timeline functionality.	Has very limited knowledge of MPI or message passing. Improper timeline support for multinode runs.
Timers	Very versatile.	Requires manual instrumentation. Requires that you understand the code.
`gprof`	Familiar tool. Provides an overview of user code.	Ignores time spent in MPI.
PMPI	You can instrument or modify MPI without modifying source. Allows use of other profiling tools.	Profiles MPI usage only. Requires integration effort.

The following are sample scenarios:

I don't know whether my program is spending time in serial computation or in MPI calls. The performance guide says I'm supposed to set Sun MPI environment variables to trade off memory for performance, but it seems like a lot of work to figure how what values to use.

Using MPProf, you can easily see if your program is spending a lot of time in MPI. It will suggest environment variable values for you.

I am rather unfamiliar with the code I'm running. I do not know whether my optimization efforts should focus on serial computation or message passing--or, for that matter, in which routines.

Using the Performance Analyzer, you can see which routines consume the most time.

I know that a few innermost loops are bottlenecks and I need more detailed information.

Adding timers and other instrumentation around innermost loops might help you if you already have some idea about your code's performance.

I have used certain MPI profiling tools in other environments and am used to them.

Depending on how those tools were constructed, the MPI profiling interface might allow you to continue using them with Sun HPC ClusterTools programs.

MPProf Profiling Tool

Sun HPC ClusterTools software includes the mpprof profiler, a tool that supports programming with the Sun MPI library.

To collect profiling data on a Sun MPI program, set the MPI_PROFILE environment variable before running the program. To generate a report using this data, invoke the mpprof command-line utility. For example:

% setenv MPI_PROFILE 1

% mprun -np 16 a.out

% mpprof mpprof.index.rm.jid

If profiling has been enabled, Sun MPI creates a binary-format profiling data file for each MPI process. The MPProf utility digests the data from these numerous, intermediate, data files--generating statistical summaries in ASCII text. Wherever possible, MPProf adds direct tuning advice to the statistical summaries of MPI process behavior.

MPProf does not provide detailed information about the ordering of message-passing events or interprocess synchronization. That is, MPProf is not a trace-history viewer.

Note - To get information on user code as well as MPI message-passing activity, use Performance Analyzer.

Sample MPProf Output

Output from the MPProf report generator is self explanatory and should require no further documentation. Nevertheless, sample output is included here for readers who do not have immediate access to this tool.

Overview

The first section (sample output shown in box below) is an overview that includes

The name of your program

When you ran your program

What time period is covered in this report

What fraction of time was spent in MPI

How many MPI ranks are covered in the report

OVERVIEW

========

The program being reported on is

"/home/myusername/mycode/a.out,"

which ran as job name "cre.6975" on Fri Mar 17 18:45:04 2006.

Profiled Time Range:

Start at elapsed time 0.000029 secs

End at elapsed time   9.475821 secs

Total duration is     9.475792 secs

Fraction spent in MPI 21.5%

Elapsed time is measured from the end of MPI_Init. Data is being

reported for 16 processes of a 16-process job.

Load Balance

The Load section (sample output shown in box below) shows the distribution of how much time was spent in MPI per process. When a process spends much time in MPI, this may be because it's waiting for another process. Thus, performance bottlenecks may be in processes that spend little time in MPI. Those processes, as well as the nodes they ran on, are identified in this section. Other tools, such as the Performance Analyzer, are needed to pinpoint bottlenecks further if load imbalances are apparent.

LOAD BALANCE

============

Data is being reported on 16 MPI processes. The following histogram

shows how these processes were distributed as a function of the

fraction of the time the processes spent  in MPI routines:

Number of MPI Processes

10-|

9-|

8-|

 7-|                                                 #

 6-|                                                 #

 5-|                                                ##

 4-|                                                ##

 3-|                                               ###

 2-|                                               ###

 1-|  #                                           ####

 0-+----+----+----+----+----+----+----+----+----+----+

 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0

  Percentage time in MPI

Rank   Hostname   MPI Time

0       node0     17.39%

15      node0     61.04%

4       node0     62.03%

13      node0     62.08%

11      node0     62.12%

[...]

Sun MPI Environment Variables

As discussed in Chapter 7, Sun MPI environment variables are set by default to values that will be appropriate in many cases, but runtime performance can in cases be enhanced by further tuning.

In particular, Sun MPI implements two-sided message passing between processes that cannot directly access one another's address spaces with internal shared buffers. If these buffers become congested, performance can suffer. MPProf monitors these buffers, watches for congestion, and suggests environment variable values for enhanced performance at the cost of higher memory usage.

The report discusses its recommendations and then provides a summary (sample output shown in box below) that may be "copy-and-pasted" into a session window or run script. The code may then be run again to determine whether better performance results from the recommendations.

MPI ENVIRONMENT VARIABLES

=========================

[...]

SUGGESTION SUMMARY

==================

[...]

In the C shell, these environment variables may be set by the

following commands:

setenv MPI_POLLALL 0

setenv MPI_PROCBIND 1

setenv MPI_SHM_SBPOOLSIZE 368640

setenv MPI_SPIN 1

In the Bourne or Korn shell, these environment variables may be

set by the following commands:

export MPI_POLLALL=0

export MPI_PROCBIND=1

export MPI_SHM_SBPOOLSIZE=368640

export MPI_SPIN=1

Breakdown by MPI Routine

The Breakdown by MPI Routine section of the report (sample output shown in box below) gives a breakdown by MPI routine. Attributing bytes sent and received to MPI routines can be tricky due to nonblocking and collective operations. Note that:

For nonblocking sends, MPProf attributes bytes sent to the initiating call (such as MPI_Isend), since this is often the call that actually moves the data.

For nonblocking receives, MPProf attributes bytes received to the completing call (such as MPI_Wait), since the actual number of bytes received is not known before this call.

For collectives, MPProf attributes bytes sent and received as follows:

TABLE 8-2 MPProf Attributes Bytes Sent and Received
	Sent	Received
MPI_Barrier	0	0
MPI_Bcast	n (root)	n (nonroot)
MPI_Gather	n	np * n (root)
MPI_Scatter	np * n (root)	n
MPI_Allgather	n	np * n
MPI_Alltoall	np * n	np * n
MPI_Reduce	n	n (root)
MPI_Allreduce	n	n
MPI_Scan	n	n
MPI_Reduce_scatter	np * n	n

Assumptions:

n is the number of elements passed to the MPI collective routine multiplied by the size of the data type used.

np is the number of processes participating in the collective operation.

Some byte counts are attributed only at the root process or the nonroot processes, as noted.

BREAKDOWN BY MPI ROUTINE

========================

Here, averages over all MPI processes profiled are reported. The

numbers in parentheses roughly indicate the variations there are

among all of the MPI processes. These variations are computed as

(1-min/max)/2 where "min" and "max" are the minimum and maximum

values, respectively, for each statistic reported. A total of 9

different MPI APIs were called.

MPI Routine      Time    Calls Made    Sent      Received

MPI_Allreduce  0.15 (39%)  11 (0%)  45276 (0%)  45276  (0%)

MPI_Alltoall   0.00  (8%)  11 (0%)    704 (0%)    704  (0%)

MPI_Alltoallv  0.55  (3%)  11 (0%)  23672 (0%)  23672  (2%)

MPI_Comm_rank  0.00 (16%)   1 (0%)      0 (0%)      0  (0%)

MPI_Comm_size  0.00  (8%)   1 (0%)      0 (0%)      0  (0%)

MPI_Irecv      0.00 (23%)   0 (0%)      0 (0%)      0  (0%)

MPI_Reduce     0.00 (49%)   2 (0%)     12 (0%)      0 (50%)

MPI_Send       0.00 (27%)   0 (0%)      3 (0%)      0  (0%)

MPI_Wait       0.00 (50%)   0 (0%)      0 (0%)      3  (0%)

Where "Time" is in seconds and "Sent" and "Received" are in bytes.

Time Dependence

Performance analysis tools are sometimes classified as profiling tools, which give overall performance characterization of a program, or tracing tools, which give detailed time histories of events that occur during program execution.

While MPProf is a profiling tool, it includes some time-dependent information (a sample of the Time Dependence section of the MPProf report is shown below). This information is coarse but can prove useful in different ways:

If your program does not terminate normally (due to program error or your decision to abort the run), some profiling information is still available.

You can monitor your program's performance while it is still running. Note that some collected data might not have been flushed to disk yet. This means that if you generate an MPProf report while your program is still running, the report might not represent the latest state of your program.

You can sometimes identify different phases of computation (startup, steady-state computation, post processing, and so forth) and focus on any one of them for better analysis using the mpprof switches -s and -e.

In the following example, there is considerable broadcast and barrier activity before the program settles down to steady-state behavior, the performance of which is critical for long-running programs.

TIME DEPENDENCE

===============

Time periods may be missing if no MPI calls were made during the

period. Times for MPI calls that persist over multiple reporting

intervals will only be reported in a single interval; these

reported times may be greater than 100%.

period  MPI_Barrier  MPI_Bcast  MPI_Send  MPI_Waitall

 1         1.2%       6.9%      0.0%         0.0%

 5         0.0%       0.0%      0.0%       123.0%

 6         0.0%       0.0%      0.0%         4.2%

 7         0.0%     803.3%      0.1%         3.1%

16         0.0%     101.0%      3.9%         0.0%

17         0.0%       7.6%      0.0%         0.0%

18        32.5%       0.0%      0.5%        10.3%

19        18.9%       5.0%     52.4%         0.0%

20       191.9%       0.0%     22.6%        34.5%

21         0.0%       0.0%      0.0%         0.0%

22        10.9%       0.0%      0.0%         0.0%

23        16.7%       2.1%      0.2%         4.3%

24        21.7%       2.1%      0.3%         4.6%

25        21.0%       2.1%      0.3%         5.4%

26        21.1%       3.6%      0.3%         5.2%

27        20.1%       1.9%      0.4%         5.9%

28        20.6%       2.1%      0.3%         5.3%

29        20.4%       2.1%      0.3%         5.5%

30        17.9%       3.6%      0.2%         5.0%

31        22.6%       2.1%      0.3%         5.4%

Connections

A connection is a sender/receiver pair. MPProf shows how much point-to-point message traffic there is per connection, with one matrix showing numbers of messages and another showing numbers of bytes. Values are scaled.

A real application is likely to have various communication patterns in it, but the overall matrix may resemble some easy-to-recognize case. The following examples illustrate some simple patterns.

CONNECTIONS

===========

               sender

               0  1  2  3  4  5  6  7  8  9

receiver

            0  _ 99 99 99 99 99 99 99 99 99

            1  _  _  _  _  _  _  _  _  _  _

            2  _  _  _  _  _  _  _  _  _  _

            3  _  _  _  _  _  _  _  _  _  _

            4  _  _  _  _  _  _  _  _  _  _

            5  _  _  _  _  _  _  _  _  _  _

            6  _  _  _  _  _  _  _  _  _  _

            7  _  _  _  _  _  _  _  _  _  _

            8  _  _  _  _  _  _  _  _  _  _

            9  _  _  _  _  _  _  _  _  _  _

CONNECTIONS

===========

               _  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

              99  _  _  _  _  _  _  _  _  _

CONNECTIONS

===========

               _ 99  _  _  _  _  _  _  _  _

              99  _ 99  _  _  _  _  _  _  _

               _ 99  _ 99  _  _  _  _  _  _

               _  _ 99  _ 99  _  _  _  _  _

               _  _  _ 99  _ 99  _  _  _  _

               _  _  _  _ 99  _ 99  _  _  _

               _  _  _  _  _ 99  _ 99  _  _

               _  _  _  _  _  _ 99  _ 99  _

               _  _  _  _  _  _  _ 99  _ 99

               _  _  _  _  _  _  _  _ 99  _

CONNECTIONS

===========

               _ 99  _  _  _  _  _  _  _ 99

              99  _ 99  _  _  _  _  _  _  _

               _ 99  _ 99  _  _  _  _  _  _

               _  _ 99  _ 99  _  _  _  _  _

               _  _  _ 99  _ 99  _  _  _  _

               _  _  _  _ 99  _ 99  _  _  _

               _  _  _  _  _ 99  _ 99  _  _

               _  _  _  _  _  _ 99  _ 99  _

               _  _  _  _  _  _  _ 99  _ 99

              99  _  _  _  _  _  _  _ 99  _

CONNECTIONS

===========

              _9____99____________________9______

              9_9_____9____________________9_____

              _9_9_____9____________________9____

              __9_9_____9____________________9___

              ___9_9_____9____________________9__

              ____9_9_____9____________________9_

              9____9_______9____________________9

              9_______9____99____________________

              _9_____9_9_____9___________________

              __9_____9_9_____9__________________

              ___9_____9_9_____9_________________

              ____9_____9_9_____9________________

              _____9_____9_9_____9_______________

              ______99____9_______9______________

              _______9_______9____99_____________

              ________9_____9_9_____9____________

              _________9_____9_9_____9___________

              __________9_____9_9_____9__________

              ___________9_____9_9_____9_________

              ____________9_____9_9_____9________

              _____________99____9_______9_______

              ______________9_______9____99______

              _______________9_____9_9_____9_____

              ________________9_____9_9_____9____

              _________________9_____9_9_____9___

              __________________9_____9_9_____9__

              ___________________9_____9_9_____9_

              ____________________99____9_______9

              9____________________9_______9____9

              _9____________________9_____9_9____

              __9____________________9_____9_9___

              ___9____________________9_____9_9__

              ____9____________________9_____9_9_

              _____9____________________9_____9_9

              ______9____________________99____9_

Once the connectivity matrices have been displayed, MPProf prints the average message length:

The average length of point-to-point messages was 13916 bytes per message.

This message length can be compared to the product of some characteristic latency in your system and some characteristic bandwidth. For example, if "ping-pong" tests indicate an MPI latency of about 2 microseconds and bandwidth of about 500 Mbyte/sec, then the product is 1000 bytes. An average length of 13916 bytes suggests bandwidth is more important than latency to your application's performance.

Multithreaded Programs

MPProf profiling might be used with multi-threaded MPI programs. If many threads per process are engaged in heavy MPI activity, however, MPProf data collection can slow down measurably. MPProf aggregates data for all threads on a process.

The `mpdump` Utility

It is sometimes possible to extract more information from MPProf data files than the report generator prints out. You can use the mpdump utility to convert MPProf binary data files to an ASCII format and then process the data directly, perhaps with customized scripts. The format of the ASCII files is undocumented, but it is easy to interpret.

Managing Disk Files

MPProf writes relatively little data to disk during profiling runs. This means that it is relatively easy to use compared to tracing tools, the data volumes of which must be managed carefully.

Nevertheless, one file per process is generated and it is worth managing these files. Specifically, MPProf profiling generates one data file per process per run, as well as one index file per run that points to the data files. The data files, by default, are stored in /usr/tmp of the respective nodes where the processes ran.

Use

% mpprof -r -g directory mpprof.index.rm.jid

to collect data files from their respective nodes to an archival location if you want to save the files.

Use

% mpprof -r -mpprof.index.rm.jid

to remove data files to clean up after a run.

Incorporating Environment Variable Suggestions

MPProf generates suggestions for tuning Sun MPI environment variables.

These suggestions are often helpful, but they do not guarantee performance improvement. Some experimentation may be necessary. If you incorporate MPProf's suggestions, the next profiling run may indicate further suggestions. Multiple reruns may be required before MPProf no longer has further recommendations, but most of the speedup, if any, is likely to result from the first round or two of experimentation.

Rerunning a program incorporating MPProf's tuning suggestions may be automated. Here is an example.

CODE EXAMPLE 8-1 Sample MPProf Session, Rerunning a Program To Tune Performance
% `cat my_mpprof_script.csh` #!/bin/csh set LOGFILE = mpprofrun.logfile # always run with this on setenv MPI_PROFILE 1 echo echo trial-0 echo setenv MPI_PROFILE 1 # iterate at most 10 times @ iteration = 0 while ( $iteration < 10 ) # run the job $* >& $LOGFILE # archive this run mkdir trial-$iteration mpprof -r -g trial-$iteration mpprof.index.* < /dev/null mv $LOGFILE trial-$iteration/log # go to the archive cd trial-$iteration # report the basics mpprof mpprof.index.* < /dev/null > mpprof.output grep "Total duration is" mpprof.output grep "Fraction spent in" mpprof.output # increment interation count @ iteration = $iteration + 1 # look at the environment variable suggestions grep "setenv" mpprof.output \| grep -v Suggestion > mpprof.envvars set wclist = ( `wc -l mpprof.envvars` ) echo # start reporting on next round if ( $wclist[1] != 0 ) then echo trial-$iteration cat mpprof.envvars source mpprof.envvars endif # clean up rm -f mpprof.envvars cd .. # convergence? if ( $wclist[1] == 0 ) then break endif end % my_mpprof_script.csh mprun -np 16 a.out trial-0 setenv MPI_PROFILE 1 Total duration is 23.702518 secs Fraction spent in MPI 21.9 %trial-1 setenv MPI_POLLALL 0 setenv MPI_PROCBIND 1 unsetenv MPI_SHM_CPOOLSIZE setenv MPI_SHM_SBPOOLSIZE 533504 setenv MPI_SPIN 1 Total duration is 21.697837 secs Fraction spent in MPI 14.5% trial-2 setenv MPI_SHM_NUMPOSTBOX 66 setenv MPI_SHM_SBPOOLSIZE 1064960 Total duration is 19.901631 secs Fraction spent in MPI 5.7% trial-3 setenv MPI_SHM_SBPOOLSIZE 1598464 Total duration is 19.993672 secs Fraction spent in MPI 5.9% trial-4 setenv MPI_SHM_SBPOOLSIZE 2129920 Total duration is 19.946838 secs Fraction spent in MPI 5.8% trial-5 setenv MPI_SHM_SBPOOLSIZE 2661376 Total duration is 19.906594 secs Fraction spent in MPI 5.7% % `ls -d trial-*` trial-0 trial-1 trial-2 trial-3 trial-4 trial-5

CODE EXAMPLE 8-1 Sample MPProf Session, Rerunning a Program To Tune Performance

% cat my_mpprof_script.csh

#!/bin/csh

set LOGFILE = mpprofrun.logfile

# always run with this on

setenv MPI_PROFILE 1

echo

echo trial-0

echo setenv MPI_PROFILE 1

# iterate at most 10 times

@ iteration = 0

while ( $iteration < 10 )

  # run the job

  $* >& $LOGFILE

  # archive this run

mkdir trial-$iteration

mpprof -r -g trial-$iteration mpprof.index.* < /dev/null

mv $LOGFILE trial-$iteration/log

# go to the archive

cd trial-$iteration

# report the basics

mpprof mpprof.index.* < /dev/null > mpprof.output

grep "Total duration is" mpprof.output

grep "Fraction spent in" mpprof.output

# increment interation count

@ iteration = $iteration + 1

# look at the environment variable suggestions

grep "setenv" mpprof.output | grep -v Suggestion > mpprof.envvars

set wclist = ( `wc -l mpprof.envvars` )

echo

# start reporting on next round

if ( $wclist[1] != 0 ) then

  echo trial-$iteration

  cat mpprof.envvars

  source mpprof.envvars

endif

# clean up

rm -f mpprof.envvars

cd ..

# convergence?

if ( $wclist[1] == 0 ) then

  break

endif

end

% my_mpprof_script.csh mprun -np 16 a.out

trial-0

setenv MPI_PROFILE 1

  Total duration is    23.702518 secs

  Fraction spent in MPI 21.9

%trial-1

setenv MPI_POLLALL 0

setenv MPI_PROCBIND 1

unsetenv MPI_SHM_CPOOLSIZE

setenv MPI_SHM_SBPOOLSIZE 533504

setenv MPI_SPIN 1

  Total duration is    21.697837 secs

  Fraction spent in MPI 14.5%

trial-2

setenv MPI_SHM_NUMPOSTBOX 66

setenv MPI_SHM_SBPOOLSIZE 1064960

  Total duration is    19.901631 secs

  Fraction spent in MPI  5.7%

trial-3

setenv MPI_SHM_SBPOOLSIZE 1598464

  Total duration is    19.993672 secs

  Fraction spent in MPI  5.9%

trial-4

setenv MPI_SHM_SBPOOLSIZE 2129920

  Total duration is    19.946838 secs

  Fraction spent in MPI  5.8%

trial-5

setenv MPI_SHM_SBPOOLSIZE 2661376

  Total duration is    19.906594 secs

  Fraction spent in MPI  5.7%

% ls -d trial-*

trial-0  trial-1  trial-2  trial-3  trial-4  trial-5

Performance Analyzer Profiling of Sun MPI Programs

The Performance Analyzer offers a good first step to understanding the performance characteristics of a program on Sun systems. It combines ease of use with powerful functionality.

As with most profiling tools, there are two basic steps to using Sun's performance analysis tools. The first step is to use the Collector to collect performance data. The second step is to use the Performance Analyzer to examine results. For example, this procedure can be as simple as replacing this:

% mprun -np 16 a.out 3 5 341

with this:

% mprun -np 16 collect a.out 3 5 341

% analyzer test.*.er

Sun's compilers and tools are usually located in the directory /opt/SUNWspro/bin. Check with your system administrator for details for your site.

The following sections show the use of the Performance Analyzer with Sun MPI programs, often revisiting variations of the preceding example.

Data Collection

There are several ways of using the Collector with MPI codes. For example, the simplest usage would be to add collect to the mprun command line:

% mprun -np 16 collect a.out 3 5 341

In this section we discuss some of the issues that can arise when using the Collector, and how to handle them.

Data Volume

The volume of collected data can grow large, especially for long-running or parallel programs. Though the Collector mitigates this problem, the scaling to large volumes remains an issue.

There are a number of useful strategies for managing these data volumes:

Increase the profiling interval. The interval might be specified in milliseconds with the -p switch to the collect command. The default value is 10. The actual interval used depends on the resolution of the profiling system. Of course, while increasing the interval reduces the data volume, it can reduce the quality of the sampling data.

For example, to reduce the number of profiled events roughly by a factor of two, use:

% mprun -np 16 collect -p 20 a.out 3 5 341

Collect data on only a subset of the MPI processes. In many cases, activity on one MPI process reflects performance behavior on all processes fairly closely. Or, if there are load imbalances among the processes, a larger subset might be used. Of course, limiting data collection to a subset of the processes might bias the profiling data. In particular, a master process might behave unlike any of the other processes. In the following example, we collect on only the first four MPI ranks using a script:

% cat csh-script

#!/bin/csh

if ( $MP_RANK < 4 ) then

  collect $*

else

$*

endif

% mprun -np 16 csh-script a.out 3 5 341

Collect data to a local file system. This method does not reduce the volume of data collected, but it helps mitigate the impact. Meanwhile, local file systems are often not mounted on all nodes, so gathering experiment data to a common location should also be part of this approach. In the following example, we collect local data and then gather it centrally:

% cat csh-script

#!/bin/csh

collect -d /tmp $*

er_mv /tmp/test.$MP_RANK.er myrun

% mkdir myrun

% mprun -np 16 csh-script a.out 3 5 341

This strategy should be a routine part of data collection.

A large parallel job might run out of a central NFS-mounted file system. While this might be adequate for jobs that are not I/O intensive, it might cause a critical bottleneck for profiling sessions. If multiple MPI processes are trying to write large volumes of profiling data simultaneously over NFS to a single file system, that file system, along with network congestion, could lead to tremendous slowdowns and perturbations of the program's performance. It is preferable to collect profiling data to local file systems and, perhaps, gather them to a central directory after program execution.

To identify local file systems, use:

% /usr/bin/df -lk

on each node of the cluster you will use, or ask your system administrator about large-volume, high-performance disk space.

One possible choice of a local file system is /tmp. Note that the /tmp file system on different nodes of a cluster refer to different, respectively local file systems. Also, the /tmp file system might not be very large, and if it becomes filled there might be a great impact on general system operability.

Data Organization

The Collector generates one "experiment" per MPI process. If there are multiple runs of a multiprocess job, therefore, the number of experiments can grow quickly.

To organize these experiments, it often makes sense to gather experiments from a run into a distinctive subdirectory. Use the commands er_mv, er_rm, and er_cp (again, typically under the directory /opt/SUNWspro/bin) to move, remove, or copy experiments, respectively. For more information on these utilities, refer to the corresponding man pages or the Sun Studio Compiler Collection documentation.

If you collect an experiment directly into a directory, make sure that the directory has already been created and, ideally, that no other experiments already exist in it.

Example

% mkdir run1

% er_rm -f run1/*.er

% mprun -np 16 collect -d run1 a.out 3 5 341

% mkdir run2

% er_rm -f run2/*.er

% mprun -np 16 collect -d run2 a.out 3 5 341

% mkdir run3

% er_rm -f run3/*.er

% mprun -np 16 collect -d run3 a.out 3 5 341

The er_rm steps are not required because (in this instance) we are using freshly created directories. Nevertheless, these steps serve as reminders to avoid the confusion that can result when too many experiments are gathered in the same directory.

Other Data Collection Issues

The Collector collects sampling information by default. Thus, Sun MPI calls will be profiled if they consume a sufficiently high fraction of runtime. Sometimes it is interesting to capture MPI calls more methodically. In particular, for time-line analysis, it is desirable to mark the beginnings and endings of MPI calls with higher resolution than sampling provides. MPI call tracing may be activated with the
-m on switch to collect.

Sometimes it is desirable to collect data not only on a process, but its descendants, as well. The collect tag will follow descendants with the -F on switch.

For example, the Solaris 10 Operating System supports multiple page sizes with the /usr/bin/ppgsz command-line utility. To collect performance data on a code run with 4-Mbyte memory pages, our sample data collection becomes:

% mprun -np 16 collect -F on ppg sz -o heap=4M,stack=512K a.out 3 5 341

In the particular case of the ppgsz utility, a cleaner alternative may be to preload the mpss.so.1 shared object, so that the ppgsz utility need not appear on the command line. For more information on multiple page size support, see the ppgsz and mpss.so.1 man pages in the Solaris 10 Operating System.

Analyzing Profiling Data

Once data has been gathered with the Collector, it can be viewed with the Performance Analyzer. For example:

% analyzer myrun/test.*.er

Use of the Performance Analyzer is illustrated in the case study.

There are also command-line utilities to aid in data analysis. For more information, see the er_print and er_src man pages. Here are examples of their use:

To view a summary of how time was spent in functions, use:

% er_print -functions test.*.er

To view a summary of how time was spent in MPI functions, use:

% er_print -functions test.*.er | grep PMPI_

To view caller-callee relationships (as with gprof), use:

% er_print -callers-callees test.*.er

To view where time is spent on a per-source-line basis, along with compiler commentary, compile and link with the -g switch and use:

% er_print -source myfunction_ 1 test.*.er

The 1 is used to distinguish in case of multiple instances of the named function.

To view the compiler commentary associated with an object file without even running your program, compile with the -g switch and use:

% er_src my_file.o

Case Study

In this case study, we examine the NPB 2.3 BT benchmark. We run the program using the following environment variable settings:

% setenv MPI_SPIN 1

% setenv MPI_PROCBIND 1

% setenv MPI_POLLALL 0

These settings are not required for Performance Analyzer profiling. We simply use them to profile our code as it would run in production. See Appendix C for more information about using Sun MPI environment variables for high performance. Also refer to the Sun MPI Programming and Reference Guide.

The job is run on a single, shared-memory node using 25 processes.

FIGURE 8-1 shows the Performance Analyzer's first view of the resulting profiling data. This default view shows how time is spent in different functions. Both exclusive and inclusive user CPU times are shown for each function, excluding and including, respectively, time spent in any functions it calls. The top line shows that a total of 3362.710 seconds, over all 25 processes, are profiled.

We see that the functions LHSX(), LHSY(), and LHSZ() account for
523.410+467.960+444.350=1435.72 seconds of that time.

FIGURE 8-1 Performance Analyzer--Main View

Fortran programmers will note that the term function is used in the C sense to include all subprograms, whether they pass return values or not. Further, function names are those generated by the Fortran compiler. That is, by default they are converted to lower case and have an underscore appended.

In FIGURE 8-2, we see how this overview appears when MPI tracing is turned on with the Collector's -m on switch. Note that accounting for bytes sent and received is slightly different from the way accounting is handled by MPProf. In particular, for bytes received, the size of the receive buffer is used rather than the actual number of bytes received. For more information on how bytes are counted for MPI calls, refer to the Performance Analyzer documentation volume Program Performance Analysis Tools.

FIGURE 8-2 Performance Analyzer--Main View With Tracing Enabled

We can see how time is spent within a subroutine if the code was compiled and linked with the -g switch. This switch introduces minimal impact on optimization and parallelization, and it can be employed rather freely by performance-oriented users. When we click on the Source tab, the Performance Analyzer brings up a text editor for the highlighted function. The choice of text editor can be changed under the Options menu with the Text Editor Options selection. The displayed, annotated source code includes the selected metrics, the user source code, and compiler commentary. A small fragment is shown in FIGURE 8-3. In particular, notice that the Performance Analyzer highlights hot (time-consuming) lines of code. Only a small fragment is shown since, in practice, the annotated source can become rather wide.

FIGURE 8-3 Performance Analyzer--Source View

To get a better idea of where time is spent at a high level in the code, you can also click on the Callers-Callees tab. FIGURE 8-4 shows a possible Callers-Callee view. The selected function appears in the middle section, its callers appear above it, and its callees below. Selected metrics are shown for all displayed functions. By clicking on callers, you can find where time incurred in the particular function occurs in the source code at a high level. By clicking on callees, you can find more detail on expensive calls a particular function might be making. This sort of analysis is probably familiar to gprof users, but the Performance Analyzer has features that go beyond some of gprof's limitations. For more information about gprof, see Using the gprof Utility.

FIGURE 8-4 Performance Analyzer--Callers-Callees View

Different metrics can be selected for the various displays by clicking on the Metrics button, as seen at the bottom of FIGURE 8-1. You can choose which metrics are shown, the order in which they are shown, and which metric should be used for sorting.

Click on the Timeline tab to see a view similar to the one shown in FIGURE 8-5. Time is shown across the horizontal axis, while experiments appear on the vertical axis. Experiments are numbered sequentially, starting from 1. For example, MPI rank 7 might appear as experiment 8. The timeline view helps you see synchronization points, load imbalances, and other performance characteristics of your parallel MPI program.

FIGURE 8-5 Performance Analyzer--Timeline View

In FIGURE 8-6, another timeline view is shown. Click on an event to see its callstack in the lower right corner. When you zoom in to the time scale of the clock sampling (10 milliseconds by default), clock-sampled events appear discretely on the view. Using MPI tracing helps maintain resolution even to the highest levels of zoom.

FIGURE 8-6 Performance Analyzer--Timeline View With Callstack

Because clocks on different nodes of a cluster are not guaranteed to be synchronized, and because the Performance Analyzer makes no attempt to synchronize time stamps from different nodes, timeline views of Sun MPI runs that were distributed over multiple nodes in a cluster are not guaranteed to be displayed properly.

Overview of Functions

In a profile, you will typically find many unfamiliar functions that do not appear explicitly in your code. Further, it can happen that none of the familiar MPI calls you do use will appear, such as MPI_Isend(), MPI_Irecv(), MPI_Wait(), or MPI_Waitall().

FIGURE 8-7 shows examples of functions you might find in your profiles, along with explanations of what you are seeing. The functions are organized by load object, such as an executable (here, a.out) or a dynamic library.

FIGURE 8-7 Examples of Functions That Might Appear in Profiles

Note that:

The top function is _start, which calls the main function. In Fortran programs, the main function calls the MAIN_ function.

By default, the Fortran compiler converts subprogram names to lower case and appends underscores. For example, the Fortran routine MY_FORTRAN_ROUTINE() would be converted to my_fortran_routine_().

The MPI standard defines a profiling interface, which provides that MPI_* functions should also be accessible by using the shifted names PMPI_*. In the Sun MPI implementation, this means that all user-callable functions are named with their PMPI_* forms, with a pmpi_* wrapper for Fortran use.

For example, a C call to MPI_Send() will enter the function PMPI_Send.

Sun MPI uses a number of internal routines, which will appear in profiles.

A Fortran call to MPI_SEND() enters the function pmpi_send_, which in turn calls PMPI_Send.

libcollector intercepts particular calls to the Sun MPI or threads libraries to support synchronization tracing. Thus, functions such as PMPI_Recv and mutex_lock might appear twice in profiles -- once belonging to a user-callable library and once belonging to libcollector.

Sun MPI calls routines in other Sun HPC ClusterTools libraries.

Various Sun HPC ClusterTools libraries call other standard libraries. Notably:

The _poll, poll, and yield calls might be called by a Sun MPI process for waiting.

The memcpy call is often called when an MPI process is copying data locally, such as for on-node message passing.

The _read, read, _write, _writev, and writev calls are used in off-node message passing over TCP.

One way to get an overview of which MPI calls are being made, and which are most important, is to look for the PMPI entry points into the MPI library. For our case study example:

% er_print -function proc-0.er | grep PMPI_

0.050 16.980 PMPI_Wait

0.030 0.810 PMPI_Isend

0.030 16.930 PMPI_Wait

0.020 0.490 PMPI_Irecv

0. 0. PMPI_Finalize

0. 0.150 PMPI_Init

0. 1.630 PMPI_Waitall

%

In this example, roughly 20 seconds out of 146.93 seconds profiled are due to MPI calls. The exclusive times are typically small and meaningless. Synchronizing calls, such as PMPI_Wait and PMPI_Waitall, appear twice, once due to libmpi and once to libcollector. Such duplication can be ignored.

If ever there is a question about what role an unfamiliar (but possibly time-consuming) function is playing, within the Performance Analyzer you can:

Choose Callers-Callees to see which function is calling it.

Choose Show Summary Metrics from the View menu to see what is displayed under Load Object.

Choose Select Load Objects Included from the View menu to restrict viewing to functions that belong to specific load objects (such as your executable or libmpi).

MPI Wait Times

Time might be spent in MPI calls for a wide variety of reasons. Specifically, much of that time might be spent waiting for some condition (a message to arrive, internal buffering to become available, and so on), rather than in moving data.

While an MPI process waits, it might or might not tie up a CPU. Nevertheless, such wait time probably has a cost in terms of program performance.

Your options for ensuring that wait time is profiled include:

Have Sun MPI spin a CPU aggressively during wait situations. This option requires turning off coscheduling and turning on spin behavior. Both are off by default. To do this, use the following environment variable settings:

% setenv MPI_COSCHED 0

% setenv MPI_SPIN    1

Select wall-clock time, rather than CPU time, as the profiling metric. Profiling metrics are described earlier in this chapter in the case study.

Other Profiling Approaches

Both Sun MPI and the Solaris OS environment offer useful profiling facilities. Using the MPI profiling interface, you can investigate MPI calls. Using your own timer calls, you can profile specific behaviors. Using the Solaris gprof utility, you can profile diverse multiprocess codes, including those using MPI.

Using the MPI Profiling Interface

The MPI standard supports a profiling interface, which allows any user to profile either individual MPI calls or the entire library. This interface supports two equivalent APIs for each MPI routine. One has the prefix MPI_, while the other has PMPI_. User codes typically call the MPI_ routines. A profiling routine or library will typically provide wrappers for the MPI_ routines that simply call the PMPI_ ones, with timer calls around the PMPI_ call.

You can use this interface to change the behavior of MPI routines without modifying your source code. For example, suppose you believe that most of the time spent in some collective call, such as MPI_Allreduce, is due to the synchronization of the processes that is implicit to such a call. Then, you can compile a wrapper such as the one that follows, and link it into your code before the -lmpi switch. The effect will be that time profiled by MPI_Allreduce calls will be due exclusively to the MPI_Allreduce operation, with synchronization costs attributed to barrier operations.

subroutine MPI_Allreduce(x,y,n,type,op,comm,ier)

integer x(*), y(*), n, type, op, comm, ier

call PMPI_Barrier(comm,ier)

call PMPI_Allreduce(x,y,n,type,op,comm,ier)

end

Profiling wrappers or libraries can be used even with application binaries that have already been linked. Refer to the Solaris man page for ld for more information about the environment variable LD_PRELOAD.

You can get profiling libraries from independent sources for use with Sun MPI. An example of a profiling library is included in the multiprocessing environment (MPE) from Argonne National Laboratory. Several external profiling tools can be made to work with Sun MPI using this mechanism. For more information on this library and on the MPI profiling interface, refer to the Sun MPI Programming and Reference Guide.

Inserting MPI Timer Calls

Sun HPC ClusterTools implements the MPI timer call MPI_Wtime (demonstrated in the example that follows) with the high-resolution timer gethrtime. If you use MPI_Wtime calls, you should use them to measure sections that last more than several microseconds. Times on different processes are not guaranteed to be synchronized. For information about the gethrtime timer, see the gethrtime(3C) man page.

When profiling multiprocess codes, ensure that the timings are not distorted by the asynchrony of the various processes. For this purpose, you usually need to synchronize the processes before starting and before stopping the timer.

In the following example, most processes might accumulate time in the interesting, timed portion, waiting for process 0 (zero) to emerge from uninteresting initialization. This condition would skew your program's timings. For example:

CALL MPI_COMM_RANK(MPI_COMM_WORLD,ME,IER)

IF ( ME .EQ. 0 ) THEN

    initialization

END IF

! place barrier here

! CALL MPI_BARRIER(MPI_COMM_WORLD, IER)

T_START = MPI_WTIME()

    timed portion

T_END = MPI_WTIME()

When stopping a timer, remember that measurements of elapsed time will differ on different processes. So, execute another barrier before the "stop" timer. Alternatively, use "maximum" elapsed time for all processes.

Avoid timing very small fragments of code. This is good advice when debugging uniprocessor codes, and the consequences are greater with many processors. Code fragments perform differently when timed in isolation. The introduction of barrier calls for timing purposes can be disruptive for short intervals.

Using the `gprof` Utility

The Solaris utility gprof might be used for multiprocess codes, such as those that use MPI. Several points should be noted:

Compile and link your programs with the -pg (Fortran) or -xpg (C) switch.

Use the environment variable PROFDIR to profile multiprocess jobs, such as those that use MPI.

Use the gprof command after program execution to gather summary statistics either on individual processes or for multiprocess aggregates.

Note, however, that the gprof utility has several limitations.

The gprof utility requires recompilation and relinking.

Many libraries do not have gprof versions. For example, activity spent within Sun MPI calls do not appear in gprof profiles.

The gprof utility apportions time equally among all callers. For example, assume a matrix-multiply routine is called from one caller for small matrices and an equal number of times from another caller for large matrices. The gprof utility attributes time spent in the matrix multiplication equally to both callers, even if the large-matrix operations are substantially more time consuming.

It does not count time spent in sleeps and yields, which can skew results.

It loses the relationships between process ids (used to tag profile files) and MPI process ranks.

Its profiles from different processes might overwrite one another if a multiprocess job spans multiple nodes.

Note that the Performance Analyzer is simple to use, provides the profiling information that gprof does, offers additional functionality, and avoids the pitfalls. Thus, gprof users are highly encouraged to migrate to the Performance Analyzer for both MPI and non-MPI codes.

For more information about the gprof utility, refer to the gprof man page. For more information about the Performance Analyzer, refer to the documentation supplied with the Sun Studio Compiler Collections.