C H A P T E R  8

Profiling

An important component of performance tuning is profiling, through which you develop a picture of how well your code is running and what sorts of bottlenecks it might have. Profiling can be a difficult task in the simplest of cases, and the complexities multiply with MPI programs because of their parallelism. Without profiling information, however, code optimization can be wasted effort.

This chapter describes:

This chapter includes a few case studies that examine some of the NAS Parallel Benchmarks 2.3. These are available from the NASA Ames Research Center at

http://www.nas.nasa.gov/Software/NPB/index.html


Note - The runs shown in this chapter were not optimized for the platforms on which they executed.




General Profiling Methodology

It is likely that only a few parts of a program account for most of its runtime. Profiling enables you to identify these "hot spots" and characterize their behavior. You can then focus your optimization efforts on the spots where they will have the most effect.

Profiling can be an experimental, exploratory procedure. So you might find yourself rerunning an experiment frequently. It is a challenge to design such runs so that they complete quickly, while still capturing the performance characteristics you are trying to study. There are several ways you can strip down your runs to achieve this balance, including reducing the data set and performing fewer loop iterations. However, regardless of which streamlining method you employ, keep the following caveats in mind:

Basic Approaches

There are various approaches to profiling Sun HPC ClusterTools programs:

Use the DTrace dynamic tracing utility. DTrace is a component of the Solaris 10 OS. DTrace is a comprehensive dynamic tracing utility that you can use to monitor the behavior of applications programs as well as the operating system itself. For more information about DTrace, refer to the Solaris Dynamic Tracing Guide (Part Number 817-6223). This guide is part of the Solaris 10 OS Software Developer Collection.

Solaris 10 OS documentation can be found on the web at the following location:

http://www.sun.com/documentation

TABLE 8-1 outlines the advantages and disadvantages associated with each of these methods of profiling.


TABLE 8-1 Profiling Alternatives

Method

Advantages

Disadvantages

MPProf

  • Very simple to use.
  • Generates self-explanatory performance report.
  • Gives feedback in simple terms (such as which environment variables to set for better performance).
  • Simplicity.
  • Has no knowledge of user code.
  • Performance suggestions are fallible.

DTrace

  • Included with the Solaris 10 OS
  • Can be used on live production systems to monitor behavior and track errors.
  • No need to recompile application code.
  • Can be attached and detached to and from an MPI program multiple times without disturbing the run.
  • Use of a scripting language allows you to generate complex output.
  • The D scripting language has many built-in functions, so scripts are less complicated to write.
  • Need to learn another scripting language.
  • Very simple output processing (no GUI).
  • Very low level tool.

Performance Analyzer Profiling

  • No recompilation or relinking is required.
  • Profiles whole programs: user computation and MPI message passing.
  • Identifies time-consuming routines.
  • With -g recompilation and relinking, gives attribution on a per-source-line basis with negligible loss in optimization level.
  • Shows caller-callee relationships.
  • Uses a style familiar to gprof users.
  • On UltraSPARC microprocessors, profiles based on hardware counters (floating-point operations, cache misses, and so forth).
  • Timeline functionality.
  • Has very limited knowledge of MPI or message passing.
  • Improper timeline support for multinode runs.

Timers

  • Very versatile.
  • Requires manual instrumentation.
  • Requires that you understand the code.

gprof

  • Familiar tool.
  • Provides an overview of user code.
  • Ignores time spent in MPI.

PMPI

  • You can instrument or modify MPI without modifying source.
  • Allows use of other profiling tools.
  • Profiles MPI usage only.
  • Requires integration effort.

The following are sample scenarios:

Using MPProf, you can easily see if your program is spending a lot of time in MPI. It will suggest environment variable values for you.

Using the Performance Analyzer, you can see which routines consume the most time.

Adding timers and other instrumentation around innermost loops might help you if you already have some idea about your code's performance.

Depending on how those tools were constructed, the MPI profiling interface might allow you to continue using them with Sun HPC ClusterTools programs.


MPProf Profiling Tool

Sun HPC ClusterTools software includes the mpprof profiler, a tool that supports programming with the Sun MPI library.

To collect profiling data on a Sun MPI program, set the MPI_PROFILE environment variable before running the program. To generate a report using this data, invoke the mpprof command-line utility. For example:


% setenv MPI_PROFILE 1
% mprun -np 16 a.out
% mpprof mpprof.index.rm.jid

If profiling has been enabled, Sun MPI creates a binary-format profiling data file for each MPI process. The MPProf utility digests the data from these numerous, intermediate, data files--generating statistical summaries in ASCII text. Wherever possible, MPProf adds direct tuning advice to the statistical summaries of MPI process behavior.

MPProf does not provide detailed information about the ordering of message-passing events or interprocess synchronization. That is, MPProf is not a trace-history viewer.



Note - To get information on user code as well as MPI message-passing activity, use Performance Analyzer.



Sample MPProf Output

Output from the MPProf report generator is self explanatory and should require no further documentation. Nevertheless, sample output is included here for readers who do not have immediate access to this tool.

Overview

The first section (sample output shown in box below) is an overview that includes

Load Balance

The Load section (sample output shown in box below) shows the distribution of how much time was spent in MPI per process. When a process spends much time in MPI, this may be because it's waiting for another process. Thus, performance bottlenecks may be in processes that spend little time in MPI. Those processes, as well as the nodes they ran on, are identified in this section. Other tools, such as the Performance Analyzer, are needed to pinpoint bottlenecks further if load imbalances are apparent.


LOAD BALANCE
============
Data is being reported on 16 MPI processes. The following histogram
shows how these processes were distributed as a function of the
fraction of the time the processes spent  in MPI routines:
 
Number of MPI Processes
 
10-|
 9-|
 8-|
 7-|                                                 #
 6-|                                                 #
 5-|                                                ##
 4-|                                                ##
 3-|                                               ###
 2-|                                               ###
 1-|  #                                           ####
 0-+----+----+----+----+----+----+----+----+----+----+
 15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0 65.0
 
  Percentage time in MPI
 
Rank   Hostname   MPI Time
0       node0     17.39%
15      node0     61.04%
4       node0     62.03%
13      node0     62.08%
11      node0     62.12%
 
[...]

Sun MPI Environment Variables

As discussed in Chapter 7, Sun MPI environment variables are set by default to values that will be appropriate in many cases, but runtime performance can in cases be enhanced by further tuning.

In particular, Sun MPI implements two-sided message passing between processes that cannot directly access one another's address spaces with internal shared buffers. If these buffers become congested, performance can suffer. MPProf monitors these buffers, watches for congestion, and suggests environment variable values for enhanced performance at the cost of higher memory usage.

The report discusses its recommendations and then provides a summary (sample output shown in box below) that may be "copy-and-pasted" into a session window or run script. The code may then be run again to determine whether better performance results from the recommendations.


MPI ENVIRONMENT VARIABLES
=========================
[...]
 
SUGGESTION SUMMARY
==================
[...]
 
In the C shell, these environment variables may be set by the
following commands:
 
setenv MPI_POLLALL 0
setenv MPI_PROCBIND 1
setenv MPI_SHM_SBPOOLSIZE 368640
setenv MPI_SPIN 1
 
In the Bourne or Korn shell, these environment variables may be
set by the following commands:
 
export MPI_POLLALL=0
export MPI_PROCBIND=1
export MPI_SHM_SBPOOLSIZE=368640
export MPI_SPIN=1

Breakdown by MPI Routine

The Breakdown by MPI Routine section of the report (sample output shown in box below) gives a breakdown by MPI routine. Attributing bytes sent and received to MPI routines can be tricky due to nonblocking and collective operations. Note that:

Assumptions:

Time Dependence

Performance analysis tools are sometimes classified as profiling tools, which give overall performance characterization of a program, or tracing tools, which give detailed time histories of events that occur during program execution.

While MPProf is a profiling tool, it includes some time-dependent information (a sample of the Time Dependence section of the MPProf report is shown below). This information is coarse but can prove useful in different ways:

In the following example, there is considerable broadcast and barrier activity before the program settles down to steady-state behavior, the performance of which is critical for long-running programs.


TIME DEPENDENCE
===============
Time periods may be missing if no MPI calls were made during the
period. Times for MPI calls that persist over multiple reporting
intervals will only be reported in a single interval; these
reported times may be greater than 100%.
 
period  MPI_Barrier  MPI_Bcast  MPI_Send  MPI_Waitall
 
 1         1.2%       6.9%      0.0%         0.0%
 5         0.0%       0.0%      0.0%       123.0%
 6         0.0%       0.0%      0.0%         4.2%
 7         0.0%     803.3%      0.1%         3.1%
16         0.0%     101.0%      3.9%         0.0%
17         0.0%       7.6%      0.0%         0.0%
18        32.5%       0.0%      0.5%        10.3%
19        18.9%       5.0%     52.4%         0.0%
20       191.9%       0.0%     22.6%        34.5%
21         0.0%       0.0%      0.0%         0.0%
22        10.9%       0.0%      0.0%         0.0%
23        16.7%       2.1%      0.2%         4.3%
24        21.7%       2.1%      0.3%         4.6%
25        21.0%       2.1%      0.3%         5.4%
26        21.1%       3.6%      0.3%         5.2%
27        20.1%       1.9%      0.4%         5.9%
28        20.6%       2.1%      0.3%         5.3%
29        20.4%       2.1%      0.3%         5.5%
30        17.9%       3.6%      0.2%         5.0%
31        22.6%       2.1%      0.3%         5.4%
 

Connections

A connection is a sender/receiver pair. MPProf shows how much point-to-point message traffic there is per connection, with one matrix showing numbers of messages and another showing numbers of bytes. Values are scaled.

A real application is likely to have various communication patterns in it, but the overall matrix may resemble some easy-to-recognize case. The following examples illustrate some simple patterns.


CONNECTIONS
===========
 
               sender
               0  1  2  3  4  5  6  7  8  9
receiver
 
            0  _ 99 99 99 99 99 99 99 99 99
            1  _  _  _  _  _  _  _  _  _  _
            2  _  _  _  _  _  _  _  _  _  _
            3  _  _  _  _  _  _  _  _  _  _
            4  _  _  _  _  _  _  _  _  _  _
            5  _  _  _  _  _  _  _  _  _  _
            6  _  _  _  _  _  _  _  _  _  _
            7  _  _  _  _  _  _  _  _  _  _
            8  _  _  _  _  _  _  _  _  _  _
            9  _  _  _  _  _  _  _  _  _  _


CONNECTIONS
===========
 
               _  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _
              99  _  _  _  _  _  _  _  _  _

 



CONNECTIONS
===========
 
               _ 99  _  _  _  _  _  _  _  _
              99  _ 99  _  _  _  _  _  _  _
               _ 99  _ 99  _  _  _  _  _  _
               _  _ 99  _ 99  _  _  _  _  _
               _  _  _ 99  _ 99  _  _  _  _
               _  _  _  _ 99  _ 99  _  _  _
               _  _  _  _  _ 99  _ 99  _  _
               _  _  _  _  _  _ 99  _ 99  _
               _  _  _  _  _  _  _ 99  _ 99
               _  _  _  _  _  _  _  _ 99  _

 



CONNECTIONS
===========
 
               _ 99  _  _  _  _  _  _  _ 99
              99  _ 99  _  _  _  _  _  _  _
               _ 99  _ 99  _  _  _  _  _  _
               _  _ 99  _ 99  _  _  _  _  _
               _  _  _ 99  _ 99  _  _  _  _
               _  _  _  _ 99  _ 99  _  _  _
               _  _  _  _  _ 99  _ 99  _  _
               _  _  _  _  _  _ 99  _ 99  _
               _  _  _  _  _  _  _ 99  _ 99
              99  _  _  _  _  _  _  _ 99  _
 


CONNECTIONS
===========
 
              _9____99____________________9______
              9_9_____9____________________9_____
              _9_9_____9____________________9____
              __9_9_____9____________________9___
              ___9_9_____9____________________9__
              ____9_9_____9____________________9_
              9____9_______9____________________9
              9_______9____99____________________
              _9_____9_9_____9___________________
              __9_____9_9_____9__________________
              ___9_____9_9_____9_________________
              ____9_____9_9_____9________________
              _____9_____9_9_____9_______________
              ______99____9_______9______________
              _______9_______9____99_____________
              ________9_____9_9_____9____________
              _________9_____9_9_____9___________
              __________9_____9_9_____9__________
              ___________9_____9_9_____9_________
              ____________9_____9_9_____9________
              _____________99____9_______9_______
              ______________9_______9____99______
              _______________9_____9_9_____9_____
              ________________9_____9_9_____9____
              _________________9_____9_9_____9___
              __________________9_____9_9_____9__
              ___________________9_____9_9_____9_
              ____________________99____9_______9
              9____________________9_______9____9
              _9____________________9_____9_9____
              __9____________________9_____9_9___
              ___9____________________9_____9_9__
              ____9____________________9_____9_9_
              _____9____________________9_____9_9
              ______9____________________99____9_
 

Once the connectivity matrices have been displayed, MPProf prints the average message length:


The average length of point-to-point messages was 13916 bytes per message.

This message length can be compared to the product of some characteristic latency in your system and some characteristic bandwidth. For example, if "ping-pong" tests indicate an MPI latency of about 2 microseconds and bandwidth of about 500 Mbyte/sec, then the product is 1000 bytes. An average length of 13916 bytes suggests bandwidth is more important than latency to your application's performance.

Multithreaded Programs

MPProf profiling might be used with multi-threaded MPI programs. If many threads per process are engaged in heavy MPI activity, however, MPProf data collection can slow down measurably. MPProf aggregates data for all threads on a process.

The mpdump Utility

It is sometimes possible to extract more information from MPProf data files than the report generator prints out. You can use the mpdump utility to convert MPProf binary data files to an ASCII format and then process the data directly, perhaps with customized scripts. The format of the ASCII files is undocumented, but it is easy to interpret.

Managing Disk Files

MPProf writes relatively little data to disk during profiling runs. This means that it is relatively easy to use compared to tracing tools, the data volumes of which must be managed carefully.

Nevertheless, one file per process is generated and it is worth managing these files. Specifically, MPProf profiling generates one data file per process per run, as well as one index file per run that points to the data files. The data files, by default, are stored in /usr/tmp of the respective nodes where the processes ran.

Use

% mpprof -r -g directory mpprof.index.rm.jid

to collect data files from their respective nodes to an archival location if you want to save the files.

Use

% mpprof -r -mpprof.index.rm.jid

to remove data files to clean up after a run.

Incorporating Environment Variable Suggestions

MPProf generates suggestions for tuning Sun MPI environment variables.

These suggestions are often helpful, but they do not guarantee performance improvement. Some experimentation may be necessary. If you incorporate MPProf's suggestions, the next profiling run may indicate further suggestions. Multiple reruns may be required before MPProf no longer has further recommendations, but most of the speedup, if any, is likely to result from the first round or two of experimentation.

Rerunning a program incorporating MPProf's tuning suggestions may be automated. Here is an example.


CODE EXAMPLE 8-1 Sample MPProf Session, Rerunning a Program To Tune Performance
% cat my_mpprof_script.csh
#!/bin/csh
 
set LOGFILE = mpprofrun.logfile
 
# always run with this on 
setenv MPI_PROFILE 1
echo 
echo trial-0
echo setenv MPI_PROFILE 1
 
# iterate at most 10 times
@ iteration = 0
while ( $iteration < 10 )
 
  # run the job
  $* >& $LOGFILE
 
  # archive this run
 
mkdir trial-$iteration
mpprof -r -g trial-$iteration mpprof.index.* < /dev/null
mv $LOGFILE trial-$iteration/log
 
# go to the archive
cd trial-$iteration
 
# report the basics
mpprof mpprof.index.* < /dev/null > mpprof.output
grep "Total duration is" mpprof.output
grep "Fraction spent in" mpprof.output
 
# increment interation count
@ iteration = $iteration + 1
 
# look at the environment variable suggestions
grep "setenv" mpprof.output | grep -v Suggestion > mpprof.envvars
set wclist = ( `wc -l mpprof.envvars` )
echo
 
# start reporting on next round
 
if ( $wclist[1] != 0 ) then
  echo trial-$iteration
  cat mpprof.envvars
  source mpprof.envvars
endif
 
# clean up
rm -f mpprof.envvars
cd ..
 
# convergence?
if ( $wclist[1] == 0 ) then
  break
endif
 
end
 
% my_mpprof_script.csh mprun -np 16 a.out
 
trial-0
setenv MPI_PROFILE 1
  Total duration is    23.702518 secs
  Fraction spent in MPI 21.9
 
%trial-1
setenv MPI_POLLALL 0
setenv MPI_PROCBIND 1
unsetenv MPI_SHM_CPOOLSIZE
setenv MPI_SHM_SBPOOLSIZE 533504
setenv MPI_SPIN 1
  Total duration is    21.697837 secs
  Fraction spent in MPI 14.5%
 
trial-2
setenv MPI_SHM_NUMPOSTBOX 66
setenv MPI_SHM_SBPOOLSIZE 1064960
  Total duration is    19.901631 secs
  Fraction spent in MPI  5.7%
 
trial-3
setenv MPI_SHM_SBPOOLSIZE 1598464
  Total duration is    19.993672 secs
  Fraction spent in MPI  5.9%
 
trial-4
setenv MPI_SHM_SBPOOLSIZE 2129920
  Total duration is    19.946838 secs
  Fraction spent in MPI  5.8%
 
trial-5
setenv MPI_SHM_SBPOOLSIZE 2661376
  Total duration is    19.906594 secs
  Fraction spent in MPI  5.7%
 
 
% ls -d trial-*
 
trial-0  trial-1  trial-2  trial-3  trial-4  trial-5


Performance Analyzer Profiling of Sun MPI Programs

The Performance Analyzer offers a good first step to understanding the performance characteristics of a program on Sun systems. It combines ease of use with powerful functionality.

As with most profiling tools, there are two basic steps to using Sun's performance analysis tools. The first step is to use the Collector to collect performance data. The second step is to use the Performance Analyzer to examine results. For example, this procedure can be as simple as replacing this:

% mprun -np 16 a.out 3 5 341

with this:

% mprun -np 16 collect a.out 3 5 341
% analyzer test.*.er

Sun's compilers and tools are usually located in the directory /opt/SUNWspro/bin. Check with your system administrator for details for your site.

The following sections show the use of the Performance Analyzer with Sun MPI programs, often revisiting variations of the preceding example.

Data Collection

There are several ways of using the Collector with MPI codes. For example, the simplest usage would be to add collect to the mprun command line:

% mprun -np 16 collect a.out 3 5 341

In this section we discuss some of the issues that can arise when using the Collector, and how to handle them.

Data Volume

The volume of collected data can grow large, especially for long-running or parallel programs. Though the Collector mitigates this problem, the scaling to large volumes remains an issue.

There are a number of useful strategies for managing these data volumes:

For example, to reduce the number of profiled events roughly by a factor of two, use:

% mprun -np 16 collect -p 20 a.out 3 5 341

This strategy should be a routine part of data collection.

A large parallel job might run out of a central NFS-mounted file system. While this might be adequate for jobs that are not I/O intensive, it might cause a critical bottleneck for profiling sessions. If multiple MPI processes are trying to write large volumes of profiling data simultaneously over NFS to a single file system, that file system, along with network congestion, could lead to tremendous slowdowns and perturbations of the program's performance. It is preferable to collect profiling data to local file systems and, perhaps, gather them to a central directory after program execution.

To identify local file systems, use:

% /usr/bin/df -lk

on each node of the cluster you will use, or ask your system administrator about large-volume, high-performance disk space.

One possible choice of a local file system is /tmp. Note that the /tmp file system on different nodes of a cluster refer to different, respectively local file systems. Also, the /tmp file system might not be very large, and if it becomes filled there might be a great impact on general system operability.

Data Organization

The Collector generates one "experiment" per MPI process. If there are multiple runs of a multiprocess job, therefore, the number of experiments can grow quickly.

To organize these experiments, it often makes sense to gather experiments from a run into a distinctive subdirectory. Use the commands er_mv, er_rm, and er_cp (again, typically under the directory /opt/SUNWspro/bin) to move, remove, or copy experiments, respectively. For more information on these utilities, refer to the corresponding man pages or the Sun Studio Compiler Collection documentation.

If you collect an experiment directly into a directory, make sure that the directory has already been created and, ideally, that no other experiments already exist in it.

Example


% mkdir run1
% er_rm -f run1/*.er
% mprun -np 16 collect -d run1 a.out 3 5 341
% mkdir run2
% er_rm -f run2/*.er
% mprun -np 16 collect -d run2 a.out 3 5 341
% mkdir run3
% er_rm -f run3/*.er
% mprun -np 16 collect -d run3 a.out 3 5 341
%

The er_rm steps are not required because (in this instance) we are using freshly created directories. Nevertheless, these steps serve as reminders to avoid the confusion that can result when too many experiments are gathered in the same directory.

Other Data Collection Issues

The Collector collects sampling information by default. Thus, Sun MPI calls will be profiled if they consume a sufficiently high fraction of runtime. Sometimes it is interesting to capture MPI calls more methodically. In particular, for time-line analysis, it is desirable to mark the beginnings and endings of MPI calls with higher resolution than sampling provides. MPI call tracing may be activated with the
-m on switch to collect.

Sometimes it is desirable to collect data not only on a process, but its descendants, as well. The collect tag will follow descendants with the -F on switch.

For example, the Solaris 10 Operating System supports multiple page sizes with the /usr/bin/ppgsz command-line utility. To collect performance data on a code run with 4-Mbyte memory pages, our sample data collection becomes:

% mprun -np 16 collect -F on ppg sz -o heap=4M,stack=512K a.out 3 5 341

In the particular case of the ppgsz utility, a cleaner alternative may be to preload the mpss.so.1 shared object, so that the ppgsz utility need not appear on the command line. For more information on multiple page size support, see the ppgsz and mpss.so.1 man pages in the Solaris 10 Operating System.

Analyzing Profiling Data

Once data has been gathered with the Collector, it can be viewed with the Performance Analyzer. For example:

% analyzer myrun/test.*.er

Use of the Performance Analyzer is illustrated in the case study.

There are also command-line utilities to aid in data analysis. For more information, see the er_print and er_src man pages. Here are examples of their use:

% er_print -functions test.*.er

% er_print -functions test.*.er | grep PMPI_

% er_print -callers-callees test.*.er

% er_print -source myfunction_ 1 test.*.er

The 1 is used to distinguish in case of multiple instances of the named function.

% er_src my_file.o

Case Study

In this case study, we examine the NPB 2.3 BT benchmark. We run the program using the following environment variable settings:


% setenv MPI_SPIN 1
% setenv MPI_PROCBIND 1 
% setenv MPI_POLLALL 0 

These settings are not required for Performance Analyzer profiling. We simply use them to profile our code as it would run in production. See Appendix C for more information about using Sun MPI environment variables for high performance. Also refer to the Sun MPI Programming and Reference Guide.

The job is run on a single, shared-memory node using 25 processes.

FIGURE 8-1 shows the Performance Analyzer's first view of the resulting profiling data. This default view shows how time is spent in different functions. Both exclusive and inclusive user CPU times are shown for each function, excluding and including, respectively, time spent in any functions it calls. The top line shows that a total of 3362.710 seconds, over all 25 processes, are profiled.

We see that the functions LHSX(), LHSY(), and LHSZ() account for
523.410+467.960+444.350=1435.72 seconds of that time.


FIGURE 8-1 Performance Analyzer--Main View

Graphic image of the Performance Analyzer -- main view


Fortran programmers will note that the term function is used in the C sense to include all subprograms, whether they pass return values or not. Further, function names are those generated by the Fortran compiler. That is, by default they are converted to lower case and have an underscore appended.

In FIGURE 8-2, we see how this overview appears when MPI tracing is turned on with the Collector's -m on switch. Note that accounting for bytes sent and received is slightly different from the way accounting is handled by MPProf. In particular, for bytes received, the size of the receive buffer is used rather than the actual number of bytes received. For more information on how bytes are counted for MPI calls, refer to the Performance Analyzer documentation volume Program Performance Analysis Tools.


FIGURE 8-2 Performance Analyzer--Main View With Tracing Enabled

Graphic image of the Performance Analyzer -- main view with tracing enabled.


We can see how time is spent within a subroutine if the code was compiled and linked with the -g switch. This switch introduces minimal impact on optimization and parallelization, and it can be employed rather freely by performance-oriented users. When we click on the Source tab, the Performance Analyzer brings up a text editor for the highlighted function. The choice of text editor can be changed under the Options menu with the Text Editor Options selection. The displayed, annotated source code includes the selected metrics, the user source code, and compiler commentary. A small fragment is shown in FIGURE 8-3. In particular, notice that the Performance Analyzer highlights hot (time-consuming) lines of code. Only a small fragment is shown since, in practice, the annotated source can become rather wide.


FIGURE 8-3 Performance Analyzer--Source View

Graphic image of the Performance Analyzer -- source view


To get a better idea of where time is spent at a high level in the code, you can also click on the Callers-Callees tab. FIGURE 8-4 shows a possible Callers-Callee view. The selected function appears in the middle section, its callers appear above it, and its callees below. Selected metrics are shown for all displayed functions. By clicking on callers, you can find where time incurred in the particular function occurs in the source code at a high level. By clicking on callees, you can find more detail on expensive calls a particular function might be making. This sort of analysis is probably familiar to gprof users, but the Performance Analyzer has features that go beyond some of gprof's limitations. For more information about gprof, see Using the gprof Utility.


FIGURE 8-4 Performance Analyzer--Callers-Callees View

Graphic image of the Performance Analyzer -- callers-callees view.


Different metrics can be selected for the various displays by clicking on the Metrics button, as seen at the bottom of FIGURE 8-1. You can choose which metrics are shown, the order in which they are shown, and which metric should be used for sorting.

Click on the Timeline tab to see a view similar to the one shown in FIGURE 8-5. Time is shown across the horizontal axis, while experiments appear on the vertical axis. Experiments are numbered sequentially, starting from 1. For example, MPI rank 7 might appear as experiment 8. The timeline view helps you see synchronization points, load imbalances, and other performance characteristics of your parallel MPI program.


FIGURE 8-5 Performance Analyzer--Timeline View

Graphic image of the Performance Analyzer -- timeline view


In FIGURE 8-6, another timeline view is shown. Click on an event to see its callstack in the lower right corner. When you zoom in to the time scale of the clock sampling (10 milliseconds by default), clock-sampled events appear discretely on the view. Using MPI tracing helps maintain resolution even to the highest levels of zoom.


FIGURE 8-6 Performance Analyzer--Timeline View With Callstack

Graphic image of the Performance Analyzer -- timeline view with callstack.


Because clocks on different nodes of a cluster are not guaranteed to be synchronized, and because the Performance Analyzer makes no attempt to synchronize time stamps from different nodes, timeline views of Sun MPI runs that were distributed over multiple nodes in a cluster are not guaranteed to be displayed properly.

Overview of Functions

In a profile, you will typically find many unfamiliar functions that do not appear explicitly in your code. Further, it can happen that none of the familiar MPI calls you do use will appear, such as MPI_Isend(), MPI_Irecv(), MPI_Wait(), or MPI_Waitall().

FIGURE 8-7 shows examples of functions you might find in your profiles, along with explanations of what you are seeing. The functions are organized by load object, such as an executable (here, a.out) or a dynamic library.


FIGURE 8-7 Examples of Functions That Might Appear in Profiles

Table of examples of functions that might appear in profiles


Note that:

For example, a C call to MPI_Send() will enter the function PMPI_Send.

Sun MPI uses a number of internal routines, which will appear in profiles.

One way to get an overview of which MPI calls are being made, and which are most important, is to look for the PMPI entry points into the MPI library. For our case study example:


% er_print -function proc-0.er | grep PMPI_

0.050 16.980 PMPI_Wait

0.030 0.810 PMPI_Isend

0.030 16.930 PMPI_Wait

0.020 0.490 PMPI_Irecv

0. 0. PMPI_Finalize

0. 0.150 PMPI_Init

0. 1.630 PMPI_Waitall

0. 1.630 PMPI_Waitall

%


In this example, roughly 20 seconds out of 146.93 seconds profiled are due to MPI calls. The exclusive times are typically small and meaningless. Synchronizing calls, such as PMPI_Wait and PMPI_Waitall, appear twice, once due to libmpi and once to libcollector. Such duplication can be ignored.

If ever there is a question about what role an unfamiliar (but possibly time-consuming) function is playing, within the Performance Analyzer you can:

MPI Wait Times

Time might be spent in MPI calls for a wide variety of reasons. Specifically, much of that time might be spent waiting for some condition (a message to arrive, internal buffering to become available, and so on), rather than in moving data.

While an MPI process waits, it might or might not tie up a CPU. Nevertheless, such wait time probably has a cost in terms of program performance.

Your options for ensuring that wait time is profiled include:


Other Profiling Approaches

Both Sun MPI and the Solaris OS environment offer useful profiling facilities. Using the MPI profiling interface, you can investigate MPI calls. Using your own timer calls, you can profile specific behaviors. Using the Solaris gprof utility, you can profile diverse multiprocess codes, including those using MPI.

Using the MPI Profiling Interface

The MPI standard supports a profiling interface, which allows any user to profile either individual MPI calls or the entire library. This interface supports two equivalent APIs for each MPI routine. One has the prefix MPI_, while the other has PMPI_. User codes typically call the MPI_ routines. A profiling routine or library will typically provide wrappers for the MPI_ routines that simply call the PMPI_ ones, with timer calls around the PMPI_ call.

You can use this interface to change the behavior of MPI routines without modifying your source code. For example, suppose you believe that most of the time spent in some collective call, such as MPI_Allreduce, is due to the synchronization of the processes that is implicit to such a call. Then, you can compile a wrapper such as the one that follows, and link it into your code before the -lmpi switch. The effect will be that time profiled by MPI_Allreduce calls will be due exclusively to the MPI_Allreduce operation, with synchronization costs attributed to barrier operations.


subroutine MPI_Allreduce(x,y,n,type,op,comm,ier)
integer x(*), y(*), n, type, op, comm, ier
call PMPI_Barrier(comm,ier)
call PMPI_Allreduce(x,y,n,type,op,comm,ier)
end 

Profiling wrappers or libraries can be used even with application binaries that have already been linked. Refer to the Solaris man page for ld for more information about the environment variable LD_PRELOAD.

You can get profiling libraries from independent sources for use with Sun MPI. An example of a profiling library is included in the multiprocessing environment (MPE) from Argonne National Laboratory. Several external profiling tools can be made to work with Sun MPI using this mechanism. For more information on this library and on the MPI profiling interface, refer to the Sun MPI Programming and Reference Guide.

Inserting MPI Timer Calls

Sun HPC ClusterTools implements the MPI timer call MPI_Wtime (demonstrated in the example that follows) with the high-resolution timer gethrtime. If you use MPI_Wtime calls, you should use them to measure sections that last more than several microseconds. Times on different processes are not guaranteed to be synchronized. For information about the gethrtime timer, see the gethrtime(3C) man page.

When profiling multiprocess codes, ensure that the timings are not distorted by the asynchrony of the various processes. For this purpose, you usually need to synchronize the processes before starting and before stopping the timer.

In the following example, most processes might accumulate time in the interesting, timed portion, waiting for process 0 (zero) to emerge from uninteresting initialization. This condition would skew your program's timings. For example:


CALL MPI_COMM_RANK(MPI_COMM_WORLD,ME,IER)
IF ( ME .EQ. 0 ) THEN 
    initialization
END IF
! place barrier here
! CALL MPI_BARRIER(MPI_COMM_WORLD, IER)
T_START = MPI_WTIME()
    timed portion
T_END = MPI_WTIME() 

When stopping a timer, remember that measurements of elapsed time will differ on different processes. So, execute another barrier before the "stop" timer. Alternatively, use "maximum" elapsed time for all processes.

Avoid timing very small fragments of code. This is good advice when debugging uniprocessor codes, and the consequences are greater with many processors. Code fragments perform differently when timed in isolation. The introduction of barrier calls for timing purposes can be disruptive for short intervals.

Using the gprof Utility

The Solaris utility gprof might be used for multiprocess codes, such as those that use MPI. Several points should be noted:

Note, however, that the gprof utility has several limitations.

Note that the Performance Analyzer is simple to use, provides the profiling information that gprof does, offers additional functionality, and avoids the pitfalls. Thus, gprof users are highly encouraged to migrate to the Performance Analyzer for both MPI and non-MPI codes.

For more information about the gprof utility, refer to the gprof man page. For more information about the Performance Analyzer, refer to the documentation supplied with the Sun Studio Compiler Collections.