Sun HPC ClusterTools 6 Software Performance Guide
|
![Table Of Contents Table Of Contents](shared/toc01.gif) ![Previous Chapter Previous Chapter](shared/prev01.gif) ![Next Chapter Next Chapter](shared/next01.gif)
|
This list is a summary of the key performance tips found in this document. They are organized under the following categories:
Compilation and Linking
Compilation and linking are discussed in Chapter 6.
- Use Sun Studio Compiler Collection compilers for best performance. Sun HPC ClusterTools 6 supports versions 8, 9, 10, and 11 of the Sun Studio compilers for C, C++, and Fortran.
See Compiler Version.
- Use the mpf77, mpf90, mpcc, and mpCC utilities where possible. Link with
-lmpi. For example:
% mpf90 -fast -g a.f -lmpi
|
See The mp* Utilities.
See The -fast Switch.
- As appropriate, add the following -xarch setting after -fast:
|
32-bit binary
|
64-bit binary
|
UltraSPARC II
(will also run on UltraSPARC III)
|
-xarch=v8plusa
|
-xarch=v9a
|
UltraSPARC III
(will not run on UltraSPARC II)
|
-xarch=v8plusb
|
-xarch=v9b
|
UltraSPARC IV and IV+
|
-xarch=v8plusb
|
-xarch=v9
|
AMD Opteron processors
|
-xarch=i386
|
-xarch=amd64
|
See The -xarch Switch.
- Compile with -xalias=actual due to Fortran binding issues in the MPI standard. See The -xalias Switch.
- Compile and link with -g.
See The -g Switch.
- Link with -lopt for C programs.
- Compile and link with -xvector if math library intrinsics (logarithms, exponentials, or trigonometric functions) appear inside long loops.
- Compile with -xprefetch selectively.
- Compile with -xrestrict and -xalias_level, as appropriate, for C programs.
- Compile with -xsfpconst, as appropriate, for C programs.
- Compile with -stackvar, as appropriate, for Fortran programs.
See Other Useful Switches.
MPProf
- Before running your Sun MPI program, set the MPI_Profile environment variable to 1.
- After running your Sun MPI program, you will find a file of the form mpprof.index.rm.jid in your working directory. Type the following command:
% mpprof mpprof.index.rm.jid
|
- To archive profiling results, type the following command:
% mpprof -r -g archive_directory mpprof.index.rm.jid
|
- To clean up files, type the following command:
% mpprof -r mpprof.index.rm.jid
|
Analyzer Profiling
Use of the Performance Analyzer with Sun MPI programs is discussed in Chapter 7.
- Set your path to include the most recent compiler software, usually /opt/SUNWspro/bin
- The following examples show basic usage to collect performance data and analyze results:
% mprun -np 16 collect a.out 3 5 341% analyzer test.*.er
|
- For more advanced data collection, use scripts. See the following example:
% cat csh-script
#!/bin/csh
if ( $MP_RANK == 0 ) then
mkdir myrun
endif
if ( $MP_RANK < 4 ) then
collect -p 20 -m on -o /tmp/proc-$MP_RANK.er $*
er_mv /tmp/proc-$MP_RANK.er myrun
else
$*
endif
% mprun -np 16 csh-script a.out 3 5 341
|
Here, the following techniques have been used:
- Data volumes have been reduced by profiling only a subset of the processes.
- Data volumes have been reduced by increasing the profiling frequency with the collect -p switch.
- Data volumes have been handled by collecting to the local filesystem /tmp. Other fast file systems can be identified by your system administrator or with the command /usr/bin/df -lk
- MPI wait tracing data is activated with the collect -m switch.
- Experiments have been named by process rank.
- Experiments have been gathered to a centralized location, directory myrun, after the MPI job finished.
- Analyzing data
- Basic view shows how much time is spent per function.
- Click on the Source button to see how much time is spent per source-code line. This requires that the code was compiled and linked with -g, which is compatible with high levels of optimization and parallelization.
- Click on Callers-Callees to see caller-callee relationships.
- Click on Timeline to see a timeline view.
- Select other metrics with the Metrics button.
- Use er_print to bypass the graphical interface:
% er_print -functions proc-0.er% er_print -callers-callees proc-0.er% er_print -source lhsx_ 1 proc-0.er
|
- Look at inclusive time for high-level MPI functions to filter out internal software layers of the Sun MPI library:
% er_print -function proc-0.er | grep PMPI_
|
- To ensure that MPI wait times are profiled, select wall-clock time, instead of CPU time, as the profiling metric. Or, to collect data in the first place, type the following command:
% setenv MPI_COSCHED 0% setenv MPI_SPIN 1
|
- Loading data:
- The Performance Analyzer accepts experiment names on the command line, such as the following:
% analyzer% analyzer proc-0.er% analyzer run1/proc-*.er
|
- After the Performance Analyzer has been started, use the Experiment menu to Add and Drop individual experiments.
Job Launch on a Multinode Cluster
- Checking Load (see the following example for CRE and UNIX commands useful for checking load)
|
CRE
|
UNIX
|
How high is the load?
|
% mpinfo -N
|
% uptime
|
What is causing the load?
|
% mpps -e
|
% ps -e
|
See Running on a Dedicated System.
- Objectives for Job Launch
- Minimize internode communication.
- Run on one node if possible.
- Place heavily communicating processes on the same node as one another.
See Minimizing Communication Costs.
- Maximize bisection bandwidth.
- Run on one node if possible.
- Otherwise, spread over many nodes.
- For example, spread jobs that use multiple I/O servers.
See Controlling Bisection Bandwidth.
- Examples of Job Launch with CRE as the Resource Manager
- To run jobs in the background, perhaps from a shell script, use the -n option:
or use the following commands:
% cat a.csh#!/bin/cshmprun -n -np 4 a.out% a.csh
|
See Running Jobs in the Background.
- To eliminate core dumps, do so in the parent shell.
% limit coredumpsize 0 (for csh)
$ ulimit -c 0 (for sh)
See Limiting Core Dumps.
- To run 32 processes, with each block of consecutive 4 processes mapped to a node:
% mprun -np 32 -Zt 4 a.out
or
% mprun -np 32 -Z 4 a.out
See Collocal Blocks of Processes.
- To run 16 processes, with no two mapped to the same node:
See Multithreaded Job.
- To map 32 processes in round-robin fashion to the nodes in the cluster, with possibly multiple processes per node:
% mprun -Ns -W -np 32 a.out
|
See Round-Robin Distribution of Processes.
- To map the first 4 processes to node0, the next 4 to node1, and the next 8 to node2, type the following:
% cat nodelist
node0 4node1 4node2 8
% mprun -np 16 -m nodelist a.out
|
See Detailed Mapping.
MPI Programming Tips
- Minimize number and volume of messages.
See Reducing Message Volume.
- Reduce serialization and improve load balancing.
See Reducing Serialization and Load Balancing.
- Minimize synchronizations:
- Generally reduce the amount of message passing.
- Reduce the amount of explicit synchronization (such as MPI_Barrier(), MPI_Ssend(), and so on).
- Post sends well ahead of when a receiver needs data.
- Ensure sufficient system buffering.
See Synchronization.
- Pay attention to buffering:
- Do not assume unlimited internal buffering by Sun MPI.
- Use nonblocking calls such as MPI_Isend() for finest control over user-specified buffering.
- Post receives early to relieve pressure on system buffers.
See Buffering.
- Replace blocking operations with nonblocking operations:
- Initiate nonblocking operations as soon as possible.
- Complete nonblocking operations as late as possible.
- Test the status of nonblocking operations periodically with MPI_Test() calls.
See Nonblocking Operations.
- Pay attention to polling:
- Match message-passing calls (receives to sends, collectives to collectives, and so on).
- Post MPI_Irecv() calls ahead of arrivals.
- Avoid MPI_ANY_SOURCE.
- Avoid MPI_Probe() and MPI_Iprobe().
- Set the environment variable MPI_POLLALL to 0 at run time.
See Polling.
- Take advantage of MPI collective operations.
See Sun MPI Collectives.
- Use contiguous data types:
- Send some unnecessary padding if necessary.
- Pack your own data if you can outperform generalized MPI_Pack()/MPI_Unpack() routines.
See Contiguous Data Types.
- Avoid congestion if you're going to run over TCP:
- Avoid "hot receivers."
- Use blocking point-to-point communications.
- Use synchronous sends (MPI_Ssend() and related calls).
- Use MPI collectives such as MPI_Alltoall(), MPI_Alltoallv(), MPI_Gather(), or MPI_Gatherv(), as appropriate.
- At run time, set MPI_EAGERONLY to 0, and possibly lower MPI_TCP_RENDVSIZE.
See Special Considerations for Message Passing Over TCP.
Sun HPC ClusterTools 6 Software Performance Guide
|
819-4134-10
|
![Table Of Contents Table Of Contents](shared/toc01.gif) ![Previous Chapter Previous Chapter](shared/prev01.gif) ![Next Chapter Next Chapter](shared/next01.gif)
|
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.