C H A P T E R 1

Quick Reference

This list is a summary of the key performance tips found in this document. They are organized under the following categories:

Compilation and Linking

MPProf

Analyzer Profiling

Job Launch on a Multinode Cluster

MPI Programming Tips

Compilation and Linking

Compilation and linking are discussed in Chapter 6.

Use Sun Studio Compiler Collection compilers for best performance. Sun HPC ClusterTools 6 supports versions 8, 9, 10, and 11 of the Sun Studio compilers for C, C++, and Fortran.

See Compiler Version.

Use the mpf77, mpf90, mpcc, and mpCC utilities where possible. Link with
-lmpi. For example:

% mpf90 -fast -g a.f -lmpi

See The mp* Utilities.

Compile with -fast.

See The -fast Switch.

As appropriate, add the following -xarch setting after -fast:

	32-bit binary	64-bit binary
UltraSPARC II (will also run on UltraSPARC III)	-`xarch=v8plusa`	`-xarch=v9a`
UltraSPARC III (will not run on UltraSPARC II)	`-xarch=v8plusb`	`-xarch=v9b`
UltraSPARC IV and IV+	`-xarch=v8plusb`	-xarch=v9
AMD Opteron processors	`-xarch=i386`	-xarch=amd64

See The -xarch Switch.

Compile with -xalias=actual due to Fortran binding issues in the MPI standard. See The -xalias Switch.

Compile and link with -g.

See The -g Switch.

Link with -lopt for C programs.

Compile and link with -xvector if math library intrinsics (logarithms, exponentials, or trigonometric functions) appear inside long loops.

Compile with -xprefetch selectively.

Compile with -xrestrict and -xalias_level, as appropriate, for C programs.

Compile with -xsfpconst, as appropriate, for C programs.

Compile with -stackvar, as appropriate, for Fortran programs.

See Other Useful Switches.

MPProf

Before running your Sun MPI program, set the MPI_Profile environment variable to 1.

% setenv MPI_PROFILE 1

After running your Sun MPI program, you will find a file of the form mpprof.index.rm.jid in your working directory. Type the following command:

% mpprof mpprof.index.rm.jid

To archive profiling results, type the following command:

% mpprof -r -g archive_directory mpprof.index.rm.jid

To clean up files, type the following command:

% mpprof -r mpprof.index.rm.jid

Analyzer Profiling

Use of the Performance Analyzer with Sun MPI programs is discussed in Chapter 7.

Set your path to include the most recent compiler software, usually /opt/SUNWspro/bin

The following examples show basic usage to collect performance data and analyze results:

% mprun -np 16 collect a.out 3 5 341% analyzer test.*.er

For more advanced data collection, use scripts. See the following example:

% cat csh-script

#!/bin/csh

if ( $MP_RANK == 0 ) then

  mkdir myrun

endif

if ( $MP_RANK < 4 ) then

  collect -p 20 -m on -o /tmp/proc-$MP_RANK.er $*

  er_mv /tmp/proc-$MP_RANK.er myrun

else

$*

endif

% mprun -np 16 csh-script a.out 3 5 341

Here, the following techniques have been used:

Data volumes have been reduced by profiling only a subset of the processes.

Data volumes have been reduced by increasing the profiling frequency with the collect -p switch.

Data volumes have been handled by collecting to the local filesystem /tmp. Other fast file systems can be identified by your system administrator or with the command /usr/bin/df -lk

MPI wait tracing data is activated with the collect -m switch.

Experiments have been named by process rank.

Experiments have been gathered to a centralized location, directory myrun, after the MPI job finished.

Analyzing data

Basic view shows how much time is spent per function.

Click on the Source button to see how much time is spent per source-code line. This requires that the code was compiled and linked with -g, which is compatible with high levels of optimization and parallelization.

Click on Callers-Callees to see caller-callee relationships.

Click on Timeline to see a timeline view.

Select other metrics with the Metrics button.

Use er_print to bypass the graphical interface:

% er_print -functions       proc-0.er% er_print -callers-callees proc-0.er% er_print -source lhsx_ 1 proc-0.er

Look at inclusive time for high-level MPI functions to filter out internal software layers of the Sun MPI library:

% er_print -function proc-0.er | grep PMPI_

To ensure that MPI wait times are profiled, select wall-clock time, instead of CPU time, as the profiling metric. Or, to collect data in the first place, type the following command:

% setenv MPI_COSCHED	 0% setenv MPI_SPIN   1

Loading data:

The Performance Analyzer accepts experiment names on the command line, such as the following:

% analyzer% analyzer proc-0.er% analyzer run1/proc-*.er

After the Performance Analyzer has been started, use the Experiment menu to Add and Drop individual experiments.

Job Launch on a Multinode Cluster

Checking Load (see the following example for CRE and UNIX commands useful for checking load)

	CRE	UNIX
How high is the load?	`%` `mpinfo -N`	`%` `uptime`
What is causing the load?	`%` `mpps -e`	`%` `ps -e`

See Running on a Dedicated System.

Objectives for Job Launch

Minimize internode communication.

- Run on one node if possible.

- Place heavily communicating processes on the same node as one another.

See Minimizing Communication Costs.

Maximize bisection bandwidth.

- Run on one node if possible.

- Otherwise, spread over many nodes.

- For example, spread jobs that use multiple I/O servers.

See Controlling Bisection Bandwidth.

Examples of Job Launch with CRE as the Resource Manager

To run jobs in the background, perhaps from a shell script, use the -n option:

% mprun -n -np 4 a.out &

or use the following commands:

% cat a.csh#!/bin/cshmprun -n -np 4 a.out% a.csh

See Running Jobs in the Background.

To eliminate core dumps, do so in the parent shell.

% limit coredumpsize 0 (for csh)

$ ulimit -c 0 (for sh)

See Limiting Core Dumps.

To run 32 processes, with each block of consecutive 4 processes mapped to a node:

% mprun -np 32 -Zt 4 a.out

% mprun -np 32 -Z 4 a.out

See Collocal Blocks of Processes.

To run 16 processes, with no two mapped to the same node:

% mprun -Ns -np 16 a.out

See Multithreaded Job.

To map 32 processes in round-robin fashion to the nodes in the cluster, with possibly multiple processes per node:

% mprun -Ns -W -np 32 a.out

See Round-Robin Distribution of Processes.

To map the first 4 processes to node0, the next 4 to node1, and the next 8 to node2, type the following:

% cat nodelist

node0 4node1 4node2 8

% mprun -np 16 -m nodelist a.out

See Detailed Mapping.

MPI Programming Tips

Minimize number and volume of messages.

See Reducing Message Volume.

Reduce serialization and improve load balancing.

See Reducing Serialization and Load Balancing.

Minimize synchronizations:

Generally reduce the amount of message passing.

Reduce the amount of explicit synchronization (such as MPI_Barrier(), MPI_Ssend(), and so on).

Post sends well ahead of when a receiver needs data.

Ensure sufficient system buffering.

See Synchronization.

Pay attention to buffering:

Do not assume unlimited internal buffering by Sun MPI.

Use nonblocking calls such as MPI_Isend() for finest control over user-specified buffering.

Post receives early to relieve pressure on system buffers.

See Buffering.

Replace blocking operations with nonblocking operations:

Initiate nonblocking operations as soon as possible.

Complete nonblocking operations as late as possible.

Test the status of nonblocking operations periodically with MPI_Test() calls.

See Nonblocking Operations.

Pay attention to polling:

Match message-passing calls (receives to sends, collectives to collectives, and so on).

Post MPI_Irecv() calls ahead of arrivals.

Avoid MPI_ANY_SOURCE.

Avoid MPI_Probe() and MPI_Iprobe().

Set the environment variable MPI_POLLALL to 0 at run time.

See Polling.

Take advantage of MPI collective operations.

See Sun MPI Collectives.

Use contiguous data types:

Send some unnecessary padding if necessary.

Pack your own data if you can outperform generalized MPI_Pack()/MPI_Unpack() routines.

See Contiguous Data Types.

Avoid congestion if you're going to run over TCP:

Avoid "hot receivers."

Use blocking point-to-point communications.

Use synchronous sends (MPI_Ssend() and related calls).

Use MPI collectives such as MPI_Alltoall(), MPI_Alltoallv(), MPI_Gather(), or MPI_Gatherv(), as appropriate.

At run time, set MPI_EAGERONLY to 0, and possibly lower MPI_TCP_RENDVSIZE.

See Special Considerations for Message Passing Over TCP.