Sun MPI 4.0 User's Guide: With CRE

Chapter 6 Performance Tuning

Sun MPI uses a variety of techniques to deliver high-performance, robust, and memory-efficient message passing under a wide set of circumstances. In certain situations, however, applications will benefit from nondefault behaviors. The Sun MPI environment variables discussed in this section allow you to tune these default behaviors. A list of all Sun MPI environment variables, with brief descriptions, can be found in Appendix A, Environment Variables and in the MPI man page.

Current Settings

User tuning of MPI environment variables can be restricted by the system administrator through a configuration file, hpc.conf. To determine whether such resctrictions are in place on your local cluster use the MPI_PRINTENV (described below) to verify settings. .

In most cases, performance will be good without tuning any environment variables. Nevertheless, here are some performance guidelines for using MPI environment variables. In some cases, diagnosis of whether environment variables would be helpful is aided by Prism profiling with TNF probes, as described in the Prism User's Guide and the Sun MPI Programming and Reference Guide.

Runtime Diagnostic Information

Certain Sun MPI environment variables cause extra diagnostic information to be printed out at run time:

% setenv MPI_PRINTENV 1
% setenv MPI_SHOW_INTERFACES 3
% setenv MPI_SHOW_ERRORS 1

Running on a Dedicated System

If your system has sufficient capacity for running your MPI job, you can commit processors aggressively to your job. At a minimum, the CPU load should not exceed the number of physical processors. The CPU load for your job is the number of MPI processes in the job, but the load is greater if your job is multithreaded. The load on the system must also be shared with any other jobs are running on the same system. You can check the current load can be checked with the mpinfo command.

To run your job more aggressively on a dedicated system, set the MPI_SPIN and MPI_PROCBIND environment variables:

% setenv MPI_SPIN 1

Use this only if you will leave at least one processor per node free to service system daemons. Profiling with Prism introduces background daemons that cause a slight but noticeable load, so you must be careful to avoid overloading when attempting to profile a code with this setting.

% setenv MPI_PROCBIND 1

Set the MPI_PROCBIND variable only if there are no other MPI jobs running and your job is single-threaded.

Safe Use of System Buffers

In some MPI programs, processes send large volumes of data with blocking sends before starting to receive messages. The MPI standard specifies that users must explicitly provide buffering in such cases, perhaps using MPI_Bsend calls. In practice, however, some users rely on the standard send routine (MPI_Send) to supply unlimited buffering. By default, Sun MPI prevents deadlock in such situations through general polling, which drains system buffers even when no receives have been posted by the user code.

For best performance on typical, safe programs, you can suppress general polling should by setting MPI_POLLALL:

% setenv MPI_POLLALL 0

Trading Memory for Performance

Depending on message traffic, performance can stall if system buffers become congested, but it can be superior if buffers are large. Here, we examine performance for on-node messages via shared-memory buffers.

It is helpful to think of data traffic per connection, the "path" from a particular sender to a particular receiver, since many Sun MPI buffering resources are allocated on a per-connection basis. A sender may emit bursts of messages on a connection, during which time the corresponding receiver may not be depleting the buffers. For example, a sender may execute a sequence of send operations to one receiver during a period in which that receiver is not making any MPI calls whatsoever.

You may need to use profiling to diagnose such conditions. For more information on profiling, see the Prism User's Guide and the Sun MPI Programming and Reference Guide.

Rendezvous or Eager Protocol?

Is your program sending many long, unexpected messages? Sun MPI offers message rendezvous, which requires a receiver to echo a ready signal to the sender before data transmission can begin. This can improve performance for the case of a pair of processes that communicate with a different order for their sends as for their receives, since receive-side buffering would be reduced. To allow rendezvous behavior for long messages, set the MPI_EAGERONLY environment variable:

% setenv MPI_EAGERONLY 0

The threshold message size for rendezvous behavior can be tuned independently for each protocol with MPI_SHM_RENDVSIZE, MPI_TCP_RENDVSIZE, and MPI_RSM_RENDVSIZE.

Note -

Rendezvous will often degrade performance by coupling senders to receivers. Also, for some "unsafe" codes, it can produce deadlock.

Many Broadcasts or Reductions

Does your program include many broadcasts or reductions on large messages? Large broadcasts may benefit from increased values of MPI_SHM_BCASTSIZE, and large reductions from increased MPI_SHM_REDUCESIZE. Also, if many different communicators are involved, you may want to increase MPI_SHM_GBPOOLSIZE. In most cases, the default values will provide best performance.

Shared-Memory Point-to-Point Message Passing

The size of each shared-memory buffer is fixed at 1 Kbyte. Most other quantities in shared-memory message passing are settable with MPI environment variables.

A short message, at most MPI_SHM_SHORTMSGSIZE bytes long, is fit into one postbox and no buffers are used. Above that size, message data is written into buffers and controlled by postboxes.

Only starting at MPI_SHM_PIPESTART bytes, however, are multiple postboxes used, which is known as pipelining. The amount of buffer data controlled by any one postbox is at most MPI_SHM_PIPESIZE bytes. By default, MPI_SHM_PIPESTART is well below MPI_SHM_PIPESIZE. For the smallest pipelined messages, then, a message is broken roughly into two, and each of two postboxes controls roughly half the message.

Above MPI_SHM_CYCLESTART bytes, messages are fed cyclically through two sets of buffers, each set of size MPI_SHM_CYCLESIZE bytes. During a cyclic transfer, the footprint of the message in shared memory buffers is 2*MPI_SHM_CYCLESIZE bytes.

The postbox area consists of MPI_SHM_NUMPOSTBOX postboxes per connection. By default, each connection has its own pool of buffers, each pool of size MPI_SHM_CPOOLSIZE bytes.

By setting MPI_SHM_SBPOOLSIZE, users may specify that each sender has a pool of buffers, of MPI_SHM_SBPOOLSIZE bytes each, to be shared among its various connections. If MPI_SHM_CPOOLSIZE is also set, then any one connection may consume only that many bytes from its send-buffer pool at any one time.

Memory Considerations

In all, the size of the shared-memory area devoted to point-to-point messages is

n * ( n - 1
)
  * (
      MPI_SHM_NUMPOSTBOX *
( 64 + MPI_SHM_SHORTMSGSIZE )
    + MPI_SHM_CPOOLSIZE
    )

bytes when per-connection pools are used (that is, when MPI_SHM_SBPOOLSIZE is not set) and

n * ( n - 1
) * MPI_SHM_NUMPOSTBOX *
( 64 + MPI_SHM_SHORTMSGSIZE )
+
n * MPI_SHM_SBPOOLSIZE

bytes when per-sender pools are used (that is, when MPI_SHM_SBPOOLSIZE is set).

Cyclic message passing limits the size of shared memory that is needed to transfer even arbitrarily large messages.

Shared-Memory Collectives

Collective operations in Sun MPI are highly optimized and make use of a "general buffer pool" within shared memory.

MPI_SHM_GBPOOLSIZE sets the amount of space available on a node for the "optimized" collectives in bytes. By default, it is set to 20971520 bytes. This space is used by MPI_Bcast, MPI_Reduce, MPI_Allreduce, MPI_Reduce_scatter, and MPI_Barrier, provided that two or more of the MPI processes are on the node.

When a communicator is created, space is reserved in the general buffer pool for performing barriers, short broadcasts, and a few other purposes.

For larger broadcasts, shared memory is allocated out of the general buffer pool. The maximum buffer-memory footprint in bytes of a broadcast operation is set by an environment variable as

(n/4) * 2 * MPI_SHM_BCASTSIZE

where n is the number of MPI processes on the node. If less memory is needed than this, then less memory is used. After the broadcast operation, the memory is returned to the general buffer pool.

For reduce operations,

n * n * MPI_SHM_REDUCESIZE

bytes are borrowed from the general buffer pool.

The broadcast and reduce operations are pipelined for very large messages. By increasing MPI_SHM_BCASTSIZE and MPI_SHM_REDUCESIZE, one can improve the efficiency of these collective operations for very large messages, but the amount of time it takes to fill the pipeline can also increase.

If MPI_SHM_GBPOOLSIZE proves to be too small and a collective operation happens to be unable to borrow memory from this pool, the operation will revert to slower algorithms. Hence, under certain circumstances, performance could dictate increasing MPI_SHM_GBPOOLSIZE.

Running over TCP

TCP ensures reliable dataflow, even over lossy networks, by retransmitting data as necessary. When the underlying network loses a lot of data, the rate of retransmission can be very high and delivered MPI performance will suffer accordingly. Increasing synchronization between senders and receivers by lowering the TCP rendezvous threshold with MPI_TCP_RENDVSIZE may help in certain cases. Generally, increased synchronization will hurt performance, but over a lossy network it may help mitigate catastrophic degradation.

If the network is not lossy, then lowering the rendezvous threshold would be counterproductive and, indeed, a Sun MPI safeguard may be lifted. For reliable networks, use

% setenv MPI_TCPSAFEGATHER 0

Remote Shared Memory (RSM) Point-to-Point Message Passing

The RSM protocol has some similarities with the shared memory protocol, but it also has substantial deviations, and environment variables are used differently.

The maximum size of a short message is MPI_RSM_SHORTMSGSIZE bytes, with default value of 401 bytes. Short RSM messages can span multiple postboxes, but they still do not use any buffers.

The most data that will be sent under any one postbox for pipelined messages is MPI_RSM_PIPESIZE bytes. There are MPI_RSM_NUMPOSTBOX postboxes for each RSM connection.

If MPI_RSM_SBPOOLSIZE is unset, then each RSM connection has a buffer pool of MPI_RSM_CPOOLSIZE bytes. If MPI_RSM_SBPOOLSIZE is set, then each process has a pool of buffers that is MPI_RSM_SBPOOLSIZE bytes per remote node for sending messages to processes on the remote node.

Unlike the case of the shared-memory protocol, values of the MPI_RSM_PIPESIZE, MPI_RSM_CPOOLSIZE, and MPI_RSM_SBPOOLSIZE environment variables are merely requests. Values set with the setenv or printed when MPI_PRINTENV is used may not reflect effective values. In particular, only when connections are actually established are the RSM parameters truly set. Indeed, the effective values could change over the course of program execution if lazy connections are employed.

Striping refers to passing messages over multiple links to get the speedup of their aggregate bandwidth. The number of stripes used is MPI_RSM_MAXSTRIPE or all physically available stripes, whichever is less.

Use of rendezvous for RSM messages is controlled with MPI_RSM_RENDVSIZE.

Memory Considerations

Memory is allocated on a node for each remote MPI process that sends messages to it over RSM. If np_local is the number of processes on a particular node, then the memory requirement on the node for RSM message passing from any one remote process is

np_local * ( MPI_RSM_NUMPOSTBOX * 128 + MPI_RSM_CPOOLSIZE )

bytes when MPI_RSM_SBPOOLSIZE is unset, and

np_local * MPI_RSM_NUMPOSTBOX * 128 + MPI_RSM_SBPOOLSIZE

bytes when MPI_RSM_SBPOOLSIZE is set.

The amount of memory actually allocated may be higher or lower than this requirement:

The memory requirement is rounded up to some multiple of 8192 bytes with a minimum of 32768 bytes.

This memory is allocated from a 256-Kbyte (262,144-byte) segment.
- If the memory requirement is greater than 256 Kbytes, then insufficient memory will be allocated.
- If the memory requirement is less than 256 Kbytes, some allocated memory will go unused. (There is some, but only limited, sharing of segments.)

If less memory is allocated than is required, then requested values of MPI_RSM_CPOOLSIZE or MPI_RSM_SBPOOLSIZE may be reduced at run time. This can cause the requested value of MPI_RSM_PIPESIZE to be overridden as well.

Each remote MPI process requires its own allocation on the node as described above.

If multiple stripes are employed, the memory requirement increases correspondingly.

Performance Considerations

The pipe size should be at most half as big as the connection pool

2 * MPI_RSM_PIPESIZE <= MPI_RSM_CPOOLSIZE

Otherwise, pipelined transfers will proceed slowly. The library adjusts MPI_RSM_PIPESIZE appropriately.

Reducing striping has no performance advantage, but varying MPI_RSM_MAXSTRIPE can give you insight into the relationship between application performance depends and internode bandwidth.

For pipelined messages, a sender must synchronize with its receiver to ensure that remote writes to buffers have completed before postboxes are written. Long pipelined messages can absorb this synchronization cost, but performance for short pipelined messages will suffer. In some cases, raising MPI_RSM_SHORTMSGSIZE can mitigate this effect.