A P P E N D I X B - Sun MPI Environment Variables

A P P E N D I X B

Sun MPI Environment Variables

This appendix describes some Sun MPI environment variables and their effects on program performance. It covers the following topics:

Yielding and Descheduling

Polling

Shared-Memory Point-to-Point Message Passing

Shared-Memory Collectives

Running Over TCP

Summary Table Of Environment Variables

Prescriptions for using MPI environment variables for performance tuning are provided in Chapter 7. Additional information on these and other environment variables can be found in the Sun MPI Programming and Reference Guide.

These environment variables are closely related to the details of the Sun MPI implementation, and their use requires an understanding of the implementation. More details on the Sun MPI implementation can be found in Appendix A.

Yielding and Descheduling

A blocking MPI communication call might not return until its operation has completed. If the operation has stalled, perhaps because there is insufficient buffer space to send or because there is no data ready to receive, Sun MPI will attempt to progress other outstanding, nonblocking messages. If no productive work can be performed, then in the most general case Sun MPI will yield the CPU to other processes, ultimately escalating to the point of descheduling the process by means of the spind daemon.

Setting MPI_COSCHED=0 specifies that processes should not be descheduled. This is the default behavior.

Setting MPI_SPIN=1 suppresses yields. The default value, 0, allows yields.

Polling

By default, Sun MPI polls generally for incoming messages, regardless of whether receives have been posted. To suppress general polling, use MPI_POLLALL=0.

Shared-Memory Point-to-Point Message Passing

The size of each shared-memory buffer is fixed at 1 Kbyte. Most other quantities in shared-memory message passing are settable with MPI environment variables.

For any point-to-point message, Sun MPI will determine at runtime whether the message should be sent via shared memory, remote shared memory, or TCP. The flowchart in FIGURE B-1 illustrates what happens if a message of B bytes is to be sent over shared memory.

FIGURE B-1 Message of B Bytes Sent Over Shared Memory

Graphic image showing a message of B bytes sent over shared memory.

For pipelined messages, MPI_SHM_PIPESIZE bytes are sent under the control of any one postbox. If the message is shorter than 2 x MPI_SHM_PIPESIZE bytes, the message is split roughly into halves.

For cyclic messages, MPI_SHM_CYCLESIZE bytes are sent under the control of any one postbox, so that the footprint of the message in shared memory buffers is 2 x MPI_SHM_CYCLESIZE bytes.

The postbox area consists of MPI_SHM_NUMPOSTBOX postboxes per connection.

By default, each connection has its own pool of buffers, each pool of size MPI_SHM_CPOOLSIZE bytes.

By setting MPI_SHM_SBPOOLSIZE, users can specify that each sender has a pool of buffers, each pool having MPI_SHM_SBPOOLSIZE bytes, to be shared among its various connections. If MPI_SHM_CPOOLSIZE is also set, then any one connection might consume only that many bytes from its send-buffer pool at any one time.

Memory Considerations

In all, the size of the shared-memory area devoted to point-to-point messages is

n x ( n - 1 ) x ( MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + MPI_SHM_CPOOLSIZE )

bytes when per-connection pools are used (that is, when MPI_SHM_SBPOOLSIZE is not set), and

n x ( n - 1 ) x MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + n x MPI_SHM_SBPOOLSIZE

bytes when per-sender pools are used (that is, when MPI_SHM_SBPOOLSIZE is set).

Performance Considerations

A sender should be able to deposit its message and complete its operation without waiting for any other process. You should typically:

Use the default setting of MPI_EAGERONLY, or set MPI_SHM_RENDVSIZE to be larger than the greatest number of bytes any on-node message will have.

Use the default setting of MPI_SHM_CYCLESTART.

Increase MPI_SHM_CPOOLSIZE to ensure sufficient buffering at all times.

In theory, rendezvous can improve performance for long messages if their receives are posted in a different order than their sends. In practice, the right set of conditions for overall performance improvement with rendezvous messages is rarely met.

Send-buffer pools can be used to provide reduced overall memory consumption for a particular value of MPI_SHM_CPOOLSIZE. If a process will only have outstanding messages to a few other processes at any one time, then set MPI_SHM_SBPOOLSIZE to the number of other processes times MPI_SHM_CPOOLSIZE. Multithreaded applications might suffer, however, since then a sender's threads would contend for a single send-buffer pool instead of for multiple, distinct connection pools.

Pipelining, including for cyclic messages, can roughly double the point-to-point bandwidth between two processes. This is a secondary performance effect, however, since processes tend to get considerably out of step with one another, and since the nodal backplane can become saturated with multiple processes exercising it at the same time.

Restrictions

The short-message area of a postbox must be large enough to point to all the buffers it commands. In practice, this restriction is relatively weak since, if the buffer pool is not too fragmented, the postbox can point to a few, large, contiguous regions of buffer space. In the worst case, however, a postbox will have to point to many disjoint, 1-Kbyte buffers. Each pointer requires 8 bytes, and 8 bytes of the short-message area are reserved. Thus, to avoid runtime errors

( MPI_SHM_SHORTMSGIZE - 8 ) x 1024 / 8

should be at least as large as

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

MPI_SHM_CYCLESIZE)

If a connection-pool buffer is used, it must be sufficiently large to accommodate the minimum footprint any message will ever require. This means that to avoid runtime errors, MPI_SHM_CPOOLSIZE should be at least as large as

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

2 x MPI_SHM_CYCLESIZE)

If a send-buffer pool is used and all connections originating from this sender are moving cyclic messages, there must be at least enough room in the send buffer pool to advance one message:

MPI_SHM_SBPOOLSIZE greater than or equal ((np - 1) + 1) x MPI_SHM_CYCLESIZE

Other restrictions are noted in TABLE B-1.

Shared-Memory Collectives

Collective operations in Sun MPI are highly optimized and make use of a general buffer pool within shared memory. MPI_SHM_GBPOOLSIZE sets the amount of space available on a node for the "optimized" collectives in bytes. By default, it is set to 20971520 bytes. This space is used by MPI_Bcast(), MPI_Reduce(), MPI_Allreduce(), MPI_Reduce_scatter(), and MPI_Barrier(), provided that two or more of the MPI processes are on the node.

Memory is allocated from the general buffer pool in three different ways:

When a communicator is created, space is reserved in the general buffer pool for performing barriers, short broadcasts, and a few other purposes.

For larger broadcasts, shared memory is allocated out of the general buffer pool. The maximum buffer-memory footprint in bytes of a broadcast operation is set by an environment variable as

(n / 4) x 2 x MPI_SHM_BCASTSIZE

where n is the number of MPI processes on the node. If less memory is needed than this, then less memory is used. After the broadcast operation, the memory is returned to the general buffer pool.

For reduce operations,

n x n x MPI_SHM_REDUCESIZE

bytes are borrowed from the general buffer pool and returned after the operation.

In essence, MPI_SHM_BCASTSIZE and MPI_SHM_REDUCESIZE set the pipeline sizes for broadcast and reduce operations on large messages. Larger values can improve the efficiency of these operations for very large messages, but the amount of time it takes to fill the pipeline can also increase. Typically, the default values are suitable, but if your application relies exclusively on broadcasts or reduces of very large messages, then you can try doubling or quadrupling the corresponding environment variable using one of the following:

% setenv MPI_SHM_BCASTSIZE 65536

% setenv MPI_SHM_BCASTSIZE 131072

% setenv MPI_SHM_REDUCESIZE 512

% setenv MPI_SHM_REDUCESIZE 1024

If MPI_SHM_GBPOOLSIZE proves to be too small and a collective operation happens to be unable to borrow memory from this pool, the operation will revert to slower algorithms. Hence, under certain circumstances, performance optimization could dictate increasing MPI_SHM_GBPOOLSIZE.

Running Over TCP

TCP ensures reliable dataflow, even over los-prone networks, by retransmitting data as necessary. When the underlying network loses a lot of data, the rate of retransmission can be very high, and delivered MPI performance will suffer accordingly. Increasing synchronization between senders and receivers by lowering the TCP rendezvous threshold with MPI_TCP_RENDVSIZE might help in certain cases. Generally, increased synchronization will hurt performance, but over a loss-prone network it might help mitigate performance degradation.

If the network is not lossy, then lowering the rendezvous threshold would be counterproductive and, indeed, a Sun MPI safeguard might be lifted. For reliable networks, use

% setenv MPI_TCP_SAFEGATHER 0

to speed MPI_Gather() and MPI_Gatherv() performance.

Summary Table Of Environment Variables

TABLE B-1 Sun MPI Environment Variables
Name	Units	Range	Default

Informational
MPI_PRINTENV	(None)	0 or 1	0
MPI_QUIET	(None)	0 or 1	0
MPI_SHOW_ERRORS	(None)	0 or 1	0
MPI_SHOW_INTERFACES	(None)	0 - 3	0

Shared Memory Point-to-Point
MPI_SHM_NUMPOSTBOX	Postboxes	1	16
MPI_SHM_SHORTMSGSIZE	Bytes	Multiple of 64	256
MPI_SHM_PIPESIZE	Bytes	Multiple of 1024	8192
MPI_SHM_PIPESTART	Bytes	Multiple of 1024	2048
MPI_SHM_CYCLESIZE	Bytes	Multiple of 1024	8192
MPI_SHM_CYCLESTART	Bytes	--	The default value is 0x7fffffff for 32-bit and 0x7fffffffffffffff for 64-bit Operating Systems. That is, by default there is no cyclic message passing.
MPI_SHM_CPOOLSIZE	Bytes	Multiple of 1024	24576 if `MPI_SHM_SBPOOLSIZE` is not set `MPI_SHM_SBPOOLSIZE` if it is set
MPI_SHM_SBPOOLSIZE	Bytes	Multiple of 1024	(Unset)

Shared Memory Collectives
MPI_SHM_BCASTSIZE	Bytes	Multiple of 128	32768
MPI_SHM_REDUCESIZE	Bytes	Multiple of 64	256
MPI_SHM_GBPOOLSIZE	Bytes	>256	20971520

TCP
MPI_TCP_CONNTIMEOUT	Seconds	0	600
MPI_TCP_CONNLOOP	Occurrences	≥0	0
MPI_TCP_SAFEGATHER	(None)	0 or 1	1


One-Sided Communication
MPI_USE_AGENT_THREAD	(None)	0 or 1	0

Polling and Flow
MPI_FLOWCONTROL	Messages	≥0	0
MPI_POLLALL	(None)	0 or 1	1

Dedicated Performance
MPI_PROCBIND	(None)	0 or 1	0
MPI_SPIN	(None)	0 or 1	0

Full vs. Lazy Connections
MPI_FULLCONNINIT	(None)	0 or 1	0

Eager vs. Rendezvous
MPI_EAGERONLY	(None)	0 or 1	1
MPI_SHM_RENDVSIZE	Bytes	≥1	24576
MPI_TCP_RENDVSIZE	Bytes	≥1	49152

Collectives
MPI_CANONREDUCE	(None)	0 or 1	0
MPI_OPTCOLL	(None)	0 or 1	1

Coscheduling
MPI_COSCHED	(None)	0 or 1	(Unset, or "2")
MPI_SPINDTIMEOUT	Milliseconds	≥0	1000

Handles
MPI_MAXFHANDLES	Handles	≥1	1024
MPI_MAXREQHANDLES	Handles	≥1	1024