A P P E N D I X  B

Sun MPI Environment Variables

This appendix describes some Sun MPI environment variables and their effects on program performance. It covers the following topics:

Prescriptions for using MPI environment variables for performance tuning are provided in Chapter 7. Additional information on these and other environment variables can be found in the Sun MPI Programming and Reference Guide.

These environment variables are closely related to the details of the Sun MPI implementation, and their use requires an understanding of the implementation. More details on the Sun MPI implementation can be found in Appendix A.


Yielding and Descheduling

A blocking MPI communication call might not return until its operation has completed. If the operation has stalled, perhaps because there is insufficient buffer space to send or because there is no data ready to receive, Sun MPI will attempt to progress other outstanding, nonblocking messages. If no productive work can be performed, then in the most general case Sun MPI will yield the CPU to other processes, ultimately escalating to the point of descheduling the process by means of the spind daemon.

Setting MPI_COSCHED=0 specifies that processes should not be descheduled. This is the default behavior.

Setting MPI_SPIN=1 suppresses yields. The default value, 0, allows yields.


Polling

By default, Sun MPI polls generally for incoming messages, regardless of whether receives have been posted. To suppress general polling, use MPI_POLLALL=0.


Shared-Memory Point-to-Point Message Passing

The size of each shared-memory buffer is fixed at 1 Kbyte. Most other quantities in shared-memory message passing are settable with MPI environment variables.

For any point-to-point message, Sun MPI will determine at runtime whether the message should be sent via shared memory, remote shared memory, or TCP. The flowchart in FIGURE B-1 illustrates what happens if a message of B bytes is to be sent over shared memory.


FIGURE B-1 Message of B Bytes Sent Over Shared Memory

Graphic image showing a message of B bytes sent over shared memory.


For pipelined messages, MPI_SHM_PIPESIZE bytes are sent under the control of any one postbox. If the message is shorter than 2 x MPI_SHM_PIPESIZE bytes, the message is split roughly into halves.

For cyclic messages, MPI_SHM_CYCLESIZE bytes are sent under the control of any one postbox, so that the footprint of the message in shared memory buffers is 2 x MPI_SHM_CYCLESIZE bytes.

The postbox area consists of MPI_SHM_NUMPOSTBOX postboxes per connection.

By default, each connection has its own pool of buffers, each pool of size MPI_SHM_CPOOLSIZE bytes.

By setting MPI_SHM_SBPOOLSIZE, users can specify that each sender has a pool of buffers, each pool having MPI_SHM_SBPOOLSIZE bytes, to be shared among its various connections. If MPI_SHM_CPOOLSIZE is also set, then any one connection might consume only that many bytes from its send-buffer pool at any one time.

Memory Considerations

In all, the size of the shared-memory area devoted to point-to-point messages is

n x ( n - 1 ) x ( MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + MPI_SHM_CPOOLSIZE )

bytes when per-connection pools are used (that is, when MPI_SHM_SBPOOLSIZE is not set), and

n x ( n - 1 ) x MPI_SHM_NUMPOSTBOX x ( 64 + MPI_SHM_SHORTMSGSIZE ) + n x MPI_SHM_SBPOOLSIZE

bytes when per-sender pools are used (that is, when MPI_SHM_SBPOOLSIZE is set).

Performance Considerations

A sender should be able to deposit its message and complete its operation without waiting for any other process. You should typically:

In theory, rendezvous can improve performance for long messages if their receives are posted in a different order than their sends. In practice, the right set of conditions for overall performance improvement with rendezvous messages is rarely met.

Send-buffer pools can be used to provide reduced overall memory consumption for a particular value of MPI_SHM_CPOOLSIZE. If a process will only have outstanding messages to a few other processes at any one time, then set MPI_SHM_SBPOOLSIZE to the number of other processes times MPI_SHM_CPOOLSIZE. Multithreaded applications might suffer, however, since then a sender's threads would contend for a single send-buffer pool instead of for multiple, distinct connection pools.

Pipelining, including for cyclic messages, can roughly double the point-to-point bandwidth between two processes. This is a secondary performance effect, however, since processes tend to get considerably out of step with one another, and since the nodal backplane can become saturated with multiple processes exercising it at the same time.

Restrictions

( MPI_SHM_SHORTMSGIZE - 8 ) x 1024 / 8

should be at least as large as

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

MPI_SHM_CYCLESIZE)

max(

MPI_SHM_PIPESTART,

MPI_SHM_PIPESIZE,

2 x MPI_SHM_CYCLESIZE)

MPI_SHM_SBPOOLSIZE greater than or equal ((np - 1) + 1) x MPI_SHM_CYCLESIZE


Shared-Memory Collectives

Collective operations in Sun MPI are highly optimized and make use of a general buffer pool within shared memory. MPI_SHM_GBPOOLSIZE sets the amount of space available on a node for the "optimized" collectives in bytes. By default, it is set to 20971520 bytes. This space is used by MPI_Bcast(), MPI_Reduce(), MPI_Allreduce(), MPI_Reduce_scatter(), and MPI_Barrier(), provided that two or more of the MPI processes are on the node.

Memory is allocated from the general buffer pool in three different ways:

(n / 4) x 2 x MPI_SHM_BCASTSIZE

where n is the number of MPI processes on the node. If less memory is needed than this, then less memory is used. After the broadcast operation, the memory is returned to the general buffer pool.

n x n x MPI_SHM_REDUCESIZE

bytes are borrowed from the general buffer pool and returned after the operation.

In essence, MPI_SHM_BCASTSIZE and MPI_SHM_REDUCESIZE set the pipeline sizes for broadcast and reduce operations on large messages. Larger values can improve the efficiency of these operations for very large messages, but the amount of time it takes to fill the pipeline can also increase. Typically, the default values are suitable, but if your application relies exclusively on broadcasts or reduces of very large messages, then you can try doubling or quadrupling the corresponding environment variable using one of the following:


% setenv MPI_SHM_BCASTSIZE 65536 
% setenv MPI_SHM_BCASTSIZE 131072 
% setenv MPI_SHM_REDUCESIZE 512 
% setenv MPI_SHM_REDUCESIZE 1024

If MPI_SHM_GBPOOLSIZE proves to be too small and a collective operation happens to be unable to borrow memory from this pool, the operation will revert to slower algorithms. Hence, under certain circumstances, performance optimization could dictate increasing MPI_SHM_GBPOOLSIZE.


Running Over TCP

TCP ensures reliable dataflow, even over los-prone networks, by retransmitting data as necessary. When the underlying network loses a lot of data, the rate of retransmission can be very high, and delivered MPI performance will suffer accordingly. Increasing synchronization between senders and receivers by lowering the TCP rendezvous threshold with MPI_TCP_RENDVSIZE might help in certain cases. Generally, increased synchronization will hurt performance, but over a loss-prone network it might help mitigate performance degradation.

If the network is not lossy, then lowering the rendezvous threshold would be counterproductive and, indeed, a Sun MPI safeguard might be lifted. For reliable networks, use

% setenv MPI_TCP_SAFEGATHER 0

to speed MPI_Gather() and MPI_Gatherv() performance.


Summary Table Of Environment Variables


TABLE B-1 Sun MPI Environment Variables

Name

Units

Range

Default

 

 

 

 

Informational

MPI_PRINTENV

(None)

0 or 1

0

MPI_QUIET

(None)

0 or 1

0

MPI_SHOW_ERRORS

(None)

0 or 1

0

MPI_SHOW_INTERFACES

(None)

0 - 3

0

 

 

 

 

Shared Memory Point-to-Point

MPI_SHM_NUMPOSTBOX

Postboxes

1

16

MPI_SHM_SHORTMSGSIZE

Bytes

Multiple of 64

256

MPI_SHM_PIPESIZE

Bytes

Multiple of 1024

8192

MPI_SHM_PIPESTART

Bytes

Multiple of 1024

2048

MPI_SHM_CYCLESIZE

Bytes

Multiple of 1024

8192

MPI_SHM_CYCLESTART

Bytes

--

The default value is 0x7fffffff for 32-bit and 0x7fffffffffffffff for 64-bit Operating Systems. That is, by default there is no cyclic message passing.

MPI_SHM_CPOOLSIZE

Bytes

Multiple of 1024

24576 if MPI_SHM_SBPOOLSIZE is not set

MPI_SHM_SBPOOLSIZE if it is set

MPI_SHM_SBPOOLSIZE

Bytes

Multiple of 1024

(Unset)

 

 

 

 

Shared Memory Collectives

MPI_SHM_BCASTSIZE

Bytes

Multiple of 128

32768

MPI_SHM_REDUCESIZE

Bytes

Multiple of 64

256

MPI_SHM_GBPOOLSIZE

Bytes

>256

20971520

 

 

 

 

TCP

MPI_TCP_CONNTIMEOUT

Seconds

greater than or equal0

600

MPI_TCP_CONNLOOP

Occurrences

greater than or equal0

0

MPI_TCP_SAFEGATHER

(None)

0 or 1

1

 

 

 

 

 

 

 

 

One-Sided Communication

 

 

 

MPI_USE_AGENT_THREAD

(None)

0 or 1

0

 

 

 

 

Polling and Flow

MPI_FLOWCONTROL

Messages

greater than or equal0

0

MPI_POLLALL

(None)

0 or 1

1

 

 

 

 

Dedicated Performance

MPI_PROCBIND

(None)

0 or 1

0

MPI_SPIN

(None)

0 or 1

0

 

 

 

 

Full vs. Lazy Connections

MPI_FULLCONNINIT

(None)

0 or 1

0

 

 

 

 

Eager vs. Rendezvous

MPI_EAGERONLY

(None)

0 or 1

1

MPI_SHM_RENDVSIZE

Bytes

greater than or equal1

24576

MPI_TCP_RENDVSIZE

Bytes

greater than or equal1

49152

 

 

 

 

Collectives

MPI_CANONREDUCE

(None)

0 or 1

0

MPI_OPTCOLL

(None)

0 or 1

1

 

 

 

 

Coscheduling

MPI_COSCHED

(None)

0 or 1

(Unset, or "2")

MPI_SPINDTIMEOUT

Milliseconds

greater than or equal0

1000

 

 

 

 

Handles

MPI_MAXFHANDLES

Handles

greater than or equal1

1024

MPI_MAXREQHANDLES

Handles

greater than or equal1

1024