A P P E N D I X  B

Environment Variables

Many environment variables are available for fine-tuning your Sun MPI environment. All Sun MPI environment variables are listed here with brief descriptions. The same descriptions are also available on the MPI man page. If you want to return to the default setting after setting a variable, simply unset it (using unsetenv). The effects of some of the variables are explained in more detail in the Sun HPC ClusterTools Software Performance Guide.

The environment variables are listed here in six groups:


Informational Variables

MPI_PRINTENV

When set to 1, this variable causes other environment variables and the hpc.conf parameters associated with the MPI job to be printed. The default value is 0.

MPI_QUIET

If set to 1, this variable suppresses Sun MPI warning messages. The default value
is 0.

MPI_SHOW_ERRORS

If set to 1, the MPI_ERRORS_RETURN error handler prints the error message and returns the error. The default value is 0.

MPI_SHOW_INTERFACES

When set to 1, 2, or 3, information regarding which interfaces are being used by an MPI application is printed to stdout. Set MPI_SHOW_INTERFACES to 1 to print the selected internode interface. Set it to 2 to print all the interfaces and their rankings. Set it to 3 for verbose output. The default value, 0, does not print information to stdout.


General Performance Tuning

MPI_POLLALL

When this variable is set to 1, the default value, all connections are polled for receives, also known as full polling. When set to 0, only those connections are polled where receives are posted. Full polling helps drain system buffers, lessening the chance of deadlock for "unsafe" codes. Well-written codes should set MPI_POLLALL to 0 for best performance.



Note - If MPI_POLLALL is set to 0 (zero) and your program performs an MPI_Send/MPI_Cancel without a corresponding MPI call on the receiving process, the MPI_Cancel may not succeed. Your program may hang.



MPI_PROCBIND

Binds each MPI process to its own processor. The system administrator can allow or disable processor binding by setting the allow_pbind parameter in the CREOptions section of the hpc.conf file. If this parameter is set, the MPI_PROCBIND environment variable is disabled. Performance can be enhanced with processor binding, but very poor performance will result if processor binding is used for multithreaded jobs or for more than one job at a time.

By default, MPI_PROCBIND is set to 0, which turns off processor binding. To turn on processor binding, set the value to 1. With processor binding turned on, the processes in a job are assigned to all CPUs that are not already bound to other processes.

You can further control CPU binding by using these values for MPI_PROCBIND:

MPI_PROCBIND {P|L|T} [list | range]

To specify the CPUs to which the threads are bound, you use either list or range.

If a node had fewer CPUs than the number you specified, CRE would assign the thread to the CPU that was the modulus of the number you specified, divided by the number of CPUs. For example, a list of L12 in a node with only 8 CPUs would result in the process being assigned to CPU number 3.
When you use a list, the CRE environment does not check to see whether those CPU's are already bound. As a result, you could have two threads bound to the same CPU.
N-MxI
In other words, specify the range with a starting number (N), an ending number (M), and a counting interval (I). The counting interval (I) is optional, and its default value is 1.

For example:

% setenv MPI_PROCBIND L0-11

The preceding setting would search through the first 12 CPUs, and assign processes to those that are unbound. If the number of available CPUs is less than the number of processes, the extra processes would remain unbound.

The mprun -v option (verbose) prints the CPU assignments that have been made. You can also run pbind(1M) on each node to verify their CPU bindings.

MPI_SPIN

This variable sets the spin policy. The default value is 0, which causes MPI processes to spin nonaggressively, allowing best performance when the load is at least as great as the number of CPUs. A value of 1 causes MPI processes to spin aggressively, leading to best performance if extra CPUs are available on each node to handle system daemons and other background activities.


Tuning Memory for Point-to-Point Performance

MPI_RSM_CPOOLSIZE

This is the requested size, in bytes, to be allocated per stripe for buffers for each RSM connection. This value can be overridden when connections are established on the basis of the size of the segment allocated. The default value is 256 Kbytes.

MPI_RSM_NUMPOSTBOX

This is the number of postboxes per stripe per RSM connection. The maximum number of postboxes depends on the value of rsm_maxsegsize. The default is 128 postboxes.

MPI_RSM_PIPESIZE

This limits the size (in bytes) of a message that can be sent over remote shared memory through the buffer list of one postbox per stripe. The default is 64 Kbytes. This size also depends on the block size used for sending data. The maximum size is equal to min (cpoolsize/2, (10 * max(blk1sz,blk2sz))).

MPI_RSM_SBPOOLSIZE

If set, this variable specifies the requested size in bytes of each RSM send buffer pool. An RSM send buffer pool is the pool of buffers on a node that a remote process would use to send to processes on the node. A multiple of 1024 must be used. If unset, the size of buffer pool is equal to cpoolsize times processes per node. The max value allowed is maxsegsize minus the memory used for postboxes.

MPI_RSM_SHORTMSGSIZE

This variable specifies the maximum size, in bytes, of a message that will be sent by means of remote shared memory without using buffers. The default value is 3918 bytes. The upper limit is determined by the number of postboxes available.

MPI_RSM_STRONGPARTITION

If set to 1, the RSM protocol module uses strong partition to manage memory. Every connection has a set of blocks in the buffer pool reserved for communication. Otherwise, the pool is shared by all the receivers. The default value is 0.

MPI_SHM_CPOOLSIZE

This variable represents the amount of memory, in bytes, that can be allocated to each connection pool. When MPI_SHM_SBPOOLSIZE is not set, the default value is 24,576 bytes. Otherwise, the default value is MPI_SHM_SBPOOLSIZE.

MPI_SHM_CYCLESIZE

This variable represents the limit, in bytes, on the portion of a shared-memory message that will be sent via the buffer list of a single postbox during a cyclic transfer. The default value is 8192 bytes. A multiple of 1024 that is at most MPI_SHM_CPOOLSIZE/2 must be used.

MPI_SHM_CYCLESTART

Shared-memory transfers that are larger than MPI_SHM_CYCLESTART bytes are cyclic. The default value is 24,576 bytes.

MPI_SHM_NUMPOSTBOX

This variable represents the number of postboxes dedicated to each shared-memory connection. The default value is 16.

MPI_SHM_PIPESIZE

This variable represents the limit, in bytes, on the portion of a shared-memory message that will be sent via the buffer list of a single postbox during a pipeline transfer. The default value is 8192 bytes. The value must be a multiple of 1024.

MPI_SHM_PIPESTART

This variable represents the size, in bytes, at which shared-memory transfers starts to be pipelined. The default value is 2048. Multiples of 1024 must be used.

MPI_SHM_SBPOOLSIZE

If set, this variable represents the size, in bytes, of the pool of shared-memory buffers dedicated to each sender. A multiple of 1024 must be used. If unset, then pools of shared-memory buffers are dedicated to connections rather than to senders.

MPI_SHM_SHORTMSGSIZE

This variable represents the size (in bytes) of the section of a postbox that contains either data or a buffer list. The default value is 256 bytes.



Note - If MPI_SHM_PIPESTART, MPI_SHM_PIPESIZE, or MPI_SHM_CYCLESIZE is increased to a size larger than 31,744 bytes, then MPI_SHM_SHORTMSGSIZE might also have to be increased. See the Sun HPC ClusterTools Software Performance Guide for more information.




Numerics

MPI_CANONREDUCE

Prevents reduction operations from using any optimizations that take advantage of the physical location of processors. This can provide more consistent results in the case of floating-point addition, for example. However, the operation can take longer to complete. The default value is 0, meaning that optimizations are allowed. To prevent optimizations, set the value to 1.


Synchronization of One-Sided Communications

MPI_USE_AGENT_THREAD

If the enviromental variable MPI_USE_AGENT_THREAD is set to 1, upon the first call to MPI_Win_create the Sun MPI library creates one agent thread for processes that need such a thread. (If MPI_USE_AGENT_THREAD is not set, or it is set to 0 [zero], no such thread is created.)

The two purposes of MPI_USE_AGENT_THREAD are to ensure progress in passive target RMA synchronization and to perform MPI RMA operations on local window memory on behalf of other processes when those processes do not have direct (shared-memory or RSM) access to window memory.

The agent thread does not run user code. Thread-safety in the non-thread-safe MPI library is achieved by a monitor around MPI communication calls. If no windows requiring the use of an agent thread are active, the agent thread is suspended If MPI_USE_AGENT_THREAD is not set, one-sided MPI operations can be delayed till the next synchronization call.

MPI_RSM_PUTSIZE

Use the environmental variable MPI_RSM_PUTSIZE to control the method used to put small buffers over the RSM PM. Buffers passed to MPI_Put that are smaller than MPI_RSM_PUTSIZE are transferred to the remote side using a message protocol that has low latency but requires an implicit synchronization with the remote side, which may delay the sending side and reduce performance. The delay is usually small if both sides perform frequent small transfers. Buffers larger than MPI_RSM_PUTSIZE are transferred using a direct one-sided protocol which does not require synchronization, but has higher latency. The cost of the higher latency is not significant for large enough messages. Choose a value of 0 (zero) to always use the direct protocol, and a large value to always use the message protocol. The default value is 16 Kbytes.

MPI_RSM_GETSIZE

Use the environmental variable MPI_RSM_GETSIZE to control the method used to get small buffers over the RSM PM. Buffers passed to MPI_Get that are larger than MPI_RSM_GETSIZE are transferred using a message protocol that achieves high bandwidth but requires an implicit synchronization with the remote side, which may delay the sending side and reduce performance. For large enough messages, the higher bandwidth offsets the cost of the synchronization delay. Buffers passed to MPI_Get that are smaller than MPI_RSM_GETSIZE are transferred using a direct one-sided protocol which does not require synchronization, but has lower bandwidth. For small enough messages, the lower bandwidth has little impact on performance. Choose a value of 0 (zero) to always use the message protocol, and a large value to always use the direct protocol. The default value is 16Kbytes.

 


Tuning Rendezvous

MPI_EAGERONLY

When this variable is set to 1, the default, only the eager protocol is used. When it is set to 0, both eager and rendezvous protocols are used.

MPI_RSM_RENDVSIZE

Messages communicated by remote shared memory that are greater than this size use the rendezvous protocol unless the environment variable MPI_EAGERONLY is set to 1. The default value is 256 Kbytes.

MPI_SHM_RENDVSIZE

Messages communicated by shared memory that are greater than this size use the rendezvous protocol unless the environment variable MPI_EAGERONLY is set. The default value is 24,576 bytes.

MPI_TCP_RENDVSIZE

Messages communicated by TCP that contain data of this size and greater use the rendezvous protocol unless the environment variable MPI_EAGERONLY is set. The default value is 49,152 bytes.


MPProf

MPI_PROFILE

Setting this variable to 1 enables a profiling session. When profiling is enabled, profiling data for the MPI process ranks are written to a set of intermediate files, one file per process rank. MPProf also creates an index file of the form: mpprof.index.rm.jid (where rm is the resource manager and jid is the job ID) that contains pointers to the intermediate files of the form mpprof.n.rm.jid (where n is the process rank, rm is the resource manager, and jid is the job ID). If MPI_PROFILE is not set, program execution proceeds without generating profiling data.

MPI_PROFDATADIR

By default, the temporary files generated during MPProf profiling are located in /usr/tmp/. Set an alternative location as a value for environment variable MPI_PROFDATADIR.

MPI_PROFINDEXDIR

By default, the index file for MPProf profiling is located in the current directory. Set an alternative, nondefault location as a value for MPI_PROFINDEXDIR.

MPI_PROFINTERVAL

The variable MPI_PROFINTERVAL can be used to specify a time interval for controlling when snapshots of the profiling data will be written to the intermediate files.

Setting MPI_PROFINTERVAL to 0 forces a snapshot for every MPI call that is made. Setting MPI_PROFINTERVAL to Inf causes only one snapshot to be recorded at MPI_Finalize time. If MPI_PROFINTERVAL is unset or has no value, the default value of 60 seconds will be used.

If time intervals are used and an MPI program terminates before the MPI_Finalize call, any snapshots that were recorded can be used by mpprof to generate a profile of program operations up to the point of termination.

MPI_PROFMAXFILESIZE

This variable can be used to specify the maximum size, in Kbytes, that can be written to the intermediate files.

The default intermediate file size limit files is 51,200 Kbytes (50 Mbytes). If a process records data that exceeds the file size limit, that write operation completes, but it cannot record additional profiling data. Other intermediate files that have not reached the limit can continue to receive data. The file size limit can be removed altogether by setting MPI_PROFMAXFILESIZE to unlimited.


Miscellaneous

MPI_COSCHED

This variable specifies the user's preference regarding use of the spind daemon for coscheduling. The value can be 0 (prefer no use) or 1 (prefer use). This preference can be overridden by the system administrator's policy. This policy is set in the hpc.conf file and can be 0 (forbid use), 1 (require use), or 2 (no policy). If no policy is set and no user preference is specified, coscheduling is not used.



Note - If no user preference is specified, the value 2 is displayed when environment variables are printed with MPI_PRINTENV.



MPI_CHECK_ARGS

When this variable is set to 1, argument checking is performed on MPI calls, and errors are printed when they occur. The default is 0.

MPI_FLOWCONTROL

This variable limits the number of unexpected messages that can be queued from a particular connection. Once this quantity of unexpected messages has been received, polling the connection for incoming messages stops. The default value, 0, indicates that no limit is set. To limit flow, set the value to an integer greater than 0.

MPI_FULLCONNINIT

This variable ensures that all connections are established during initialization. By default, connections are established lazily. However, you can override this default by setting the environment variable MPI_FULLCONNINIT to 1, forcing full-connection initialization mode. The default value is 0.

MPI_MAXFHANDLES

This variable representst the maximum number of Fortran handles for objects other than requests. MPI_MAXFHANDLES specifies the upper limit on the number of concurrently allocated Fortran handles for MPI objects other than requests. This variable is ignored in the default 32-bit library. The default value is 1024. Users should take care to free MPI objects that are no longer in use. There is no limit on handle allocation for C codes.

MPI_MAXPROCS

This variable overrides the value specified by maxprocs_default in hpc.conf; it cannot exceed the value specified by maxprocs_limit in hpc.conf. If the value does exceed the maxprocs_limit value, the job aborts with an error when the program calls MPI_Init. Note that the upper limit of support for RSM communication is 2048 processes.

MPI_MAXREQHANDLES

This variable representst the maximum number of Fortran request handles. MPI_MAXREQHANDLES specifies the upper limit on the number of concurrently allocated MPI request handles. Users must take care to free up request handles by properly completing requests. The default value is 1024. This variable is ignored in the default 32-bit library.

MPI_OPTCOLL

The MPI collectives are implemented using a variety of optimizations. Some of these optimizations can inhibit performance of point-to-point messages for "unsafe" programs. The default value of this variable, 1, indicates that optimized collectives are used. The optimizations can be turned off by setting the value to 0.

MPI_RSM_MAXSTRIPE

This variable specifies the maximum number of interfaces that can be used for striping data during communication over RSM. The value cannot be higher than 64 and the number of installed interfaces. The default is 4.

MPI_SHM_BCASTSIZE

On SMPs, MPI_Bcast() is implemented for large messages using a double-buffering scheme. The size of each buffer (in bytes) is settable by using this environment variable. The default value is 32,768 bytes.

MPI_SHM_GBPOOLSIZE

This variable representst the amount of memory available, in bytes, to the general buffer pool for use by collective operations. The default value is 20,971,520 bytes.

MPI_SHM_REDUCESIZE

On SMPs, calling MPI_Reduce() causes all processors to participate in the reduce. Each processor works on a piece of data equal to the MPI_SHM_REDUCESIZE setting. The default value is 256 bytes.

You must take care when setting this variable because the system reserves MPI_SHM_REDUCESIZE * np * np memory to execute the reduce.

MPI_SPINDTIMEOUT

When coscheduling is enabled, this variable limits the length of time (in milliseconds) a message remains in the poll waiting for the spind daemon to return. If the timeout occurs before the daemon finds any messages, the process reenters the polling loop. The default value is 1000 milliseconds. A default can also be set by a system administrator in the hpc.conf file.

MPI_TCP_CONNLOOP

This variable sets the number of times MPI_TCP_CONNTIMEOUT occurs before signaling an error. The default value for this variable is 0, meaning that the program aborts on the first occurrence of MPI_TCP_CONNTIMEOUT.

MPI_TCP_CONNTIMEOUT

This variable sets the timeout value in seconds that is used for an accept() call. The default value for this variable is 600 seconds (10 minutes). This timeout can be triggered in both full- and lazy-connection initialization. After the timeout is reached, a warning message is printed. If MPI_TCP_CONNLOOP is set to 0, then the first timeout causes the program to abort.

MPI_TCP_SAFEGATHER

This variable allows use of a congestion-avoidance algorithm for MPI_Gather() and MPI_Gatherv() over TCP. By default, MPI_TCP_SAFEGATHER is set to 1, which meansthat use of this algorithm is on. If you know that your underlying network can handle gathering large amounts of data on a single node, you might want to override this algorithm by setting MPI_TCP_SAFEGATHER to 0.