Sun MPI 4.0 Programming and Reference Guide

Chapter 2 The Sun MPI Library

This chapter describes the Sun MPI library:

"The Libraries"

"Sun MPI Routines"

"Programming With Sun MPI"

"Multithreaded Programming"

"Profiling Interface"

"MPE: Extensions to the Library"

The Libraries

Sun MPI comprises eight MPI libraries: four 32-bit versions and four 64-bit versions:

32- and 64-bit libraries - If you want to take advantage of the 64-bit capabilities of Sun MPI 4.0, you must explicitly link to the 64-bit libraries. The 32-bit libraries are the default in each category.

Note -
The 64-bit libraries are installed only when the installation system is running Solaris 7.

Thread-safe and non-thread-safe libraries - For multithreaded programs, the user must link with the thread-safe library in the appropriate category unless the program has only one thread calling MPI. For programs that are not multithreaded, the user can link against either the thread-safe or the default (non-thread-safe) library. However, non-multithreaded programs will have better performance using the default library, as it does not incur the extra overhead of providing thread-safety. Therefore, you should use the default libraries whenever possible for maximum performance.

Standard and trace libraries - The trace libraries are used to take advantage of Prism's MPI performance analysis features and to provide enhanced error reporting. These libraries are intended for development purposes only, as the overhead involved in their aggressive parameter-checking and probes degrades performance compared with the standard libraries.

The 32-bit libraries are the default, as are the standard (nontrace) libraries within the 32- or 64-bit categories. Within any given category (32- or 64-bit, standard or trace library), the non-thread-safe library is the default. For full information about linking to libraries, see "Compiling and Linking".

Sun MPI Routines

This section gives a brief description of the routines in the Sun MPI library. All the Sun MPI routines are listed in Appendix A, Sun MPI and Sun MPI I/O Routines with brief descriptions and their C syntax. For detailed descriptions of individual routines, see the man pages. For more complete information, see the MPI standard (see "Related Publications" of the preface).

Point-to-Point Routines

Point-to-point routines include the basic send and receive routines in both blocking and nonblocking forms and in four modes.

A blocking send blocks until its message buffer can be written with a new message. A blocking receive blocks until the received message is in the receive buffer.

Nonblocking sends and receives differ from blocking sends and receives in that they return immediately and their completion must be waited or tested for. It is expected that eventually nonblocking send and receive calls will allow the overlap of communication and computation.

MPI's four modes for point-to-point communication are:

Standard, in which the completion of a send implies that the message either is buffered internally or has been received. Users are free to overwrite the buffer that they passed in with any of the blocking send or receive routines, after the routine returns.

Buffered, in which the user guarantees a certain amount of buffering space.

Synchronous, in which rendezvous semantics occur between sender and receiver; that is, a send blocks until the corresponding receive has occurred.

Ready, in which a send can be started only if the matching receive is already posted. The ready mode for sends is a way for the programmer to notify the system that the receive has been posted, so that the underlying system can use a faster protocol if it is available.

Collective Communication

Collective communication routines are blocking routines that involve all processes in a communicator. Collective communication includes broadcasts and scatters, reductions and gathers, all-gathers and all-to-alls, scans, and a synchronizing barrier call.

Table 2-1 Collective Communication Routines


`MPI_Bcast`	Broadcasts from one process to all others in a communicator.
`MPI_Scatter`	Scatters from one process to all others in a communicator.
`MPI_Reduce`	Reduces from all to one in a communicator.
`MPI_Allreduce`	Reduces, then broadcasts result to all nodes in a communicator.
`MPI_Reduce_scatter`	Scatters a vector that contains results across the nodes in a communicator.
`MPI_Gather`	Gathers from all to one in a communicator.
`MPI_Allgather`	Gathers, then broadcasts the results of the gather in a communicator.
`MPI_Alltoall`	Performs a set of gathers in which each process receives a specific result in a communicator.
`MPI_Scan`	Scans (parallel prefix) across processes in a communicator.
`MPI_Barrier`	Synchronizes processes in a communicator (no data is transmitted).

Many of the collective communication calls have alternative vector forms, with which different amounts of data can be sent to or received from different processes.

The syntax and semantics of these routines are basically consistent with the point-to-point routines (upon which they are built), but there are restrictions to keep them from getting too complicated:

The amount of data sent must exactly match the amount of data specified by the receiver.

There is only one mode, a mode analogous to the standard mode of point-to-point routines.

Managing Groups, Contexts, and Communicators

A distinguishing feature of the MPI standard is that it includes a mechanism for creating separate worlds of communication, accomplished through communicators, contexts, and groups.

A communicator specifies a group of processes that will conduct communication operations within a specified context without affecting or being affected by operations occurring in other groups or contexts elsewhere in the program. A communicator also guarantees that, within any group and context, point-to-point and collective communication are isolated from each other.

A group is an ordered collection of processes. Each process has a rank in the group; the rank runs from 0 to n-1. A process can belong to more than one group; its rank in one group has nothing to do with its rank in any other group.

A context is the internal mechanism by which a communicator guarantees safe communication space to the group.

At program startup, two default communicators are defined: MPI_COMM_WORLD, which has as a process group all the processes of the job; and MPI_COMM_SELF, which is equivalent to an identity communicator. The process group that corresponds to MPI_COMM_WORLD is not predefined, but can be accessed using MPI_COMM_GROUP. One MPI_COMM_SELF communicator is defined for each process, each of which has rank zero in its own communicator. For many programs, these are the only communicators needed.

Communicators are of two kinds: intracommunicators, which conduct operations within a given group of processes; and intercommunicators, which conduct operations between two groups of processes.

Communicators provide a caching mechanism, which allows an application to attach attributes to communicators. Attributes can be user data or any other kind of information.

New groups and new communicators are constructed from existing ones. Group constructor routines are local, and their execution does not require interprocessor communication. Communicator constructor routines are collective, and their execution may require interprocess communication.

Note -

Users who do not need any communicator other than the default MPI_COMM_WORLD communicator -- that is, who do not need any sub- or supersets of processes -- can simply plug in MPI_COMM_WORLD wherever a communicator argument is requested. In these circumstances, users can ignore this section and the associated routines. (These routines can be identified from the listing in Appendix A, Sun MPI and Sun MPI I/O Routines.)

Data Types

All Sun MPI communication routines have a data type argument. These may be primitive data types, such as integers or floating-point numbers, or they may be user-defined, derived data types, which are specified in terms of primitive types.

Derived data types allow users to specify more general, mixed, and noncontiguous communication buffers, such as array sections and structures that contain combinations of primitive data types.

The basic data types that can be specified for the data-type argument correspond to the basic data types of the host language. Values for the data-type argument for Fortran and the corresponding Fortran types are listed in the following table.

Table 2-2 Possible Values for the Data Type Argument for Fortran


MPI Data Type	Fortran Data Type
`MPI_INTEGER`	`INTEGER`
`MPI_REAL`	`REAL`
`MPI_DOUBLE_PRECISION`	`DOUBLE PRECISION`
`MPI_COMPLEX`	`COMPLEX`
`MPI_LOGICAL`	`LOGICAL`
`MPI_CHARACTER`	`CHARACTER(1)`
`MPI_DOUBLE_COMPLEX`	`DOUBLE COMPLEX`
`MPI_REAL4`	`REAL*4`
`MPI_REAL8`	`REAL*8`
`MPI_INTEGER2`	`INTEGER*2`
`MPI_INTEGER4`	`INTEGER*4`
`MPI_BYTE`
`MPI_PACKED`

Values for the data-type argument in C and the corresponding C types are listed in the following table. .

Table 2-3 Possible Values for the Data Type Argument for C


MPI Data Type	C Data Type
`MPI_CHAR`	`signed char`
`MPI_SHORT`	`signed short int`
`MPI_INT`	`signed int`
`MPI_LONG`	`signed long int`
`MPI_UNSIGNED_CHAR`	`unsigned char`
`MPI_UNSIGNED_SHORT`	`unsigned short int`
`MPI_UNSIGNED`	`unsigned int`
`MPI_UNSIGNED_LONG`	`unsigned long int`
`MPI_FLOAT`	`float`
`MPI_DOUBLE`	`double`
`MPI_LONG_DOUBLE`	`long double`
`MPI_LONG_LONG_INT`	`long long int`
`MPI_BYTE`
`MPI_PACKED`

The data types MPI_BYTE and MPI_PACKED have no corresponding Fortran or C data types.

Persistent Communication Requests

Sometimes within an inner loop of a parallel computation, a communication with the same argument list is executed repeatedly. The communication can be slightly improved by using a persistent communication request, which reduces the overhead for communication between the process and the communication controller. A persistent request can be thought of as a communication port or "half-channel."

Managing Process Topologies

Process topologies are associated with communicators; they are optional attributes that can be given to an intracommunicator (not to an intercommunicator).

Recall that processes in a group are ranked from 0 to n-1. This linear ranking often reflects nothing of the logical communication pattern of the processes, which may be, for instance, a 2- or 3-dimensional grid. The logical communication pattern is referred to as a virtual topology (separate and distinct from any hardware topology). In MPI, there are two types of virtual topologies that can be created: Cartesian (grid) topology and graph topology.

You can use virtual topologies in your programs by taking physical processor organization into account to provide a ranking of processors that optimizes communications.

Environmental Inquiry Functions

Environmental inquiry functions include routines for starting up and shutting down, error-handling routines, and timers.

Few MPI routines may be called before MPI_Init or after MPI_Finalize. Examples include MPI_Initialized and MPI_Version. MPI_Finalize may be called only if there are no outstanding communications involving that process.

The set of errors handled by MPI is dependent upon the implementation. See Appendix B, Troubleshooting for tables listing the Sun MPI 4.0 error classes.

Programming With Sun MPI

Although there are about 190 (non-I/O) routines in the Sun MPI library, you can write programs for a wide range of problems using only six routines:

Table 2-4 Six Basic MPI Routines


`MPI_Init`	Initializes the MPI library.
`MPI_Finalize`	Finalizes the MPI library. This includes releasing resources used by the library.
`MPI_Comm_size`	Determines the number of processes in a specified communicator.
`MPI_Comm_rank`	Determines the rank of calling process within a communicator.
`MPI_Send`	Sends a message.
`MPI_Recv`	Receives a message.

This set of six routines includes the basic send and receive routines. Programs that depend heavily on collective communication may also include MPI_Bcast and MPI_Reduce.

The functionality of these routines means you can have the benefit of parallel operations without having to learn the whole library at once. As you become more familiar with programming for message passing, you can start learning the more complex and esoteric routines and add them to your programs as needed.

See "Sample Code", for two simple Sun MPI code samples, one in C and one in Fortran. See "Sun MPI Routines", for a complete list of Sun MPI routines.

Fortran Support

Sun MPI 4.0 provides basic Fortran support, as described in section 10.2 of the MPI-2 standard. Essentially, Fortran bindings and an mpif.h file are provided, as specified in the MPI-1 standard. The mpif.h file is valid for both fixed- and free-source form, as specified in the MPI-2 standard.

The MPI interface is known to violate the Fortran standard in several ways, which cause few problems for Fortran 77 programs. These standard violations can cause more significant problems for Fortran 90 programs, however, if you do not follow the guidelines recommended in the standard. If you are programming in Fortran, and particularly if you are using Fortran 90, you should consult section 10.2 of the MPI-2 standard for detailed information about basic Fortran support in an MPI implementation.

Recommendations for All-to-All and All-to-One Communication

The Sun MPI library uses the TCP protocol to communicate over a variety of networks. MPI depends on TCP to ensure reliable, correct data flow. TCP's reliability compensates for unreliability in the underlying network, as the TCP retransmission algorithms will handle any segments that are lost or corrupted. In most cases, this works well with good performance characteristics. However, when doing all-to-all and all-to-one communication over certain networks, a large number of TCP segments may be lost, resulting in poor performance.

You can compensate for this diminished performance over TCP in these ways:

When writing your own algorithms, avoid flooding one node with a lot of data.

If you need to do all-to-all or all-to-one communication, use one of the Sun MPI routines to do so. They are implemented in a way that avoids congesting a single node with lots of data. The following routines fall into this category:
- MPI_Alltoall and MPI_Alltoallv - These have been implemented using a pairwise communication pattern, so that every rank is communicating with only one other rank at a given time.
- MPI_Gather/MPI_Gatherv - The root process sends ready-to-send packets to each nonroot-rank process to tell the processes to send their data. In this way, the root process can regulate how much data it is receiving at any one time. Using this ready-to-send method is, however associated with a minor performance cost. For this reason, you can override this method by setting the MPI_TCPSAFEGATHER environment variable to 0. (See the Sun MPI user's guides for information about environment variables.)

Signals and MPI

When running the MPI library over TCP, nonfatal SIGPIPE signals may be generated. To handle them, the library sets the signal handler for SIGPIPE to ignore, overriding the default setting (terminate the process). In this way, the MPI library can recover in certain situations. You should therefore avoid changing the SIGPIPE signal handler.

The Sun MPI 4.0 Fortran and C⁺⁺ bindings are implemented as wrappers on top of the C bindings. The profiling interface is implemented using weak symbols. This means a profiling library need contain only a profiled version of C bindings.

The SIGPIPEs may occur when a process first starts communicating over TCP. This happens because the MPI library creates connections over TCP only when processes actually communicate with one another. There are some unavoidable conditions where SIGPIPEs may be generated when two processes establish a connection. If you want to avoid any SIGPIPEs, set the environment variable MPI_FULLCONNINIT, which creates all connections during MPI_Init() and avoids any situations which may generate a SIGPIPE. For more information about environment variables, see the Sun MPI user's guides.

Multithreaded Programming

When you are linked to one of the thread-safe libraries, Sun MPI calls are thread safe, in accordance with basic tenets of thread safety for MPI mentioned in the MPI-2 specification [Document for a Standard Message-Passing Interface. Please see the preface of this document for more information about this and other recommended reference material. ] . This means that:

When two concurrently running threads make MPI calls, the outcome will be as if the calls executed in some order.

Blocking MPI calls will block the calling thread only. A blocked calling thread will not prevent progress of other runnable threads on the same process, nor will it prevent them from executing MPI calls. Thus, multiple sends and receives are concurrent.

Guidelines for Thread-Safe Programming

Each thread within an MPI process may issue MPI calls; however, threads are not separately addressable. That is, the rank of a send or receive call identifies a process, not a thread, meaning that no order is defined for the case where two threads call MPI_Recv with the same tag and communicator. Such threads are said to be in conflict.

If threads within the same application post conflicting communication calls, data races will result. You can prevent such data races by using distinct communicators or tags for each thread.

In general, you will need to adhere to these guidelines:

You must not have a request serviced by more than one thread. Although you may have an operation posted in one thread and then completed in another, you may not have the operation completed in more than one thread.

A data type or communicator must not be freed by one thread while it is in use by another thread.

Once MPI_Finalize has been called, subsequent calls in any thread will fail.

You must ensure that a sufficient number of lightweight processes (LWPs) are available for your multithreaded program. Failure to do so may degrade performance or even result in deadlock.

You cannot stub the thread calls in your multithreaded program by omitting the threads libraries in the link line. The libmpi.so library automatically calls in the threads libraries, which effectively overrides any stubs.

The following sections describe more specific guidelines that apply for some routines. They also include some general considerations for collective calls and communicator operations that you should be aware of.

`MPI_Wait`, `MPI_Waitall`, `MPI_Waitany`, `MPI_Waitsome`

In a program where two or more threads call one of these routines, you must ensure that they are not waiting for the same request. Similarly, the same request cannot appear in the array of requests of multiple concurrent wait calls.

`MPI_Cancel`

One thread must not cancel a request while that request is being serviced by another thread.

`MPI_Probe`, `MPI_Iprobe`

A call to MPI_Probe or MPI_Iprobe from one thread on a given communicator should not have a source rank and tags that match those of any other probes or receives on the same communicator. Otherwise, correct matching of message to probe call may not occur.

Collective Calls

Collective calls are matched on a communicator according to the order in which the calls are issued at each processor. All the processes on a given communicator must make the same collective call. You can avoid the effects of this restriction on the threads on a given processor by using a different communicator for each thread.

No process that belongs to the communicator may omit making a particular collective call; that is, none should be left "dangling."

Communicator Operations

Each of the communicator functions operates simultaneously with each of the noncommunicator functions, regardless of what the parameters are and of whether the functions are on the same or different communicators. However, if you are using multiple instances of the same communicator function on the same communicator, where all parameters are the same, it cannot be determined which threads belong to which resultant communicator. Therefore, when concurrent threads issue such calls, you must assure that the calls are synchronized in such a way that threads in different processes participating in the same communicator operation are grouped together. Do this either by using a different base communicator for each call or by making the calls in single-thread mode before actually using them within the separate threads.

Please note also these special situations:

If you are using multiple instances of the same function with differing parameters and multiple threads, you must use different communicators. You must not use multiple instances of the same function on the same communicator with other differing parameters.

When using splits with multiple instances of the same function with the same parameters, but with different threads at the split, you must use different communicators.

For example, suppose you wish to produce several communicators in different sets of threads by performing MPI_Comm_split on some base communicator. To ensure proper, thread-safe operation, you should replicate the base communicator via MPI_Comm_dup (in the root thread or in one thread) and then perform MPI_Comm_split on the resulting duplicate communicators.

Do not free a communicator in one thread if it is still being used by another thread.

Error Handlers

When an error occurs as a result of an MPI call, the handler may not run on the same thread as the thread that made the error-raising call. In other words, you cannot assume that the error handler will execute in the local context of the thread that made the error-raising call. The error handler may be executed by another thread on the same process, distinct from the one that returns the error code. Therefore, you cannot rely on local variables for error handling in threads; instead, use global variables from the process.

Profiling Interface

Prism 6.0, a component of Sun HPC ClusterTools 3.0 software, can be used in conjunction with the TNF probes and libraries included with Sun MPI 4.0 for profiling your code. See Appendix C, TNF Probes for information about the TNF probes and "Choosing a Library Path" for information about linking to the "trace" or TNF libraries. See the Prism 6.0 User's Guide for more information about the TNF viewer built into Prism.

Sun MPI 4.0 also meets the requirements of the profiling interface described in Chapter 8 of the MPI-1 Standard. You may write your own profiling library or choose from a number of available profiling libraries, such as those included with the multiprocessing environment (MPE) from Argonne National Laboratory. (See "MPE: Extensions to the Library" for more information.) The User's Guide for mpich, a Portable Implementation of MPI, includes more detailed information about using profiling libraries. For information about this and other MPI- and MPICH-related publications, see "Related Publications".

The following figure illustrates how the software fits together. In this example, the user is linking against a profiling library that collects information on MPI_Send(). No profiling information is being collected for MPI_Recv().

To compile the program, the user's link line would look like this:

# cc ..... -llibrary-name -lmpi

Figure 2-1 Sun MPI Profiling Interface

MPE: Extensions to the Library

Although the Sun MPI library does not include or support the multiprocessing environment (MPE) available from Argonne National Laboratory (ANL), it is compatible with MPE. In case you would like to use these extensions to the MPI library, we have included some instructions for downloading it from ANL and building it yourself. Note that these procedures may change if ANL makes changes to MPE.

To Obtain and Build MPE

The MPE software is available from Argonne National Laboratory.

Use ftp to obtain the file.

ftp://ftp.mcs.anl.gov/pub/mpi/misc/mpe.tar.gz

The mpe.tar.gz file is about 240 Kbytes.

Use gunzip and tar to decompress the software.

# gunzip mpe.tar.gz
# tar xvf mpe.tar

Change your current working directory to the mpe directory, and execute configure with the arguments shown.

# cd mpe
# configure -cc=cc -fc=f77 -opt=-I/opt/SUNWhpc/include

Execute a make.

# make

This will build several libraries.