Sun MPI 4.0 Programming and Reference Guide

Chapter 2 The Sun MPI Library

This chapter describes the Sun MPI library:

The Libraries

Sun MPI comprises eight MPI libraries: four 32-bit versions and four 64-bit versions:

The 32-bit libraries are the default, as are the standard (nontrace) libraries within the 32- or 64-bit categories. Within any given category (32- or 64-bit, standard or trace library), the non-thread-safe library is the default. For full information about linking to libraries, see "Compiling and Linking".

Sun MPI Routines

This section gives a brief description of the routines in the Sun MPI library. All the Sun MPI routines are listed in Appendix A, Sun MPI and Sun MPI I/O Routines with brief descriptions and their C syntax. For detailed descriptions of individual routines, see the man pages. For more complete information, see the MPI standard (see "Related Publications" of the preface).

Point-to-Point Routines

Point-to-point routines include the basic send and receive routines in both blocking and nonblocking forms and in four modes.

A blocking send blocks until its message buffer can be written with a new message. A blocking receive blocks until the received message is in the receive buffer.

Nonblocking sends and receives differ from blocking sends and receives in that they return immediately and their completion must be waited or tested for. It is expected that eventually nonblocking send and receive calls will allow the overlap of communication and computation.

MPI's four modes for point-to-point communication are:

Collective Communication

Collective communication routines are blocking routines that involve all processes in a communicator. Collective communication includes broadcasts and scatters, reductions and gathers, all-gathers and all-to-alls, scans, and a synchronizing barrier call.

Table 2-1 Collective Communication Routines

MPI_Bcast

Broadcasts from one process to all others in a communicator. 

MPI_Scatter

Scatters from one process to all others in a communicator. 

MPI_Reduce

Reduces from all to one in a communicator. 

MPI_Allreduce

Reduces, then broadcasts result to all nodes in a communicator. 

MPI_Reduce_scatter

Scatters a vector that contains results across the nodes in a communicator. 

MPI_Gather

Gathers from all to one in a communicator. 

MPI_Allgather

Gathers, then broadcasts the results of the gather in a communicator. 

MPI_Alltoall

Performs a set of gathers in which each process receives a specific result in a communicator. 

MPI_Scan

Scans (parallel prefix) across processes in a communicator. 

MPI_Barrier

Synchronizes processes in a communicator (no data is transmitted).  

Many of the collective communication calls have alternative vector forms, with which different amounts of data can be sent to or received from different processes.

The syntax and semantics of these routines are basically consistent with the point-to-point routines (upon which they are built), but there are restrictions to keep them from getting too complicated:

Managing Groups, Contexts, and Communicators

A distinguishing feature of the MPI standard is that it includes a mechanism for creating separate worlds of communication, accomplished through communicators, contexts, and groups.

A communicator specifies a group of processes that will conduct communication operations within a specified context without affecting or being affected by operations occurring in other groups or contexts elsewhere in the program. A communicator also guarantees that, within any group and context, point-to-point and collective communication are isolated from each other.

A group is an ordered collection of processes. Each process has a rank in the group; the rank runs from 0 to n-1. A process can belong to more than one group; its rank in one group has nothing to do with its rank in any other group.

A context is the internal mechanism by which a communicator guarantees safe communication space to the group.

At program startup, two default communicators are defined: MPI_COMM_WORLD, which has as a process group all the processes of the job; and MPI_COMM_SELF, which is equivalent to an identity communicator. The process group that corresponds to MPI_COMM_WORLD is not predefined, but can be accessed using MPI_COMM_GROUP. One MPI_COMM_SELF communicator is defined for each process, each of which has rank zero in its own communicator. For many programs, these are the only communicators needed.

Communicators are of two kinds: intracommunicators, which conduct operations within a given group of processes; and intercommunicators, which conduct operations between two groups of processes.

Communicators provide a caching mechanism, which allows an application to attach attributes to communicators. Attributes can be user data or any other kind of information.

New groups and new communicators are constructed from existing ones. Group constructor routines are local, and their execution does not require interprocessor communication. Communicator constructor routines are collective, and their execution may require interprocess communication.


Note -

Users who do not need any communicator other than the default MPI_COMM_WORLD communicator -- that is, who do not need any sub- or supersets of processes -- can simply plug in MPI_COMM_WORLD wherever a communicator argument is requested. In these circumstances, users can ignore this section and the associated routines. (These routines can be identified from the listing in Appendix A, Sun MPI and Sun MPI I/O Routines.)


Data Types

All Sun MPI communication routines have a data type argument. These may be primitive data types, such as integers or floating-point numbers, or they may be user-defined, derived data types, which are specified in terms of primitive types.

Derived data types allow users to specify more general, mixed, and noncontiguous communication buffers, such as array sections and structures that contain combinations of primitive data types.

The basic data types that can be specified for the data-type argument correspond to the basic data types of the host language. Values for the data-type argument for Fortran and the corresponding Fortran types are listed in the following table.

Table 2-2 Possible Values for the Data Type Argument for Fortran

MPI Data Type 

Fortran Data Type 

MPI_INTEGER

INTEGER

MPI_REAL

REAL

MPI_DOUBLE_PRECISION

DOUBLE PRECISION

MPI_COMPLEX

COMPLEX

MPI_LOGICAL

LOGICAL

MPI_CHARACTER

CHARACTER(1)

MPI_DOUBLE_COMPLEX

DOUBLE COMPLEX

MPI_REAL4

REAL*4

MPI_REAL8

REAL*8

MPI_INTEGER2

INTEGER*2

MPI_INTEGER4

INTEGER*4

MPI_BYTE

 

MPI_PACKED

 

Values for the data-type argument in C and the corresponding C types are listed in the following table. .

Table 2-3 Possible Values for the Data Type Argument for C

MPI Data Type 

C Data Type 

MPI_CHAR

signed char

MPI_SHORT

signed short int

MPI_INT

signed int

MPI_LONG

signed long int

MPI_UNSIGNED_CHAR

unsigned char

MPI_UNSIGNED_SHORT

unsigned short int

MPI_UNSIGNED

unsigned int

MPI_UNSIGNED_LONG

unsigned long int

MPI_FLOAT

float

MPI_DOUBLE

double

MPI_LONG_DOUBLE

long double

MPI_LONG_LONG_INT

long long int

MPI_BYTE

 

MPI_PACKED

 

The data types MPI_BYTE and MPI_PACKED have no corresponding Fortran or C data types.

Persistent Communication Requests

Sometimes within an inner loop of a parallel computation, a communication with the same argument list is executed repeatedly. The communication can be slightly improved by using a persistent communication request, which reduces the overhead for communication between the process and the communication controller. A persistent request can be thought of as a communication port or "half-channel."

Managing Process Topologies

Process topologies are associated with communicators; they are optional attributes that can be given to an intracommunicator (not to an intercommunicator).

Recall that processes in a group are ranked from 0 to n-1. This linear ranking often reflects nothing of the logical communication pattern of the processes, which may be, for instance, a 2- or 3-dimensional grid. The logical communication pattern is referred to as a virtual topology (separate and distinct from any hardware topology). In MPI, there are two types of virtual topologies that can be created: Cartesian (grid) topology and graph topology.

You can use virtual topologies in your programs by taking physical processor organization into account to provide a ranking of processors that optimizes communications.

Environmental Inquiry Functions

Environmental inquiry functions include routines for starting up and shutting down, error-handling routines, and timers.

Few MPI routines may be called before MPI_Init or after MPI_Finalize. Examples include MPI_Initialized and MPI_Version. MPI_Finalize may be called only if there are no outstanding communications involving that process.

The set of errors handled by MPI is dependent upon the implementation. See Appendix B, Troubleshooting for tables listing the Sun MPI 4.0 error classes.

Programming With Sun MPI

Although there are about 190 (non-I/O) routines in the Sun MPI library, you can write programs for a wide range of problems using only six routines:

Table 2-4 Six Basic MPI Routines

MPI_Init

Initializes the MPI library. 

MPI_Finalize

Finalizes the MPI library. This includes releasing resources used by the library.  

MPI_Comm_size

Determines the number of processes in a specified communicator. 

MPI_Comm_rank

Determines the rank of calling process within a communicator. 

MPI_Send

Sends a message. 

MPI_Recv

Receives a message. 

This set of six routines includes the basic send and receive routines. Programs that depend heavily on collective communication may also include MPI_Bcast and MPI_Reduce.

The functionality of these routines means you can have the benefit of parallel operations without having to learn the whole library at once. As you become more familiar with programming for message passing, you can start learning the more complex and esoteric routines and add them to your programs as needed.

See "Sample Code", for two simple Sun MPI code samples, one in C and one in Fortran. See "Sun MPI Routines", for a complete list of Sun MPI routines.

Fortran Support

Sun MPI 4.0 provides basic Fortran support, as described in section 10.2 of the MPI-2 standard. Essentially, Fortran bindings and an mpif.h file are provided, as specified in the MPI-1 standard. The mpif.h file is valid for both fixed- and free-source form, as specified in the MPI-2 standard.

The MPI interface is known to violate the Fortran standard in several ways, which cause few problems for Fortran 77 programs. These standard violations can cause more significant problems for Fortran 90 programs, however, if you do not follow the guidelines recommended in the standard. If you are programming in Fortran, and particularly if you are using Fortran 90, you should consult section 10.2 of the MPI-2 standard for detailed information about basic Fortran support in an MPI implementation.

Recommendations for All-to-All and All-to-One Communication

The Sun MPI library uses the TCP protocol to communicate over a variety of networks. MPI depends on TCP to ensure reliable, correct data flow. TCP's reliability compensates for unreliability in the underlying network, as the TCP retransmission algorithms will handle any segments that are lost or corrupted. In most cases, this works well with good performance characteristics. However, when doing all-to-all and all-to-one communication over certain networks, a large number of TCP segments may be lost, resulting in poor performance.

You can compensate for this diminished performance over TCP in these ways:

Signals and MPI

When running the MPI library over TCP, nonfatal SIGPIPE signals may be generated. To handle them, the library sets the signal handler for SIGPIPE to ignore, overriding the default setting (terminate the process). In this way, the MPI library can recover in certain situations. You should therefore avoid changing the SIGPIPE signal handler.

The Sun MPI 4.0 Fortran and C++ bindings are implemented as wrappers on top of the C bindings. The profiling interface is implemented using weak symbols. This means a profiling library need contain only a profiled version of C bindings.

The SIGPIPEs may occur when a process first starts communicating over TCP. This happens because the MPI library creates connections over TCP only when processes actually communicate with one another. There are some unavoidable conditions where SIGPIPEs may be generated when two processes establish a connection. If you want to avoid any SIGPIPEs, set the environment variable MPI_FULLCONNINIT, which creates all connections during MPI_Init() and avoids any situations which may generate a SIGPIPE. For more information about environment variables, see the Sun MPI user's guides.

Multithreaded Programming

When you are linked to one of the thread-safe libraries, Sun MPI calls are thread safe, in accordance with basic tenets of thread safety for MPI mentioned in the MPI-2 specification [Document for a Standard Message-Passing Interface. Please see the preface of this document for more information about this and other recommended reference material. ] . This means that:

Guidelines for Thread-Safe Programming

Each thread within an MPI process may issue MPI calls; however, threads are not separately addressable. That is, the rank of a send or receive call identifies a process, not a thread, meaning that no order is defined for the case where two threads call MPI_Recv with the same tag and communicator. Such threads are said to be in conflict.

If threads within the same application post conflicting communication calls, data races will result. You can prevent such data races by using distinct communicators or tags for each thread.

In general, you will need to adhere to these guidelines:

The following sections describe more specific guidelines that apply for some routines. They also include some general considerations for collective calls and communicator operations that you should be aware of.

MPI_Wait, MPI_Waitall, MPI_Waitany, MPI_Waitsome

In a program where two or more threads call one of these routines, you must ensure that they are not waiting for the same request. Similarly, the same request cannot appear in the array of requests of multiple concurrent wait calls.

MPI_Cancel

One thread must not cancel a request while that request is being serviced by another thread.

MPI_Probe, MPI_Iprobe

A call to MPI_Probe or MPI_Iprobe from one thread on a given communicator should not have a source rank and tags that match those of any other probes or receives on the same communicator. Otherwise, correct matching of message to probe call may not occur.

Collective Calls

Collective calls are matched on a communicator according to the order in which the calls are issued at each processor. All the processes on a given communicator must make the same collective call. You can avoid the effects of this restriction on the threads on a given processor by using a different communicator for each thread.

No process that belongs to the communicator may omit making a particular collective call; that is, none should be left "dangling."

Communicator Operations

Each of the communicator functions operates simultaneously with each of the noncommunicator functions, regardless of what the parameters are and of whether the functions are on the same or different communicators. However, if you are using multiple instances of the same communicator function on the same communicator, where all parameters are the same, it cannot be determined which threads belong to which resultant communicator. Therefore, when concurrent threads issue such calls, you must assure that the calls are synchronized in such a way that threads in different processes participating in the same communicator operation are grouped together. Do this either by using a different base communicator for each call or by making the calls in single-thread mode before actually using them within the separate threads.

Please note also these special situations:

For example, suppose you wish to produce several communicators in different sets of threads by performing MPI_Comm_split on some base communicator. To ensure proper, thread-safe operation, you should replicate the base communicator via MPI_Comm_dup (in the root thread or in one thread) and then perform MPI_Comm_split on the resulting duplicate communicators.

Error Handlers

When an error occurs as a result of an MPI call, the handler may not run on the same thread as the thread that made the error-raising call. In other words, you cannot assume that the error handler will execute in the local context of the thread that made the error-raising call. The error handler may be executed by another thread on the same process, distinct from the one that returns the error code. Therefore, you cannot rely on local variables for error handling in threads; instead, use global variables from the process.

Profiling Interface

Prism 6.0, a component of Sun HPC ClusterTools 3.0 software, can be used in conjunction with the TNF probes and libraries included with Sun MPI 4.0 for profiling your code. See Appendix C, TNF Probes for information about the TNF probes and "Choosing a Library Path" for information about linking to the "trace" or TNF libraries. See the Prism 6.0 User's Guide for more information about the TNF viewer built into Prism.

Sun MPI 4.0 also meets the requirements of the profiling interface described in Chapter 8 of the MPI-1 Standard. You may write your own profiling library or choose from a number of available profiling libraries, such as those included with the multiprocessing environment (MPE) from Argonne National Laboratory. (See "MPE: Extensions to the Library" for more information.) The User's Guide for mpich, a Portable Implementation of MPI, includes more detailed information about using profiling libraries. For information about this and other MPI- and MPICH-related publications, see "Related Publications".

The following figure illustrates how the software fits together. In this example, the user is linking against a profiling library that collects information on MPI_Send(). No profiling information is being collected for MPI_Recv().

To compile the program, the user's link line would look like this:

# cc ..... -llibrary-name -lmpi
Figure 2-1 Sun MPI Profiling Interface

Graphic

MPE: Extensions to the Library

Although the Sun MPI library does not include or support the multiprocessing environment (MPE) available from Argonne National Laboratory (ANL), it is compatible with MPE. In case you would like to use these extensions to the MPI library, we have included some instructions for downloading it from ANL and building it yourself. Note that these procedures may change if ANL makes changes to MPE.

To Obtain and Build MPE

The MPE software is available from Argonne National Laboratory.

  1. Use ftp to obtain the file.

ftp://ftp.mcs.anl.gov/pub/mpi/misc/mpe.tar.gz 

The mpe.tar.gz file is about 240 Kbytes.

  1. Use gunzip and tar to decompress the software.

# gunzip mpe.tar.gz
# tar xvf mpe.tar
  1. Change your current working directory to the mpe directory, and execute configure with the arguments shown.

# cd mpe
# configure -cc=cc -fc=f77 -opt=-I/opt/SUNWhpc/include
  1. Execute a make.

# make

This will build several libraries.


Note -

Sun MPI does not include the MPE error handlers. You must call the debug routines MPE_Errors_call_dbx_in_xterm() and MPE_Signals_call_debugger() yourself.


Please refer to the User's Guide for mpich, a Portable Implementation of MPI, for information on how to use MPE. It is available at the Argonne National Laboratory web site:

http://www.mcs.anl.gov/mpi/mpich/