Sun MPI 4.0 Programming and Reference Guide

Routines

This release of Sun MPI includes all the MPI I/O routines, which are defined in Chapter 9, "I/O," of the MPI-2 specification. (See the preface for information about this specification.)

Code samples that use many of these routines are provided in "Sample Code".

File Manipulation

Collective coordination	Noncollective coordination
`MPI_File_open` `MPI_File_close` `MPI_File_set_size` `MPI_File_preallocate`	`MPI_File_delete` `MPI_File_get_size` `MPI_File_get_group` `MPI_File_get_amode`

Collective coordination

Noncollective coordination

MPI_File_open

MPI_File_close

MPI_File_set_size

MPI_File_preallocate

MPI_File_delete

MPI_File_get_size

MPI_File_get_group

MPI_File_get_amode

MPI_File_open and MPI_File_close are collective operations that open and close a file, respectively -- that is, all processes in a communicator group must together open or close a file. To achieve a single-user, UNIX-like open, set the communicator to MPI_COMM_SELF.

MPI_File_delete deletes a specified file.

The routines MPI_File_set_size, MPI_File_get_size, MPI_File_get_group, and MPI_File_get_amode get and set information about a file. When using the collective routine MPI_File_set_size on a UNIX file, if the size that is set is smaller than the current file size, the file is truncated at the position defined by size. If size is set to be larger than the current file size, the file size becomes size.

When the file size is increased this way with MPI_File_set_size, new regions are created in the file with displacements between the old file size and the larger, newly set file size. Sun MPI I/O does not necessarily allocate file space for such new regions. You may reserve file space either by using MPI_File_preallocate or by performing a read or write to unallocated bytes. MPI_File_preallocate ensures that storage space is allocated for a set quantity of bytes for the specified file; however, its use is very "expensive" in terms of performance and disk space.

The routine MPI_File_get_group returns a communicator group, but it does not free the group.

File Info

Noncollective coordination	Collective coordination
`MPI_File_get_info`	`MPI_File_set_info`

The opaque info object allows you to provide hints for optimization of your code, making it run faster or more efficiently, for example. These hints are set for each file, using the MPI_File_open, MPI_File_set_view, MPI_File_set_info, and MPI_File_delete routines. MPI_File_set_info sets new values for the specified file's hints. MPI_File_get_info returns all the hints that the system currently associates with the specified file.

When using UNIX files, Sun MPI I/O provides four hints for controlling how much buffer space it uses to satisfy I/O requests: noncoll_read_bufsize, noncoll_write_bufsize, coll_read_bufsize, and coll_write_bufsize. These hints may be tuned for your particular hardware configuration and application to improve performance for both noncollective and collective data accesses. For example, if your application uses a single MPI I/O call to request multiple noncontiguous chunks that form a regular strided pattern inthe file, you may want to adjust the noncoll_write_bufsize to match the size of the stride. Note that these hints limit the size of MPI I/O's underlying buffers but do not limit the size of how much data a user can read or write in asingle request.

File Views

Noncollective coordination	Collective coordination
`MPI_File_get_view`	`MPI_File_set_view`

The MPI_File_set_view routine changes the process's view of the data in the file, specifying its displacement, elementary data type, and file type, as well as setting the individual file pointers and shared file pointer to 0. MPI_File_set_view is a collective routine; all processes in the group must pass identical values for the file handle and the elementary data type, although the values for the displacement, the file type, and the info object may vary. However, if you use the data-access routines that use file positioning with a shared file pointer, you must also give the displacement and the file type identical values. The data types passed in as the elementary data type and the file type must be committed.

You can also specify the type of data representation for the file. See "File Interoperability" for information about registering data representation identifiers.

Note -

Displacements within the file type and the elementary data type must be monotonically nondecreasing.

Data Access

The 35 data-access routines are categorized according to file positioning. Data access can be achieved by any of these methods of file positioning:

By explicit offset
By individual file pointer
By shared file pointer

In the following subsections, each of these methods is discussed in more detail.

While blocking I/O calls will not return until the request is completed, nonblocking calls do not wait for the I/O request to complete. A separate "request complete" call, such as MPI_Test or MPI_Wait, is needed to confirm that the buffer is ready to be used again. Nonblocking routines have the prefix MPI_File_i, where the i stands for immediate.

All the nonblocking collective routines for data access are "split" into two routines, each with _begin or _end as a suffix. These split collective routines are subject to the semantic rules described in Section 9.4.5 of the MPI-2 standard.

Data Access With Explicit Offsets

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read_at` `MPI_File_write_at`	`MPI_File_read_at_all` `MPI_File_write_at_all`
Nonblocking orsplit collective	`MPI_File_iread_at` `MPI_File_iwrite_at`	`MPI_File_read_at_all_begin` `MPI_File_read_at_all_end` `MPI_File_write_at_all_begin` `MPI_File_write_at_all_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read_at

MPI_File_write_at

MPI_File_read_at_all

MPI_File_write_at_all

Nonblocking orsplit collective

MPI_File_iread_at

MPI_File_iwrite_at

MPI_File_read_at_all_begin

MPI_File_read_at_all_end

MPI_File_write_at_all_begin

MPI_File_write_at_all_end

To access data at an explicit offset, specify the position in the file where the next data access for each process should begin. For each call to a data-access routine, a process attempts to access a specified number of file types of a specified data type (starting at the specified offset) into a specified user buffer.

The offset is measured in elementary data type units relative to the current view; moreover, "holes" are not counted when locating an offset. The data is read from (in the case of a read) or written into (in the case of a write) those parts of the file specified by the current view. These routines store the number of buffer elements of a particular data type actually read (or written) in the status object, and all the other fields associated with the status object are undefined. The number of elements that are read or written can be accessed using MPI_Get_count.

MPI_File_read_at attempts to read from the file via the associated file handle returned from a successful MPI_File_open. Similarly, MPI_File_write_at attempts to write data from a user buffer to a file. MPI_File_iread_at and MPI_File_iwrite_at are the nonblocking versions of MPI_File_read_at and MPI_File_write_at, respectively.

MPI_File_read_at_all and MPI_File_write_at_all are collective versions of MPI_File_read_at and MPI_File_write_at, in which each process provides an explicit offset. The split collective versions of these nonblocking routines are listed in the table at the beginning of this section.

Data Access With Individual File Pointers

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read` `MPI_File_write`	`MPI_File_read_all` `MPI_File_write_all`
Nonblocking orsplit collective	`MPI_File_iread` `MPI_File_iwrite`	`MPI_File_read_all_begin` `MPI_File_read_all_end` `MPI_File_write_all_begin` `MPI_File_write_all_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read

MPI_File_write

MPI_File_read_all

MPI_File_write_all

Nonblocking orsplit collective

MPI_File_iread

MPI_File_iwrite

MPI_File_read_all_begin

MPI_File_read_all_end

MPI_File_write_all_begin

MPI_File_write_all_end

For each open file, Sun MPI I/O maintains one individual file pointer per process per collective MPI_File_open. For these data-access routines, MPI I/O implicitly uses the value of the individual file pointer. These routines use and update only the individual file pointers maintained by MPI I/O by pointing to the next elementary data type after the one that has most recently been accessed. The individual file pointer is updated relative to the current view of the file. The shared file pointer is neither used nor updated. (For data access with shared file pointers, please see the next section.)

These routines have similar semantics to the explicit-offset data-access routines, except that the offset is defined here to be the current value of the individual file pointer.

MPI_File_read_all and MPI_File_write_all are collective versions of MPI_File_read and MPI_File_write, with each process using its individual file pointer.

MPI_File_iread and MPI_File_iwrite are the nonblocking versions of MPI_File_read and MPI_File_write, respectively. The split collective versions of MPI_File_read_all and MPI_File_write_all are listed in the table at the beginning of this section.

Pointer Manipulation

MPI_File_seek
MPI_File_get_position
MPI_File_get_byte_offset

Each process can call the routine MPI_File_seek to update its individual file pointer according to the update mode. The update mode has the following possible values:

MPI_SEEK_SET - The pointer is set to the offset.
MPI_SEEK_CUR - The pointer is set to the current pointer position plus the offset.
MPI_SEEK_END - The pointer is set to the end of the file plus the offset.

The offset can be negative for backwards seeking, but you cannot seek to a negative position in the file. The current position is defined as the elementary data item immediately following the last-accessed data item.

MPI_File_get_position returns the current position of the individual file pointer relative to the current displacement and file type.

MPI_File_get_byte_offset converts the offset specified for the current view to the displacement value, or absolute byte position, for the file.

Data Access With Shared File Pointers

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read_shared` `MPI_File_write_shared`	`MPI_File_read_ordered` `MPI_File_write_ordered` `MPI_File_seek_shared` `MPI_File_get_position_shared`
Nonblocking orsplit collective	`MPI_File_iread_shared` `MPI_File_iwrite_shared`	`MPI_File_read_ordered_begin` `MPI_File_read_ordered_end` `MPI_File_write_ordered_begin` `MPI_File_write_ordered_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read_shared

MPI_File_write_shared

MPI_File_read_ordered

MPI_File_write_ordered

MPI_File_seek_shared

MPI_File_get_position_shared

Nonblocking orsplit collective

MPI_File_iread_shared

MPI_File_iwrite_shared

MPI_File_read_ordered_begin

MPI_File_read_ordered_end

MPI_File_write_ordered_begin

MPI_File_write_ordered_end

Sun MPI I/O maintains one shared file pointer per collective MPI_File_open (shared among processes in the communicator group that opened the file). As with the routines for data access with individual file pointers, you can also use the current value of the shared file pointer to specify the offset of data accesses implicitly. These routines use and update only the shared file pointer; the individual file pointers are neither used nor updated by any of these routines.

These routines have similar semantics to the explicit-offset data-access routines, except:

The offset is defined here to be the current value of the shared file pointer.

Multiple calls (one for each process in the communicator group) affect the shared file pointer routines as if the calls were serialized.

All processes must use the same file view.

After a shared file pointer operation is initiated, it is updated, relative to the current view of the file, to point to the elementary data item immediately following the last one requested, regardless of the number of items actually accessed.

MPI_File_read_shared and MPI_File_write_shared are blocking routines that use the shared file pointer to read and write files, respectively. The order of serialization is not deterministic for these noncollective routines, so you need to use other methods of synchronization if you wish to impose a particular order.

MPI_File_iread_shared and MPI_File_iwrite_shared are the nonblocking versions of MPI_File_read_shared and MPI_File_write_shared, respectively.

MPI_File_read_ordered and MPI_File_write_ordered are the collective versions of MPI_File_read_shared and MPI_File_write_shared. They must be called by all processes in the communicator group associated with the file handle, and the accesses to the file occur in the order determined by the ranks of the processes within the group. After all the processes in the group have issued their respective calls, for each process in the group, these routines determine the position where the shared file pointer would be after all processes with ranks lower than this process's rank had accessed their data. Then data is accessed (read or written) at that position. The shared file pointer is then updated by the amount of data requested by all processes of the group.

The split collective versions of MPI_File_read_ordered and MPI_File_write_ordered are listed in the table at the beginning of this section.

MPI_File_seek_shared is a collective routine, and all processes in the communicator group associated with the particular file handler must call MPI_File_seek_shared with the same file offset and the same update mode. All the processes are synchronized with a barrier before the shared file pointer is updated.

MPI_File_get_position_shared returns the current position of the shared file pointer relative to the current displacement and file type.

File Interoperability

MPI_Register_datarep
MPI_File_get_type_extent

Sun MPI I/O supports the basic data representations described in Section 9.5 of the MPI-2 standard:

native - With native representation, data is stored exactly as in memory, in other words, in Solaris/UltraSPARC data representation. This format offers the highest performance and no loss of arithmetic precision. It should be used only in a homogeneous environment, that is, on Solaris/UltraSPARC nodes running Sun ClusterTools software. It may also be used when the MPI application will perform the data type conversions itself.

internal - With internal representation, data is stored in an implementation-dependent format, such as for Sun MPI 4.0.

external32 - With external32 representation, data is stored in a portable format, prescribed by the MPI-2 and IEEE standards.

These data representations, as well as any user-defined representations, are specified as an argument to MPI_File_set_view.

You may create user-defined data representations with MPI_Register_datarep. Once a data representation has been defined with this routine, you may specify it as an argument to MPI_File_set_view, so that subsequent data-access operations will call the conversion functions specified with MPI_Register_datarep.

If the file data representation is anything but native, you must be careful when constructing elementary data types and file types. For those functions that accept displacements in bytes, the displacements must be specified in terms of their values in the file for the file data representation being used.

MPI_File_get_type_extent can be used to calculate the extents of data types in the file. The extent is the same for all processes accessing the specified file. If the current view uses a user-defined data representation, MPI_File_get_type_extent uses one of the functions specified in setting the data representation to calculate the extent.

File Consistency and Semantics

Noncollective coordination	Collective coordination
MPI_File_get_atomicity	`MPI_File_set_atomicity` `MPI_File_sync`

Noncollective coordination

Collective coordination

MPI_File_get_atomicity

MPI_File_set_atomicity

MPI_File_sync

The routines ending in _atomicity allow you to set or query whether a file is in atomic or nonatomic mode. In atomic mode, all operations within the communicator group that opens a file are completed as if sequentialized into some serial order. In nonatomic mode, no such guarantee is made. In nonatomic mode, MPI_File_sync can be used to assure weak consistency.

The default mode varies with the number of nodes you are using. If you are running a job on a single node, a file is in nonatomic mode by default when it is opened. If you are running a job on more than one node, a file is in atomic mode by default.

MPI_File_set_atomicity is a collective call that sets the consistency semantics for data-access operations. All the processes in the group must pass identical values for both the file handle and the Boolean flag that indicates whether atomic mode is set.

MPI_File_get_atomicity returns the current consistency semantics for data-access operations. Again, a Boolean flag indicates whether the atomic mode is set.

Note -

In some cases, setting atomicity to false may provide better performance. The default atomicity value on a cluster is true. The lack of synchronization among the distributed caches on a cluster will often prevent your data from completing in the desired state. In these circumstances, you may suffer performance disadvantages with atomicity set to true, especially when the data accesses overlap.