Sun MPI 4.0 Programming and Reference Guide

Chapter 4 Programming With Sun MPI I/O

File I/O in Sun MPI 4.0 is fully MPI-2 compliant. MPI I/O is specified as part of that standard, which was published in July, 1997. Its goal is to provide a library of routines featuring a portable parallel file system interface that is an extension of the MPI framework. See "Related Publications" for more information about the MPI-2 standard.

The closest thing to a standard in file I/O is the UNIX file interface, but UNIX does not provide efficient coordination among multiple simultaneous accesses to a file, particularly when those accesses originate on multiple machines in a cluster. Another drawback of the UNIX file interface is its single-offset interface, that is, its lack of aggregate requests, which can also lead to inefficient access. The MPI I/O library provides routines that accomplish this coordination. Furthermore, MPI I/O allows multiple simultaneous access requests to be made to take advantage of Sun HPC's parallel file system, PFS. It is currently the only application programming interface through which users can access Sun HPC's PFS. For more information about PFS, see the Sun HPC ClusterTools 3.0 Administrator's Guide: With LSF (if you are using LSF Suite) or the Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE (if you are using the CRE). Also see the pfsstat man page.

Note -

A direct interface to Sun HPC's PFS (parallel file system) is not available to the user in this release. Currently, the only way to access PFS is through Sun's implementation of MPI I/O or Solaris command-line utilities.

Using Sun MPI I/O

MPI I/O models file I/O on message passing; that is, writing to a file is analogous to sending a message, and reading from a file is analogous to receiving a message. The MPI library provides a high-level way of partitioning data among processes, which saves you from having to specify the details involved in making sure that the right pieces of data go to the right processes. This section describes basic MPI I/O concepts and the Sun MPI I/O routines.

Data Partitioning and Data Types

MPI I/O uses the MPI model of communicators and derived data types to describe communication between processes and I/O devices. MPI I/O determines which processes are communicating with a particular I/O device. Derived data types can be used to define the layout of data in memory and of data in a file on the I/O device. (For more information about derived data types, see "Data Types".) Because MPI I/O builds on MPI concepts, it's easy for a knowledgeable MPI programmer to add MPI I/O code to a program.

Data is stored in memory and in the file according to MPI data types. Herein lies one of MPI and MPI I/O's advantages: Because they provide a mechanism whereby you can create your own data types, you have more freedom and flexibility in specifying data layout in memory and in the file.

The library also simplifies the task of describing how your data moves from processor memory to file and back again. You create derived data types that describe how the data is arranged in each process's memory and how it should be arranged in that process's part of the disk file.

The Sun MPI I/O routines are described in "Routines". But first, to be able to define a data layout, you will need to understand some basic MPI I/O data-layout concepts. The next section explains some of the fundamental terms and concepts.

Definitions

The following terms are used to describe partitioning data among processes. Figure 4-1 illustrates some of these concepts.

An elementary data type (or etype) is the unit of data access and positioning. It can be any MPI basic or derived data type. Data access is performed in elementary-data-type units, and offsets (see below) are expressed as a count of elementary data types.

The file type (or filetype) is used to partition a file among processes; that is, a file type defines a template for accessing the file. It is either a single elementary data type or a derived MPI data type constructed from elementary data types. A file type may contain "holes," or extents of bytes that will not be accessed by this process.

A file displacement (or disp) is an absolute byte position relative to the beginning of a file. The displacement defines the location where a view begins (see below).

A view defines the current set of data visible and accessible by a process from an open file in terms of a displacement, an elementary data type, and a file type. The pattern described by a file type is repeated, beginning at the displacement, to define the view.

An offset is a position relative to the current view, expressed as a count of elementary data types. Holes in the view's file type are ignored when calculating this position.

Figure 4-1 Displacement, the Elementary Data Type, the File Type, and the View

For a more detailed description of MPI I/O, see Chapter 9, "I/O," of the MPI-2 standard.

Note for Fortran Users

When writing a Fortran program, you must declare the variable ADDRESS as

INTEGER*MPI_ADDRESS_KIND
ADDRESS

MPI_ADDRESS_KIND is a constant defined in mpi.h. This constant defines the length of the declared integer.

Routines

This release of Sun MPI includes all the MPI I/O routines, which are defined in Chapter 9, "I/O," of the MPI-2 specification. (See the preface for information about this specification.)

Code samples that use many of these routines are provided in "Sample Code".

File Manipulation

Collective coordination	Noncollective coordination
`MPI_File_open` `MPI_File_close` `MPI_File_set_size` `MPI_File_preallocate`	`MPI_File_delete` `MPI_File_get_size` `MPI_File_get_group` `MPI_File_get_amode`

Collective coordination

Noncollective coordination

MPI_File_open

MPI_File_close

MPI_File_set_size

MPI_File_preallocate

MPI_File_delete

MPI_File_get_size

MPI_File_get_group

MPI_File_get_amode

MPI_File_open and MPI_File_close are collective operations that open and close a file, respectively -- that is, all processes in a communicator group must together open or close a file. To achieve a single-user, UNIX-like open, set the communicator to MPI_COMM_SELF.

MPI_File_delete deletes a specified file.

The routines MPI_File_set_size, MPI_File_get_size, MPI_File_get_group, and MPI_File_get_amode get and set information about a file. When using the collective routine MPI_File_set_size on a UNIX file, if the size that is set is smaller than the current file size, the file is truncated at the position defined by size. If size is set to be larger than the current file size, the file size becomes size.

When the file size is increased this way with MPI_File_set_size, new regions are created in the file with displacements between the old file size and the larger, newly set file size. Sun MPI I/O does not necessarily allocate file space for such new regions. You may reserve file space either by using MPI_File_preallocate or by performing a read or write to unallocated bytes. MPI_File_preallocate ensures that storage space is allocated for a set quantity of bytes for the specified file; however, its use is very "expensive" in terms of performance and disk space.

The routine MPI_File_get_group returns a communicator group, but it does not free the group.

File Info

Noncollective coordination	Collective coordination
`MPI_File_get_info`	`MPI_File_set_info`

The opaque info object allows you to provide hints for optimization of your code, making it run faster or more efficiently, for example. These hints are set for each file, using the MPI_File_open, MPI_File_set_view, MPI_File_set_info, and MPI_File_delete routines. MPI_File_set_info sets new values for the specified file's hints. MPI_File_get_info returns all the hints that the system currently associates with the specified file.

When using UNIX files, Sun MPI I/O provides four hints for controlling how much buffer space it uses to satisfy I/O requests: noncoll_read_bufsize, noncoll_write_bufsize, coll_read_bufsize, and coll_write_bufsize. These hints may be tuned for your particular hardware configuration and application to improve performance for both noncollective and collective data accesses. For example, if your application uses a single MPI I/O call to request multiple noncontiguous chunks that form a regular strided pattern inthe file, you may want to adjust the noncoll_write_bufsize to match the size of the stride. Note that these hints limit the size of MPI I/O's underlying buffers but do not limit the size of how much data a user can read or write in asingle request.

File Views

Noncollective coordination	Collective coordination
`MPI_File_get_view`	`MPI_File_set_view`

The MPI_File_set_view routine changes the process's view of the data in the file, specifying its displacement, elementary data type, and file type, as well as setting the individual file pointers and shared file pointer to 0. MPI_File_set_view is a collective routine; all processes in the group must pass identical values for the file handle and the elementary data type, although the values for the displacement, the file type, and the info object may vary. However, if you use the data-access routines that use file positioning with a shared file pointer, you must also give the displacement and the file type identical values. The data types passed in as the elementary data type and the file type must be committed.

You can also specify the type of data representation for the file. See "File Interoperability" for information about registering data representation identifiers.

Note -

Displacements within the file type and the elementary data type must be monotonically nondecreasing.

Data Access

The 35 data-access routines are categorized according to file positioning. Data access can be achieved by any of these methods of file positioning:

By explicit offset
By individual file pointer
By shared file pointer

In the following subsections, each of these methods is discussed in more detail.

While blocking I/O calls will not return until the request is completed, nonblocking calls do not wait for the I/O request to complete. A separate "request complete" call, such as MPI_Test or MPI_Wait, is needed to confirm that the buffer is ready to be used again. Nonblocking routines have the prefix MPI_File_i, where the i stands for immediate.

All the nonblocking collective routines for data access are "split" into two routines, each with _begin or _end as a suffix. These split collective routines are subject to the semantic rules described in Section 9.4.5 of the MPI-2 standard.

Data Access With Explicit Offsets

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read_at` `MPI_File_write_at`	`MPI_File_read_at_all` `MPI_File_write_at_all`
Nonblocking orsplit collective	`MPI_File_iread_at` `MPI_File_iwrite_at`	`MPI_File_read_at_all_begin` `MPI_File_read_at_all_end` `MPI_File_write_at_all_begin` `MPI_File_write_at_all_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read_at

MPI_File_write_at

MPI_File_read_at_all

MPI_File_write_at_all

Nonblocking orsplit collective

MPI_File_iread_at

MPI_File_iwrite_at

MPI_File_read_at_all_begin

MPI_File_read_at_all_end

MPI_File_write_at_all_begin

MPI_File_write_at_all_end

To access data at an explicit offset, specify the position in the file where the next data access for each process should begin. For each call to a data-access routine, a process attempts to access a specified number of file types of a specified data type (starting at the specified offset) into a specified user buffer.

The offset is measured in elementary data type units relative to the current view; moreover, "holes" are not counted when locating an offset. The data is read from (in the case of a read) or written into (in the case of a write) those parts of the file specified by the current view. These routines store the number of buffer elements of a particular data type actually read (or written) in the status object, and all the other fields associated with the status object are undefined. The number of elements that are read or written can be accessed using MPI_Get_count.

MPI_File_read_at attempts to read from the file via the associated file handle returned from a successful MPI_File_open. Similarly, MPI_File_write_at attempts to write data from a user buffer to a file. MPI_File_iread_at and MPI_File_iwrite_at are the nonblocking versions of MPI_File_read_at and MPI_File_write_at, respectively.

MPI_File_read_at_all and MPI_File_write_at_all are collective versions of MPI_File_read_at and MPI_File_write_at, in which each process provides an explicit offset. The split collective versions of these nonblocking routines are listed in the table at the beginning of this section.

Data Access With Individual File Pointers

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read` `MPI_File_write`	`MPI_File_read_all` `MPI_File_write_all`
Nonblocking orsplit collective	`MPI_File_iread` `MPI_File_iwrite`	`MPI_File_read_all_begin` `MPI_File_read_all_end` `MPI_File_write_all_begin` `MPI_File_write_all_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read

MPI_File_write

MPI_File_read_all

MPI_File_write_all

Nonblocking orsplit collective

MPI_File_iread

MPI_File_iwrite

MPI_File_read_all_begin

MPI_File_read_all_end

MPI_File_write_all_begin

MPI_File_write_all_end

For each open file, Sun MPI I/O maintains one individual file pointer per process per collective MPI_File_open. For these data-access routines, MPI I/O implicitly uses the value of the individual file pointer. These routines use and update only the individual file pointers maintained by MPI I/O by pointing to the next elementary data type after the one that has most recently been accessed. The individual file pointer is updated relative to the current view of the file. The shared file pointer is neither used nor updated. (For data access with shared file pointers, please see the next section.)

These routines have similar semantics to the explicit-offset data-access routines, except that the offset is defined here to be the current value of the individual file pointer.

MPI_File_read_all and MPI_File_write_all are collective versions of MPI_File_read and MPI_File_write, with each process using its individual file pointer.

MPI_File_iread and MPI_File_iwrite are the nonblocking versions of MPI_File_read and MPI_File_write, respectively. The split collective versions of MPI_File_read_all and MPI_File_write_all are listed in the table at the beginning of this section.

Pointer Manipulation

MPI_File_seek
MPI_File_get_position
MPI_File_get_byte_offset

Each process can call the routine MPI_File_seek to update its individual file pointer according to the update mode. The update mode has the following possible values:

MPI_SEEK_SET - The pointer is set to the offset.
MPI_SEEK_CUR - The pointer is set to the current pointer position plus the offset.
MPI_SEEK_END - The pointer is set to the end of the file plus the offset.

The offset can be negative for backwards seeking, but you cannot seek to a negative position in the file. The current position is defined as the elementary data item immediately following the last-accessed data item.

MPI_File_get_position returns the current position of the individual file pointer relative to the current displacement and file type.

MPI_File_get_byte_offset converts the offset specified for the current view to the displacement value, or absolute byte position, for the file.

Data Access With Shared File Pointers

Synchronism	Noncollective coordination	Collective coordination
Blocking	`MPI_File_read_shared` `MPI_File_write_shared`	`MPI_File_read_ordered` `MPI_File_write_ordered` `MPI_File_seek_shared` `MPI_File_get_position_shared`
Nonblocking orsplit collective	`MPI_File_iread_shared` `MPI_File_iwrite_shared`	`MPI_File_read_ordered_begin` `MPI_File_read_ordered_end` `MPI_File_write_ordered_begin` `MPI_File_write_ordered_end`

Synchronism

Noncollective coordination

Collective coordination

Blocking

MPI_File_read_shared

MPI_File_write_shared

MPI_File_read_ordered

MPI_File_write_ordered

MPI_File_seek_shared

MPI_File_get_position_shared

Nonblocking orsplit collective

MPI_File_iread_shared

MPI_File_iwrite_shared

MPI_File_read_ordered_begin

MPI_File_read_ordered_end

MPI_File_write_ordered_begin

MPI_File_write_ordered_end

Sun MPI I/O maintains one shared file pointer per collective MPI_File_open (shared among processes in the communicator group that opened the file). As with the routines for data access with individual file pointers, you can also use the current value of the shared file pointer to specify the offset of data accesses implicitly. These routines use and update only the shared file pointer; the individual file pointers are neither used nor updated by any of these routines.

These routines have similar semantics to the explicit-offset data-access routines, except:

The offset is defined here to be the current value of the shared file pointer.

Multiple calls (one for each process in the communicator group) affect the shared file pointer routines as if the calls were serialized.

All processes must use the same file view.

After a shared file pointer operation is initiated, it is updated, relative to the current view of the file, to point to the elementary data item immediately following the last one requested, regardless of the number of items actually accessed.

MPI_File_read_shared and MPI_File_write_shared are blocking routines that use the shared file pointer to read and write files, respectively. The order of serialization is not deterministic for these noncollective routines, so you need to use other methods of synchronization if you wish to impose a particular order.

MPI_File_iread_shared and MPI_File_iwrite_shared are the nonblocking versions of MPI_File_read_shared and MPI_File_write_shared, respectively.

MPI_File_read_ordered and MPI_File_write_ordered are the collective versions of MPI_File_read_shared and MPI_File_write_shared. They must be called by all processes in the communicator group associated with the file handle, and the accesses to the file occur in the order determined by the ranks of the processes within the group. After all the processes in the group have issued their respective calls, for each process in the group, these routines determine the position where the shared file pointer would be after all processes with ranks lower than this process's rank had accessed their data. Then data is accessed (read or written) at that position. The shared file pointer is then updated by the amount of data requested by all processes of the group.

The split collective versions of MPI_File_read_ordered and MPI_File_write_ordered are listed in the table at the beginning of this section.

MPI_File_seek_shared is a collective routine, and all processes in the communicator group associated with the particular file handler must call MPI_File_seek_shared with the same file offset and the same update mode. All the processes are synchronized with a barrier before the shared file pointer is updated.

MPI_File_get_position_shared returns the current position of the shared file pointer relative to the current displacement and file type.

File Interoperability

MPI_Register_datarep
MPI_File_get_type_extent

Sun MPI I/O supports the basic data representations described in Section 9.5 of the MPI-2 standard:

native - With native representation, data is stored exactly as in memory, in other words, in Solaris/UltraSPARC data representation. This format offers the highest performance and no loss of arithmetic precision. It should be used only in a homogeneous environment, that is, on Solaris/UltraSPARC nodes running Sun ClusterTools software. It may also be used when the MPI application will perform the data type conversions itself.

internal - With internal representation, data is stored in an implementation-dependent format, such as for Sun MPI 4.0.

external32 - With external32 representation, data is stored in a portable format, prescribed by the MPI-2 and IEEE standards.

These data representations, as well as any user-defined representations, are specified as an argument to MPI_File_set_view.

You may create user-defined data representations with MPI_Register_datarep. Once a data representation has been defined with this routine, you may specify it as an argument to MPI_File_set_view, so that subsequent data-access operations will call the conversion functions specified with MPI_Register_datarep.

If the file data representation is anything but native, you must be careful when constructing elementary data types and file types. For those functions that accept displacements in bytes, the displacements must be specified in terms of their values in the file for the file data representation being used.

MPI_File_get_type_extent can be used to calculate the extents of data types in the file. The extent is the same for all processes accessing the specified file. If the current view uses a user-defined data representation, MPI_File_get_type_extent uses one of the functions specified in setting the data representation to calculate the extent.

File Consistency and Semantics

Noncollective coordination	Collective coordination
MPI_File_get_atomicity	`MPI_File_set_atomicity` `MPI_File_sync`

Noncollective coordination

Collective coordination

MPI_File_get_atomicity

MPI_File_set_atomicity

MPI_File_sync

The routines ending in _atomicity allow you to set or query whether a file is in atomic or nonatomic mode. In atomic mode, all operations within the communicator group that opens a file are completed as if sequentialized into some serial order. In nonatomic mode, no such guarantee is made. In nonatomic mode, MPI_File_sync can be used to assure weak consistency.

The default mode varies with the number of nodes you are using. If you are running a job on a single node, a file is in nonatomic mode by default when it is opened. If you are running a job on more than one node, a file is in atomic mode by default.

MPI_File_set_atomicity is a collective call that sets the consistency semantics for data-access operations. All the processes in the group must pass identical values for both the file handle and the Boolean flag that indicates whether atomic mode is set.

MPI_File_get_atomicity returns the current consistency semantics for data-access operations. Again, a Boolean flag indicates whether the atomic mode is set.

Note -

In some cases, setting atomicity to false may provide better performance. The default atomicity value on a cluster is true. The lack of synchronization among the distributed caches on a cluster will often prevent your data from completing in the desired state. In these circumstances, you may suffer performance disadvantages with atomicity set to true, especially when the data accesses overlap.

Sample Code

In this section, we give some sample code to get you started with programming your I/O using Sun MPI 4.0. We start with an example that shows how a parallel job can partition file data among its processes. Next we explore how you can adapt our initial example to use a broad range of other I/O programming styles supported by Sun MPI I/O. Finally, we present a sample code that illustrates the use of the nonblocking MPI I/O routines.

Before we start, remember that MPI I/O is part of MPI, so you must call MPI_Init before calling any MPI I/O routines and MPI_Finalize at the end of your program, even if you only use MPI I/O routines.

Partitioned Writing and Reading in a Parallel Job

MPI I/O was designed to enable processes in a parallel job to request multiple data items that are noncontiguous within a file. Typically, a parallel job partitions file data among the processes.

One method of partitioning a file is to derive the offset at which to access data from the rank of the process. The rich set of MPI derived types also allows us to easily partition file data. For example, we could create an MPI vector type as the filetype passed into MPI_File_set_view. Since vector types do not end with a hole, a call must be made, either to MPI_Type_create_resized or to MPI_Type_ub, to complete the partition. This call extends the extent to include holes at the end of the type for processes with higher ranks. We create a partitioned file by passing different displacements to MPI_File_set_view. Each of these displacements would be derived from the process' rank. Consequently, offsets would not need to be derived from the ranks because only the data in that process' portion of the partition would be visible in that process' view.

In the following example, we use the first method where we derive the file offsets directly from the process' rank. Each process writes and reads NUM_INTS integers starting at the offset rank * NUM_INTS. We pass an explicit offset to our MPI I/O data-access routines MPI_File_write_at and MPI_File_read_at. We call MPI_Get_elements to find out how many elements were written or read. To verify that the write was successful, we compare the data written and read as well as set up an MPI_Barrier before calling MPI_File_get_size to verify that the file is the size that we expect upon completion of all the processes' writes.

Observe that we called MPI_File_set_view to set our view of the file as essentially an array of integers instead of the UNIX-like view of the file as an array of bytes. Thus, the offsets that we pass to MPI_File_write_at and MPI_File_read_at are indices into an array of integers and not a byte offset.

Example 4-1 Example code in which each process writes and reads `NUM_INTS` integers to a file using `MPI_File_write_at` and `MPI_File_read_at`, respectively.

/* wr_at.c
 *
 * Example to demonstrate use of MPI_File_write_at and MPI_File_read_at
 *
*/

#include <stdio.h>
#include "mpi.h"

#define NUM_INTS 100

void sample_error(int error, char *string)
{
  fprintf(stderr, "Error %d in %s\n", error, string);
  MPI_Finalize();
  exit(-1);
}

void
main( int argc, char **argv )
{  
  char filename[128];
  int i, rank,  comm_size;
  int *buff1, *buff2;
  MPI_File fh;
  MPI_Offset disp, offset, file_size;
  MPI_Datatype etype, ftype, buftype;
  MPI_Info info;
  MPI_Status status;
  int result, count, differs;

  if(argc < 2) {
    fprintf(stdout, "Missing argument: filename\n");
    exit(-1);
  }
  strcpy(filename, argv[1]);

  MPI_Init(&argc, &argv);

  /* get this processor's rank */
  result = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_Comm_rank");

  result = MPI_Comm_size(MPI_COMM_WORLD, &comm_size);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_Comm_size");
  
  /* communicator group MPI_COMM_WORLD opens file "foo" 
     for reading and writing (and creating, if necessary) */
  result = MPI_File_open(MPI_COMM_WORLD, filename, 
			 MPI_MODE_RDWR | MPI_MODE_CREATE, (int)NULL, &fh);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_open");

  /* Set the file view which tiles the file type MPI_INT, starting 
     at displacement 0.  In this example, the etype is also MPI_INT.
 */
  disp = 0;
  etype = MPI_INT;
  ftype = MPI_INT;
  info = (MPI_Info)NULL;
  result = MPI_File_set_view(fh, disp, etype, ftype, (char *)NULL,
info);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_set_view");

  /* Allocate and initialize a buffer (buff1) containing NUM_INTS
integers, 
     where the integer in location i is set to i. */
  buff1 = (int *)malloc(NUM_INTS*sizeof(int));
  for(i=0;i<NUM_INTS;i++) buff1[i] = i;

  /* Set the buffer type to also be MPI_INT, then write the buffer
(buff1)
     starting at offset 0, i.e., the first etype in the file. */ 
  buftype = MPI_INT;
  offset = rank * NUM_INTS;
  result = MPI_File_write_at(fh, offset, buff1, NUM_INTS, buftype, &status);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_write_at");
  
  result = MPI_Get_elements(&status, MPI_BYTE, &count);
  if(result != MPI_SUCCESS)
    sample_error(result, "MPI_Get_elements");
  if(count != NUM_INTS*sizeof(int))
    fprintf(stderr, "Did not write the same number of bytes as requested\n");
  else
    fprintf(stdout, "Wrote %d bytes\n", count);

  /* Allocate another buffer (buff2) to read into, then read NUM_INTS
     integers into this buffer.  */
  buff2 = (int *)malloc(NUM_INTS*sizeof(int));
  result = MPI_File_read_at(fh, offset, buff2, NUM_INTS, buftype, &status); 
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_read_at");

  /* Find out how many bytes were read and compare to how many
     we expected */
  result = MPI_Get_elements(&status, MPI_BYTE, &count);
  if(result != MPI_SUCCESS)
    sample_error(result, "MPI_Get_elements");
  if(count != NUM_INTS*sizeof(int))
    fprintf(stderr, "Did not read the same number of bytes as requested\n");
  else
    fprintf(stdout, "Read %d bytes\n", count);
  
  /* Check to see that each integer read from each location is 
     the same as the integer written to that location. */
  differs = 0;
  for(i=0; i<NUM_INTS; i++) {
    if(buff1[i] != buff2[i]) {
      fprintf(stderr, "Integer number %d differs\n", i);
      differs = 1;
    }
  }
  if(!differs)
    fprintf(stdout, "Wrote and read the same data\n");

  MPI_Barrier(MPI_COMM_WORLD);
  
  result = MPI_File_get_size(fh, &file_size);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_get_size");

  /* Compare the file size with what we expect */
  /* To see a negative response, make the file preexist with a larger
     size than what is written by this program */
  if(file_size != (comm_size * NUM_INTS * sizeof(int)))
    fprintf(stderr, "File size is not equal to the write size\n");
  
  result = MPI_File_close(&fh);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_close");

  MPI_Finalize();

  free(buff1);
  free(buff2);
}

Data Access Styles

We can adapt our example above to support the I/O programming style that best suits our application. Essentially, there are three dimensions on which to choose an appropriate data access routine for your particular task: file pointer type, collective or noncollective, and blocking or nonblocking.

We need to choose which file pointer type to use: explicit, individual, or shared. In the example above, we used an explicit pointer and passed it directly as the offset parameter to the MPI_File_write_at and MPI_File_read_at routines. Using an explicit pointer is equivalent to calling MPI_File_seek to set the individual file pointer to offset, then calling MPI_File_write or MPI_File_read, which is directly analogous to calling UNIX lseek() and write() or read(). If each process accesses the file sequentially, individual file pointers save you the effort to recalculate offset for each data access. We would use a shared file pointer in situations where all the processes need to cooperatively access a file in a sequential way, for example, writing log files.

Collective data-access routines allow the user to enforce some implicit coordination among the processes in a parallel job when making data accesses. For example, if a parallel job alternately reads in a matrix and performs computation on it, but cannot progress to the next stage of computation until all processes have completed the last stage, then a coordinated effort between processes when accessing data might be more efficient. In the example above, we could easily append the suffix _all to MPI_File_write_at and MPI_File_read_at to make the accesses collective. By coordinating the processes, we could achieve greater efficiency in the MPI library or at the file system level in buffering or caching the next matrix. In contrast, noncollective accesses are used when it is not evident that any benefit would be gained by coordinating disparate accesses by each process. UNIX file accesses are noncollective.

Overlapping I/O With Computation and Communication

MPI I/O also supports nonblocking versions of each of the data-access routines, that is, the data-access routines that have the letter i before write or read in the routine name (i stands for immediate). By definition, nonblocking I/O routines return immediately after the I/O request has been issued and does not wait until the I/O request has been completed. This functionality allows the user to perform computation and communication at the same time as the I/O. Since large I/O requests can take a long time to complete, this provides a way to more efficiently utilize your programs waiting time.

As in our example above, parallel jobs often partition large matrices stored in files. These parallel jobs may use many large matrices or matrices that are too large to fit into memory at once. Thus, each process may access the multiple and/or large matrices in stages. During each stage, a process reads in a chunk of data, then performs some computation on it (which may involve communicating with the other processes in the parallel job). While performing the computation and communication, the process could issue a nonblocking I/O read request for the next chunk of data. Similarly, once the computation on a particular chunk has completed, a nonblocking write request could be issued before performing computation and communication on the next chunk.

The following example code illustrates the use of a nonblocking data-access routine. Notice that, like nonblocking communication routines, the nonblocking I/O routines require a call to MPI_Wait to wait for the nonblocking request to complete or repeated calls to MPI_Test to determine when the nonblocking data access has completed. Once complete, the write or read buffer is available for use again by the program.

Example 4-2 Example code in which each process reads and writes `NUM_BYTES` bytes to a file using the nonblocking MPI I/O routines `MPI_File_iread_at` and `MPI_File_iwrite_at`, respectively. Note the use of `MPI_Wait` and `MPI_Test` to determine when the nonblocking requests have completed.

/* iwr_at.c
 *
 * Example to demonstrate use of MPI_File_iwrite_at and MPI_File_iread_at
 *
*/

#include <stdio.h>
#include "mpi.h"

#define NUM_BYTES 100

void sample_error(int error, char *string)
{
  fprintf(stderr, "Error %d in %s\n", error, string);
  MPI_Finalize();
  exit(-1);
}

void
main( int argc, char **argv )
{  
  char filename[128];
  char *buff;
  MPI_File fh;
  MPI_Offset offset;
  MPI_Request request;
  MPI_Status status;
  int i, rank, flag, result;

  if(argc < 2) {
    fprintf(stdout, "Missing argument: filename\n");
    exit(-1);
  }
  strcpy(filename, argv[1]);

  MPI_Init(&argc, &argv);

  result = MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_Comm_rank");

  result = MPI_File_open(MPI_COMM_WORLD, filename, 
			 MPI_MODE_RDWR | MPI_MODE_CREATE, 
			 (MPI_Info)NULL, &fh);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_open");

  buff = (char *)malloc(NUM_BYTES*sizeof(char));
  for(i=0;i<NUM_BYTES;i++) buff[i] = i;

  offset = rank * NUM_BYTES;
  result = MPI_File_iread_at(fh, offset, buff, NUM_BYTES, 
			     MPI_BYTE, &request); 
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_iread_at");

  /* Perform some useful computation and/or communication */

  result = MPI_Wait(&request, &status);

  buff = (char *)malloc(NUM_BYTES*sizeof(char));
  for(i=0;i<NUM_BYTES;i++) buff[i] = i;
  result = MPI_File_iwrite_at(fh, offset, buff, NUM_BYTES, 
			      MPI_BYTE, &request);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_iwrite_at");
  
  /* Perform some useful computation and/or communication */

  flag = 0;
  i = 0;
  while(!flag) {
     result = MPI_Test(&request, &flag, &status);
     i++;
     /* Perform some more computation or communication, if possible
*/
  }

  result = MPI_File_close(&fh);
  if(result != MPI_SUCCESS) 
    sample_error(result, "MPI_File_close");

  MPI_Finalize();

  fprintf(stdout, "Successful completion\n");

  free(buff);
}

For More Information

For more information on MPI I/O, refer to the documents listed in the section "Related Publications" of the preface.

Chapter 4 Programming With Sun MPI I/O

Using Sun MPI I/O

Data Partitioning and Data Types

Definitions

Figure 4-1 Displacement, the Elementary Data Type, the File Type, and the View

Note for Fortran Users

Routines

File Manipulation

File Info

File Views

Data Access

Data Access With Explicit Offsets

Data Access With Individual File Pointers

Pointer Manipulation

Data Access With Shared File Pointers

File Interoperability

File Consistency and Semantics

Sample Code

Partitioned Writing and Reading in a Parallel Job

Example 4-1 Example code in which each process writes and reads NUM_INTS integers to a file using MPI_File_write_at and MPI_File_read_at, respectively.

Data Access Styles

Overlapping I/O With Computation and Communication

Example 4-2 Example code in which each process reads and writes NUM_BYTES bytes to a file using the nonblocking MPI I/O routines MPI_File_iread_at and MPI_File_iwrite_at, respectively. Note the use of MPI_Wait and MPI_Test to determine when the nonblocking requests have completed.

For More Information

Example 4-1 Example code in which each process writes and reads `NUM_INTS` integers to a file using `MPI_File_write_at` and `MPI_File_read_at`, respectively.

Example 4-2 Example code in which each process reads and writes `NUM_BYTES` bytes to a file using the nonblocking MPI I/O routines `MPI_File_iread_at` and `MPI_File_iwrite_at`, respectively. Note the use of `MPI_Wait` and `MPI_Test` to determine when the nonblocking requests have completed.