In Sun S3L, arrays are distributed in blocks across multiple processes, allowing S3L operations to be performed in parallel on different sections of the array. These arrays are referred to in this manual as S3L arrays and, more generically, as parallel arrays.
Arrays passed to Sun S3L routines by C or F77 message-passing programs can have block, cyclic, or block-cyclic distributions. Regardless of the type of distribution specified by the calling program, Sun S3L will automatically select the distribution scheme that is most efficient for the routine being called. If that means Sun S3L changes the distribution method internally, it will restore the original distribution scheme on the resultant array before passing it back to the calling program.
Arrays from C and F77 message-passing programs can also be undistributed. That is, all the elements of the array can be located on the same process--a serial array in the conventional sense.
The balance of this chapter describes S3L arrays in more detail.
A principal attribute of S3L arrays is rank--the number of dimensions an array has. For example, an S3L array with three dimensions is called a rank-three array. S3L arrays can have up to 31 dimensions.
An S3L array is also defined by its extents, its length along each dimension of the array and its type, which reflects the data type of its elements. S3L arrays can be of the following types:
S3L_integer (4-byte integer)
S3L_long_integer (8-byte integer)
S3L_float (4-byte floating point number)
S3L_double (8-byte double precision floating point number)
S3L_integer (4-byte integer)
S3L_complex (8-byte complex number)
S3L_double_complex (16-byte complex number)
The C and Fortran equivalents of these array data types are described in Chapter 4, Sun S3L Data Types.
When an S3L array is declared, it is associated with a unique array handle. This is an S3L internal structure that fully describes the array. An S3L array handle contains all the information needed to define both the global and local characteristics of an S3L array. For example, an array handle includes
global features, such as the array's rank and information about how the array is distributed
local features, such as its extents and its location in memory on the process
By describing both local and global features of an array, an array handle makes it possible for any process to easily access data in array sections that are on other processes, not just data in its local section. That is, no matter how an array has been distributed, the associated S3L array handle ensures that its layout is understood by all participating processes.
In C programs, S3L array handles are declared as type S3L_array_t and in Fortran programs as type integer*8.
In a Sun MPI application, each process is identified by a unique rank. This is an integer in the range 0 to np-1, where np is the total number of processes associated with the application.
This use of rank is totally unrelated to references to S3L array ranks. Process ranks correspond to MPI ranks as used in interprocess communication. Array ranks indicate the number of dimensions an array has.
Sun S3L maps each S3L array onto a logical arrangement of processes, referred is to as a process grid. A process grid will have the same number of dimensions as the S3L array with which it is associated. Each S3L array section that is distributed to a particular process is called a subgrid.
Sun S3L controls the ordering of the np processes within the n-dimensional process grid. Figure 2-1through Figure 2-3 illustrate this with examples of how Sun S3L might arrange eight processes in one- and two-dimensional process grids.
In Figure 2-1, the eight processes form a one-dimensional grid.
Figure 2-2 and Figure 2-3 show the eight processes organized into rectangular 2x4 process grids. Although both have 2x4 extents, the array process grids differ in their majorness attribute. This attribute determines the order in which the processes are distributed onto a process grid's axes or local subgrid axes. The two possible modes are:
Column major - Processes are distributed along column axes first; that is, the process grid's row indices increase fastest.
Row major - Processes are distributed along row axes first; the process grid's column indices increase fastest.
In Figure 2-2, subgrid distribution follows a column-major order. In Figure 2-3, process grid distribution is in row-major order.
In these examples, axis numbers are one-based (Fortran-style). For the C-language interface, reduce each value by 1.
When an S3L array is defined, the programmer has the choice of either defining a process grid explicitly, using the S3L_set_process_grid function, or letting S3L define one using an internal algorithm. The following F77 example how to specify a two-dimensional process grid that is defined over a set of eight processes having MPI ranks 0 through 7. The process grid has extents of 2x4 and is assigned column-major ordering.
#include <s3l/s3l-c.h> #include <s3l/s3l_errno-c.h> int S3L_free(a) S3L_array_t *a include `s3l/s3l-f.h' integer*8 pg integer*4 rank integer*4 pext(2),process_list(8) integer*4 i,ier rank = 2 pext(1) = 2 pext(2) = 4 do i=1,8 process_list(i)=i-1 end do call s3l_set_process_grid(pg,rank,S3L_MAJOR_COLUMN, pext,8,process_list,ier) |
A process grid can be defined over the full set of processes being used by an application or over any subset of those processes. This flexibility can be useful when circumstances call for setting up a process grid that does not include all available processes.
For example, if an application will be running in a two-node cluster where one node has 14 CPUs and the other has 10, better load balancing may be achieved by defining the process grid to have 10 processes in each node.
For more information about explicitly defining process grids, see "S3L_set_process_grid " or the S3L_set_process_grid(3) man page.
Sun S3L provides two subroutines for declaring S3L arrays: S3L_declare and S3L_declare_detailed. The library also includes the S3L_DefineArray interface, which maintains compatibility with the Sun HPC 2.0 release of Sun S3L.
S3L_declare and S3L_declare_detailed perform the same function, except that S3L_declare_detailed provides additional arguments that allow more detailed control over the array features. Both require the programmer to specify
The array's rank
The array's extents
The array's type
Which axes will be distributed and which will be local (kept in a single block on one process).The method by which the array is to be allocated.
In addition, S3L_declare_detailed allows the programmer to specify the following array features:
The starting address of the local subgrid. This value is used only if the programmer elects to allocate array subgrids explicitly by disabling automatic array allocation. The block size to be used in distributing the array along each axis. The programmer has the option of letting Sun S3L choose a default blocksize.Which processes contain the start of each array axis. Again, the programmer can let Sun S3L specify default processes. To use this option, the programmer must specify a particular process grid, rather than using one provided by Sun S3L.
The following F77 example allocates a 100 x 100 x 100 double-precision array.
integer*8 a,pg_a integer*4 ext_a(3), block_a(3), local_a(3) ext_a(1) = 100 ext_a(2) = 100 ext_a(3) = 100 local_a(1) = 1 local_a(2) = 0 local_a(3) = 0 call s3l_declare_detailed(a,0,3,ext_a,S3L_double,block_a, -1,local_a,pg_a,S3L_USE_MALLOC,ier) |
The S3L array a is distributed along each axis of the process grid. The block sizes for the three axes are specified in block_a. Because local_a is set to 1, the first axis of a will be local to the first process in the process grid's first axis. The second and third axes of a are distributed along the corresponding axes of the process grid.
If local_a had been set to 0 instead, all three array axes would be distributed along their respecitive process grid axes.
For more information about this function see "S3L_declare_detailed " or the S3L_declare_detailed(3) man page.
The simpler and more compact S3L_declare involves fewer parameters and always block-distributes the arrays. The following C program example allocates a one-dimensional, double-precision array of length 1000.
#include <s3l/s3l-c.h> int local,ext,ier: S3L_array_t A; local = 0: ext = 1000: ier = S3L_declare(&A,1,&ext,S3L_double,&local,S3L_USE_MALLOC); |
This example illustrates use of the array_is_local parameter. This parameter consists of an array containing one element per axis. Each element of the array is either 1 or 0, depending on whether the corresponding array axis should be local to a process or distributed. If array_is_local(i) is 0, the array axis i will be distributed along the corresponding axis of the process grid. If it is 1, array axis i will not be distributed. Instead, the extent of that process grid axis will be regarded as 1 and the array axis will be local to the process.
In this S3L_declare example, the array has only one axis, so array_is_local has a single value, in this case 0. If the program containing this code is run on six processes, Sun S3L will associate a one-dimensional process grid of length 6 with the S3L array A. It will set the block size of the array distribution to ceiling(1000/6)=167. As a result, processes 0 though 4 will have 167 local array elements and process 5 will have 165.
If array_is_local had been set to 1, the entire array would have been allocated to process 0.
When S3L arrays are not needed anymore, they should be deallocated so the memory resources associated with them will be made available for other uses. S3L arrays are deallocated with S3L_free.
Sun S3L distributes arrays in a block cyclic fashion. This means each array axis is partitioned into blocks of a certain block size and the blocks are distributed onto the processes in a cyclic fashion.
Block cyclic distribution is a superset of simple block distribution, a more commonly used array distribution scheme. Figure 2-4 through Figure 2-6 illustrate block and block cyclic distributions with a sample 8x8 array distributed onto a 2x2 process grid.
In Figure 2-4 and Figure 2-5, block size is set to 4 along both axes and the resulting blocks are distributed in pure block fashion. As a result, all the subgrid indices on any given process are contiguous along both axes.
The only difference between these two examples is that process grid ordering is column-major in Figure 2-4 and row-major in Figure 2-5.
Figure 2-6shows block cyclic distribution of the same array. In this example, the block size for the first axis is set to 4 and the block size for the second axis is set to 2.
When no part of an S3L array is distributed--that is, when all axes are local--all elements of the array are on a single process. By default, this is the process with MPI rank 0. The programmer can request that an undistributed array be allocated to a particular process via the S3L_declare_detailed routine.
Although the elements of an undistributed array are defined only on a single process, the S3L array handle enables all other processes to access the undistributed array.
The Sun S3L utilities S3L_print_array and S3L_print_sub_array can be used to print the values of a distributed S3L array to standard output.
S3L_print_array prints the whole array, while S3L_print_sub_array prints a section of the array that is defined by programmer-specified lower and upper bounds.
The values of array elements will be printed out in column-major order; this is referred to as Fortran ordering, where the leftmost axis index varies fastest.
Each element value is accompanied by the array indices for that value. This is illustrated by the following example.
a is a 4 x 5 x 2 S3L array, which has been initialized to random double-precision values via a call S3L_rand_lcg. A call to S3L_print_array will produce the following output:
call s3l_print_array(a) (1,1,1) 0.000525 (2,1,1) 0.795124 (3,1,1) 0.225717 (4,1,1) 0.371280 (1,2,1) 0.225035 (2,2,1) 0.878745 (3,2,1) 0.047473 (4,2,1) 0.180571 (1,3,1) 0.432766 ... |
When an S3L array is large, S3L_print_array, it is often a good idea to print only a section of the array, rather than the entire array. This not only reduces the time it takes to retrieve the data, but it can be difficult to locate useful information in displays of large amounts of data. By printing selected sections of a large array can make the task of finding data of interest much easier. This can be done using the function S3L_print_sub_array. The following example shows how to print only the first column of the array shown in the previous example:
integer*4 lb(3),ub(3),st(3) c specify the lower and upper bounds c along each axis. Elements whose coordinates c are greater or equal to lb(i) and less or c equal to ub(i) (and with stride st(i)) are c printed to the output lb(1) = 1 ub(1) = 4 st(1) = 1 lb(2) = 1 ub(2) = 1 st(2) = 1 lb(3) = 1 ub(3) = 1 st(3) = 1 call s3l_print_sub_array(a,lb,ub,st,ier) |
The following output would be produced by this call
(1,1,1) 0.000525 (2,1,1) 0.795124 (3,1,1) 0.225717 (4,1,1) 0.371280 |
If a stride argument other than 1 is specified, only elements at the specified stride locations will be printed. For example, the following sets the stride for axis 1 to 2
st(1) = 2 |
which results in the following output:
(1,1,1) 0.000525 (3,1,1) 0.225717 |
S3L arrays can be visualized with Prism, the debugger that is part of Sun HPC ClusterTools 3.0. Before S3L arrays can be visualized, however, the programmer must instruct Prism that a variable of interest in an MPI code describes an S3L array.
For example, if variable a has been declared in a Fortran program to be of type integer*8 and a corresponding S3L array of type S3L_float has been allocated by a call to an S3L array allocation function, the programmer should enter the following at the prism command prompt:
type float a |
Once this is done, Prism can print values of the distributed array:
print a(1:2,4:6) |
Or it can assign values to it:
assign a(2,10)=2.0 |
or visualize it
print a on dedicated |