Sun S3L 3.0 Programming and Reference Guide

Specifying Single-Instance vs. Multiple-Instance Operations

Sun S3L routines that support multiple instances have the same calling sequence for single-instance and multiple-instance operations. The methods for specifying single-instance and multiple-instance operations depend on which routine you are calling. The man pages for routines that are capable of multiple-instance operation contain specific information for their respective routines.

"Example 1: Matrix-Vector Multiplication" explains the differences between single- and multiple-instance operation for the matrix-vector-multiplication routine. "Example 2: Fast Fourier Transforms" discusses use of multiple instances in FFTs.

Example 1: Matrix-Vector Multiplication

When you call the matrix-vector-multiplication routine, S3L_mat_vec_mult, the dimensionality of the arguments you supply determines whether the routine performs a single-instance or multiple-instance operation. The F77 form of this Sun S3L function is

S3L_mat_vec_mult(y, a, x, y_vector_axis, row_axis, col_axis, x_vector_axis,
ier)


Note -

The S3L_mat_vec_mult routine requires you to specify which axes you are using as data axes for each matrix or vector argument.


Single-Instance Operation

To perform a single-instance operation, specify each vector argument as a 1D parallel array and each matrix argument as a 2D parallel array. (Alternatively, you can declare these arguments to have more dimensions, but all instance axes must have length 1.)

For example, a single-instance operation in F77 can be performed by first defining the block-distributed arrays:

 integer*8 a, x, y
 integer*4 ext(2), axis_is_local(2)
 integer*4 ier

 axis_is_local(1) = 0
 axis_is_local(2) = 0

 ext(1) = p
 ext(2) = q

 call s3l_declare(a, 2, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

 call s3l_declare(x, 2, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

 call s3l_declare(y, 2, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

and then using

call S3L_mat_vec_mult(y, a, x, 1, 1, 2, 1, ier)

Arrays x and y are 1D; the definitions of x_vector_axis = 1 and col_axis = 2 indicate that the product a(i, j) * x(j) will be evaluated for all values of j. These products will be summed over the first index of a (row_axis = 1), and the result added to the corresponding element in y. The equivalent code is

do i = 1, p
    sum = 0.0
    do j = i, q
        sum =
sum + a(i, j) * x(j)
    enddo
enddo

Multiple-Instance Operation

To perform a multiple-instance operation, embed the multiple instances of each vector argument in a parallel array of rank greater than 1, and embed the multiple instances of each matrix argument in a parallel array of rank greater than 2.

For example, the simplest multiple-instance matrix-vector multiplication involves the definition of one instance axis.

 integer*8 a, x, y
 integer*4 ext(3), axis_is_local(3)
 integer*4 ier

 axis_is_local(1) = 0
 axis_is_local(2) = 0
 axis_is_local(3) = 0

 ext(1) = p
 ext(2) = q
 ext(2) = r

 call s3l_declare(a, 3, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

 ext(1) = q
 ext(2) = r

 call s3l_declare(x, 2, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

 ext(1) = p
 ext(2) = r

 call s3l_declare(y, 2, ext, S3L_float, axis_is_local,
$    S3L_USE_MALLOC, ier)

In this code, all three arrays contain an instance axis of length r. In addition, each instance axis is the rightmost axis in the array declaration. In other words, the order of data axes and instance axes is the same in all three arrays. These axes definitions produce arrays whose geometries are outlined in Figure 5-1. In the illustration, r = 4. Multiplication using these arrays is then performed by

call S3L_mat_vec_mult(y, a, x, 1, 1, 2, 1, ier)

In analyzing the operations performed in this call, it is useful to define s0, the index along the instance axis. For a given value of s0, the following will be evaluated:

These two operations will be performed for all 1 <= s0 <= r. In other words, the matrix-vector multiplication will be evaluated for all instances

 y(:, s0) * a(:, :, s0) * x(:, s0)

The order in which these instances are evaluated depends on the layouts of the arrays. Since all arrays are block-distributed along all axes, it is possible for one set of processes to work on the first instance

 y(:, 1) = a(:, :, 1) * x(:< 1)

while another set of processors evaluates the Nth instance at the same time--that is, in parallel .

 y(:, N) = a(:, :, N) * x(:, N)

The extent of parallelism depends on the details of the data layouts, particularly on the mapping of the data and instance axes to the underlying process grid axes. The highest degree of parallelism is achieved when all data axes are local, and all instance axes are distributed.

The use of local data axes forces each cell (all data axes) to reside entirely in just one process. The use of distributed instance axes spreads the collection of cells over the process grid, resulting in better load-balancing among processes. Use of this data layout is discussed below.

Multiple-instance operations are usually most efficient when each cell (all of the data axes) resides on one process. Use of such a layout scheme is discussed in this section. In addition, the use of several instance axes are illustrated. Declarations of arrays containing these axes can be done as

 integer*8 a, x, y
 integer*4 mat_ext(5), mat_axis_is_local(5)
 integer*4 vec_ext(4), vec_axis_is_local(4)
 integer*4 ier

 mat_axis_is_local(1) = 1
 mat_axis_is_local(2) = 1
 mat_axis_is_local(3) = 0
 mat_axis_is_local(4) = 0
 mat_axis_is_local(5) = 0

 mat_ext(1) = p
 mat_ext(2) = q
 mat_ext(2) = k
 mat_ext(4) = m
 mat_ext(5) = n

 call s3l_declare(a, 5, mat_ext, S3L_float, mat_axis_is_local,)
$    S3L_USE_MALLOC, ier

 vec_axis_is_local(1) = 1
 vec_axis_is_local(2) = 1
 vec_axis_is_local(3) = 0
 vec_axis_is_local(4) = 0
 vec_axis_is_local(5) = 0

 vec_ext(1) = q
 vec_ext(2) = k
 vec_ext(2) = m
 vec_ext(4) = n

 call s3l_declare(x, 4, vec_ext, S3L_float, vec_axis_is_local,)
$    S3L_USE_MALLOC, ier

 vec_ext(1) = p
 vec_ext(2) = k
 vec_ext(2) = m
 vec_ext(4) = n

 call s3l_declare(y, 4, vec_ext, S3L_float, vec_axis_is_local,)
$    S3L_USE_MALLOC, ier

The data axes are defined to be local to a process. Each array has three instance axes, each of which is block distributed. Note that the order of instance axes is the same in all three arrays.

To analyze the results of the call

 call S3L_mat_vec_mult(y, a, x, 1, 1, 2, 1, ier)

s0, s1, and s2 are used to denote the index along each of the three instance axes. For a given set of s0, s1, and s2, the following will be evaluated:

These two operations will be performed for all 1 <= s0 <= k, 1 <= s1 <= m, and 1 <= s2 <= n. In other words, the matrix-vector multiplication will be evaluated for all instances

 y(:, s0, s1, s2) = A(:, :, s0, s1, s2) * x(:, s0, s1,
s2)

However, unlike the previous example, the data axes in this case are local. This means that the evaluation of each instance does not involve any interprocess communication. Each process independently works on its own set of instances, using a purely local matrix-vector-multiplication algorithm. These local algorithms are usually faster than their global counterparts, since no communication between processes is involved.

Source code for these operations is in the file matvec_mult.f. This can be found in the S3L examples directory examples/s3l/dense_matrix_ops-f/, the location of which is site-specific.

Example 2: Fast Fourier Transforms

When calling the detailed complex-to-complex FFT routine, S3L_fft_detailed, you can supply a multidimensional parallel array and specify whether you want to perform a forward transform, an inverse transform, or no transform along each axis. You can also specify axes along which no transform is performed, but address bits are reversed. The axes that are transformed or bit-reversed are the data axes, and define the cell; the axes along which you perform no transformation are the instance axes.


Note -

The simple FFT routine, S3L_fft, performs a transform along each axis of the supplied parallel array. Consequently, it does not support multiple instances.