To perform a multiple-instance operation, embed the multiple instances of each vector argument in a parallel array of rank greater than 1, and embed the multiple instances of each matrix argument in a parallel array of rank greater than 2.
For example, the simplest multiple-instance matrix-vector multiplication involves the definition of one instance axis.
| integer*8 a, x, y integer*4 ext(3), axis_is_local(3) integer*4 ier axis_is_local(1) = 0 axis_is_local(2) = 0 axis_is_local(3) = 0 ext(1) = p ext(2) = q ext(2) = r call s3l_declare(a, 3, ext, S3L_float, axis_is_local, $ S3L_USE_MALLOC, ier) ext(1) = q ext(2) = r call s3l_declare(x, 2, ext, S3L_float, axis_is_local, $ S3L_USE_MALLOC, ier) ext(1) = p ext(2) = r call s3l_declare(y, 2, ext, S3L_float, axis_is_local, $ S3L_USE_MALLOC, ier) | 
In this code, all three arrays contain an instance axis of length r. In addition, each instance axis is the rightmost axis in the array declaration. In other words, the order of data axes and instance axes is the same in all three arrays. These axes definitions produce arrays whose geometries are outlined in Figure 5-1. In the illustration, r = 4. Multiplication using these arrays is then performed by
| call S3L_mat_vec_mult(y, a, x, 1, 1, 2, 1, ier) | 
In analyzing the operations performed in this call, it is useful to define s0, the index along the instance axis. For a given value of s0, the following will be evaluated:
The values of x_vector_axis = 1 and col_axis = 2 indicate that the product a(i, j, s0) * x(j, s0) will be calculated for all j.
The above product will be summed over i, the first index of a (row_axis = 1), and the result added to y(i, s0).
These two operations will be performed for all 1 <= s0 <= r. In other words, the matrix-vector multiplication will be evaluated for all instances
| y(:, s0) * a(:, :, s0) * x(:, s0) | 
The order in which these instances are evaluated depends on the layouts of the arrays. Since all arrays are block-distributed along all axes, it is possible for one set of processes to work on the first instance
| y(:, 1) = a(:, :, 1) * x(:< 1) | 
while another set of processors evaluates the Nth instance at the same time--that is, in parallel .
| y(:, N) = a(:, :, N) * x(:, N) | 
The extent of parallelism depends on the details of the data layouts, particularly on the mapping of the data and instance axes to the underlying process grid axes. The highest degree of parallelism is achieved when all data axes are local, and all instance axes are distributed.
The use of local data axes forces each cell (all data axes) to reside entirely in just one process. The use of distributed instance axes spreads the collection of cells over the process grid, resulting in better load-balancing among processes. Use of this data layout is discussed below.
Multiple-instance operations are usually most efficient when each cell (all of the data axes) resides on one process. Use of such a layout scheme is discussed in this section. In addition, the use of several instance axes are illustrated. Declarations of arrays containing these axes can be done as
| integer*8 a, x, y integer*4 mat_ext(5), mat_axis_is_local(5) integer*4 vec_ext(4), vec_axis_is_local(4) integer*4 ier mat_axis_is_local(1) = 1 mat_axis_is_local(2) = 1 mat_axis_is_local(3) = 0 mat_axis_is_local(4) = 0 mat_axis_is_local(5) = 0 mat_ext(1) = p mat_ext(2) = q mat_ext(2) = k mat_ext(4) = m mat_ext(5) = n call s3l_declare(a, 5, mat_ext, S3L_float, mat_axis_is_local,) $ S3L_USE_MALLOC, ier vec_axis_is_local(1) = 1 vec_axis_is_local(2) = 1 vec_axis_is_local(3) = 0 vec_axis_is_local(4) = 0 vec_axis_is_local(5) = 0 vec_ext(1) = q vec_ext(2) = k vec_ext(2) = m vec_ext(4) = n call s3l_declare(x, 4, vec_ext, S3L_float, vec_axis_is_local,) $ S3L_USE_MALLOC, ier vec_ext(1) = p vec_ext(2) = k vec_ext(2) = m vec_ext(4) = n call s3l_declare(y, 4, vec_ext, S3L_float, vec_axis_is_local,) $ S3L_USE_MALLOC, ier | 
The data axes are defined to be local to a process. Each array has three instance axes, each of which is block distributed. Note that the order of instance axes is the same in all three arrays.
To analyze the results of the call
| call S3L_mat_vec_mult(y, a, x, 1, 1, 2, 1, ier) | 
s0, s1, and s2 are used to denote the index along each of the three instance axes. For a given set of s0, s1, and s2, the following will be evaluated:
The values of x_vector_axis = 1 and col_axis = 2 indicate that the product a(i, j, s0, s1, s2) * x(j, s0, s1, s2) will be calculated for all j.
This product will be summed over i, the first index of a (row_axis = 1), and the result added to y(i, s0, s1, s2).
These two operations will be performed for all 1 <= s0 <= k, 1 <= s1 <= m, and 1 <= s2 <= n. In other words, the matrix-vector multiplication will be evaluated for all instances
| y(:, s0, s1, s2) = A(:, :, s0, s1, s2) * x(:, s0, s1, s2) | 
However, unlike the previous example, the data axes in this case are local. This means that the evaluation of each instance does not involve any interprocess communication. Each process independently works on its own set of instances, using a purely local matrix-vector-multiplication algorithm. These local algorithms are usually faster than their global counterparts, since no communication between processes is involved.
Source code for these operations is in the file matvec_mult.f. This can be found in the S3L examples directory examples/s3l/dense_matrix_ops-f/, the location of which is site-specific.