Sun S3L 3.0 Programming and Reference Guide

S3L Array Layout and Performance

Most Sun S3L functions perform best when their operand arrays are block-distributed along all axes. But there are exceptions to this generalization. This section discusses those exceptions.

Functions That Benefit From Cyclic Distributions

Functions that focus their operations on discrete subparts of an S3L array rather across the full array are likely to provide better load balancing and performance when their array operands are distributed cyclically rather than in simple block fashion. This is particularly true for the LU decomposition (S3L_lu_factor and S3L_lu_solve) and Singular Value Decomposition (S3L_gen_svd) functions.

This is illustrated by the examples shown in Figure 3-1 and Figure 3-2, which show a 16x16 array being distributed across a 1x4 process grid, first in simple block fashion and next in block cyclic fashion.

In Figure 3-1, a block size of 4 is used for the second axis. This means that the second array axis will be distributed in one pass across the process grid's second axis--in other words, it will be block-distributed.

Figure 3-1 Block Distribution of a 16x16 S3L Array on a 1x4 Process Grid

If the nature of the operation is such that every process will compute the sum of elements in the lower triangular part of the array (shaded portion) and send the results to the next processor, this distribution pattern will result in serious load imbalance among the processes. Since process 0 must perform many more iterations than the other processes, especially than process 3, overall computational time will be greater than it would be if better load balancing could be achieved.

In Figure 3-2, a block size of 2 is chosen for second axis. Although process 0 still has a larger section of the array operand than the other processes, cyclic distribution reduces the load differences significantly.

Figure 3-2 Block-Cyclic Distribution of a 16x16 S3L Array on a 1x4 Process Grid

Note that there is usually a limit to the load balancing gains that block-cyclic distribution can provide. In other words, setting block size to 1 is not likely to maximize performance, even for operations like the one just described. This limit results from a trade-off between the gains in load balancing that are provided by small block sizes and the gains in cache blocking efficiency that are achieved by loading array elements with consecutive indices into cache.

In addition to this trade-off, most of the nodal codes that underlay Sun S3L implement simple block distribution. Their optimal block size has to be matched to the optimal partitioning of the Sun S3L array.

In algorithms that are naturally load balanced--that is, where the amount of data each process has to process is approximately equal-- block-cyclic distribution has little effect on execution efficiency.

Distributing Only the Last Axis

The performance of some S3L functions can be enhanced by block-distributing only the last axis of the S3L array and making all other axes local. This rule applies to the FFT, sorting, and banded solver functions.

These functions are all optimized for operating on S3L arrays that are distributed in this manner. If an array that has a different type of distribution is being passed in as an argument, these functions automatically redistribute the array, perform the parallel computation and then restore it to its original form. Since this data redistribution introduces extra overhead, it is a good practice to ensure that S3L arrays passed to these functions follow this distribution plan.

Allocating Arrays in Shared Memory

Sun S3L supports the allocation of S3L arrays in shared memory. When an MPI program runs on a cluster of nodes, processes collocated on the same node can allocate their local array parts in that node's shared memory. Storing array sections in shared memory allows collocated processes to access each others array elements without going through MPI. This can yield significant performance improvements.

Note -

A special case of this would be an MPI application running on a single node. In this case, the entire S3L array could be allocated in shared memory.

Several Sun S3L functions are optimized for shared-memory accesses. That is, they employ different, more efficient algorithms when S3L arrays have been allocated in shared memory. These functions include the single and multidimensional parallel FFTs, as well as array transpose and sparse solver routines.

Use the S3L_declare or S3L_declare_detailed functions to allocate a parallel array in shared memory. For the type of allocation, specify either:

S3L_USE_MMAP, which uses mmap() to allocate the S3L array

S3L_USE_SHMGET, which uses shmget() (System V shared memory) to obtain Intimate Shared Memory. This is a form of shared memory in which data is always kept resident in physical memory and is never paged out.

Note -
Since physical memory is locked up by an application using Intimate Shared Memory, S3L_USE_SHMGET should only be used when the S3L array is small and the system is not being shared with other users.

Numbers of Processes

Many Sun S3L routines employ a serial algorithm when called from an application running on a single process and a different, parallel algorithm when called from a multiprocess application. When those Sun S3L routines are executed on a small number of processes--two or three--they are likely to be slower than the serial version running on a single process. This is because the higher overhead involved in the parallel process can overshadow any gains resulting from parallelization of the operation.

This means that, in general, MPI applications that call Sun S3L routines should be executing on at least four processes.