Sun S3L 3.0 Programming and Reference Guide

Functions That Benefit From Cyclic Distributions

Functions that focus their operations on discrete subparts of an S3L array rather across the full array are likely to provide better load balancing and performance when their array operands are distributed cyclically rather than in simple block fashion. This is particularly true for the LU decomposition (S3L_lu_factor and S3L_lu_solve) and Singular Value Decomposition (S3L_gen_svd) functions.

This is illustrated by the examples shown in Figure 3-1 and Figure 3-2, which show a 16x16 array being distributed across a 1x4 process grid, first in simple block fashion and next in block cyclic fashion.

In Figure 3-1, a block size of 4 is used for the second axis. This means that the second array axis will be distributed in one pass across the process grid's second axis--in other words, it will be block-distributed.

Figure 3-1 Block Distribution of a 16x16 S3L Array on a 1x4 Process Grid

Graphic

If the nature of the operation is such that every process will compute the sum of elements in the lower triangular part of the array (shaded portion) and send the results to the next processor, this distribution pattern will result in serious load imbalance among the processes. Since process 0 must perform many more iterations than the other processes, especially than process 3, overall computational time will be greater than it would be if better load balancing could be achieved.

In Figure 3-2, a block size of 2 is chosen for second axis. Although process 0 still has a larger section of the array operand than the other processes, cyclic distribution reduces the load differences significantly.

Figure 3-2 Block-Cyclic Distribution of a 16x16 S3L Array on a 1x4 Process Grid

Graphic

Note that there is usually a limit to the load balancing gains that block-cyclic distribution can provide. In other words, setting block size to 1 is not likely to maximize performance, even for operations like the one just described. This limit results from a trade-off between the gains in load balancing that are provided by small block sizes and the gains in cache blocking efficiency that are achieved by loading array elements with consecutive indices into cache.

In addition to this trade-off, most of the nodal codes that underlay Sun S3L implement simple block distribution. Their optimal block size has to be matched to the optimal partitioning of the Sun S3L array.

In algorithms that are naturally load balanced--that is, where the amount of data each process has to process is approximately equal-- block-cyclic distribution has little effect on execution efficiency.