Sun S3L 3.0 Programming and Reference Guide

Dense Matrix Operations

Sun S3L includes optimized parallel functions for performing the following dense matrix operations:

matrix-matrix and matrix-vector multiplication

inner product computation
outer product computation
2-norm computation

These functions have been optimized for various multiprocess configurations and array sizes.They implement different algorithms according to the particular configuration. Among the algorithms that these functions may employ are the Broadcast-Multiply-Roll, Cannon, and Broadcast-Broadcast Multiply.

The source data to be used by the dense matrix operations should be aligned so that no extra redistribution costs are imposed. For example, if rectangular matrix A, which is distributed along the last axis of a 1 x np process grid, will be multiplied with an n x 1 vector x, the vector multiplier should be distributed onto np processes with the same block size as is used for the distribution of A along the second axis.

In general the performance of dense matrix operations is similar for both pure block and block-cyclic distributions.

If both operand arrays in a dense matrix function do not have the same type of distribution--that is, one is pure block-distributed and the other block-cyclic--the dense matrix multiplication routine will automatically redistribute as necessary to properly align the arrays. However, this redistribution can add considerable overhead to the operation, so it is best if the application ensures that they have like distributions beforehand.

In general, the dense matrix parallel algorithm is more efficient when the matrices being multiplied are large. This is because the large matrices take advantage of the dominance of the O(N3) computational complexity over the O(N2) communication requirements.

The benefit of larger matrices can be offset, however, when the matrices are so large that they occupy nearly all of total system memory. Because additional internal data structures are allocated for the parallel algorithm, swapping out of memory may be required, which can degrade performance.

When the multiple instance capability of the dense matrix functions is used, performance can be significantly aided by making the instances local.