Sun S3L includes dense linear systems solvers that provide solutions to linear systems of equations for real and complex general matrices.
LU operations are carried out in two stages. First, one or more matrices A are decomposed into their LU factors using Gaussian elimination with partial pivoting. This is done using S3L_lu_factor. Then, the generated LU factors are used by S3L_lu_solve to solve the linear system AX=B or by S3L_lu_invert to compute the inverse of A.
The LU decomposition routine, which is derived from the ScaLAPACK implementation, uses a parallel block-partitioned algorithm. The Sun S3L routine exploits the optimized nodal libraries, consisting primarily of a specialized matrix-matrix multiply routine, to speed up the computation. In addition, an optimized communication scheme is used to reduce the total number of interprocess communication steps.
The LU decomposition algorithm used in Sun S3L is aware of the size of the external cache memory and other CPU parameters and selects the most efficient methods based on that information.
The S3L LU decomposition routine is particularly efficient when factoring 64-bit (double-precision) floating-point matrices that have been allocated using the S3L_USE_MEMALIGN64 option so that their local subgrids are in 64-byte aligned memory addresses. Performance is best in such cases when a block size of 24 or 48 is used.
As mentioned in Chapter 2, Sun S3L Arrays, block-cyclic distribution should be used for LU factorization and solution routines.
Also, process grids having the form 1 x np, where np is the total number of processes, provide best performance when the number of processes and the size of the arrays is relatively small. However, when the size of the array to be factored is very large, it is better to specify rectangular process grids of the form nq x np, rather than square process grids.