This section discusses performance-tuning for specific S3L functions. It does not cover all Sun S3L functions.
Sun S3L provides two functions for computing the forward and backward Fourier transforms of one-, two-, and three-dimensional complex arrays: S3L_fft and S3L_ifft.
In Sun S3L, the term forward FFT is used for operations with a negative sign in the exponential factors.
Before calling either of these functions, however, an FFT setup must be computed, using S3L_fft_setup. This setup will contain the FFT twiddle factors and other internal data specific to the algorithm that will be used in the FFT computation.
The setup data depend only on the size, type and layout of the S3L array. Consequently, once they are computed they can be applied in the FFT computation of any other S3L arrays that have the same size, type and layout.
The S3L array must already have been allocated before calling S3L_fft_setup. The setup routine specifies the S3L array of interest by using its array handle, which was returned by a previous call to S3L_declare or S3L_declare_detailed.
The following code example shows S3L_fft_setup being used to create a setup for the S3L array a.
c integer*4 setup_id, ier integer*8 c c compute an FFT setup for a parallel c S3L array identified by a. c c Note that a must have been properly allocated via c a call to S3L_declare or 3L_declare_detailed. c cakk s3l_fft_setup(a,setup_id,ier) |
S3L_fft_setup returns a value of type integer*4 (F77/F90) or int (C/C++) in setup_id. Subsequent calls to S3L_fft and S3L_ifft use the setup_id value to reference the FFT setup of interest.
Use S3L_fft to compute a forward FFT.
call s3l_fft(a,setup_id,ier) |
Use S3L_ifft to compute the inverse FFT.
call s3l_ifft(a,setup_id,ier) |
Note that the same setup can be used for both the forward and the inverse FFT. S3L_fft does not scale the results, so a forward FFT followed by an inverse FFT results in the original data being scaled by the product of the array extents.
Note also that, for one-dimensional FFTs, there are certain requirements regarding the size of the FFT and the number of processes used to compute it. For details, see "S3L_fft " and "S3L_ifft " or the S3L_fft and S3L_ifft man pages.
S3L_fft and S3L_ifft can be used to compute the Discrete Fourier Transform of arrays of up to three dimensions.
When arrays with more than three dimensions are to be transformed or, when more control over how the FFTs are applied to the array dimensions is desired, use S3L_fft_detailed. This function uses the same setup_id as S3L_fft and S3L_ifft, but accepts additional parameters. S3L_fft_detailed allows the programmer to specify
the axis along which the transform is to be applied
the direction of the FFT (1 for forward, -1 for inverse)
Once the FFT computations have completed, the resources associated with that FFT setup should be freed by calling S3L_fft_free_setup, as in the following F77 example.
call s3l_fft_free_setup(setup_id,ier) |
The Sun S3L FFT is optimized for use by message-passing applications. It works for S3L arrays with arbitrary distributions a multiprocess process grid. For certain data distributions, maximum efficiency can be obtained.
For example, the algorithm used for a two-dimensional complex-to-complex parallel FFT is optimized for arrays distributed along their second dimension, where the number of processes is a power of two. For maximum efficiency, both axes of the array must be divisible by the number of processes.
When the array distribution allows fast algorithms to be used, the two-dimensional FFT consists essentially of multiple one-dimensional FFTs performed along the array columns local to the process. It then transposes the array in such a way that each process has a number of local rows. This is followed by a second set of local one-dimensional FFTs. Finally, the data are transposed back to the original distribution.
This sequence is illustrated by the diagrams in Figure 3-3 through Figure 3-5.
When the data are distributed in suboptimal ways-- for example, along both dimensions on a rectangular process-grid--a global communication step is usually performed first to redistribute the data to its optimal distribution and then the optimized FFT is performed. This redistribution step increases execution overhead.
Sun S3L includes routines for performing the singular value decomposition of real single- or double-precision arrays. These routines include options for computing only the singular values or the right and/or left singular eigenvectors.
For the S3L_gen_svd function, optimal performance is achieved when block-cyclic distribution is used on the S3L array operand. S3L_gen_svd works best on large arrays. Performance is also better when S3L_gen_svd executes on a small number of processes.
Sun S3L includes sorting routines for parallel sorting of one-dimensional arrays or multidimensional arrays along a user-specified axis. It also includes routines for computing the grade (rank) of the elements of an array.
The sorting and grading routines are most efficient when the arrays are block distributed. For multidimensional sorts and grades, performance is best when the axis to be sorted or graded is local to a process.
The sort routines use a variation of the sample sort algorithm. In a coordinated operation, all processors extract a random sample of the data. The distribution of this sample should match as closely as possible the distribution of the data to be sorted. Based on this sample, all processes split their data into np packets, each destined for a particular process. A global interprocess communication stage then collects the packets into the appropriate processes. Each process then independently sorts its own data using a quicksort algorithm. Finally, all the processes coordinate to restore the data to its original distribution.
The first communication stage, where packets of data are sent to the appropriate processes, is more intensive than the operation that restores the original distribution. This is because the second communication stage only involves exchanges between processes with neighboring ranks.
In general, the parallel S3L sort routines exhibit good scaling characteristics. While some data distribution patterns can affect the quality of load balancing and the performance of local sorts, the internal parameters of the parallel algorithm have been chosen to achieve good performance for most cases of data distribution.
Sorting single-precision floating-point numbers is faster than sorting double-precision floating-point numbers. Also, sorting 64-byte integers can be slower than sorting 64-byte floating-point numbers.
Sun S3L includes dense linear systems solvers that provide solutions to linear systems of equations for real and complex general matrices.
LU operations are carried out in two stages. First, one or more matrices A are decomposed into their LU factors using Gaussian elimination with partial pivoting. This is done using S3L_lu_factor. Then, the generated LU factors are used by S3L_lu_solve to solve the linear system AX=B or by S3L_lu_invert to compute the inverse of A.
The LU decomposition routine, which is derived from the ScaLAPACK implementation, uses a parallel block-partitioned algorithm. The Sun S3L routine exploits the optimized nodal libraries, consisting primarily of a specialized matrix-matrix multiply routine, to speed up the computation. In addition, an optimized communication scheme is used to reduce the total number of interprocess communication steps.
The LU decomposition algorithm used in Sun S3L is aware of the size of the external cache memory and other CPU parameters and selects the most efficient methods based on that information.
The S3L LU decomposition routine is particularly efficient when factoring 64-bit (double-precision) floating-point matrices that have been allocated using the S3L_USE_MEMALIGN64 option so that their local subgrids are in 64-byte aligned memory addresses. Performance is best in such cases when a block size of 24 or 48 is used.
As mentioned in Chapter 2, Sun S3L Arrays, block-cyclic distribution should be used for LU factorization and solution routines.
Also, process grids having the form 1 x np, where np is the total number of processes, provide best performance when the number of processes and the size of the arrays is relatively small. However, when the size of the array to be factored is very large, it is better to specify rectangular process grids of the form nq x np, rather than square process grids.
S3L includes both a tridiagonal and a general banded linear solver.
The tridiagonal solver does not perform pivoting, so it should be used only for systems that are known to be stable, such as diagonally dominant tridiagonal systems.
The banded solver performs partial pivoting, so it exhibits better stability than the tridiagonal solvers.
Both solvers will perform optimally when the LHS arrays are block-distributed along their last axis, the RHS arrays are block-distributed along their first axis, and the same block size is used for both distributions.
The S3L_gen_trid_factor function is used to factor a tridiagonal system of equations, where the main diagonal, upper subdiagonal and lower subdiagonal are given in three S3L arrays, D, U, and L.
S3L_gen_trid_solve, is used to solve a tridiagonal system of equations whose RHS matrix is given in an S3L array B, using the factorization results from a previous call to S3L_gen_trid_factor.
The banded solvers have linear complexity and their interprocess communication requirements are not great. Consequently, these banded solvers can deliver performance on clusters of small nodes that is similar to what would be achieved on a single node with many CPUs.
Given a general square sparse matrix A and an RHS vector b, S3L_gen_iter_solve solves the linear system of equations Ax = b using an iterative algorithm, with or without preconditioning.
S3L_gen_iter_solve implements several different algorithms, which the programmer can select. These include
BiConjugate Gradient Stabilized (BiCGSTAB)
Conjugate Gradient (CG)
Conjugate Gradient Squared (CGS)
Conjugate Residuals (CR)
Restarted Generalized Minimum Residual (GMRES)
Quasi-Minimal Residual (QMR)
Richardson method
If preconditioning is used, the programmer can choose point Jacobi or incomplete LU as the method. S3L_gen_iter_solve also allows the programmer to specify a number of other algorithm-specific parameters, including
the convergence/divergence criteria
the initial guess
the maximum number of iterations
Sun S3L includes a set of utilities for declaring and initializing sparse matrices. Other utilities for operating on sparse matrices are provided as well.
S3L_declare_sparse declares an S3L sparse array and allocates all required memory to store its elements and sparsity pattern representation.
S3L_rand_sparse creates a random sparse matrix with a random sparsity pattern in either the Coordinate (COO) format or the Compressed Sparse Row (CSR) format. Upon successful completion, it returns an S3L array handle representing this random sparse matrix.
S3L_read_sparse reads a sparse array from a data file. Similarly, S3L_write_sparse writes a sparse S3L array to a file. S3L_print_sparse prints a sparse array to standard output.
S3L_matvec_sparse multiplies a sparse array with a dense vector.
Sun S3L includes optimized parallel functions for performing the following dense matrix operations:
matrix-matrix and matrix-vector multiplication
inner product computation
outer product computation
2-norm computation
These functions have been optimized for various multiprocess configurations and array sizes.They implement different algorithms according to the particular configuration. Among the algorithms that these functions may employ are the Broadcast-Multiply-Roll, Cannon, and Broadcast-Broadcast Multiply.
The source data to be used by the dense matrix operations should be aligned so that no extra redistribution costs are imposed. For example, if rectangular matrix A, which is distributed along the last axis of a 1 x np process grid, will be multiplied with an n x 1 vector x, the vector multiplier should be distributed onto np processes with the same block size as is used for the distribution of A along the second axis.
In general the performance of dense matrix operations is similar for both pure block and block-cyclic distributions.
If both operand arrays in a dense matrix function do not have the same type of distribution--that is, one is pure block-distributed and the other block-cyclic--the dense matrix multiplication routine will automatically redistribute as necessary to properly align the arrays. However, this redistribution can add considerable overhead to the operation, so it is best if the application ensures that they have like distributions beforehand.
In general, the dense matrix parallel algorithm is more efficient when the matrices being multiplied are large. This is because the large matrices take advantage of the dominance of the O(N3) computational complexity over the O(N2) communication requirements.
The benefit of larger matrices can be offset, however, when the matrices are so large that they occupy nearly all of total system memory. Because additional internal data structures are allocated for the parallel algorithm, swapping out of memory may be required, which can degrade performance.
When the multiple instance capability of the dense matrix functions is used, performance can be significantly aided by making the instances local.
Sun S3L includes functions for performing various signal processing computations, such as convolving or deconvolving two one- or two-dimensional arrays and for computing the correlation or autocorrelation of a signal. These functions are based on the Sun S3L FFT functions. These functions have been optimized to eliminate unnecessary computations.
For example, the convolution of the two-dimensional arrays (images) A and B can be achieved by
Computing the two-dimensional FFT of A
Computing the two-dimensional FFT of B
Finding the pointwise product of the result
Computing the inverse two-dimensional FFT of the previous result