| Sun Performance Library User's Guide |
Using Sun Performance Library
This chapter describes using the Sun Performance Library to improve the execution speed of applications written in either FORTRAN 77, Fortran 95, or C. Although some modifications to applications might be required to gain peak performance, many applications can benefit significantly from using Sun Performance Library without making source code changes or recompiling.
Improving Application Performance
Use Sun Performance Library in the following ways to improve the speed of user code without making any code changes:
- Use Sun Performance Library routines instead of the base Netlib routines. See the next section Replacing Routines With Sun Performance Library Routines."
- Use Sun Performance Library to speed up the other libraries, if an application already uses libraries in addition to those in the Sun Performance Library. See Improving Performance of Other Libraries.
- Use tools that automatically modify an application to use Sun Performance Library. See Using Tools to Restructure Code.
Replacing Routines With Sun Performance Library Routines
Many applications are built using one or more of the base Netlib libraries supported by the Sun Performance Library. Third-party vendors can also use BLAS and LAPACK as building blocks in their applications. Because Sun Performance Library maintains the same interfaces and functionality of these libraries, base Netlib routines can be replaced with Sun Performance Library routines.
Sun Performance Library can be included in a user's development environment to improve application performance on single processor and multiprocessor (MP) platforms. Sun Performance Library routines can be faster than the corresponding Netlib routines or routines provided by other vendors that perform similar functions. The serial speed of many Sun Performance Library routines has been increased, and many routines have been parallelized that might be serial in other products.
Improving Performance of Other Libraries
Users of other mathematical libraries can replace the BLAS in their library with the BLAS in Sun Performance Library, while leaving other routines unchanged. This is helpful when an application has a dependency on proprietary interfaces in another library that prevent the other library from being completely replaced. Many commercial math libraries are built around a core of generic BLAS and LAPACK routines, so replacing those generic routines with the highly optimized BLAS and LAPACK routines in Sun Performance Library can give speed improvements on both serial and MP platforms. Because replacing the core routines does not require any code changes, the proprietary library features can still be used.
Even libraries that already have fast core routines may get additional speedups by using Sun Performance Library. For example, if another vendor's core routines are based on BLAS, these routines can be replaced with Sun Performance Library routines, which have SPARC specific optimizations. Many Sun Performance Library routines have also been parallelized.
Using Tools to Restructure Code
In some cases, other libraries may not directly use the routines in the Sun Performance Library; however, there might be conversion aids available. For example, EISPACK users can refer to a conversion chart in the LAPACK Users' Manual that shows how to convert EISPACK calls to LAPACK calls.
Several vendors market automatic code restructuring tools that replace existing code with Sun Performance Library code. For example, a source- to- source conversion tool can replace existing BLAS code structures with calls to the BLAS in Sun Performance Library. These tools can also recognize many user written matrix multiplications and replace them with calls to the matrix multiplication subroutine in Sun Performance Library.
Fortran
f77/f95InterfacesThe Sun Performance Library routines can be called from within a FORTRAN 77, Fortran 95, or a C program. However, C programs must still use the FORTRAN 77 calling sequence.
Sun Performance Library
f77/f95interfaces use the following conventions:
- All arguments are passed by reference.
- The number of arguments to a routine is fixed.
- Types of arguments must match.
- Arrays are stored columnwise.
- Indices are based at one, in keeping with standard Fortran practice.
When calling Sun Performance Library routines:
- Do not prototype the subroutines with the Fortran 95
INTERFACEstatement. Use theUSE SUNPERFstatement instead.- Do not use
-ext_names=plainto compile routines that call routines from Sun Performance Library.Using Fortran 95 Features
This release supports Fortran 95 language features. To use the Sun Performance Library Fortran 95 modules and definitions, including the
USE SUNPERFstatement in the program. TheUSE SUNPERFstatement enables the following features:
- Type Independence - In the FORTRAN 77 routines, the type must be specified as part of the name.
DGEMMis a double precision matrix multiply andSGEMMis single precision. With the Fortran 95 interfaces, when callingGEMM, Fortran will infer the type from the arguments that are passed. Passing single-precision arguments toGEMMgets results that are equivalent to specifyingSGEMM, passing double-precision arguments gets results that are equivalent toDGEMM, and so on. For example,CALL DSCAL(20,5.26D0,X,1)could be changed toCALL SCAL(20, 5.26D0, X, 1).- Compile-Time Checking - In FORTRAN 77, it is generally impossible for the compiler to determine what arguments should be passed to a particular routine. In Fortran 95, the
USE SUNPERFstatement allows the compiler to determine the number, type, size, and shape of each argument to each Sun Performance Library routine. It can check the calls against the expected value and display errors during compilation.- Optional f95 Interfaces - In FORTRAN 77, all arguments must be specified in the order determined by the interface for all routines. All interfaces will support
f95styleOPTIONALattributes on arguments that are not required. To determine the optional arguments for a routine, refer to the man pages. Optional arguments are enclosed in square brackets [ ].For example, the
SAXPYroutine is defined as follows in the man page:
SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY])REAL ALPHAINTEGER INCX, INCY, NREAL X(*), Y(*)Note that the arguments
N,INCX, andINCYare optional.Suppose the user tries to call the
SAXPYroutine with the following arguments:
USE SUNPERFCOMPLEX ALPHAREAL X(100), Y(100), XA(100,100), RALPHAINTEGER INCX, INCYIf mismatches in the type, shape, or number of arguments occur, the compiler would issue the following error message:
ERROR: No specific match can be found for the generic subprogram call "AXPY".Using the arguments defined above, the following examples show incorrect calls to the
SAXPYroutine due type, shape, or number mismatches.
- Incorrect type of the arguments-If
SAXPYis called as follows:
CALL AXPY(100, ALPHA, X, INCX, Y, INCY)
- A compiler error occurs because the variable
ALPHAis typeCOMPLEX, but the interface describes it as being typeREAL.- Incorrect shape of the arguments- If
SAXPYis called as follows:
CALL AXPY(N, RALPHA, XA, INCX, Y, INCY)
- A compiler error occurs because the
XAargument is two dimensional, but the interface is expecting a one-dimensional argument.- Incorrect number of arguments- If
SAXPYis called as follows:
CALL AXPY(RALPHA, X, INCX, Y)- A compiler error occurs because the compiler cannot find a routine in the
AXPYinterface group that takes four parameters of the form
AXPY(REAL, REAL 1-D ARRAY, INTEGER, REAL 1-D ARRAY)
- In the last example, the
f95keyword parameter passing capability can allow a user to make essentially the same call using that capability.
CALL AXPY(ALPHA=RALPHA,X=X,INCX=INCX,Y=Y)- This is a valid call to the
AXPYinterface. It is necessary to use keyword parameter passing on any parameter that appears in the list after the firstOPTIONALparameter is omitted.The following calls to the
AXPYinterface are valid.
CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY)CALL AXPY(N,RALPHA,X,INCX,Y)CALL AXPY(N,RALPHA,X,Y=Y)CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)Fortran Examples
Getting peak performance from Sun Performance Library for single processor applications is a matter of identifying code constructs in an application that can be replaced by calls to subroutines in Sun Performance Library. Multiprocessor applications can get additional speed by identifying opportunities for parallelization.
The easiest situation occurs when a block of user code exactly duplicates a capability of Sun Performance Library. Consider the code below:
DO I = 1, NDO J = 1, NY(I) = Y(I) + A(I,J) * X(J)END DOEND DOThis is the matrix-vector product y
Ax + y, which can be performed with the
DGEMVsubroutine.As another example, consider the following code fragment:
DO I = 1, NIF (V2(I,K) .LT. 0.0) THENV2(I,K) = 0.0ELSEDO J = 1, MX(J,I) = X(J,I) + Vl(J,K) * V2(I,K)END DOEND IFEND DOIn other cases, a block of code can be equivalent to several Sun Performance Library calls or contain a mixture of code that can be replaced together with code that has no natural replacement in Sun Performance Library. One way to rewrite the code with Sun Performance Library is shown below:
DO I = 1, NIF (V2(I,K) .LT. 0.0) THENV2(I,K) = 0.0END IFEND DOCALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)An
f95specific example is also shown.
WHERE (V(1:N,K) .LT. 0.0) THENV(1:N,K) = 0.0END WHERECALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)The code to replace negative numbers with zero in
V2has no natural analog in Sun Performance Library, so that code is pulled out of the outer loop. With that code removed to its own loop, the rest of the loop can be recognized as being a rank- 1 update of the general matrix x, which can be accomplished using theDGERroutine from BLAS.Note that if there are many negative or zero values in
V2, it may be that the majority of the time is not spent in the rank- 1 update and so replacing that code with the call toDGERmight not bring a large payoff. It might be worthwhile to evaluate the reference to K. If it is a loop index, it may be that the loops shown here are part of a larger code structure, and loops overDGEMVorDGERcan often be converted to some form of matrix multiplication. If so, a single call to a matrix multiplication routine will probably bring a much larger payoff than a loop over calls toDGER.All Sun Performance Library routines are MT-safe (multithread safe). Because the routines are MT-safe, additional performance is possible on MP platforms by using the auto-parallelizing compiler to parallelize loops that contain calls to Sun Performance Library.
An example of an effective combination of a Sun Performance Library routine together with an auto-parallelizing compiler parallelization directive is shown in the following example.
C$PAR DOALLDO I = 1, NCALL DGBMV ('No transpose', N, N, ALPHA, A, LDA,$ B(l,I), 1, BETA, C(l,I), 1)END DOSun Performance Library contains a routine named
DGBMVto multiply a banded matrix by a vector. By putting this routine into a properly constructed loop, it is possible to use the routines in Sun Performance Library to multiply a banded matrix by a matrix. The compiler will not parallelize this loop by default because the presence of subroutine calls in a loop inhibits parallelization. However, because Sun Performance Library routines are MT-safe, a user may use parallelization directives as shown below to instruct the compiler to parallelize this loop.Note that a user can also use compiler directives to parallelize a loop with a subroutine call that ordinarily would not be parallelizable. For example, it is ordinarily not possible to parallelize a loop containing a call to some of the linear system solvers, because some vendors have implemented those routines using code that is not MT-safe. Loops containing calls to the expert drivers of the linear system solvers (routines whose names end in
SVX) are usually not parallelizable with other implementations of LAPACK. The implementation of LAPACK in Sun Performance Library allows parallelization of loops containing such calls. Because the versions in Sun Performance Library are MT-safe, users of MP platforms can get additional performance by parallelizing these loops.C Interfaces
Sun Performance Library contains native C interfaces for each of the routines contained in LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK. The Sun Performance Library C interfaces have the following features:
- Function names have C names
- Function interfaces follow C conventions
- C functions do not contain redundant or unnecessary arguments for a C function
The following example compares the standard LAPACK Fortran interface and the Sun Performance Library C interfaces for the
DGBCONroutine.CALL DGBCON (NORM, N, NSUB, NSUPER, DA, LDA, IPIVOT, DANORM, DRCOND, DWORK, IWORK2, INFO) void dgbcon(char norm, int n, int nsub, int nsuper, double *da, int lda, int *ipivot, double danorm, double drcond, int *info)Note that the names of the arguments are the same and that arguments with the same name have the same base type. Scalar arguments that are used only as input values, such as
NORMandN, are passed by value in the C version. Arrays and scalars that will be used to return values are passed by reference.The Sun Performance Library C interfaces improve on CLAPACK, available on Netlib, which is an
f2ctranslation of the standard libraries. For example, all of the CLAPACK routines are followed by a trailing underscore to maintain compatibility with Fortran compilers, which often postfix routine names in the object (.o) file with an underscore. The Sun Performance Library C interfaces do not require a trailing underscore.Sun Performance Library C interfaces use the following conventions:
- Input-only scalars are passed by value rather than by reference, which gives added safety and allows constants to be passed without creating a separate variable to hold their value. Complex and double complex arguments are not considered scalars because they are not implemented as a scalar type by C.
- Complex scalars can be passed as either structures or arrays of length 2
- Arguments relating to workspace are not used in Sun Performance Library.
- Types of arguments must match even after C does type conversion. For example, be careful when passing a single precision real value because a C compiler can automatically promote the argument to double precision.
- Arrays are stored columnwise.
- Array indices are based at zero in conformance with C conventions rather than being based at one to conform to Fortran conventions.
- For example, the Fortran interface to
IDAMAX, which C programs access asidamax_, would return a 1 to indicate the first element in a vector. The C interface toidamax, which C programs access asidamax, would return a 0 to indicate the first element of a vector. This convention is observed in function return values, permutation vectors, and anywhere else that vector or array indices are used.
Note Some of the routines in Sun Performance Library usemallocinternally, so user codes that make calls to Sun Performance Library and tosbrkmay not work correctly.
Sun Performance Library uses global integer registers
%g2,%g3, and%g4in 32-bit mode and%g2through%g5in 64-bit mode as scratch registers. User code should not use these registers for temporary storage, and then call a Sun Performance Library routine. The data will be overwritten when the Sun Performance Library routine uses these registers.C Examples
The key to using Sun Performance Library to get peak performance from applications is to recognize opportunities to transform user-written code sequences into calls to Sun Performance Library functions. The following code sequence adapted from LAPACK shows one example:
int i;float a[n], b[n], largest;largest = a[0];for (i = 0; i < n; i++){if (a[i] > largest)largest = a[i];if (b[i] > largestlargest = b[i];}There is no subroutine in Sun Performance Library that exactly replicates the functionality of the code above. However, the code can be accelerated by replacing it with the several calls to Sun Performance Library as shown below:
int i, large_index;float a[n], b[n], largest;large_index = isamax (n, a, l);largest = a[large_index];large_index = isamax (n, b, l);if (b[large_index] > largest)largest = b[large_index];Note the differences between the call to the native C
isamaxin Sun Performance Library above and the call shown below to a comparable function in CLAPACK:
As an example of a program that uses Sun Performance Library routines from user-managed threads, consider a real-time signal processing application running on a 4-processor server with one processor dedicated to acquiring the data, two processors dedicated to performing FFTs on the data, and one processor dedicated to postprocessing the data after the FFTs. It begins by creating multiple running instances of the function that performs the FFT:
for (i = 0; i < NCPUS_FOR_FFT; i++) {who[i] = i;do_fft[i] = 0;fft_done_buff_available[i] = l;(void)thr_create ((void *)0, (size_t)0, fft_func,(void *)&who[i], (long)0, (thread_t *)0);The code below is a simplified implementation of part of
fft_funcstarted bythr_createin the loop above. Note that production code should check the return value fromthr_createabove and should use semaphores rather than busy waits at the synchronization points in the code below.
|
Sun Microsystems, Inc. Copyright information. All rights reserved. Feedback |
Library | Contents | Previous | Next | Index |