Sun Performance Library User's Guide | ![]() ![]() ![]() ![]() ![]() |
Using Sun Performance Library
This chapter describes using the Sun Performance Library to improve the execution speed of applications written in either FORTRAN 77, Fortran 95, or C. Although some modifications to applications might be required to gain peak performance, many applications can benefit significantly from using Sun Performance Library without making source code changes or recompiling.
Improving Application Performance
Use Sun Performance Library in the following ways to improve the speed of user code without making any code changes:
- Use Sun Performance Library routines instead of the base Netlib routines. See the next section Replacing Routines With Sun Performance Library Routines."
- Use Sun Performance Library to speed up the other libraries, if an application already uses libraries in addition to those in the Sun Performance Library. See Improving Performance of Other Libraries.
- Use tools that automatically modify an application to use Sun Performance Library. See Using Tools to Restructure Code.
Replacing Routines With Sun Performance Library Routines
Many applications are built using one or more of the base Netlib libraries supported by the Sun Performance Library. Third-party vendors can also use BLAS and LAPACK as building blocks in their applications. Because Sun Performance Library maintains the same interfaces and functionality of these libraries, base Netlib routines can be replaced with Sun Performance Library routines.
Sun Performance Library can be included in a user's development environment to improve application performance on single processor and multiprocessor (MP) platforms. Sun Performance Library routines can be faster than the corresponding Netlib routines or routines provided by other vendors that perform similar functions. The serial speed of many Sun Performance Library routines has been increased, and many routines have been parallelized that might be serial in other products.
Improving Performance of Other Libraries
Users of other mathematical libraries can replace the BLAS in their library with the BLAS in Sun Performance Library, while leaving other routines unchanged. This is helpful when an application has a dependency on proprietary interfaces in another library that prevent the other library from being completely replaced. Many commercial math libraries are built around a core of generic BLAS and LAPACK routines, so replacing those generic routines with the highly optimized BLAS and LAPACK routines in Sun Performance Library can give speed improvements on both serial and MP platforms. Because replacing the core routines does not require any code changes, the proprietary library features can still be used.
Even libraries that already have fast core routines may get additional speedups by using Sun Performance Library. For example, if another vendor's core routines are based on BLAS, these routines can be replaced with Sun Performance Library routines, which have SPARC specific optimizations. Many Sun Performance Library routines have also been parallelized.
Using Tools to Restructure Code
In some cases, other libraries may not directly use the routines in the Sun Performance Library; however, there might be conversion aids available. For example, EISPACK users can refer to a conversion chart in the LAPACK Users' Manual that shows how to convert EISPACK calls to LAPACK calls.
Several vendors market automatic code restructuring tools that replace existing code with Sun Performance Library code. For example, a source- to- source conversion tool can replace existing BLAS code structures with calls to the BLAS in Sun Performance Library. These tools can also recognize many user written matrix multiplications and replace them with calls to the matrix multiplication subroutine in Sun Performance Library.
Fortran
f77
/f95
InterfacesThe Sun Performance Library routines can be called from within a FORTRAN 77, Fortran 95, or a C program. However, C programs must still use the FORTRAN 77 calling sequence.
Sun Performance Library
f77
/f95
interfaces use the following conventions:
- All arguments are passed by reference.
- The number of arguments to a routine is fixed.
- Types of arguments must match.
- Arrays are stored columnwise.
- Indices are based at one, in keeping with standard Fortran practice.
When calling Sun Performance Library routines:
- Do not prototype the subroutines with the Fortran 95
INTERFACE
statement. Use theUSE SUNPERF
statement instead.- Do not use
-ext_names=plain
to compile routines that call routines from Sun Performance Library.Using Fortran 95 Features
This release supports Fortran 95 language features. To use the Sun Performance Library Fortran 95 modules and definitions, including the
USE SUNPERF
statement in the program. TheUSE SUNPERF
statement enables the following features:
- Type Independence - In the FORTRAN 77 routines, the type must be specified as part of the name.
DGEMM
is a double precision matrix multiply andSGEMM
is single precision. With the Fortran 95 interfaces, when callingGEMM
, Fortran will infer the type from the arguments that are passed. Passing single-precision arguments toGEMM
gets results that are equivalent to specifyingSGEMM
, passing double-precision arguments gets results that are equivalent toDGEMM
, and so on. For example,CALL DSCAL(20,5.26D0,X,1)
could be changed toCALL SCAL(20, 5.26D0, X, 1)
.- Compile-Time Checking - In FORTRAN 77, it is generally impossible for the compiler to determine what arguments should be passed to a particular routine. In Fortran 95, the
USE SUNPERF
statement allows the compiler to determine the number, type, size, and shape of each argument to each Sun Performance Library routine. It can check the calls against the expected value and display errors during compilation.- Optional f95 Interfaces - In FORTRAN 77, all arguments must be specified in the order determined by the interface for all routines. All interfaces will support
f95
styleOPTIONAL
attributes on arguments that are not required. To determine the optional arguments for a routine, refer to the man pages. Optional arguments are enclosed in square brackets [ ].For example, the
SAXPY
routine is defined as follows in the man page:
SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY])REAL ALPHAINTEGER INCX, INCY, NREAL X(*), Y(*)Note that the arguments
N
,INCX
, andINCY
are optional.Suppose the user tries to call the
SAXPY
routine with the following arguments:
USE SUNPERFCOMPLEX ALPHAREAL X(100), Y(100), XA(100,100), RALPHAINTEGER INCX, INCYIf mismatches in the type, shape, or number of arguments occur, the compiler would issue the following error message:
ERROR: No specific match can be found for the generic subprogram call "AXPY".Using the arguments defined above, the following examples show incorrect calls to the
SAXPY
routine due type, shape, or number mismatches.
- Incorrect type of the arguments-If
SAXPY
is called as follows:
CALL AXPY(100, ALPHA, X, INCX, Y, INCY)
- A compiler error occurs because the variable
ALPHA
is typeCOMPLEX
, but the interface describes it as being typeREAL
.- Incorrect shape of the arguments- If
SAXPY
is called as follows:
CALL AXPY(N, RALPHA, XA, INCX, Y, INCY)
- A compiler error occurs because the
XA
argument is two dimensional, but the interface is expecting a one-dimensional argument.- Incorrect number of arguments- If
SAXPY
is called as follows:
CALL AXPY(RALPHA, X, INCX, Y)- A compiler error occurs because the compiler cannot find a routine in the
AXPY
interface group that takes four parameters of the form
AXPY(REAL, REAL 1-D ARRAY, INTEGER, REAL 1-D ARRAY)
- In the last example, the
f95
keyword parameter passing capability can allow a user to make essentially the same call using that capability.
CALL AXPY(ALPHA=RALPHA,X=X,INCX=INCX,Y=Y)- This is a valid call to the
AXPY
interface. It is necessary to use keyword parameter passing on any parameter that appears in the list after the firstOPTIONAL
parameter is omitted.The following calls to the
AXPY
interface are valid.
CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY)CALL AXPY(N,RALPHA,X,INCX,Y)CALL AXPY(N,RALPHA,X,Y=Y)CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)Fortran Examples
Getting peak performance from Sun Performance Library for single processor applications is a matter of identifying code constructs in an application that can be replaced by calls to subroutines in Sun Performance Library. Multiprocessor applications can get additional speed by identifying opportunities for parallelization.
The easiest situation occurs when a block of user code exactly duplicates a capability of Sun Performance Library. Consider the code below:
DO I = 1, NDO J = 1, NY(I) = Y(I) + A(I,J) * X(J)END DOEND DOThis is the matrix-vector product y
Ax + y, which can be performed with the
DGEMV
subroutine.As another example, consider the following code fragment:
DO I = 1, NIF (V2(I,K) .LT. 0.0) THENV2(I,K) = 0.0ELSEDO J = 1, MX(J,I) = X(J,I) + Vl(J,K) * V2(I,K)END DOEND IFEND DOIn other cases, a block of code can be equivalent to several Sun Performance Library calls or contain a mixture of code that can be replaced together with code that has no natural replacement in Sun Performance Library. One way to rewrite the code with Sun Performance Library is shown below:
DO I = 1, NIF (V2(I,K) .LT. 0.0) THENV2(I,K) = 0.0END IFEND DOCALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)An
f95
specific example is also shown.
WHERE (V(1:N,K) .LT. 0.0) THENV(1:N,K) = 0.0END WHERECALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)The code to replace negative numbers with zero in
V2
has no natural analog in Sun Performance Library, so that code is pulled out of the outer loop. With that code removed to its own loop, the rest of the loop can be recognized as being a rank- 1 update of the general matrix x, which can be accomplished using theDGER
routine from BLAS.Note that if there are many negative or zero values in
V2
, it may be that the majority of the time is not spent in the rank- 1 update and so replacing that code with the call toDGER
might not bring a large payoff. It might be worthwhile to evaluate the reference to K. If it is a loop index, it may be that the loops shown here are part of a larger code structure, and loops overDGEMV
orDGER
can often be converted to some form of matrix multiplication. If so, a single call to a matrix multiplication routine will probably bring a much larger payoff than a loop over calls toDGER
.All Sun Performance Library routines are MT-safe (multithread safe). Because the routines are MT-safe, additional performance is possible on MP platforms by using the auto-parallelizing compiler to parallelize loops that contain calls to Sun Performance Library.
An example of an effective combination of a Sun Performance Library routine together with an auto-parallelizing compiler parallelization directive is shown in the following example.
C$PAR DOALLDO I = 1, NCALL DGBMV ('No transpose', N, N, ALPHA, A, LDA,$ B(l,I), 1, BETA, C(l,I), 1)END DOSun Performance Library contains a routine named
DGBMV
to multiply a banded matrix by a vector. By putting this routine into a properly constructed loop, it is possible to use the routines in Sun Performance Library to multiply a banded matrix by a matrix. The compiler will not parallelize this loop by default because the presence of subroutine calls in a loop inhibits parallelization. However, because Sun Performance Library routines are MT-safe, a user may use parallelization directives as shown below to instruct the compiler to parallelize this loop.Note that a user can also use compiler directives to parallelize a loop with a subroutine call that ordinarily would not be parallelizable. For example, it is ordinarily not possible to parallelize a loop containing a call to some of the linear system solvers, because some vendors have implemented those routines using code that is not MT-safe. Loops containing calls to the expert drivers of the linear system solvers (routines whose names end in
SVX
) are usually not parallelizable with other implementations of LAPACK. The implementation of LAPACK in Sun Performance Library allows parallelization of loops containing such calls. Because the versions in Sun Performance Library are MT-safe, users of MP platforms can get additional performance by parallelizing these loops.C Interfaces
Sun Performance Library contains native C interfaces for each of the routines contained in LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK. The Sun Performance Library C interfaces have the following features:
- Function names have C names
- Function interfaces follow C conventions
- C functions do not contain redundant or unnecessary arguments for a C function
The following example compares the standard LAPACK Fortran interface and the Sun Performance Library C interfaces for the
DGBCON
routine.CALL DGBCON (NORM, N, NSUB, NSUPER, DA, LDA, IPIVOT, DANORM, DRCOND, DWORK, IWORK2, INFO) void dgbcon(char norm, int n, int nsub, int nsuper, double *da, int lda, int *ipivot, double danorm, double drcond, int *info)Note that the names of the arguments are the same and that arguments with the same name have the same base type. Scalar arguments that are used only as input values, such as
NORM
andN
, are passed by value in the C version. Arrays and scalars that will be used to return values are passed by reference.The Sun Performance Library C interfaces improve on CLAPACK, available on Netlib, which is an
f2c
translation of the standard libraries. For example, all of the CLAPACK routines are followed by a trailing underscore to maintain compatibility with Fortran compilers, which often postfix routine names in the object (.o
) file with an underscore. The Sun Performance Library C interfaces do not require a trailing underscore.Sun Performance Library C interfaces use the following conventions:
- Input-only scalars are passed by value rather than by reference, which gives added safety and allows constants to be passed without creating a separate variable to hold their value. Complex and double complex arguments are not considered scalars because they are not implemented as a scalar type by C.
- Complex scalars can be passed as either structures or arrays of length 2
- Arguments relating to workspace are not used in Sun Performance Library.
- Types of arguments must match even after C does type conversion. For example, be careful when passing a single precision real value because a C compiler can automatically promote the argument to double precision.
- Arrays are stored columnwise.
- Array indices are based at zero in conformance with C conventions rather than being based at one to conform to Fortran conventions.
- For example, the Fortran interface to
IDAMAX
, which C programs access asidamax_
, would return a 1 to indicate the first element in a vector. The C interface toidamax
, which C programs access asidamax
, would return a 0 to indicate the first element of a vector. This convention is observed in function return values, permutation vectors, and anywhere else that vector or array indices are used.
Note Some of the routines in Sun Performance Library usemalloc
internally, so user codes that make calls to Sun Performance Library and tosbrk
may not work correctly.
Sun Performance Library uses global integer registers
%g2
,%g3
, and%g4
in 32-bit mode and%g2
through%g5
in 64-bit mode as scratch registers. User code should not use these registers for temporary storage, and then call a Sun Performance Library routine. The data will be overwritten when the Sun Performance Library routine uses these registers.C Examples
The key to using Sun Performance Library to get peak performance from applications is to recognize opportunities to transform user-written code sequences into calls to Sun Performance Library functions. The following code sequence adapted from LAPACK shows one example:
int i;float a[n], b[n], largest;largest = a[0];for (i = 0; i < n; i++){if (a[i] > largest)largest = a[i];if (b[i] > largestlargest = b[i];}There is no subroutine in Sun Performance Library that exactly replicates the functionality of the code above. However, the code can be accelerated by replacing it with the several calls to Sun Performance Library as shown below:
int i, large_index;float a[n], b[n], largest;large_index = isamax (n, a, l);largest = a[large_index];large_index = isamax (n, b, l);if (b[large_index] > largest)largest = b[large_index];Note the differences between the call to the native C
isamax
in Sun Performance Library above and the call shown below to a comparable function in CLAPACK:
As an example of a program that uses Sun Performance Library routines from user-managed threads, consider a real-time signal processing application running on a 4-processor server with one processor dedicated to acquiring the data, two processors dedicated to performing FFTs on the data, and one processor dedicated to postprocessing the data after the FFTs. It begins by creating multiple running instances of the function that performs the FFT:
for (i = 0; i < NCPUS_FOR_FFT; i++) {who[i] = i;do_fft[i] = 0;fft_done_buff_available[i] = l;(void)thr_create ((void *)0, (size_t)0, fft_func,(void *)&who[i], (long)0, (thread_t *)0);The code below is a simplified implementation of part of
fft_func
started bythr_create
in the loop above. Note that production code should check the return value fromthr_create
above and should use semaphores rather than busy waits at the synchronization points in the code below.
Sun Microsystems, Inc. Copyright information. All rights reserved. Feedback |
Library | Contents | Previous | Next | Index |