Using Sun Performance Library

Sun Performance Library User's Guide

Chapter 2

Using Sun Performance Library

This chapter describes using the Sun Performance Library to improve the execution speed of applications written in either FORTRAN 77, Fortran 95, or C. Although some modifications to applications might be required to gain peak performance, many applications can benefit significantly from using Sun Performance Library without making source code changes or recompiling.

Improving Application Performance

Use Sun Performance Library in the following ways to improve the speed of user code without making any code changes:

Use Sun Performance Library routines instead of the base Netlib routines. See the next section Replacing Routines With Sun Performance Library Routines."
Use Sun Performance Library to speed up the other libraries, if an application already uses libraries in addition to those in the Sun Performance Library. See Improving Performance of Other Libraries.
Use tools that automatically modify an application to use Sun Performance Library. See Using Tools to Restructure Code.

Replacing Routines With Sun Performance Library Routines

Many applications are built using one or more of the base Netlib libraries supported by the Sun Performance Library. Third-party vendors can also use BLAS and LAPACK as building blocks in their applications. Because Sun Performance Library maintains the same interfaces and functionality of these libraries, base Netlib routines can be replaced with Sun Performance Library routines.

Sun Performance Library can be included in a user's development environment to improve application performance on single processor and multiprocessor (MP) platforms. Sun Performance Library routines can be faster than the corresponding Netlib routines or routines provided by other vendors that perform similar functions. The serial speed of many Sun Performance Library routines has been increased, and many routines have been parallelized that might be serial in other products.

Improving Performance of Other Libraries

Users of other mathematical libraries can replace the BLAS in their library with the BLAS in Sun Performance Library, while leaving other routines unchanged. This is helpful when an application has a dependency on proprietary interfaces in another library that prevent the other library from being completely replaced. Many commercial math libraries are built around a core of generic BLAS and LAPACK routines, so replacing those generic routines with the highly optimized BLAS and LAPACK routines in Sun Performance Library can give speed improvements on both serial and MP platforms. Because replacing the core routines does not require any code changes, the proprietary library features can still be used.

Even libraries that already have fast core routines may get additional speedups by using Sun Performance Library. For example, if another vendor's core routines are based on BLAS, these routines can be replaced with Sun Performance Library routines, which have SPARC specific optimizations. Many Sun Performance Library routines have also been parallelized.

Using Tools to Restructure Code

In some cases, other libraries may not directly use the routines in the Sun Performance Library; however, there might be conversion aids available. For example, EISPACK users can refer to a conversion chart in the LAPACK Users' Manual that shows how to convert EISPACK calls to LAPACK calls.

Several vendors market automatic code restructuring tools that replace existing code with Sun Performance Library code. For example, a source- to- source conversion tool can replace existing BLAS code structures with calls to the BLAS in Sun Performance Library. These tools can also recognize many user written matrix multiplications and replace them with calls to the matrix multiplication subroutine in Sun Performance Library.

Fortran f77/f95 Interfaces

The Sun Performance Library routines can be called from within a FORTRAN 77, Fortran 95, or a C program. However, C programs must still use the FORTRAN 77 calling sequence.

Sun Performance Library f77/f95 interfaces use the following conventions:

All arguments are passed by reference.
The number of arguments to a routine is fixed.
Types of arguments must match.
Arrays are stored columnwise.
Indices are based at one, in keeping with standard Fortran practice.

When calling Sun Performance Library routines:

Do not prototype the subroutines with the Fortran 95 INTERFACE statement. Use the USE SUNPERF statement instead.
Do not use -ext_names=plain to compile routines that call routines from Sun Performance Library.

Using Fortran 95 Features

This release supports Fortran 95 language features. To use the Sun Performance Library Fortran 95 modules and definitions, including the USE SUNPERF statement in the program. The USE SUNPERF statement enables the following features:

Type Independence - In the FORTRAN 77 routines, the type must be specified as part of the name. DGEMM is a double precision matrix multiply and SGEMM is single precision. With the Fortran 95 interfaces, when calling GEMM, Fortran will infer the type from the arguments that are passed. Passing single-precision arguments to GEMM gets results that are equivalent to specifying SGEMM, passing double-precision arguments gets results that are equivalent to DGEMM, and so on. For example, CALL DSCAL(20,5.26D0,X,1) could be changed to CALL SCAL(20, 5.26D0, X, 1).
Compile-Time Checking - In FORTRAN 77, it is generally impossible for the compiler to determine what arguments should be passed to a particular routine. In Fortran 95, the USE SUNPERF statement allows the compiler to determine the number, type, size, and shape of each argument to each Sun Performance Library routine. It can check the calls against the expected value and display errors during compilation.
Optional f95 Interfaces - In FORTRAN 77, all arguments must be specified in the order determined by the interface for all routines. All interfaces will support f95 style OPTIONAL attributes on arguments that are not required. To determine the optional arguments for a routine, refer to the man pages. Optional arguments are enclosed in square brackets [ ].

For example, the SAXPY routine is defined as follows in the man page:
SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY])

REAL ALPHA

INTEGER INCX, INCY, N

REAL X(*), Y(*)
Note that the arguments N, INCX, and INCY are optional.

Suppose the user tries to call the SAXPY routine with the following arguments:
USE SUNPERF

COMPLEX ALPHA

REAL X(100), Y(100), XA(100,100), RALPHA

INTEGER INCX, INCY
If mismatches in the type, shape, or number of arguments occur, the compiler would issue the following error message:
 	 ERROR: No specific match can be found for the generic subprogram 
call "AXPY".
Using the arguments defined above, the following examples show incorrect calls to the SAXPY routine due type, shape, or number mismatches.
Incorrect type of the arguments-If SAXPY is called as follows:

CALL AXPY(100, ALPHA, X, INCX, Y, INCY)

A compiler error occurs because the variable ALPHA is type COMPLEX, but the interface describes it as being type REAL.
Incorrect shape of the arguments- If SAXPY is called as follows:

CALL AXPY(N, RALPHA, XA, INCX, Y, INCY)

A compiler error occurs because the XA argument is two dimensional, but the interface is expecting a one-dimensional argument.
Incorrect number of arguments- If SAXPY is called as follows:

CALL AXPY(RALPHA, X, INCX, Y)
A compiler error occurs because the compiler cannot find a routine in the AXPY interface group that takes four parameters of the form

AXPY(REAL, REAL 1-D ARRAY, INTEGER, REAL 1-D ARRAY)
In the last example, the f95 keyword parameter passing capability can allow a user to make essentially the same call using that capability.

CALL AXPY(ALPHA=RALPHA,X=X,INCX=INCX,Y=Y)
This is a valid call to the AXPY interface. It is necessary to use keyword parameter passing on any parameter that appears in the list after the first OPTIONAL parameter is omitted.
The following calls to the AXPY interface are valid.
CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY)

CALL AXPY(N,RALPHA,X,INCX,Y)

CALL AXPY(N,RALPHA,X,Y=Y)

CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)
Fortran Examples

Getting peak performance from Sun Performance Library for single processor applications is a matter of identifying code constructs in an application that can be replaced by calls to subroutines in Sun Performance Library. Multiprocessor applications can get additional speed by identifying opportunities for parallelization.

The easiest situation occurs when a block of user code exactly duplicates a capability of Sun Performance Library. Consider the code below:
DO I = 1, N

DO J = 1, N

Y(I) = Y(I) + A(I,J) * X(J)

END DO

END DO
This is the matrix-vector product y Ax + y, which can be performed with the DGEMV subroutine.

As another example, consider the following code fragment:
DO I = 1, N

IF (V2(I,K) .LT. 0.0) THEN

V2(I,K) = 0.0

ELSE

DO J = 1, M

X(J,I) = X(J,I) + Vl(J,K) * V2(I,K)

END DO

END IF

END DO
In other cases, a block of code can be equivalent to several Sun Performance Library calls or contain a mixture of code that can be replaced together with code that has no natural replacement in Sun Performance Library. One way to rewrite the code with Sun Performance Library is shown below:
DO I = 1, N

IF (V2(I,K) .LT. 0.0) THEN

V2(I,K) = 0.0

END IF

END DO

CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)
An f95 specific example is also shown.
WHERE (V(1:N,K) .LT. 0.0) THEN

V(1:N,K) = 0.0

END WHERE

CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)
The code to replace negative numbers with zero in V2 has no natural analog in Sun Performance Library, so that code is pulled out of the outer loop. With that code removed to its own loop, the rest of the loop can be recognized as being a rank- 1 update of the general matrix x, which can be accomplished using the DGER routine from BLAS.

Note that if there are many negative or zero values in V2, it may be that the majority of the time is not spent in the rank- 1 update and so replacing that code with the call to DGER might not bring a large payoff. It might be worthwhile to evaluate the reference to K. If it is a loop index, it may be that the loops shown here are part of a larger code structure, and loops over DGEMV or DGER can often be converted to some form of matrix multiplication. If so, a single call to a matrix multiplication routine will probably bring a much larger payoff than a loop over calls to DGER.

All Sun Performance Library routines are MT-safe (multithread safe). Because the routines are MT-safe, additional performance is possible on MP platforms by using the auto-parallelizing compiler to parallelize loops that contain calls to Sun Performance Library.

An example of an effective combination of a Sun Performance Library routine together with an auto-parallelizing compiler parallelization directive is shown in the following example.
C$PAR DOALL

DO I = 1, N

CALL DGBMV ('No transpose', N, N, ALPHA, A, LDA,

$ B(l,I), 1, BETA, C(l,I), 1)

END DO
Sun Performance Library contains a routine named DGBMV to multiply a banded matrix by a vector. By putting this routine into a properly constructed loop, it is possible to use the routines in Sun Performance Library to multiply a banded matrix by a matrix. The compiler will not parallelize this loop by default because the presence of subroutine calls in a loop inhibits parallelization. However, because Sun Performance Library routines are MT-safe, a user may use parallelization directives as shown below to instruct the compiler to parallelize this loop.

Note that a user can also use compiler directives to parallelize a loop with a subroutine call that ordinarily would not be parallelizable. For example, it is ordinarily not possible to parallelize a loop containing a call to some of the linear system solvers, because some vendors have implemented those routines using code that is not MT-safe. Loops containing calls to the expert drivers of the linear system solvers (routines whose names end in SVX) are usually not parallelizable with other implementations of LAPACK. The implementation of LAPACK in Sun Performance Library allows parallelization of loops containing such calls. Because the versions in Sun Performance Library are MT-safe, users of MP platforms can get additional performance by parallelizing these loops.

C Interfaces

Sun Performance Library contains native C interfaces for each of the routines contained in LAPACK, BLAS, FFTPACK, VFFTPACK, and LINPACK. The Sun Performance Library C interfaces have the following features:

Function names have C names
Function interfaces follow C conventions
C functions do not contain redundant or unnecessary arguments for a C function

The following example compares the standard LAPACK Fortran interface and the Sun Performance Library C interfaces for the DGBCON routine.
 CALL DGBCON (NORM, N, NSUB, NSUPER, DA, LDA, IPIVOT, DANORM,
              DRCOND, DWORK, IWORK2, INFO)
 void dgbcon(char norm, int n, int nsub, int nsuper, double *da,
             int lda, int *ipivot, double danorm, double drcond, 
             int *info)
Note that the names of the arguments are the same and that arguments with the same name have the same base type. Scalar arguments that are used only as input values, such as NORM and N, are passed by value in the C version. Arrays and scalars that will be used to return values are passed by reference.

The Sun Performance Library C interfaces improve on CLAPACK, available on Netlib, which is an f2c translation of the standard libraries. For example, all of the CLAPACK routines are followed by a trailing underscore to maintain compatibility with Fortran compilers, which often postfix routine names in the object (.o) file with an underscore. The Sun Performance Library C interfaces do not require a trailing underscore.

Sun Performance Library C interfaces use the following conventions:

Input-only scalars are passed by value rather than by reference, which gives added safety and allows constants to be passed without creating a separate variable to hold their value. Complex and double complex arguments are not considered scalars because they are not implemented as a scalar type by C.
Complex scalars can be passed as either structures or arrays of length 2
Arguments relating to workspace are not used in Sun Performance Library.
Types of arguments must match even after C does type conversion. For example, be careful when passing a single precision real value because a C compiler can automatically promote the argument to double precision.
Arrays are stored columnwise.
Array indices are based at zero in conformance with C conventions rather than being based at one to conform to Fortran conventions.

For example, the Fortran interface to IDAMAX, which C programs access as idamax_, would return a 1 to indicate the first element in a vector. The C interface to idamax, which C programs access as idamax, would return a 0 to indicate the first element of a vector. This convention is observed in function return values, permutation vectors, and anywhere else that vector or array indices are used.
Note – Some of the routines in Sun Performance Library use malloc internally, so user codes that make calls to Sun Performance Library and to sbrk may not work correctly.

Sun Performance Library uses global integer registers %g2, %g3, and %g4 in 32-bit mode and %g2 through %g5 in 64-bit mode as scratch registers. User code should not use these registers for temporary storage, and then call a Sun Performance Library routine. The data will be overwritten when the Sun Performance Library routine uses these registers.

C Examples

The key to using Sun Performance Library to get peak performance from applications is to recognize opportunities to transform user-written code sequences into calls to Sun Performance Library functions. The following code sequence adapted from LAPACK shows one example:
int i;

float a[n], b[n], largest;

largest = a[0];

for (i = 0; i < n; i++)

{

if (a[i] > largest)

largest = a[i];

if (b[i] > largest

largest = b[i];

}
There is no subroutine in Sun Performance Library that exactly replicates the functionality of the code above. However, the code can be accelerated by replacing it with the several calls to Sun Performance Library as shown below:
int i, large_index;

float a[n], b[n], largest;

large_index = isamax (n, a, l);

largest = a[large_index];

large_index = isamax (n, b, l);

if (b[large_index] > largest)

largest = b[large_index];
Note the differences between the call to the native C isamax in Sun Performance Library above and the call shown below to a comparable function in CLAPACK:
/* 1. Declare scratch variable to allow 1 to be passed by value */

int one = l;

/* 2. Append underscore to conform to FORTRAN naming system */

/* 3. Pass all arguments, even scalar input-only, by reference */

/* 4. Subtract one to convert from FORTRAN indexing conventions */

large_index = isamax_ (&n, a, &one) - l;

largest = a[large_index]; large_index = isamax_ (&n, b, &one) - l;

if (b[large_index] > largest)

largest = b[large_index];
As an example of a program that uses Sun Performance Library routines from user-managed threads, consider a real-time signal processing application running on a 4-processor server with one processor dedicated to acquiring the data, two processors dedicated to performing FFTs on the data, and one processor dedicated to postprocessing the data after the FFTs. It begins by creating multiple running instances of the function that performs the FFT:
for (i = 0; i < NCPUS_FOR_FFT; i++) {

who[i] = i;

do_fft[i] = 0;

fft_done_buff_available[i] = l;

(void)thr_create ((void *)0, (size_t)0, fft_func,

(void *)&who[i], (long)0, (thread_t *)0);
The code below is a simplified implementation of part of fft_func started by thr_create in the loop above. Note that production code should check the return value from thr_create above and should use semaphores rather than busy waits at the synchronization points in the code below.
cpu_id = *who_am_i;

while (1) {

while (!do_fft[cpu_id]) {}

rfftf (n, &dataset[0][cpu_id], &scratch[0][cpu_id]);

while (!fft_done_buff_available[cpu_id]) {}

fft_done_buff_available[cpu_id] = 0;

scopy (n, &dataset[0][cpu_id], 1, &fft_done_buff[0][cpu_id], 1);

do_fft[cpu_id] = 0;

}

SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY]) REAL ALPHA INTEGER INCX, INCY, N REAL X(), Y()

USE SUNPERF COMPLEX ALPHA REAL X(100), Y(100), XA(100,100), RALPHA INTEGER INCX, INCY

CALL AXPY(100, ALPHA, X, INCX, Y, INCY)

CALL AXPY(N, RALPHA, XA, INCX, Y, INCY)

CALL AXPY(RALPHA, X, INCX, Y)

AXPY(REAL, REAL 1-D ARRAY, INTEGER, REAL 1-D ARRAY)

CALL AXPY(ALPHA=RALPHA,X=X,INCX=INCX,Y=Y)

CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY) CALL AXPY(N,RALPHA,X,INCX,Y) CALL AXPY(N,RALPHA,X,Y=Y) CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)

DO I = 1, N DO J = 1, N Y(I) = Y(I) + A(I,J) * X(J) END DO END DO

DO I = 1, N IF (V2(I,K) .LT. 0.0) THEN V2(I,K) = 0.0 ELSE DO J = 1, M X(J,I) = X(J,I) + Vl(J,K) * V2(I,K) END DO END IF END DO

DO I = 1, N IF (V2(I,K) .LT. 0.0) THEN V2(I,K) = 0.0 END IF END DO CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

WHERE (V(1:N,K) .LT. 0.0) THEN V(1:N,K) = 0.0 END WHERE CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

C$PAR DOALL DO I = 1, N CALL DGBMV ('No transpose', N, N, ALPHA, A, LDA, $ B(l,I), 1, BETA, C(l,I), 1) END DO

int i; float a[n], b[n], largest; largest = a[0]; for (i = 0; i < n; i++) { if (a[i] > largest) largest = a[i]; if (b[i] > largest largest = b[i]; }

int i, large_index; float a[n], b[n], largest; large_index = isamax (n, a, l); largest = a[large_index]; large_index = isamax (n, b, l); if (b[large_index] > largest) largest = b[large_index];

/* 1. Declare scratch variable to allow 1 to be passed by value / int one = l; / 2. Append underscore to conform to FORTRAN naming system / / 3. Pass all arguments, even scalar input-only, by reference / / 4. Subtract one to convert from FORTRAN indexing conventions */ large_index = isamax_ (&n, a, &one) - l; largest = a[large_index]; large_index = isamax_ (&n, b, &one) - l; if (b[large_index] > largest) largest = b[large_index];

for (i = 0; i < NCPUS_FOR_FFT; i++) { who[i] = i; do_fft[i] = 0; fft_done_buff_available[i] = l; (void)thr_create ((void )0, (size_t)0, fft_func, (void )&who[i], (long)0, (thread_t *)0);

cpu_id = *who_am_i; while (1) { while (!do_fft[cpu_id]) {} rfftf (n, &dataset[0][cpu_id], &scratch[0][cpu_id]); while (!fft_done_buff_available[cpu_id]) {} fft_done_buff_available[cpu_id] = 0; scopy (n, &dataset[0][cpu_id], 1, &fft_done_buff[0][cpu_id], 1); do_fft[cpu_id] = 0; }

Library | Contents | Previous | Next | Index

Chapter 2

Using Sun Performance Library

Improving Application Performance

Replacing Routines With Sun Performance Library Routines

Improving Performance of Other Libraries

Using Tools to Restructure Code

Fortran f77/f95 Interfaces

Using Fortran 95 Features

Fortran Examples

C Interfaces

C Examples

Fortran `f77`/`f95` Interfaces