C H A P T E R  2

Using Sun Performance Library

This chapter describes using the Sun Performance Library to improve the execution speed of applications written in Fortran 95 or C. The performance of many applications can be increased by using Sun Performance Library without making source code changes or recompiling. However, some modifications to applications might be required to gain peak performance with Sun Performance Library.


2.1 Improving Application Performance

The following sections describe ways of using Sun Performance Library routines without making source code changes or recompiling.

2.1.1 Replacing Routines With Sun Performance Library Routines

Many applications use one or more of the base Netlib libraries, such as LAPACK or BLAS. Because Sun Performance Library maintains the same interfaces and functionality of these libraries, base Netlib routines can be replaced with Sun Performance Library routines. Application performance is increased, because Sun Performance Library routines can be faster than the corresponding Netlib routines or similar routines provided by other vendors.

2.1.2 Improving Performance of Other Libraries

Many commercial math libraries are built around a core of generic BLAS and LAPACK routines. When an application has a dependency on proprietary interfaces in another library that prevents the library from being completely replaced, the BLAS and LAPACK routines used in that library can be replaced with the Sun Performance Library BLAS and LAPACK routines. Because replacing the core routines does not require any code changes, the proprietary library features can still be used, and the other routines in the library can remain unchanged.

2.1.3 Using Tools to Restructure Code

Some libraries that do not directly use Sun Performance Library routines can be modified by using automatic code restructuring tools that replace existing code with Sun Performance Library code. For example, a source- to- source conversion tool can replace existing BLAS code structures with calls to the Sun Performance Library BLAS routines. These conversion tools can also recognize many user written matrix multiplications and replace them with calls to the matrix multiplication subroutine in Sun Performance Library.


2.2 Fortran Interfaces

Sun Performance Library contains f95 interfaces and legacy f77 interfaces for maintaining compatibility with the standard LAPACK and BLAS libraries and existing codes. Sun Performance Library f95 and legacy f77 interfaces use the following conventions:

When calling Sun Performance Library routines:

2.2.1 Fortran SUNPERF Module for Use With Fortran 95

Sun Performance Library provides a Fortran module for additional ease-of-use features with Fortran 95 programs. To use this module, include the following line in Fortran 95 codes.

USE SUNPERF

USE statements must precede all other statements in the code, except for the PROGRAM or SUBROUTINE statement.

The SUNPERF module contains interfaces that simplify the calling sequences and provides the following features:

When using Sun Performance Library routines with optional arguments, the _64 suffix is required for 64-bit integers, as shown in the following code example:


SUBROUTINE SUB(N,ALPHA,X,Y)
USE SUNPERF
INTEGER(8) N
REAL(8) ALPHA, X(N), Y(N)
 
! EQUIVALENT TO DAXPY_64(N,ALPHA,X,1_8,Y,1_8)
CALL AXPY_64(ALPHA=ALPHA,X=X,Y=Y)
 
END

For a detailed description of using the Sun Performance Library 64-bit interfaces, see Compiling Code for a 64-Bit Enabled Operating Environments.

Because the sunperf.mod file is compiled with -dalign, any code that contains the USE SUNPERF statement must be compiled with -dalign. The following error occurs if the code is not compiled with -dalign.


 use sunperf
              ^     
    "test_code.f", Line = 2, Column = 11: ERROR: Procedure "SUNPERF" and this compilation must both be compiled with -a dalign, or without -a dalign. 

2.2.2 Optional Arguments

Sun Performance Library routines support Fortran 95 optional arguments, where argument values that can be inferred from other arguments can be omitted. For example, the SAXPY routine is defined as follows in the man page.


SUBROUTINE SAXPY([N], ALPHA, X, [INCX], Y, [INCY])
REAL ALPHA
INTEGER INCX, INCY, N
REAL X(*), Y(*)

The N, INCX, and INCY arguments are optional. Note the square bracket notation in the man pages that denotes the optional arguments.

Suppose the user tries to call the SAXPY routine with the following arguments.


USE SUNPERF
COMPLEX ALPHA
REAL    X(100), Y(100), XA(100,100), RALPHA
INTEGER INCX, INCY

If mismatches in the type, shape, or number of arguments occur, the compiler would issue the following error message:

ERROR: No specific match can be found for the generic subprogram call "AXPY".

Using the arguments defined above, the following examples show incorrect calls to the SAXPY routine due type, shape, or number mismatches.

A compiler error occurs because mixing parameter types, such as COMPLEX ALPHA and REAL X, is not supported.

A compiler error occurs because the XA argument is two dimensional, but the interface is expecting a one-dimensional argument.

A compiler error occurs because the compiler cannot find a routine in the AXPY interface group that takes four arguments of the following form.


	AXPY(REAL, REAL 1-D ARRAY, INTEGER, REAL 1-D ARRAY)

In the following example, the f95 keyword parameter passing capability can allow a user to make essentially the same call using that capability.


	CALL AXPY(ALPHA=RALPHA,X=X,INCX=INCX,Y=Y)

This is a valid call to the AXPY interface. It is necessary to use keyword parameter passing on any parameter that appears in the list after the first OPTIONAL parameter is omitted.

The following calls to the AXPY interface are valid.


	CALL AXPY(N,RALPHA,X,Y=Y,INCY=INCY)
	CALL AXPY(N,RALPHA,X,INCX,Y)
	CALL AXPY(N,RALPHA,X,Y=Y)
	CALL AXPY(ALPHA=RALPHA,X=X,Y=Y)


2.3 Fortran Examples

To increase the performance of single processor applications, identify code constructs in an application that can be replaced by calls to Sun Performance Library routines. Performance of multiprocessor applications can be increased by identifying opportunities for parallelization.

To increase application performance by modifying code to use Sun Performance Library routines, identify blocks of code that exactly duplicate the capability of a Sun Performance Library routine. The following code example is the matrix-vector product y left arrow Ax + y, which can be replaced with the DGEMV subroutine.,


      DO I = 1, N
          DO J = 1, N
              Y(I) = Y(I) + A(I,J) * X(J)
          END DO
      END DO

In other cases, a block of code can be equivalent to several Sun Performance Library calls or contain portions of code that can be replaced with calls to Sun Performance Library routines. Consider the following code example.


      DO I = 1, N
          IF (V2(I,K) .LT. 0.0) THEN
              V2(I,K) = 0.0
          ELSE
              DO J = 1, M
                  X(J,I) = X(J,I) + Vl(J,K) * V2(I,K)
              END DO
          END IF 
      END DO

The code example can be rewritten to use the Sun Performance Library routine DGER, as shown here.


      DO I = 1, N
          IF (V2(I,K) .LT. 0.0) THEN
             V2(I,K) = 0.0
          END IF 
      END DO
      CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

The same code example can also be rewritten using Fortran 95 specific statements, as shown here.


WHERE (V(1:N,K) .LT. 0.0) THEN
       V(1:N,K) = 0.0
END WHERE 
CALL DGER (M, N, 1.0D0, X, LDX, Vl(l,K), 1, V2(1,K), 1)

Because the code to replace negative numbers with zero in V2 has no natural analog in Sun Performance Library, that code is pulled out of the outer loop. With that code removed to its own loop, the rest of the loop is a rank- 1 update of the general matrix x that can be replaced with the DGER routine from BLAS.

The amount of performance increase can also depend on the data the Sun Performance Library routine uses. For example, if V2 contains many negative or zero values, the majority of the time might not be spent in the rank- 1 update. In this case, replacing the code with a call to DGER might not increase performance.

Evaluating other loop indexes can affect the Sun Performance Library routine used. For example, if the reference to K is a loop index, the loops in the code sample shown above might be part of a larger code structure, where the loops over DGEMV or DGER could be converted to some form of matrix multiplication. If so, a single call to a matrix multiplication routine can increase performance more than using a loop with calls to DGER.

Because all Sun Performance Library routines are MT-safe (multithread safe), using the auto-parallelizing compiler to parallelize loops that contain calls to Sun Performance Library routines can increase performance on multiprocessor platforms.

An example of combining a Sun Performance Library routine with an auto-parallelizing compiler parallelization directive is shown in the following code example.


      C$PAR DOALL
      DO I = 1, N
             CALL DGBMV ('No transpose', N, N, ALPHA, A, LDA,
     $     B(l,I), 1, BETA, C(l,I), 1)
      END DO

Sun Performance Library contains a routine named DGBMV to multiply a banded matrix by a vector. By putting this routine into a properly constructed loop, Sun Performance Library routines can be used to multiply a banded matrix by a matrix. The compiler will not parallelize this loop by default, because the presence of subroutine calls in a loop inhibits parallelization. However, Sun Performance Library routines are MT-safe, so a user can use parallelization directives that instruct the compiler to parallelize this loop.

Compiler directives can also be used to parallelize a loop with a subroutine call that ordinarily would not be parallelizable. For example, it is ordinarily not possible to parallelize a loop containing a call to some of the linear system solvers, because some vendors have implemented those routines using code that is not MT-safe. Loops containing calls to the expert drivers of the linear system solvers (routines whose names end in SVX) are usually not parallelizable with other implementations of LAPACK. Because the implementation of LAPACK in Sun Performance Library allows parallelization of loops containing such calls, users of multiprocessor platforms can get additional performance by parallelizing these loops.


2.4 C Interfaces

The Sun Performance Library routines can be called from within a FORTRAN 77, Fortran 95, or C program. However, C programs must still use the FORTRAN 77 calling sequence.

Sun Performance Library contains native C interfaces for each of the routines contained in LAPACK, BLAS, FFTPACK, VFFTPACK, and SPARSE BLAS. The Sun Performance Library C interfaces have the following features:

The following example compares the standard LAPACK Fortran interface and the Sun Performance Library C interfaces for the DGBCON routine.


CALL DGBCON (NORM, N, NSUB, NSUPER, DA, LDA, IPIVOT, DANORM,
	DRCOND, DWORK, IWORK2, INFO)
void dgbcon(char norm, int n, int nsub, int nsuper, double *da,
	int lda, int *ipivot, double danorm, double drcond, 
	int *info)

Note that the names of the arguments are the same and that arguments with the same name have the same base type. Scalar arguments that are used only as input values, such as NORM and N, are passed by value in the C version. Arrays and scalars that will be used to return values are passed by reference.

The Sun Performance Library C interfaces improve on CLAPACK, available on Netlib, which is an f2c translation of the standard libraries. For example, all of the CLAPACK routines are followed by a trailing underscore to maintain compatibility with Fortran compilers, which often postfix routine names in the object (.o) file with an underscore. The Sun Performance Library C interfaces do not require a trailing underscore.

Sun Performance Library C interfaces use the following conventions:

For example, the Fortran interface to IDAMAX, which C programs access as idamax_, would return a 1 to indicate the first element in a vector. The C interface to idamax, which C programs access as idamax, would also return a 1, to indicate the first element of a vector. This convention is observed in function return values, permutation vectors, and anywhere else that vector or array indices are used.



Note - Some Sun Performance Library routines use malloc internally, so user codes that make calls to Sun Performance Library and to sbrk might not work correctly.


The SPARC version of the Sun Performance Library uses global integer registers %g2, %g3, and %g4 in 32-bit mode and %g2 through %g5 in 64-bit mode as scratch registers. User code should not use these registers for temporary storage, and then call a Sun Performance Library routine. The data will be overwritten when the Sun Performance Library routine uses these registers.


2.5 C Examples

Transforming user-written code sequences into calls to Sun Performance Library routines increases application performance. The following code example adapted from LAPACK shows one example.


int    i; 
float a[n], b[n], largest;
 
largest = a[0]; 
for (i = 0; i < n; i++)
{
if (a[i] > largest)
    largest = a[i];
    if (b[i] > largest
    largest = b[i];
}

No Sun Performance Library routine exactly replicates the functionality of this code example. However, the code can be accelerated by replacing it with several calls to the Sun Performance Library routine isamax, as shown in the following code example.


int    i, large_index; 
float a[n], b[n], largest;
 
large_index = isamax (n, a, l) - 1; 
largest = a[large_index]; 
large_index = isamax (n, b, l) - 1; 
if (b[large_index] > largest) 
     largest = b[large_index];

Compare the differences between calling the native C isamax routine in Sun Performance Library, shown in the previous code example, with calling the isamax routine in CLAPACK, shown in the following code example.


/* 1. Declare scratch variable to allow 1 to be passed by value */ 
int one = l;
/* 2. Append underscore to conform to FORTRAN naming system     */
/* 3. Pass all arguments, even scalar input-only, by reference  */ 
/* 4. Subtract one to convert from FORTRAN indexing conventions */
large_index = isamax_ (&n, a, &one) - l; 
largest = a[large_index]; large_index = isamax_ (&n, b, &one) - l; 
if (b[large_index] > largest) 
     largest = b[large_index];