Performance Option Reference (Fortran Programming Guide)

Fortran Programming Guide

Performance Option Reference

The compiler options listed in the following table provide the user with a repertoire of strategies to improve the performance of a program over default compilation. Only some of the compilers' more potent performance options appear in the table. A more complete list can be found in the Fortran User's Guide.

Table 9-1 Some Effective Performance Options


Action	Option
Uses various optimization options together	`-fast`
Sets compiler optimization level to `n`	`-On` (`-O` `=` `-O3`)
Specifies target hardware	`-xtarget=sys`
Optimizes using performance profile data (with `-O5`)	`-xprofile=use`
Unrolls loops by `n`	`-unroll=n`
Permits simplifications and optimization of floating-point	`-fsimple=1\|2`
Performs dependency analysis to optimize loops	`-depend`

Some of these options increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.

`-fast`

This single option selects a number of performance options that, working together, produce object code optimized for execution speed without an excessive increase in compilation time.

The options selected by -fast are subject to change from one release to another, and not all are available on each platform:

-native generates code optimized for the host architecture.
-O4 sets optimization level.
-libmil inlines calls to some simple library functions.
-fsimple=1 simplifies floating-point code (SPARC only).
-dalign uses faster, double word loads and stores (SPARC only).
-xlibmopt use optimized libm math library (SPARC only).
-fns -ftrap=%none turns off all trapping.
-depend analyze loops for data dependencies (SPARC only).
-nofstore disables forcing precision on expressions (x86 only).

-fast provides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User's Guide). Following -fast with additional options adds further optimizations. For example:

f77 -fast -O5 ...

sets the optimization to level 5 instead of 4.

Note -

-fast includes -dalign and -native. These options may have unexpected side effects for some programs.

`-On`

No compiler optimizations are performed by the compilers unless a -O option is specified explicitly (or implicitly with macro options like -fast). In nearly all cases, specifying an optimization level for compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size.

For most cases, level -O3 is a good balance between performance gain, code size, and compilation time. Level -O4 adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. Level -O5 adds more aggressive optimization techniques that would not be applied at lower levels. In general, levels above -O3 should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)

`PRAGMA OPT=n`

Use the C$ PRAGMA SUN OPT=n directive to set different optimization levels for individual routines in a source file. This directive will override the -On flag on the compiler command line, but must be used with the -xmaxopt=n flag to set a maximum optimization level. See the f77(1) and f90(1) man pages for details.

Optimization With Runtime Profile Feedback

The compiler applies its optimization strategies at level O3 and above much more efficiently if combined with -xprofile=use. With this option, the optimizer is directed by a runtime execution profile produced by the program (compiled with -xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with -O5. Here's a typical example of profile collection with higher optimization levels:

demo% f77 -o prg -fast -xprofile=collect prg.f ...
demo% prg    
demo% f77 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...
demo% prgx

The first compilation in the example generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.

(See the Fortran User's Guide for details on -xprofile options.)

`-dalign`

With -dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by -fast.) The double-word instructions are almost twice as fast as the equivalent single word operations.

However, users should be aware that using -dalign (and therefore -fast) may cause problems with some programs that have been coded expecting a specific alignment of data in COMMON blocks. With -dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or COMPLEX) are aligned on double-word boundaries, with the result that:

COMMON blocks might be larger than expected due to added padding.

All program units sharing COMMON must be compiled with -dalign if any one of them is compiled with -dalign.

For example, a program that writes data by aliasing an entire COMMON block of mixed data types as a single array might not work properly with -dalign because the block will be larger (due to padding of double and quad precision variables) than the program expects.

SPARC: `-depend`

Adding -depend to optimization levels -O3 and higher (on the SPARC platform) extends the compiler's ability to optimize DO loops and loop nests. With this option, the optimizer analyzes inter-iteration loop dependencies to determine whether or not certain transformations of the loop structure can be performed. Only loops without dependencies can be restructured. However, the added analysis might increase compilation time.

`-fsimple=2`

Unless directed to, the compiler does not attempt to simplify floating-point computations (the default is -fsimple=0). With the -fast option, -fsimple=1 is used and some conservative assumptions are made. Adding -fsimple=2 enables the optimizer to make further simplifications with the understanding that this might cause some programs to produce slightly different results due to rounding effects. If -fsimple level 1 or 2 is used, all program units should be similarly compiled to ensure consistent numerical accuracy.

`-unroll=n`

Unrolling short loops with long iteration counts can be profitable for some routines. However, unrolling can also increase program size and might even degrade performance of other loops. With n=1, the default, no loops are unrolled automatically by the optimizer. With n greater than 1, the optimizer attempts to unroll loops up to a depth of n.

The compiler's code generator makes its decision to unroll loops depending on a number of factors. The compiler might decline to unroll a loop even though this option is specified with n>1.

If a DO loop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines whether or not executing the unrolled loop is inappropriate. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer with better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation might be needed.

The example that follows shows how a simple loop might be unrolled to a depth of four with -unroll=4 (the source code is not changed with this option):

Original Loop:
    DO I=1,20000
       X(I) = X(I) + Y(I)*A(I)
    END DO

Unrolled by 4 compiles as if it were written:
    DO I=1, 19997,4
       TEMP1 = X(I) + Y(I)*A(I)
       TEMP2 = X(I+1) + Y(I+1)*A(I+1)
       TEMP3 = X(I+2) + Y(I+2)*A(I+2)
       X(I+3) = X(I+3) + Y(I+3)*A(I+3)
       X(I) = TEMP1
       X(I+1) = TEMP2
       X(I+2) = TEMP3
    END DO

This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.

`-xtarget=platform`

The performance of some programs might improve if the compiler has an accurate description of the target computer hardware. When program performance is critical, the proper specification of the target hardware could be very important. This is especially true when running on the newer SPARC processors. However, for most programs and older SPARC processors, the performance gain could be negligible and a generic specification might be sufficient.

The Fortran User's Guide lists all the system names recognized by -xtarget=. For any given system name (for example, ss1000, for SPARCserver 1000), -xtarget expands into a specific combination of -xarch, -xcache, and -xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.

The special setting -xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture. Therefore, -xtarget=generic is the default, even though it might produce suboptimal performance.