9.1.1 Performance Options (Sun Studio 12: Fortran Programming Guide)

Sun Studio 12: Fortran Programming Guide

9.1.1 Performance Options

The compiler options listed in the following table provide the user with a repertoire of strategies to improve the performance of a program over default compilation. Only some of the compilers’ more potent performance options appear in the table. A more complete list can be found in the Fortran User’s Guide.

Table 9–1 Some Effective Performance Options


Action	Option
Uses a combination of optimization options together	`-fast`
Sets compiler optimization level to `n`	`-O``n` (`-O` `=` `-O3`)
Specifies general target hardware	`-xtarget=``sys`
Specifies a particular Instruction Set Architecture	`-xarch=``isa`
Optimizes using performance profile data (with `-O5`)	`-xprofile=use`
Unrolls loops by `n`	`-unroll=``n`
Permits simplifications and optimization of floating-point	`-fsimple=1\|2`
Performs dependency analysis to optimize loops	`-depend`
Performs interprocedural optimizations	`-xipo`

Some of these options increase compilation time because they invoke a deeper analysis of the program. Some options work best when routines are collected into files along with the routines that call them (rather than splitting each routine into its own file); this allows the analysis to be global.

9.1.1.1 `-fast`

This single option selects a number of performance options.

Note –

This option is defined as a particular selection of other options that is subject to change from one release to another, and between compilers. Also, some of the options selected by -fast might not be available on all platforms. Compile with the -dryrun flag to see the expansion of -fast.

-fast provides high performance for certain benchmark applications. However, the particular choice of options may or may not be appropriate for your application. Use -fast as a good starting point for compiling your application for best performance. But additional tuning may still be required. If your program behaves improperly when compiled with -fast, look closely at the individual options that make up- fast and invoke only those appropriate to your program that preserve correct behavior.

Note also that a program compiled with -fast may show good performance and accurate results with some data sets, but not with others. Avoid compiling with-fast those programs that depend on particular properties of floating-point arithmetic.

Because some of the options selected by -fast have linking implications, if you compile and link in separate steps be sure to link with- fast also.

–fast selects the following options:

–dalign
–depend
–fns
–fsimple=2
-ftrap=common
—fround=nearest (Solaris only)
–libmil
–xtarget=native
–O5
–xlibmopt (Solaris only)
-pad=local (SPARC only)
-xvector=lib (SPARC only)
-nofstore (x86 only)
-xregs=frameptr (x86 only)

-fast provides a quick way to engage much of the optimizing power of the compilers. Each of the composite options may be specified individually, and each may have side effects to be aware of (discussed in the Fortran User’s Guide). Note also that the exact expansion of -fast may change with each compiler release. Compiling with -dryrun will show the expansion of all command-line flags.

Following -fast with additional options adds further optimizations. For example:

f95 -fast -m64 ...

compiles for a 64-bit enabled platform.

Because -fast invokes -dalign, -fns, -fsimple=2, programs compiled with- fast can result in nonstandard floating-point arithmetic, nonstandard alignment of data, and nonstandard ordering of expression evaluation. These selections might not be appropriate for most programs.

9.1.1.2 `-O``n`

The compiler performs no optimizations unless a -O option is specified explicitly (or implicitly with macro options like -fast). In nearly all cases, specifying an optimization level at compilation improves program execution performance. On the other hand, higher levels of optimization increase compilation time and may significantly increase code size.

For most cases, level -O3 is a good balance between performance gain, code size, and compilation time. Level -O4 adds automatic inlining of calls to routines contained in the same source file as the caller routine, among other things. (See the Fortran User’s Guide for further information about subprogram call inlining.)

Level -O5 adds more aggressive optimization techniques that would not be applied at lower levels. In general, levels above -O3 should be specified only to those routines that make up the most compute-intensive parts of the program and thereby have a high certainty of improving performance. (There is no problem linking together parts of a program compiled with different optimization levels.)

9.1.1.3 `PRAGMA OPT=``n`

Use the C$ PRAGMA SUN OPT=n directive to set different optimization levels for individual routines in a source file. This directive will override the -On flag on the compiler command line, but must be used with the -xmaxopt=n flag to set a maximum optimization level. See the f95(1) man page for details.

9.1.1.4 Optimization With Runtime Profile Feedback

The compiler applies its optimization strategies at level O3 and above much more efficiently if combined with -xprofile=use. With this option, the optimizer is directed by a runtime execution profile produced by the program (compiled with -xprofile=collect) with typical input data. The feedback profile indicates to the compiler where optimization will have the greatest effect. This may be particularly important with -O5. Here’s a typical example of profile collection with higher optimization levels:

demo% f95 -o prg -fast -xprofile=collect prg.f ...
demo% prg 
demo% f95 -o prgx -fast -O5 -xprofile=use:prg.profile prg.f ...
demo% prgx

The first compilation in the example generates an executable that produces statement coverage statistics when run. The second compilation uses this performance data to guide the optimization of the program.

(See the Fortran User’s Guide for details on -xprofile options.)

9.1.1.5 `-dalign`

With -dalign the compiler is able to generate double-word load/store instructions whenever possible. Programs that do much data motion may benefit significantly when compiled with this option. (It is one of the options selected by -fast.) The double-word instructions are almost twice as fast as the equivalent single word operations.

However, users should be aware that using -dalign (and therefore -fast) may cause problems with some programs that have been coded expecting a specific alignment of data in COMMON blocks. With -dalign, the compiler may add padding to ensure that all double (and quad) precision data (either REAL or COMPLEX) are aligned on double-word boundaries, with the result that:

COMMON blocks might be larger than expected due to added padding.
All program units sharing COMMON must be compiled with -dalign if any one of them is compiled with -dalign.

For example, a program that writes data by aliasing an entire COMMON block of mixed data types as a single array might not work properly with -dalign because the block will be larger (due to padding of double and quad precision variables) than the program expects.

9.1.1.6 `-depend`

Adding -depend to optimization levels -O3 and higher extends the compiler’s ability to optimize DO loops and loop nests. With this option, the optimizer analyzes inter-iteration data dependences to determine whether or not certain transformations of the loop structure can be performed. Only loops without data dependences can be restructured. However, the added analysis might increase compilation time.

9.1.1.7 `-fsimple=2`

Unless directed to, the compiler does not attempt to simplify floating-point computations (the default is -fsimple=0). -fsimple=2 enables the optimizer to make aggressive simplifications with the understanding that this might cause some programs to produce slightly different results due to rounding effects. If -fsimple level 1 or 2 is used, all program units should be similarly compiled to ensure consistent numerical accuracy. See the Fortran User’s Guide for important information about this option.

9.1.1.8 `-unroll=``n`

Unrolling short loops with long iteration counts can be profitable for some routines. However, unrolling can also increase program size and might even degrade performance of other loops. With n=1, the default, no loops are unrolled automatically by the optimizer. With n greater than 1, the optimizer attempts to unroll loops up to a depth of n.

The compiler’s code generator makes its decision to unroll loops depending on a number of factors. The compiler might decline to unroll a loop even though this option is specified with n>1.

If a DO loop with a variable loop limit can be unrolled, both an unrolled version and the original loop are compiled. A runtime test on iteration count determines if it is appropriate to execute the unrolled loop. Loop unrolling, especially with simple one or two statement loops, increases the amount of computation done per iteration and provides the optimizer with better opportunities to schedule registers and simplify operations. The tradeoff between number of iterations, loop complexity, and choice of unrolling depth is not easy to determine, and some experimentation might be needed.

The example that follows shows how a simple loop might be unrolled to a depth of four with -unroll=4 (the source code is not changed with this option):

Original Loop:

    DO I=1,20000
       X(I) = X(I) + Y(I)*A(I)
    END DO

Unrolled by 4 compiles as if it were written:

    DO I=1, 19997,4
       TEMP1 = X(I) + Y(I)*A(I)
       TEMP2 = X(I+1) + Y(I+1)*A(I+1)
       TEMP3 = X(I+2) + Y(I+2)*A(I+2)
       X(I+3) = X(I+3) + Y(I+3)*A(I+3)
       X(I) = TEMP1
       X(I+1) = TEMP2
       X(I+2) = TEMP3
    END DO

This example shows a simple loop with a fixed loop count. The restructuring is more complex with variable loop counts.

9.1.1.9 `-xtarget=``platform`

The performance of some programs might improve if the compiler has an accurate description of the target computer hardware. When program performance is critical, the proper specification of the target hardware could be very important. This is especially true when running on the newer SPARC processors. However, for most programs and older SPARC processors, the performance gain could be negligible and a generic specification might be sufficient.

The Fortran User’s Guide lists all the system names recognized by -xtarget=. For any given system name (for example, ultra2, for UltraSPARC-II), -xtarget expands into a specific combination of -xarch, -xcache, and -xchip that properly matches that system. The optimizer uses these specifications to determine strategies to follow and instructions to generate.

The special setting -xtarget=native enables the optimizer to compile code targeted at the host system (the system doing the compilation). This is obviously useful when compilation and execution are done on the same system. When the execution system is not known, it is desirable to compile for a generic architecture. Therefore, -xtarget=generic is the default, even though it might produce suboptimal performance.

UltraSPARC-III and UltraSPARC-IV Support

Both the -xtarget and -xchip flags accept ultra3 and ultra3 variants and will generate optimized code for UltraSPARC-III and UltraSPARC-IV processors. When compiling and running an application on the latest UltraSPARC platforms, specify the -fast flag to automatically select the proper compiler optimization options for that platform.

For cross-compilations (compiling on a platform other than the latest UltraSPARC platforms but generating binaries intended to run on an UltraSPARC-III processor), use these flags:

-fast -xtarget=ultra3

Use -m64 to compile for 64-bit code generation.

See the Fortran User’s Guide for a list of -xtarget flags for the latest UltraSPARC processors.

Performance profiling, with -xprofile=collect: and -xprofile=use:, is particularly effective on the UltraSPARC-III and UltraSPARC-IV platforms because it allows the compiler to identify the most frequently executed sections of the program and perform localized optimizations to best advantage.

64-Bit x86 Platform Support

The Sun Studio Fortran compiler supports the compilation of 32-bit and 64-bit code for Solaris and Linux x86 platforms.

The -xtarget=pentium3 flag expands to: -xarch=sse -xchip=pentium3 -xcache=16/32/4:256/32/4.

For Pentium 4 systems, -xtarget=pentium4 expands to: -xarch=sse2 -xchip=pentium4 -xcache=8/64/4:256/128/8.

A new -m64 option specifies compilation for the 64-bit x64 instruction set.

A new -xtarget option, -xtarget=opteron, specifies the -xarch, -xchip, and -xcache settings for 32-bit AMD compilation.

You must specify -m64 after -fast and -xtarget on the command line to generate 64-bit code. The -xtarget option does not automatically generate 64-bit code. The -fast option also results in 32-bit code because it is a macro which also defines an -xtarget value. All the current -xtarget values (except -xtarget=native64 and -xtarget=generic64) result in 32- bit code, so it is necessary to specify -xarch=m64 after (to the right of) -fast or -xtarget to compile 64-bit code, as in:

% f95 -fast -m64 or % f95 -xtarget=opteron -m64

The compilers now predefine __amd64 and __x86_64 when you specify -xarch=amd64.

Additional information about compilation and performance on 32-bit and 64-bit x86 platforms can be found in the Fortran User’s Guide.

9.1.1.10 Interprocedural Optimization With `-xipo`

This new f95 compiler flag, introduced with the release of Forte Developer 6 update 2, performs whole-program optimizations by invoking an interprocedural analysis pass. Unlike -xcrossfile, -xipo optimizes across all object files at the link step and is not limited to just the source files on the compile command.

-xipo is particularly useful when compiling and linking large multi-file applications. Object files compiled with -xipo have analysis information saved within them. This enables interprocedural analysis across source and pre-compiled program files.

For details on how to use interprocedural analysis effectively, see the Fortran User’s Guide.

9.1.1.11 Add `PRAGMA` `ASSUME` Assertions

By adding ASSUME directives at strategic points in the source code you can help guide the compiler’s optimization stragegy by revealing important information about the program that is not determinable any other way. For example, you can let the compiler know that the trip count of a DO loop is always greater than a value, or that there is a high probability that an IF branch will not be taken. The compiler can use this information to generate better code, based on these assertions.

As an added bonus, the programmer can use the ASSUME pragma to validate the exectution of the program by enabling warning messages to be issued whenever an assertion turns out to be false at run time.

For details, see the description of the ASSUME pragma in Chapter 2 of the Fortran User’s Guide, and the -xassume_control compiler command-line option in Chapter 3 of that manual.