Sun Studio 12: Fortran Programming Guide

10.1 Essential Concepts

Parallelizing (or multithreading) an application compiles the program to run on a multiprocessor system or in a multithreaded environment. Parallelization enables a single task, such as a DO loop, to run over multiple processors (or threads) with a potentially significant execution speedup.

Before an application program can be run efficiently on a multiprocessor system like the Ultra^TM 60, Sun Enterprise^TM Server 6500, or Sun Enterprise Server 10000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations across multiple processors or threads.

Multithreading an application can be done manually by making appropriate calls to the libthread primitives. However, a significant amount of analysis and reprogramming might be required. (See the Solaris Multithreaded Programming Guide for more information.)

Sun compilers can automatically generate multithreaded object code to run on multiprocessor systems. The Fortran compilers focus on DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program.

The choice of which loops to parallelize and how to distribute them can be left entirely up to the compiler (-autopar), specified explicitly by the programmer with source code directives (-explicitpar), or done in combination (-parallel).

Note –

Programs that do their own (explicit) thread management should not be compiled with any of the compiler’s parallelization options. Explicit multithreading (calls to libthread primitives) cannot be combined with routines compiled with these parallelization options.

Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run more slowly when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.

Implicit loops (IF loops and Fortran 95 array syntax, for example) as well as explicit DO loops are candidates for automatic parallelization by the Fortran compilers.

f95 can detect loops that might be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the -loopinfo option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.

The Fortran 95 compiler provides explicit parallelization by implementing the OpenMP 2.0 Fortran API directives. For legacy programs, f95 also accepts the older Sun and Cray style directives, but use of these directives is now deprecated. OpenMP has become an informal standard for explicit parallelization in Fortran 95, C, and C++ and is recommended over the older directive styles.

For information on OpenMP, see the OpenMP API User’s Guide, or the OpenMP web site at http://www.openmp.org.

10.1.1 Speedups—What to Expect

If you parallelize a program so that it runs over four processors, can you expect it to take (roughly) one fourth the time that it did with a single processor (a fourfold speedup)?

Probably not. It can be shown (by Amdahl’s law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if p is the percentage of the total program execution time that runs in parallel mode, the theoretical speedup limit is 100/(100–p); therefore, if only 60% of a program’s execution runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.

As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the runtime must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.

Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).

10.1.2 Steps to Parallelizing a Program

Here is a very general outline of the steps needed to parallelize an application:

Optimize. Use the appropriate set of compiler options to get the best serial performance on a single processor.
Profile. Using typical test data, determine the performance profile of the program. Identify the most significant loops.
Benchmark. Determine that the serial test results are accurate. Use these results and the performance profile as the benchmark.
Parallelize. Use a combination of options and directives to compile and build a parallelized executable.
Verify. Run the parallelized program on a single processor and single thread and check results to find instabilities and programming errors that might have crept in. (Set $PARALLEL or $OMP_NUM_THREADS to 1; see 10.1.5 Number of Threads).
Test. Make various runs on several processors to check results.
Benchmark. Make performance measurements with various numbers of processors on a dedicated system. Measure performance changes with changes in problem size (scalability).
Repeat steps 4 to 7. Make improvements to your parallelization scheme based on performance.

10.1.3 Data Dependence Issues

Not all loops are parallelizable. Running a loop in parallel over a number of processors usually results in iterations executing out of order. Moreover, the multiple processors executing the loop in parallel may interfere with each other whenever there are data dependencies in the loop.

Situations where data dependence issues arise include recurrence, reduction, indirect addressing, and data dependent loop iterations.

10.1.3.1 Data Dependent Loops

You might be able to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring could be needed.

Some general rules are:

A loop is data independent only if all iterations write to distinct memory locations.
Iterations may read from the same locations as long as no one iteration writes to them.

These are general conditions for parallelization. The compilers’ automatic parallelization analysis considers additional criteria when deciding whether to parallelize a loop. However, you can use directives to explicitly force loops to be parallelized, even loops that contain inhibitors and produce incorrect results.

10.1.3.2 Recurrence

Variables that are set in one iteration of a loop and used in a subsequent iteration introduce cross-iteration dependencies, or recurrences. Recurrence in a loop requires that the iterations to be executed in the proper order. For example:

   DO I=2,N
      A(I) = A(I-1)*B(I)+C(I)
   END DO

requires the value computed for A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce correct results, iteration I must complete before iteration I+1 can execute.

10.1.3.3 Reduction

Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:

   DO K = 1,N
     SUM = SUM + A(I)*B(I)
   END DO

If each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant.

Certain common reduction operations are recognized and handled as special cases by the compiler.

10.1.3.4 Indirect Addressing

Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:

   DO L = 1,NW
     A(ID(L)) = A(L) + B(L)
   END DO

In the example, repeated values in ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent.

10.1.4 Compiling for Parallelization

The Sun Studio compilers support the OpenMP parallelization model natively as the primary parallelization model. For information on OpenMP parallelization, see the OpenMP API User’s Guide. Sun and Cray-style parallelization refer to legacy applications and are no longer supported by current Sun Studio compilers.

Table 10–1 Fortran 95 Parallelization Options


Option	Flag
Automatic (only)	`-autopar`
Automatic and Reduction	`-autopar -reduction`
Show which loops are parallelized	`-loopinfo`
Show warnings with explicit	`-vpara`
Allocate local variables on stack	`-stackvar`
Compile for OpenMP parallelization	`-xopenmp`

Notes on these options:

Many of these options have equivalent synonyms, such as -autopar and -xautopar. Either may be used.
The compiler prof/gprof profiling options -p, -xpg, and -pg should not be used along with any of the parallelization options. The runtime support for these profiling options is not thread-safe. Invalid results or a segmentation fault could occur at runtime.
-reduction requires -autopar.
-autopar includes -depend and loop structure optimization.
-noautopar, -noreduction are the negations.
Parallelization options can be in any order, but they must be all lowercase.
Reduction operations are not analyzed in explicitly parallelized loops.
-xopenmp also invokes -stackvar automatically.
The options -loopinfo and -vpara must be used in conjunction with one of the parallelization options.

10.1.5 Number of Threads

The PARALLEL (or OMP_NUM_THREADS) environment variable controls the maximum number of threads available to the program. Setting the environment variable tells the runtime system the maximum number of threads the program can use. The default is 1. In general, set the PARALLEL or OMP_NUM_THREADS variable to the number of available virtual processors on the target platform.

The following example shows how to set it:

demo% setenv OMP_NUM_THREADS 4       C shell

-or-

demo$ OMP_NUM_THREADS=4               Bourne/Korn shell
demo$ export OMP_NUM_THREADS

In this example, setting PARALLEL to four enables the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads could run on the same processor as others, possibly degrading performance.

The SunOS^TM operating system command psrinfo(1M) displays a list of the processors available on a system:

demo% psrinfo
0      on-line   since 03/18/2007 15:51:03
1      on-line   since 03/18/2007 15:51:03
2      on-line   since 03/18/2007 15:51:03
3      on-line   since 03/18/2007 15:51:03

10.1.6 Stacks, Stack Sizes, and Parallelization

The executing program maintains a main memory stack for the initial thread executing the program, as well as distinct stacks for each helper thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations.

The default size of the main stack is about 8 megabytes. The Fortran compilers normally allocate local variables and arrays as STATIC (not on the stack). However, the -stackvar option forces the allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of -stackvar is recommended with parallelization because it improves the optimizer’s ability to parallelize subprogram calls in loops. -stackvar is required with explicitly parallelized loops containing subprogram calls. (See the discussion of -stackvar in the Fortran User’s Guide.)

Using the C shell (csh), the limit command displays the current main stack size as well as sets it:

demo% limit             C shell example
cputime       unlimited
filesize       unlimited
datasize       2097148 kbytes
stacksize       8192 kbytes            <- current main stack size
coredumpsize       0 kbytes
descriptors       64
memorysize       unlimited
demo% limit stacksize 65536       <- set main stack to 64Mb
demo% limit stacksize
stacksize       65536 kbytes

With Bourne or Korn shells, the corresponding command is ulimit:

demo$ ulimit -a         Korn Shell example
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         2097148
stack(kbytes)        8192
coredump(blocks)     0
nofiles(descriptors) 64
vmemory(kbytes)      unlimited
demo$ ulimit -s 65536
demo$ ulimit -s
65536

Each helper thread of a multithreaded program has its own thread stack. This stack mimics the initial thread stack but is unique to the thread. The thread’s PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 8 megabytes on 64–bit SPARC and 64-bit x86 platforms, 4 megabytes otherwise. The size is set with the STACKSIZE environment variable:

demo% setenv STACKSIZE 8192    <- Set thread stack size to 8 Mb    C shell
                          -or-
demo$ STACKSIZE=8192           Bourne/Korn Shell
demo$ export STACKSIZE

Setting the thread stack size to a value larger than the default may be necessary for some parallelized Fortran codes. However, it may not be possible to know just how large it should be, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.