Fortran Programming Guide

Essential Concepts

Parallelizing (or multithreading) an application recasts the compiled program to run on a multiprocessor system. Parallelization enables single tasks, such as a DO loop, to run over multiple processors with a potentially significant execution speedup.

Before an application program can be run efficiently on a multiprocessor system like the Ultra(TM) 60, Enterprise(TM) 450, or Ultra HPC 1000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations.

Multithreading an application can be done manually by making appropriate calls to the libthread primitives. However, a significant amount of analysis and reprogramming might be required. (See the Solaris Multithreaded Programming Guide for more information.)

Sun compilers can automatically generate multithreaded object code to run on multiprocessor systems. The Fortran compilers focus on DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program.

The choice of which loops to parallelize and how to distribute them can be left entirely up to the compiler (-autopar), determined explicitly by the programmer with source code directives (-explicitpar), or done in combination (-parallel).

Note -

Programs that do their own (explicit) thread management should not be compiled with any of the compiler's parallelization options. Explicit multithreading (calls to libthread primitives) cannot be combined with routines compiled with these parallelization options.

Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run more slowly when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.

Only explicit Fortran 90 DO loops are candidates for parallelization with f90.

Sun compilers can detect loops that might be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the -loopinfo option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.

Speedups--What to Expect

If you parallelize a program so that it runs over four processors, can you expect it to take (roughly) one fourth the time that it did with a single processor (a fourfold speedup)?

Probably not. It can be shown (by Amdahl's law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if c is the percentage of the execution time run in parallel, the theoretical speedup limit is 100/(100-c); therefore, if only 60% of a program runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.

As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the runtime must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.

Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).

Steps to Parallelizing a Program

Here is a very general outline of the steps needed to parallelize an application:

Optimize. Use the appropriate set of compiler options to get the best serial performance on a single processor.
Profile. Using typical test data, determine the performance profile of the program. Identify the most significant loops.
Benchmark. Determine that the serial test results are accurate. Use these results and the performance profile as the benchmark.
Parallelize. Use a combination of options and directives to compile and build a parallelized executable.
Verify. Run the parallelized program on a single processor and check results to find instabilities and programming errors that might have crept in.
Test. Make various runs on several processors to check results.
Benchmark. Make performance measurements with various numbers of processors on a dedicated system. Measure performance changes with changes in problem size (scalability).
Repeat steps 4 to 7. Make improvements to parallelization scheme based on performance.

Data Dependency Issues

Not all loops are parallelizable. Running a loop in parallel over a number of processors may result in iterations executing out of order. Or the multiple processors executing the loop in parallel may interfere with each other. These situations arise whenever there are data dependencies in the loop.

Recurrence

Variables that are set in one iteration of a loop and used in a subsequent iteration introduce cross-iteration dependencies, or recurrences. Recurrence in a loop requires that the iterations to be executed in the proper order. For example:

   DO I=2,N
      A(I) = A(I-1)*B(I)+C(I)
   END DO

requires the value computed for A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce results running each iteration in parallel that are the same as with single processor, iteration I must complete before iteration I+1 can execute.

Reduction

Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:

   DO K = 1,N
     SUM = SUM + A(I)*B(I)
   END DO

If each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant.

Certain common reduction operations are recognized and handled as special cases by the compiler.

Indirect Addressing

Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:

   DO L = 1,NW
     A(ID(L)) = A(L) + B(L)
   END DO

In the preceding, repeated values in ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent.

Data Dependent Loops

You might be able to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring could be needed.

Some general rules are:

A loop is data independent only if all iterations write to distinct memory locations.

Iterations may read from the same locations as long as no one iteration writes to them.

These are general conditions for parallelization. The compilers' automatic parallelization analysis considers additional criteria when deciding whether to parallelize a loop. However, you can use directives to explicitly force loops to be parallelized, even loops that contain inhibitors and produce incorrect results.

Parallel Options and Directives Summary

The following table shows the f77 5.0 and f90 2.0 compilation options related to parallelization.

Table 10-1 Parallelization Options


Option	Flag
Automatic (only)	`-autopar`
Automatic and Reduction	`-autopar` -reduction
Explicit (only)	`-explicitpar`
Automatic and Explicit	`-parallel`
Automatic and Reduction and Explicit	`-parallel` `-reduction`
Show which loops are parallelized	`-loopinfo`
Show warnings with explicit	`-vpara`
Allocate local variables on stack	`-stackvar`
Use Sun-style MP directives	`-mp=sun`
Use Cray-style MP directives	`-mp=cray`

Notes on these options:

-reduction requires -autopar.
-autopar includes -depend and loop structure optimization.
-parallel is equivalent to -autopar -explicitpar.
-noautopar, -noexplicitpar, -noreduction are the negations.
Parallelization options can be in any order, but they must be all lowercase.
Reduction operations are not analyzed for explicitly parallelized loops.
Use of any of the parallelization options requires a WorkShop license.

The following table shows the f77/f90 and f90 parallel directives.

Table 10-2 Parallel Directives


Parallel Directive	Purpose
`C$PAR TASKCOMMON`	Declares a common block private
`C$PAR DOALL` `optional qualifiers`	Parallelizes next loop, if possible
`C$PAR DOSERIAL`	Inhibits parallelization of next loop
`C$PAR DOSERIAL*`	Inhibits parallelization of loop nest

Number of Processors

The PARALLEL environment variable controls the maximum number of processors available to the program. The following example shows how to set it:

demo% setenv PARALLEL 4       C shell
                                                         -or-
demo$ PARALLEL=4               Bourne/Korn shell
demo$ export PARALLEL

In this example, setting PARALLEL to four enables the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads could run on the same processor as others, possibly degrading performance.

The SunOS command psrinfo(1M) displays a list of the processors available on a system:

demo% psrinfo
0      on-line   since 03/18/96 15:51:03
1      on-line   since 03/18/96 15:51:03
2      on-line   since 03/18/96 15:51:03
3      on-line   since 03/18/96 15:51:03

Stacks, Stack Sizes, and Parallelization

The executing program maintains a main memory stack for the parent program and distinct stacks for each thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations.

The default size of the main stack is about 8 megabytes. The Fortran compilers normally allocate local variables and arrays as STATIC (not on the stack). However, the -stackvar option forces allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of -stackvar is recommended with parallelization because it improves the optimizer's ability to parallelize CALLs in loops. -stackvar is required with explicitly parallelized loops containing subprogram calls. (See the discussion of -stackvar in the Fortran User's Guide.)

The limit command displays the current main stack size as well as setting it:

demo% limit             C shell example
cputime       unlimited
filesize       unlimited
datasize       2097148 kbytes
stacksize       8192 kbytes            <- current main stack size
coredumpsize       0 kbytes
descriptors       64 
memorysize       unlimited
demo% limit stacksize 65536       <- set main stack to 64Mb
demo% limit stacksize
stacksize       65536 kbytes

demo$ >ulimit -a         Korn Shell example
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         2097148
stack(kbytes)        8192
coredump(blocks)     0
nofiles(descriptors) 64
vmemory(kbytes)      unlimited
demo$ ulimit -s 65536
demo$ ulimit -s
65536

Each thread of a multithreaded program has its own thread stack. This stack mimics the main program stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 256 kilobytes. The size is set with the STACKSIZE environment variable:

demo% setenv STACKSIZE 8192    <- Set thread stack size to 8 Mb    C shell
                          -or-
demo$ STACKSIZE=8192           Bourne/Korn Shell
demo$ export STACKSIZE

Setting the thread stack size to a value larger than the default may be necessary for most parallelized Fortran codes. However, it may not be possible to know just how large to set it, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.