Fortran Programming Guide |
SPARC: Parallelization
This chapter presents an overview of multiprocessor parallelization and describes the capabilities of Sun's Fortran compilers. Implementation differences between
f77
andf95
are noted.
Note Fortran parallelization features require a Sun WorkShop HPC license.
Essential Concepts
Parallelizing (or multithreading) an application compiles the program to run on a multiprocessor system or in a multithreaded environment. Parallelization enables a single task, such as a DO loop, to run over multiple processors (or threads) with a potentially significant execution speedup.
Before an application program can be run efficiently on a multiprocessor system like the UltraTM 60, Sun EnterpriseTM Server 6500, or Sun Enterprise Server 10000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations across multiple processors or threads.
Multithreading an application can be done manually by making appropriate calls to the
libthread
primitives. However, a significant amount of analysis and reprogramming might be required. (See the Solaris Multithreaded Programming Guide for more information.)Sun compilers can automatically generate multithreaded object code to run on multiprocessor systems. The Fortran compilers focus on DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program.
The choice of which loops to parallelize and how to distribute them can be left entirely up to the compiler (
-autopar
), specified explicitly by the programmer with source code directives (-explicitpar
), or done in combination (-parallel)
.
Note Programs that do their own (explicit) thread management should not be compiled with any of the compiler's parallelization options. Explicit multithreading (calls tolibthread
primitives) cannot be combined with routines compiled with these parallelization options.
Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run more slowly when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.
Implicit loops (
IF
loops and Fortran 95 array syntax, for example) as well as explicitDO
loops are candidates for automatic parallelization by the Fortran compilers.Sun WorkShop compilers can detect loops that might be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the
-loopinfo
option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.Both
f77
andf95
support two styles of explicit parallization directives--Sun style and Cray style. In addition,f95
supports the OpenMP 1.1 directives and runtime library routines. Explicit parallelization in Fortran is described on page 159.Speedups--What to Expect
If you parallelize a program so that it runs over four processors, can you expect it to take (roughly) one fourth the time that it did with a single processor (a fourfold speedup)?
Probably not. It can be shown (by Amdahl's law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if p is the percentage of the total program execution time that runs in parallel mode, the theoretical speedup limit is 100/(100-p); therefore, if only 60% of a program's execution runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.
As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the runtime must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.
Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).
Steps to Parallelizing a Program
Here is a very general outline of the steps needed to parallelize an application:
- Optimize. Use the appropriate set of compiler options to get the best serial performance on a single processor.
- Profile. Using typical test data, determine the performance profile of the program. Identify the most significant loops.
- Benchmark. Determine that the serial test results are accurate. Use these results and the performance profile as the benchmark.
- Parallelize. Use a combination of options and directives to compile and build a parallelized executable.
- Verify. Run the parallelized program on a single processor and single thread and check results to find instabilities and programming errors that might have crept in.(Set
$PARALLEL
or$OMB_NUM_THREADS
to 1; see page 151)- Test. Make various runs on several processors to check results.
- Benchmark. Make performance measurements with various numbers of processors on a dedicated system. Measure performance changes with changes in problem size (scalability).
- Repeat steps 4 to 7. Make improvements to your parallelization scheme based on performance.
Data Dependency Issues
Not all loops are parallelizable. Running a loop in parallel over a number of processors usually results in iterations executing out of order. Moreover, the multiple processors executing the loop in parallel may interfere with each other whenever there are data dependencies in the loop.
Situations where data depencency issues arise include recurrence, reduction, indirect addressing, and data dependent loop iterations.
Recurrence
Variables that are set in one iteration of a loop and used in a subsequent iteration introduce cross-iteration dependencies, or recurrences. Recurrence in a loop requires that the iterations to be executed in the proper order. For example:
DO I=2,NA(I) = A(I-1)*B(I)+C(I)END DOrequires the value computed for A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce correct results, iteration I must complete before iteration I+1 can execute.
Reduction
Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:
DO K = 1,NSUM = SUM + A(I)*B(I)END DOIf each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant.
Certain common reduction operations are recognized and handled as special cases by the compiler.
Indirect Addressing
Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:
DO L = 1,NWA(ID(L)) = A(L) + B(L)END DOIn the example, repeated values in ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent.
Data Dependent Loops
You might be able to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring could be needed.
- A loop is data independent only if all iterations write to distinct memory locations.
- Iterations may read from the same locations as long as no one iteration writes to them.
These are general conditions for parallelization. The compilers' automatic parallelization analysis considers additional criteria when deciding whether to parallelize a loop. However, you can use directives to explicitly force loops to be parallelized, even loops that contain inhibitors and produce incorrect results.
Parallel Options and Directives Summary
The following table shows the Sun WorkShop 6
f77
andf95
compilation options related to parallelization.
-reduction
requires-autopar
.-autopar
includes-depend
and loop structure optimization.-parallel
is equivalent to-autopar -explicitpar
.
-noautopar, -noexplicitpar, -noreduction
are the negations.- Parallelization options can be in any order, but they must be all lowercase.
- Reduction operations are not analyzed for explicitly parallelized loops.
- Use of any of the parallelization options requires a Sun WorkShop HPC license.
-openmp
is a macro for the combination of options:-mp=openmp -stackvar -explicitpar
- The options
-loopinfo
,-vpara
, and-mp
must be used in conjunction with one of the parallelization options-autopar
,-explicitpar
, or-parallel
.The following table summarizes the
f77
andf95
Sun-style parallel directives.
Cray-style directives are similar (see page 176), but use a
CMIC$
sentinel instead ofC$PAR
, and with different optional qualifiers on theDOALL
directive. Use of these directives is explained in the section, Explicit Parallelization. Appendix E of the Fortran User's Guide gives a detailed summary of all Fortran directives, including these and Fortran 95 OpenMP.Number of Threads
The
PARALLEL
(orOMP_NUM_THREADS
) environment variable controls the maximum number of threads available to the program. Setting the environment variable tells the runtime system the maximum number of threads the program can use. The default is 1. In general, set thePARALLEL
orOMP_NUM_THREADS
variable to the available number of processors on the target platform.The following example shows how to set it:
demo%-or-setenv PARALLEL 4
C shelldemo$PARALLEL=4
Bourne/Korn shelldemo$export PARALLEL
In this example, setting
PARALLEL
to four enables the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads could run on the same processor as others, possibly degrading performance.The SunOSTM operating system command
psrinfo
(1M) displays a list of the processors available on a system:
demo%psrinfo
0 on-line since 03/18/99 15:51:031 on-line since 03/18/99 15:51:032 on-line since 03/18/99 15:51:033 on-line since 03/18/99 15:51:03Stacks, Stack Sizes, and Parallelization
The executing program maintains a main memory stack for the initial thread executing the program, as well as distinct stacks for each helper thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations.
The default size of the main stack is about 8 megabytes. The Fortran compilers normally allocate local variables and arrays as STATIC (not on the stack). However, the
-stackvar
option forces the allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of-stackvar
is recommended with parallelization because it improves the optimizer's ability to parallelize subprogram calls in loops.-stackvar
is required with explicitly parallelized loops containing subprogram calls. (See the discussion of-stackvar
in the Fortran User's Guide.)Using the C shell (
csh
), thelimit
command displays the current main stack size as well as sets it:
With Bourne or Korn shells, the corresponding command is
ulimit
:
Each helper thread of a multithreaded program has its own thread stack. This stack mimics the initial thread stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 2 Megabytes on SPARC V9 (UltraSPARC) platforms, 1 Megabyte otherwise. The size is set with the
STACKSIZE
environment variable:
demo%setenv STACKSIZE 8192
<- Set thread stack size to 8 Mb C shell-or-demo$STACKSIZE=8192
Bourne/Korn Shelldemo$export STACKSIZE
Setting the thread stack size to a value larger than the default may be necessary for some parallelized Fortran codes. However, it may not be possible to know just how large it should be, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.
Automatic Parallelization
With the
-autopar
and-parallel
options, thef77
andf95
compilers automatically find DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.Loop Parallelization
The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.
For example, with four CPUs and a parallelized loop with 1000 iterations, each thread would execute a chunk of 250 iterations:
Processor 1 executes iterations 1 through 250 Processor 2 executes iterations 251 through 500 Processor 3 executes iterations 501 through 750 Processor 4 executes iterations 751 through 1000
Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependence analysis rejects from parallelization those loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.
Note that the compiler always chooses to parallelize loops using a static loop scheduling--simply dividing the work in the loop into equal blocks of iterations. Other scheduling schemes may be specified using explicit parallelization directives described later in this chapter.
Arrays, Scalars, and Pure Scalars
A few definitions, from the point of view of automatic parallelization, are needed:
- An array is a variable that is declared with at least one dimension.
- A scalar is a variable that is not an array.
- A pure scalar is a scalar variable that is not aliased--not referenced in an
EQUIVALENCE
orPOINTER
statement.Example: Array/scalar:
dimension a(10)real m(100,10), s, u, x, zequivalence ( u, z )pointer ( px, x )s = 0.0...Both
m
anda
are array variables;s
is pure scalar. The variablesu,
x,
z,
andpx
are scalar variables, but not pure scalars.Automatic Parallelization Criteria
DO
loops that have no cross-iteration data dependencies are automatically parallelized by-autopar
or-parallel
. The general criteria for automatic parallelization are:
- Only explicit
DO
loops and implicit loops, such asIF
loops and Fortran 95 array syntax are parallelization candidates.- The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.
- Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.
- Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependence.
- The amount of work within the body of the loop must outweigh the overhead of parallelization.
f77
: Apparent DependenciesThe
f77
compiler may automatically eliminate a reference that appears to create a data dependency in the loop. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.Example: Using
-autopar
, with dependencies eliminated by private arrays:
parameter (n=1000)real a(n), b(n), c(n,n)do i = 1, 1000 <--Parallelizeddo k = 1, na(k) = b(k) + 2.0end dodo j = 1, nc(i,j) = a(j) + 2.3end doend doendIn the example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array
a
appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent.Inhibitors to Automatic Parallelization
Under automatic parallelization, the compilers do not parallelize a loop if:
- The
DO
loop is nested inside anotherDO
loop that is parallelized.- Flow control allows jumping out of the
DO
loop.- A user-level subprogram is invoked inside the loop.
- An I/O statement is in the loop.
- Calculations within the loop change an aliased scalar variable.
Nested Loops
In a multithreaded, multiprocessor environment, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each thread. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.
Automatic Parallelization With Reduction Operations
A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.
Example: Reduction summation of the elements of a vector:
s = 0.0do i = 1, 1000s = s + v(i)end dot(k) = sHowever, for some operations, if reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.
Recognition of reduction operations is not included in the automatic parallelization analysis unless the
-reduction
compiler option is specified along with-autopar
or-parallel
.If a parallelizable loop contains one of the reduction operations listed in TABLE 10-3, the compiler will parallelize it if
-reduction
is specified.Recognized Reduction Operations
The following table lists the reduction operations that are recognized by
f77
andf95
.
All forms of the
MIN
andMAX
function are recognized.Numerical Accuracy and Reduction Operations
Floating-point sum or product reduction operations may be inaccurate due to the following conditions:
- The order in which the calculations are performed in parallel is not the same as when performed serially on a single processor.
- The order of calculation affects the sum or product of floating-point numbers. Hardware floating-point addition and multiplication are not associative. Roundoff, overflow, or underflow errors may result depending on how the operands associate. For example,
(X*Y)*Z
andX*(Y*Z)
may not have the same numerical significance.In some situations, the error may not be acceptable.
Example: Overflow and underflow, with and without reduction:
Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:
Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.
1 s = 0.568582080884714E+022 s = 0.568582080884722E+023 s = 0.568582080884721E+024 s = 0.568582080884724E+02
In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.
Explicit Parallelization
This section describes the source code directives recognized by
f77
andf95
to explicitly indicate which loops to parallelize and what strategy to use.The Sun WorkShop 6 Fortran compilers will accept both Sun-style and Cray-style parallelization directives to facilitate porting explicitly parallelized programs from other platforms.
The Fortran 95 compiler will also accept the OpenMP Fortran parallelization directives. The OpenMP Fortran specification is available at
http://www.openmp.org/
. The OpenMP directives, library routines, and environment variables are summarized in Appendix E of the Fortran User's Guide.Explicit parallelization of a program requires prior analysis and deep understanding of the application code as well as the concepts of shared-memory parallelization.
DO loops are marked for parallelization by directives placed immediately before them. The compiler options
-parallel
or-explicitpar
must be used for DO loops to be recognized and parallel code generated. Parallelization directives are comment lines that tell the compiler to parallelize (or not to parallelize) theDO
loop that follows the directive. Directives are also called pragmas.Take care when choosing which loops to mark for parallelization. The compiler generates threaded, parallel code for all loops marked with parallelization directives, even if there are data dependencies that will cause the loop to compute incorrect results when run in parallel.
If you do your own multithreaded coding using the
libthread
primitives, do not use any of the compilers' parallelization options--the compilers cannot parallelize code that has already been parallelized with user calls to the threads library.Parallelizable Loops
A loop is appropriate for explicit parallelization if:
- It is a
DO
loop, but not aDO
WHILE
or Fortran 95 array syntax.- The values of array variables for each iteration of the loop do not depend on the values of array variables for any other iteration of the loop.
- If the loop changes a scalar variable, that variable's value is not used after the loop terminates. Such scalar variables are not guaranteed to have a defined value after the loop terminates, since the compiler does not automatically ensure a proper storeback for them.
- For each iteration, any subprogram that is invoked inside the loop does not reference or change values of array variables for any other iteration.
- The
DO
loop index must be an integer.Scoping Rules: Private and Shared
A private variable or array is private to a single iteration of a loop. The value assigned to a private variable or array in one iteration is not propagated to any other iteration of the loop.
A shared variable or array is shared with all other iterations. The value assigned to a shared variable or array in an iteration is seen by other iterations of the loop.
If an explicitly parallelized loop contains shared references, then you must ensure that sharing does not cause correctness problems. The compiler does not synchronize on updates or accesses to shared variables.
If you specify a variable as private in one loop, and its only initialization is within some other loop, the value of that variable may be left undefined in the loop.
Subprogram Call in a Loop
A subprogram call in a loop (or in any subprograms called from within the called routine) may introduce data dependencies that could go unnoticed without a deep analysis of the data and control flow through the chain of calls. While it is best to parallelize outermost loops that do a significant amount of the work, these tend to be the very loops that involve subprogram calls.
Because such an interprocedural analysis is difficult and could greatly increase compilation time, automatic parallelization modes do not attempt it. With explicit parallelization, the compiler generates parallelized code for a loop marked with a DOALL directive even if it contains calls to subprograms. It is still the programmer's responsibility to insure that no data dependencies exist within the loop and all that the loop encloses, including called subprograms.
Multiple invocations of a routine by different threads can cause problems resulting from references to local static variables that interfere with each other. Making all the local variables in a routine automatic rather than static prevents this. Each invocation of a subprogram then has its own unique store of local variables maintained on the stack, and no two invocations will interfere with each other.
Local subprogram variables can be made automatic variables that reside on the stack either by listing them on an
AUTOMATIC
statement or by compiling the subprogram with the-stackvar
option. However, local variables initialized inDATA
statements must be rewritten to be initialized in actual assignments.
Note Allocating local variables to the stack can cause stack overflow. See Stacks, Stack Sizes, and Parallelization about increasing the size of the stack.
Inhibitors to Explicit Parallelization
In general, the compiler parallelizes a loop if you explicitly direct it to. There are exceptions--some loops the compiler will not parallelize.
The following are the primary detectable inhibitors that might prevent explicitly parallelizing a
DO
loop:
- The
DO
loop is nested inside anotherDO
loop that is parallelized.
- This exception holds for indirect nesting, too. If you explicitly parallelize a loop that includes a call to a subroutine, then even if you request the compiler to parallelize loops in that subroutine, those loops are not run in parallel at runtime.
- A flow control statement allows jumping out of the
DO
loop.- The index variable of the loop is subject to side effects, such as being equivalenced.
By compiling with
-vpara
, you will get diagnostic messages when the compiler detects a problem while explicitly parallelizing a loop. The compiler may still parallelize the loop.The following table lists typical parallelization problems detected by the compiler:
Example: Nested loops:
...
C$PAR DOALL
do 900 i = 1, 1000 !
Parallelized (outer loop)do 200 j = 1, 1000 !
Not parallelized, no warning...
200 continue900 continue...
Example: A parallelized loop in a subroutine:
In the example, the loop within the subroutine is not parallelized because the subroutine itself is run in parallel.
Example: Jumping out of a loop:
C$PAR DOALL
do i = 1, 1000 !
Not parallelized, warning issued
...if (a(i) .gt. min_threshold ) go to 20...end do20 continue...Example: A variable in a loop has a loop-carried dependency:
C$PAR DOALL
do 100 i = 1, 200 !
Parallelized, with warningy = y * i !
y has a loop-carried dependencya(i) = y100 continue...I/O With Explicit Parallelization
You can do I/O in a loop that executes in parallel, provided that:
- It does not matter that the output from different threads is interleaved (program output is nondeterministic.)
- You can ensure the safety of executing the loop in parallel.
Example: I/O statement in loop
In the example, the program may deadlock in
libF77_mt
and hang. Press Control-C to regain keyboard control.There are situations where the programmer might not be aware that I/O could take place within a parallelized loop. Consider a user-supplied exception handler that prints output when it catches an arithmetic exception (like divide by zero). If a parallelized loop provokes an exception, the implicit I/O from the handler may cause I/O deadlocks and a system hang.
- The library
libF77_mt
is MT safe, but mostly not MT hot.- You cannot do recursive (nested) I/O if you compile with
-mt
.As an informal definition, an interface is MT safe if:
- It can be simultaneously invoked by more than one thread of control.
- The caller is not required to do any explicit synchronization before calling the function.
- The interface is free of data races.
A data race occurs when the content of an address in memory is being updated by more than one thread, and that address is not protected by a lock. The value of that memory address is therefore nondeterministic--the two threads race to update the thread (but in this case, the one who gets there last, wins).
An interface is generally called MT hot if the implementation has been tuned for performance advantage, using the techniques of multithreading. See the Solaris Multithreaded Programming Guide for details.
Sun-Style Parallelization Directives
Sun-style directives are enabled by default (or with the
-mp=sun
option) when compiling with the-explicitpar
or-parallel
options.Sun Parallelization Directives Syntax
A parallel directive consists of one or more directive lines. A Sun-style directive line is defined as follows:
C$PAR Directive [ Qualifiers ] <- Initial directive lineC$PAR& [More_Qualifiers] <- Optional continuation lines
- A directive line is case-insensitive.
- A directive line begins with a five-character sentinel:
C$PAR
,*$PAR
, or!$PAR
.- With
f77
andf95
fixed-format:
- An initial directive line has a blank in column 6.
- A continuation directive line has a nonblank in column 6.
- Columns beyond 72 are ignored unless the
-e
option is specified.- With
f95
free format:- Qualifiers, if any, follow directives--on the same line or continuation lines.
- Multiple qualifiers on one line are separated by commas.
- Spaces before, after, or within a directive or qualifier are ignored.
The Sun-style parallel directives are:
Examples of Sun-style parallel directives:
TASKCOMMON
DirectiveThe
TASKCOMMON
directive declares variables in a global COMMON block as thread-private: Every variable declared in a common block becomes a private variable to the thread, but remains global within the thread. Only named COMMON blocks can be declaredTASKCOMMON
.The syntax of the directive is:
C$PAR TASKCOMMON
common_block_nameThe directive must appear immediately before or after every COMMON declaration for that named block.
This directive is effective only when compiled with
-explicitpar
or-parallel
. Otherwise, the directive is ignored and the block is treated as a regular COMMON block.Variables declared in
TASKCOMMON
blocks are treated as thread-private variables in all theDOALL
loops and routines called from within theDOALL
loops. Each thread gets its own copy of the COMMON block, so data written by one thread is not directly visible to other threads. During serial portions of the program, accesses are to the initial thread's copy of the COMMON block.Variables in
TASKCOMMON
blocks should not appear on anyDOALL
qualifiers, such asPRIVATE
,SHARED
,READONLY
, and so on.It is an error to declare a common block as task common in some but not all compilation units where the block is defined. A check at runtime for task common consistency can be enabled by compiling the program with the
-xcommonchk=yes
flag. (Enable the runtime check only during program development, as it can degrade performance.)
DOALL
Directive
The
DOALL
directive requests the compiler to generate parallel code for the one DO loop immediately following it (if compiled with the-parallel
or-explicitpar
options).
Note Analysis and transformation of reduction operations is not performed within explicitly parallelized loops.
Example: Explicit parallelization of a loop:
demo%cat t4.f
...C$PAR DOALL
do i = 1, na(i) = b(i) * c(i)end dodo k = 1, mx(k) = x(k) * z(k,k)end do...demo%f95 -explicitpar t4.f
DOALL
QualifiersAll qualifiers on the Sun-style
DOALL
directive are optional. The following table summarizes them:
PRIVATE(varlist)
The
PRIVATE(
varlist)
qualifier specifies that all scalars and arrays in the list varlist are private for theDOALL
loop. Both arrays and scalars can be specified as private. In the case of an array, each thread of theDOALL
loop gets a copy of the entire array. All other scalars and arrays referenced in theDOALL
loop, but not contained in the private list, conform to their appropriate default scoping rules. (See page 160).Example: Specify array
a
private in loopi
:
C$PAR DOALL PRIVATE(a)do i = 1, na(1) = b(i)do j = 2, na(j) = a(j-1) + b(j) * c(j)end dox(i) = f(a)end doSHARED(varlist)
The
SHARED
(varlist) qualifier specifies that all scalars and arrays in the list varlist are shared for theDOALL
loop. Both arrays and scalars can be specified as shared. Shared scalars and arrays can be accessed in all the iterations of aDOALL
loop. All other scalars and arrays referenced in theDOALL
loop, but not contained in the shared list, conform to their appropriate default scoping rules.Example: Specify a shared variable:
C$PAR DOALL SHARED(y)do i = 1,na(i) = yend doIn the example, the variable
y
has been specified as a variable whose value should be shared among the iterations of thei
loop.READONLY(varlist)
The
READONLY(
varlist)
qualifier specifies that all scalars and arrays in the list varlist are read-only for theDOALL
loop. Read-only scalars and arrays are a special class of shared scalars and arrays that are not modified in any iteration of theDOALL
loop. Specifying scalars and arrays asREADONLY
indicates to the compiler that it does not need to use a separate copy of that scalar variable or array for each thread of theDOALL
loop.Example: Specify a read-only variable:
x = 3C$PAR DOALL SHARED(x),READONLY(x)do i = 1, nb(i) = x + 1end doIn the preceding example,
x
is a shared variable, but the compiler can rely on the fact that its value will not be modified in any iteration of thei
loop because of itsREADONLY
specification.STOREBACK(varlist)
A
STOREBACK
scalar variable or array is one whose value is computed in aDOALL
loop. The computed value can be used after the termination of the loop. In other words, the last loop iteration values of storeback scalars or arrays are visible after theDOALL
loop.Example: Specify the loop index variable as storeback:
C$PAR DOALL PRIVATE(x), STOREBACK(x,i)do i = 1, nx = ...end do... = i... = xIn the preceding example, both the variables
x
andi
are STOREBACK variables, even though both variables are private to thei
loop. The value ofi
after the loop isn+1
, while the value ofx
is whatever value it had at the end of the last iteration.There are some potential problems for
STOREBACK
to be aware of.The
STOREBACK
operation occurs at the last iteration of the explicitly parallelized loop, even if this is not the same iteration that last updates the value of theSTOREBACK
variable or array.Example:
STOREBACK
variable potentially different from the serial version:
C$PAR DOALL PRIVATE(x), STOREBACK(x)do i = 1, nif (...) thenx = ...end ifend doprint *,xIn the preceding example, the value of the
STOREBACK
variablex
that is printed out might not be the same as that printed out by a serial version of thei
loop. In the explicitly parallelized case, the processor that processes the last iteration of thei
loop (wheni
=n
) and performs theSTOREBACK
operation forx
, might not be the same processor that currently contains the last updated value ofx
. The compiler issues a warning message about these potential problems.SAVELAST
The
SAVELAST
qualifier specifies that all private scalars and arrays areSTOREBACK
variables for theDOALL
loop.Example: Specify
SAVELAST
:
C$PAR DOALL PRIVATE(x,y), SAVELASTdo i = 1, nx = ...y = ...end do... = i... = x... = yIn the example, variables
x
,y,
andi
areSTOREBACK
variables.REDUCTION(varlist)
The
REDUCTION
(varlist) qualifier specifies that all variables in the list varlist are reduction variables for theDOALL
loop. A reduction variable (or array) is one whose partial values can be individually computed on various processors, and whose final value can be computed from all its partial values.The presence of a list of reduction variables requests the compiler to handle a
DOALL
loop as reduction loop by generating parallel reduction code for it.Example: Specify a reduction variable:
C$PAR DOALL REDUCTION(x)do i = 1, nx = x + a(i)end doIn the preceding example, the variable
x
is a (sum) reduction variable; thei
loop is a (sum) reduction loop.SCHEDTYPE(t)
SCHEDTYPE(
t)
specifies the scheduling type t be used to schedule theDOALL
loop.
Multiple Qualifiers
Qualifiers can appear multiple times with cumulative effect. In the case of conflicting qualifiers, the compiler issues a warning message, and the qualifier appearing last prevails.
Example: A three-line Sun-style directive (note conflicting
MAXCPUS
,SHARED
, andPRIVATE
qualifiers):
C$PAR DOALL MAXCPUS(4), READONLY(S), PRIVATE(A,B,X), MAXCPUS(2)C$PAR DOALL SHARED(B,X,Y), PRIVATE(Y,Z)C$PAR DOALL READONLY(T)Example: A one-line equivalent of the preceding three lines:
C$PAR DOALL MAXCPUS(2), PRIVATE(A,Y,Z), SHARED(B,X), READONLY(S,T)
DOSERIAL
Directive
The
DOSERIAL
directive disables parallelization of the specified loop. This directive applies to the one loop immediately following it.Example: Exclude one loop from parallelization:
do i = 1, nC$PAR DOSERIALdo j = 1, ndo k = 1, n...end doend doend doIn the example, when compiling with
-parallel
, thej
loop will not be parallelized by the compiler, but thei
ork
loop may be.
DOSERIAL*
Directive
The
DOSERIAL*
directive disables parallelization of the specified nest of loops. This directive applies to the whole nest of loops immediately following it.Example: Exclude a whole nest of loops from parallelization:
do i = 1, nC$PAR DOSERIAL*do j = 1, ndo k = 1, n...end doend doend doIn the example, when compiling with
-parallel
, thej
andk
loops will not be parallelized by the compiler, but thei
loop may be.Interaction Between
DOSERIAL*
andDOALL
If both
DOSERIAL*
andDOALL
are specified for the same loop, the last one prevails.Example: Specifying both
DOSERIAL*
andDOALL
:
C$PAR DOSERIAL*do i = 1, 1000C$PAR DOALLdo j = 1, 1000...end doend doIn the example, the
i
loop is not parallelized, but thej
loop is.Also, the scope of the
DOSERIAL*
directive does not extend beyond the textual loop nest immediately following it. The directive is limited to the same function or subroutine that it appears in.Example:
DOSERIAL*
does not extend to a loop in a called subroutine:
program callercommon /block/ a(10,10)C$PAR DOSERIAL*do i = 1, 10call callee(i)end doendsubroutine callee(k)common /block/a(10,10)do j = 1, 10a(j,k) = j + kend doreturnendIn the preceding example,
DOSERIAL*
applies only to thei
loop and not to thej
loop, regardless of whether the call to the subroutinecallee
is inlined.Default Scoping Rules for Sun-Style Directives
For Sun-style (
C$PAR
) explicit directives, the compiler uses default rules to determine whether a scalar or array is shared or private. You can override the default rules to specify the attributes of scalars or arrays referenced inside a loop. (With Cray-style!MIC$
directives, all variables that appear in the loop must be explicitly declared either shared or private on theDOALL
directive.)The compiler applies these default rules:
- All scalars are treated as private. A local copy of a scalar is made available for each thread executing the loop, and that local copy is used by that thread only.
- All array references are treated as shared references. Any write of an array element by one thread is visible to all threads. No synchronization is performed on accesses to shared variables.
If inter-iteration dependencies exist in a loop, then the execution may result in erroneous results. You must ensure that these cases do not arise. The compiler may sometimes be able to detect such a situation at compile time and issue a warning, but it does not disable parallelization of such loops.
Example: Potential problem through equivalence:
equivalence (a(1),y)C$PAR DOALLdo i = 1,ny = ia(i) = yend doIn the example, since the scalar variable
y
has been equivalenced toa(1)
, we have a conflict withy
as private anda(:)
as shared by default, leading to possibly erroneous results when the parallelizedi
loop is executed. No diagnostic is issued in these situations.You can fix the example by using
C$PAR DOALL
PRIVATE(y)
.Cray-Style Parallelization Directives
Parallel directives have two forms: Sun style and Cray style. The
f77
andf95
default is Sun style (-mp=sun
). To use Cray-style directives, you must compile with-mp=cray
.Mixing program units compiled with both Sun and Cray directives can produce incorrect results.
A major difference between Sun and Cray directives is that Cray style requires explicit scoping of every scalar and array in the loop as either SHARED or PRIVATE.
The following table shows Cray-style directive syntax.
!MIC$ DOALL
!MIC$& SHARED(
v1,
v2, ... )
!MIC$& PRIVATE(
u1,
u2, ... )
...
optional qualifiersCray Directive Syntax
A parallel directive consists of one or more directive lines. A directive line is defined with the same syntax as Sun-style (page 165), except:
- The sentinels are
CMIC$
,*MIC$
, or!MIC$
, but only!MIC$
is recognized withf95
free-format.- Every variable or array referenced in the loop appears in a
SHARED
orPRIVATE
qualifier.The Cray directives are similar to Sun-style:
Cray Directive Compared With Sun-Style DOALLdifferent set of qualifiers and scheduling TASKCOMMONsame as Sun-style DOSERIALsame as Sun-style DOSERIAL*same as Sun-style
DOALL
QualifiersFor Cray-style
DOALL
, thePRIVATE
qualifier is required. Each variable within theDO
loop must be qualified as private or shared, and theDO
loop index must always be private. The following table summarizes available Cray-style qualifiers.
For Cray-style directives, the
DOALL
directive allows a single scheduling qualifier, for example,!MIC$&
CHUNKSIZE(100)
. TABLE 10-8 shows the Cray-styleDOALL
directive scheduling qualifiers:
For both
f77
andf95
the default scheduling type (when no scheduling type is specified on a Cray-styleDOALL
directive) is the Sun-styleSTATIC
, for which there is no Cray-style equivalent.Environment Variables
There are three environment variables used with parallelization:
PARALLEL
SUNW_MP_THR_IDLE
- OMP_NUM_THREADS
(See also the
STACKSIZE
discussion on page 152)
PARALLEL
andOMP_NUM_THREADS
To run a parallelized program in a multithreaded environment, you must set either the
PARALLEL
orOMP_NUM_THREADS
environment variable prior to execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set thePARALLEL
orOMP_NUM_THREADS
variable to the available number of processors on the target platform.
SUNW_MP_THR_IDLE
Use the
SUNW_MP_THR_IDLE
environment variable to control the end-of-task status of each thread executing the parallel part of a program. You can set the value of this variable tospin
,sleep
ns
, orsleep
nms
. The default is spin, which means a thread spin-waits when it finishes its share of a parallel task until a new parallel task arrives. The other choice puts the thread to sleep after spin-waiting for n seconds (ns
) or n milliseconds (nms
). If a new task arrives before this wait-time the thread stops spinning and starts the new task.
%setenv SUNW_MP_THR_IDLE 50ms
%setenv PARALLEL 4
%myprog
In this example, at most four threads are created by the program. After finishing a parallel task, a thread spins for 50 ms. If within that time a new task has arrived for the thread, it executes it. Otherwise, the thread goes to sleep until a new task arrives.
Debugging Parallelized Programs
Debugging parallelized programs requires some extra effort. The following schemes suggest ways to approach this task.
First Steps at Debugging
There are some steps you can try immediately to determine the cause of errors.
- Turn off parallelization.
- You can do one of the following:
- Turn off the parallelization options--Verify that the program works correctly by compiling with
-O3
or-O4
, but without any parallelization.- Set the number of threads to one and compile with parallelization on--run the program with the environment variable
PARALLEL
set to1
.- Check also for out of bounds array references by compiling with
-C
.- Problems using
-autopar
may indicate that the compiler is parallelizing something it should not.- Turn off
-reduction
.
- If you are using the
-reduction
option, summation reduction may be occurring and yielding slightly different answers. Try running without this option.- Use the
DOSERIAL
directive to selectively disable automatic parallelization of individual loops.- Use
fsplit
.
- If you have many subroutines in your program, use
fsplit
(1) to break them into separate files. Then compile some files with and without-parallel
, and usef77
orf95
to link the.o
files. You must specify-parallel
on this link step. (See Fortran User's Guide section on consistent compiling and linking.)- Execute the binary and verify results.
- Repeat this process until the problem is narrowed down to one subroutine.
- Use
-loopinfo
.
- Check which loops are being parallelized and which loops are not.
- Use a dummy subroutine.
- Create a dummy subroutine or function that does nothing. Put calls to this subroutine in a few of the loops that are being parallelized. Recompile and execute. Use
-loopinfo
to see which loops are being parallelized.- Continue this process until you start getting the correct results.
- Use explicit parallelization.
- Add the
C$PAR DOALL
directive to a couple of the loops that are being parallelized. Compile with-explicitpar
, then execute and verify the results. Use-loopinfo
to see which loops are being parallelized. This method permits the addition of I/O statements to the parallelized loop.- Repeat this process until you find the loop that causes the wrong results.
- Note: if you need
-explicitpar
only (without-autopar
), do not compile with-explicitpar
and-depend
. This method is the same as compiling with-parallel
, which, of course, includes-autopar
.- Run loops backward serially.
- Replace
DO
I=1,N
withDO
I=N,1,-1
. Different results point to data dependencies.- Avoid using the loop index.
Replace: DO I=1,N...CALL SNUBBER(I)...ENDDOWith:DO I1=1,NI=I1
...CALL SNUBBER(I)...ENDDODebugging Parallel Code With
dbx
To use
dbx
on a parallel loop, temporarily rewrite the program as follows:
- Isolate the body of the loop in a file and subroutine of its own.
- In the original routine, replace loop body with a call to the new subroutine.
- Compile the new subroutine with
-g
and no parallelization options.- Compile the changed original routine with parallelization and no
-g
.
- Example: Manually transform a loop to allow using
dbx
in parallel:
Sun Microsystems, Inc. Copyright information. All rights reserved. Feedback |
Library | Contents | Previous | Next | Index |