This chapter presents an overview of multiprocessor parallelization and describes the capabilities of Sun's Fortran compilers. Implementation differences between f77 and f90 are noted.
Parallelization features are available only on SPARC platforms with Solaris 2.5.1, 2.6, and Solaris 7 operating environments, and require a Sun WorkShop license.
Parallelizing (or multithreading) an application recasts the compiled program to run on a multiprocessor system. Parallelization enables single tasks, such as a DO loop, to run over multiple processors with a potentially significant execution speedup.
Before an application program can be run efficiently on a multiprocessor system like the Ultra(TM) 60, Enterprise(TM) 450, or Ultra HPC 1000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations.
Multithreading an application can be done manually by making appropriate calls to the libthread primitives. However, a significant amount of analysis and reprogramming might be required. (See the Solaris Multithreaded Programming Guide for more information.)
Sun compilers can automatically generate multithreaded object code to run on multiprocessor systems. The Fortran compilers focus on DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program.
The choice of which loops to parallelize and how to distribute them can be left entirely up to the compiler (-autopar), determined explicitly by the programmer with source code directives (-explicitpar), or done in combination (-parallel).
Programs that do their own (explicit) thread management should not be compiled with any of the compiler's parallelization options. Explicit multithreading (calls to libthread primitives) cannot be combined with routines compiled with these parallelization options.
Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run more slowly when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.
Only explicit Fortran 90 DO loops are candidates for parallelization with f90.
Sun compilers can detect loops that might be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the -loopinfo option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.
If you parallelize a program so that it runs over four processors, can you expect it to take (roughly) one fourth the time that it did with a single processor (a fourfold speedup)?
Probably not. It can be shown (by Amdahl's law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if c is the percentage of the execution time run in parallel, the theoretical speedup limit is 100/(100-c); therefore, if only 60% of a program runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.
As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the runtime must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.
Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).
Here is a very general outline of the steps needed to parallelize an application:
Optimize. Use the appropriate set of compiler options to get the best serial performance on a single processor.
Profile. Using typical test data, determine the performance profile of the program. Identify the most significant loops.
Benchmark. Determine that the serial test results are accurate. Use these results and the performance profile as the benchmark.
Parallelize. Use a combination of options and directives to compile and build a parallelized executable.
Verify. Run the parallelized program on a single processor and check results to find instabilities and programming errors that might have crept in.
Test. Make various runs on several processors to check results.
Benchmark. Make performance measurements with various numbers of processors on a dedicated system. Measure performance changes with changes in problem size (scalability).
Repeat steps 4 to 7. Make improvements to parallelization scheme based on performance.
Not all loops are parallelizable. Running a loop in parallel over a number of processors may result in iterations executing out of order. Or the multiple processors executing the loop in parallel may interfere with each other. These situations arise whenever there are data dependencies in the loop.
Variables that are set in one iteration of a loop and used in a subsequent iteration introduce cross-iteration dependencies, or recurrences. Recurrence in a loop requires that the iterations to be executed in the proper order. For example:
DO I=2,N A(I) = A(I-1)*B(I)+C(I) END DO
requires the value computed for A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce results running each iteration in parallel that are the same as with single processor, iteration I must complete before iteration I+1 can execute.
Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:
DO K = 1,N SUM = SUM + A(I)*B(I) END DO
If each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant.
Certain common reduction operations are recognized and handled as special cases by the compiler.
Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:
DO L = 1,NW A(ID(L)) = A(L) + B(L) END DO
In the preceding, repeated values in ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent.
You might be able to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring could be needed.
Some general rules are:
A loop is data independent only if all iterations write to distinct memory locations.
Iterations may read from the same locations as long as no one iteration writes to them.
These are general conditions for parallelization. The compilers' automatic parallelization analysis considers additional criteria when deciding whether to parallelize a loop. However, you can use directives to explicitly force loops to be parallelized, even loops that contain inhibitors and produce incorrect results.
The following table shows the f77 5.0 and f90 2.0 compilation options related to parallelization.
Table 10-1 Parallelization Options
Option |
Flag |
---|---|
Automatic (only) | -autopar |
Automatic and Reduction | -autopar -reduction |
Explicit (only) | -explicitpar |
Automatic and Explicit | -parallel |
Automatic and Reduction and Explicit | -parallel -reduction |
Show which loops are parallelized | -loopinfo |
Show warnings with explicit | -vpara |
Allocate local variables on stack | -stackvar |
Use Sun-style MP directives | -mp=sun |
Use Cray-style MP directives | -mp=cray |
Notes on these options:
-reduction requires -autopar.
-autopar includes -depend and loop structure optimization.
-parallel is equivalent to -autopar -explicitpar.
-noautopar, -noexplicitpar, -noreduction are the negations.
Parallelization options can be in any order, but they must be all lowercase.
Reduction operations are not analyzed for explicitly parallelized loops.
Use of any of the parallelization options requires a WorkShop license.
The following table shows the f77/f90 and f90 parallel directives.
Table 10-2 Parallel Directives
Parallel Directive |
Purpose |
C$PAR TASKCOMMON |
Declares a common block private |
C$PAR DOALL optional qualifiers |
Parallelizes next loop, if possible |
C$PAR DOSERIAL |
Inhibits parallelization of next loop |
C$PAR DOSERIAL* |
Inhibits parallelization of loop nest |
The PARALLEL environment variable controls the maximum number of processors available to the program. The following example shows how to set it:
demo% setenv PARALLEL 4 C shell -or- demo$ PARALLEL=4 Bourne/Korn shell demo$ export PARALLEL
In this example, setting PARALLEL to four enables the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads could run on the same processor as others, possibly degrading performance.
The SunOS command psrinfo(1M) displays a list of the processors available on a system:
demo% psrinfo 0 on-line since 03/18/96 15:51:03 1 on-line since 03/18/96 15:51:03 2 on-line since 03/18/96 15:51:03 3 on-line since 03/18/96 15:51:03
The executing program maintains a main memory stack for the parent program and distinct stacks for each thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations.
The default size of the main stack is about 8 megabytes. The Fortran compilers normally allocate local variables and arrays as STATIC (not on the stack). However, the -stackvar option forces allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of -stackvar is recommended with parallelization because it improves the optimizer's ability to parallelize CALLs in loops. -stackvar is required with explicitly parallelized loops containing subprogram calls. (See the discussion of -stackvar in the Fortran User's Guide.)
The limit command displays the current main stack size as well as setting it:
demo% limit C shell example cputime unlimited filesize unlimited datasize 2097148 kbytes stacksize 8192 kbytes <- current main stack size coredumpsize 0 kbytes descriptors 64 memorysize unlimited demo% limit stacksize 65536 <- set main stack to 64Mb demo% limit stacksize stacksize 65536 kbytes
demo$ >ulimit -a Korn Shell example time(seconds) unlimited file(blocks) unlimited data(kbytes) 2097148 stack(kbytes) 8192 coredump(blocks) 0 nofiles(descriptors) 64 vmemory(kbytes) unlimited demo$ ulimit -s 65536 demo$ ulimit -s 65536
Each thread of a multithreaded program has its own thread stack. This stack mimics the main program stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 256 kilobytes. The size is set with the STACKSIZE environment variable:
demo% setenv STACKSIZE 8192 <- Set thread stack size to 8 Mb C shell -or- demo$ STACKSIZE=8192 Bourne/Korn Shell demo$ export STACKSIZE
Setting the thread stack size to a value larger than the default may be necessary for most parallelized Fortran codes. However, it may not be possible to know just how large to set it, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.
With the -autopar and -parallel options, the compilers automatically find DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.
The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.
For example, with four CPUs and a parallelized loop with 1000 iterations:
Processor 1 executing iterations |
1 |
through |
250 |
Processor 2 executing iterations |
251 |
through |
500 |
Processor 3 executing iterations |
501 |
through |
750 |
Processor 4 executing iterations |
751 |
through |
1000 |
Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.
Note that the compiler always chooses to parallelize loops using a chunk distribution--simply dividing the work in the loop into equal blocks of iterations. Other distribution schemes may be specified using explicit parallelization directives described later in this chapter.
A few definitions, from the point of view of automatic parallelization, are needed:
An array is a variable that is declared with at least one dimension.
A scalar is a variable that is not an array.
A pure scalar is a scalar variable that is not aliased--not referenced in an EQUIVALENCE or POINTER statement.
dimension a(10) real m(100,10), s, u, x, z equivalence ( u, z ) pointer ( px, x ) s = 0.0 ...
Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.
DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar or -parallel. The general criteria for automatic parallelization are:
DO loops are parallelized, but not DO WHILE or Fortran 90 array operations.
The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.
Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.
Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependency.
The f77 compiler may automatically eliminate a reference that appears to create a dependency transforming the compiled code. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.
Example: Using -autopar, with dependencies eliminated by private arrays:
parameter (n=1000) real a(n), b(n), c(n,n) do i = 1, 1000 <--Parallelized do k = 1, n a(k) = b(k) + 2.0 end do do j = 1, n c(i,j) = a(j) + 2.3 end do end do end
In the preceding example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a(*) appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent.
Under automatic parallelization, the compilers do not parallelize a loop if:
The DO loop is nested inside another DO loop that is parallelized.
Flow control allows jumping out of the DO loop.
A user-level subprogram is invoked inside the loop.
An I/O statement is in the loop.
Calculations within the loop change an aliased scalar variable.
On multiprocessor systems, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each processor. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.
A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.
Example: Reduction summation of the elements of a vector:
s = 0.0 do i = 1, 1000 s = s + v(i) end do t(k) = s
However, for some operations, if the reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.
Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.
If a parallelizable loop contains one of the reduction operations listed in Table 10-3, the compiler will parallelize it if -reduction is specified.
The following table lists the reduction operations that are recognized by f77 and f90.
Table 10-3 Recognized Reduction Operations
Mathematical Operations |
Fortran Statement Templates |
---|---|
Sum of the elements | s = s + v(i) |
Product of the elements | s = s * v(i) |
Dot product of two vectors | s = s + v(i) * u(i) |
Minimum of the elements | s = amin( s, v(i)) |
Maximum of the elements | s = amax( s, v(i)) |
OR of the elements |
do i = 1, n b = b .or. v(i) end do |
AND of nonpositive elements |
b = .true. do i = 1, n if (v(i) .le. 0) b=b .and. v(i) end do |
Count nonzero elements |
k = 0 do i = 1, n if ( v(i) .ne. 0 ) k = k + 1 end do |
All forms of the MIN and MAX function are recognized.
Floating-point sum or product reduction operations may be inaccurate due to the following conditions:
The order in which the calculations were performed in parallel was not the same as when performed serially on a single processor.
The order of calculation affected the sum or product of floating-point numbers. Hardware floating-point addition and multiplication are not associative. Roundoff, overflow, or underflow errors may result depending on how the operands associate. For example, (X*Y)*Z and X*(Y*Z) may not have the same numerical significance.
In some situations, the error may not be acceptable.
Example: Overflow and underflow, with and without reduction:
demo% cat t3.f real A(10002), result, MAXFLOAT MAXFLOAT = r_max_normal() do 10 i = 1 , 10000, 2 A(i) = MAXFLOAT A(i+1) = -MAXFLOAT 10 continue A(5001)=-MAXFLOAT A(5002)=MAXFLOAT do 20 i = 1 ,10002 !Add up the array RESULT = RESULT + A(i) 20 continue write(6,*) RESULT end demo% setenv PARALLEL 2 {Number of processors is 2} demo% f77 -silent -autopar t3.f demo% a.out 0. {Without reduction, 0. is correct} demo% f77 -silent -autopar -reduction t3.f demo% a.out Inf {With reduction, Inf. is not correct} demo%
Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:
demo% cat t4.f parameter ( n = 100000 ) double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n) s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1 s = 0.0 do i = 1, n s = s + v(i) end do write(*, '(" s = ", e21.15)') s end demo% f77 -autopar -reduction t4.f
Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.
Number of Processors |
Output |
---|---|
1 | s = 0.568582080884714E+02 |
2 | s = 0.568582080884722E+02 |
3 | s = 0.568582080884721E+02 |
4 | s = 0.568582080884724E+02 |
In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.
This section describes the source code directives recognized by f77 5.0 and f90 2.0 to explicitly indicate which loops to parallelize and what strategy to use.
Explicit parallelization of a program requires prior analysis and deep understanding of the application code as well as the concepts of shared-memory parallelization.
DO loops are marked for parallelization by directives placed immediately before them. The compiler options -parallel and -explicitpar must be used for DO loops to be recognized and parallel code generated. Take care when choosing which loops to mark for parallelization. The compiler generates threaded, parallel code for all loops marked with DOALL directives, even if there are data dependencies that will cause the loop to compute incorrect results when run in parallel.
If you do your own multithreaded coding using the libthread primitives, do not use any of the compilers' parallelization options--the compilers cannot parallelize code that has already been parallelized with user calls to the threads library.
A loop is appropriate for explicit parallelization if:
It is a DO loop, but not a DO WHILE or Fortran 90 array syntax.
The values of array variables for each iteration of the loop do not depend on the values of array variables for any other iteration of the loop.
If the loop changes a scalar, that scalar is not referenced after the loop terminates. Such scalar variables are not guaranteed to have a defined value after the loop terminates, since the compiler does not automatically ensure a proper storeback for them.
For each iteration, any subprogram that is invoked inside the loop does not reference or change values of array variables for any other iteration.
The DO loop index must be an integer.
A private variable or array is private to a single iteration of a loop. The value assigned to a private variable or array in one iteration is not propagated to any other iteration of the loop.
A shared variable or array is shared with all other iterations. The value assigned to a shared variable or array in an iteration is seen by other iterations of the loop.
If an explicitly parallelized loop contains shared references, then you must ensure that sharing does not cause correctness problems. The compiler does no synchronization on updates or accesses to shared variables.
If you specify a variable as private in one loop, and its only initialization is within some other loop, the value of that variable may be left undefined in the loop.
For Sun-style (C$PAR) explicit directives, the compiler uses default rules to determine whether a scalar or array is shared or private. You can override the default rules to specify the attributes of scalars or arrays referenced inside a loop. (With Cray-style !MIC$ directives, all variables that appear in the loop must be explicitly declared either shared or private on the DOALL directive.)
The compiler applies these default rules:
All scalars are treated as private. A processor local copy of the scalar is made in each processor, and that local copy is used within that process.
All array references are treated as shared references. Any write of an array element by one processor is visible to all processors. No synchronization is performed on accesses to shared variables.
If inter-iteration dependencies exist in a loop, then the execution may result in erroneous results. You must ensure that these cases do not arise. The compiler may sometimes be able to detect such a situation at compile time and issue a warning, but it does not disable parallelization of such loops.
Example: Potential problem through equivalence:
equivalence (a(1),y) C$PAR DOALL do i = 1,n y = i a(i) = y end do
In the preceding example, since the scalar variable y has been equivalenced to a(1), it is no longer a private variable, even though the compiler treats it as such by the default scoping rule. Thus, the presence of the DOALL directive might lead to erroneous results when the parallelized i loop is executed.
You can fix the example by using C$PAR DOALL PRIVATE(y).
Parallelization directives are comment lines that tell the compiler to parallelize (or not to parallelize) the DO loop that follows the directive. Directives are also called pragmas.
A parallelization directive consists of one or more directive lines.
Sun-style directives are recognized by f77 and f90 by default (or with the -mp=sun option). A Sun-style directive line is defined as follows:
C$PAR Directive [ Qualifiers ] <- Initial directive line C$PAR& [More_Qualifiers] <- Optional continuation lines
The letters of a directive line are case-insensitive.
The first five characters are C$PAR, *$PAR, or !$PAR.
An initial directive line has a blank in column 6.
A continuation directive line has a nonblank in column 6.
Directives are listed in columns 7 and beyond.
Qualifiers, if any, follow directives--on the same line or continuation lines.
Multiple qualifiers on one line are separated by commas.
Spaces before, after, or within a directive or qualifier are ignored.
Columns beyond 72 are ignored unless the -e option is specified.
The parallel directives and their actions are as follows:
Directive |
Action |
TASKCOMMON |
Declares COMMON block private |
DOALL |
Parallelizes the next loop |
DOSERIAL |
Does not parallelize the next loop |
DOSERIAL* |
Does not parallelize the next nest of loops |
Examples of f77 parallel directives:
C$PAR TASKCOMMON ALPHA Declare block private COMMON /ALPHA/BZ,BY(100) C$PAR DOALL No qualifiers C$PAR DOSERIAL C$PAR DOALL SHARED(I,K,X,V), PRIVATE(A) This one-line directive is equivalent to the three-line directive that follows. C$PAR DOALL C$PAR& SHARED(I,K,X,V) C$PAR& PRIVATE(A)
The TASKCOMMON directive declares variables in a global COMMON block as private. Every variable declared in a task common block becomes a private variable. Only named COMMON blocks can be declared TASK COMMON.
The syntax of the directive is:
C$PAR TASKCOMMON common_block_name
The directive must appear immediately after the defining COMMON declaration.
This directive is effective only when compiled with -explicitpar or -parallel. Otherwise, the directive is ignored and the block is treated as a regular common block.
Variables declared in task common blocks are treated as private variables in all the DOALL loops they appear in explicitly, and in the routines called from a loop where the specified common block is in its scope.
It is an error to declare a common block as task common in some but not all compilation units where the block is defined. A check at runtime for task common consistency can be enabled by compiling the program with the -xcommonchk=yes flag. (Enable the runtime check only during program development, as it can degrade performance.)
The compilers will parallelize the DO loop following a DOALL directive (if compiled with the -parallel or -explicitpar options).
Analysis and transformation of reduction operations within loops is not done if they are explicitly parallelized.
Example: Explicit parallelization of a loop:
demo% cat t4.f ... C$PAR DOALL do i = 1, n a(i) = b(i) * c(i) end do do k = 1, m x(k) = x(k) * z(k,k) end do ... demo% f77 -explicitpar t4.f
A subprogram call in a loop (or in any subprograms called from within the called routine) may introduce data dependencies that could go unnoticed without a deep analysis of the data and control flow through the chain of calls. While it is best to parallelize outermost loops that do a significant amount of the work, these tend to be the very loops that involve subprogram calls.
Because such an interprocedural analysis is difficult and could greatly increase compilation time, automatic parallelization modes do not attempt it. With explicit parallelization, the compiler generates parallelized code for a loop marked with a DOALL directive that contains calls to subprograms. It is still the programmer's responsibility toeinsure that no data dependencies exist within the loop and all that the loop encloses, including called subprograms.
Multiple invocations of a routine from different processors can cause problems resulting from references to local static variables that interfere with each other. Making all the local variables in a routine automatic rather than static prevents this. Each invocation of a subprogram then has its own unique store of local variables maintained on the stack, and no two invocations will interfere with each other.
Local subprogram variables can be made automatic variables that reside on the stack either by listing them on an AUTOMATIC statement or by compiling the subprogram with the -stackvar option. However, local variables initialized in DATA statements must be rewritten to be initialized in actual assignments.
Allocating local variables to the stack can cause stack overflow. See "Stacks, Stack Sizes, and Parallelization" about increasing the size of the stack.
Data dependencies can still be introduced through the data passed down the call tree as arguments or through COMMON blocks. This data flow should be analyzed carefully before parallelizing a loop with subprogram calls.
All qualifiers on the DOALL directive are optional. The following table summarizes them:
Table 10-4 DOALL Qualifiers
Qualifier |
Assertion |
Syntax |
---|---|---|
PRIVATE |
Do not share variables u1, ... between iterations | DOALL PRIVATE(u1,u2,) |
SHARED |
Share variables v1, v2, ... between iterations | DOALL SHARED(v1,v2,) |
MAXCPUS |
Use no more than n CPUs | DOALL MAXCPUS(n) |
READONLY |
The listed variables are not modified in the DOALL loop | DOALL READONLY(v1,v2,) |
SAVELAST |
Save the last DO iteration values of all private variables | DOALL SAVELAST |
STOREBACK |
Save the last DO iteration values of variables v1, ... | DOALL STOREBACK(v1,v2,) |
REDUCTION |
Treat the variables v1, v2, ... as reduction variables. | DOALL REDUCTION(v1,v2,) |
SCHEDTYPE |
Set the scheduling type to t. | DOALL SCHEDTYPE(t) |
The PRIVATE(varlist)qualifier specifies that all scalars and arrays in the list varlist are private for the DOALL loop. Both arrays and scalars can be specified as private. In the case of an array, each thread of the DOALL loop gets a copy of the entire array. All other scalars and arrays referenced in the DOALL loop, but not contained in the private list, conform to their appropriate default scoping rules.
Example: Specify a private array:
C$PAR DOALL PRIVATE(a) do i = 1, n a(1) = b(i) do j = 2, n a(j) = a(j-1) + b(j) * c(j) end do x(i) = f(a) end do
In the preceding example, the array a is specified as private to the i loop.
The SHARED(varlist) qualifier specifies that all scalars and arrays in the list varlist are shared for the DOALL loop. Both arrays and scalars can be specified as shared. Shared scalars and arrays are common to all the iterations of a DOALL loop. All other scalars and arrays referenced in the DOALL loop, but not contained in the shared list, conform to their appropriate default scoping rules.
Example: Specify a shared variable:
equivalence (a(1),y) C$PAR DOALL SHARED(y) do i = 1,n a(i) = y end do
In the preceding example, the variable y has been specified as a variable whose value should be shared among the iterations of the i loop.
The READONLY(varlist) qualifier specifies that all scalars and arrays in the list varlist are read-only for the DOALL loop. Read-only scalars and arrays are a special class of shared scalars and arrays that are not modified in any iteration of the DOALL loop. Specifying scalars and arrays as READONLY indicates to the compiler that it does not need to use a separate copy of that variable or array for each thread of the DOALL loop.
Example: Specify a read-only variable:
x = 3 C$PAR DOALL SHARED(x),READONLY(x) do i = 1, n b(i) = x + 1 end do
In the preceding example, x is a shared variable, but the compiler can rely on the fact that it will not change over each iteration of the i loop because of its READONLY specification.
A STOREBACK variable or array is one whose value is computed in a DOALL loop. The computed value can be used after the termination of the loop. In other words, the last loop iteration values of storeback scalars and arrays may be visible outside of the DOALL loop.
Example: Specify the loop index variable as storeback:
C$PAR DOALL PRIVATE(x), STOREBACK(x,i) do i = 1, n x = ... end do ... = i ... = x
In the preceding example, both the variables x and i are STOREBACK variables, even though both variables are private to the i loop.
There are some potential problems for STOREBACK, however.
The STOREBACK operation occurs at the last iteration of the explicitly parallelized loop, even if this is the same iteration that last updates the value of the STOREBACK variable or array.
Example: STOREBACK variable potentially different from the serial version:
C$PAR DOALL PRIVATE(x), STOREBACK(x) do i = 1, n if (...) then x = ... end if end do print *,x
In the preceding example, the value of the STOREBACK variable x that is printed out might not be the same as that printed out by a serial version of the i loop. In the explicitly parallelized case, the processor that processes the last iteration of the i loop (when i = n) and performs the STOREBACK operation for x, might not be the same processor that currently contains the last updated value of x. The compiler issues a warning message about these potential problems.
In an explicitly parallelized loop, arrays are not treated by default as STOREBACK, so include them in the list varlist if such a storeback operation is desired--for example, if the arrays have been declared as private.
The SAVELAST qualifier specifies that all private scalars and arrays are STOREBACK for the DOALL loop. A STOREBACK variable or array is one whose value is computed in a DOALL loop; this computed value can be used after the termination of the loop. In other words, the last loop iteration values of STOREBACK scalars and arrays may be visible outside of the DOALL loop.
C$PAR DOALL PRIVATE(x,y), SAVELAST do i = 1, n x = ... y = ... end do ... = i ... = x ... = y
In the preceding example, variables x, y, and i are STOREBACK variables.
The REDUCTION(varlist) qualifier specifies that all variables in the list varlist are reduction variables for the DOALL loop. A reduction variable (or array) is one whose partial values can be individually computed on various processors, and whose final value can be computed from all its partial values.
The presence of a list of reduction variables can aid the compiler in identifying if a DOALL loop is a reduction loop and in generating parallel reduction code for it.
Example: Specify a reduction variable:
C$PAR DOALL REDUCTION(x) do i = 1, n x = x + a(i) end do
In the preceding example, the variable x is a (sum) reduction variable; the i loop is a (sum) reduction loop.
The SCHEDTYPE(t) qualifier specifies that the specific scheduling type t be used to schedule the DOALL loop.
Table 10-5 DOALL SCHEDTYPE Qualifiers
Scheduling Type |
Action |
---|---|
STATIC |
Use static scheduling for this DO loop. Distribute all iterations uniformly to all available processors. Example: With 1000 iterations and 4 CPUs each CPU gets a single iteration in turn until all the iterations have been distributed. |
SELF[(chunksize)] |
Use self-scheduling for this DO loop. Distribute chunksize iterations to each available processor: o Repeat with the remaining iterations until all the iterations have been processed. o If chunksize is not provided, f77 selects a value. Example: With 1000 iterations and chunksize of 4, distribute 4 iterations to each CPU. |
FACTORING[( m )] |
Use factoring scheduling for this DO loop. With n iterations initially and k CPUs, distribute n/(2k) iterations uniformly to each processor until all iterations have been processed. o At least m iterations must be assigned to each processor. o There can be one final smaller residual chunk. o If m is not provided, f77 selects a value. Example: With 1000 iterations and FACTORING(4), and 4 CPUs, distribute 125 iterations to each CPU, then 62 iterations, then 31 iterations, and so on. |
GSS[( m )] |
Use guided self-scheduling for this DO loop. With n iterations initially, and k CPUs, then: o Assign n/k iterations to the first processor. o Assign the remaining iterations divided by k to the second processor, and so on until all iterations have been processed. Note: o At least m iterations must be assigned to each CPU. o There can be one final smaller residual chunk. o If m is not provided, f77 selects a value. Example: With 1000 iterations and GSS(10), and 4 CPUs, distribute 250 iterations to the first CPU, then 187 to the second CPU, then 140 to the third CPU, and so on. |
Qualifiers can appear multiple times with cumulative effect. In the case of conflicting qualifiers, the compiler issues a warning message, and the qualifier appearing last prevails.
Example: A three-line Sun-style directive:
C$PAR DOALL MAXCPUS(4), READONLY(S), PRIVATE(A,B,X), MAXCPUS(2) C$PAR DOALL SHARED(B,X,Y), PRIVATE(Y,Z) C$PAR DOALL READONLY(T)
Example: A one-line equivalent of the preceding three lines (note duplicate MAXCPUS and conflicting SHARED/PRIVATE):
C$PAR DOALL MAXCPUS(2), PRIVATE(A,Y,Z), SHARED(B,X), READONLY(S,T)
The DOSERIAL directive disables parallelization of the specified loop. This directive applies to the one loop immediately following it (if you compile it with -explicitpar or -parallel).
Example: Exclude one loop from parallelization:
do i = 1, n C$PAR DOSERIAL do j = 1, n do k = 1, n ... end do end do end do
In the preceding example, the j loop is not parallelized, but the i or k loop can be.
The DOSERIAL* directive disables parallelization the specified nest of loops. This directive applies to the whole nest of loops immediately following it (if you compile with -explicitpar or -parallel).
Example: Exclude a whole nest of loops from parallelization:
do i = 1, n C$PAR DOSERIAL* do j = 1, n do k = 1, n ... end do end do end do
In the preceding loops, the j and k loops are not parallelized; the i loop could be.
If both DOSERIAL and DOALL are specified, the last one prevails.
Example: Specifying both DOSERIAL and DOALL:
C$PAR DOSERIAL* do i = 1, 1000 C$PAR DOALL do j = 1, 1000 ... end do end do
In the preceding example, the i loop is not parallelized, but the j loop is.
Also, the scope of the DOSERIAL* directive does not extend beyond the textual loop nest immediately following it. The directive is limited to the same function or subroutine that it is in.
Example: DOSERIAL* does not extend to a loop of a called subroutine:
program caller common /block/ a(10,10) C$PAR DOSERIAL* do i = 1, 10 call callee(i) end do end subroutine callee(k) common /block/a(10,10) do j = 1, 10 a(j,k) = j + k end do return end
In the preceding example, DOSERIAL* applies only to the i loop and not to the j loop, regardless of whether the call to the subroutine callee is inlined.
In general, the compiler parallelizes a loop if you explicitly direct it to. There are exceptions--some loops the compiler just cannot parallelize.
The following are the primary detectable inhibitors that might prevent explicitly parallelizing a DO loop:
The DO loop is nested inside another DO loop that is parallelized.
This exception holds for indirect nesting, too. If you explicitly parallelize a loop that includes a call to a subroutine, then even if you parallelize loops in that subroutine, those loops are not run in parallel at runtime.
A flow control statement allows jumping out of the DO loop.
The index variable of the loop is subject to side effects, such as being equivalenced.
If you compile with -vpara, you may get a warning message if f77/f90 detects a problem with explicitly parallelizing a loop. f77/f90 may still parallelize the loop. The following list of typical parallelization problems shows those that are ignored by the compiler
Table 10-6 Explicit Parallelization Problems
Problem |
Parallelized |
Message |
---|---|---|
Loop is nested inside another loop that is parallelized. |
No |
No |
Loop is in a subroutine, and a call to the subroutine is in a parallelized loop. |
No |
No |
Jumping out of loop is allowed by a flow control statement. |
No |
Yes |
Index variable of loop is subject to side effects. |
Yes |
No |
Some variable in the loop keeps a loop-carried dependency. |
Yes |
Yes |
I/O statement in the loop--usually unwise, because the order of the output is not predictable. |
Yes |
No |
and those that generate messages with -vpara.
... C$PAR DOALL do 900 i = 1, 1000 ! Parallelized (outer loop) do 200 j = 1, 1000 ! Not parallelized, no warning ... 200 continue 900 continue ... demo% f77 -explicitpar -vpara t6.f
Example: A parallelized loop in a subroutine:
C$PAR DOALL do 100 i = 1, 200 ... call calc (a, x) ... 100 continue ... demo% f77 -explicitpar -vpara t.f |
subroutine calc ( b, y ) ... C$PAR DOALL do 1 m = 1, 1000 ... 1 continue return end |
At runtime, the loop could run in parallel. |
At runtime, both loops do not run in parallel. |
In the preceding example, the loop within the subroutine is not parallelized because the subroutine itself is run in parallel.
Example: Jumping out of a loop:
C$PAR DOALL do i = 1, 1000 ! ¨ Not parallelized, with warning ... if (a(i) .gt. min_threshold ) go to 20 ... end do 20 continue ... demo% f77 -explicitpar -vpara t9.f
Example: An index variable subject to side effects:
equivalence ( a(1), y ) ! ¨ Source of possible side effects ... C$PAR DOALL do i = 1, 2000 ! ¨ Parallelized: no warning, but not safe y = i a(i) = y end do ... demo% f77 -explicitpar -vpara t11.f
Example: A variable in a loop has a loop-carried dependency:
C$PAR DOALL do 100 i = 1, 200 ! Parallelized, with warning y = y * i ! y has a loop-carried dependency a(i) = y 100 continue ... demo% f77 -explicitpar -vpara t12.f
You can do I/O in a loop that executes in parallel, provided that:
It does not matter that the output from different threads is interleaved (program output is nondeterministic.)
You can ensure the safety of executing the loop in parallel.
Example: I/O statement in loop
C$PAR DOALL do i = 1, 10 ! Parallelized with no warning (not advisable) k = i call show ( k ) end do end subroutine show( j ) write(6,1) j 1 format('Line number ', i3, '.') end demo% f77 -silent -explicitpar -vpara t13.f demo% setenv PARALLEL 2 demo% a.out (The output displays the numbers 1 through 10, but in a different order each time.)
Example: Recursive I/O:
do i = 1, 10 <-- Parallelized with no warning ---unsafe k = i print *, list( k ) <-- list is a function that does I/O end do end function list( j ) write(6,"('Line number ', i3, '.')") j list = j end demo% f77 -silent -mt t14.f demo% setenv PARALLEL 2 demo% a.out
In the preceding example, the program may deadlock in libF77_mt and hang. Press Control-C to regain keyboard control.
There are situations where the programmer might not be aware that I/O could take place within a parallelized loop. Consider a user-supplied exception handler that prints output when it catches an arithmetic exception (like divide by zero). If a parallelized loop provokes an exception, the implicit I/O from the handler may cause I/O deadlocks and a system hang.
In general:
The library libF77_mt is MT safe, but mostly not MT hot.
You cannot do recursive (nested) I/O if you compile with -mt.
As an informal definition, an interface is MT safe if:
It can be simultaneously invoked by more than one thread of control.
The caller is not required to do any explicit synchronization before calling the function.
The interface is free of data races.
A data race occurs when the content of memory is being updated by more than one thread, and that bit of memory is not protected by a lock. The value of that bit of memory is nondeterministic--the two threads race to see who gets to update the thread (but in this case, the one who gets there last, wins).
An interface is colloquially called MT hot if the implementation has been tuned for performance advantage, using the techniques of multithreading. For some formal definitions of multithreading technology, read the Solaris Multithreaded Programming Guide.
Parallel directives have two forms: Sun style and Cray style. The f77 and f90 default is Sun style (-mp=sun). To use Cray-style directives, you must compile with -mp=cray.
Mixing program units compiled with both Sun and Cray directives can produce different results.
A major difference between Sun and Cray directives is that Cray style requires explicit scoping of every scalar and array in the loop as either SHARED or PRIVATE.
The following table shows Cray-style directive syntax.
!MIC$ DOALL !MIC$& SHARED( v1 , v2, ) !MIC$& PRIVATE( u1 , u2, ) ...optional qualifiers
A parallel directive consists of one or more directive lines. A directive line is defined as follows:
The directive line is case insensitive.
The first five characters are CMIC$, *MIC$, or !MIC$.
An initial directive line has a blank in column 6.
A continuation directive line has a nonblank in column 6.
Directives are listed in columns 7 and beyond.
Qualifiers, if any, follow directives--on the same line or continuation lines.
Multiple qualifiers on a line are separated by commas.
All variables and arrays are in qualifiers SHARED or PRIVATE.
Spaces before, after, or within a directive or qualifier are ignored.
Columns beyond 72 are ignored.
With f90 -free free-format, leading blanks can appear before !MIC$.
For Cray-style directives, the PRIVATE qualifier is required. Each variable within the DO loop must be qualified as private or shared, and the DO loop index must always be private. The following table summarizes available Cray-style qualifiers.
Table 10-7 DOALL Qualifiers (Cray Style)
Qualifier |
Assertion |
---|---|
SHARED( v1, v2, ... ) |
Share the variables v1, v2, ... between parallel processes. That is, they are accessible to all the tasks. |
PRIVATE( x1, x2, ... ) |
Do not share the variables x1, x2, ... between parallel processes. That is, each task has its own private copy of these variables. |
SAVELAST |
Save the values of private variables from the last DO iteration. |
MAXCPUS( n ) |
Use no more than n CPUs. |
For Cray-style directives, the DOALL directive allows a single scheduling qualifier, for example, !MIC$& CHUNKSIZE(100). Table 10-8 shows the Cray-style DOALL directive
Table 10-8 DOALL Cray Scheduling
Qualifier |
Assertion |
---|---|
GUIDED |
Distribute the iterations by use of guided self-scheduling. This distribution minimizes synchronization overhead, with acceptable dynamic load balancing. |
SINGLE |
Distribute one iteration to each available processor. |
CHUNKSIZE( n ) |
Distribute n iterations to each available processor. n may be an expression. For best performance, n must be an integer constant. Example: With 100 iterations and CHUNKSIZE(4), distribute 4 iterations to each CPU. |
NUMCHUNKS( m ) |
If there are n iterations, distribute n/m iterations to each available processor. There can be one smaller residual chunk. m is an expression. For best performance, m must be an integer constant. Example: With 100 iterations and NUMCHUNKS(4), distribute 25 iterations to each CPU. |
scheduling qualifiers:
The f77 default scheduling type is the Sun-style STATIC. The f90 default is GUIDED.
With the explicit parallelization situations listed in "Inhibitors to Explicit Parallelization", the additional parallelization inhibitors for f90 include:
The DO increment parameter, if specified, is a variable.
There is an I/O statement in the loop.
Parallelized loops in subprograms called from parallelized loops are, in fact, not run in parallel.
Compiling with the -g option cancels any of the parallelization options -autopar, -explicitpar, and -parallel, as well as -reduction and -depend. Some alternative ways to debug parallelized code are suggested in the following section.
Debugging parallelized programs requires some cleverness. The following schemes suggest ways to approach the problem:
Turn off parallelization.
You can do one of the following:
Turn off the parallelization options--Verify that the program works correctly by compiling with -O3 or -O4, but without any parallelization.
Set the CPUs to one--run the program with the environment variable PARALLEL=1.
If the problem disappears, then you know it was due to parallelization.
Check also for out of bounds array references by compiling with -C.
Problems using -autopar may indicate that the compiler is parallelizing something it should not.
Turn off -reduction.
If you are using the -reduction option, summation reduction may be occurring and yielding slightly different answers. Try running without this option.
Reduce the number of compile options.
Compile with just -parallel -O3 and check the results.
Use fsplit or f90split.
If you have a lot of subroutines in your program, use fsplit(1) to break them into separate files. (Use f90split(1) on Fortran 90 source codes.) Then compile some files with and without -parallel, and use f77 or f90 to link the .o files. You must specify -parallel on this link step as well. (See Fortran User's Guide section on consistent compiling and linking.)
Execute the binary and verify results.
Repeat this process until the problem is narrowed down to one subroutine.
You can proceed using a dummy subroutine or explicit parallelization to track down the loop that causes the problem.
Use -loopinfo.
Check which loops are being parallelized and which loops are not.
Use a dummy subroutine.
Create a dummy subroutine or function that does nothing. Put calls to this subroutine in a few of the loops that are being parallelized. Recompile and execute. Use -loopinfo to see which loops are being parallelized.
Continue this process until you start getting the correct results.
Then remove the calls from the other loops, compile, and execute to verify that you are getting the correct results.
Use explicit parallelization.
Add the C$PAR DOALL directive to a couple of the loops that are being parallelized. Compile with -explicitpar, then execute and verify the results. Use -loopinfo to see which loops are being parallelized. This method permits the addition of I/O statements to the parallelized loop.
Repeat this process until you find the loop that causes the wrong results.
If you need -explicitpar only (without -autopar), do not compile with -explicitpar and -depend. This method is the same as compiling with -parallel, which, of course, includes -autopar.
Run loops backward serially.
Replace DO I=1,N with DO I=N,1,-1. Different results point to data dependencies.
Avoid using the loop index. It is safer to do so in the loop body, especially if the index is used as an argument in a call.
Replace: DO I=1,N ... CALL SNUBBER(I) ... ENDDO With: DO I1=1,N I=I1 ... CALL SNUBBER(I) ... ENDDO
To use dbx on a parallel loop, temporarily rewrite the program as follows:
Isolate the body of the loop in a file and subroutine of its own.
In the original routine, replace loop body with a call to the new subroutine.
Compile the new subroutine with -g and no parallelization options.
Compile the changed original routine with parallelization and no -g.
Example: Manually transform a loop to allow using dbx in parallel:
Original code: demo% cat loop.f C$PAR DOALL DO i = 1,10 WRITE(0,*) 'Iteration ', i END DO END Split into two parts: caller loop and loop body as a subroutine demo% cat loop1.f C$PAR DOALL DO i = 1,10 k = i CALL loop_body ( k ) END DO END demo% cat loop2.f SUBROUTINE loop_body ( k ) WRITE(0,*) 'Iteration ', k RETURN END Compile caller loop with parallelization but no debugging demo% f77 -O3 -c -explicitpar loop1.f Compile the subprogram with debugging but not parallelized demo% f77 -c -g loop2.f Link together both parts into a.out demo% f77 loop1.o loop2.o -explicitpar Run a.out under dbx and put breakpoint into loop body subroutine demo% dbx a.out ¨ Various dbx messages not shown (dbx) stop in loop_body (2) stop in loop_body (dbx) run Running: a.out (process id 28163) dbx stops at breakpoint t@1 (l@1) stopped in loop_body at line 2 in file "loop2.f" 2 write(0,*) 'Iteration ', k Now show value of k (dbx) print k k = 1 ¨ Various values other than 1 are possible (dbx)