C H A P T E R  10

Parallelization

This chapter presents an overview of multiprocessor parallelization and describes the capabilities of Fortran 95 on SPARC multiprocessor platforms. (Parallelization options are not currently available on x86 platforms.)

See also Techniques for Optimizing Applications: High Performance Computing by Rajat Garg and Ilya Sharapov, a Sun Microsystems BluePrints publication (http://www.sun.com/blueprints/pubs.html)


10.1 Essential Concepts

Parallelizing (or multithreading) an application compiles the program to run on a multiprocessor system or in a multithreaded environment. Parallelization enables a single task, such as a DO loop, to run over multiple processors (or threads) with a potentially significant execution speedup.

Before an application program can be run efficiently on a multiprocessor system like the Ultratrademark 60, Sun Enterprisetrademark Server 6500, or Sun Enterprise Server 10000, it needs to be multithreaded. That is, tasks that can be performed in parallel need to be identified and reprogrammed to distribute their computations across multiple processors or threads.

Multithreading an application can be done manually by making appropriate calls to the libthread primitives. However, a significant amount of analysis and reprogramming might be required. (See the Solaris Multithreaded Programming Guide for more information.)

Sun compilers can automatically generate multithreaded object code to run on multiprocessor systems. The Fortran compilers focus on DO loops as the primary language element supporting parallelism. Parallelization distributes the computational work of a loop over several processors without requiring modifications to the Fortran source program.

The choice of which loops to parallelize and how to distribute them can be left entirely up to the compiler (-autopar), specified explicitly by the programmer with source code directives (-explicitpar), or done in combination (-parallel).



Note - Programs that do their own (explicit) thread management should not be compiled with any of the compiler's parallelization options. Explicit multithreading (calls to libthread primitives) cannot be combined with routines compiled with these parallelization options.



Not all loops in a program can be profitably parallelized. Loops containing only a small amount of computational work (compared to the overhead spent starting and synchronizing parallel tasks) may actually run more slowly when parallelized. Also, some loops cannot be safely parallelized at all; they would compute different results when run in parallel due to dependencies between statements or iterations.

Implicit loops (IF loops and Fortran 95 array syntax, for example) as well as explicit DO loops are candidates for automatic parallelization by the Fortran compilers.

f95 can detect loops that might be safely and profitably parallelized automatically. However, in most cases, the analysis is necessarily conservative, due to the concern for possible hidden side effects. (A display of which loops were and were not parallelized can be produced by the -loopinfo option.) By inserting source code directives before loops, you can explicitly influence the analysis, controlling how a specific loop is (or is not) to be parallelized. However, it then becomes your responsibility to ensure that such explicit parallelization of a loop does not lead to incorrect results.

The Fortran 95 compiler provides explicit parallelization by implementing the OpenMP 2.0 Fortran API directives. For legacy programs, f95 also supports the older Sun and Cray style directives. OpenMP has become an informal standard for explicit parallelization in Fortran 95, C, and C++ and is recommended over the older directive styles.

For information on OpenMP, see the OpenMP API User's Guide, or the OpenMP web site at http://www.openmp.org/.

For a discussion of legacy parallelization directives, see Section 10.3.3, Sun-Style Parallelization Directives, and Section 10.3.4, Cray-Style Parallelization Directives.

10.1.1 Speedups--What to Expect

If you parallelize a program so that it runs over four processors, can you expect it to take (roughly) one fourth the time that it did with a single processor (a fourfold speedup)?

Probably not. It can be shown (by Amdahl's law) that the overall speedup of a program is strictly limited by the fraction of the execution time spent in code running in parallel. This is true no matter how many processors are applied. In fact, if p is the percentage of the total program execution time that runs in parallel mode, the theoretical speedup limit is 100/(100-p); therefore, if only 60% of a program's execution runs in parallel, the maximum increase in speed is 2.5, independent of the number of processors. And with just four processors, the theoretical speedup for this program (assuming maximum efficiency) would be just 1.8 and not 4. With overhead, the actual speedup would be less.

As with any optimization, choice of loops is critical. Parallelizing loops that participate only minimally in the total program execution time has only minimal effect. To be effective, the loops that consume the major part of the runtime must be parallelized. The first step, therefore, is to determine which loops are significant and to start from there.

Problem size also plays an important role in determining the fraction of the program running in parallel and consequently the speedup. Increasing the problem size increases the amount of work done in loops. A triply nested loop could see a cubic increase in work. If the outer loop in the nest is parallelized, a small increase in problem size could contribute to a significant performance improvement (compared to the unparallelized performance).

10.1.2 Steps to Parallelizing a Program

Here is a very general outline of the steps needed to parallelize an application:

1. Optimize. Use the appropriate set of compiler options to get the best serial performance on a single processor.

2. Profile. Using typical test data, determine the performance profile of the program. Identify the most significant loops.

3. Benchmark. Determine that the serial test results are accurate. Use these results and the performance profile as the benchmark.

4. Parallelize. Use a combination of options and directives to compile and build a parallelized executable.

5. Verify. Run the parallelized program on a single processor and single thread and check results to find instabilities and programming errors that might have crept in. (Set $PARALLEL or $OMP_NUM_THREADS to 1; see Section 10.1.5, Number of Threads).

6. Test. Make various runs on several processors to check results.

7. Benchmark. Make performance measurements with various numbers of processors on a dedicated system. Measure performance changes with changes in problem size (scalability).

8. Repeat steps 4 to 7. Make improvements to your parallelization scheme based on performance.

10.1.3 Data Dependence Issues

Not all loops are parallelizable. Running a loop in parallel over a number of processors usually results in iterations executing out of order. Moreover, the multiple processors executing the loop in parallel may interfere with each other whenever there are data dependencies in the loop.

Situations where data dependence issues arise include recurrence, reduction, indirect addressing, and data dependent loop iterations.

10.1.3.1 Data Dependent Loops

You might be able to rewrite a loop to eliminate data dependencies, making it parallelizable. However, extensive restructuring could be needed.

Some general rules are:

These are general conditions for parallelization. The compilers' automatic parallelization analysis considers additional criteria when deciding whether to parallelize a loop. However, you can use directives to explicitly force loops to be parallelized, even loops that contain inhibitors and produce incorrect results.

10.1.3.2 Recurrence

Variables that are set in one iteration of a loop and used in a subsequent iteration introduce cross-iteration dependencies, or recurrences. Recurrence in a loop requires that the iterations to be executed in the proper order. For example:

   DO I=2,N
      A(I) = A(I-1)*B(I)+C(I)
   END DO

requires the value computed for A(I) in the previous iteration to be used (as A(I-1)) in the current iteration. To produce correct results, iteration I must complete before iteration I+1 can execute.

10.1.3.3 Reduction

Reduction operations reduce the elements of an array into a single value. For example, summing the elements of an array into a single variable involves updating that variable in each iteration:

   DO K = 1,N
     SUM = SUM + A(I)*B(I)
   END DO

If each processor running this loop in parallel takes some subset of the iterations, the processors will interfere with each other, overwriting the value in SUM. For this to work, each processor must execute the summation one at a time, although the order is not significant.

Certain common reduction operations are recognized and handled as special cases by the compiler.

10.1.3.4 Indirect Addressing

Loop dependencies can result from stores into arrays that are indexed in the loop by subscripts whose values are not known. For example, indirect addressing could be order dependent if there are repeated values in the index array:

   DO L = 1,NW
     A(ID(L)) = A(L) + B(L)
   END DO

In the example, repeated values in ID cause elements in A to be overwritten. In the serial case, the last store is the final value. In the parallel case, the order is not determined. The values of A(L) that are used, old or updated, are order dependent.

10.1.4 Compiling for Parallelization

The following table shows the f95 compilation options related to parallelization.

TABLE 10-1 Parallelization Options

Option

Flag

Automatic (only)

-autopar

Automatic and Reduction

-autopar -reduction

Explicit (only)

-explicitpar

Automatic and Explicit

-parallel

Automatic and Reduction and Explicit

-parallel -reduction

Show which loops are parallelized

-loopinfo

Show warnings with explicit

-vpara

Allocate local variables on stack

-stackvar

Enable Sun-style MP directives

-mp=sun

Enable Cray-style MP directives

-mp=cray

Compile for OpenMP parallelization

-openmp


Notes on these options:

The Sun Studio compilers now support the OpenMP parallelization model natively as the primary parallelization model. Sun and Cray-style parallelization, as described in this chapter, applies to legacy applications. For information on OpenMP parallelization, see the OpenMP API User's Guide.

10.1.5 Number of Threads

The PARALLEL (or OMP_NUM_THREADS) environment variable controls the maximum number of threads available to the program. Setting the environment variable tells the runtime system the maximum number of threads the program can use. The default is 1. In general, set the PARALLEL or OMP_NUM_THREADS variable to the available number of processors on the target platform.

The following example shows how to set it:

demo% setenv PARALLEL 4       C shell

-or-

demo$ PARALLEL=4               Bourne/Korn shell
demo$ export PARALLEL

In this example, setting PARALLEL to four enables the execution of a program using at most four threads. If the target machine has four processors available, the threads will map to independent processors. If there are fewer than four processors available, some threads could run on the same processor as others, possibly degrading performance.

The SunOStrademark operating system command psrinfo(1M) displays a list of the processors available on a system:

demo% psrinfo
0      on-line   since 03/18/99 15:51:03
1      on-line   since 03/18/99 15:51:03
2      on-line   since 03/18/99 15:51:03
3      on-line   since 03/18/99 15:51:03

10.1.6 Stacks, Stack Sizes, and Parallelization

The executing program maintains a main memory stack for the initial thread executing the program, as well as distinct stacks for each helper thread. Stacks are temporary memory address spaces used to hold arguments and AUTOMATIC variables over subprogram invocations.

The default size of the main stack is about 8 megabytes. The Fortran compilers normally allocate local variables and arrays as STATIC (not on the stack). However, the -stackvar option forces the allocation of all local variables and arrays on the stack (as if they were AUTOMATIC variables). Use of -stackvar is recommended with parallelization because it improves the optimizer's ability to parallelize subprogram calls in loops. -stackvar is required with explicitly parallelized loops containing subprogram calls. (See the discussion of -stackvar in the Fortran User's Guide.)

Using the C shell (csh), the limit command displays the current main stack size as well as sets it:

demo% limit             C shell example
cputime       unlimited
filesize       unlimited
datasize       2097148 kbytes
stacksize       8192 kbytes            <- current main stack size
coredumpsize       0 kbytes
descriptors       64 
memorysize       unlimited
demo% limit stacksize 65536       <- set main stack to 64Mb
demo% limit stacksize
stacksize       65536 kbytes

With Bourne or Korn shells, the corresponding command is ulimit:

demo$ ulimit -a         Korn Shell example
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         2097148
stack(kbytes)        8192
coredump(blocks)     0
nofiles(descriptors) 64
vmemory(kbytes)      unlimited
demo$ ulimit -s 65536
demo$ ulimit -s
65536

Each helper thread of a multithreaded program has its own thread stack. This stack mimics the initial thread stack but is unique to the thread. The thread's PRIVATE arrays and variables (local to the thread) are allocated on the thread stack. The default size is 8 Megabytes on SPARC V9 (UltraSPARC) platforms, 4 Megabytes otherwise. The size is set with the STACKSIZE environment variable:

demo% setenv STACKSIZE 8192    <- Set thread stack size to 8 Mb    C shell
                          -or-
demo$ STACKSIZE=8192           Bourne/Korn Shell
demo$ export STACKSIZE

Setting the thread stack size to a value larger than the default may be necessary for some parallelized Fortran codes. However, it may not be possible to know just how large it should be, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.


10.2 Automatic Parallelization

With the -autopar and -parallel options, the f95 compiler automatically finds DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.

10.2.1 Loop Parallelization

The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.

For example, with four CPUs and a parallelized loop with 1000 iterations, each thread would execute a chunk of 250 iterations:

Processor 1 executes iterations

1

through

250

Processor 2 executes iterations

251

through

500

Processor 3 executes iterations

501

through

750

Processor 4 executes iterations

751

through

1000


Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependence analysis rejects from parallelization those loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.

Note that the compiler always chooses to parallelize loops using a static loop scheduling--simply dividing the work in the loop into equal blocks of iterations. Other scheduling schemes may be specified using explicit parallelization directives described later in this chapter.

10.2.2 Arrays, Scalars, and Pure Scalars

A few definitions, from the point of view of automatic parallelization, are needed:

Example: Array/scalar:

      dimension a(10)
      real m(100,10), s, u, x, z
      equivalence ( u, z )
      pointer ( px, x )
      s = 0.0
      ...

Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.

10.2.3 Automatic Parallelization Criteria

DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar or -parallel. The general criteria for automatic parallelization are:

10.2.3.1 Apparent Dependencies

The compilers may automatically eliminate a reference that appears to create a data dependence in the loop. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.

Example: Using -autopar, with dependencies eliminated by private arrays:

      parameter (n=1000)
      real a(n), b(n), c(n,n)
      do i = 1, 1000             <--Parallelized
        do k = 1, n                   
          a(k) = b(k) + 2.0
        end do
        do j = 1, n-1
          c(i,j) = a(j+1) + 2.3
        end do
      end do
      end

In the example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a appear to result in a data dependence, the compiler generates temporary private copies of the array to make the outer loop iterations independent.

10.2.3.2 Inhibitors to Automatic Parallelization

Under automatic parallelization, the compilers do not parallelize a loop if:

10.2.3.3 Nested Loops

In a multithreaded, multiprocessor environment, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each thread. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.

10.2.4 Automatic Parallelization With Reduction Operations

A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.

Example: Reduction summation of the elements of a vector:

      s = 0.0
      do i = 1, 1000
        s = s + v(i)
      end do
      t(k) = s

However, for some operations, if reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.

Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.

If a parallelizable loop contains one of the reduction operations listed in TABLE 10-2, the compiler will parallelize it if -reduction is specified.

10.2.4.1 Recognized Reduction Operations

The following table lists the reduction operations that are recognized by the compiler.

TABLE 10-2 Recognized Reduction Operations

Mathematical Operations

Fortran Statement Templates

Sum

s = s + v(i)

Product

s = s * v(i)

Dot product

s = s + v(i) * u(i)

Minimum

s = amin( s, v(i))

Maximum

s = amax( s, v(i))

OR

do i = 1, n

b = b .or. v(i)

end do

AND

b = .true.

do i = 1, n

b = b .and. v(i)

end do

Count of non-zero elements

k = 0

do i = 1, n

if(v(i).ne.0) k = k + 1

end do


All forms of the MIN and MAX function are recognized.

10.2.4.2 Numerical Accuracy and Reduction Operations

Floating-point sum or product reduction operations may be inaccurate due to the following conditions:

In some situations, the error may not be acceptable.

Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:

demo% cat t4.f
      parameter ( n = 100000 )
      double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n)
      s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1
      s = 0.0
      do i = 1, n
        s = s + v(i)
      end do
      write(*, '(" s = ", e21.15)') s
      end
demo% f95 -O4 -autopar -reduction t4.f

Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.

Number of Processors

Output

1

s = 0.568582080884714E+02

2

s = 0.568582080884722E+02

3

s = 0.568582080884721E+02

4

s = 0.568582080884724E+02


In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.


10.3 Explicit Parallelization

This section describes the source code directives recognized by f95 to explicitly indicate which loops to parallelize and what strategy to use.

The Fortran 95 compiler now supports the OpenMP Fortran API as the primary parallelization model. See the OpenMP API User's Guide for additional information..

f95 will also accept legacy Sun-style and Cray-style parallelization directives to facilitate porting explicitly parallelized programs from other platforms.

Explicit parallelization of a program requires prior analysis and deep understanding of the application code as well as the concepts of shared-memory parallelization.

DO loops are marked for parallelization by directives placed immediately before them. Compile with -openmp to enable recognition of OpenMP Fortran 95 directives and generation of parallelized DO loop code. (Compile with -parallel or -explicitpar for legacy Sun or Cray directives.) Parallelization directives are comment lines that tell the compiler to parallelize (or not to parallelize) the DO loop that follows the directive. Directives are also called pragmas.

Take care when choosing which loops to mark for parallelization. The compiler generates threaded, parallel code for all loops marked with parallelization directives, even if there are data dependencies that will cause the loop to compute incorrect results when run in parallel.

If you do your own multithreaded coding using the libthread primitives, do not use any of the compilers' parallelization options--the compilers cannot parallelize code that has already been parallelized with user calls to the threads library.

10.3.1 Parallelizable Loops

A loop is appropriate for explicit parallelization if:

10.3.1.1 Scoping Rules: Private and Shared

A private variable or array is private to a single iteration of a loop. The value assigned to a private variable or array in one iteration is not propagated to any other iteration of the loop.

A shared variable or array is shared with all other iterations. The value assigned to a shared variable or array in an iteration is seen by other iterations of the loop.

If an explicitly parallelized loop contains shared references, then you must ensure that sharing does not cause correctness problems. The compiler does not synchronize on updates or accesses to shared variables.

If you specify a variable as private in one loop, and its only initialization is within some other loop, the value of that variable may be left undefined in the loop.

10.3.1.2 Subprogram Call in a Loop

A subprogram call in a loop (or in any subprograms called from within the called routine) may introduce data dependencies that could go unnoticed without a deep analysis of the data and control flow through the chain of calls. While it is best to parallelize outermost loops that do a significant amount of the work, these tend to be the very loops that involve subprogram calls.

Because such an interprocedural analysis is difficult and could greatly increase compilation time, automatic parallelization modes do not attempt it. With explicit parallelization, the compiler generates parallelized code for a loop marked with a PARALLEL DO or DOALL directive even if it contains calls to subprograms. It is still the programmer's responsibility to insure that no data dependencies exist within the loop and all that the loop encloses, including called subprograms.

Multiple invocations of a routine by different threads can cause problems resulting from references to local static variables that interfere with each other. Making all the local variables in a routine automatic rather than static prevents this. Each invocation of a subprogram then has its own unique store of local variables maintained on the stack, and no two invocations will interfere with each other.

Local subprogram variables can be made automatic variables that reside on the stack either by listing them on an AUTOMATIC statement or by compiling the subprogram with the -stackvar option. However, local variables initialized in DATA statements must be rewritten to be initialized in actual assignments.



Note - Allocating local variables to the stack can cause stack overflow. See Section 10.1.6, Stacks, Stack Sizes, and Parallelization about increasing the size of the stack.



10.3.1.3 Inhibitors to Explicit Parallelization

In general, the compiler parallelizes a loop if you explicitly direct it to. There are exceptions--some loops the compiler will not parallelize.

The following are the primary detectable inhibitors that might prevent explicitly parallelizing a DO loop:

This exception holds for indirect nesting, too. If you explicitly parallelize a loop that includes a call to a subroutine, then even if you request the compiler to parallelize loops in that subroutine, those loops are not run in parallel at runtime.

By compiling with -vpara and -loopinfo, you will get diagnostic messages if the compiler detects a problem while explicitly parallelizing a loop.

The following table lists typical parallelization problems detected by the compiler:

TABLE 10-3 Explicit Parallelization Problems

Problem

Parallelized

Warning Message

Loop is nested inside another loop that is parallelized.

No

No

Loop is in a subroutine called within the body of a parallelized loop.

No

No

Jumping out of loop is allowed by a flow control statement.

No

Yes

Index variable of loop is subject to side effects.

Yes

No

Some variable in the loop has a loop-carried dependency.

Yes

Yes

I/O statement in the loop--usually unwise, because the order of the output is not predictable.

Yes

No


Example: Nested loops:

      ...
!$OMP PARALLEL DO
      do 900 i = 1, 1000      !  Parallelized (outer loop)
        do 200 j = 1, 1000    !  Not parallelized, no warning
            ...
200   continue
900      continue
      ...

Example: A parallelized loop in a subroutine:

 program main
      ...
!$OMP PARALLEL DO
      do 100 i = 1, 200      <-parallelized
        ...
        call calc (a, x)
        ...
100      continue
      ...
subroutine calc ( b, y )
      ...
!$OMP PARALLEL DO
      do 1 m = 1, 1000       <-not parallelized
        ...
1      continue
      return
      end

In the example, the loop within the subroutine is not parallelized because the subroutine itself is run in parallel.

Example: Jumping out of a loop:

!$omp parallel do
      do i = 1, 1000     ! left arrow Not parallelized, error issued
        ...
        if (a(i) .gt. min_threshold ) go to 20
        ...
      end do
20      continue
      ...

The compiler issues an error diagnostic if there is a jump outside a loop marked for parallelization.

Example: A variable in a loop has a loop-carried dependency:

demo% cat vpfn.f
      real function fn (n,x,y,z)
      real y(*),x(*),z(*)
      s = 0.0
!$omp parallel do private(i,s) shared(x,y,z)
      do  i = 1, n
          x(i) = s
          s = y(i)*z(i)
      enddo
      fn=x(10)
      return
      end
demo% f95 -c -vpara -loopinfo -openmp -O4 vpfn.f
"vpfn.f", line 5: Warning: the loop may have parallelization inhibiting reference
"vpfn.f", line 5: PARALLELIZED, user pragma used

Here the loop is parallelized but the possible loop carried dependency is diagnosed in a warning. However, be aware that not all loop dependencies can be diagnosed by the compiler.

10.3.1.4 I/O With Explicit Parallelization

You can do I/O in a loop that executes in parallel, provided that:

Example: I/O statement in loop

!$OMP PARALLEL DO PRIVATE(k)
      do i = 1, 10     !  Parallelized 
        k = i
        call show ( k ) 
      end do
      end
      subroutine show( j )
      write(6,1) j
1      format('Line number ', i3, '.')
      end
demo% f95 -openmp t13.f
demo% setenv PARALLEL 4
demo% a.out

Line number 9.

Line number 10.

Line number 4.

Line number 5.

Line number 6.

Line number 1.

Line number 2.

Line number 3.

Line number 7.

Line number 8.


However, I/O that is recursive, where an I/O statement contains a call to a function that itself does I/O, will cause a runtime error.

10.3.2 OpenMP Parallelization Directives

OpenMP is a parallel programming model for multi-processor platforms that is becoming standard programming practice for Fortran 95, C, and C++ applications. It is the preferred parallel programming model for Sun Studio compilers.

To enable OpenMP directives, compile with the -openmp option flag. Fortran 95 OpenMP directives are identified with the comment-like sentinel !$OMP followed by the directive name and subordinate clauses.

The !$OMP PARALLEL directive identifies the parallel regions in a program. The !$OMP DO directive identifies DO loops within a parallel region that are to be parallelized. These directives can be combined into a single !$OMP PARALLEL DO directive that must be placed immediately before the DO loop.

The OpenMP specification includes a number of directives for sharing and synchronizing work in a parallel region of a program, and subordinate clauses for data scoping and control.

One major difference between OpenMP and legacy Sun-style directives is that OpenMP requires explicit data scoping as either private or shared.

For more information, including guidelines for converting legacy programs using Sun and Cray parallelization directives, see the OpenMP API User's Guide.

10.3.3 Sun-Style Parallelization Directives

Legacy Sun-style directives are enabled by default (or with the -mp=sun option) when compiling with the -explicitpar or -parallel options.

10.3.3.1 Sun Parallelization Directives Syntax

A parallel directive consists of one or more directive lines. A Sun-style directive line is defined as follows:

C$PAR Directive    [ Qualifiers ]       <- Initial directive line
C$PAR& [More_Qualifiers]               <- Optional continuation lines

The Sun-style parallel directives are:

Directive

Action

TASKCOMMON

Declares variables in a COMMON block to be thread-private

DOALL

Parallelizes the next loop

DOSERIAL

Does not parallelize the next loop

DOSERIAL*

Does not parallelize the next nest of loops


Examples of Sun-style parallel directives:

C$PAR TASKCOMMON ALPHA                  Declare block private 
      COMMON /ALPHA/BZ,BY(100)
 
C$PAR DOALL                             No qualifiers
 
C$PAR DOSERIAL
 
C$PAR DOALL SHARED(I,K,X,V), PRIVATE(A)  
            This one-line directive is equivalent to the three-line directive that follows.
C$PAR DOALL
C$PAR& SHARED(I,K,X,V)
C$PAR& PRIVATE(A)

10.3.3.2 TASKCOMMON Directive

The TASKCOMMON directive declares variables in a global COMMON block as thread-private: Every variable declared in a common block becomes a private variable to the thread, but remains global within the thread. Only named COMMON blocks can be declared TASKCOMMON.

The syntax of the directive is:

C$PAR TASKCOMMON common_block_name

The directive must appear immediately after every COMMON declaration for that named block.

This directive is effective only when compiled with -explicitpar or -parallel. Otherwise, the directive is ignored and the block is treated as a regular COMMON block.

Variables declared in TASKCOMMON blocks are treated as thread-private variables in all the DOALL loops and routines called from within the DOALL loops. Each thread gets its own copy of the COMMON block, so data written by one thread is not directly visible to other threads. During serial portions of the program, accesses are to the initial thread's copy of the COMMON block.

Variables in TASKCOMMON blocks should not appear on any DOALL qualifiers, such as PRIVATE, SHARED, READONLY, and so on.

It is an error to declare a common block as task common in some but not all compilation units where the block is defined. A check at runtime for task common consistency can be enabled by compiling the program with the -xcommonchk=yes flag. Enable the runtime check only during program development, as it can degrade performance.

10.3.3.3 DOALL Directive

The DOALL directive requests the compiler to generate parallel code for the one DO loop immediately following it (if compiled with the -parallel or -explicitpar options).



Note - Analysis and transformation of reduction operations is not performed within explicitly parallelized loops.



Example: Explicit parallelization of a loop:

demo% cat t4.f
      ...
C$PAR DOALL
      do i = 1, n        
         a(i) = b(i) * c(i)
      end do
      do k = 1, m     
         x(k) = x(k) * z(k,k)
      end do
      ...
demo% f95 -explicitpar t4.f

10.3.3.4 DOALL Qualifiers

All qualifiers on the Sun-style DOALL directive are optional. The following table summarizes them:

TABLE 10-4 DOALL Qualifiers

Qualifier

Assertion

Syntax

PRIVATE

Do not share variables u1, ... between iterations

DOALL PRIVATE(u1,u2,...)

SHARED

Share variables v1, v2, ... between iterations

DOALL SHARED(v1,v2,...)

MAXCPUS

Use no more than n CPUs (threads)

DOALL MAXCPUS(n)

READONLY

The listed variables are not modified in the DOALL loop

DOALL READONLY(v1,v2,...)

STOREBACK

Save the last DO iteration values of variables v1, ...

DOALL STOREBACK(v1,v2,...)

SAVELAST

Save the last DO iteration values of all private variables

DOALL SAVELAST

REDUCTION

Treat the variables v1, v2, ... as reduction variables.

DOALL REDUCTION(v1,v2,...)

SCHEDTYPE

Set the scheduling type to t.

DOALL SCHEDTYPE(t)


PRIVATE(varlist)

The PRIVATE(varlist)qualifier specifies that all scalars and arrays in the list varlist are private for the DOALL loop. Both arrays and scalars can be specified as private. In the case of an array, each thread of the DOALL loop gets a copy of the entire array. All other scalars and arrays referenced in the DOALL loop, but not contained in the private list, conform to their appropriate default scoping rules. (See Section 10.3.1.1, Scoping Rules: Private and Shared).

Example: Specify array a private in loop i:

C$PAR DOALL PRIVATE(a)
      do i = 1, n
        a(1) = b(i)
        do j = 2, n
          a(j) = a(j-1) + b(j) * c(j)
        end do
        x(i) = f(a)
      end do

SHARED(varlist)

The SHARED(varlist) qualifier specifies that all scalars and arrays in the list varlist are shared for the DOALL loop. Both arrays and scalars can be specified as shared. Shared scalars and arrays can be accessed in all the iterations of a DOALL loop. All other scalars and arrays referenced in the DOALL loop, but not contained in the shared list, conform to their appropriate default scoping rules.

Example: Specify a shared variable:

C$PAR DOALL SHARED(y)
      do i = 1,n
        a(i) = y
      end do

In the example, the variable y has been specified as a variable whose value should be shared among the iterations of the i loop.

READONLY(varlist)

The READONLY(varlist) qualifier specifies that all scalars and arrays in the list varlist are read-only for the DOALL loop. Read-only scalars and arrays are a special class of shared scalars and arrays that are not modified in any iteration of the DOALL loop. Specifying scalars and arrays as READONLY indicates to the compiler that it does not need to use a separate copy of that scalar variable or array for each thread of the DOALL loop.

Example: Specify a read-only variable:

      x = 3
C$PAR DOALL SHARED(x),READONLY(x)
      do i = 1, n
        b(i) = x + 1
      end do

In the preceding example, x is a shared variable, but the compiler can rely on the fact that its value will not be modified in any iteration of the i loop because of its READONLY specification.

STOREBACK(varlist)

A STOREBACK scalar variable or array is one whose value is computed in a DOALL loop. The computed value can be used after the termination of the loop. In other words, the last loop iteration values of storeback scalars or arrays are visible after the DOALL loop.

Example: Specify the loop index variable as storeback:

C$PAR DOALL PRIVATE(x), STOREBACK(x,i)
      do i = 1, n
        x = ...
      end do
      ... = i
      ... = x

In the preceding example, both the variables x and i are storeback variables, even though both variables are private to the i loop. The value of i after the loop is n+1, while the value of x is whatever value it had at the end of the last iteration.

There are some potential problems for STOREBACK to be aware of.

The STOREBACK operation occurs at the last iteration of the explicitly parallelized loop, even if this is not the same iteration that last updates the value of the STOREBACK variable or array.

Example: STOREBACK variable potentially different from the serial version:

C$PAR DOALL PRIVATE(x), STOREBACK(x)
      do i = 1, n
        if (...) then
            x = ...
        end if
      end do
      print *,x

In the preceding example, the value of the STOREBACK variable x that is printed out might not be the same as that printed out by a serial version of the i loop. In the explicitly parallelized case, the processor that processes the last iteration of the i loop (when i = n) and performs the STOREBACK operation for x, might not be the same processor that currently contains the last updated value of x. The compiler issues a warning message about these potential problems.

SAVELAST

The SAVELAST qualifier specifies that all private scalars and arrays are STOREBACK variables for the DOALL loop.

Example: Specify SAVELAST:

C$PAR DOALL PRIVATE(x,y), SAVELAST 
      do i = 1, n
        x = ...
        y = ...
      end do
      ... = i
      ... = x
      ... = y

In the example, variables x, y, and i are STOREBACK variables.

REDUCTION(varlist)

The REDUCTION(varlist) qualifier specifies that all variables in the list varlist are reduction variables for the DOALL loop. A reduction variable (or array) is one whose partial values can be individually computed on various processors, and whose final value can be computed from all its partial values.

The presence of a list of reduction variables requests the compiler to handle a DOALL loop as reduction loop by generating parallel reduction code for it.

Example: Specify a reduction variable:

C$PAR DOALL REDUCTION(x)
      do i = 1, n
        x = x + a(i)
      end do

In the preceding example, the variable x is a (sum) reduction variable; the i loop is a (sum) reduction loop.

SCHEDTYPE(t)

SCHEDTYPE(t) specifies the scheduling type t be used to schedule the DOALL loop.

TABLE 10-5 DOALL SCHEDTYPE Qualifiers

Scheduling Type

Action

STATIC

Use static scheduling for this DO loop. (This is the default scheduling for Sun-style DOALL.)

Distribute all iterations uniformly to all available threads.

Example: With 1000 iterations and 4 processors, each thread gets one chunk of 250 contiguous iterations.

SELF[(chunksize)]

Use self-scheduling for this DO loop.

Each thread gets one chunk of chunksize iterations at a time, distributed in a nondeterministic order until all iterations are processed. Chunks of iterations may not be distributed uniformly to all available threads.

  • If chunksize is not provided, the compiler selects a value.

Example: With 1000 iterations and chunksize of 4, each thread gets 4 iterations at a time until all iterations are processed.

FACTORING[(m)]

Use factoring scheduling for this DO loop.

With n iterations initially and k threads, all the iterations are divided into groups of chunks of iterations, starting with the first group of k chunks of n/(2k) iterations each; the second group has k chunks of
n/(4k) iterations, and so on. The chunksize for each group is the remaining iterations divided by 2k. Because FACTORING is dynamic, there is no guarantee that each thread gets exactly one chunk from each group.

  • At least m iterations must be assigned to each thread.
  • There can be one final smaller residual chunk.
  • If m is not provided, the compiler selects a value.

Example: With 1000 iterations and FACTORING(3), and 4 threads, the first group has 4 chunks of 125 iterations each, the second has 4 chunks of 62 iterations each, the third group has 4 chunks of 31 iterations each, and so on.

GSS[(m)]

Use guided self-scheduling for this DO loop.

With n iterations initially, and k threads, then:

  • Assign n/k iterations to the first thread.
  • Assign the remaining iterations divided by k to the second thread, and so on until all iterations have been processed.

GSS is dynamic, so there is no guarantee that chunks of iterations are uniformly distributed to all available threads.

  • At least m iterations must be assigned to each thread.
  • There can be one final smaller residual chunk.
  • If m is not provided, the compiler selects a value.

Example: With 1000 iterations and GSS(10), and 4 threads, distribute 250 iterations to the first thread, then 187 to the second thread, then 140 to the third thread, and so on.


Multiple Qualifiers

Qualifiers can appear multiple times with cumulative effect. In the case of conflicting qualifiers, the compiler issues a warning message, and the qualifier appearing last prevails.

Example: A three-line Sun-style directive (note conflicting MAXCPUS, SHARED, and PRIVATE qualifiers):

C$PAR DOALL MAXCPUS(4), READONLY(S), PRIVATE(A,B,X), MAXCPUS(2)
C$PAR DOALL SHARED(B,X,Y), PRIVATE(Y,Z)
C$PAR DOALL READONLY(T)

Example: A one-line equivalent of the preceding three lines:

C$PAR DOALL MAXCPUS(2), PRIVATE(A,Y,Z), SHARED(B,X), READONLY(S,T)

10.3.3.5 DOSERIAL Directive

The DOSERIAL directive disables parallelization of the specified loop. This directive applies to the one loop immediately following it.

Example: Exclude one loop from parallelization:

      do i = 1, n
C$PAR DOSERIAL
        do j = 1, n
          do k = 1, n
              ...
          end do
        end do
      end do

In the example, when compiling with -parallel, the j loop will not be parallelized by the compiler, but the i or k loop may be.

10.3.3.6 DOSERIAL* Directive

The DOSERIAL* directive disables parallelization of the specified nest of loops. This directive applies to the whole nest of loops immediately following it.

Example: Exclude a whole nest of loops from parallelization:

      do i = 1, n
C$PAR DOSERIAL*
        do j = 1, n
          do k = 1, n
              ...
          end do
        end do
      end do

In the example, when compiling with -parallel, the j and k loops will not be parallelized by the compiler, but the i loop may be.

10.3.3.7 Interaction Between DOSERIAL* and DOALL

If both DOSERIAL* and DOALL are specified for the same loop, the last one prevails.

Example: Specifying both DOSERIAL* and DOALL:

C$PAR DOSERIAL*
      do i = 1, 1000
C$PAR DOALL
        do j = 1, 1000
            ...
        end do
      end do

In the example, the i loop is not parallelized, but the j loop is.

Also, the scope of the DOSERIAL* directive does not extend beyond the textual loop nest immediately following it. The directive is limited to the same function or subroutine that it appears in.

Example: DOSERIAL* does not extend to a loop in a called subroutine:

      program caller
      common /block/ a(10,10)
C$PAR DOSERIAL*
      do i = 1, 10
        call callee(i)
      end do
      end
 
      subroutine callee(k)
      common /block/a(10,10)
      do j = 1, 10
        a(j,k) = j + k
      end do
      return
      end

In the preceding example, DOSERIAL* applies only to the i loop and not to the j loop, regardless of whether the call to the subroutine callee is inlined.

10.3.3.8 Default Scoping Rules for Sun-Style Directives

For Sun-style (C$PAR) explicit directives, the compiler uses default rules to determine whether a scalar or array is shared or private. You can override the default rules to specify the attributes of scalars or arrays referenced inside a loop. (With Cray-style !MIC$ directives, all variables that appear in the loop must be explicitly declared either shared or private on the DOALL directive.)

The compiler applies these default rules:

If inter-iteration dependencies exist in a loop, then the execution may result in erroneous results. You must ensure that these cases do not arise. The compiler may sometimes be able to detect such a situation at compile time and issue a warning, but it does not disable parallelization of such loops.

Example: Potential problem through equivalence:

      equivalence (a(1),y)
C$PAR DOALL
      do i = 1,n
        y = i
        a(i) = y
      end do

In the example, since the scalar variable y has been equivalenced to a(1), we have a conflict with y as private and a(:) as shared by default, leading to possibly erroneous results when the parallelized i loop is executed. No diagnostic is issued in these situations.

You can fix the example by using C$PAR DOALL PRIVATE(y).

10.3.4 Cray-Style Parallelization Directives

To use legacy Cray-style parallelization directives, you must compile with -mp=cray.

Mixing program units compiled with both Sun and Cray directives can produce incorrect results.

A major difference between Sun and Cray directives is that Cray style requires explicit scoping of every scalar and array in the loop as either SHARED or PRIVATE, unless AUTOSCOPE is specified.

The following table shows Cray-style directive syntax.

!MIC$ DOALL
!MIC$&  SHARED( v1, v2,  ... )
!MIC$&  PRIVATE( u1, u2,  ... )
    ...optional qualifiers

10.3.4.1 Cray Directive Syntax

A parallel directive consists of one or more directive lines. A directive line is defined with the same syntax as Sun-style (see Section 10.3.3, Sun-Style Parallelization Directives), except:

The Cray directives are similar to Sun-style:

Cray Directive

Compared With Sun-Style

DOALL

Different set of qualifiers and scheduling

TASKCOMMON

Same as Sun-style

DOSERIAL

Same as Sun-style

DOSERIAL*

Same as Sun-style


10.3.4.2 DOALL Qualifiers

For Cray-style DOALL, the PRIVATE qualifier is required. Each variable within the DO loop must be qualified as private or shared, and the DO loop index must always be private. The following table summarizes available Cray-style qualifiers.

TABLE 10-6 DOALL Qualifiers (Cray Style)

Qualifier

Assertion

SHARED( v1, v2, ... )

Share the variables v1, v2, ... between iterations.

PRIVATE( x1, x2, ... )

Do not share the variables x1, x2, ... between iterations. That is, each thread has its own private copy of these variables.

AUTOSCOPE

Unscoped variables and arrays not explicitly scoped by a PRIVATE or SHARED qualifier are scoped according to the scoping rules listed below.

SAVELAST

Save the last DO-iteration values of all private variables in the loop.

MAXCPUS( n )

Use no more than n threads.


AUTOSCOPE Automatic Scoping Rules

Specifying AUTOSCOPE directs the compiler to use the following rules to determine the scoping of a variable or array not explicitly scoped as PRIVATE or SHARED.

For a variable or array to be SHARED, any of the following must be true:

For a variable or array to be PRIVATE, the following must be true:

Still, AUTOSCOPE cannot always determine the scope of variables or arrays at compile time. Conditional paths through the loop, among other things, can alter the scoping in ways that cannot be determined by the compiler. It is much safer to scope variables explicitly with PRIVATE and SHARED qualifiers.

Cray-Style Scheduling Qualifiers

For Cray-style directives, the DOALL directive allows a single scheduling qualifier, for example, !MIC$& CHUNKSIZE(100). TABLE 10-7 shows the Cray-style DOALL directive scheduling qualifiers:

TABLE 10-7 DOALL Cray Scheduling

Qualifier

Assertion

GUIDED

Distribute the iterations by use of guided self-scheduling.

This distribution minimizes synchronization overhead, with acceptable dynamic load balancing. The default chunk size is 64.

GUIDED is equivalent to Sun-style GSS(64).

SINGLE

Distribute one iteration to each available thread. SINGLE is dynamic and equivalent to Sun-style SELF(1).

CHUNKSIZE( n )

Distribute n iterations to each available thread.

n must be an integer expression. For best performance, n must be an integer constant. CHUNKSIZE(n) is equivalent to Sun-style SELF(n).

Example: With 100 iterations and CHUNKSIZE(4), each thread gets 4 iterations at a time.

NUMCHUNKS( m )

If there are n iterations, distribute n/m iterations to each available thread. There can be one smaller residual chunk.

m is an integer expression. For best performance, m must be an integer constant. NUMCHUNKS(m) is equivalent to Sun-style SELF(n/m) where n is the total number of iterations.

Example: With 100 iterations and NUMCHUNKS(4), each thread gets 25 iterations at a time.


The default scheduling type (when no scheduling type is specified on a Cray-style DOALL directive) is the Sun-style STATIC, for which there is no Cray-style equivalent.


10.4 Environment Variables

There are four environment variables used with parallelization:

(See also the STACKSIZE discussion in Section 10.1.6, Stacks, Stack Sizes, and Parallelization.)

10.4.1 PARALLEL and OMP_NUM_THREADS

To run a parallelized program in a multithreaded environment, you must set either the PARALLEL or OMP_NUM_THREADS environment variable prior to execution. This tells the runtime system the maximum number of threads the program can create. The default is 1. In general, set the PARALLEL or OMP_NUM_THREADS variable to the available number of processors on the target platform.
Example: SETENV PARALLEL 4

10.4.2 SUNW_MP_WARN

Controls warning messages issued by the runtime multitasking library. If set TRUE, the library issues warning messages to stderr; FALSE disables warning messages and is the default. Example: SETENV SUNW_MP_WARN TRUE

10.4.3 SUNW_MP_THR_IDLE

Use the SUNW_MP_THR_IDLE environment variable to control the end-of-task status of every thread, other than the master thread, executing the parallel part of the program. You can set the value to one of the following values:

Value

Meaning

SPIN

Thread should spin (or busy-wait) after completing a parallel task, until a new parallel task arrives. (Default)

SLEEP (time)

Specifies the amount of time a thread should spin-wait after completing a parallel task. If, while a thread is spinning, a new task arrives for the thread, the tread executes the new task immediately. Otherwise, the thread goes to sleep and is awakened when a new task arrives.

time may be specified in seconds, (ns), or just (n), or milliseconds, (nms).

SLEEP with no argument puts the thread to sleep immediately after completing a parallel task. SLEEP, SLEEP (0), SLEEP (0s), and SLEEP (0ms) are all equivalent.


The default, without SUNW_MP_THR_IDLE explicitly specified, is SPIN.

Example:.

% setenv SUNW_MP_THR_IDLE SLEEP (50ms)
% setenv PARALLEL 4
% myprog

In this example, at most four threads are created by the program. After finishing a parallel task, a thread spins for 50 ms. If within that time a new task has arrived for the thread, it executes it. Otherwise, the thread goes to sleep until a new task arrives.


10.5 Debugging Parallelized Programs

Fortran source code:

	real x / 1.0 /, y / 0.0 /
	print *, x/y
	end
	character  string*5, out*20
	double precision value
	external exception_handler
	i = ieee_handler('set', 'all', exception_handler)
	string = '1e310'
	print *, 'Input string ', string, ' becomes: ', value
	print *, 'Value of 1e300 * 1e10 is:', 1e300 * 1e10
	i = ieee_flags('clear', 'exception', 'all', out)
	end
 
	integer function exception_handler(sig, code, sigcontext)
	integer sig, code, sigcontext(5)
	print *, '*** IEEE exception raised!'
	return
	end
 

Runtime output:

*** IEEE exception raised!
 Input string 1e310 becomes:  Infinity
 Value of 1e300 * 1e10 is: Inf
 Note: Following IEEE floating-point traps enabled;   see ieee_handler(3M):
 Inexact;  Underflow;  Overflow;  Division by Zero;  Invalid   Operand;
 Sun's implementation of IEEE arithmetic is discussed in
  the Numerical Computation Guide.

Debugging parallelized programs requires some extra effort. The following schemes suggest ways to approach this task.

10.5.1 First Steps at Debugging

There are some steps you can try immediately to determine the cause of errors.

You can do one of the following:

If the problem disappears, then you can assume it was due to using multiple threads.

If you are using the -reduction option, summation reduction may be occurring and yielding slightly different answers. Try running without this option.

If you have many subroutines in your program, use fsplit(1) to break them into separate files. Then compile some files with and without -parallel, and use f95 to link the .o files. You must specify -parallel on this link step.

Execute the binary and verify results.

Repeat this process until the problem is narrowed down to one subroutine.

Check which loops are being parallelized and which loops are not.

Create a dummy subroutine or function that does nothing. Put calls to this subroutine in a few of the loops that are being parallelized. Recompile and execute. Use -loopinfo to see which loops are being parallelized.

Continue this process until you start getting the correct results.

Add the C$PAR DOALL directive to a couple of the loops that are being parallelized. Compile with -explicitpar, then execute and verify the results. Use -loopinfo to see which loops are being parallelized. This method permits the addition of I/O statements to the parallelized loop.

Repeat this process until you find the loop that causes the wrong results.

Note: if you need -explicitpar only (without -autopar), do not compile with -explicitpar and -depend. This method is the same as compiling with -parallel, which, of course, includes -autopar.

Replace DO I=1,N with DO I=N,1,-1. Different results point to data dependencies.

10.5.2 Debugging Parallel Code With dbx

To use dbx on a parallel loop, temporarily rewrite the program as follows:

Example: Manually transform a loop to allow using dbx in parallel:

Original code:

demo% cat loop.f
C$PAR DOALL
      DO i = 1,10
            WRITE(0,*) 'Iteration ', i
      END DO
      END

Split into two parts: caller loop and loop body as a subroutine

demo% cat loop1.f
C$PAR DOALL
      DO i = 1,10
            k = i
            CALL loop_body ( k )
      END DO
      END
 
demo% cat loop2.f
      SUBROUTINE loop_body ( k )
      WRITE(0,*) 'Iteration ', k 
      RETURN
            END

Compile caller loop with parallelization but no debugging

demo% f95 -O3 -c -explicitpar loop1.f

Compile the subprogram with debugging but not parallelized

demo% f95 -c -g loop2.f

Link together both parts into a.out

demo% f95 loop1.o loop2.o -explicitpar

Run a.out under dbx and put breakpoint into loop body subroutine

demo% dbx a.out          left arrow Various dbx messages not shown
(dbx) stop in loop_body
(2) stop in loop_body
(dbx) run
Running: a.out
(process id 28163)

dbx stops at breakpoint

t@1 (l@1) stopped in loop_body at line 2 in file  
    "loop2.f"
    2           write(0,*) 'Iteration ', k

Now show value of k

(dbx) print k
k = 1                  left arrow Various values other than 1  are possible
(dbx)
 


10.6 Further Reading

The following provide more information: