Sun-Style Parallelization Directives (Fortran Programming Guide)

Fortran Programming Guide

Sun-Style Parallelization Directives

Parallelization directives are comment lines that tell the compiler to parallelize (or not to parallelize) the DO loop that follows the directive. Directives are also called pragmas.

A parallelization directive consists of one or more directive lines.

Sun-style directives are recognized by f77 and f90 by default (or with the -mp=sun option). A Sun-style directive line is defined as follows:

C$PAR Directive    [ Qualifiers ]       <- Initial directive line
C$PAR& [More_Qualifiers]               <- Optional continuation lines

The letters of a directive line are case-insensitive.
The first five characters are C$PAR, *$PAR, or !$PAR.
An initial directive line has a blank in column 6.
A continuation directive line has a nonblank in column 6.
Directives are listed in columns 7 and beyond.
Qualifiers, if any, follow directives--on the same line or continuation lines.
Multiple qualifiers on one line are separated by commas.
Spaces before, after, or within a directive or qualifier are ignored.
Columns beyond 72 are ignored unless the -e option is specified.

The parallel directives and their actions are as follows:

Directive	Action
`TASKCOMMON`	Declares COMMON block private
`DOALL`	Parallelizes the next loop
`DOSERIAL`	Does not parallelize the next loop
`DOSERIAL*`	Does not parallelize the next nest of loops

Examples of f77 parallel directives:

C$PAR TASKCOMMON ALPHA                  Declare block private 
      COMMON /ALPHA/BZ,BY(100)

C$PAR DOALL                             No qualifiers

C$PAR DOSERIAL

C$PAR DOALL SHARED(I,K,X,V), PRIVATE(A)
 
            This one-line directive is equivalent to the three-line directive that follows.
C$PAR DOALL
C$PAR& SHARED(I,K,X,V)
C$PAR& PRIVATE(A)

`TASKCOMMON` Directive

The TASKCOMMON directive declares variables in a global COMMON block as private. Every variable declared in a task common block becomes a private variable. Only named COMMON blocks can be declared TASK COMMON.

The syntax of the directive is:

C$PAR TASKCOMMON common_block_name

The directive must appear immediately after the defining COMMON declaration.

This directive is effective only when compiled with -explicitpar or -parallel. Otherwise, the directive is ignored and the block is treated as a regular common block.

Variables declared in task common blocks are treated as private variables in all the DOALL loops they appear in explicitly, and in the routines called from a loop where the specified common block is in its scope.

It is an error to declare a common block as task common in some but not all compilation units where the block is defined. A check at runtime for task common consistency can be enabled by compiling the program with the -xcommonchk=yes flag. (Enable the runtime check only during program development, as it can degrade performance.)

`DOALL` Directive

The compilers will parallelize the DO loop following a DOALL directive (if compiled with the -parallel or -explicitpar options).

Note -

Analysis and transformation of reduction operations within loops is not done if they are explicitly parallelized.

Example: Explicit parallelization of a loop:

demo% cat t4.f
      ...
C$PAR DOALL
      do i = 1, n        
         a(i) = b(i) * c(i)
      end do
      do k = 1, m     
         x(k) = x(k) * z(k,k)
      end do
      ...
demo% f77 -explicitpar t4.f

Subprogram Call in a Loop

A subprogram call in a loop (or in any subprograms called from within the called routine) may introduce data dependencies that could go unnoticed without a deep analysis of the data and control flow through the chain of calls. While it is best to parallelize outermost loops that do a significant amount of the work, these tend to be the very loops that involve subprogram calls.

Because such an interprocedural analysis is difficult and could greatly increase compilation time, automatic parallelization modes do not attempt it. With explicit parallelization, the compiler generates parallelized code for a loop marked with a DOALL directive that contains calls to subprograms. It is still the programmer's responsibility toeinsure that no data dependencies exist within the loop and all that the loop encloses, including called subprograms.

Multiple invocations of a routine from different processors can cause problems resulting from references to local static variables that interfere with each other. Making all the local variables in a routine automatic rather than static prevents this. Each invocation of a subprogram then has its own unique store of local variables maintained on the stack, and no two invocations will interfere with each other.

Local subprogram variables can be made automatic variables that reside on the stack either by listing them on an AUTOMATIC statement or by compiling the subprogram with the -stackvar option. However, local variables initialized in DATA statements must be rewritten to be initialized in actual assignments.

Note -

Allocating local variables to the stack can cause stack overflow. See "Stacks, Stack Sizes, and Parallelization" about increasing the size of the stack.

Data dependencies can still be introduced through the data passed down the call tree as arguments or through COMMON blocks. This data flow should be analyzed carefully before parallelizing a loop with subprogram calls.

DOALL Qualifiers

All qualifiers on the DOALL directive are optional. The following table summarizes them:

Table 10-4 DOALL Qualifiers


Qualifier	Assertion	Syntax
`PRIVATE`	Do not share variables `u1`, ... between iterations	`DOALL PRIVATE(u1,u2,)`
`SHARED`	Share variables `v1`, `v2`, ... between iterations	`DOALL SHARED(v1,v2,)`
`MAXCPUS`	Use no more than `n` CPUs	`DOALL MAXCPUS(n)`
`READONLY`	The listed variables are not modified in the `DOALL` loop	`DOALL READONLY(v1,v2,)`
`SAVELAST`	Save the last `DO` iteration values of all private variables	`DOALL SAVELAST`
`STOREBACK`	Save the last `DO` iteration values of variables `v1`, ...	`DOALL STOREBACK(v1,v2,)`
`REDUCTION`	Treat the variables `v1`, `v2`, ... as reduction variables.	`DOALL REDUCTION(v1,v2,)`
`SCHEDTYPE`	Set the scheduling type to `t`.	`DOALL SCHEDTYPE(t)`

`PRIVATE(varlist)`

The PRIVATE(varlist)qualifier specifies that all scalars and arrays in the list varlist are private for the DOALL loop. Both arrays and scalars can be specified as private. In the case of an array, each thread of the DOALL loop gets a copy of the entire array. All other scalars and arrays referenced in the DOALL loop, but not contained in the private list, conform to their appropriate default scoping rules.

Example: Specify a private array:

C$PAR DOALL PRIVATE(a)
      do i = 1, n
        a(1) = b(i)
        do j = 2, n
          a(j) = a(j-1) + b(j) * c(j)
        end do
        x(i) = f(a)
      end do

In the preceding example, the array a is specified as private to the i loop.

`SHARED(varlist)`

The SHARED(varlist) qualifier specifies that all scalars and arrays in the list varlist are shared for the DOALL loop. Both arrays and scalars can be specified as shared. Shared scalars and arrays are common to all the iterations of a DOALL loop. All other scalars and arrays referenced in the DOALL loop, but not contained in the shared list, conform to their appropriate default scoping rules.

Example: Specify a shared variable:

      equivalence (a(1),y)
C$PAR DOALL SHARED(y)
      do i = 1,n
        a(i) = y
      end do

In the preceding example, the variable y has been specified as a variable whose value should be shared among the iterations of the i loop.

`READONLY(varlist)`

The READONLY(varlist) qualifier specifies that all scalars and arrays in the list varlist are read-only for the DOALL loop. Read-only scalars and arrays are a special class of shared scalars and arrays that are not modified in any iteration of the DOALL loop. Specifying scalars and arrays as READONLY indicates to the compiler that it does not need to use a separate copy of that variable or array for each thread of the DOALL loop.

Example: Specify a read-only variable:

      x = 3
C$PAR DOALL SHARED(x),READONLY(x)
      do i = 1, n
        b(i) = x + 1
      end do

In the preceding example, x is a shared variable, but the compiler can rely on the fact that it will not change over each iteration of the i loop because of its READONLY specification.

`STOREBACK(varlist)`

A STOREBACK variable or array is one whose value is computed in a DOALL loop. The computed value can be used after the termination of the loop. In other words, the last loop iteration values of storeback scalars and arrays may be visible outside of the DOALL loop.

Example: Specify the loop index variable as storeback:

C$PAR DOALL PRIVATE(x), STOREBACK(x,i)
      do i = 1, n
        x = ...
      end do
      ... = i
      ... = x

In the preceding example, both the variables x and i are STOREBACK variables, even though both variables are private to the i loop.

There are some potential problems for STOREBACK, however.

The STOREBACK operation occurs at the last iteration of the explicitly parallelized loop, even if this is the same iteration that last updates the value of the STOREBACK variable or array.

Example: STOREBACK variable potentially different from the serial version:

C$PAR DOALL PRIVATE(x), STOREBACK(x)
      do i = 1, n
        if (...) then
            x = ...
        end if
      end do
      print *,x

In the preceding example, the value of the STOREBACK variable x that is printed out might not be the same as that printed out by a serial version of the i loop. In the explicitly parallelized case, the processor that processes the last iteration of the i loop (when i = n) and performs the STOREBACK operation for x, might not be the same processor that currently contains the last updated value of x. The compiler issues a warning message about these potential problems.

In an explicitly parallelized loop, arrays are not treated by default as STOREBACK, so include them in the list varlist if such a storeback operation is desired--for example, if the arrays have been declared as private.

`SAVELAST`

The SAVELAST qualifier specifies that all private scalars and arrays are STOREBACK for the DOALL loop. A STOREBACK variable or array is one whose value is computed in a DOALL loop; this computed value can be used after the termination of the loop. In other words, the last loop iteration values of STOREBACK scalars and arrays may be visible outside of the DOALL loop.

Example: Specify SAVELAST:

C$PAR DOALL PRIVATE(x,y), SAVELAST 
      do i = 1, n
        x = ...
        y = ...
      end do
      ... = i
      ... = x
      ... = y

In the preceding example, variables x, y, and i are STOREBACK variables.

`REDUCTION(varlist)`

The REDUCTION(varlist) qualifier specifies that all variables in the list varlist are reduction variables for the DOALL loop. A reduction variable (or array) is one whose partial values can be individually computed on various processors, and whose final value can be computed from all its partial values.

The presence of a list of reduction variables can aid the compiler in identifying if a DOALL loop is a reduction loop and in generating parallel reduction code for it.

Example: Specify a reduction variable:

C$PAR DOALL REDUCTION(x)
      do i = 1, n
        x = x + a(i)
      end do

In the preceding example, the variable x is a (sum) reduction variable; the i loop is a (sum) reduction loop.

`SCHEDTYPE(t)`

The SCHEDTYPE(t) qualifier specifies that the specific scheduling type t be used to schedule the DOALL loop.

Table 10-5 DOALL SCHEDTYPE Qualifiers


Scheduling Type	Action
`STATIC`	Use static scheduling for this `DO` loop. Distribute all iterations uniformly to all available processors. Example: With 1000 iterations and 4 CPUs each CPU gets a single iteration in turn until all the iterations have been distributed.
`SELF[(chunksize)]`	Use self-scheduling for this `DO` loop. Distribute `chunksize` iterations to each available processor: o Repeat with the remaining iterations until all the iterations have been processed. o If `chunksize` is not provided, `f77` selects a value. Example: With 1000 iterations and `chunksize` of 4, distribute 4 iterations to each CPU.
`FACTORING[(` `m` `)]`	Use factoring scheduling for this `DO` loop. With `n` iterations initially and `k` CPUs, distribute n/(2k) iterations uniformly to each processor until all iterations have been processed. o At least `m` iterations must be assigned to each processor. o There can be one final smaller residual chunk. o If `m` is not provided, `f77` selects a value. Example: With 1000 iterations and `FACTORING`(4), and 4 CPUs, distribute 125 iterations to each CPU, then 62 iterations, then 31 iterations, and so on.
`GSS[(` `m` `)]`	Use guided self-scheduling for this `DO` loop. With `n` iterations initially, and `k` CPUs, then: o Assign n/k iterations to the first processor. o Assign the remaining iterations divided by `k` to the second processor, and so on until all iterations have been processed. Note: o At least `m` iterations must be assigned to each CPU. o There can be one final smaller residual chunk. o If `m` is not provided, `f77` selects a value. Example: With 1000 iterations and `GSS`(10), and 4 CPUs, distribute 250 iterations to the first CPU, then 187 to the second CPU, then 140 to the third CPU, and so on.

Multiple Qualifiers

Qualifiers can appear multiple times with cumulative effect. In the case of conflicting qualifiers, the compiler issues a warning message, and the qualifier appearing last prevails.

Example: A three-line Sun-style directive:

C$PAR DOALL MAXCPUS(4), READONLY(S), PRIVATE(A,B,X), MAXCPUS(2)
C$PAR DOALL SHARED(B,X,Y), PRIVATE(Y,Z)
C$PAR DOALL READONLY(T)

Example: A one-line equivalent of the preceding three lines (note duplicate MAXCPUS and conflicting SHARED/PRIVATE):

C$PAR DOALL MAXCPUS(2), PRIVATE(A,Y,Z), SHARED(B,X), READONLY(S,T)

`DOSERIAL` Directive

The DOSERIAL directive disables parallelization of the specified loop. This directive applies to the one loop immediately following it (if you compile it with -explicitpar or -parallel).

Example: Exclude one loop from parallelization:

      do i = 1, n
C$PAR DOSERIAL
        do j = 1, n
          do k = 1, n
              ...
          end do
        end do
      end do

In the preceding example, the j loop is not parallelized, but the i or k loop can be.

`DOSERIAL*` Directive

The DOSERIAL* directive disables parallelization the specified nest of loops. This directive applies to the whole nest of loops immediately following it (if you compile with -explicitpar or -parallel).

Example: Exclude a whole nest of loops from parallelization:

      do i = 1, n
C$PAR DOSERIAL*
        do j = 1, n
          do k = 1, n
              ...
          end do
        end do
      end do

In the preceding loops, the j and k loops are not parallelized; the i loop could be.

Interaction Between `DOSERIAL*` and `DOALL`

If both DOSERIAL and DOALL are specified, the last one prevails.

Example: Specifying both DOSERIAL and DOALL:

C$PAR DOSERIAL*
      do i = 1, 1000
C$PAR DOALL
        do j = 1, 1000
            ...
        end do
      end do

In the preceding example, the i loop is not parallelized, but the j loop is.

Also, the scope of the DOSERIAL* directive does not extend beyond the textual loop nest immediately following it. The directive is limited to the same function or subroutine that it is in.

Example: DOSERIAL* does not extend to a loop of a called subroutine:

      program caller
      common /block/ a(10,10)
C$PAR DOSERIAL*
      do i = 1, 10
        call callee(i)
      end do
      end

      subroutine callee(k)
      common /block/a(10,10)
      do j = 1, 10
        a(j,k) = j + k
      end do
      return
      end

In the preceding example, DOSERIAL* applies only to the i loop and not to the j loop, regardless of whether the call to the subroutine callee is inlined.

Inhibitors to Explicit Parallelization

In general, the compiler parallelizes a loop if you explicitly direct it to. There are exceptions--some loops the compiler just cannot parallelize.

The following are the primary detectable inhibitors that might prevent explicitly parallelizing a DO loop:

The DO loop is nested inside another DO loop that is parallelized.

This exception holds for indirect nesting, too. If you explicitly parallelize a loop that includes a call to a subroutine, then even if you parallelize loops in that subroutine, those loops are not run in parallel at runtime.

A flow control statement allows jumping out of the DO loop.

The index variable of the loop is subject to side effects, such as being equivalenced.

If you compile with -vpara, you may get a warning message if f77/f90 detects a problem with explicitly parallelizing a loop. f77/f90 may still parallelize the loop. The following list of typical parallelization problems shows those that are ignored by the compiler

Table 10-6 Explicit Parallelization Problems


Problem	Parallelized	Message
Loop is nested inside another loop that is parallelized.	No	No
Loop is in a subroutine, and a call to the subroutine is in a parallelized loop.	No	No
Jumping out of loop is allowed by a flow control statement.	No	Yes
Index variable of loop is subject to side effects.	Yes	No
Some variable in the loop keeps a loop-carried dependency.	Yes	Yes
I/O statement in the loop--usually unwise, because the order of the output is not predictable.	Yes	No

and those that generate messages with -vpara.

Example: Nested loops:

 
    ...
C$PAR DOALL
      do 900 i = 1, 1000      !  Parallelized (outer loop)
        do 200 j = 1, 1000    !  Not parallelized, no warning
            ...
200   continue
900      continue
      ...
demo% f77 -explicitpar -vpara t6.f

Example: A parallelized loop in a subroutine:

C$PAR DOALL

do 100 i = 1, 200

...

call calc (a, x)

...

100 continue

...

demo% f77 -explicitpar -vpara t.f

subroutine calc ( b, y )

...

C$PAR DOALL

do 1 m = 1, 1000

...

1 continue

return

end

At runtime, the loop could run in parallel.

At runtime, both loops do not run in parallel.

In the preceding example, the loop within the subroutine is not parallelized because the subroutine itself is run in parallel.

Example: Jumping out of a loop:

C$PAR DOALL
      do i = 1, 1000     ! ¨
 Not parallelized, with warning
        ...
        if (a(i) .gt. min_threshold ) go to 20
        ...
      end do
20      continue
      ...
demo% f77 -explicitpar -vpara t9.f

Example: An index variable subject to side effects:

      equivalence ( a(1), y )   ! 
¨ Source of possible side effects
      ...
C$PAR DOALL
      do i = 1, 2000         ! 
¨ Parallelized: no warning, but not safe
        y = i
        a(i) = y
      end do
      ...
demo% f77 -explicitpar -vpara t11.f

Example: A variable in a loop has a loop-carried dependency:

C$PAR DOALL
      do 100 i = 1, 200        ! Parallelized, with warning
        y = y * i              !  y has a loop-carried dependency
        a(i) = y
100      continue
      ...
demo% f77 -explicitpar -vpara t12.f

I/O With Explicit Parallelization

You can do I/O in a loop that executes in parallel, provided that:

It does not matter that the output from different threads is interleaved (program output is nondeterministic.)

You can ensure the safety of executing the loop in parallel.

Example: I/O statement in loop

C$PAR DOALL
      do i = 1, 10     !  Parallelized with no warning (not advisable)
        k = i
        call show ( k ) 
      end do
      end
      subroutine show( j )
      write(6,1) j
1      format('Line number ', i3, '.')
      end
demo% f77 -silent -explicitpar -vpara t13.f
demo% setenv PARALLEL 2
demo% a.out
(The output displays the numbers 1 through 10, but in a different order each time.)

Example: Recursive I/O:

      do i = 1, 10   
   <--  Parallelized with no warning ---unsafe
        k = i
        print *, list( k )    <-- list is a function that does I/O
      end do
      end
      function list( j )
      write(6,"('Line number ', i3, '.')") j
      list = j
      end
demo% f77 -silent -mt t14.f
demo% setenv PARALLEL 2
demo% a.out

In the preceding example, the program may deadlock in libF77_mt and hang. Press Control-C to regain keyboard control.

There are situations where the programmer might not be aware that I/O could take place within a parallelized loop. Consider a user-supplied exception handler that prints output when it catches an arithmetic exception (like divide by zero). If a parallelized loop provokes an exception, the implicit I/O from the handler may cause I/O deadlocks and a system hang.

In general:

The library libF77_mt is MT safe, but mostly not MT hot.
You cannot do recursive (nested) I/O if you compile with -mt.

As an informal definition, an interface is MT safe if:

It can be simultaneously invoked by more than one thread of control.

The caller is not required to do any explicit synchronization before calling the function.

The interface is free of data races.

A data race occurs when the content of memory is being updated by more than one thread, and that bit of memory is not protected by a lock. The value of that bit of memory is nondeterministic--the two threads race to see who gets to update the thread (but in this case, the one who gets there last, wins).

An interface is colloquially called MT hot if the implementation has been tuned for performance advantage, using the techniques of multithreading. For some formal definitions of multithreading technology, read the Solaris Multithreaded Programming Guide.