Parallel Pragmas (Sun WorkShop Compiler C 5.0 User's Guide)

Sun WorkShop Compiler C 5.0 User's Guide

Parallel Pragmas

There is one parallel pragma: #pragma MP taskloop [options].

The MP taskloop pragma can, optionally, take one or more of the following arguments.

maxcpus (number_of_processors)
private (list_of_private_variables)
shared (list_of_shared_variables)
readonly (list_of_readonly_variables)
storeback (list_of_storeback_variables)
savelast
reduction (list_of_reduction_variables)
schedtype (scheduling_type)

Only one option can be specified per MP taskloop pragma; however, the pragmas are cumulative and apply to the next for loop encountered within the current block in the source code:

#pragma MP taskloop maxcpus(4)

#pragma MP taskloop shared(a,b)

#pragma MP taskloop storeback(x)

These options may appear multiple times prior to the for loop to which they apply. In case of conflicting options, the compiler will issue a warning message.

Nesting of `for` loops

An MP taskloop pragma applies to the next for loop within the current block. There is no nesting of parallelized for loops by MP C.

Eligibility for Parallelizing

An MP taskloop pragma suggests to the compiler that, unless otherwise disallowed, the specified for loop should be parallelized.

For loops with irregular control flow and unknown loop iteration increment are not eligible for parallelization. For example, for loops containing setjmp, longjmp, exit, abort, return, goto, labels, and break should not be considered as candidates for parallelization.

Of particular importance is to note that for loops with inter-iteration dependencies can be eligible for explicit parallelization. This means that if a MP taskloop pragma is specified for such a loop the compiler will simply honor it, unless the for loop is disqualified. It is the user's responsibility to make sure that such explicit parallelization will not lead to incorrect results.

If both the serial_loop or serial_loop_nested and taskloop pragmas are specified for a for loop, the last one specified will prevail.

Consider the following example:

#pragma MP serial_loop_nested

for (i=0; i<100; i++) {

# pragma MP taskloop

for (j=0; j<1000; j++) {

...

}

The i loop will not be parallelized but the j loop might be.

Number of Processors

#pragma MP taskloop maxcpus (number_of_processors) specifies the number of processors to be used for this loop, if possible.

The value of maxcpus must be a positive integer. If maxcpus equals 1, then the specified loop will be executed in serial. (Note that setting maxcpus to be 1 is equivalent to specifying the serial_loop pragma.) The smaller of the values of maxcpus or the interpreted value of the PARALLEL environment variable will be used. When the environment variable PARALLEL is not specified, it is interpreted as having the value 1.

If more than one maxcpus pragma is specified for a for loop, the last one specified will prevail.

Classifying Variables

A variable used in a loop is classified as being either a "private", "shared", "reduction", or "readonly" variable. The variable will belong to only one of these classifications. A variable can only be classified as a reduction or readonly variable via an explicit pragma. See #pragma MP taskloop reduction and #pragma MP taskloop readonly. A variable can be classified as being either a "private or "shared" variable via an explicit pragma or through the following default scoping rules.

Default Scoping Rules for Private and Shared Variables

A private variable is one whose value is private to each processor processing some iterations of a for loop. In other words, the value assigned to a private variable in one iteration of a for loop is not propagated to other processors processing other iterations of that for loop. A shared variable, on the other hand, is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop. Loops being explicitly parallelized through use of #pragma MP taskloop directives, that contain references to shared variables, must ensure that such sharing of values does not cause any correctness problems (such as race conditions). No synchronization is provided by the compiler on updates and accesses to shared variables in an explicitly parallelized loop.

In analyzing explicitly parallelized loops, the compiler uses the following "default scoping rules" to determine whether a variable is private or shared:

If a variable is not explicitly classified via a pragma, the variable will default to being classified as a shared variable if it is declared as a pointer or array, and is only referenced using array syntax within the loop. Otherwise, it will be classified as a private variable.

The loop index variable is always treated as a private variable and is always a storeback variable.

It is highly recommended that all variables used in an explicitly parallelized for loop be explicitly classified as one of shared, private, reduction, or readonly, to avoid the "default scoping rules."

Since the compiler does not perform any synchronization on accesses to shared variables, extreme care must be exercised before using an MP taskloop pragma for a loop that contains, for example, array references. If inter-iteration data dependencies exist in such an explicitly parallelized loop, then its parallel execution may give erroneous results. The compiler may or may not be able to detect such a potential problem situation and issue a warning message. In any case, the compiler will not disable the explicit parallelization of loops with potential shared variable problems.

Private Variables

#pragma MP taskloop private (list_of_private_variables) specifies all the variables that should be treated as private variables for this loop. All other variables used in the loop that are not explicitly specified as shared, readonly, or reduction variables, will be either shared or private as defined by the default scoping rules.

A private variable is one whose value is private to each processor processing some iterations of a loop. In other words, the value assigned to a private variable by one of the processors working on iterations of a loop is not propagated to other processors processing other iterations of that loop. A private variable has no initial value at the start of each iteration of a loop and must be set to a value within the iteration of a loop prior to its first use within that iteration. Execution of a program with a loop containing an explicitly declared private variable whose value is used prior to being set will result in undefined behavior.

Shared Variables

#pragma MP taskloop shared (list_of_shared_variables) specifies all the variables that should be treated as shared variables for this loop. All other variables used in the loop that are not explicitly specified as private, readonly, storeback or reduction variables, will be either shared or private as defined by the default scoping rules.

A shared variable is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop.

Read-only Variables

Read-only variables are a special class of shared variables that are not modified in any iteration of a loop. #pragma MP taskloop readonly (list_of_readonly_variables) indicates to the compiler that it may use a separate copy of that variable's value for each processor processing iterations of the loop.

Storeback Variables

#pragma MP taskloop storeback (list_of_storeback_variables) specifies all the variables to be treated as storeback variables.

A storeback variable is one whose value is computed in a loop, and this computed value is then used after the termination of the loop. The last loop iteration values of storeback variables are available for use after the termination of the loop. Such a variable is a good candidate to be declared explicitly via this directive as a storeback variable when the variable is a private variable, whether by explicitly declaring the variable private or by the default scoping rules.

Note that the storeback operation for a storeback variable occurs at the last iteration of the explicitly parallelized loop, regardless of whether or not that iteration updates the value of the storeback variable. In other words the processor that processes the last iteration of a loop may not be the same processor that currently contains the last updated value for a storeback variable. Consider the following example:

#pragma MP taskloop private(x)
#pragma MP taskloop storeback(x)
   for (i=1; i <= n; i++) {
      if (...) {
          x=...
      }
}
   printf ("%d", x);

In the previous example the value of the storeback variable x printed out via the printf() call may not be the same as that printed out by a serial version of the i loop, because in the explicitly parallelized case, the processor that processes the last iteration of the loop (when i==n), which performs the storeback operation for x may not be the same processor that currently contains the last updated value for x. The compiler will attempt to issue a warning message to alert the user of such potential problems.

In an explicitly parallelized loop, variables referenced as arrays are not treated as storeback variables. Hence it is important to include them in the list_of_storeback_variables if such storeback operation is desired (for example, if the variables referenced as arrays have been declared as private variables).

Savelast

#pragma MP taskloop savelast specifies that all the private variables of a loop be treated as a storeback variables. The syntax of this pragma is as follows:

#pragma MP taskloop savelast

It is often convenient to use this form, rather than list out each private variable of a loop when declaring each variable as storeback variables.

Reduction Variables

#pragma MP taskloop reduction (list_of_reduction_variables) specifies that all the variables appearing in the reduction list will be treated as reduction variables for the loop. A reduction variable is one whose partial values can be individually computed by each of the processors processing iterations of the loop, and whose final value can be computed from all its partial values. The presence of a list of reduction variables can facilitate the compiler in identifying that the loop is a reduction loop, allowing generation of parallel reduction code for it. Consider the following example:

#pragma MP taskloop reduction(x)
    for (i=0; i<n; i++) {
         x = x + a[i];
}

the variable x is a (sum) reduction variable and the i loop is a(sum) reduction loop.

Scheduling Control

The MP C compiler supports several pragmas that can be used in conjunction with the taskloop pragma to control the loop scheduling strategy for a given loop. The syntax for this pragma is:

#pragma MP taskloop schedtype (scheduling_type)

This pragma can be used to specify the specific scheduling_type to be used to schedule the parallelized loop. Scheduling_type can be one of the following:

static

In static scheduling all the iterations of the loop are uniformly distributed among all the participating processors. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(static)
    for (i=0; i<1000; i++) {
...
}

In the above example, each of the four processors will process 250 iterations of the loop.

self [(chunk_size)]

In self scheduling, each participating processor processes a fixed number of iterations (called the "chunk size") until all the iterations of the loop have been processed. The optional chunk_size parameter specifies the "chunk size" to be used. Chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(self(120))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

120, 120, 120, 120, 120, 120, 120, 120, 40.

gss [(min_chunk_size)]

In guided self scheduling, each participating processor processes a variable number of iterations (called the "min chunk size") until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(gss(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

250, 188, 141, 106, 79, 59, 45, 33, 25, 19, 14, 11, 10, 10, 10.

factoring [(min_chunk_size)]

In factoring scheduling, each participating processor processes a variable number of iterations (called the "min chunk size") until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(factoring(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

125, 125, 125, 125, 62, 62, 62, 62, 32, 32, 32, 32, 16, 16, 16, 16, 10, 10, 10, 10, 10, 10.