Sun Studio 12: C User's Guide

3.8 Aliasing and Parallelization

ISO C aliasing can often prevent loops from getting parallelized. Aliasing occurs when there are two possible references to the same memory location. Consider the following example:

Example 3–21 A Loop With Two References to the Same Memory Location

void copy(float a[], float b[], int n) {
    int i;
    for (i=0; i < n; i++) {
            a[i] = b[i]; /* S1 */
    }
}

Since variables a and b are parameters, it is possible that a and b may be pointing to overlapping regions of memory; e.g, if copy were called as follows:

copy (x[10], x[11], 20);

In the called routine, two successive iterations of the copy loop may be reading and writing the same element of the array x. However, if the routine copy were called as follows then there is no possibility of overlap in any of the 20 iterations of the loop:

copy (x[10], x[40], 20);

In general, it is not possible for the compiler to analyze this situation correctly without knowing how the routine is called. The compiler provides a keyword extension to ISO C that lets you convey this kind of aliasing information. See 3.8.2 Restricted Pointers for more information.

3.8.1 Array and Pointer References

Part of the aliasing problem is that the C language can define array referencing and definition through pointer arithmetic. In order for the compiler to effectively parallelize loops, either automatically or explicitly with pragmas, all data that is laid out as an array must be referenced using C array reference syntax and not pointers. If pointer syntax is used, the compiler cannot determine the relationship of the data between different iterations of a loop. Thus it will be conservative and not parallelize the loop.

3.8.2 Restricted Pointers

In order for a compiler to effectively perform parallel execution of a loop, it needs to determine if certain lvalues designate distinct regions of storage. Aliases are lvalues whose regions of storage are not distinct. Determining if two pointers to objects are aliases is a difficult and time consuming process because it could require analysis of the entire program. Consider function vsq() below:

Example 3–22 A Loop With Two Pointers

void vsq(int n, double * a, double * b) {
    int i;
    for (i=0; i<n; i++) {
            b[i] = a[i] * a[i];
    }
}

The compiler can parallelize the execution of the different iterations of the loops if it knows that pointers a and b access different objects. If there is an overlap in objects accessed through pointers a and b then it would be unsafe for the compiler to execute the loops in parallel. At compile time, the compiler does not know if the objects accessed by a and b overlap by simply analyzing the function vsq(); the compiler may need to analyze the whole program to get this information.

Restricted pointers are used to specify pointers which designate distinct objects so that the compiler can perform pointer alias analysis. The following is an example of function vsq() in which function parameters are declared as restricted pointers:

void vsq(int n, double * restrict a, double * restrict b)

Pointers a and b are declared as restricted pointers, so the compiler knows that a and b point to distinct regions of storage. With this alias information, the compiler is able to parallelize the loop.

The keyword restrict is a type-qualifier, like volatile, and it shall only qualify pointer types. restrict is recognized as a keyword when you use -xc99=all (except with -Xs). There are situations in which you may not want to change the source code. You can specify that pointer-valued function-parameters be treated as restricted pointers by using the following command line option:

-xrestrict=[func1,…,funcn]

If a function list is specified, then pointer parameters in the specified functions are treated as restricted; otherwise, all pointer parameters in the entire C file are treated as restricted. For example, -xrestrict=vsq, qualifies the pointers a and b given in the first example of the function vsq() with the keyword restrict.

It is critical that you use restrict correctly. If pointers qualified as restricted pointers point to objects which are not distinct, the compiler can incorrectly parallelize loops resulting in undefined behavior. For example, assume that pointers a and b of function vsq() point to objects which overlap, such that b[i] and a[i+1] are the same object. If a and b are not declared as restricted pointers the loops will be executed serially. If a and b are incorrectly qualified as restricted pointers the compiler may parallelize the execution of the loops, which is not safe, because b[i+1] should only be computed after b[i] is computed.

3.8.3 Explicit Parallelization and Pragmas

Often, there is not enough information available for the compiler to make a decision on the legality or profitability of parallelization. The compiler supports pragmas that allow the programmer to effectively parallelize loops that otherwise would be too difficult or impossible for the compiler to handle. The Sun-Specific MP pragmas detailed in the rest of this section have been deprecated in favor of the OpenMP standard. See the OpenMP API User’s Guide for information to the directives of the standard.

3.8.3.1 Serial Pragmas

Note –

The Sun-specific MP pragmas have been deprecated and are no longer supported. However, the compiler supports the APIs specified by the OpenMP 2.5 standard instead. See the OpenMP API User’s Guide for migration information to the directives of the standard.

There are two serial pragmas, and both apply to for loops:

#pragma MP serial_loop
#pragma MP serial_loop_nested

The #pragma MP serial_loop pragma indicates to the compiler that the next for loop is not to be automatically parallelized.

The #pragma MP serial_loop_nested pragma indicates to the compiler that the next for loop and any for loops nested within the scope of this for loop are not to be automatically parallelized.

The scope of these pragmas begins with the pragma and ends with the beginning of the next block, the next for loop within the current block, or the end of the current block, which ever occurs first.

3.8.3.2 Parallel Pragma

Note –

There is one parallel pragma: #pragma MP taskloop [options].

The MP taskloop pragma can, optionally, take one or more of the following arguments.

maxcpus (number_of_processors)
private (list_of_private_variables)
shared (list_of_shared_variables)
readonly (list_of_readonly_variables)
storeback (list_of_storeback_variables)
savelast
reduction (list_of_reduction_variables)
schedtype (scheduling_type)

The scope of these pragmas begins with the pragma and ends with which ever of the following occurs first: the beginning of the next block, the next for loop within the current block, the end of the current block. The pragma applies to the next for loop prior to the end of the pragmas scope.

Only one option can be specified per MP taskloop pragma; however, the pragmas are cumulative and apply to the next for loop within the pragmas scope:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop shared(a,b)
#pragma MP taskloop storeback(x)

These options may appear multiple times prior to the for loop to which they apply. In case of conflicting options, the compiler will issue a warning message.

Nesting of `for` Loops

An MP taskloop pragma applies to the next for loop within the current block. There is no nesting of parallelized for loops by parallelized C.

Eligibility for Parallelizing

An MP taskloop pragma suggests to the compiler that, unless otherwise disallowed, the specified for loop should be parallelized.

Any for loop with irregular control flow and unknown loop iteration increment is ineligible for parallelization. For example, for loops containing setjmp, longjmp, exit, abort, return, goto, labels, and break should not be considered as candidates for parallelization.

Of particular importance is to note that for loops with inter-iteration dependencies can be eligible for explicit parallelization. This means that if an MP taskloop pragma is specified for such a loop the compiler will simply honor it, unless the for loop is disqualified. It is the user’s responsibility to make sure that such explicit parallelization will not lead to incorrect results.

If both the serial_loop or serial_loop_nested and taskloop pragmas are specified for a for loop, the last one specified will prevail.

Consider the following example:

#pragma MP serial_loop_nested
    for (i=0; i<100; i++) {
   # pragma MP taskloop
      for (j=0; j<1000; j++) {
      ...
 }
}

The i loop will not be parallelized but the j loop might be.

Number of Processors

#pragma MP taskloop maxcpus (number_of_processors) specifies the number of processors to be used for this loop, if possible.

The value of maxcpus must be a positive integer. If maxcpus equals 1, then the specified loop will be executed in serial. (Note that setting maxcpus to be 1 is equivalent to specifying the serial_loop pragma.) The smaller of the values of maxcpus or the interpreted value of the PARALLEL environment variable will be used. When the environment variable PARALLEL is not specified, it is interpreted as having the value 1.

If more than one maxcpus pragma is specified for a for loop, the last one specified will prevail.

Classifying Variables

A variable used in a loop is classified as being either a private, shared, reduction, or readonly variable. The variable belongs to only one of these classifications. A variable can only be classified as a reduction or readonly variable through an explicit pragma. See #pragma MP taskloop reduction and #pragma MP taskloop readonly. A variable can be classified as being either a private or shared variable through an explicit pragma or through the following default scoping rules.

Default Scoping Rules for `private` and `shared` Variables

A private variable is one whose value is private to each processor processing some iterations of a for loop. In other words, the value assigned to a private variable in one iteration of a for loop is not propagated to other processors processing other iterations of that for loop. A shared variable, on the other hand, is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop. Loops being explicitly parallelized through use of #pragma MP taskloop directives, that contain references to shared variables, must ensure that such sharing of values does not cause any correctness problems (such as race conditions). No synchronization is provided by the compiler on updates and accesses to shared variables in an explicitly parallelized loop.

In analyzing explicitly parallelized loops, the compiler uses the following “default scoping rules” to determine whether a variable is private or shared:

If a variable is not explicitly classified via a pragma, the variable will default to being classified as a shared variable if it is declared as a pointer or array, and is only referenced using array syntax within the loop. Otherwise, it will be classified as a private variable.
The loop index variable is always treated as a private variable and is always a storeback variable.

It is highly recommended that all variables used in an explicitly parallelized for loop be explicitly classified as one of shared, private, reduction, or readonly, to avoid the “default scoping rules.”

Since the compiler does not perform any synchronization on accesses to shared variables, extreme care must be exercised before using an MP taskloop pragma for a loop that contains, for example, array references. If inter-iteration data dependencies exist in such an explicitly parallelized loop, then its parallel execution may give erroneous results. The compiler may or may not be able to detect such a potential problem situation and issue a warning message. In any case, the compiler will not disable the explicit parallelization of loops with potential shared variable problems.

`private` Variables

#pragma MP taskloop private (list_of_private_variables)

Use this pragma to specify all the variables that should be treated as private variables for this loop. All other variables used in the loop that are not explicitly specified as shared, readonly, or reduction variables, are either shared or private as defined by the default scoping rules.

A private variable is one whose value is private to each processor processing some iterations of a loop. In other words, the value assigned to a private variable by one of the processors working on iterations of a loop is not propagated to other processors processing other iterations of that loop. A private variable has no initial value at the start of each iteration of a loop and must be set to a value within the iteration of a loop prior to its first use within that iteration. Execution of a program with a loop containing an explicitly declared private variable whose value is used prior to being set will result in undefined behavior.

`shared` Variables

#pragma MP taskloop shared (list_of_shared_variables)

Use this pragma to specify all the variables that should be treated as shared variables for this loop. All other variables used in the loop that are not explicitly specified as private, readonly, storeback or reduction variables, are either shared or private as defined by the default scoping rules.

A shared variable is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop.

`readonly` Variables

#pragma MP taskloop readonly (list_of_readonly_variables)

readonly variables are a special class of shared variables that are not modified in any iteration of a loop. Use this pragma to indicate to the compiler that it may use a separate copy of that variable’s value for each processor processing iterations of the loop.

`storeback` Variables

#pragma MP taskloop storeback (list_of_storeback_variables)

Use this pragma to specify all the variables to be treated as storeback variables.

A storeback variable is one whose value is computed in a loop, and this computed value is then used after the termination of the loop. The last loop iteration values of storeback variables are available for use after the termination of the loop. Such a variable is a good candidate to be declared explicitly via this directive as a storeback variable when the variable is a private variable, whether by explicitly declaring the variable private or by the default scoping rules.

Note that the storeback operation for a storeback variable occurs at the last iteration of the explicitly parallelized loop, regardless of whether or not that iteration updates the value of the storeback variable. In other words, the processor that processes the last iteration of a loop may not be the same processor that currently contains the last updated value for a storeback variable. Consider the following example:

#pragma MP taskloop private(x)
#pragma MP taskloop storeback(x)
   for (i=1; i <= n; i++) {
      if (...) {
          x=...
      }
}
   printf (“%d”, x);

In the previous example the value of the storeback variable x printed out via the printf() call may not be the same as that printed out by a serial version of the i loop, because in the explicitly parallelized case, the processor that processes the last iteration of the loop (when i==n), which performs the storeback operation for x may not be the same processor that currently contains the last updated value for x. The compiler will attempt to issue a warning message to alert the user of such potential problems.

In an explicitly parallelized loop, variables referenced as arrays are not treated as storeback variables. Hence it is important to include them in the list_of_storeback_variables if such storeback operation is desired (for example, if the variables referenced as arrays have been declared as private variables).

`savelast`

#pragma MP taskloop savelast

Use this pragma to specify all the private variables of a loop that you want to be treated as storeback variables. The syntax of this pragma is as follows:

#pragma MP taskloop savelast

It is often convenient to use this form, rather than list out each private variable of a loop when declaring each variable as storeback variables.

`reduction` Variables

#pragma MP taskloop reduction (list_of_reduction_variables) specifies that all the variables appearing in the reduction list will be treated as reduction variables for the loop. A reduction variable is one whose partial values can be individually computed by each of the processors processing iterations of the loop, and whose final value can be computed from all its partial values. The presence of a list of reduction variables can facilitate the compiler in identifying that the loop is a reduction loop, allowing generation of parallel reduction code for it. Consider the following example:

#pragma MP taskloop reduction(x)
    for (i=0; i<n; i++) {
         x = x + a[i];
}

the variable x is a (sum) reduction variable and the i loop is a(sum) reduction loop.

Scheduling Control

The Sun ISO C compiler supports several pragmas that can be used in conjunction with the taskloop pragma to control the loop scheduling strategy for a given loop. The syntax for this pragma is:

#pragma MP taskloop schedtype (scheduling_type)

This pragma can be used to specify the specific scheduling_type to be used to schedule the parallelized loop. Scheduling_type can be one of the following:

static

In static scheduling all the iterations of the loop are uniformly distributed among all the participating processors. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(static)
    for (i=0; i<1000; i++) {
...
}

In the above example, each of the four processors will process 250 iterations of the loop.

self [(chunk_size)]

In self scheduling, each participating processor processes a fixed number of iterations (called the “chunk size”) until all the iterations of the loop have been processed. The optional chunk_size parameter specifies the “chunk size” to be used. Chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable, chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(self(120))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

120, 120, 120, 120, 120, 120, 120, 120, 40.

gss [(min_chunk_size)]

In guided self scheduling, each participating processor processes a variable number of iterations (called the “min chunk size”) until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable, min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(gss(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

250, 188, 141, 106, 79, 59, 45, 33, 25, 19, 14, 11, 10, 10, 10.

factoring [(min_chunk_size)]

In factoring scheduling, each participating processor processes a variable number of iterations (called the “min chunk size”) until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(factoring(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

125, 125, 125, 125, 62, 62, 62, 62, 32, 32, 32, 32, 16, 16, 16, 16, 10, 10, 10, 10, 10, 10

3.8 Aliasing and Parallelization

Example 3–21 A Loop With Two References to the Same Memory Location

3.8.1 Array and Pointer References

3.8.2 Restricted Pointers

Example 3–22 A Loop With Two Pointers

3.8.3 Explicit Parallelization and Pragmas

3.8.3.1 Serial Pragmas

3.8.3.2 Parallel Pragma

Nesting of for Loops

Eligibility for Parallelizing

Number of Processors

Classifying Variables

Default Scoping Rules for private and shared Variables

private Variables

shared Variables

readonly Variables

storeback Variables

savelast

reduction Variables

Scheduling Control

Nesting of `for` Loops

Default Scoping Rules for `private` and `shared` Variables

`private` Variables

`shared` Variables

`readonly` Variables

`storeback` Variables

`savelast`

`reduction` Variables