Automatic Parallelization - Oracle® Developer Studio 12.6: C User's Guide

Language:

4.2 Automatic Parallelization

The C compiler generates parallel code for those loops that it determines are safe to parallelize. Typically, these loops have iterations that are independent of each other. For such loops, the order in which iterations are executed or if they are executed in parallel, does not matter. Many, though not all, vector loops fall into this category.

Because of the way aliasing works in C, determining the safety of parallelization is difficult. To help the compiler, Oracle Developer Studio C offers pragmas and additional pointer qualifications to provide aliasing information known to the programmer that the compiler cannot determine. See Type-Based Alias Analysis for more information.

The following example illustrates how to enable and control parallelized C:

% cc -fast -xO4 -xautopar example.c -o example

This compiler command generates an executable called example, which can be executed normally. To find out how to take advantage of multiprocessor execution, see -xautopar.

4.2.1 Data Dependence and Interference

The C compiler performs analysis on loops in programs to determine whether executing different iterations of the loops in parallel is safe. The purpose of this analysis is to determine whether any two iterations of the loop could interfere with each other. Typically, this problem occurs if one iteration of a loop could read a variable while another iteration is writing the very same variable. Consider the following program fragment:

Example 2 Loop With Dependence

for (i=1; i < 1000; i++) {
    sum = sum + a[i]; /* S1 */
}

In this example, any two successive iterations, i and i+1, will write and read the same variable sum. Therefore, in order for these two iterations to execute in parallel, some form of locking on the variable would be required. Otherwise, allowing the two iterations to execute in parallel is not safe.

However, the use of locks imposes overhead that might slow down the program. The C compiler will not ordinarily parallelize the loop in the example above because there is a data dependence between two iterations of the loop. Consider another example:

Example 3 Loop Without Dependence

for (i=1; i < 1000; i++) {
    a[i] = 2 * a[i]; /* S1 */
}

In this case, each iteration of the loop references a different array element. Therefore different iterations of the loop can be executed in any order. They may be executed in parallel without any locks because no two data elements of different iterations can possibly interfere.

The analysis performed by the compiler to determine whether two different iterations of a loop could reference the same variable is called data dependence analysis. Data dependences prevent loop parallelization if one of the references writes to the variable. The dependence analysis performed by the compiler can have three outcomes:

There is a dependence, in which case, executing the loop in parallel is not safe.
There is no dependence, in which case the loop may safely execute in parallel using an arbitrary number of processors.
The dependence cannot be determined. The compiler assumes, for safety, that a dependence might prevent parallel execution of the loop and will not parallelize the loop.

In the following example, whether two iterations of the loop write to the same element of array a depends on whether array b contains duplicate elements. Unless the compiler can determine this fact, it must assume there might be a dependence and not parallelize the loop.

Example 4 Loop That Might Contain Dependencies

for (i=1; i < 1000; i++) {
    a[b[i]] = 2 * a[i];
}

4.2.2 Private Scalars and Private Arrays

For some data dependences, the compiler might still be able to parallelize a loop. Consider the following example.

Example 5 Parallelizable Loop With Dependence

for (i=1; i < 1000; i++) {
    t = 2 * a[i];           /* S1 */
    b[i] = t;               /* S2 */
}

In this example, assuming that arrays a and b are non-overlapping arrays, there appears to be a data dependence in any two iterations due to the variable t. The following statements execute during iterations one and two.

Example 6 Iterations One and Two

t = 2*a[1];  /* 1 */
b[1] = t;    /* 2 */
t = 2*a[2];  /* 3 */
b[2] = t;    /* 4 */

Because statements one and three modify the variable t, the compiler cannot execute them in parallel. However, because the value of t is always computed and used in the same iteration, the compiler can use a separate copy of t for each iteration. This method eliminates the interference between different iterations due to such variables. In effect, variable t is used as a private variable for each thread executing that iteration, as shown in the following example.

Example 7 Variable t as a Private Variable for Each Thread

for (i=1; i < 1000; i++) {
    pt[i] = 2 * a[i];       /* S1 */
    b[i] = pt[i];           /* S2 */
}

In this example, each scalar variable reference t is replaced by an array reference pt. Each iteration now uses a different element of pt, eliminating any data dependencies between any two iterations. One problem with this is that it may lead to an extra large array. In practice, the compiler only allocates one copy of the variable for each thread that participates in the execution of the loop. Each such variable is, in effect, private to the thread.

The compiler can also privatize array variables to create opportunities for parallel execution of loops. Consider the following example:

Example 8 Parallelizable Loop With an Array Variable

for (i=1; i < 1000; i++) {
    for (j=1; j < 1000; j++) {
            x[j] = 2 * a[i];        /* S1 */
            b[i][j] = x[j];         /* S2 */
    }
}

In this example, different iterations of the outer loop modify the same elements of array x, and thus the outer loop cannot be parallelized. However, if each thread executing the outer loop iterations has a private copy of the entire array x, then no interference between any two iterations of the outer loop would occur. The following example illustrates this point.

Example 9 Parallelizable Loop Using a Privatized Array

for (i=1; i < 1000; i++) {
    for (j=1; j < 1000; j++) {
            px[i][j] = 2 * a[i];    /* S1 */
            b[i][j] = px[i][j];     /* S2 */
    }
}

As in the case of private scalars, you need to expand the array only up to the number of threads executing in the system. This expansion is done automatically by the compiler by allocating one copy of the original array in the private space of each thread.

4.2.3 Storeback

Privatization of variables can be very useful for improving the parallelism in the program. However, if the private variable is referenced outside the loop then the compiler needs to verify that it has the right value. Consider the following example:

Example 10 Parallelized Loop Using Storeback

for (i=1; i < 1000; i++) {
    t = 2 * a[i];           /* S1 */
    b[i] = t;               /* S2 */
}
x = t;                      /* S3 */

In this example, the value of t referenced in statement S3 is the final value of t computed by the loop. After the variable t has been privatized and the loop has finished executing, the right value of t needs to be stored back into the original variable. This process is called storeback, and is accomplished by copying the value of t on the final iteration back to the original location of variable t. In many cases the compiler can do this automatically, but sometimes the last value cannot be computed easily.

Example 11 Loop That Cannot Use Storeback

for (i=1; i < 1000; i++) {
    if (c[i] > x[i] ) {         /* C1 */
            t = 2 * a[i];           /* S1 */
            b[i] = t;               /* S2 */
    }
}
x = t*t;                       /* S3 */

For correct execution, the value of t in statement S3 is not usually the value of t on the final iteration of the loop. In fact, it is the last iteration for which the condition C1 is true. Computing the final value of t can be difficult in general. In cases like this example, the compiler will not parallelize the loop.

4.2.4 Reduction Variables

There are cases when there is a real dependence between iterations of a loop that cannot be removed by simply privatizing the variables causing the dependence. For example, look at the following code where values are being accumulated from one iteration to the next.

Example 12 Loop That Might Be Parallelized

for (i=1; i < 1000; i++) {
    sum += a[i]*b[i]; /* S1 */
}

In this example, the loop computes the vector product of two arrays into a common variable called sum. This loop cannot be parallelized in a simple manner. The compiler can take advantage of the associative nature of the computation in statement S1 and allocate a private variable called psum[i] for each thread. Each copy of the variable psum[i] is initialized to 0. Each thread computes its own partial sum in its own copy of the variable psum[i]. Before crossing the barrier, all the partial sums are added onto the original variable sum. In this example, the variable sum is called a reduction variable because it computes a sum-reduction. However, one danger of promoting scalar variables to reduction variables is that the manner in which rounded values are accumulated can change the final value of sum. The compiler performs this transformation only if you specifically give permission for it to do so.

4.2.5 Loop Transformations

The compiler performs several loop restructuring transformations to help improve the parallelization of a loop in programs. Some of these transformations can also improve the single processor execution of loops as well. The transformations performed by the compiler are described in this section.

4.2.5.1 Loop Distribution

Loops often contain a few statements that cannot be executed in parallel and many statements that can be executed in parallel. Loop Distribution attempts to remove the sequential statements into a separate loop and gather the parallelizable statements into a different loop. This process is illustrated in the following example:

Example 13 Candidate for Loop Distribution

for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];               /* S1 */
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0;  /* S2 */
    y[i] = z[i] - x[i];                    /* S3 */
}

Assuming that arrays x, y, w, a, and z do not overlap, statements S1 and S3 can be parallelized but statement S2 cannot be. The following example shows how the loop looks after it is split or distributed into two different loops.

Example 14 Distributed Loop

/* L1: parallel loop */
for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];              /* S1 */
    y[i] = z[i] - x[i];                   /* S3 */
}
/* L2: sequential loop */
for (i=0; i < n; i++) {
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0; /* S2 */
}

After this transformation, loop L1 does not contain any statements that prevent the parallelization of the loop and may be executed in parallel. Loop L2, however, still has a non-parallelizable statement from the original loop.

Loop distribution is not always profitable or safe to perform. The compiler performs analysis to determine the safety and profitability of distribution.

4.2.5.2 Loop Fusion

If the granularity of a loop, or the work performed by a loop, is small, the performance gain from distribution might be insignificant because the overhead of parallel loop startup is too high compared to the loop workload. In such situations, the compiler uses loop fusion to combine several loops into a single parallel loop, increasing the granularity of the loop. Loop fusion is easy and safe when loops with identical trip counts are adjacent to each other. Consider the following example:

Example 15 Loops With Small Work Loads

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];        /* S1 */
}
/* L2: another short parallel loop */
for (i=0; i < 100; i++) {
    b[i] = a[i] * d[i];        /* S2 */
}

The two short parallel loops are next to each other, and can be safely combined as follows:

Example 16 The Two Loops Fused

/* L3: a larger parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];       /* S1 */
    b[i] = a[i] * d[i];       /* S2 */
}

The new loop generates half the parallel loop execution overhead. Loop fusion can also help in other ways. For example if the same data is referenced in two loops, then combining them can improve the locality of reference.

However, loop fusion is not always safe to perform. If loop fusion creates a data dependence that did not exist previously, the fusion could result in incorrect execution. Consider the following example:

Example 17 Unsafe Fusion Candidates

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];      /* S1 */
}
/* L2: a short loop with data dependence */
for (i=0; i < 100; i++) {
    a[i+1] = a[i] * d[i];    /* S2 */
}

If the loops in this example are fused, a data dependence is created from statement S2 to S1. In effect, the value of a[i] in the right side of statement S1 is computed in statement S2. If the loops are not fused, this dependence would not occur. The compiler performs safety and profitability analysis to determine whether loop fusion should be done. Often, the compiler can fuse an arbitrary number of loops. Increasing the granularity in this manner can sometimes push a loop far enough up for it to be profitable for parallelization.

4.2.5.3 Loop Interchange

Parallelizing the outermost loop in a nest of loops is generally more profitable because the overheads incurred are small. However, parallelizing the outermost loops is not always safe due to dependences that might be carried by such loops. The following example illustrates this situation.

Example 18 Nested Loop That Cannot Be Parallelized

for (i=0; i <n; i++) {
    for (j=0; j <n; j++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

In this example, the loop with the index variable i cannot be parallelized because of a dependency between two successive iterations of the loop. The two loops can be interchanged and the parallel loop (the j-loop) becomes the outer loop:

Example 19 Loops Interchanged

for (j=0; j<n; j++) {
    for (i=0; i<n; i++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

The resulting loop incurs an overhead of parallel work distribution only once, while previously the overhead was incurred n times. The compiler performs safety and profitability analysis to determine whether to perform loop interchange.

4.2.6 Aliasing and Parallelization

ISO C aliasing can often prevent loops from getting parallelized. Aliasing occurs when there are two possible references to the same memory location. Consider the following example:

Example 20 Loop With Two References to the Same Memory Location

void copy(float a[], float b[], int n) {
    int i;
    for (i=0; i < n; i++) {
            a[i] = b[i]; /* S1 */
    }
}

Because variables a and b are parameters, it is possible that a and b might be pointing to overlapping regions of memory, for example, if copy were called as follows:

copy (x[10], x[11], 20);

In the called routine, two successive iterations of the copy loop might be reading and writing the same element of the array x. However, if the routine copy were called as follows, there is no possibility of overlap in any of the 20 iterations of the loop:

copy (x[10], x[40], 20);

The compiler cannot analyze this situation correctly without information about how the routine is called. However, the Oracle Developer Studio C compiler does provide a keyword extension to standard ISO C to convey this kind of aliasing information. See Restricted Pointers for more information.

4.2.6.1 Array and Pointer References

Part of the aliasing problem is that the C language can define array referencing and definition through pointer arithmetic. In order for the compiler to effectively parallelize loops automatically, all data that is laid out as an array must be referenced using C array reference syntax and not pointers. If pointer syntax is used, the compiler cannot determine the relationship of the data between different iterations of a loop. Thus, the compiler will be conservative and not parallelize the loop.

4.2.6.2 Restricted Pointers

In order for a compiler to effectively perform parallel execution of a loop, it needs to determine whether certain lvalues designate distinct regions of storage. Aliases are lvalues whose regions of storage are not distinct. Determining whether two pointers to objects are aliases is a difficult and time consuming process because it could require analysis of the entire program. Consider function vsq() in the following example:

Example 21 Loop With Two Pointers

void vsq(int n, double * a, double * b) {
    int i;
    for (i=0; i<n; i++) {
            b[i] = a[i] * a[i];
    }
}

The compiler can parallelize the execution of the different iterations of the loops if it has determined that pointers a and b access different objects. If there is an overlap in objects accessed through pointers a and b then it would be unsafe for the compiler to execute the loops in parallel. At compile time, the compiler cannot determine whether the objects accessed by a and b overlap by simply analyzing the function vsq(). The compiler might need to analyze the whole program to get this information.

Restricted pointers are used to specify pointers that designate distinct objects so that the compiler can perform pointer alias analysis. The following example shows function vsq() in which function parameters are declared as restricted pointers:

void vsq(int n, double * restrict a, double * restrict b)

Pointers a and b are declared as restricted pointers, so the compiler knows that a and b point to distinct regions of storage. With this alias information, the compiler is able to parallelize the loop.

The keyword restrict is a type-qualifier, like volatile, and it shall only qualify pointer types. There are situations in which you may not want to change the source code. You can specify that pointer-valued function-parameters be treated as restricted pointers by using the following command line option:

-xrestrict=[func1,…,funcn]

If a function list is specified, then pointer parameters in the specified functions are treated as restricted; otherwise, all pointer parameters in the entire C file are treated as restricted. For example,-xrestrict=vsq, qualifies the pointers a and b given in the first example of the function vsq() with the keyword restrict.

It is critical that you use restrict correctly. If pointers qualified as restricted pointers point to objects which are not distinct, the compiler can incorrectly parallelize loops resulting in undefined behavior. For example, assume that pointers a and b of function vsq() point to objects which overlap, such that b[i] and a[i+1] are the same object. If a and b are not declared as restricted pointers the loops will be executed serially. If a and b are incorrectly qualified as restricted pointers the compiler may parallelize the execution of the loops, which is not safe, because b[i+1] should only be computed after b[i] is computed.