Sun Studio 12: C User's Guide

Chapter 3 Parallelizing Sun C Code

The Sun C compiler can optimize code to run on SPARC shared-memory multiprocessor machines. The process is called parallelizing. The compiled code can execute in parallel using the multiple processors on the system. This chapter explains how you can take advantage of the compiler’s parallelizing features.

3.1 Overview

The C compiler generates parallel code for those loops that it determines are safe to parallelize. Typically, these loops have iterations that are independent of each other. For such loops, it does not matter in what order the iterations are executed or if they are executed in parallel. Many, though not all, vector loops fall into this category.

Because of the way aliasing works in C, it is difficult to determine the safety of parallelization. To help the compiler, Sun C offers pragmas and additional pointer qualifications to provide aliasing information known to the programmer that the compiler cannot determine. See Chapter 5, Type-Based Alias Analysis for more information.

3.1.1 Example of Use

The following example illustrates how to enable and control parallelized C:

% cc -fast -xO4 -xautopar example.c -o example

This generates an executable called example, which can be executed normally. If you wish to take advantage of multiprocessor execution, see B.2.69 -xautopar.

3.2 Parallelizing for OpenMP

You can compile your code so that it complies with the OpenMP specification. For more information on the OpenMP specification for C, visit the web site at http://www.openmp.org/specs/.

To take advantage of the compiler’s OpenMP support, you need to issue the compiler’s -xopenmp option. See B.2.118 -xopenmp[=i].

See the OpenMP API User’s Guide for migration information to the directives of the standard.

3.2.1 Handling OpenMP Runtime Warnings

The OpenMP runtime system can issue warnings for non-fatal errors. Use the following function to register a call back function to handle these warnings:

int sunw_mp_register_warn(void (*func) (void *) )

You can access the prototype for this function by issuing a #include preprocessor directive for <sunw_mp_misc.h>.

If you do not want to register a function, set the environment variable SUNW_MP_WARN to TRUE and the warning messages are sent to stderr. For more information on SUNW_MP_WARN, see SUNW_MP_WARN.

For information specific to this implementation of OpenMP, see the OpenMP API User’s Guide.

3.3 Environment Variables

There are four environment variables that relate to parallelized C:

PARALLEL
SUNW_MP_THR_IDLE
SUNW_MP_WARN
STACKSIZE

3.3.1 `PARALLEL`

Set the PARALLEL environment variable if you can take advantage of multiprocessor execution. The PARALLEL environment variable specifies the number of processors available to the program. The following example shows that PARALLEL is set to two:

% setenv PARALLEL 2

If the target machine has multiple processors, the threads can map to independent processors. Running the program leads to the creation of two threads that execute the parallelized portions of the program.

3.3.1.1 `SUNW_MP_THR_IDLE`

Currently, the starting thread of a program creates bound threads. Once created, these bound threads participate in executing the parallel part of a program (parallel loop, parallel region, etc.) and keep spin-waiting while the sequential part of the program runs. These bound threads never sleep or stop until the program terminates. Having these threads spin-wait generally gives the best performance when a parallelized program runs on a dedicated system. However, threads that are spin-waiting use system resources.

Use the SUNW_MP_THR_IDLE environment variable to control the status of each thread after it finishes its share of a parallel job.

% setenv SUNW_MP_THR_IDLE value

You can substitute either spin or sleep[n s|n ms] for value. The default is sleep, which puts the thread to sleep after spin-waiting n units. The wait unit can be seconds (s, the default unit) or milliseconds (ms), where 1s means one second, and 10ms means ten milliseconds. sleep with no arguments puts the thread to sleep immediately after completing a parallel task. sleep, sleep0, sleep0s, and sleep0ms are all equivalent. If a new job arrives before n units is reached, the thread stops spin-waiting and starts doing the new job.

The other choice, spin means the thread should spin (or busy-wait) after completing a parallel task, until a new parallel task arrives.

If SUNW_MP_THR_IDLE contains an illegal value or isn’t set, sleep is used as the default.

`SUNW_MP_WARN`

Set this environment variable to TRUE to print warning messages from OpenMP and other parallelization runtime-systems.

% setenv SUNW_MP_WARN TRUE

If you registered a function by using sunw_mp_register_warn() to handle warning messages, then SUNW_MP_WARN prints no warning messages, even if you set it to TRUE. If you did not register a function and set SUNW_MP_WARN to TRUE, SUNW_MP_WARN prints the warning messages to stderr. If you do not register a function and you do not set SUNW_MP_WARN, no warning messages are issued. For more information on sunw_mp_register_warn() see 3.2.1 Handling OpenMP Runtime Warnings.

`STACKSIZE`

The executing program maintains a main memory stack for the master thread and distinct stacks for each slave thread. Stacks are temporary memory address spaces used to hold arguments and automatic variables over subprogram invocations.

The default size of the main stack is about eight megabytes. Use the limit command to display the current main stack size as well as set it.

% limit
cputime unlimited
filesize unlimited
datasize 2097148 kbytes
stacksize 8192 kbytes <- current main stack size
coredumpsize 0 kbytes
descriptors 256
memorysize unlimited
% limit stacksize 65536 <- set main stack to 64Mb

Each slave thread of a multithreaded program has its own thread stack. This stack mimics the main stack of the master thread but is unique to the thread. The thread’s private arrays and variables (local to the thread) are allocated on the thread stack.

All slave threads have the same stack size, which is four megabytes for 32-bit applications and eight megabytes for 64-bit applications by default. The size is set with the STACKSIZE environment variable:

% setenv STACKSIZE 16483 <- Set thread stack size to 16 Mb

Setting the thread stack size to a value larger than the default may be necessary for some parallelized code.

Sometimes the compiler may generate a warning message that indicates a bigger stack size is needed. However, it may not be possible to know just how large to set it, except by trial and error, especially if private/local arrays are involved. If the stack size is too small for a thread to run, the program will abort with a segmentation fault.

3.3.1.2 Keyword

The keyword restrict can be used with parallelized C. The proper use of the keyword restrict helps the optimizer in understanding the aliasing of data required to determine if a code sequence can be parallelized. Refer to D.1.2 C99 Keywords for details.

3.4 Data Dependence and Interference

The C compiler performs analysis on loops in programs to determine if it is safe to execute different iterations of the loops in parallel. The purpose of this analysis is to determine if any two iterations of the loop could interfere with each other. Typically this happens if one iteration of a variable could read a variable while another iteration is writing the very same variable. Consider the following program fragment:

Example 3–1 A Loop With Dependence

for (i=1; i < 1000; i++) {
    sum = sum + a[i]; /* S1 */
}

In 3.4 Data Dependence and Interference any two successive iterations, i and i+1, will write and read the same variable sum. Therefore, in order for these two iterations to execute in parallel some form of locking on the variable would be required. Otherwise it is not safe to allow the two iterations to execute in parallel.

However, the use of locks imposes overhead that might slowdown the program. The C compiler will not ordinarily parallelize the loop in 3.4 Data Dependence and Interference. In 3.4 Data Dependence and Interference there is a data dependence between two iterations of the loop. Consider another example:

Example 3–2 A Loop Without Dependence

for (i=1; i < 1000; i++) {
    a[i] = 2 * a[i]; /* S1 */
}

In this case each iteration of the loop references a different array element. Therefore different iterations of the loop can be executed in any order. They may be executed in parallel without any locks because no two data elements of different iterations can possibly interfere.

The analysis performed by the compiler to determine if two different iterations of a loop could reference the same variable is called data dependence analysis. Data dependences prevent loop parallelization if one of the references writes to the variable. The dependence analysis performed by the compiler can have three outcomes:

There is a dependence. In this case, it is not safe to execute the loop in parallel. 3.4 Data Dependence and Interference illustrates this case.
There is no dependence. The loop may safely execute in parallel using an arbitrary number of processors. 3.4 Data Dependence and Interference illustrates this case.
The dependence cannot be determined. The compiler assumes, for safety, that there might be a dependence that prevents parallel execution of the loop and will not parallelize the loop.

In 3.4 Data Dependence and Interference, whether or not two iterations of the loop write to the same element of array a depends on whether or not array b contains duplicate elements. Unless the compiler can determine this fact, it assumes there is a dependence and does not parallelize the loop.

Example 3–3 A Loop That May or May Not Contain Dependencies

for (i=1; i < 1000; i++) {
    a[b[i]] = 2 * a[i];
}

3.4.1 Parallel Execution Model

The parallel execution of loops is performed by Solaris threads. The thread starting the initial execution of the program is called the master thread. At program start-up the master thread creates multiple slave threads as shown in the following figure. At the end of the program all the slave threads are terminated. Slave thread creation is performed exactly once to minimize the overhead.

Figure 3–1 Master and Slave Threads

A figure which shows a Master Thread spawning slave threads.

After start-up, the master thread starts the execution of the program while slave threads wait idly. When the master thread encounters a parallel loop, different iterations of the loop are distributed among the slave and master threads which start the execution of the loop. After each thread finishes execution of its chunk it synchronizes with the remaining threads. This synchronization point is called a barrier. The master thread cannot continue executing the remainder of the program until all the threads have finished their work and reached the barrier. The slave threads go into a wait state after the barrier waiting for more parallel work, and the master thread continues to execute the program.

During this process, various overheads can occur:

The overhead of synchronization and work distribution
The overhead of barrier synchronization

In general, there may be some parallel loops for which the amount of useful work performed is not enough to justify the overhead. For such loops, there may be appreciable slowdown. In the following figure, a loop is parallelized. However the barriers, represented by horizontal bars, introduce significant overhead. The work between the barriers is performed serially or in parallel as indicated. The amount of time required to execute the loop in parallel is considerably less than the amount of time required to synchronize the master and slave threads at the barriers.

Figure 3–2 Parallel Execution of a Loop

Figure showing parallel execution of a loop.

3.4.2 Private Scalars and Private Arrays

There are some data dependences for which the compiler may still be able to parallelize a loop. Consider the following example.

Example 3–4 A Parallelizable Loop With Dependence

for (i=1; i < 1000; i++) {
    t = 2 * a[i];           /* S1 */
    b[i] = t;               /* S2 */
}

In this example, assuming that arrays a and b are non-overlapping arrays, there appears to be a data dependence in any two iterations due to the variable t. The following statements execute during iterations one and two.

Example 3–5 Iterations One and Two

t = 2*a[1];  /* 1 */
b[1] = t;    /* 2 */
t = 2*a[2];  /* 3 */
b[2] = t;    /* 4 */

Because statements one and three modify the variable t, the compiler cannot execute them in parallel. However, the value of t is always computed and used in the same iteration so the compiler can use a separate copy of t for each iteration. This eliminates the interference between different iterations due to such variables. In effect, we have made variable t as a private variable for each thread executing that iteration. This can be illustrated as follows:

Example 3–6 Variable `t` as a Private Variable for Each Thread

for (i=1; i < 1000; i++) {
    pt[i] = 2 * a[i];       /* S1 */
    b[i] = pt[i];           /* S2 */
}

3.4.2 Private Scalars and Private Arrays is essentially the same example as 3.4 Data Dependence and Interference, but each scalar variable reference t is now replaced by an array reference pt. Each iteration now uses a different element of pt, and this results in eliminating any data dependencies between any two iterations. Of course one problem with this illustration is that it may lead to an extra large array. In practice, the compiler only allocates one copy of the variable for each thread that participates in the execution of the loop. Each such variable is, in effect, private to the thread.

The compiler can also privatize array variables to create opportunities for parallel execution of loops. Consider the following example:

Example 3–7 A Parallelizable Loop With an Array Variable

for (i=1; i < 1000; i++) {
    for (j=1; j < 1000; j++) {
            x[j] = 2 * a[i];        /* S1 */
            b[i][j] = x[j];         /* S2 */
    }
}

In 3.4.2 Private Scalars and Private Arrays, different iterations of the outer loop modify the same elements of array x, and thus the outer loop cannot be parallelized. However, if each thread executing the outer loop iterations has a private copy of the entire array x, then there would be no interference between any two iterations of the outer loop. This is illustrated as follows:

Example 3–8 A Parallelizable Loop Using a Privatized Array

for (i=1; i < 1000; i++) {
    for (j=1; j < 1000; j++) {
            px[i][j] = 2 * a[i];    /* S1 */
            b[i][j] = px[i][j];     /* S2 */
    }
}

As in the case of private scalars, it is not necessary to expand the array for all the iterations, but only up to the number of threads executing in the systems. This is done automatically by the compiler by allocating one copy of the original array in the private space of each thread.

3.4.3 Storeback

Privatization of variables can be very useful for improving the parallelism in the program. However, if the private variable is referenced outside the loop then the compiler needs to assure that it has the right value. Consider the following example:

Example 3–9 A Parallelized Loop Using Storeback

for (i=1; i < 1000; i++) {
    t = 2 * a[i];           /* S1 */
    b[i] = t;               /* S2 */
}
x = t;                      /* S3 */

In 3.4.3 Storeback the value of t referenced in statement S3 is the final value of t computed by the loop. After the variable t has been privatized and the loop has finished executing, the right value of t needs to be stored back into the original variable. This is called storeback. This is done by copying the value of t on the final iteration back to the original location of variable t. In many cases the compiler can do this automatically. But there are situations where the last value cannot be computed so easily:

Example 3–10 A Loop That Cannot Use Storeback

for (i=1; i < 1000; i++) {
    if (c[i] > x[i] ) {         /* C1 */
            t = 2 * a[i];           /* S1 */
            b[i] = t;               /* S2 */
    }
}
x = t*t;                       /* S3 */

For correct execution, the value of t in statement S3 is not, in general, the value of t on the final iteration of the loop. It is in fact the last iteration for which the condition C1 is true. Computing the final value of t is quite hard in the general cases. In cases like this the compiler will not parallelize the loop.

3.4.4 Reduction Variables

There are cases when there is a real dependence between iterations of a loop and the variables causing the dependence cannot simply be privatized. This can arise, for example, when values are being accumulated from one iteration to the next.

Example 3–11 A Loop That May or May Not Be Parallelized

for (i=1; i < 1000; i++) {
    sum += a[i]*b[i]; /* S1 */
}

In 3.4.4 Reduction Variables, the loop computes the vector product of two arrays into a common variable called sum. This loop cannot be parallelized in a simple manner. The compiler can take advantage of the associative nature of the computation in statement S1 and allocate a private variable called psum[i] for each thread. Each copy of the variable psum[i] is initialized to 0. Each thread computes its own partial sum in its own copy of the variable psum[i]. Before crossing the barrier, all the partial sums are added onto the original variable sum. In this example, the variable sum is called a reduction variable because it computes a sum-reduction. However, one danger of promoting scalar variables to reduction variables is that the manner in which rounded values are accumulated can change the final value of sum. The compiler performs this transformation only if you specifically give permission for it to do so.

3.5 Speedups

If the compiler does not parallelized a portion of a program where a significant amount of time is spent, then no speedup occurs. This is basically a consequence of Amdahls Law. For example, if a loop that accounts for five percent of the execution time of a program is parallelized, then the overall speedup is limited to five percent. However, there may not be any improvement depending on the size of the workload and parallel execution overheads.

As a general rule, the larger the fraction of program execution that is parallelized, the greater the likelihood of a speedup.

Each parallel loop incurs a small overhead during start-up and shutdown. The start overhead includes the cost of work distribution, and the shutdown overhead includes the cost of the barrier synchronization. If the total amount of work performed by the loop is not big enough then no speedup will occur. In fact the loop might even slow down. So if a large amount of program execution is accounted by a large number of short parallel loops, then the whole program may slow down instead of speeding up.

The compiler performs several loop transformations that try to increase the granularity of the loops. Some of these transformations are loop interchange and loop fusion. So in general, if the amount of parallelism in a program is small or is fragmented among small parallel regions, then the speedup is less.

Often scaling up a problem size improves the fraction of parallelism in a program. For example, consider a problem that consists of two parts: a quadratic part that is sequential, and a cubic part that is parallelizable. For this problem the parallel part of the workload grows faster than the sequential part. So at some point the problem will speedup nicely, unless it runs into resource limitations.

It is beneficial to try some tuning, experimentation with directives, problem sizes and program restructuring in order to achieve benefits from parallel C.

3.5.1 Amdahl’s Law

Fixed problem-size speedup is generally governed by Amdahl’s law. Amdahl’s Law simply says that the amount of parallel speedup in a given problem is limited by the sequential portion of the problem.The following equation describes the speedup of a problem where F is the fraction of time spent in sequential region, and the remaining fraction of the time is spent uniformly among P processors. If the second term of the equation drops to zero, the total speedup is bounded by the first term, which remains fixed.

$Equation showing Amdahl’s law, the fraction one over S equals F plus the fraction one minus F quantity over P.$

The following figure illustrates this concept diagrammatically. The darkly shaded portion represents the sequential part of the program, and remains constant for one, two, four, and eight processors, while the lightly shaded portion represents the parallel portion of the program that can be divided uniformly among an arbitrary number of processors.

Figure 3–3 Fixed Problem Speedups

As the number of processors increases, the amount of time required for the parallel portion of each program decreases whereas the serial portion of each program stays the same.

In reality, however, you may incur overheads due to communication and distribution of work to multiple processors. These overheads may or may not be fixed for arbitrary number of processors used.

The following figure illustrates the ideal speedups for a program containing 0%, 2%, 5%, and 10% sequential portions. Here, no overhead is assumed.

Figure 3–4 Amdahl's Law Speedup Curve

The graph shows that the most speedup occurs with the
program that has no sequential portion.

A graph that shows the ideal speedups for a program containing 0%, 2%, 5%, and 10% sequential portions, assuming no overhead. The x-axis measures the number of processors and the y-axis measures the speedup.

3.5.1.1 Overheads

Once the overheads are incorporated in the model the speedup curves change dramatically. Just for the purposes of illustration we assume that overheads consist of two parts: a fixed part which is independent of the number of processors, and a non-fixed part that grows quadratically with the number of the processors used:

$1 over S equals 1 divided by the quantity F plus the quantity 1 minus the fraction F over P end quantity plus K sub 1 plus K sub 2 times P squared.$

The fraction one over S equals one divided by the quantity of F plus the quantity one minus the fraction F over P end of quantity plus K sub one plus K sub two times P squared end quantity.

In this equation, K₁ and K₂ are some fixed factors. Under these assumptions the speedup curve is shown in the following figure. It is interesting to note that in this case the speedups peak out. After a certain point adding more processors is detrimental to performance as shown in the following figure.

Figure 3–5 Speedup Curve With Overheads

The graph shows that all programs reach the greatest speedup at five processors and then loose this benefit as up to eight processors are added. The x-axis measures the number of processors and the y-axis measures the speedup.

3.5.1.2 Gustafson’s Law

Amdahls Law can be misleading for predicting parallel speedups in real problems. The fraction of time spent in sequential sections of the program sometimes depends on the problem size. That is, by scaling the problem size, you may improve the chances of speedup. The following example demonstrates this.

Example 3–12 Scaling the Problem Size May Improve Chances of Speedup

/*
* initialize the arrays
*/
for (i=0; i < n; i++) {
    for (j=0; j < n; j++) {
            a[i][j] = 0.0;
            b[i][j] = ...
            c[i][j] = ...
    }
}
/*
* matrix multiply
*/
for (i=0; i < n; i++) {
    for(j=0; j < n; j++) {
            for (k=0; k < n; k++) {
                a[i][j] = b[i][k]*c[k][j];
            }
    }
}

Assume an ideal overhead of zero and assume that only the second loop nest is executed in parallel. It is easy to see that for small problem sizes (i.e. small values of n), the sequential and parallel parts of the program are not so far from each other. However, as n grows larger, the time spent in the parallel part of the program grows faster than the time spent in the sequential part. For this problem, it is beneficial to increase the number of processors as the problem size increases.

3.6 Load Balance and Loop Scheduling

Loop scheduling is the process of distributing iterations of a parallel loop to multiple threads. In order to maximize the speedup, it is important that the work be distributed evenly among the threads while not imposing too much overhead. The compiler offers several types of scheduling for different situations.

3.6.1 Static or Chunk Scheduling

It is beneficial to divide the work evenly among the different threads on the system when the work performed by different iterations of a loop is the same. This approach is known as static scheduling.

Example 3–13 A Good Loop for Static Scheduling

for (i=1; i < 1000; i++) {
    sum += a[i]*b[i];       /* S1 */
}

Under static or chunk scheduling, each thread will get the same number of iterations. If there were 4 threads, then in the above example, each thread will get 250 iterations. Provided there are no interruptions and each thread progresses at the same rate, all the threads will complete at the same time.

3.6.2 Self Scheduling

Static scheduling will not achieve good load balance, in general, when the work performed by each iteration varies. In static scheduling, each thread grabs the same chunk of iterations. Each thread, except the master thread, upon completion of its chunk waits to participate in the next parallel loop execution. The master thread continues execution of the program. In self scheduling, each thread grabs a different small chunk of iteration and after completion of its assigned chunk, tries to acquire more chunks from the same loop.

3.6.3 Guided Self Scheduling

In guided self scheduling (GSS), each thread gets successively smaller number of chunks. In cases where the size of each iteration varies, GSS can help balance the load.

3.7 Loop Transformations

The compiler performs several loop restructuring transformations to help improve the parallelization of a loop in programs. Some of these transformations can also improve the single processor execution of loops as well. The transformations performed by the compiler are described below.

3.7.1 Loop Distribution

Often loops contain a few statements that cannot be executed in parallel and many statements that can be executed in parallel. Loop Distribution attempts to remove the sequential statements into a separate loop and gather the parallelizable statements into a different loop. This is illustrated in the following example:

Example 3–14 A Candidate for Loop Distribution

for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];               /* S1 */
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0;  /* S2 */
    y[i] = z[i] - x[i];                    /* S3 */
}

Assuming that arrays x, y, w, a, and z do not overlap, statements S1 and S3 can be parallelized but statement S2 cannot be. Here is how the loop looks after it is split or distributed into two different loops:

Example 3–15 The Distributed Loop

/* L1: parallel loop */
for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];              /* S1 */
    y[i] = z[i] - x[i];                   /* S3 */
}
/* L2: sequential loop */
for (i=0; i < n; i++) {
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0; /* S2 */
}

After this transformation, loop L1 does not contain any statements that prevent the parallelization of the loop and may be executed in parallel. Loop L2, however, still has a non-parallelizable statement from the original loop.

Loop distribution is not always profitable or safe to perform. The compiler performs analysis to determine the safety and profitability of distribution.

3.7.2 Loop Fusion

If the granularity of a loop, or the work performed by a loop, is small, the performance gain from distribution may be insignificant. This is because the overhead of parallel loop start-up is too high compared to the loop workload. In such situations, the compiler uses loop fusion to combine several loops into a single parallel loop, and thus increase the granularity of the loop. Loop fusion is easy and safe when loops with identical trip counts are adjacent to each other. Consider the following example:

Example 3–16 Loops With Small Work Loads

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];        /* S1 */
}
/* L2: another short parallel loop */
for (i=0; i < 100; i++) {
    b[i] = a[i] * d[i];        /* S2 */
}

The two short parallel loops are next to each other, and can be safely combined as follows:

Example 3–17 The Two Loops Fused

/* L3: a larger parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];       /* S1 */
    b[i] = a[i] * d[i];       /* S2 */
}

The new loop generates half the parallel loop execution overhead. Loop fusion can also help in other ways. For example if the same data is referenced in two loops, then combining them can improve the locality of reference.

However, loop fusion is not always safe to perform. If loop fusion creates a data dependence that did not exist before then the fusion may result in incorrect execution. Consider the following example:

Example 3–18 Unsafe Fusion Candidates

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];      /* S1 */
}
/* L2: a short loop with data dependence */
for (i=0; i < 100; i++) {
    a[i+1] = a[i] * d[i];    /* S2 */
}

If the loops in 3.7.2 Loop Fusion are fused, a data dependence is created from statement S2 to S1. In effect, the value of a[i] in the right hand side of statement S1 is computed in statement S2. If the loops are not fused, this would not happen. The compiler performs safety and profitability analysis to determine if loop fusion should be done. Often, the compiler can fuse an arbitrary number of loops. Increasing the granularity in this manner can sometimes push a loop far enough up for it to be profitable for parallelization.

3.7.3 Loop Interchange

It is generally more profitable to parallelize the outermost loop in a nest of loops, since the overheads incurred are small. However, it is not always safe to parallelize the outermost loops due to dependences that might be carried by such loops. This is illustrated in the following:

Example 3–19 Nested Loop That Cannot Be Parallelized

for (i=0; i <n; i++) {
    for (j=0; j <n; j++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

In this example, the loop with the index variable i cannot be parallelized, because of a dependency between two successive iterations of the loop. The two loops can be interchanged and the parallel loop (the j-loop) becomes the outer loop:

Example 3–20 The Loops Interchanged

for (j=0; j<n; j++) {
    for (i=0; i<n; i++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

The resulting loop incurs an overhead of parallel work distribution only once, while previously, the overhead was incurred n times. The compiler performs safety and profitability analysis to determine whether to perform loop interchange.

3.8 Aliasing and Parallelization

ISO C aliasing can often prevent loops from getting parallelized. Aliasing occurs when there are two possible references to the same memory location. Consider the following example:

Example 3–21 A Loop With Two References to the Same Memory Location

void copy(float a[], float b[], int n) {
    int i;
    for (i=0; i < n; i++) {
            a[i] = b[i]; /* S1 */
    }
}

Since variables a and b are parameters, it is possible that a and b may be pointing to overlapping regions of memory; e.g, if copy were called as follows:

copy (x[10], x[11], 20);

In the called routine, two successive iterations of the copy loop may be reading and writing the same element of the array x. However, if the routine copy were called as follows then there is no possibility of overlap in any of the 20 iterations of the loop:

copy (x[10], x[40], 20);

In general, it is not possible for the compiler to analyze this situation correctly without knowing how the routine is called. The compiler provides a keyword extension to ISO C that lets you convey this kind of aliasing information. See 3.8.2 Restricted Pointers for more information.

3.8.1 Array and Pointer References

Part of the aliasing problem is that the C language can define array referencing and definition through pointer arithmetic. In order for the compiler to effectively parallelize loops, either automatically or explicitly with pragmas, all data that is laid out as an array must be referenced using C array reference syntax and not pointers. If pointer syntax is used, the compiler cannot determine the relationship of the data between different iterations of a loop. Thus it will be conservative and not parallelize the loop.

3.8.2 Restricted Pointers

In order for a compiler to effectively perform parallel execution of a loop, it needs to determine if certain lvalues designate distinct regions of storage. Aliases are lvalues whose regions of storage are not distinct. Determining if two pointers to objects are aliases is a difficult and time consuming process because it could require analysis of the entire program. Consider function vsq() below:

Example 3–22 A Loop With Two Pointers

void vsq(int n, double * a, double * b) {
    int i;
    for (i=0; i<n; i++) {
            b[i] = a[i] * a[i];
    }
}

The compiler can parallelize the execution of the different iterations of the loops if it knows that pointers a and b access different objects. If there is an overlap in objects accessed through pointers a and b then it would be unsafe for the compiler to execute the loops in parallel. At compile time, the compiler does not know if the objects accessed by a and b overlap by simply analyzing the function vsq(); the compiler may need to analyze the whole program to get this information.

Restricted pointers are used to specify pointers which designate distinct objects so that the compiler can perform pointer alias analysis. The following is an example of function vsq() in which function parameters are declared as restricted pointers:

void vsq(int n, double * restrict a, double * restrict b)

Pointers a and b are declared as restricted pointers, so the compiler knows that a and b point to distinct regions of storage. With this alias information, the compiler is able to parallelize the loop.

The keyword restrict is a type-qualifier, like volatile, and it shall only qualify pointer types. restrict is recognized as a keyword when you use -xc99=all (except with -Xs). There are situations in which you may not want to change the source code. You can specify that pointer-valued function-parameters be treated as restricted pointers by using the following command line option:

-xrestrict=[func1,…,funcn]

If a function list is specified, then pointer parameters in the specified functions are treated as restricted; otherwise, all pointer parameters in the entire C file are treated as restricted. For example, -xrestrict=vsq, qualifies the pointers a and b given in the first example of the function vsq() with the keyword restrict.

It is critical that you use restrict correctly. If pointers qualified as restricted pointers point to objects which are not distinct, the compiler can incorrectly parallelize loops resulting in undefined behavior. For example, assume that pointers a and b of function vsq() point to objects which overlap, such that b[i] and a[i+1] are the same object. If a and b are not declared as restricted pointers the loops will be executed serially. If a and b are incorrectly qualified as restricted pointers the compiler may parallelize the execution of the loops, which is not safe, because b[i+1] should only be computed after b[i] is computed.

3.8.3 Explicit Parallelization and Pragmas

Often, there is not enough information available for the compiler to make a decision on the legality or profitability of parallelization. The compiler supports pragmas that allow the programmer to effectively parallelize loops that otherwise would be too difficult or impossible for the compiler to handle. The Sun-Specific MP pragmas detailed in the rest of this section have been deprecated in favor of the OpenMP standard. See the OpenMP API User’s Guide for information to the directives of the standard.

3.8.3.1 Serial Pragmas

Note –

The Sun-specific MP pragmas have been deprecated and are no longer supported. However, the compiler supports the APIs specified by the OpenMP 2.5 standard instead. See the OpenMP API User’s Guide for migration information to the directives of the standard.

There are two serial pragmas, and both apply to for loops:

#pragma MP serial_loop
#pragma MP serial_loop_nested

The #pragma MP serial_loop pragma indicates to the compiler that the next for loop is not to be automatically parallelized.

The #pragma MP serial_loop_nested pragma indicates to the compiler that the next for loop and any for loops nested within the scope of this for loop are not to be automatically parallelized.

The scope of these pragmas begins with the pragma and ends with the beginning of the next block, the next for loop within the current block, or the end of the current block, which ever occurs first.

3.8.3.2 Parallel Pragma

Note –

There is one parallel pragma: #pragma MP taskloop [options].

The MP taskloop pragma can, optionally, take one or more of the following arguments.

maxcpus (number_of_processors)
private (list_of_private_variables)
shared (list_of_shared_variables)
readonly (list_of_readonly_variables)
storeback (list_of_storeback_variables)
savelast
reduction (list_of_reduction_variables)
schedtype (scheduling_type)

The scope of these pragmas begins with the pragma and ends with which ever of the following occurs first: the beginning of the next block, the next for loop within the current block, the end of the current block. The pragma applies to the next for loop prior to the end of the pragmas scope.

Only one option can be specified per MP taskloop pragma; however, the pragmas are cumulative and apply to the next for loop within the pragmas scope:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop shared(a,b)
#pragma MP taskloop storeback(x)

These options may appear multiple times prior to the for loop to which they apply. In case of conflicting options, the compiler will issue a warning message.

Nesting of `for` Loops

An MP taskloop pragma applies to the next for loop within the current block. There is no nesting of parallelized for loops by parallelized C.

Eligibility for Parallelizing

An MP taskloop pragma suggests to the compiler that, unless otherwise disallowed, the specified for loop should be parallelized.

Any for loop with irregular control flow and unknown loop iteration increment is ineligible for parallelization. For example, for loops containing setjmp, longjmp, exit, abort, return, goto, labels, and break should not be considered as candidates for parallelization.

Of particular importance is to note that for loops with inter-iteration dependencies can be eligible for explicit parallelization. This means that if an MP taskloop pragma is specified for such a loop the compiler will simply honor it, unless the for loop is disqualified. It is the user’s responsibility to make sure that such explicit parallelization will not lead to incorrect results.

If both the serial_loop or serial_loop_nested and taskloop pragmas are specified for a for loop, the last one specified will prevail.

Consider the following example:

#pragma MP serial_loop_nested
    for (i=0; i<100; i++) {
   # pragma MP taskloop
      for (j=0; j<1000; j++) {
      ...
 }
}

The i loop will not be parallelized but the j loop might be.

Number of Processors

#pragma MP taskloop maxcpus (number_of_processors) specifies the number of processors to be used for this loop, if possible.

The value of maxcpus must be a positive integer. If maxcpus equals 1, then the specified loop will be executed in serial. (Note that setting maxcpus to be 1 is equivalent to specifying the serial_loop pragma.) The smaller of the values of maxcpus or the interpreted value of the PARALLEL environment variable will be used. When the environment variable PARALLEL is not specified, it is interpreted as having the value 1.

If more than one maxcpus pragma is specified for a for loop, the last one specified will prevail.

Classifying Variables

A variable used in a loop is classified as being either a private, shared, reduction, or readonly variable. The variable belongs to only one of these classifications. A variable can only be classified as a reduction or readonly variable through an explicit pragma. See #pragma MP taskloop reduction and #pragma MP taskloop readonly. A variable can be classified as being either a private or shared variable through an explicit pragma or through the following default scoping rules.

Default Scoping Rules for `private` and `shared` Variables

A private variable is one whose value is private to each processor processing some iterations of a for loop. In other words, the value assigned to a private variable in one iteration of a for loop is not propagated to other processors processing other iterations of that for loop. A shared variable, on the other hand, is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop. Loops being explicitly parallelized through use of #pragma MP taskloop directives, that contain references to shared variables, must ensure that such sharing of values does not cause any correctness problems (such as race conditions). No synchronization is provided by the compiler on updates and accesses to shared variables in an explicitly parallelized loop.

In analyzing explicitly parallelized loops, the compiler uses the following “default scoping rules” to determine whether a variable is private or shared:

If a variable is not explicitly classified via a pragma, the variable will default to being classified as a shared variable if it is declared as a pointer or array, and is only referenced using array syntax within the loop. Otherwise, it will be classified as a private variable.
The loop index variable is always treated as a private variable and is always a storeback variable.

It is highly recommended that all variables used in an explicitly parallelized for loop be explicitly classified as one of shared, private, reduction, or readonly, to avoid the “default scoping rules.”

Since the compiler does not perform any synchronization on accesses to shared variables, extreme care must be exercised before using an MP taskloop pragma for a loop that contains, for example, array references. If inter-iteration data dependencies exist in such an explicitly parallelized loop, then its parallel execution may give erroneous results. The compiler may or may not be able to detect such a potential problem situation and issue a warning message. In any case, the compiler will not disable the explicit parallelization of loops with potential shared variable problems.

`private` Variables

#pragma MP taskloop private (list_of_private_variables)

Use this pragma to specify all the variables that should be treated as private variables for this loop. All other variables used in the loop that are not explicitly specified as shared, readonly, or reduction variables, are either shared or private as defined by the default scoping rules.

A private variable is one whose value is private to each processor processing some iterations of a loop. In other words, the value assigned to a private variable by one of the processors working on iterations of a loop is not propagated to other processors processing other iterations of that loop. A private variable has no initial value at the start of each iteration of a loop and must be set to a value within the iteration of a loop prior to its first use within that iteration. Execution of a program with a loop containing an explicitly declared private variable whose value is used prior to being set will result in undefined behavior.

`shared` Variables

#pragma MP taskloop shared (list_of_shared_variables)

Use this pragma to specify all the variables that should be treated as shared variables for this loop. All other variables used in the loop that are not explicitly specified as private, readonly, storeback or reduction variables, are either shared or private as defined by the default scoping rules.

A shared variable is a variable whose current value is accessible by all processors processing iterations of a for loop. The value assigned to a shared variable by one processor working on iterations of a loop may be seen by other processors working on other iterations of the loop.

`readonly` Variables

#pragma MP taskloop readonly (list_of_readonly_variables)

readonly variables are a special class of shared variables that are not modified in any iteration of a loop. Use this pragma to indicate to the compiler that it may use a separate copy of that variable’s value for each processor processing iterations of the loop.

`storeback` Variables

#pragma MP taskloop storeback (list_of_storeback_variables)

Use this pragma to specify all the variables to be treated as storeback variables.

A storeback variable is one whose value is computed in a loop, and this computed value is then used after the termination of the loop. The last loop iteration values of storeback variables are available for use after the termination of the loop. Such a variable is a good candidate to be declared explicitly via this directive as a storeback variable when the variable is a private variable, whether by explicitly declaring the variable private or by the default scoping rules.

Note that the storeback operation for a storeback variable occurs at the last iteration of the explicitly parallelized loop, regardless of whether or not that iteration updates the value of the storeback variable. In other words, the processor that processes the last iteration of a loop may not be the same processor that currently contains the last updated value for a storeback variable. Consider the following example:

#pragma MP taskloop private(x)
#pragma MP taskloop storeback(x)
   for (i=1; i <= n; i++) {
      if (...) {
          x=...
      }
}
   printf (“%d”, x);

In the previous example the value of the storeback variable x printed out via the printf() call may not be the same as that printed out by a serial version of the i loop, because in the explicitly parallelized case, the processor that processes the last iteration of the loop (when i==n), which performs the storeback operation for x may not be the same processor that currently contains the last updated value for x. The compiler will attempt to issue a warning message to alert the user of such potential problems.

In an explicitly parallelized loop, variables referenced as arrays are not treated as storeback variables. Hence it is important to include them in the list_of_storeback_variables if such storeback operation is desired (for example, if the variables referenced as arrays have been declared as private variables).

`savelast`

#pragma MP taskloop savelast

Use this pragma to specify all the private variables of a loop that you want to be treated as storeback variables. The syntax of this pragma is as follows:

#pragma MP taskloop savelast

It is often convenient to use this form, rather than list out each private variable of a loop when declaring each variable as storeback variables.

`reduction` Variables

#pragma MP taskloop reduction (list_of_reduction_variables) specifies that all the variables appearing in the reduction list will be treated as reduction variables for the loop. A reduction variable is one whose partial values can be individually computed by each of the processors processing iterations of the loop, and whose final value can be computed from all its partial values. The presence of a list of reduction variables can facilitate the compiler in identifying that the loop is a reduction loop, allowing generation of parallel reduction code for it. Consider the following example:

#pragma MP taskloop reduction(x)
    for (i=0; i<n; i++) {
         x = x + a[i];
}

the variable x is a (sum) reduction variable and the i loop is a(sum) reduction loop.

Scheduling Control

The Sun ISO C compiler supports several pragmas that can be used in conjunction with the taskloop pragma to control the loop scheduling strategy for a given loop. The syntax for this pragma is:

#pragma MP taskloop schedtype (scheduling_type)

This pragma can be used to specify the specific scheduling_type to be used to schedule the parallelized loop. Scheduling_type can be one of the following:

static

In static scheduling all the iterations of the loop are uniformly distributed among all the participating processors. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(static)
    for (i=0; i<1000; i++) {
...
}

In the above example, each of the four processors will process 250 iterations of the loop.

self [(chunk_size)]

In self scheduling, each participating processor processes a fixed number of iterations (called the “chunk size”) until all the iterations of the loop have been processed. The optional chunk_size parameter specifies the “chunk size” to be used. Chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable, chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(self(120))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

120, 120, 120, 120, 120, 120, 120, 120, 40.

gss [(min_chunk_size)]

In guided self scheduling, each participating processor processes a variable number of iterations (called the “min chunk size”) until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable, min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(gss(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

250, 188, 141, 106, 79, 59, 45, 33, 25, 19, 14, 11, 10, 10, 10.

factoring [(min_chunk_size)]

In factoring scheduling, each participating processor processes a variable number of iterations (called the “min chunk size”) until all the iterations of the loop have been processed. The optional min_chunk_size parameter specifies that each variable chunk size used must be at least min_chunk_size in size. Min_chunk_size must be a positive integer constant, or variable of integral type. If specified as a variable min_chunk_size must evaluate to a positive integer value at the beginning of the loop. If this optional parameter is not specified or its value is not positive, the compiler will select the chunk size to be used. Consider the following example:

#pragma MP taskloop maxcpus(4)
#pragma MP taskloop schedtype(factoring(10))
for (i=0; i<1000; i++) {
...
}

In the above example, the number of iterations of the loop assigned to each participating processor, in order of work request, are:

125, 125, 125, 125, 62, 62, 62, 62, 32, 32, 32, 32, 16, 16, 16, 16, 10, 10, 10, 10, 10, 10

Chapter 3 Parallelizing Sun C Code

3.1 Overview

3.1.1 Example of Use

3.2 Parallelizing for OpenMP

3.2.1 Handling OpenMP Runtime Warnings

3.3 Environment Variables

3.3.1 PARALLEL

3.3.1.1 SUNW_MP_THR_IDLE

SUNW_MP_WARN

STACKSIZE

3.3.1.2 Keyword

3.4 Data Dependence and Interference

Example 3–1 A Loop With Dependence

Example 3–2 A Loop Without Dependence

Example 3–3 A Loop That May or May Not Contain Dependencies

3.4.1 Parallel Execution Model

Figure 3–1 Master and Slave Threads

Figure 3–2 Parallel Execution of a Loop

3.4.2 Private Scalars and Private Arrays

Example 3–4 A Parallelizable Loop With Dependence

Example 3–5 Iterations One and Two

Example 3–6 Variable t as a Private Variable for Each Thread

Example 3–7 A Parallelizable Loop With an Array Variable

Example 3–8 A Parallelizable Loop Using a Privatized Array

3.4.3 Storeback

Example 3–9 A Parallelized Loop Using Storeback

Example 3–10 A Loop That Cannot Use Storeback

3.4.4 Reduction Variables

Example 3–11 A Loop That May or May Not Be Parallelized

3.5 Speedups

3.5.1 Amdahl’s Law

Figure 3–3 Fixed Problem Speedups

Figure 3–4 Amdahl's Law Speedup Curve

3.5.1.1 Overheads

Figure 3–5 Speedup Curve With Overheads

3.5.1.2 Gustafson’s Law

Example 3–12 Scaling the Problem Size May Improve Chances of Speedup

3.6 Load Balance and Loop Scheduling

3.6.1 Static or Chunk Scheduling

Example 3–13 A Good Loop for Static Scheduling

3.6.2 Self Scheduling

3.6.3 Guided Self Scheduling

3.7 Loop Transformations

3.7.1 Loop Distribution

Example 3–14 A Candidate for Loop Distribution

Example 3–15 The Distributed Loop

3.7.2 Loop Fusion

Example 3–16 Loops With Small Work Loads

Example 3–17 The Two Loops Fused

Example 3–18 Unsafe Fusion Candidates

3.7.3 Loop Interchange

Example 3–19 Nested Loop That Cannot Be Parallelized

Example 3–20 The Loops Interchanged

3.8 Aliasing and Parallelization

Example 3–21 A Loop With Two References to the Same Memory Location

3.8.1 Array and Pointer References

3.8.2 Restricted Pointers

Example 3–22 A Loop With Two Pointers

3.8.3 Explicit Parallelization and Pragmas

3.8.3.1 Serial Pragmas

3.8.3.2 Parallel Pragma

Nesting of for Loops

Eligibility for Parallelizing

Number of Processors

Classifying Variables

Default Scoping Rules for private and shared Variables

private Variables

shared Variables

readonly Variables

storeback Variables

savelast

reduction Variables

Scheduling Control

3.3.1 `PARALLEL`

3.3.1.1 `SUNW_MP_THR_IDLE`

`SUNW_MP_WARN`

`STACKSIZE`

Example 3–6 Variable `t` as a Private Variable for Each Thread

Nesting of `for` Loops

Default Scoping Rules for `private` and `shared` Variables

`private` Variables

`shared` Variables

`readonly` Variables

`storeback` Variables

`savelast`

`reduction` Variables