JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Studio 12.2: C User's Guide
search filter icon
search icon

Document Information

Preface

1.  Introduction to the C Compiler

2.  C-Compiler Implementation-Specific Information

3.  Parallelizing C Code

3.1 Overview

3.1.1 Example of Use

3.2 Parallelizing for OpenMP

3.2.1 Handling OpenMP Runtime Warnings

3.3 Environment Variables

3.3.1 PARALLEL or OMP_NUM_THREADS

3.3.2 SUNW_MP_THR_IDLE

3.3.3 SUNW_MP_WARN

3.3.4 STACKSIZE

3.3.5 Using restrict in Parallel Code

3.4 Data Dependence and Interference

3.4.1 Parallel Execution Model

3.4.2 Private Scalars and Private Arrays

3.4.3 Storeback

3.4.4 Reduction Variables

3.5 Speedups

3.5.1 Amdahl's Law

3.5.1.1 Overheads

3.5.1.2 Gustafson's Law

3.6 Load Balance and Loop Scheduling

3.6.1 Static or Chunk Scheduling

3.6.2 Self Scheduling

3.6.3 Guided Self Scheduling

3.7 Loop Transformations

3.7.1 Loop Distribution

3.7.2 Loop Fusion

3.7.3 Loop Interchange

3.8 Aliasing and Parallelization

3.8.1 Array and Pointer References

3.8.2 Restricted Pointers

3.8.3 Explicit Parallelization and Pragmas

3.8.3.1 Serial Pragmas

3.8.3.2 Parallel Pragma

Nesting of for Loops

Eligibility for Parallelizing

Number of Processors

Classifying Variables

Default Scoping Rules for private and shared Variables

private Variables

shared Variables

readonly Variables

storeback Variables

savelast

reduction Variables

Scheduling Control

3.9 Memory Barrier Intrinsics

4.  lint Source Code Checker

5.  Type-Based Alias Analysis

6.  Transitioning to ISO C

7.  Converting Applications for a 64-Bit Environment

8.  cscope: Interactively Examining a C Program

A.  Compiler Options Grouped by Functionality

B.  C Compiler Options Reference

C.  Implementation-Defined ISO/IEC C99 Behavior

D.  Supported Features of C99

E.  Implementation-Defined ISO/IEC C90 Behavior

F.  ISO C Data Representations

G.  Performance Tuning

H.  The Differences Between K&R Solaris Studio C and Solaris Studio ISO C

Index

3.7 Loop Transformations

The compiler performs several loop restructuring transformations to help improve the parallelization of a loop in programs. Some of these transformations can also improve the single processor execution of loops as well. The transformations performed by the compiler are described below.

3.7.1 Loop Distribution

Often loops contain a few statements that cannot be executed in parallel and many statements that can be executed in parallel. Loop Distribution attempts to remove the sequential statements into a separate loop and gather the parallelizable statements into a different loop. This is illustrated in the following example:

Example 3-14 A Candidate for Loop Distribution

for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];               /* S1 */
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0;  /* S2 */
    y[i] = z[i] - x[i];                    /* S3 */
}

Assuming that arrays x, y, w, a, and z do not overlap, statements S1 and S3 can be parallelized but statement S2 cannot be. Here is how the loop looks after it is split or distributed into two different loops:

Example 3-15 The Distributed Loop

/* L1: parallel loop */
for (i=0; i < n; i++) {
    x[i] = y[i] + z[i]*w[i];              /* S1 */
    y[i] = z[i] - x[i];                   /* S3 */
}
/* L2: sequential loop */
for (i=0; i < n; i++) {
    a[i+1] = (a[i-1] + a[i] + a[i+1]/3.0; /* S2 */
}

After this transformation, loop L1 does not contain any statements that prevent the parallelization of the loop and may be executed in parallel. Loop L2, however, still has a non-parallelizable statement from the original loop.

Loop distribution is not always profitable or safe to perform. The compiler performs analysis to determine the safety and profitability of distribution.

3.7.2 Loop Fusion

If the granularity of a loop, or the work performed by a loop, is small, the performance gain from distribution may be insignificant. This is because the overhead of parallel loop start-up is too high compared to the loop workload. In such situations, the compiler uses loop fusion to combine several loops into a single parallel loop, and thus increase the granularity of the loop. Loop fusion is easy and safe when loops with identical trip counts are adjacent to each other. Consider the following example:

Example 3-16 Loops With Small Work Loads

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];        /* S1 */
}
/* L2: another short parallel loop */
for (i=0; i < 100; i++) {
    b[i] = a[i] * d[i];        /* S2 */
}

The two short parallel loops are next to each other, and can be safely combined as follows:

Example 3-17 The Two Loops Fused

/* L3: a larger parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];       /* S1 */
    b[i] = a[i] * d[i];       /* S2 */
}

The new loop generates half the parallel loop execution overhead. Loop fusion can also help in other ways. For example if the same data is referenced in two loops, then combining them can improve the locality of reference.

However, loop fusion is not always safe to perform. If loop fusion creates a data dependence that did not exist before then the fusion may result in incorrect execution. Consider the following example:

Example 3-18 Unsafe Fusion Candidates

/* L1: short parallel loop */
for (i=0; i < 100; i++) {
    a[i] = a[i] + b[i];      /* S1 */
}
/* L2: a short loop with data dependence */
for (i=0; i < 100; i++) {
    a[i+1] = a[i] * d[i];    /* S2 */
}

If the loops in 3.7.2 Loop Fusion are fused, a data dependence is created from statement S2 to S1. In effect, the value of a[i] in the right hand side of statement S1 is computed in statement S2. If the loops are not fused, this would not happen. The compiler performs safety and profitability analysis to determine if loop fusion should be done. Often, the compiler can fuse an arbitrary number of loops. Increasing the granularity in this manner can sometimes push a loop far enough up for it to be profitable for parallelization.

3.7.3 Loop Interchange

It is generally more profitable to parallelize the outermost loop in a nest of loops, since the overheads incurred are small. However, it is not always safe to parallelize the outermost loops due to dependences that might be carried by such loops. This is illustrated in the following:

Example 3-19 Nested Loop That Cannot Be Parallelized

for (i=0; i <n; i++) {
    for (j=0; j <n; j++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

In this example, the loop with the index variable i cannot be parallelized, because of a dependency between two successive iterations of the loop. The two loops can be interchanged and the parallel loop (the j-loop) becomes the outer loop:

Example 3-20 The Loops Interchanged

for (j=0; j<n; j++) {
    for (i=0; i<n; i++) {
            a[j][i+1] = 2.0*a[j][i-1];
    }
}

The resulting loop incurs an overhead of parallel work distribution only once, while previously, the overhead was incurred n times. The compiler performs safety and profitability analysis to determine whether to perform loop interchange.