JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Studio 12.2: C User's Guide
search filter icon
search icon

Document Information


1.  Introduction to the C Compiler

2.  C-Compiler Implementation-Specific Information

3.  Parallelizing C Code

3.1 Overview

3.1.1 Example of Use

3.2 Parallelizing for OpenMP

3.2.1 Handling OpenMP Runtime Warnings

3.3 Environment Variables





3.3.5 Using restrict in Parallel Code

3.4 Data Dependence and Interference

3.4.1 Parallel Execution Model

3.4.2 Private Scalars and Private Arrays

3.4.3 Storeback

3.4.4 Reduction Variables

3.5 Speedups

3.5.1 Amdahl's Law Overheads Gustafson's Law

3.6 Load Balance and Loop Scheduling

3.6.1 Static or Chunk Scheduling

3.6.2 Self Scheduling

3.6.3 Guided Self Scheduling

3.7 Loop Transformations

3.7.1 Loop Distribution

3.7.2 Loop Fusion

3.7.3 Loop Interchange

3.8 Aliasing and Parallelization

3.8.1 Array and Pointer References

3.8.2 Restricted Pointers

3.8.3 Explicit Parallelization and Pragmas Serial Pragmas Parallel Pragma

Nesting of for Loops

Eligibility for Parallelizing

Number of Processors

Classifying Variables

Default Scoping Rules for private and shared Variables

private Variables

shared Variables

readonly Variables

storeback Variables


reduction Variables

Scheduling Control

3.9 Memory Barrier Intrinsics

4.  lint Source Code Checker

5.  Type-Based Alias Analysis

6.  Transitioning to ISO C

7.  Converting Applications for a 64-Bit Environment

8.  cscope: Interactively Examining a C Program

A.  Compiler Options Grouped by Functionality

B.  C Compiler Options Reference

C.  Implementation-Defined ISO/IEC C99 Behavior

D.  Supported Features of C99

E.  Implementation-Defined ISO/IEC C90 Behavior

F.  ISO C Data Representations

G.  Performance Tuning

H.  The Differences Between K&R Solaris Studio C and Solaris Studio ISO C


3.5 Speedups

If the compiler does not parallelized a portion of a program where a significant amount of time is spent, then no speedup occurs. This is basically a consequence of Amdahls Law. For example, if a loop that accounts for five percent of the execution time of a program is parallelized, then the overall speedup is limited to five percent. However, there may not be any improvement depending on the size of the workload and parallel execution overheads.

As a general rule, the larger the fraction of program execution that is parallelized, the greater the likelihood of a speedup.

Each parallel loop incurs a small overhead during start-up and shutdown. The start overhead includes the cost of work distribution, and the shutdown overhead includes the cost of the barrier synchronization. If the total amount of work performed by the loop is not big enough then no speedup will occur. In fact the loop might even slow down. So if a large amount of program execution is accounted by a large number of short parallel loops, then the whole program may slow down instead of speeding up.

The compiler performs several loop transformations that try to increase the granularity of the loops. Some of these transformations are loop interchange and loop fusion. So in general, if the amount of parallelism in a program is small or is fragmented among small parallel regions, then the speedup is less.

Often scaling up a problem size improves the fraction of parallelism in a program. For example, consider a problem that consists of two parts: a quadratic part that is sequential, and a cubic part that is parallelizable. For this problem the parallel part of the workload grows faster than the sequential part. So at some point the problem will speedup nicely, unless it runs into resource limitations.

It is beneficial to try some tuning, experimentation with directives, problem sizes and program restructuring in order to achieve benefits from parallel C.

3.5.1 Amdahl’s Law

Fixed problem-size speedup is generally governed by Amdahl’s law. Amdahl’s Law simply says that the amount of parallel speedup in a given problem is limited by the sequential portion of the problem.The following equation describes the speedup of a problem where F is the fraction of time spent in sequential region, and the remaining fraction of the time is spent uniformly among P processors. If the second term of the equation drops to zero, the total speedup is bounded by the first term, which remains fixed.

Equation showing Amdahl’s law, the fraction one over S equals F plus the fraction one minus F quantity over P.

The following figure illustrates this concept diagrammatically. The darkly shaded portion represents the sequential part of the program, and remains constant for one, two, four, and eight processors, while the lightly shaded portion represents the parallel portion of the program that can be divided uniformly among an arbitrary number of processors.

Figure 3-3 Fixed Problem Speedups

As the number of processors increases, the amount of time required for the parallel portion of each program decreases.

As the number of processors increases, the amount of time required for the parallel portion of each program decreases whereas the serial portion of each program stays the same.

In reality, however, you may incur overheads due to communication and distribution of work to multiple processors. These overheads may or may not be fixed for arbitrary number of processors used.

The following figure illustrates the ideal speedups for a program containing 0%, 2%, 5%, and 10% sequential portions. Here, no overhead is assumed.

Figure 3-4 Amdahl's Law Speedup Curve

The graph shows that the most speedup occurs with the program that has no sequential portion.

A graph that shows the ideal speedups for a program containing 0%, 2%, 5%, and 10% sequential portions, assuming no overhead. The x-axis measures the number of processors and the y-axis measures the speedup. Overheads

Once the overheads are incorporated in the model the speedup curves change dramatically. Just for the purposes of illustration we assume that overheads consist of two parts: a fixed part which is independent of the number of processors, and a non-fixed part that grows quadratically with the number of the processors used:

1 over S equals 1 divided by the quantity F plus the quantity 1 minus the fraction F over P end quantity plus K sub 1 plus K sub 2 times P squared.

The fraction one over S equals one divided by the quantity of F plus the quantity one minus the fraction F over P end of quantity plus K sub one plus K sub two times P squared end quantity.

In this equation, K1 and K2 are some fixed factors. Under these assumptions the speedup curve is shown in the following figure. It is interesting to note that in this case the speedups peak out. After a certain point adding more processors is detrimental to performance as shown in the following figure.

Figure 3-5 Speedup Curve With Overheads

The graph shows that all programs reach the greatest speedup at five processors and then loose this benefit as up to eight processors are added.

The graph shows that all programs reach the greatest speedup at five processors and then loose this benefit as up to eight processors are added. The x-axis measures the number of processors and the y-axis measures the speedup. Gustafson’s Law

Amdahls Law can be misleading for predicting parallel speedups in real problems. The fraction of time spent in sequential sections of the program sometimes depends on the problem size. That is, by scaling the problem size, you may improve the chances of speedup. The following example demonstrates this.

Example 3-12 Scaling the Problem Size May Improve Chances of Speedup

* initialize the arrays
for (i=0; i < n; i++) {
    for (j=0; j < n; j++) {
            a[i][j] = 0.0;
            b[i][j] = ...
            c[i][j] = ...
* matrix multiply
for (i=0; i < n; i++) {
    for(j=0; j < n; j++) {
            for (k=0; k < n; k++) {
                a[i][j] = b[i][k]*c[k][j];

Assume an ideal overhead of zero and assume that only the second loop nest is executed in parallel. It is easy to see that for small problem sizes (i.e. small values of n), the sequential and parallel parts of the program are not so far from each other. However, as n grows larger, the time spent in the parallel part of the program grows faster than the time spent in the sequential part. For this problem, it is beneficial to increase the number of processors as the problem size increases.