Loop Parallelization (Fortran Programming Guide)

Fortran Programming Guide

Loop Parallelization

The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.

For example, with four CPUs and a parallelized loop with 1000 iterations:

Processor 1 executing iterations	1	through	250
Processor 2 executing iterations	251	through	500
Processor 3 executing iterations	501	through	750
Processor 4 executing iterations	751	through	1000

Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.

Note that the compiler always chooses to parallelize loops using a chunk distribution--simply dividing the work in the loop into equal blocks of iterations. Other distribution schemes may be specified using explicit parallelization directives described later in this chapter.