DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar. The general criteria for automatic parallelization are:
Only explicit DO loops and implicit loops, such as IF loops and Fortran 95 array syntax are parallelization candidates.
The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.
Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.
Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependence.
The amount of work within the body of the loop must outweigh the overhead of parallelization.
The compilers may automatically eliminate a reference that appears to create a data dependence in the loop. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.
Example: Using -autopar, with dependencies eliminated by private arrays:
parameter (n=1000) real a(n), b(n), c(n,n) do i = 1, 1000 <--Parallelized do k = 1, n a(k) = b(k) + 2.0 end do do j = 1, n-1 c(i,j) = a(j+1) + 2.3 end do end do end |
In the example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a appear to result in a data dependence, the compiler generates temporary private copies of the array to make the outer loop iterations independent.
Under automatic parallelization, the compilers do not parallelize a loop if:
The DO loop is nested inside another DO loop that is parallelized
Flow control allows jumping out of the DO loop
A user-level subprogram is invoked inside the loop
An I/O statement is in the loop
Calculations within the loop change an aliased scalar variable
In a multithreaded, multiprocessor environment, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each thread. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.