With the -autopar option, the f95 compiler automatically finds DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.
The compiler’s dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.
For example, with four CPUs and a parallelized loop with 1000 iterations, each thread would execute a chunk of 250 iterations:
Processor 1 executes iterations |
1 |
through |
250 |
Processor 2 executes iterations |
251 |
through |
500 |
Processor 3 executes iterations |
501 |
through |
750 |
Processor 4 executes iterations |
751 |
through |
1000 |
Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler’s dependence analysis rejects from parallelization those loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.
Note that the compiler always chooses to parallelize loops using a static loop scheduling—simply dividing the work in the loop into equal blocks of iterations. Other scheduling schemes may be specified using explicit parallelization directives described later in this chapter.
A few definitions, from the point of view of automatic parallelization, are needed:
An array is a variable that is declared with at least one dimension.
A pure scalar is a scalar variable that is not aliased—not referenced in an EQUIVALENCE or POINTER statement.
Example: Array/scalar:
dimension a(10) real m(100,10), s, u, x, z equivalence ( u, z ) pointer ( px, x ) s = 0.0 ... |
Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.
DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar. The general criteria for automatic parallelization are:
Only explicit DO loops and implicit loops, such as IF loops and Fortran 95 array syntax are parallelization candidates.
The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.
Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.
Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependence.
The amount of work within the body of the loop must outweigh the overhead of parallelization.
The compilers may automatically eliminate a reference that appears to create a data dependence in the loop. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.
Example: Using -autopar, with dependencies eliminated by private arrays:
parameter (n=1000) real a(n), b(n), c(n,n) do i = 1, 1000 <--Parallelized do k = 1, n a(k) = b(k) + 2.0 end do do j = 1, n-1 c(i,j) = a(j+1) + 2.3 end do end do end |
In the example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a appear to result in a data dependence, the compiler generates temporary private copies of the array to make the outer loop iterations independent.
Under automatic parallelization, the compilers do not parallelize a loop if:
The DO loop is nested inside another DO loop that is parallelized
Flow control allows jumping out of the DO loop
A user-level subprogram is invoked inside the loop
An I/O statement is in the loop
Calculations within the loop change an aliased scalar variable
In a multithreaded, multiprocessor environment, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each thread. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.
A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.
Example: Reduction summation of the elements of a vector:
s = 0.0 do i = 1, 1000 s = s + v(i) end do t(k) = s |
However, for some operations, if reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.
Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.
If a parallelizable loop contains one of the reduction operations listed in Table 10–2, the compiler will parallelize it if -reduction is specified.
The following table lists the reduction operations that are recognized by the compiler.
Table 10–2 Recognized Reduction Operations
Mathematical Operations |
Fortran Statement Templates |
---|---|
Sum |
s = s + v(i) |
Product |
s = s * v(i) |
Dot product |
s = s + v(i) * u(i) |
Minimum |
s = amin( s, v(i)) |
Maximum |
s = amax( s, v(i)) |
OR |
do i = 1, n b = b .or. v(i) end do |
AND |
b = .true. do i = 1, n b = b .and. v(i) end do |
Count of non-zero elements |
k = 0 do i = 1, n if(v(i).ne.0) k = k + 1 end do |
All forms of the MIN and MAX function are recognized.
Floating-point sum or product reduction operations may be inaccurate due to the following conditions:
The order in which the calculations are performed in parallel is not the same as when performed serially on a single processor.
The order of calculation affects the sum or product of floating-point numbers. Hardware floating-point addition and multiplication are not associative. Roundoff, overflow, or underflow errors may result depending on how the operands associate. For example, (X*Y)*Z and X*(Y*Z) may not have the same numerical significance.
In some situations, the error may not be acceptable.
Example: Roundoff, get the sum of 100,000 random numbers between– 1 and +1:
demo% cat t4.f parameter ( n = 100000 ) double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n) s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1 s = 0.0 do i = 1, n s = s + v(i) end do write(*, ’(" s = ", e21.15)’) s end demo% f95 -O4 -autopar -reduction t4.f |
Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between– 1 and +1.
Number of Processors |
Output |
---|---|
1 |
s = 0.568582080884714E+02 |
2 |
s = 0.568582080884722E+02 |
3 |
s = 0.568582080884721E+02 |
4 |
s = 0.568582080884724E+02 |
In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.