With the -autopar and -parallel options, the compilers automatically find DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.
The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.
For example, with four CPUs and a parallelized loop with 1000 iterations:
Processor 1 executing iterations |
1 |
through |
250 |
Processor 2 executing iterations |
251 |
through |
500 |
Processor 3 executing iterations |
501 |
through |
750 |
Processor 4 executing iterations |
751 |
through |
1000 |
Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.
Note that the compiler always chooses to parallelize loops using a chunk distribution--simply dividing the work in the loop into equal blocks of iterations. Other distribution schemes may be specified using explicit parallelization directives described later in this chapter.
A few definitions, from the point of view of automatic parallelization, are needed:
An array is a variable that is declared with at least one dimension.
A scalar is a variable that is not an array.
A pure scalar is a scalar variable that is not aliased--not referenced in an EQUIVALENCE or POINTER statement.
dimension a(10) real m(100,10), s, u, x, z equivalence ( u, z ) pointer ( px, x ) s = 0.0 ...
Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.
DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar or -parallel. The general criteria for automatic parallelization are:
DO loops are parallelized, but not DO WHILE or Fortran 90 array operations.
The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.
Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.
Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependency.
The f77 compiler may automatically eliminate a reference that appears to create a dependency transforming the compiled code. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.
Example: Using -autopar, with dependencies eliminated by private arrays:
parameter (n=1000) real a(n), b(n), c(n,n) do i = 1, 1000 <--Parallelized do k = 1, n a(k) = b(k) + 2.0 end do do j = 1, n c(i,j) = a(j) + 2.3 end do end do end
In the preceding example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a(*) appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent.
Under automatic parallelization, the compilers do not parallelize a loop if:
The DO loop is nested inside another DO loop that is parallelized.
Flow control allows jumping out of the DO loop.
A user-level subprogram is invoked inside the loop.
An I/O statement is in the loop.
Calculations within the loop change an aliased scalar variable.
On multiprocessor systems, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each processor. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.
A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.
Example: Reduction summation of the elements of a vector:
s = 0.0 do i = 1, 1000 s = s + v(i) end do t(k) = s
However, for some operations, if the reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.
Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.
If a parallelizable loop contains one of the reduction operations listed in Table 10-3, the compiler will parallelize it if -reduction is specified.
The following table lists the reduction operations that are recognized by f77 and f90.
Table 10-3 Recognized Reduction Operations
Mathematical Operations |
Fortran Statement Templates |
---|---|
Sum of the elements | s = s + v(i) |
Product of the elements | s = s * v(i) |
Dot product of two vectors | s = s + v(i) * u(i) |
Minimum of the elements | s = amin( s, v(i)) |
Maximum of the elements | s = amax( s, v(i)) |
OR of the elements |
do i = 1, n b = b .or. v(i) end do |
AND of nonpositive elements |
b = .true. do i = 1, n if (v(i) .le. 0) b=b .and. v(i) end do |
Count nonzero elements |
k = 0 do i = 1, n if ( v(i) .ne. 0 ) k = k + 1 end do |
All forms of the MIN and MAX function are recognized.
Floating-point sum or product reduction operations may be inaccurate due to the following conditions:
The order in which the calculations were performed in parallel was not the same as when performed serially on a single processor.
The order of calculation affected the sum or product of floating-point numbers. Hardware floating-point addition and multiplication are not associative. Roundoff, overflow, or underflow errors may result depending on how the operands associate. For example, (X*Y)*Z and X*(Y*Z) may not have the same numerical significance.
In some situations, the error may not be acceptable.
Example: Overflow and underflow, with and without reduction:
demo% cat t3.f real A(10002), result, MAXFLOAT MAXFLOAT = r_max_normal() do 10 i = 1 , 10000, 2 A(i) = MAXFLOAT A(i+1) = -MAXFLOAT 10 continue A(5001)=-MAXFLOAT A(5002)=MAXFLOAT do 20 i = 1 ,10002 !Add up the array RESULT = RESULT + A(i) 20 continue write(6,*) RESULT end demo% setenv PARALLEL 2 {Number of processors is 2} demo% f77 -silent -autopar t3.f demo% a.out 0. {Without reduction, 0. is correct} demo% f77 -silent -autopar -reduction t3.f demo% a.out Inf {With reduction, Inf. is not correct} demo%
Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:
demo% cat t4.f parameter ( n = 100000 ) double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n) s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1 s = 0.0 do i = 1, n s = s + v(i) end do write(*, '(" s = ", e21.15)') s end demo% f77 -autopar -reduction t4.f
Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.
Number of Processors |
Output |
---|---|
1 | s = 0.568582080884714E+02 |
2 | s = 0.568582080884722E+02 |
3 | s = 0.568582080884721E+02 |
4 | s = 0.568582080884724E+02 |
In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.