Fortran Programming Guide

Automatic Parallelization

With the -autopar and -parallel options, the compilers automatically find DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.

Loop Parallelization

The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.

For example, with four CPUs and a parallelized loop with 1000 iterations:

Processor 1 executing iterations	1	through	250
Processor 2 executing iterations	251	through	500
Processor 3 executing iterations	501	through	750
Processor 4 executing iterations	751	through	1000

Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.

Note that the compiler always chooses to parallelize loops using a chunk distribution--simply dividing the work in the loop into equal blocks of iterations. Other distribution schemes may be specified using explicit parallelization directives described later in this chapter.

Arrays, Scalars, and Pure Scalars

A few definitions, from the point of view of automatic parallelization, are needed:

An array is a variable that is declared with at least one dimension.

A scalar is a variable that is not an array.

A pure scalar is a scalar variable that is not aliased--not referenced in an EQUIVALENCE or POINTER statement.

Example: Array/scalar:

      dimension a(10)
      real m(100,10), s, u, x, z
      equivalence ( u, z )
      pointer ( px, x )
      s = 0.0
      ...

Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.

Automatic Parallelization Criteria

DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar or -parallel. The general criteria for automatic parallelization are:

DO loops are parallelized, but not DO WHILE or Fortran 90 array operations.

The values of array variables for each iteration of the loop must not depend on the values of array variables for any other iteration of the loop.

Calculations within the loop must not conditionally change any pure scalar variable that is referenced after the loop terminates.

Calculations within the loop must not change a scalar variable across iterations. This is called a loop-carried dependency.

`f77`: Apparent Dependencies

The f77 compiler may automatically eliminate a reference that appears to create a dependency transforming the compiled code. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.

Example: Using -autopar, with dependencies eliminated by private arrays:

      parameter (n=1000)
      real a(n), b(n), c(n,n)
      do i = 1, 1000                             <--Parallelized
        do k = 1, n                   
          a(k) = b(k) + 2.0
        end do
        do j = 1, n
          c(i,j) = a(j) + 2.3
        end do
      end do
      end

In the preceding example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a(*) appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent.

Inhibitors to Automatic Parallelization

Under automatic parallelization, the compilers do not parallelize a loop if:

The DO loop is nested inside another DO loop that is parallelized.
Flow control allows jumping out of the DO loop.
A user-level subprogram is invoked inside the loop.
An I/O statement is in the loop.
Calculations within the loop change an aliased scalar variable.

Nested Loops

On multiprocessor systems, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each processor. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.

Automatic Parallelization With Reduction Operations

A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.

Example: Reduction summation of the elements of a vector:

      s = 0.0
      do i = 1, 1000
        s = s + v(i)
      end do
      t(k) = s

However, for some operations, if the reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.

Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.

If a parallelizable loop contains one of the reduction operations listed in Table 10-3, the compiler will parallelize it if -reduction is specified.

Recognized Reduction Operations

The following table lists the reduction operations that are recognized by f77 and f90.

Table 10-3 Recognized Reduction Operations


Mathematical Operations	Fortran Statement Templates
Sum of the elements	s = s + v(i)
Product of the elements	s = s * v(i)
Dot product of two vectors	s = s + v(i) * u(i)
Minimum of the elements	s = amin( s, v(i))
Maximum of the elements	s = amax( s, v(i))
`OR` of the elements	`do i = 1, n` `b = b .or. v(i)` `end do`
`AND` of nonpositive elements	`b = .true.` `do i = 1, n` `if (v(i) .le. 0) b=b .and. v(i)` `end do`
Count nonzero elements	`k = 0` `do i = 1, n` `if ( v(i) .ne. 0 ) k = k + 1` `end do`

All forms of the MIN and MAX function are recognized.

Numerical Accuracy and Reduction Operations

Floating-point sum or product reduction operations may be inaccurate due to the following conditions:

The order in which the calculations were performed in parallel was not the same as when performed serially on a single processor.

The order of calculation affected the sum or product of floating-point numbers. Hardware floating-point addition and multiplication are not associative. Roundoff, overflow, or underflow errors may result depending on how the operands associate. For example, (X*Y)*Z and X*(Y*Z) may not have the same numerical significance.

In some situations, the error may not be acceptable.

Example: Overflow and underflow, with and without reduction:

demo% cat t3.f
      real A(10002), result, MAXFLOAT
      MAXFLOAT = r_max_normal()
      do 10 i = 1 , 10000, 2
      A(i) = MAXFLOAT
      A(i+1) = -MAXFLOAT
10      continue

      A(5001)=-MAXFLOAT
      A(5002)=MAXFLOAT
 
      do 20 i = 1 ,10002        !Add up the array
        RESULT = RESULT + A(i)
20      continue
      write(6,*) RESULT
      end
demo% setenv PARALLEL 2          {Number of processors is 2}
demo% f77 -silent -autopar t3.f 
demo% a.out
   0.                            {Without reduction, 0. is correct}
demo% f77 -silent -autopar -reduction t3.f
demo% a.out
  Inf                            {With reduction, Inf. is not correct}
demo%

Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:

demo% cat t4.f
      parameter ( n = 100000 )
      double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n)
      s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1
      s = 0.0
      do i = 1, n
        s = s + v(i)
      end do
      write(*, '(" s = ", e21.15)') s
      end
demo% f77 -autopar -reduction t4.f

Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.

Number of Processors	Output
1	s = 0.568582080884714E+02
2	s = 0.568582080884722E+02
3	s = 0.568582080884721E+02
4	s = 0.568582080884724E+02

In this situation, roundoff error on the order of 10^-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.