Fortran Programming Guide

Automatic Parallelization

With the -autopar and -parallel options, the compilers automatically find DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.

Loop Parallelization

The compiler's dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.

For example, with four CPUs and a parallelized loop with 1000 iterations:

Processor 1 executing iterations 

through 

250 

Processor 2 executing iterations 

251 

through 

500 

Processor 3 executing iterations 

501 

through 

750 

Processor 4 executing iterations 

751 

through 

1000 

Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler's dependency analysis rejects loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.

Note that the compiler always chooses to parallelize loops using a chunk distribution--simply dividing the work in the loop into equal blocks of iterations. Other distribution schemes may be specified using explicit parallelization directives described later in this chapter.

Arrays, Scalars, and Pure Scalars

A few definitions, from the point of view of automatic parallelization, are needed:

Example: Array/scalar:


      dimension a(10)
      real m(100,10), s, u, x, z
      equivalence ( u, z )
      pointer ( px, x )
      s = 0.0
      ...

Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.

Automatic Parallelization Criteria

DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar or -parallel. The general criteria for automatic parallelization are:

f77: Apparent Dependencies

The f77 compiler may automatically eliminate a reference that appears to create a dependency transforming the compiled code. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.

Example: Using -autopar, with dependencies eliminated by private arrays:


      parameter (n=1000)
      real a(n), b(n), c(n,n)
      do i = 1, 1000                             <--Parallelized
        do k = 1, n                   
          a(k) = b(k) + 2.0
        end do
        do j = 1, n
          c(i,j) = a(j) + 2.3
        end do
      end do
      end

In the preceding example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a(*) appear to result in a data dependency, the compiler generates temporary private copies of the array to make the outer loop iterations independent.

Inhibitors to Automatic Parallelization

Under automatic parallelization, the compilers do not parallelize a loop if:

Nested Loops

On multiprocessor systems, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each processor. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.

Automatic Parallelization With Reduction Operations

A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.

Example: Reduction summation of the elements of a vector:


      s = 0.0
      do i = 1, 1000
        s = s + v(i)
      end do
      t(k) = s

However, for some operations, if the reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.

Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.

If a parallelizable loop contains one of the reduction operations listed in Table 10-3, the compiler will parallelize it if -reduction is specified.

Recognized Reduction Operations

The following table lists the reduction operations that are recognized by f77 and f90.

Table 10-3 Recognized Reduction Operations

Mathematical Operations 

Fortran Statement Templates 

Sum of the elements 

 s = s + v(i)

Product of the elements 

 s = s * v(i)

Dot product of two vectors 

 s = s + v(i) * u(i)

Minimum of the elements 

 s = amin( s, v(i))

Maximum of the elements 

 s = amax( s, v(i))

OR of the elements

do i = 1, n

b = b .or. v(i)

end do

AND of nonpositive elements

b = .true.

do i = 1, n

if (v(i) .le. 0) b=b .and. v(i)

end do

Count nonzero elements 

k = 0

do i = 1, n

if ( v(i) .ne. 0 ) k = k + 1

end do

All forms of the MIN and MAX function are recognized.

Numerical Accuracy and Reduction Operations

Floating-point sum or product reduction operations may be inaccurate due to the following conditions:

In some situations, the error may not be acceptable.

Example: Overflow and underflow, with and without reduction:


demo% cat t3.f
      real A(10002), result, MAXFLOAT
      MAXFLOAT = r_max_normal()
      do 10 i = 1 , 10000, 2
      A(i) = MAXFLOAT
      A(i+1) = -MAXFLOAT
10      continue

      A(5001)=-MAXFLOAT
      A(5002)=MAXFLOAT
 
      do 20 i = 1 ,10002        !Add up the array
        RESULT = RESULT + A(i)
20      continue
      write(6,*) RESULT
      end
demo% setenv PARALLEL 2          {Number of processors is 2}
demo% f77 -silent -autopar t3.f 
demo% a.out
   0.                            {Without reduction, 0. is correct}
demo% f77 -silent -autopar -reduction t3.f
demo% a.out
  Inf                            {With reduction, Inf. is not correct}
demo% 

Example: Roundoff, get the sum of 100,000 random numbers between -1 and +1:


demo% cat t4.f
      parameter ( n = 100000 )
      double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n)
      s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1
      s = 0.0
      do i = 1, n
        s = s + v(i)
      end do
      write(*, '(" s = ", e21.15)') s
      end
demo% f77 -autopar -reduction t4.f

Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between -1 and +1.

Number of Processors 

Output 

  s = 0.568582080884714E+02

  s = 0.568582080884722E+02

  s = 0.568582080884721E+02

  s = 0.568582080884724E+02

In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.