Sun Studio 12: Fortran Programming Guide

10.2 Automatic Parallelization

With the -autopar option, the f95 compiler automatically finds DO loops that can be parallelized effectively. These loops are then transformed to distribute their iterations evenly over the available processors. The compiler generates the thread calls needed to make this happen.

10.2.1 Loop Parallelization

The compiler’s dependency analysis transforms a DO loop into a parallelizable task. The compiler may restructure the loop to split out unparallelizable sections that will run serially. It then distributes the work evenly over the available processors. Each processor executes a different chunk of iterations.

For example, with four CPUs and a parallelized loop with 1000 iterations, each thread would execute a chunk of 250 iterations:

Processor 1 executes iterations 

through 

250 

Processor 2 executes iterations 

251 

through 

500 

Processor 3 executes iterations 

501 

through 

750 

Processor 4 executes iterations 

751 

through 

1000 

Only loops that do not depend on the order in which the computations are performed can be successfully parallelized. The compiler’s dependence analysis rejects from parallelization those loops with inherent data dependencies. If it cannot fully determine the data flow in a loop, the compiler acts conservatively and does not parallelize. Also, it may choose not to parallelize a loop if it determines the performance gain does not justify the overhead.

Note that the compiler always chooses to parallelize loops using a static loop scheduling—simply dividing the work in the loop into equal blocks of iterations. Other scheduling schemes may be specified using explicit parallelization directives described later in this chapter.

10.2.2 Arrays, Scalars, and Pure Scalars

A few definitions, from the point of view of automatic parallelization, are needed:

Example: Array/scalar:


      dimension a(10)
      real m(100,10), s, u, x, z
      equivalence ( u, z )
      pointer ( px, x )
      s = 0.0
      ...

Both m and a are array variables; s is pure scalar. The variables u, x, z, and px are scalar variables, but not pure scalars.

10.2.3 Automatic Parallelization Criteria

DO loops that have no cross-iteration data dependencies are automatically parallelized by -autopar. The general criteria for automatic parallelization are:

10.2.3.1 Apparent Dependencies

The compilers may automatically eliminate a reference that appears to create a data dependence in the loop. One of the many such transformations makes use of private versions of some of the arrays. Typically, the compiler does this if it can determine that such arrays are used in the original loops only as temporary storage.

Example: Using -autopar, with dependencies eliminated by private arrays:


      parameter (n=1000)
      real a(n), b(n), c(n,n)
      do i = 1, 1000             <--Parallelized
        do k = 1, n
          a(k) = b(k) + 2.0
        end do
        do j = 1, n-1
          c(i,j) = a(j+1) + 2.3
        end do
      end do
      end

In the example, the outer loop is parallelized and run on independent processors. Although the inner loop references to array a appear to result in a data dependence, the compiler generates temporary private copies of the array to make the outer loop iterations independent.

10.2.3.2 Inhibitors to Automatic Parallelization

Under automatic parallelization, the compilers do not parallelize a loop if:

10.2.3.3 Nested Loops

In a multithreaded, multiprocessor environment, it is most effective to parallelize the outermost loop in a loop nest, rather than the innermost. Because parallel processing typically involves relatively large loop overhead, parallelizing the outermost loop minimizes the overhead and maximizes the work done for each thread. Under automatic parallelization, the compilers start their loop analysis from the outermost loop in a nest and work inward until a parallelizable loop is found. Once a loop within the nest is parallelized, loops contained within the parallel loop are passed over.

10.2.4 Automatic Parallelization With Reduction Operations

A computation that transforms an array into a scalar is called a reduction operation. Typical reduction operations are the sum or product of the elements of a vector. Reduction operations violate the criterion that calculations within a loop not change a scalar variable in a cumulative way across iterations.

Example: Reduction summation of the elements of a vector:


      s = 0.0
      do i = 1, 1000
        s = s + v(i)
      end do
      t(k) = s

However, for some operations, if reduction is the only factor that prevents parallelization, it is still possible to parallelize the loop. Common reduction operations occur so frequently that the compilers are capable of recognizing and parallelizing them as special cases.

Recognition of reduction operations is not included in the automatic parallelization analysis unless the -reduction compiler option is specified along with -autopar or -parallel.

If a parallelizable loop contains one of the reduction operations listed in Table 10–2, the compiler will parallelize it if -reduction is specified.

10.2.4.1 Recognized Reduction Operations

The following table lists the reduction operations that are recognized by the compiler.

Table 10–2 Recognized Reduction Operations

Mathematical Operations  

Fortran Statement Templates  

Sum 

s = s + v(i)

Product 

s = s * v(i)

Dot product 

s = s + v(i) * u(i)

Minimum 

s = amin( s, v(i))

Maximum 

s = amax( s, v(i))

OR

do i = 1, n

b = b .or. v(i)

end do

AND

b = .true.

do i = 1, n

b = b .and. v(i)

end do

Count of non-zero elements 

k = 0

do i = 1, n

if(v(i).ne.0) k = k + 1

end do

All forms of the MIN and MAX function are recognized.

10.2.4.2 Numerical Accuracy and Reduction Operations

Floating-point sum or product reduction operations may be inaccurate due to the following conditions:

In some situations, the error may not be acceptable.

Example: Roundoff, get the sum of 100,000 random numbers between– 1 and +1:


demo% cat t4.f
      parameter ( n = 100000 )
      double precision d_lcrans, lb / -1.0 /, s, ub / +1.0 /, v(n)
      s = d_lcrans ( v, n, lb, ub ) ! Get n random nos. between -1 and +1
      s = 0.0
      do i = 1, n
        s = s + v(i)
      end do
      write(*, ’(" s = ", e21.15)’) s
      end
demo% f95 -O4 -autopar -reduction t4.f

Results vary with the number of processors. The following table shows the sum of 100,000 random numbers between– 1 and +1.

Number of Processors  

Output  

s = 0.568582080884714E+02

s = 0.568582080884722E+02

s = 0.568582080884721E+02

s = 0.568582080884724E+02

In this situation, roundoff error on the order of 10-14 is acceptable for data that is random to begin with. For more information, see the Sun Numerical Computation Guide.