Sun Studio 12: OpenMP API User's Guide

5.4 Checking the Results of Autoscoping

Use compiler commentary to check autoscoping results and to see if any parallel regions are serialized because autoscoping failed.

The compiler will produce an inline commentary when compiled with the -g debug option. This generated commentary can be viewed with the er_src command, as shown below. (The er_src command is provided as part of the Sun Studio software; for more information, see the er_src(1) man page or the Sun Studio Performance Analyzer manual.)

A good place to start is to compile with the -xvpara option. A warning message will be printed out if autoscoping fails, as shown below.

Example 5–1 Compiling With `-vpara`

%cat t.f
      INTEGER X(100), Y(100), I, T
C$OMP PARALLEL DO DEFAULT(__AUTO)
      DO I=1, 100
         T = Y(I)
         CALL FOO(X)
         X(I) = T*T
      END DO
C$OMP END PARALLEL DO
      END
%f95 -xopenmp -xO3 -vpara -c t.f
"t.f", line 2: Warning: parallel region will be executed 
   by a single thread because the autoscoping 
   of following variables failed - x

Compile with -vpara with f95, -xvpara with cc. (This option has not yet been implemented in CC.)

Example 5–2 Using Compiler Commentary

%cat t.f
      INTEGER X(100), Y(100), I, T
C$OMP PARALLEL DO DEFAULT(__AUTO)
      DO I=1, 100
         T = Y(I)
         X(I) = T*T
      END DO
C$OMP END PARALLEL DO
      END

%f95 -xopenmp -xO3 -g -c t.f
%er_src t.o
Source file: ./t.f
Object file: ./ot.o
Load Object: ./t.o

     1. INTEGER X(100), Y(100), I, T

Source OpenMP region below has tag R1
Variables autoscoped as SHARED in R1: x, y
Variables autoscoped as PRIVATE in R1: t, i
Private variables in R1: i, t
Shared variables in R1: y, x
     2. C$OMP PARALLEL DO DEFAULT(__AUTO)
       <Function: _$d1A2.MAIN_>
Source loop below has tag L1
L1 parallelized by explicit user directive
L1 parallel loop-body code placed in function _$d1A2.MAIN_ along with 0
inner loops
Copy in M-function of loop below has tag L2
L2 scheduled with steady-state cycle count = 3
L2 unrolled 4 times
L2 has 0 loads, 0 stores, 2 prefetches, 0 FPadds, 0 FPmuls, and 0 FPdivs
per iteration
L2 has 1 int-loads, 1 int-stores, 4 alu-ops, 1 muls, 0 int-divs and 1
shifts per iteration
     3. DO I=1, 100
     4. T = Y(I)
     5. X(I) = T*T
     6. END DO
     7. C$OMP END PARALLEL DO
     8. END

Next, a more complicated example to illustrate how the autoscoping rules work.

Example 5–3 A More Complicated Example

 1.      REAL FUNCTION FOO (N, X, Y)
 2.      INTEGER       N, I
 3.      REAL          X(*), Y(*)
 4.      REAL          W, MM, M
 5.
 6.      W = 0.0
 7.
 8. C$OMP PARALLEL DEFAULT(__AUTO)
 9.
10. C$OMP SINGLE
11.       M = 0.0
12. C$OMP END SINGLE
13.
14.       MM = 0.0
15.
16. C$OMP DO
17.       DO I = 1, N
18.          T = X(I)
19.          Y(I) = T
20.          IF (MM .GT. T) THEN
21.             W = W + T
22.             MM = T
23.          END IF
24.       END DO
25. C$OMP END DO
26.
27. C$OMP CRITICAL
28.       IF ( MM .GT. M ) THEN
29.          M = MM
30.       END IF
31. C$OMP END CRITICAL
32.
33. C$OMP END PARALLEL
34.
35.      FOO = W - M
36.
37.      RETURN
38.      END

The function FOO() contains a parallel region, which contains a SINGLE construct, a work-sharing DO construct and a CRITICAL construct. If we ignore all the OpenMP parallel constructs, what the code in the parallel region does is:

Copy the value in array X to array Y
Find the maximum positive value in X, and store it in M
Accumulate the value of some elements of X into variable W.

Let’s see how the compiler uses the above rules to find the appropriate scopes for the variables in the parallel region.

The following variables are used in the parallel region, I, N, MM, T, W, M, X, and Y. The compiler will determine the following.

Scalar I is the loop index of the work-sharing DO loop. The OpenMP specification mandates that I be scoped PRIVATE.
Scalar N is only read in the parallel region and therefore will not cause data race, so it is scoped as SHARED following rule S1.
Any thread executing the parallel region will execute statement 14, which sets the value of scalar MM to 0.0. This write will cause data race, so rule S1 does not apply. The write happens before any read of MM in the same thread, so MM is scoped as PRIVATE according to rule S2.
Similarly, scalar T is scoped as PRIVATE.
Scalar W is read and then written at statement 21, so rules S1 and S2 do not apply. The addition operation is both associative and communicative, therefore, W is scoped as REDUCTION(+) according to rule S3.
Scalar M is written in statement 11 which is inside a SINGLE construct. The implicit barrier at the end of the SINGLE construct ensures that the write in statement 11 will not happen concurrently with either the read in statement 28 or the write in statement 29, and the latter two will not happen at the same time because both are inside the same CRITICAL construct. No two threads can access M at the same time. Therefore, the writes and reads of M in the parallel region do not cause a data race, and, following rule S1, M is scoped SHARED.
Array X is only read and not written in the region, so it is scoped as SHARED by rule A1.
The writes to array Y is distributed among the threads, and no two threads will write to the same elements of Y. As there is no data race, Y is scoped SHARED according to rule A1.

5.4 Checking the Results of Autoscoping

Example 5–1 Compiling With -vpara

Example 5–2 Using Compiler Commentary

Example 5–3 A More Complicated Example

Example 5–1 Compiling With `-vpara`