Analyzing Program Performance With Sun WorkShop

Chapter 3 Loop Analysis Tools

The Fortran and C compilers automatically parallelize loops for which they determine that it is safe and profitable to do so. LoopTool is a performance analysis tool that reads loop timing files created by these compilers. LoopTool uses a graphical user interface (GUI). LoopReport is the command-line version of LoopTool.

This chapter is organized as follows:

Basic Concepts

LoopTool and LoopReport enable you to:

Time all loops, whether serial or parallel.
Produce a table of loop timings.
Collect hints from the compiler during compilation.
LoopTool displays a graph of loop runtimes and shows which loops were parallelized. You can go directly from the graphical display of loops to the source code for any loop you want, so you can edit your source code while in LoopTool.
LoopReport reports loop runtimes in an ASCII file instead of a graphical display.

There are four basic steps for using LoopTool and LoopReport:

Setting up environment variables
Compiling the program with the options required to create a timing file for loop analysis
Running the program to generate a timing file
Invoke LoopTool or LoopReport on the timing file

Note -
The examples in this section use the Fortran (f77 and f90) compilers. The options shown (such as -xparallel, -Zlp) also work for C.

Setting Up Your Environment

Before running an executable compiled with -Zlp, set the environment variable PARALLEL to the number of processors on your machine.

The following command makes use of psrinfo, a system utility. Note the backquotes:

% setenv PARALLEL `/usr/sbin/psrinfo | wc -l`

You may want to put this command in a shell startup file (such as .cshrc or .profile).

Creating a Loop Timing File

To create a loop timing file, you compile your program with compiler options that automatically parallelize and optimize your code (-xparallel and -xO4). You also add the -Zlp option to compile for LoopTool or LoopReport. When you run the program compiled with these options, Sun WorkShop creates a timing file for LoopTool or LoopReport to process.

The three compiler options are illustrated in this example:

% f77 -xO4 -xparallel -Zlp source_file

Note -

All examples apply to FORTRAN 77, Fortran 90, and C programs.

There are a number of other useful options for looking at and parallelizing loops:

Option	Effect
-o program	Renames the executable to program
-xexplicitpar	Parallelizes loops marked with DOALL pragma
-xloopinfo	Prints hints to stderr for redirection to files

Other Compilation Options

Many combinations of compiler options work for LoopTool and LoopReport.

To compile for automatic parallelization, typical compilation switches are -xparallel and -x04. To compile for LoopTool and LoopReport, add -Zlp.

% f77 -x04 -xparallel -Zlp source_file

You can use either -xO3 or -xO4 with -xparallel. If you don't specify -xO3 or -xO4 but you do use -xparallel, then the compiler uses -xO3. Table 3-1 summarizes how optimization level options are added for specific options.

Table 3-1 Optimization Level Options and What They Imply


You type:	Bumped Up To:
-xparallel	-xparallel -xO3
-xparallel -Zlp	-xparallel -xO3 -Zlp
-xexplicitpar	-xexplicitpar -xO3
-xexplicitpar -Zlp	-xexplicitpar -xO3 -Zlp
-Zlp	-xdepend -xO3 -Zlp

Other compilation options include -xexplicitpar and -xloopinfo.

The Fortran compiler option -xexplicitpar is used with the pragma DOALL. If you insert DOALL before a loop in your source code, you are explicitly marking that loop for parallelization. The compiler parallelizes the loop when you compile with -xexplicitpar.

The following code fragment shows how to mark a loop explicitly for parallelization.

	subroutine adj(a,b,c,x,n)
	  real*8 a(n), b(n), c(-n:0), x
	    integer n
c$par DOALL
	do 19 i = 1, n*n
	  do 29 k = i, n*n
	    a(i) = a(i) + x*b(k)*c(i-k)
29	  continue
19	continue
	return
	end

When you use -Zlp by itself, -xdepend and -xO3 are added. The -xdepend option instructs the compiler to perform the data dependency analysis that it needs to do to identify loops. The option -xparallel includes -xdepend, but -xdepend does not imply (or trigger) -xparallel.

The -xloopinfo option prints hints about loops to stderr (the UNIX standard error file, on file descriptor 2) when you compile your program. The hints include the routine names, the line number for the start of the loop, whether the loop was parallelized, and the reason it was not parallelized, if applicable.

The following example redirects hints about loops in the source file gamteb.F to the file gamtab.loopinfo:

% f77 -xO3 -parallel -xloopinfo -Zlp gamteb.F 2> gamteb.loopinfo

The main difference between -Zlp and -xloopinfo is that in addition to providing compiler hints about loops, -Zlp also instruments your program so that timing statistics are recorded at runtime. For this reason, also, LoopTool and LoopReport analyze only programs that have been compiled with -Zlp.

Run The Program

After compiling with -Zlp, run the executable. This creates the loop timing file, program.looptimes. Both LoopTool and LoopReport process two files: the instrumented executable and the loop timing file.

Starting LoopTool

You can start LoopTool by giving it the name of a program (an executable) to load:

% looptool program &

If you start LoopTool without specifying a file, the Open File dialog box is displayed, allowing you to select a file to examine:

% looptool &

Loading a Timing File

LoopTool reads the timing file associated with your program. The timing file contains information about loops. Typically, this file has a name of the format program.looptimes and is in the same directory as your program.

By default, LoopTool looks in the executable's directory for a timing file. Therefore, if the timing file is there (the usual case), you don't need to specify where to look for it:

% looptool program &

If you name a timing file on the command line, then LoopTool and LoopReport use it.

% looptool program program.looptimes &

If you use the command line option -p, LoopTool and LoopReport check for a timing file in the directory indicated by -p:

% looptool -p timing_file_directory program &

If the environment variable LVPATH is set, the tools check that directory for a timing file.

% setenv LVPATH timing_file_directory
% looptool program &

Using LoopTool

The main window displays the runtimes of your program's loops in a bar chart arranged in the order that the source files were presented to the compiler.

Figure 3-1 shows the components of the LoopTool window.

Figure 3-1 LoopTool Main Window

Opening Files

To open executable and timing files, choose File Open in the main window.

There are two ways to specify the files you want to open:

Type in the name of the files to open.
Bring up a file chooser.

Once you enter the executable's path, you don't need to type in the timing file, unless it's in a different directory or has a non-default name (or both).

For more information about opening files, see the Analyzing the Loops in Your Program section of the Sun WorkShop Online Help.

Creating a Report on All Loops

To open a window with detailed information on all the loops in your program, choose File Create Report in the main window (see Figure 3-2). The generated report is identical to that produced by LoopReport.

The Help button in the report window links to the Sun WorkShop online help section containing compiler hints.

Figure 3-2 LoopReport

Printing the LoopTool Graph

To print the LoopTool graph, choose File Print Graph in the main window and type the name of your chosen printer. To save the graph to a file, type a filename instead of a printer name.

For more information about printing see the Sun WorkShop online help.

Choosing an Editor

Choose File Options in the main window to open the Options dialog box, where you can choose an editor for editing source code. The available editors are vi, gnuemacs, and xemacs.

Note -

vi and xemacs are installed with LoopTool into your install directory (usually /opt/SUNWspro/bin) if they're not already on your system. You must provide gnuemacs yourself. In all cases, the editor you want must be in a directory in your search path in order for LoopTool to find it. For example, your PATH environment variable should include /usr/local if that's where gnuemacs is located on your system.

For more information about choosing an editor see the WorkShop Online Help.

Getting Hints and Editing Source Code

Clicking a loop in the main window (see Figure 3-1) does two things:

It brings up a window in which you can edit your source code (see Figure 3-3). The available editors are vi, xemacs, and gnuemacs.

For information on vi, see the vi(1) manual page. xemacs and gnuemacs have online help (click the Help button).

The Sun WorkShop vi editor has a special Version menu that allows you to make use of the Source Code Control System (SCCS) utility for sharing files. See the online help, as well as the sccs(1) manual page, for more information.

It brings up a separate window that displays one or more hints about the loop you've selected. The Help button in this window displays the Sun WorkShop online help compiler hints section. See also "Compiler Hints", which explains the hints in detail.

Figure 3-3 shows an xemacs editor window with a loop selected, and a hint window with an explanation of a compiler hint.

Figure 3-3 The Text Editor and Hints Windows

Caution -

If you edit your source code, line numbers shown by LoopTool may become inconsistent with the source. You must save and recompile the edited source and then run LoopTool with the new executable, producing new loop information, for the line numbers to remain consistent.

Starting LoopReport

When it starts up, LoopReport expects to be given the name of your program. Type loopreport and the name of the program (an executable) you want examined.

% loopreport program

You can also start LoopReport with no file specified. However, if you invoke LoopReport without giving it the name of a program, it looks for a file named a.out in the current working directory.

% loopreport > a.out.loopreport

You can also direct the output into a file, or pipe it into another command:

% loopreport program > program.loopreport
% loopreport program | more

Timing File

LoopReport also reads the timing file associated with your program. The timing file is created when you use the -zlp option, and contains information about loops. Typically, this file has a name of the format program.looptimes, and is found in the same directory as your program.

However, there are four ways to specify the location of a timing file. LoopReport chooses a timing file according to the rules listed below.

If a timing file is named on the command line, LoopReport uses that file.
```
% loopreport program newtimes > program.loopreport
```

If the command-line option -p is used, LoopReport looks in the directory named by -p for a timing file.
```
% loopreport program -p /home/timingfiles > program.loopreport
```

If the environment variable LVPATH is set, LoopReport looks in that directory for a timing file.
```
% setenv LVPATH /home/timingfiles
% loopreport program > program.loopreport
```

LoopReport writes the table of loop statistics to stdout--the standard output. You can also redirect the output to a file, or pipe it into another command:
```
% loopreport program > program.loopreport
% loopreport program | more
```

Figure 3-4 Sample Loop Report

Fields in the Loop Report

The descriptions below apply equally to LoopTool's "Create Report" output and LoopReport's output.

The loop report contains the following information:

LoopID

An arbitrary number, assigned by the compiler during compile time. This is just an internal loopID, useful for talking about loops, but not really related in any way to your program.

Line #

The line number of the first statement of the loop in the source file.

Par?

Par is short for "Parallelized by the compiler?" Y means that this loop was marked for parallelization; N means that the loop was not.

Hints

Number corresponding to hint text in the "Legend for compiler hints" list.

Entries

Number of times this loop was entered from above. This is distinct from the number of loop iterations, which is the total number of times a loop executes. For example, these are two loops in Fortran.
```
do 10 i=1,17
	do 10 j=1,50
		...some code...
	10 continue
```

The first loop is entered once, and it iterates 17 times. The second loop is entered 17 times, and it iterates 17*50 = 850 times.

Nest

Nesting level of the loop. If a loop is a top-level loop, its nesting level is 0. If the loop is the child of another loop, its nesting level is 1.

For example, in this C code, the i loop has a nesting level of 0, the j loop has a nesting level of 1, and the k loop has a nesting level of 2.
```
for (i=0; i<17; i++)
	for (j=0; j<42; j++)
			for (k=0; k<1000; k++)
				do something;
```

Wallclock

The total amount of elapsed wallclock time spent executing this loop for the whole program. The elapsed time for an outer loop includes the elapsed time for an inner loop. For example:
```
for (i=1; i<10; i++)
	for (j=1; j<10; j++)
			do something; 
```
The time assigned to the outer loop (the i loop) might be 10 seconds, and the time assigned to the inner loop (the j loop) might be 9.9 seconds.

Percentage

The percentage of total program runtime measured as wallclock time spent executing this loop. As with wallclock time, outer loops are credited with time spent in loops they contain.

Variables

The names of the variables that cause a data dependency in this loop. This field only appears when the compiler hint indicates that this loop suffers from a data dependency. A data dependency occurs when parallelization of a loop can not be done safely because the values computed in one iteration of a loop are used in another. The following illustrates a data dependency:
```
do i = 1, N
	a(i) = b(i) + c(i)
	b(i) = 2 * a(i + 1)
end do
```
If the example loop above is run in parallel, iteration 1 which recomputes b(1) based on the value of a(2), may run after iteration 2 which has recomputed a(2). The value of b(1) is determined by the new value of a(2) rather than the original value as would happen if the loop is not parallelized.

Compiler Hints

LoopTool and LoopReport present hints about the optimizations applied to a particular loop, and about why a loop might not have been parallelized. The hints are heuristics gathered by the compiler during optimization. They should be understood in that context; they are not absolute facts about the code generated for a given loop. However, the hints are often very useful indications of how you can transform your code so that the compiler can perform more aggressive optimizations, including parallelizing loops.

For some useful explanations and tips, read the sections in the Sun WorkShop Fortran User's Guide that address parallelization.

Table 3-2 lists the hints about optimizations applied to loops.

Table 3-2 Loop Optimization Hints


Hint #	Hint Definition
0	No hint available
1	Loop contains procedure call
2	Compiler generated two versions of this loop
3	The variable(s) "`list`" cause a data dependency in this loop
4	Loop was significantly transformed during optimization
5	Loop may or may not hold enough work to be profitably parallelized
6	Loop was marked by user-inserted pragma, `DOALL`
7	Loop contains multiple exits
8	Loop contains I/O, or other function calls, that are not MT safe
9	Loop contains backward flow of control
10	Loop may have been distributed
11	Two or more loops may have been fused
12	Two or more loops may have been interchanged

0. No Hint Available

None of the other hints applied to this loop. This hint does not mean that none of the other hints might apply; it means that the compiler did not infer any of those hints.

1. Loop contains procedure call

The loop could not be parallelized since it contains a procedure call that is not MT safe. If such a loop were parallelized, multiple copies of the loop might instantiate the function call simultaneously, trample on each other's use of any variables local to that function, or trample on return values, and generally invalidate the function's purpose. If you are certain that the procedure calls in this loop are MT safe, you can direct the compiler to parallelize this loop no matter what by inserting the DOALL pragma before the body of the loop. For example, if foo is an MT-safe function call, then you can force it to be parallelized by inserting c$par DOALL:

c$par DOALL
	do 19 i = 1, n*n
			do 29 k = i, n*n
				a(i) = a(i) + x*b(k)*c(i-k)
				call foo()
29			continue
19	continue

The computer interprets the DOALL pragmas only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler ignores the DOALL pragmas.

2. Compiler generated two versions of this loop

The compiler could not tell at compile time if the loop contained enough work to be profitable to parallelize. The compiler generated two versions of the loop, a serial version and a parallel version, and a runtime check that will choose at runtime which version to execute. The runtime check determines the amount of work that the loop has to do by checking the loop iteration values.

3. The variable(s) "`list`" cause a data dependency in this loop

A variable inside the loop is affected by the value of a variable in a previous iteration of the loop. For example:

do 99 i=1,n
	do 99 j = 1,m
		a[i, j+1] = a[i,j] + a[i,j-1]
99 continue

This is a contrived example, since for such a simple loop the optimizer would simply swap the inner and outer loops, so that the inner loop could be parallelized. But this example demonstrates the concept of data dependency, often referred to as, "loop-carried data dependency."

The compiler can often tell you the names of the variables that cause the loop-carried data dependency. If you rearrange your program to remove (or minimize) such dependencies, then the compiler can perform more aggressive optimizations.

4. Loop was significantly transformed during optimization

The compiler performed some optimizations on this loop that might make it almost impossible to associate the generated code with the source code. For this reason, line numbers may be incorrect. Examples of optimizations that can radically alter a loop are loop distribution, loop fusion, and loop interchange (see Hint 10, Hint 11, and Hint 12).

5. Loop may or may not hold enough work to be profitably parallelized

The compiler was not able to determine at compile time whether this loop held enough work to warrant parallelizing. Often loops that are labeled with this hint may also be labeled "parallelized," meaning that the compiler generated two versions of the loop (see Hint 2), and that it will be decided at runtime whether the parallel version or the serial version should be used.

Since all the compiler hints, including the flag that indicates whether or not a loop is parallelized, are generated at compile time, there's no way to be certain that a loop labeled "parallelized" actually executes in parallel.

6. Loop was marked by user-inserted pragma, `DOALL`

This loop was parallelized because the compiler was instructed to do so by the DOALL pragma. This hint is a useful reminder to help you easily identify those loops that you explicitly wanted to parallelize.

The DOALL pragmas are interpreted by the compiler only when you compile with -parallel or -explicitpar; if you compile with -autopar, then the compiler will ignore the DOALL pragmas.

7. Loop contains multiple exits

The loop contains a GOTO or some other branch out of the loop other than the natural loop end point. For this reason, it is not safe to parallelize the loop, since the compiler has no way of predicting the loop's runtime behavior.

8. Loop contains I/O, or other function calls, that are not MT safe

This hint is similar to Hint 1. The difference is that this hint often focuses on I/O that is not multithread-safe, whereas Hint 1 can refer to any sort of multithread-unsafe function call.

9. Loop contains backward flow of control

The loop contains a GOTO or other control flow up and out of the body of the loop. That is, some statement inside the loop appears to the compiler to jump back to some previously executed portion of code. As with the case of a loop that contains multiple exits, this loop is not safe to parallelize.

If you can reduce or minimize backward flows of control, the compiler will be able to perform more aggressive optimizations.

10. Loop may have been distributed

The contents of the loop may have been distributed over several iterations of the loop. That is, the compiler may have been able to rewrite the body of the loop so that it could be parallelized. However, since this rewriting takes place in the language of the internal representation of the optimizer, it's very difficult to associate the original source code with the rewritten version. For this reason, hints about a distributed loop may refer to line numbers that don't correspond to line numbers in your source code.

11. Two or more loops may have been fused

Two consecutive loops were combined into one, so the resulting larger loop contains enough work to be profitably parallelized. Again, in this case, source line numbers for the loop may be misleading.

12. Two or more loops may have been interchanged

The loop indices of an inner and an outer loop have been swapped, to move data dependencies as far away from the inner loop as possible, and to enable this nested loop to be parallelized. In the case of deeply nested loops, the interchange may have occurred with more than two loops.

How Optimization Affects Loops

As you might infer from the descriptions of the compiler hints, associating optimized code with source code can be tricky. Clearly, you would prefer to see information from the compiler presented to you in a way that relates as directly as possible to your source code. Unfortunately, the compiler optimizer "reads" your program in terms of its internal language, and although it tries to relate that to your source code, it is not always successful.

Some particular optimizations that can cause confusion are described in the following sections.

Inlining

Inlining is an optimization applied only at optimization level -O4 and only for functions contained within one file. That is, if one file contains 17 Fortran functions, 16 of those can be inlined into the first function, and you compile at -O4, then the source code for those 16 functions may be copied into the body of the first function. Then, when further optimizations are applied, it becomes difficult to determine which loop on which source line number was subjected to which optimization.

If the compiler hints seem particularly opaque, consider compiling with -O3 -parallel -Zlp, so that you can see what the compiler says about your loops before it tries to inline any of your functions.

In particular, "phantom" loops--that is, loops that the compiler claims exist, but you know do not exist in your source code--could well be a symptom of inlining.

Loop Transformations--Unrolling, Jamming, Splitting, and Transposing

The compiler performs many loop optimizations that radically change the body of the loop. These include optimizations, unrolling, jamming, splitting, and transposing.

LoopTool and LoopReport attempt to provide hints that make as much sense as possible, but given the nature of the problem of associating optimized code with source code, the hints may be misleading.

Parallel Loops Nested Inside Serial Loops

If a parallel loop is nested inside a serial loop, the runtime information reported by LoopTool and LoopReport may be misleading because each loop is stipulated to use the wall-clock time of each of its loop iterations. If an inner loop is parallelized, it is assigned the wall-clock time of each iteration, although some of those iterations are running in parallel.

However, the outer loop is assigned only the runtime of its child, the parallel loop, which will be the runtime of the longest parallel instantiation of the inner loop. This double timing leads to the anomaly of the outer loop apparently consuming less time than the inner loop.