Fortran Programming Guide

Chapter 8 Performance Profiling

This chapter describes how to measure and display program performance. Knowing where a program is spending most of its compute cycles and how efficiently it uses system resources is a prerequisite for performance tuning.

The time Command

The simplest way to gather basic data about program performance and resource utilization is to use the time (1) command or, in csh, the set time command.

Running the program with the time command prints a line of timing information on program termination.


demo% time myprog 
   The Answer is: 543.01 
6.5u 17.1s 1:16 31% 11+21k 354+210io 135pf+0w
demo%

The interpretation is:

user system wallclock resources memory I/O paging

Multiprocessor Interpretation of time Output

Timing results are interpreted in a different way when the program is run in parallel in a multiprocessor environment. Since /bin/time accumulates the user time on different threads, only wall clock time is used.

Since the user time displayed includes the time spent on all the processors, it can be quite large and is not a good measure of performance. A better measure is the real time, which is the wall clock time. This also means that to get an accurate timing of a parallelized program you must run it on a quiet system dedicated to just your program.

The gprof Profiling Command

The gprof(1) command provides a detailed postmortem analysis of program timing at the subprogram level, including how many times a subprogram was called, who called it, whom it called, and how much time was spent in the routine and by the routines it called.

To enable gprof profiling, compile and link the program with the -pg option:


demo% f77 -o Myprog -fast -pg Myprog.f ...
demo% Myprog
demo% gprof Myprog

The program must complete normally for gprof to obtain meaningful timing information.

At program termination, the file gmon.out is automatically written in the working directory. This file contains the profiling data that will be interpreted by gprof.

Invoking gprof produces a report on standard output. An example is shown on the next pages. Not only the routines in your program are listed but also the library procedures and the routines they call.

The report is mostly two profiles of how the total time is distributed across the program procedures: the call graph and the flat profile. They are preceded by an explanation of the column labels, followed by an index. (The gprof -b option eliminates the explanatory text; see the gprof(1) man page for other options that can be used to limit the amount of output generated.)

In the graph profile, each procedure (subprogram, procedure) is presented in a call-tree representation. The line representing a procedure in its call-tree is called the function line, and is identified by an index number in the leftmost column, within square brackets; the lines above it are the parent lines; the lines below it, the descendant lines.


---------------------------------------------------------------
				parent line									caller 1
				parent line									caller 2
					....
[n]   time				function line									function name
				descendant line									called 1
				descendant line									called 2
					....
---------------------------------------------------------------

The call graph profile is followed by a flat profile that provides a routine-by-routine overview. An (edited) example of gprof output follows.


Note -

User-defined functions appear with their Fortran names followed by an underscore. Library routines appear with leading underscores.


The call graph profile:


granularity: each sample hit covers 2 byte(s) for 0.08% of 12.78 seconds

                                  called/total       parents 
index  %time    self descendents  called+self    name          index
                                  called/total       children

                0.00       12.66       1/1           main [1]
[3]     99.1    0.00       12.66       1         MAIN_ [3]
                0.92       10.99    1000/1000        diffr_ [4]
                0.62        0.00    2000/2001        code_ [9]
                0.11        0.00    1000/1000        shock_ [11]
                0.02        0.00    1000/1000        bndry_ [14]
                0.00        0.00       1/1           init_ [24]
                0.00        0.00       2/2           output_ [40]
                0.00        0.00       1/1           input_ [47]

-----------------------------------------------

                0.92       10.99    1000/1000        MAIN_ [3]
[4]     93.2    0.92       10.99    1000         diffr_ [4]
                1.11        4.52    3000/3000        deriv_ [7]
                1.29        2.91    3000/6000        cheb1_ [5]
                1.17        0.00    3000/3000        dissip_ [8]

-----------------------------------------------

                1.29        2.91    3000/6000        deriv_ [7]
                1.29        2.91    3000/6000        diffr_ [4]
[5]     65.7    2.58        5.81    6000         cheb1_ [5]
                5.81        0.00    6000/6000        fftb_ [6]
                0.00        0.00     128/321         cos [21]
                0.00        0.00     128/192         __sin [279]

-----------------------------------------------

                5.81        0.00    6000/6000        cheb1_ [5]
[6]     45.5    5.81        0.00    6000         fftb_ [6]
                0.00        0.00      64/321         cos [21]
                0.00        0.00      64/192         __sin [279]

-----------------------------------------------
...

The flat profile overview:


granularity: each sample hit covers 2 byte(s) for 0.08% of 12.84 seconds

   %  cumulative    self              self    total          
 time   seconds   seconds    calls  ms/call  ms/call name    
 45.2       5.81     5.81    6000     0.97     0.97  fftb_   [6]
 20.1       8.39     2.5     6000     0.43     1.40  cheb1_  [5]
  9.1       9.56     1.17    3000     0.39     0.39  dissip_ [8]
  8.6      10.67     1.11    3000     0.37     1.88  deriv_  [7]
  7.1      11.58     0.92    1000     0.92    11.91  diffr_  [4]
  4.8      12.20     0.62    2001     0.31     0.31  code_   [9]
  2.5      12.53     0.33    69000     0.00     0.00  __exp  [10]
  0.9      12.64     0.11     1000     0.11     0.11  shock_ [11]
...

The function line [5] in the preceding example reveals that:

Overhead Considerations

Profiling (compiling with the -pg option) may greatly increase the running time of a program. This is due to the extra overhead required to clock program performance and subprogram calls. Profiling tools like gprof attempt to subtract an approximate overhead factor when computing relative runtime percentages. All other timings shown may not be accurate due to UNIX and hardware timekeeping inaccuracies.

Programs with short execution times are the most difficult to profile because the overhead may be a significant fraction of the total execution time. The best practice is to choose input data for the profiling run that will result in a realistic test of the program's performance. If this is not possible, consider enclosing the main computational part of the program within a loop that effectively runs the program N times. Estimate actual performance by dividing the profile results by N.

The Fortran library includes two routines that return the total time used by the calling process. See the man pages for dtime(3F) and etime(3F).

The tcov Profiling Command

The tcov(1) command, when used with programs compiled with the -a, -xa, or -xprofile=tcov options, produces a statement-by-statement profile of the source code showing which statements executed and how often. It also gives a summary of information about the basic block structure of the program.

There are two implementations of tcov coverage analysis. The original tcov is invoked by the -a or -xa compiler options. Enhanced statement level coverage is invoked by the -xprofile=tcov compiler option and the -x tcov option. In either case, the output is a copy of the source files annotated with statement execution counts in the margin. Although these two versions of tcov are essentially the same as far as the Fortran user is concerned (most of the enhancements apply to C++ programs), there will be some performance improvement with the newer style.

"Old Style" tcov Coverage Analysis

Compile the program with the -a (or -xa) option. This produces the file $TCOVDIR/file.d for each source .f file in the compilation. (If environment variable $TCOVDIR is not set at compile time, the .d files are stored in the current directory.)

Run the program (execution must complete normally). This produces updated information in the .d files. To view the coverage analysis merged with the individual source files, run tcov on the source files. The annotated source files are named $TCOVDIR/file.tcov for each source file.

The output produced by tcov shows the number of times each statement was actually executed. Statements that were not executed are marked with ####-> to the left of the statement.

Here is a simple example:


demo% f77 -a -o onetwo -silent one.f two.f
demo% onetwo
       ... output from program
demo% tcov one.f two.f
demo% cat one.tcov two.tcov
                       program one
      1 ->             do i=1,10
     10 ->                   call two(i)
                       end do
      1 ->             end

             Top 10 Blocks
             Line         Count
                3            10
                2             1
                5             1

          3       Basic blocks in this file
          3       Basic blocks executed
      100.00       Percent of the file executed
           12       Total basic block executions
         4.00       Average executions per basic block

                       subroutine two(i)
     10 ->             print*, "two called", i
                       return
                       end

             Top 10 Blocks
             Line         Count
                2            10

          1       Basic blocks in this file
          1       Basic blocks executed
      100.00       Percent of the file executed
           10       Total basic block executions
        10.00       Average executions per basic block
demo%

"New Style" Enhanced tcov Analysis

To use new style tcov, compile with -xprofile=tcov. When the program is run, coverage data is stored in program.profile/tcovd, where program is the name of the executable file. (If the executable were a.out, a.out.profile/tcovd would be created.)

Run tcov -x dirname source_files to create the coverage analysis merged with each source file. The report is written to file.tcov in the current directory.

Running a simple example:


demo% f77 -o onetwo -silent -xprofile=tcov one.f two.f
demo% onetwo
       ... output from program
demo% tcov -x onetwo.profile one.f two.f
demo% cat one.f.tcov two.f.tcov
                       program one
      1 ->             do i=1,10
     10 ->                   call two(i)
                       end do
      1 ->             end
       .....etc
demo%

Environment variables $SUN_PROFDATA and $SUN_PROFDATA_DIR can be used to specify where the intermediary data collection files are kept. These are the *.d and tcovd files created by old and new style tcov, respectively.

Each subsequent run accumulates more coverage data into the tcovd file. Data for each object file is zeroed out the first time the program is executed after the corresponding source file has been recompiled. Data for the entire program is zeroed out by removing the tcovd file.

These environment variables can be used to separate the collected data from different runs. With these variables set, the running program writes execution data to the files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/.

Similarly, the directory that tcov reads is specified by tcov -x $SUN_PROFDATA. If $SUN_PROFDATA_DIR is set, tcov will prepend it, looking for files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/, and not in the working directory.

For the details, see the tcov(1) man page.

f77 I/O Profiling

You can obtain a report about how much data was transferred by your program. For each Fortran unit, the report shows the file name, the number of I/O statements, the number of bytes, and some statistics on these items.

To obtain an I/O profiling report, insert calls to the library routines start_iostats and end_iostats around the parts of the program you wish to measure. (A call to end_iostats is required if the program terminates with an END or STOP statement rather than a CALL EXIT.)


Note -

The I/O statements profiled are: READ, WRITE, PRINT, OPEN, CLOSE, INQUIRE, BACKSPACE, ENDFILE, and REWIND. The runtime system opens stdin, stdout, and stderr before the first executable statement of your program, so you must explicitly reopen these units after the call to start_iostats.


Example: Profile stdin, stdout, and stderr:


      EXTERNAL start_iostats 
      ... 
      CALL start_iostats
      OPEN(5)
      OPEN(6)
      OPEN(0)

If you want to measure only part of the program, call end_iostats to stop the process. A call to end_iostats may also be required if your program terminates with an END or STOP statement rather than CALL EXIT.

The program must be compiled with the -pg option. When the program terminates, the I/O profile report is produced on the file name.io_stats, where name is the name of the executable file.

Here is an example:


demo% f77 -o myprog -pg -silent myprog.f
demo% myprog
   ... output from program
demo% cat myprog.io_stats
                   INPUT REPORT
1. unit    2. file name             3. input data             4. map
                           cnt     total       avg   std dev    (cnt)
------------------------------------------------------------------------
      0          stderr      0         0         0         0    No
                             0         0         0         0
      5           stdin      2         8         4         0    No
                             1         8         8         0
      6          stdout      0         0         0         0    No
                             0         0         0         0
     19         fort.19      8        48         6     4.276    No
                             4        48        12         0
     20         fort.20      8        48         6     4.276    No
                             4        48        12         0
     21         fort.21      8        48         6     4.276    No
                             4        48        12         0
     22         fort.22      8        48         6     4.276    No
                             4        48        12         0

                   OUTPUT REPORT
1. unit             5. output data            6. blk size  7. fmt  8. direct
           cnt     total       avg   std dev                        (rec len)
-----------------------------------------------------------------------------
      0      4        40        10         0           -1     Yes     seq
             1        40        40         0
      5      0         0         0         0           -1     Yes     seq
             0         0         0         0
      6     26       248     9.538      1.63           -1     Yes     seq
             6       248     41.33     3.266
     19      8        48         6     4.276       500548     Yes     seq
             4        48        12         0
     20      8        48         6     4.276       503116      No     seq
             4        48        12         0
     21      8        48         6     4.276       503116     Yes     dir
             4        48        12         0                        (   12)
     22      8        48         6     4.276       503116      No     dir
             4        48        12         0                        (   12)
     ...

Each pair of lines in the report displays information about an I/O unit. One section shows input operations and another shows output. The first line of a pair displays statistics on the number of data elements transferred before the unit was closed. The second row of statistics is based on the number of I/O statements processed.

In the example, there were 6 calls to write a total of 26 data elements to standard output. A total of 248 bytes was transferred. The display also shows the average and standard deviation in bytes transferred per I/O statement (9.538 and 1.63, respectively), and the average and standard deviation per I/O statement call (42.33 and 3.266, respectively).

The input report also contains a column to indicate whether a unit was memory mapped or not. If mapped, the number of mmap() calls is recorded in parentheses in the second row of the pair.

The output report indicates block sizes, formatting, and access type. A file opened for direct access shows its defined record length in parentheses in the second row of the pair.


Note -

Compiling with environment variable LD_LIBRARY_PATH set might disable I/O profiling, which relies on its profiling I/O library being in a standard location.