Sun Studio 12: Fortran Programming Guide

Chapter 8 Performance Profiling

This chapter describes how to measure and display program performance. Knowing where a program is spending most of its compute cycles and how efficiently it uses system resources is a prerequisite for performance tuning.

8.1 Sun Studio Performance Analyzer

Developing high performance applications requires a combination of compiler features, libraries of optimized routines, and tools for performance analysis.

Sun Studio software provides a sophisticated pair of tools for collecting and analyzing program performance data:

These two tools help to answer the following kinds of questions:

The main window of the Performance Analyzer displays a list of functions for the program with exclusive and inclusive metrics for each function. The list can be filtered by load object, by thread, by light-weight process (LWP) and by time slice. For a selected function, a subsidiary window displays the callers and callees of the function. This window can be used to navigate the call tree—in search of high metric values, for example. Two more windows display source code annotated line-by-line with performance metrics and interleaved with compiler commentary, and disassembly code annotated with metrics for each instruction. Source code and compiler commentary are interleaved with the instructions if available.

The Collector and Analyzer are designed for use by any software developer, even if performance tuning is not the developer’s main responsibility. They provide a more flexible, detailed and accurate analysis than the commonly used profiling tools prof and gprof, and are not subject to an attribution error in gprof.

Command-line equivalents of the Collector and Analyzer are available:

Details can be found in the Sun Studio Program Performance Analysis Tools manual.

8.2 The time Command

The simplest way to gather basic data about program performance and resource utilization is to use the time (1) command or, in csh, the set time command.

Running the program with the time command prints a line of timing information on program termination.

demo% time myprog
   The Answer is: 543.01
6.5u 17.1s 1:16 31% 11+21k 354+210io 135pf+0w

The interpretation is:

user system wallclock resources memory I/O paging

6.5u 17.1s 1:16 31% 11+21k 354+210io 135pf+0w
6.5 seconds in user code, approximately

8.2.1 Multiprocessor Interpretation of time Output

Timing results are interpreted in a different way when the program is run in parallel in a multiprocessor environment. Since /bin/time accumulates the user time on different threads, only wall clock time is used.

Since the user time displayed includes the time spent on all the processors, it can be quite large and is not a good measure of performance. A better measure is the real time, which is the wall clock time. This also means that to get an accurate timing of a parallelized program you must run it on a quiet system dedicated to just your program.

8.3 The tcov Profiling Command

The tcov(1) command, when used with programs compiled with the -xprofile=tcov option, produces a statement-by-statement profile of the source code showing which statements executed and how often. It also gives a summary of information about the basic block structure of the program.

Enhanced statement level coverage is invoked by the -xprofile=tcov compiler option and the tcov -x option. The output is a copy of the source files annotated with statement execution counts in the margin.

Note –

The code coverage report produced by tcov will be unreliable if the compiler has inlined calls to routines. The compiler inlines calls whenever appropriate at optimization levels above -O3, and according to the -inline option. With inlining, the compiler replaces a call to a routine with the actual code for the called routine. And, since there is no call, references to those inlined routines will not be reported by tcov. Therefore, to get an accurate coverage report, do not enable compiler inlining.

8.3.1 Enhanced tcov Analysis

To use tcov, compile with -xprofile=tcov. When the program is run, coverage data is stored in program.profile/tcovd, where program is the name of the executable file. (If the executable were a.out, a.out.profile/tcovd would be created.)

Run tcov -x dirname source_files to create the coverage analysis merged with each source file. The report is written to file.tcov in the current directory.

Running a simple example:

demo% f95 -o onetwo -xprofile=tcov one.f two.f
demo% onetwo
       ... output from program
demo% tcov -x onetwo.profile one.f two.f
demo% cat one.f.tcov two.f.tcov
                       program one
      1 ->             do i=1,10
     10 ->                   call two(i)
                       end do
      1 ->             end

Environment variables $SUN_PROFDATA and $SUN_PROFDATA_DIR can be used to specify where the intermediary data collection files are kept. These are the *.d and tcovd files created by old and new style tcov, respectively.

These environment variables can be used to separate the collected data from different runs. With these variables set, the running program writes execution data to the files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/.

Similarly, the directory that tcov reads is specified by tcov -x $SUN_PROFDATA. If $SUN_PROFDATA_DIR is set, tcov will prepend it, looking for files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/, and not in the working directory.

Each subsequent run accumulates more coverage data into the tcovd file. Data for each object file is zeroed out the first time the program is executed after the corresponding source file has been recompiled. Data for the entire program is zeroed out by removing the tcovd file.

For the details, see the tcov(1) man page.