C H A P T E R 8 - Performance Profiling

C H A P T E R 8

Performance Profiling

This chapter describes how to measure and display program performance. Knowing where a program is spending most of its compute cycles and how efficiently it uses system resources is a prerequisite for performance tuning.

8.1 Sun Studio Performance Analyzer

Developing high performance applications requires a combination of compiler features, libraries of optimized routines, and tools for performance analysis.

Sun Studio software provides a sophisticated pair of tools for collecting and analyzing program performance data:

The Collector collects performance data on a statistical basis called profiling. The data can include call stacks, microstate accounting information, thread-synchronization delay data, hardware-counter overflow data, address space data, and summary information for the operating system.

The Performance Analyzer displays the data recorded by the Collector, so you can examine the information. The Analyzer processes the data and displays various metrics of performance at program, function, caller-callee, source-line, and disassembly-instruction levels. These metrics are classed into three groups: clock-based metrics, synchronization delay metrics, and hardware counter metrics.

The Performance Analyzer can also help you to fine-tune your application's performance, by creating a mapfile you can use to improve the order of function loading in the application address space.

These two tools help to answer the following kinds of questions:

How much of the available resources does the program consume?

Which functions or load objects are consuming the most resources?

Which source lines and disassembly instructions consume the most resources?

How did the program arrive at this point in the execution?

Which resources are being consumed by a function or load object?

The main window of the Performance Analyzer displays a list of functions for the program with exclusive and inclusive metrics for each function. The list can be filtered by load object, by thread, by light-weight process (LWP) and by time slice. For a selected function, a subsidiary window displays the callers and callees of the function. This window can be used to navigate the call tree--in search of high metric values, for example. Two more windows display source code annotated line-by-line with performance metrics and interleaved with compiler commentary, and disassembly code annotated with metrics for each instruction. Source code and compiler commentary are interleaved with the instructions if available.

The Collector and Analyzer are designed for use by any software developer, even if performance tuning is not the developer's main responsibility. They provide a more flexible, detailed and accurate analysis than the commonly used profiling tools prof and gprof, and are not subject to an attribution error in gprof.

Command-line equivalents of the Collector and Analyzer are available:

Data collection can be done with the collect(1) command.

The Collector can be run from dbx using the collector subcommands.

The command-line utility er_print(1) prints out an ASCII version of the various Analyzer displays.

The command-line utility er_src(1) displays source and disassembly code listings annotated with compiler commentary but without performance data.

Details can be found in the Sun Studio Program Performance Analysis Tools manual.

8.2 The `time` Command

The simplest way to gather basic data about program performance and resource utilization is to use the time (1) command or, in csh, the set time command.

Running the program with the time command prints a line of timing information on program termination.

demo% time myprog

   The Answer is: 543.01

6.5u 17.1s 1:16 31% 11+21k 354+210io 135pf+0w

demo%

The interpretation is:

user system wallclock resources memory I/O paging

user - 6.5 seconds in user code, approximately

6.5u 17.1s 1:16 31% 11+21k 354+210io 135pf+0w

system - 17.1 seconds in system code for this task, approximately

wallclock - 1 minute 16 seconds to complete

resources - 31% of system resources dedicated to this program

memory - 11 Kilobytes of shared program memory, 21 kilobytes of private data memory

I/O - 354 reads, 210 writes

paging - 135 page faults, 0 swapouts

8.2.1 Multiprocessor Interpretation of `time` Output

Timing results are interpreted in a different way when the program is run in parallel in a multiprocessor environment. Since /bin/time accumulates the user time on different threads, only wall clock time is used.

Since the user time displayed includes the time spent on all the processors, it can be quite large and is not a good measure of performance. A better measure is the real time, which is the wall clock time. This also means that to get an accurate timing of a parallelized program you must run it on a quiet system dedicated to just your program.

8.3 The `tcov` Profiling Command

The tcov(1) command, when used with programs compiled with the -xprofile=tcov option, produces a statement-by-statement profile of the source code showing which statements executed and how often. It also gives a summary of information about the basic block structure of the program.

Enhanced statement level coverage is invoked by the -xprofile=tcov compiler option and the tcov -x option. The output is a copy of the source files annotated with statement execution counts in the margin.

Note - The code coverage report produced by tcov will be unreliable if the compiler has inlined calls to routines. The compiler inlines calls whenever appropriate at optimization levels above -O3, and according to the -inline option. With inlining, the compiler replaces a call to a routine with the actual code for the called routine. And, since there is no call, references to those inlined routines will not be reported by tcov. Therefore, to get an accurate coverage report, do not enable compiler inlining.

8.3.1 Enhanced `tcov` Analysis

To use tcov, compile with -xprofile=tcov. When the program is run, coverage data is stored in program.profile/tcovd, where program is the name of the executable file. (If the executable were a.out, a.out.profile/tcovd would be created.)

Run tcov -x dirname source_files to create the coverage analysis merged with each source file. The report is written to file.tcov in the current directory.

Running a simple example:

demo% f95 -o onetwo -xprofile=tcov one.f two.f

demo% onetwo

       ... output from program

demo% tcov -x onetwo.profile one.f two.f

demo% cat one.f.tcov two.f.tcov

                       program one

      1 ->             do i=1,10

     10 ->                   call two(i)

                       end do

      1 ->             end

       .....etc

demo%

Environment variables $SUN_PROFDATA and $SUN_PROFDATA_DIR can be used to specify where the intermediary data collection files are kept. These are the *.d and tcovd files created by old and new style tcov, respectively.

These environment variables can be used to separate the collected data from different runs. With these variables set, the running program writes execution data to the files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/.

Similarly, the directory that tcov reads is specified by tcov -x $SUN_PROFDATA. If $SUN_PROFDATA_DIR is set, tcov will prepend it, looking for files in $SUN_PROFDATA_DIR/$SUN_PROFDATA/, and not in the working directory.

Each subsequent run accumulates more coverage data into the tcovd file. Data for each object file is zeroed out the first time the program is executed after the corresponding source file has been recompiled. Data for the entire program is zeroed out by removing the tcovd file.

For the details, see the tcov(1) man page.

8.1 Sun Studio Performance Analyzer

8.2 The time Command

8.2.1 Multiprocessor Interpretation of time Output

8.3 The tcov Profiling Command

8.3.1 Enhanced tcov Analysis

8.2 The `time` Command

8.2.1 Multiprocessor Interpretation of `time` Output

8.3 The `tcov` Profiling Command

8.3.1 Enhanced `tcov` Analysis