ChorusOS 4.0 Introduction

Chapter 11 Performance Profiling

This chapter explains how to analyze the performance of a ChorusOS operating system and its applications by generating a performance profile report.

"Introduction to Performance Profiling" explains why a performance profile is useful and how it can be used.
"Preparing to Create a Performance Profile" explains how to configure your system so that you can generate a performance profile.
"Running a Performance Profiling Session" explains how to create a performance profile.
"Analyzing Performance Profiling Reports" explains how to analyze the performance profile.
"Performance Profiler Description" gives more information on how the performance profiler works.

Introduction to Performance Profiling

The ChorusOS operating system performance profiling system contains a set of tools that facilitate the analysis and optimization of the performance of the ChorusOS operating system and applications. These tools concern only system components sharing the system address space, that is, the ChorusOS operating system components and supervisor application actors. This set of tools is composed of a profiling server, libraries for building profiled actors, a target controller and a host utility.

Software performance profiling consists of collecting data about the dynamic behavior of the software, to gain knowledge of the time distribution within the software. For example, the performance profiling system is able to report the time spent within each procedure, as well as providing a dynamically constructed call graph.

The typical steps of an optimization project are:

To bench a set of typical applications, using the ChorusOS operating system and applications at peak performance. The selection of these applications is critical, as the system will eventually be tuned for this type of application.
To evaluate and record the output of the benchmarks.
To use the performance profiling system to collect raw data about the dynamic behavior of the applications.
To generate, evaluate and record the performance profiling reports.
To plan and implement optimizations such as rewriting certain time-critical routines in assembly language, using in-line functions or tuning algorithms.

The performance profiling tools provide two different classes of service, depending on the way in which the software being measured has been prepared:

The performance profiling system is applied to software generated in the standard way, (the same version as used for benchmarking). In this case, the performance profiling reports only minimal information, which consists mainly of the percentage of time spent in each routine of the software. The corresponding performance profiling report is called simple form.
The performance profiling system is applied to software regenerated exclusively for performance profiling; software is completely recompiled, using the performance profiling C compiler option (usually the -p option). This allows the performance profiling system to report much more information by dynamically counting routine invocations and building a complete call graph. The corresponding performance profiling report is called full form.

Note -

The standard (binary) version of the ChorusOS operating system is not compiled with the performance profiling option: profiling the system will only generate a simple form. Non-profiled components (or components for which a simple report form is sufficient) do not need to be compiled with the performance profiling option.

In order to obtain a full form for ChorusOS operating system components, a source product distribution is needed. In this case, it is necessary to regenerate the system components with the performance profiling option set.

Preparing to Create a Performance Profile

Configuring the System

In order to perform system performance profiling using the ChorusOS Profiler, a ChorusOS target system must include the ACTOR_EXTENDED_MNGT and NFS_CLIENT feature options.

Launch the performance profiling server (the PROF actor) dynamically, using:

% rsh -n target arun PROF &

Compiling the Application

If you require full report forms, the profiled components must be compiled using the performance profiling compiler options (usually, the -p option).

If you are using the imake environment provided with the ChorusOS operating system, you can set the profiling option in the Project.tmpl file if you want to profile the whole project hierarchy, or in each Imakefile of the directories that you want to profile if you want to profile only a subset of your project hierarchy. In either case, add the following line:

PROF=$(PROF_ON)

You can also add the performance profiling option dynamically by calling make with the compiler profiling option:

% make PROF=-p

in the directory of the program that is to be performance profiled.

Launching the Performance Profiled Application

In this section, it is assumed that the application consists of a single supervisor actor, the_actor, it is also assumed that the target system is named trumpet, and that the target tree is mounted under the $CHORUS_ROOT host directory.

In order to be performance profiled, an application may be either:

launched at system boot time, as part of the system image, or
dynamically launched using the arun service, using the -k option, with the following command:
% rsh trumpet arun -k "the_actor"

Running a Performance Profiling Session

Starting the Performance Profiling Session

Performance profiling is initiated by running the profctl utility on the target system, using the -start option. This utility (see "Security" for more details) considers the components to be profiled as arguments.

If the_actor was part of the system image:

% rsh trumpet arun profctl -start -b the_actor

Otherwise, if the_actor was loaded dynamically:

% rsh trumpet arun profctl -start -a the_actor aid

where aid is the numeric identifier of the actor (as returned by the arun or aps commands).

Note -

Several components may be specified to the profctl utility. See "Security" for more details.

Run the application.

Stopping the Performance Profiling Session

Performance profiling is stopped by running the profctl utility again, using the -stop option:

% rsh trumpet arun profctl -stop

When performance profiling is stopped, a raw data file is generated for each profiled component within the /tmp directory of the target file system. The name of the file consists of the component name, to which the suffix .prof is added. For example, if only the_actor was profiled, the file $CHORUSUS_ROOT/tmp/the_actor.prof would be created.

Generating Performance Profiling Reports

Performance profiling reports are generated by the profrpg host utility (see "Security" for details on reporting options).

Use the report generator to produce a report for each profiled component; as follows:

% cd $CHORUSUS_ROOT/tmp

% profrpg the_actor > the_actor.rpg

In order to track the benefits of optimization, the reports should be archived.

Analyzing Performance Profiling Reports

Performance profiling is applied to a user-selected set of components (ChorusOS operating system kernel, supervisor actors). The result of the performance profiling consists of a set of reports, one per profiled component.

A performance profiling report consists of two parts:

A global report that provides general information about the profiling session, including clock attributes, CPU attributes, and the distribution of CPU time between idle threads, user actors, non-profiled supervisor components, and each of the profiled supervisor components.
A component-based function table that indicates the distribution of CPU time inside the profiled component.

For each function, the performance profile report displays the following information:

Function header.
- Function number. This field indicates the function number in the current report. It is provided in order to facilitate study of the report using a text editor.
- Function name. This field indicates the name of the function.
- Size. This field indicates the size of the function in bytes.
- Time spent in function. This field indicates the flat time spent in the body of the function (the number of profiling ticks that occurred while an instruction was executed within the function). This value is followed by the percentage of the total component time it represents. This is the most valuable information and the report can be sorted by this key if desired.
- Total time spent in function. This field indicates the aggregated-time spent within the function and called functions. The value is given as a percentage of total actor time. By default, the report generator sorts the table by this key. This field is computed by the report generator, and assumes that each call to a given routine lasts the same amount of time. This information is only provided in the full profiling form. In the simple form, this information is the same as the flat-time information.
- Recursion indicator. This field indicates that the procedure was found in a recursive loop. As the profiling system is not fully set up for multithreading, this indicator might be erroneously set. This information is only provided in the full profiling form.
Call graph description.
- List of callers. This field details a list of the functions calling the profiled function. For each caller, the report provides:
  - the caller's function number
  - the number of calls
  - the caller's name and the offset of the call in the caller's body. When a function calls another function from several locations, several entries are made in the list of callers
- List of called functions. For each called function, the report provides:
  - the callee's function number
  - the number of calls
  - the percentage of the total function time that is charged to the callee
  - the name of the function

Shown below is an example of a profiling report.

  overhead=2.468
  memcpy 4 K=18.834
  memcpy 16 K=51.936
  memcpy 64 K=185.579
  memcpy 256 K=801.300
  sysTime=2.576
  threadSelf=2.210
  thread switch=5.777
  threadCreate (active)=8.062
  threadCreate (active, preempt)=10.071
  threadPriority (self)=3.789
  threadPriority (self, high)=3.195
  threadResume (preempt)=6.999
  threadResume (awake)=4.014
 ...
  ipcCall (null, timeout)=35.732
  ipcSend (null, funcmode)=7.723
  ipcCall (null, funcmode)=31.762
  ipcSend (null, funcumode)=7.924
  ipcCall (null, funcumode)=31.864
  ipcSend (annex)=8.294
  ipcReceive (annex)=7.086
  ipcCall (annex)=33.708
  ipcSend (body 4b)=8.020
  ipcReceive (body 4b)=6.822
  ipcCall (body 4b)=32.558
  ipcSend (annex, body 4b)=8.684
  ipcReceive (annex, body 4b)=7.495
  ipcCall (annex, body 4b)=34.849

Performance Profiler Description

This section provides information about the performance profiling system's design, to help you understand the sequence of events that occurs before the generation of a performance profiling report.

The performance profiling tool set consists of:

The profiler server, PROF, a supervisor actor. This actor first interprets performance profiling requests issued by the PROF utility, and then executes the performance profiling function at a selected profiling clock rate on the target. See PROF(1CC) for more details.
The profctl target utility (see profctl(1CC)). This utility sends performance profiling requests to the profiler server, PROF, on the target.
The profrpg (see profrpg(1CC)) host utility. This command interprets profiling data and produces coherent profiling reports on the development host.

The Performance Profiling Library

When the performance profiling compiler option (generally -p) is used, the compiler provides each function entry point with a call to a routine, normally called mcount. For each function, the compiler also sets up a static counter, and passes the address of this counter to mcount. The counter is initialized at zero.

What is done by mcount is defined by the application. Low-end performance profilers simply count the number of times the routine is called. ChorusOS Profiler provides a sophisticated mcount routine within the profiled library that constructs the runtime call graph. Note that you can supply your own mcount routine, for example to assert predicates when debugging a component.

The Performance Profiler Server

The profiler server, PROF, is a supervisor actor that can locate and modify static data within the memory context of the profiled actors, using the embedded symbol tables. The profiler server also dynamically creates and deletes the memory regions that are used to construct the call graph and count the profiling ticks (see below).

The Performance Profiling Clock

While the performance profiler is active, the system is regularly interrupted by the profiling clock, which by default is the system clock. At each clock tick, the instruction pointer is sampled, the active procedure is located and a counter associated with the interrupted procedure is incremented. A high rate performance profiling clock could use a significant amount of system time, which could lead to the system appearing to run more slowly. A rapid sampling clock could jeopardize the system's real-time requirements.

Notes About Accuracy

Significant disruptions in the real-time capabilities of the profiled programs must be expected, because performance profiling is performed by software (rather than by hardware with an external bus analyzer or equivalent device). Performance profiling using software slows down the processor, and the profiled applications may behave differently when being profiled compared to when running at full processor speed.

When profiling, the processor can spend more than fifty percent of the processing time profiling clock interrupts. Similarly, the time spent recording the call graph is significant, and tends to bias the profiling results in a non-linear manner.

The accuracy of the reported percentage of time spent is about five percent when the number of profiling ticks is in the order of magnitude of ten times the number of bytes in the profiled programs. In other words, in order to profile a program of 1 million bytes with any degree of accuracy, at least 10 millions ticks should be used. This level of accuracy is usually sufficient to plan code optimizations, which is the primary goal of the profiler, but the operator should beware of using all the fractional digits of the reported figures.

If more accuracy is needed, the operator can experiment with different combinations of the rate of the profiling clock, the type of profiling clock and the time spent profiling.