4 - C H A P T E R -

C H A P T E R 4

Collecting Performance Data

The first stage of performance analysis is data collection. This chapter describes what is required for data collection, where the data is stored, how to collect data and how to manage the data collection. For more information about the data itself, see .

This chapter covers the following topics.

Preparing Your Program for Data Collection and Analysis

Compiling and Linking Your Program

Limitations on Data Collection

Where the Data Is Stored

Estimating Storage Requirements

Collecting Data Using the collect Command

Collecting Data From the Integrated Development Environment

Collecting Data Using the dbx collector Subcommands

Collecting Data From a Running Process

Collecting Data From MPI Programs

Preparing Your Program for Data Collection and Analysis

For most programs, you do not need to do anything special to prepare your program for data collection and analysis. You should read one or more of the subsections below if your program does any of the following:

Installs a signal handler

Explicitly dynamically loads a system library

Dynamically loads a module (.o file)

Dynamically compiles functions

Creates descendant processes

Uses the asynchronous I/O library

Uses the profiling timer or hardware counter API directly

Calls setuid(2) or executes a setuid file.

Also, if you want to control data collection from your program you should read the relevant subsection.

Use of System Libraries

The Collector interposes on functions from various system libraries, to collect tracing data and to ensure the integrity of data collection. The following list describes situations in which the Collector interposes on calls to library functions.

Collection of synchronization wait tracing data. The Collector interposes on functions from the threads library, libthread.so.

Collection of heap tracing data. The Collector interposes on the functions malloc, realloc, memalign and free. Versions of these functions are found in the C standard library, libc.so and also in other libraries such as libmalloc.so and libmtmalloc.so.

Collection of MPI tracing data. The Collector interposes on functions from the MPI library, libmpi.so.

Ensuring the integrity of clock data. The Collector interposes on setitimer and prevents the program from using the profiling timer.

Ensuring the integrity of hardware counter data. The Collector interposes on functions from the hardware counter library, libcpc.so and prevents the program from using the counters. Calls from the program to functions from this library return with a return value of -1.

Enabling data collection on descendant processes. The Collector interposes on the functions fork(2), fork1(2), vfork(2), fork(3F), system(3C), system(3F), sh(3F), popen(3C), and exec(2) and its variants. Calls to vfork are replaced internally by calls to fork1. These interpositions are only done for the collect command.

Guaranteeing the handling of the SIGPROF and SIGEMT signals by the Collector. The Collector interposes on sigaction to ensure that its signal handler is the primary signal handler for these signals.

There are some circumstances in which the interposition does not succeed:

Statically linking a program with any of the libraries that contain functions that are interposed.

Attaching dbx to a running application that does not have the collector library preloaded.

Dynamically loading one of these libraries and resolving the symbols by searching only within the library.

The failure of interposition by the Collector can cause loss or invalidation of performance data.

Use of Signal Handlers

The Collector uses two signals to collect profiling data, SIGPROF and SIGEMT. The Collector installs a signal handler for each of these signals, which intercept and process the signals, but pass on signals they do not use to any other signal handlers that are installed. If a program installs its own signal handler for these signals, the Collector re-installs its signal handler as the primary handler to guarantee the integrity of the performance data.

The collect command can also use user-specified signals for pausing and resuming data collection and for recording samples. These signals are not protected by the Collector. It is the responsibility of the user to ensure that there is no conflict between use of the specified signals by the Collector and any use made by the application of the same signals.

The signal handlers installed by the Collector set a flag that ensures that system calls are not interrupted for signal delivery. This flag setting could change the behavior of the program if the program's signal handler sets the flag to permit interruption of system calls. One important example of a change in behavior occurs for the asynchronous I/O library, libaio.so, which uses SIGPROF for asynchronous cancel operations, and which does interrupt system calls. If the collector library, libcollector.so, is installed, the cancel signal arrives late.

If you attach dbx to a process without preloading the collector library and enable performance data collection, and the program subsequently installs its own signal handler, the Collector does not re-install its own signal handler. In this case, the program's signal handler must ensure that the SIGPROF and SIGEMT signals are passed on so that performance data is not lost. If the program's signal handler interrupts system calls, both the program behavior and the profiling behavior will be different from when the collector library is preloaded.

Use of `setuid`

There are restrictions enforced by the dynamic loader that make it difficult to use setuid(2) and collect performance data. If your program calls setuid or executes a setuid file, it is likely that the Collector cannot write an experiment file because it lacks the necessary permissions for the new user ID.

Controlling Data Collection From Your Program

If you want to control data collection from your program, the Collector shared library, libcollector.so contains some API functions that you can use in your program. The functions are written in C, and a Fortran interface is provided. Both the C interface and the Fortran interface are defined in header files that are provided with the library.

To use the API functions from C or C++, insert the following statement.

#include "libcollector.h"

The functions are defined as follows.

void collector_sample(char *name);

void collector_pause(void);

void collector_resume(void);

void collector_terminate_expt(void);

To use the API functions from Fortran, insert the following statement:.

include libfcollector.h

When you link your program, link with -lfcollector.

Caution - Do not link a program in any language with -lcollector. If you do, the Collector can exhibit unpredictable behavior.

The C include file contains macros that bypass the calls to the real API functions if data is not being collected. In this case the functions are not dynamically loaded. The Fortran API subroutines call the C API functions if performance data is being collected, otherwise they return. The overhead for the checking is very small and should not significantly affect program performance.

To collect performance data you must run your program using the Collector, as described later in this chapter. Inserting calls to the API functions does not enable data collection.

If you intend to use the API functions in a multithreaded program, you should ensure that they are only called by one thread. The API functions perform actions that apply to the process and not to individual threads. If each thread calls the API functions, the data that is recorded might not be what you expect. For example, if collector_pause() or collector_terminate_expt() is called by one thread before the other threads have reached the same point in the program, collection is paused or terminated for all threads, and data can be lost from the threads that were executing code before the API call.

The descriptions of the four API functions follow.

`collector_sample(char *name)` (C and C++)

`collector_sample(string)` (Fortran)

Record a sample packet and label the sample with the given name or string. The label is not currently used by the Performance Analyzer. The Fortran argument string is of type character.

Sample points contain data for the process and not for individual threads. In a multithreaded application, the collector_sample() API function ensures that only one sample is written if another call is made while it is recording a sample. The number of samples recorded can be less than the number of threads making the call.

The Performance Analyzer does not distinguish between samples recorded by different mechanisms. If you want to see only the samples recorded by API calls, you should turn off all other sampling modes when you record performance data.

`collector_pause()`

Stop writing event-specific data to the experiment. The experiment remains open, and global data continues to be written. The call is ignored if no experiment is active or if data recording is already stopped.

`collector_resume()`

Resume writing event-specific data to the experiment after a call to collector_pause(). The call is ignored if no experiment is active or if data recording is active.

`collector_terminate_expt()`

Terminate the experiment whose data is being collected. No further data is collected, but the program continues to run normally. The call is ignored if no experiment is active.

Dynamic Functions and Modules

If your C program or C++ program dynamically compiles functions or dynamically loads modules (.o files) into the data space of the program, you must supply information to the Collector if you want to see data for the dynamic function or module in the Performance Analyzer. The information is passed by calls to collector API functions. The definitions of the API functions are as follows.

void collector_func_load(char *name, char *alias,

    char *sourcename, void *vaddr, int size, int lntsize,

    Lineno *lntable);

void collector_func_unload(void *vaddr);

void collector_module_load(char *modulename, void *vaddr);

void collector_module_unload(void *vaddr);

You do not need to use these API functions for Java methods that are compiled by the Java HotSpot virtual machine, for which a different interface is used. The Java interface provides the name of the method that was compiled to the Collector. You can see function data and annotated disassembly listings for Java compiled methods, but not annotated source listings.

The descriptions of the four API functions follow.

`collector_func_load()`

Pass information about dynamically compiled functions to the Collector for recording in the experiment. The parameter list is described in the following table.


Parameter	Definition
`name`	The name of the dynamically compiled function that is used by the performance tools. The name does not have to be the actual name of the function. The name need not follow any of the normal naming conventions of functions, although it should not contain embedded blanks or embedded quote characters.
`alias`	An arbitrary string used to describe the function. It can be `NULL`. It is not interpreted in any way, and can contain embedded blanks. It is displayed in the Summary tab of the Analyzer. It can be used to indicate what the function is, or why the function was dynamically constructed.
`sourcename`	The path to the source file from which the function was constructed. It can be `NULL`. The source file is used for annotated source listings.
`vaddr`	The address at which the function was loaded.
`size`	The size of the function in bytes.
`lntsize`	A count of the number of entries in the line number table. It should be zero if line number information is not provided.
`lntable`	A table containing `lntsize` entries, each of which is a pair of integers. The first integer is an offset, and the second entry is a line number. All instructions between an offset in one entry and the offset given in the next entry are attributed to the line number given in the first entry. Offsets must be in increasing numeric order, but the order of line numbers is arbitrary. If `lntable` is `NULL`, no source listings of the function are possible, although disassembly listings are available.

`collector_func_unload()`

Inform the collector that the dynamic function at the address vaddr has been unloaded.

`collector_module_load()`

Used to inform the collector that the module modulename has been loaded into the address space at address vaddr by the program. The module is read to determine its functions and the source and line number mappings for these functions.

`collector_module_unload()`

Inform the collector that the module that was loaded at the address vaddr has been unloaded.

Compiling and Linking Your Program

You can collect and analyze data for a program compiled with almost any option, but some choices affect what you can collect or what you can see in the Performance Analyzer. The issues that you should take into account when you compile and link your program are described in the following subsections.

Source Code Information

To see source code information, you must use the -g compiler option (-g0 for C++ to ensure that front-end inlining is enabled). When this option is used the compiler generates symbol tables that are used by the Performance Analyzer to obtain source line numbers and file names and print compiler commentary messages. Without this option you cannot view annotated source code listings or compiler commentary, and you might not have all function names in the main Performance Analyzer display. You must also use the -g (or -xF) compiler option if you want to generate a mapfile.

If you need to move or remove the object (.o) files for any reason, you can load your program with the -xs option. With this option, all the information on the source files is put into the executable. This option makes it easier to move the experiment and the program-related files to a new location before analyzing it, for example.

Static Linking

When you compile your program, you must not disable dynamic linking, which is done with the -dn and -Bstatic compiler options. If you try to collect data for a program that is entirely statically linked, the Collector prints an error message and does not collect data. This is because the collector library, among others, is dynamically loaded when you run the Collector.

You should not statically link any of the system libraries. If you do, you might not be able to collect any kind of tracing data. Nor should you link to the Collector library, libcollector.so.

Optimization

If you compile your program with optimization turned on at some level, the compiler can rearrange the order of execution so that it does not strictly follow the sequence of lines in your program. The Performance Analyzer can analyze experiments collected on optimized code, but the data it presents at the disassembly level is often difficult to relate to the original source code lines. In addition, the call sequence can appear to be different from what you expect if the compiler performs tail-call optimizations.

If you compile a C program on an IA platform with an optimization level of 4 or 5, the Collector is unable to reliably unwind the call stack. As a consequence, only the exclusive metrics for a function are reliable. If you compile a C++ program on an IA platform, you can use any optimization level, as long as you do not use the -noex (or -features=no@except) compiler option to disable C++ exceptions. If you do use this option the Collector is unable to reliably unwind the call stack, and only the exclusive metrics for a function are reliable.

Intermediate Files

If you generate intermediate files using the -E or -P compiler options, the Performance Analyzer uses the intermediate file for annotated source code, not the original source file. The #line directives generated with -E can cause problems in the assignment of metrics to source lines.

Limitations on Data Collection

This section describes the limitations on data collection that are imposed by the hardware, the operating environment, the way you run your program or by the Collector itself.

Limitations on Clock-based Profiling

The profiling interval must be a multiple of the system clock resolution. The default resolution is 10 milliseconds. If you want to do profiling at higher resolution, you can change the system clock rate to give a resolution of 1 millisecond. If you have root privilege, you can do this by adding the following line to the file /etc/system, and then rebooting.

set hires_tick=1

See the Solaris Tunable Parameters Reference Manual for more information.

Limitations on Collection of Tracing Data

You cannot collect any kind of tracing data from a program that is already running unless the Collector library, libcollector.so, has been preloaded. See Collecting Data From a Running Process for more information.

Limitations on Hardware-Counter Overflow Profiling

There are several limitations on hardware counter overflow profiling:

You can only collect hardware-counter overflow data on processors that have hardware counters and that support overflow profiling. On other systems, hardware-counter overflow profiling is disabled. UltraSPARC processors prior to the UltraSPARC III processor family do not support hardware-counter overflow profiling.

You cannot collect hardware-counter overflow data with versions of the operating environment that precede the Solaris 8 release.

You can record data for at most two hardware counters in an experiment. To record data for more than two hardware counters or for counters that use the same register you must run separate experiments.

You cannot collect hardware-counter overflow data on a system while cpustat(1) is running, because cpustat takes control of the counters and does not let a user process use the counters. If cpustat is started during data collection, the experiment is terminated.

You cannot use the hardware counters in your own code via the libcpc(3) API if you are doing hardware-counter overflow profiling. The Collector interposes on the libcpc library functions and returns with a return value of -1 if the call did not come from the Collector.

If you try to collect hardware counter data on a running program that is using the hardware counter library, by attaching dbx to the process, the experiment is corrupted.

Limitations on Data Collection for Descendant Processes

You can collect data on descendant processes subject to the following limitations:

If you want to collect data for all descendant processes that are followed by the Collector, you must use the collect command with the -F on option.

You can collect data automatically for calls to fork and its variants and exec and its variants. Calls to system, popen, and sh are not followed by the Collector.

If you want to collect data for individual descendant processes, you must attach dbx to the process. See Appendix Collecting Data From a Running Process for more information.

Limitations on Java Profiling

You can collect data on Java programs subject to the following limitations:

You must use a version of the Java 2 Software Development Kit no earlier than 1.4. The path to the Java virtual machine ^[1] should be specified in one of the following four environment variables: JDK_1_4_HOME, JDK_HOME, JAVA_PATH, PATH. The Collector verifies that the version of java it finds in these environment variables is an ELF executable, and if it is not, an error message is printed, indicating which environment variable was used, and the full path name that was tried.

You cannot collect tracing data for Java monitors or Java allocations. However, you can collect tracing data for any C or C++ functions that are called from a Java method.

You must use the collect command to collect data. You cannot use the dbx collector subcommands or the data collection capabilities of the IDE.

If you want to use the 64 bit JVM machine, it must either be the default, or you must specify the path to it when you collect data. Do not use java -d64 to collect data using the 64 bit JVM machine. If you do, no data is collected.

Where the Data Is Stored

The data collected during one execution of your application is called an experiment. The experiment consists of a set of files that are stored in a directory. The name of the experiment is the name of the directory.

In addition to recording the experiment data, the Collector creates its own archives of the load objects used by the program. These archives contain the addresses, sizes and names of each object file and each function in the load object, as well as the address of the load object and a time stamp for its last modification.

Experiments are stored by default in the current directory. If this directory is on a networked file system, storing the data takes longer than on a local file system, and can distort the performance data. You should always try to record experiments on a local file system if possible. You can change the storage location when you run the Collector.

Experiments for descendant processes are stored inside the experiment for the founder process.

Experiment Names

The default name for a new experiment is test.1.er. The suffix .er is mandatory: if you give a name that does not have it, an error message is displayed and the name is not accepted.

If you choose a name with the format experiment.n.er, where n is a positive integer, the Collector automatically increments n by one in the names of subsequent experiments--for example, mytest.1.er is followed by mytest.2.er, mytest.3.er, and so on. The Collector also increments n if the experiment already exists, and continues to increment n until it finds an experiment name that is not in use. If the experiment name does not contain n and the experiment exists, the Collector prints an error message.

Experiments can be collected into groups. The group is defined in an experiment group file, which is stored by default in the current directory. The experiment group file is a plain text file with a special header line and an experiment name on each subsequent line. The default name for an experiment group file is test.erg. If the name does not end in .erg, an error is displayed and the name is not accepted. Once you have created an experiment group, any experiments you run with that group name are added to the group.

The default experiment name is different for experiments collected from MPI programs, which create one experiment for each MPI process. The default experiment name is test.m.er, where m is the MPI rank of the process. If you specify an experiment group group.erg, the default experiment name is group.m.er. If you specify an experiment name, it overrides these defaults. See Collecting Data From MPI Programs for more information.

Experiments for descendant processes are named with their lineage as follows. To form the experiment name for a descendant process, an underscore, a code letter and a number are added to the stem of its creator's experiment name. The code letter is f for a fork and x for an exec. The number is the index of the fork or exec (whether successful or not). For example, if the experiment name for the founder process is test.1.er, the experiment for the child process created by the third call to fork is test.1.er/_f3.er. If that child process calls exec successfully, the experiment name for the new descendant process is test.1.er/_f3_x1.er.

Moving Experiments

If you want to move an experiment to another computer to analyze it, you should be aware of the dependencies of the analysis on the operating environment in which the experiment was recorded.

The archive files contain all the information necessary to compute metrics at the function level and to display the timeline. However, if you want to see annotated source code or annotated disassembly code, you must have access to versions of the load objects or source files that are identical to the ones used when the experiment was recorded.

The Performance Analyzer searches for the source, object and executable files in the following locations in turn, and stops when it finds a file of the correct basename:

The experiment.

The absolute pathname as recorded in the executable.

The current working directory.

To ensure that you see the correct annotated source code and annotated disassembly code for your program, you can copy the source code, the object files and the executable into the experiment before you move or copy the experiment. If you don't want to copy the object files, you can link your program with -xs to ensure that the information on source lines and file locations are inserted into the executable.

Estimating Storage Requirements

In this section some guidelines are given for estimating the amount of disk space needed to record an experiment. The size of the experiment depends directly on the size of the data packets and the rate at which they are recorded, the number of LWPs used by the program, and the execution time of the program.

The data packets contain event-specific data and data that depends on the program structure (the call stack). The amount of data that depends on the data type is approximately 50 to 100 bytes. The call stack data consists of return addresses for each call, and contains 4 bytes (8 bytes on 64 bit SPARC architecture) per address. Data packets are recorded for each LWP in the experiment.

The rate at which profiling data packets are recorded is controlled by the profiling interval for clock data and by the overflow value for hardware counter data. However, the choice of these parameters also affects the data quality and the distortion of program performance due to the data collection overhead. Smaller values of these parameters give better statistics but also increase the overhead. The default values of the profiling interval and the overflow value have been carefully chosen as a compromise between obtaining good statistics and minimizing the overhead. Smaller values also mean more data.

For a clock-based profiling experiment with a profiling interval of 10ms and a small call stack, such that the packet size is 100 bytes, data is recorded at a rate of 10 kbytes/sec per LWP. For a hardware counter overflow profiling experiment collecting data for CPU cycles and instructions executed on a 750MHz processor with an overflow value of 1000000 and a packet size of 100 bytes, data is recorded at a rate of 150 kbytes/sec per LWP. Applications that have call stacks with a depth of hundreds of calls could easily record data at ten times these rates.

Your estimate of the size of the experiment should also take into account the disk space used by the archive files, which is usually a small fraction of the total disk space requirement (see the previous section). If you are not sure how much space you need, try running your experiment for a short time. From this test you can obtain the size of the archive files, which are independent of the data collection time, and scale the size of the profile files to obtain an estimate of the size for the full-length experiment.

As well as allocating disk space, the Collector allocates buffers in memory to store the profile data before writing it to disk. There is currently no way to specify the size of these buffers. If the Collector runs out of memory, you should try to reduce the amount of data collected.

If your estimate of the space required to store the experiment is larger than the space you have available, you can consider collecting data for part of the run rather than the whole run. You can do this with the collect command, with the dbx collector subcommands, or by inserting calls in your program to the collector API. You can also limit the total amount of profiling and tracing data collected with the collect command or with the dbx collector subcommands.

Note - The Performance Analyzer cannot read more than 2 GB of performance data.

Collecting Data Using the `collect` Command

To run the Collector from the command line using the collect command, type the following.

% collect collect-options program program-arguments

Here, collect-options are the collect command options, program is the name of the program you want to collect data on, and program-arguments are its arguments.

If no command arguments are given, the default is to turn on clock-based profiling with a profiling interval of 10 milliseconds.

To obtain a list of options and a list of the names of any hardware counters that are available for profiling, type the collect command with no arguments.

% collect

For a description of the list of hardware counters, see . See also Limitations on Hardware-Counter Overflow Profiling.

Data Collection Options

These options control the types of data that are collected. See for a description of the data types.

If no data collection options are given, the default is -p on, which enables clock-based profiling with the default profiling interval of 10 milliseconds. The default is turned off by the -h option but not by any of the other data collection options.

If clock-based profiling is explicitly disabled, and neither any kind of tracing nor hardware counter overflow profiling is enabled, the collect command prints a warning message, and collects global data only.

`-p` option

Collect clock-based profiling data. The allowed values of option are:

off - Turn off clock-based profiling.

on - Turn on clock-based profiling with the default profiling interval of 10 milliseconds.

lo[w] - Turn on clock-based profiling with the low-resolution profiling interval of 100 milliseconds.

hi[gh] - Turn on clock-based profiling with the high-resolution profiling interval of 1 millisecond. High-resolution profiling must be explicitly enabled. See Limitations on Clock-based Profiling for information on enabling high-resolution profiling.

value - Turn on clock-based profiling and set the profiling interval to value, given in milliseconds. The value should be a multiple of the system clock resolution. If it is larger but not a multiple it is rounded down. If it is smaller, a warning message is printed and it is set to the system clock resolution. See Limitations on Clock-based Profiling for information on enabling high-resolution profiling.

Collecting clock-based profiling data is the default action of the collect command.

`-h` counter`[,`value`[,`counter2`[,`value2`]]]`

Collect hardware counter overflow profiling data. The counter names counter and counter2 can be one of the following:

An aliased counter name

An internal name, as used by cputrack(1). If the counter can use either event register, the event register to be used can be specified by appending /0 or /1 to the internal name.

If two counters are specified, they must use different registers. If they do not use different registers, the collect command prints an error message and exits. Some counters can count on either register.

To obtain a list of available counters, type collect with no arguments in a terminal window. A description of the counter list is given in the section .

The overflow value is the number of events counted at which the hardware counter overflows and the overflow event is recorded. The overflow values can be specified using value and value2, which can be set to one of the following:

hi[gh] - The high-resolution value for the chosen counter is used. The abbreviation h is also supported for compatibility with previous software releases.

lo[w] - The low-resolution value for the chosen counter is used.

number - The overflow value. Must be a positive integer.

0, on, or a null string - The default overflow value is used.

The default is the normal threshold, which is predefined for each counter and which appears in the counter list. See also Limitations on Hardware-Counter Overflow Profiling.

If you use the -h option without explicitly specifying a -p option, clock-based profiling is turned off. To collect both hardware counter data and clock-based data, you must specify both a -h option and a -p option.

`-s` option

Collect synchronization wait tracing data. The allowed values of option are:

all - Turn on synchronization wait tracing with a zero threshold. This option will force all synchronization events to be recorded.

calibrate - Turn on synchronization wait tracing and set the threshold value by calibration at runtime. (Equivalent to on.)

off - Turn off synchronization wait tracing.

on - Turn on synchronization wait tracing with the default threshold, which is to set the value by calibration at runtime. (Equivalent to calibrate.)

value - Set the threshold to value, given as a positive integer in microseconds.

Synchronization wait tracing data is not recorded for Java monitors.

`-H` option

Collect heap tracing data. The allowed values of option are:

on - Turn on tracing of heap allocation and deallocation requests.

off - Turn off heap tracing.

Heap tracing is turned off by default.

Heap tracing data is not recorded for Java memory allocations.

`-m` option

Collect MPI tracing data. The allowed values of option are:

on - Turn on tracing of MPI calls.

off - Turn off tracing of MPI calls.

MPI tracing is turned off by default.

See for more information about the MPI functions whose calls are traced and the metrics that are computed from the tracing data.

`-S` option

Record sample packets periodically. The allowed values of option are:

off - Turn off periodic sampling.

on - Turn on periodic sampling with the default sampling interval of 1 second.

value - Turn on periodic sampling and set the sampling interval to value. The interval value must be an integer, and is given in seconds.

By default, periodic sampling at 1 second intervals is enabled.

Experiment Control Options

`-F` option

Control whether or not descendant processes should have their data recorded. The allowed values of option are:

on - Record experiments on all descendant processes that are followed by the Collector.

off - Do not record experiments on descendant processes.

The Collector follows processes created by calls to the functions fork(2), fork1(2), fork(3F), vfork(2), and exec(2) and its variants. The call to vfork is replaced internally by a call to fork1. The Collector does not follow processes created by calls to system(3C), system(3F), sh(3F), and popen(3C).

`-j` option

Enable Java profiling for a nonstandard Java installation, or choose whether to collect data on methods compiled by the Java HotSpot virtual machine. The allowed values of option are:

on - Recognize methods compiled by the Java HotSpot virtual machine.

off - Do not attempt to recognize methods compiled by the Java HotSpot virtual machine.

This option is not needed if you want to collect data on a .class file or a .jar file, provided that the path to the java executable is in one of the following environment variables: JDK_1_4_HOME, JDK_HOME, JAVA_PATH, or PATH. You can then specify program as the .class file or the .jar file, with or without the extension.

If you cannot define the path to java in any of these variables, or if you want to disable the recognition of methods compiled by the Java HotSpot virtual machine you can use this option. If you use this option, program must be a Java virtual machine whose version is not earlier than 1.4. The collect command does not verify that program is a JVM machine, and collection can fail if it is not. However it does verify that program is an ELF executable, and if it is not, the collect command prints an error message.

If you want to collect data using the 64 bit JVM machine, you must not use the -d64 option to java for a 32 bit JVM machine. If you do, no data is collected. Instead you must specify the path to the 64 bit JVM machine either in program or in one of the environment variables given in this section.

`-l` signal

Record a sample packet when the signal named signal is delivered to the process.

The signal can be specified by the full signal name, by the signal name without the initial letters SIG, or by the signal number. Do not use a signal that is used by the program or that would terminate execution. Suggested signals are SIGUSR1 and SIGUSR2. Signals can be delivered to a process by the kill(1) command.

If you use both the -l and the -y options, you must use different signals for each option.

If you use this option and your program has its own signal handler, you should make sure that the signal that you specify with -l is passed on to the Collector's signal handler, and is not intercepted or ignored.

See the signal(3HEAD) man page for more information about signals.

`-x`

Leave the target process stopped on exit from the exec system call in order to allow a debugger to attach to it. If you attach dbx to the process, use the dbx commands ignore PROF and ignore EMT to ensure that collection signals are passed on to the collect command.

`-y` signal[`,r`]

Control recording of data with the signal named signal. Whenever the signal is delivered to the process, it switches between the paused state, in which no data is recorded, and the recording state, in which data is recorded. Sample points are always recorded, regardless of the state of the switch.

If you use both the -l and the -y options, you must use different signals for each option.

When the -y option is used, the Collector is started in the recording state if the optional r argument is given, otherwise it is started in the paused state. If the -y option is not used, the Collector is started in the recording state.

If you use this option and your program has its own signal handler, you should make sure that the signal that you specify with -y is passed on to the Collector's signal handler, and is not intercepted or ignored.

See the signal(3HEAD) man page for more information about signals.

Output Options

`-d` directory-name

Place the experiment in directory directory-name. This option only applies to individual experiments and not to experiment groups. If the directory does not exist, the collect command prints an error message and exits.

`-g` group-name

Make the experiment part of experiment group group-name. If group-name does not end in .erg, the collect command prints an error message and exits. If the group exists, the experiment is added to it. The experiment group is placed in the current directory unless group-name includes a path.

`-o` experiment-name

Use experiment-name as the name of the experiment to be recorded. If experiment-name does not end in .er, the collect command prints an error message and exits. See Experiment Names for more information on experiment names and how the Collector handles them.

`-L` size

Limit the amount of profiling data recorded to size megabytes. The limit applies to the sum of the amounts of clock-based profiling data, hardware-counter overflow profiling data, and synchronization wait tracing data, but not to sample points. The limit is only approximate, and can be exceeded.

When the limit is reached, no more profiling data is recorded but the experiment remains open until the target process terminates. If periodic sampling is enabled, sample points continue to be written.

The default limit on the amount of data recorded is 2000 Mbytes. This limit was chosen because the Performance Analyzer cannot process experiments that contain more than 2 Gbytes of data.

Other Options

`-n`

Do not run the target but print the details of the experiment that would be generated if the target were run. This is a "dry run" option.

Note - This option has changed from the Forte Developer 6 update 2 release.

`-R`

Display the text version of the performance tools readme in the terminal window. If the readme is not found, a warning is printed.

`-V`

Print the current version of the collect command. No further arguments are examined, and no further processing is done.

`-v`

Print the current version of the collect command and detailed information about the experiment being run.

Obsolete Options

`-a`

Address space data collection and display is no longer supported. This option is ignored with a warning.

Collecting Data From the Integrated Development Environment

Note - The Performance Analyzer GUI and the IDE are part of the Forte for Java 4, Enterprise Edition for the Solaris operating environment, versions 8 and 9.

You can collect performance data using the Debugger in the Solaris Native Language Support module of the IDE. For information on how to collect performance data in the IDE, refer to the online help for the Solaris Native Language Support module.

Collecting Data Using the `dbx` `collector` Subcommands

To run the Collector from dbx:

1. Load your program into dbx by typing the following command.

% dbx program

2. Use the collector command to enable data collection, select the data types, and set any optional parameters.

(dbx) collector subcommand

To get a listing of available collector subcommands, type:

(dbx) help collector

You must use one collector command for each subcommand.

3. Set up any dbx options you wish to use and run the program.

If a subcommand is incorrectly given, a warning message is printed and the subcommand is ignored. A complete listing of the collector subcommands follows.

Data Collection Subcommands

The following subcommands control the types of data that are collected by the Collector. They are ignored with a warning if an experiment is active.

`profile` option

Controls the collection of clock-based profiling data. The allowed values for option are:

on - Enables clock-based profiling with the default profiling interval of 10 ms.

off - Disables clock-based profiling.

timer value - Sets the profiling interval to value milliseconds. The default setting is 10 ms. The value should be a multiple of the system clock resolution. If the value is larger than the system clock resolution but not a multiple it is rounded down. If the value is smaller than the system clock resolution it is set to the system clock resolution. In both cases a warning message is printed. See Limitations on Clock-based Profiling to find out how to enable high-resolution profiling.

The Collector collects clock-based profiling data by default, unless the collection of hardware-counter overflow profiling data is turned on using the hwprofile subcommand.

`hwprofile` option

Controls the collection of hardware-counter overflow profiling data. If you attempt to enable hardware-counter overflow profiling on systems that do not support it, dbx returns a warning message and the command is ignored. The allowed values for option are:

on - Turns on hardware-counter overflow profiling. The default action is to collect data for the cycles counter at the normal overflow value.

off - Turns off hardware-counter overflow profiling.

list - Returns a list of available counters See for a description of the list. If your system does not support hardware-counter overflow profiling, dbx returns a warning message.

counter name value [ name2 value2 ] - Selects the hardware counter name, and sets its overflow value to value; optionally selects a second hardware counter name2 and sets its overflow value to value2. An overflow value of 0 is interpreted as the default overflow value. The two counters must use different registers. If they do not, a warning message is printed and the command is ignored.

The Collector does not collect hardware-counter overflow profiling data by default. If hardware-counter overflow profiling is enabled and a profile command has not been given, clock-based profiling is turned off.

`synctrace` option

Controls the collection of synchronization wait tracing data. The allowed values for option are

on - Enables synchronization wait tracing.

off - Disables synchronization wait tracing.

threshold value - Sets the threshold for the minimum synchronization delay. The allowed values for value are calibrate, to use a calibrated threshold determined at runtime, or a value given in microseconds. Setting value to 0 (zero) causes the Collector to trace all events, regardless of wait time. The default setting is calibrate.

By default, the Collector does not collect synchronization wait tracing data.

`heaptrace` option

Controls the collection of heap tracing data. The allowed values for option are

on - Enables heap tracing.

off - Disables heap tracing.

By default, the Collector does not collect heap tracing data.

`mpitrace` option

Controls the collection of MPI tracing data. The allowed values for option are

on - Enables tracing of MPI calls.

off - Disables tracing of MPI calls.

By default, the Collector does not collect MPI tracing data.

`sample` option

Controls the sampling mode. The allowed values for option are:

periodic - Enables periodic sampling.

manual - Disables periodic sampling. Manual sampling remains enabled.

period value - Sets the sampling interval to value, given in seconds.

By default, periodic sampling is enabled, with a sampling interval value of 1 second.

`dbxsample` { `on` | `off` }

Controls the recording of samples when dbx stops the target process. The meanings of the keywords are as follows:

on - A sample is recorded each time dbx stops the target process.

off - Samples are not recorded when dbx stops the target process.

By default, samples are recorded when dbx stops the target process.

Experiment Control Subcommands

`disable`

Disables data collection. If a process is running and collecting data, it terminates the experiment and disables data collection. If a process is running and data collection is disabled, it is ignored with a warning. If no process is running, it disables data collection for subsequent runs.

`enable`

Enables data collection. If a process is running but data collection is disabled, it enables data collection and starts a new experiment. If a process is running and data collection is disabled, it is ignored with a warning. If no process is running, it enables data collection for subsequent runs.

You can enable and disable data collection as many times as you like during the execution of any process. Each time you enable data collection, a new experiment is created.

`pause`

Suspends the collection of data, but leaves the experiment open. Sample points are still recorded. This subcommand is ignored if data collection is already paused.

`resume`

Resumes data collection after a pause has been issued. This subcommand is ignored if data is being collected.

`sample record` name

Record a sample packet with the label name. The label is not currently used.

Output Subcommands

The following subcommands define storage options for the experiment. They are ignored with a warning if an experiment is active.

`limit` value

Limit the amount of profiling data recorded to value megabytes. The limit applies to the sum of the amounts of clock-based profiling data, hardware-counter overflow profiling data, and synchronization wait tracing data, but not to sample points. The limit is only approximate, and can be exceeded.

When the limit is reached, no more profiling data is recorded but the experiment remains open and sample points continue to be recorded.

The default limit on the amount of data recorded is 2000 Mbytes. This limit was chosen because the Performance Analyzer cannot process experiments that contain more than 2 Gbytes of data.

`store` option

Governs where the experiment is stored. This command is ignored with a warning if an experiment is active. The allowed values for option are:

directory directory-name - Sets the directory where the experiment is stored. This subcommand is ignored with a warning if the directory does not exist.

experiment experiment-name - Sets the name of the experiment. If the experiment name does not end in .er, the subcommand is ignored with a warning. See Where the Data Is Stored for more information on experiment names and how the Collector handles them.

group group-name - Sets the name of the experiment group. If the group name does not end in .erg, the subcommand is ignored with a warning. If the group already exists, the experiment is added to the group.

The filename option is obsolete. It has been replaced by experiment. It is accepted as a synonym for experiment for compatibility with the previous Forte Developer software release.

Information Subcommands

`show`

Shows the current setting of every Collector control.

`status`

Reports on the status of any open experiment.

Obsolete Subcommands

`address_space`

Address space data collection is no longer supported. This subcommand is ignored with a warning.

`close`

Synonym for disable.

`enable_once`

Formerly used to enable data collection for one run only. This subcommand is ignored with a warning.

`quit`

Synonym for disable.

`store filename`

Synonym for store experiment.

Collecting Data From a Running Process

The Collector allows you to collect data from a running process. If the process is already under the control of dbx (either in the command line version or in the IDE), you can pause the process and enable data collection using the methods described in previous sections.

Note - The Performance Analyzer GUI and the IDE are part of the Forte for Java 4, Enterprise Edition for the Solaris operating environment, versions 8 and 9.

If the process is not under the control of dbx, you can attach dbx to it, collect performance data, and then detach from the process, leaving it to continue. If you want to collect performance data for selected descendant processes, you must attach dbx to each process.

To collect data from a running process that is not under the control of dbx:

1. Determine the program's process ID (PID).

If you started the program from the command line and put it in the background, its PID will be printed to standard output by the shell. Otherwise you can determine the program's PID by typing the following.

% ps -ef | grep program-name

2. Attach to the process.

From the Debug menu of the IDE, choose Debug Attach to Solaris Process and select the process using the dialog box. Use the online help for instructions.

From dbx, type the following.

(dbx) attach program-name pid

If dbx is not already running, type the following.

% dbx program-name pid

See the manual, Debugging a Program With dbx, for more details on attaching to a process. Attaching to a running process pauses the process.

3. Start data collection.

From the Debug menu of the IDE, choose Performance Toolkit Enable Collector and use the dialog box to set up the data collection parameters. Then choose Debug Continue to resume execution of the process.

From dbx, use the collector command to set up the data collection parameters and the cont command to resume execution of the process.

4. Detach from the process.

When you have finished collecting data, pause the program and then detach the process from dbx.

In the IDE, right-click the session for the process in the Sessions view of the Debugger window and choose Detach from the contextual menu. If the Sessions view is not displayed, click the Sessions button at the top of the Debugger window.

From dbx, type the following.

(dbx) detach

If you want to collect any kind of tracing data, you must preload the Collector library, libcollector.so, before you run your program, because the library provides wrappers to the real functions that enable data collection to take place. In addition, the Collector adds wrapper functions to other system library calls to guarantee the integrity of performance data. If you do not preload the Collector library, these wrapper functions cannot be inserted. See Use of System Libraries for more information on how the Collector interposes on system library functions.

To preload libcollector.so, you must set both the name of the library and the path to the library using environment variables. Use the environment variable LD_PRELOAD to set the name of the library. Use the environment variable LD_LIBRARY_PATH to set the path to the library. If you are using SPARC V9 64 bit architecture, you must also set the environment variable LD_LIBRARY_PATH_64. If you have already defined these environment variables, add the new values to them. The values of the environment variables are shown in TABLE 4-2.


Environment variable	Value
`LD_PRELOAD`	`libcollector.so`
`LD_LIBRARY_PATH`	`/opt/SUNWspro/lib`
`LD_LIBRARY_PATH_64`	`/opt/SUNWspro/lib/v9`

If your Forte Developer software is not installed in /opt/SUNWspro, ask your system administrator for the correct path. You can set the full path in LD_PRELOAD, but doing this can create complications when using SPARC V9 64-bit architecture.

Note - Remove the LD_PRELOAD and LD_LIBRARY_PATH settings after the run, so they do not remain in effect for other programs that are started from the same shell.

If you want to collect data from an MPI program that is already running, you must attach a separate instance of dbx to each process and enable the Collector for each process. When you attach dbx to the processes in an MPI job, each process will be halted and restarted at a different time. The time difference could change the interaction between the MPI processes and affect the performance data you collect. To minimize this problem, one solution is to use pstop(1) to halt all the processes. However, once you attach dbx to the processes, you must restart them from dbx, and there will be a timing delay in restarting the processes, which can affect the synchronization of the MPI processes. See also Collecting Data From MPI Programs.

Collecting Data From MPI Programs

The Collector can collect performance data from multi-process programs that use the Sun Message Passing Interface (MPI) library. The MPI library is included in the Sun HPC ClusterTools software. You should use the latest version of the ClusterTools software if possible, which is 4.0, but you can use 3.1 or a compatible version. To start the parallel jobs, use the Sun Cluster Runtime Environment (CRE) command mprun. See the Sun HPC ClusterTools documentation for more information. For information about MPI and the MPI standard, see the MPI web site http://www.mcs.anl.gov/mpi.

Because of the way MPI and the Collector are implemented, each MPI process records a separate experiment. Each experiment must have a unique name. Where and how the experiment is stored depends on the kinds of file systems that are available to your MPI job. Issues about storing experiments are discussed in the next subsection.

To collect data from MPI jobs, you can either run the collect command under MPI or start dbx under MPI and use the dbx collector subcommands. Each of these options is discussed in subsequent subsections.

Storing MPI Experiments

Because multiprocessing environments can be complex, there are some issues about storing MPI experiments you should be aware of when you collect performance data from MPI programs. These issues concern the efficiency of data collection and storage, and the naming of experiments. See Where the Data Is Stored for information on naming experiments, including MPI experiments.

Each MPI process that collects performance data creates its own experiment. When an MPI process creates an experiment, it locks the experiment directory. All other MPI processes must wait until the lock is released before they can use the directory. Thus, if you store the experiments on a file system that is accessible to all MPI processes, the experiments are created sequentially, but if you store the experiments on file systems that are local to each MPI process, the experiments are created concurrently.

If you store the experiments on a common file system and specify an experiment name in the standard format, experiment.n.er, the experiments have unique names. The value of n is determined by the order in which MPI processes obtain a lock on the experiment directory, and cannot be guaranteed to correspond to the MPI rank of the process. If you attach dbx to MPI processes in a running MPI job, n will be determined by the order of attachment.

If you store the experiments on a local file system and specify an experiment name in the standard format, the names are not unique. For example, suppose you ran an MPI job on a machine with 4 single-processor nodes labelled node0, node1, node2 and node3. Each node has a local disk called /scratch, and you store the experiments in directory username on this disk. The experiments created by the MPI job have the following full path names.

node0:/scratch/username/test.1.er

node1:/scratch/username/test.1.er

node2:/scratch/username/test.1.er

node3:/scratch/username/test.1.er

The full name including the node name is unique, but in each experiment directory there is an experiment named test.1.er. If you move the experiments to a common location after the MPI job is completed, you must make sure that the names remain unique. For example, to move these experiments to your home directory, which is assumed to be accessible from all nodes, and rename the experiments, type the following commands.

rsh node0 'er_mv /scratch/username/test.1.er test.0.er'

rsh node1 'er_mv /scratch/username/test.1.er test.1.er'

rsh node2 'er_mv /scratch/username/test.1.er test.2.er'

rsh node3 'er_mv /scratch/username/test.1.er test.3.er'

For large MPI jobs, you might want to move the experiments to a common location using a script. Do not use the Unix commands cp or mv; see Manipulating Experiments for information on how to copy and move experiments.

If you do not specify an experiment name, the Collector uses the MPI rank to construct an experiment name with the standard form experiment.n.er, but in this case n is the MPI rank. The stem, experiment, is the stem of the experiment group name if you specify an experiment group, otherwise it is test. The experiment names are unique, regardless of whether you use a common file system or a local file system. Thus, if you use a local file system to record the experiments and copy them to a common file system, you will not have to rename the experiments when you copy them and reconstruct any experiment group file.

If you do not know which local file systems are available to you, use the df -lk command or ask your system administrator. You should always make sure that the experiments are stored in a directory that already exists, that is uniquely defined and that is not in use for any other experiment. You should also make sure that the file system has enough space for the experiments. See Estimating Storage Requirements for information on how to estimate the space needed.

Note - If you copy or move experiments between computers or nodes you cannot view the annotated source code or source lines in the annotated disassembly code unless you have access to the load objects and source files that were used to run the experiment, or a copy with the same path and timestamp.

Running the `collect` Command Under MPI

To collect data with the collect command under the control of MPI, use the following syntax.

% mprun -np n collect [collect-arguments] program-name [program-arguments]

Here, n is the number of processes to be created by MPI. This procedure creates n separate instances of collect, each of which records an experiment. Read the section Where the Data Is Stored for information on where and how to store the experiments.

To ensure that the sets of experiments from different MPI runs are stored separately, you can create an experiment group with the -g option for each MPI run. The experiment group should be stored on a file system that is accessible to all MPI processes. Creating an experiment group also makes it easier to load the set of experiments for a single MPI run into the Performance Analyzer. An alternative to creating a group is to specify a separate directory for each MPI run with the -d option.

Collecting Data by Starting `dbx` Under MPI

To start dbx and collect data under the control of MPI, use the following syntax.

% mprun -np n dbx program-name < collection-script

Here, n is the number of processes to be created by MPI and collection-script is a dbx script that contains the commands necessary to set up and start data collection. This procedure creates n separate instances of dbx, each of which records an experiment on one of the MPI processes. If you do not define the experiment name, the experiment will be labelled with the MPI rank. Read the section Storing MPI Experiments for information on where and how to store the experiments.

You can name the experiments with the MPI rank by using the collection script and a call to MPI_Comm_rank() in your program. For example, in a C program you would insert the following line.

ier = MPI_Comm_rank(MPI_COMM_WORLD,&me);

In a Fortran program you would insert the following line.

call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)

If this call was inserted at line 17, for example, you could use a script like this.

stop at 18

run program-arguments

rank=$[me]

collector enable

collector store filename experiment.$rank.er

cont

quit

^{1 (FootNote) The terms "Java virtual machine" and "JVM" mean a virtual machine for the Java platform.}

Preparing Your Program for Data Collection and Analysis

Use of System Libraries

Use of Signal Handlers

Use of setuid

Controlling Data Collection From Your Program

collector_sample(char *name) (C and C++)

collector_sample(string) (Fortran)

collector_pause()

collector_resume()

collector_terminate_expt()

Dynamic Functions and Modules

collector_func_load()

collector_func_unload()

collector_module_load()

collector_module_unload()