Sun Studio 12 Update 1: Performance Analyzer

Chapter 3 Collecting Performance Data

The first stage of performance analysis is data collection. This chapter describes what is required for data collection, where the data is stored, how to collect data, and how to manage the data collection. For more information about the data itself, see Chapter 2, Performance Data.

This chapter covers the following topics.

Compiling and Linking Your Program

You can collect and analyze data for a program compiled with almost any compiler option, but some choices affect what you can collect or what you can see in the Performance Analyzer. The issues that you should take into account when you compile and link your program are described in the following subsections.

Source Code Information

To see source code in annotated Source and Disassembly analyses, and source lines in the Lines analyses, you must compile the source files of interest with the -g compiler option (-g0 for C++ to enable front-end inlining) to generate debug symbol information. The format of the debug symbol information can be either DWARF2 or stabs, as specified by -xdebugformat=(dwarf|stabs) . The default debug format is dwarf.

To prepare compilation objects with debug information that allows dataspace profiles, currently only for SPARC® processors, compile by specifying -xhwcprof and any level of optimization. (Currently, this functionality is not available without optimization.) To see program data objects in Data Objects analyses, also add -g (or -g0 for C++) to obtain full symbolic information.

Executables and libraries built with DWARF format debugging symbols automatically include a copy of each constituent object file’s debugging symbols. Executables and libraries built with stabs format debugging symbols also include a copy of each constituent object file’s debugging symbols if they are linked with the -xs option, which leaves stabs symbols in the various object files as well as the executable. The inclusion of this information is particularly useful if you need to move or remove the object files. With all of the debugging symbols in the executables and libraries themselves, it is easier to move the experiment and the program-related files to a new location.

Static Linking

When you compile your program, you must not disable dynamic linking, which is done with the -dn and -Bstatic compiler options. If you try to collect data for a program that is entirely statically linked, the Collector prints an error message and does not collect data. The error occurs because the collector library, among others, is dynamically loaded when you run the Collector.

Do not statically link any of the system libraries. If you do, you might not be able to collect any kind of tracing data. Also, do not link to the Collector library, libcollector.so.

Shared Object Handling

Normally the collect command causes data to be collected for all shared objects in the address space of the target, whether they are on the initial library list, or are explicitly loaded with dlopen(). However, under some circumstances some shared objects are not profiled:

When the target program is invoked with lazy loading. In such cases, the library is not loaded at startup time, and is not loaded by explicitly calling dlopen(), so shared object is not included in the experiment, and all PCs from it are mapped to the <Unknown> function. The workaround is to set the LD_BIND_NOW environment variable, which forces the library to be loaded at startup time.
When the executable was built with the -B option. In this case, the object is dynamically loaded by a call specifically to the dynamic linker entry point of dlopen()(), and the libcollector interposition is bypassed. The shared object name is not included in the experiment, and all PCs from it are mapped to the <Unknown>() function. The workaround is to not use the -B option.

Optimization at Compile Time

If you compile your program with optimization turned on at some level, the compiler can rearrange the order of execution so that it does not strictly follow the sequence of lines in your program. The Performance Analyzer can analyze experiments collected on optimized code, but the data it presents at the disassembly level is often difficult to relate to the original source code lines. In addition, the call sequence can appear to be different from what you expect if the compiler performs tail-call optimizations. See Tail-Call Optimization for more information.

Compiling Java Programs

No special action is required for compiling Java programs with the javac command.

Preparing Your Program for Data Collection and Analysis

You do not need to do anything special to prepare most programs for data collection and analysis. You should read one or more of the subsections below if your program does any of the following:

Installs a signal handler
Explicitly dynamically loads a system library
Dynamically compiles functions
Creates descendant processes that you want to profile
Uses the asynchronous I/O library
Uses the profiling timer or hardware counter API directly
Calls setuid(2) or executes a setuid file

Also, if you want to control data collection from your program, you should read the relevant subsection.

Using Dynamically Allocated Memory

Many programs rely on dynamically-allocated memory, using features such as:

malloc, valloc, alloca (C/C++)
new (C++)
Stack local variables (Fortran)
MALLOC, MALLOC64 (Fortran)

You must take care to ensure that a program does not rely on the initial contents of dynamically allocated memory, unless the memory allocation method is explicitly documented as setting an initial value: for example, compare the descriptions of calloc and malloc in the man page for malloc(3C).

Occasionally, a program that uses dynamically-allocated memory might appear to work correctly when run alone, but might fail when run with performance data collection enabled. Symptoms might include unexpected floating point behavior, segmentation faults, or application-specific error messages.

Such behavior might occur if the uninitialized memory is, by chance, set to a benign value when the application is run alone, but is set to a different value when the application is run in conjunction with the performance data collection tools. In such cases, the performance tools are not at fault. Any application that relies on the contents of dynamically allocated memory has a latent bug: an operating system is at liberty to provide any content whatsoever in dynamically allocated memory, unless explicitly documented otherwise. Even if an operating system happens to always set dynamically allocated memory to a certain value today, such latent bugs might cause unexpected behavior with a later revision of the operating system, or if the program is ported to a different operating system in the future.

The following tools may help in finding such latent bugs:

f95 -xcheck=init_local

For more information, see the Fortran User’s Guide or the f95(1) man page.
lint utility

For more information, see the C User’s Guide or the lint(1) man page.
Runtime checking under dbx

For more information, see the Debugging a Program With dbx manual or the dbx(1) man page.
Rational Purify

Using System Libraries

The Collector interposes on functions from various system libraries, to collect tracing data and to ensure the integrity of data collection. The following list describes situations in which the Collector interposes on calls to library functions.

Collecting synchronization wait tracing data. The Collector interposes on functions from the Solaris C library, libc.so, on the Solaris 10 OS.
Collecting heap tracing data. The Collector interposes on the functions malloc, realloc, memalign and free. Versions of these functions are found in the C standard library, libc.so and also in other libraries such as libmalloc.so and libmtmalloc.so.
Collecting MPI tracing data. The Collector interposes on functions from the Solaris MPI library.
Ensuring the integrity of clock data. The Collector interposes on setitimer and prevents the program from using the profiling timer.
Ensuring the integrity of hardware counter data. The Collector interposes on functions from the hardware counter library, libcpc.so and prevents the program from using the counters. Calls from the program to functions from this library return a value of -1.
Enabling data collection on descendant processes. The Collector interposes on the functions fork(2), fork1(2), vfork(2), fork(3F), system(3C), system(3F), sh(3F), popen(3C), and exec(2) and its variants. Calls to vfork are replaced internally by calls to fork1. These interpositions are only done for the collect command.
Guaranteeing the handling of the SIGPROF and SIGEMT signals by the Collector. The Collector interposes on sigaction to ensure that its signal handler is the primary signal handler for these signals.

Under some circumstances the interposition does not succeed:

Statically linking a program with any of the libraries that contain functions that are interposed.
Attaching dbx to a running application that does not have the collector library preloaded.
Dynamically loading one of these libraries and resolving the symbols by searching only within the library.

The failure of interposition by the Collector can cause loss or invalidation of performance data.

The er_sync.so, er_heap.so, and er_mpviewn.so (where n indicates the MPI version) libraries are loaded only if synchronization wait tracing data, heap tracing data, or MPI tracing data, respectively, are requested.

Using Signal Handlers

The Collector uses two signals to collect profiling data: SIGPROF for all experiments, and SIGEMT (on Solaris platforms) or SIGIO (on Linux platforms) for hardware counter experiments only. The Collector installs a signal handler for each of these signals. The signal handler intercepts and processes its own signal, but passes other signals on to any other signal handlers that are installed. If a program installs its own signal handler for these signals, the Collector reinstalls its signal handler as the primary handler to guarantee the integrity of the performance data.

The collect command can also use user-specified signals for pausing and resuming data collection and for recording samples. These signals are not protected by the Collector although a warning is written to the experiment if a user handler is installed. It is your responsibility to ensure that there is no conflict between use of the specified signals by the Collector and any use made by the application of the same signals.

The signal handlers installed by the Collector set a flag that ensures that system calls are not interrupted for signal delivery. This flag setting could change the behavior of the program if the program’s signal handler sets the flag to permit interruption of system calls. One important example of a change in behavior occurs for the asynchronous I/O library, libaio.so, which uses SIGPROF for asynchronous cancel operations, and which does interrupt system calls. If the collector library, libcollector.so, is installed, the cancel signal invariably arrives too late to cancel the asynchronous I/O operation.

If you attach dbx to a process without preloading the collector library and enable performance data collection, and the program subsequently installs its own signal handler, the Collector does not reinstall its own signal handler. In this case, the program’s signal handler must ensure that the SIGPROF and SIGEMT signals are passed on so that performance data is not lost. If the program’s signal handler interrupts system calls, both the program behavior and the profiling behavior are different from when the collector library is preloaded.

Using `setuid` and `setgid`

Restrictions enforced by the dynamic loader make it difficult to use setuid(2) and collect performance data. If your program calls setuid or executes a setuid file, it is likely that the Collector cannot write an experiment file because it lacks the necessary permissions for the new user ID.

The collect command operates by inserting a shared library, libcollector.so, into the target's address space (LD_PRELOAD). Several problems might arise if you invoke the collect command invoked on executables that call setuid or setgid, or that create descendant processes that call setuid or setgid. If you are not root when you run an experiment, collection fails because the shared libraries are not installed in a trusted directory. The workaround is to run the experiments as root, or use crle(1) to grant permission. Take great care when circumventing security barriers; you do so at your own risk.

When running the collect command, your umask must be set to allow write permission for you, and for any users or groups that are set by the setuid attributes and setgid attributes of a program being executed with exec(), and for any user or group to which that program sets itself. If the mask is not set properly, some files might not be written to the experiment, and processing of the experiment might not be possible. If the log file can be written, an error is shown when you attempt to process the experiment.

Other problems can arise if the target itself makes any of the system calls to set UID or GID, or if it changes its umask and then forks or runs exec() on some other executable, or crle was used to configure how the runtime linker searches for shared objects.

If an experiment is started as root on a target that changes its effective GID, the er_archive process that is automatically run when the experiment terminates fails, because it needs a shared library that is not marked as trusted. In that case, you can run the er_archive utility (or the er_print utility or the analyzer command) explicitly by hand, on the machine on which the experiment was recorded, immediately following the termination of the experiment.

Program Control of Data Collection

If you want to control data collection from your program, the Collector shared library, libcollector.so contains some API functions that you can use. The functions are written in C. A Fortran interface is also provided. Both C and Fortran interfaces are defined in header files that are provided with the library.

The API functions are defined as follows.

void collector_sample(char *name);
void collector_pause(void);
void collector_resume(void);
void collector_terminate_expt(void);

Similar functionality is provided for Java^TM programs by the CollectorAPI class, which is described in The Java Interface.

The C and C++ Interface

You can access the C and C++ interface in two ways:

Include collectorAPI.h and link with -lcollectorAPI, which contains real functions to check for the existence of the underlying libcollector.so API functions.

This way requires that you link with an API library, and works under all circumstances. If no experiment is active, the API calls are ignored.
Include libcollector.h, which contains macros that check for the existence of the underlying libcollector.so API functions.

This way works when used in the main executable, and when data collection is started at the same time the program starts. This way does not always work when dbx is used to attach to the process, nor when used from within a shared library that is dlopen’d by the process. This second way is provided for backward compatibility only, and its use is discouraged for any other purpose.

Note –
Do not link a program in any language with -lcollector. If you do, the Collector can exhibit unpredictable behavior.

The Fortran Interface

The Fortran API libfcollector.h file defines the Fortran interface to the library. The application must be linked with -lcollectorAPI to use this library. (An alternate name for the library, -lfcollector, is provided for backward compatibility.) The Fortran API provides the same features as the C and C++ API, excluding the dynamic function and thread pause and resume calls.

Insert the following statement to use the API functions for Fortran:

include "libfcollector.h"

Note –

Do not link a program in any language with -lcollector. If you do, the Collector can exhibit unpredictable behavior.

The Java Interface

Use the following statement to import the CollectorAPI class and access the Java API. Note however that your application must be invoked with a classpath pointing to / installation_directory/lib/collector.jar where installation-directory is the directory in which the Sun Studio software is installed.

import com.sun.forte.st.collector.CollectorAPI;

The Java CollectorAPI methods are defined as follows:

CollectorAPI.sample(String name)
CollectorAPI.pause()
CollectorAPI.resume()
CollectorAPI.terminate()

The Java API includes the same functions as the C and C++ API, excluding the dynamic function API.

The C include file libcollector.h contains macros that bypass the calls to the real API functions if data is not being collected. In this case the functions are not dynamically loaded. However, using these macros is risky because the macros do not work well under some circumstances. It is safer to use collectorAPI.h because it does not use macros. Rather, it refers directly to the functions.

The Fortran API subroutines call the C API functions if performance data is being collected, otherwise they return. The overhead for the checking is very small and should not significantly affect program performance.

To collect performance data you must run your program using the Collector, as described later in this chapter. Inserting calls to the API functions does not enable data collection.

If you intend to use the API functions in a multithreaded program, you should ensure that they are only called by one thread. The API functions perform actions that apply to the process and not to individual threads. If each thread calls the API functions, the data that is recorded might not be what you expect. For example, if collector_pause() or collector_terminate_expt() is called by one thread before the other threads have reached the same point in the program, collection is paused or terminated for all threads, and data can be lost from the threads that were executing code before the API call.

The C, C++, Fortran, and Java API Functions

The descriptions of the API functions follow.

C and C++: collector_sample(char *name)

Fortran: collector_sample(string name)

Java: CollectorAPI.sample(String name)

Record a sample packet and label the sample with the given name or string. The label is displayed by the Performance Analyzer in the Event tab. The Fortran argument string is of type character.

Sample points contain data for the process and not for individual threads. In a multithreaded application, the collector_sample() API function ensures that only one sample is written if another call is made while it is recording a sample. The number of samples recorded can be less than the number of threads making the call.

The Performance Analyzer does not distinguish between samples recorded by different mechanisms. If you want to see only the samples recorded by API calls, you should turn off all other sampling modes when you record performance data.
C, C++, Fortran: collector_pause()

Java: CollectorAPI.pause()

Stop writing event-specific data to the experiment. The experiment remains open, and global data continues to be written. The call is ignored if no experiment is active or if data recording is already stopped. This function stops the writing of all event-specific data even if it is enabled for specific threads by the collector_thread_resume() function.
C, C++, Fortran: collector_resume()

Java: CollectorAPI.resume()

Resume writing event-specific data to the experiment after a call to collector_pause() . The call is ignored if no experiment is active or if data recording is active.
C, C++, Fortran: collector_terminate_expt()

Java: CollectorAPI.terminate

Terminate the experiment whose data is being collected. No further data is collected, but the program continues to run normally. The call is ignored if no experiment is active.

Dynamic Functions and Modules

If your C or C++ program dynamically compiles functions into the data space of the program, you must supply information to the Collector if you want to see data for the dynamic function or module in the Performance Analyzer. The information is passed by calls to collector API functions. The definitions of the API functions are as follows.

void collector_func_load(char *name, char *alias,
    char *sourcename, void *vaddr, int size, int lntsize,
    Lineno *lntable);
void collector_func_unload(void *vaddr);

You do not need to use these API functions for Java methods that are compiled by the Java HotSpot^TM virtual machine, for which a different interface is used. The Java interface provides the name of the method that was compiled to the Collector. You can see function data and annotated disassembly listings for Java compiled methods, but not annotated source listings.

The descriptions of the API functions follow.

`collector_func_load()`

Pass information about dynamically compiled functions to the Collector for recording in the experiment. The parameter list is described in the following table.

Table 3–1 Parameter List for collector_func_load()


Parameter	Definition
`name`	The name of the dynamically compiled function that is used by the performance tools. The name does not have to be the actual name of the function. The name need not follow any of the normal naming conventions of functions, although it should not contain embedded blanks or embedded quote characters.
`alias`	An arbitrary string used to describe the function. It can be `NULL`. It is not interpreted in any way, and can contain embedded blanks. It is displayed in the Summary tab of the Analyzer. It can be used to indicate what the function is, or why the function was dynamically constructed.
`sourcename`	The path to the source file from which the function was constructed. It can be `NULL`. The source file is used for annotated source listings.
`vaddr`	The address at which the function was loaded.
`size`	The size of the function in bytes.
`lntsize`	A count of the number of entries in the line number table. It should be zero if line number information is not provided.
`lntable`	A table containing `lntsize` entries, each of which is a pair of integers. The first integer is an offset, and the second entry is a line number. All instructions between an offset in one entry and the offset given in the next entry are attributed to the line number given in the first entry. Offsets must be in increasing numeric order, but the order of line numbers is arbitrary. If `lntable` is `NULL`, no source listings of the function are possible, although disassembly listings are available.

`collector_func_unload()`

Inform the collector that the dynamic function at the address vaddr has been unloaded.

Limitations on Data Collection

This section describes the limitations on data collection that are imposed by the hardware, the operating system, the way you run your program, or by the Collector itself.

There are no limitations on simultaneous collection of different data types: you can collect any data type with any other data type, with the exception of count data.

The Collector can support up to 16K user threads. Data from additional threads is discarded, and a collector error is generated. To support more threads, set the SP_COLLECTOR_NUMTHREADS environment variable to a larger number.

By default, the Collector collects stacks that are, at most, up to 256 frames deep. To support deeper stacks, set the SP_COLLECTOR_STACKBUFSZ environment variable to a larger number.

Limitations on Clock-Based Profiling

The minimum value of the profiling interval and the resolution of the clock used for profiling depend on the particular operating environment. The maximum value is set to 1 second. The value of the profiling interval is rounded down to the nearest multiple of the clock resolution. The minimum and maximum value and the clock resolution can be found by typing the collect command with no arguments.

Runtime Distortion and Dilation with Clock-profiling

Clock-based profiling records data when a SIGPROF signal is delivered to the target. It causes dilation to process that signal, and unwind the call stack. The deeper the call stack, and the more frequent the signals, the greater the dilation. To a limited extent, clock-based profiling shows some distortion, deriving from greater dilation for those parts of the program executing with the deepest stacks.

Where possible, a default value is set not to an exact number of milliseconds, but to slightly more or less than an exact number (for example, 10.007 ms or 0.997 ms) to avoid correlations with the system clock, which can also distort the data. Set custom values the same way on SPARC platforms (not possible on Linux platforms).

Limitations on Collection of Tracing Data

You cannot collect any kind of tracing data from a program that is already running unless the Collector library, libcollector.so, had been preloaded. See Collecting Tracing Data From a Running Program for more information.

Runtime Distortion and Dilation with Tracing

Tracing data dilates the run in proportion to the number of events that are traced. If done with clock-based profiling, the clock data is distorted by the dilation induced by tracing events.

Limitations on Hardware Counter Overflow Profiling

Hardware counter overflow profiling has several limitations:

You can only collect hardware counter overflow data on processors that have hardware counters and that support overflow profiling. On other systems, hardware counter overflow profiling is disabled. UltraSPARC® processors prior to the UltraSPARC III processor family do not support hardware counter overflow profiling.
You cannot collect hardware counter overflow data on a system running the Solaris OS while the cpustat(1) command is running, because cpustat takes control of the counters and does not let a user process use the counters. If cpustat is started during data collection, the hardware counter overflow profiling is terminated and an error is recorded in the experiment.
You cannot use the hardware counters in your own code if you are doing hardware counter overflow profiling. The Collector interposes on the libcpc library functions and returns with a return value of -1 if the call did not come from the Collector. Your program should be coded so as to work correctly if it fails to get access to the hardware counters. If not coded to handle this, the program will fail under hardware counter profiling, or if the superuser invokes system-wide tools that also use the counters, or if the counters are not supported on that system.
If you try to collect hardware counter data on a running program that is using the hardware counter library by attaching dbx to the process, the experiment may be corrupted.

Note –
To view a list of all available counters, run the collect command with no arguments.

Runtime Distortion and Dilation With Hardware Counter Overflow Profiling

Hardware counter overflow profiling records data when a SIGEMT signal (on Solaris platforms) or a SIGIO signal (on Linux platforms) is delivered to the target. It causes dilation to process that signal, and unwind the call stack. Unlike clock-based profiling, for some hardware counters, different parts of the program might generate events more rapidly than other parts, and show dilation in that part of the code. Any part of the program that generates such events very rapidly might be significantly distorted. Similarly, some events might be generated in one thread disproportionately to the other threads.

Limitations on Data Collection for Descendant Processes

You can collect data on descendant processes subject to some limitations.

If you want to collect data for all descendant processes that are followed by the Collector, you must use the collect command with the one of the following options:

-F on option enables you to collect data automatically for calls to fork and its variants and exec and its variants.
-F all option causes the Collector to follow all descendant processes, including those due to calls to system, popen, and sh.
-F '=regexp' option enables data to be collected on all descendant processes whose name or lineage matches the specified regular expression.

See Experiment Control Options for more information about the -F option.

Limitations on OpenMP Profiling

Collecting OpenMP data during the execution of the program can be very expensive. You can suppress that cost by setting the SP_COLLECTOR_NO_OMP environment variable. If you do so, the program will have substantially less dilation, but you will not see the data from slave threads propagate up to the caller, and eventually to main()(), as it normally will if that variable is not set.

A new collector for OpenMP 3.0 is enabled by default in this release. It can profile programs that use explicit tasking. Programs built with earlier compilers can be profiled with the new collector only if a patched version of libmtsk.so is available. If this patched version is not installed, you can switch data collection to use the old collector by setting the SP_COLLECTOR_OLDOMP environment variable.

OpenMP profiling functionality is available only for applications compiled with the Sun Studio compilers, since it depends on the Sun Studio compiler runtime. For applications compiled with GNU compilers, only machine-level call stacks are displayed.

Limitations on Java Profiling

You can collect data on Java programs subject to the following limitations:

You should use a version of the Java 2 Software Development Kit (JDK) no earlier than JDK 6, Update 3. The Collector first looks for the JDK in the path set in either the JDK_HOME environment variable or the JAVA_PATH environment variable. If neither of these variables is set, it looks for a JDK in your PATH. If there is no JDK in your PATH, it looks for the java executable in /usr/java/bin/java. The Collector verifies that the version of the java executable it finds is an ELF executable, and if it is not, an error message is printed, indicating which environment variable or path was used, and the full path name that was tried.
You must use the collect command to collect data. You cannot use the dbx collector subcommands or the data collection capabilities of the IDE.
Applications that create descendant processes that run JVM software cannot be profiled.
If you want to use the 64-bit JVM software, you must use the -j on flag and specify the 64-bit JVM software as the target. Do not use java -d64 to collect data using the 64-bit JVM software. If you do so, no data is collected.
Some applications are not pure Java, but are C or C++ applications that invoke dlopen() to load libjvm.so, and then start the JVM software by calling into it. To profile such applications, set the SP_COLLECTOR_USE_JAVA_OPTIONS environment variable, and add the -j on option to the collect command line. Do not set the LD_LIBRARY_PATH environment variable for this scenario.

Runtime Performance Distortion and Dilation for Applications Written in the Java Programming Language

Java profiling uses the Java Virtual Machine Tools Interface (JVMTI), which can cause some distortion and dilation of the run.

For clock-based profiling and hardware counter overflow profiling, the data collection process makes various calls into the JVM software, and handles profiling events in signal handlers. The overhead of these routines, and the cost of writing the experiments to disk will dilate the runtime of the Java program. Such dilation is typically less than 10%.

Where the Data Is Stored

The data collected during one execution of your application is called an experiment. The experiment consists of a set of files that are stored in a directory. The name of the experiment is the name of the directory.

In addition to recording the experiment data, the Collector creates its own archives of the load objects used by the program. These archives contain the addresses, sizes and names of each object file and each function in the load object, as well as the address of the load object and a time stamp for its last modification.

Experiments are stored by default in the current directory. If this directory is on a networked file system, storing the data takes longer than on a local file system, and can distort the performance data. You should always try to record experiments on a local file system if possible. You can specify the storage location when you run the Collector.

Experiments for descendant processes are stored inside the experiment for the founder process.

Experiment Names

The default name for a new experiment is test.1.er . The suffix .er is mandatory: if you give a name that does not have it, an error message is displayed and the name is not accepted.

If you choose a name with the format experiment.n.er, where n is a positive integer, the Collector automatically increments n by one in the names of subsequent experiments. For example, mytest.1.er is followed by mytest.2.er, mytest.3.er , and so on. The Collector also increments n if the experiment already exists, and continues to increment n until it finds an experiment name that is not in use. If the experiment name does not contain n and the experiment exists, the Collector prints an error message.

Experiments can be collected into groups. The group is defined in an experiment group file, which is stored by default in the current directory. The experiment group file is a plain text file with a special header line and an experiment name on each subsequent line. The default name for an experiment group file is test.erg. If the name does not end in .erg, an error is displayed and the name is not accepted. Once you have created an experiment group, any experiments you run with that group name are added to the group.

You can manually create an experiment group file by creating a plain text file whose first line is

#analyzer experiment group

and adding the names of the experiments on subsequent lines. The name of the file must end in .erg.

You can also create an experiment group by using the -g argument to the collect command.

Experiments for descendant processes are named with their lineage as follows. To form the experiment name for a descendant process, an underscore, a code letter and a number are added to the stem of its creator’s experiment name. The code letter is f for a fork, x for an exec, and c for combination. The number is the index of the fork or exec (whether successful or not). For example, if the experiment name for the founder process is test.1.er, the experiment for the child process created by the third call to fork is test.1.er/_f3.er . If that child process calls exec successfully, the experiment name for the new descendant process is test.1.er/_f3_x1.er .

Moving Experiments

If you want to move an experiment to another computer to analyze it, you should be aware of the dependencies of the analysis on the operating environment in which the experiment was recorded.

The archive files contain all the information necessary to compute metrics at the function level and to display the timeline. However, if you want to see annotated source code or annotated disassembly code, you must have access to versions of the load objects or source files that are identical to the ones used when the experiment was recorded.

The Performance Analyzer searches for the source, object and executable files in the following locations in turn, and stops when it finds a file of the correct basename:

The archive directories of experiments.
The current working directory.
The absolute pathname as recorded in the executables or compilation objects.

You can change the search order or add other search directories from the Analyzer GUI or by using the setpath (see setpath path_list) and addpath ( see addpath path_list) directives. You can also augment the search with the pathmap command.

To ensure that you see the correct annotated source code and annotated disassembly code for your program, you can copy the source code, the object files and the executable into the experiment before you move or copy the experiment. If you don’t want to copy the object files, you can link your program with -xs to ensure that the information on source lines and file locations are inserted into the executable. You can automatically copy the load objects into the experiment using the -A copy option of the collect command or the dbx collector archive command.

Estimating Storage Requirements

This section gives some guidelines for estimating the amount of disk space needed to record an experiment. The size of the experiment depends directly on the size of the data packets and the rate at which they are recorded, the number of LWPs used by the program, and the execution time of the program.

The data packets contain event-specific data and data that depends on the program structure (the call stack). The amount of data that depends on the data type is approximately 50 to 100 bytes. The call stack data consists of return addresses for each call, and contains 4 bytes per address, or 8 bytes per address on 64 bit executables. Data packets are recorded for each LWP in the experiment. Note that for Java programs, there are two call stacks of interest: the Java call stack and the machine call stack, which therefore result in more data being written to disk.

The rate at which profiling data packets are recorded is controlled by the profiling interval for clock data, the overflow value for hardware counter data, and for tracing of functions, the rate of occurrences of traced functions. The choice of profiling interval parameters affects the data quality and the distortion of program performance due to the data collection overhead. Smaller values of these parameters give better statistics but also increase the overhead. The default values of the profiling interval and the overflow value have been carefully chosen as a compromise between obtaining good statistics and minimizing the overhead. Smaller values also mean more data.

For a clock-based profiling experiment or hardware counter overflow profiling experiment with a profiling interval of about 100 samples per second, and a packet size ranging from 80 bytes for a small call stack up to 120 bytes for a large call stack, data is recorded at a rate of 10 kbytes per second per thread. Applications that have call stacks with a depth of hundreds of calls could easily record data at ten times these rates.

For MPI tracing experiments, the data volume is 100‐150 bytes per traced MPI call, depending on the number of messages sent and the depth of the call stack. In addition, clock profiling is enabled by default when you use the -M option of the collect command, so add the estimated numbers for a clock profiling experiment. You can reduce data volume for MPI tracing by disabling clock profiling with the -p off option.

Note –

The Collector stores MPI tracing data in its own format (mpview.dat3) and also in the VampirTrace OTF format (a.otf, a.*.z). You can remove the OTF format files without affecting the Analyzer.

Your estimate of the size of the experiment should also take into account the disk space used by the archive files, which is usually a small fraction of the total disk space requirement (see the previous section). If you are not sure how much space you need, try running your experiment for a short time. From this test you can obtain the size of the archive files, which are independent of the data collection time, and scale the size of the profile files to obtain an estimate of the size for the full-length experiment.

As well as allocating disk space, the Collector allocates buffers in memory to store the profile data before writing it to disk. Currently no way exists to specify the size of these buffers. If the Collector runs out of memory, try to reduce the amount of data collected.

If your estimate of the space required to store the experiment is larger than the space you have available, consider collecting data for part of the run rather than the whole run. You can collect data on part of the run with the collect command with -y or -t options, with the dbx collector subcommands, or by inserting calls in your program to the collector API. You can also limit the total amount of profiling and tracing data collected with the collect command with the -L option, or with the dbx collector subcommands.

Note –

The Performance Analyzer cannot read more than 2 GB of performance data.

Collecting Data

You can collect performance data in either the standalone Performance Analyzer or the Analyzer window in the IDE in several ways:

Using the collect command from the command line (see Collecting Data Using the collect Command and the collect(1) man page). The collect command-line tool has smaller data collection overheads than dbx so this method can be superior to the others.
Using the Sun Studio Collect dialog box in the Performance Analyzer (see “Collecting Performance Data Using the Sun Studio Collect Dialog Box” in the Performance Analyzer online help).
Using the Project Properties dialog box in the IDE (see “Collecting Performance Data Using the IDE” in the Performance Analyzer online help).
Using the collector command from the dbx command line (see Collecting Data Using the dbx collector Subcommands.

The following data collection capabilities are available only with the Sun Studio Collect dialog box and the collect command:

Collecting data on Java programs. If you try to collect data on a Java program with Collector Dialog in the Debugger in the IDE or with the collector command in dbx, the information that is collected is for the JVM software, not the Java program.
Collecting data automatically on descendant processes.

Collecting Data Using the `collect` Command

To run the Collector from the command line using the collect command, type the following.

% collect collect-options program program-arguments

Here, collect-options are the collect command options, program is the name of the program you want to collect data on, and program-arguments are the program's arguments.

If no collect-options are given, the default is to turn on clock-based profiling with a profiling interval of approximately 10 milliseconds.

To obtain a list of options and a list of the names of any hardware counters that are available for profiling, type the collect command with no arguments.

% collect

For a description of the list of hardware counters, see Hardware Counter Overflow Profiling Data. See also Limitations on Hardware Counter Overflow Profiling.

Data Collection Options

These options control the types of data that are collected. See What Data the Collector Collects for a description of the data types.

If you do not specify data collection options, the default is -p on, which enables clock-based profiling with the default profiling interval of approximately 10 milliseconds. The default is turned off by the -h option but not by any of the other data collection options.

If you explicitly disable clock-based profiling, and do not enable tracing or hardware counter overflow profiling, the collect command prints a warning message, and collects global data only.

`-p` `option`

Collect clock-based profiling data. The allowed values of option are:

off– Turn off clock-based profiling.
on– Turn on clock-based profiling with the default profiling interval of approximately 10 milliseconds.
lo[w]– Turn on clock-based profiling with the low-resolution profiling interval of approximately 100 milliseconds.
hi[gh]– Turn on clock-based profiling with the high-resolution profiling interval of approximately 1 millisecond. See Limitations on Clock-Based Profiling for information on enabling high-resolution profiling.
[+]value– Turn on clock-based profiling and set the profiling interval to value. The default units for value are milliseconds. You can specify value as an integer or a floating-point number. The numeric value can optionally be followed by the suffix m to select millisecond units or u to select microsecond units. The value should be a multiple of the clock resolution. If it is larger but not a multiple it is rounded down. If it is smaller, a warning message is printed and it is set to the clock resolution.

On SPARC platforms, any value can be prepended with a + sign to enable clock-based dataspace profiling, as is done for hardware counter profiling.

Collecting clock-based profiling data is the default action of the collect command.

`-h` `counter_definition_1` `...[,counter_definition_n]`

Collect hardware counter overflow profiling data. The number of counter definitions is processor-dependent.

This option is now available on systems running the Linux operating system if you have installed the perfctr patch, which you can download from http://user.it.uu.se/~mikpe/linux/perfctr/2.6/ . Instructions for installation are contained within the tar file. The user-level libperfctr.so libraries are searched for using the value of the LD_LIBRARY_PATH environment variable, then in /usr/local/lib, /usr/lib, and /lib for the 32–bit versions, or /usr/local/lib64, /usr/lib64, and /lib64 for the 64–bit versions.

To obtain a list of available counters, type collect with no arguments in a terminal window. A description of the counter list is given in the section Hardware Counter Lists. On most systems, even if a counter is not listed, you can still specify it by a numeric value, either in hexadecimal or decimal.

A counter definition can take one of the following forms, depending on whether the processor supports attributes for hardware counters.

[+]counter_name[/ register_number][,interval ]

[+]counter_name[~ attribute_1=value_1]...[~attribute_n =value_n][/ register_number][,interval ]

The processor-specific counter_name can be one of the following:

An aliased counter name
A raw name
A numeric value in either decimal or hexadecimal

If you specify more than one counter, they must use different registers. If they do not use different registers, the collect command prints an error message and exits.

If the hardware counter counts events that relate to memory access, you can prefix the counter name with a + sign to turn on searching for the true program counter address (PC) of the instruction that caused the counter overflow. This backtracking works on SPARC processors, and only with counters of type load , store , or load-store. If the search is successful, the virtual PC, the physical PC, and the effective address that was referenced are stored in the event data packet.

On some processors, attribute options can be associated with a hardware counter. If a processor supports attribute options, then running the collect command with no arguments lists the counter definitions including the attribute names. You can specify attribute values in decimal or hexadecimal format.

The interval (overflow value) is the number of events or cycles counted at which the hardware counter overflows and the overflow event is recorded. The interval can be set to one of the following:

on, or a null string– The default overflow value, which you can determine by typing collect with no arguments.
hi[gh]– The high-resolution value for the chosen counter, which is approximately ten times shorter than the default overflow value. The abbreviation h is also supported for compatibility with previous software releases.
lo[w]– The low-resolution value for the chosen counter, which is approximately ten times longer than the default overflow value.
interval– A specific overflow value, which must be a positive integer and can be in decimal or hexadecimal format.

The default is the normal threshold, which is predefined for each counter and which appears in the counter list. See also Limitations on Hardware Counter Overflow Profiling.

If you use the -h option without explicitly specifying a-p option, clock-based profiling is turned off. To collect both hardware counter data and clock-based data, you must specify both a -h option and a -p option.

`-s` `option`

Collect synchronization wait tracing data. The allowed values of option are:

all– Enable synchronization wait tracing with a zero threshold. This option forces all synchronization events to be recorded.
calibrate– Enable synchronization wait tracing and set the threshold value by calibration at runtime. (Equivalent to on.)
off– Disable synchronization wait tracing.
on– Enable synchronization wait tracing with the default threshold, which is to set the value by calibration at runtime. (Equivalent to calibrate.)
value– Set the threshold to value, given as a positive integer in microseconds.

Synchronization wait tracing data is not recorded for Java programs; specifying it is treated as an error.

`-H` `option`

Collect heap tracing data. The allowed values of option are:

on– Turn on tracing of heap allocation and deallocation requests.
off– Turn off heap tracing.

Heap tracing is turned off by default. Heap tracing is not supported for Java programs; specifying it is treated as an error.

`-M` `option`

Specify collection of an MPI experiment. The target of the collect command should be the mpirun command, and its options should be separated from the target programs to be run by the mpirun command by a ‐‐ option. (Always use the ‐‐ option with the mpirun command so that you can collect an experiment by prepending the collect command and its option to the mpirun command line.) The experiment is named as usual and is referred to as the founder experiment; its directory contains subexperiments for each of the MPI processes, named by rank.

The allowed values of option are:

MPI-version-Turn on collection of an MPI experiment, assuming the specified MPI version.
off-Turn off collection of an MPI experiment.

By default, turn off collection of an MPI experiment. When an MPI experiment is turned on, the default setting for the -m option is changed to on.

The supported versions of MPI are printed when you type the collect command with no options, or if you specify an unrecognized version with the -M option.

`-m` `option`

Collect MPI tracing data. The allowed values of option are:

on– Turn on MPI tracing information.
off– Turn off MPI tracing information.

MPI tracing is turned off by default unless the -M option is enabled, in which case MPI tracing is turned on by default. Normally MPI experiments are collected with the -M option, and no user control of MPI tracing is needed. If you want to collect an MPI experiment, but not collect MPI tracing data, use the explicit options -M MPI-version -m off.

See MPI Tracing Data for more information about the MPI functions whose calls are traced and the metrics that are computed from the tracing data.

`-S` `option`

Record sample packets periodically. The allowed values of option are:

off– Turn off periodic sampling.
on– Turn on periodic sampling with the default sampling interval of 1 second.
value– Turn on periodic sampling and set the sampling interval to value. The interval value must be positive, and is given in seconds.

By default, periodic sampling at 1 second intervals is enabled.

`-c` `option`

Record count data, for SPARC processors only.

Note –

This feature requires you to install the Binary Interface Tool (BIT), which is part of the Add-on Cool Tools for OpenSPARC, available at http://cooltools.sunsource.net/. BIT is a tool for measuring performance or test suite coverage of SPARC binaries.

The allowed values of option are

on– Turn on collection of function and instruction count data. Count data and simulated count data are recorded for the executable and for any shared objects that are instrumented and that the executable statically links with, provided that those executables and shared objects were compiled with the -xbinopt=prepare option. Any other shared objects that are statically linked but not compiled with the -xbinopt=prepare option are not included in the data. Any shared objects that are dynamically opened are not included in the simulated count data. The data is viewed in the Instruction-Frequency tab in Performance Analyzer, or with the er_print ifreq command.
off‐ Turn off collection of count data.
static– Generates an experiment with the assumption that every instruction in the target executable and any statically linked shared objects was executed exactly once. As with the -c on option, the -c static option requires that the executables and shared objects are compiled with the -xbinopt=prepare flag.

By default, turn off collection of count data. Count data cannot be collected with any other type of data.

`-I` `directory`

Specify a directory for bit() instrumentation. This option is available only on SPARC-based systems, and is meaningful only when the -c option is also specified.

`-N` `library_name`

Specify a library to be excluded from bit() instrumentation, whether the library is linked into the executable or loaded with dlopen()(). This option is available only on SPARC-based systems, and is meaningful only when the -c option is also specified. You can specify multiple -N options.

`-r` `option`

Collect data for data race detection or deadlock detection for the Thread Analyzer. The allowed values are:

on- Turn on thread analyzer data-race-detection data
off– Turn off thread analyzer data
all– Turn on all thread analyzer data
race- Turn on thread analyzer data-race-detection data
deadlock– Collect deadlock and potential-deadlock data
dtN– Turn on specific thread analyzer data types, as named by the dt* parameters

For more information about the collect -r command and Thread Analyzer, see the Sun Studio 12: Thread Analyzer User’s Guide and the tha.1 man page.

Experiment Control Options

`-F` `option`

Control whether or not descendant processes should have their data recorded. The allowed values of option are:

on– Record experiments only on descendant processes that are created by functions fork, exec, and their variants.
all– Record experiments on all descendant processes.
off– Do not record experiments on descendant processes.
= regexp– Record experiments on all descendant processes whose name or lineage matches the specified regular expression.

If you specify the -F on option, the Collector follows processes created by calls to the functions fork(2), fork1(2), fork(3F), vfork(2), and exec(2) and its variants. The call to vfork is replaced internally by a call to fork1.

If you specify the -F all option, the Collector follows all descendant processes including those created by calls to system(3C), system(3F), sh(3F), and popen(3C), and similar functions, and their associated descendant processes.

If you specify the -F '= regexp' option, the Collector follows all descendant processes whose name or subexperiment name matches the specified regular expression. See the regexp(5) man page for information about regular expressions.

When you collect data on descendant processes, the Collector opens a new experiment for each descendant process inside the founder experiment. These new experiments are named by adding an underscore, a letter, and a number to the experiment suffix, as follows:

The letter is either an “f” to indicate a fork, an “x” to indicate an exec, or “c” to indicate any other descendant process.
The number is the index of the fork or exec (whether successful or not) or other call.

For example, if the experiment name for the initial process is test.1.er , the experiment for the child process created by its third fork is test.1.er/_f3.er. If that child process execs a new image, the corresponding experiment name is test.1.er/_f3_x1.er. If that child creates another process using a popen call, the experiment name is test.1.er/_f3_x1_c1.er.

The Analyzer and the er_print utility automatically read experiments for descendant processes when the founder experiment is read, but the experiments for the descendant processes are not selected for data display.

To select the data for display from the command line, specify the path name explicitly to either er_print or analyzer. The specified path must include the founder experiment name, and descendant experiment name inside the founder directory.

For example, here’s what you specify to see the data for the third fork of the test.1.er experiment:

er_print test.1.er/_f3.er

analyzer test.1.er/_f3.er

Alternatively, you can prepare an experiment group file with the explicit names of the descendant experiments in which you are interested.

To examine descendant processes in the Analyzer, load the founder experiment and select Filter Data from the View menu. A list of experiments is displayed with only the founder experiment checked. Uncheck it and check the descendant experiment of interest.

Note –

If the founder process exits while descendant processes are being followed, collection of data from descendants might continue. The founder experiment directory continues to grow accordingly.

`-j` `option`

Enable Java profiling when the target program is a JVM. The allowed values of option are:

on – Recognize methods compiled by the Java HotSpot virtual machine, and attempt to record Java call stacks.
off – Do not attempt to recognize methods compiled by the Java HotSpot virtual machine.
path – Record profiling data for the JVM installed in the specified path.

The -j option is not needed if you want to collect data on a .class file or a .jar file, provided that the path to the java executable is in either the JDK_HOME environment variable or the JAVA_PATH environment variable. You can then specify the target program on the collect command line as the .class file or the .jar file, with or without the extension.

If you cannot define the path to the java executable in the JDK_HOME or JAVA_PATH environment variables, or if you want to disable the recognition of methods compiled by the Java HotSpot virtual machine you can use the -j option. If you use this option, the program specified on the collect command line must be a Java virtual machine whose version is not earlier than JDK 6, Update 3. The collect command verifies that program is a JVM, and is an ELF executable; if it is not, the collect command prints an error message.

If you want to collect data using the 64-bit JVM, you must not use the -d64 option to the java command for a 32-bit JVM. If you do so, no data is collected. Instead you must specify the path to the 64-bit JVM either in the program argument to the collect command or in the JDK_HOME or JAVA_PATH environment variable.

`-J` `java_argument`

Specify additional arguments to be passed to the JVM used for profiling. If you specify the -J option, but do not specify Java profiling, an error is generated, and no experiment is run. The java_argument must be enclosed in quotation marks if it contains more than one argument. It must consist of a set of tokens separated by blanks or tabs. Each token is passed as a separate argument to the JVM. Most arguments to the JVM must begin with a “-” character.

`-l` `signal`

Record a sample packet when the signal named signal is delivered to the process.

You can specify the signal by the full signal name, by the signal name without the initial letters SIG, or by the signal number. Do not use a signal that is used by the program or that would terminate execution. Suggested signals are SIGUSR1 and SIGUSR2. Signals can be delivered to a process by the kill command.

If you use both the -l and the -y options, you must use different signals for each option.

If you use this option and your program has its own signal handler, you should make sure that the signal that you specify with -l is passed on to the Collector’s signal handler, and is not intercepted or ignored.

See the signal(3HEAD) man page for more information about signals.

`-t` `duration`

Specify a time range for data collection.

The duration can be specified as a single number, with an optional m or s suffix, to indicate the time in minutes or seconds at which the experiment should be terminated. By default, the duration is in seconds. The duration can also be specified as two such numbers separated by a hyphen, which causes data collection to pause until the first time elapses, and at that time data collection begins. When the second time is reached, data collection terminates. If the second number is a zero, data will be collected after the initial pause until the end of the program's run. Even if the experiment is terminated, the target process is allowed to run to completion.

`-x`

Leave the target process stopped on exit from the exec system call in order to allow a debugger to attach to it. If you attach dbx to the process, use the dbx commands ignore PROF and ignore EMT to ensure that collection signals are passed on to the collect command.

`-y` `signal`[ `,r`]

Control recording of data with the signal named signal. Whenever the signal is delivered to the process, it switches between the paused state, in which no data is recorded, and the recording state, in which data is recorded. Sample points are always recorded, regardless of the state of the switch.

The signal can be specified by the full signal name, by the signal name without the initial letters SIG, or by the signal number. Do not use a signal that is used by the program or that would terminate execution. Suggested signals are SIGUSR1 and SIGUSR2. Signals can be delivered to a process by the kill(1) command.

If you use both the -l and the -y options, you must use different signals for each option.

When the -y option is used, the Collector is started in the recording state if the optional r argument is given, otherwise it is started in the paused state. If the -y option is not used, the Collector is started in the recording state.

If you use this option and your program has its own signal handler, make sure that the signal that you specify with -y is passed on to the Collector’s signal handler, and is not intercepted or ignored.

See the signal(3HEAD) man page for more information about signals.

Output Options

`-o` `experiment_name`

Use experiment_name as the name of the experiment to be recorded. The experiment_name string must end in the string “.er”; if not, the collect utility prints an error message and exits.

If you do not specify the -o option, give the experiment a name of the form stem.n.er, where stem is a string, and n is a number. If you have specified a group name with the -g option, set stem to the group name without the .erg suffix. If you have not specified a group name, set stem to the string test.

If you are invoking the collect command from one of the commands used to run MPI jobs, for example, mpirun, but without the -M MPI-version option and the -o option, take the value of n used in the name from the environment variable used to define the MPI rank of that process. Otherwise, set n to one greater than the highest integer currently in use.

If the name is not specified in the form stem.n.er, and the given name is in use, an error message is displayed and the experiment is not run. If the name is of the form stem.n.er and the name supplied is in use, the experiment is recorded under a name corresponding to one greater than the highest value of n that is currently in use. A warning is displayed if the name is changed.

`-d` `directory-name`

Place the experiment in directory directory-name. This option only applies to individual experiments and not to experiment groups. If the directory does not exist, the collect utility prints an error message and exits. If a group is specified with the -g option, the group file is also written to directory-name.

For the lightest-weight data collection, it is best to record data to a local file, using the -d option to specify a directory in which to put the data. However, for MPI experiments on a cluster, the founder experiment must be available at the same path for all processes to have all data recorded into the founder experiment.

Experiments written to long-latency file systems are especially problematic, and might progress very slowly, especially if Sample data is collected (-S on option, the default). If you must record over a long-latency connection, disable Sample data.

`-g` `group-name`

Make the experiment part of experiment group group-name. If group-name does not end in .erg, the collect utility prints an error message and exits. If the group exists, the experiment is added to it. If group-name is not an absolute path, the experiment group is placed in the directory directory-name if a directory has been specified with -d, otherwise it is placed in the current directory.

`-A` `option`

Control whether or not load objects used by the target process should be archived or copied into the recorded experiment. The allowed values of option are:

off– do not archive load objects into the experiment.
on– archive load objects into the experiment.
copy– copy and archive load objects (the target and any shared objects it uses) into the experiment.

If you expect to copy experiments to a different machine from which they were recorded, or to read the experiments from a different machine, specify - A copy. Using this option does not copy any source files or object (.o) files into the experiment. Ensure that those files are accessible and unchanged from the machine on which you are examining the experiment.

`-L` `size`

Limit the amount of profiling data recorded to size megabytes. The limit applies to the sum of the amounts of clock-based profiling data, hardware counter overflow profiling data, and synchronization wait tracing data, but not to sample points. The limit is only approximate, and can be exceeded.

When the limit is reached, no more profiling data is recorded but the experiment remains open until the target process terminates. If periodic sampling is enabled, sample points continue to be written.

The default limit on the amount of data recorded is 2000 Mbytes. To remove the limit, set size to unlimited or none.

`-O` `file`

Append all output from collect itself to the name file, but do not redirect the output from the spawned target. If file is set to /dev/null, suppress all output from collect, including any error messages.

Other Options

`-P` `process_id`

Write a script for dbx to attach to the process with the given process_id, collect data from it, and then invoke dbx on the script. You can specify only profiling data, not tracing data, and timed runs (-t option) are not supported.

`-C` `comment`

Put the comment into the notes file for the experiment. You can supply up to ten -C options. The contents of the notes file are prepended to the experiment header.

`-n`

Do not run the target but print the details of the experiment that would be generated if the target were run. This option is a dry run option.

`-R`

Display the text version of the Performance Analyzer Readme in the terminal window. If the readme is not found, a warning is printed. No further arguments are examined, and no further processing is done.

`-V`

Print the current version of the collect command. No further arguments are examined, and no further processing is done.

`-v`

Print the current version of the collect command and detailed information about the experiment being run.

Collecting Data From a Running Process Using the `collect` Utility

In the Solaris OS only, the -P pid option can be used with the collect utility to attach to the process with the specified PID, and collect data from the process. The other options to the collect command are translated into a script for dbx, which is then invoked to collect the data. Only clock-based profile data (-p option) and hardware counter overflow profile data (-h option) can be collected. Tracing data is not supported.

If you use the -h option without explicitly specifying a -p option, clock-based profiling is turned off. To collect both hardware counter data and clock-based data, you must specify both a -h option and a -p option.

To Collect Data From a Running Process Using the `collect` Utility

Determine the program’s process ID (PID).

If you started the program from the command line and put it in the background, its PID will be printed to standard output by the shell. Otherwise you can determine the program’s PID by typing the following.
% ps -ef | grep program-name

Use the collect command to enable data collection on the process, and set any optional parameters.
% collect -P pid collect-options
The collector options are described in Data Collection Options. For information about clock-based profiling, see -p option. For information about hardware clock profiling, see -h option.

Collecting Data Using the `dbx` `collector` Subcommands

This section shows how to run the Collector from dbx, and then explains each of the subcommands that you can use with the collector command within dbx.

To Run the Collector From `dbx`:

Load your program into dbx by typing the following command.
% dbx program

Use the collector command to enable data collection, select the data types, and set any optional parameters.
(dbx) collector subcommand
To get a listing of available collector subcommands, type:
(dbx) help collector
You must use one collector command for each subcommand.

Set up any dbx options you wish to use and run the program.

If a subcommand is incorrectly given, a warning message is printed and the subcommand is ignored. A complete listing of the collector subcommands follows.

Data Collection Subcommands

The following subcommands control the types of data that are collected by the Collector. They are ignored with a warning if an experiment is active.

`profile` `option`

Controls the collection of clock-based profiling data. The allowed values for option are:

on– Enables clock-based profiling with the default profiling interval of 10 ms.
off– Disables clock-based profiling.
timer interval– Sets the profiling interval. The allowed values of interval are
- on– Use the default profiling interval of approximately 10 milliseconds.
- lo[w]– Use the low-resolution profiling interval of approximately 100 milliseconds.
- hi[gh]– Use the high-resolution profiling interval of approximately 1 millisecond. See Limitations on Clock-Based Profiling for information on enabling high-resolution profiling.
- value– Set the profiling interval to value. The default units for value are milliseconds. You can specify value as an integer or a floating-point number. The numeric value can optionally be followed by the suffix m to select millisecond units or u to select microsecond units. The value should be a multiple of the clock resolution. If the value is larger than the clock resolution but not a multiple it is rounded down. If the value is smaller than the clock resolution it is set to the clock resolution. In both cases a warning message is printed.
  
  The default setting is approximately 10 milliseconds.
  
  The Collector collects clock-based profiling data by default, unless the collection of hardware-counter overflow profiling data is turned on using the hwprofile subcommand.

`hwprofile` `option`

Controls the collection of hardware counter overflow profiling data. If you attempt to enable hardware counter overflow profiling on systems that do not support it, dbx returns a warning message and the command is ignored. The allowed values for option are:

on– Turns on hardware counter overflow profiling. The default action is to collect data for the cycles counter at the normal overflow value.
off– Turns off hardware counter overflow profiling.
list– Returns a list of available counters. See Hardware Counter Lists for a description of the list. If your system does not support hardware counter overflow profiling, dbx returns a warning message.
counter counter_definition... [, counter_definition ]– A counter definition takes the following form.

[+]counter_name[~ attribute_1=value_1]...[~attribute_n =value_n][/ register_number][,interval ]

Selects the hardware counter name, and sets its overflow value to interval; optionally selects additional hardware counter names and sets their overflow values to the specified intervals. The overflow value can be one of the following.
- on, or a null string– The default overflow value, which you can determine by typing collect with no arguments.
- hi[gh]– The high-resolution value for the chosen counter, which is approximately ten times shorter than the default overflow value. The abbreviation h is also supported for compatibility with previous software releases.
- lo[w]– The low-resolution value for the chosen counter, which is approximately ten times longer than the default overflow value.
- interval– A specific overflow value, which must be a positive integer and can be in decimal or hexadecimal format.
  
  If you specify more than one counter, they must use different registers. If they do not, a warning message is printed and the command is ignored.
  
  If the hardware counter counts events that relate to memory access, you can prefix the counter name with a + sign to turn on searching for the true PC of the instruction that caused the counter overflow. If the search is successful, the PC and the effective address that was referenced are stored in the event data packet.
  
  The Collector does not collect hardware counter overflow profiling data by default. If hardware-counter overflow profiling is enabled and a profile command has not been given, clock-based profiling is turned off.
  
  See also Limitations on Hardware Counter Overflow Profiling.

`synctrace` `option`

Controls the collection of synchronization wait tracing data. The allowed values for option are

on– Enable synchronization wait tracing with the default threshold.
off– Disable synchronization wait tracing.
threshold value– Sets the threshold for the minimum synchronization delay. The allowed values for value are:
- all– Use a zero threshold. This option forces all synchronization events to be recorded.
- calibrate– Set the threshold value by calibration at runtime. (Equivalent to on.)
- off– Turn off synchronization wait tracing.
- on– Use the default threshold, which is to set the value by calibration at runtime. (Equivalent to calibrate.)
- number– Set the threshold to number, given as a positive integer in microseconds. If value is 0, all events are traced.
  
  By default, the Collector does not collect synchronization wait tracing data.

`heaptrace` `option`

Controls the collection of heap tracing data. The allowed values for option are

on– Enables heap tracing.
off– Disables heap tracing.

By default, the Collector does not collect heap tracing data.

`tha` `option`

Collect data for data race detection or deadlock detection for the Thread Analyzer. The allowed values are:

on- Turn on thread analyzer data-race-detection data
off– Turn off thread analyzer data
all– Turn on all thread analyzer data
race- Turn on thread analyzer data-race-detection data
deadlock– Collect deadlock and potential-deadlock data
dtN– Turn on specific thread analyzer data types, as named by the dt* parameters

For more information about the Thread Analyzer, see the Sun Studio 12: Thread Analyzer User’s Guide and the tha.1 man page.

`sample` `option`

Controls the sampling mode. The allowed values for option are:

periodic– Enables periodic sampling.
manual– Disables periodic sampling. Manual sampling remains enabled.
period value– Sets the sampling interval to value, given in seconds.

By default, periodic sampling is enabled, with a sampling interval value of 1 second.

`dbxsample` { `on` | `off` }

Controls the recording of samples when dbx stops the target process. The meanings of the keywords are as follows:

on– A sample is recorded each time dbx stops the target process.
off– Samples are not recorded when dbx stops the target process.

By default, samples are recorded when dbx stops the target process.

Experiment Control Subcommands

`disable`

Disables data collection. If a process is running and collecting data, it terminates the experiment and disables data collection. If a process is running and data collection is disabled, it is ignored with a warning. If no process is running, it disables data collection for subsequent runs.

`enable`

Enables data collection. If a process is running but data collection is disabled, it enables data collection and starts a new experiment. If a process is running and data collection is enabled, it is ignored with a warning. If no process is running, it enables data collection for subsequent runs.

You can enable and disable data collection as many times as you like during the execution of any process. Each time you enable data collection, a new experiment is created.

`pause`

Suspends the collection of data , but leaves the experiment open. Sample points are not recorded while the Collector is paused. A sample is generated prior to a pause, and another sample is generated immediately following a resume. This subcommand is ignored if data collection is already paused.

`resume`

Resumes data collection after a pause has been issued. This subcommand is ignored if data is being collected.

`sample record` `name`

Record a sample packet with the label name. The label is displayed in the Event tab of the Performance Analyzer.

Output Subcommands

The following subcommands define storage options for the experiment. They are ignored with a warning if an experiment is active.

`archive` `mode`

Set the mode for archiving the experiment. The allowed values for mode are

on– normal archiving of load objects
off– no archiving of load objects
copy– copy load objects into experiment in addition to normal archiving

If you intend to move the experiment to a different machine, or read it from another machine, you should enable the copying of load objects. If an experiment is active, the command is ignored with a warning. This command does not copy source files or object files into the experiment.

`limit` `value`

Limit the amount of profiling data recorded to value megabytes. The limit applies to the sum of the amounts of clock-based profiling data, hardware counter overflow profiling data, and synchronization wait tracing data, but not to sample points. The limit is only approximate, and can be exceeded.

When the limit is reached, no more profiling data is recorded but the experiment remains open and sample points continue to be recorded.

The default limit on the amount of data recorded is 2000 Mbytes. This limit was chosen because the Performance Analyzer cannot process experiments that contain more than 2 Gbytes of data. To remove the limit, set value to unlimited or none.

`store` `option`

Governs where the experiment is stored. This command is ignored with a warning if an experiment is active. The allowed values for option are:

directory directory-name– Sets the directory where the experiment and any experiment group is stored. This subcommand is ignored with a warning if the directory does not exist.
experiment experiment-name– Sets the name of the experiment. If the experiment name does not end in .er, the subcommand is ignored with a warning. See Where the Data Is Stored for more information on experiment names and how the Collector handles them.
group group-name– Sets the name of the experiment group. If the group name does not end in .erg, the subcommand is ignored with a warning. If the group already exists, the experiment is added to the group. If the directory name has been set using the store directory subcommand and the group name is not an absolute path, the group name is prefixed with the directory name.

Information Subcommands

`show`

Shows the current setting of every Collector control.

`status`

Reports on the status of any open experiment.

Collecting Data From a Running Process With `dbx` on Solaris Platforms

On Solaris platforms, the Collector allows you to collect data from a running process. If the process is already under the control of dbx, you can pause the process and enable data collection using the methods described in previous sections. Starting data collection on a running process is not supported on Linux platforms.

If the process is not under the control of dbx, the collect –P pid command can be used to collect data from a running process, as described in Collecting Data From a Running Process Using the collect Utility. You can also attach dbx to it, collect performance data, and then detach from the process, leaving it to continue. If you want to collect performance data for selected descendant processes, you must attach dbx to each process.

To Collect Data From a Running Process That is Not Under the Control of `dbx`

Determine the program’s process ID (PID).

If you started the program from the command line and put it in the background, its PID will be printed to standard output by the shell. Otherwise you can determine the program’s PID by typing the following.
% ps -ef | grep program-name

Attach to the process.

From dbx, type the following.
(dbx) attach program-name pid
If dbx is not already running, type the following.
% dbx program-name pid
Attaching to a running process pauses the process.

See the manual Sun Studio 12 Update 1: Debugging a Program With dbx for more information about attaching to a process.

Start data collection.

From dbx, use the collector command to set up the data collection parameters and the cont command to resume execution of the process.

Detach from the process.

When you have finished collecting data, pause the program and then detach the process from dbx.

From dbx, type the following.
(dbx) detach

Collecting Tracing Data From a Running Program

If you want to collect any kind of tracing data, you must preload the Collector library, libcollector.so , before you run your program. To collect heap tracing data or synchronization wait tracing data, you must also preload er_heap.so and er_sync.so, respectively. These libraries provide wrappers to the real functions that enable data collection to take place. In addition, the Collector adds wrapper functions to other system library calls to guarantee the integrity of performance data. If you do not preload the libraries, these wrapper functions cannot be inserted. See Using System Libraries for more information on how the Collector interposes on system library functions.

To preload libcollector.so, you must set both the name of the library and the path to the library using environment variables, as shown in the table below. Use the environment variable LD_PRELOAD to set the name of the library. Use the environment variables LD_LIBRARY_PATH, LD_LIBRARY_PATH_32, or LD_LIBRARY_PATH_64 to set the path to the library . LD_LIBRARY_PATH is used if the _32 and _64 variants are not defined. If you have already defined these environment variables, add new values to them.

Table 3–2 Environment Variable Settings for Preloading libcollector.so, er_sync.so, and er_heap.so


Environment Variable	Value
`LD_PRELOAD`	`libcollector.so`
`LD_PRELOAD`	`er_heap.so`
`LD_PRELOAD`	`er_sync.so`
`LD_LIBRARY_PATH`	`/opt/sunstudio12.1/prod/lib/dbxruntime`
`LD_LIBRARY_PATH_32`	`/opt/sunstudio12.1/prod/lib/dbxruntime`
`LD_LIBRARY_PATH_64`	`/opt/sunstudio12.1/prod/lib/v9/dbxruntime`
`LD_LIBRARY_PATH_64`	`/opt/sunstudio12.1/prod/lib/amd64/dbxruntime`

If your Sun Studio software is not installed in /opt/sunstudio12.1, ask your system administrator for the correct path. You can set the full path in LD_PRELOAD, but doing this can create complications when using SPARC V9 64-bit architecture.

Note –

Remove the LD_PRELOAD and LD_LIBRARY_PATH settings after the run, so they do not remain in effect for other programs that are started from the same shell.

Collecting Data From MPI Programs

The Collector can collect performance data from multi-process programs that use the Message Passing Interface (MPI). The Collector supports OpenMPI-based MPIs, including Sun HPC ClusterTools^TM 7, Sun HPC ClusterTools 8 software, and MPICH2–based MPIs including MVAPICH2 and Intel MPI. The ClusterTools MPI software is available at http://www.sun.com/software/products/clustertools

For information about MPI and the MPI standard, see the MPI web site http://www.mcs.anl.gov/mpi/ . For more information about Open MPI, see the web site http://www.open-mpi.org/ .

To collect data from MPI jobs, you must use the collect command; the dbx collector subcommands cannot be used to start MPI data collection. Details are provided in Running the collect Command for MPI.

Running the `collect` Command for MPI

The collect command can be used to trace and profile MPI applications.

To collect data, use the following syntax:

collect [collect-arguments] mpirun [mpirun-arguments] -- program-name [program-arguments]

For example, the following command runs MPI tracing and profiling on each of the 16 MPI processes, storing the data in a single MPI experiment:

collect -M CT8.2 mpirun -np 16 -- a.out 3 5

The initial collect process reformats the mpirun command to specify running the collect command with appropriate arguments on each of the individual MPI processes.

The ‐‐ argument immediately before the program_name is required for MPI profiling. If you do not include the ‐‐ argument, the collect command displays an error message and no experiment is collected.

Note –

The technique of using the mpirun command to spawn explicit collect commands on the MPI processes is no longer supported for collecting MPI trace data. You can still use this technique for collecting other types of data.

Storing MPI Experiments

Because multiprocessing environments can be complex, you should be aware of some issues about storing MPI experiments when you collect performance data from MPI programs. These issues concern the efficiency of data collection and storage, and the naming of experiments. See Where the Data Is Stored for information on naming experiments, including MPI experiments.

Each MPI process that collects performance data creates its own subexperiment. While an MPI process creates an experiment, it locks the experiment directory; all other MPI processes must wait until the lock is released before they can use the directory. Store your experiments on a file system that is accessible to all MPI processes.

If you do not specify an experiment name, the default experiment name is used. Within the experiment, the Collector will create one subexperiment for each MPI rank. The Collector uses the MPI rank to construct a subexperiment name with the form M_rm.er, where m is the MPI rank.

If you plan to move the experiment to a different location after it is complete, then specify the -A copy option with the collect command. To copy or move the experiment, do not use the UNIX® cp or mv command; instead, use the er_cp or er_mv command as described in Chapter 9, Manipulating Experiments.

MPI tracing creates temporary files in /tmp/a.*.z on each node. These files are removed during the MPI_finalize() function call. Make sure that the file systems have enough space for the experiments. Before collecting data on a long running MPI application, do a short duration trial run to verify file sizes. Also see Estimating Storage Requirements for information on how to estimate the space needed.

MPI profiling is based on the open source VampirTrace 5.5.3 release. It recognizes several supported VampirTrace environment variables, and a new one, VT_STACKS, which controls whether or not call stacks are recorded in the data. For further information on the meaning of these variables, see the VampirTrace 5.5.3 documentation.

The default values of the environment variables VT_BUFFER_SIZE and VT_MAX_FLUSHES limit the internal buffer of the MPI API trace collector to 64 Mbytes and the number of times that the buffer is flushed to 1, respectively. After the limit has been reached for a particular MPI process, events are no longer written into the trace file for that process. The result can be an incomplete experiment, and in some cases, the experiment might not be readable.

To remove the limit and get a complete trace of an application, set VT_MAX_FLUSHES to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full. To change the size of the buffer, use the environment variable VT_BUFFER_SIZE. The optimal value for this variable depends on the application that is to be traced. Setting a small value will increase the memory available to the application but will trigger frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value, like 2 Gbytes, will minimize buffer flushes by the MPI API trace collector, but decrease the memory available to the application. If not enough memory is available to hold the buffer and the application data this might cause parts of the application to be swapped to disk leading also to a significant change in the behavior of the application.

Another important variable is VT_VERBOSE, which turns on various error and status messages. Set this variable to 2 or higher if problems arise.

Note –

If you copy or move experiments between computers or nodes, you cannot view the annotated source code or source lines in the annotated disassembly code unless you have access to the source files or a copy with the same timestamp. You can put a symbolic link to the original source file in the current directory in order to see the annotated source. You can also use settings in the Set Data Presentation dialog box: the Search Path tab (see Search Path Tab) lets you manage a list of directories to be used for searching for source files, the Pathmaps tab (see Pathmaps Tab) enables you to map the leading part of a file path from one location to another.

Using `collect` With `ppgsz`

You can use collect with ppgsz(1) by running collect on the ppgsz command and specifying the -F on or -F all flag. The founder experiment is on the ppgsz executable and uninteresting. If your path finds the 32-bit version of ppgsz, and the experiment is run on a system that supports 64-bit processes, the first thing it will do is exec its 64-bit version, creating _x1.er. That executable forks, creating _x1_f1.er.

The child process attempts to exec the named target in the first directory on your path, then in the second, and so forth, until one of the exec attempts succeeds. If, for example, the third attempt succeeds, the first two descendant experiments are named _x1_f1_x1.er and _x1_f1_x2.er, and both are completely empty. The experiment on the target is the one from the successful exec, the third one in the example, and is named _x1_f1_x3.er, stored under the founder experiment. It can be processed directly by invoking the Analyzer or the er_print utility on test.1.er/_x1_f1_x3.er.

If the 64-bit ppgsz is the initial process, or if the 32-bit ppgsz is invoked on a 32-bit kernel, the fork child that execs the real target has its data in _f1.er , and the real target’s experiment is in _f1_x3.er, assuming the same path properties as in the example above.

Chapter 3 Collecting Performance Data

Compiling and Linking Your Program

Source Code Information

Static Linking

Shared Object Handling

Optimization at Compile Time

Compiling Java Programs

Preparing Your Program for Data Collection and Analysis

Using Dynamically Allocated Memory

Using System Libraries

Using Signal Handlers

Using setuid and setgid

Program Control of Data Collection

The C and C++ Interface

The Fortran Interface

The Java Interface

The C, C++, Fortran, and Java API Functions

Dynamic Functions and Modules

collector_func_load()

collector_func_unload()

Limitations on Data Collection

Limitations on Clock-Based Profiling

Runtime Distortion and Dilation with Clock-profiling

Limitations on Collection of Tracing Data

Runtime Distortion and Dilation with Tracing

Limitations on Hardware Counter Overflow Profiling

Runtime Distortion and Dilation With Hardware Counter Overflow Profiling

Limitations on Data Collection for Descendant Processes

Limitations on OpenMP Profiling

Limitations on Java Profiling

Runtime Performance Distortion and Dilation for Applications Written in the Java Programming Language

Where the Data Is Stored

Experiment Names

Moving Experiments

Estimating Storage Requirements

Collecting Data

Collecting Data Using the collect Command

Data Collection Options

-p option

-h counter_definition_1 ...[,counter_definition_n]

-s option

-H option

-M option

-m option

-S option

-c option

-I directory

-N library_name

-r option

Experiment Control Options

-F option

-j option

-J java_argument

-l signal

-t duration

-x

-y signal[ ,r]

Output Options

-o experiment_name

-d directory-name

-g group-name

-A option

-L size

-O file

Other Options

-P process_id

-C comment

-n

-R

-V

-v

Collecting Data From a Running Process Using the collect Utility

To Collect Data From a Running Process Using the collect Utility

Collecting Data Using the dbx collector Subcommands

To Run the Collector From dbx:

Data Collection Subcommands

profile option

hwprofile option

synctrace option

heaptrace option

Using `setuid` and `setgid`

`collector_func_load()`

`collector_func_unload()`

Collecting Data Using the `collect` Command

`-p` `option`

`-h` `counter_definition_1` `...[,counter_definition_n]`

`-s` `option`

`-H` `option`

`-M` `option`

`-m` `option`

`-S` `option`

`-c` `option`

`-I` `directory`

`-N` `library_name`

`-r` `option`

`-F` `option`

`-j` `option`

`-J` `java_argument`

`-l` `signal`

`-t` `duration`

`-x`

`-y` `signal`[ `,r`]

`-o` `experiment_name`

`-d` `directory-name`

`-g` `group-name`

`-A` `option`

`-L` `size`

`-O` `file`

`-P` `process_id`

`-C` `comment`

`-n`

`-R`

`-V`

`-v`

Collecting Data From a Running Process Using the `collect` Utility

To Collect Data From a Running Process Using the `collect` Utility

Collecting Data Using the `dbx` `collector` Subcommands

To Run the Collector From `dbx`:

`profile` `option`

`hwprofile` `option`

`synctrace` `option`

`heaptrace` `option`

`tha` `option`

`sample` `option`

`dbxsample` { `on` | `off` }

`disable`

`enable`

`pause`

`resume`

`sample record` `name`

`archive` `mode`

`limit` `value`

`store` `option`

`show`

`status`

Collecting Data From a Running Process With `dbx` on Solaris Platforms

To Collect Data From a Running Process That is Not Under the Control of `dbx`

Running the `collect` Command for MPI

Using `collect` With `ppgsz`