Oracle Solaris Studio 12.4 Man Pages

Exit Print View

Updated: January 2015
 
 

collect(1)

Name

collect - command used to collect program performance data

Synopsis

collect collect-arguments target target-arguments
collect
collect -V

Description

The collect command runs the target process and records performance data and global data for the process. Performance data is collected using profiling or tracing techniques. The data can be examined with a GUI program (analyzer) or a command-line program (er_print). The data collection software run by the collect command is referred to here as the Collector.

The data from a single run of the collect command is called an experiment. The experiment is represented in the file system as a directory, with various files inside that directory.

The target is the path name of the executable, Java(TM) .jar file, or Java .class file for which you want to collect performance data. (For more information about Java profiling, see JAVA PROFILING, below.) Executables that are targets for the collect command can be compiled with any level of optimization, but must use dynamic linking. If a program is statically linked, the collect command prints an error message. In order to see annotated source using analyzer or er_print, targets should be compiled with the -g flag, and should not be stripped.

In order to enable dataspace profiling, executables should be compiled with the -xhwcprof -xdebugformat=dwarf -g flags. These flags are applicable to compiling with the C, C++ and Fortran compilers, but the -xhwcprof flag is only meaningful on SPARC[R] platforms; it is ignored on other platforms. If they are not compiled with those flags, the data_layout, data_single, and data_objects commands from er_print will not show the data. Memory Object reports will show data even if the target is compiled without the -xhwcprof flag. See the section "DATASPACE PROFILING" below.

The collect command uses the following strategy to find its target:

  • If a file with the specified target name exists, has execute permission set, and is an ELF executable, the collect command verifies that it can run on the current machine and then runs it. If the file is not an ELF executable, the collect command assumes it is a script, and runs it.

  • If a file with the specified target name exists, but does not have execute permission, collect checks whether the file is a Java[TM] jar file (target name ends in .jar) or class file (target name ends in .class). If the file is a jar file or class file, collect inserts the Java[TM] virtual machine (JVM) software as the target, with any necessary flags, and collects data on that JVM machine. (The terms "Java virtual machine" and "JVM" mean a virtual machine for the Java[TM] platform.) See the section on "JAVA PROFILING", below.

  • If a file with the specified target name is not found, collect searches your path to find an executable; if an executable file is found, collect verifies it as described above.

  • If a file of the target name is also not found in your path, the command looks for a file with that name and the string .class appended; if a file with the class name is found, collect inserts the JVM machine with the appropriate flags, as above.

  • If none of these procedures can find the target, the command fails.

Options

If invoked with no arguments, collect prints a usage summary, including the default configuration of the experiment.

If invoked with only the -h argument, collect prints hardware counter information. If the processor supports hardware counter overflow profiling, collect prints two lists containing information about hardware counters. The first list contains "aliased" hardware counters; the second list contains "raw" hardware counters. The output also contains the specification for the default HWC experiment for that processor. For more details, see the "Hardware Counter Overflow Profiling" section below.

If the processor does not support hardware counter overflow processing the output says so.

Data Specifications

-p option

Collect clock-based profiling data. The allowed values of option are:

off

Turn off clock-based profiling

on

Turn on clock-based profiling with the default profiling interval of approximately 10 milliseconds.

lo[w]

Turn on clock-based profiling with the low-resolution profiling interval of approximately 100 milliseconds.

hi[gh]

Turn on clock-based profiling with the high-resolution profiling interval of approximately 1 millisecond.

n

Turn on clock-based profiling with a profiling interval of n. The value n can be an integer or a floating-point number, with a suffix of u for values in microseconds, or m for values in milliseconds. If no suffix is used, assume the value to be in milliseconds.

If the value is smaller than the clock profiling minimum, set it to the minimum; if it is not a multiple of the clock profiling resolution, round down to the nearest multiple of the clock resolution. If it exceeds the clock profiling maximum, report an error. If it is negative or zero, report an error.

If no explicit -p argument is given, and neither count data, nor race-detection or deadlock data is specified, turn on clock-based profiling. If -h high or If -h low is specified requesting the default counter set for that chip at high- or low-frequency, the default clock-profiling will also be set to high or low; an explicit -p argument will be respected.

Clock-profiling-based dataspace and memoryspace profiling is no longer supported; all supported machines have hardware counters for memory operations.

-h parameter

Collect Hardware counter overflow (HWC) profiling data. Parameter may be any of:

off

Do not collect any HW counter profiling data. If off is specified, no other -h argument may be specified. If any other value is specified, multiple -h arguments may be given and the counters from each will be used.

on

Use the default counter set defined for the specific system. That set is shown in the output from collect -h. Not all systems have a default counter set defined; if none is defined, -h on will generate an error.

hi|high

Use the default counter set defined for the specific system, but profile at high rate. Not all systems have a default counter set defined; if none is defined, -h hi will generate an error.

lo|low

Use the default counter set defined for the specific system, but profile at low rate. Not all systems have a default counter set defined; if none is defined, -h lo will generate an error.

ctr_def...[,ctr_n_def]

Collect hardware counter overflow profiles using one ore more specified counters. The maximum number of counter supported (ctr_def through ctr_n_def) is processor-dependent. You can determine the maximum number of hardware counters definitions for profiling on the current machine, and see the full list of available hardware counters, as well as the default counter set, by running collect -h with no other arguments on the current machine.

Each counter definition takes the following form:

[+|-]ctr[~attr=val]...[~attrN=valN][/reg#],[interval]

The meanings of the counter definition options are as follows:

+|-

Optional parameter that can be applied to memory-related counters. + requests dataspace and memoryspace profiling data be recorded. - requests that it not be recorded.

Memory-related counters are those with type load, store, or load-store, as displayed in the counter list obtained by running the collect -h command without any other command-line arguments. Some such counters are also labeled precise.

For precise counters, on either SPARC or x86, the + is not needed; dataspace and memoryspace data will be recorded by default. A - will turn off dataspace and memoryspace data.

For SPARC only, a + specified on a non-precise memory counter will cause collect to collect dataspace data by finding the instruction that triggered the overflow, and the virtual and physical addresses of the memory reference. Such data is not recorded by default and a - is not needed to turn it off. See the section "DATASPACE AND MEMORYSPACE PROFILING" below.

ctr

Processor-specific counter name. You can ascertain the list of counter names by running the collect -h command without any other command-line arguments. On most systems, even if a counter is not listed, it can still be specified by a numeric value, either in hexadecimal (0x1234) or decimal. Drivers for older chips do not support numeric values, but drivers for more recent chips do. When a counter is specified numerically, the register number should also be specified. The numeric values to use are found in the chip-specific manufacturer's manuals. The name of the manual is given in the collect -h output. Some counters are only described in proprietary vendor manuals.

~attr=val

Optional one or more attribute options. On some processors, attribute options can be associated with a hardware counter. If the processor supports attribute options, then running collect -h without any other command-line arguments will also provide a list of attribute names to use for attr. The value, val , can be in decimal or hexadecimal format. Hexadecimal format numbers are in C program format where the number is prepended by a zero and lower-case x (0xhex_number). Multiple attributes are concatenated to the counter name. The ~ in front of each attr name is required.

/reg#

Hardware register to use for the counter. If not specified, collect attempts to place the counter into the first available register and as a result, might be unable to place subsequent counters due to register conflicts. If you specify more than one counter, the counters must use different registers. You can see a list of allowable register numbers by running the collect -h command without any other command-line arguments. The / character is required if the register is specified.

interval

The sampling frequency, set by defining the counter overflow value. Valid values are as follows:

on

Use some default rate, which attempts to match profiling resolution of the default clock profiling rate, which is 100x/sec. On Solaris, the HWC rate can adapt at run-time in response to intensive workloads. On Linux, the rate remains fixed, and so the default might not be a suitable choice for some raw counters.

hi

Set interval to approximately 10 times shorter than on.

lo

Set interval to approximately 10 times longer than on.

value

Set interval to a specific value, specified in decimal or hexadecimal format. Exercise caution in setting a numerical value, especially as setting the interval too low can overload your application or even your entire system. As a rule of thumb, aim for fewer than 1000 events per second per thread.

The interval may be omitted, in which case the value for on will be used. Even when the interval is omitted, the comma in front of it is required (except for the last counter in a -h parameter).

For raw counters, the values for hi, lo, and on, are guesses, but the appropriate interval is very hard to guess for any particular program. If you specify on/hi/lo for any raw counters, and the events come in faster than 100/1000/10 per second per thread, respectively, the interval will be throttled down to a more reasonable maximum on Oracle Solaris systems.

EXAMPLES: Some valid examples of -h usage:

 
-h on
-h lo
-h hi
    Enable the default counters with default, low, or
    high rates, respectively

-h cycles,,insts,,+dcm
-h cycles -h insts -h +dcm
  Both have the same meaning: three counters: cycles, insts 
  and dataspace-profiling of D-cache misses (SPARC only)

-h cycles~system=1
  Count cycles in both user and system modes

-h 0xc0/0,10000003
On Nehalem, that is the equivalent to
-h inst_retired.any_p/0,10000003

Some invalid examples of -h usage:

 
-h cycles -h off
  Can't use off with any other -h arguments
-h cycles,insts
  Missing comma, and "insts" does not parse as a number for 
  <interval>

If the -h argument specifies the use of hardware counters but hardware counters are in use by others at the time the command is given, the collect command will report an error and no experiment will be run.

If no -h argument is given, no HW counter profiling data will be collected. An experiment can specify both hardware counter overflow profiling and clock-based profiling. Specifying hardware counter overflow profiling will not disable clock-profiling, even is it is enabled by default.

For more information on hardware counters, see the "Hardware Counter Overflow Profiling" section below.

-s option

Collect synchronization tracing data.

The minimum delay threshold for tracing events is set using option. The allowed values of option are:

on

Turn on synchronization delay tracing and set the threshold value by calibration at runtime

calibrate

Same as on

off

Turn off synchronization delay tracing

n

Turn on synchronization delay tracing with a threshold value of n microseconds; if n is zero, trace all events

all

Turn on synchronization delay tracing and trace all synchronization events

By default, turn off synchronization delay tracing.

On Solaris, the following functions are traced: mutex_lock(), rw_rdlock(), rw_wrlock(), cond_wait(), cond_timedwait(), cond_reltimedwait(), thr_join(), sema_wait(), pthread_mutex_lock(), pthread_rwlock_rdlock(), pthread_rwlock_wrlock(), pthread_cond_wait(), pthread_cond_timedwait(), pthread_cond_reltimedwait_np(), pthread_join(), and sem_wait().

On Linux, the following functions are traced: pthread_mutex_lock(), pthread_cond_wait(), pthread_cond_timedwait(), pthread_join(), and sem_wait().

For Java programs, record synchronization events for Java monitors in user code, but not for native synchronization calls within the JVM machine.

-H option

Collect heap trace data. The allowed values of option are:

on

Turn on tracing of memory allocation requests

off

Turn off tracing of memory allocation requests

By default, turn off heap tracing.

Record heap-tracing events for any native calls. Treat calls to mmap as memory allocations.

Heap profiling is not supported for Java programs. Specifying it is treated as an error.

Note that heap tracing might produce very large experiments. Such experiments are very slow to load and browse.

-i option

Collect I/O trace data. The allowed values of option are:

on

Turn on tracing of I/O operations

off

Turn off tracing of I/O operations

By default, turn off I/O operations.

Note that I/O tracing might produce very large experiments. Such experiments are very slow to load and browse.

-M option

Specify collection of an MPI experiment. (See MPI PROFILING, below.) The target of collect should be mpirun, and its arguments should be separated from the user target (that is the programs that are to be run by mpirun) by an inserted -- argument. The experiment is named as usual, and is referred to as the "founder experiment"; its directory contains subexperiments for each of the MPI processes, named by rank. It is recommended that the -- argument always be used with mpirun, so that an experiment can be collected by prepending collect and its options to the mpirun command line.

The allowed values of option are:

MPI-version

Turn on collection of an MPI experiment, assuming the MPI version named. The recognized versions of MPI are printed when you type collect with no arguments, or in response to an unrecognized version specified with -M.

off

Turn off collection of an MPI experiment.

By default, turn off collection of an MPI experiment. When an MPI experiment is turned on, the default setting for -m (see below) is changed to on.

-m option

Collect MPI tracing data. (See MPI PROFILING, below.)

The allowed values of option are:

on

Turn on MPI tracing information.

off

Turn off MPI tracing information.

By default, turn off MPI tracing, except if the -M flag is enabled, in which case MPI tracing is turned on by default. Normally, MPI experiments are collected with -M, and no user control of MPI tracing is needed. If you want to collect an MPI experiment, but not collect MPI trace data, you can use the explicit flags:

-M MPI-version -m off
-c option

Collect count data. The allowed values of option are:

on

Turn on count data.

static

Turn on simulated count data, based on the assumption that every instruction was executed exactly once.

off

Turn off count data.

By default, turn off count data. Count data cannot be collected with any other type of data. For count data or simulated count data, the executable and any shared-objects that are instrumented and statically linked are counted; for count data, but not simulated count data, dynamically loaded shared objects are also instrumented and counted.

On Solaris, no special compilation is needed, although the count option is incompatible with compile flags -p, -pg, -qp, -xpg, and -xlinkopt. On Linux, the executable must be compiled with the -annotate=yes flag in order to collect count data.

-I directory

Specify a directory for count data instrumentation.

-N libname

Specify a library to be excluded from instrumentation for count data, whether the library is linked into the executable, or loaded with dlopen (3C) . Multiple -N options can be specified.

-r option

Collect data for data race detection or deadlock detection for the Thread Analyzer.

The allowed values of option are:

race

Collect data for detecting data races.

deadlock

Collect data for detecting deadlocks and potential deadlocks.

all

Collect data for detecting data races, deadlocks, and potential deadlocks. Can also be specified as race,deadlock.

off

Turn off data collection for data races, deadlocks, and potential deadlocks.

on

Collect data for detecting data races (same as race).

terminate

If an unrecoverable error is detected, terminate the target process.

abort

If an unrecoverable error is detected, terminate the target process with a core dump.

continue

If an unrecoverable error is detected, allow the process to continue.

By default, turn off collection of all Thread Analyzer data.

The terminate, abort, and continue options may be added to any data-collection options, and govern the behavior when an unrecoverable error, such as a real (not potential) deadlock. The default behavior is terminate.

Thread Analyzer data cannot be collected with any tracing data, but can be collected in conjunction with clock- or hardware counter profiling data. Thread Analyzer data significantly slows down the execution of the target, and profiles might not be meaningful as applied to the user code.

Thread Analyzer experiments can be examined with either analyzer or with tha. The latter displays a simplified list of default tabs, but is otherwise identical.

In order to enable data-race detection, executables must be instrumented, either at compile time, or by invoking a postprocessor. If the target is not instrumented, and none of the shared objects on its library list is instrumented, a warning is displayed, but the experiment is run. Other Thread Analyzer data do not require instrumentation.

See the tha (1) man page or the Thread Analyzer User's Guide for more detail.

-S interval

Collect periodic samples at the interval specified (in seconds). Record data samples from the process, and include a timestamp and execution statistics from the kernel, among other things. The allowed values of interval are:

off

Turn off periodic sampling.

on

Turn on periodic sampling with the default sampling interval (1 second).

n

Turn on periodic sampling with a sampling interval of n in seconds; n must be positive.

By default, turn on periodic sampling.

If no data specification arguments are supplied, collect clock-based profiling data, using the default resolution.

If clock-based profiling is explicitly disabled, and neither hardware counter overflow profiling nor any kind of tracing is enabled, display a warning that no function-level data is being collected, then execute the target and record global data.

Experiment Controls

-L size

Limit the amount of profiling and tracing data recorded to size megabytes. The limit applies to the sum of all profiling data and tracing data, but not to sample points. The limit is only approximate, and can be exceeded. When the limit is reached, stop profiling and tracing data, but keep the experiment open and record samples until the target process terminates. The allowed values of size are:

unlimited or none

Do not impose a size limit on the experiment.

n

Impose a limit of n megabytes. The value of n must be positive and greater than zero.

By default, there is no limit on the amount of data recorded.

-F option

Control whether or not descendant processes should have their data recorded. (Data is always collected on the founder process, independent of any -F setting.) The allowed values of option are:

on | all

Record experiments on all descendant processes.

off

Do not record experiments on any descendant processes.

=<regex>

Record experiments on those descendant processes whose executable name (a.out name) matches the regular expression. Only the basename of the executable is used, not the full path. If the <regex> that you use contains blanks or characters interpreted by your shell, be sure to enclose the full =<regex> argument in single quotes.

By default, record experiment on all descendant processes. For more details, read the sections "FOLLOWING DESCENDANT PROCESSES", and "PROFILING SCRIPTS" below.

-A option

Control whether or not load objects used by the target process should be archived or copied into the recorded experiment. The allowed values of option are:

on

Copy load objects (the target and any shared objects it uses) into the experiment. Also copy any .anc files and .o files which have Stabs or DWARF information not in the load object.

src

In addition to copying load objects as in -A on, copy into the experiment all source files and .anc files that can be found.

used[src]

In addition to copying load objects as in -A on, copy into the experiment all source files and .anc files that are referenced in the recorded data and can be found.

off

Do not copy or archive load objects or source files into the experiment.

If you copy experiments onto a different machine, or read the experiments from a different machine, specify -A on. Doing so will consume more disk space but allow the experiment to be read on other machines.

Note that -A on does not copy any sources or object files (.o's); it is your responsibility to ensure that those files are accessible from the machine where the experiment is being examined, and that they are not changed or rebuilt after the experiment was recorded.

Archiving of experiment at collection time, especially for experiments with many descendant processes, can be very expensive. A better strategy is to collect the data with -A off, and then run er_archive (1) with the -s flag after the run is terminated.

The default setting for -A is on.

-j option

Control Java profiling when the target is a JVM machine. The allowed values of option are:

on

Record profiling data for the JVM machine, and recognize methods compiled by the Java HotSpot[TM] virtual machine, and also record Java call stacks.

off

Do not record Java profiling data.

<path>

Record profiling data for the JVM, and use the JVM as installed in <path>.

See the section "JAVA PROFILING", below.

You must use -j on to obtain profiling data if the target is a JVM machine. The -j on option is not needed if the target is a class or jar file. If you are using a 64-bit JVM machine, you must specify its path explicitly as the target; do not use the -d64 option for a 32-bit JVM machine. If the -j on option is specified, but the target is not a JVM machine, an invalid argument might be passed to the target, and no data would be recorded. The collect command validates the version of the JVM machine specified for Java profiling.

-J java_arg

Specify additional arguments to be passed to the JVM used for profiling. If -J is specified, Java profiling (-j on) will be enabled. The java_arg must be surrounded by quotes if it contains more than one argument. It consists of a set of tokens, separated by either a blank or a tab; each token is passed as a separate argument to the JVM. Note that most arguments to the JVM must begin with a "-" character.

-l signal

Record a sample point whenever the given signal is delivered to the process.

See the section "DATA COLLECTION AND SIGNALS" below for more information about choosing a signal.

-y signal[,r]

Control recording of data with signal, referred to as the pause-resume signal. Whenever the given signal is delivered to the process, switch between paused (no data is recorded) and resumed (data is recorded) states. Start in the resumed state if the optional ,r flag is given, otherwise start in the paused state. This option does not affect the recording of sample points.

One use of the pause-resume signal is to start a target without collecting data, allowing it to reach steady-state, and then enabling the data.

See the section "DATA COLLECTION AND SIGNALS" below for more information about choosing a signal.

Output Controls

-o experiment_name

Use experiment_name as the name of the experiment to be recorded. The experiment_name must end in the string .er; if not, print an error message and do not run the experiment.

If -o is not specified, give the experiment a name of the form stem.n.er, where stem is a string, and n is a number. If a group name has been specified with -g, set stem to the group name without the .erg suffix. If no group name has been specified, set stem to the string "test".

If invoked from one of the commands used to run MPI jobs, for example, mpirun, but without -M MPI-versions, and -o is not specified, take the value of n used in the name from the environment variable used to define the MPI rank of that process. Otherwise, set n to one greater than the highest integer currently in use. (See MPI PROFILING, below.)

If the name is not specified in the form stem.n.er, and the given name is in use, print an error message and do not run the experiment. If the name is of the form stem.n.er and the name supplied is in use, record the experiment under a name corresponding to one greater than the highest value of n that is currently in use. Print a warning if the name is changed.

-d directory_name

Place the experiment in directory directory_name. If no directory is given, place the experiment in the current working directory. If a group is specified (see -g, below), the group file is also written to the directory named by -d.

For the lightest-weight data collection, it is best to record data to a local file, with -d used to specify a directory in which to put the data. However, for MPI experiments on a cluster, the founder experiment must be available at the same path to all processes to have all data recorded into the founder experiment.

Experiments written to long-latency file systems are especially problematic, and might progress very slowly, especially if Sample data is collected (-S on, the default). If you must record over a long-latency connection, disable Sample data.

-g group_name

Add the experiment to the experiment group group_name. The group_name string must end in the string .erg; if not, report an error and do not run the experiment. The first line of a group file must contain the string

#analyzer experiment group

and each subsequent line is the name of an experiment.

-O file

Append all output from collect itself to the named file, but do not redirect the output from the spawned target, nor from dbx (as invoked with the -P argument), nor from the processes involved in recording count data (as invoked with the -c argument). If file is set to /dev/null suppress all output from collect, including any error messages.

-t duration

Collect data for the specified duration. duration can be a single number, followed by either m, specifying minutes, or s, specifying seconds (default), or two such numbers separated by a - sign. If one number is given, data is collected from the start of the run until the given time; if two numbers are given, data is collected from the first time to the second. If the second time is zero, data is collected until the end of the run. If two non-zero numbers are given, the first must be less than the second.

Other Arguments

-P <pid>

Write a script for dbx to attach to the process with the given PID, and collect data from it, and then invoke dbx with that script. Clock or HW counter profiling data may be specified, but neither tracing nor count data are supported. See the collector (1) man page for more information.

When attaching to a process, the directory is created with the umask of the user running collect -P, but the experiment is written as the user running the process which is being attached to. If the user doing the attach is root, and the umask is not zero, the experiment will fail.

-n

Dry run: do not run the target, but print all the details of the experiment that would be run. Turn on -v.

-R

Obsolete. Will print a message to that effect and exit. This option will be removed in a future release.

-V

Print the current version. Do not examine further arguments and do no further processing.

-v

Print the current version and further detailed information about the experiment being run.

-x

Leave the target process stopped on the exit from the exec system call, in order to allow a debugger to attach to it. The collect command prints a message with the process PID.

To attach a debugger to the target once it is stopped by collect, you can follow the procedure below.

  • Obtain the PID of the process from the message printed by the collect -x command

  • Start the debugger

  • Configure the debugger to ignore SIGPROF and, if you chose to collect hardware counter data, SIGEMT on Solaris or SIGIO on Linux

  • Attach to the process using dbx's attach command.

  • Set the collector parameters for the experiment you wish to collect

  • Issue the collector enable command

  • Issue the cont command to allow the target process to run

As the process runs under the control of the debugger, the Collector records an experiment.

Alternatively, you can attach to the process and collect an experiment using the collect -P PID command.

FOLLOWING DESCENDANT PROCESSES

Data from the initial process spawned by collect, called the founder process, is always collected. Processes can create descendant processes by calling system library functions, including the variants of fork, exec, system, etc. If a -F argument is used, the collector can collect data for descendant processes, and it opens a new experiment for each descendant process inside the parent experiment. These new experiments are named with their lineage as follows:

  • An underscore is appended to the creator's experiment name.

  • A code letter is added: either "f" for a fork, or "x" for other descendants, including exec. On Linux, "C" is used for a descendant generated by clone(2).

  • A number is added after the code letter, which is the index of the descendant.

  • The experiment suffix, ".er" is appended to the lineage.

For example, if the experiment name for the initial process is "test.1.er", the experiment for the descendant process created by its third fork is "test.1.er/_f3.er". If that descendant process execs a new image, the corresponding experiment name is "test.1.er/_f3_x1.er".

If the default, -F on, is used, descendant processes initiated by calls to fork (2) , fork1 (2) , fork (3F) , vfork (2) , and exec (2) and its variants are followed. The call to vfork is replaced internally by a call to fork1. Descendants created by calls to system (3C) , system (3F) , sh (3F) , popen (3C) , and similar functions, and their associated descendant processes, are also followed. On Linux, descendants created by clone() without the CLONE_VM flag are followed by default; descendants created with the CLONE_VM flag are treated as threads, rather than processes, and are always followed, independent of the -F setting.

If the -F =<regex> argument is used, all descendants whose name matches the regular expression are followed. When matching names, only the basename of the executable is used, not the full path, and not any arguments.

For example, to capture data on the descendant process of the first exec from the first fork from the first call to system in the founder, use:

collect -F '=_x1_f1_x1'

To capture data on all the variants of exec, but not fork, use:

collect -F '=.*_x[0-9]/*'

To capture data from a call to system("echo hello") but not system("goodbye"), use:

collect -F '=echo hello'

The Analyzer and er_print automatically read experiments for descendant processes when the founder experiment is read, and the experiments for the descendant processes are selected for data display.

To specifically select the data for display from the command line, specify the path name explicitly to either er_print or Analyzer. The specified path must include the founder experiment name, and the descendant experiment's name inside the founder directory.

For example, to see the data for the third fork of the test.1.er experiment:

er_print test.1.er/_f3.er
analyzer test.1.er/_f3.er

You can prepare an experiment group file with the explicit names of descendant experiments of interest.

To examine descendant processes in the Analyzer, load the founder experiment and choose View > Filter data. The Analyzer displays a list of experiments with only the founder experiment checked. Uncheck the founder experiment and check the descendant experiment of interest.

PROFILING SCRIPTS

By default, collect no longer requires that its target be an ELF executable. If collect is invoked on a script, data is collected on the program launched to execute the script, and on all descendant processes. To collect data only on a specific process, use the -F flag to specify the name of the executable to follow.

For example, to profile the script foo.sh, but collect data primarily from the executable bar, use the command:

collect -F =bar foo.sh

Data will be collected on the founder process launched to execute the script, and all bar processes spawned from the script, but not for other processes.

JAVA PROFILING

Java profiling consists of collecting a performance experiment on the JVM machine as it runs your .class or .jar files. If possible, call stacks are collected in both the Java model and in the machine model.

Data can be shown with view mode set to User, Expert, or Machine. User mode shows each method by name, with data for interpreted and HotSpot-compiled methods aggregated together; it also suppresses data for non-user-Java threads. Expert mode separates HotSpot-compiled methods from interpreted methods, and does not suppress non-user Java threads. Machine mode shows data for interpreted Java methods against the JVM machine as it does the interpreting, while data for methods compiled with the Java HotSpot virtual machine is reported for named methods. All threads are shown. In all three modes, data is reported in the usual way for any non-OpenMP C, C++, or Fortran code called by a Java target. Such code corresponds to Java native methods. The Analyzer and the er_print utility can switch between the view mode User, view mode Expert, and view mode Machine, with User being the default.

Clock-based profiling and hardware counter overflow profiling are supported. Synchronization tracing collects data only on the Java monitor calls, and synchronization calls from native code; it does not collect data about internal synchronization calls within the JVM.

Heap tracing is not supported for Java, and generates an error if specified.

Some Java codes have shared objects contained within a jar file. The shared objects are extracted to a temporary directory when the application runs, and are deleted when the application terminates. The shared-object names are recorded in the experiment map file, but the jar file name is not. To read such experiments, be sure to add an <tt>addpath</tt> directive listing the jar file to your <tt>.er.rc</tt> file, or add the path from the Analyzer GUI, or with the <tt>addpath</tt> command in <tt>er_print</tt>. If the <tt>addpath</tt> directive is in your <tt>.er.rc</tt> file at the time the experiment is archived, the shared objects will be archived.

When collect inserts a target name of java into the argument list, it examines environment variables for a path to the java target, in the order JDK_HOME, and then JAVA_PATH. For the first of these environment variables that is set, the resultant target is verified as an ELF executable. If it is not, collect fails with an error indicating which environment variable was used, and the full path name that was tried.

If neither of those environment variables is set, the collect command uses the version set by your PATH. If there is no java in your PATH, a system default of /usr/java/bin/java is tried.

JAVA PROFILING WITH A DLOPEN

Some applications are not pure Java, but are C or C++ applications that invoke dlopen to load libjvm.so, and then start the JVM by calling into it. The collector seta an environment variable so that Java profiling is automatically enabled.

SHARED_OBJECT HANDLING

Normally, the collect command causes data to be collected for all shared objects in the address space of the target, whether on the initial library list, or explicitly dlopen'd. However, there are some circumstances under which some shared objects are not profiled.

One such scenario is when the target program is invoked with lazy-loading. In such cases, the library is not loaded at startup time, and is not loaded by explicitly calling dlopen, so the shared object name is not included in the experiment, and all PCs from it are mapped to the <Unknown> function. The workaround is to set LD_BIND_NOW, to force the library to be loaded at startup time.

Another such scenario is when the executable is built with the -B direct linking option. In that case the object is dynamically loaded by a call specifically to the dynamic linker entry point of dlopen, and the libcollector interposition is bypassed. The shared object name is not included in the experiment, and all PCs from it are mapped to the <Unknown> function. The workaround is to not use -B direct.

DATA COLLECTION AND SIGNALS

Profiling Signals

Signals are used for both clock- and hardware-counter-overflow profiling. SIGPROF is used in data collection for all experiments. The period for generating the signal depends on the data being collected. SIGEMT (Solaris) or SIGIO (Linux) is used for hardware counter overflow profiling. The overflow interval depends on the user parameter for profiling. Any user code that uses or manipulates the profiling signals may potentially interfere with data collection. When the Collector installs its signal handler for a profile signal, it sets a flag that ensures that system calls are not interrupted to deliver signals. This setting could change the behavior of a target program that uses the profiling signals for other purposes.

When the Collector installs its signal handler for a profile signal, it remembers whether or not the target had installed its own signal handler. The Collector also interposes on some signal-handling routines and does not allow the user to install a signal handler for these signals; it saves the user's handler, just as it does when the Collector replaces a user handler on starting the experiment.

Profiling signals are delivered by from the profiling timer or hardware-counter-overflow handling code in the kernel, or in response to: the kill (2) , sigsend (2) , tkill (2) , tgkill(2) or _lwp_kill (2) system calls; the raise (3C) or sigqueue (3C) library calls; or the kill (1) command. A signal code is delivered with the signal so that the Collector can distinguish the origin. If it is delivered for profiling, it is processed by the Collector; If it is not delivered for profiling, it is delivered to the target signal handler.

When the Collector is running under dbx, the profiling signal delivered occasionally has its signal code corrupted, and a profile signal may be treated as if it were generated from a system or library call or a command. In that case, it will be incorrectly delivered to the user's handler. If the user handler was set to SIG_DFL, it will cause the process to fail core dump.

When the Collector is invoked after attaching to a target process, it will install its signal handler, but it cannot interpose on the signal-handling routines. If those user code installs a signal handler after the attach, it will override the Collector's signal handler, and data will be lost.

Note that any signal, including either of the profiling signals, may cause premature termination of a system call, and the program must be prepared to handle that behavior. When libcollector installs the signal handlers for data collection, it specifies restarting those system calls that are restartable, but some, like sleep (3C) will return early without reporting an error.

Sample and Pause-resume Signals

Signals may be specified by the user as a sample signal (-l) or a pause-resume signal (-y). SIGUSR1 or SIGUSR2 are recommended for this use, but any signal that is not used by the target, may be used.

The profiling signals may be used if the process does not otherwise use them, but they should be used only if no other signal is available. The Collector interposes on some signal-handling routines and does not allow the user to install a signal handler for these signals; it saves the user's handler, just as it does when the Collector replaces a user handler on starting the experiment.

If the Collector is invoked after attaching to a target process, and the user code installs a signal handler for the sample or pause-resume signal, those signals will not longer operate as specified.

OPENMP PROFILING

Data collection for OpenMP programs collects data that can be displayed in any of the three view modes, just as for Java programs. In User mode, slave threads are shown as if they were really cloned from the master thread, and have call stacks matching those from the master thread. Frames in the call stack coming from the OpenMP runtime code (libmtsk.so) are suppressed. In Expert user mode, the master and slave threads are shown differently, and the explicit functions generated by the compiler are visible, and the frames from the OpenMP runtime code (libmtsk.so) are suppressed. For Machine mode, the actual native stacks are shown.

In User mode, various artificial functions are introduced as the leaf function of a call stack whenever the runtime library is in one of several states. These functions are <OMP-overhead>, <OMP-idle>, <OMP-reduction>, <OMP-implicit_barrier>, <OMP-explicit_barrier>, <OMP-lock_wait>, <OMP-critical_section_wait>, and <OMP-ordered_section_wait>.

Three additional clock-profiling metrics are added to the data for clock-profiling experiments:

 
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)

OpenMP Work is counted when the OpenMP runtime thinks the code is doing work. It includes time when the process is consuming User-CPU time, but it also can include time when the process is consuming System-CPU time, waiting for page faults, waiting for the CPU, etc. Hence, OpenMP Work can exceed User-CPU time. OpenMP Wait is accumulated when the OpenMP runtime thinks the process is waiting. OpenMP Wait can include User-CPU time for busy-waits (spin-waits), but it also includes Other-Wait time for sleep-waits.

Master Thread Time is the total time spent in the master thread. It is only available from Solaris experiments. It corresponds to wall-clock time.

The inclusive metrics are visible by default; the exclusive are not. Together, the sum of those two metrics equals the Total Thread Time metric. These metrics are added for all clock- and hardware counter profiling experiments.

Collecting information for every parallel-region entry in the execution of the program can be very expensive. You can suppress that cost by setting the environment variable SP_COLLECTOR_NO_OMP. If you set SP_COLLECTOR_NO_OMP, the program will have substantially less dilation, but you will not see the data from slave threads propagate up the caller, and eventually to main(), as you would when the variable is not set.

A new collector for OpenMP 3.0 is enabled by default in this release. It can profile programs that use explicit tasking. Programs built with earlier compilers can be profiled with the new collector only if a patched version of libmtsk.so is available. If it is not installed, you can switch data collection to use the old collector by setting the environment variable SP_COLLECTOR_OLDOMP.

Note that the OpenMP profiling functionality is only available for applications compiled with the Oracle Solaris Studio compilers, since it depends on the Oracle Solaris Studio compiler runtime. GNU-compiled code will only see machine-level call stacks.

MEMORYSPACE AND DATASPACE PROFILING

A memoryspace profile is a profile in which memory-related events such as cache misses, are reported against the physical structures of the machine, such as cache-lines, memory-banks, or pages.

A dataspace profile is a profile in which those memory-related events, are reported against the data structures whose references cause the events rather than just the instructions where the memory-related events occur. Dataspace profiling is only available on SPARC systems running Oracle Solaris. It is not yet available on x86 systems running either Oracle Solaris or Linux.

For either memoryspace or dataspace profiling, the data collected must be hardware counter profiles using a memory-based counter. For precise counters, on either SPARC or x86 Oracle Solaris platforms, memoryspace/dataspace data is collected by default.

In order to support dataspace profiling, executables should be compiled with the -xhwcprof flag. This flag is applicable to compiling with the C, C++ and Fortran compilers, but is only meaningful on SPARC[R] platforms. The flad is ignored on other platforms. If executables are not compiled with -xhwcprof, the data_layout, data_single, and data_objects commands from er_print will not show the data. Memoryspace profiling does not require -xhwcprof for precise counters.

On machines with precise interrupts, memoryspace profiling does not require the -xhwcprof flag for compilation. Dataspace profiling, even on such machines, does require the flags.

With the data collected, the er_print utility allows three additional commands: data_objects, data_single, and data_layout, as well as various commands relating to Memory Objects. See the er_print(1) man page for more information.

In addition, Performance Analyzer includes two data views related to dataspace profiling, labeled DataObjects and DataLayout, as well as a set of views relating to Memory Objects.

MPI PROFILING

The collect command can be used for MPI profiling to manage collection of the data from the constituent MPI processes, collect MPI trace data, and organize the data into a single "founder" experiment, with "subexperiments" for each MPI process.

The collect command can be used with MPI by simply prefacing the command that starts the MPI job and its arguments with the desired collect command and its arguments (assuming you have inserted the -- argument to indicate the end of the mpirun arguments). For example, on an SMP machine,

% mpirun -np 16 -- a.out 3 5

can be replaced by

% collect -M OMPT mpirun -np 16 -- a.out 3 5

This command runs an MPI tracing experiment on each of the 16 MPI processes, collecting them all in an MPI experiment, named by the usual conventions for naming experiments. It assumes use of the Oracle Message Passing Toolkit (previously known as Sun HPC ClusterTools) version of MPI.

The initial collect process reformats the mpirun command to specify running collect with appropriate arguments on each of the individual MPI processes.

Note that the -- argument immediately before the target name is required for MPI profiling (although it is optional for mpirun itself), so that collect can separate the mpirun arguments from the target and its arguments. If the -- argument is not supplied, collect prints an error message, and no experiment is run.

Furthermore, a -x PATH argument is added to the mpirun arguments by collect, so that the remote collect's can find their targets. If any environment variables in your environment begin with "VT_" or with "SP_COLLECTOR_", they are passed to the remote collect with -x flags for each.

MIMD MPI runs are supported, with the similar requirement that there must be a "--" argument after each ":" (indicating a new target and local mpirun arguments for it). If the -- argument is not supplied, collect prints an error message, and no experiment is run.

Some versions of Oracle Message Passing Toolkit, or Sun HPC ClusterTools have functionality for MPI State profiling. When clock-profiling data is collected on an MPI experiment run with such a version of MPI, two additional metrics can be shown:

 
MPI Work (mpiwork)
MPI Wait (mpiwwait)

MPI Work accumulates when the process is inside the MPI runtime doing work, such as processing requests or messages; MPI Wait accumulates when the process is inside the MPI runtime, but waiting for an event, buffer, or message.

On Solaris systems, MPI Wait is accumulated whether the MPI library sleeps or spins when waiting. On Linux systems, MPI Wait is accumulated when the MPI library spins when waiting; it is not accumulated if the MPI library sleeps (yields the CPU) when waiting, and will be undercounted relative to the real wait time.

In the Analyzer, when MPI trace data is collected, two additional tabs are shown, MPI Timeline and MPI Chart.

The technique of using mpirun to spawn explicit collect commands on the MPI processes is no longer supported to collect MPI trace data, and should not be used. It can still be used for all other types of data.

MPI profiling is based on the open source VampirTrace 5.5.3 release. It recognizes several VampirTrace environment variables, and a new one, VT_STACKS, which controls whether or not call stacks are recorded in the data. For further information on the meaning of these variables, see the VampirTrace 5.5.3 documentation.

The default value of the environment variable VT_BUFFER_SIZE limits the internal buffer of the MPI API trace collector to 64 MB, and the default value of VT_MAX_FLUSHES limits the number of times that the buffer is flushed to 1. Events that are to be recorded after the limits have been reached are no longer written into the trace file. The environment variables apply to every process of a parallel application, meaning that applications with n processes will typically create trace files n times the size of a serial application.

To remove the limit and get a complete trace of an application, set VT_MAX_FLUSHES to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full. To change the size of the buffer, use the environment variable VT_BUFFER_SIZE. The optimal value for this variable depends on the application which is to be traced. Setting a small value will increase the memory available to the application but will trigger frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value, like 2G, will minimize buffer flushes by the MPI API trace collector, but decrease the memory available to the application. If not enough memory is available to hold the buffer and the application data this might cause parts of the application to be swapped to disk leading also to a significant change in the behavior of the application.

Another important variable is VT_VERBOSE, which turns on various error and status messages, and setting it to 2 or higher is recommended if problems arise.

Normally, MPI trace output data is post-processed when the mpirun target exits; a processed data file is written to the experiment, and information about the post-processing time is written into the experiment header. MPI post-processing is not done if MPI tracing is explicitly disabled.

In the event of a failure in post-processing, an error is reported, and no MPI Tabs or MPI tracing metrics will be available.

If the mpirun target does not actually invoke MPI, an experiment will still be recorded, but no MPI trace data will be produced. The experiment will report an MPI post-processing error, and no MPI Tabs or MPI tracing metrics will be available.

If the environment variable VT_UNIFY is set to "0", the post-processing routines, er_vtunify and er_mpipp will not be run by collect. They will be run the first time either er_print or analyzer are invoked on the experiment.

USING COLLECT WITH PPGSZ

The collect command can be used with ppgsz by running the collect command on the ppgsz command, and specifying the -F on flag. The founder experiment is on the ppgsz executable and is uninteresting. If your path finds the 32-bit version of ppgsz, and the experiment is being run on a system that supports 64-bit processes, the first thing the collect command does is execute an exec function on its 64-bit version, creating _x1.er. That executable forks, creating _x1_f1.er. The descendant process attempts to execute an exec function on the named target, in the first directory on your path, then in the second, and so forth, until one of the exec functions succeeds. If, for example, the third attempt succeeds, the first two descendant experiments are named _x1_f1_x1.er and _x1_f1_x2.er, and both are completely empty. The experiment on the target is the one from the successful exec, the third one in the example, and is named _x1_f1_x3.er, stored under the founder experiment. It can be processed directly by invoking the Analyzer or the er_print utility on test.1.er/_x1_f1_x3.er.

If the 64-bit ppgsz is the initial process run, or if the 32-bit ppgsz is invoked on a 32-bit kernel, the fork descendant that executes exec on the real target has its data in _f1.er, and the real target's experiment is in _f1_x3.er, assuming the same path properties as in the example above.

See the section "FOLLOWING DESCENDANT PROCESSES", above. For more information on hardware counters, see the "Hardware Counter Overflow Profiling" section below.

USING COLLECT ON SETUID/SETGID TARGETS

The collect command operates by inserting a shared library, libcollector.so, into the target's address space (LD_PRELOAD), along with additional shared libraries for specific tracing data collection. Those shared libraries write the files that constitute the experiment.

Several problems might arise if collect is invoked on executables that call setuid or setgid, or that create descendant processes that call setuid or setgid. If the user running the experiment is not root, collection fails because the shared libraries are not installed in a trusted directory. The workaround is to run the experiments as root, or use crle (1) to grant permission. Users should, of course, take great care when circumventing security barriers, and do so at their own risk.

In addition, the umask for the user running the collect command must be set to allow write permission for that user, and for any users or groups that are set by the setuid/setgid attributes of a program being exec'd and for any user or group to which that program sets itself. If the mask is not set properly, some files might not be written to the experiment, and processing of the experiment might not be possible. If the log file can be written, an error is shown when the user attempts to process the experiment.

Note that when attaching as one user to a process that is owned by another user, umask must be set for to allow writing by the user owning the process to which you are attaching.

Other problems can arise if the target itself makes any of the system calls to set UID or GID, or if it changes its umask and then forks or runs exec on some other process, or crle was used to configure how the runtime linker searches for shared objects.

If an experiment is started as root on a target that changes its effective GID, the er_archive process that is automatically run when the experiment terminates fails, because it needs a shared library that is not marked as trusted. In that case, you can run er_archive (or er_print or Analyzer) explicitly by hand, on the machine on which the experiment was recorded, immediately following the termination of the experiment.

DATA COLLECTED

Three types of data are collected: profiling data, tracing data and sampling data. The data packets recorded in profiling and tracing include the callstack of each LWP, the LWP, thread, and CPU IDs, and some event-specific data. The data packets recorded in sampling contain global data such as execution statistics, but no program-specific or event-specific data. All data packets include a timestamp.

Each data type describes the metrics derived from that data, both as a name, and as the string the user would use in a metrics command looking at an experiment.

Clock-based Profiling

The event-specific data recorded in clock-based profiling is an array of counts for each accounting microstate. The microstate array is incremented by the system at a prescribed frequency, and is recorded by the Collector when a profiling signal is processed.

Clock-based profiling can run at a range of frequencies which must be multiples of the clock resolution used for the profiling timer. If you try to do high-resolution profiling on a machine with an operating system that does not support it, the command prints a warning message and uses the highest resolution supported. Similarly, a custom setting that is not a multiple of the resolution supported by the system is rounded down to the nearest non-zero multiple of that resolution, and a warning message is printed.

Clock-based profiling data is converted into the following metrics:

 
Total Thread Time (total)
Total CPU Time (totalcpu)
User CPU Time (user)
System CPU Time (system)
Trap CPU Time (trap)
User Lock Time (lock)
Data Page Fault Time (datapfault)
Text Page Fault Time (textpfault)
Kernel Page Fault Time (kernelpfault)
Stopped Time (stop)
Wait CPU Time (wait)
Sleep Time (sleep)

For experiments on multithreaded applications, all of the times are summed across all threads in the process. Total Thread Time adds up to the real elapsed time, multiplied by the average number of threads in the process.

If clock-based profiling is performed on an OpenMP program, three additional metrics:

 
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)

are provided. On Solaris, OpenMP Work accumulates when work is being done in parallel. OpenMP Wait accumulates when the OpenMP runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel, but the thread is not scheduled on a CPU. Master Thread Time represents time in the master thread only.

On Linux, OpenMP Work and OpenMP Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that OpenMP should do a busy wait, OpenMP Wait on Linux will not be useful. Master Thread Time is not provided on Linux.

If clock-based profiling is performed on an MPI program, run under Oracle Message Passing Toolkit or Sun HPC ClusterTools release 8.1 or later, two additional metrics:

 
MPI Work (mpiwork)
MPI Wait (mpiwait)

are provided. On Solaris, MPI Work accumulates when the MPI runtime is active. MPI Wait accumulates when the MPI runtime is waiting for the send or receive of a message, or when the MPI runtime is active, but the thread is not running on a CPU.

On Linux, MPI Work and MPI Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that MPI should do a busy wait, MPI Wait on Linux will not be useful.

Hardware Counter Overflow Profiling

Hardware counter overflow profiling records the number of events counted by the hardware counter at the time the overflow signal was processed.

The counters available depend on the specific processor chip and operating system. Running the command collect -h with no other arguments will describe the processor, and the number of hardware counters available, along with a list of all counters and a default hardware-counter set for that processor. The counters that are aliased to common names are displayed first in the list, followed by a list of the raw hardware counters. After the list of known counters is printed, the name of the reference manual for the chip, and the default counter set defined for that chip is printed.

If neither the performance counter subsystem nor collect know the names for the counters on a specific chip, the tables are empty. Even so, however, the counters can be specified numerically, as described above. The lines of output are formatted similar to the following:

Aliased HW counters available for profiling:

 
cycles[/{0|1}],<interval> ('CPU Cycles', alias for Cycle_cnt; CPU-cycles)
insts[/{0|1}],<interval> ('Instructions Executed', alias for Instr_cnt; events)
dcrm[/1],<interval> ('D$ Read Misses', alias for DC_rd_miss; load events)
...

Raw HW counters available for profiling:

 
Cycle_cnt[/{0|1}],<interval> (CPU-cycles)
Instr_cnt[/{0|1}],<interval> (events)
DC_rd[/0],<interval> (load events)
SI_snoop[/0],<interval> (not-program-related events)
...

In the first line of aliased counter output, the first field, "cycles", gives the counter name that can be used in a -h argument. It is followed by a specification of which registers can be used for that counter. The metric name is "CPU Cycles", and the raw hardware counter name is "Cycle_cnt". The last field, "CPU-cycles", specifies the type of units being counted. There can be up to two words for the type of information. The second or only word of the type information can be either "CPU-cycles" or "events". If the counter can be used to provide a time-based metric, the value is CPU-cycles; otherwise it is events.

The second output line of the aliased counter output above has "events" instead of "CPU-cycles" at the end of the line, indicating that it counts events, and cannot be converted to a time.

The third output line above has two words of type information, "load events", at the end of the line. The first word of type information can have the value of "load", "store", "load-store", or "not-program-related". The first three of these type values indicate that the counter is memory-related and the counter name can be preceded by the "+" sign when used in the collect -h command. The "+" sign indicates the request for data collection to attempt to find the precise instruction and virtual address that caused the event on the counter that overflowed.

On some chips, the counter interrupts are precise, and no "+" sign is needed. Such counters are indicated by the word "(precise)" following the event type.

The "not-program-related" value indicates that the counter captures events initiated by some other program, such as CPU-to-CPU cache snoops. Using the counter for profiling generates a warning and profiling does not record a call stack. It does, however, show the time being spent in an artificial function called "collector_not_program_related". Thread IDs and LWP IDs are recorded, but are meaningless.

Each line in the raw hardware counter list includes the internal counter name as used by cputrack (1) , the register number(s) on which that counter can be used, the default overflow value, and the counter units, which is either CPU-cycles or events.

The metrics reported from hardware counter data are named by the counter used. If the counter measures in cycles, the data will be converted to time; if it measures in events, the data will be reported as an event count. A user option allows cycle-based counters to be shown as events, too.

If two specific counters, "cycles" and "insts", are collected, two additional metrics are available, "CPI" and "IPC", meaning cycles-per-instruction and instructions-per-cycle", respectively. They are always shown as a ratio, and not as a time, count, or percentage. A high value of CPI or low value of IPC indicates code that runs inefficiently in the machine; conversely, a low value of CPI or a high value of IPC indicates code that runs efficiently in the pipeline.

EXAMPLES:

Example 1: Using the aliased counter information listed in the above sample output, the following command:

collect -h cycles,3600003

enables CPU Cycles profiling, with the 3600003 chosen to generate a peak event rate of approximately 1000 events/second/thread on a 3.6 GHz system. (Note that generating too high an event rate will ultimately perturb the performance you are trying to profile. On Solaris systems, you can use on/high/low instead of numeric overflow intervals.)

Example 2:

Running the collect -h command with no other arguments on an AMD Opteron machine would produce a raw hardware counter output similar to the following :

 
FP_dispatched_fpu_ops[/{0|1|2|3}],<interval> (events)
FP_cycles_no_fpu_ops_retired[/{0|1|2|3}],<interval> (CPU-cycles)
...

Using the above raw hardware counter output, the following command:

collect -h FP_dispatched_fpu_ops~umask=0x3/2,10007

enables the Floating Point Add and Multiply operations to be tracked at the rate of 1 capture every 10007 events. (For more details on valid attribute values, refer to the processor documentation). The "/2" value specifies the data is to be captured using the register 2 of the hardware.

Supported Solaris systems, and supported versions of Linux running the Linux kernel with version number 2.6.32 or later, have the necessary OS support for hardware counter overflow profiling already installed.

Supported Linux systems with kernels earlier than 2.6.32 use the perfctr framework; you are responsible for installing the required perfctr patch on the system. You can find the patch by searching the Web for "perfctr patch." Instructions for installation are contained within a tar file at the patch download location. The Collector searches for user-level libperfctr.so libraries using LD_LIBRARY_PATH, and then in /usr/local/lib, /usr/lib/, and /lib/ for the 32-bit versions, or /usr/local/lib64 /usr/lib64/, and /lib64/ for the 64-bit versions.

Synchronization Delay Tracing

Synchronization delay tracing records all calls to the various thread synchronization routines where the real-time delay in the call exceeds a specified threshold. The data packet contains timestamps for entry and exit to the synchronization routines, the thread ID, and the LWP ID at the time the request is initiated. (Synchronization requests from a thread can be initiated on one LWP, but complete on another.)

Synchronization delay tracing data is converted into the following metrics:

 
Synchronization Wait Time (sync)
Synchronization Delay Events (syncn)
Heap Tracing

Heap tracing records all calls to malloc, free, realloc, memalign, and valloc with the size of the block requested, its address, and for realloc, the previous address. Calls to calloc are recorded on Oracle Solaris but not on Linux.

Heap tracing data is converted into the following metrics:

 
Allocations (heapalloccnt)
Bytes Allocated (heapallocbytes)
Leaks (heapleakcnt)
Bytes Leaked (heapleakbytes)

Leaks are defined as allocations that are not freed. If a zero-length block is allocated, it counts as an allocation with zero bytes allocated. If a zero-length block is not freed, it counts as a leak with zero bytes leaked.

Heap tracing experiments can be very large, and might be slow to process.

IO Tracing

IO tracing records all calls to the standard IO routines and all IO system calls.

IO tracing data is converted into the following metrics:

 
Bytes Read (ioreadbytes)
Read Count (ioreadcnt)
Read Time (ioreadtime)
Bytes Written (iowritebytes)
Write Count (iowritecnt)
Write Time (iowritetime)
Other IO Count (ioothrcnt)
Other IO Time (ioothertime)
IO Error Count (ioerrornt)
IO Error Time (ioerrortime)
MPI Tracing

MPI tracing records calls to the MPI library for functions that can take a significant amount of time to complete. MPI tracing is implemented using the Open Source Vampir Trace code.

MPI tracing data is converted into the following metrics:

 
MPI Time (mpitime)
MPI Sends (mpisendcnt)
MPI Bytes Sent (mpisendbytes)
MPI Receives (mpirecvcnt)
MPI Bytes Received (mpirecvbytes)
Other MPI Events (mpiothercnt)

MPI Time is the total thread time spent in the MPI function. If MPI state times are also collected, MPI Work Time plus MPI Wait Time for all MPI functions other than MPI_Init and MPI_Finalize should approximately equal MPI Work Time. On Linux, MPI Wait and MPI Work are based on user+system CPU time, while MPI Time is based on real time, so the numbers will not match.

The MPI Bytes Received metric counts the actual number of bytes received in all messages. MPI Bytes Sent counts the actual number of bytes sent in all messages. MPI Sends counts the number of messages sent, and MPI Receives counts the number of messages received. MPI_Sendrecv counts as both a send and a receive. MPI Other Events counts the events in the trace that are neither sends nor receives.

Count Data

Count data is recorded by instrumenting the executable, and counting the number of times each instruction was executed. It also counts the number of times the first instruction in a function is executed, and calls that the function execution count. On SPARC systems only, it also counts the number of times an instruction in a branch-delay slot is annulled.

Count data is converted into the following metric:

 
Bit Func Count (bit_fcount)
Bit Inst Exec (bit_instx)
Bit Inst Annul (bit_annul) -- SPARC only
Data-race Detection Data

Data-race detection data consists of pairs of race-access events that constitute a race. The events are combined into a race, and races for which the call stacks for the two access are identical are merged into a race group.

Data-race detection data is converted into the following metric:

 
Race Accesses (raccess)
Deadlock Detection Data

Deadlock detection data consists of pairs of threads with conflicting locks.

Deadlock detection data is converted into the following metric:

 
Deadlocks (deadlocks)
Sampling and Global Data

Sampling refers to the process of generating markers along the time line of execution. At each sample point, execution statistics are recorded. All of the data recorded at sample points is global to the program, and does not map to function-level metrics.

Samples are always taken at the start of the process, and at its termination. By default or if a non-zero -S argument is specified, samples are taken periodically at the specified interval. In addition, samples can be taken by using the libcollector(3) API.

The data recorded at each sample point consists of microstate accounting information from the kernel, along with various other statistics maintained within the kernel.

RESTRICTIONS

Most of the Performance Analyzer binaries depend on finding a shared library from the installation containing the binaries. Users must not set LD_LIBRARY_PATH to include any library directories from a different installation of the tools. The binaries may fail to execute if the LD_LIBRARY_PATH is set to a different installation.

The Collector can support up to 32K user threads. Data from additional threads is discarded, and a collector error generated. To support more threads, set the environment variable SP_COLLECTOR_NUMTHREADS to a larger number.

By default, the Collector collects stacks that are 256 frames deep. To support deeper stacks, set the environment variable SP_COLLECTOR_STACKBUFSZ to a larger number.

The Collector interposes on some signal-handling routines to protect its use of SIGPROF signals for clock-based profiling and SIGEMT (Solaris) or SIGIO (Linux) for hardware counter overflow profiling against disruption by the target program. See the section "DATA COLLECTION AND SIGNALS" above.

The Collector interposes on setitimer (2) to ensure that the profiling timer is not available to the target program if clock-based profiling is enabled.

The Collector interposes on functions in the hardware counter library, libcpc.so, so that an application cannot use hardware counters while the Collector is collecting performance data. The interposed functions return a value of -1.

Dataspace profiling is not available on systems running the Linux OS, nor on x86 based systems running the Solaris OS.

For this release, the data from collecting periodic samples is not reliable on systems running the Linux OS.

For this release, wide data discrepancies are observed when profiling multithreaded applications on systems running the RedHat Enterprise Linux OS.

Hardware counter overflow profiling cannot be run on a system where cpustat is running, because cpustat takes control of the counters, and does not let a user process use them.

Java Profiling requires Java[TM] 2 SDK (JDK) 7, Update 11 or later JDK 7's.

collect cannot be used on executables compiled with -xprofile=tcov flag.

Data is not collected on descendant processes that are created to use the setuid attribute, nor on any descendant processes created with an exec call for an executable that is not dynamically linked. Furthermore, subsequent descendant processes might produce corrupted or unreadable experiments. The workaround is to ensure that all processes spawned are dynamically-linked and do not have the setuid attribute.

Applications that call vfork (2) have these calls replaced by a call to fork1 (2) .

Count data (collect -c) cannot be collected on Linux 5 systems; count data cannot be collected for 32-bit binaries on any Linux system at all.

See also

analyzer (1) , collector (1) , dbx (1) , er_archive (1) , er_cp (1) , er_export (1) , er_mv (1) , er_print (1) , er_rm (1) , tha (1) , libcollector (3)

Performance Analyzer manual