Go to main content
Oracle Developer Studio 12.6 Man Pages

Exit Print View

Updated: June 2017
 
 

collect(1)

Name

collect - command used to collect program performance data

Synopsis

collect collect-arguments target target-arguments
collect
collect -V

Description

The collect command runs the target process and records performance data and global data for the process. Performance data is collected using profiling or tracing techniques. The data can be examined with the Performance Analyzer graphical tool (analyzer) or a command-line program (er_print). The data collection software run by the collect command is referred to here as the Collector.

The data from a single run of the collect command is called an experiment. The experiment is represented in the file system as a directory, with various files inside that directory.

The target is the path name of the executable, Java .jar file, or Java .class file for which you want to collect performance data. For more information about Java profiling, see JAVA PROFILING, below.

Executables that are targets for the collect command can be compiled with any level of optimization, but must use dynamic linking. If a program is statically linked, the collect command prints an error message. In order to see annotated source using analyzer or er_print, targets should be compiled with the -g flag, and should not be stripped.

The collect command uses the following strategy to find its target:

  • If a file with the specified target name exists, has execute permission set, and is an ELF executable, the collect command verifies that it can run on the current machine and then runs it. If the file is not an ELF executable, the collect command assumes it is a script, and runs it.

  • If a file with the specified target name exists but does not have execute permission, collect checks whether the file is a Java jar file (target name ends in .jar) or class file (target name ends in .class). If the file is a jar file or class file, collect inserts the Java virtual machine (JVM) software as the target, with any necessary flags, and collects data on that JVM. See JAVA PROFILING below.

  • If a file with the specified target name is not found, collect searches your path to find an executable; if an executable file is found, collect verifies it as described above.

  • If a file of the target name is also not found in your path, the command looks for a file with that name and the string .class appended; if a file with the class name is found, collect inserts the JVM machine with the appropriate flags, as above.

  • If none of these procedures can find the target, the command fails.

Options

If invoked with no arguments, collect prints a usage summary, including the default configuration of the experiment.

Data Specifications

-p option

Collect clock-based profiling data. The allowed values of option are:

off

Turns off clock-based profiling.

lo[w]

Turns on clock-based profiling with a per-thread rate of approximately 10 samples per second.

on

Turns on clock-based profiling with a per-thread rate of approximately 100 samples per second.

hi[gh]

Turns on clock-based profiling with a per-thread rate of approximately 1000 samples per second.

n

Turns on clock-based profiling with a profile timer period of n. The value n can be an integer or a floating-point number, with a suffix of u for values in microseconds, or m for values in milliseconds. If no suffix is used, assume the value to be in milliseconds.

If the value is smaller than the clock profiling minimum, set it to the minimum; if it is not a multiple of the clock profiling resolution, round down to the nearest multiple of the clock resolution. If it exceeds the clock profiling maximum, report an error. If it is negative or zero, report an error.

If no explicit -p argument is given, and neither count data, nor race-detection or deadlock data is specified, turn on clock-based profiling. If -h high or -h low is specified requesting the default counter set for that chip at high- or low-frequency, the default clock-profiling will also be set to high or low; an explicit -p argument will be respected.

Clock-profiling-based dataspace and memoryspace profiling is no longer supported; all supported machines have hardware counters for memory operations.

-h [parameter]

Hardware counter overflow profiling.

-h

Shows extended help for collect hardware counter overflow (HWC) profiling.

If the –h option is specified without a value for parameter, collect prints hardware counter information. If the processor supports hardware counter overflow profiling, collect prints two lists containing information about hardware counters. The first list contains "aliased" hardware counters; the second list contains "raw" hardware counters. The output also contains the specification for the default HWC experiment for that processor. For more details, see the "Hardware Counter Overflow Profiling" section below.

If the processor does not support hardware counter overflow processing the output says so.

The value of parameter can be set to default counters at a specific rate, a specifc counter, or a set of counters.

–h {auto | lo | on | hi}

Turns on Hardware Counter overflow (HWC) profiling data for a default set of counters at the specified rate:

auto

Matches the rate used by clock-profiling. If clock-profiling is disabled, use the per-thread maximum rate of approximately 100 samples per second. auto is the default and preferred setting.

lo|low

Uses per-thread maximum rate of approximately 10 samples per second.

on

Uses per-thread maximum rate of approximately 100 samples per second.

hi|high

Uses per-thread maximum rate of approximately 1000 samples per second.

Alternatively, you can use specific counters:

–h ctr_def[,ctr_def]...

Collects hardware-counter-overflow profiles using one or more specified counters. The maximum number of counters supported is processor-dependent. You can see the maximum number of hardware counter definitions for profiling on the current machine, the full list of available hardware counters, and the default counter set by running collect -h with no other arguments on the current machine.

Each counter definition takes the following form:

[+|-]ctr[~attr=val]...[~attrN=valN][/reg#],[rate]

The meanings of the counter definition options are as follows:

+|-

Optional parameter that can be applied to precise, memory-related counters, which are the counters used for memoryspace and dataspace profiling.

A + is the default and is not needed.

A - collects only normal hardware-counter information and not the extra information that is used for memoryspace and dataspace profiling.

See the section "MEMORYSPACE AND DATASPACE PROFILING" below.

ctr

Counter name. You can see the list of counter names for your processor by running the collect -h command without any other command-line arguments. On most systems, you can specify a counter using a numeric value in hexadecimal (such as 0x00c3) or decimal even if a counter is not listed in collect -h output. The numeric values for counters are specified in the processor manufacturer's manuals. The name of the relevant manual is shown in the collect -h output. Some counters are only described in proprietary vendor manuals. On Oracle Solaris, when a counter is specified numerically it can help to specify the register number also.

~attr=val

Optional one or more attribute options. On some processors, attribute options can be associated with a hardware counter. If the processor supports attribute options, collect -h provides a list of attribute names to use for attr. The value val can be in decimal or hexadecimal format. Hexadecimal format numbers are in C program format where the number is prepended by a zero and lower-case x (0xhex_number). Multiple attributes are concatenated to the counter name. The ~ tilde character in front of each attr name is required.

/reg#

On Oracle Solaris, hardware register to use for the counter. If not specified, collect attempts to place the counter into the first available register and as a result, might be unable to place subsequent counters due to register conflicts. If you specify more than one counter, the counters must use different registers. You can see a list of allowable register numbers by running the collect -h command without any other command-line arguments. The / character is required if the register is specified.

rate

The sampling frequency. Valid values are as follows:

auto

Matches the rate used by clock profiling. If clock profiling is disabled, use the per-thread maximum rate of 100 samples per second. auto is the default and preferred value.

lo

Uses per-thread maximum rate of approximately 10 samples per second.

on

Uses per-thread maximum rate of approximately 100 samples per second.

hi

Uses per-thread maximum rate of approximately 1000 samples per second.

value

Specifies a fixed event interval value to trigger a sample, rather than a sampling rate. When specifying value, note that the actual frequency is dependent on the selected counter and the program under test.

The event interval can be specified in decimal or hexadecimal format. Exercise caution in setting a numerical value, especially as setting the interval too low can overload your application or even your entire system. As a rule of thumb, aim for fewer than 1000 events per second per thread. You can use the Performance Analyzer Timeline view to visually estimate the rate of samples.

The rate can be omitted, in which case auto will be used. Even when the rate is omitted, the comma in front of it is required (except for the last counter in a -h parameter).

EXAMPLES: Some valid examples of -h usage:

 
-h auto
-h lo
-h hi
   Enable the default counters with default, low, or
   high rates, respectively

-h cycles,,insts,,dcm
-h cycles -h insts -h dcm
   Both have the same meaning: three counters: cycles, insts 
   and D-cache misses.

-p lo -h cycles,,insts,,dcm
   Select a low rate of profiling for clock and HWC cycles, insts 
   and D-cache misses. A low rate of profiling can be used to 
   reduce data collection overhead and experiment size when
   dealing with long-running or highly multi-threaded applications.

-h cycles~system=1
  Count cycles, explicitly including cycles in system mode.

-h 0xc0/0,10000003
On Nehalem, that is the equivalent to
-h inst_retired.any_p/0,10000003

Some invalid examples of -h usage:

 
-h cycles -h off
  Can't use off with any other -h arguments
-h cycles,insts
  Missing comma, and "insts" does not parse as a number for 
  <interval>

If the -h argument specifies the use of hardware counters but hardware counters are in use by others at the time the command is given, the collect command will report an error and no experiment will be run.

If no -h argument is given, no HW counter profiling data will be collected. An experiment can specify both hardware counter overflow profiling and clock-based profiling. Specifying hardware counter overflow profiling will not disable clock-profiling, even is it is enabled by default.

For more information on hardware counters, see the "Hardware Counter Overflow Profiling" section below.

-s option[,scope]

Collect synchronization tracing data.

The minimum delay threshold for tracing events is set using option and optionally the scope of APIs traced are set by scope.

The allowed values of option are:

on

Turns on synchronization delay tracing and set the threshold value by calibration at runtime

calibrate

Same as on

off

Turns off synchronization delay tracing

n

Turns on synchronization delay tracing with a threshold value of n microseconds; if n is zero, trace all events

all

Turns on synchronization delay tracing and trace all synchronization events

By default, turns off synchronization delay tracing.

For native API tracing on Oracle Solaris, the following functions are traced: mutex_lock(), rw_rdlock(), rw_wrlock(), cond_wait(), cond_timedwait(), cond_reltimedwait(), thr_join(), sema_wait(), pthread_mutex_lock(), pthread_rwlock_rdlock(), pthread_rwlock_wrlock(), pthread_cond_wait(), pthread_cond_timedwait(), pthread_cond_reltimedwait_np(), pthread_join(), and sem_wait().

On Linux, the following functions are traced: pthread_mutex_lock(), pthread_cond_wait(), pthread_cond_timedwait(), pthread_join(), and sem_wait().

For Java programs, record synchronization events for Java monitors in user code.

The allowed values of scope are:

n

Traces native APIs.

j

Traces Java APIs

nj

Traces native and Java APIs

By default, trace both native and Java APIs.

-H option

Collects heap trace data. The allowed values of option are:

on

Turns on tracing of memory allocation requests

off

Turns off tracing of memory allocation requests

By default, turns off heap tracing.

Records heap-tracing events for any native calls. Treat calls to mmap as memory allocations.

Heap profiling for Java programs traces native allocations only, not Java allocations.

Note that heap tracing might produce very large experiments. Such experiments are very slow to load and browse.

-i option

Collects I/O trace data. The allowed values of option are:

on

Turns on tracing of I/O operations

off

Turns off tracing of I/O operations

By default, turns off I/O operations.

Note that I/O tracing might produce very large experiments. Such experiments are very slow to load and browse.

-M option

Specifies collection of an MPI experiment. (See MPI PROFILING, below.) The target of collect should be mpirun, and its arguments should be separated from the user target (that is the programs that are to be run by mpirun) by an inserted -- argument. The experiment is named as usual, and is referred to as the "founder experiment"; its directory contains subexperiments for each of the MPI processes, named by rank. It is recommended that the -- argument always be used with mpirun, so that an experiment can be collected by prepending collect and its options to the mpirun command line.

The allowed values of option are:

MPI-version

Turns on collection of an MPI experiment, assuming the MPI version named. The recognized versions of MPI are printed when you type collect with no arguments, or in response to an unrecognized version specified with -M.

off

Turns off collection of an MPI experiment.

By default, turns off collection of an MPI experiment. When an MPI experiment is turned on, the default setting for -m (see below) is changed to on.

-m option

Collect MPI tracing data. (See MPI PROFILING, below.)

The allowed values of option are:

on

Turns on MPI tracing information.

off

Turns off MPI tracing information.

By default, turn off MPI tracing, except if the -M flag is enabled, in which case MPI tracing is turned on by default. Normally, MPI experiments are collected with -M, and no user control of MPI tracing is needed. If you want to collect an MPI experiment, but not collect MPI trace data, you can use the explicit flags:

-M MPI-version -m off
-c option

Collects count data. The allowed values of option are:

on

Turns on count data.

static

Turns on simulated count data, based on the assumption that every instruction was executed exactly once.

off

Turns off count data.

By default, turn off count data. Count data cannot be collected with any other type of data. For count data or simulated count data, the executable and any shared-objects that are instrumented and statically linked are counted; for count data, but not simulated count data, dynamically loaded shared objects are also instrumented and counted.

On Oracle Solaris, no special compilation is needed, although the count option is incompatible with compile flags -p, -pg, -qp, -xpg, and -xlinkopt. On Linux, the executable must be compiled with the -xannotate=yes flag in order to collect count data.

-I directory

Specifies a directory for count data instrumentation.

-N libname

Specifies a library to be excluded from instrumentation for count data, whether the library is linked into the executable, or loaded with dlopen(3C). Multiple -N options can be specified.

-r option

Collects data for data race detection or deadlock detection for the Thread Analyzer.

The allowed values of option are:

race

Collects data for detecting data races.

deadlock

Collects data for detecting deadlocks and potential deadlocks.

all

Collects data for detecting data races, deadlocks, and potential deadlocks. Can also be specified as race,deadlock.

off

Turns off data collection for data races, deadlocks, and potential deadlocks.

on

Collects data for detecting data races (same as race).

terminate

If an unrecoverable error is detected, terminates the target process.

abort

If an unrecoverable error is detected, terminates the target process with a core dump.

continue

If an unrecoverable error is detected, enables the process to continue.

By default, turn off collection of all Thread Analyzer data.

The terminate, abort, and continue options can be added to any data-collection options, and govern the behavior when an unrecoverable error, such as a real (not potential) deadlock. The default behavior is terminate.

Thread Analyzer data cannot be collected with any tracing data, but can be collected in conjunction with clock- or hardware counter profiling data. Thread Analyzer data significantly slows down the execution of the target, and profiles might not be meaningful as applied to the user code.

Thread Analyzer experiments can be examined with either analyzer or with tha. The latter displays a simplified list of default tabs, but is otherwise identical.

In order to enable data-race detection, executables must be instrumented, either at compile time, or by invoking a post-processor. If the target is not instrumented, and none of the shared objects on its library list is instrumented, a warning is displayed, but the experiment is run. Other Thread Analyzer data do not require instrumentation.

See the tha(1) man page or the Oracle Developer Studio 12.6: Thread Analyzer User’s Guide for more detail.

-S interval

Periodically samples process-wide resource utilization at the interval specified (in seconds). The allowed values of interval are:

off

Turns off periodic sampling.

on

Turns on periodic sampling with the default sampling interval (1 second).

n

Turns on periodic sampling with a sampling interval of n in seconds; n must be positive.

By default, turn on periodic sampling.

Experiment Controls

-L size

Limit the amount of profiling and tracing data recorded to size megabytes. The limit applies to the sum of all profiling data and tracing data, but not to process-wide resource-utilization samples. The limit is only approximate, and can be exceeded. When the limit is reached, stop profiling and tracing data, but keep the experiment open and record samples until the target process terminates. The allowed values of size are:

unlimited or none

Do not impose a size limit on the experiment.

n

Imposes a limit of n megabytes. The value of n must be positive and greater than zero.

By default, there is no limit on the amount of data recorded.

-F option

Controls whether descendant processes should have their data recorded. (Data is always collected on the founder process, independent of any -F setting.) The allowed values of option are:

on | all

Records experiments on all descendant processes.

off

Does not record experiments on any descendant processes.

=<regex>

Records experiments on those descendant processes whose executable name (a.out name) matches the regular expression. Only the basename of the executable is used, not the full path. If the <regex> that you use contains blanks or characters interpreted by your shell, be sure to enclose the full =<regex> argument in single quotes.

By default, record experiment on all descendant processes. For more details, read the sections "FOLLOWING DESCENDANT PROCESSES", and "PROFILING SCRIPTS" below.

-A option

Controls whether to perform archiving as part of data collection. Archiving is required to make an experiment self-contained and portable. The allowed values of option are:

on

Copies load objects (the target and any shared objects it uses) into the experiment. Also copy any ancillary files (.anc) and object files (.o) which have Stabs or DWARF information not in the load object.

src

In addition to copying load objects as in -A on, copies into the experiment all source files and ancillary files (.anc) that can be found.

usedsrc

Similar to –A src, but only copies source files, ancillary files (.anc), and load objects that are needed for analytics and can be found. This option might require additional processing time, but might result in smaller experiment sizes.

off

Does not copy or archive load objects or source files into the experiment.

Archiving will not be performed in the following circumstances:

  • A profiled process is terminated before it exits normally

  • –A off is specified

In such cases, you must run er_archive explicitly on the same machine where the profiling data was recorded.

When many processes are being profiled, enabling archiving as part of data collection can be very expensive and might change the timing of the application run. With many processes, a better strategy is to collect the data with –A off, and later, when the profiling is complete,archive the experiment using er_archive -s all. In this case all binaries and source files will be saved in the experiment.

The minimum archiving required that enables an experiment to be accessed on another machine is –A on. When using this option, note that –A on does not copy any sources or object files (.o's); it is your responsibility to ensure that those files are accessible from the machine where the experiment is being examined, and that they are not changed or rebuilt after the experiment was recorded.

The default setting for -A is on.

-j option

Controls Java profiling when the target is a JVM machine. The allowed values of option are:

on

Records profiling data for the JVM machine, and recognize methods compiled by the Java HotSpot virtual machine, and also record Java call stacks. This is the default.

off

Does not record Java profiling data. Profiling data for native call stacks is still recorded.

<path>

Records profiling data for the JVM, and use the JVM as installed in <path>.

See the section "JAVA PROFILING", below.

-J java_arg

Specifies additional arguments to be passed to the JVM used for profiling. If -J is specified, Java profiling (-j on) will be enabled. The java_arg must be surrounded by quotes if it contains more than one argument. It consists of a set of tokens, separated by either a blank or a tab; each token is passed as a separate argument to the JVM. Note that most arguments to the JVM must begin with a "-" character.

-l signal

Samples process-wide resource-utilization whenever the given signal is delivered to the process.

See the section "DATA COLLECTION AND SIGNALS" below for more information about choosing a signal.

-y signal[,r]

Controls recording of data with signal, referred to as the pause-resume signal. Whenever the given signal is delivered to the process, switch between paused (no data is recorded) and resumed (data is recorded) states. Start in the resumed state if the optional ,r flag is given, otherwise start in the paused state. This option does not affect the recording of process-wide resource-utilization samples.

One use of the pause-resume signal is to start a target without collecting data, allowing it to reach steady-state, and then enabling the data.

See the section "DATA COLLECTION AND SIGNALS" below for more information about choosing a signal.

Output Controls

-o experiment_name

Uses experiment_name as the name of the experiment to be recorded. The experiment_name must end in the string .er; if not, print an error message and do not run the experiment.

If -o is not specified, give the experiment a name of the form stem.n.er, where stem is a string, and n is a number. If a group name has been specified with -g, set stem to the group name without the .erg suffix. If no group name has been specified, set stem to the string "test".

If invoked from one of the commands used to run MPI jobs, for example, mpirun, but without -M MPI-versions, and -o is not specified, take the value of n used in the name from the environment variable used to define the MPI rank of that process. Otherwise, set n to one greater than the highest integer currently in use. (See MPI PROFILING, below.)

If the name is not specified in the form stem.n.er, and the given name is in use, print an error message and do not run the experiment. If the name is of the form stem.n.er and the name supplied is in use, record the experiment under a name corresponding to one greater than the highest value of n that is currently in use. Print a warning if the name is changed.

-d directory_name

Places the experiment in directory directory_name. If no directory is given, place the experiment in the current working directory. If a group is specified (see -g, below), the group file is also written to the directory named by -d.

For the lightest-weight data collection, it is best to record data to a local file, with -d used to specify a directory in which to put the data. However, for MPI experiments on a cluster, the founder experiment must be available at the same path to all processes to have all data recorded into the founder experiment.

Experiments written to long-latency file systems are especially problematic, and might progress very slowly.

-g group_name

Adds the experiment to the experiment group group_name. The group_name string must end in the string .erg; if not, report an error and do not run the experiment. The first line of a group file must contain the string

#analyzer experiment group

and each subsequent line is the name of an experiment.

-O file

Appends all output from collect itself to the named file, but do not redirect the output from the spawned target, nor from dbx (as invoked with the -P argument), nor from the processes involved in recording count data (as invoked with the -c argument). If file is set to /dev/null suppress all output from collect, including any error messages.

-t duration

Collects data for the specified duration. duration can be a single number followed by either m to specify minutes, or s to specify seconds (default), or two such numbers separated by a - sign. If one number is given, data is collected from the start of the run until the given time; if two numbers are given, data is collected from the first time to the second. If the second time is zero, data is collected until the end of the run. If two non-zero numbers are given, the first must be less than the second.

Although you specify duration in minutes or seconds, the start and end of data collection is recognized with greater accuracy. If clock profiling is enabled, the accuracy is approximately twice the clock profiling interval. If clock profiling is not enabled, the accuracy is 200 milliseconds.

Other Arguments

-C comment

Puts the comment into the notes file for the experiment. Up to ten –C arguments can be supplied.

-P <pid>

Write a script for dbx to attach to the process with the given PID, and collect data from it, and then invoke dbx with that script. Clock or HW counter profiling data may be specified, but neither tracing nor count data are supported. See the collector(1) man page for more information.

When attaching to a process, the directory is created with the umask of the user running collect -P, but the experiment is written as the user running the process which is being attached to. If the user doing the attach is root, and the umask is not zero, the experiment will fail.


Note -  On Linux, attaching to a multithreaded process, including Java, will not properly collect data. Data for the thread that was attached to will be captured, but not data for other threads.
-n

Dry run: do not run the target, but print all the details of the experiment that would be run. Turn on -v.

-V

Prints the current version. Do not examine further arguments and do no further processing.

-v

Prints the current version and further detailed information about the experiment being run.

-x

Leaves the target process stopped on the exit from the exec system call, in order to allow a debugger to attach to it. The collect command prints a message with the process PID.

To attach a debugger to the target once it is stopped by collect, you can follow the procedure below.

  • Obtain the PID of the process from the message printed by the collect -x command

  • Start the debugger

  • Configure the debugger to ignore SIGPROF and, if you chose to collect hardware counter data, SIGEMT on Solaris or SIGIO on Linux

  • Attach to the process using dbx's attach command.

  • Set the collector parameters for the experiment you wish to collect

  • Issue the collector enable command

  • Issue the cont command to allow the target process to run

As the process runs under the control of the debugger, the Collector records an experiment.

Alternatively, you can attach to the process and collect an experiment using the collect -P PID command.

FOLLOWING DESCENDANT PROCESSES

Data from the initial process spawned by collect, called the founder process, is always collected. Processes can create descendant processes by calling system library functions, including the variants of fork, exec, system, etc. If a -F argument is used, the collector can collect data for descendant processes, and it opens a new experiment for each descendant process inside the parent experiment. These new experiments are named with their lineage as follows:

  • An underscore is appended to the creator's experiment name.

  • A code letter is added: either "f" for a fork, or "x" for other descendants, including exec. On Linux, "C" is used for a descendant generated by clone(2).

  • A number is added after the code letter, which is the index of the descendant.

  • The experiment suffix, ".er" is appended to the lineage.

For example, if the experiment name for the initial process is "test.1.er", the experiment for the descendant process created by its third fork is "test.1.er/_f3.er". If that descendant process execs a new image, the corresponding experiment name is "test.1.er/_f3_x1.er".

If the default, -F on, is used, descendant processes initiated by calls to fork(2), fork1(2), fork(3F), vfork(2), and exec(2) and its variants are followed. The call to vfork is replaced internally by a call to fork1. Descendants created by calls to system(3C), system(3F), sh(3F), popen(3C) , and similar functions, and their associated descendant processes, are also followed. On Linux, descendants created by clone() without the CLONE_VM flag are followed by default; descendants created with the CLONE_VM flag are treated as threads, rather than processes, and are always followed, independent of the -F setting.

If the -F =<regex> argument is used, all descendants whose name matches the regular expression are followed. When matching names, only the basename of the executable is used, not the full path, and not any arguments.

For example, to capture data on the descendant process of the first exec from the first fork from the first call to system in the founder, use:

collect -F '=_x1_f1_x1'

To capture data on all the variants of exec, but not fork, use:

collect -F '=.*_x[0-9]/*'

To capture data from a call to system("echo hello") but not system("goodbye"), use:

collect -F '=echo hello'

The Analyzer and er_print automatically read experiments for descendant processes when the founder experiment is read, and the experiments for the descendant processes are selected for data display.

To specifically select the data for display from the command line, specify the path name explicitly to either er_print or Analyzer. The specified path must include the founder experiment name, and the descendant experiment's name inside the founder directory.

For example, to see the data for the third fork of the test.1.er experiment:

er_print test.1.er/_f3.er
analyzer test.1.er/_f3.er

You can prepare an experiment group file with the explicit names of descendant experiments of interest.

To examine descendant processes in the Analyzer, load the founder experiment and choose View > Filter data. The Analyzer displays a list of experiments with only the founder experiment checked. Uncheck the founder experiment and check the descendant experiment of interest.

PROFILING SCRIPTS

By default, collect no longer requires that its target be an ELF executable. If collect is invoked on a script, data is collected on the program launched to execute the script, and on all descendant processes. To collect data only on a specific process, use the -F flag to specify the name of the executable to follow.

For example, to profile the script foo.sh, but collect data primarily from the executable bar, use the command:

collect -F =bar foo.sh

Data will be collected on the founder process launched to execute the script, and all bar processes spawned from the script, but not for other processes.

JAVA PROFILING

Java profiling consists of collecting a performance experiment on the JVM machine as it runs your .class or .jar files. If possible, call stacks are collected in both the Java model and in the machine model. On x86 platforms, if Java applications crash during data collection, disabling capture of machine model call stacks with the SP_COLLECTOR_NATIVE_MAX_STACKDEPTH environment variable might help. See "Environment Variables" below.

Data can be shown with view mode set to User, Expert, or Machine. User mode shows each method by name, with data for interpreted and HotSpot-compiled methods aggregated together; it also suppresses data for non-user-Java threads. Expert mode separates HotSpot-compiled methods from interpreted methods, and does not suppress non-user Java threads. Machine mode shows data for interpreted Java methods against the JVM machine as it does the interpreting, while data for methods compiled with the Java HotSpot virtual machine is reported for named methods. All threads are shown. In all three modes, data is reported in the usual way for any non-OpenMP C, C++, or Fortran code called by a Java target. Such code corresponds to Java native methods. The Analyzer and the er_print utility can switch between the view mode User, view mode Expert, and view mode Machine, with User being the default.

Clock-based profiling and hardware counter overflow profiling are supported. Synchronization tracing collects data only on the Java monitor calls, and synchronization calls from native code; it does not collect data about internal synchronization calls within the JVM.

Heap tracing is not supported for Java, and generates an error if specified.

Some Java codes have shared objects contained within a jar file. The shared objects are extracted to a temporary directory when the application runs, and are deleted when the application terminates. The shared-object names are recorded in the experiment map file, but the jar file name is not. To read such experiments, be sure to add an addpath directive listing the jar file to your .er.rc file, or add the path from the Analyzer GUI, or with the addpath command in er_print. If the addpath directive is in your .er.rc file at the time the experiment is archived, the shared objects will be archived.

When collect inserts a target name of java into the argument list, it examines environment variables for a path to the java target, in the order JDK_HOME, and then JAVA_PATH. For the first of these environment variables that is set, the resultant target is verified as an ELF executable. If it is not, collect fails with an error indicating which environment variable was used, and the full path name that was tried.

If neither of those environment variables is set, the collect command uses the version set by your PATH. If there is no java in your PATH, a system default of /usr/java/bin/java is tried.

JAVA PROFILING WITH A DLOPEN

Some applications are not pure Java, but are C or C++ applications that invoke dlopen to load libjvm.so, and then start the JVM by calling into it. The collector sets an environment variable so that Java profiling is automatically enabled.

SHARED_OBJECT HANDLING

Normally, the collect command causes data to be collected for all shared objects in the address space of the target, whether on the initial library list, or explicitly dlopen'd. However, there are some circumstances under which some shared objects are not profiled.

One such scenario is when the target program is invoked with lazy-loading. In such cases, the library is not loaded at startup time, and is not loaded by explicitly calling dlopen, so the shared object name is not included in the experiment, and all PCs from it are mapped to the <Unknown> function. The workaround is to set LD_BIND_NOW, to force the library to be loaded at startup time.

Another such scenario is when the executable is built with the -B direct linking option. In that case the object is dynamically loaded by a call specifically to the dynamic linker entry point of dlopen, and the libcollector interposition is bypassed. The shared object name is not included in the experiment, and all PCs from it are mapped to the <Unknown> function. The workaround is to not use -B direct.

DATA COLLECTION AND SIGNALS

Profiling Signals

Signals are used for both clock- and hardware-counter-overflow profiling. SIGPROF is used in data collection for all experiments. The period for generating the signal depends on the data being collected. SIGEMT (Solaris) or SIGIO (Linux) is used for hardware counter overflow profiling. The overflow interval depends on the user parameter for profiling. Any user code that uses or manipulates the profiling signals may potentially interfere with data collection. When the Collector installs its signal handler for a profile signal, it sets a flag that ensures that system calls are not interrupted to deliver signals. This setting could change the behavior of a target program that uses the profiling signals for other purposes.

When the Collector installs its signal handler for a profile signal, it remembers whether or not the target had installed its own signal handler. The Collector also interposes on some signal-handling routines and does not allow the user to install a signal handler for these signals; it saves the user's handler, just as it does when the Collector replaces a user handler on starting the experiment.

Profiling signals are delivered by from the profiling timer or hardware-counter-overflow handling code in the kernel, or in response to: the kill(2), sigsend(2), tkill(2), tgkill(2) or _lwp_kill(2) system calls; the raise(3C) or sigqueue(3C) library calls; or the kill(1) command. A signal code is delivered with the signal so that the Collector can distinguish the origin. If it is delivered for profiling, it is processed by the Collector; If it is not delivered for profiling, it is delivered to the target signal handler.

When the Collector is running under dbx, the profiling signal delivered occasionally has its signal code corrupted, and a profile signal may be treated as if it were generated from a system or library call or a command. In that case, it will be incorrectly delivered to the user's handler. If the user handler was set to SIG_DFL, it will cause the process to fail core dump.

When the Collector is invoked after attaching to a target process, it will install its signal handler, but it cannot interpose on the signal-handling routines. If those user code installs a signal handler after the attach, it will override the Collector's signal handler, and data will be lost.

Note that any signal, including either of the profiling signals, may cause premature termination of a system call, and the program must be prepared to handle that behavior. When libcollector installs the signal handlers for data collection, it specifies restarting those system calls that are restartable, but some, like sleep(3C) will return early without reporting an error.

Process-Wide Sample and Pause-Resume Signals

Signals can be specified by the user as a sample signal (–l) or a pause-resume signal (–y). SIGUSR1 or SIGUSR2 are recommended for this use, but any signal that is not used by the target can be used.

The profiling signals can be used if the process does not otherwise use them, but they should be used only if no other signal is available. The Collector interposes on some signal-handling routines and does not allow the user to install a signal handler for these signals; it saves the user's handler, just as it does when the Collector replaces a user handler on starting the experiment.

If the Collector is invoked after attaching to a target process, and the user code installs a signal handler for the sample or pause-resume signal, those signals will no longer operate as specified.

OPENMP PROFILING

Data collection for OpenMP programs collects data that can be displayed in any of the three view modes, just as for Java programs. In User mode, slave threads are shown as if they were really cloned from the master thread, and have call stacks matching those from the master thread. Frames in the call stack coming from the OpenMP runtime code (libmtsk.so) are suppressed. In Expert user mode, the master and slave threads are shown differently, and the explicit functions generated by the compiler are visible, and the frames from the OpenMP runtime code (libmtsk.so) are suppressed. For Machine mode, the actual native stacks are shown.

In User mode, various artificial functions are introduced as the leaf function of a call stack whenever the runtime library is in one of several states. These functions are <OMP-overhead>, <OMP-idle>, <OMP-reduction>, <OMP-implicit_barrier>, <OMP-explicit_barrier>, <OMP-lock_wait>, <OMP-critical_section_wait>, and <OMP-ordered_section_wait>.

Three additional clock-profiling metrics are added to the data for clock-profiling experiments:

 
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)

OpenMP Work is counted when the OpenMP runtime thinks the code is doing work. It includes time when the process is consuming User-CPU time, but it also can include time when the process is consuming System-CPU time, waiting for page faults, waiting for the CPU, etc. Hence, OpenMP Work can exceed User-CPU time. OpenMP Wait is accumulated when the OpenMP runtime thinks the process is waiting. OpenMP Wait can include User-CPU time for busy-waits (spin-waits), but it also includes Other-Wait time for sleep-waits.

Master Thread Time is the total time spent in the master thread. It is only available from Oracle Solaris experiments. It corresponds to wall-clock time.

The inclusive metrics are visible by default; the exclusive are not. Together, the sum of those two metrics equals the Total Thread Time metric. These metrics are added for all clock- and hardware counter profiling experiments.

Collecting information for every parallel-region entry in the execution of the program can be very expensive. You can suppress that cost by setting the environment variable SP_COLLECTOR_NO_OMP. If you set SP_COLLECTOR_NO_OMP, the program will have substantially less dilation, but you will not see the data from slave threads propagate up the caller, and eventually to main(), as you would when the variable is not set.

A collector for OpenMP 3.0 is enabled by default in this release. It can profile programs that use explicit tasking. Programs built with earlier compilers can be profiled with the new collector only if a patched version of libmtsk.so is available. If it is not installed, you can switch data collection to use the old collector by setting the environment variable SP_COLLECTOR_OLDOMP.

Note that the OpenMP profiling functionality is only available for applications compiled with the Oracle Developer Studio compilers, since it depends on the Oracle Developer Studio compiler runtime. GNU-compiled code will only see machine-level call stacks.

MEMORYSPACE AND DATASPACE PROFILING

A memoryspace profile is a profile in which memory-related events such as cache misses are reported against the physical structures of the machine, such as cache-lines, memory-banks, or pages. Memoryspace profiling is available on Oracle SPARC systems and Intel Oracle Solaris systems.

A dataspace profile is a profile in which those memory-related events are reported against the data structures whose references cause the events rather than just the instructions where the memory-related events occur. Dataspace profiling is only available on SPARC systems running Oracle Solaris.

For either memoryspace or dataspace profiling, you must collect hardware counter profiles on an Oracle Solaris system using precise, memory-related counters. Such counters are found in the counter list obtained by running the collect -h command without any other command-line arguments; the counters are annotated memoryspace.

Further, in order to support dataspace profiling, executables should be compiled for a SPARC platform with the -xhwcprof -xdebugformat=dwarf -g flags.

Memoryspace profiling data can be viewed with er_print commands or Performance Analyzer views relating to Memory Objects.

Dataspace profiling data can be viewed with the er_print utility commands data_objects, data_single, and data_layout or with Performance Analyzer using the data views labeled DataObjects and DataLayout.

MPI PROFILING

The collect command can be used for MPI profiling to manage collection of the data from the constituent MPI processes, collect MPI trace data, and organize the data into a single "founder" experiment, with "subexperiments" for each MPI process.

The collect command can be used with MPI by simply prefacing the command that starts the MPI job and its arguments with the desired collect command and its arguments (assuming you have inserted the -- argument to indicate the end of the mpirun arguments). For example, on an SMP machine,

% mpirun -np 16 -- a.out 3 5

can be replaced by

% collect -M OMPT mpirun -np 16 -- a.out 3 5

This command runs an MPI tracing experiment on each of the 16 MPI processes, collecting them all in an MPI experiment, named by the usual conventions for naming experiments. It assumes use of the Oracle Message Passing Toolkit (previously known as Sun HPC ClusterTools) version of MPI.

The initial collect process reformats the mpirun command to specify running collect with appropriate arguments on each of the individual MPI processes.

Note that the -- argument immediately before the target name is required for MPI profiling (although it is optional for mpirun itself), so that collect can separate the mpirun arguments from the target and its arguments. If the -- argument is not supplied, collect prints an error message, and no experiment is run.

Furthermore, a -x PATH argument is added to the mpirun arguments by collect, so that the remote collect's can find their targets. If any environment variables in your environment begin with "VT_" or with "SP_COLLECTOR_", they are passed to the remote collect with -x flags for each.

MIMD MPI runs are supported, with the similar requirement that there must be a "--" argument after each ":" (indicating a new target and local mpirun arguments for it). If the -- argument is not supplied, collect prints an error message, and no experiment is run.

Some versions of Oracle Message Passing Toolkit, or Sun HPC ClusterTools have functionality for MPI State profiling. When clock-profiling data is collected on an MPI experiment run with such a version of MPI, two additional metrics can be shown:

 
MPI Work (mpiwork)
MPI Wait (mpiwwait)

MPI Work accumulates when the process is inside the MPI runtime doing work, such as processing requests or messages; MPI Wait accumulates when the process is inside the MPI runtime, but waiting for an event, buffer, or message.

On Oracle Solaris systems, MPI Wait is accumulated whether the MPI library sleeps or spins when waiting. On Linux systems, MPI Wait is accumulated when the MPI library spins when waiting; it is not accumulated if the MPI library sleeps (yields the CPU) when waiting, and will be undercounted relative to the real wait time.

In the Analyzer, when MPI trace data is collected, two additional tabs are shown, MPI Timeline and MPI Chart.

The technique of using mpirun to spawn explicit collect commands on the MPI processes is no longer supported to collect MPI trace data, and should not be used. It can still be used for all other types of data.

MPI profiling is based on the open source VampirTrace 5.5.3 release. It recognizes several VampirTrace environment variables, and a new one, VT_STACKS, which controls whether or not call stacks are recorded in the data. For further information on the meaning of these variables, see the VampirTrace 5.5.3 documentation.

The default value of the environment variable VT_BUFFER_SIZE limits the internal buffer of the MPI API trace collector to 64 MB, and the default value of VT_MAX_FLUSHES limits the number of times that the buffer is flushed to 1. Events that are to be recorded after the limits have been reached are no longer written into the trace file. The environment variables apply to every process of a parallel application, meaning that applications with n processes will typically create trace files n times the size of a serial application.

To remove the limit and get a complete trace of an application, set VT_MAX_FLUSHES to 0. This setting causes the MPI API trace collector to flush the buffer to disk whenever the buffer is full. To change the size of the buffer, use the environment variable VT_BUFFER_SIZE. The optimal value for this variable depends on the application which is to be traced. Setting a small value will increase the memory available to the application but will trigger frequent buffer flushes by the MPI API trace collector. These buffer flushes can significantly change the behavior of the application. On the other hand, setting a large value, like 2G, will minimize buffer flushes by the MPI API trace collector, but decrease the memory available to the application. If not enough memory is available to hold the buffer and the application data this might cause parts of the application to be swapped to disk leading also to a significant change in the behavior of the application.

Another important variable is VT_VERBOSE, which turns on various error and status messages, and setting it to 2 or higher is recommended if problems arise.

Normally, MPI trace output data is post-processed when the mpirun target exits; a processed data file is written to the experiment, and information about the post-processing time is written into the experiment header. MPI post-processing is not done if MPI tracing is explicitly disabled.

In the event of a failure in post-processing, an error is reported, and no MPI Tabs or MPI tracing metrics will be available.

If the mpirun target does not actually invoke MPI, an experiment will still be recorded, but no MPI trace data will be produced. The experiment will report an MPI post-processing error, and no MPI Tabs or MPI tracing metrics will be available.

If the environment variable VT_UNIFY is set to "0", the post-processing routines, er_vtunify and er_mpipp will not be run by collect. They will be run the first time either er_print or analyzer are invoked on the experiment.

USING COLLECT WITH PPGSZ

The collect command can be used with ppgsz by running the collect command on the ppgsz command, and specifying the -F on flag. The founder experiment is on the ppgsz executable and is uninteresting. If your path finds the 32-bit version of ppgsz, and the experiment is being run on a system that supports 64-bit processes, the first thing the collect command does is execute an exec function on its 64-bit version, creating _x1.er. That executable forks, creating _x1_f1.er. The descendant process attempts to execute an exec function on the named target, in the first directory on your path, then in the second, and so forth, until one of the exec functions succeeds. If, for example, the third attempt succeeds, the first two descendant experiments are named _x1_f1_x1.er and _x1_f1_x2.er, and both are completely empty. The experiment on the target is the one from the successful exec, the third one in the example, and is named _x1_f1_x3.er, stored under the founder experiment. It can be processed directly by invoking the Analyzer or the er_print utility on test.1.er/_x1_f1_x3.er.

If the 64-bit ppgsz is the initial process run, or if the 32-bit ppgsz is invoked on a 32-bit kernel, the fork descendant that executes exec on the real target has its data in _f1.er, and the real target's experiment is in _f1_x3.er, assuming the same path properties as in the example above.

See the section "FOLLOWING DESCENDANT PROCESSES", above. For more information on hardware counters, see the "Hardware Counter Overflow Profiling" section below.

USING COLLECT ON SETUID/SETGID TARGETS

The collect command operates by inserting a shared library, libcollector.so, into the target's address space (LD_PRELOAD), along with additional shared libraries for specific tracing data collection. Those shared libraries write the files that constitute the experiment.

Several problems might arise if collect is invoked on executables that call setuid or setgid, or that create descendant processes that call setuid or setgid. If the user running the experiment is not root, collection fails because the shared libraries are not installed in a trusted directory. The workaround is to run the experiments as root, or use crle(1) to grant permission. Users should, of course, take great care when circumventing security barriers, and do so at their own risk.

In addition, the umask for the user running the collect command must be set to allow write permission for that user, and for any users or groups that are set by the setuid/setgid attributes of a program being exec'd and for any user or group to which that program sets itself. If the mask is not set properly, some files might not be written to the experiment, and processing of the experiment might not be possible. If the log file can be written, an error is shown when the user attempts to process the experiment.

Note that when attaching as one user to a process that is owned by another user, umask must be set to allow writing by the user owning the process to which you are attaching.

Other problems can arise if the target itself makes any of the system calls to set UID or GID, or if it changes its umask and then forks or runs exec on some other process, or crle was used to configure how the runtime linker searches for shared objects.

If an experiment is started as root on a target that changes its effective GID, the er_archive process that is automatically run when the experiment terminates fails, because it needs a shared library that is not marked as trusted. In that case, you can run er_archive (or er_print or Analyzer) explicitly by hand, on the machine on which the experiment was recorded, immediately following the termination of the experiment.

DATA COLLECTED

Three types of data are collected: profiling data, tracing data, and process-wide resource-utilization data. The data packets recorded in profiling and tracing include the callstack of each LWP, the LWP, thread, and CPU IDs, and some event-specific data. The data packets recorded in process-wide resource-utilization samples contain global data such as execution statistics, but no program-specific or event-specific data. All data packets include a timestamp.

Each data type describes the metrics derived from that data, both as a name, and as the string the user would use in a metrics command looking at an experiment.

Clock-based Profiling

The event-specific data recorded in clock-based profiling is an array of counts for each accounting microstate. The microstate array is incremented by the system at a prescribed frequency, and is recorded by the Collector when a profiling signal is processed.

Clock-based profiling can run at a range of frequencies which must be multiples of the clock resolution used for the profiling timer. If you try to do high-resolution profiling on a machine with an operating system that does not support it, the command prints a warning message and uses the highest resolution supported. Similarly, a custom setting that is not a multiple of the resolution supported by the system is rounded down to the nearest non-zero multiple of that resolution, and a warning message is printed.

On Oracle Solaris, clock-based profiling data is converted into the following metrics:

 
Total Thread Time (total) = sum over all ten microstates
Total CPU Time (totalcpu) = user + system + trap
User CPU Time (user)
System CPU Time (system)
Trap CPU Time (trap)
User Lock Time (lock)
Data Page Fault Time (datapfault)
Text Page Fault Time (textpfault)
Kernel Page Fault Time (kernelpfault)
Stopped Time (stop)
Wait CPU Time (wait)
Sleep Time (sleep)

For experiments on multithreaded applications, all of the times are summed across all threads in the process. Total Thread Time adds up to the real elapsed time, multiplied by the average number of threads in the process.

On Linux, clock-based profiling data produces one metric: Total CPU Time (totalcpu).

If clock-based profiling is performed on an OpenMP program, three additional metrics are provided:

 
OpenMP Work (ompwork)
OpenMP Wait (ompwait)
Master Thread Time (masterthread)

On Oracle Solaris, OpenMP Work accumulates when work is being done in parallel. OpenMP Wait accumulates when the OpenMP runtime is waiting for synchronization, and accumulates whether the wait is using CPU time or sleeping, or when work is being done in parallel, but the thread is not scheduled on a CPU. Master Thread Time represents time in the master thread only.

On Linux, OpenMP Work and OpenMP Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that OpenMP should do a busy wait, OpenMP Wait on Linux will not be useful. Master Thread Time is not provided on Linux.

If clock-based profiling is performed on an MPI program, run under Oracle Message Passing Toolkit or Sun HPC ClusterTools release 8.1 or later, two additional metrics are provided:

 
MPI Work (mpiwork)
MPI Wait (mpiwait)

On Oracle Solaris, MPI Work accumulates when the MPI runtime is active. MPI Wait accumulates when the MPI runtime is waiting for the send or receive of a message, or when the MPI runtime is active, but the thread is not running on a CPU.

On Linux, MPI Work and MPI Wait are accumulated only when the process is active in either user or system mode. Unless you have specified that MPI should do a busy wait, MPI Wait on Linux will not be useful.

Hardware Counter Overflow Profiling

Hardware counter overflow profiling records the number of events counted by the hardware counter at the time the overflow signal was processed.

The counters available depend on the specific processor chip and operating system. Running the command collect -h with no other arguments will describe the processor, and the number of hardware counters available, along with a list of all counters and a default hardware-counter set for that processor. The counters that are aliased to common names are displayed first in the list, followed by a list of the raw hardware counters. After the list of known counters is printed, the name of the reference manual for the chip, and the default counter set defined for that chip is printed.

If neither the performance counter subsystem nor collect know the names for the counters on a specific chip, the tables are empty. Even so, the counters can be specified numerically as described above.

The lines of output are formatted similar to the following:

 
Aliases for most useful HW counters:

    alias      raw name                     type      units regs description

    cycles     Cycles_user                       CPU-cycles 0123 CPU Cycles
    insts      Instr_all                             events 0123 Instructions Executed
    c_stalls   Commit_0_cyc                      CPU-cycles 0123 Stall Cycles
    loads      Instr_ld       memoryspace            events 0123 Load Instructions
    stores     Instr_st       memoryspace            events 0123 Store Instructions
    dcm        DC_miss_commit memoryspace            events 0123 L1 D-cache Misses
...

Raw HW counters:

    name                                    type      units regs description

    Sel_pipe_drain_cyc                           CPU-cycles 0123 
    Sel_0_wait_cyc                               CPU-cycles 0123 
    Sel_0_ready_cyc                              CPU-cycles 0123 
...

The top section labeled Aliases for most useful HW counters contains the following columns.

alias

Gives a convenient non-processor-specific alias that can be used in a -h argument.

raw name

Lists the real unaliased processor-specific counter name.

type

Lists counter type information, when applicable. Counters of type memoryspace can be used for memoryspace and, where available, dataspace profiling. Rarely, a not-program-related type appears indicating a counter that captures events that cannot be attributed directly to your program. Specifying such a counter produces a warning and profiling will not record a call stack; time will be attributed to an artificial function called collector_not_program_related; and Thread IDs and LWP IDs will be meaningless.

units

Shows either CPU-cycles which can approximately be converted to time during analysis, or events which are raw hardware counts.

regs

Specifies which registers can be used for the counter.

description

Provides a description of the counter

The Raw HW counters section is similar except that no aliases are listed. Introductory paragraphs describing the counters might be available for certain processors.

If the two aliases cycles and insts are collected, two additional metrics are available, CPI (cycles per instruction) and IPC (instructions per cycle). A high CPI ratio or a low IPC ratio indicates code that runs inefficiently in the machine. A low CPI ratio or a high IPC ratio indicates code that runs efficiently in the pipeline.

EXAMPLES:

Example 1: Using the aliased counter information listed in the above sample output, the following command:

collect -p hi -h cycles

enables CPU Cycles profiling, with hi chosen to generate a peak event rate of approximately 1000 events/second/thread. Note that generating too high an event rate will ultimately distort the performance you are trying to profile.

Synchronization Delay Tracing

Synchronization delay tracing records all calls to the various thread synchronization routines where the real-time delay in the call exceeds a specified threshold. The data packet contains timestamps for entry and exit to the synchronization routines, the thread ID, and the LWP ID at the time the request is initiated. Synchronization requests from a thread can be initiated on one LWP, but complete on another.

Synchronization delay tracing data is converted into the following metrics:

 
Synchronization Wait Time (sync)
Synchronization Delay Events (syncn)
Heap Tracing

Heap tracing records all calls to malloc, free, realloc, memalign, and valloc with the size of the block requested, its address, and for realloc, the previous address. Calls to calloc are recorded on Oracle Solaris but not on Linux.

Heap tracing data is converted into the following metrics:

 
Allocations (heapalloccnt)
Bytes Allocated (heapallocbytes)
Leaks (heapleakcnt)
Bytes Leaked (heapleakbytes)

Leaks are defined as allocations that are not freed. If a zero-length block is allocated, it counts as an allocation with zero bytes allocated. If a zero-length block is not freed, it counts as a leak with zero bytes leaked.

Heap tracing experiments can be very large, and might be slow to process.

IO Tracing

IO tracing records all calls to the standard IO routines and all IO system calls.

IO tracing data is converted into the following metrics:

 
Bytes Read (ioreadbytes)
Read Count (ioreadcnt)
Read Time (ioreadtime)
Bytes Written (iowritebytes)
Write Count (iowritecnt)
Write Time (iowritetime)
Other IO Count (ioothrcnt)
Other IO Time (ioothertime)
IO Error Count (ioerrornt)
IO Error Time (ioerrortime)
MPI Tracing

MPI tracing records calls to the MPI library for functions that can take a significant amount of time to complete. MPI tracing is implemented using the Open Source Vampir Trace code.

MPI tracing data is converted into the following metrics:

 
MPI Time (mpitime)
MPI Sends (mpisendcnt)
MPI Bytes Sent (mpisendbytes)
MPI Receives (mpirecvcnt)
MPI Bytes Received (mpirecvbytes)
Other MPI Events (mpiothercnt)

MPI Time is the total thread time spent in the MPI function. If MPI state times are also collected, MPI Work Time plus MPI Wait Time for all MPI functions other than MPI_Init and MPI_Finalize should approximately equal MPI Work Time. On Linux, MPI Wait and MPI Work are based on user+system CPU time, while MPI Time is based on real time, so the numbers will not match.

The MPI Bytes Received metric counts the actual number of bytes received in all messages. MPI Bytes Sent counts the actual number of bytes sent in all messages. MPI Sends counts the number of messages sent, and MPI Receives counts the number of messages received. MPI_Sendrecv counts as both a send and a receive. MPI Other Events counts the events in the trace that are neither sends nor receives.

Count Data

Count data is recorded by instrumenting the executable, and counting the number of times each instruction was executed. It also counts the number of times the first instruction in a function is executed, and calls that the function execution count. On SPARC systems only, it also counts the number of times an instruction in a branch-delay slot is annulled.

Count data is converted into the following metrics:

 
Bit Func Count (bit_fcount)
Bit Inst Exec (bit_instx)
Bit Inst Annul (bit_annul) -- SPARC only
Data-race Detection Data

Data-race detection data consists of pairs of race-access events that constitute a race. The events are combined into a race, and races for which the call stacks for the two access are identical are merged into a race group.

Data-race detection data is converted into the following metric:

 
Race Accesses (raccess)
Deadlock Detection Data

Deadlock detection data consists of pairs of threads with conflicting locks.

Deadlock detection data is converted into the following metric:

 
Deadlocks (deadlocks)
Process-Wide Resource-Utilization Samples

Process-wide resource utilization can be sampled occasionally. The data is attributed to the process and does not map to function-level metrics.

Process-wide resource utilization is always sampled at the start and termination of the process. By default or if a non-zero -S argument is specified, samples are taken periodically at the specified interval. In addition, samples can be taken by using the libcollector(3) API.

The data recorded at each sample consists of microstate accounting information from the kernel, along with various other statistics maintained within the kernel.

Environment Variables

SP_COLLECTOR_JAVA_MAX_STACKDEPTH

Set the maximum number of callstack frames captured, or set to '0' to prevent capturing Java callstacks. The default behavior is to capture up to 256 frames.

SP_COLLECTOR_NATIVE_MAX_STACKDEPTH

Set the maximum number of callstack frames captured, or set to '0' to prevent capturing native callstacks. The default behavior is to capture up to 256 frames. When profiling Java on x86 systems, setting SP_COLLECTOR_NATIVE_MAX_STACKDEPTH=0 might reduce the risk of fatal errors related to native stack unwind. When native callstacks are disabled, JNI and assembly stacks will not be captured.

SP_COLLECTOR_NO_VALIDATE

Define this variable to disable checking hardware, system, and Java versions. The default is to do all checks. Setting this variable will significantly speed up the start-up of the collect command.

SP_COLLECTOR_OUTPUT

Specify filename to redirect the collect output to specified file.

SP_COLLECTOR_SIZE_LIMIT

When using the –c on option, enables you to specify the maximum size of the experiment in megabytes. For all collect options except –c on, you can use –L to specify a maximum experiment size.

SP_ER_PRINT_ALLOW_COREDUMP

Define this variable to allow the operating system to generate a core file if the analyzer back-end (er_print process) encounters a fatal error. If not defined, the analyzer back-end will not generate core files, but will instead create an error report located at /tmp/analyzer.process-ID/crash.sigsignal.process-ID where process-ID is the Process ID and signal is the signal number.

SP_COLLECTOR_HWC_DEFAULT

Define this variable to turn on profiling with the default hardware counters. This is equivalent to using the –h auto option.

SP_COLLECTOR_NO_OMP

Define this variable to suppress tracking of parallel regions. The program will have substantially less dilation, but the data from slave threads will not propagate to main().

SP_COLLECTOR_OLDOMP

Define this variable to profile a program built with compilers from Sun Studio 12.0 or earlier versions.

RESTRICTIONS

Most of the Performance Analyzer binaries depend on finding a shared library from the installation containing the binaries. Users must not set LD_LIBRARY_PATH to include any library directories from a different installation of the tools. The binaries might fail to execute if the LD_LIBRARY_PATH is set to a different installation.

By default, the Collector collects stacks that are 256 frames deep. To support deeper stacks, set the environment variable SP_COLLECTOR_NATIVE_MAX_STACKDEPTH to a larger number. If you are profiling a Java binary, set the SP_COLLECTOR_JAVA_MAX_STACKDEPTH environment variable.

The Collector interposes on some signal-handling routines to protect its use of SIGPROF signals for clock-based profiling and SIGEMT (Oracle Solaris) or SIGIO (Linux) for hardware counter overflow profiling against disruption by the target program. See the section "DATA COLLECTION AND SIGNALS" above.

The Collector interposes on setitimer(2) for clock profiling, periodic sampling, and hardware counter checking. Any setitimer calls from target programs will fail.

On Oracle Solaris, the Collector interposes on functions in the hardware counter library, libcpc.so, so that an application cannot use hardware counters while the Collector is collecting performance data. The interposed functions return a value of -1.

Dataspace profiling is only available on SPARC systems running Oracle Solaris.

For this release, the data from process-wide resource utilization samples might not be reliable on systems running the Linux OS.

Hardware counter overflow profiling cannot be run on an Oracle Solaris system where cpustat is running, because cpustat takes control of the counters, and does not let a user process use them.

Java Profiling requires Java 2 SDK (JDK) 7, Update 11 or later JDK 7's.

collect cannot be used on executables compiled with -xprofile=tcov flag.

Data is not collected on descendant processes that are created to use the setuid attribute, nor on any descendant processes created with an exec call for an executable that is not dynamically linked. Furthermore, subsequent descendant processes might produce corrupted or unreadable experiments. The workaround is to ensure that all processes spawned are dynamically-linked and do not have the setuid attribute.

Applications that call vfork(2) have these calls replaced by a call to fork1(2).

Count data (collect -c) cannot be collected on Oracle Linux 5 systems; count data cannot be collected for 32-bit binaries on any Linux system at all.

On Linux systems, data cannot be collected on applications using clone(2) with the CLONE_VM flag.

See Also

analyzer(1), collector(1), dbx(1), er_archive(1), er_cp(1), er_export(1), er_mv(1) , er_print(1), er_rm(1), tha(1), libcollector(3)

Performance Analyzer manual