Man Page collect.1

NAME

     collect - command used to collect program performance data

SYNOPSIS

     collect collect-arguments target target-arguments
     collect
     collect -V
     collect -R

DESCRIPTION

     The collect command runs the target process and records per-
     formance  data and global data for the process.  Performance
     data is collected using  profiling  or  tracing  techniques.
     The  data can be examined with a GUI program (analyzer) or a
     command-line  program  (er_print).   The   data   collection
     software  run  by the collect command is referred to here as
     the Collector.

     The data from a single run of the collect command is  called
     an  experiment.   The  experiment is represented in the file
     system as a directory, with various files inside that direc-
     tory.

     The target is the path name of the executable, Java(TM) .jar
     file, or Java .class file for which you want to collect per-
     formance data.  (For more information about Java  profiling,
     see  JAVA  PROFILING,  below.)  Executables that are targets
     for the collect command can be compiled with  any  level  of
     optimization, but must use dynamic linking.  If a program is
     statically linked, the collect command prints an error  mes-
     sage.   In  order  to see annotated source using analyzer or
     er_print, targets should be compiled with the -g  flag,  and
     should not be stripped.

     In order to enable dataspace profiling, executables must  be
     compiled  with  the  -xhwcprof -xdebugformat=dwarf -g flags.
     These flags are valid for the C, C++ and Fortran  compilers,
     but  only on SPARC[R] platforms.  See the section "DATASPACE
     PROFILING", below.

     The collect command uses the following strategy to find  its
     target:

     - If there is a file with the name of  the  target  that  is
       marked  executable, the file is verified as an ELF execut-
       able that can run on the target machine. If  the  file  is
       not  such  a  valid  ELF  executable,  the collect command
       fails.

     - If there is a file with the name of the  target,  and  the
       file is not executable, collect checks whether the file is
       a Java[TM] jar file or class file. If the file is  a  Java
       jar file or class file, the Java[TM] virtual machine (JVM)
       software is inserted as the  target,  with  any  necessary
       flags,  and  data  is collected on that JVM machine.  (The
       terms "Java virtual machine"  and  "JVM"  mean  a  virtual
       machine  for  the  Java[TM] platform.)  See the section on
       "JAVA PROFILING", below.

     - If there is no file with  the  name  of  the  target,  the
       user's  path is searched to find an executable; if an exe-
       cutable is found, it is verified as described above.

     - If no file of the current name is found, the command looks
       for  a file with that name and the string .class appended;
       if a file is  found,  the  target  of  a  JVM  machine  is
       inserted, with the appropriate flags, as above.

     - If none of these procedures can find the target, the  com-
       mand fails.

OPTIONS

     If invoked with no arguments, print a usage summary, includ-
     ing the default configuration of the experiment. If the pro-
     cessor supports hardware counter overflow  profiling,  print
     two  lists  containing  information about hardware counters.
     The first list contains "well known" hardware counters;  the
     second   list  contains  raw  hardware  counters.  For  more
     details, see the "Hardware Counter Overflow Profiling"  sec-
     tion below.

  Data Specifications
     -p option
          Collect clock-based profiling data.  The allowed values
          of option are:

          Value     Meaning

          off       Turn off clock-based profiling

          on        Turn  on  clock-based  profiling   with   the
                    default  profiling  interval of approximately
                    10 milliseconds.

          lo[w]     Turn on clock-based profiling with  the  low-
                    resolution  profiling  interval  of  approxi-
                    mately 100 milliseconds.

          hi[gh]    Turn on clock-based profiling with the  high-
                    resolution  profiling  interval  of  approxi-
                    mately 1 millisecond.

          n         Turn  on   clock-based   profiling   with   a
                    profiling  interval of n.  The value n can be
                    an integer or a floating-point number, with a
                    suffix  of u for values in microseconds, or m
                    for values in milliseconds.  If no suffix  is
                    used, assume the value to be in milliseconds.

                    If the value is smaller than the  clock  pro-
                    filing  minimum, set it to the minimum; if it
                    is not a  multiple  of  the  clock  profiling
                    resolution,  round down to the nearest multi-
                    ple of the clock resolution.  If  it  exceeds
                    the clock profiling maximum, report an error.
                    If it is negative or zero, report  an  error.
                    If  invoked  with  no  arguments,  report the
                    clock-profiling intervals.

          An optional + may be prepended to  the  clock-profiling  |
          interval,  specifying  that  collect  capture dataspace  |
          data.  It will do so by backtracking  one  instruction,  |
          and  if  that  instruction  is a memory instruction, it  |
          will assume that  the  delay  was  attributed  to  that  |
          instruction and record the event, including the virtual  |
          and physical addresses of the memory reference.          |

          Caution  must  be  used  in  interpreting   clock-based  |
          dataspace  data;  the delay may be completely unrelated  |
          to the memory instruction that happened to precede  the  |
          instruction with the clock-profile hit; for example, if  |
          a memory instruction hits in the cache,  but  is  in  a  |
          loop  executed many times, high counts on that instruc-  |
          tion may appear to indicate memory stall delays, but it  |
          does  not.  This situation may be disambiguated by exa-  |
          mining the disassembly around the instruction  indicat-  |
          ing  the  stall.   If the surrounding instructions also  |
          have high clock-profiling metrics, the memory delay  is  |
          likely to be spurious.                                   |

          Clock-based dataspace profiling should only be used  on  |
          machines  that  do  not support HW counter profiling on  |
          memory-based counters.                                   |

          See the section "DATASPACE PROFILING", below.            |

          If no  explicit  -p  off  argument  is  given,  and  no
          hardware  counter overflow profiling is specified, turn
          on clock-based profiling.

     -h ctr_def...[,ctr_n_def]
          Collect hardware counter overflow profiles. The  number
          of  counter definitions, (ctr_def through ctr_n_def) is
          processor-dependent. For example, on an UltraSPARC  III
          system,  up  to  two  counters may be programmed; on an
          Intel Pentium IV with Hyperthreading, up to 18 counters
          are  available.  The  user  can  ascertain  the maximum
          number of hardware counters definitions  for  profiling
          on  a  target  system,  and  the full list of available
          hardware  counters,  by  running  the  collect  command
          without any arguments.

          This option is now available  on  systems  running  the
          Linux  OS.  The  user is responsible for installing the
          required perfctr patch on the system; that patch can be
          downloaded from:
          http://user.it.uu.se/~mikpe/linux/perfctr/2.6/perfctr-2.6.15.tar.gz
          Instructions for installation are contained within that
          tar file.

          Each counter definition  takes  one  of  the  following
          forms,  depending  on  whether  attributes for hardware
          counters are supported on the processor:

          1. [+]ctr[/reg#][,interval]

          2. [+]ctr[~attr=val]...[~attrN=valN][/reg#][,interval]

          The meanings of the counter definition options  are  as
          follows:

          Value     Meaning

          +         Optional parameter that  can  be  applied  to
                    memory-related  counters.  Causes  collect to
                    collect dataspace  data  by  backtracking  to
                    find the instruction that triggered the over-
                    flow, and to find the  virtual  and  physical
                    addresses of the memory reference. Backtrack-
                    ing works on SPARC processors, and only  with
                    counters  of type load, store, or load-store,
                    as displayed in the counter list obtained  by
                    running   the  collect  command  without  any
                    command-line  arguments.   See  the   section
                    "DATASPACE PROFILING", below.

          ctr       Processor-specific counter name. The user can
                    ascertain  the  list of counter names by run-
                    ning  the  collect   command    without   any
                    command-line arguments.

          attr=val  On some processors, attribute options can  be
                    associated  with  a  hardware counter. If the
                    processor supports  attribute  options,  then
                    running   collect  without  any  command-line
                    arguments specifies the  counter  definition,
                    ctr_def, in the second form listed above, and
                    provide a list of attribute names to use  for
                    attr.  Value val can be in decimal or hexade-
                    cimal format. Hexadecimal format numbers  are
                    in  C  program  format  where  the  number is
                    prepended  by  a  zero   and   lower-case   x
                    (0xhex_number).

          reg#      Hardware register to use for the counter.  If
                    not  specified, collect attempts to place the
                    counter into the first available register and
                    as  a result, might be unable to place subse-
                    quent counters due to register conflicts.  If
                    the user specifies more than one counter, the
                    counters must use different  registers.   The
                    list  of  allowable  register  numbers can be
                    ascertained by running  the  collect  command
                    without any command-line arguments.

          interval  Sampling  frequency,  set  by  defining   the
                    counter  overflow value.  Valid values are as
                    follows:

                    Value     Meaning

                    on        Select the default rate, which  can
                              be  determined  by running the col-
                              lect command without  any  command-
                              line   arguments.   Note  that  the
                              default value for all raw  counters
                              is  the  same, and might not be the
                              most suitable value for a  specific
                              counter.

                    hi        Set interval  to  approximately  10
                              times shorter than on.

                    lo        Set interval  to  approximately  10
                              times longer than on.

                    value     Set interval to a  specific  value,
                              specified in decimal or hexadecimal
                              format.

          An experiment can specify both hardware  counter  over-
          flow  profiling and clock-based profiling.  If hardware
          counter overflow profiling  is  specified,  but  clock-
          based  profiling  is not explicitly specified, turn off
          clock-based profiling.

          For more information  on  hardware  counters,  see  the
          "Hardware Counter Overflow Profiling" section below.

     -s option
          Collect synchronization tracing data.

          The minimum delay threshold for tracing events  is  set
          using option.  The allowed values of option are:

          Value     Meaning

          on        Turn on synchronization delay tracing and set
                    the threshold value by calibration at runtime

          calibrate Same as on

          off       Turn off synchronization delay tracing

          n         Turn on synchronization delay tracing with  a
                    threshold  value  of  n microseconds; if n is
                    zero, trace all events

          all       Turn on  synchronization  delay  tracing  and
                    trace all synchronization events

          By default, turn off synchronization delay tracing.

          Record synchronization events for  Java  monitors,  but
          not for native synchronization within the JVM machine.

     -H option
          Collect heap trace data. The allowed values  of  option
          are:

          Value     Meaning

          on        Turn on tracing of memory allocation requests

          off       Turn  off  tracing   of   memory   allocation
                    requests

          By default, turn off heap tracing.

          Record heap-tracing events for any native calls.  Treat
          calls to mmap as memory allocations.

          Heap profiling is  not  supported  for  Java  programs.
          Specifying it is treated as an error.

          Note that heap tracing may produce a very large experi-
          ment.   Such  experiments  are  very  slow  to load and
          browse.


     -m option
          Collect MPI tracing data.

          The allowed values of option are:

          Value     Meaning

          on        Turn on tracing of MPI calls

          off       Turn off tracing of MPI calls

          By default, turn off MPI tracing.


     -c option                                                          ||
          Collect count data, using bit(1) instrumentation.  This  |
          option is available only on SPARC systems.               |

          The allowed values of option are:                        |

          Value                                                         ||
                    Meaning                                        |

          on                                                            ||
                    Turn on count data                             |

          static                                                        ||
                    Turn  on  simulated  count data, based on the  |
                    assumption that every  instruction  was  exe-  |
                    cuted exactly once.                            |

          off                                                           ||
                    Turn off count data                            |

          By default, turn off count data.  Count data may not be  |
          collected  with any other type of data.  For count data  |
          or  simulated  count  data,  the  executable  and   any  |
          shared-objects  that  are  instrumented  and statically  |
          linked are counted; dynamically  loaded  shared  object  |
          are not instrumented and counted.                        |

          In order to collect count data, the executable must  be
          compiled with the -xbinopt=prepare flag.


     -r option                                                          ||
          Collect thread-analyzer data.                            |

          The allowed values of option are:                        |

          Value                                                         ||
                    Meaning                                        |

          on                                                            ||
                    Turn  on  thread analyzer data-race-detection  |
                    data                                           |

          all                                                           ||
                    Turn on all thread analyzer data               |

          off                                                           ||
                    Turn off thread analyzer data                  |

          dt1,...,dtN                                                   ||
                    Turn  on specific thread analyzer data types,  |
                    as named by the dt* parameters.                |

          The specific types of thread analyzer data that may  be  |
          requested are:                                           |

          Value                                                         ||
                    Meaning                                        |

          race                                                          ||
                    Collect datarace data                          |

          deadlock                                                      ||
                    Collect deadlock and potential-deadlock data   |

                    By  default,  turn  off  all  thread-analyzer  |
                    data.   Thread  Analyzer data may not be col-  |
                    lected with any tracing data, but may be col-  |
                    lected  in  conjunction  with  clock- or HWC-  |
                    profile data.  Thread Analyzer data will sig-  |
                    nificantly  slow  down  the  execution of the  |
                    target, and profiles may not be meaningful as  |
                    applied to the user code.                      |

                    Thread Analyzer experiments can  be  examined  |
                    with either analyzer or with tha.  The latter  |
                    will present a  simplified  list  of  default  |
                    tabs, but is otherwise identical.              |

                    In order to enable data-race detection,  exe-  |
                    cutables must be instrumented, either at com-  |
                    pile time, or by  invoking  a  postprocessor.  |
                    If  the  target is not instrumented, and none  |
                    of the shared objects on its library list  is  |
                    instrumented,  a  warning  will be displayed,  |
                    but the experiment will be run.  Other Thread  |
                    Analyzer data do not require instrumentation.  |

                    For OpenMP race-detection, a new  version  of  |
                    libmtsk.so  is  needed;  at FCS time, a patch  |
                    for that version will be issued, but prior to  |
                    that a copy of that library is installed with  |
                    the bits, and will be automatically picked up  |
                    by collect.                                    |

                    See the tha(1) man page for more detail.       |

          -S interval
                    Collect  periodic  samples  at  the  interval
                    specified  (in seconds).  Record data samples
                    from the process, and include a timestamp and
                    execution  statistics  from the kernel, among
                    other things.  The allowed values of interval
                    are:

                    Value     Meaning

                    off       Turn off periodic sampling

                    on        Turn on periodic sampling with  the
                              default    sampling   interval   (1
                              second)

                    n         Turn on periodic  sampling  with  a
                              sampling  interval of n in seconds;
                              n must be positive.

          By default, turn on periodic sampling.

          If no data specification arguments are  supplied,  col-
          lect  clock-based  profiling  data,  using  the default
          resolution.

          If clock-based profiling is  explicitly  disabled,  and
          neither  hardware-counter  overflow  profiling  nor any
          kind of tracing is enabled, display a warning  that  no
          function-level  data  is  being collected, then execute
          the target and record global data.

  Experiment Controls
     -L size
          Limit the amount of profiling and tracing data recorded
          to size megabytes.  The limit applies to the sum of all
          profiling data and tracing  data,  but  not  to  sample
          points.  The  limit  is  only  approximate,  and can be
          exceeded.  When the limit is  reached,  stop  profiling
          and  tracing  data,  but  keep  the experiment open and
          record samples until  the  target  process  terminates.
          The allowed values of size are:

          Value     Meaning

          unlimited or none
                    Do not impose a size limit on the experiment

          n         Impose a limit of n MB.; n must  be  positive
                    and greater than zero.

          The default limit on the amount  of  data  recorded  is
          2000 Mbytes.

     -F option
          Control whether or not descendant processes should have
          their data recorded.  The allowed values of option are:

          Value     Meaning

          on        Record experiments  on  descendant  processes
                    from fork and exec

          all       Record   experiments   on   all    descendant
                    processes

          off       Do  not  record  experiments  on   descendant
                    processes

          =<regex>                                                      ||
                    Record    experiments   on   all   descendant  |
                    processes whose executable name (a.out  name)  |
                    or lineage match the regular expression.       |

          By default, do not record  descendant  processes.   For
          more  details, users should read the section "FOLLOWING
          DESCENDANT PROCESSES", below.

     -A option
          Control whether or not load objects used by the  target
          process  should be archived or copied into the recorded
          experiment.  The allowed values of option are:

          Value     Meaning

          on        Archive load objects into the experiment.

          off       Do not archive load objects into the  experi-
                    ment.

          copy      Copy and archive load objects into the exper-
                    iment.

          A  user  that  copies  experiments  onto  a   different
          machine,  or  reads  the  experiments  from a different
          machine, should specify -A copy.  Note  that  doing  so
          does  not  copy  any sources or object files; it is the
          user's responsibility to ensure that  those  files  are
          accessible on the machine where the experiment is being
          run.

          The default setting for -A is on.

     -j option
          Control  Java  profiling  when  the  target  is  a  JVM
          machine. The allowed values of option are:

          Value     Meaning

          on        Record profiling data for  the  JVM  machine,
                    and  recognize  methods  compiled by the Java
                    HotSpot[TM] virtual machine, and also  record
                    Java callstacks.

          off       Do not record Java profiling data.

          <path>    Record profiling data for the  JVM,  and  use
                    the JVM as installed in <path>.

          See the section "JAVA PROFILING", below.

          The user must use -j on to obtain profiling data if the
          target  is  a  JVM  machine.   The  -j on option is not
          needed if the target is a class or jar file.  Users  on
          a  64-bit  JVM machine must specify its path explicitly
          as the target; do not use the -d64 option for a  32-bit
          JVM machine.  If the -j on option is specified, but the
          target is not a JVM machine, an invalid  argument might
          be passed to the target, and no data would be recorded.
          The collect command validates the version  of  the  JVM
          machine specified for Java profiling.

     -J java_arg
          Specify a single argument to be passed to the JVM  used
          for profiling.  If  -J is specified, but Java profiling
          is not specified, an error is generated, and no experi-
          ment  run.  The argument is passed as a single argument  |
          to the JVM.  If multiple arguments are needed,  do  not  |
          use   -J, but rather specify the path to the JVM expli-  |
          citly, use  -j on, and add the arguments  for  the  JVM  |
          after the path to the JVM.                               |

     -l signal
          Record a sample point  whenever  the  given  signal  is
          delivered to the process.

     -y signal[,r]
          Control recording of data with  signal.   Whenever  the
          given  signal  is  delivered  to  the  process,  switch
          between paused (no data is recorded) and resumed  (data
          is  recorded) states. Start in the resumed state if the
          optional ,r flag  is  given,  otherwise  start  in  the
          paused  state.  This option does not affect the record-
          ing of sample points.

  Output Controls
     -o experiment_name
          Use experiment_name as the name of the experiment to be
          recorded.   The  experiment_name must end in the string
          .er; if not, print an error message and do not run  the
          experiment.

          If -o is not specified, give the experiment a  name  of
          the  form stem.n.er, where stem is a string, and n is a
          number.  If a group name has been  specified  with  -g,
          set  stem to the group name without the .erg suffix. If
          no group name has  been  specified,  set  stem  to  the
          string "test".

          If invoked from one of the commands  used  to  run  MPI
          jobs  and -o is not specified, take the value of n used
          in the name  from  the  environment  variable  used  to
          define  the  MPI rank of that process. Otherwise, set n
          to one greater than the highest  integer  currently  in
          use.

          If the name is not specified in the form stem.n.er, and
          the given name is in use, print an error message and do
          not run the experiment.  If the name  is  of  the  form
          stem.n.er  and  the name supplied is in use, record the
          experiment under a name corresponding  to  one  greater
          than  the  highest value of n that is currently in use.
          Print a warning if the name is changed.

     -d directory_name
          Place the experiment in directory  directory_name.   If
          no  directory  is  given,  place  the experiment in the
          current working directory.  If  a  group  is  specified
          (see  -g, below), the group file is also written to the
          directory named by -d.

     -g group_name
          Add the experiment to the experiment group  group_name.
          The  group_name  string must end in the string .erg; if
          not, report an error and do not run the experiment.
          The first line of a group file must contain the string   |
               #analyzer experiment group                          |
          and each subsequent line is the name of an experiment.   |

     -O file
          Append all output from  collect  itself  to  the  named
          file,  but  do not redirect the output from the spawned
          target.  If file is set to /dev/null suppress all  out-
          put from collect, including any error messages.

     -t duration                                                        ||
          Collect  data for the specified duration.  duration may  |
          be a single number, followed by  either  m,  specifying  |
          minutes,  or  s,  specifying  seconds (default), or two  |
          such numbers separated by a - sign.  If one  number  is  |
          given, data will be collected from the start of the run  |
          until the given time; if two numbers  are  given,  data  |
          will  be  collected  from the first time to the second.  |
          If the second time is  zero,  data  will  be  collected  |
          until  the end of the run.  If two non-zero numbers are  |
          given, the first must be less than the second.           |

  Other Arguments
     -P <pid>                                                           ||
          Write  a  script  for dbx to attach to the process with  |
          the given PID, and collect data from it.  Only  profil-  |
          ing  data, not tracing data may be specified, and timed  |
          runs (-t) are not supported.                             |

     -C comment
          Put the comment into the notes file for the experiment.
          Up to ten -C arguments may be supplied.

     -n   Dry run: do not run  the  target,  but  print  all  the
          details  of  the experiment that would be run.  Turn on
          -v.

     -R   Display the  text  version  of  the  performance  tools
          README  in  the  terminal  window. If the README is not
          found, print a warning.  Do not examine  further  argu-
          ments and do no further processing.

     -V   Print the current  version.   Do  not  examine  further
          arguments and do no further processing.

     -v   Print the current version and further detailed informa-
          tion about the experiment being run.

     -x   Leave the target process stopped on the exit  from  the
          exec  system  call,  in  order  to  allow a debugger to
          attach to it.  The collect command will print a message  |
          with the process PID.                                    |

          To attach a debugger to the target once it  is  stopped
          by collect, the user must follow the procedure below.

          - Obtain the  PID  of  the  process  from  the  message  |
            printed by the collect -x command                      |

          - Start the debugger

          - Configure the debugger to ignore SIGPROF and, if  you  |
            chose  to  collect  hardware  counter data, SIGEMT on  |
            Solaris or SIGIO on Linux                              |

          - Attach to the process using the PID.

          As the process runs under the control of the  debugger,
          the Collector records an experiment.

FOLLOWING DESCENDANT PROCESSES

     Data from the initial process spawned by collect, called the  |
     founder  process, is always collected.  Processes can create  |
     descendant processes by calling  system  library  functions,  |
     including the variants of fork, exec, system, etc..  If a -F  |
     argument is used, the collector can collect data for descen-  |
     dant  processes, and it opens a new experiment for each des-  |
     cendant process inside the  parent  experiment.   These  new  |
     experiments are named with their lineage as follows:          |

     -  |
                                                                         ||
       An underscore is  appended  to  the  creator's  experiment  |
       name.                                                       |

     -  |
                                                                         ||
       A code letter is added: either "f" for a fork, or "x"  for  |
       an exec, or "c" for other descendants.                      |

     -  |
                                                                         ||
       A number is added after the  code  letter,  which  is  the  |
       index  of  the fork or exec. The assignment of this number  |
       is applied whether the process was started successfully or  |
       not.                                                        |

     -  |
        The experiment suffic, ".er" is appended to the lineage.         ||

     For example, if the experiment name for the initial  process  |
     is  "test.1.er",  the  experiment for the descendant process  |
     created by its third fork  is  "test.1.er/_f3.er".  If  that  |
     descendant  process  execs  a  new  image, the corresponding  |
     experiment name is "test.1.er/_f3_x1.er".                     |

     If  -F on is used, descendant processes initiated  by  calls  |
     to  fork(2),  fork1(2),  fork(3F), vfork(2), and exec(2) and  |
     its variants  will  be  followed.   The  call  to  vfork  is  |
     replaced internally by a call to fork1.  Descendants creates  |
     by calls to system(3C), system(3F), sh(3F),  popen(3C),  and  |
     similar   functions,   and   their   associated   descendant  |
     processes, are not followed.                                  |

     If the -F all argument is used,  all  descendants  are  fol-  |
     lowed,  including those from system(3C), system(3F), sh(3F),  |
     popen(3C), and similar functions.                             |

     If the -F =<regex> argument is used, all  descendants  whose  |
     name  or  lineage  match the regular expression will be fol-  |
     lowed.  When matching lineage, the ".er" should be  omitted.  |
     When matching names, both the command, and its arguments are  |
     part of the expression.                                       |

     For example, to capture data on the  descendant  process  of  |
     the  first  exec  from the first fork from the first call to  |
     system in the founder, use:                                   |
          collect -F '=_c1_f1_x1'                                  |

     To capture data on all the variants of exec, but  not  fork,  |
     use:                                                          |
          collect -F '=.*_x[0-9]/*'                                |

     To capture data from a call to system("echo hello")           |
      but not system("goodbye"), use:                              |
          collect -F '=echo hello'                                 |

     The Analyzer and er_print automatically read experiments for
     descendant  processes  when  the founder experiment is read,
     but the experiments for the  descendant  processes  are  not
     selected for data display.

     To select the  data  for  display  from  the  command  line,
     specify  the  path  name  explicitly  to  either er_print or
     Analyzer. The specified path must include the founder exper-
     iment  name, and the descendant experiment's name inside the
     founder directory.

     For example, to see the data  for  the  third  fork  of  the
     test.1.er experiment:
          er_print test.1.er/_f3.er
          analyzer test.1.er/_f3.er

     You can prepare an experiment group file with  the  explicit
     names of descendant experiments of interest.

     To examine descendant processes in the  Analyzer,  load  the
     founder  experiment  and  select "Filter data" from the View
     menu. The analyzer will display a list of  experiments  with
     only  the  founder  experiment  checked. Uncheck the founder
     experiment and check the descendant experiment of interest.
                                                                   |

JAVA PROFILING

     Java profiling consists of collecting a performance  experi-
     ment on the JVM machine as it runs the user's .class or .jar
     files.  If possible, callstacks are collected  in  both  the
     Java model and in the machine model.

     Data can be shown with view mode set  to  User,  Expert,  or
     Machine.  User mode shows each method by name, with data for
     interpreted   and   HotSpot-compiled   methods    aggregated
     together; it also suppresses data for non-user-Java threads.
     Expert mode separates HotSpot-compiled methods  from  inter-
     preted methods, and does not suppress non-user Java threads.
     Machine mode shows data for interpreted Java methods against
     the  JVM machine as it does the interpreting, while data for
     methods compiled with the Java HotSpot  virtual  machine  is
     reported  for named methods.  All threads are shown.  In all
     three modes, data is reported in the usual way for any  non-
     OpenMP  C,  C++,  or  Fortran  code called by a Java target.
     Such code corresponds to Java native methods.  The  Analyzer
     and  the  er_print  utility can switch between the view mode
     User, view mode Expert, and view  mode  Machine,  with  User
     being the default.

     Clock-based profiling and hardware counter overflow  profil-
     ing  are  supported.   Synchronization tracing collects data
     only on the Java monitor calls,  and  synchronization  calls
     from  native  code;  it does not collect data about internal
     synchronization calls within the JVM.

     Heap tracing is not supported for  Java,  and  generates  an
     error if specified.

     When collect inserts a target name of java into the argument
     list,  it  examines  environment variables for a path to the
     java target, in the order JDK_HOME, and then JAVA_PATH.  For
     the  first  of  these environment variables that is set, the
     resultant target is verified as an ELF executable. If it  is
     not,  collect  fails with an error indicating which environ-
     ment variable was used, and the  full  path  name  that  was
     tried.

     If none of those environment variables is set,  the  collect
     command uses the default path where the Java[TM] 2 Platform,
     Standard Edition technology was installed with the  release,
     if  any,  and  if it was not installed, as set by the user's
     PATH.

     Java Profiling requires the Java[TM] 2 SDK, version 1.5.0_03  |
     or  later;  some  earlier  versions (but no earlier than the  |
     Java[TM] 2 SDK, version 1.4.2_02) may work, but are not sup-  |
     ported.                                                       |

OPENMP PROFILING

     Data collection for OpenMP programs collects data  that  can
     be  displayed  in  any  of the three view modes, just as for
     Java programs.  The presentation is identical for user  mode
     and  expert  mode.   Slave threads are shown as if they were
     really forked from the master thread, and have  call  stacks
     matching  the master thread. Frames in the call stack coming
     from the OpenMP runtime code  (libmtsk.so)  are  suppressed.
     For machine mode, the actual native stacks are shown.

     In user mode, various artificial functions are introduced as
     the  leaf  function  of  a  callstack  whenever  the runtime
     library is in one of several states.   These  functions  are
     <OMP-overhead>,     <OMP-idle>,    <OMP-reduction>,    <OMP-
     implicit_barrier>, <OMP-explicit_barrier>,  <OMP-lock_wait>,
     <OMP-critical_section_wait>, and <OMP-ordered_section_wait>.

     Two additional clock-profiling metrics are added to the data
     for  clock-profiling  experiments:  OMP  Work, and OMP Wait.  |
     OMP Work is counted when the OpenMP runtime thinks the  code  |
     is doing work.  It includes time when the process is consum-  |
     ing User-CPU time, but it also may  include  time  when  the  |
     process  is  consuming  System-CPU  time,  waiting  for page  |
     faults, waiting for the CPU, etc..  Hence, OpenMP  Work  may  |
     excess  User-CPU  time.  OpenMP Wait is accumulated when the  |
     OpenMP runtime  thinks  the  process  is  waiting.   It  may  |
     include  User-CPU  time  for busy-waits (spin-waits), but it  |
     also includes Other-Wait time for sleep-waits.                |

     The inclusive metrics are visible by default; the  exclusive  |
     are  not.  Together, the sum of those two metrics equals the  |
     Total LWP Time metric.  These  metrics  are  added  for  all  |
     clock- and HWC-counter experiments.                           |

DATASPACE PROFILING

     A dataspace profile is a data collection  in  which  memory-
     related  events,  such as cache misses, are reported against
     the data object references that cause the events rather than
     just the instructions where the memory-related events occur.
     Dataspace profiling is not available on systems running  the
     Linux OS, nor on x86 based systems running the Solaris OS.

     To allow dataspace profiling, the target may be  written  in  |
     C,  C++ or Fortran, and must be compiled for SPARC architec-  |
     ture, with the -xhwcprof -xdebugformat=dwarf  -g  flags,  as  |
     described  above.   Furthermore,  the data collected must be  |
     hardware  counter  profiles  and  the  optional  +  must  be
     prepended  to  the  counter  name.   If  the  optional  + is
     prepended to one memory-related counter, but  not  all,  the
     counters  without  the  + will report dataspace data against
     the <Unknown> data object, with subtype (Dataspace data  not
     requested during data collection).
     With the data collected, the er_print utility  allows  three
     additional   commands:    data_objects,   data_single,   and
     data_layout, as well as various commands relating to  Memory
     Objects.  See the er_print(1) man page for more information.

     In addition, the Analyzer now includes two tabs  related  to
     dataspace  profiling, labeled DataObjects and DataLayout, as
     well as a set of tabs relating to Memory Objects.   See  the
     analyzer(1) man page for more information.

     Clock-based dataspace  profiling  should  only  be  used  on  |
     machines  that  do  not  support  HW  counter profiling with  |
     memory-based counters.  It  requires  the  same  compilation  |
     flags  as  for  HW counter profiling.  Data should be inter-  |
     preted with care, as explained above.                         |

USING COLLECT WITH MPI

     The collect command can be used with MPI by simply prefacing
     the  target  and  its arguments with the collect command and
     its arguments in the command line that starts the  MPI  job.
     For example, on an SMP machine,
          % mprun -np 16 a.out 3 5
     can be replaced by
          % mprun -np 16 collect -m  on  -d  /tmp/mydirectory  -g
          run1.erg a.out 3 5
     This command runs an MPI tracing experiment on each  of  the
     16  MPI  processes, collecting them all in a specific direc-
     tory, and collecting them as a group.  The individual exper-
     iments  are  named by the MPI rank, as described above under
     the -o option.  The experiments, as specified above, contain
     clock-based  profiling  data, which is turned on by default,
     and MPI tracing data.

     On a cluster, local file systems like /tmp may be private to
     a  node.   If experiments are collected on node-private file
     systems, you should gather those experiments to  a  globally
     visible  file  system  after the experiments have completed,
     and edit any group file to reflect the new location of those
     experiments.

USING COLLECT WITH PPGSZ

     The collect command can be used with ppgsz  by  running  the
     collect  command on the ppgsz command, and specifying the -F
     on flag.  The founder experiment is on the ppgsz  executable
     and is uninteresting.  If your path finds the 32-bit version
     of ppgsz, and the experiment is being run on a  system  that
     supports  64-bit processes, the first thing the collect com-
     mand does is execute an exec function on its 64-bit version,
     creating _x1.er.  That executable forks, creating _x1_f1.er.
     The descendant process attempts to execute an exec  function
     on  the  named  target, in the first directory on your path,
     then in the second, and so forth,  until  one  of  the  exec
     functions  succeeds.   If,  for  example,  the third attempt
     succeeds, the first two  descendant  experiments  are  named
     _x1_f1_x1.er  and  _x1_f1_x2.er,  and  both  are  completely
     empty.  The experiment on the target is  the  one  from  the
     successful  exec, the third one in the example, and is named
     _x1_f1_x3.er, stored under the founder experiment.   It  can
     be  processed  directly  by  invoking  the  Analyzer  or the
     er_print utility on test.1.er/_x1_f1_x3.er.

     If the 64-bit ppgsz is the initial process run,  or  if  the
     32-bit ppgsz is invoked on a 32-bit kernel, the fork descen-
     dant that executes exec on the real target has its  data  in
     _f1.er,  and  the  real target's experiment is in _f1_x3.er,
     assuming the same path properties as in the example above.

     See the section  "FOLLOWING  DESCENDANT  PROCESSES",  above.
     For more information on hardware counters, see the "Hardware
     Counter Overflow Profiling" section below.

     The collect command operates by inserting a shared  library,
     libcollector.so,    into    the   target's   address   space
     (LD_PRELOAD),  and  by  using  a  second   shared   library,
     collaudit.so,  to  record shared-object use with the runtime
     linker's  audit  interface  (LD_AUDIT).   Those  two  shared
     libraries write the files that constitute the experiment.

     Several problems may arise if collect is invoked on  execut-
     ables  that call setuid or setgid, or that create descendant
     processes that call setuid or setgid.  If the  user  running
     the  experiment  is  not  root, collection fails because the
     shared libraries are not installed in a  trusted  directory.
     The workaround is to run the experiments as root.

     In addition, the umask for the user running the collect com-
     mand  must  be  set to allow write permission for that user,
     and  for  any  users  or  groups  that  are   set   by   the
     setuid/setgid  attributes  of a program being exec'd and for
     any user or group to which that program sets itself.  If the
     mask  is  not set properly, some files may not be written to
     the experiment, and processing of the experiment may not  be
     possible.  If the log file can be written, an error is shown
     when the user attempts to process the experiment.

     Other problems can arise if the target itself makes  any  of
     the  system  calls  to  set UID or GID, or if it changes its
     umask and then forks or runs exec on some other process,  or
     crle  was  used to configure how the runtime linker searches
     for shared objects.

DATA COLLECTED

Three types of data are collected: profiling data, tracing
data and sampling data. The data packets recorded in profil-
ing and tracing include the callstack of each LWP, the LWP,
thread and CPU IDs, and some event-specific data. The data
packets recorded in sampling contain global data such as
execution statistics, but no program-specific or event-
specific data. All data packets include a timestamp.

Clock-based Profiling
The event-specific data recorded in clock-based profil-
ing is an array of counts for each accounting micro-
state. The microstate array is incremented by the sys-
tem at a prescribed frequency, and is recorded by the
Collector when a profiling signal is processed.

Clock-based profiling can run at a range of frequencies
which must be multiples of the clock resolution used |
for the profiling timer. If you try to do high- |
resolution profiling on a machine with an operating
system that does not support it, the command prints a
warning message and uses the highest resolution sup-
ported. Similarly, a custom setting that is not a mul-
tiple of the resolution supported by the system is
rounded down to the nearest non-zero multiple of that
resolution, and a warning message is printed.

Clock-based profiling data is converted into the fol-
lowing metrics:

User CPU Time
Wall Time
Total LWP Time
System CPU Time
Wait CPU Time
User Lock Time
Text Page Fault Time
Data Page Fault Time
Other Wait Time

For experiments on multithreaded applications, all of
the times, other than Wall Time, are summed across all
LWPs in the process; Wall Time is the time spent in
all states for LWP 1 only. Total LWP Time adds up to
the real elapsed time, multiplied by the average number
of LWPs in the process.

If clock-based dataspace profiling is specified, an |
additional metric: |

Max. Mem Stalls |
is provided. |

Hardware Counter Overflow Profiling
Hardware counter overflow profiling records the number
of events counted by the hardware counter at the time
the overflow signal was processed. This type of profil-
ing is now available on systems running the Linux OS,
provided that they have the Perfctr patch installed.

Hardware counter overflow profiling can be done on sys-
tems that support overflow profiling and that include
the hardware counter shared library, libcpc.so(3). You
must use a version of the Solaris OS no earlier that
the Solaris 8 OS. On UltraSPARC[R] computers, you must
use a version of the hardware no earlier than the
UltraSPARC III hardware. On computers that do not sup-
port overflow profiling, an attempt to select hardware
counter overflow profiling generates an error.

The counters available depend on the specific CPU pro-
cessor and operating system. Running the collect com-
mand with no arguments prints out a usage message that
contains the names of the counters. The counters that
are considered well-known are displayed first in the
list, followed by a list of the raw hardware counters.

The lines of output are formatted similar to the fol-
lowing:

Well known HW counters available for profiling:
cycles[/{0|1}],9999991 ('CPU Cycles', alias for Cycle_cnt; CPU-cycles)
insts[/{0|1}],9999991 ('Instructions Executed', alias for Instr_cnt; events)
dcrm[/1],100003 ('D$ Read Misses', alias for DC_rd_miss; load events)
...
Raw HW counters available for profiling:
Cycle_cnt[/{0|1}],1000003 (CPU-cycles)
Instr_cnt[/{0|1}],1000003 (events)
DC_rd[/0],1000003 (load events)
SI_snoop[/0],1000003 (not-program-related events)
...

In the first line of the well-known counter output, the
first field, "cycles", gives the well-known counter
name that can be used in the -h counter... argument. It
is followed by a specification of which registers can
be used for that counter. The next field, "9999991",
is the default overflow value for that counter. The
following field in parentheses, "CPU Cycles", is the
metric name, followed by the raw hardware counter name.
The last field, "CPU-cycle", specifies the type of
units being counted. There can be up to two words for
the type of information. The second or only word of
the type information may be either "CPU-cycles" or
"events". If the counter can be used to provide a
time-based metric, the value is CPU-cycles; otherwise
it is events.

The second output line of the well-known counter output
above has "events" instead of "CPU-cycles" at the end
of the line, indicating that it counts events, and can
not be converted to a time.

The third output line above has two words of type
information, "load events", at the end of the line. The
first word of type information may have the value of
"load", "store", "load-store", or "not-program-
related". The first three of these type values indicate
that the counter is memory-related and the counter name
can be preceded by the "+" sign when used in the col-
lect -h command. The "+" sign indicates the request
for data collection to attempt to find the precise
instruction and virtual address that caused the event
on the counter that overflowed.

The "not-program-related" value indicates that the
counter captures events initiated by some other pro-
gram, such as CPU-to-CPU cache snoops. Using the
counter for profiling generates a warning and profiling
does not record a call stack. It does, however, show
the time being spent in an artificial function called
"collector_not_program_related". Thread IDs and LWP IDs
are recorded, but are meaningless.

The information included in the raw hardware counter
list is a subset of the well-known counter list. Each
line includes the internal counter name as used by cpu-
track(1), the register number(s) on which that counter
can be used, the default overflow value, and the
counter units, which is either CPU-cycles or Events.

EXAMPLES:

Example 1: Using the well-known counter information
listed in the above sample output, the following com-
mand:

collect -h cycles/0,hi,+dcrm,9999

enables the CPU Cycle profiling on register 0. The "hi"
value enables a sample rate that is approximately 10
times faster than the default rate of 9999991. The
"dcrm" value enables the D$ Read Miss profiling on
register 1 and the preceding "+" enables Dataspace pro-
filing for the dcrm. The "9999" value sets the sampling
to be done every 9999 read misses, instead of the
default value of every 100003 read misses.

Example 2:

Running the collect command with no arguments on an AMD
Opteron machine would produce a raw hardware counter
output similar to the following :

FP_dispatched_fpu_ops[/{0|1|2|3}],1000003 (events)
FP_cycles_no_fpu_ops_retired[/{0|1|2|3}],1000003 (CPU-cycles)
...

Using the above raw hardware counter output, the fol-
lowing command:

collect -h FP_dispatched_fpu_ops~umask=0x3/2,10007

enables the Floating Point Add and Multiply operations
to be tracked at the rate of 1 capture every 10007
events. (For more details on valid attribute values,
refer to the processor documentation). The "/2" value
specifies the data is to be captured using the register
2 of the hardware.

Synchronization Delay Tracing
Synchronization delay tracing records all calls to the
various thread synchronization routines where the
real-time delay in the call exceeds a specified thres-
hold. The data packet contains timestamps for entry and
exit to the synchronization routines, the thread ID,
and the LWP ID at the time the request is initiated.
(Synchronization requests from a thread can be ini-
tiated on one LWP, but complete on another.)

Synchronization delay tracing data is converted into
the following metrics:

Synchronization Delay Events
Synchronization Wait Time

Heap Tracing
Heap tracing records all calls to malloc, free, real-
loc, memalign, and valloc with the size of the block
requested, its address, and for realloc, the previous
address.

Heap tracing data is converted into the following
metrics:

Leaks
Bytes Leaked
Allocations
Bytes Allocated

Leaks are defined as allocations that are not freed.
If a zero-length block is allocated, it counts as an
allocation with zero bytes allocated. If a zero-length
block is not freed, it counts as a leak with zero bytes
leaked.

For applications written in the Java[TM] programming
language, leaks are defined as allocations that have
not been garbage-collected. Heap profiling for such
applications is obsolescent and will not be supported
in future releases.

Heap tracing experiments can be very large, and may be
slow to process.

MPI Tracing
MPI tracing records calls to the MPI library for func-
tions that can take a significant amount of time to
complete. MPI tracing is not available on systems run-
ning the Linux OS.

The following functions from the MPI library are
traced:

MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Barrier
MPI_Bcast
MPI_Bsend
MPI_Gather
MPI_Gatherv
MPI_Irecv
MPI_Isend
MPI_Recv
MPI_Reduce
MPI_Reduce_scatter
MPI_Rsend
MPI_Scan
MPI_Scatter
MPI_Scatterv
MPI_Send
MPI_Sendrecv
MPI_Sendrecv_replace
MPI_Ssend
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
MPI_Win_fence
MPI_Win_lock

MPI tracing data is converted into the following
metrics:

MPI Time
MPI Sends
MPI Bytes Sent
MPI Receives
MPI Bytes Received
Other MPI Calls

MPI Time is the total LWP time spent in the MPI func-
tion.

The MPI Bytes Received metric uses the actual number of
bytes for blocking calls, but uses the buffer size for
non-blocking calls. Metrics that are computed for col-
lective operations such as gather, scatter, and reduce
have the maximum possible values for these operations.
No reduction in the values is made due to optimization
of the collective operations.

Note that the MPI Bytes Received as reported may seri-
ously overestimate the actual transmissions whenever
the buffer used for a non-blocking receive is signifi-
cantly larger than the size needed for the receive.

Count Data ||
Count data is recorded by instrumenting the executable, |
and counting the number of times each instruction was |
executed. It also counts the number of times the first |
instruction in a function is executed, and calls that |
the function execution count. |

Count data is converted into the following metric: |

Bit Func Count |
Bit Inst Exec |
Bit Inst Annul |

Data-race Detection Data ||
Data-race detection data consists of pairs of race- |
access events that constitute a race. The events are |
combined into a race, and races for which the |
callstacks for the two access are identical are merged |
into a race group. |

Data-race detection data is converted into the follow- |
ing metric: |

Race Accesses |

Deadlock Detection Data ||
Deadlock detection data consists of pairs of threads |
with conflicting locks. |

Deadlock detection data is converted into the following |
metric: |

Deadlocks |

Sampling and Global Data
Sampling refers to the process of generating markers
along the time line of execution. At each sample point,
execution statistics are recorded. All of the data
recorded at sample points is global to the program, and
does not map to function-level metrics.

Samples are always taken at the start of the process,
and at its termination. By default or if a non-zero -S
argument is specified, samples are taken periodically
at the specified interval. In addition, samples can be
taken by using the libcollector(3) API.

The data recorded at each sample point consists of
microstate accounting information from the kernel,
along with various other statistics maintained within
the kernel.

RESTRICTIONS

     The Collector interposes on some signal-handling routines to
     ensure  that its use of SIGPROF signals for clock-based pro-  |
     filing and SIGEMT (Solaris) or SIGIO  (Linux)  for  hardware  |
     counter  overflow  profiling  is not disrupted by the target  |
     program.  The Collector library re-installs its  own  signal  |
     handler if the target program installs a signal handler. The  |
     Collector's signal handler sets a  flag  that  ensures  that
     system  calls  are  not interrupted to deliver signals. This
     setting could change the behavior of the target program.

     The Collector interposes on setitimer(2) to ensure that  the
     profiling  timer  is  not available to the target program if
     clock-based profiling is enabled.

     The  Collector  interposes  on  functions  in  the  hardware
     counter  library,  libcpc.so,  so that an application cannot
     use hardware counters while the Collector is collecting per-
     formance  data.  The  interposed functions return a value of
     -1.


     Dataspace profiling are not available on systems running the  |
     Linux OS.                                                     |

     For this release, the data from collecting periodic  samples
     is not reliable on systems running the Linux OS.

     For this release, wide data discrepancies are observed  when
     profiling  multithreaded applications on systems running the
     RedHat Enterprise Linux OS.

     Hardware counter overflow profiling cannot be run on a  sys-
     tem  where cpustat is running, because cpustat takes control
     of the counters, and does not let a user process use them.

     Java Profiling requires the Java[TM] 2 SDK, version 1.5.0_03  |
     or  later updates of 1.5.0, or Java[TM] 2 SDK, version 1.6.0  |
     or later updates of 1.6.0.                                    |

     Data is not  collected  on  descendant  processes  that  are
     created  to  use the setuid attribute, nor on any descendant
     processes created with an exec function run on an executable
     that  is  not  dynamically  linked.  Furthermore, subsequent
     descendant processes may  produce  corrupted  or  unreadable
     experiments.  The workaround is to ensure that all processes
     spawned are dynamically-linked and do not  have  the  setuid
     attribute.

     Applications that call vfork(2) have these calls replaced by
     a call to fork1(2).