Chapter 12 User Process Tracing

DTrace is a powerful tool for understanding the behavior of user processes. DTrace can be invaluable when debugging and analyzing performance problems, or for simply understanding the behavior of a complex application. This chapter focuses on the DTrace facilities that are relevant to tracing user process activity and provides examples that illustrate their use.

12.1 copyin and copyinstr Subroutines

DTrace's interaction with processes is slightly different than most traditional debuggers and observability tools. Many such tools appear to execute within the scope of the process, allowing users dereference pointers to program variables directly. Rather than appearing to execute within or as part of the process itself, DTrace probes execute in the Oracle Linux kernel. To access process data, a probe uses the copyin or copyinstr subroutines to copy user process data into the address space of the kernel.

For example, consider the following write() system call:

ssize_t write(int fd, const void *buf, size_t nbytes);

The following D program illustrates an incorrect attempt to print the contents of a string that is passed to the write() system call:

syscall::write:entry
{
  printf("%s", stringof(arg1)); /* incorrect use of arg1 */
}

If you attempt to run this script, DTrace produces error messages similar to the following:

dtrace: error on enabled probe ID 1 (ID 37: syscall::write:entry): \
invalid address (0x10038a000) in action #1

The arg1 variable, which contains the value of the buf parameter, is an address that refers to memory in the process executing the system call. To read the string at that address, use the copyinstr subroutine and record its result with the printf action, for example:

syscall::write:entry
{
  printf("%s", copyinstr(arg1)); /* correct use of arg1 */
}

In the previous script, the output shows all of the strings that are being passed to the write() system call. Occasionally, however, you might see irregular output similar to the following:

 0         37                     write:entry mada&^%**&

The copyinstr subroutine acts on an input argument, which is the user address of a null-terminated ASCII string, but buffers that are passed to the write() system call might refer to binary data rather than ASCII strings or to ASCII strings that do not include a terminating null byte. To print only as much of the string as the caller intended, use the two parameter version of the copyinstr subroutine, which includes the size of the targeted string buffer:

syscall::write:entry { printf("%s", copyinstr(arg1, arg2)); }

Alternatively, you can use the copyin subroutine, which takes an address and size, for example:

syscall::write:entry
{
  printf("%s", stringof(copyin(arg1, arg2)));
}

Note that the stringof operator is necessary so that DTrace properly converts the user data that is retrieved by copyin to a string. The use of stringof is not necessary with the copyinstr subroutine because it always returns the type string.

12.1.1 Avoiding Errors

The copyin and copyinstr subroutines cannot read from user addresses that have not yet been touched, so even a valid address could cause an error if the page containing that address has not yet been faulted in by being accessed. Consider the following example:

# dtrace -n syscall::open:entry'{ trace(copyinstr(arg0)); }'
dtrace: description 'syscall::open:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  1      8                       open:entry   /dev/sr0                         
  1      8                       open:entry   /var/run/utmp                    
  1      8                       open:entry   /dev/sr0
dtrace: error on enabled probe ID 2 (ID 8: syscall::open:entry): \
invalid address (0x9af1b) in action #1 at DIF offset 52

In the example output, the application was functioning properly, and the address in arg0 was valid, but it referred to a page that had not yet been accessed by the corresponding process. To resolve this issue, you would need to wait for the kernel or an application to use the data before tracing it.

For example, you might wait until the system call returns to apply copyinstr, as shown here:

# dtrace -n syscall::open:entry'{ self->file = arg0; }' \
  -n syscall::open:return'{ trace(copyinstr(self->file)); self->file = 0; }'
dtrace: description 'syscall::open:entry' matched 1 probe
dtrace: description 'syscall::open:return' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0      9                      open:return   /dev/sr0                         
  1      9                      open:return   /usr/lib64/gconv/gconv-modules.cache
  0      9                      open:return   /dev/sr0                         
  0      9                      open:return   public/pickup                    
  1      9                      open:return   maildrop                         
  1      9                      open:return   /dev/sr0                         
  1      9                      open:return   /dev/sr0                         
  1      9                      open:return   /var/run/utmp
...

12.2 Eliminating dtrace Interference

If you trace every call to the write() system call, it causes a cascade of output because each call causes the dtrace command to call write() as it displays the output, and so on. This feedback loop is a good example of how the dtrace command can interfere with the desired data. To prevent this type of unwanted data from being traced, use a simple predicate like the one that is shown in the following example and save it in a file named stringof.d:

syscall::write:entry
/pid != $pid/
{
  printf("%s", stringof(copyin(arg1, arg2)));
}

In the previous example, the $pid macro variable expands to the process identifier of the process that enabled the probes. The pid variable contains the process identifier of the process whose thread was running on the CPU where the probe was fired. Therefore, the predicate /pid != $pid/ ensures that the script does not trace any events related to itself.

12.3 Using the syscall Provider

The syscall provider enables you to trace every system call entry and return. System calls can be a good starting point for understanding the behavior of a process, especially if the process seems to be spending a large amount of time executing or blocked in the kernel, as shown in the output of commands such as ps and top.

For example, consider a process with a process ID of 31337 that is consuming a large amount of system time. One possible explanation for this behavior is that the process is executing a large number of system calls. You can specify a simple D program on the command line to see which system calls are happening most often:

# dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }'
dtrace: description ’syscall:::entry’ matched 215 probes
^C

  kill                                                              1
  clone                                                             4
  pipe                                                              4
  setpgid                                                           4
  rt_sigreturn                                                      6
  sendmsg                                                           7
  socket                                                            7
  access                                                            8
  getegid                                                           8
  geteuid                                                           8
  getgid                                                            8
  getuid                                                            8
  wait4                                                            12
  close                                                            15
  read                                                             23
  newstat                                                          25
  write                                                            42
  ioctl                                                            65
  rt_sigaction                                                    168
  rt_sigprocmask                                                  198
  write                                                          1092

The previous report shows the system calls that are being called most often, which in this case, is the write() system call.

You can use the syscall provider to further examine the source of all of the write() system calls, for example:

# dtrace -n syscall::write:entry'/pid == 31337/{ @writes = quantize(arg2); }'
dtrace: description ’syscall::write:entry’ matched 1 probe
^C

          value  ------------- Distribution ------------- count
              0 |                                         0
              1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1037
              2 |@                                        3
              4 |                                         0
              8 |                                         0
             16 |                                         0
             32 |@                                        3
             64 |                                         0
            128 |                                         0
            256 |                                         0
            512 |                                         0
           1024 |@                                        5

The previous output shows that the process is executing many write() system calls with a relatively small amount of data. The ratio could be the source of the performance problem for this particular process. This example illustrates a general methodology for investigating system call behavior.

12.4 ustack Action

Note

If you want to perform symbol lookup in a stripped executable, you must specify the --export-dynamic option when linking the program. This option causes the linker to add all symbols to the dynamic symbol table (the set of symbols that are visible from dynamic objects at run time). If you use gcc to link the objects, specify the option as -Wl,--export-dynamic to pass the correct option to the linker.

If you want to look up symbols in shared libraries or unstripped executables, the --export-dynamic option is not required.

Tracing a process thread's stack when a particular probe is activated is often useful for examining a problem in more detail. The ustack action traces the user thread's stack. For example, if a process that opens many files occasionally fails in the open() system call, you can use the ustack action to discover the code path that executes the failed open. Type the following source code and save it in a file named badopen.d:

syscall::open:entry
/pid == $1/
{
  self->path = copyinstr(arg0);
}

syscall::open:return
/self->path != NULL && errno != 0/
{
  printf("open for '%s' failed", self->path);
  ustack();
}

This script also illustrates the use of the $1 macro variable, which takes the value of the first operand that is specified on the dtrace command line:

# dtrace -s ./badopen.d 3430
dtrace: script './badopen.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1    489                    openat:return open for '/usr/lib/foo' failed
              libc.so.6`sleep+0xe0
              ld-linux-x86-64.so.2`do_lookup_x+0x847
              libc.so.6`0x3cb8003630
              libc.so.6`0x3cb8003c48
              libc.so.6`0x3cb800e2c8
              libc.so.6`0x3cb8003c48
              looper`0x400612
              libc.so.6`getenv+0x2a
              looper`0x4003c8
              looper`0x4009b0
              libc.so.6`0x3cb800e2c8
              looper`0x4009b0
              looper`doOpenLoop+0x33
              looper`0x400e9c
              looper`main+0x5f
              looper`0x400ea9
              libc.so.6`__libc_start_main+0xfd
              looper`main
              looper`0x4009b0
              looper`__libc_csu_init

The ustack action records program counter (PC) values for the stack and the dtrace command resolves the PC values to symbol names by looking though the process's symbol tables. If dtrace cannot resolve the PC value to a symbol, it prints out the value as a hexadecimal integer.

If a process exits or is killed before the ustack data is formatted for output, dtrace might be unable to convert the PC values in the stack trace to symbol names and the command displays them as hexadecimal integers.

12.5 uregs[] Array

The uregs[] array enables you to access individual user registers. See Table 12.1, “x86 uregs[] Constants”, which lists the index constants into the uregs[] array for each supported architecture.

The following table lists the index constants into the uregs[] array for each supported architecture.

Table 12.1 x86 uregs[] Constants

Constant

Register

Architecture

R_PC

program counter register

x86, AMD64

R_SP

stack pointer register

x86, AMD64

R_R0

first return code

x86, AMD64

R_R1

second return code

x86, AMD64

R_CS

%cs

x86, AMD64

R_GS

%gs

x86, AMD64

R_ES

%es

x86, AMD64

R_DS

%ds

x86, AMD64

R_EDI

%ed

x86, AMD64

R_ESI

%es

x86, AMD64

R_EBP

%ebp

x86, AMD64

R_EAX

%eax

x86, AMD64

R_ESP

%esp

x86, AMD64

R_EAX

%eax

x86, AMD64

R_EBX

%ebx

x86, AMD64

R_ECX

%ecx

x86, AMD64

R_EDX

%edx

x86, AMD64

R_TRAPNO

%trapno

x86, AMD64

R_ERR

%err

x86, AMD64

R_EIP

%eip

x86, AMD64

R_CS

%cs

x86, AMD64

R_EFL

%efl

x86, AMD64

R_UESP

%uesp

x86, AMD64

R_SS

%ss

x86, AMD64

R_RSP

%rsp

AMD64

R_RFL

%rfl

AMD64

R_RIP

%rip

AMD64

R_RAX

%rax

AMD64

R_RCX

%rcx

AMD64

R_RDX

%rdx

AMD64

R_RBP

%rbp

AMD64

R_RSI

%rsi

AMD64

R_RDI

%rdi

AMD64

R_R8

%r8

AMD64

R_R9

%r9

AMD64

R_R10

%r10

AMD64

R_R11

%r11

AMD64

R_R12

%r12

AMD64

R_R13

%r13

AMD64

R_R14

%r14

AMD64

R_R15

%r15

AMD64


12.6 Using the pid Provider

The pid provider enables you to trace any instruction in a process. Unlike most other providers, pid probes are created on demand, based on the probe descriptions that are found in your D programs. As a result, no pid probes are listed in the output of the dtrace -l command until you enable them.

12.6.1 User Function Boundary Tracing

The simplest mode of operation for the pid provider is to provide function boundary tracing in user space. The following example program traces all of the function entries and returns that are made from a single function. The $1 macro variable, the first operand on the command line, is the process ID for the process to trace. The $2 macro variable, the second operand on the command line, is the name of the function from which to trace all function calls. Type the following source code and save it in a file named userfunc.d:

#!/usr/sbin/dtrace -s
#pragma D option flowindent

pid$1::$2:entry
{
  self->trace = 1;
}

pid$1:::entry,
pid$1:::return
/self->trace/
{
}

pid$1::$2:return
/self->trace/
{
  self->trace = 0;
}

Type the previous example script and save it in a file named userfunc.d, then use the chmod command to make the file executable. This script produces output with more details on the principal buffer:

# ./userfunc.d 123 execute
dtrace: script ’./userfunc.d’ matched 11594 probes
  0  -> execute
  0    -> execute
  0      -> Dfix
  0      <- Dfix
  0      -> s_strsave
  0        -> malloc
  0        <- malloc
  0      <- s_strsave
  0      -> set
  0        -> malloc
  0        <- malloc
  0      <- set
  0      -> set1
  0        -> tglob
  0        <- tglob
  0      <- set1
  0      -> setq
  0        -> s_strcmp
  0        <- s_strcmp
...

The pid provider can only be used on processes that are already running. You can use the $target macro variable (see Chapter 9, Scripting) and the dtrace command with the -c and -p options to create and grab processes of interest and instrument them by using DTrace.

For example, you can use the following D script to determine the distribution of function calls that are made to libc by a particular subject process. Type the following source code and save it in a file named libc.d:

pid$target:libc.so::entry
{
  @[probefunc] = count();
}

To determine the distribution of such calls that are made by the date command, save the script in a file named libc.d and run the following command:

# dtrace -s libc.d -c date
dtrace: script ’libc.d’ matched 2476 probes
Fri Jul 30 14:08:54 PDT 2004
dtrace: pid 109196 has exited

  pthread_rwlock_unlock                         1
  _fflush_u                                     1
  rwlock_lock                                   1
  rw_write_held                                 1
  strftime                                      1
  _close                                        1
  _read                                         1
  __open                                        1
  _open                                         1
  strstr                                        1
  load_zoneinfo                                 1
...
  _ti_bind_guard                               47
  _ti_bind_clear                               94

12.6.2 Tracing Arbitrary Instructions

You can use the pid provider to trace any instruction in any user function. Upon demand, the pid provider creates a probe for every instruction in a function. The name of each probe is the offset of its corresponding instruction in the function and is expressed as a hexadecimal integer. For example, to enable a probe that is associated with the instruction at offset 0x1c in the function foo of the module bar.so in the process with PID 123, you would use the following command:

# dtrace -n pid123:bar.so:foo:1c

To enable all of the probes in the function foo, including the probe for each instruction, you would use the following command:

# dtrace -n pid123:bar.so:foo:

Using the previous command demonstrates an extremely powerful technique for debugging and analyzing user applications. Infrequent errors can be difficult to debug because they can be difficult to reproduce. Often, you can identify a problem after the failure has occurred, which is too late to reconstruct the code path.

The following example demonstrates how to combine the pid provider with speculative tracing to solve this problem by tracing every instruction in a function. See Chapter 7, Speculative Tracing for a description.

Type the following source code and save it in a file named errorpath.d:

pid$1::$2:entry
{
  self->spec = speculation();
  speculate(self->spec);
  printf("%x %x %x %x %x", arg0, arg1, arg2, arg3, arg4);
}

pid$1::$2:
/self->spec/
{
  speculate(self->spec);
}

pid$1::$2:return
/self->spec && arg1 == 0/
{
  discard(self->spec);
  self->spec = 0;
}

pid$1::$2:return
/self->spec && arg1 != 0/
{
  commit(self->spec);
  self->spec = 0;
}

Executing the errorpath.d script results in output similar to the following:

# ./errorpath.d 123 _chdir
dtrace: script ’./errorpath.d’ matched 19 probes
CPU    ID                     FUNCTION:NAME
  0 25253                      _chdir:entry 81e08 6d140 ffbfcb20 656c73 0
  0 25253                      _chdir:entry
  0 25269                      _chdir:0
  0 25270                      _chdir:4
  0 25271                      _chdir:8
  0 25272                      _chdir:c
  0 25273                      _chdir:10
  0 25274                      _chdir:14
  0 25275                      _chdir:18
  0 25276                      _chdir:1c
  0 25277                      _chdir:20
  0 25278                      _chdir:24
  0 25279                      _chdir:28
  0 25280                      _chdir:2c
  0 25268                      _chdir:return