Solaris Dynamic Tracing Guide

Chapter 33 User Process Tracing

DTrace is an extremely powerful tool for understanding the behavior of user processes. DTrace can be invaluable when debugging, analyzing performance problems, or simply understanding the behavior of a complex application. This chapter focuses on the DTrace facilities relevant for tracing user process activity and provides examples to illustrate their use.

copyin() and copyinstr() Subroutines

DTrace's interaction with processes is a little different than most traditional debuggers or observability tools. Many such tools appear to execute within the scope of the process, letting users dereference pointers to program variables directly. Rather than appearing to execute within or as part of the process itself, DTrace probes execute in the Solaris kernel. To access process data, a probe needs to use the copyin() or copyinstr() subroutines to copy user process data into the address space of the kernel.

For example, consider the following write(2) system call:

ssize_t write(int fd, const void *buf, size_t nbytes);

The following D program illustrates an incorrect attempt to print the contents of a string passed to the write(2) system call:

syscall::write:entry
{
	printf("%s", stringof(arg1)); /* incorrect use of arg1 */
}

If you try to run this script, DTrace will produce error messages similar to the following example:


dtrace: error on enabled probe ID 1 (ID 37: syscall::write:entry): \
    invalid address (0x10038a000) in action #1

The arg1 variable, containing the value of the buf parameter, is an address that refers to memory in the process executing the system call. To read the string at that address, use the copyinstr() subroutine and record its result with the printf() action:

syscall::write:entry
{
	printf("%s", copyinstr(arg1)); /* correct use of arg1 */

The output of this script shows all of the strings being passed to the write(2) system call. Occasionally, however, you might see irregular output similar to the following example:


  0     37                      write:entry mada���

The copyinstr() subroutine acts on an input argument that is the user address of a null-terminated ASCII string. However, buffers passed to the write(2) system call might refer to binary data rather than ASCII strings. To print only as much of the string as the caller intended, use the copyin() subroutine, which takes a size as its second argument:

syscall::write:entry
{
	printf("%s", stringof(copyin(arg1, arg2)));
}

Notice that the stringof operator is necessary so that DTrace properly converts the user data retrieved using copyin() to a string. The use of stringof is not necessary when using copyinstr() because this function always returns type string.

Avoiding Errors

The copyin() and copyinstr() subroutines cannot read from user addresses which have not yet been touched so even a valid address may cause an error if the page containing that address has not yet been faulted in by being accessed. Consider the following example:


# dtrace -n syscall::open:entry'{ trace(copyinstr(arg0)); }'
dtrace: description 'syscall::open:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
dtrace: error on enabled probe ID 2 (ID 50: syscall::open:entry): invalid address
(0x9af1b) in action #1 at DIF offset 52

In the above example output, the application was functioning properly, and the address in arg0 was valid, but it referred to a page that had not yet been accessed by the corresponding process. To resolve this issue, wait for kernel or application to use the data before tracing it. For example, you might wait until the system call returns to apply copyinstr(), as shown in the following example:


# dtrace -n syscall::open:entry'{ self->file = arg0; }' \
-n syscall::open:return'{ trace(copyinstr(self->file)); self->file = 0; }'
dtrace: description 'syscall::open:entry' matched 1 probe
CPU     ID                    FUNCTION:NAME
  2     51                      open:return   /dev/null                        

Eliminating dtrace(1M) Interference

If you trace every call to the write(2) system call, you will cause a cascade of output. Each call to write() causes the dtrace(1M) command to call write() as it displays the output, and so on. This feedback loop is a good example of how the dtrace command can interfere with the desired data. You can use a simple predicate to prevent these unwanted data from being traced:

syscall::write:entry
/pid != $pid/
{
	printf("%s", stringof(copyin(arg1, arg2)));
}

The $pid macro variable expands to the process identifier of the process that enabled the probes. The pid variable contains the process identifier of the process whose thread was running on the CPU where the probe was fired. Therefore the predicate /pid != $pid/ ensures that the script does not trace any events related to the running of this script itself.

syscall Provider

The syscall provider enables you to trace every system call entry and return. System calls can be a good starting point for understanding a process's behavior, especially if the process seems to be spending a large amount of time executing or blocked in the kernel. You can use the prstat(1M) command to see where processes are spending time:


$ prstat -m -p 31337
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP  
 13499 user1     53  44 0.0 0.0 0.0 0.0 2.5 0.0  4K  24  9K   0 mystery/6

This example shows that the process is consuming a large amount of system time. One possible explanation for this behavior is that the process is executing a large number of system calls. You can use a simple D program specified on the command-line to see which system calls are happening most often:


# dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }'
dtrace: description 'syscall:::entry' matched 215 probes
^C

  open                                                              1
  lwp_park                                                          2
  times                                                             4
  fcntl                                                             5
  close                                                             6
  sigaction                                                         6
  read                                                             10
  ioctl                                                            14
  sigprocmask                                                     106
  write                                                          1092

This report shows which system calls are being called most often, in this case, the write(2) system call. You can use the syscall provider to further examine the source of all the write() system calls:


# dtrace -n syscall::write:entry'/pid == 31337/{ @writes[arg2] = quantize(); }'
dtrace: description 'syscall::write:entry' matched 1 probe
^C

           value  ------------- Distribution ------------- count    
               0 |                                         0        
               1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@   1037
               2 |@                                        3        
               4 |                                         0        
               8 |                                         0        
              16 |                                         0        
              32 |@                                        3        
              64 |                                         0        
             128 |                                         0        
             256 |                                         0        
             512 |                                         0        
            1024 |@                                        5        
            2048 |                                         0        

The output shows that the process is executing many write() system calls with a relatively small amount of data. This ratio could be the source of the performance problem for this particular process. This example illustrates a general methodology for investigating system call behavior.

ustack() Action

Tracing a process thread's stack at the time a particular probe is activated is often useful for examining a problem in more detail. The ustack() action traces the user thread's stack. If, for example, a process that opens many files occasionally fails in the open(2) system call, you can use the ustack() action to discover the code path that executes the failed open():

syscall::open:entry
/pid == $1/
{
	self->path = copyinstr(arg0);
}

syscall::open:return
/self->path != NULL && arg1 == -1/
{
	printf("open for '%s' failed", self->path);
	ustack();
}

This script also illustrates the use of the $1 macro variable which takes the value of the first operand specified on the dtrace(1M) command-line:


# dtrace -s ./badopen.d 31337
dtrace: script './badopen.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  0     40                      open:return open for '/usr/lib/foo' failed
              libc.so.1`__open+0x4
              libc.so.1`open+0x6c
              420b0
              tcsh`dosource+0xe0
              tcsh`execute+0x978
              tcsh`execute+0xba0
              tcsh`process+0x50c
              tcsh`main+0x1d54
              tcsh`_start+0xdc

The ustack() action records program counter (PC) values for the stack and dtrace(1M) resolves those PC values to symbol names by looking though the process's symbol tables. If dtrace can't resolve the PC value to a symbol, it will print out the value as a hexadecimal integer.

If a process exits or is killed before the ustack() data is formatted for output, dtrace might be unable to convert the PC values in the stack trace to symbol names, and will be forced to display them as hexadecimal integers. To work around this limitation, specify a process of interest with the -c or -p option to dtrace. See Chapter 14, dtrace(1M) Utility for details on these and other options. If the process ID or command is not known in advance, the following example D program that can be used to work around the limitation:

/*
 * This example uses the open(2) system call probe, but this technique
 * is applicable to any script using the ustack() action where the stack
 * being traced is in a process that may exit soon.
 */
 syscall::open:entry
{
	ustack();
	stop_pids[pid] = 1;
}

syscall::rexit:entry
/stop_pids[pid] != 0/
{
	printf("stopping pid %d", pid);
	stop();
	stop_pids[pid] = 0;
}

The above script stops a process just before it exits if the ustack() action has been applied to a thread in that process. This technique ensures that the dtrace command will be able to resolve the PC values to symbolic names. Notice that the value of stop_pids[pid] is set to 0 after it has been used to clear the dynamic variable. Remember to set stopped processes running again using the prun(1) command or your system will accumulate many stopped processes.

uregs[] Array

The uregs[] array enables you to access individual user registers. The following tables list indices into the uregs[] array corresponding to each supported Solaris system architecture.

Table 33–1 SPARC uregs[] Constants

Constant 

Register 

R_G0..R_G7

%g0..%g7 global registers

R_O0..R_O7

%o0..%o7 out registers

R_L0..R_L7

%l0..%l7 local registers

R_I0..R_I7

%i0..%i7 in registers

R_CCR

%ccr condition code register

R_PC

%pc program counter

R_NPC

%npc next program counter

R_Y

%y multiply/divide register

R_ASI

%asi address space identifier register

R_FPRS

%fprs floating-point registers state

Table 33–2 x86 uregs[] Constants

Constant 

Register 

R_CS

%cs

R_GS

%gs

R_ES

%es

R_DS

%ds

R_EDI

%edi

R_ESI

%esi

R_EBP

%ebp

R_EAX

%eax

R_ESP

%esp

R_EAX

%eax

R_EBX

%ebx

R_ECX

%ecx

R_EDX

%edx

R_TRAPNO

%trapno

R_ERR

%err

R_EIP

%eip

R_CS

%cs

R_ERR

%err

R_EFL

%efl

R_UESP

%uesp

R_SS

%ss

On AMD64 platforms, the uregs array has the same content as it does on x86 platforms, plus the additional elements listed in the following table:

Table 33–3 amd64 uregs[] Constants

Constant 

Register 

R_RSP

%rsp

R_RFL

%rfl

R_RIP

%rip

R_RAX

%rax

R_RCX

%rcx

R_RDX

%rdx

R_RBX

%rbx

R_RBP

%rbp

R_RSI

%rsi

R_RDI

%rdi

R_R8

%r8

R_R9

%r9

R_R10

%r10

R_R11

%r11

R_R12

%r12

R_R13

%r13

R_R14

%r14

R_R15

%r15

The aliases listed in the following table can be used on all platforms:

Table 33–4 Common uregs[] Constants

Constant 

Register 

R_PC

program counter register 

R_SP

stack pointer register 

R_R0

first return code 

R_R1

second return code 

pid Provider

The pid provider enables you to trace any instruction in a process. Unlike most other providers, pid probes are created on demand based on the probe descriptions found in your D programs. As a result, no pid probes are listed in the output of dtrace -l until you have enabled them yourself.

User Function Boundary Tracing

The simplest mode of operation for the pid provider is as the user space analogue to the fbt provider. The following example program traces all function entries and returns that are made from a single function. The $1 macro variable (the first operand on the command line) is the process ID for the process to trace. The $2 macro variable (the second operand on the command line) is the name of the function from which to trace all function calls.


Example 33–1 userfunc.d: Trace User Function Entry and Return

pid$1::$2:entry
{
	self->trace = 1;
}

pid$1::$2:return
/self->trace/
{
	self->trace = 0;
}

pid$1:::entry,
pid$1:::return
/self->trace/
{
}

Type in the above example script and save it in a file named userfunc.d, and then chmod it to be executable. This script produces output similar to the following example:


# ./userfunc.d 15032 execute
dtrace: script './userfunc.d' matched 11594 probes
  0  -> execute                               
  0    -> execute                             
  0      -> Dfix                              
  0      <- Dfix                              
  0      -> s_strsave                         
  0        -> malloc                          
  0        <- malloc                          
  0      <- s_strsave                         
  0      -> set                               
  0        -> malloc                          
  0        <- malloc                          
  0      <- set                               
  0      -> set1                              
  0        -> tglob                           
  0        <- tglob                           
  0      <- set1                              
  0      -> setq                              
  0        -> s_strcmp                        
  0        <- s_strcmp                        
...

The pid provider can only be used on processes that are already running. You can use the $target macro variable (see Chapter 15, Scripting) and the dtrace -c and -p options to create and grab processes of interest and instrument them using DTrace. For example, the following D script can be used to determine the distribution of function calls made to libc by a particular subject process:

pid$target:libc.so::entry
{
	@[probefunc] = count();
}

To determine the distribution of such calls made by the date(1) command, save the script in a file named libc.d and execute the following command:


# dtrace -s libc.d -c date
dtrace: script 'libc.d' matched 2476 probes
Fri Jul 30 14:08:54 PDT 2004
dtrace: pid 109196 has exited

  pthread_rwlock_unlock                                             1
  _fflush_u                                                         1
  rwlock_lock                                                       1
  rw_write_held                                                     1
  strftime                                                          1
  _close                                                            1
  _read                                                             1
  __open                                                            1
  _open                                                             1
  strstr                                                            1
  load_zoneinfo                                                     1

...
  _ti_bind_guard                                                   47
  _ti_bind_clear                                                   94

Tracing Arbitrary Instructions

You can use the pid provider to trace any instruction in any user function. Upon demand, the pid provider will create a probe for every instruction in a function. The name of each probe is the offset of its corresponding instruction in the function expressed as a hexadecimal integer. For example, to enable a probe associated with the instruction at offset 0x1c in function foo of module bar.so in the process with PID 123, you can use the following command:


# dtrace -n pid123:bar.so:foo:1c

To enable all of the probes in the function foo, including the probe for each instruction, you can use the command:


# dtrace -n pid123:bar.so:foo:

This command demonstrates an extremely powerful technique for debugging and analyzing user applications. Infrequent errors can be difficult to debug because they can be difficult to reproduce. Often, you can identify a problem after the failure has occurred, too late to reconstruct the code path. The following example demonstrates how to combine the pid provider with speculative tracing (see Chapter 13, Speculative Tracing) to solve this problem by tracing every instruction in a function.


Example 33–2 errorpath.d: Trace User Function Call Error Path

pid$1::$2:entry
{
	self->spec = speculation();
	speculate(self->spec);
	printf("%x %x %x %x %x", arg0, arg1, arg2, arg3, arg4);
}

pid$1::$2:
/self->spec/
{
	speculate(self->spec);
}

pid$1::$2:return
/self->spec && arg1 == 0/
{
	discard(self->spec);
	self->spec = 0;
}

pid$1::$2:return
/self->spec && arg1 != 0/
{
	commit(self->spec);
	self->spec = 0;
}

Executing errorpath.d results in output similar to the following example:


# ./errorpath.d 100461 _chdir
dtrace: script './errorpath.d' matched 19 probes
CPU     ID                    FUNCTION:NAME
  0  25253                     _chdir:entry 81e08 6d140 ffbfcb20 656c73 0
  0  25253                     _chdir:entry
  0  25269                         _chdir:0
  0  25270                         _chdir:4
  0  25271                         _chdir:8
  0  25272                         _chdir:c
  0  25273                        _chdir:10
  0  25274                        _chdir:14
  0  25275                        _chdir:18
  0  25276                        _chdir:1c
  0  25277                        _chdir:20
  0  25278                        _chdir:24
  0  25279                        _chdir:28
  0  25280                        _chdir:2c
  0  25268                    _chdir:return