Chapter 11 DTrace Providers

This chapter describes some of the existing DTrace providers. Note that the list of providers discussed in this chapter is not exhaustive. To display the providers that are available on your system, use the dtrace -l command. Detailed information about translators for important data structures can be found in /usr/lib64/dtrace/version/*.d files.

11.1 dtrace Provider

The dtrace provider includes several probes that are related to DTrace itself. You can use these probes to initialize state before tracing begins, process state after tracing has completed, and to handle unexpected execution errors in other probes.

11.1.1 BEGIN Probe

The BEGIN probe fires before any other probe. No other probe fires until all BEGIN clauses have completed. This probe can be used to initialize any state that is needed in other probes. The following example shows how to use the BEGIN probe to initialize an associative array to map between mmap() protection bits and a textual representation:

BEGIN
{
  prot[0] = "---";
  prot[1] = "r--";
  prot[2] = "-w-";
  prot[3] = "rw-";
  prot[4] = "--x";
  prot[5] = "r-x";
  prot[6] = "-wx";
  prot[7] = "rwx";
}

syscall::mmap:entry
{
  printf("mmap with prot = %s", prot[arg2 & 0x7]);
}

The BEGIN probe fires in an unspecified context, which means the output of stack or ustack, and the value of context-specific variables such as execname, are all arbitrary. These values should not be relied upon or interpreted to infer any meaningful information. No arguments are defined for the BEGIN probe.

11.1.2 END Probe

The END probe fires after all other probes. This probe will not fire until all other probe clauses have completed. This probe can be used to process state that has been gathered or to format the output. The printa action is therefore often used in the END probe. The BEGIN and END probes can be used together to measure the total time that is spent tracing, for example:

BEGIN
{
  start = timestamp;
}

/*
 * ... other tracing actions...
 */

END
{
  printf("total time: %d secs", (timestamp - start) / 1000000000);
}

See Section 3.5, “Data Normalization” and Section 6.2, “printa Action” for other common uses of the END probe.

As with the BEGIN probe, no arguments are defined for the END probe. The context in which the END probe fires is arbitrary and should not be depended upon.

When tracing with the bufpolicy option set to fill, adequate space is reserved to accommodate any records that are traced in the END probe. See Section 5.2.3, “fill Policy and END Probes” for details.

Note

The exit action causes tracing to stop and the END probe to fire. However, there is some delay between the invocation of the exit action and when the END probe fires. During this delay, no probes will fire. After a probe invokes the exit action, the END probe is not fired until the DTrace consumer determines that exit has been called and stops tracing. The rate at which the exit status is checked can be set by using statusrate option. For more information, see Chapter 10, Options and Tunables.

11.1.3 ERROR Probe

The ERROR probe fires when a runtime error occurs during the execution of a clause for a DTrace probe. As shown in the following example, if a clause attempts to dereference a NULL pointer, the ERROR probe fires. Save it in a file named error.d:

BEGIN
{
  *(char *)NULL;
}

ERROR
{
  printf("Hit an error!");
}

When you run this program, output similar to the following is displayed:

# dtrace -s error.d 
dtrace: script 'error.d' matched 2 probes
CPU     ID                    FUNCTION:NAME
  1      3                           :ERROR Hit an error!
dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN):
invalid address (0x0) in action #1 at DIF offset 16
^C

The previous output indicates that the ERROR probe fired and that dtrace reported the error. dtrace has its own enabling of the ERROR probe so that it can report errors. Using the ERROR probe, you can create your own custom error handling.

The arguments to the ERROR probe are described in the following table.

Argument

Description

arg1

The enabled probe identifier (EPID) of the probe that caused the error.

arg2

The index of the action that caused the fault.

arg3

The DIF offset into the action or -1 if not applicable.

arg4

The fault type.

arg5

Value that is particular to the fault type.

The following table describes the various fault types that can be specified in arg4 and the values that arg5 can take for each fault type.

arg4 Value

Description

arg5 Meaning

DTRACEFLT_UNKNOWN

Unknown fault type

None

DTRACEFLT_BADADDR

Access to unmapped or invalid address

Address accessed

DTRACEFLT_BADALIGN

Unaligned memory access

Address accessed

DTRACEFLT_ILLOP

Illegal or invalid operation

None

DTRACEFLT_DIVZERO

Integer divide by zero

None

DTRACEFLT_NOSCRATCH

Insufficient scratch memory to satisfy scratch allocation

None

DTRACEFLT_KPRIV

Attempt to access a kernel address or property without sufficient privileges

Address accessed or 0 if not applicable

DTRACEFLT_UPRIV

Attempt to access a user address or property without sufficient privileges

Address accessed or 0 if not applicable

DTRACEFLT_TUPOFLOW

DTrace internal parameter stack overflow

None

DTRACEFLT_BADSTACK

Invalid user process stack

Address of invalid stack pointer

If the actions that are taken in the ERROR probe cause an error, that error is silently dropped. The ERROR probe is not recursively invoked.

11.1.4 dtrace Stability

The dtrace provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Stable

Stable

Common

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Stable

Stable

Common

Arguments

Stable

Stable

Common

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.

11.2 profile Provider

The profile provider includes probes that are associated with an interrupt that fires at some regular, specified time interval. Such probes are not associated with any particular point of execution, but rather with the asynchronous interrupt event. You can use these probes to sample some aspect of the system state and then use the samples to infer system behavior. If the sampling rate is high or the sampling time is long, an accurate inference is possible. Using DTrace actions, you can use the profile provider to sample practically any aspect of the system. For example, you could sample the state of the current thread, the state of the CPU, or the current machine instruction.

11.2.1 profile-n Probes

The profile-n probes fire at a fixed interval, at a high-interrupt level on all active CPUs. The units of n default to a frequency that is expressed as a rate of firing per second, but the value can also have an optional suffix , as shown in Table 11.1, “Valid Time Suffixes”, which specifies either a time interval or a frequency. The following table describes valid time suffixes for a tick-n probe.

Table 11.1 Valid Time Suffixes

Suffix

Time Units

nsec or ns

nanoseconds

usec or us

microseconds

msec or ms

milliseconds

sec or s

seconds

min or m

minutes

hour or h

hours

day or d

days

hz

hertz (frequency expressed as rate per second)


11.2.2 tick-n Probes

The tick-n probes fire at fixed intervals, at a high interrupt level on only one CPU per interval. Unlike profile-n probes, which fire on every CPU, tick-n probes fire on only one CPU per interval and the CPU on which they fire can change over time. The units of n default to a frequency expressed as a rate of firing per second, but the value can also have an optional time suffix as shown in Table 11.1, “Valid Time Suffixes”, which specifies either a time interval or a frequency.

The tick-n probes have several uses, such as providing some periodic output or taking a periodic action.

Note

By default, the highest supported tick frequency is 5000 Hz (tick-5000).

11.2.3 profile Probe Arguments

The following table describes the arguments for the profile probes.

Table 11.2 profile Probe Arguments

Probe

arg0

arg1

arg2

profile-n

pc

upc

nsecs

tick-n

pc

upc


The arguments are as follows:

  • pc: kernel program counter

  • upc: user-space program counter

  • nsecs: elapsed number of nanoseconds

11.2.4 profile Probe Creation

Unlike other providers, the profile provider creates probes dynamically on an as-needed basis. Thus, the desired probe might not appear in a listing of all probes, for example, when using the dtrace -l -P profile command, but the probe is created when it is explicitly enabled.

A time interval that is too short causes the machine to continuously field time-based interrupts and denies service on the machine. The profile provider silently refuses to create a probe that would result in an interval of less than two hundred microseconds.

11.2.5 prof Stability

The profile provider uses DTrace's stability mechanism to describe its stabilities. These stability values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

Common

Module

Unstable

Unstable

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

Common

Arguments

Evolving

Evolving

Common

For more information, see Chapter 16, DTrace Stability Features.

11.3 fbt Provider

The fbt (Function Boundary Tracing) provider includes probes that are associated with the entry to and return from most functions in the Oracle Linux kernel. Therefore, there could well be tens of thousands of fbt probes.

To confirm that the fbt provider is available on your processor's architecture, you should be able to load the module that provides fbt instrumentation and successfully list several probes. Note that this process could take several seconds due to the large number of such probes. For example, consider the following command, which is executed as root:

# dtrace -l -P fbt | wc -l
dtrace: failed to match fbt:::: No probe matches description
1
# modprobe fbt
# dtrace -l -P fbt | wc -l
88958

In the previous example, the first dtrace command automatically loads modules that are listed in /etc/dtrace-modules, but also confirms that fbt was not among them. After fbt is loaded manually, many fbt probes appear. For more information, see Section 11.3.4, “Module Loading and fbt”.

Like other DTrace providers, Function Boundary Tracing (FBT) has no probe effect when not explicitly enabled. When enabled, FBT only induces a probe effect in probed functions. While the FBT implementation is highly specific to the instruction set architecture, FBT has been implemented on both x86 and 64-bit Arm platforms. For each instruction set, there are a small number of leaf functions that do not call other functions and are highly optimized by the compiler, which cannot be instrumented by FBT. Probes for these functions are not present in DTrace.

An effective use of FBT probes requires knowledge of the operating system implementation. It is therefore recommended that you use FBT only when developing kernel software or when other providers are not sufficient. You can use other DTrace providers such as syscall, sched, proc, and io to answer most system analysis questions without requiring operating system implementation knowledge.

11.3.1 fbt Probes

FBT provides a probe at the entry and return of most functions in the kernel, named entry and return, respectively. All FBT probes have a function name and module name.

11.3.2 fbt Probe Arguments

The arguments to entry probes are the same as the arguments to the corresponding operating system kernel function. These arguments can be accessed as int64_t values by using the arg0, arg1, arg2, ... variables.

If the function has a return value, the return value is stored in arg1 of the return probe. If a function does not have a return value, arg1 is not defined.

While a given function only has a single point of entry, it might have many different points where it returns to its caller. FBT collects a function's multiple return sites into a single return probe. If you want to know the exact return path, you can examine the return probe arg0 value, which indicates the offset in bytes of the returning instruction in the function text.

11.3.3 fbt Examples

You can easily use the fbt provider to explore the kernel's implementation. The following example script records the first gettimeofday call from any clock process and then follows the subsequent code path through the kernel. Type the following D source code and save it in a file named xgettimeofday.d:

/*
 * To make the output more readable, indent every function entry
 * and unindent every function return.  This is done by setting the
 * "flowindent" option.
 */
#pragma D option flowindent

syscall::gettimeofday:entry
/execname == "clock" && guard++ == 0/
{
        self->traceme = 1;
        printf("start");
}

fbt:::
/self->traceme/
{}

syscall::gettimeofday:return
/self->traceme/
{
        self->traceme = 0;
        exit(0);
}

Running this script results in output that is similar to the following:

# dtrace -s ./xgettimeofday.d
dtrace: script './xgettimeofday.d' matched 92115 probes
CPU FUNCTION
  0  => gettimeofday                          start
  0    -> SyS_gettimeofday
  0      -> getnstimeofday64
  0        -> __getnstimeofday64
  0        <- __getnstimeofday64
  0      <- getnstimeofday64
  0      -> _copy_to_user
  0      <- _copy_to_user
  0    <- SyS_gettimeofday
  0  <= gettimeofday

The previous output shows the internal kernel functions that are called when the gettimeofday system call is made.

11.3.4 Module Loading and fbt

While the Oracle Linux kernel can dynamically load and unload kernel modules, for fbt probes, the fbt kernel module must be loaded to support the instrumentation. For more information about loading kernel modules, see the note in Section 1.1, “Getting Started With DTrace”. If fbt is not listed in /etc/dtrace-modules, or if the dtrace -l command lists no fbt probes, use the following command:

# modprobe fbt

Conversely, you can unload the fbt instrumentation with the following command:

# modprobe -r fbt

When the fbt module is loaded, FBT automatically provides probes to instrument all other loaded modules, including any new modules that are dynamically loaded. If a loaded module has no enabled FBT probes, the module might be unloaded and the corresponding probes are destroyed as the module is unloaded. If a loaded module has enabled FBT probes, the module is considered busy and cannot be unloaded.

11.3.5 fbt Stability

The fbt provider uses DTrace's stability mechanism to describe its stabilities. These stability values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

Common

Module

Private

Private

Unknown

Function

Private

Private

ISA

Name

Evolving

Evolving

Common

Arguments

Private

Private

ISA

For more information, see Chapter 16, DTrace Stability Features.

11.4 syscall Provider

The syscall provider makes available a probe at the entry to and return from every system call in the system. Because system calls are the primary interface between user-level applications and the operating system kernel, the syscall provider can offer tremendous insight into application behavior with respect to the system.

11.4.1 syscall Probes

syscall provides a pair of probes for each system call: an entry probe that fires before the system call is entered, and a return probe that fires after the system call has completed, but before control has been transferred back to user-level. For all syscall probes, the function name is set as the name of the instrumented system call.

Often, the system call names that are provided by syscall correspond to names in the Section 2 manual pages. However, some syscall provider probes do not directly correspond to any documented system call. Some common reasons for this discrepancy are described in the following sections.

11.4.1.1 System Call Anachronisms

In some cases, the name of the system call, as provided by the syscall provider, might be a reflection of an ancient implementation detail.

11.4.1.2 Subcoded System Calls

Some system calls might be implemented as sub operations of another system call. For example, socketcall(), is the common kernel entry point for the socket system calls.

11.4.1.3 New System Calls

Oracle Linux implements at-suffixed system interfaces as individual system calls, for example:

  • faccessat()

  • fchmodat()

  • fchownat()

  • fstatat64()

  • futimensat()

  • linkat()

  • mkdirat()

  • mknodat()

  • name_to_handle_at()

  • newfstatat()

  • open_by_handle_at()

  • openat()

  • readlinkat()

  • renameat()

  • symlinkat()

  • unlinkat()

  • utimensat()

These system calls implement a superset of the functionality of their old non-at-suffixed counterparts. They take an additional first argument that is either an open directory file descriptor. In which case, the operation on a relative pathname is taken relative to the specified directory, or is the reserved value AT_FDCWD, in which case the operation takes place relative to the current working directory.

11.4.1.4 Replaced System Calls

In Oracle Linux, the following older system calls have been replaced and are not called by the newer glibc interfaces. These legacy interfaces remain, but are reimplemented, not as system calls in their own right, but as calls to the newer system calls. The following table lists the legacy call and its new call equivalent.

Legacy System Call

New System Call

access(p, m)

faccessat(AT_FDCWD, p, m, 0)

chmod(p, m)

fchmodat(AT_FDCWD, p, m, 0)

chown(p, u, g)

fchownat(AT_FDCWD, p, u, g, 0)

creat(p, m)

openat(AT_FDCWD, p, O_WRONLY|O_CREAT|O_TRUNC, m)

fchmod(fd, m)

fchmodat(fd, NULL, m, 0)

fchown(fd, u, g)

fchownat(fd, NULL, u, g, 0)

fstat(fd, s)

fstatat(fd, NULL, s, 0)

lchown(p, u, g)

fchownat(AT_FDCWD, p, u, g, AT_SYMLINK_NOFOLLOW)

link(p1, p2)

linkat(AT_FDCWD, p1, AT_FDCWD, p2, 0)

lstat(p, s)

fstatat(AT_FDCWD, p, s, AT_SYMLINK_NOFOLLOW)

mkdir(p, m)

mkdirat(AT_FDCWD, p, m)

mknod(p, m, d)

mknodat(AT_FDCWD, p, m, d)

open(p, o, m)

openat(AT_FDCWD, p, o, m)

readlink(p, b, s)

readlinkat(AT_FDCWD, p, b, s)

rename(p1, p2)

renameat(AT_FDCWD, p1, AT_FDCWD, p2)

rmdir(p)

unlinkat(AT_FDCWD, p, AT_REMOVEDIR)

stat(p, s)

fstatat(AT_FDCWD, p, s, 0)

symlink(p1, p2)

symlinkat(p1, AT_FDCWD, p2)

unlink(p)

unlinkat(AT_FDCWD, p, 0)

11.4.1.5 Large File System Calls

A 32-bit program that supports large files that exceed two gigabytes in size must be able to process 64-bit file offsets. Because large files require the use of large offsets, large files are manipulated through a parallel set of system interfaces. The following table lists some of the syscall probes for the large file system call interfaces.

Table 11.3 syscall Large File Probes

Large File syscall Probe

System Call

getdents64

getdents()

pread64 *

pread()

pwrite64 *

pwrite()


11.4.1.6 Private System Calls

Some system calls are private implementation details of Oracle Linux subsystems that span the user-kernel boundary.

11.4.2 syscall Probe Arguments

For entry probes, the arguments, arg0 ... argn, are arguments to the system call. For return probes, both arg0 and arg1 contain the return value. A non-zero value in the D variable errno indicates a system call failure.

11.4.3 syscall Stability

The syscall provider uses DTrace's stability mechanism to describe its stabilities. These stability values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

Common

Module

Private

Private

Unknown

Function

Private

Private

Instruction set architecture (ISA)

Name

Evolving

Evolving

Common

Arguments

Private

Private

ISA

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.

11.5 sdt provider

The Statically Defined Tracing (SDT) provider (sdt) creates probes at sites that a software programmer has formally designated. The SDT mechanism enables programmers to consciously choose locations of interest to users of DTrace and to convey some semantic knowledge about each location through the probe name.

Importantly, SDT can act as a metaprovider by registering probes so that they appear to come from other providers, such as io, proc, and sched, which do not have dedicated modules of their own. Thus, the SDT provider is chiefly of interest only to developers of new providers. Most users will access SDT only indirectly by using other providers.

Note

Because the sdt probes that are defined for the Oracle Linux kernel are likely to change over time, they are not listed here. Both the name stability and the data stability of the probes are Private, which reflects the kernel's implementation and should not be interpreted as a commitment to preserve these interfaces. For more information, see Chapter 16, DTrace Stability Features.

11.5.1 Creating sdt Probes

If you are a device driver developer, you might be interested in creating your own sdt probes for your Oracle Linux driver. The disabled probe effect of SDT is essentially the cost of several no-operation machine instructions. You are therefore encouraged to add sdt probes to your device drivers as needed. Unless these probes negatively affect performance, you can leave them in your shipping code. See Chapter 14, Statically Defined Tracing of Kernel Modules.

DTrace also provides a mechanism for application developers to define user-space static probes. See Chapter 13, Statically Defined Tracing of User Applications.

11.5.1.1 Declaring Probes

The sdt probes are declared by using the DTRACE_PROBE macro from <linux/sdt.h>.

The module name and function name of an SDT-based probe correspond to the kernel module and function of the probe, respectively. DTrace includes the kernel module name and function name as part of the tuple identifying a probe, so you do not need to include this information in the probe name to prevent name space collisions. Use the dtrace -l -m module command to list the probes that your driver module has installed and the full names that are seen by DTrace users.

The name of the probe depends on the name that is provided in the DTRACE_PROBE macro. If the name does not contain two consecutive underscores (__), the name of the probe is as written in the macro. If the name contains two consecutive underscores, the probe name converts the consecutive underscores to a single dash (-). For example, if a DTRACE_PROBE macro specifies transaction__start, the SDT probe is named transaction-start. This substitution enables C code to provide macro names that are not valid C identifiers without specifying a string.

SDT can also act as a metaprovider by registering probes so that they appear to come from other providers, such as io, proc, and sched, which do not have dedicated modules of their own. For example, kernel/exit.c contains calls to the DTRACE_PROC macro, which are defined as follows in <linux/sdt.h>:

# define DTRACE_PROC(name) \
         DTRACE_PROBE(__proc_##name);

Probes that use such macros appear to come from a provider other than sdt. The leading double underscore, provider name, and trailing underscore in the name argument are used to match the provider and are not included in the probe name. Note that the functionality for creating probes for providers other than those that are hard-coded into DTrace is not currently available.

11.5.1.2 sdt Probe Arguments

The arguments for each sdt probe are the arguments that are specified in the kernel source code in the corresponding DTRACE_PROBE macro reference. When declaring your sdt probes, you can minimize their disabled probe effect by not dereferencing pointers and by not loading from global variables in the probe arguments. Both pointer dereferencing and global variable loading may be done safely in D actions that enable probes, so DTrace users can request these actions only when they are needed.

11.5.2 sdt Stability

The sdt provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Private

Private

ISA

Arguments

Private

Private

ISA

For more information about the stability mechanism, refer to Chapter 16, DTrace Stability Features.

11.6 pid Provider

The pid provider enables tracing of any user process, as specified by its pid.

The pid provider enables tracing function entry and return in user programs just like the fbt provider provides that capability for the kernel. Note that most of the examples in this guide that use the fbt provider to trace kernel function calls can be modified slightly to apply to user processes.

The pid provider also enables tracing of any instruction, as specified by an absolute address or function offset.

The pid provider has no probe effect when probes are not enabled. When probes are enabled, the probes only induce probe effect on those processes that are traced.

Note

When the compiler inlines a function, the pid provider's probe does not fire. Use one of the following methods to compile a particular C function so that it will not be inlined.

  • Sun Studio: #pragma no_inline (funcname[, funcname])

  • gcc: funcname __attribute__ ((noinline))

Consult your compiler documentation for updates.

11.6.1 Naming pid Probes

The pid provider actually defines a class of providers. Each process can potentially have its own associated pid provider. For example, a process with ID 123, would be traced by using the pid123 provider.

The module portion of the probe description refers to an object loaded in the corresponding process's address space. To see which objects will be loaded for my_exec or are loaded for process ID 123, use the following commands:

# ldd my_exec
...
# pldd 123
123:  /tmp/my_exec
linux-vdso.so.1
/lib64/libc.so.6
/lib64/ld-linux-x86-64.so.2p

In the probe description, you name the object by the name of the file, not by its full path name. You can also omit the .6 or so.6 suffix. All of the following examples name the same probe:

pid123:libc.so.6:strcpy:entry
pid123:libc.so:strcpy:entry
pid123:libc:strcpy:entry

The first example is the actual name of the probe. The other examples are convenient aliases that are replaced with the full load object name internally.

For the load object of the executable, you can use the a.out alias. The following two probe descriptions name the same probe:

pid123:my_exec:main:return
pid123:a.out:main:return

The function field of the probe description names a function in the module. A user application binary might have several names for the same function. For example, __gnu_get_libc_version might be an alternate name for the function gnu_get_libc_version in libc.so.6. DTrace chooses one canonical name for such a function and uses that name internally.

The following example illustrates how DTrace internally remaps module and function names to a canonical form:

# dtrace -q -n 'pid123:libc:__gnu_get_libc_version:
    { printf("%s\n%s\n", probemod, probefunc)}'
libc.so.6
gnu_get_libc_version

For examples of how to use the pid provider effectively, see Chapter 12, User Process Tracing.

11.6.2 pid Probe Arguments

An entry probe fires when the traced function is invoked. The arguments to entry probes are the values of the arguments to the traced function.

A return probe fires when the traced function returns or makes a tail call to another function. The arg1 probe argument holds the function return value.

An offset probe fires whenever execution reaches the instruction at the specified offset in the function. For example, to trace the instruction at the address 4 bytes into function main, you can use pid123:a.out:main:4. The arguments for offset probes are undefined. The uregs[] array will help you when examining the process state at these probe sites. See Section 12.5, “uregs[] Array”.

11.6.3 pid Stability

The pid provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Private

Private

Unknown

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.

11.7 proc Provider

The proc provider makes available the probes that pertain to the following activities: process creation and termination, LWP creation and termination, execution of new program images, and signal sending and handling.

11.7.1 proc Probes

The probes for the proc provider are listed in the following table.

Table 11.4 proc Probes

Probe

Description

create

Fires when a process (or process thread) is created using fork() or vfork(), which both invoke clone(). The psinfo_t corresponding to the new child process is pointed to by args[0].

exec

Fires whenever a process loads a new process image using a variant of the execve() system call. The exec probe fires before the process image is loaded. Process variables like execname and curpsinfo therefore contain the process state before the image is loaded. Some time after the exec probe fires, either the exec-failure or exec-success probe subsequently fires in the same thread. The path of the new process image is pointed to by args[0].

exec-failure

Fires when an exec() variant has failed. The exec-failure probe fires only after the exec probe has fired in the same thread. The errno value is provided in args[0].

exec-success

Fires when an exec() variant has succeeded. Like the exec-failure probe, the exec-success probe fires only after the exec probe has fired in the same thread. By the time that the exec-success probe fires, process variables like execname and curpsinfo contain the process state after the new process image has been loaded.

exit

Fires when the current process is exiting. The reason for exit, which is expressed as one of the SIGCHLD <asm-generic/signal.h> codes, is contained in args[0].

lwp-create

Fires when a process thread is created, the latter typically as a result of pthread_create(). The lwpsinfo_t corresponding to the new thread is pointed to by args[0]. The psinfo_t of the process that created the thread is pointed to by args[1].

lwp-exit

Fires when a process or process thread is exiting, due either to a signal or to an explicit call to exit or pthread_exit().

lwp-start

Fires within the context of a newly created process or process thread. The lwp-start probe fires before any user-level instructions are executed. If the thread is the first created for the process, the start probe fires, followed by lwp-start.

signal-clear

Probes that fires when a pending signal is cleared because the target thread was waiting for the signal in sigwait(), sigwaitinfo(), or sigtimedwait(). Under these conditions, the pending signal is cleared and the signal number is returned to the caller. The signal number is in args[0]. signal-clear fires in the context of the formerly waiting thread.

signal-discard

Fires when a signal is sent to a single-threaded process and the signal is both unblocked and ignored by the process. Under these conditions, the signal is discarded on generation. The lwpsinfo_t and psinfo_t of the target process and thread are in args[0] and args[1], respectively. The signal number is in args[2].

signal-handle

Fires immediately before a thread handles a signal. The signal-handle probe fires in the context of the thread that will handle the signal. The signal number is in args[0]. A pointer to the siginfo_t structure that corresponds to the signal is in args[1]. The address of the signal handler in the process is in args[2].

signal-send

Fires when a signal is sent to a process or to a thread created by a process. The signal-send probe fires in the context of the sending process or thread. The lwpsinfo_t and psinfo_t of the receiving process and thread are in args[0] and args[1], respectively. The signal number is in args[2]. signal-send is always followed by signal-handle or signal-clear in the receiving process and thread.

start

Fires in the context of a newly created process. The start probe fires before any user-level instructions are executed in the process.


Note

In Linux, there is no fundamental difference between a process and a thread that a process creates. The threads of a process are set up so that they can share resources, but each thread has its own entry in the process table with its own process ID.

11.7.2 proc Probe Arguments

The following table lists the argument types for the proc probes. See Table 11.4, “proc Probes” for a description of the arguments.

Table 11.5 proc Probe Arguments

Probe

args[0]

args[1]

args[2]

create

psinfo_t *

exec

char *

exec-failure

int

exec-success

exit

int

lwp-create

lwpsinfo_t *

psinfo_t *

lwp-exit

lwp-start

signal-clear

int

signal-discard

lwpsinfo_t *

psinfo_t *

int

signal-handle

int

siginfo_t *

void (*)(void)

signal-send

lwpsinfo_t *

psinfo_t *

int

start


11.7.2.1 lwpsinfo_t

Several proc probes have arguments of type lwpsinfo_t. Detailed information about this data structure can be found in /usr/lib64/dtrace/version/procfs.d. The definition of the lwpsinfo_t structure, as available to DTrace consumers, is as follows:

typedef struct lwpsinfo {
  int pr_flag;                /* flags */
  id_t pr_lwpid;              /* thread id */
  uintptr_t pr_addr;          /* internal address of thread */
  uintptr_t pr_wchan;         /* wait addr for sleeping lwp (NULL on Linux) */
  char pr_stype;              /* sync event type (0 on Linux) */
  char pr_state;              /* numeric thread state */
  char pr_sname;              /* printable character for pr_state */
  int pr_pri;                 /* priority, high value = high priority */
  char pr_name[PRCLSZ];       /* scheduling class name */
  processorid_t pr_onpro;     /* processor which last ran this thread */
} lwpsinfo_t;
Note

Lightweight processes do not exist in Linux. Rather, in Oracle Linux, processes and threads are represented by process descriptors of type struct task_struct in the task list. DTrace translates the members of lwpsinfo_t from the task_struct for the Oracle Linux process.

The pr_flag is set to 1 if the thread is stopped. Otherwise, it is set to 0.

In Oracle Linux, the pr_stype field is unsupported, and hence is always 0.

The following table describes the values that pr_state can take, as well as the corresponding character values for pr_sname.

Table 11.6 pr_state Values

pr_state Value

pr_sname Value

Description

SRUN (2)

R

The thread is runnable or is currently running on a CPU. The sched:::enqueue probe fires immediately before a thread's state is transitioned to SRUN. The sched:::on-cpu probe will fire a short time after the thread starts to run.

The equivalent Oracle Linux task state is TASK_RUNNING.

SSLEEP (1)

S

The thread is sleeping. The sched:::sleep probe will fire immediately before a thread's state is transitioned to SSLEEP.

The equivalent Oracle Linux task state is TASK_INTERRUPTABLE or TASK_UNINTERRUPTABLE.

SSTOP (4)

T

The thread is stopped, either due to an explicit proc directive or some other stopping mechanism.

The equivalent Oracle Linux task state is __TASK_STOPPED or __TASK_TRACED.

SWAIT (7)

W

The thread is waiting on wait queue. The sched:::cpucaps-sleep probe will fire immediately before the thread's state transitions to SWAIT.

The equivalent Oracle Linux task state is TASK_WAKEKILL or TASK_WAKING.

SZOMB (3)

Z

The thread is a zombie.

The equivalent Oracle Linux task state is EXIT_ZOMBIE, EXIT_DEAD, or TASK_DEAD.


11.7.2.2 psinfo_t

Several proc probes have an argument of type psinfo_t. Detailed information about this data structure can be found in /usr/lib64/dtrace/version/procfs.d. The definition of the psinfo_t structure, as available to DTrace consumers, is as follows:

typedef struct psinfo {
        int pr_nlwp;                    /* not supported */
        pid_t pr_pid;                   /* unique process id */
        pid_t pr_ppid;                  /* process id of parent */
        pid_t pr_pgid;                  /* pid of process group leader */
        pid_t pr_sid;                   /* session id */
        uid_t pr_uid;                   /* real user id */
        uid_t pr_euid;                  /* effective user id */
        uid_t pr_gid;                   /* real group id */
        uid_t pr_egid;                  /* effective group id */
        uintptr_t pr_addr;              /* address of process */
        size_t pr_size;                 /* not supported */
        size_t pr_rssize;               /* not supported */
        struct tty_struct *pr_ttydev;   /* controlling tty (or -1) */
        ushort_t pr_pctcpu;             /* not supported */
        ushort_t pr_pctmem;             /* not supported */
        timestruc_t pr_start;           /* not supported */
        timestruc_t pr_time;            /* not supported */
        timestruc_t pr_ctime;           /* not supported */
        char pr_fname[16];              /* name of exec'ed file */
        char pr_psargs[80];             /* initial chars of arg list */
        int pr_wstat;                   /* not supported */
        int pr_argc;                    /* initial argument count */
        uintptr_t pr_argv;              /* address of initial arg vector */
        uintptr_t pr_envp;              /* address of initial env vector */
        char pr_dmodel;                 /* data model */
        taskid_t pr_taskid;             /* not supported */
        projid_t pr_projid;             /* not supported */
        int pr_nzomb;                   /* not supported */
        poolid_t pr_poolid;             /* not supported */
        zoneid_t pr_zoneid;             /* not supported */
        id_t pr_contract;               /* not supported */
        lwpsinfo_t pr_lwp;              /* not supported */
} psinfo_t;
Note

Lightweight processes do not exist in Linux. In Oracle Linux, processes and threads are represented by process descriptors of type struct task_struct in the task list. DTrace translates the members of psinfo_t from the task_struct for the Oracle Linux process.

pr_dmodel is set to either PR_MODEL_ILP32, denoting a 32–bit process, or PR_MODEL_LP64, denoting a 64–bit process.

11.7.3 proc Examples

The following examples illustrate the use of the probes that are published by the proc provider.

11.7.3.1 exec

The following example shows how you can use the exec probe to easily determine which programs are being executed, and by whom. Type the following D source code and save it in a file named whoexec.d:

#pragma D option quiet

proc:::exec
{
  self->parent = execname;
}

proc:::exec-success
/self->parent != NULL/
{
  @[self->parent, execname] = count();
  self->parent = NULL;
}

proc:::exec-failure
/self->parent != NULL/
{
  self->parent = NULL;
}

END
{
  printf("%-20s %-20s %s\n", "WHO", "WHAT", "COUNT");
  printa("%-20s %-20s %@d\n", @);
}

Running the example script for a short period of time results in output similar to the following:

# dtrace -s ./whoexec.d 
^C
WHO                  WHAT                 COUNT
abrtd                abrt-handle-eve      1
firefox              basename             1
firefox              mkdir                1
firefox              mozilla-plugin-      1
firefox              mozilla-xremote      1
firefox              run-mozilla.sh       1
firefox              uname                1
gnome-panel          firefox              1
kworker/u:1          modprobe             1
modprobe             modprobe.ksplic      1
mozilla-plugin-      plugin-config        1
mozilla-plugin-      uname                1
nice                 sosreport            1
run-mozilla.sh       basename             1
run-mozilla.sh       dirname              1
run-mozilla.sh       firefox              1
run-mozilla.sh       uname                1
sh                   abrt-action-sav      1
sh                   blkid                1
sh                   brctl                1
sh                   cut                  1
...

11.7.3.2 start and exit Probes

If you want to know how long programs are running, from creation to termination, you can enable the start and exit probes, as shown in the following example. Save it in a file named progtime.d:

proc:::start
{
  self->start = timestamp;
}

proc:::exit
/self->start/
{
  @[execname] = quantize(timestamp - self->start);
  self->start = 0;
}

Running the example script on a build server for several seconds results in output similar to the following:

# dtrace -s ./progtime.d
dtrace: script ’./progtime.d’ matched 2 probes
^C
...
cc
          value  ------------- Distribution ------------- count
       33554432 |                                         0
       67108864 |@@@                                      3
      134217728 |@                                        1
      268435456 |                                         0
      536870912 |@@@@                                     4
     1073741824 |@@@@@@@@@@@@@@                           13
     2147483648 |@@@@@@@@@@@@                             11
     4294967296 |@@@                                      3
     8589934592 |                                         0

sh
          value  ------------- Distribution ------------- count
         262144 |                                         0
         524288 |@                                        5
        1048576 |@@@@@@@                                  29
        2097152 |                                         0
        4194304 |                                         0
        8388608 |@@@                                      12
       16777216 |@@                                       9
       33554432 |@@                                       9
       67108864 |@@                                       8
      134217728 |@                                        7
      268435456 |@@@@@                                    20
      536870912 |@@@@@@                                   26
     1073741824 |@@@                                      14
     2147483648 |@@                                       11
     4294967296 |                                         3
     8589934592 |                                         1
    17179869184 |                                         0
...

11.7.3.3 signal-send

The following example shows how you can use the signal-send probe to determine the sending and receiving of process associated with any signal. Type the following D source code and save it in a file named sig.d:

#pragma D option quiet

proc:::signal-send
{
  @[execname, stringof(args[1]->pr_fname), args[2]] = count();
}

END
{
  printf("%20s %20s %12s %s\n",
      "SENDER", "RECIPIENT", "SIG", "COUNT");
  printa("%20s %20s %12d %@d\n", @);
}

Running this script results in output similar to the following:

# dtrace -s sig.d
^C
              SENDER            RECIPIENT          SIG COUNT
         gnome-panel                 Xorg           29 1
         kworker/0:2               dtrace            2 1
                Xorg                 Xorg           29 3
                java                 Xorg           29 6
             firefox                 Xorg           29 14
         kworker/0:0                 Xorg           29 1135

11.7.4 proc Stability

The proc provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Evolving

Evolving

ISA

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.

11.8 sched Provider

The sched provider makes available probes that are related to CPU scheduling. Because CPUs are the one resource that all threads must consume, the sched provider is very useful for understanding systemic behavior. For example, using the sched provider, you can understand when and why threads sleep, run, change priority, or wake other threads.

11.8.1 sched Probes

The following table describes the probes for the sched provider.

Table 11.7 sched Probes

Probe

Description

change-pri

Fires whenever a thread's priority is about to be changed. The lwpsinfo_t of the thread is pointed to by args[0]. The thread's current priority is in the pr_pri field of this structure. The psinfo_t of the process containing the thread is pointed to by args[1]. The thread's new priority is contained in args[2].

dequeue

Fires immediately before a runnable thread is dequeued from a run queue. The lwpsinfo_t of the thread being dequeued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU from which the thread is being dequeued is pointed to by args[2]. If the thread is being dequeued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1.

enqueue

Fires immediately before a runnable thread is enqueued to a run queue. The lwpsinfo_t of the thread being enqueued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU to which the thread is being enqueued is pointed to by args[2]. If the thread is being enqueued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1. The value in args[3] is a boolean indicating whether the thread will be enqueued to the front of the run queue. The value is non-zero if the thread will be enqueued at the front of the run queue, and zero if the thread will be enqueued at the back of the run queue.

off-cpu

Fires when the current CPU is about to end execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is ending execution. The lwpsinfo_t of the thread that the current CPU will next execute is pointed to by args[0]. The psinfo_t of the process containing the next thread is pointed to by args[1].

on-cpu

Fires when a CPU has just begun execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

preempt

Fires immediately before the current thread is preempted. After this probe fires, the current thread will select a thread to run and the off-cpu probe will fire for the current thread. In some cases, a thread on one CPU will be preempted, but the preempting thread will run on another CPU in the meantime. In this situation, the preempt probe will fire, but the dispatcher will be unable to find a higher priority thread to run and the remain-cpu probe will fire instead of the off-cpu probe.

remain-cpu

Fires when a scheduling decision has been made, but the dispatcher has elected to continue to run the current thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

sleep

Fires immediately before the current thread sleeps on a synchronization object. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t pointed to by curlwpsinfo. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t pointed to by curlwpsinfo. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.

surrender

Fires when a CPU has been instructed by another CPU to make a scheduling decision — often because a higher-priority thread has become runnable. The lwpsinfo_t of the current thread is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1].

tick

Fires as a part of clock tick-based accounting. In clock tick-based accounting, CPU accounting is performed by examining which threads and processes are running when a fixed-interval interrupt fires. The lwpsinfo_t that corresponds to the thread that is being assigned CPU time is pointed to by args[0]. The psinfo_t that corresponds to the process that contains the thread is pointed to by args[1].

wakeup

Fires immediately before the current thread wakes a thread sleeping on a synchronization object. The lwpsinfo_t of the sleeping thread is pointed to by args[0]. The psinfo_t of the process containing the sleeping thread is pointed to by args[1]. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t of the sleeping thread. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t of the sleeping thread. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.


11.8.2 sched Probe Arguments

The following table describes the argument types for the sched probes. See Table 11.7, “sched Probes” for descriptions of the arguments.

Table 11.8 sched Probe Arguments

Probe

args[0]

args[1]

args[2]

args[3]

change-pri

lwpsinfo_t *

psinfo_t *

int

dequeue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

enqueue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

int

off-cpu

lwpsinfo_t *

psinfo_t *

on-cpu

preempt

remain-cpu

sleep

surrender

lwpsinfo_t *

psinfo_t *

tick

lwpsinfo_t *

psinfo_t *

wakeup

lwpsinfo_t *

psinfo_t *


11.8.2.1 cpuinfo_t

The cpuinfo_t structure defines a CPU. Per the information in Table 11.8, “sched Probe Arguments”, arguments to both the enqueue and dequeue probes include a pointer to a cpuinfo_t. Additionally, the cpuinfo_t that corresponds to the current CPU is pointed to by the curcpu variable.

The definition of the cpuinfo_t structure is as follows:

typedef struct cpuinfo {
  processorid_t cpu_id;      /* CPU identifier */
  psetid_t cpu_pset;         /* not supported */
  chipid_t cpu_chip;         /* chip identifier */
  lgrp_id_t cpu_lgrp;        /* not supported */
  cpuinfo_arch_t *cpu_info;  /* CPU information */
} cpuinfo_t;

cpu_id: Is the processor identifier.

cpu_chip: Is the identifier of the physical chip. Physical chips can contain several CPU cores.

cpu_info: Is a pointer to the cpuinfo_arch_t structure that is associated with the CPU.

11.8.3 sched Examples

The following examples show the use of sched probes.

11.8.3.1 on-cpu and off-cpu Probes

One common question that you might want answered is which CPUs are running threads and for how long? The following example shows how you can use the on-cpu and off-cpu probes to easily answer this question on a system-wide basis. Type the following D source code and save it in a file named where.d:

sched:::on-cpu
{
  self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
  @[cpu] = quantize(timestamp - self->ts);
  self->ts = 0;
}

Running the previous script results in output that is similar to the following:

# dtrace -s ./where.d
dtrace: script ’./where.d’ matched 2 probes
^C
        0
          value  ------------- Distribution ------------- count
           2048 |                                         0
           4096 |@@                                       37
           8192 |@@@@@@@@@@@@@                            212
          16384 |@                                        30
          32768 |                                         10
          65536 |@                                        17
         131072 |                                         12
         262144 |                                         9
         524288 |                                         6
        1048576 |                                         5
        2097152 |                                         1
        4194304 |                                         3
        8388608 |@@@@                                     75
       16777216 |@@@@@@@@@@@@                             201
       33554432 |                                         6
       67108864 |                                         0

        1
          value  ------------- Distribution ------------- count
           2048 |                                         0
           4096 |@                                        6
           8192 |@@@@                                     23
          16384 |@@@                                      18
          32768 |@@@@                                     22
          65536 |@@@@                                     22
         131072 |@                                        7
         262144 |                                         5
         524288 |                                         2
        1048576 |                                         3
        2097152 |@                                        9
        4194304 |                                         4
        8388608 |@@@                                      18
       16777216 |@@@                                      19
       33554432 |@@@                                      16
       67108864 |@@@@                                     21
      134217728 |@@                                       14
      268435456 |                                         0

The previous output shows that on CPU 1 threads tend to run for less than 131072 nanoseconds (on order of 100 microseconds) at a stretch, or for 8388608 to 134217728 nanoseconds (approximately 10 to 100 milliseconds). A noticeable gap between the two clusters of data is shown in the histogram. You also might be interested in knowing which CPUs are running a particular process.

You can also use the on-cpu and off-cpu probes for answering this question. The following script displays which CPUs run a specified application over a period of ten seconds. Save it in a file named whererun.d.:

#pragma D option quiet
dtrace:::BEGIN
{
  start = timestamp;
}

sched:::on-cpu
/execname == $$1/
{
  self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
  @[cpu] = sum(timestamp - self->ts);
  self->ts = 0;
}

profile:::tick-1sec
/++x >= 10/
{
  exit(0);
}

dtrace:::END
{
  printf("CPU distribution over %d seconds:\n\n",
    (timestamp - start) / 1000000000);
  printf("CPU microseconds\n--- ------------\n");
  normalize(@, 1000);
  printa("%3d %@d\n", @);
}

Running the previous script on a large mail server and specifying the IMAP daemon results in output that is similar to the following:

# dtrace -s ./whererun.d imapd
CPU distribution of imapd over 10 seconds:

CPU microseconds
--- ------------
 15 10102
 12 16377
 21 25317
 19 25504
 17 35653
 13 41539
 14 46669
 20 57753
 22 70088
 16 115860
 23 127775
 18 160517

Oracle Linux takes into account the amount of time that a thread has been sleeping when selecting a CPU on which to run the thread, as a thread that has been sleeping for less time tends not to migrate. Use the off-cpu and on-cpu probes to observe this behavior. Type the following source code and save it in a file named howlong.d:

sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
 self->cpu = cpu;
 self->ts = timestamp;
}

sched:::on-cpu
/self->ts/
{
 @[self->cpu == cpu ?
   "sleep time, no CPU migration" : "sleep time, CPU migration"] =
   lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
 self->ts = 0;
 self->cpu = 0;
}

Running the previous script for approximately 30 seconds results in output that is similar to the following:

# dtrace -s ./howlong.d
dtrace: script ’./howlong.d’ matched 2 probes
^C
 sleep time, CPU migration
          value  ------------- Distribution ------------- count
            < 0 |                                         0
              0 |@@@@@@@                                  6838
             25 |@@@@@                                    4714
             50 |@@@                                      3108
             75 |@                                        1304
            100 |@                                        1557
            125 |@                                        1425
            150 |                                         894
            175 |@                                        1526
            200 |@@                                       2010
            225 |@@                                       1933
            250 |@@                                       1982
            275 |@@                                       2051
            300 |@@                                       2021
            325 |@                                        1708
            350 |@                                        1113
            375 |                                         502
            400 |                                         220
            425 |                                         106
            450 |                                         54
            475 |                                         40
         >= 500 |@                                        1716

 sleep time, no CPU migration
          value  ------------- Distribution ------------- count
            < 0 |                                         0
              0 |@@@@@@@@@@@@                             58413
             25 |@@@                                      14793
             50 |@@                                       10050
             75 |                                         3858
            100 |@                                        6242
            125 |@                                        6555
            150 |                                         3980
            175 |@                                        5987
            200 |@                                        9024
            225 |@                                        9070
            250 |@@                                       10745
            275 |@@                                       11898
            300 |@@                                       11704
            325 |@@                                       10846
            350 |@                                        6962
            375 |                                         3292
            400 |                                         1713
            425 |                                         585
            450 |                                         201
            475 |                                         96
         >= 500 |                                         3946

The previous output reveals that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub-100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision when a certain threshold is exceeded.

11.8.3.2 enqueue and dequeue Probes

You might want to know on which CPUs processes and threads are waiting to run. You can use the enqueue probe along with the dequeue probe to answer this question. Type the following source code and save it in a file named qtime.d:

sched:::enqueue
{
  a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] =
  timestamp;
}

sched:::dequeue
/a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]/
{
  @[args[2]->cpu_id] = quantize(timestamp -
    a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]);
  a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = 0;
}

Running the previous script for several seconds results in output that is similar to the following:

# dtrace -s qtime.d 
dtrace: script 'qtime.d' matched 16 probes
^C

        1
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |                                         1        
           32768 |@                                        47       
           65536 |@@@@@@@                                  365      
          131072 |@@@@@@@@@@@@                             572      
          262144 |@@@@@@@@@@@@                             570      
          524288 |@@@@@@@                                  354      
         1048576 |@                                        57       
         2097152 |                                         7        
         4194304 |                                         1        
         8388608 |                                         1        
        16777216 |                                         0        

        0
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |                                         6        
           32768 |@                                        49       
           65536 |@@@@@                                    261      
          131072 |@@@@@@@@@@@@@                            753      
          262144 |@@@@@@@@@@@@                             704      
          524288 |@@@@@@@@                                 455      
         1048576 |@                                        74       
         2097152 |                                         9        
         4194304 |                                         2        
         8388608 |                                         0

Rather than looking at wait times, you might want to examine the length of the run queue over time. Using the enqueue and dequeue probes, you can set up an associative array to track the queue length. Type the following source code and save it in a file named qlen.d:

sched:::enqueue
{
  this->len = qlen[args[2]->cpu_id]++;
  @[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}

sched:::dequeue
/qlen[args[2]->cpu_id]/
{
  qlen[args[2]->cpu_id]--;
}

Running the previous script on a largely idle dual-core processor system for approximately 30 seconds results in output that is similar to the following:

# dtrace -s qlen.d 
dtrace: script 'qlen.d' matched 16 probes
^C

        1
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        8124     
               1 |@@@@@@                                   1558     
               2 |@                                        160      
               3 |                                         51       
               4 |                                         24       
               5 |                                         13       
               6 |                                         11       
               7 |                                         9        
               8 |                                         6        
               9 |                                         0        

        0
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           8569     
               1 |@@@@@@@@@                                2429     
               2 |@                                        292      
               3 |                                         25       
               4 |                                         8        
               5 |                                         5        
               6 |                                         4        
               7 |                                         4        
               8 |                                         1        
               9 |                                         0

The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queues were very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of each table might be unexpected. For example, why were the run queues as long as 8 runnable threads? To explore this question further, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.

For this type of script, you would enable the enqueue and dequeue probes and then use both speculations and associative arrays. Whenever a thread is enqueued, the script increments the length of the queue and records the timestamp in an associative array keyed by the thread. You cannot use a thread-local variable in this case because a thread might be enqueued by another thread. The script then checks to see if the queue length exceeds the maximum, and if so, the script starts a new speculation, and records the timestamp and the new maximum. Then, when a thread is dequeued, the script compares the enqueue timestamp to the timestamp of the longest length: if the thread was enqueued before the timestamp of the longest length, the thread was in the queue when the longest length was recorded. In this case, the script speculatively traces the thread's information. When the kernel dequeues the last thread that was enqueued at the timestamp of the longest length, the script commits the speculation data. Type the following source code and save it in a file named whoqueue.d:

#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k

int maxlen;
int spec[int];
sched:::enqueue
{
  this->len = ++qlen[this->cpu = args[2]->cpu_id];
  in[args[0]->pr_addr] = timestamp;
}

sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
  /*
   * There is already a speculation for this CPU. We just set a new
   * record, so we’ll discard the old one.
   */
  discard(spec[this->cpu]);
}

sched:::enqueue
/this->len > maxlen/
{
  /*
   * We have a winner. Set the new maximum length and set the timestamp
   * of the longest length.
   */
  maxlen = this->len;
  longtime[this->cpu] = timestamp;
  /*
   * Now start a new speculation, and speculatively trace the length.
   */
  this->spec = spec[this->cpu] = speculation();
  speculate(this->spec);
  printf("Run queue of length %d:\n", this->len);
}

sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
  this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
  speculate(spec[this->cpu]);
  printf(" %d/%d (%s)\n",
    args[1]->pr_pid, args[0]->pr_lwpid,
    stringof(args[1]->pr_fname));
}

sched:::dequeue
/qlen[args[2]->cpu_id]/
{
  in[args[0]->pr_addr] = 0;
  this->len = --qlen[args[2]->cpu_id];
}

sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
  /*
   * We just processed the last thread that was enqueued at the time
   * of longest length; commit the speculation, which by now contains
   * each thread that was enqueued when the queue was longest.
   */
  commit(spec[this->cpu]);
  spec[this->cpu] = 0;
}

Running the previous script on the same system results in output that is similar to the following:

# dtrace -s whoqueue.d
Run queue of length 1:
 2850/2850 (java)
Run queue of length 2:
 4034/4034 (kworker/0:1)
 16/16 (sync_supers)
Run queue of length 3:
 10/10 (ksoftirqd/1)
 1710/1710 (hald-addon-inpu)
 25350/25350 (dtrace)
Run queue of length 4:
 2852/2852 (java)
 2850/2850 (java)
 1710/1710 (hald-addon-inpu)
 2099/2099 (Xorg)
Run queue of length 5:
 3149/3149 (notification-da)
 2417/2417 (gnome-settings-)
 2437/2437 (gnome-panel)
 2461/2461 (wnck-applet)
 2432/2432 (metacity)
Run queue of length 9:
 3685/3685 (firefox)
 3149/3149 (notification-da)
 2417/2417 (gnome-settings-)
 2437/2437 (gnome-panel)
 2852/2852 (java)
 2452/2452 (nautilus)
 2461/2461 (wnck-applet)
 2432/2432 (metacity)
 2749/2749 (gnome-terminal)
^C

11.8.3.3 sleep and wakeup Probes

The following example shows how you might use the wakeup probe to determine what is waking a particular process, and when, over a given period. Type the following source code and save it in a file named gterm.d:

#pragma D option quiet

dtrace:::BEGIN
{
  start = timestamp;
}

sched:::wakeup
/stringof(args[1]->pr_fname) == "gnome-terminal"/
{
  @[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}

profile:::tick-1sec
/++x == 10/
{
  exit(0);
}

The output from running this script is as follows:

# dtrace -s gterm.d

  Xorg                                              
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@                          69       
               1 |@@@@@@@@                                 35       
               2 |@@@@@@@@@                                42       
               3 |                                         2        
               4 |                                         0        
               5 |                                         0        
               6 |                                         0        
               7 |@@@@                                     16       
               8 |                                         0        
               9 |@@@                                      15       
           >= 10 |                                         0  

This output shows that the X server is waking the gnome-terminal process as you interact with the system.

Additionally, you could use the sleep probe with the wakeup probe to understand which applications are blocking on other applications, and for how long. Type the following source code and save it in a file named whofor.d:

#pragma D option quiet
sched:::sleep
{
  bedtime[curlwpsinfo->pr_addr] = timestamp;
}

sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
  @[stringof(args[1]->pr_fname), execname] =
    quantize(timestamp - bedtime[args[0]->pr_addr]);
  bedtime[args[0]->pr_addr] = 0;
}

END
{
  printa("%s sleeping on %s:\n%@d\n", @);
}

The tail of the output from running the previous example script on a desktop system for several seconds is similar to the following:

# dtrace -s whofor.d 
^C
...
Xorg sleeping on metacity:

           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
          262144 |                                         0        

gnome-power-man sleeping on Xorg:

           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
          524288 |                                         0
...

11.8.3.4 preempt and remain-cpu Probes

Because Oracle Linux is a preemptive system, higher priority threads preempt lower priority threads. Preemption can induce a significant latency bubble in the lower priority thread. Therefore, you might want to know which threads are being preempted by other threads.

The following example shows how you would use the preempt and remain-cpu probes to display this information. Type the following source code and save it in a file named whopreempt.d:

#pragma D option quiet

sched:::preempt
{
  self->preempt = 1;
}

sched:::remain-cpu
/self->preempt/
{
  self->preempt = 0;
}

sched:::off-cpu
/self->preempt/
{
  /*
   * If we were told to preempt ourselves, see who we ended up giving
   * the CPU to.
   */
  @[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
    curlwpsinfo->pr_pri] = count();
  self->preempt = 0;
}

END
{
  printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI",
    "PREEMPTED", "PRI", "#");
  printa("%30s %3d %30s %3d %5@d\n", @);
}

Running the previous script on a desktop system for several seconds results in output that is similar to the following:

# dtrace -s whopreempt.d 
^C
                     PREEMPTOR PRI                      PREEMPTED PRI     #
                       firefox 120                    kworker/0:0 120     1
                   gnome-panel 120                        swapper 120     1
                   gnome-panel 120                    wnck-applet 120     1
                   jbd2/dm-0-8 120                        swapper 120     1
                    khugepaged 139                    kworker/0:0 120     1
                   ksoftirqd/1 120                    kworker/0:0 120     1
                   kworker/0:0 120                 gnome-terminal 120     1
                   kworker/0:2 120                           Xorg 120     1
                   kworker/0:2 120                           java 120     1
                   kworker/1:0 120                           Xorg 120     1
                      nautilus 120                           Xorg 120     1
                  rtkit-daemon   0                   rtkit-daemon 120     1
                  rtkit-daemon 120                        swapper 120     1
                    watchdog/0   0                        swapper 120     1
                    watchdog/1   0                    kworker/0:0 120     1
                   wnck-applet 120                           Xorg 120     1
                   wnck-applet 120                        swapper 120     1
                     automount 120                    kworker/0:0 120     2
               gnome-power-man 120                    kworker/0:0 120     2
                   kworker/0:0 120                        swapper 120     2
                   kworker/1:0 120                         dtrace 120     2
                      metacity 120                    kworker/0:0 120     2
               notification-da 120                        swapper 120     2
                 udisks-daemon 120                    kworker/0:0 120     2
                     automount 120                        swapper 120     3
                   gnome-panel 120                           Xorg 120     3
               gnome-settings- 120                           Xorg 120     3
               gnome-settings- 120                        swapper 120     3
                gnome-terminal 120                        swapper 120     3
                          java 120                    kworker/0:0 120     3
                   ksoftirqd/0 120                        swapper 120     3
                   kworker/0:2 120                        swapper 120     3
                      metacity 120                           Xorg 120     3
                      nautilus 120                    kworker/0:0 120     3
                         qpidd 120                        swapper 120     3
                      metacity 120                        swapper 120     4
               gvfs-afc-volume 120                        swapper 120     5
                          java 120                           Xorg 120     5
               notification-da 120                           Xorg 120     5
               notification-da 120                    kworker/0:0 120     5
                          Xorg 120                    kworker/0:0 120     6
                   wnck-applet 120                    kworker/0:0 120    10
                   VBoxService 120                        swapper 120    13
                        dtrace 120                        swapper 120    14
                   kworker/1:0 120                    kworker/0:0 120    16
                        dtrace 120                    kworker/0:0 120    20
                          Xorg 120                        swapper 120    90
               hald-addon-inpu 120                        swapper 120   100
                          java 120                        swapper 120   108
                gnome-terminal 120                    kworker/0:0 120   110

11.8.3.5 tick

If NOHZ is set to off, Oracle Linux uses tick-based CPU accounting, where a system clock interrupt fires at a fixed interval and attributes CPU utilization to the processes that are running at the time of the tick. The following example shows how you would use the tick probe to observe this attribution.

# dtrace -n sched:::tick'{ @[stringof(args[1]->pr_fname)] = count() }'
dtrace: description 'sched:::tick' matched 1 probe
^C

  VBoxService                                                       1
  gpk-update-icon                                                   1
  hald-addon-inpu                                                   1
  jbd2/dm-0-8                                                       1
  automount                                                         2
  gnome-session                                                     2
  hald                                                              2
  gnome-power-man                                                   3
  ksoftirqd/0                                                       3
  kworker/0:2                                                       3
  notification-da                                                   4
  devkit-power-da                                                   6
  nautilus                                                          9
  dbus-daemon                                                      11
  gnome-panel                                                      11
  gnome-settings-                                                  11
  dtrace                                                           19
  khugepaged                                                       22
  metacity                                                         27
  kworker/0:0                                                      41
  swapper                                                          56
  firefox                                                          58
  wnck-applet                                                      61
  gnome-terminal                                                   67
  java                                                             84
  Xorg                                                            227

One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system either over-accounts or under-accounts for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. If accounting is performed before time-related dispatching, the system under-accounts for threads running at a regular interval. If such threads run for less than the clock tick interval, they can effectively hide behind the clock tick.

The following example examines whether a system has any such threads. Type the following source code and save it in a file named tick.d:

sched:::tick,
sched:::enqueue
{
  @[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}

The output of the example script is two distributions of the millisecond offset within a ten millisecond interval, one for the tick probe and another for enqueue:

# dtrace -s tick.d
dtrace: script 'tick.d' matched 9 probes
^C

  tick                                              
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@                                    29       
               1 |@@@@@@@@@@@@@@@@@@@                      106      
               2 |@@@@@                                    27       
               3 |@                                        7        
               4 |@@                                       10       
               5 |@@                                       12       
               6 |@                                        4        
               7 |@                                        8        
               8 |@@                                       9        
               9 |@@@                                      17       
           >= 10 |                                         0        

  enqueue                                           
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@                                     82       
               1 |@@@@                                     86       
               2 |@@@@                                     76       
               3 |@@@                                      65       
               4 |@@@@@                                    101      
               5 |@@@@                                     79       
               6 |@@@@                                     75       
               7 |@@@@                                     76       
               8 |@@@@                                     89       
               9 |@@@@                                     75       
           >= 10 |                                         0 

The output histogram named tick shows that the clock tick is firing at a 1 millisecond offset. In this example, the output for enqueue is evenly spread across the ten millisecond interval and no spike is visible at 1 millisecond, so it appears that the threads are being not being scheduled on a time basis.

11.8.4 sched Stability

The sched provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Evolving

Evolving

ISA

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.

11.9 io Provider

The io provider makes available probes that relate to data input and output. The io provider enables quick exploration of behavior that is observed through I/O monitoring tools such as iostat. For example, you can use the io provider to understand I/O by device, I/O type, I/O size, process, or application name .

11.9.1 io Probes

The following table describes the probes for the io provider.

Table 11.9 io Probes

Probe

Description

start

Fires when an I/O request is about to be made either to a peripheral device or to an NFS server.

done

Fires after an I/O request has been fulfilled. The done probe fires after the I/O completes, but before completion processing has been performed on the buffer. As a result B_DONE is not set in b_flags at the time the done probe fires.

wait-start

Fires immediately before a thread begins to wait pending completion of a given I/O request. Some time after the wait-start probe fires, the wait-done probe fires in the same thread.

wait-done

Fires when a thread finishes waiting for the completion of a given I/O request. The wait-done probe fires only after the wait-start probe has fired in the same thread.


The io probes fire for all I/O requests to peripheral devices, and for all file read and file write requests to an NFS server. Requests for metadata from an NFS server, for example, do not trigger io probes due to a readdir() request.

11.9.2 io Probe Arguments

The following table describes the arguments for the io probes.

Table 11.10 io Probe Arguments

Argument

Type

Description

args[0]

bufinfo_t *

The bufinfo_t for the corresponding I/O request.

args[1]

devinfo_t *

The devinfo_t for the device for the corresponding I/O request.

args[2]

fileinfo_t *

The fileinfo_t for the file for the corresponding I/O request.


Note

DTrace does not currently support the use of fileinfo_t with io probes. In Oracle Linux, no information is readily accessible at the level where the io probes fire about the file where an I/O request originated.

11.9.2.1 bufinfo_t

The bufinfo_t structure is the abstraction that describes an I/O request. The buffer that corresponds to an I/O request is pointed to by args[0] in the start, done, wait-start, and wait-done probes. Detailed information about this data structure can be found in /usr/lib64/dtrace/version/io.d. The definition of bufinfo_t is as follows:

typedef struct bufinfo {
  int b_flags;         /* flags */
  size_t b_bcount;     /* number of bytes */
  caddr_t b_addr;      /* buffer address */
  uint64_t b_blkno;    /* expanded block # on device */
  uint64_t b_lblkno;   /* logical block # on device */
  size_t b_resid;      /* not supported */
  size_t b_bufsize;    /* size of allocated buffer */
  caddr_t b_iodone;    /* I/O completion routine */
  int b_error;         /* not supported */
  dev_t b_edev;        /* extended device */
} bufinfo_t;
Note

DTrace translates the members of bufinfo_t from the buffer_head or bio for the Oracle Linux I/O request structure, depending on the kernel version.

b_flags indicates the state of the I/O buffer, and consists of a bitwise-or of different state values. The following table describes the values for the supported states.

Table 11.11 b_flags Values

b_flags

Value

Description

B_ASYNC

0x000400

Indicates that the I/O request is asynchronous and is not waited upon. The wait-start and wait-done probes do not fire for asynchronous I/O requests.

Note

Some I/Os directed to be asynchronous might not set B_ASYNC. The asynchronous I/O subsystem could implement the asynchronous request by having a separate worker thread perform a synchronous I/O operation.

B_PAGEIO

0x000010

Indicates that the buffer is being used in a paged I/O request.

B_PHYS

0x000020

Indicates that the buffer is being used for physical (direct) I/O to a user data area.

B_READ

0x000040

Indicates that data is to be read from the peripheral device into main memory.

B_WRITE

0x000100

Indicates that the data is to be transferred from main memory to the peripheral device.


b_bcount: Is the number of bytes to be transferred as part of the I/O request.

b_addr: Is the virtual address of the I/O request, when known.

b_blkno: Identifies which block on the device is to be accessed.

b_lblkno: Identifies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (such as the cylinder, track, and so on) is defined by the device.

b_bufsize: Contains the size of the allocated buffer.

b_iodone: Identifies a specific routine in the kernel that is called when the I/O is complete.

b_edev: Contains the major and minor device numbers of the device accessed. You can use the D subroutines getmajor and getminor to extract the major and minor device numbers from the b_edev field.

11.9.2.2 devinfo_t

The devinfo_t structure provides information about a device. The devinfo_t structure that corresponds to the destination device of an I/O is pointed to by args[1] in the start, done, wait-start, and wait-done probes. Detailed information about this data structure can be found in /usr/lib64/dtrace/version/io.d. The definition of devinfo_t is as follows:

typedef struct devinfo {
  int dev_major;           /* major number */
  int dev_minor;           /* minor number */
  int dev_instance;        /* not supported */
  string dev_name;         /* name of device */
  string dev_statname;     /* name of device + instance/minor */
  string dev_pathname;     /* pathname of device */
} devinfo_t;
Note

DTrace translates the members of devinfo_t from the buffer_head for the Oracle Linux I/O request structure.

dev_major: Is the major number of the device.

dev_minor: Is the minor number of the device.

dev_name: Is the name of the device driver that manages the device.

dev_statname: Is the name of the device as reported by iostat. This field is provided so that aberrant iostat output can be quickly correlated to actual I/O activity.

dev_pathname: Is the full path of the device. The path that is specified by dev_pathname includes components expressing the device node, the instance number, and the minor node. However, note that all three of these elements are not necessarily expressed in the statistics name. For some devices, the statistics name consists of the device name and the instance number. For other devices, the name consists of the device name and the number of the minor node. As a result, two devices that have the same dev_statname migh differ in their dev_pathname.

11.9.2.3 fileinfo_t

Note

DTrace does not currently support the use of fileinfo_t with the args[2] argument of the io probes. You can use the fileinfo_t structure to obtain information about a process's open files by using the fds[] array. See Section 2.9.5, “Built-In Variables”.

The fileinfo_t structure provides information about a file. args[2] in the start, done, wait-start, and wait-done probes points to the file to which an I/O request corresponds. The presence of file information is contingent upon the file system providing this information when dispatching I/O requests. Some file systems, especially third-party file systems, might not provide this information. Also, I/O requests might emanate from a file system for which no file information exists. For example, any I/O from or to file system metadata is not associated with any one file. Finally, some highly optimized file systems might aggregate I/O from disjoint files into a single I/O request. In this case, the file system might provide the file information either for the file that represents the majority of the I/O or for the file that represents some of the I/O. Alternatively, the file system might provide no file information at all in this case.

Detailed information about this data structure can be found in /usr/lib64/dtrace/version/io.d. The definition of fileinfo_t is as follows:

typedef struct fileinfo {
  string fi_name;           /* name (basename of fi_pathname) */
  string fi_dirname;        /* directory (dirname of fi_pathname) */
  string fi_pathname;       /* full pathname */
  loff_t fi_offset;         /* offset within file */
  string fi_fs;             /* file system */
  string fi_mount;          /* not supported */
  int fi_oflags;            /* open() flags for file descriptor */
} fileinfo_t;

The fi_name field contains the name of the file but does not include any directory components. If no file information is associated with an I/O, the fi_name field is set to the string <none>. In some rare cases, the pathname that is associated with a file might be unknown. In this case, the fi_name field is set to the string <unknown>.

The fi_dirname field contains only the directory component of the file name. As with fi_name, this string can be set to <none>, if no file information is present, or <unknown> if the pathname that is associated with the file is not known.

The fi_pathname field contains the full pathname to the file. As with fi_name, this string can be set to <none>, if no file information is present, or <unknown> if the pathname that is associated with the file is not known.

The fi_offset field contains the offset within the file , or -1, if either file information is not present or if the offset is otherwise unspecified by the file system.

The fi_fs field contains the name of the file system type, or <none>, if no information is present.

The fi_oflags field contains the flags that were specified when opening the file.

11.9.3 io Examples

'The following example script displays information for every I/O as it is issued. Type the following source code and save it in a file named iosnoop.d.

#pragma D option quiet

BEGIN
{
  printf("%10s %2s\n", "DEVICE", "RW");
}

io:::start
{
  printf("%10s %2s\n", args[1]->dev_statname,
  args[0]->b_flags & B_READ ? "R" : "W");
}

The output from this script is similar to the following:

# dtrace -s ./iosnoop.d
    DEVICE RW
     dm-00  R
     dm-00  R
     dm-00  R
     dm-00  R
     dm-00  R
     dm-00  R
...

You can make the example script slightly more sophisticated by using an associative array to track the time (in milliseconds) spent on each I/O, as shown in the following example:

#pragma D option quiet

BEGIN
{
  printf("%10s %2s %7s\n", "DEVICE", "RW", "MS");
}

io:::start
{
  start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
  this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
  printf("%10s %2s %3d.%03d\n", args[1]->dev_statname,
  args[0]->b_flags & B_READ ? "R" : "W",
  this->elapsed / 10000000, (this->elapsed / 1000) % 1000);
  start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

The modified script adds a MS (milliseconds) column to the output.

You can aggregate on device, application, process ID and bytes transferred, then save it in a file named whoio.d, as shown in the following example:

#pragma D option quiet

io:::start
{
  @[args[1]->dev_statname, execname, pid] = sum(args[0]->b_bcount);
}

END
{
  printf("%10s %20s %10s %15s\n", "DEVICE", "APP", "PID", "BYTES");
  printa("%10s %20s %10d %15@d\n", @);
}

Running this script for a few seconds results in output that is similar to the following:

# dtrace -s whoio.d 
^C
    DEVICE                  APP        PID           BYTES
     dm-00               evince      14759           16384
     dm-00          flush-252:0       1367           45056
     dm-00                 bash      14758          131072
     dm-00       gvfsd-metadata       2787          135168
     dm-00               evince      14758          139264
     dm-00               evince      14338          151552
     dm-00          jbd2/dm-0-8        390          356352

If you are copying data from one device to another, you might want to know if one of the devices acts as a limiter on the copy. To answer this question, you need to know the effective throughput of each device, rather than the number of bytes per second that each device is transferring. For exampe, you can determine throughput by using the following script and saving it in a file named copy.d:

#pragma D option quiet

io:::start
{
  start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
  /*
   * We want to get an idea of our throughput to this device in KB/sec.
   * What we have, however, is nanoseconds and bytes. That is we want
   * to calculate:
   *
   * bytes / 1024
   * ------------------------
   * nanoseconds / 1000000000
   *
   * But we cannot calculate this using integer arithmetic without losing
   * precision (the denominator, for one, is between 0 and 1 for nearly
   * all I/Os). So we restate the fraction, and cancel:
   *
   * bytes       1000000000      bytes       976562
   * --------- * ------------- = --------- * -------------
   * 1024        nanoseconds     1           nanoseconds
   *
   * This is easy to calculate using integer arithmetic.
   */
  this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
  @[args[1]->dev_statname, args[1]->dev_pathname] =
    quantize((args[0]->b_bcount * 976562) / this->elapsed);
  start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

END
{
  printa(" %s (%s)\n%@d\n", @);
}

Running the previous script for several seconds while copying data from a hard disk to a USB drive yields the following output:

# dtrace -s copy.d
^C
sdc1 (/dev/sdc1)

           value  ------------- Distribution ------------- count    
              32 |                                         0
              64 |                                         3
             128 |                                         1
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2257
             512 |                                         1
            1024 |                                         0       

 dm-00 (/dev/dm-00)

           value  ------------- Distribution ------------- count    
             128 |                                         0
             256 |                                         1
             512 |                                         0
            1024 |                                         2
            2048 |                                         0
            4096 |                                         2
            8192 |@@@@@@@@@@@@@@@@@@                       172
           16384 |@@@@@                                    52
           32768 |@@@@@@@@@@@                              108
           65536 |@@@                                      34
          131072 |                                         0     

The previous output shows that the USB drive (sdc1) is clearly the limiting device. The throughput of sdc1 is between 256K/sec and 512K/sec, while dm-00 delivered I/O at anywhere from 8 MB/second to over 64 MB/second.

11.9.4 io Stability

The io provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Evolving

Evolving

ISA

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features

11.10 fasttrap Provider

The fasttrap provider performs dynamic instrumentation of arbitrary instructions in user-space threads. Unlike most other DTrace providers, the fasttrap provider is not designed for tracing system activity. Rather, this provider is intended as a way for DTrace consumers to inject information into the DTrace framework by activating the fasttrap probe.

For more information about enabling statically defined probes in user-space programs, see Chapter 13, Statically Defined Tracing of User Applications.

11.10.1 fasttrap Probes

The fasttrap provider makes available a single probe that fires whenever a user-level process makes a certain DTrace call into the kernel. The DTrace call to activate the probe is not available

11.10.2 fasttrap Stability

The fasttrap provider uses DTrace's stability mechanism to describe its stabilities. These values are listed in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Private

Private

Unknown

For more information about the stability mechanism, see Chapter 16, DTrace Stability Features.