11 DTrace Providers
WARNING:
Oracle Linux 7 is now in Extended Support. See Oracle Linux Extended Support and Oracle Open Source Support Policies for more information.
Migrate applications and data to Oracle Linux 8 or Oracle Linux 9 as soon as possible.
For more information about DTrace, see Oracle Linux: DTrace Release Notes and Oracle Linux: Using DTrace for System Tracing.
This chapter describes some of the existing DTrace providers. Note
that the list of providers discussed in this chapter is not
exhaustive. To display the providers that are available on your
system, use the dtrace -l command. Detailed
information about translators for important data structures can be
found in
/usr/lib64/dtrace/version/*.d
files.
dtrace Provider
The dtrace provider includes several probes
that are related to DTrace itself. You can use these probes to
initialize state before tracing begins, process state after
tracing has completed, and to handle unexpected execution errors
in other probes.
BEGIN Probe
The BEGIN probe fires before any other probe.
No other probe fires until all BEGIN clauses
have completed. This probe can be used to initialize any state
that is needed in other probes. The following example shows how
to use the BEGIN probe to initialize an
associative array to map between mmap()
protection bits and a textual representation:
BEGIN
{
prot[0] = "---";
prot[1] = "r--";
prot[2] = "-w-";
prot[3] = "rw-";
prot[4] = "--x";
prot[5] = "r-x";
prot[6] = "-wx";
prot[7] = "rwx";
}
syscall::mmap:entry
{
printf("mmap with prot = %s", prot[arg2 & 0x7]);
}
The BEGIN probe fires in an unspecified
context, which means the output of stack or
ustack, and the value of context-specific
variables such as execname, are all
arbitrary. These values should not be relied upon or interpreted
to infer any meaningful information. No arguments are defined
for the BEGIN probe.
END Probe
The END probe fires after all other probes.
This probe will not fire until all other probe clauses have
completed. This probe can be used to process state that has been
gathered or to format the output. The printa
action is therefore often used in the END
probe. The BEGIN and END
probes can be used together to measure the total time that is
spent tracing, for example:
BEGIN
{
start = timestamp;
}
/*
* ... other tracing actions...
*/
END
{
printf("total time: %d secs", (timestamp - start) / 1000000000);
}
See Data Normalization and
printa Action for other common uses of the
END probe.
As with the BEGIN probe, no arguments are
defined for the END probe. The context in
which the END probe fires is arbitrary and
should not be depended upon.
When tracing with the bufpolicy option set to
fill, adequate space is reserved to
accommodate any records that are traced in the
END probe. See
fill Policy and END Probes for details.
Note:
The exit action causes tracing to stop and
the END probe to fire. However, there is
some delay between the invocation of the
exit action and when the
END probe fires. During this delay, no
probes will fire. After a probe invokes the
exit action, the END
probe is not fired until the DTrace consumer determines that
exit has been called and stops tracing. The
rate at which the exit status is checked can be set by using
statusrate option. For more information,
see Options and Tunables.
ERROR Probe
The ERROR probe fires when a runtime error
occurs during the execution of a clause for a DTrace probe. As
shown in the following example, if a clause attempts to
dereference a NULL pointer, the
ERROR probe fires. Save it in a file named
error.d:
BEGIN
{
*(char *)NULL;
}
ERROR
{
printf("Hit an error!");
}When you run this program, output similar to the following is displayed:
# dtrace -s error.d dtrace: script 'error.d' matched 2 probes CPU ID FUNCTION:NAME 1 3 :ERROR Hit an error! dtrace: error on enabled probe ID 1 (ID 1: dtrace:::BEGIN): invalid address (0x0) in action #1 at DIF offset 16 ^C
The previous output indicates that the ERROR
probe fired and that dtrace reported the
error. dtrace has its own enabling of the
ERROR probe so that it can report errors.
Using the ERROR probe, you can create your
own custom error handling.
The arguments to the ERROR probe are
described in the following table.
| Argument | Description |
|---|---|
|
|
The enabled probe identifier (EPID) of the probe that caused the error. |
|
|
The index of the action that caused the fault. |
|
|
The DIF offset into the action or -1 if not applicable. |
|
|
The fault type. |
|
|
Value that is particular to the fault type. |
The following table describes the various fault types that can
be specified in arg4 and the values that
arg5 can take for each fault type.
| arg4 Value | Description | arg5 Meaning |
|---|---|---|
|
|
Unknown fault type |
None |
|
|
Access to unmapped or invalid address |
Address accessed |
|
|
Unaligned memory access |
Address accessed |
|
|
Illegal or invalid operation |
None |
|
|
Integer divide by zero |
None |
|
|
Insufficient scratch memory to satisfy scratch allocation |
None |
|
|
Attempt to access a kernel address or property without sufficient privileges |
Address accessed or 0 if not applicable |
|
|
Attempt to access a user address or property without sufficient privileges |
Address accessed or 0 if not applicable |
|
|
DTrace internal parameter stack overflow |
None |
|
|
Invalid user process stack |
Address of invalid stack pointer |
If the actions that are taken in the ERROR
probe cause an error, that error is silently dropped. The
ERROR probe is not recursively invoked.
dtrace Stability
The dtrace provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Stable |
Stable |
Common |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Stable |
Stable |
Common |
|
Arguments |
Stable |
Stable |
Common |
For more information about the stability mechanism, see DTrace Stability Features.
profile Provider
The profile provider includes probes that are
associated with an interrupt that fires at some regular, specified
time interval. Such probes are not associated with any particular
point of execution, but rather with the asynchronous interrupt
event. You can use these probes to sample some aspect of the
system state and then use the samples to infer system behavior. If
the sampling rate is high or the sampling time is long, an
accurate inference is possible. Using DTrace actions, you can use
the profile provider to sample practically any
aspect of the system. For example, you could sample the state of
the current thread, the state of the CPU, or the current machine
instruction.
profile-n Probes
The profile-n
probes fire at a fixed interval, at a high-interrupt level on
all active CPUs. The units of n
default to a frequency that is expressed as a rate of firing per
second, but the value can also have an optional suffix , as
shown in Table 11-1, which specifies
either a time interval or a frequency. The following table
describes valid time suffixes for a
tick-
n probe.
Table 11-1 Valid Time Suffixes
| Suffix | Time Units |
|---|---|
|
|
nanoseconds |
|
|
microseconds |
|
|
milliseconds |
|
|
seconds |
|
|
minutes |
|
|
hours |
|
|
days |
|
|
hertz (frequency expressed as rate per second) |
tick-n Probes
The tick-n
probes
fire at fixed intervals, at a high interrupt level on only one
CPU per interval. Unlike
profile-n
probes,
which fire on every CPU,
tick-n
probes fire
on only one CPU per interval and the CPU on which they fire can
change over time. The units of n
default to a frequency expressed as a rate of firing per second,
but the value can also have an optional time suffix as shown in
Table 11-1, which specifies either a
time interval or a frequency.
The tick-n
probes
have several uses, such as providing some periodic output or
taking a periodic action.
Note:
By default, the highest supported tick frequency is 5000 Hz
(tick-5000).
profile Probe Arguments
The following table describes the arguments for the
profile probes.
Table 11-2 profile Probe Arguments
| Probe | arg0 | arg1 | arg2 |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
— |
The arguments are as follows:
-
pc: kernel program counter -
upc: user-space program counter -
nsecs: elapsed number of nanoseconds
profile Probe Creation
Unlike other providers, the profile provider
creates probes dynamically on an as-needed basis. Thus, the
desired probe might not appear in a listing of all probes, for
example, when using the dtrace -l -P profile
command, but the probe is created when it is explicitly enabled.
A time interval that is too short causes the machine to
continuously field time-based interrupts and denies service on
the machine. The profile provider silently
refuses to create a probe that would result in an interval of
less than two hundred microseconds.
prof Stability
The profile provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
Common |
|
Module |
Unstable |
Unstable |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
Common |
|
Arguments |
Evolving |
Evolving |
Common |
For more information, see DTrace Stability Features.
fbt Provider
The fbt (Function Boundary Tracing) provider
includes probes that are associated with the entry to and return
from most functions in the Oracle Linux kernel. Therefore, there could
well be tens of thousands of fbt probes.
To confirm that the fbt provider is available
on your processor's architecture, you should be able to load the
module that provides fbt instrumentation and
successfully list several probes. Note that this process could
take several seconds due to the large number of such probes. For
example, consider the following command, which is executed as
root:
# dtrace -l -P fbt | wc -l dtrace: failed to match fbt:::: No probe matches description 1 # modprobe fbt # dtrace -l -P fbt | wc -l 88958
In the previous example, the first dtrace
command automatically loads modules that are listed in
/etc/dtrace-modules, but also confirms that
fbt was not among them. After
fbt is loaded manually, many
fbt probes appear. For more information, see
Module Loading and fbt.
Like other DTrace providers, Function Boundary Tracing (FBT) has no probe effect when not explicitly enabled. When enabled, FBT only induces a probe effect in probed functions. While the FBT implementation is highly specific to the instruction set architecture, FBT has been implemented on both x86 and 64-bit Arm platforms. For each instruction set, there are a small number of leaf functions that do not call other functions and are highly optimized by the compiler, which cannot be instrumented by FBT. Probes for these functions are not present in DTrace.
An effective use of FBT probes requires knowledge of the operating
system implementation. It is therefore recommended that you use
FBT only when developing kernel software or when other providers
are not sufficient. You can use other DTrace providers such as
syscall, sched,
proc, and io to answer most
system analysis questions without requiring operating system
implementation knowledge.
fbt Probes
FBT provides a probe at the entry and return of most functions
in the kernel, named entry and
return, respectively. All FBT probes have a
function name and module name.
fbt Probe Arguments
The arguments to entry probes are the same as
the arguments to the corresponding operating system kernel
function. These arguments can be accessed as
int64_t values by using the
arg0, arg1,
arg2, ... variables.
If the function has a return value, the return value is stored
in arg1 of the return
probe. If a function does not have a return value,
arg1 is not defined.
While a given function only has a single point of entry, it
might have many different points where it returns to its caller.
FBT collects a function's multiple return sites into a single
return probe. If you want to know the exact
return path, you can examine the return probe
arg0 value, which indicates the offset in
bytes of the returning instruction in the function text.
fbt Examples
You can easily use the fbt provider to
explore the kernel's implementation. The following example
script records the first gettimeofday call
from any clock process and then follows the
subsequent code path through the kernel. Type the following D
source code and save it in a file named
xgettimeofday.d:
/*
* To make the output more readable, indent every function entry
* and unindent every function return. This is done by setting the
* "flowindent" option.
*/
#pragma D option flowindent
syscall::gettimeofday:entry
/execname == "clock" && guard++ == 0/
{
self->traceme = 1;
printf("start");
}
fbt:::
/self->traceme/
{}
syscall::gettimeofday:return
/self->traceme/
{
self->traceme = 0;
exit(0);
}Running this script results in output that is similar to the following:
# dtrace -s ./xgettimeofday.d dtrace: script './xgettimeofday.d' matched 92115 probes CPU FUNCTION 0 => gettimeofday start 0 -> SyS_gettimeofday 0 -> getnstimeofday64 0 -> __getnstimeofday64 0 <- __getnstimeofday64 0 <- getnstimeofday64 0 -> _copy_to_user 0 <- _copy_to_user 0 <- SyS_gettimeofday 0 <= gettimeofday
The previous output shows the internal kernel functions that are
called when the gettimeofday system call is
made.
Module Loading and fbt
While the Oracle Linux kernel can dynamically load and unload kernel
modules, for fbt probes, the
fbt kernel module must be loaded to support
the instrumentation. For more information about loading kernel
modules, see the note in Getting Started With DTrace. If
fbt is not listed in
/etc/dtrace-modules, or if the
dtrace -l command lists no
fbt probes, use the following command:
# modprobe fbt
Conversely, you can unload the fbt
instrumentation with the following command:
# modprobe -r fbt
When the fbt module is loaded, FBT
automatically provides probes to instrument all other loaded
modules, including any new modules that are dynamically loaded.
If a loaded module has no enabled FBT probes, the module might
be unloaded and the corresponding probes are destroyed as the
module is unloaded. If a loaded module has enabled FBT probes,
the module is considered busy and cannot be unloaded.
fbt Stability
The fbt provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
Common |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
ISA |
|
Name |
Evolving |
Evolving |
Common |
|
Arguments |
Private |
Private |
ISA |
For more information, see DTrace Stability Features.
syscall Provider
The syscall provider makes available a probe at
the entry to and return from every system call in the system.
Because system calls are the primary interface between user-level
applications and the operating system kernel, the
syscall provider can offer tremendous insight
into application behavior with respect to the system.
syscall Probes
syscall provides a pair of probes for each
system call: an entry probe that fires before
the system call is entered, and a return
probe that fires after the system call has completed, but before
control has been transferred back to user-level. For all
syscall probes, the function name is set as
the name of the instrumented system call.
Often, the system call names that are provided by
syscall correspond to names in the Section 2
manual pages. However, some syscall provider
probes do not directly correspond to any documented system call.
Some common reasons for this discrepancy are described in the
following sections.
System Call Anachronisms
In some cases, the name of the system call, as provided by the
syscall provider, might be a reflection of
an ancient implementation detail.
Subcoded System Calls
Some system calls might be implemented as sub operations of
another system call. For example,
socketcall(), is the common kernel entry
point for the socket system calls.
New System Calls
Oracle Linux implements at-suffixed system
interfaces as individual system calls, for example:
-
faccessat() -
fchmodat() -
fchownat() -
fstatat64() -
futimensat() -
linkat() -
mkdirat() -
mknodat() -
name_to_handle_at() -
newfstatat() -
open_by_handle_at() -
openat() -
readlinkat() -
renameat() -
symlinkat() -
unlinkat() -
utimensat()
These system calls implement a superset of the functionality
of their old non-at-suffixed counterparts.
They take an additional first argument that is either an open
directory file descriptor. In which case, the operation on a
relative pathname is taken relative to the specified
directory, or is the reserved value
AT_FDCWD, in which case the operation takes
place relative to the current working directory.
Replaced System Calls
In Oracle Linux, the following older system calls have been replaced
and are not called by the newer glibc
interfaces. These legacy interfaces remain, but are
reimplemented, not as system calls in their own right, but as
calls to the newer system calls. The following table lists the
legacy call and its new call equivalent.
| Legacy System Call | New System Call |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Large File System Calls
A 32-bit program that supports large
files that exceed two gigabytes in size must be
able to process 64-bit file offsets. Because large files
require the use of large offsets, large files are manipulated
through a parallel set of system interfaces. The following
table lists some of the syscall probes for
the large file system call interfaces.
Table 11-3 syscall Large File Probes
| Large File syscall Probe | System Call |
|---|---|
|
|
|
|
|
|
|
|
|
syscall Probe Arguments
For entry probes, the arguments,
arg0 ...
argn
, are
arguments to the system call. For return probes, both
arg0 and arg1 contain the
return value. A non-zero value in the D variable
errno indicates a system call failure.
syscall Stability
The syscall provider uses DTrace's stability
mechanism to describe its stabilities. These stability values
are listed in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
Common |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Instruction set architecture (ISA) |
|
Name |
Evolving |
Evolving |
Common |
|
Arguments |
Private |
Private |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
sdt provider
The Statically Defined Tracing (SDT) provider
(sdt) creates probes at sites that a software
programmer has formally designated. The SDT mechanism enables
programmers to consciously choose locations of interest to users
of DTrace and to convey some semantic knowledge about each
location through the probe name.
Importantly, SDT can act as a metaprovider by registering probes
so that they appear to come from other providers, such as
io, proc, and
sched, which do not have dedicated modules of
their own. Thus, the SDT provider is chiefly of interest only to
developers of new providers. Most users will access SDT only
indirectly by using other providers.
Note:
Because the sdt probes that are defined for
the Oracle Linux kernel are likely to change over time, they are not
listed here. Both the name stability and the data stability of
the probes are Private, which reflects the kernel's
implementation and should not be interpreted as a commitment to
preserve these interfaces. For more information, see
DTrace Stability Features.
Creating sdt Probes
If you are a device driver developer, you might be interested in
creating your own sdt probes for your Oracle Linux
driver. The disabled probe effect of SDT is essentially the cost
of several no-operation machine instructions. You are therefore
encouraged to add sdt probes to your device
drivers as needed. Unless these probes negatively affect
performance, you can leave them in your shipping code. See
Statically Defined Tracing of Kernel Modules.
DTrace also provides a mechanism for application developers to define user-space static probes. See Statically Defined Tracing of User Applications.
Declaring Probes
The sdt probes are declared by using the
DTRACE_PROBE macro from
<linux/sdt.h>.
The module name and function name of an SDT-based probe correspond to the kernel module and function of the probe, respectively. DTrace includes the kernel module name and function name as part of the tuple identifying a probe, so you do not need to include this information in the probe name to prevent name space collisions. Use the dtrace -l -m module command to list the probes that your driver module has installed and the full names that are seen by DTrace users.
The name of the probe depends on the name that is provided in
the DTRACE_PROBE macro. If the name does
not contain two consecutive underscores
(__), the name of the probe is as written
in the macro. If the name contains two consecutive
underscores, the probe name converts the consecutive
underscores to a single dash (-). For
example, if a DTRACE_PROBE macro specifies
transaction__start, the SDT probe is named
transaction-start. This substitution
enables C code to provide macro names that are not valid C
identifiers without specifying a string.
SDT can also act as a metaprovider by registering probes so
that they appear to come from other providers, such as
io, proc, and
sched, which do not have dedicated modules
of their own. For example, kernel/exit.c
contains calls to the DTRACE_PROC macro,
which are defined as follows in
<linux/sdt.h>:
# define DTRACE_PROC(name) \
DTRACE_PROBE(__proc_##name);
Probes that use such macros appear to come from a provider
other than sdt. The leading double
underscore, provider name, and trailing underscore in the
name argument are used to match the
provider and are not included in the probe name. Note that the
functionality for creating probes for providers other than
those that are hard-coded into DTrace is not currently
available.
sdt Probe Arguments
The arguments for each sdt probe are the
arguments that are specified in the kernel source code in the
corresponding DTRACE_PROBE macro reference.
When declaring your sdt probes, you can
minimize their disabled probe effect by not dereferencing
pointers and by not loading from global variables in the probe
arguments. Both pointer dereferencing and global variable
loading may be done safely in D actions that enable probes, so
DTrace users can request these actions only when they are
needed.
sdt Stability
The sdt provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Private |
Private |
ISA |
|
Arguments |
Private |
Private |
ISA |
For more information about the stability mechanism, refer to DTrace Stability Features.
pid Provider
The pid provider enables tracing of any user
process, as specified by its pid.
The pid provider enables tracing function entry
and return in user programs just like the fbt
provider provides that capability for the kernel. Note that most
of the examples in this guide that use the fbt
provider to trace kernel function calls can be modified slightly
to apply to user processes.
The pid provider also enables tracing of any
instruction, as specified by an absolute address or function
offset.
The pid provider has no probe effect when
probes are not enabled. When probes are enabled, the probes only
induce probe effect on those processes that are traced.
Note:
When the compiler inlines a function, the pid
provider's probe does not fire. Use one of the following methods
to compile a particular C function so that it will not be
inlined.
-
Sun Studio:
#pragma no_inline (funcname[, funcname]) -
gcc:funcname __attribute__ ((noinline))
Consult your compiler documentation for updates.
Naming pid Probes
The pid provider actually defines a class of
providers. Each process can potentially have its own associated
pid provider. For example, a process with ID
123, would be traced by using the pid123
provider.
The module portion of the probe description refers to an object
loaded in the corresponding process's address space. To see
which objects will be loaded for my_exec or
are loaded for process ID 123, use the following commands:
# ldd my_exec ... # pldd 123 123: /tmp/my_exec linux-vdso.so.1 /lib64/libc.so.6 /lib64/ld-linux-x86-64.so.2p
In the probe description, you name the object by the name of the
file, not by its full path name. You can also omit the
.6 or so.6 suffix. All of
the following examples name the same probe:
pid123:libc.so.6:strcpy:entry pid123:libc.so:strcpy:entry pid123:libc:strcpy:entry
The first example is the actual name of the probe. The other examples are convenient aliases that are replaced with the full load object name internally.
For the load object of the executable, you can use the
a.out alias. The following two probe
descriptions name the same probe:
pid123:my_exec:main:return pid123:a.out:main:return
The function field of the probe description names a function in
the module. A user application binary might have several names
for the same function. For example,
__gnu_get_libc_version might be an alternate
name for the function gnu_get_libc_version in
libc.so.6. DTrace chooses one canonical name
for such a function and uses that name internally.
The following example illustrates how DTrace internally remaps module and function names to a canonical form:
# dtrace -q -n 'pid123:libc:__gnu_get_libc_version:
{ printf("%s\n%s\n", probemod, probefunc)}'
libc.so.6
gnu_get_libc_version
For examples of how to use the pid provider
effectively, see User Process Tracing.
pid Probe Arguments
An entry probe fires when the traced function
is invoked. The arguments to entry probes are the values of the
arguments to the traced function.
A return probe fires when the traced function
returns or makes a tail call to another function.
The arg1 probe argument holds the function
return value.
An offset probe fires whenever execution
reaches the instruction at the specified offset in the function.
For example, to trace the instruction at the address 4 bytes
into function main, you can use
pid123:a.out:main:4. The arguments for offset
probes are undefined. The uregs[] array will
help you when examining the process state at these probe sites.
See uregs[] Array.
pid Stability
The pid provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
ISA |
|
Arguments |
Private |
Private |
Unknown |
For more information about the stability mechanism, see DTrace Stability Features.
proc Provider
The proc provider makes available the probes
that pertain to the following activities: process creation and
termination, LWP creation and termination, execution of new
program images, and signal sending and handling.
proc Probes
The probes for the proc provider are listed
in the following table.
Table 11-4 proc Probes
| Probe | Description |
|---|---|
|
|
Fires when a process (or process thread) is created
using |
|
|
Fires whenever a process loads a new process image
using a variant of the |
|
|
Fires when an |
|
|
Fires when an |
|
|
Fires when the current process is exiting. The reason
for exit, which is expressed as one of the
|
|
|
Fires when a process thread is created, the latter
typically as a result of
|
|
|
Fires when a process or process thread is exiting, due
either to a signal or to an explicit call to
|
|
|
Fires within the context of a newly created process or
process thread. The |
|
|
Probes that fires when a pending signal is cleared
because the target thread was waiting for the signal
in |
|
|
Fires when a signal is sent to a single-threaded
process and the signal is both unblocked and ignored
by the process. Under these conditions, the signal is
discarded on generation. The
|
|
|
Fires immediately before a thread handles a signal.
The |
|
|
Fires when a signal is sent to a process or to a
thread created by a process. The
|
|
|
Fires in the context of a newly created process. The
|
Note:
In Linux, there is no fundamental difference between a process and a thread that a process creates. The threads of a process are set up so that they can share resources, but each thread has its own entry in the process table with its own process ID.
proc Probe Arguments
The following table lists the argument types for the
proc probes. See Table 11-4
for a description of the arguments.
Table 11-5 proc Probe Arguments
| Probe | args[0] | args[1] | args[2] |
|---|---|---|---|
|
|
|
— |
— |
|
|
|
— |
— |
|
|
|
— |
— |
|
|
— |
— |
— |
|
|
|
— |
— |
|
|
|
|
— |
|
|
— |
— |
— |
|
|
— |
— |
— |
|
|
|
— |
— |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
— |
— |
— |
lwpsinfo_t
Several proc probes have arguments of type
lwpsinfo_t. Detailed information about this
data structure can be found in
/usr/lib64/dtrace/version/procfs.d.
The definition of the lwpsinfo_t structure,
as available to DTrace consumers, is as follows:
typedef struct lwpsinfo {
int pr_flag; /* flags */
id_t pr_lwpid; /* thread id */
uintptr_t pr_addr; /* internal address of thread */
uintptr_t pr_wchan; /* wait addr for sleeping lwp (NULL on Linux) */
char pr_stype; /* sync event type (0 on Linux) */
char pr_state; /* numeric thread state */
char pr_sname; /* printable character for pr_state */
int pr_pri; /* priority, high value = high priority */
char pr_name[PRCLSZ]; /* scheduling class name */
processorid_t pr_onpro; /* processor which last ran this thread */
} lwpsinfo_t;Note:
Lightweight processes do not exist in Linux. Rather, in
Oracle Linux, processes and threads are represented by process
descriptors of type struct task_struct in
the task list. DTrace translates the members of
lwpsinfo_t from the
task_struct for the Oracle Linux process.
The pr_flag is set to 1
if the thread is stopped. Otherwise, it is set to
0.
In Oracle Linux, the pr_stype field is unsupported,
and hence is always 0.
The following table describes the values that
pr_state can take, as well as the
corresponding character values for
pr_sname.
Table 11-6 pr_state Values
| pr_state Value | pr_sname Value
|
Description |
|---|---|---|
|
|
R
|
The thread is runnable or is currently running on a
CPU. The
The equivalent Oracle Linux task state is
|
|
|
S
|
The thread is sleeping. The
The equivalent Oracle Linux task state is
|
|
|
T
|
The thread is stopped, either due to an explicit
The equivalent Oracle Linux task state is
|
|
|
W
|
The thread is waiting on wait queue. The
The equivalent Oracle Linux task state is
|
|
|
Z
|
The thread is a zombie.
The equivalent Oracle Linux task state is
|
psinfo_t
Several proc probes have an argument of
type psinfo_t. Detailed information about
this data structure can be found in
/usr/lib64/dtrace/version/procfs.d.
The definition of the psinfo_t structure,
as available to DTrace consumers, is as follows:
typedef struct psinfo {
int pr_nlwp; /* not supported */
pid_t pr_pid; /* unique process id */
pid_t pr_ppid; /* process id of parent */
pid_t pr_pgid; /* pid of process group leader */
pid_t pr_sid; /* session id */
uid_t pr_uid; /* real user id */
uid_t pr_euid; /* effective user id */
uid_t pr_gid; /* real group id */
uid_t pr_egid; /* effective group id */
uintptr_t pr_addr; /* address of process */
size_t pr_size; /* not supported */
size_t pr_rssize; /* not supported */
struct tty_struct *pr_ttydev; /* controlling tty (or -1) */
ushort_t pr_pctcpu; /* not supported */
ushort_t pr_pctmem; /* not supported */
timestruc_t pr_start; /* not supported */
timestruc_t pr_time; /* not supported */
timestruc_t pr_ctime; /* not supported */
char pr_fname[16]; /* name of exec'ed file */
char pr_psargs[80]; /* initial chars of arg list */
int pr_wstat; /* not supported */
int pr_argc; /* initial argument count */
uintptr_t pr_argv; /* address of initial arg vector */
uintptr_t pr_envp; /* address of initial env vector */
char pr_dmodel; /* data model */
taskid_t pr_taskid; /* not supported */
projid_t pr_projid; /* not supported */
int pr_nzomb; /* not supported */
poolid_t pr_poolid; /* not supported */
zoneid_t pr_zoneid; /* not supported */
id_t pr_contract; /* not supported */
lwpsinfo_t pr_lwp; /* not supported */
} psinfo_t;Note:
Lightweight processes do not exist in Linux. In Oracle Linux,
processes and threads are represented by process descriptors
of type struct task_struct in the task
list. DTrace translates the members of
psinfo_t from the
task_struct for the Oracle Linux process.
pr_dmodel is set to either
PR_MODEL_ILP32, denoting a 32–bit
process, or PR_MODEL_LP64, denoting a
64–bit process.
proc Examples
The following examples illustrate the use of the probes that are
published by the proc provider.
exec
The following example shows how you can use the
exec probe to easily determine which
programs are being executed, and by whom. Type the following D
source code and save it in a file named
whoexec.d:
#pragma D option quiet
proc:::exec
{
self->parent = execname;
}
proc:::exec-success
/self->parent != NULL/
{
@[self->parent, execname] = count();
self->parent = NULL;
}
proc:::exec-failure
/self->parent != NULL/
{
self->parent = NULL;
}
END
{
printf("%-20s %-20s %s\n", "WHO", "WHAT", "COUNT");
printa("%-20s %-20s %@d\n", @);
}
Running the example script for a short period of time results in output similar to the following:
# dtrace -s ./whoexec.d ^C WHO WHAT COUNT abrtd abrt-handle-eve 1 firefox basename 1 firefox mkdir 1 firefox mozilla-plugin- 1 firefox mozilla-xremote 1 firefox run-mozilla.sh 1 firefox uname 1 gnome-panel firefox 1 kworker/u:1 modprobe 1 modprobe modprobe.ksplic 1 mozilla-plugin- plugin-config 1 mozilla-plugin- uname 1 nice sosreport 1 run-mozilla.sh basename 1 run-mozilla.sh dirname 1 run-mozilla.sh firefox 1 run-mozilla.sh uname 1 sh abrt-action-sav 1 sh blkid 1 sh brctl 1 sh cut 1 ...
start and exit Probes
If you want to know how long programs are running, from
creation to termination, you can enable the
start and exit probes,
as shown in the following example. Save it in a file named
progtime.d:
proc:::start
{
self->start = timestamp;
}
proc:::exit
/self->start/
{
@[execname] = quantize(timestamp - self->start);
self->start = 0;
}Running the example script on a build server for several seconds results in output similar to the following:
# dtrace -s ./progtime.d
dtrace: script ’./progtime.d’ matched 2 probes
^C
...
cc
value ------------- Distribution ------------- count
33554432 | 0
67108864 |@@@ 3
134217728 |@ 1
268435456 | 0
536870912 |@@@@ 4
1073741824 |@@@@@@@@@@@@@@ 13
2147483648 |@@@@@@@@@@@@ 11
4294967296 |@@@ 3
8589934592 | 0
sh
value ------------- Distribution ------------- count
262144 | 0
524288 |@ 5
1048576 |@@@@@@@ 29
2097152 | 0
4194304 | 0
8388608 |@@@ 12
16777216 |@@ 9
33554432 |@@ 9
67108864 |@@ 8
134217728 |@ 7
268435456 |@@@@@ 20
536870912 |@@@@@@ 26
1073741824 |@@@ 14
2147483648 |@@ 11
4294967296 | 3
8589934592 | 1
17179869184 | 0
...signal-send
The following example shows how you can use the
signal-send probe to determine the sending
and receiving of process associated with any signal. Type the
following D source code and save it in a file named
sig.d:
#pragma D option quiet
proc:::signal-send
{
@[execname, stringof(args[1]->pr_fname), args[2]] = count();
}
END
{
printf("%20s %20s %12s %s\n",
"SENDER", "RECIPIENT", "SIG", "COUNT");
printa("%20s %20s %12d %@d\n", @);
}Running this script results in output similar to the following:
# dtrace -s sig.d
^C
SENDER RECIPIENT SIG COUNT
gnome-panel Xorg 29 1
kworker/0:2 dtrace 2 1
Xorg Xorg 29 3
java Xorg 29 6
firefox Xorg 29 14
kworker/0:0 Xorg 29 1135proc Stability
The proc provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
ISA |
|
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
sched Provider
The sched provider makes available probes that
are related to CPU scheduling. Because CPUs are the one resource
that all threads must consume, the sched
provider is very useful for understanding systemic behavior. For
example, using the sched provider, you can
understand when and why threads sleep, run, change priority, or
wake other threads.
sched Probes
The following table describes the probes for the
sched provider.
Table 11-7 sched Probes
| Probe | Description |
|---|---|
|
|
Fires whenever a thread's priority is about to be
changed. The |
|
|
Fires immediately before a runnable thread is dequeued
from a run queue. The |
|
|
Fires immediately before a runnable thread is enqueued
to a run queue. The |
|
|
Fires when the current CPU is about to end execution
of a thread. The |
|
|
Fires when a CPU has just begun execution of a thread.
The |
|
|
Fires immediately before the current thread is
preempted. After this probe fires, the current thread
will select a thread to run and the
|
|
|
Fires when a scheduling decision has been made, but
the dispatcher has elected to continue to run the
current thread. The |
|
|
Fires immediately before the current thread sleeps on
a synchronization object. The type of the
synchronization object is contained in the
|
|
|
Fires when a CPU has been instructed by another CPU to
make a scheduling decision — often because a
higher-priority thread has become runnable. The
|
|
|
Fires as a part of clock tick-based accounting. In
clock tick-based accounting, CPU accounting is
performed by examining which threads and processes are
running when a fixed-interval interrupt fires. The
|
|
|
Fires immediately before the current thread wakes a
thread sleeping on a synchronization object. The
|
sched Probe Arguments
The following table describes the argument types for the
sched probes. See
Table 11-7 for descriptions of the arguments.
Table 11-8 sched Probe Arguments
| Probe | args[0] | args[1] | args[2] | args[3] |
|---|---|---|---|---|
|
|
|
|
|
— |
|
|
|
|
|
— |
|
|
|
|
|
|
|
|
|
|
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
|
|
— |
— |
|
|
|
|
— |
— |
|
|
|
|
— |
— |
cpuinfo_t
The cpuinfo_t structure defines a CPU. Per
the information in Table 11-8, arguments to
both the enqueue and
dequeue probes include a pointer to a
cpuinfo_t. Additionally, the
cpuinfo_t that corresponds to the current
CPU is pointed to by the curcpu variable.
The definition of the cpuinfo_t structure
is as follows:
typedef struct cpuinfo {
processorid_t cpu_id; /* CPU identifier */
psetid_t cpu_pset; /* not supported */
chipid_t cpu_chip; /* chip identifier */
lgrp_id_t cpu_lgrp; /* not supported */
cpuinfo_arch_t *cpu_info; /* CPU information */
} cpuinfo_t;cpu_id: Is the processor identifier.
cpu_chip: Is the identifier of the physical
chip. Physical chips can contain several CPU cores.
cpu_info: Is a pointer to the
cpuinfo_arch_t structure that is associated
with the CPU.
sched Examples
The following examples show the use of sched
probes.
on-cpu and off-cpu Probes
One common question that you might want answered is which CPUs
are running threads and for how long? The following example
shows how you can use the on-cpu and
off-cpu probes to easily answer this
question on a system-wide basis. Type the following D source
code and save it in a file named where.d:
sched:::on-cpu
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = quantize(timestamp - self->ts);
self->ts = 0;
}Running the previous script results in output that is similar to the following:
# dtrace -s ./where.d
dtrace: script ’./where.d’ matched 2 probes
^C
0
value ------------- Distribution ------------- count
2048 | 0
4096 |@@ 37
8192 |@@@@@@@@@@@@@ 212
16384 |@ 30
32768 | 10
65536 |@ 17
131072 | 12
262144 | 9
524288 | 6
1048576 | 5
2097152 | 1
4194304 | 3
8388608 |@@@@ 75
16777216 |@@@@@@@@@@@@ 201
33554432 | 6
67108864 | 0
1
value ------------- Distribution ------------- count
2048 | 0
4096 |@ 6
8192 |@@@@ 23
16384 |@@@ 18
32768 |@@@@ 22
65536 |@@@@ 22
131072 |@ 7
262144 | 5
524288 | 2
1048576 | 3
2097152 |@ 9
4194304 | 4
8388608 |@@@ 18
16777216 |@@@ 19
33554432 |@@@ 16
67108864 |@@@@ 21
134217728 |@@ 14
268435456 | 0The previous output shows that on CPU 1 threads tend to run for less than 131072 nanoseconds (on order of 100 microseconds) at a stretch, or for 8388608 to 134217728 nanoseconds (approximately 10 to 100 milliseconds). A noticeable gap between the two clusters of data is shown in the histogram. You also might be interested in knowing which CPUs are running a particular process.
You can also use the on-cpu and
off-cpu probes for answering this question.
The following script displays which CPUs run a specified
application over a period of ten seconds. Save it in a file
named whererun.d.:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::on-cpu
/execname == $$1/
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = sum(timestamp - self->ts);
self->ts = 0;
}
profile:::tick-1sec
/++x >= 10/
{
exit(0);
}
dtrace:::END
{
printf("CPU distribution over %d seconds:\n\n",
(timestamp - start) / 1000000000);
printf("CPU microseconds\n--- ------------\n");
normalize(@, 1000);
printa("%3d %@d\n", @);
}Running the previous script on a large mail server and specifying the IMAP daemon results in output that is similar to the following:
# dtrace -s ./whererun.d imapd CPU distribution of imapd over 10 seconds: CPU microseconds --- ------------ 15 10102 12 16377 21 25317 19 25504 17 35653 13 41539 14 46669 20 57753 22 70088 16 115860 23 127775 18 160517
Oracle Linux takes into account the amount of time that a thread has
been sleeping when selecting a CPU on which to run the thread,
as a thread that has been sleeping for less time tends not to
migrate. Use the off-cpu and
on-cpu probes to observe this behavior.
Type the following source code and save it in a file named
howlong.d:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
self->cpu = cpu;
self->ts = timestamp;
}
sched:::on-cpu
/self->ts/
{
@[self->cpu == cpu ?
"sleep time, no CPU migration" : "sleep time, CPU migration"] =
lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
self->ts = 0;
self->cpu = 0;
}Running the previous script for approximately 30 seconds results in output that is similar to the following:
# dtrace -s ./howlong.d
dtrace: script ’./howlong.d’ matched 2 probes
^C
sleep time, CPU migration
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@ 6838
25 |@@@@@ 4714
50 |@@@ 3108
75 |@ 1304
100 |@ 1557
125 |@ 1425
150 | 894
175 |@ 1526
200 |@@ 2010
225 |@@ 1933
250 |@@ 1982
275 |@@ 2051
300 |@@ 2021
325 |@ 1708
350 |@ 1113
375 | 502
400 | 220
425 | 106
450 | 54
475 | 40
>= 500 |@ 1716
sleep time, no CPU migration
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@ 58413
25 |@@@ 14793
50 |@@ 10050
75 | 3858
100 |@ 6242
125 |@ 6555
150 | 3980
175 |@ 5987
200 |@ 9024
225 |@ 9070
250 |@@ 10745
275 |@@ 11898
300 |@@ 11704
325 |@@ 10846
350 |@ 6962
375 | 3292
400 | 1713
425 | 585
450 | 201
475 | 96
>= 500 | 3946The previous output reveals that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub-100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision when a certain threshold is exceeded.
enqueue and dequeue Probes
You might want to know on which CPUs processes and threads are
waiting to run. You can use the enqueue
probe along with the dequeue probe to
answer this question. Type the following source code and save
it in a file named qtime.d:
sched:::enqueue
{
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] =
timestamp;
}
sched:::dequeue
/a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]/
{
@[args[2]->cpu_id] = quantize(timestamp -
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]);
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = 0;
}Running the previous script for several seconds results in output that is similar to the following:
# dtrace -s qtime.d
dtrace: script 'qtime.d' matched 16 probes
^C
1
value ------------- Distribution ------------- count
8192 | 0
16384 | 1
32768 |@ 47
65536 |@@@@@@@ 365
131072 |@@@@@@@@@@@@ 572
262144 |@@@@@@@@@@@@ 570
524288 |@@@@@@@ 354
1048576 |@ 57
2097152 | 7
4194304 | 1
8388608 | 1
16777216 | 0
0
value ------------- Distribution ------------- count
8192 | 0
16384 | 6
32768 |@ 49
65536 |@@@@@ 261
131072 |@@@@@@@@@@@@@ 753
262144 |@@@@@@@@@@@@ 704
524288 |@@@@@@@@ 455
1048576 |@ 74
2097152 | 9
4194304 | 2
8388608 | 0
Rather than looking at wait times, you might want to examine
the length of the run queue over time. Using the
enqueue and dequeue
probes, you can set up an associative array to track the queue
length. Type the following source code and save it in a file
named qlen.d:
sched:::enqueue
{
this->len = qlen[args[2]->cpu_id]++;
@[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
qlen[args[2]->cpu_id]--;
}Running the previous script on a largely idle dual-core processor system for approximately 30 seconds results in output that is similar to the following:
# dtrace -s qlen.d
dtrace: script 'qlen.d' matched 16 probes
^C
1
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8124
1 |@@@@@@ 1558
2 |@ 160
3 | 51
4 | 24
5 | 13
6 | 11
7 | 9
8 | 6
9 | 0
0
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8569
1 |@@@@@@@@@ 2429
2 |@ 292
3 | 25
4 | 8
5 | 5
6 | 4
7 | 4
8 | 1
9 | 0The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queues were very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of each table might be unexpected. For example, why were the run queues as long as 8 runnable threads? To explore this question further, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.
For this type of script, you would enable the
enqueue and dequeue
probes and then use both speculations and associative arrays.
Whenever a thread is enqueued, the script increments the
length of the queue and records the timestamp in an
associative array keyed by the thread. You cannot use a
thread-local variable in this case because a thread might be
enqueued by another thread. The script then checks to see if
the queue length exceeds the maximum, and if so, the script
starts a new speculation, and records the timestamp and the
new maximum. Then, when a thread is dequeued, the script
compares the enqueue timestamp to the
timestamp of the longest length: if the thread was enqueued
before the timestamp of the longest length, the thread was in
the queue when the longest length was recorded. In this case,
the script speculatively traces the thread's information. When
the kernel dequeues the last thread that was enqueued at the
timestamp of the longest length, the script commits the
speculation data. Type the following source code and save it
in a file named whoqueue.d:
#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k
int maxlen;
int spec[int];
sched:::enqueue
{
this->len = ++qlen[this->cpu = args[2]->cpu_id];
in[args[0]->pr_addr] = timestamp;
}
sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
/*
* There is already a speculation for this CPU. We just set a new
* record, so we’ll discard the old one.
*/
discard(spec[this->cpu]);
}
sched:::enqueue
/this->len > maxlen/
{
/*
* We have a winner. Set the new maximum length and set the timestamp
* of the longest length.
*/
maxlen = this->len;
longtime[this->cpu] = timestamp;
/*
* Now start a new speculation, and speculatively trace the length.
*/
this->spec = spec[this->cpu] = speculation();
speculate(this->spec);
printf("Run queue of length %d:\n", this->len);
}
sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
speculate(spec[this->cpu]);
printf(" %d/%d (%s)\n",
args[1]->pr_pid, args[0]->pr_lwpid,
stringof(args[1]->pr_fname));
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
in[args[0]->pr_addr] = 0;
this->len = --qlen[args[2]->cpu_id];
}
sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
/*
* We just processed the last thread that was enqueued at the time
* of longest length; commit the speculation, which by now contains
* each thread that was enqueued when the queue was longest.
*/
commit(spec[this->cpu]);
spec[this->cpu] = 0;
}Running the previous script on the same system results in output that is similar to the following:
# dtrace -s whoqueue.d Run queue of length 1: 2850/2850 (java) Run queue of length 2: 4034/4034 (kworker/0:1) 16/16 (sync_supers) Run queue of length 3: 10/10 (ksoftirqd/1) 1710/1710 (hald-addon-inpu) 25350/25350 (dtrace) Run queue of length 4: 2852/2852 (java) 2850/2850 (java) 1710/1710 (hald-addon-inpu) 2099/2099 (Xorg) Run queue of length 5: 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2461/2461 (wnck-applet) 2432/2432 (metacity) Run queue of length 9: 3685/3685 (firefox) 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2852/2852 (java) 2452/2452 (nautilus) 2461/2461 (wnck-applet) 2432/2432 (metacity) 2749/2749 (gnome-terminal) ^C
sleep and wakeup Probes
The following example shows how you might use the
wakeup probe to determine what is waking a
particular process, and when, over a given period. Type the
following source code and save it in a file named
gterm.d:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::wakeup
/stringof(args[1]->pr_fname) == "gnome-terminal"/
{
@[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}The output from running this script is as follows:
# dtrace -s gterm.d
Xorg
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@ 69
1 |@@@@@@@@ 35
2 |@@@@@@@@@ 42
3 | 2
4 | 0
5 | 0
6 | 0
7 |@@@@ 16
8 | 0
9 |@@@ 15
>= 10 | 0
This output shows that the X server is waking the
gnome-terminal process as you interact with
the system.
Additionally, you could use the sleep probe
with the wakeup probe to understand which
applications are blocking on other applications, and for how
long. Type the following source code and save it in a file
named whofor.d:
#pragma D option quiet
sched:::sleep
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
@[stringof(args[1]->pr_fname), execname] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%s sleeping on %s:\n%@d\n", @);
}The tail of the output from running the previous example script on a desktop system for several seconds is similar to the following:
# dtrace -s whofor.d
^C
...
Xorg sleeping on metacity:
value ------------- Distribution ------------- count
65536 | 0
131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2
262144 | 0
gnome-power-man sleeping on Xorg:
value ------------- Distribution ------------- count
131072 | 0
262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1
524288 | 0
...preempt and remain-cpu Probes
Because Oracle Linux is a preemptive system, higher priority threads preempt lower priority threads. Preemption can induce a significant latency bubble in the lower priority thread. Therefore, you might want to know which threads are being preempted by other threads.
The following example shows how you would use the
preempt and remain-cpu
probes to display this information. Type the following source
code and save it in a file named
whopreempt.d:
#pragma D option quiet
sched:::preempt
{
self->preempt = 1;
}
sched:::remain-cpu
/self->preempt/
{
self->preempt = 0;
}
sched:::off-cpu
/self->preempt/
{
/*
* If we were told to preempt ourselves, see who we ended up giving
* the CPU to.
*/
@[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
curlwpsinfo->pr_pri] = count();
self->preempt = 0;
}
END
{
printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI",
"PREEMPTED", "PRI", "#");
printa("%30s %3d %30s %3d %5@d\n", @);
}Running the previous script on a desktop system for several seconds results in output that is similar to the following:
# dtrace -s whopreempt.d
^C
PREEMPTOR PRI PREEMPTED PRI #
firefox 120 kworker/0:0 120 1
gnome-panel 120 swapper 120 1
gnome-panel 120 wnck-applet 120 1
jbd2/dm-0-8 120 swapper 120 1
khugepaged 139 kworker/0:0 120 1
ksoftirqd/1 120 kworker/0:0 120 1
kworker/0:0 120 gnome-terminal 120 1
kworker/0:2 120 Xorg 120 1
kworker/0:2 120 java 120 1
kworker/1:0 120 Xorg 120 1
nautilus 120 Xorg 120 1
rtkit-daemon 0 rtkit-daemon 120 1
rtkit-daemon 120 swapper 120 1
watchdog/0 0 swapper 120 1
watchdog/1 0 kworker/0:0 120 1
wnck-applet 120 Xorg 120 1
wnck-applet 120 swapper 120 1
automount 120 kworker/0:0 120 2
gnome-power-man 120 kworker/0:0 120 2
kworker/0:0 120 swapper 120 2
kworker/1:0 120 dtrace 120 2
metacity 120 kworker/0:0 120 2
notification-da 120 swapper 120 2
udisks-daemon 120 kworker/0:0 120 2
automount 120 swapper 120 3
gnome-panel 120 Xorg 120 3
gnome-settings- 120 Xorg 120 3
gnome-settings- 120 swapper 120 3
gnome-terminal 120 swapper 120 3
java 120 kworker/0:0 120 3
ksoftirqd/0 120 swapper 120 3
kworker/0:2 120 swapper 120 3
metacity 120 Xorg 120 3
nautilus 120 kworker/0:0 120 3
qpidd 120 swapper 120 3
metacity 120 swapper 120 4
gvfs-afc-volume 120 swapper 120 5
java 120 Xorg 120 5
notification-da 120 Xorg 120 5
notification-da 120 kworker/0:0 120 5
Xorg 120 kworker/0:0 120 6
wnck-applet 120 kworker/0:0 120 10
VBoxService 120 swapper 120 13
dtrace 120 swapper 120 14
kworker/1:0 120 kworker/0:0 120 16
dtrace 120 kworker/0:0 120 20
Xorg 120 swapper 120 90
hald-addon-inpu 120 swapper 120 100
java 120 swapper 120 108
gnome-terminal 120 kworker/0:0 120 110tick
If NOHZ is set to off,
Oracle Linux uses tick-based CPU accounting,
where a system clock interrupt fires at a fixed interval and
attributes CPU utilization to the processes that are running
at the time of the tick. The following example shows how you
would use the tick probe to observe this
attribution.
# dtrace -n sched:::tick'{ @[stringof(args[1]->pr_fname)] = count() }'
dtrace: description 'sched:::tick' matched 1 probe
^C
VBoxService 1
gpk-update-icon 1
hald-addon-inpu 1
jbd2/dm-0-8 1
automount 2
gnome-session 2
hald 2
gnome-power-man 3
ksoftirqd/0 3
kworker/0:2 3
notification-da 4
devkit-power-da 6
nautilus 9
dbus-daemon 11
gnome-panel 11
gnome-settings- 11
dtrace 19
khugepaged 22
metacity 27
kworker/0:0 41
swapper 56
firefox 58
wnck-applet 61
gnome-terminal 67
java 84
Xorg 227One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system either over-accounts or under-accounts for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. If accounting is performed before time-related dispatching, the system under-accounts for threads running at a regular interval. If such threads run for less than the clock tick interval, they can effectively hide behind the clock tick.
The following example examines whether a system has any such
threads. Type the following source code and save it in a file
named tick.d:
sched:::tick,
sched:::enqueue
{
@[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}
The output of the example script is two distributions of the
millisecond offset within a ten millisecond interval, one for
the tick probe and another for
enqueue:
# dtrace -s tick.d
dtrace: script 'tick.d' matched 9 probes
^C
tick
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@ 29
1 |@@@@@@@@@@@@@@@@@@@ 106
2 |@@@@@ 27
3 |@ 7
4 |@@ 10
5 |@@ 12
6 |@ 4
7 |@ 8
8 |@@ 9
9 |@@@ 17
>= 10 | 0
enqueue
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@ 82
1 |@@@@ 86
2 |@@@@ 76
3 |@@@ 65
4 |@@@@@ 101
5 |@@@@ 79
6 |@@@@ 75
7 |@@@@ 76
8 |@@@@ 89
9 |@@@@ 75
>= 10 | 0
The output histogram named tick shows that
the clock tick is firing at a 1 millisecond offset. In this
example, the output for enqueue is evenly
spread across the ten millisecond interval and no spike is
visible at 1 millisecond, so it appears that the threads are
being not being scheduled on a time basis.
sched Stability
The sched provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
ISA |
|
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features.
io Provider
The io provider makes available probes that
relate to data input and output. The io
provider enables quick exploration of behavior that is observed
through I/O monitoring tools such as iostat.
For example, you can use the io provider to
understand I/O by device, I/O type, I/O size, process, or
application name
.
io Probes
The following table describes the probes for the
io provider.
Table 11-9 io Probes
| Probe | Description |
|---|---|
|
|
Fires when an I/O request is about to be made either to a peripheral device or to an NFS server. |
|
|
Fires after an I/O request has been fulfilled. The
done probe fires after the I/O completes, but before
completion processing has been performed on the
buffer. As a result |
|
|
Fires immediately before a thread begins to wait
pending completion of a given I/O request. Some time
after the |
|
|
Fires when a thread finishes waiting for the
completion of a given I/O request. The
|
The io probes fire for all I/O requests to
peripheral devices, and for all file read and file write
requests to an NFS server. Requests for metadata from an NFS
server, for example, do not trigger io probes
due to a readdir() request.
io Probe Arguments
The following table describes the arguments for the
io probes.
Table 11-10 io Probe Arguments
| Argument | Type | Description |
|---|---|---|
|
|
|
The |
|
|
|
The |
|
|
|
The |
Note:
DTrace does not currently support the use of
fileinfo_t with io
probes. In Oracle Linux, no information is readily accessible at the
level where the io probes fire about the
file where an I/O request originated.
bufinfo_t
The bufinfo_t structure is the abstraction
that describes an I/O request. The buffer that corresponds to
an I/O request is pointed to by args[0] in
the start, done,
wait-start, and
wait-done probes. Detailed information
about this data structure can be found in
/usr/lib64/dtrace/version/io.d.
The definition of bufinfo_t is as follows:
typedef struct bufinfo {
int b_flags; /* flags */
size_t b_bcount; /* number of bytes */
caddr_t b_addr; /* buffer address */
uint64_t b_blkno; /* expanded block # on device */
uint64_t b_lblkno; /* logical block # on device */
size_t b_resid; /* not supported */
size_t b_bufsize; /* size of allocated buffer */
caddr_t b_iodone; /* I/O completion routine */
int b_error; /* not supported */
dev_t b_edev; /* extended device */
} bufinfo_t;Note:
DTrace translates the members of
bufinfo_t from the
buffer_head or bio for
the Oracle Linux I/O request structure, depending on the kernel
version.
b_flags indicates the state of the I/O
buffer, and consists of a bitwise-or of different state
values. The following table describes the values for the
supported states.
Table 11-11 b_flags Values
| b_flags | Value | Description |
|---|---|---|
|
|
|
Indicates that the I/O request is asynchronous and
is not waited upon. The
Note:
Some I/Os directed to be asynchronous might not
set |
|
|
|
Indicates that the buffer is being used in a paged I/O request. |
|
|
|
Indicates that the buffer is being used for physical (direct) I/O to a user data area. |
|
|
|
Indicates that data is to be read from the peripheral device into main memory. |
|
|
|
Indicates that the data is to be transferred from main memory to the peripheral device. |
b_bcount: Is the number of bytes to be
transferred as part of the I/O request.
b_addr: Is the virtual address of the I/O
request, when known.
b_blkno: Identifies which block on the
device is to be accessed.
b_lblkno: Identifies which logical block on
the device is to be accessed. The mapping from a logical block
to a physical block (such as the cylinder, track, and so on)
is defined by the device.
b_bufsize: Contains the size of the
allocated buffer.
b_iodone: Identifies a specific routine in
the kernel that is called when the I/O is complete.
b_edev: Contains the major and minor device
numbers of the device accessed. You can use the D subroutines
getmajor and getminor to
extract the major and minor device numbers from the
b_edev field.
devinfo_t
The devinfo_t structure provides
information about a device. The devinfo_t
structure that corresponds to the destination device of an I/O
is pointed to by args[1] in the
start, done,
wait-start, and
wait-done probes. Detailed information
about this data structure can be found in
/usr/lib64/dtrace/version/io.d.
The definition of devinfo_t is as follows:
typedef struct devinfo {
int dev_major; /* major number */
int dev_minor; /* minor number */
int dev_instance; /* not supported */
string dev_name; /* name of device */
string dev_statname; /* name of device + instance/minor */
string dev_pathname; /* pathname of device */
} devinfo_t;Note:
DTrace translates the members of
devinfo_t from the
buffer_head for the Oracle Linux I/O request
structure.
dev_major: Is the major number of the
device.
dev_minor: Is the minor number of the
device.
dev_name: Is the name of the device driver
that manages the device.
dev_statname: Is the name of the device as
reported by iostat. This field is provided
so that aberrant iostat output can be
quickly correlated to actual I/O activity.
dev_pathname: Is the full path of the
device. The path that is specified by
dev_pathname includes components expressing
the device node, the instance number, and the minor node.
However, note that all three of these elements are not
necessarily expressed in the statistics name. For some
devices, the statistics name consists of the device name and
the instance number. For other devices, the name consists of
the device name and the number of the minor node. As a result,
two devices that have the same dev_statname
migh differ in their dev_pathname.
fileinfo_t
Note:
DTrace does not currently support the use of
fileinfo_t with the
args[2] argument of the
io probes. You can use the
fileinfo_t structure to obtain
information about a process's open files by using the
fds[] array. See
Built-In Variables.
The fileinfo_t structure provides
information about a file. args[2] in the
start, done,
wait-start, and
wait-done probes points to the file to
which an I/O request corresponds. The presence of file
information is contingent upon the file system providing this
information when dispatching I/O requests. Some file systems,
especially third-party file systems, might not provide this
information. Also, I/O requests might emanate from a file
system for which no file information exists. For example, any
I/O from or to file system metadata is not associated with any
one file. Finally, some highly optimized file systems might
aggregate I/O from disjoint files into a single I/O request.
In this case, the file system might provide the file
information either for the file that represents the majority
of the I/O or for the file that represents some of the I/O.
Alternatively, the file system might provide no file
information at all in this case.
Detailed information about this data structure can be found in
/usr/lib64/dtrace/version/io.d.
The definition of fileinfo_t is as follows:
typedef struct fileinfo {
string fi_name; /* name (basename of fi_pathname) */
string fi_dirname; /* directory (dirname of fi_pathname) */
string fi_pathname; /* full pathname */
loff_t fi_offset; /* offset within file */
string fi_fs; /* file system */
string fi_mount; /* not supported */
int fi_oflags; /* open() flags for file descriptor */
} fileinfo_t;
The fi_name field contains the name of the
file but does not include any directory components. If no file
information is associated with an I/O, the
fi_name field is set to the string
<none>. In some rare cases, the
pathname that is associated with a file might be unknown. In
this case, the fi_name field is set to the
string <unknown>.
The fi_dirname field contains only the
directory component of the file name. As with
fi_name, this string can be set to
<none>, if no file information is
present, or <unknown> if the pathname
that is associated with the file is not known.
The fi_pathname field contains the full
pathname to the file. As with fi_name, this
string can be set to <none>, if no
file information is present, or
<unknown> if the pathname that is
associated with the file is not known.
The fi_offset field contains the offset
within the file , or -1, if either file
information is not present or if the offset is otherwise
unspecified by the file system.
The fi_fs field contains the name of the
file system type, or <none>, if no
information is present.
The fi_oflags field contains the flags that
were specified when opening the file.
io Examples
'The following example script displays information for every I/O
as it is issued. Type the following source code and save it in a
file named iosnoop.d.
#pragma D option quiet
BEGIN
{
printf("%10s %2s\n", "DEVICE", "RW");
}
io:::start
{
printf("%10s %2s\n", args[1]->dev_statname,
args[0]->b_flags & B_READ ? "R" : "W");
}The output from this script is similar to the following:
# dtrace -s ./iosnoop.d
DEVICE RW
dm-00 R
dm-00 R
dm-00 R
dm-00 R
dm-00 R
dm-00 R
...You can make the example script slightly more sophisticated by using an associative array to track the time (in milliseconds) spent on each I/O, as shown in the following example:
#pragma D option quiet
BEGIN
{
printf("%10s %2s %7s\n", "DEVICE", "RW", "MS");
}
io:::start
{
start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}
io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
printf("%10s %2s %3d.%03d\n", args[1]->dev_statname,
args[0]->b_flags & B_READ ? "R" : "W",
this->elapsed / 10000000, (this->elapsed / 1000) % 1000);
start[args[0]->b_edev, args[0]->b_blkno] = 0;
}The modified script adds a MS (milliseconds) column to the output.
You can aggregate on device, application, process ID and bytes
transferred, then save it in a file named
whoio.d, as shown in the following example:
#pragma D option quiet
io:::start
{
@[args[1]->dev_statname, execname, pid] = sum(args[0]->b_bcount);
}
END
{
printf("%10s %20s %10s %15s\n", "DEVICE", "APP", "PID", "BYTES");
printa("%10s %20s %10d %15@d\n", @);
}Running this script for a few seconds results in output that is similar to the following:
# dtrace -s whoio.d
^C
DEVICE APP PID BYTES
dm-00 evince 14759 16384
dm-00 flush-252:0 1367 45056
dm-00 bash 14758 131072
dm-00 gvfsd-metadata 2787 135168
dm-00 evince 14758 139264
dm-00 evince 14338 151552
dm-00 jbd2/dm-0-8 390 356352
If you are copying data from one device to another, you might
want to know if one of the devices acts as a limiter on the
copy. To answer this question, you need to know the effective
throughput of each device, rather than the number of bytes per
second that each device is transferring. For exampe, you can
determine throughput by using the following script and saving it
in a file named copy.d:
#pragma D option quiet
io:::start
{
start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}
io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
/*
* We want to get an idea of our throughput to this device in KB/sec.
* What we have, however, is nanoseconds and bytes. That is we want
* to calculate:
*
* bytes / 1024
* ------------------------
* nanoseconds / 1000000000
*
* But we cannot calculate this using integer arithmetic without losing
* precision (the denominator, for one, is between 0 and 1 for nearly
* all I/Os). So we restate the fraction, and cancel:
*
* bytes 1000000000 bytes 976562
* --------- * ------------- = --------- * -------------
* 1024 nanoseconds 1 nanoseconds
*
* This is easy to calculate using integer arithmetic.
*/
this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
@[args[1]->dev_statname, args[1]->dev_pathname] =
quantize((args[0]->b_bcount * 976562) / this->elapsed);
start[args[0]->b_edev, args[0]->b_blkno] = 0;
}
END
{
printa(" %s (%s)\n%@d\n", @);
}Running the previous script for several seconds while copying data from a hard disk to a USB drive yields the following output:
# dtrace -s copy.d
^C
sdc1 (/dev/sdc1)
value ------------- Distribution ------------- count
32 | 0
64 | 3
128 | 1
256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2257
512 | 1
1024 | 0
dm-00 (/dev/dm-00)
value ------------- Distribution ------------- count
128 | 0
256 | 1
512 | 0
1024 | 2
2048 | 0
4096 | 2
8192 |@@@@@@@@@@@@@@@@@@ 172
16384 |@@@@@ 52
32768 |@@@@@@@@@@@ 108
65536 |@@@ 34
131072 | 0
The previous output shows that the USB drive
(sdc1) is clearly the limiting device. The
throughput of sdc1 is between 256K/sec and
512K/sec, while dm-00 delivered I/O at
anywhere from 8 MB/second to over 64 MB/second.
io Stability
The io provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
ISA |
|
Arguments |
Evolving |
Evolving |
ISA |
For more information about the stability mechanism, see DTrace Stability Features
fasttrap Provider
The fasttrap provider performs dynamic
instrumentation of arbitrary instructions in user-space threads.
Unlike most other DTrace providers, the
fasttrap provider is not designed for tracing
system activity. Rather, this provider is intended as a way for
DTrace consumers to inject information into the DTrace framework
by activating the fasttrap probe.
For more information about enabling statically defined probes in user-space programs, see Statically Defined Tracing of User Applications.
fasttrap Probes
The fasttrap provider makes available a
single probe that fires whenever a user-level process makes a
certain DTrace call into the kernel. The DTrace call to activate
the probe is not available
fasttrap Stability
The fasttrap provider uses DTrace's stability
mechanism to describe its stabilities. These values are listed
in the following table.
| Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
|
Provider |
Evolving |
Evolving |
ISA |
|
Module |
Private |
Private |
Unknown |
|
Function |
Private |
Private |
Unknown |
|
Name |
Evolving |
Evolving |
ISA |
|
Arguments |
Private |
Private |
Unknown |
For more information about the stability mechanism, see DTrace Stability Features.