Starting with the Solaris 2.5 release, the SunOS kernel has a small number of TNF probes built into it. The probes record kernel events such as system calls, thread state transitions, page faults, swapping, and I/O. You can use these probes to obtain detailed traces of kernel activity under your application workloads. The probes have negligible impact on the performance of the running kernel when the probes are disabled.
The method for tracing the kernel is similar to tracing a user-level process, although there are some differences. To understand the following discussion, you should have read the sections "Running prex" and "Reading the Trace File".You need super-user privileges to trace the kernel.
Use the prex utility to control kernel probes. The standard prex commands to list and manipulate probes are available to you, along with commands to set up and manage kernel tracing.
Kernel probes write trace records into a kernel trace buffer. You must copy the buffer into a TNF file for post-processing; use the tnfxtract utility for this.
You use the tnfdump utility to examine a kernel trace file. This is exactly the same as examining a user-level trace file.
The steps you typically follow to take a kernel trace are:
Become superuser (su).
Allocate a kernel trace buffer of the desired size (prex).
Select the probes you want to trace and enable (prex).
Turn kernel tracing on (prex).
Run your application.
Turn kernel tracing off (prex).
Extract the kernel trace buffer (tnfxtract).
Disable all probes (prex).
Deallocate the kernel trace buffer (prex).
Examine the trace file (tnfdump).
A convenient way to follow these steps is to use two shell windows; run an interactive prex session in one, and run your application and tnfxtract in the other.You might find it simpler to use the ktrace shell script shown in " Shell Script for Kernel Tracing".
Start prex on the kernel with the -k flag (make sure you are root). After prex successfully attaches to the kernel, it prompts you for commands.
# prex -k prex>
Only one prex session can be attached to the kernel at any given time.
The first step in taking a kernel trace is to allocate a kernel trace buffer. The trace buffer is circular, which means that newer data overwrites older data after the buffer fills up.
The default size for the buffer is 384 kilobytes, but you can override this when you allocate the buffer. When deciding on the buffer size, keep in mind that the buffer occupies both physical memory and kernel virtual memory, so it will impact the system you are tracing--the more RAM you have, the less the impact of measurement on your experiment, and the bigger the buffer you can allocate.
Use the buffer command in prex to allocate the buffer. For example, you could allocate a 512 kilobyte buffer as follows:
prex> buffer # do we have a buffer? No trace buffer allocated prex> buffer alloc 512k# allocate 512KB buffer Buffer of size 524288 bytes allocated prex>
The minimum buffer size is 128 kilobytes; prex allocates a buffer of this size if you request anything smaller.
Use the standard prex list command to list kernel probes and their attributes. For example, to list the name and keys attributes of the probe named pagein:
prex> list name=pagein keys name=pagein keys=vm pageio io
To list the name and keys attributes of all probes in the io group:
prex> list name keys probes io name=biodone keys=io blockio name=physio_start keys=io rawio name=pagein keys=vm pageio io name=pageout keys=vm pageio io name=physio_end keys=io rawio name=strategy keys=io blockio prex>
The next step is to trace and enable the probes you need. For example, to select all probes whose keys specify thread, vm, and io, and prepare them for tracing:
prex> trace thread vm io # attach trace function to probes prex> enable thread vm io # enable probes prex>
Unlike user-level probes, kernel probes are not automatically `traced' (in prex terminology) when prex attaches to the kernel. You must explicitly issue the trace and enable commands as shown above.
At this point the relevant probes are ready for tracing, but kernel tracing has not been globally enabled so no trace records are being written.
If you want to trace all system activity (for example, on a busy server), you can proceed to "Enabling and Disabling Tracing". This is the most common use of kernel tracing, and usually the most informative.
However, you also have the option of restricting trace data generation to selected processes; this can reduce the amount of data you need to collect and analyze.
prex uses two abstractions to allow you to do process filtering.
The process filter set is a list of process identifiers (PIDs) for those processes that you want enabled for tracing; no trace data will be written by (threads belonging to) processes that are not in the filter set. The default filter set is empty.
The process filter mode is a global flag that selects whether process filtering is enabled or disabled in the kernel. The default is that process filtering is disabled, which means that all processes (and threads) write trace records. When it is enabled, only (threads belonging to) the processes in the filter set write trace records.
Use the prex pfilter command to control process filtering.
prex> pfilter # are we filtering? Process filtering is off Process filter set is empty. prex> pfilter add 408 # add PID 408 to filter set prex> pfilter Process filtering is off Process filter set is {408} prex> pfilter on # enable process filter mode prex>
System threads (such as interrupt threads) are treated as belonging to process 0.
The final step in starting up tracing is to globally enable kernel tracing. When you do this, the probes you have enabled write records into the kernel trace buffer as they are encountered.
prex> ktrace # check tracing status Tracing is off prex> ktrace on # enable kernel tracing prex>
To trace the kernel behavior of your application (perhaps with user-level tracing enabled as well), start it running at this time. When your application finishes, or when you decide you have enough trace data to sample its behavior, globally disable kernel tracing.
prex> ktrace off # disable kernel tracing prex>
The trace buffer is still present in the kernel. Copy it out into a TNF file using tnfxtract. The procedure that you follow is explained in the subsection " Extracting Kernel Trace Data (tnfxtract)".
After you have copied the trace buffer into a TNF file, reset kernel tracing by disabling all probes and deallocating the buffer. This is important because it brings the performance of the kernel back to where it was before you started tracing.
Use the standard prex commands to disable and untrace all probes.
prex> disable $all # disable all probes prex> untrace $all # untrace all probes prex>
Finally, use the prex buffer command to deallocate the trace buffer.
Do not deallocate the trace buffer until you have copied it out into a trace file. Otherwise, you will lose the trace data you collected for your experiment.
prex> buffer dealloc # deallocate buffer buffer deallocated prex>
You can now quit prex and examine the trace data that you have collected.
prex> quit #
Use the tnfxtract utility to make a copy (or snapshot) of the active kernel trace buffer into an external TNF trace file. You typically run the utility after tracing has been disabled, although you can also run it concurrently with tracing. The utility ensures that it reads and writes only consistent TNF data.
Specify a file name that tnfxtract can use to hold the extracted kernel trace data. This file will be overwritten and truncated to the size of the kernel trace buffer. For example, to extract it into a temporary file named ktrace.tnf:
# tnfxtract /tmp/ktrace.tnf # ls -l /tmp/ktrace.tnf -rw------- 1 root other 524288 Aug 15 16:00 /tmp/ktrace.tnf #
You might find it convenient to change the permissions on the trace file so that it is world-readable. This makes it easier for you to run analysis tools over the file when you are not superuser.
You can also use tnfxtract on a kernel crash dump; see tnfxtract(1) for details.
Use tnfdump to get a time-ordered ASCII dump of a kernel trace file on stdout (exactly as in user-level tracing). Because the trace output might be very large, you typically run it through a pager (such as more) or redirect it to an output file.
# tnfdump /tmp/ktrace.tnf | more
tnfdump accepts multiple TNF files as input; it produces a time-sorted output of all event records in its input files, so you can use it to combine multiple user-level trace files along with the kernel trace file.
The SunOS kernel probes supply information approximately at the level of statistics-based Solaris performance monitoring tools such as vmstat, iostat, and sar. However, they provide more detail -- each probe records a high-resolution timestamp, a thread (LWP, thread, and process identifiers), a CPU, and individual probes identify the system resources (such as disks, files, CPUs, and so on) associated with an event.
You can use kernel tracing, along with user-level tracing, to correlate the events recorded by your application or library probes with the events recorded by the kernel probes. Thus you can get a detailed view of how your code uses kernel services, and how the demands your application places on system resources interact with the demands placed by other clients.
The text below summarizes the SunOS kernel probes; see the tnf_probes(4) manual page for detailed information on the event record fields.
The thread_create probe traces kernel thread creation. It records the process identifier, the kernel thread identifier, and the kernel address of the start routine of the new thread.
The thread_exit probe records the termination of the current thread.
Probes that trace thread-state transitions are called microstate probes. They record a thread state and, optionally, a thread identifier. If a thread identifier is provided, the state change applies to the indicated thread. Otherwise, it applies to the writing thread.
Thread state values use the microstate constants defined in <sys/msacct.h>. The states are recorded as shown in Table 1-7.
Table 1-7 Thread Microstate Constants
State |
Explanation |
---|---|
user |
Running in user mode |
system |
Running in system mode |
tfault |
Initial state for user text fault |
dfault |
Initial state for user data fault |
trap |
Initial state for other trap |
user_lock |
Asleep, waiting for user-mode lock |
sleep |
Asleep for any other reason |
wait_cpu |
Waiting for a CPU (runnable) |
stopped |
Stopped (/proc, jobcontrol, lwp_stop) |
The kernel does not trace transitions between the system and user states that are implied by system calls. This is to reduce trace volume. You need to enable the system call probes to get this information; for convenience, they are automatically enabled for you when you enable probes with the thread key.
System call entry and exit probes identify the system operations explicitly requested by user code.
The syscall_start probe marks the start of a system call, and records the system call number. The probe does not capture system call arguments, because this is fairly expensive. (Some of this information can be captured by interposing on the C library entry points at the user level.) The syscall_start probe also implicitly marks the current thread as entering the system state.
The syscall_end probe marks the end of a system call, and records the two return values of the call, as well as the errno value. The syscall_end probe also implicitly marks the current thread as entering the user state.
System call implementation at this level can change from release to release. Do not rely on a consistent mapping of system calls to numbers.
The virtual memory subsystem (VM) probes provide information on page faults, page I/O, the page daemon, and the swapper.
Page fault probes relate virtual addresses with fault types and with files (vnodes).
The address_fault probe traces address-space faults; it records the faulting virtual address, the type of fault, and the desired access protection.
The fault type and access type values use the constants defined in <vm/seg_enum.h>. Fault types are invalid page (minor fault), protection fault, or software requests to lock and unlock pages for physical I/O (softlock and softunlock). Access types are read, write, execute and create.
The major_fault probe traces major page faults; it records the vnode and offset (which together identify a file system page) from which the fault should be resolved. This data can be correlated with the immediately preceding address_fault event for the current thread to obtain the faulting virtual address.
The anon_private probe traces copy-on-write faults.
The anon_zero probe traces zero-fill faults.
The page_unmap probe marks the dissociation between a physical page and a file system page (for example, when a page is renamed or destroyed).
The pagein probe traces the initiation of a pagein request; it records the vnode, offset, and size of the pagein.
The pageout probe traces the completion of a pageout request; it records the number of pages paged out, the number of pages freed, and the number of pages reclaimed after the pageout.
Iterations of the page daemon (page stealer) are traced by two probes: pageout_scan_start and pageout_scan_end. The probes report the number of free pages needed before the scan, the number of pages examined during the scan, and the free page counts before and after the scan. Potentially, more pages are freed when pageout requests queued by the scan are completed.
Three probes trace the activity of the swapper.
The swapout_process probe traces the swapping out of a process address space; it records the process identifier and the total number of pages either freed or queued for output.
The swapout_lwp probe traces the swapping out of an LWP's stack pages; it records the LWP's identity and the number of pages queued for output.
The swapin_lwp probe traces the swapping in of an LWP's stack pages; it records the LWP's identity and the number of stack pages brought in.
The strategy probe traces the initiation of local block device I/O by the kernel. It records the device number, logical block number, size, buffer pointer, and buffer flags associated with the transfer. The flag values are the buffer status flags as defined in <sys/buf.h>.
The biodone probe traces the completion of a buffered I/O transfer, that is, calls to the kernel biodone(9f) routine. It records the device number, logical block number, and buffer pointer associated with the transfer.
Physical (raw) I/O is traced by two probes in physio(9f): physio_start and physio_end. These probes record the device number, offset, size, and direction of the I/O transfer.
The thread_queue probe traces thread scheduling; it records the thread identifier of the scheduled thread, the CPU associated with the dispatch queue on which it is placed, the thread's dispatch priority, and the current number of runnable threads on the dispatch queue.
#!/bin/sh # # ktrace # Reset and enable kernel tracing # Run command in background, if specified # Sleep for duration, if specified # Wait for command completion, timeout, or keyboard interrupt # (Note: keyboard interrupt kills command) # Disable kernel tracing # Extract kernel trace buffer # Reset kernel tracing, if specified # TMPDIR=${TMPDIR:-/tmp} output_file=$TMPDIR/ktrace.tnf buffer_size=512k duration= events= command= do_reset=y child= alarm= # usage message usage() { echo "" echo $1 echo " Usage: ktrace [-o <output_file>] # default /tmp/ktrace.tnf [-s <buffer_size>] # default 512k -e <events> # kernel probes (keys) to enable [-r] # don't reset kernel tracing # default is to reset after command <cmd> | -t <seconds> Eg, # ktrace -e `thread vm io' -t 10 # ktrace -e `thread' -s 256k myapp ... " exit 1 } # failure message fail() { while [ $# -gt 0 ] do echo $1 shift done echo "ktrace failed" exit 2 } # Reset kernel tracing ktrace_reset() { if [ $1 = "y" ]; then echo "Resetting kernel tracing" prex -k >/dev/null 2>&1 <<-EOF ktrace off untrace /.*/ disable /.*/ buffer dealloc EOF test $? -ne 0 && fail "Could not reset kernel tracing" \ "`su root' and retry" fi } # Enable kernel tracing ktrace_on() { echo "Enabling kernel tracing" prex -k >/dev/null 2>&1 <<-EOF buffer alloc $buffer_size trace $events enable $events ktrace on EOF test $? -ne 0 && fail "Could not enable kernel tracing" \ "Check syntax of `-e' argument" \ "Check buffer size is not too high" } # Disable kernel tracing ktrace_off() { prex -k >/dev/null 2>&1 <<-EOF ktrace off EOF test $? -ne 0 && fail "Could not disable kernel tracing" echo "Kernel tracing disabled" } # Extract kernel trace buffer ktrace_xtract() { echo "Extracting kernel trace buffer" tnfxtract $output_file || fail "Could not extract kernel trace buffer" ls -l $output_file } # Run command, sleep for duration, or wait for command to complete run_command() { trap `interrupt' 0 1 2 3 15 if [ "$command" ]; then $command & child=$! echo "`$command' pid is $child" fi if [ "$duration" ]; then sleep $duration & alarm=$! wait $alarm # XXX test -z "$child" || kill -15 $child else wait $child fi trap 0 1 2 3 15 } # Keyboard interrupt interrupt() { test -z "$alarm" || kill -15 $alarm test -z "$child" || kill -15 $child } # Parse options while getopts o:s:t:e:r opt do case $opt in o) output_file="$OPTARG";; s) buffer_size="$OPTARG";; t) duration="$OPTARG";; e) events="$OPTARG";; r) do_reset="n";; \?) usage;; esac done shift `expr $OPTIND - 1` # Get command to run test $# -gt 0 && command="$*" # Check and normalize options test -z "$events" && usage "No kernel events specified" test -z "$command" && test -z "$duration" && \ usage "No command or time duration specified" # Perform experiment ktrace_reset y # Reset kernel tracing ktrace_on # Enable kernel tracing run_command # Run command, wait, or sleep ktrace_off # Disable kernel tracing ktrace_xtract # Extract buffer ktrace_reset $do_reset # Reset kernel tracing again exit 0 |