The sched
provider makes available probes related to CPU scheduling. Because CPUs are the one resource
that all threads must consume, the sched provider is
very useful for understanding systemic behavior. For example, using the
sched provider, you can understand when and why threads sleep,
run, change priority, or wake other threads.
Table 11.7, “sched Probes” lists the sched probes.
Table 11.7. sched Probes
|
Probe |
Description |
|---|---|
|
|
Fires whenever a thread's priority is about to be changed. The
|
|
|
Fires immediately before a runnable thread is dequeued from a run queue. The
|
|
|
Fires immediately before a runnable thread is enqueued to a run queue. The
|
|
|
Fires when the current CPU is about to end execution of a thread. The
|
|
|
Fires when a CPU has just begun execution of a thread. The
|
|
|
Fires immediately before the current thread is preempted. After this probe
fires, the current thread will select a thread to run and the
|
|
|
Fires when a scheduling decision has been made, but the dispatcher has elected
to continue to run the current thread. The |
|
|
Fires immediately before the current thread sleeps on a synchronization
object. The type of the synchronization object is contained in the
|
|
|
Fires when a CPU has been instructed by another CPU to make a scheduling decision — often because a higher-priority thread has become runnable. |
|
|
Fires as a part of clock tick-based accounting. In clock tick-based
accounting, CPU accounting is performed by examining which threads and processes
are running when a fixed-interval interrupt fires. The
|
|
|
Fires immediately before the current thread wakes a thread sleeping on a
synchronization object. The |
Table 11.8, “sched Probe Arguments” lists the argument types for the sched probes;
Table 11.7, “sched Probes” describes the arguments.
Table 11.8. sched Probe Arguments
|
Probe |
|
|
|
|
|---|---|---|---|---|
|
|
|
|
|
— |
|
|
|
|
|
— |
|
|
|
|
|
|
|
|
|
|
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
— |
— |
— |
— |
|
|
|
|
— |
— |
|
|
|
|
— |
— |
|
|
|
|
— |
— |
The cpuinfo_t structure defines a
CPU. As Table 11.8, “sched Probe Arguments” indicates, arguments to both the enqueue and
dequeue probes include a pointer to a cpuinfo_t.
Additionally, the cpuinfo_t corresponding to the
current CPU is pointed to by the curcpu variable. The
definition of the cpuinfo_t structure is as
follows:
typedef struct cpuinfo {
processorid_t cpu_id; /* CPU identifier */
psetid_t cpu_pset; /* not supported */
chipid_t cpu_chip; /* chip identifier */
lgrp_id_t cpu_lgrp; /* not supported */
processor_info_t cpu_info; /* CPU information */
} cpuinfo_t;cpu_id is the processor identifier.
cpu_chip is the identifier of the physical chip. Physical chips may
contain several CPU cores.
cpu_info is the processor_info_t structure
associated with the CPU.
The following sections contain examples of using sched probes.
One common question you might want answered is which CPUs are
running threads and for how long. You can use the on-cpu and
off-cpu probes to easily
answer this question on a system-wide basis as shown in the following example:
sched:::on-cpu
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = quantize(timestamp - self->ts);
self->ts = 0;
}Running the above script results in output similar to the following example:
#dtrace -s ./where.ddtrace: script ’./where.d’ matched 5 probes^C0 value ------------- Distribution ------------- count 2048 | 0 4096 |@@ 37 8192 |@@@@@@@@@@@@@ 212 16384 |@ 30 32768 | 10 65536 |@ 17 131072 | 12 262144 | 9 524288 | 6 1048576 | 5 2097152 | 1 4194304 | 3 8388608 |@@@@ 75 16777216 |@@@@@@@@@@@@ 201 33554432 | 6 67108864 | 0 1 value ------------- Distribution ------------- count 2048 | 0 4096 |@ 6 8192 |@@@@ 23 16384 |@@@ 18 32768 |@@@@ 22 65536 |@@@@ 22 131072 |@ 7 262144 | 5 524288 | 2 1048576 | 3 2097152 |@ 9 4194304 | 4 8388608 |@@@ 18 16777216 |@@@ 19 33554432 |@@@ 16 67108864 |@@@@ 21 134217728 |@@ 14 268435456 | 0
This output shows that on CPU 1 threads tend to run for less than 100 microseconds
at a stretch, or for approximately 10 milliseconds. A noticable gap between the two
clusters of data is shown in the histogram. You also might be interested in knowing which
CPUs are running a particular process. You can use the on-cpu and
off-cpu probes for answering this question as well. The following
script displays which CPUs run a specified application over a period of ten
seconds:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::on-cpu
/execname == $$1/
{
self->ts = timestamp;
}
sched:::off-cpu
/self->ts/
{
@[cpu] = sum(timestamp - self->ts);
self->ts = 0;
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}
dtrace:::END
{
printf("CPU distribution over %d seconds:\n\n",
(timestamp - start) / 1000000000);
printf("CPU microseconds\n--- ------------\n");
normalize(@, 1000);
printa("%3d %@d\n", @);
}Running the above script on a large mail server and specifying the IMAP daemon results in output similar to the following example:
# dtrace -s ./whererun.d imapd
CPU distribution of imapd over 10 seconds:
CPU microseconds
--- ------------
15 10102
12 16377
21 25317
19 25504
17 35653
13 41539
14 46669
20 57753
22 70088
16 115860
23 127775
18 160517Oracle Linux takes into account the amount of time that a thread has been sleeping
when selecting a CPU on which to run the thread: a thread that has been sleeping for less
time tends not to migrate. You can use the off-cpu and
on-cpu probes to observe this behavior:
sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
self->cpu = cpu;
self->ts = timestamp;
}
sched:::on-cpu
/self->ts/
{
@[self->cpu == cpu ?
"sleep time, no CPU migration" : "sleep time, CPU migration"] =
lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
self->ts = 0;
self->cpu = 0;
}Running the above script for approximately 30 seconds results in output similar to the following example:
#dtrace -s ./howlong.ddtrace: script ’./howlong.d’ matched 5 probes^Csleep time, CPU migration value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@ 6838 25 |@@@@@ 4714 50 |@@@ 3108 75 |@ 1304 100 |@ 1557 125 |@ 1425 150 | 894 175 |@ 1526 200 |@@ 2010 225 |@@ 1933 250 |@@ 1982 275 |@@ 2051 300 |@@ 2021 325 |@ 1708 350 |@ 1113 375 | 502 400 | 220 425 | 106 450 | 54 475 | 40 >= 500 |@ 1716 sleep time, no CPU migration value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@ 58413 25 |@@@ 14793 50 |@@ 10050 75 | 3858 100 |@ 6242 125 |@ 6555 150 | 3980 175 |@ 5987 200 |@ 9024 225 |@ 9070 250 |@@ 10745 275 |@@ 11898 300 |@@ 11704 325 |@@ 10846 350 |@ 6962 375 | 3292 400 | 1713 425 | 585 450 | 201 475 | 96 >= 500 | 3946
The example output shows that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub 100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision once a certain threshold is exceeded.
When a CPU becomes idle, the dispatcher looks for work enqueued on other (non-idle)
CPUs. The following sample code uses the dequeue probe to understand
how often applications are transferred and by which CPU:
#pragma D option quiet
sched:::dequeue
/args[2]->cpu_id != -1 && cpu != args[2]->cpu_id &&
(curlwpsinfo->pr_flag & PR_IDLE)/
{
@[stringof(args[1]->pr_fname), args[2]->cpu_id] =
lquantize(cpu, 0, 100);
}
END
{
printa("%s stolen from CPU %d by:\n%@d\n", @);
}The following output is an extract from running the above script on a 4-CPU system:
#dtrace -s ./whosteal.d^C... nscd stolen from CPU 1 by: value -------------- Distribution ------------ count 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28 3 | 0 snmpd stolen from CPU 1 by: value -------------- Distribution ------------ count < 0 | 0 0 |@ 1 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 31 3 |@@ 2 4 | 0 ...
Instead of knowing which CPUs took which work, you might want to know the CPUs on which processes and threads are waiting to run. You can use the enqueue and dequeue probes together to answer this question:
sched:::enqueue
{
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] =
timestamp;
}
sched:::dequeue
/a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]/
{
@[args[2]->cpu_id] = quantize(timestamp -
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]);
a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = 0;
}Running the above script for several seconds results in output similar to the following example:
#dtrace -s qtime.ddtrace: script 'qtime.d' matched 2 probes^C1 value ------------- Distribution ------------- count 8192 | 0 16384 | 1 32768 |@ 47 65536 |@@@@@@@ 365 131072 |@@@@@@@@@@@@ 572 262144 |@@@@@@@@@@@@ 570 524288 |@@@@@@@ 354 1048576 |@ 57 2097152 | 7 4194304 | 1 8388608 | 1 16777216 | 0 0 value ------------- Distribution ------------- count 8192 | 0 16384 | 6 32768 |@ 49 65536 |@@@@@ 261 131072 |@@@@@@@@@@@@@ 753 262144 |@@@@@@@@@@@@ 704 524288 |@@@@@@@@ 455 1048576 |@ 74 2097152 | 9 4194304 | 2 8388608 | 0
Instead of looking at wait times, you might want to examine the
length of the run queue over time. Using the enqueue
and dequeue probes, you can
set up an associative array to track the queue length:
sched:::enqueue
{
this->len = qlen[args[2]->cpu_id]++;
@[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
qlen[args[2]->cpu_id]--;
}Running this script for approximately 30 seconds on a largely idle dual-core processor system results in output similar to the following example:
#dtrace -s qlen.ddtrace: script 'qlen.d' matched 2 probes^C1 value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8124 1 |@@@@@@ 1558 2 |@ 160 3 | 51 4 | 24 5 | 13 6 | 11 7 | 9 8 | 6 9 | 0 0 value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 8569 1 |@@@@@@@@@ 2429 2 |@ 292 3 | 25 4 | 8 5 | 5 6 | 4 7 | 4 8 | 1 9 | 0
The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queues were very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of each table might be unexpected. For example, why were the run queues as long as 8 runnable threads? To explore this question, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.
For this type of script, you would enable the enqueue and
dequeue probes and use both speculations and associative arrays.
Whenever a thread is enqueued, the script increments the length of the queue and records
the timestamp in an associative array keyed by the thread. You cannot use a thread-local
variable in this case because a thread might be enqueued by another thread. The script
then checks to see if the queue length exceeds the maximum. If it does, the script starts
a new speculation, and records the timestamp and the new maximum. Then, when a thread is
dequeued, the script compares the enqueue timestamp to the timestamp of the longest
length: if the thread was enqueued before the timestamp of the longest length, the thread
was in the queue when the longest length was recorded. In this case, the script
speculatively traces the thread's information. Once the kernel dequeues the last thread
that was enqueued at the timestamp of the longest length, the script commits the
speculation data. This script is shown below:
#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k
int maxlen;
int spec[int];
sched:::enqueue
{
this->len = ++qlen[this->cpu = args[2]->cpu_id];
in[args[0]->pr_addr] = timestamp;
}
sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
/*
* There is already a speculation for this CPU. We just set a new
* record, so we’ll discard the old one.
*/
discard(spec[this->cpu]);
}
sched:::enqueue
/this->len > maxlen/
{
/*
* We have a winner. Set the new maximum length and set the timestamp
* of the longest length.
*/
maxlen = this->len;
longtime[this->cpu] = timestamp;
/*
* Now start a new speculation, and speculatively trace the length.
*/
this->spec = spec[this->cpu] = speculation();
speculate(this->spec);
printf("Run queue of length %d:\n", this->len);
}
sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
speculate(spec[this->cpu]);
printf(" %d/%d (%s)\n",
args[1]->pr_pid, args[0]->pr_lwpid,
stringof(args[1]->pr_fname));
}
sched:::dequeue
/qlen[args[2]->cpu_id]/
{
in[args[0]->pr_addr] = 0;
this->len = --qlen[args[2]->cpu_id];
}
sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
/*
* We just processed the last thread that was enqueued at the time
* of longest length; commit the speculation, which by now contains
* each thread that was enqueued when the queue was longest.
*/
commit(spec[this->cpu]);
spec[this->cpu] = 0;
}Running the above script on the same system results in output similar to the following example:
#dtrace -s whoqueue.dRun queue of length 1: 2850/2850 (java) Run queue of length 2: 4034/4034 (kworker/0:1) 16/16 (sync_supers) Run queue of length 3: 10/10 (ksoftirqd/1) 1710/1710 (hald-addon-inpu) 25350/25350 (dtrace) Run queue of length 4: 2852/2852 (java) 2850/2850 (java) 1710/1710 (hald-addon-inpu) 2099/2099 (Xorg) Run queue of length 5: 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2461/2461 (wnck-applet) 2432/2432 (metacity) Run queue of length 9: 3685/3685 (firefox) 3149/3149 (notification-da) 2417/2417 (gnome-settings-) 2437/2437 (gnome-panel) 2852/2852 (java) 2452/2452 (nautilus) 2461/2461 (wnck-applet) 2432/2432 (metacity) 2749/2749 (gnome-terminal)^C
You can use the wakeup probe to determine what is waking a
particular process, and when over a given period, as shown in the following
example:
#pragma D option quiet
dtrace:::BEGIN
{
start = timestamp;
}
sched:::wakeup
/stringof(args[1]->pr_fname) == "gnome-terminal"/
{
@[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}
profile:::tick-1sec
/++x == 10/
{
exit(0);
}The output from running this script is shown below:
# dtrace -s gterm.d
Xorg
value ------------- Distribution ------------- count
< 0 | 0
0 |@@@@@@@@@@@@@@@ 69
1 |@@@@@@@@ 35
2 |@@@@@@@@@ 42
3 | 2
4 | 0
5 | 0
6 | 0
7 |@@@@ 16
8 | 0
9 |@@@ 15
>= 10 | 0 The output shows that the X server is waking the gnome-terminal
process, as you interact with the system.
You can additionally use the sleep probe along with the wakeup
probe to understand which applications are blocking on which applications, and for how
long, as shown in the following example:
#pragma D option quiet
sched:::sleep
{
bedtime[curlwpsinfo->pr_addr] = timestamp;
}
sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
@[stringof(args[1]->pr_fname), execname] =
quantize(timestamp - bedtime[args[0]->pr_addr]);
bedtime[args[0]->pr_addr] = 0;
}
END
{
printa("%s sleeping on %s:\n%@d\n", @);
}The tail of the output from running the example script for several seconds on a desktop system resembles the following example:
#dtrace -s whofor.d^C... Xorg sleeping on metacity: value ------------- Distribution ------------- count 65536 | 0 131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2 262144 | 0 gnome-power-man sleeping on Xorg: value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 524288 | 0 ...
Because Oracle Linux is a preemptive system, higher priority threads preempt lower
priority ones. Preemption can induce a significant latency bubble in the lower priority
thread, so you might want to know which threads are being preempted by which other
threads. The following example shows how to use the preempt and
remain-cpu probes to display this information:
#pragma D option quiet
sched:::preempt
{
self->preempt = 1;
}
sched:::remain-cpu
/self->preempt/
{
self->preempt = 0;
}
sched:::off-cpu
/self->preempt/
{
/*
* If we were told to preempt ourselves, see who we ended up giving
* the CPU to.
*/
@[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
curlwpsinfo->pr_pri] = count();
self->preempt = 0;
}
END
{
printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI",
"PREEMPTED", "PRI", "#");
printa("%30s %3d %30s %3d %5@d\n", @);
}Running the above script for several seconds on a desktop system results in output similar to the following example:
#dtrace -s whopreempt.d^CPREEMPTOR PRI PREEMPTED PRI # firefox 120 kworker/0:0 120 1 gnome-panel 120 swapper 120 1 gnome-panel 120 wnck-applet 120 1 jbd2/dm-0-8 120 swapper 120 1 khugepaged 139 kworker/0:0 120 1 ksoftirqd/1 120 kworker/0:0 120 1 kworker/0:0 120 gnome-terminal 120 1 kworker/0:2 120 Xorg 120 1 kworker/0:2 120 java 120 1 kworker/1:0 120 Xorg 120 1 nautilus 120 Xorg 120 1 rtkit-daemon 0 rtkit-daemon 120 1 rtkit-daemon 120 swapper 120 1 watchdog/0 0 swapper 120 1 watchdog/1 0 kworker/0:0 120 1 wnck-applet 120 Xorg 120 1 wnck-applet 120 swapper 120 1 automount 120 kworker/0:0 120 2 gnome-power-man 120 kworker/0:0 120 2 kworker/0:0 120 swapper 120 2 kworker/1:0 120 dtrace 120 2 metacity 120 kworker/0:0 120 2 notification-da 120 swapper 120 2 udisks-daemon 120 kworker/0:0 120 2 automount 120 swapper 120 3 gnome-panel 120 Xorg 120 3 gnome-settings- 120 Xorg 120 3 gnome-settings- 120 swapper 120 3 gnome-terminal 120 swapper 120 3 java 120 kworker/0:0 120 3 ksoftirqd/0 120 swapper 120 3 kworker/0:2 120 swapper 120 3 metacity 120 Xorg 120 3 nautilus 120 kworker/0:0 120 3 qpidd 120 swapper 120 3 metacity 120 swapper 120 4 gvfs-afc-volume 120 swapper 120 5 java 120 Xorg 120 5 notification-da 120 Xorg 120 5 notification-da 120 kworker/0:0 120 5 Xorg 120 kworker/0:0 120 6 wnck-applet 120 kworker/0:0 120 10 VBoxService 120 swapper 120 13 dtrace 120 swapper 120 14 kworker/1:0 120 kworker/0:0 120 16 dtrace 120 kworker/0:0 120 20 Xorg 120 swapper 120 90 hald-addon-inpu 120 swapper 120 100 java 120 swapper 120 108 gnome-terminal 120 kworker/0:0 120 110
If NOHZ is set to off, Oracle Linux uses
tick-based CPU accounting, in which a system clock interrupt fires
at a fixed interval and attributes CPU utilization to the processes running at the time of
the tick. The following example shows how to use the tick probe to
observe this attribution:
#dtrace -n sched:::tick'{ @[stringof(args[1]->pr_fname)] = count() }'dtrace: description 'sched:::tick' matched 1 probe^CVBoxService 1 gpk-update-icon 1 hald-addon-inpu 1 jbd2/dm-0-8 1 automount 2 gnome-session 2 hald 2 gnome-power-man 3 ksoftirqd/0 3 kworker/0:2 3 notification-da 4 devkit-power-da 6 nautilus 9 dbus-daemon 11 gnome-panel 11 gnome-settings- 11 dtrace 19 khugepaged 22 metacity 27 kworker/0:0 41 swapper 56 firefox 58 wnck-applet 61 gnome-terminal 67 java 84 Xorg 227
One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system either over-accounts or under-accounts for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. If accounting is performed before time-related dispatching, the system under-accounts for threads running at a regular interval. If such threads run for less than the clock tick interval, they can effectively hide behind the clock tick. The following example examine whether a system has any such threads:
sched:::tick,
sched:::enqueue
{
@[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}The output of the example script is two distributions of the millisecond offset
within a ten millisecond interval, one for the tick probe and another
for enqueue:
#dtrace -s tick.ddtrace: script 'tick.d' matched 2 probes^Ctick value ------------- Distribution ------------- count < 0 | 0 0 |@@@@@ 29 1 |@@@@@@@@@@@@@@@@@@@ 106 2 |@@@@@ 27 3 |@ 7 4 |@@ 10 5 |@@ 12 6 |@ 4 7 |@ 8 8 |@@ 9 9 |@@@ 17 >= 10 | 0 enqueue value ------------- Distribution ------------- count < 0 | 0 0 |@@@@ 82 1 |@@@@ 86 2 |@@@@ 76 3 |@@@ 65 4 |@@@@@ 101 5 |@@@@ 79 6 |@@@@ 75 7 |@@@@ 76 8 |@@@@ 89 9 |@@@@ 75 >= 10 | 0
The output histogram named tick shows that the clock tick is
firing at a 1 millisecond offset because the system clock frequency is 1000Hz). In this
example, the output for enqueue is evenly spread across the ten
millisecond interval and no spike is visible at 1 millisecond, so it appears that the
threads are being not being scheduled on a time basis.
The sched provider uses DTrace's stability mechanism to describe its
stabilities, as shown in the following table.
Element | Name Stability | Data Stability | Dependency Class |
|---|---|---|---|
Provider | Evolving | Evolving | ISA |
Module | Private | Private | Unknown |
Function | Private | Private | Unknown |
Name | Evolving | Evolving | ISA |
Arguments | Evolving | Evolving | ISA |
For more information about the stability mechanism, see Chapter 14, Stability.