11.6 sched Provider

11.6.1 Probes
11.6.2 Arguments
11.6.3 cpuinfo_t
11.6.4 Examples
11.6.5 Stability

The sched provider makes available probes related to CPU scheduling. Because CPUs are the one resource that all threads must consume, the sched provider is very useful for understanding systemic behavior. For example, using the sched provider, you can understand when and why threads sleep, run, change priority, or wake other threads.

11.6.1 Probes

Table 11.6, “sched Probes” lists the sched probes.

Table 11.6 sched Probes

Probe

Description

change-pri

Fires whenever a thread's priority is about to be changed. The lwpsinfo_t of the thread is pointed to by args[0]. The thread's current priority is in the pr_pri field of this structure. The psinfo_t of the process containing the thread is pointed to by args[1]. The thread's new priority is contained in args[2].

dequeue

Fires immediately before a runnable thread is dequeued from a run queue. The lwpsinfo_t of the thread being dequeued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU from which the thread is being dequeued is pointed to by args[2]. If the thread is being dequeued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1.

enqueue

Fires immediately before a runnable thread is enqueued to a run queue. The lwpsinfo_t of the thread being enqueued is pointed to by args[0]. The psinfo_t of the process containing the thread is pointed to by args[1]. The cpuinfo_t of the CPU to which the thread is being enqueued is pointed to by args[2]. If the thread is being enqueued from a run queue that is not associated with a particular CPU, the cpu_id member of this structure will be -1. The value in args[3] is a boolean indicating whether the thread will be enqueued to the front of the run queue. The value is non-zero if the thread will be enqueued at the front of the run queue, and zero if the thread will be enqueued at the back of the run queue.

off-cpu

Fires when the current CPU is about to end execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is ending execution. The lwpsinfo_t structure of the thread that the current CPU will next execute is pointed to by args[0]. The psinfo_t of the process containing the next thread is pointed to by args[1].

on-cpu

Fires when a CPU has just begun execution of a thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

preempt

Fires immediately before the current thread is preempted. After this probe fires, the current thread will select a thread to run and the off-cpu probe will fire for the current thread. In some cases, a thread on one CPU will be preempted, but the preempting thread will run on another CPU in the meantime. In this situation, the preempt probe will fire, but the dispatcher will be unable to find a higher priority thread to run and the remain-cpu probe will fire instead of the off-cpu probe.

remain-cpu

Fires when a scheduling decision has been made, but the dispatcher has elected to continue to run the current thread. The curcpu variable indicates the current CPU. The curlwpsinfo variable indicates the thread that is beginning execution. The curpsinfo variable describes the process containing the current thread.

sleep

Fires immediately before the current thread sleeps on a synchronization object. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t pointed to by curlwpsinfo. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t pointed to by curlwpsinfo. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.

surrender

Fires when a CPU has been instructed by another CPU to make a scheduling decision — often because a higher-priority thread has become runnable.

tick

Fires as a part of clock tick-based accounting. In clock tick-based accounting, CPU accounting is performed by examining which threads and processes are running when a fixed-interval interrupt fires. The lwpsinfo_t that corresponds to the thread that is being assigned CPU time is pointed to by args[0]. The psinfo_t that corresponds to the process that contains the thread is pointed to by args[1].

wakeup

Fires immediately before the current thread wakes a thread sleeping on a synchronization object. The lwpsinfo_t of the sleeping thread is pointed to by args[0]. The psinfo_t of the process containing the sleeping thread is pointed to by args[1]. The type of the synchronization object is contained in the pr_stype member of the lwpsinfo_t of the sleeping thread. The address of the synchronization object is contained in the pr_wchan member of the lwpsinfo_t of the sleeping thread. The meaning of this address is a private implementation detail, but the address value may be treated as a token unique to the synchronization object.


11.6.2 Arguments

Table 11.7, “sched Probe Arguments” lists the argument types for the sched probes; Table 11.6, “sched Probes” describes the arguments.

Table 11.7 sched Probe Arguments

Probe

args[0]

args[1]

args[2]

args[3]

change-pri

lwpsinfo_t *

psinfo_t *

pri_t

dequeue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

enqueue

lwpsinfo_t *

psinfo_t *

cpuinfo_t *

int

off-cpu

lwpsinfo_t *

psinfo_t *

on-cpu

preempt

remain-cpu

sleep

surrender

lwpsinfo_t *

psinfo_t *

tick

lwpsinfo_t *

psinfo_t *

wakeup

lwpsinfo_t *

psinfo_t *


11.6.3 cpuinfo_t

The cpuinfo_t structure defines a CPU. As Table 11.7, “sched Probe Arguments” indicates, arguments to both the enqueue and dequeue probes include a pointer to a cpuinfo_t. Additionally, the cpuinfo_t corresponding to the current CPU is pointed to by the curcpu variable. The definition of the cpuinfo_t structure is as follows:

typedef struct cpuinfo {
  processorid_t cpu_id;      /* CPU identifier */
  psetid_t cpu_pset;         /* not supported */
  chipid_t cpu_chip;         /* chip identifier */
  lgrp_id_t cpu_lgrp;        /* not supported */
  processor_info_t cpu_info; /* CPU information */
} cpuinfo_t;

cpu_id is the processor identifier.

cpu_chip is the identifier of the physical chip. Physical chips may contain several CPU cores.

cpu_info is the processor_info_t structure associated with the CPU.

11.6.4 Examples

The following sections contain examples of using sched probes.

11.6.4.1 on-cpu and off-cpu

One common question you might want answered is which CPUs are running threads and for how long. You can use the on-cpu and off-cpu probes to easily answer this question on a system-wide basis as shown in the following example:

sched:::on-cpu
{
  self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
  @[cpu] = quantize(timestamp - self->ts);
  self->ts = 0;
}

Running the above script results in output similar to the following example:

# dtrace -s ./where.d
dtrace: script ’./where.d’ matched 5 probes
^C
        0
          value  ------------- Distribution ------------- count
           2048 |                                         0
           4096 |@@                                       37
           8192 |@@@@@@@@@@@@@                            212
          16384 |@                                        30
          32768 |                                         10
          65536 |@                                        17
         131072 |                                         12
         262144 |                                         9
         524288 |                                         6
        1048576 |                                         5
        2097152 |                                         1
        4194304 |                                         3
        8388608 |@@@@                                     75
       16777216 |@@@@@@@@@@@@                             201
       33554432 |                                         6
       67108864 |                                         0

        1
          value  ------------- Distribution ------------- count
           2048 |                                         0
           4096 |@                                        6
           8192 |@@@@                                     23
          16384 |@@@                                      18
          32768 |@@@@                                     22
          65536 |@@@@                                     22
         131072 |@                                        7
         262144 |                                         5
         524288 |                                         2
        1048576 |                                         3
        2097152 |@                                        9
        4194304 |                                         4
        8388608 |@@@                                      18
       16777216 |@@@                                      19
       33554432 |@@@                                      16
       67108864 |@@@@                                     21
      134217728 |@@                                       14
      268435456 |                                         0

This output shows that on CPU 1 threads tend to run for less than 100 microseconds at a stretch, or for approximately 10 milliseconds. A noticable gap between the two clusters of data is shown in the histogram. You also might be interested in knowing which CPUs are running a particular process. You can use the on-cpu and off-cpu probes for answering this question as well. The following script displays which CPUs run a specified application over a period of ten seconds:

#pragma D option quiet
dtrace:::BEGIN
{
  start = timestamp;
}

sched:::on-cpu
/execname == $$1/
{
  self->ts = timestamp;
}

sched:::off-cpu
/self->ts/
{
  @[cpu] = sum(timestamp - self->ts);
  self->ts = 0;
}

profile:::tick-1sec
/++x == 10/
{
  exit(0);
}

dtrace:::END
{
  printf("CPU distribution over %d seconds:\n\n",
    (timestamp - start) / 1000000000);
  printf("CPU microseconds\n--- ------------\n");
  normalize(@, 1000);
  printa("%3d %@d\n", @);
}

Running the above script on a large mail server and specifying the IMAP daemon results in output similar to the following example:

# dtrace -s ./whererun.d imapd
CPU distribution of imapd over 10 seconds:

CPU microseconds
--- ------------
 15 10102
 12 16377
 21 25317
 19 25504
 17 35653
 13 41539
 14 46669
 20 57753
 22 70088
 16 115860
 23 127775
 18 160517

Oracle Linux takes into account the amount of time that a thread has been sleeping when selecting a CPU on which to run the thread: a thread that has been sleeping for less time tends not to migrate. You can use the off-cpu and on-cpu probes to observe this behavior:

sched:::off-cpu
/curlwpsinfo->pr_state == SSLEEP/
{
 self->cpu = cpu;
 self->ts = timestamp;
}

sched:::on-cpu
/self->ts/
{
 @[self->cpu == cpu ?
   "sleep time, no CPU migration" : "sleep time, CPU migration"] =
   lquantize((timestamp - self->ts) / 1000000, 0, 500, 25);
 self->ts = 0;
 self->cpu = 0;
}

Running the above script for approximately 30 seconds results in output similar to the following example:

# dtrace -s ./howlong.d
dtrace: script ’./howlong.d’ matched 5 probes
^C
 sleep time, CPU migration
          value  ------------- Distribution ------------- count
            < 0 |                                         0
              0 |@@@@@@@                                  6838
             25 |@@@@@                                    4714
             50 |@@@                                      3108
             75 |@                                        1304
            100 |@                                        1557
            125 |@                                        1425
            150 |                                         894
            175 |@                                        1526
            200 |@@                                       2010
            225 |@@                                       1933
            250 |@@                                       1982
            275 |@@                                       2051
            300 |@@                                       2021
            325 |@                                        1708
            350 |@                                        1113
            375 |                                         502
            400 |                                         220
            425 |                                         106
            450 |                                         54
            475 |                                         40
         >= 500 |@                                        1716

 sleep time, no CPU migration
          value  ------------- Distribution ------------- count
            < 0 |                                         0
              0 |@@@@@@@@@@@@                             58413
             25 |@@@                                      14793
             50 |@@                                       10050
             75 |                                         3858
            100 |@                                        6242
            125 |@                                        6555
            150 |                                         3980
            175 |@                                        5987
            200 |@                                        9024
            225 |@                                        9070
            250 |@@                                       10745
            275 |@@                                       11898
            300 |@@                                       11704
            325 |@@                                       10846
            350 |@                                        6962
            375 |                                         3292
            400 |                                         1713
            425 |                                         585
            450 |                                         201
            475 |                                         96
         >= 500 |                                         3946

The example output shows that there are many more occurrences of non-migration than migration. Also, when sleep times are longer, migrations are more likely. The distributions are noticeably different in the sub 100 millisecond range, but look very similar as the sleep times get longer. This result would seem to indicate that sleep time is not factored into the scheduling decision once a certain threshold is exceeded.

11.6.4.2 enqueue and dequeue

When a CPU becomes idle, the dispatcher looks for work enqueued on other (non-idle) CPUs. The following sample code uses the dequeue probe to understand how often applications are transferred and by which CPU:

#pragma D option quiet

sched:::dequeue
/args[2]->cpu_id != -1 && cpu != args[2]->cpu_id &&
  (curlwpsinfo->pr_flag & PR_IDLE)/
{
  @[stringof(args[1]->pr_fname), args[2]->cpu_id] =
  lquantize(cpu, 0, 100);
}

END
{
  printa("%s stolen from CPU %d by:\n%@d\n", @);
}

The following output is an extract from running the above script on a 4-CPU system:

# dtrace -s ./whosteal.d
^C
...
  nscd stolen from CPU 1 by:

        value  -------------- Distribution ------------ count
            1 |                                         0
            2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 28
            3 |                                         0

  snmpd stolen from CPU 1 by:

        value  -------------- Distribution ------------ count
          < 0 |                                         0
            0 |@                                        1
            1 |                                         0
            2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@     31
            3 |@@                                       2
            4 |                                         0
...

Instead of knowing which CPUs took which work, you might want to know the CPUs on which processes and threads are waiting to run. You can use the enqueue and dequeue probes together to answer this question:

sched:::enqueue
{
  a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] =
  timestamp;
}

sched:::dequeue
/a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]/
{
  @[args[2]->cpu_id] = quantize(timestamp -
    a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id]);
  a[args[0]->pr_lwpid, args[1]->pr_pid, args[2]->cpu_id] = 0;
}

Running the above script for several seconds results in output similar to the following example:

# dtrace -s qtime.d 
dtrace: script 'qtime.d' matched 2 probes
^C

        1
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |                                         1        
           32768 |@                                        47       
           65536 |@@@@@@@                                  365      
          131072 |@@@@@@@@@@@@                             572      
          262144 |@@@@@@@@@@@@                             570      
          524288 |@@@@@@@                                  354      
         1048576 |@                                        57       
         2097152 |                                         7        
         4194304 |                                         1        
         8388608 |                                         1        
        16777216 |                                         0        

        0
           value  ------------- Distribution ------------- count    
            8192 |                                         0        
           16384 |                                         6        
           32768 |@                                        49       
           65536 |@@@@@                                    261      
          131072 |@@@@@@@@@@@@@                            753      
          262144 |@@@@@@@@@@@@                             704      
          524288 |@@@@@@@@                                 455      
         1048576 |@                                        74       
         2097152 |                                         9        
         4194304 |                                         2        
         8388608 |                                         0

Instead of looking at wait times, you might want to examine the length of the run queue over time. Using the enqueue and dequeue probes, you can set up an associative array to track the queue length:

sched:::enqueue
{
  this->len = qlen[args[2]->cpu_id]++;
  @[args[2]->cpu_id] = lquantize(this->len, 0, 100);
}

sched:::dequeue
/qlen[args[2]->cpu_id]/
{
  qlen[args[2]->cpu_id]--;
}

Running this script for approximately 30 seconds on a largely idle dual-core processor system results in output similar to the following example:

# dtrace -s qlen.d 
dtrace: script 'qlen.d' matched 2 probes
^C

        1
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@        8124     
               1 |@@@@@@                                   1558     
               2 |@                                        160      
               3 |                                         51       
               4 |                                         24       
               5 |                                         13       
               6 |                                         11       
               7 |                                         9        
               8 |                                         6        
               9 |                                         0        

        0
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           8569     
               1 |@@@@@@@@@                                2429     
               2 |@                                        292      
               3 |                                         25       
               4 |                                         8        
               5 |                                         5        
               6 |                                         4        
               7 |                                         4        
               8 |                                         1        
               9 |                                         0

The output is roughly what you would expect for an idle system: the majority of the time that a runnable thread is enqueued, the run queues were very short (three or fewer threads in length). However, given that the system was largely idle, the exceptional data points at the bottom of each table might be unexpected. For example, why were the run queues as long as 8 runnable threads? To explore this question, you could write a D script that displays the contents of the run queue when the length of the run queue is long. This problem is complicated because D enablings cannot iterate over data structures, and therefore cannot simply iterate over the entire run queue. Even if D enablings could do so, you should avoid dependencies on the kernel's internal data structures.

For this type of script, you would enable the enqueue and dequeue probes and use both speculations and associative arrays. Whenever a thread is enqueued, the script increments the length of the queue and records the timestamp in an associative array keyed by the thread. You cannot use a thread-local variable in this case because a thread might be enqueued by another thread. The script then checks to see if the queue length exceeds the maximum. If it does, the script starts a new speculation, and records the timestamp and the new maximum. Then, when a thread is dequeued, the script compares the enqueue timestamp to the timestamp of the longest length: if the thread was enqueued before the timestamp of the longest length, the thread was in the queue when the longest length was recorded. In this case, the script speculatively traces the thread's information. Once the kernel dequeues the last thread that was enqueued at the timestamp of the longest length, the script commits the speculation data. This script is shown below:

#pragma D option quiet
#pragma D option nspec=4
#pragma D option specsize=100k

int maxlen;
int spec[int];
sched:::enqueue
{
  this->len = ++qlen[this->cpu = args[2]->cpu_id];
  in[args[0]->pr_addr] = timestamp;
}

sched:::enqueue
/this->len > maxlen && spec[this->cpu]/
{
  /*
   * There is already a speculation for this CPU. We just set a new
   * record, so we’ll discard the old one.
   */
  discard(spec[this->cpu]);
}

sched:::enqueue
/this->len > maxlen/
{
  /*
   * We have a winner. Set the new maximum length and set the timestamp
   * of the longest length.
   */
  maxlen = this->len;
  longtime[this->cpu] = timestamp;
  /*
   * Now start a new speculation, and speculatively trace the length.
   */
  this->spec = spec[this->cpu] = speculation();
  speculate(this->spec);
  printf("Run queue of length %d:\n", this->len);
}

sched:::dequeue
/(this->in = in[args[0]->pr_addr]) &&
  this->in <= longtime[this->cpu = args[2]->cpu_id]/
{
  speculate(spec[this->cpu]);
  printf(" %d/%d (%s)\n",
    args[1]->pr_pid, args[0]->pr_lwpid,
    stringof(args[1]->pr_fname));
}

sched:::dequeue
/qlen[args[2]->cpu_id]/
{
  in[args[0]->pr_addr] = 0;
  this->len = --qlen[args[2]->cpu_id];
}

sched:::dequeue
/this->len == 0 && spec[this->cpu]/
{
  /*
   * We just processed the last thread that was enqueued at the time
   * of longest length; commit the speculation, which by now contains
   * each thread that was enqueued when the queue was longest.
   */
  commit(spec[this->cpu]);
  spec[this->cpu] = 0;
}

Running the above script on the same system results in output similar to the following example:

# dtrace -s whoqueue.d
Run queue of length 1:
 2850/2850 (java)
Run queue of length 2:
 4034/4034 (kworker/0:1)
 16/16 (sync_supers)
Run queue of length 3:
 10/10 (ksoftirqd/1)
 1710/1710 (hald-addon-inpu)
 25350/25350 (dtrace)
Run queue of length 4:
 2852/2852 (java)
 2850/2850 (java)
 1710/1710 (hald-addon-inpu)
 2099/2099 (Xorg)
Run queue of length 5:
 3149/3149 (notification-da)
 2417/2417 (gnome-settings-)
 2437/2437 (gnome-panel)
 2461/2461 (wnck-applet)
 2432/2432 (metacity)
Run queue of length 9:
 3685/3685 (firefox)
 3149/3149 (notification-da)
 2417/2417 (gnome-settings-)
 2437/2437 (gnome-panel)
 2852/2852 (java)
 2452/2452 (nautilus)
 2461/2461 (wnck-applet)
 2432/2432 (metacity)
 2749/2749 (gnome-terminal)
^C

11.6.4.3 sleep and wakeup

You can use the wakeup probe to determine what is waking a particular process, and when over a given period, as shown in the following example:

#pragma D option quiet

dtrace:::BEGIN
{
  start = timestamp;
}

sched:::wakeup
/stringof(args[1]->pr_fname) == "gnome-terminal"/
{
  @[execname] = lquantize((timestamp - start) / 1000000000, 0, 10);
}

profile:::tick-1sec
/++x == 10/
{
  exit(0);
}

The output from running this script is shown below:

# dtrace -s gterm.d

  Xorg                                              
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@@@@@@@@@@@                          69       
               1 |@@@@@@@@                                 35       
               2 |@@@@@@@@@                                42       
               3 |                                         2        
               4 |                                         0        
               5 |                                         0        
               6 |                                         0        
               7 |@@@@                                     16       
               8 |                                         0        
               9 |@@@                                      15       
           >= 10 |                                         0  

The output shows that the X server is waking the gnome-terminal process, as you interact with the system.

You can additionally use the sleep probe along with the wakeup probe to understand which applications are blocking on which applications, and for how long, as shown in the following example:

#pragma D option quiet
sched:::sleep
{
  bedtime[curlwpsinfo->pr_addr] = timestamp;
}

sched:::wakeup
/bedtime[args[0]->pr_addr]/
{
  @[stringof(args[1]->pr_fname), execname] =
    quantize(timestamp - bedtime[args[0]->pr_addr]);
  bedtime[args[0]->pr_addr] = 0;
}

END
{
  printa("%s sleeping on %s:\n%@d\n", @);
}

The tail of the output from running the example script for several seconds on a desktop system resembles the following example:

# dtrace -s whofor.d 
^C
...
Xorg sleeping on metacity:

           value  ------------- Distribution ------------- count    
           65536 |                                         0        
          131072 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 2        
          262144 |                                         0        

gnome-power-man sleeping on Xorg:

           value  ------------- Distribution ------------- count    
          131072 |                                         0        
          262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1        
          524288 |                                         0
...

11.6.4.4 preempt and remain-cpu

Because Oracle Linux is a preemptive system, higher priority threads preempt lower priority ones. Preemption can induce a significant latency bubble in the lower priority thread, so you might want to know which threads are being preempted by which other threads. The following example shows how to use the preempt and remain-cpu probes to display this information:

#pragma D option quiet

sched:::preempt
{
  self->preempt = 1;
}

sched:::remain-cpu
/self->preempt/
{
  self->preempt = 0;
}

sched:::off-cpu
/self->preempt/
{
  /*
   * If we were told to preempt ourselves, see who we ended up giving
   * the CPU to.
   */
  @[stringof(args[1]->pr_fname), args[0]->pr_pri, execname,
    curlwpsinfo->pr_pri] = count();
  self->preempt = 0;
}

END
{
  printf("%30s %3s %30s %3s %5s\n", "PREEMPTOR", "PRI",
    "PREEMPTED", "PRI", "#");
  printa("%30s %3d %30s %3d %5@d\n", @);
}

Running the above script for several seconds on a desktop system results in output similar to the following example:

# dtrace -s whopreempt.d 
^C
                     PREEMPTOR PRI                      PREEMPTED PRI     #
                       firefox 120                    kworker/0:0 120     1
                   gnome-panel 120                        swapper 120     1
                   gnome-panel 120                    wnck-applet 120     1
                   jbd2/dm-0-8 120                        swapper 120     1
                    khugepaged 139                    kworker/0:0 120     1
                   ksoftirqd/1 120                    kworker/0:0 120     1
                   kworker/0:0 120                 gnome-terminal 120     1
                   kworker/0:2 120                           Xorg 120     1
                   kworker/0:2 120                           java 120     1
                   kworker/1:0 120                           Xorg 120     1
                      nautilus 120                           Xorg 120     1
                  rtkit-daemon   0                   rtkit-daemon 120     1
                  rtkit-daemon 120                        swapper 120     1
                    watchdog/0   0                        swapper 120     1
                    watchdog/1   0                    kworker/0:0 120     1
                   wnck-applet 120                           Xorg 120     1
                   wnck-applet 120                        swapper 120     1
                     automount 120                    kworker/0:0 120     2
               gnome-power-man 120                    kworker/0:0 120     2
                   kworker/0:0 120                        swapper 120     2
                   kworker/1:0 120                         dtrace 120     2
                      metacity 120                    kworker/0:0 120     2
               notification-da 120                        swapper 120     2
                 udisks-daemon 120                    kworker/0:0 120     2
                     automount 120                        swapper 120     3
                   gnome-panel 120                           Xorg 120     3
               gnome-settings- 120                           Xorg 120     3
               gnome-settings- 120                        swapper 120     3
                gnome-terminal 120                        swapper 120     3
                          java 120                    kworker/0:0 120     3
                   ksoftirqd/0 120                        swapper 120     3
                   kworker/0:2 120                        swapper 120     3
                      metacity 120                           Xorg 120     3
                      nautilus 120                    kworker/0:0 120     3
                         qpidd 120                        swapper 120     3
                      metacity 120                        swapper 120     4
               gvfs-afc-volume 120                        swapper 120     5
                          java 120                           Xorg 120     5
               notification-da 120                           Xorg 120     5
               notification-da 120                    kworker/0:0 120     5
                          Xorg 120                    kworker/0:0 120     6
                   wnck-applet 120                    kworker/0:0 120    10
                   VBoxService 120                        swapper 120    13
                        dtrace 120                        swapper 120    14
                   kworker/1:0 120                    kworker/0:0 120    16
                        dtrace 120                    kworker/0:0 120    20
                          Xorg 120                        swapper 120    90
               hald-addon-inpu 120                        swapper 120   100
                          java 120                        swapper 120   108
                gnome-terminal 120                    kworker/0:0 120   110

11.6.4.5 tick

If NOHZ is set to off, Oracle Linux uses tick-based CPU accounting, in which a system clock interrupt fires at a fixed interval and attributes CPU utilization to the processes running at the time of the tick. The following example shows how to use the tick probe to observe this attribution:

# dtrace -n sched:::tick'{ @[stringof(args[1]->pr_fname)] = count() }'
dtrace: description 'sched:::tick' matched 1 probe
^C

  VBoxService                                                       1
  gpk-update-icon                                                   1
  hald-addon-inpu                                                   1
  jbd2/dm-0-8                                                       1
  automount                                                         2
  gnome-session                                                     2
  hald                                                              2
  gnome-power-man                                                   3
  ksoftirqd/0                                                       3
  kworker/0:2                                                       3
  notification-da                                                   4
  devkit-power-da                                                   6
  nautilus                                                          9
  dbus-daemon                                                      11
  gnome-panel                                                      11
  gnome-settings-                                                  11
  dtrace                                                           19
  khugepaged                                                       22
  metacity                                                         27
  kworker/0:0                                                      41
  swapper                                                          56
  firefox                                                          58
  wnck-applet                                                      61
  gnome-terminal                                                   67
  java                                                             84
  Xorg                                                            227

One deficiency of tick-based accounting is that the system clock that performs accounting is often also responsible for dispatching any time-related scheduling activity. As a result, if a thread is to perform some amount of work every clock tick (that is, every 10 milliseconds), the system either over-accounts or under-accounts for the thread, depending on whether the accounting is done before or after time-related dispatching scheduling activity. If accounting is performed before time-related dispatching, the system under-accounts for threads running at a regular interval. If such threads run for less than the clock tick interval, they can effectively hide behind the clock tick. The following example examine whether a system has any such threads:

sched:::tick,
sched:::enqueue
{
  @[probename] = lquantize((timestamp / 1000000) % 10, 0, 10);
}

The output of the example script is two distributions of the millisecond offset within a ten millisecond interval, one for the tick probe and another for enqueue:

# dtrace -s tick.d
dtrace: script 'tick.d' matched 2 probes
^C

  tick                                              
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@@                                    29       
               1 |@@@@@@@@@@@@@@@@@@@                      106      
               2 |@@@@@                                    27       
               3 |@                                        7        
               4 |@@                                       10       
               5 |@@                                       12       
               6 |@                                        4        
               7 |@                                        8        
               8 |@@                                       9        
               9 |@@@                                      17       
           >= 10 |                                         0        

  enqueue                                           
           value  ------------- Distribution ------------- count    
             < 0 |                                         0        
               0 |@@@@                                     82       
               1 |@@@@                                     86       
               2 |@@@@                                     76       
               3 |@@@                                      65       
               4 |@@@@@                                    101      
               5 |@@@@                                     79       
               6 |@@@@                                     75       
               7 |@@@@                                     76       
               8 |@@@@                                     89       
               9 |@@@@                                     75       
           >= 10 |                                         0 

The output histogram named tick shows that the clock tick is firing at a 1 millisecond offset because the system clock frequency is 1000Hz). In this example, the output for enqueue is evenly spread across the ten millisecond interval and no spike is visible at 1 millisecond, so it appears that the threads are being not being scheduled on a time basis.

11.6.5 Stability

The sched provider uses DTrace's stability mechanism to describe its stabilities, as shown in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Evolving

Evolving

ISA

For more information about the stability mechanism, see Chapter 15, Stability.