Using the io Provider

Using the `io` Provider

The following example script displays pertinent information for every I/O as it is issued:

#pragma D option quiet

BEGIN
{
        printf("%10s %58s %2s\n", "DEVICE", "FILE", "RW");
}

io:::start
{
        printf("%10s %58s %2s\n", args[1]->dev_statname,
            args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W");
}

The output of the example when cold-starting Acrobat Reader on an x86 laptop system resembles the following example:

# dtrace -s ./iosnoop.d
     DEVICE                                                       FILE RW
     cmdk0                                 /opt/Acrobat4/bin/acroread  R
     cmdk0                                 /opt/Acrobat4/bin/acroread  R
     cmdk0                                                  <unknown>  R
     cmdk0                           /opt/Acrobat4/Reader/AcroVersion  R
     cmdk0                                                  <unknown>  R
     cmdk0                                                  <unknown>  R
     cmdk0                                                     <none>  R
     cmdk0                                                  <unknown>  R
     cmdk0                                                     <none>  R
     cmdk0                 /usr/lib/locale/iso_8859_1/iso_8859_1.so.3  R
     cmdk0                 /usr/lib/locale/iso_8859_1/iso_8859_1.so.3  R
     cmdk0                 /usr/lib/locale/iso_8859_1/iso_8859_1.so.3  R
     cmdk0                                                     <none>  R
     cmdk0                                                  <unknown>  R
     cmdk0                                                  <unknown>  R
     cmdk0                                                  <unknown>  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0                                                     <none>  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0                                                  <unknown>  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0                                                     <none>  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0   /opt/Acrobat4/Reader/intelsolaris/lib/libreadcore.so.4.0  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0             /opt/Acrobat4/Reader/intelsolaris/bin/acroread  R
     cmdk0                                                  <unknown>  R
     cmdk0        /opt/Acrobat4/Reader/intelsolaris/lib/libAGM.so.3.0  R
     cmdk0                                                     <none>  R
     cmdk0        /opt/Acrobat4/Reader/intelsolaris/lib/libAGM.so.3.0  R
     cmdk0        /opt/Acrobat4/Reader/intelsolaris/lib/libAGM.so.3.0  R
       ...

The <none> entries in the output indicate that the I/O does not correspond to the data in any particular file: these I/Os are due to metadata of one form or another. The <unknown> entries in the output indicate that the path name for the file is not known. This situation is relatively rare.

You could make the example script slightly more sophisticated by using an associative array to track the time spent on each I/O, as shown in the following example:

#pragma D option quiet

BEGIN
{
        printf("%10s %58s %2s %7s\n", "DEVICE", "FILE", "RW", "MS");
}

io:::start
{
        start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
        this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
        printf("%10s %58s %2s %3d.%03d\n", args[1]->dev_statname,
            args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W",
            this->elapsed / 10000000, (this->elapsed / 1000) % 1000);
        start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

The output of the preceding example while hot-plugging a USB storage device into an otherwise idle x86 laptop system is shown in the following example:

# dtrace -s ./iotime.d
    DEVICE                                                 FILE RW      MS
     cmdk0                                 /kernel/drv/scsa2usb  R  24.781
     cmdk0                                 /kernel/drv/scsa2usb  R  25.208
     cmdk0                                    /var/adm/messages  W  25.981
     cmdk0                                 /kernel/drv/scsa2usb  R   5.448
     cmdk0                                               <none>  W   4.172
     cmdk0                                 /kernel/drv/scsa2usb  R   2.620
     cmdk0                                    /var/adm/messages  W   0.252
     cmdk0                                            <unknown>  R   3.213
     cmdk0                                               <none>  W   3.011
     cmdk0                                            <unknown>  R   2.197
     cmdk0                                    /var/adm/messages  W   2.680
     cmdk0                                               <none>  W   0.436
     cmdk0                                    /var/adm/messages  W   0.542
     cmdk0                                               <none>  W   0.339
     cmdk0                                    /var/adm/messages  W   0.414
     cmdk0                                               <none>  W   0.344
     cmdk0                                    /var/adm/messages  W   0.361
     cmdk0                                               <none>  W   0.315
     cmdk0                                    /var/adm/messages  W   0.421
     cmdk0                                               <none>  W   0.349
     cmdk0                                               <none>  R   1.524
     cmdk0                                            <unknown>  R   3.648
     cmdk0                                 /usr/lib/librcm.so.1  R   2.553
     cmdk0                                 /usr/lib/librcm.so.1  R   1.332
     cmdk0                                 /usr/lib/librcm.so.1  R   0.222
     cmdk0                                 /usr/lib/librcm.so.1  R   0.228
     cmdk0                                 /usr/lib/librcm.so.1  R   0.927
     cmdk0                                               <none>  R   1.189
       ...
     cmdk0                            /usr/lib/devfsadm/linkmod  R   1.110
  ...

You can make several observations about the mechanics of the system based on this output. First, note the long time to perform the first several I/Os, which took about 25 milliseconds each. This time might have been due to the cmdk0 device having been power managed on the laptop. Second, observe the I/O due to the scsa2usb driver loading to deal with USB Mass Storage device. Third, note the writes to /var/adm/messages as the device is reported. Finally, observe the reading of the device link generators (the files ending in link.so), which presumably deal with the new device.

The io provider enables in-depth understanding of iostat output. Assume you observe iostat output similar to the following example:

extended device statistics                   
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b 
cmdk0        8.0    0.0  399.8    0.0  0.0  0.0    0.8   0   1 
sd0          0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
sd2          0.0  109.0    0.0  435.9  0.0  1.0    8.9   0  97 
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0 
nfs2         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0

You can use the iotime.d script to see these I/Os as they happen, as shown in the following example:

DEVICE                                               FILE RW      MS
       sd2                                  /mnt/archives.tar  W   0.856
       sd2                                  /mnt/archives.tar  W   0.729
       sd2                                  /mnt/archives.tar  W   0.890
       sd2                                  /mnt/archives.tar  W   0.759
       sd2                                  /mnt/archives.tar  W   0.884
       sd2                                  /mnt/archives.tar  W   0.746
       sd2                                  /mnt/archives.tar  W   0.891
       sd2                                  /mnt/archives.tar  W   0.760
       sd2                                  /mnt/archives.tar  W   0.889
     cmdk0                      /export/archives/archives.tar  R   0.827
       sd2                                  /mnt/archives.tar  W   0.537
       sd2                                  /mnt/archives.tar  W   0.887
       sd2                                  /mnt/archives.tar  W   0.763
       sd2                                  /mnt/archives.tar  W   0.878
       sd2                                  /mnt/archives.tar  W   0.751
       sd2                                  /mnt/archives.tar  W   0.884
       sd2                                  /mnt/archives.tar  W   0.760
       sd2                                  /mnt/archives.tar  W   3.994
       sd2                                  /mnt/archives.tar  W   0.653
       sd2                                  /mnt/archives.tar  W   0.896
       sd2                                  /mnt/archives.tar  W   0.975
       sd2                                  /mnt/archives.tar  W   1.405
       sd2                                  /mnt/archives.tar  W   0.724
       sd2                                  /mnt/archives.tar  W   1.841
     cmdk0                      /export/archives/archives.tar  R   0.549
       sd2                                  /mnt/archives.tar  W   0.543
       sd2                                  /mnt/archives.tar  W   0.863
       sd2                                  /mnt/archives.tar  W   0.734
       sd2                                  /mnt/archives.tar  W   0.859
       sd2                                  /mnt/archives.tar  W   0.754
       sd2                                  /mnt/archives.tar  W   0.914
       sd2                                  /mnt/archives.tar  W   0.751
       sd2                                  /mnt/archives.tar  W   0.902
       sd2                                  /mnt/archives.tar  W   0.735
       sd2                                  /mnt/archives.tar  W   0.908
       sd2                                  /mnt/archives.tar  W   0.753

This output appears to show that the file archives.tar is being read from cmdk0 in /export/archives, and being written to device sd2 in /mnt. This existence of two files named archives.tar that are being operated on separately in parallel seems unlikely. To investigate further, you can aggregate on device, application, process ID and bytes transferred, as shown in the following example:

#pragma D option quiet

io:::start
{
        @[args[1]->dev_statname, execname, pid] = sum(args[0]->b_bcount);
}

END
{
        printf("%10s %20s %10s %15s\n", "DEVICE", "APP", "PID", "BYTES");
        printa("%10s %20s %10d %15@d\n", @);
}

Running this script for a few seconds results in output similar to the following example:

# dtrace -s ./whoio.d
^C
    DEVICE                  APP        PID           BYTES
     cmdk0                   cp        790         1515520
       sd2                   cp        790         1527808

This output shows that this activity is a copy of the file archives.tar from one device to another. This conclusion leads to another natural question: is one of these devices faster than the other? Which device acts as the limiter on the copy? To answer these questions, you need to know the effective throughput of each device rather than the number of bytes per second each device is transferring. You can determine the throughput with the following example script:

#pragma D option quiet

io:::start
{
        start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
        /*
         * You want to get an idea of the throughput to this device in KB/sec.
         * What you have, however, is nanoseconds and bytes.  That is what you want
         * to calculate:
         *
         *                        bytes / 1024
         *                  ------------------------
         *                  nanoseconds / 1000000000
         *
         * But you cannot calculate this using integer arithmetic without losing
         * precision (the denominator, for one, is between 0 and 1 for nearly
         * all I/Os).  So restate the fraction, and cancel:
         * 
         *     bytes      1000000000         bytes        976562
         *   --------- * -------------  =  --------- * -------------  
         *      1024      nanoseconds          1        nanoseconds
         *
         * This is easy to calculate using integer arithmetic; this is what
         * you do below.
         */
        this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
        @[args[1]->dev_statname, args[1]->dev_pathname] =
            quantize((args[0]->b_bcount * 976562) / this->elapsed);
        start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

END
{
        printa("  %s (%s)\n%@d\n", @);
}

Running the example script for several seconds yields the following output:

sd2 (/devices/pci@0,0/pci1179,1@1d/storage@2/disk@0,0:r)

           value  ------------- Distribution ------------- count    
              32 |                                         0        
              64 |                                         3        
             128 |                                         1        
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2257     
             512 |                                         1        
            1024 |                                         0        

  cmdk0 (/devices/pci@0,0/pci-ide@1f,1/ide@0/cmdk@0,0:a)

           value  ------------- Distribution ------------- count    
             128 |                                         0        
             256 |                                         1        
             512 |                                         0        
            1024 |                                         2        
            2048 |                                         0        
            4096 |                                         2        
            8192 |@@@@@@@@@@@@@@@@@@                       172      
           16384 |@@@@@                                    52       
           32768 |@@@@@@@@@@@                              108      
           65536 |@@@                                      34       
          131072 |                                         0

The output shows that sd2 is clearly the limiting device. The sd2 throughput is between 256K/sec and 512K/sec, while cmdk0 is delivering I/O at anywhere from 8 MB/second to over 64 MB/second. The script prints out both the name as seen in iostat, and the full path of the device. To find out more about the device, you could specify the device path to prtconf, as shown in the following example:

# prtconf -v /devices/pci@0,0/pci1179,1@1d/storage@2/disk@0,0
disk, instance #2 (driver name: sd)
    Driver properties:
        name='lba-access-ok' type=boolean dev=(29,128)
        name='removable-media' type=boolean dev=none
        name='pm-components' type=string items=3 dev=none
            value='NAME=spindle-motor' + '0=off' + '1=on'
        name='pm-hardware-state' type=string items=1 dev=none
            value='needs-suspend-resume'
        name='ddi-failfast-supported' type=boolean dev=none
        name='ddi-kernel-ioctl' type=boolean dev=none
    Hardware properties:
        name='inquiry-revision-id' type=string items=1
            value='1.04'
        name='inquiry-product-id' type=string items=1
            value='STORAGE DEVICE'
        name='inquiry-vendor-id' type=string items=1
            value='Generic'
        name='inquiry-device-type' type=int items=1
            value=00000000
        name='usb' type=boolean
        name='compatible' type=string items=1
            value='sd'
        name='lun' type=int items=1
            value=00000000
        name='target' type=int items=1
            value=00000000

As the emphasized terms indicate, this device is a removable USB storage device.

The examples in this section have explored all I/O requests. However, you might only be interested in one type of request. The following example tracks the directories in which writes are occurring, along with the applications performing the writes:

#pragma D option quiet

io:::start
/args[0]->b_flags & B_WRITE/
{
        @[execname, args[2]->fi_dirname] = count();
}

END
{
        printf("%20s %51s %5s\n", "WHO", "WHERE", "COUNT");
        printa("%20s %51s %5@d\n", @);
}

Running this example script on a desktop workload for a period of time yields some interesting results, as shown in the following example output:

# dtrace -s ./whowrite.d
^C
              WHO                                             WHERE COUNT
               su                                          /var/adm     1
          fsflush                                              /etc     1
          fsflush                                                 /     1
          fsflush                                          /var/log     1
          fsflush                                  /export/bmc/lisa     1
              esd   /export/bmc/.phoenix/default/78cxczuy.slt/Cache     1
          fsflush                              /export/bmc/.phoenix     1
              esd         /export/bmc/.phoenix/default/78cxczuy.slt     1
               vi                                          /var/tmp     2
               vi                                              /etc     2
              cat                                            <none>     2
             bash                                                 /     2
               vi                                            <none>     3
            xterm                                          /var/adm     3
          fsflush                                       /export/bmc     7
  MozillaFirebird                                            <none>     8
              vim                                       /export/bmc     9
  MozillaFirebird                                       /export/bmc    10
          fsflush                                          /var/adm    11
         devfsadm                                              /dev    14
              ksh                                            <none>    71
              ksh                                       /export/bmc    71
          fsflush         /export/bmc/.phoenix/default/78cxczuy.slt   119
  MozillaFirebird         /export/bmc/.phoenix/default/78cxczuy.slt   119
          fsflush                                            <none>   211
  MozillaFirebird   /export/bmc/.phoenix/default/78cxczuy.slt/Cache   591
          fsflush   /export/bmc/.phoenix/default/78cxczuy.slt/Cache   666
            sched                                            <none>  2385

As the output indicates, virtually all writes are associated with the Mozilla Firebird cache. The writes labeled <none> are likely due to writes associated with the UFS log, writes that are themselves induced by other writes in the filesystem. See the ufs(4FS) man page for details on logging. This example shows how to use the io provider to discover a problem at a much higher layer of software. In this case, the script has revealed a configuration problem: the web browser would induce much less I/O (and quite likely none at all) if its cache were in a directory in a tmpfs filesystem.

The previous examples have used only the start and done probes. You can use the wait-start and wait-done probes to understand why applications block for I/O – and for how long. The following example script uses both io probes and sched probes to derive CPU time compared to I/O wait time for the StarOffice software.

#pragma D option quiet

sched:::on-cpu
/execname == "soffice.bin"/
{
        self->on = vtimestamp;
}

sched:::off-cpu
/self->on/
{
        @time["<on cpu>"] = sum(vtimestamp - self->on);
        self->on = 0;
}

io:::wait-start
/execname == "soffice.bin"/
{
        self->wait = timestamp;
}

io:::wait-done
/self->wait/
{
        @io[args[2]->fi_name] = sum(timestamp - self->wait);
        @time["<I/O wait>"] = sum(timestamp - self->wait);
        self->wait = 0;
}

END
{
        printf("Time breakdown (milliseconds):\n");
        normalize(@time, 1000000);
        printa("  %-50s %15@d\n", @time);

        printf("\nI/O wait breakdown (milliseconds):\n");
        normalize(@io, 1000000);
        printa("  %-50s %15@d\n", @io);
}

Running the example script during a cold start of the StarOffice software yields the following output:

Time breakdown (milliseconds):
  <on cpu>                                                      3634
  <I/O wait>                                                   13114

I/O wait breakdown (milliseconds):
  soffice.tmp                                                      0
  Office                                                           0
  unorc                                                            0
  sbasic.cfg                                                       0
  en                                                               0
  smath.cfg                                                        0
  toolboxlayout.xml                                                0
  sdraw.cfg                                                        0
  swriter.cfg                                                      0
  Linguistic.dat                                                   0
  scalc.cfg                                                        0
  Views.dat                                                        0
  Store.dat                                                        0
  META-INF                                                         0
  Common.xml.tmp                                                   0
  afm                                                              0
  libsimreg.so                                                     1
  xiiimp.so.2                                                      3
  outline                                                          4
  Inet.dat                                                         6
  fontmetric                                                       6
  ...
  
  libj641si_g.so                                                  46
  libX11.so.4                                                     46
  liblng641si.so                                                  48
  swriter.db                                                      53
  libwrp641si.so                                                  53
  liblocaledata_ascii.so                                          56
  libi18npool641si.so                                             65
  libdbtools2.so                                                  69
  ofa64101.res                                                    74
  libxcr641si.so                                                  82
  
  libsot641si.so                                                  86
  libcppuhelper3C52.so                                            98
  libfwl641si.so                                                 100
  libsb641si.so                                                  104
  libcomphelp2.so                                                105
  libxo641si.so                                                  106
  libucpfile1.so                                                 110
  libcppu.so.3                                                   111
  sw64101.res                                                    114
  libdb-3.2.so                                                   119
  libtk641si.so                                                  126
  libdtransX11641si.so                                           127
  libgo641si.so                                                  132
  libfwe641si.so                                                 150
  libi18n641si.so                                                152
  libfwi641si.so                                                 154
  libso641si.so                                                  173
  libpsp641si.so                                                 186
  libtl641si.so                                                  189
  <unknown>                                                      189
  libucbhelper1C52.so                                            195
  libutl641si.so                                                 213
  libofa641si.so                                                 216
  libfwk641si.so                                                 229
  libsvl641si.so                                                 261
  libcfgmgr2.so                                                  368
  libsvt641si.so                                                 373
  libvcl641si.so                                                 741
  libsvx641si.so                                                 885
  libsfx641si.so                                                 993
  <none>                                                        1096
  libsw641si.so                                                 1365
  applicat.rdb                                                  1580

As this output shows, much of the cold StarOffice start time is due to waiting for I/O. 13.1 seconds waiting for I/O as opposed to 3.6 seconds on CPU. Running the script on a warm start of the StarOffice software reveals that page caching has eliminated the I/O time, as shown in the following example output:

Time breakdown (milliseconds):
  <I/O wait>                                                       0
  <on cpu>                                                      2860

I/O wait breakdown (milliseconds):
  temp                                                             0
  soffice.tmp                                                      0
  <unknown>                                                        0
  Office                                                           0

The cold start output shows that the file applicat.rdb accounts for more I/O wait time than any other file. This result is presumably due to many I/Os to the file. To explore the I/Os performed to this file, you can use the following D script:

io:::start
/execname == "soffice.bin" && args[2]->fi_name == "applicat.rdb"/
{
        @ = lquantize(args[2]->fi_offset != -1 ?
            args[2]->fi_offset / (1000 * 1024) : -1, 0, 1000);
}

This script uses the fi_offset field of the fileinfo_t structure to understand which parts of the file are being accessed, at the granularity of a megabyte. Running this script during a cold start of the StarOffice software results in output similar to the following example:

# dtrace -s ./applicat.d
dtrace: script './applicat.d' matched 4 probes
^C

           value  ------------- Distribution ------------  count    
             < 0 |                                         0        
               0 |@@@                                      28       
               1 |@@                                       17       
               2 |@@@@                                     35       
               3 |@@@@@@@@@                                72       
               4 |@@@@@@@@@@                               78       
               5 |@@@@@@@@                                 65       
               6 |                                         0

This output indicates that only the first six megabytes of the file are accessed, perhaps because the file is six megabytes in size. The output also indicates that the entire file is not accessed. If you wanted to improve the cold start time of StarOffice, you might want to understand the access pattern of the file. If the needed sections of the file could be largely contiguous, one way to improve StarOffice cold start time might be to have a scout thread run ahead of the application, inducing the I/O to the file before it's needed. This approach is straightforward if the file is accessed using mmap. However, the approximate 1.6 seconds that this strategy would gain in cold start time does not merit the additional complexity and maintenance burden in the application. Either way, the data gathered with the io provider allows a precise understanding of the benefit that such work could ultimately deliver.