11.7 io Provider

11.7.1 Probes
11.7.2 Arguments
11.7.3 bufinfo_t
11.7.4 devinfo_t
11.7.5 fileinfo_t
11.7.6 Examples
11.7.7 Stability

The io provider makes available probes that relate to data input and output. The io provider enables quick exploration of behavior observed through I/O monitoring tools such as iostat. For example, you can use the io provider to understand I/O by device, I/O type, I/O size, process, or application name.

11.7.1 Probes

Table 11.8, “io Probes” lists the io probes.

Table 11.8 io Probes

Probe

Description

start

Fires when an I/O request is about to be made either to a peripheral device or to an NFS server. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The devinfo_t of the device to which the I/O is being issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Note that file information availability depends on the file system making the I/O request. See Section 11.7.5, “fileinfo_t” for more information.

done

Fires after an I/O request has been fulfilled. The bufinfo_t corresponding to the I/O request is pointed to by args[0]. The done probe fires after the I/O completes, but before completion processing has been performed on the buffer. As a result B_DONE is not set in b_flags at the time the done probe fires. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2].

wait-start

Fires immediately before a thread begins to wait pending completion of a given I/O request. The buf structure corresponding to the I/O request for which the thread waits is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2]. Some time after the wait-start probe fires, the wait-done probe fires in the same thread.

wait-done

Fires when a thread finishes waiting for the completion of a given I/O request. The bufinfo_t corresponding to the I/O request for which the thread will wait is pointed to by args[0]. The devinfo_t of the device to which the I/O was issued is pointed to by args[1]. The fileinfo_t of the file that corresponds to the I/O request is pointed to by args[2].The wait-done probe fires only after the wait-start probe has fired in the same thread.


The io probes fire for all I/O requests to peripheral devices, and for all file read and file write requests to an NFS server. Requests for metadata from an NFS server, for example, do not trigger io probes due to a readdir() request.

11.7.2 Arguments

Table 11.9, “io Probe Arguments” lists the argument types for io probes. The arguments are described in Table 11.8, “io Probes”.

Table 11.9 io Probe Arguments

Probe

args[0]

args[1]

args[2]

start

struct buf *

devinfo_t *

fileinfo_t *

done

struct buf *

devinfo_t *

fileinfo_t *

wait-start

struct buf *

devinfo_t *

fileinfo_t *

wait-done

struct buf *

devinfo_t *

fileinfo_t *


Each io probe has arguments consisting of a pointer to a buf structure, a pointer to a devinfo_t structure, and a pointer to a fileinfo_t.structure. These structures are described in the following sections.

Note

DTrace does not currently support the use of fileinfo_t with io probes. In Oracle Linux, no information is readily accessible at the level where the io probes fire about the file where an I/O request originated.

11.7.3 bufinfo_t

The bufinfo_t structure is the abstraction that describes an I/O request. The buffer corresponding to an I/O request is pointed to by args[0] in the start, done, wait-start, and wait-done probes. The definition of bufinfo_t is as follows:

typedef struct bufinfo {
  int b_flags;         /* flags */
  size_t b_bcount;     /* number of bytes */
  caddr_t b_addr;      /* buffer address */
  uint64_t b_blkno;    /* expanded block # on device */
  uint64_t b_lblkno;   /* block # on device */
  size_t b_resid;      /* not supported */
  size_t b_bufsize;    /* size of allocated buffer */
  caddr_t b_iodone;    /* I/O completion routine */
  int b_error;         /* not supported */
  dev_t b_edev;        /* extended device */
} bufinfo_t;
Note

DTrace translates the members of bufinfo_t from the buffer_head for the Oracle Linux I/O request structure.

b_flags indicates the state of the I/O buffer, and consists of a bitwise-or of different state values. Table 11.10, “b_flags Values” lists the values of the supported states.

Table 11.10 b_flags Values

b_flags

Value

Description

B_READ

0x000040

Indicates that data is to be read from the peripheral device into main memory.

B_WRITE

0x000100

Indicates that the data is to be transferred from main memory to the peripheral device.


b_bcount is the number of bytes to be transferred as part of the I/O request.

b_addr is the virtual address of the I/O request, unless B_PAGEIO is set. The address is a kernel virtual address unless B_PHYS is set, in which case it is a user virtual address. If B_PAGEIO is set, the b_addr field contains kernel private data. Only one of B_PHYS and B_PAGEIO can be set, or neither flag is set.

b_lblkno identifies which logical block on the device is to be accessed. The mapping from a logical block to a physical block (such as the cylinder, track, and so on) is defined by the device.

b_bufsize contains the size of the allocated buffer.

b_iodone identifies a specific routine in the kernel that is called when the I/O is complete.

b_edev contains the major and minor device numbers of the device accessed. You can use the D subroutines getmajor and getminor to extract the major and minor device numbers from the b_edev field.

11.7.4 devinfo_t

The devinfo_t structure provides information about a device. The devinfo_t structure corresponding to the destination device of an I/O is pointed to by args[1] in the start, done, wait-start, and wait-done probes. The definition of devinfo_t is as follows:

typedef struct devinfo {
  int dev_major;           /* major number */
  int dev_minor;           /* minor number */
  int dev_instance;        /* not supported */
  string dev_name;         /* name of device */
  string dev_statname;     /* name of device + instance/minor */
  string dev_pathname;     /* pathname of device */
} devinfo_t;
Note

DTrace translates the members of devinfo_t from the buffer_head for the Oracle Linux I/O request structure.

dev_major is the major number of the device.

dev_minor is the minor number of the device.

dev_name is the name of the device driver that manages the device.

dev_statname is the name of the device as reported by iostat. This field is provided so that aberrant iostat output can be quickly correlated to actual I/O activity.

dev_pathname is the full path of the device. The path specified by dev_pathname includes components expressing the device node, the instance number, and the minor node. However, all three of these elements are not necessarily expressed in the statistics name. For some devices, the statistics name consists of the device name and the instance number. For other devices, the name consists of the device name and the number of the minor node. As a result, two devices that have the same dev_statname may differ in their dev_pathname.

11.7.5 fileinfo_t

Note

DTrace does not currently support the use of fileinfo_t with the args[2] argument of io probes. You can use the fileinfo_t structure to obtain information about a process's open files via the fds[] array. See Section 2.9.5, “Built-in Variables”.

The fileinfo_t structure provides information about a file. args[2] in the start, done, wait-start, and wait-done probes points to the file to which an I/O request corresponds. The presence of file information is contingent upon the file system providing this information when dispatching I/O requests. Some file systems, especially third-party file systems, might not provide this information. Also, I/O requests might emanate from a file system for which no file information exists. For example, any I/O from or to file system metadata is not associated with any one file. Finally, some highly optimized file systems might aggregate I/O from disjoint files into a single I/O request. In this case, the file system might provide the file information either for the file that represents the majority of the I/O or for the file that represents some of the I/O. Alternatively, the file system might provide no file information at all in this case.

The definition of fileinfo_t is as follows:

typedef struct fileinfo {
  string fi_name;           /* name (basename of fi_pathname) */
  string fi_dirname;        /* directory (dirname of fi_pathname) */
  string fi_pathname;       /* full pathname */
  offset_t fi_offset;       /* offset within file */
  string fi_fs;             /* file system */
  string fi_mount;          /* not supported */
  int fi_oflags;            /* open() flags for file descriptor */
} fileinfo_t;

The fi_name field contains the name of the file but does not include any directory components. If no file information is associated with an I/O, the fi_name field is set to the string <none>. In some rare cases, the pathname associated with a file might be unknown. In this case, the fi_name field is set to the string <unknown>.

The fi_dirname field contains only the directory component of the file name. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.

The fi_pathname field contains the full pathname to the file. As with fi_name, this string may be set to <none> if no file information is present, or <unknown> if the pathname associated with the file is not known.

The fi_offset field contains the offset within the file , or -1 if either file information is not present or if the offset is otherwise unspecified by the file system.

The fi_fs field contains the name of the file system type, or <none> if no information is present.

The fi_oflags field contains the flags that were specified when opening the file.

11.7.6 Examples

The following example script displays information for every I/O as it is issued:

#pragma D option quiet

BEGIN
{
  printf("%10s %58s %2s\n", "DEVICE", "FILE", "RW");
}

io:::start
{
  printf("%10s %58s %2s\n", args[1]->dev_statname,
  args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W");
}

The output from this script resembles the following example:

# dtrace -s ./iosnoop.d
    DEVICE                             FILE RW
     dm-00                  /usr/bin/evince  R
     dm-00                  /usr/bin/evince  R
     dm-00                        <unknown>  R
     dm-00                        <unknown>  R
     dm-00                        <unknown>  R
     dm-00                           <none>  R
...

The <none> entries in the output indicate that the I/O request does not correspond to the data in any particular file. Such I/O requests are due to metadata of one form or another. The <unknown> entries in the output indicate that the pathname for the file is not known.

You can make the example script slightly more sophisticated by using an associative array to track the time in milliseconds spent on each I/O, as shown in the following example:

#pragma D option quiet

BEGIN
{
  printf("%10s %58s %2s %7s\n", "DEVICE", "FILE", "RW", "MS");
}

io:::start
{
  start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
  this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
  printf("%10s %58s %2s %3d.%03d\n", args[1]->dev_statname,
  args[2]->fi_pathname, args[0]->b_flags & B_READ ? "R" : "W",
  this->elapsed / 10000000, (this->elapsed / 1000) % 1000);
  start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

The modified script adds a MS (milliseconds) column to the output.

You can aggregate on device, application, process ID and bytes transferred, as shown in the following example:

#pragma D option quiet

io:::start
{
  @[args[1]->dev_statname, execname, pid] = sum(args[0]->b_bcount);
}

END
{
  printf("%10s %20s %10s %15s\n", "DEVICE", "APP", "PID", "BYTES");
  printa("%10s %20s %10d %15@d\n", @);
}

Running this script for a few seconds results in output similar to the following example:

# dtrace -s whoio.d 
^C
    DEVICE                  APP        PID           BYTES
     dm-00               evince      14759           16384
     dm-00          flush-252:0       1367           45056
     dm-00                 bash      14758          131072
     dm-00       gvfsd-metadata       2787          135168
     dm-00               evince      14758          139264
     dm-00               evince      14338          151552
     dm-00          jbd2/dm-0-8        390          356352

If you are copying data from one device to another, you might want to know if one of the devices acts as a limiter on the copy. To answer this question, you need to know the effective throughput of each device rather than the number of bytes per second that each device is transferring. You can determine throughput with the following example script:

#pragma D option quiet

io:::start
{
  start[args[0]->b_edev, args[0]->b_blkno] = timestamp;
}

io:::done
/start[args[0]->b_edev, args[0]->b_blkno]/
{
  /*
   * We want to get an idea of our throughput to this device in KB/sec.
   * What we have, however, is nanoseconds and bytes. That is we want
   * to calculate:
   *
   * bytes / 1024
   * ------------------------
   * nanoseconds / 1000000000
   *
   * But we cannot calculate this using integer arithmetic without losing
   * precision (the denominator, for one, is between 0 and 1 for nearly
   * all I/Os). So we restate the fraction, and cancel:
   *
   * bytes       1000000000      bytes       976562
   * --------- * ------------- = --------- * -------------
   * 1024        nanoseconds     1           nanoseconds
   *
   * This is easy to calculate using integer arithmetic.
   */
  this->elapsed = timestamp - start[args[0]->b_edev, args[0]->b_blkno];
  @[args[1]->dev_statname, args[1]->dev_pathname] =
    quantize((args[0]->b_bcount * 976562) / this->elapsed);
  start[args[0]->b_edev, args[0]->b_blkno] = 0;
}

END
{
  printa(" %s (%s)\n%@d\n", @);
}

Running the example script for several seconds while copying data from a hard disk to a USB drive yields the following output:

 sdc1 (/dev/sdc1)

           value  ------------- Distribution ------------- count    
              32 |                                         0
              64 |                                         3
             128 |                                         1
             256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  2257
             512 |                                         1
            1024 |                                         0       

 dm-00 (/dev/dm-00)

           value  ------------- Distribution ------------- count    
             128 |                                         0
             256 |                                         1
             512 |                                         0
            1024 |                                         2
            2048 |                                         0
            4096 |                                         2
            8192 |@@@@@@@@@@@@@@@@@@                       172
           16384 |@@@@@                                    52
           32768 |@@@@@@@@@@@                              108
           65536 |@@@                                      34
          131072 |                                         0     

The output shows that the USB drive (sdc1) is clearly the limiting device. The throughput of sdc1 is between 256K/sec and 512K/sec, while dm-00 delivered I/O at anywhere from 8 MB/second to over 64 MB/second.

11.7.7 Stability

The io provider uses DTrace's stability mechanism to describe its stabilities, as shown in the following table.

Element

Name Stability

Data Stability

Dependency Class

Provider

Evolving

Evolving

ISA

Module

Private

Private

Unknown

Function

Private

Private

Unknown

Name

Evolving

Evolving

ISA

Arguments

Evolving

Evolving

ISA

For more information about the stability mechanism, see Chapter 15, Stability