Writing Device Drivers

Chapter 11 Drivers for Block Devices

This chapter describes the structure of block device drivers. The kernel views a block device as a set of randomly accessible logical blocks. The file system buffers the data blocks between a block device and the user space using a list of buf(9S) structures. Only block devices can support a file system.

This chapter provides information on the following subjects:

Block Driver Structure Overview

Figure 11–1 shows data structures and routines that define the structure of a block device driver. Device drivers typically include the following:

Device-loadable driver section
Device configuration section
Device access section

The shaded device access section in Figure 11–1 illustrates block driver entry points.

Figure 11–1 Block Driver Roadmap

Diagram shows structures and entry points for block device drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. See Chapter 5, Driver Autoconfiguration, for details regarding driver data structures.

Note –

Some of the entry points can be replaced by nodev(9F) or nulldev(9F) as appropriate.

File I/O

A file system is a tree-structured hierarchy of directories and files. Some file systems, such as the UNIX File System (UFS), reside on block-oriented devices. File systems are created by format(1M) and newfs(1M).

When an application issues a read(2) or write(2) system call to an ordinary file on the UFS file system, the file system can call the device driver strategy(9E) entry point for the block device on which the file system resides. The file system code can call strategy(9E) several times for a single read(2) or write(2) system call.

The file system code determines the logical device address, or logical block number, for each ordinary file block and builds a block I/O request in the form of a buf(9S) structure directed at the block device. The driver strategy(9E) entry point then interprets the buf(9S) structure and completes the request.

Block Device Autoconfiguration

attach(9E) should perform the common initialization tasks for each instance of a device. Typically, these tasks include:

Allocating per-instance state structures
Mapping the device's registers
Registering device interrupts
Initializing mutex and condition variables
Creating power manageable components
Creating minor nodes

Block device drivers create minor nodes of type S_IFBLK. This causes a block special file representing the node to eventually appear in the /devices hierarchy.

Logical device names for block devices appear in the /dev/dsk directory, and consist of a controller number, bus-address number, disk number, and slice number. These names are created by the devfsadm(1M) program if the node type is set to DDI_NT_BLOCK or DDI_NT_BLOCK_CHAN. DDI_NT_BLOCK_CHAN should be specified if the device communicates on a channel (a bus with an additional level of addressability), such as SCSI disks, and causes a bus-address field (tN) to appear in the logical name. DDI_NT_BLOCK should be used for most other devices.

For each minor device (which corresponds to each partition on the disk), the driver must also create an nblocks or Nblocks property. This is an integer property giving the number of blocks supported by the minor device expressed in units of DEV_BSIZE (512 bytes). The file system uses the nblocks and Nblocks properties to determine device limits; Nblocks is the 64–bit version of nblocks and should be used with storage devices with over 1 Tbyte of storage per disk.). See Device Properties for more information.

Example 11–1 shows a typical attach(9E) entry point with emphasis on creating the device's minor node and the Nblocks property. Note that because this example uses Nblocks and not nblocks, it calls ddi_prop_update_int64(9F) instead of ddi_prop_update_int(9F).

As a side note, this example shows the use of makedevice(9F) to create a device number for ddi_prop_update_int64(9F). makedevice(9F) itself makes use of ddi_driver_major(9F), which generates a major number from a pointer to a dev_info_t structure, just as getmajor(9F) does with a dev_t structure pointer.

Example 11–1 Block Driver attach(9E) Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
     int instance = ddi_get_instance(dip);
     switch (cmd) {
       case DDI_ATTACH:
              allocate a state structure and initialize it
              map the devices registers
              add the device driver's interrupt handler(s)
              initialize any mutexes and condition variables
              read label information if the device is a disk
              create power manageable components
           /*
            * Create the device minor node. Note that the node_type
            * argument is set to DDI_NT_BLOCK.
            */
           if (ddi_create_minor_node(dip, "minor_name", S_IFBLK,
                   instance,  DDI_NT_BLOCK, 0) == DDI_FAILURE) {
                      free resources allocated so far
                  /* Remove any previously allocated minor nodes */
                  ddi_remove_minor_node(dip, NULL);
                  return (DDI_FAILURE);
            }
           /*
            * Create driver properties like "Nblocks". If the device
            * is a disk, the Nblocks property is usually calculated from
            * information in the disk label.  Use "Nblocks" instead of
            * "nblocks" to ensure the property works for large disks.
            */
            xsp->Nblocks = size of device in 512 byte blocks;
            maj_number = ddi_driver_major(dip);
           if (ddi_prop_update_int64(makedevice(maj_number, instance), dip, 
                  "Nblocks", xsp->Nblocks) != DDI_PROP_SUCCESS) {
                  cmn_err(CE_CONT, "%s: cannot create Nblocks property\n",
                           ddi_get_name(dip));
                 free resources allocated so far
                 return (DDI_FAILURE);
           }
           xsp->open = 0;
           xsp->nlayered = 0;
           ...
           return (DDI_SUCCESS);

        case DDI_RESUME:
            For information, see Chapter 9, Power Management
       default:
              return (DDI_FAILURE);
     }
}

Controlling Device Access

This section describes aspects of the open() and close() entry points that are specific to block device drivers. See Chapter 10, Drivers for Character Devices for more information on open(9E) and close(9E).

`open()` Entry Point (Block Drivers)

The open(9E) entry point is used to gain access to a given device. The open(9E) routine of a block driver is called when a user thread issues an open(2) or mount(2) system call on a block special file associated with the minor device, or when a layered driver calls open(9E). See File I/O for more information.

The open(9E) entry point should check for the following:

The device can be opened; for example, it is online and ready.
The device can be opened as requested; the device supports the operation, and the device's current state does not conflict with the request.
The caller has permission to open the device.

Example 11–2 demonstrates a block driver open(9E) entry point.

Example 11–2 Block Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flags, int otyp, cred_t *credp)
{
       minor_t             instance;
       struct xxstate            *xsp;

     instance = getminor(*devp);
     xsp = ddi_get_soft_state(statep, instance);
     if (xsp == NULL)
               return (ENXIO);
     mutex_enter(&xsp->mu);
     /*
        * only honor FEXCL. If a regular open or a layered open
        * is still outstanding on the device, the exclusive open
        * must fail.
        */
     if ((flags & FEXCL) && (xsp->open || xsp->nlayered)) {
           mutex_exit(&xsp->mu);
           return (EAGAIN);
     }
     switch (otyp) {
       case OTYP_LYR:
             xsp->nlayered++;
             break;
      case OTYP_BLK:
             xsp->open = 1;
             break;
     default:
             mutex_exit(&xsp->mu);
             return (EINVAL);
     }
   mutex_exit(&xsp->mu);
      return (0);
}

The otyp argument is used to specify the type of open on the device. OTYP_BLK is the typical open type for a block device. A device can be opened several times with otyp set to OTYP_BLK, although close(9E) will be called only once when the final close of type OTYP_BLK has occurred for the device. otyp is set to OTYP_LYR if the device is being used as a layered device. For every open of type OTYP_LYR, the layering driver issues a corresponding close of type OTYP_LYR. The example keeps track of each type of open so the driver can determine when the device is not being used in close(9E).

`close()` Entry Point (Block Drivers)

The arguments of the close(9E) entry point are identical to arguments of open(9E), except that dev is the device number, as opposed to a pointer to the device number.

The close(9E) routine should verify otyp in the same way as was described for the open(9E) entry point. In Example 11–3, close(9E) must determine when the device can really be closed based on the number of block opens and layered opens.

Example 11–3 Block Device close(9E) Routine

static int
xxclose(dev_t dev, int flag, int otyp, cred_t *credp)
{
     minor_t instance;
     struct xxstate *xsp;

     instance = getminor(dev);
     xsp = ddi_get_soft_state(statep, instance);
       if (xsp == NULL)
              return (ENXIO);
     mutex_enter(&xsp->mu);
     switch (otyp) {
       case OTYP_LYR:
           xsp->nlayered--;
           break;
      case OTYP_BLK:
           xsp->open = 0;
           break;
     default:
           mutex_exit(&xsp->mu);
           return (EINVAL);
       }

     if (xsp->open || xsp->nlayered) {
           /* not done yet */
           mutex_exit(&xsp->mu);
           return (0);
     }
       /* cleanup (rewind tape, free memory, etc.) */
   /* wait for I/O to drain */
     mutex_exit(&xsp->mu);

     return (0);
}

`strategy()` Entry Point

The strategy(9E) entry point is used to read and write data buffers to and from a block device. The name strategy refers to the fact that this entry point might implement some optimal strategy for ordering requests to the device.

strategy(9E) can be written to process one request at a time (synchronous transfer), or to queue multiple requests to the device (asynchronous transfer). When choosing a method, the abilities and limitations of the device should be taken into account.

The strategy(9E) routine is passed a pointer to a buf(9S) structure. This structure describes the transfer request, and contains status information on return. buf(9S) and strategy(9E) are the focus of block device operations.

`buf` Structure

The following buf structure members are important to block drivers:

       int                b_flags;            /* Buffer Status */
     struct buf       *av_forw;        /* Driver work list link */
     struct buf       *av_back;        /* Driver work lists link */
     size_t           b_bcount;        /* # of bytes to transfer */
     union {
         caddr_t      b_addr;          /* Buffer's virtual address */
     } b_un;
     daddr_t          b_blkno;         /* Block number on device */
     diskaddr_t       b_lblkno;        /* Expanded block number on device */
     size_t           b_resid;         /* # of bytes not transferred */
                                       /* after error */
     int              b_error;         /* Expanded error field */
     void             *b_private;      /* “opaque” driver private area */
     dev_t            b_edev;          /* expanded dev field */

b_flags contains status and transfer attributes of the buf structure. If B_READ is set, the buf structure indicates a transfer from the device to memory; otherwise, it indicates a transfer from memory to the device. If the driver encounters an error during data transfer, it should set the B_ERROR field in the b_flags member and provide a more specific error value in b_error. Drivers should use bioerror(9F) rather than setting B_ERROR.

Caution –

Drivers should never clear b_flags.

av_forw and av_back: Pointers that the driver can use to manage a list of buffers by the driver. See Asynchronous Data Transfers (Block Drivers) for a discussion of the av_forw and av_back pointers.
b_bcount: Specifies the number of bytes to be transferred by the device.
b_un.b_addr: The kernel virtual address of the data buffer. Only valid after bp_mapin(9F) call.
b_blkno: The starting 32-bit logical block number on the device for the data transfer, expressed in DEV_BSIZE (512 bytes) units. The driver should use either b_blkno or b_lblkno, but not both.
b_lblkno: The starting 64-bit logical block number on the device for the data transfer, expressed in DEV_BSIZE (512 bytes) units. The driver should use either b_blkno or b_lblkno, but not both.
b_resid: Set by the driver to indicate the number of bytes that were not transferred because of an error. See Example 11–8 for an example of setting b_resid. The b_resid member is overloaded: it is also used by disksort(9F).
b_error: Set to an error number by the driver when a transfer error occurs. It is set in conjunction with the b_flags B_ERROR bit. See Intro(9E) for details regarding error values. Drivers should use bioerror(9F) rather than setting b_error directly.
b_private: For exclusive use by the driver to store driver-private data.
b_edev: Contains the device number of the device involved in the transfer.

`bp_mapin` Structure

When a buf structure pointer is passed into the device driver's strategy(9E) routine, the data buffer referred to by b_un.b_addr is not necessarily mapped in the kernel's address space. This means that the driver cannot directly access the data. Most block-oriented devices have DMA capability, and therefore do not need to access the data buffer directly. Instead, they use the DMA mapping routines to allow the device's DMA engine to do the data transfer. For details about using DMA, see Chapter 8, Direct Memory Access (DMA).

If a driver needs to directly access the data buffer (as opposed to having the device access the data), it must first map the buffer into the kernel's address space using bp_mapin(9F). bp_mapout(9F) should be used when the driver no longer needs to access the data directly.

Caution –

bp_mapout(9F) should only be called on buffers that have been allocated and are owned by the device driver. It must not be called on buffers passed to the driver through the strategy(9E) entry point (for example a file system). Because bp_mapin(9F) does not keep a reference count, bp_mapout(9F) will remove any kernel mapping that a layer above the device driver might rely on.

Synchronous Data Transfers (Block Drivers)

This section presents a simple method for performing synchronous I/O transfers. It assumes that the hardware is a simple disk device that can transfer only one data buffer at a time using DMA, and that the disk can be spun up and spun down by software command. The device driver's strategy(9E) routine waits for the current request to be completed before accepting a new one. The device interrupts when the transfer is complete or when an error occurs.

Check for invalid buf(9S) requests.

Check the buf(9S) structure passed to strategy(9E) for validity. All drivers should check that:
- The request begins at a valid block. The driver converts the b_blkno field to the correct device offset and then determines if the offset is valid for the device.
- The request does not go beyond the last block on the device.
- Device-specific requirements are met.
If an error is encountered, the driver should indicate the appropriate error with bioerror(9F) and complete the request by calling biodone(9F). biodone(9F) notifies the caller of strategy(9E) that the transfer is complete (in this case, because of an error).

Check whether the device is busy.

Synchronous data transfers allow single-threaded access to the device. The device driver enforces this by maintaining a busy flag (guarded by a mutex), and by waiting on a condition variable with cv_wait(9F) when the device is busy.

If the device is busy, the thread waits until a cv_broadcast(9F) or cv_signal(9F) from the interrupt handler indicates that the device is no longer busy. See Chapter 3, Multithreading for details on condition variables.

When the device is no longer busy, the strategy(9E) routine marks it as busy and prepares the buffer and the device for the transfer.

Set up the buffer for DMA.

Prepare the data buffer for a DMA transfer by allocating a DMA handle using ddi_dma_alloc_handle(9F) and binding the data buffer to the handle using ddi_dma_buf_bind_handle(9F). See Chapter 8, Direct Memory Access (DMA) for information on setting up DMA resources and related data structures.

Begin the transfer.

At this point, a pointer to the buf(9S) structure is saved in the state structure of the device. The interrupt routine can then complete the transfer by calling biodone(9F).

The device driver then accesses device registers to initiate a data transfer. In most cases, the driver should protect the device registers from other threads by using mutexes. In this case, because strategy(9E) is single-threaded, guarding the device registers is not necessary. (See Chapter 3, Multithreading for details about data locks.)

Once the executing thread has started the device's DMA engine, the driver can return execution control to the calling routine, as shown in Example 11–4:

Example 11–4 Synchronous Block Driver strategy(9E) Routine

static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    struct device_reg *regp;
    minor_t instance;
    ddi_dma_cookie_t cookie;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    if (xsp == NULL) {
           bioerror(bp, ENXIO);
           biodone(bp);
           return (0);
    }
    /* validate the transfer request */
    if ((bp->b_blkno >= xsp->Nblocks) || (bp->b_blkno < 0)) {
           bioerror(bp, EINVAL);    
           biodone(bp);
           return (0);
    }
    /*
     * Hold off all threads until the device is not busy.
     */
    mutex_enter(&xsp->mu);
    while (xsp->busy) {
           cv_wait(&xsp->cv, &xsp->mu);
    }
    xsp->busy = 1;
    mutex_exit(&xsp->mu);
    if the device has power manageable components (see Chapter 9, Power Management),
    mark the device busy with pm_busy_components(9F),
and then ensure that the device 
    is powered up by calling ddi_dev_is_needed(9F).

    Set up DMA resources with ddi_dma_alloc_handle(9F) 
    and ddi_dma_buf_bind_handle(9F). 

    xsp->bp = bp;
    regp = xsp->regp;
    ddi_put32(xsp->data_access_handle, &regp->dma_addr,
            cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &regp->dma_size,
             (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &regp->csr,
             ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handle the interrupting device.

When the device finishes the data transfer it generates an interrupt, which eventually results in the driver's interrupt routine being called. Most drivers specify the state structure of the device as the argument to the interrupt routine when registering interrupts (see the ddi_add_intr(9F) man page and Registering Interrupts). The interrupt routine can then access the buf(9S) structure being transferred, plus any other information available from the state structure.

The interrupt handler should check the device's status register to determine if the transfer completed without error. If an error occurred, the handler should indicate the appropriate error with bioerror(9F). The handler should also clear the pending interrupt for the device and then complete the transfer by calling biodone(9F).

As the final task, the handler clears the busy flag and calls cv_signal(9F) or cv_broadcast(9F) on the condition variable, signaling that the device is no longer busy. This allows other threads waiting for the device (in strategy(9E)) to proceed with the next data transfer.

Example 11–5 Synchronous Block Driver Interrupt Routine

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
           mutex_exit(&xsp->mu);
           return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
           /* failure */
           bp->b_resid = bp->b_bcount;
           bioerror(bp, EIO);
    } else {
           /* success */
           bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
           CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    if the device has power manageable components that were marked busy in strategy(9F).
    mark them idle now with pm_idle_component(9F)
    release any resources used in the transfer, such as DMA resources 
    (ddi_dma_unbind_handle(9F) and ddi_dma_free_handle(9F)).

    /* Let the next I/O thread have access to the device */
    xsp->busy = 0;
    cv_signal(&xsp->cv);
    mutex_exit(&xsp->mu);
    return (DDI_INTR_CLAIMED);
}

Asynchronous Data Transfers (Block Drivers)

This section presents a method for performing asynchronous I/O transfers. The driver queues the I/O requests and then returns control to the caller. Again, the assumption is that the hardware is a simple disk device that allows one transfer at a time. The device interrupts when a data transfer has completed or when an error occurs.

Check for invalid buf(9S) requests.

As in the synchronous case, the device driver should check the buf(9S) structure passed to strategy(9E) for validity. See Synchronous Data Transfers (Block Drivers) for more details.

Enqueue the request.

Unlike synchronous data transfers, a driver does not wait for an asynchronous request to complete. Instead, it adds the request to a queue. The head of the queue can be the current transfer, or a separate field in the state structure can be used to hold the active request (as in Example 11–6). If the queue was initially empty, then the hardware is not busy, and strategy(9E) starts the transfer before returning. Otherwise, whenever a transfer completes and the queue is non-empty, the interrupt routine begins a new transfer. This example actually places the decision of whether to start a new transfer into a separate routine for convenience.

The driver can use the av_forw and the av_back members of the buf(9S) structure to manage a list of transfer requests. A single pointer can be used to manage a singly linked list, or both pointers can be used together to build a doubly linked list. The device hardware specification specifies which type of list management (such as insertion policies) will optimize the performance of the device. The transfer list is a per-device list, so the head and tail of the list are stored in the state structure.

Example 11–6 allows multiple threads access to the driver shared data, so you must identify any such data (such as the transfer list) and protect it with a mutex. (See Chapter 3, Multithreading for more details about mutex locks.)

Example 11–6 Asynchronous Block Driver strategy(9E) Routine
```
static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    minor_t instance;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    ...
    validate transfer request
    ...
    Add the request to the end of the queue. Depending on the device, a sorting algorithm, such as disksort(9F)
may be used if it improves the performance of the device.
    mutex_enter(&xsp->mu);
    bp->av_forw = NULL;
    if (xsp->list_head) {
           /* Non-empty transfer list */
           xsp->list_tail->av_forw = bp;
           xsp->list_tail = bp;
    } else {
           /* Empty Transfer list */
           xsp->list_head = bp;
           xsp->list_tail = bp;
    }
    mutex_exit(&xsp->mu);
    /* Start the transfer if possible */
    (void) xxstart((caddr_t)xsp);
    return (0);
}
```

Start the first transfer.

Device drivers that implement queuing usually have a start() routine. start() dequeues the next request and starts the data transfer to or from the device. In this example, start() processes all requests, regardless of the state of the device (busy or free).

Note –

start() must be written so that it can be called from any context, because it can be called by both the strategy routine (in kernel context) and the interrupt routine (in interrupt context).

start() is called by strategy(9E) every time it queues a request so that an idle device can be started. If the device is busy, start() returns immediately.

start() is also called by the interrupt handler before it returns from a claimed interrupt so that a nonempty queue can be serviced. If the queue is empty, start() returns immediately.

Because start() is a private driver routine, it can take any arguments and return any type. Example 11–7 is written as if it will also be used as a DMA callback (although that portion is not shown), so it must take a caddr_t as an argument and return an int. See Handling Resource Allocation Failures for more information about DMA callback routines.

Example 11–7 Block Driver `start()` Routine

static int
xxstart(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;

    mutex_enter(&xsp->mu);
    /*
     * If there is nothing more to do, or the device is
     * busy, return.
     */
    if (xsp->list_head == NULL || xsp->busy) {
           mutex_exit(&xsp->mu);
           return (0);
    }
    xsp->busy = 1;
    /* Get the first buffer off the transfer list */
    bp = xsp->list_head;
    /* Update the head and tail pointer */
    xsp->list_head = xsp->list_head->av_forw;
    if (xsp->list_head == NULL)
           xsp->list_tail = NULL;
    bp->av_forw = NULL;
    mutex_exit(&xsp->mu);
    
    if the device has power manageable components (see Chapter 9, Power Management),
mark the device busy with pm_busy_components, and then ensure that the device
     is powered up by calling ddi_dev_is_needed.
    Set up DMA resources with ddi_dma_alloc_handle(9F) and  
ddi_dma_buf_bind_handle(9F).
    xsp->bp = bp;
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_addr,
            cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_size,
             (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
             ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handle the interrupting device.

The interrupt routine is similar to the asynchronous version, with the addition of the call to start() and the removal of the call to cv_signal(9F).

Example 11–8 Asynchronous Block Driver Interrupt Routine

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
            mutex_exit(&xsp->mu);
            return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
            /* failure */
            bp->b_resid = bp->b_bcount;
            bioerror(bp, EIO);
    } else {
            /* success */
            bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
            CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    if the device has power manageable components that were marked busy in strategy(9F)    
(9E), mark them idle now with pm_idle_component(9F)
    release any resources used in the transfer, such as DMA resources
ddi_dma_unbind_handle(9F) and 
ddi_dma_free_handle(9F)
    /* Let the next I/O thread have access to the device */
    xsp->busy = 0;
    mutex_exit(&xsp->mu);
    (void) xxstart((caddr_t)xsp);
    return (DDI_INTR_CLAIMED);
}

Miscellaneous Entry Points

This section discusses the dump(9E) and print(9E) entry points.

`dump()` Entry Point (Block Drivers)

The dump(9E) entry point is used to copy a portion of virtual address space directly to the specified device in the case of a system failure. It is also used to copy the state of the kernel out to disk during a checkpoint operation (see the cpr(7) and dump(9E) man pages). It must be capable of performing this operation without the use of interrupts, since they are disabled during the checkpoint operation.

int dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblk)

dev is the device number of the device to dump to, addr is the base kernel virtual address at which to start the dump, blkno is the first block to dump to, and nblk is the number of blocks to dump. The dump depends upon the existing driver working properly.

`print()` Entry Point (Block Drivers)

int print(dev_t dev, char *str)

The print(9E) entry point is called by the system to display a message about an exception it has detected. print(9E) should call cmn_err(9F) to post the message to the console on behalf of the system. Here is an example:

static int
 xxprint(dev_t dev, char *str)
 {
     cmn_err(CE_CONT, “xx: %s\n”, str);
     return (0);
 }

Disk Device Drivers

Disk devices represent an important class of block device drivers.

Disk `ioctl`s

Solaris disk drivers need to support a minimum set of ioctl commands specific to Solaris disk drivers. These I/O controls are specified in the dkio(7) manual page. Disk I/O controls transfer disk information to or from the device driver. A Solaris disk device is one that is supported by disk utility commands such as format(1M) and newfs(1M). Table 11–1 lists the mandatory Sun disk I/O controls.

Table 11–1 Mandatory Solaris Disk ioctls


`ioctl`	Description
`DKIOCINFO`	Returns information describing the disk controller
`DKIOCGAPART`	Returns a disk's partition map
`DKIOCSAPART`	Sets a disk's partition map
`DKIOCGGEOM`	Returns a disk's geometry
`DKIOCSGEOM`	Sets a disk's geometry
`DKIOCGVTOC`	Returns a disk's Volume Table of Contents
`DKIOCSVTOC`	Sets a disk's Volume Table of Contents

Disk Performance

The Solaris DDI/DKI provides facilities to optimize I/O transfers for improved file system performance. It supports a mechanism to manage the list of I/O requests so as to optimize disk access for a file system. See Asynchronous Data Transfers (Block Drivers) for a description of enqueuing an I/O request.

The diskhd structure is used to manage a linked list of I/O requests.

struct diskhd {
    long     b_flags;                 /* not used, needed for consistency*/
    struct   buf *b_forw,    *b_back;       /* queue of unit queues */
    struct   buf *av_forw,    *av_back;    /* queue of bufs for this unit */
    long     b_bcount;                    /* active flag */
};

The diskhd data structure has two buf pointers that the driver can manipulate. The av_forw pointer points to the first active I/O request. The second pointer, av_back, points to the last active request on the list.

A pointer to this structure is passed as an argument to disksort(9F), along with a pointer to the current buf structure being processed. The disksort(9F) routine is used to sort the buf requests in a fashion that optimizes disk seek and then inserts the buf pointer into the diskhd list. The disksort(9F) program uses the value that is in b_resid of the buf structure as a sort key. The driver is responsible for setting this value. Most Sun disk drivers use the cylinder group as the sort key. This tends to optimize the file system read-ahead accesses.

Once data has been added to the diskhd list, the device needs to transfer the data. If the device is not busy processing a request, the xxstart() routine pulls the first buf structure off the diskhd list and starts a transfer.

If the device is busy, the driver should return from the xxstrategy() entry point. Once the hardware is done with the data transfer, it generates an interrupt. The driver's interrupt routine is then called to service the device. After servicing the interrupt, the driver can then call the start() routine to process the next buf structure in the diskhd list.