Writing Device Drivers

Chapter 16 Drivers for Block Devices

This chapter describes the structure of block device drivers. The kernel views a block device as a set of randomly accessible logical blocks. The file system uses a list of buf(9S) structures to buffer the data blocks between a block device and the user space. Only block devices can support a file system.

This chapter provides information on the following subjects:

Block Driver Structure Overview

Figure 16–1 shows data structures and routines that define the structure of a block device driver. Device drivers typically include the following elements:

Device-loadable driver section
Device configuration section
Device access section

The shaded device access section in the following figure illustrates entry points for block drivers.

Figure 16–1 Block Driver Roadmap

Diagram shows structures and entry points for block device
drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. See Chapter 6, Driver Autoconfiguration for details on driver data structures.

Block device drivers provide these entry points:

Note –

Some of the entry points can be replaced by nodev(9F) or nulldev(9F) as appropriate.

File I/O

A file system is a tree-structured hierarchy of directories and files. Some file systems, such as the UNIX File System (UFS), reside on block-oriented devices. File systems are created by format(1M) and newfs(1M).

When an application issues a read(2) or write(2) system call to an ordinary file on the UFS file system, the file system can call the device driver strategy(9E) entry point for the block device on which the file system resides. The file system code can call strategy(9E) several times for a single read(2) or write(2) system call.

The file system code determines the logical device address, or logical block number, for each ordinary file block. A block I/O request is then built in the form of a buf(9S) structure directed at the block device. The driver strategy(9E) entry point then interprets the buf(9S) structure and completes the request.

Block Device Autoconfiguration

attach(9E) should perform the common initialization tasks for each instance of a device:

Allocating per-instance state structures
Mapping the device's registers
Registering device interrupts
Initializing mutex and condition variables
Creating power manageable components
Creating minor nodes

Block device drivers create minor nodes of type S_IFBLK. As a result, a block special file that represents the node appears in the /devices hierarchy.

Logical device names for block devices appear in the /dev/dsk directory, and consist of a controller number, bus-address number, disk number, and slice number. These names are created by the devfsadm(1M) program if the node type is set to DDI_NT_BLOCK or DDI_NT_BLOCK_CHAN. DDI_NT_BLOCK_CHAN should be specified if the device communicates on a channel, that is, a bus with an additional level of addressability. SCSI disks are a good example. DDI_NT_BLOCK_CHAN causes a bus-address field (tN) to appear in the logical name. DDI_NT_BLOCK should be used for most other devices.

A minor device refers to a partition on the disk. For each minor device, the driver must create an nblocks or Nblocks property. This integer property gives the number of blocks supported by the minor device expressed in units of DEV_BSIZE, that is, 512 bytes. The file system uses the nblocks and Nblocks properties to determine device limits. Nblocks is the 64-bit version of nblocks. Nblocks should be used with storage devices that can hold over 1 Tbyte of storage per disk. See Device Properties for more information.

Example 16–1 shows a typical attach(9E) entry point with emphasis on creating the device's minor node and the Nblocks property. Note that because this example uses Nblocks and not nblocks, ddi_prop_update_int64(9F) is called instead of ddi_prop_update_int(9F).

As a side note, this example shows the use of makedevice(9F) to create a device number for ddi_prop_update_int64(). The makedevice function makes use of ddi_driver_major(9F), which generates a major number from a pointer to a dev_info_t structure. Using ddi_driver_major() is similar to using getmajor(9F), which gets a dev_t structure pointer.

Example 16–1 Block Driver `attach()` Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
     int instance = ddi_get_instance(dip);
     switch (cmd) {
       case DDI_ATTACH:
       /*
        * allocate a state structure and initialize it
        * map the devices registers
        * add the device driver's interrupt handler(s)
        * initialize any mutexes and condition variables
        * read label information if the device is a disk
        * create power manageable components
        *
        * Create the device minor node. Note that the node_type
        * argument is set to DDI_NT_BLOCK.
        */
       if (ddi_create_minor_node(dip, "minor_name", S_IFBLK,
          instance, DDI_NT_BLOCK, 0) == DDI_FAILURE) {
          /* free resources allocated so far */
          /* Remove any previously allocated minor nodes */
          ddi_remove_minor_node(dip, NULL);
          return (DDI_FAILURE);
        }
       /*
        * Create driver properties like "Nblocks". If the device
        * is a disk, the Nblocks property is usually calculated from
        * information in the disk label.  Use "Nblocks" instead of
        * "nblocks" to ensure the property works for large disks.
        */
       xsp->Nblocks = size;
       /* size is the size of the device in 512 byte blocks */
       maj_number = ddi_driver_major(dip);
       if (ddi_prop_update_int64(makedevice(maj_number, instance), dip, 
          "Nblocks", xsp->Nblocks) != DDI_PROP_SUCCESS) {
          cmn_err(CE_CONT, "%s: cannot create Nblocks property\n",
               ddi_get_name(dip));
         /* free resources allocated so far */
         return (DDI_FAILURE);
       }
       xsp->open = 0;
       xsp->nlayered = 0;
       /* ... */
       return (DDI_SUCCESS);

    case DDI_RESUME:
       /* For information, see Chapter 12, "Power Management," in this book. */
       default:
          return (DDI_FAILURE);
     }
}

Controlling Device Access

This section describes the entry points for open() and close() functions in block device drivers. See Chapter 15, Drivers for Character Devices for more information on open(9E) and close(9E).

`open()` Entry Point (Block Drivers)

The open(9E) entry point is used to gain access to a given device. The open(9E) routine of a block driver is called when a user thread issues an open(2) or mount(2) system call on a block special file associated with the minor device, or when a layered driver calls open(9E). See File I/O for more information.

The open() entry point should check for the following conditions:

The device can be opened, that is, the device is online and ready.
The device can be opened as requested. The device supports the operation. The device's current state does not conflict with the request.
The caller has permission to open the device.

The following example demonstrates a block driver open(9E) entry point.

Example 16–2 Block Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flags, int otyp, cred_t *credp)
{
       minor_t         instance;
       struct xxstate        *xsp;

     instance = getminor(*devp);
     xsp = ddi_get_soft_state(statep, instance);
     if (xsp == NULL)
           return (ENXIO);
     mutex_enter(&xsp->mu);
     /*
    * only honor FEXCL. If a regular open or a layered open
    * is still outstanding on the device, the exclusive open
    * must fail.
    */
     if ((flags & FEXCL) && (xsp->open || xsp->nlayered)) {
       mutex_exit(&xsp->mu);
       return (EAGAIN);
     }
     switch (otyp) {
       case OTYP_LYR:
         xsp->nlayered++;
         break;
      case OTYP_BLK:
         xsp->open = 1;
         break;
     default:
         mutex_exit(&xsp->mu);
         return (EINVAL);
     }
   mutex_exit(&xsp->mu);
      return (0);
}

The otyp argument is used to specify the type of open on the device. OTYP_BLK is the typical open type for a block device. A device can be opened several times with otyp set to OTYP_BLK. close(9E) is called only once when the final close of type OTYP_BLK has occurred for the device. otyp is set to OTYP_LYR if the device is being used as a layered device. For every open of type OTYP_LYR, the layering driver issues a corresponding close of type OTYP_LYR. The example keeps track of each type of open so the driver can determine when the device is not being used in close(9E).

`close()` Entry Point (Block Drivers)

The close(9E) entry point uses the same arguments as open(9E) with one exception. dev is the device number rather than a pointer to the device number.

The close() routine should verify otyp in the same way as was described for the open(9E) entry point. In the following example, close() must determine when the device can really be closed. Closing is affected by the number of block opens and layered opens.

Example 16–3 Block Device close(9E) Routine

static int
xxclose(dev_t dev, int flag, int otyp, cred_t *credp)
{
     minor_t instance;
     struct xxstate *xsp;

     instance = getminor(dev);
     xsp = ddi_get_soft_state(statep, instance);
       if (xsp == NULL)
          return (ENXIO);
     mutex_enter(&xsp->mu);
     switch (otyp) {
       case OTYP_LYR:
       xsp->nlayered--;
       break;
      case OTYP_BLK:
       xsp->open = 0;
       break;
     default:
       mutex_exit(&xsp->mu);
       return (EINVAL);
       }

     if (xsp->open || xsp->nlayered) {
       /* not done yet */
       mutex_exit(&xsp->mu);
       return (0);
     }
       /* cleanup (rewind tape, free memory, etc.) */
   /* wait for I/O to drain */
     mutex_exit(&xsp->mu);

     return (0);
}

`strategy()` Entry Point

The strategy(9E) entry point is used to read and write data buffers to and from a block device. The name strategy refers to the fact that this entry point might implement some optimal strategy for ordering requests to the device.

strategy(9E) can be written to process one request at a time, that is, a synchronous transfer. strategy() can also be written to queue multiple requests to the device, as in an asynchronous transfer. When choosing a method, the abilities and limitations of the device should be taken into account.

The strategy(9E) routine is passed a pointer to a buf(9S) structure. This structure describes the transfer request, and contains status information on return. buf(9S) and strategy(9E) are the focus of block device operations.

`buf` Structure

The following buf structure members are important to block drivers:

     int          b_flags;     /* Buffer Status */
     struct buf       *av_forw;    /* Driver work list link */
     struct buf       *av_back;    /* Driver work list link */
     size_t       b_bcount;    /* # of bytes to transfer */
     union {
     caddr_t      b_addr;      /* Buffer's virtual address */
     } b_un;
     daddr_t      b_blkno;     /* Block number on device */
     diskaddr_t       b_lblkno;    /* Expanded block number on device */
     size_t       b_resid;     /* # of bytes not transferred */
                       /* after error */
     int          b_error;     /* Expanded error field */
     void         *b_private;      /* “opaque” driver private area */
     dev_t        b_edev;      /* expanded dev field */

where:

av_forw and av_back

Pointers that the driver can use to manage a list of buffers by the driver. See Asynchronous Data Transfers (Block Drivers) for a discussion of the av_forw and av_back pointers.

b_bcount

Specifies the number of bytes to be transferred by the device.

b_un.b_addr

The kernel virtual address of the data buffer. Only valid after bp_mapin(9F) call.

b_blkno

The starting 32-bit logical block number on the device for the data transfer, which is expressed in 512-byte DEV_BSIZE units. The driver should use either b_blkno or b_lblkno but not both.

b_lblkno

The starting 64-bit logical block number on the device for the data transfer, which is expressed in 512-byte DEV_BSIZE units. The driver should use either b_blkno or b_lblkno but not both.

b_resid

Set by the driver to indicate the number of bytes that were not transferred because of an error. See Example 16–7 for an example of setting b_resid. The b_resid member is overloaded. b_resid is also used by disksort(9F).

b_error

Set to an error number by the driver when a transfer error occurs. b_error is set in conjunction with the b_flags B_ERROR bit. See the Intro(9E) man page for details about error values. Drivers should use bioerror(9F) rather than setting b_error directly.

b_flags

Flags with status and transfer attributes of the buf structure. If B_READ is set, the buf structure indicates a transfer from the device to memory. Otherwise, this structure indicates a transfer from memory to the device. If the driver encounters an error during data transfer, the driver should set the B_ERROR field in the b_flags member. In addition, the driver should provide a more specific error value in b_error. Drivers should use bioerror(9F) rather than setting B_ERROR.

Caution –

Drivers should never clear b_flags.

b_private

For exclusive use by the driver to store driver-private data.

b_edev

Contains the device number of the device that was used in the transfer.

`bp_mapin` Structure

A buf structure pointer can be passed into the device driver's strategy(9E) routine. However, the data buffer referred to by b_un.b_addr is not necessarily mapped in the kernel's address space. Therefore, the driver cannot directly access the data. Most block-oriented devices have DMA capability and therefore do not need to access the data buffer directly. Instead, these devices use the DMA mapping routines to enable the device's DMA engine to do the data transfer. For details about using DMA, see Chapter 9, Direct Memory Access (DMA).

If a driver needs to access the data buffer directly, that driver must first map the buffer into the kernel's address space by using bp_mapin(9F). bp_mapout(9F) should be used when the driver no longer needs to access the data directly.

Caution –

bp_mapout(9F) should only be called on buffers that have been allocated and are owned by the device driver. bp_mapout() must not be called on buffers that are passed to the driver through the strategy(9E) entry point, such as a file system. bp_mapin(9F) does not keep a reference count. bp_mapout(9F) removes any kernel mapping on which a layer over the device driver might rely.

Synchronous Data Transfers (Block Drivers)

This section presents a simple method for performing synchronous I/O transfers. This method assumes that the hardware is a simple disk device that can transfer only one data buffer at a time by using DMA. Another assumption is that the disk can be spun up and spun down by software command. The device driver's strategy(9E) routine waits for the current request to be completed before accepting a new request. The device interrupts when the transfer is complete. The device also interrupts if an error occurs.

The steps for performing a synchronous data transfer for a block driver are as follows:

Check for invalid buf(9S) requests.

Check the buf(9S) structure that is passed to strategy(9E) for validity. All drivers should check the following conditions:
- The request begins at a valid block. The driver converts the b_blkno field to the correct device offset and then determines whether the offset is valid for the device.
- The request does not go beyond the last block on the device.
- Device-specific requirements are met.
If an error is encountered, the driver should indicate the appropriate error with bioerror(9F). The driver should then complete the request by calling biodone(9F). biodone() notifies the caller of strategy(9E) that the transfer is complete. In this case, the transfer has stopped because of an error.
Check whether the device is busy.

Synchronous data transfers allow single-threaded access to the device. The device driver enforces this access in two ways:
- The driver maintains a busy flag that is guarded by a mutex.
- The driver waits on a condition variable with cv_wait(9F), when the device is busy.
If the device is busy, the thread waits until the interrupt handler indicates that the device is not longer busy. The available status can be indicated by either the cv_broadcast(9F) or the cv_signal(9F) function. See Chapter 3, Multithreading for details on condition variables.

When the device is no longer busy, the strategy(9E) routine marks the device as available. strategy() then prepares the buffer and the device for the transfer.
Set up the buffer for DMA.

Prepare the data buffer for a DMA transfer by using ddi_dma_alloc_handle(9F) to allocate a DMA handle. Use ddi_dma_buf_bind_handle(9F) to bind the data buffer to the handle. For information on setting up DMA resources and related data structures, see Chapter 9, Direct Memory Access (DMA).

Begin the transfer.

At this point, a pointer to the buf(9S) structure is saved in the state structure of the device. The interrupt routine can then complete the transfer by calling biodone(9F).

The device driver then accesses device registers to initiate a data transfer. In most cases, the driver should protect the device registers from other threads by using mutexes. In this case, because strategy(9E) is single-threaded, guarding the device registers is not necessary. See Chapter 3, Multithreading for details about data locks.

When the executing thread has started the device's DMA engine, the driver can return execution control to the calling routine, as follows:

static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    struct device_reg *regp;
    minor_t instance;
    ddi_dma_cookie_t cookie;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    if (xsp == NULL) {
       bioerror(bp, ENXIO);
       biodone(bp);
       return (0);
    }
    /* validate the transfer request */
    if ((bp->b_blkno >= xsp->Nblocks) || (bp->b_blkno < 0)) {
       bioerror(bp, EINVAL);    
       biodone(bp);
       return (0);
    }
    /*
     * Hold off all threads until the device is not busy.
     */
    mutex_enter(&xsp->mu);
    while (xsp->busy) {
       cv_wait(&xsp->cv, &xsp->mu);
    }
    xsp->busy = 1;
    mutex_exit(&xsp->mu);
    /* 
     * If the device has power manageable components, 
     * mark the device busy with pm_busy_components(9F),
     * and then ensure that the device 
     * is powered up by calling pm_raise_power(9F).
     *
     * Set up DMA resources with ddi_dma_alloc_handle(9F) and
     * ddi_dma_buf_bind_handle(9F).
     */
    xsp->bp = bp;
    regp = xsp->regp;
    ddi_put32(xsp->data_access_handle, &regp->dma_addr,
        cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &regp->dma_size,
         (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &regp->csr,
         ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handle the interrupting device.

When the device finishes the data transfer, the device generates an interrupt, which eventually results in the driver's interrupt routine being called. Most drivers specify the state structure of the device as the argument to the interrupt routine when registering interrupts. See the ddi_add_intr(9F) man page and Registering Interrupts. The interrupt routine can then access the buf(9S) structure being transferred, plus any other information that is available from the state structure.

The interrupt handler should check the device's status register to determine whether the transfer completed without error. If an error occurred, the handler should indicate the appropriate error with bioerror(9F). The handler should also clear the pending interrupt for the device and then complete the transfer by calling biodone(9F).

As the final task, the handler clears the busy flag. The handler then calls cv_signal(9F) or cv_broadcast(9F) on the condition variable, signaling that the device is no longer busy. This notification enables other threads waiting for the device in strategy(9E) to proceed with the next data transfer.

The following example shows a synchronous interrupt routine.

Example 16–4 Synchronous Interrupt Routine for Block Drivers

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
       mutex_exit(&xsp->mu);
       return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
       /* failure */
       bp->b_resid = bp->b_bcount;
       bioerror(bp, EIO);
    } else {
       /* success */
       bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
       CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    /*
     * If the device has power manageable components that were
     * marked busy in strategy(9F), mark them idle now with
     * pm_idle_component(9F)
     * Release any resources used in the transfer, such as DMA
     * resources ddi_dma_unbind_handle(9F) and
     * ddi_dma_free_handle(9F).
     *
     * Let the next I/O thread have access to the device.
     */
    xsp->busy = 0;
    cv_signal(&xsp->cv);
    mutex_exit(&xsp->mu);
    return (DDI_INTR_CLAIMED);
}

Asynchronous Data Transfers (Block Drivers)

This section presents a method for performing asynchronous I/O transfers. The driver queues the I/O requests and then returns control to the caller. Again, the assumption is that the hardware is a simple disk device that allows one transfer at a time. The device interrupts when a data transfer has completed. An interrupt also takes place if an error occurs. The basic steps for performing asynchronous data transfers are:

Check for invalid buf(9S) requests.
Enqueue the request.
Start the first transfer.
Handle the interrupting device.

Checking for Invalid `buf` Requests

As in the synchronous case, the device driver should check the buf(9S) structure passed to strategy(9E) for validity. See Synchronous Data Transfers (Block Drivers) for more details.

Enqueuing the Request

Unlike synchronous data transfers, a driver does not wait for an asynchronous request to complete. Instead, the driver adds the request to a queue. The head of the queue can be the current transfer. The head of the queue can also be a separate field in the state structure for holding the active request, as in Example 16–5.

If the queue is initially empty, then the hardware is not busy and strategy(9E) starts the transfer before returning. Otherwise, if a transfer completes with a non-empty queue, the interrupt routine begins a new transfer. Example 16–5 places the decision of whether to start a new transfer into a separate routine for convenience.

The driver can use the av_forw and the av_back members of the buf(9S) structure to manage a list of transfer requests. A single pointer can be used to manage a singly linked list, or both pointers can be used together to build a doubly linked list. The device hardware specification specifies which type of list management, such as insertion policies, is used to optimize the performance of the device. The transfer list is a per-device list, so the head and tail of the list are stored in the state structure.

The following example provides multiple threads with access to the driver shared data, such as the transfer list. You must identify the shared data and must protect the data with a mutex. See Chapter 3, Multithreading for more details about mutex locks.

Example 16–5 Enqueuing Data Transfer Requests for Block Drivers

static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    minor_t instance;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    /* ... */
    /* validate transfer request */
    /* ... */
    /*
     * Add the request to the end of the queue. Depending on the device, a sorting
     * algorithm, such as disksort(9F) can be used if it improves the
     * performance of the device.
     */
    mutex_enter(&xsp->mu);
    bp->av_forw = NULL;
    if (xsp->list_head) {
       /* Non-empty transfer list */
       xsp->list_tail->av_forw = bp;
       xsp->list_tail = bp;
    } else {
       /* Empty Transfer list */
       xsp->list_head = bp;
       xsp->list_tail = bp;
    }
    mutex_exit(&xsp->mu);
    /* Start the transfer if possible */
    (void) xxstart((caddr_t)xsp);
    return (0);
}

Starting the First Transfer

Device drivers that implement queuing usually have a start() routine. start() dequeues the next request and starts the data transfer to or from the device. In this example, start() processes all requests regardless of the state of the device, whether busy or free.

Note –

start() must be written to be called from any context. start() can be called by both the strategy routine in kernel context and the interrupt routine in interrupt context.

start() is called by strategy(9E) every time strategy() queues a request so that an idle device can be started. If the device is busy, start() returns immediately.

start() is also called by the interrupt handler before the handler returns from a claimed interrupt so that a nonempty queue can be serviced. If the queue is empty, start() returns immediately.

Because start() is a private driver routine, start() can take any arguments and can return any type. The following code sample is written to be used as a DMA callback, although that portion is not shown. Accordingly, the example must take a caddr_t as an argument and return an int. See Handling Resource Allocation Failures for more information about DMA callback routines.

Example 16–6 Starting the First Data Request for a Block Driver

static int
xxstart(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;

    mutex_enter(&xsp->mu);
    /*
     * If there is nothing more to do, or the device is
     * busy, return.
     */
    if (xsp->list_head == NULL || xsp->busy) {
       mutex_exit(&xsp->mu);
       return (0);
    }
    xsp->busy = 1;
    /* Get the first buffer off the transfer list */
    bp = xsp->list_head;
    /* Update the head and tail pointer */
    xsp->list_head = xsp->list_head->av_forw;
    if (xsp->list_head == NULL)
       xsp->list_tail = NULL;
    bp->av_forw = NULL;
    mutex_exit(&xsp->mu);
    /*
     * If the device has power manageable components,
     * mark the device busy with pm_busy_components(9F),
     * and then ensure that the device
     * is powered up by calling pm_raise_power(9F).
     *
     * Set up DMA resources with ddi_dma_alloc_handle(9F) and
     * ddi_dma_buf_bind_handle(9F).
     */
    xsp->bp = bp;
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_addr,
        cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_size,
         (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
         ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handling the Interrupting Device

The interrupt routine is similar to the asynchronous version, with the addition of the call to start() and the removal of the call to cv_signal(9F).

Example 16–7 Block Driver Routine for Asynchronous Interrupts

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->mu);
        return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
        /* failure */
        bp->b_resid = bp->b_bcount;
        bioerror(bp, EIO);
    } else {
        /* success */
        bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
        CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    /*
     * If the device has power manageable components that were
     * marked busy in strategy(9F), mark them idle now with
     * pm_idle_component(9F)
     * Release any resources used in the transfer, such as DMA
     * resources (ddi_dma_unbind_handle(9F) and
     * ddi_dma_free_handle(9F)).
     *
     * Let the next I/O thread have access to the device.
     */
    xsp->busy = 0;
    mutex_exit(&xsp->mu);
    (void) xxstart((caddr_t)xsp);
    return (DDI_INTR_CLAIMED);
}

`dump()` and `print()` Entry Points

This section discusses the dump(9E) and print(9E) entry points.

`dump()` Entry Point (Block Drivers)

The dump(9E) entry point is used to copy a portion of virtual address space directly to the specified device in the case of a system failure. dump() is also used to copy the state of the kernel out to disk during a checkpoint operation. See the cpr(7) and dump(9E) man pages for more information. The entry point must be capable of performing this operation without the use of interrupts, because interrupts are disabled during the checkpoint operation.

int dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblk)

where:

dev: Device number of the device to receive the dump.
addr: Base kernel virtual address at which to start the dump.
blkno: Block at which the dump is to start.
nblk: Number of blocks to dump.

The dump depends upon the existing driver working properly.

`print()` Entry Point (Block Drivers)

int print(dev_t dev, char *str)

The print(9E) entry point is called by the system to display a message about an exception that has been detected. print(9E) should call cmn_err(9F) to post the message to the console on behalf of the system. The following example demonstrates a typical print() entry point.

static int
 xxprint(dev_t dev, char *str)
 {
     cmn_err(CE_CONT, “xx: %s\n”, str);
     return (0);
 }

Disk Device Drivers

Disk devices represent an important class of block device drivers.

Disk `ioctl`s

Solaris disk drivers need to support a minimum set of ioctl commands specific to Solaris disk drivers. These I/O controls are specified in the dkio(7I) manual page. Disk I/O controls transfer disk information to or from the device driver. A Solaris disk device is supported by disk utility commands such as format(1M) and newfs(1M). The mandatory Sun disk I/O controls are as follows:

DKIOCINFO: Returns information that describes the disk controller
DKIOCGAPART: Returns a disk's partition map
DKIOCSAPART: Sets a disk's partition map
DKIOCGGEOM: Returns a disk's geometry
DKIOCSGEOM: Sets a disk's geometry
DKIOCGVTOC: Returns a disk's Volume Table of Contents
DKIOCSVTOC: Sets a disk's Volume Table of Contents

Disk Performance

The Solaris DDI/DKI provides facilities to optimize I/O transfers for improved file system performance. A mechanism manages the list of I/O requests so as to optimize disk access for a file system. See Asynchronous Data Transfers (Block Drivers) for a description of enqueuing an I/O request.

The diskhd structure is used to manage a linked list of I/O requests.

struct diskhd {
    long     b_flags;         /* not used, needed for consistency*/
    struct   buf *b_forw,    *b_back;       /* queue of unit queues */
    struct   buf *av_forw,    *av_back;    /* queue of bufs for this unit */
    long     b_bcount;            /* active flag */
};

The diskhd data structure has two buf pointers that the driver can manipulate. The av_forw pointer points to the first active I/O request. The second pointer, av_back, points to the last active request on the list.

A pointer to this structure is passed as an argument to disksort(9F), along with a pointer to the current buf structure being processed. The disksort() routine sorts the buf requests to optimize disk seek. The routine then inserts the buf pointer into the diskhd list. The disksort() program uses the value that is in b_resid of the buf structure as a sort key. The driver is responsible for setting this value. Most Sun disk drivers use the cylinder group as the sort key. This approach optimizes the file system read-ahead accesses.

When data has been added to the diskhd list, the device needs to transfer the data. If the device is not busy processing a request, the xxstart() routine pulls the first buf structure off the diskhd list and starts a transfer.

If the device is busy, the driver should return from the xxstrategy() entry point. When the hardware is done with the data transfer, an interrupt is generated. The driver's interrupt routine is then called to service the device. After servicing the interrupt, the driver can then call the start() routine to process the next buf structure in the diskhd list.