Writing Device Drivers

Part II Designing Specific Kinds of Device Drivers

The second part of the book provides design information that is specific to the type of driver:

Chapter 15, Drivers for Character Devices describes drivers for character-oriented devices.
Chapter 16, Drivers for Block Devices describes drivers for a block-oriented devices.
Chapter 17, SCSI Target Drivers outlines the Sun Common SCSI Architecture (SCSA) and the requirements for SCSI target drivers.
Chapter 18, SCSI Host Bus Adapter Drivers explains how to apply SCSA to SCSI Host Bus Adapter (HBA) drivers.
Chapter 19, Drivers for Network Devices describes the Generic LAN driver (GLD). The GLDv3 framework is a function calls-based interface of MAC plugins and MAC driver service routines and structures.
Chapter 20, USB Drivers describes how to write a client USB device driver using the USBA 2.0 framework.

Chapter 15 Drivers for Character Devices

A character device does not have physically addressable storage media, such as tape drives or serial ports, where I/O is normally performed in a byte stream. This chapter describes the structure of a character device driver, focusing in particular on entry points for character drivers. In addition, this chapter describes the use of physio(9F) and aphysio(9F) in the context of synchronous and asynchronous I/O transfers.

This chapter provides information on the following subjects:

Overview of the Character Driver Structure

Figure 15–1 shows data structures and routines that define the structure of a character device driver. Device drivers typically include the following elements:

Device-loadable driver section
Device configuration section
Character driver entry points

The shaded device access section in the following figure illustrates character driver entry points.

Figure 15–1 Character Driver Roadmap

Diagram shows structures and entry points for character
device drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. These structures contain pointers to the driver entry points:

Note –

Some of these entry points can be replaced with nodev(9F) or nulldev(9F) as appropriate.

Character Device Autoconfiguration

The attach(9E) routine should perform the common initialization tasks that all devices require, such as:

Allocating per-instance state structures
Registering device interrupts
Mapping the device's registers
Initializing mutex variables and condition variables
Creating power-manageable components
Creating minor nodes

See attach() Entry Point for code examples of these tasks.

Character device drivers create minor nodes of type S_IFCHR. A minor node of S_IFCHR causes a character special file that represents the node to eventually appear in the /devices hierarchy.

The following example shows a typical attach(9E) routine for character drivers. Properties that are associated with the device are commonly declared in an attach() routine. This example uses a predefined Size property. Size is the equivalent of the Nblocks property for getting the size of partition in a block device. If, for example, you are doing character I/O on a disk device, you might use Size to get the size of a partition. Since Size is a 64-bit property, you must use a 64-bit property interface. In this case, you use ddi_prop_update_int64(9F). See Device Properties for more information about properties.

Example 15–1 Character Driver `attach()` Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
  int instance = ddi_get_instance(dip);
  switch (cmd) {
  case DDI_ATTACH:
      /* 
       * Allocate a state structure and initialize it.
       * Map the device's registers.
       * Add the device driver's interrupt handler(s).
       * Initialize any mutexes and condition variables.
       * Create power manageable components.
       *
       * Create the device's minor node. Note that the node_type
       * argument is set to DDI_NT_TAPE.
       */
       if (ddi_create_minor_node(dip, minor_name, S_IFCHR,
           instance, DDI_NT_TAPE, 0) == DDI_FAILURE) {
           /* Free resources allocated so far. */
           /* Remove any previously allocated minor nodes. */
           ddi_remove_minor_node(dip, NULL);
           return (DDI_FAILURE);
       }
      /*
       * Create driver properties like "Size." Use "Size" 
       * instead of "size" to ensure the property works 
       * for large bytecounts.
       */
       xsp->Size = size_of_device_in_bytes;
       maj_number = ddi_driver_major(dip);
       if (ddi_prop_update_int64(makedevice(maj_number, instance), 
           dip, "Size", xsp->Size) != DDI_PROP_SUCCESS) {
           cmn_err(CE_CONT, "%s: cannot create Size property\n",
               ddi_get_name(dip));
               /* Free resources allocated so far. */
           return (DDI_FAILURE);
       }
      /* ... */
      return (DDI_SUCCESS);    
case DDI_RESUME:
      /* See the "Power Management" chapter in this book. */
default:
      return (DDI_FAILURE);
  }
}

Device Access (Character Drivers)

Access to a device by one or more application programs is controlled through the open(9E) and close(9E) entry points. An open(2) system call to a special file representing a character device always causes a call to the open(9E) routine for the driver. For a particular minor device, open(9E) can be called many times. The close(9E) routine is called only when the final reference to a device is removed. If the device is accessed through file descriptors, the final call to close(9E) can occur as a result of a close(2) or exit(2) system call. If the device is accessed through memory mapping, the final call to close(9E) can occur as a result of a munmap(2) system call.

`open()` Entry Point (Character Drivers)

The primary function of open() is to verify that the open request is allowed. The syntax for open(9E) is as follows:

int xxopen(dev_t *devp, int flag, int otyp, cred_t *credp);

where:

devp

Pointer to a device number. The open() routine is passed a pointer so that the driver can change the minor number. With this pointer, drivers can dynamically create minor instances of the device. An example would be a pseudo terminal driver that creates a new pseudo-terminal whenever the driver is opened. A driver that dynamically chooses the minor number normally creates only one minor device node in attach(9E) with ddi_create_minor_node(9F), then changes the minor number component of *devp using makedevice(9F) and getmajor(9F):

    *devp = makedevice(getmajor(*devp), new_minor);

You do not have to call ddi_create_minor_node(9F) for the new minor. A driver must not change the major number of *devp. The driver must keep track of available minor numbers internally.

flag

Flag with bits to indicate whether the device is opened for reading (FREAD), writing (FWRITE), or both. User threads issuing the open(2) system call can also request exclusive access to the device (FEXCL) or specify that the open should not block for any reason (FNDELAY), but the driver must enforce both cases. A driver for a write-only device such as a printer might consider an open(9E) for reading invalid.

otyp

Integer that indicates how open() was called. The driver must check that the value of otyp is appropriate for the device. For character drivers, otyp should be OTYP_CHR (see the open(9E) man page).

credp

Pointer to a credential structure containing information about the caller, such as the user ID and group IDs. Drivers should not examine the structure directly, but should instead use drv_priv(9F) to check for the common case of root privileges. In this example, only root or a user with the PRIV_SYS_DEVICES privilege is allowed to open the device for writing.

The following example shows a character driver open(9E) routine.

Example 15–2 Character Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flag, int otyp, cred_t *credp)
{
    minor_t        instance;

    if (getminor(*devp)         /* if device pointer is invalid */
        return (EINVAL);
    instance = getminor(*devp); /* one-to-one example mapping */
    /* Is the instance attached? */
    if (ddi_get_soft_state(statep, instance) == NULL)
        return (ENXIO);
    /* verify that otyp is appropriate */
    if (otyp != OTYP_CHR)
        return (EINVAL);
    if ((flag & FWRITE) && drv_priv(credp) == EPERM)
        return (EPERM);
    return (0);
}

`close()` Entry Point (Character Drivers)

The syntax for close(9E) is as follows:

int xxclose(dev_t dev, int flag, int otyp, cred_t *credp);

close() should perform any cleanup necessary to finish using the minor device, and prepare the device (and driver) to be opened again. For example, the open routine might have been invoked with the exclusive access (FEXCL) flag. A call to close(9E) would allow additional open routines to continue. Other functions that close(9E) might perform are:

Waiting for I/O to drain from output buffers before returning
Rewinding a tape (tape device)
Hanging up the phone line (modem device)

A driver that waits for I/O to drain could wait forever if draining stalls due to external conditions such as flow control. See Threads Unable to Receive Signals for information about how to avoid this problem.

I/O Request Handling

This section discusses I/O request processing in detail.

User Addresses

When a user thread issues a write(2) system call, the thread passes the address of a buffer in user space:

    char buffer[] = "python";
    count = write(fd, buffer, strlen(buffer) + 1);

The system builds a uio(9S) structure to describe this transfer by allocating an iovec(9S) structure and setting the iov_base field to the address passed to write(2), in this case, buffer. The uio(9S) structure is passed to the driver write(9E) routine. See Vectored I/O for more information about the uio(9S) structure.

The address in the iovec(9S) is in user space, not kernel space. Thus, the address is neither guaranteed to be currently in memory nor to be a valid address. In either case, accessing a user address directly from the device driver or from the kernel could crash the system. Thus, device drivers should never access user addresses directly. Instead, a data transfer routine in the Solaris DDI/DKI should be used to transfer data into or out of the kernel. These routines can handle page faults. The DDI/DKI routines can bring in the proper user page to continue the copy transparently. Alternatively, the routines can return an error on an invalid access.

copyout(9F) can be used to copy data from kernel space to user space. copyin(9F) can copy data from user space to kernel space. ddi_copyout(9F) and ddi_copyin(9F) operate similarly but are to be used in the ioctl(9E) routine. copyin(9F) and copyout(9F) can be used on the buffer described by each iovec(9S) structure, or uiomove(9F) can perform the entire transfer to or from a contiguous area of driver or device memory.

Vectored I/O

In character drivers, transfers are described by a uio(9S) structure. The uio(9S) structure contains information about the direction and size of the transfer, plus an array of buffers for one end of the transfer. The other end is the device.

The uio(9S) structure contains the following members:

iovec_t       *uio_iov;       /* base address of the iovec */
                              /* buffer description array */
int           uio_iovcnt;     /* the number of iovec structures */
off_t         uio_offset;     /* 32-bit offset into file where */
                              /* data is transferred from or to */
offset_t      uio_loffset;    /* 64-bit offset into file where */
                              /* data is transferred from or to */
uio_seg_t     uio_segflg;     /* identifies the type of I/O transfer */
                              /* UIO_SYSSPACE:  kernel <-> kernel */
                              /* UIO_USERSPACE: kernel <-> user */
short         uio_fmode;      /* file mode flags (not driver setTable) */
daddr_t       uio_limit;      /* 32-bit ulimit for file (maximum */
                              /* block offset). not driver settable. */
diskaddr_t    uio_llimit;     /* 64-bit ulimit for file (maximum block */
                              /* block offset). not driver settable. */
int           uio_resid;      /* amount (in bytes) not */
                              /* transferred on completion */

A uio(9S) structure is passed to the driver read(9E) and write(9E) entry points. This structure is generalized to support what is called gather-write and scatter-read. When writing to a device, the data buffers to be written do not have to be contiguous in application memory. Similarly, data that is transferred from a device into memory comes off in a contiguous stream but can go into noncontiguous areas of application memory. See the readv(2), writev(2), pread(2), and pwrite(2) man pages for more information on scatter-gather I/O.

Each buffer is described by an iovec(9S) structure. This structure contains a pointer to the data area and the number of bytes to be transferred.

caddr_t    iov_base;    /* address of buffer */
int        iov_len;     /* amount to transfer */

The uio structure contains a pointer to an array of iovec(9S) structures. The base address of this array is held in uio_iov, and the number of elements is stored in uio_iovcnt.

The uio_offset field contains the 32-bit offset into the device at which the application needs to begin the transfer. uio_loffset is used for 64-bit file offsets. If the device does not support the notion of an offset, these fields can be safely ignored. The driver should interpret either uio_offset or uio_loffset, but not both. If the driver has set the D_64BIT flag in the cb_ops(9S) structure, that driver should use uio_loffset.

The uio_resid field starts out as the number of bytes to be transferred, that is, the sum of all the iov_len fields in uio_iov. This field must be set by the driver to the number of bytes that were not transferred before returning. The read(2) and write(2) system calls use the return value from the read(9E) and write(9E) entry points to determine failed transfers. If a failure occurs, these routines return -1. If the return value indicates success, the system calls return the number of bytes requested minus uio_resid. If uio_resid is not changed by the driver, the read(2) and write(2) calls return 0. A return value of 0 indicates end-of-file, even though all the data has been transferred.

The support routines uiomove(9F), physio(9F), and aphysio(9F) update the uio(9S) structure directly. These support routines update the device offset to account for the data transfer. Neither the uio_offset or uio_loffset fields need to be adjusted when the driver is used with a seekable device that uses the concept of position. I/O performed to a device in this manner is constrained by the maximum possible value of uio_offset or uio_loffset. An example of such a usage is raw I/O on a disk.

If the device has no concept of position, the driver can take the following steps:

Save uio_offset or uio_loffset.
Perform the I/O operation.
Restore uio_offset or uio_loffset to the field's initial value.

I/O that is performed to a device in this manner is not constrained by the maximum possible value of uio_offset or uio_loffset. An example of this type of usage is I/O on a serial line.

The following example shows one way to preserve uio_loffset in the read(9E) function.

static int
xxread(dev_t dev, struct uio *uio_p, cred_t *cred_p)
{
    offset_t off;
    /* ... */
    off = uio_p->uio_loffset;  /* save the offset */
    /* do the transfer */
    uio_p->uio_loffset = off;  /* restore it */
}

Differences Between Synchronous and Asynchronous I/O

Data transfers can be synchronous or asynchronous. The determining factor is whether the entry point that schedules the transfer returns immediately or waits until the I/O has been completed.

The read(9E) and write(9E) entry points are synchronous entry points. The transfer must not return until the I/O is complete. Upon return from the routines, the process knows whether the transfer has succeeded.

The aread(9E) and awrite(9E) entry points are asynchronous entry points. Asynchronous entry points schedule the I/O and return immediately. Upon return, the process that issues the request knows that the I/O is scheduled and that the status of the I/O must be determined later. In the meantime, the process can perform other operations.

With an asynchronous I/O request to the kernel, the process is not required to wait while the I/O is in process. A process can perform multiple I/O requests and allow the kernel to handle the data transfer details. Asynchronous I/O requests enable applications such as transaction processing to use concurrent programming methods to increase performance or response time. Any performance boost for applications that use asynchronous I/O, however, comes at the expense of greater programming complexity.

Data Transfer Methods

Data can be transferred using either programmed I/O or DMA. These data transfer methods can be used either by synchronous or by asynchronous entry points, depending on the capabilities of the device.

Programmed I/O Transfers

Programmed I/O devices rely on the CPU to perform the data transfer. Programmed I/O data transfers are identical to other read and write operations for device registers. Various data access routines are used to read or store values to device memory.

uiomove(9F) can be used to transfer data to some programmed I/O devices. uiomove(9F) transfers data between the user space, as defined by the uio(9S) structure, and the kernel. uiomove() can handle page faults, so the memory to which data is transferred need not be locked down. uiomove() also updates the uio_resid field in the uio(9S) structure. The following example shows one way to write a ramdisk read(9E) routine. It uses synchronous I/O and relies on the presence of the following fields in the ramdisk state structure:

caddr_t    ram;        /* base address of ramdisk */
int        ramsize;    /* size of the ramdisk */

Example 15–3 Ramdisk read(9E) Routine Using uiomove(9F)

static int
rd_read(dev_t dev, struct uio *uiop, cred_t *credp)
{
     rd_devstate_t     *rsp;

     rsp = ddi_get_soft_state(rd_statep, getminor(dev));
     if (rsp == NULL)
       return (ENXIO);
     if (uiop->uio_offset >= rsp->ramsize)
       return (EINVAL);
     /*
      * uiomove takes the offset into the kernel buffer,
      * the data transfer count (minimum of the requested and
      * the remaining data), the UIO_READ flag, and a pointer
      * to the uio structure.
      */
     return (uiomove(rsp->ram + uiop->uio_offset,
         min(uiop->uio_resid, rsp->ramsize - uiop->uio_offset),
         UIO_READ, uiop));
}

Another example of programmed I/O would be a driver that writes data one byte at a time directly to the device's memory. Each byte is retrieved from the uio(9S) structure by using uwritec(9F). The byte is then sent to the device. read(9E) can use ureadc(9F) to transfer a byte from the device to the area described by the uio(9S) structure.

Example 15–4 Programmed I/O write(9E) Routine Using uwritec(9F)

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{
    int    value;
    struct xxstate     *xsp;

    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL)
        return (ENXIO);
    /* if the device implements a power manageable component, do this: */
    pm_busy_component(xsp->dip, 0);
    if (xsp->pm_suspended)
        pm_raise_power(xsp->dip, normal power);

    while (uiop->uio_resid > 0) {
        /*
         * do the programmed I/O access
         */
        value = uwritec(uiop);
        if (value == -1)
               return (EFAULT);
        ddi_put8(xsp->data_access_handle, &xsp->regp->data,
            (uint8_t)value);
        ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
            START_TRANSFER);
        /*
         * this device requires a ten microsecond delay
         * between writes
         */
        drv_usecwait(10);
    }
    pm_idle_component(xsp->dip, 0);
    return (0);
}

DMA Transfers (Synchronous)

Character drivers generally use physio(9F) to do the setup work for DMA transfers in read(9E) and write(9E), as is shown in Example 15–5.

int physio(int (*strat)(struct buf *), struct buf *bp,
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct uio *uio);

physio(9F) requires the driver to provide the address of a strategy(9E) routine. physio(9F) ensures that memory space is locked down, that is, memory cannot be paged out, for the duration of the data transfer. This lock-down is necessary for DMA transfers because DMA transfers cannot handle page faults. physio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information.

Example 15–5 read(9E) and write(9E) Routines Using physio(9F)

static int
xxread(dev_t dev, struct uio *uiop, cred_t *credp)
{
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_READ, xxminphys, uiop);
     return (ret);
}    

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{     
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_WRITE, xxminphys, uiop);
     return (ret);
}

In the call to physio(9F), xxstrategy is a pointer to the driver strategy() routine. Passing NULL as the buf(9S) structure pointer tells physio(9F) to allocate a buf(9S) structure. If the driver must provide physio(9F) with a buf(9S) structure, getrbuf(9F) should be used to allocate the structure. physio(9F) returns zero if the transfer completes successfully, or an error number on failure. After calling strategy(9E), physio(9F) calls biowait(9F) to block until the transfer either completes or fails. The return value of physio(9F) is determined by the error field in the buf(9S) structure set by bioerror(9F).

DMA Transfers (Asynchronous)

Character drivers that support aread(9E) and awrite(9E) use aphysio(9F) instead of physio(9F).

int aphysio(int (*strat)(struct buf *), int (*cancel)(struct buf *),
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct aio_req *aio_reqp);

Note –

The address of anocancel(9F) is the only value that can currently be passed as the second argument to aphysio(9F).

aphysio(9F) requires the driver to pass the address of a strategy(9E) routine. aphysio(9F) ensures that memory space is locked down, that is, cannot be paged out, for the duration of the data transfer. This lock-down is necessary for DMA transfers because DMA transfers cannot handle page faults. aphysio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information.

Example 15–5 and Example 15–6 demonstrate that the aread(9E) and awrite(9E) entry points differ only slightly from the read(9E) and write(9E) entry points. The difference is primarily in their use of aphysio(9F) instead of physio(9F).

Example 15–6 aread(9E) and awrite(9E) Routines Using aphysio(9F)

static int
xxaread(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
         return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_READ,
     xxminphys, aiop));
}

static int
xxawrite(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_WRITE,
     xxminphys,aiop));  
}

In the call to aphysio(9F), xxstrategy() is a pointer to the driver strategy routine. aiop is a pointer to the aio_req(9S) structure. aiop is passed to aread(9E) and awrite(9E). aio_req(9S) describes where the data is to be stored in user space. aphysio(9F) returns zero if the I/O request is scheduled successfully or an error number on failure. After calling strategy(9E), aphysio(9F) returns without waiting for the I/O to complete or fail.

`minphys()` Entry Point

The minphys() entry point is a pointer to a function to be called by physio(9F) or aphysio(9F). The purpose of xxminphys is to ensure that the size of the requested transfer does not exceed a driver-imposed limit. If the user requests a larger transfer, strategy(9E) is called repeatedly, requesting no more than the imposed limit at a time. This approach is important because DMA resources are limited. Drivers for slow devices, such as printers, should be careful not to tie up resources for a long time.

Usually, a driver passes the address of the kernel function minphys(9F), but the driver can define its own xxminphys() routine instead. The job of xxminphys() is to keep the b_bcount field of the buf(9S) structure under a driver's limit. The driver should adhere to other system limits as well. For example, the driver's xxminphys() routine should call the system minphys(9F) routine after setting the b_bcount field and before returning.

Example 15–7 minphys(9F) Routine

#define XXMINVAL (512 << 10)    /* 512 KB */
static void
xxminphys(struct buf *bp)
{
       if (bp->b_bcount > XXMINVAL)
        bp->b_bcount = XXMINVAL
      minphys(bp);
}

`strategy()` Entry Point

The strategy(9E) routine originated in block drivers. The strategy function got its name from implementing a strategy for efficient queuing of I/O requests to a block device. A driver for a character-oriented device can also use a strategy(9E) routine. In the character I/O model presented here, strategy(9E) does not maintain a queue of requests, but rather services one request at a time.

In the following example, the strategy(9E) routine for a character-oriented DMA device allocates DMA resources for synchronous data transfer. strategy() starts the command by programming the device register. See Chapter 9, Direct Memory Access (DMA) for a detailed description.

Note –

strategy(9E) does not receive a device number (dev_t) as a parameter. Instead, the device number is retrieved from the b_edev field of the buf(9S) structure passed to strategy(9E).

Example 15–8 strategy(9E) Routine

static int
xxstrategy(struct buf *bp)
{
     minor_t            instance;
     struct xxstate     *xsp;
     ddi_dma_cookie_t   cookie;

     instance = getminor(bp->b_edev);
     xsp = ddi_get_soft_state(statep, instance);
     /* ... */
      * If the device has power manageable components,
      * mark the device busy with pm_busy_components(9F),
      * and then ensure that the device is
      * powered up by calling pm_raise_power(9F).
      */
     /* Set up DMA resources with ddi_dma_alloc_handle(9F) and
      * ddi_dma_buf_bind_handle(9F).
      */
     xsp->bp = bp; /* remember bp */
     /* Program DMA engine and start command */
     return (0);
}

Note –

Although strategy() is declared to return an int, strategy() must always return zero.

On completion of the DMA transfer, the device generates an interrupt, causing the interrupt routine to be called. In the following example, xxintr() receives a pointer to the state structure for the device that might have generated the interrupt.

Example 15–9 Interrupt Routine

static u_int
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     if ( /* device did not interrupt */ ) {
        return (DDI_INTR_UNCLAIMED);
     }
     if ( /* error */ ) {
        /* error handling */
     }
     /* Release any resources used in the transfer, such as DMA resources.
      * ddi_dma_unbind_handle(9F) and ddi_dma_free_handle(9F)
      * Notify threads that the transfer is complete.
      */
     biodone(xsp->bp);
     return (DDI_INTR_CLAIMED);
}

The driver indicates an error by calling bioerror(9F). The driver must call biodone(9F) when the transfer is complete or after indicating an error with bioerror(9F).

Mapping Device Memory

Some devices, such as frame buffers, have memory that is directly accessible to user threads by way of memory mapping. Drivers for these devices typically do not support the read(9E) and write(9E) interfaces. Instead, these drivers support memory mapping with the devmap(9E) entry point. For example, a frame buffer driver might implement the devmap(9E) entry point to enable the frame buffer to be mapped in a user thread.

The devmap(9E) entry point is called to export device memory or kernel memory to user applications. The devmap() function is called from devmap_setup(9F) inside segmap(9E) or on behalf of ddi_devmap_segmap(9F).

The segmap(9E) entry point is responsible for setting up a memory mapping requested by an mmap(2) system call. Drivers for many memory-mapped devices use ddi_devmap_segmap(9F) as the entry point rather than defining their own segmap(9E) routine.

See Chapter 10, Mapping Device and Kernel Memory and Chapter 11, Device Context Management for details.

Multiplexing I/O on File Descriptors

A thread sometimes needs to handle I/O on more than one file descriptor. One example is an application program that needs to read the temperature from a temperature-sensing device and then report the temperature to an interactive display. A program that makes a read request with no data available should not block while waiting for the temperature before interacting with the user again.

The poll(2) system call provides users with a mechanism for multiplexing I/O over a set of file descriptors that reference open files. poll(2) identifies those file descriptors on which a program can send or receive data without blocking, or on which certain events have occurred.

To enable a program to poll a character driver, the driver must implement the chpoll(9E) entry point. The system calls chpoll(9E) when a user process issues a poll(2) system call on a file descriptor associated with the device. The chpoll(9E) entry point routine is used by non-STREAMS character device drivers that need to support polling.

The chpoll(9E) function uses the following syntax:

int xxchpoll(dev_t dev, short events, int anyyet, short *reventsp,
     struct pollhead **phpp);

In the chpoll(9E) entry point, the driver must follow these rules:

Implement the following algorithm when the chpoll(9E) entry point is called:
```
if ( /* events are satisfied now */ ) {     
    *reventsp = mask_of_satisfied_events
} else {
    *reventsp = 0;
    if (!anyyet)
        *phpp = &local_pollhead_structure;
}
return (0);
```
See the chpoll(9E) man page for a discussion of events to check. The chpoll(9E) entry point should then return the mask of satisfied events by setting the return events in *reventsp.

If no events have occurred, the return field for the events is cleared. If the anyyet field is not set, the driver must return an instance of the pollhead structure. The pollhead structure is usually allocated in a state structure. The pollhead structure should be treated as opaque by the driver. None of the pollhead fields should be referenced.
Call pollwakeup(9F) whenever a device condition of type events, listed in Example 15–10, occurs. This function should be called only with one event at a time. You can call pollwakeup(9F) in the interrupt routine when the condition has occurred.

Example 15–10 and Example 15–11 show how to implement the polling discipline and how to use pollwakeup(9F).

The following example shows how to handle the POLLIN and POLLERR events. The driver first reads the status register to determine the current state of the device. The parameter events specifies which conditions the driver should check. If an appropriate condition has occurred, the driver sets that bit in *reventsp. If none of the conditions has occurred and if anyyet is not set, the address of the pollhead structure is returned in *phpp.

Example 15–10 chpoll(9E) Routine

static int
xxchpoll(dev_t dev, short events, int anyyet,
    short *reventsp, struct pollhead **phpp)
{
     uint8_t status;
     short revent;
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
         return (ENXIO);
     revent = 0;
     /*
    * Valid events are:
    * POLLIN | POLLOUT | POLLPRI | POLLHUP | POLLERR
    * This example checks only for POLLIN and POLLERR.
    */
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if ((events & POLLIN) && data available to read) {
        revent |= POLLIN;
     }
     if (status & DEVICE_ERROR) {
        revent |= POLLERR;
     }
     /* if nothing has occurred */
     if (revent == 0) {
        if (!anyyet) {
        *phpp = &xsp->pollhead;
        }
     }
       *reventsp = revent;
     return (0);
}

The following example shows how to use the pollwakeup(9F) function. The pollwakeup(9F) function usually is called in the interrupt routine when a supported condition has occurred. The interrupt routine reads the status from the status register and checks for the conditions. The routine then calls pollwakeup(9F) for each event to possibly notify polling threads that they should check again. Note that pollwakeup(9F) should not be called with any locks held, since deadlock could result if another routine tried to enter chpoll(9E) and grab the same lock.

Example 15–11 Interrupt Routine Supporting chpoll(9E)

static u_int
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     uint8_t    status;
     /* normal interrupt processing */
     /* ... */
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if (status & DEVICE_ERROR) {
        pollwakeup(&xsp->pollhead, POLLERR);
     }
     if ( /* just completed a read */ ) {
        pollwakeup(&xsp->pollhead, POLLIN);
     }
     /* ... */
     return (DDI_INTR_CLAIMED);
}

Miscellaneous I/O Control

The ioctl(9E) routine is called when a user thread issues an ioctl(2) system call on a file descriptor associated with the device. The I/O control mechanism is a catchall for getting and setting device-specific parameters. This mechanism is frequently used to set a device-specific mode, either by setting internal driver software flags or by writing commands to the device. The control mechanism can also be used to return information to the user about the current device state. In short, the control mechanism can do whatever the application and driver need to have done.

`ioctl()` Entry Point (Character Drivers)

int xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
     cred_t *credp, int *rvalp);

The cmd parameter indicates which command ioctl(9E) should perform. By convention, the driver with which an I/O control command is associated is indicated in bits 8-15 of the command. Typically, the ASCII code of a character represents the driver. The driver-specific command in bits 0-7. The creation of some I/O commands is illustrated in the following example:

#define XXIOC    (`x' << 8)     /* `x' is a character representing */
                                      /* device xx */
#define XX_GET_STATUS    (XXIOC | 1)  /* get status register */
#define XX_SET_CMD       (XXIOC | 2)  /* send command */

The interpretation of arg depends on the command. I/O control commands should be documented in the driver documentation or a man page. The command should also be defined in a public header file, so that applications can determine the name of the command, what the command does, and what the command accepts or returns as arg. Any data transfer of arg into or out of the driver must be performed by the driver.

Certain classes of devices such as frame buffers or disks must support standard sets of I/O control requests. These standard I/O control interfaces are documented in the Solaris 8 Reference Manual Collection. For example, fbio(7I) documents the I/O controls that frame buffers must support, and dkio(7I) documents standard disk I/O controls. See Miscellaneous I/O Control for more information on I/O controls.

Drivers must use ddi_copyin(9F) to transfer arg data from the user-level application to the kernel level. Drivers must use ddi_copyout(9F) to transfer data from the kernel to the user level. Failure to use ddi_copyin(9F) or ddi_copyout(9F) can result in panics under two conditions. A panic occurs if the architecture separates the kernel and user address spaces, or if the user address has been swapped out.

ioctl(9E) is usually a switch statement with a case for each supported ioctl(9E) request.

Example 15–12 ioctl(9E) Routine

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
     uint8_t        csr;
     struct xxstate     *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL) {
        return (ENXIO);
     }
     switch (cmd) {
     case XX_GET_STATUS:
       csr = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
       if (ddi_copyout(&csr, (void *)arg,
           sizeof (uint8_t), mode) != 0) {
           return (EFAULT);
       }
       break;
     case XX_SET_CMD:
       if (ddi_copyin((void *)arg, &csr,
         sizeof (uint8_t), mode) != 0) {
         return (EFAULT);
       }
       ddi_put8(xsp->data_access_handle, &xsp->regp->csr, csr);
       break;
     default:
       /* generic "ioctl unknown" error */
       return (ENOTTY);
     }
     return (0);
}

The cmd variable identifies a specific device control operation. A problem can occur if arg contains a user virtual address. ioctl(9E) must call ddi_copyin(9F) or ddi_copyout(9F) to transfer data between the data structure in the application program pointed to by arg and the driver. In Example 15–12, for the case of an XX_GET_STATUS request, the contents of xsp->regp->csr are copied to the address in arg. ioctl(9E) can store in *rvalp any integer value as the return value to the ioctl(2) system call that makes a successful request. Negative return values, such as -1, should be avoided. Many application programs assume that negative values indicate failure.

The following example demonstrates an application that uses the I/O controls discussed in the previous paragraph.

Example 15–13 Using ioctl(9E)

#include <sys/types.h>
#include "xxio.h"     /* contains device's ioctl cmds and args */
int
main(void)
{
     uint8_t    status;
     /* ... */
     /*
      * read the device status
      */
     if (ioctl(fd, XX_GET_STATUS, &status) == -1) {
         /* error handling */
     }
     printf("device status %x\n", status);
     exit(0);
}

I/O Control Support for 64-Bit Capable Device Drivers

The Solaris kernel runs in 64-bit mode on suitable hardware, supporting both 32-bit applications and 64-bit applications. A 64-bit device driver is required to support I/O control commands from programs of both sizes. The difference between a 32-bit program and a 64-bit program is the C language type model. A 32-bit program is ILP32, and a 64-bit program is LP64. See Appendix C, Making a Device Driver 64-Bit Ready for information on C data type models.

If data that flows between programs and the kernel is not identical in format, the driver must be able to handle the model mismatch. Handling a model mismatch requires making appropriate adjustments to the data.

To determine whether a model mismatch exists, the ioctl(9E) mode parameter passes the data model bits to the driver. As Example 15–14 shows, the mode parameter is then passed to ddi_model_convert_from(9F) to determine whether any model conversion is necessary.

A flag subfield of the mode argument is used to pass the data model to the ioctl(9E) routine. The flag is set to one of the following:

DATAMODEL_ILP32
DATAMODEL_LP64

FNATIVE is conditionally defined to match the data model of the kernel implementation. The FMODELS mask should be used to extract the flag from the mode argument. The driver can then examine the data model explicitly to determine how to copy the application data structure.

The DDI function ddi_model_convert_from(9F) is a convenience routine that can assist some drivers with their ioctl() calls. The function takes the data type model of the user application as an argument and returns one of the following values:

DDI_MODEL_ILP32 – Convert from ILP32 application
DDI_MODEL_NONE – No conversion needed

DDI_MODEL_NONE is returned if no data conversion is necessary, as occurs when the application and driver have the same data model. DDI_MODEL_ILP32 is returned to a driver that is compiled to the LP64 model and that communicates with a 32-bit application.

In the following example, the driver copies a data structure that contains a user address. The data structure changes size from ILP32 to LP64. Accordingly, the 64-bit driver uses a 32-bit version of the structure when communicating with a 32-bit application.

Example 15–14 ioctl(9E) Routine to Support 32-bit Applications and 64-bit Applications

struct args32 {
    uint32_t    addr;    /* 32-bit address in LP64 */
    int     len;
}
struct args {
    caddr_t     addr;    /* 64-bit address in LP64 */
    int     len;
}

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
    struct  xxstate  *xsp;
    struct  args     a;
    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL) {
        return (ENXIO);
    }
    switch (cmd) {
    case XX_COPYIN_DATA:
        switch(ddi_model_convert_from(mode)) {
        case DDI_MODEL_ILP32:
        {
            struct args32 a32;

            /* copy 32-bit args data shape */
            if (ddi_copyin((void *)arg, &a32,
                sizeof (struct args32), mode) != 0) {
                return (EFAULT);
            }
            /* convert 32-bit to 64-bit args data shape */
            a.addr = a32.addr;
            a.len = a32.len;
            break;
        }
        case DDI_MODEL_NONE:
            /* application and driver have same data model. */
            if (ddi_copyin((void *)arg, &a, sizeof (struct args),
                mode) != 0) {
                return (EFAULT);
            }
        }
        /* continue using data shape in native driver data model. */
        break;

    case XX_COPYOUT_DATA:
        /* copyout handling */
        break;
    default:
        /* generic "ioctl unknown" error */
        return (ENOTTY);
    }
    return (0);
}

Handling `copyout()` Overflow

Sometimes a driver needs to copy out a native quantity that no longer fits in the 32-bit sized structure. In this case, the driver should return EOVERFLOW to the caller. EOVERFLOW serves as an indication that the data type in the interface is too small to hold the value to be returned, as shown in the following example.

Example 15–15 Handling copyout(9F) Overflow

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
     cred_t *cr, int *rval_p)
    {
        struct resdata res;
        /* body of driver */
        switch (ddi_model_convert_from(mode & FMODELS)) {
        case DDI_MODEL_ILP32: {
            struct resdata32 res32;

            if (res.size > UINT_MAX)
                    return (EOVERFLOW);    
            res32.size = (size32_t)res.size;
            res32.flag = res.flag;
            if (ddi_copyout(&res32,
                (void *)arg, sizeof (res32), mode))
                    return (EFAULT);
        }
        break;

        case DDI_MODEL_NONE:
            if (ddi_copyout(&res, (void *)arg, sizeof (res), mode))
                    return (EFAULT);
            break;
        }
        return (0);
    }

32-bit and 64-bit Data Structure Macros

The method in Example 15–15 works well for many drivers. An alternate scheme is to use the data structure macros that are provided in <sys/model.h>to move data between the application and the kernel. These macros make the code less cluttered and behave identically, from a functional perspective.

Example 15–16 Using Data Structure Macros to Move Data

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
        cred_t *cr, int *rval_p)
    {    
        STRUCT_DECL(opdata, op);

        if (cmd != OPONE)
            return (ENOTTY);

        STRUCT_INIT(op, mode);

        if (copyin((void *)arg,
            STRUCT_BUF(op), STRUCT_SIZE(op)))
                return (EFAULT);

        if (STRUCT_FGET(op, flag) != XXACTIVE ||     
            STRUCT_FGET(op, size) > XXSIZE)
                return (EINVAL);
        xxdowork(device_state, STRUCT_FGET(op, size));
        return (0);
}

How Do the Structure Macros Work?

In a 64-bit device driver, structure macros enable the use of the same piece of kernel memory by data structures of both sizes. The memory buffer holds the contents of the native form of the data structure, that is, the LP64 form, and the ILP32 form. Each structure access is implemented by a conditional expression. When compiled as a 32-bit driver, only one data model, the native form, is supported. No conditional expression is used.

The 64-bit versions of the macros depend on the definition of a shadow version of the data structure. The shadow version describes the 32-bit interface with fixed-width types. The name of the shadow data structure is formed by appending “32” to the name of the native data structure. For convenience, place the definition of the shadow structure in the same file as the native structure to ease future maintenance costs.

The macros can take the following arguments:

structname: The structure name of the native form of the data structure as entered after the struct keyword.
umodel: A flag word that contains the user data model, such as FILP32 or FLP64, extracted from the mode parameter of ioctl(9E).
handle: The name used to refer to a particular instance of a structure that is manipulated by these macros.
fieldname: The name of the field within the structure.

When to Use Structure Macros

Macros enable you to make in-place references only to the fields of a data item. Macros do not provide a way to take separate code paths that are based on the data model. Macros should be avoided if the number of fields in the data structure is large. Macros should also be avoided if the frequency of references to these fields is high.

Macros hide many of the differences between data models in the implementation of the macros. As a result, code written with this interface is generally easier to read. When compiled as a 32-bit driver, the resulting code is compact without needing clumsy #ifdefs, but still preserves type checking.

Declaring and Initializing Structure Handles

STRUCT_DECL(9F) and STRUCT_INIT(9F) can be used to declare and initialize a handle and space for decoding an ioctl on the stack. STRUCT_HANDLE(9F) and STRUCT_SET_HANDLE(9F) declare and initialize a handle without allocating space on the stack. The latter macros can be useful if the structure is very large, or is contained in some other data structure.

Note –

Because the STRUCT_DECL(9F) and STRUCT_HANDLE(9F) macros expand to data structure declarations, these macros should be grouped with such declarations in C code.

The macros for declaring and initializing structures are as follows:

STRUCT_DECL(structname, handle): Declares a structure handlethat is called handle for a structname data structure. STRUCT_DECL allocates space for its native form on the stack. The native form is assumed to be larger than or equal to the ILP32 form of the structure.
STRUCT_INIT(handle, umodel): Initializes the data model for handle to umodel. This macro must be invoked before any access is made to a structure handle declared with STRUCT_DECL(9F).
STRUCT_HANDLE(structname, handle): Declares a structure handle that is called handle. Contrast with STRUCT_DECL(9F).
STRUCT_SET_HANDLE(handle, umodel, addr): Initializes the data model for handle to umodel, and sets addr as the buffer used for subsequent manipulation. Invoke this macro before accessing a structure handle declared with STRUCT_DECL(9F).

Operations on Structure Handles

The macros for performing operations on structures are as follows:

size_t STRUCT_SIZE(handle): Returns the size of the structure referred to by handle, according to its embedded data model.
typeof fieldname STRUCT_FGET(handle, fieldname): Returns the indicated field in the data structure referred to by handle. This field is a non-pointer type.
typeof fieldname STRUCT_FGETP(handle, fieldname): Returns the indicated field in the data structure referred to by handle. This field is a pointer type.
STRUCT_FSET(handle, fieldname, val): Sets the indicated field in the data structure referred to by handle to value val. The type of val should match the type of fieldname. The field is a non-pointer type.
STRUCT_FSETP(handle, fieldname, val): Sets the indicated field in the data structure referred to by handle to value val. The field is a pointer type.
typeof fieldname *STRUCT_FADDR(handle, fieldname): Returns the address of the indicated field in the data structure referred to by handle.
struct structname *STRUCT_BUF(handle): Returns a pointer to the native structure described by handle.

Other Operations

Some miscellaneous structure macros follow:

size_t SIZEOF_STRUCT(struct_name, datamodel): Returns the size of struct_name, which is based on the given data model.
size_t SIZEOF_PTR(datamodel): Returns the size of a pointer based on the given data model.

Chapter 16 Drivers for Block Devices

This chapter describes the structure of block device drivers. The kernel views a block device as a set of randomly accessible logical blocks. The file system uses a list of buf(9S) structures to buffer the data blocks between a block device and the user space. Only block devices can support a file system.

This chapter provides information on the following subjects:

Block Driver Structure Overview

Figure 16–1 shows data structures and routines that define the structure of a block device driver. Device drivers typically include the following elements:

Device-loadable driver section
Device configuration section
Device access section

The shaded device access section in the following figure illustrates entry points for block drivers.

Figure 16–1 Block Driver Roadmap

Diagram shows structures and entry points for block device
drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. See Chapter 6, Driver Autoconfiguration for details on driver data structures.

Block device drivers provide these entry points:

Note –

Some of the entry points can be replaced by nodev(9F) or nulldev(9F) as appropriate.

File I/O

A file system is a tree-structured hierarchy of directories and files. Some file systems, such as the UNIX File System (UFS), reside on block-oriented devices. File systems are created by format(1M) and newfs(1M).

When an application issues a read(2) or write(2) system call to an ordinary file on the UFS file system, the file system can call the device driver strategy(9E) entry point for the block device on which the file system resides. The file system code can call strategy(9E) several times for a single read(2) or write(2) system call.

The file system code determines the logical device address, or logical block number, for each ordinary file block. A block I/O request is then built in the form of a buf(9S) structure directed at the block device. The driver strategy(9E) entry point then interprets the buf(9S) structure and completes the request.

Block Device Autoconfiguration

attach(9E) should perform the common initialization tasks for each instance of a device:

Allocating per-instance state structures
Mapping the device's registers
Registering device interrupts
Initializing mutex and condition variables
Creating power manageable components
Creating minor nodes

Block device drivers create minor nodes of type S_IFBLK. As a result, a block special file that represents the node appears in the /devices hierarchy.

Logical device names for block devices appear in the /dev/dsk directory, and consist of a controller number, bus-address number, disk number, and slice number. These names are created by the devfsadm(1M) program if the node type is set to DDI_NT_BLOCK or DDI_NT_BLOCK_CHAN. DDI_NT_BLOCK_CHAN should be specified if the device communicates on a channel, that is, a bus with an additional level of addressability. SCSI disks are a good example. DDI_NT_BLOCK_CHAN causes a bus-address field (tN) to appear in the logical name. DDI_NT_BLOCK should be used for most other devices.

A minor device refers to a partition on the disk. For each minor device, the driver must create an nblocks or Nblocks property. This integer property gives the number of blocks supported by the minor device expressed in units of DEV_BSIZE, that is, 512 bytes. The file system uses the nblocks and Nblocks properties to determine device limits. Nblocks is the 64-bit version of nblocks. Nblocks should be used with storage devices that can hold over 1 Tbyte of storage per disk. See Device Properties for more information.

Example 16–1 shows a typical attach(9E) entry point with emphasis on creating the device's minor node and the Nblocks property. Note that because this example uses Nblocks and not nblocks, ddi_prop_update_int64(9F) is called instead of ddi_prop_update_int(9F).

As a side note, this example shows the use of makedevice(9F) to create a device number for ddi_prop_update_int64(). The makedevice function makes use of ddi_driver_major(9F), which generates a major number from a pointer to a dev_info_t structure. Using ddi_driver_major() is similar to using getmajor(9F), which gets a dev_t structure pointer.

Example 16–1 Block Driver `attach()` Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
     int instance = ddi_get_instance(dip);
     switch (cmd) {
       case DDI_ATTACH:
       /*
        * allocate a state structure and initialize it
        * map the devices registers
        * add the device driver's interrupt handler(s)
        * initialize any mutexes and condition variables
        * read label information if the device is a disk
        * create power manageable components
        *
        * Create the device minor node. Note that the node_type
        * argument is set to DDI_NT_BLOCK.
        */
       if (ddi_create_minor_node(dip, "minor_name", S_IFBLK,
          instance, DDI_NT_BLOCK, 0) == DDI_FAILURE) {
          /* free resources allocated so far */
          /* Remove any previously allocated minor nodes */
          ddi_remove_minor_node(dip, NULL);
          return (DDI_FAILURE);
        }
       /*
        * Create driver properties like "Nblocks". If the device
        * is a disk, the Nblocks property is usually calculated from
        * information in the disk label.  Use "Nblocks" instead of
        * "nblocks" to ensure the property works for large disks.
        */
       xsp->Nblocks = size;
       /* size is the size of the device in 512 byte blocks */
       maj_number = ddi_driver_major(dip);
       if (ddi_prop_update_int64(makedevice(maj_number, instance), dip, 
          "Nblocks", xsp->Nblocks) != DDI_PROP_SUCCESS) {
          cmn_err(CE_CONT, "%s: cannot create Nblocks property\n",
               ddi_get_name(dip));
         /* free resources allocated so far */
         return (DDI_FAILURE);
       }
       xsp->open = 0;
       xsp->nlayered = 0;
       /* ... */
       return (DDI_SUCCESS);

    case DDI_RESUME:
       /* For information, see Chapter 12, "Power Management," in this book. */
       default:
          return (DDI_FAILURE);
     }
}

Controlling Device Access

This section describes the entry points for open() and close() functions in block device drivers. See Chapter 15, Drivers for Character Devices for more information on open(9E) and close(9E).

`open()` Entry Point (Block Drivers)

The open(9E) entry point is used to gain access to a given device. The open(9E) routine of a block driver is called when a user thread issues an open(2) or mount(2) system call on a block special file associated with the minor device, or when a layered driver calls open(9E). See File I/O for more information.

The open() entry point should check for the following conditions:

The device can be opened, that is, the device is online and ready.
The device can be opened as requested. The device supports the operation. The device's current state does not conflict with the request.
The caller has permission to open the device.

The following example demonstrates a block driver open(9E) entry point.

Example 16–2 Block Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flags, int otyp, cred_t *credp)
{
       minor_t         instance;
       struct xxstate        *xsp;

     instance = getminor(*devp);
     xsp = ddi_get_soft_state(statep, instance);
     if (xsp == NULL)
           return (ENXIO);
     mutex_enter(&xsp->mu);
     /*
    * only honor FEXCL. If a regular open or a layered open
    * is still outstanding on the device, the exclusive open
    * must fail.
    */
     if ((flags & FEXCL) && (xsp->open || xsp->nlayered)) {
       mutex_exit(&xsp->mu);
       return (EAGAIN);
     }
     switch (otyp) {
       case OTYP_LYR:
         xsp->nlayered++;
         break;
      case OTYP_BLK:
         xsp->open = 1;
         break;
     default:
         mutex_exit(&xsp->mu);
         return (EINVAL);
     }
   mutex_exit(&xsp->mu);
      return (0);
}

The otyp argument is used to specify the type of open on the device. OTYP_BLK is the typical open type for a block device. A device can be opened several times with otyp set to OTYP_BLK. close(9E) is called only once when the final close of type OTYP_BLK has occurred for the device. otyp is set to OTYP_LYR if the device is being used as a layered device. For every open of type OTYP_LYR, the layering driver issues a corresponding close of type OTYP_LYR. The example keeps track of each type of open so the driver can determine when the device is not being used in close(9E).

`close()` Entry Point (Block Drivers)

The close(9E) entry point uses the same arguments as open(9E) with one exception. dev is the device number rather than a pointer to the device number.

The close() routine should verify otyp in the same way as was described for the open(9E) entry point. In the following example, close() must determine when the device can really be closed. Closing is affected by the number of block opens and layered opens.

Example 16–3 Block Device close(9E) Routine

static int
xxclose(dev_t dev, int flag, int otyp, cred_t *credp)
{
     minor_t instance;
     struct xxstate *xsp;

     instance = getminor(dev);
     xsp = ddi_get_soft_state(statep, instance);
       if (xsp == NULL)
          return (ENXIO);
     mutex_enter(&xsp->mu);
     switch (otyp) {
       case OTYP_LYR:
       xsp->nlayered--;
       break;
      case OTYP_BLK:
       xsp->open = 0;
       break;
     default:
       mutex_exit(&xsp->mu);
       return (EINVAL);
       }

     if (xsp->open || xsp->nlayered) {
       /* not done yet */
       mutex_exit(&xsp->mu);
       return (0);
     }
       /* cleanup (rewind tape, free memory, etc.) */
   /* wait for I/O to drain */
     mutex_exit(&xsp->mu);

     return (0);
}

`strategy()` Entry Point

The strategy(9E) entry point is used to read and write data buffers to and from a block device. The name strategy refers to the fact that this entry point might implement some optimal strategy for ordering requests to the device.

strategy(9E) can be written to process one request at a time, that is, a synchronous transfer. strategy() can also be written to queue multiple requests to the device, as in an asynchronous transfer. When choosing a method, the abilities and limitations of the device should be taken into account.

The strategy(9E) routine is passed a pointer to a buf(9S) structure. This structure describes the transfer request, and contains status information on return. buf(9S) and strategy(9E) are the focus of block device operations.

`buf` Structure

The following buf structure members are important to block drivers:

     int          b_flags;     /* Buffer Status */
     struct buf       *av_forw;    /* Driver work list link */
     struct buf       *av_back;    /* Driver work list link */
     size_t       b_bcount;    /* # of bytes to transfer */
     union {
     caddr_t      b_addr;      /* Buffer's virtual address */
     } b_un;
     daddr_t      b_blkno;     /* Block number on device */
     diskaddr_t       b_lblkno;    /* Expanded block number on device */
     size_t       b_resid;     /* # of bytes not transferred */
                       /* after error */
     int          b_error;     /* Expanded error field */
     void         *b_private;      /* “opaque” driver private area */
     dev_t        b_edev;      /* expanded dev field */

where:

av_forw and av_back

Pointers that the driver can use to manage a list of buffers by the driver. See Asynchronous Data Transfers (Block Drivers) for a discussion of the av_forw and av_back pointers.

b_bcount

Specifies the number of bytes to be transferred by the device.

b_un.b_addr

The kernel virtual address of the data buffer. Only valid after bp_mapin(9F) call.

b_blkno

The starting 32-bit logical block number on the device for the data transfer, which is expressed in 512-byte DEV_BSIZE units. The driver should use either b_blkno or b_lblkno but not both.

b_lblkno

The starting 64-bit logical block number on the device for the data transfer, which is expressed in 512-byte DEV_BSIZE units. The driver should use either b_blkno or b_lblkno but not both.

b_resid

Set by the driver to indicate the number of bytes that were not transferred because of an error. See Example 16–7 for an example of setting b_resid. The b_resid member is overloaded. b_resid is also used by disksort(9F).

b_error

Set to an error number by the driver when a transfer error occurs. b_error is set in conjunction with the b_flags B_ERROR bit. See the Intro(9E) man page for details about error values. Drivers should use bioerror(9F) rather than setting b_error directly.

b_flags

Flags with status and transfer attributes of the buf structure. If B_READ is set, the buf structure indicates a transfer from the device to memory. Otherwise, this structure indicates a transfer from memory to the device. If the driver encounters an error during data transfer, the driver should set the B_ERROR field in the b_flags member. In addition, the driver should provide a more specific error value in b_error. Drivers should use bioerror(9F) rather than setting B_ERROR.

Caution –

Drivers should never clear b_flags.

b_private

For exclusive use by the driver to store driver-private data.

b_edev

Contains the device number of the device that was used in the transfer.

`bp_mapin` Structure

A buf structure pointer can be passed into the device driver's strategy(9E) routine. However, the data buffer referred to by b_un.b_addr is not necessarily mapped in the kernel's address space. Therefore, the driver cannot directly access the data. Most block-oriented devices have DMA capability and therefore do not need to access the data buffer directly. Instead, these devices use the DMA mapping routines to enable the device's DMA engine to do the data transfer. For details about using DMA, see Chapter 9, Direct Memory Access (DMA).

If a driver needs to access the data buffer directly, that driver must first map the buffer into the kernel's address space by using bp_mapin(9F). bp_mapout(9F) should be used when the driver no longer needs to access the data directly.

Caution –

bp_mapout(9F) should only be called on buffers that have been allocated and are owned by the device driver. bp_mapout() must not be called on buffers that are passed to the driver through the strategy(9E) entry point, such as a file system. bp_mapin(9F) does not keep a reference count. bp_mapout(9F) removes any kernel mapping on which a layer over the device driver might rely.

Synchronous Data Transfers (Block Drivers)

This section presents a simple method for performing synchronous I/O transfers. This method assumes that the hardware is a simple disk device that can transfer only one data buffer at a time by using DMA. Another assumption is that the disk can be spun up and spun down by software command. The device driver's strategy(9E) routine waits for the current request to be completed before accepting a new request. The device interrupts when the transfer is complete. The device also interrupts if an error occurs.

The steps for performing a synchronous data transfer for a block driver are as follows:

Check for invalid buf(9S) requests.

Check the buf(9S) structure that is passed to strategy(9E) for validity. All drivers should check the following conditions:
- The request begins at a valid block. The driver converts the b_blkno field to the correct device offset and then determines whether the offset is valid for the device.
- The request does not go beyond the last block on the device.
- Device-specific requirements are met.
If an error is encountered, the driver should indicate the appropriate error with bioerror(9F). The driver should then complete the request by calling biodone(9F). biodone() notifies the caller of strategy(9E) that the transfer is complete. In this case, the transfer has stopped because of an error.
Check whether the device is busy.

Synchronous data transfers allow single-threaded access to the device. The device driver enforces this access in two ways:
- The driver maintains a busy flag that is guarded by a mutex.
- The driver waits on a condition variable with cv_wait(9F), when the device is busy.
If the device is busy, the thread waits until the interrupt handler indicates that the device is not longer busy. The available status can be indicated by either the cv_broadcast(9F) or the cv_signal(9F) function. See Chapter 3, Multithreading for details on condition variables.

When the device is no longer busy, the strategy(9E) routine marks the device as available. strategy() then prepares the buffer and the device for the transfer.
Set up the buffer for DMA.

Prepare the data buffer for a DMA transfer by using ddi_dma_alloc_handle(9F) to allocate a DMA handle. Use ddi_dma_buf_bind_handle(9F) to bind the data buffer to the handle. For information on setting up DMA resources and related data structures, see Chapter 9, Direct Memory Access (DMA).

Begin the transfer.

At this point, a pointer to the buf(9S) structure is saved in the state structure of the device. The interrupt routine can then complete the transfer by calling biodone(9F).

The device driver then accesses device registers to initiate a data transfer. In most cases, the driver should protect the device registers from other threads by using mutexes. In this case, because strategy(9E) is single-threaded, guarding the device registers is not necessary. See Chapter 3, Multithreading for details about data locks.

When the executing thread has started the device's DMA engine, the driver can return execution control to the calling routine, as follows:

static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    struct device_reg *regp;
    minor_t instance;
    ddi_dma_cookie_t cookie;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    if (xsp == NULL) {
       bioerror(bp, ENXIO);
       biodone(bp);
       return (0);
    }
    /* validate the transfer request */
    if ((bp->b_blkno >= xsp->Nblocks) || (bp->b_blkno < 0)) {
       bioerror(bp, EINVAL);    
       biodone(bp);
       return (0);
    }
    /*
     * Hold off all threads until the device is not busy.
     */
    mutex_enter(&xsp->mu);
    while (xsp->busy) {
       cv_wait(&xsp->cv, &xsp->mu);
    }
    xsp->busy = 1;
    mutex_exit(&xsp->mu);
    /* 
     * If the device has power manageable components, 
     * mark the device busy with pm_busy_components(9F),
     * and then ensure that the device 
     * is powered up by calling pm_raise_power(9F).
     *
     * Set up DMA resources with ddi_dma_alloc_handle(9F) and
     * ddi_dma_buf_bind_handle(9F).
     */
    xsp->bp = bp;
    regp = xsp->regp;
    ddi_put32(xsp->data_access_handle, &regp->dma_addr,
        cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &regp->dma_size,
         (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &regp->csr,
         ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handle the interrupting device.

When the device finishes the data transfer, the device generates an interrupt, which eventually results in the driver's interrupt routine being called. Most drivers specify the state structure of the device as the argument to the interrupt routine when registering interrupts. See the ddi_add_intr(9F) man page and Registering Interrupts. The interrupt routine can then access the buf(9S) structure being transferred, plus any other information that is available from the state structure.

The interrupt handler should check the device's status register to determine whether the transfer completed without error. If an error occurred, the handler should indicate the appropriate error with bioerror(9F). The handler should also clear the pending interrupt for the device and then complete the transfer by calling biodone(9F).

As the final task, the handler clears the busy flag. The handler then calls cv_signal(9F) or cv_broadcast(9F) on the condition variable, signaling that the device is no longer busy. This notification enables other threads waiting for the device in strategy(9E) to proceed with the next data transfer.

The following example shows a synchronous interrupt routine.

Example 16–4 Synchronous Interrupt Routine for Block Drivers

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
       mutex_exit(&xsp->mu);
       return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
       /* failure */
       bp->b_resid = bp->b_bcount;
       bioerror(bp, EIO);
    } else {
       /* success */
       bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
       CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    /*
     * If the device has power manageable components that were
     * marked busy in strategy(9F), mark them idle now with
     * pm_idle_component(9F)
     * Release any resources used in the transfer, such as DMA
     * resources ddi_dma_unbind_handle(9F) and
     * ddi_dma_free_handle(9F).
     *
     * Let the next I/O thread have access to the device.
     */
    xsp->busy = 0;
    cv_signal(&xsp->cv);
    mutex_exit(&xsp->mu);
    return (DDI_INTR_CLAIMED);
}

Asynchronous Data Transfers (Block Drivers)

This section presents a method for performing asynchronous I/O transfers. The driver queues the I/O requests and then returns control to the caller. Again, the assumption is that the hardware is a simple disk device that allows one transfer at a time. The device interrupts when a data transfer has completed. An interrupt also takes place if an error occurs. The basic steps for performing asynchronous data transfers are:

Check for invalid buf(9S) requests.
Enqueue the request.
Start the first transfer.
Handle the interrupting device.

Checking for Invalid `buf` Requests

As in the synchronous case, the device driver should check the buf(9S) structure passed to strategy(9E) for validity. See Synchronous Data Transfers (Block Drivers) for more details.

Enqueuing the Request

Unlike synchronous data transfers, a driver does not wait for an asynchronous request to complete. Instead, the driver adds the request to a queue. The head of the queue can be the current transfer. The head of the queue can also be a separate field in the state structure for holding the active request, as in Example 16–5.

If the queue is initially empty, then the hardware is not busy and strategy(9E) starts the transfer before returning. Otherwise, if a transfer completes with a non-empty queue, the interrupt routine begins a new transfer. Example 16–5 places the decision of whether to start a new transfer into a separate routine for convenience.

The driver can use the av_forw and the av_back members of the buf(9S) structure to manage a list of transfer requests. A single pointer can be used to manage a singly linked list, or both pointers can be used together to build a doubly linked list. The device hardware specification specifies which type of list management, such as insertion policies, is used to optimize the performance of the device. The transfer list is a per-device list, so the head and tail of the list are stored in the state structure.

The following example provides multiple threads with access to the driver shared data, such as the transfer list. You must identify the shared data and must protect the data with a mutex. See Chapter 3, Multithreading for more details about mutex locks.

Example 16–5 Enqueuing Data Transfer Requests for Block Drivers

static int
xxstrategy(struct buf *bp)
{
    struct xxstate *xsp;
    minor_t instance;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    /* ... */
    /* validate transfer request */
    /* ... */
    /*
     * Add the request to the end of the queue. Depending on the device, a sorting
     * algorithm, such as disksort(9F) can be used if it improves the
     * performance of the device.
     */
    mutex_enter(&xsp->mu);
    bp->av_forw = NULL;
    if (xsp->list_head) {
       /* Non-empty transfer list */
       xsp->list_tail->av_forw = bp;
       xsp->list_tail = bp;
    } else {
       /* Empty Transfer list */
       xsp->list_head = bp;
       xsp->list_tail = bp;
    }
    mutex_exit(&xsp->mu);
    /* Start the transfer if possible */
    (void) xxstart((caddr_t)xsp);
    return (0);
}

Starting the First Transfer

Device drivers that implement queuing usually have a start() routine. start() dequeues the next request and starts the data transfer to or from the device. In this example, start() processes all requests regardless of the state of the device, whether busy or free.

Note –

start() must be written to be called from any context. start() can be called by both the strategy routine in kernel context and the interrupt routine in interrupt context.

start() is called by strategy(9E) every time strategy() queues a request so that an idle device can be started. If the device is busy, start() returns immediately.

start() is also called by the interrupt handler before the handler returns from a claimed interrupt so that a nonempty queue can be serviced. If the queue is empty, start() returns immediately.

Because start() is a private driver routine, start() can take any arguments and can return any type. The following code sample is written to be used as a DMA callback, although that portion is not shown. Accordingly, the example must take a caddr_t as an argument and return an int. See Handling Resource Allocation Failures for more information about DMA callback routines.

Example 16–6 Starting the First Data Request for a Block Driver

static int
xxstart(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;

    mutex_enter(&xsp->mu);
    /*
     * If there is nothing more to do, or the device is
     * busy, return.
     */
    if (xsp->list_head == NULL || xsp->busy) {
       mutex_exit(&xsp->mu);
       return (0);
    }
    xsp->busy = 1;
    /* Get the first buffer off the transfer list */
    bp = xsp->list_head;
    /* Update the head and tail pointer */
    xsp->list_head = xsp->list_head->av_forw;
    if (xsp->list_head == NULL)
       xsp->list_tail = NULL;
    bp->av_forw = NULL;
    mutex_exit(&xsp->mu);
    /*
     * If the device has power manageable components,
     * mark the device busy with pm_busy_components(9F),
     * and then ensure that the device
     * is powered up by calling pm_raise_power(9F).
     *
     * Set up DMA resources with ddi_dma_alloc_handle(9F) and
     * ddi_dma_buf_bind_handle(9F).
     */
    xsp->bp = bp;
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_addr,
        cookie.dmac_address);
    ddi_put32(xsp->data_access_handle, &xsp->regp->dma_size,
         (uint32_t)cookie.dmac_size);
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
         ENABLE_INTERRUPTS | START_TRANSFER);
    return (0);
}

Handling the Interrupting Device

The interrupt routine is similar to the asynchronous version, with the addition of the call to start() and the removal of the call to cv_signal(9F).

Example 16–7 Block Driver Routine for Asynchronous Interrupts

static u_int
xxintr(caddr_t arg)
{
    struct xxstate *xsp = (struct xxstate *)arg;
    struct buf *bp;
    uint8_t status;
    mutex_enter(&xsp->mu);
    status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
    if (!(status & INTERRUPTING)) {
        mutex_exit(&xsp->mu);
        return (DDI_INTR_UNCLAIMED);
    }
    /* Get the buf responsible for this interrupt */
    bp = xsp->bp;
    xsp->bp = NULL;
    /*
     * This example is for a simple device which either
     * succeeds or fails the data transfer, indicated in the
     * command/status register.
     */
    if (status & DEVICE_ERROR) {
        /* failure */
        bp->b_resid = bp->b_bcount;
        bioerror(bp, EIO);
    } else {
        /* success */
        bp->b_resid = 0;
    }
    ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
        CLEAR_INTERRUPT);
    /* The transfer has finished, successfully or not */
    biodone(bp);
    /*
     * If the device has power manageable components that were
     * marked busy in strategy(9F), mark them idle now with
     * pm_idle_component(9F)
     * Release any resources used in the transfer, such as DMA
     * resources (ddi_dma_unbind_handle(9F) and
     * ddi_dma_free_handle(9F)).
     *
     * Let the next I/O thread have access to the device.
     */
    xsp->busy = 0;
    mutex_exit(&xsp->mu);
    (void) xxstart((caddr_t)xsp);
    return (DDI_INTR_CLAIMED);
}

`dump()` and `print()` Entry Points

This section discusses the dump(9E) and print(9E) entry points.

`dump()` Entry Point (Block Drivers)

The dump(9E) entry point is used to copy a portion of virtual address space directly to the specified device in the case of a system failure. dump() is also used to copy the state of the kernel out to disk during a checkpoint operation. See the cpr(7) and dump(9E) man pages for more information. The entry point must be capable of performing this operation without the use of interrupts, because interrupts are disabled during the checkpoint operation.

int dump(dev_t dev, caddr_t addr, daddr_t blkno, int nblk)

where:

dev: Device number of the device to receive the dump.
addr: Base kernel virtual address at which to start the dump.
blkno: Block at which the dump is to start.
nblk: Number of blocks to dump.

The dump depends upon the existing driver working properly.

`print()` Entry Point (Block Drivers)

int print(dev_t dev, char *str)

The print(9E) entry point is called by the system to display a message about an exception that has been detected. print(9E) should call cmn_err(9F) to post the message to the console on behalf of the system. The following example demonstrates a typical print() entry point.

static int
 xxprint(dev_t dev, char *str)
 {
     cmn_err(CE_CONT, “xx: %s\n”, str);
     return (0);
 }

Disk Device Drivers

Disk devices represent an important class of block device drivers.

Disk `ioctl`s

Solaris disk drivers need to support a minimum set of ioctl commands specific to Solaris disk drivers. These I/O controls are specified in the dkio(7I) manual page. Disk I/O controls transfer disk information to or from the device driver. A Solaris disk device is supported by disk utility commands such as format(1M) and newfs(1M). The mandatory Sun disk I/O controls are as follows:

DKIOCINFO: Returns information that describes the disk controller
DKIOCGAPART: Returns a disk's partition map
DKIOCSAPART: Sets a disk's partition map
DKIOCGGEOM: Returns a disk's geometry
DKIOCSGEOM: Sets a disk's geometry
DKIOCGVTOC: Returns a disk's Volume Table of Contents
DKIOCSVTOC: Sets a disk's Volume Table of Contents

Disk Performance

The Solaris DDI/DKI provides facilities to optimize I/O transfers for improved file system performance. A mechanism manages the list of I/O requests so as to optimize disk access for a file system. See Asynchronous Data Transfers (Block Drivers) for a description of enqueuing an I/O request.

The diskhd structure is used to manage a linked list of I/O requests.

struct diskhd {
    long     b_flags;         /* not used, needed for consistency*/
    struct   buf *b_forw,    *b_back;       /* queue of unit queues */
    struct   buf *av_forw,    *av_back;    /* queue of bufs for this unit */
    long     b_bcount;            /* active flag */
};

The diskhd data structure has two buf pointers that the driver can manipulate. The av_forw pointer points to the first active I/O request. The second pointer, av_back, points to the last active request on the list.

A pointer to this structure is passed as an argument to disksort(9F), along with a pointer to the current buf structure being processed. The disksort() routine sorts the buf requests to optimize disk seek. The routine then inserts the buf pointer into the diskhd list. The disksort() program uses the value that is in b_resid of the buf structure as a sort key. The driver is responsible for setting this value. Most Sun disk drivers use the cylinder group as the sort key. This approach optimizes the file system read-ahead accesses.

When data has been added to the diskhd list, the device needs to transfer the data. If the device is not busy processing a request, the xxstart() routine pulls the first buf structure off the diskhd list and starts a transfer.

If the device is busy, the driver should return from the xxstrategy() entry point. When the hardware is done with the data transfer, an interrupt is generated. The driver's interrupt routine is then called to service the device. After servicing the interrupt, the driver can then call the start() routine to process the next buf structure in the diskhd list.

Chapter 17 SCSI Target Drivers

The Solaris DDI/DKI divides the software interface to SCSI devices into two major parts: target drivers and host bus adapter (HBA) drivers. Target refers to a driver for a device on a SCSI bus, such as a disk or a tape drive. Host bus adapter refers to the driver for the SCSI controller on the host machine. SCSA defines the interface between these two components. This chapter discusses target drivers only. See Chapter 18, SCSI Host Bus Adapter Drivers for information on host bus adapter drivers.

Note –

The terms “host bus adapter” and “HBA” are equivalent to “host adapter,” which is defined in SCSI specifications.

This chapter provides information on the following subjects:

Introduction to Target Drivers

Target drivers can be either character or block device drivers, depending on the device. Drivers for tape drives are usually character device drivers, while disks are handled by block device drivers. This chapter describes how to write a SCSI target driver. The chapter discusses the additional requirements that SCSA places on block and character drivers for SCSI target devices.

The following reference documents provide supplemental information needed by the designers of target drivers and host bus adapter drivers.

Small Computer System Interface 2 (SCSI-2), ANSI/NCITS X3.131-1994, Global Engineering Documents, 1998. ISBN 1199002488.

The Basics of SCSI, Fourth Edition, ANCOT Corporation, 1998. ISBN 0963743988.

Refer also to the SCSI command specification for the target device, provided by the hardware vendor.

Sun Common SCSI Architecture Overview

The Sun Common SCSI Architecture (SCSA) is the Solaris DDI/DKI programming interface for the transmission of SCSI commands from a target driver to a host bus adapter driver. This interface is independent of the type of host bus adapter hardware, the platform, the processor architecture, and the SCSI command being transported across the interface.

Conforming to the SCSA enables the target driver to pass SCSI commands to target devices without knowledge of the hardware implementation of the host bus adapter.

The SCSA conceptually separates building the SCSI command from transporting the command with data across the SCSI bus. The architecture defines the software interface between high-level and low-level software components. The higher level software component consists of one or more SCSI target drivers, which translate I/O requests into SCSI commands appropriate for the peripheral device. The following example illustrates the SCSI architecture.

Figure 17–1 SCSA Block Diagram

Diagram shows the role of the Sun Common SCSI Architecture
in relation to SCSI drivers in the operating system.

The lower-level software component consists of a SCSA interface layer and one or more host bus adapter drivers. The target driver is responsible for the generation of the proper SCSI commands required to execute the desired function and for processing the results.

General Flow of Control

Assuming no transport errors occur, the following steps describe the general flow of control for a read or write request.

The target driver's read(9E) or write(9E) entry point is invoked. physio(9F) is used to lock down memory, prepare a buf structure, and call the strategy routine.
The target driver's strategy(9E) routine checks the request. strategy() then allocates a scsi_pkt(9S) by using scsi_init_pkt(9F). The target driver initializes the packet and sets the SCSI command descriptor block (CDB) using the scsi_setup_cdb(9F) function. The target driver also specifies a timeout. Then, the driver provides a pointer to a callback function. The callback function is called by the host bus adapter driver on completion of the command. The buf(9S) pointer should be saved in the SCSI packet's target-private space.
The target driver submits the packet to the host bus adapter driver by using scsi_transport(9F). The target driver is then free to accept other requests. The target driver should not access the packet while the packet is in transport. If either the host bus adapter driver or the target supports queueing, new requests can be submitted while the packet is in transport.
As soon as the SCSI bus is free and the target not busy, the host bus adapter driver selects the target and passes the CDB. The target driver executes the command. The target then performs the requested data transfers.
After the target sends completion status and the command completes, the host bus adapter driver notifies the target driver. To perform the notification, the host calls the completion function that was specified in the SCSI packet. At this time the host bus adapter driver is no longer responsible for the packet, and the target driver has regained ownership of the packet.
The SCSI packet's completion routine analyzes the returned information. The completion routine then determines whether the SCSI operation was successful. If a failure has occurred, the target driver retries the command by calling scsi_transport(9F) again. If the host bus adapter driver does not support auto request sense, the target driver must submit a request sense packet to retrieve the sense data in the event of a check condition.
After successful completion or if the command cannot be retried, the target driver calls scsi_destroy_pkt(9F). scsi_destroy_pkt() synchronizes the data. scsi_destroy_pkt() then frees the packet. If the target driver needs to access the data before freeing the packet, scsi_sync_pkt(9F) is called.
Finally, the target driver notifies the requesting application that the read or write transaction is complete. This notification is made by returning from the read(9E) entry point in the driver for character devices. Otherwise, notification is made indirectly through biodone(9F).

SCSA allows the execution of many of such operations, both overlapped and queued, at various points in the process. The model places the management of system resources on the host bus adapter driver. The software interface enables the execution of target driver functions on host bus adapter drivers by using SCSI bus adapters of varying degrees of sophistication.

SCSA Functions

SCSA defines functions to manage the allocation and freeing of resources, the sensing and setting of control states, and the transport of SCSI commands. These functions are listed in the following table.

Table 17–1 Standard SCSA Functions


Function Name	Category
scsi_abort(9F)	Error handling
scsi_alloc_consistent_buf(9F)
scsi_destroy_pkt(9F)
scsi_dmafree(9F)
scsi_free_consistent_buf(9F)
scsi_ifgetcap(9F)	Transport information and control
scsi_ifsetcap(9F)
scsi_init_pkt(9F)	Resource management
scsi_poll(9F)	Polled I/O
scsi_probe(9F)	Probe functions
scsi_reset(9F)
scsi_setup_cdb(9F)	CDB initialization function
scsi_sync_pkt(9F)
scsi_transport(9F)	Command transport
scsi_unprobe(9F)

Note –

If your driver needs to work with a SCSI-1 device, use the makecom(9F).

Hardware Configuration File

Because SCSI devices are not self-identifying, a hardware configuration file is required for a target driver. See the driver.conf(4) and scsi_free_consistent_buf(9F) man pages for details. The following is a typical configuration file:

    name="xx" class="scsi" target=2 lun=0;

The system reads the file during autoconfiguration. The system uses the class property to identify the driver's possible parent. Then, the system attempts to attach the driver to any parent driver that is of class scsi. All host bus adapter drivers are of this class. Using the class property rather than the parent property is preferred. This approach enables any host bus adapter driver that finds the expected device at the specified target and lun IDs to attach to the target. The target driver is responsible for verifying the class in its probe(9E) routine.

Declarations and Data Structures

Target drivers must include the header file <sys/scsi/scsi.h>.

SCSI target drivers must use the following command to generate a binary module:

ld -r xx xx.o -N"misc/scsi"

`scsi_device` Structure

The host bus adapter driver allocates and initializes a scsi_device(9S) structure for the target driver before either the probe(9E) or attach(9E) routine is called. This structure stores information about each SCSI logical unit, including pointers to information areas that contain both generic and device-specific information. One scsi_device(9S) structure exists for each logical unit that is attached to the system. The target driver can retrieve a pointer to this structure by calling ddi_get_driver_private(9F).

Caution –

Because the host bus adapter driver uses the private field in the target device's dev_info structure, target drivers must not use ddi_set_driver_private(9F).

The scsi_device(9S) structure contains the following fields:

struct scsi_device {
    struct scsi_address           sd_address;    /* opaque address */
    dev_info_t                    *sd_dev;       /* device node */
    kmutex_t                      sd_mutex;
    void                          *sd_reserved;
    struct scsi_inquiry           *sd_inq;
    struct scsi_extended_sense    *sd_sense;
    caddr_t                       sd_private;
};

where:

sd_address: Data structure that is passed to the routines for SCSI resource allocation.
sd_dev: Pointer to the target's dev_info structure.
sd_mutex: Mutex for use by the target driver. This mutex is initialized by the host bus adapter driver and can be used by the target driver as a per-device mutex. Do not hold this mutex across a call to scsi_transport(9F) or scsi_poll(9F). See Chapter 3, Multithreading for more information on mutexes.
sd_inq: Pointer for the target device's SCSI inquiry data. The scsi_probe(9F) routine allocates a buffer, fills the buffer in with inquiry data, and attaches the buffer to this field.
sd_sense: Pointer to a buffer to contain SCSI request sense data from the device. The target driver must allocate and manage this buffer. See attach() Entry Point (SCSI Target Drivers).
sd_private: Pointer field for use by the target driver. This field is commonly used to store a pointer to a private target driver state structure.

`scsi_pkt` Structure (Target Drivers)

The scsi_pkt structure contains the following fields:

struct scsi_pkt {
    opaque_t  pkt_ha_private;         /* private data for host adapter */
    struct scsi_address pkt_address;  /* destination packet is for */
    opaque_t  pkt_private;            /* private data for target driver */
    void     (*pkt_comp)(struct scsi_pkt *);  /* completion routine */
    uint_t   pkt_flags;               /* flags */
    int      pkt_time;                /* time allotted to complete command */
    uchar_t  *pkt_scbp;               /* pointer to status block */
    uchar_t  *pkt_cdbp;               /* pointer to command block */
    ssize_t  pkt_resid;               /* data bytes not transferred */
    uint_t   pkt_state;               /* state of command */
    uint_t   pkt_statistics;          /* statistics */
    uchar_t  pkt_reason;              /* reason completion called */
};

where:

pkt_address: Target device's address set by scsi_init_pkt(9F).
pkt_private: Place to store private data for the target driver. pkt_private is commonly used to save the buf(9S) pointer for the command.
pkt_comp: Address of the completion routine. The host bus adapter driver calls this routine when the driver has transported the command. Transporting the command does not mean that the command succeeded. The target might have been busy. Another possibility is that the target might not have responded before the time out period elapsed. See the description for pkt_time field. The target driver must supply a valid value in this field. This value can be NULL if the driver does not want to be notified.

Note –

Two different SCSI callback routines are provided. The pkt_comp field identifies a completion callback routine, which is called when the host bus adapter completes its processing. A resource callback routine is also available, which is called when currently unavailable resources are likely to be available. See the scsi_init_pkt(9F) man page.

pkt_flags

Provides additional control information, for example, to transport the command without disconnect privileges (FLAG_NODISCON) or to disable callbacks (FLAG_NOINTR). See the scsi_pkt(9S) man page for details.

pkt_time

Time out value in seconds. If the command is not completed within this time, the host bus adapter calls the completion routine with pkt_reason set to CMD_TIMEOUT. The target driver should set this field to longer than the maximum time the command might take. If the timeout is zero, no timeout is requested. Timeout starts when the command is transmitted on the SCSI bus.

pkt_scbp

Pointer to the block for SCSI status completion. This field is filled in by the host bus adapter driver.

pkt_cdbp

Pointer to the SCSI command descriptor block, the actual command to be sent to the target device. The host bus adapter driver does not interpret this field. The target driver must fill the field in with a command that the target device can process.

pkt_resid

Residual of the operation. The pkt_resid field has two different uses depending on how pkt_resid is used. When pkt_resid is used to allocate DMA resources for a command scsi_init_pkt(9F), pkt_resid indicates the number of unallocable bytes. DMA resources might not be allocated due to DMA hardware scatter-gather or other device limitations. After command transport, pkt_resid indicates the number of non-transferable data bytes. The field is filled in by the host bus adapter driver before the completion routine is called.

pkt_state

Indicates the state of the command. The host bus adapter driver fills in this field as the command progresses. One bit is set in this field for each of the five following command states:

STATE_GOT_BUS – Acquired the bus
STATE_GOT_TARGET – Selected the target
STATE_SENT_CMD – Sent the command
STATE_XFERRED_DATA – Transferred data, if appropriate
STATE_GOT_STATUS – Received status from the device

pkt_statistics

Contains transport-related statistics set by the host bus adapter driver.

pkt_reason

Gives the reason the completion routine was called. The completion routine decodes this field. The routine then takes the appropriate action. If the command completes, that is, no transport errors occur, this field is set to CMD_CMPLT. Other values in this field indicate an error. After a command is completed, the target driver should examine the pkt_scbp field for a check condition status. See the scsi_pkt(9S) man page for more information.

Autoconfiguration for SCSI Target Drivers

SCSI target drivers must implement the standard autoconfiguration routines _init(9E), _fini(9E), and _info(9E). See Loadable Driver Interfaces for more information.

The following routines are also required, but these routines must perform specific SCSI and SCSA processing:

`probe()` Entry Point (SCSI Target Drivers)

SCSI target devices are not self-identifying, so target drivers must have a probe(9E) routine. This routine must determine whether the expected type of device is present and responding.

The general structure and the return codes of the probe(9E) routine are the same as the structure and return codes for other device drivers. SCSI target drivers must use the scsi_probe(9F) routine in their probe(9E) entry point. scsi_probe(9F) sends a SCSI inquiry command to the device and returns a code that indicates the result. If the SCSI inquiry command is successful, scsi_probe(9F) allocates a scsi_inquiry(9S) structure and fills the structure in with the device's inquiry data. Upon return from scsi_probe(9F), the sd_inq field of the scsi_device(9S) structure points to this scsi_inquiry(9S) structure.

Because probe(9E) must be stateless, the target driver must call scsi_unprobe(9F) before probe(9E) returns, even if scsi_probe(9F) fails.

Example 17–1 shows a typical probe(9E) routine. The routine in the example retrieves the scsi_device(9S) structure from the private field of its dev_info structure. The routine also retrieves the device's SCSI target and logical unit numbers for printing in messages. The probe(9E) routine then calls scsi_probe(9F) to verify that the expected device, a printer in this case, is present.

If successful, scsi_probe(9F) attaches the device's SCSI inquiry data in a scsi_inquiry(9S) structure to the sd_inq field of the scsi_device(9S) structure. The driver can then determine whether the device type is a printer, which is reported in the inq_dtype field. If the device is a printer, the type is reported with scsi_log(9F), using scsi_dname(9F) to convert the device type into a string.

Example 17–1 SCSI Target Driver probe(9E) Routine

static int
xxprobe(dev_info_t *dip)
{
    struct scsi_device *sdp;
    int rval, target, lun;
    /*
     * Get a pointer to the scsi_device(9S) structure
     */
    sdp = (struct scsi_device *)ddi_get_driver_private(dip);

    target = sdp->sd_address.a_target;
    lun = sdp->sd_address.a_lun;
    /*
     * Call scsi_probe(9F) to send the Inquiry command. It will
     * fill in the sd_inq field of the scsi_device structure.
     */
    switch (scsi_probe(sdp, NULL_FUNC)) {
    case SCSIPROBE_FAILURE:
    case SCSIPROBE_NORESP:
    case SCSIPROBE_NOMEM:
       /*
        * In these cases, device might be powered off,
        * in which case we might be able to successfully
        * probe it at some future time - referred to
        * as `deferred attach'.
        */
        rval = DDI_PROBE_PARTIAL;
        break;
    case SCSIPROBE_NONCCS:
    default:
        /*
         * Device isn't of the type we can deal with,
         * and/or it will never be usable.
         */
        rval = DDI_PROBE_FAILURE;
        break;
    case SCSIPROBE_EXISTS:
        /*
         * There is a device at the target/lun address. Check
         * inq_dtype to make sure that it is the right device
         * type. See scsi_inquiry(9S)for possible device types.
         */
        switch (sdp->sd_inq->inq_dtype) {
        case DTYPE_PRINTER:
        scsi_log(sdp, "xx", SCSI_DEBUG,
           "found %s device at target%d, lun%d\n",
            scsi_dname((int)sdp->sd_inq->inq_dtype),
            target, lun);
        rval = DDI_PROBE_SUCCESS;
        break;
        case DTYPE_NOTPRESENT:
        default:
        rval = DDI_PROBE_FAILURE;
        break;     
        }    
    }
    scsi_unprobe(sdp);
    return (rval);
}

A more thorough probe(9E) routine could check scsi_inquiry(9S) to make sure that the device is of the type expected by a particular driver.

`attach()` Entry Point (SCSI Target Drivers)

After the probe(9E) routine has verified that the expected device is present, attach(9E) is called. attach() performs these tasks:

Allocates and initializes any per-instance data.
Creates minor device node information.
Restores the hardware state of a device after a suspension of the device or the system. See attach() Entry Point for details.

A SCSI target driver needs to call scsi_probe(9F) again to retrieve the device's inquiry data. The driver must also create a SCSI request sense packet. If the attach is successful, the attach() function should not call scsi_unprobe(9F).

Three routines are used to create the request sense packet: scsi_alloc_consistent_buf(9F), scsi_init_pkt(9F), and scsi_setup_cdb(9F). scsi_alloc_consistent_buf(9F) allocates a buffer that is suitable for consistent DMA. scsi_alloc_consistent_buf() then returns a pointer to a buf(9S) structure. The advantage of a consistent buffer is that no explicit synchronization of the data is required. In other words, the target driver can access the data after the callback. The sd_sense element of the device's scsi_device(9S) structure must be initialized with the address of the sense buffer. scsi_init_pkt(9F) creates and partially initializes a scsi_pkt(9S) structure. scsi_setup_cdb(9F) creates a SCSI command descriptor block, in this case by creating a SCSI request sense command.

Note that a SCSI device is not self-identifying and does not have a reg property. As a result, the driver must set the pm-hardware-state property. Setting pm-hardware-state informs the framework that this device needs to be suspended and then resumed.

The following example shows the SCSI target driver's attach() routine.

Example 17–2 SCSI Target Driver attach(9E) Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    struct xxstate         *xsp;
    struct scsi_pkt        *rqpkt = NULL;
    struct scsi_device     *sdp;
    struct buf         *bp = NULL;
    int            instance;
    instance = ddi_get_instance(dip);
    switch (cmd) {
        case DDI_ATTACH:
        break;
        case DDI_RESUME:
        /* For information, see the "Directory Memory Access (DMA)" */
        /* chapter in this book. */
        default:
        return (DDI_FAILURE);
    }
    /*
     * Allocate a state structure and initialize it.
     */
    xsp = ddi_get_soft_state(statep, instance);
    sdp = (struct scsi_device *)ddi_get_driver_private(dip);
    /*
     * Cross-link the state and scsi_device(9S) structures.
     */
    sdp->sd_private = (caddr_t)xsp;
    xsp->sdp = sdp;
    /*
     * Call scsi_probe(9F) again to get and validate inquiry data.
     * Allocate a request sense buffer. The buf(9S) structure
     * is set to NULL to tell the routine to allocate a new one.
     * The callback function is set to NULL_FUNC to tell the
     * routine to return failure immediately if no
     * resources are available.
     */
    bp = scsi_alloc_consistent_buf(&sdp->sd_address, NULL,
    SENSE_LENGTH, B_READ, NULL_FUNC, NULL);
    if (bp == NULL)
        goto failed;
    /*
     * Create a Request Sense scsi_pkt(9S) structure.
     */
    rqpkt = scsi_init_pkt(&sdp->sd_address, NULL, bp,
    CDB_GROUP0, 1, 0, PKT_CONSISTENT, NULL_FUNC, NULL);
    if (rqpkt == NULL)
        goto failed;
    /*
     * scsi_alloc_consistent_buf(9F) returned a buf(9S) structure.
     * The actual buffer address is in b_un.b_addr.
     */
    sdp->sd_sense = (struct scsi_extended_sense *)bp->b_un.b_addr;
    /*
     * Create a Group0 CDB for the Request Sense command
     */
    if (scsi_setup_cdb((union scsi_cdb *)rqpkt->pkt_cdbp,
        SCMD_REQUEST_SENSE, 0, SENSE__LENGTH, 0) == 0)
         goto failed;;
    /*
     * Fill in the rest of the scsi_pkt structure.
     * xxcallback() is the private command completion routine.
     */
    rqpkt->pkt_comp = xxcallback;
    rqpkt->pkt_time = 30; /* 30 second command timeout */
    rqpkt->pkt_flags |= FLAG_SENSING;
    xsp->rqs = rqpkt;
    xsp->rqsbuf = bp;
    /*
     * Create minor nodes, report device, and do any other initialization. */
     * Since the device does not have the 'reg' property,
     * cpr will not call its DDI_SUSPEND/DDI_RESUME entries.
     * The following code is to tell cpr that this device
     * needs to be suspended and resumed.
     */
    (void) ddi_prop_update_string(device, dip,
     "pm-hardware-state", "needs-suspend-resume");
    xsp->open = 0;
    return (DDI_SUCCESS);
failed:
    if (bp)
        scsi_free_consistent_buf(bp);
    if (rqpkt)
        scsi_destroy_pkt(rqpkt);
    sdp->sd_private = (caddr_t)NULL;
    sdp->sd_sense = NULL;
    scsi_unprobe(sdp);
    /* Free any other resources, such as the state structure. */
    return (DDI_FAILURE);
}

`detach()` Entry Point (SCSI Target Drivers)

The detach(9E) entry point is the inverse of attach(9E). detach() must free all resources that were allocated in attach(). If successful, the detach should call scsi_unprobe(9F). The following example shows a target driver detach() routine.

Example 17–3 SCSI Target Driver detach(9E) Routine

static int
xxdetach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
    struct xxstate *xsp;
    switch (cmd) {
    case DDI_DETACH:
      /*
       * Normal detach(9E) operations, such as getting a
       * pointer to the state structure
       */
      scsi_free_consistent_buf(xsp->rqsbuf);
      scsi_destroy_pkt(xsp->rqs);
      xsp->sdp->sd_private = (caddr_t)NULL;
      xsp->sdp->sd_sense = NULL;
      scsi_unprobe(xsp->sdp);
      /*
       * Remove minor nodes.
       * Free resources, such as the state structure and properties.
       */
          return (DDI_SUCCESS);
    case DDI_SUSPEND:
      /* For information, see the "Directory Memory Access (DMA)" */
      /* chapter in this book. */
    default:
      return (DDI_FAILURE);
    }
}

`getinfo()` Entry Point (SCSI Target Drivers)

The getinfo(9E) routine for SCSI target drivers is much the same as for other drivers (see getinfo() Entry Point for more information on DDI_INFO_DEVT2INSTANCE case). However, in the DDI_INFO_DEVT2DEVINFO case of the getinfo() routine, the target driver must return a pointer to its dev_info node. This pointer can be saved in the driver state structure or can be retrieved from the sd_dev field of the scsi_device(9S) structure. The following example shows an alternative SCSI target driver getinfo() code fragment.

Example 17–4 Alternative SCSI Target Driver `getinfo()` Code Fragment

case DDI_INFO_DEVT2DEVINFO:
    dev = (dev_t)arg;
    instance = getminor(dev);
    xsp = ddi_get_soft_state(statep, instance);
    if (xsp == NULL)
        return (DDI_FAILURE);
    *result = (void *)xsp->sdp->sd_dev;
    return (DDI_SUCCESS);

Resource Allocation

To send a SCSI command to the device, the target driver must create and initialize a scsi_pkt(9S) structure. This structure must then be passed to the host bus adapter driver.

`scsi_init_pkt()` Function

The scsi_init_pkt(9F) routine allocates and zeroes a scsi_pkt(9S) structure. scsi_init_pkt() also sets pointers to pkt_private, *pkt_scbp, and *pkt_cdbp. Additionally, scsi_init_pkt() provides a callback mechanism to handle the case where resources are not available. This function has the following syntax:

struct scsi_pkt *scsi_init_pkt(struct scsi_address *ap,
     struct scsi_pkt *pktp, struct buf *bp, int cmdlen,
     int statuslen, int privatelen, int flags,
     int (*callback)(caddr_t), caddr_t arg)

where:

ap

Pointer to a scsi_address structure. ap is the sd_address field of the device's scsi_device(9S) structure.

pktp

Pointer to the scsi_pkt(9S) structure to be initialized. If this pointer is set to NULL, a new packet is allocated.

bp

Pointer to a buf(9S) structure. If this pointer is not null and has a valid byte count, DMA resources are allocated.

cmdlen

Length of the SCSI command descriptor block in bytes.

statuslen

Required length of the SCSI status completion block in bytes.

privatelen

Number of bytes to allocate for the pkt_private field.

flags

Set of flags:

PKT_CONSISTENT – This bit must be set if the DMA buffer was allocated using scsi_alloc_consistent_buf(9F). In this case, the host bus adapter driver guarantees that the data transfer is properly synchronized before performing the target driver's command completion callback.
PKT_DMA_PARTIAL – This bit can be set if the driver accepts a partial DMA mapping. If set, scsi_init_pkt(9F) allocates DMA resources with the DDI_DMA_PARTIAL flag set. The pkt_resid field of the scsi_pkt(9S) structure can be returned with a nonzero residual. A nonzero value indicates the number of bytes for which scsi_init_pkt(9F) was unable to allocate DMA resources.

callback

Specifies the action to take if resources are not available. If set to NULL_FUNC, scsi_init_pkt(9F) returns the value NULL immediately. If set to SLEEP_FUNC, scsi_init_pkt() does not return until resources are available. Any other valid kernel address is interpreted as the address of a function to be called when resources are likely to be available.

arg

Parameter to pass to the callback function.

The scsi_init_pkt() routine synchronizes the data prior to transport. If the driver needs to access the data after transport, the driver should call scsi_sync_pkt(9F) to flush any intermediate caches. The scsi_sync_pkt() routine can be used to synchronize any cached data.

`scsi_sync_pkt()` Function

If the target driver needs to resubmit the packet after changing the data, scsi_sync_pkt(9F) must be called before calling scsi_transport(9F). However, if the target driver does not need to access the data, scsi_sync_pkt() does not need to be called after the transport.

`scsi_destroy_pkt()` Function

The scsi_destroy_pkt(9F) routine synchronizes any remaining cached data that is associated with the packet, if necessary. The routine then frees the packet and associated command, status, and target driver-private data areas. This routine should be called in the command completion routine.

`scsi_alloc_consistent_buf()` Function

For most I/O requests, the data buffer passed to the driver entry points is not accessed directly by the driver. The buffer is just passed on to scsi_init_pkt(9F). If a driver sends SCSI commands that operate on buffers that the driver itself examines, the buffers should be DMA consistent. The SCSI request sense command is a good example. The scsi_alloc_consistent_buf(9F) routine allocates a buf(9S) structure and a data buffer that is suitable for DMA-consistent operations. The HBA performs any necessary synchronization of the buffer before performing the command completion callback.

Note –

scsi_alloc_consistent_buf(9F) uses scarce system resources. Thus, you should use scsi_alloc_consistent_buf() sparingly.

`scsi_free_consistent_buf()` Function

scsi_free_consistent_buf(9F) releases a buf(9S) structure and the associated data buffer allocated with scsi_alloc_consistent_buf(9F). See attach() Entry Point (SCSI Target Drivers) and detach() Entry Point (SCSI Target Drivers) for examples.

Building and Transporting a Command

The host bus adapter driver is responsible for transmitting the command to the device. Furthermore, the driver is responsible for handling the low-level SCSI protocol. The scsi_transport(9F) routine hands a packet to the host bus adapter driver for transmission. The target driver has the responsibility to create a valid scsi_pkt(9S) structure.

Building a Command

The routine scsi_init_pkt(9F) allocates space for a SCSI CDB, allocates DMA resources if necessary, and sets the pkt_flags field, as shown in this example:

pkt = scsi_init_pkt(&sdp->sd_address, NULL, bp,
CDB_GROUP0, 1, 0, 0, SLEEP_FUNC, NULL);

This example creates a new packet along with allocating DMA resources as specified in the passed buf(9S) structure pointer. A SCSI CDB is allocated for a Group 0 (6-byte) command. The pkt_flags field is set to zero, but no space is allocated for the pkt_private field. This call to scsi_init_pkt(9F), because of the SLEEP_FUNC parameter, waits indefinitely for resources if no resources are currently available.

The next step is to initialize the SCSI CDB, using the scsi_setup_cdb(9F) function:

    if (scsi_setup_cdb((union scsi_cdb *)pkt->pkt_cdbp,
     SCMD_READ, bp->b_blkno, bp->b_bcount >> DEV_BSHIFT, 0) == 0)
     goto failed;

This example builds a Group 0 command descriptor block. The example fills in the pkt_cdbp field as follows:

The command itself is in byte 0. The command is set from the parameter SCMD_READ.
The address field is in bits 0-4 of byte 1 and bytes 2 and 3. The address is set from bp->b_blkno.
The count field is in byte 4. The count is set from the last parameter. In this case, count is set to bp->b_bcount >> DEV_BSHIFT, where DEV_BSHIFT is the byte count of the transfer converted to the number of blocks.

Note –

scsi_setup_cdb(9F) does not support setting a target device's logical unit number (LUN) in bits 5-7 of byte 1 of the SCSI command block. This requirement is defined by SCSI-1. For SCSI-1 devices that require the LUN bits set in the command block, use makecom_g0(9F) or some equivalent rather than scsi_setup_cdb(9F).

After initializing the SCSI CDB, initialize three other fields in the packet and store as a pointer to the packet in the state structure.

pkt->pkt_private = (opaque_t)bp;
pkt->pkt_comp = xxcallback;
pkt->pkt_time = 30;
xsp->pkt = pkt;

The buf(9S) pointer is saved in the pkt_private field for later use in the completion routine.

Setting Target Capabilities

The target drivers use scsi_ifsetcap(9F) to set the capabilities of the host adapter driver. A cap is a name-value pair, consisting of a null-terminated character string and an integer value. The current value of a capability can be retrieved using scsi_ifgetcap(9F). scsi_ifsetcap(9F) allows capabilities to be set for all targets on the bus.

In general, however, setting capabilities of targets that are not owned by the target driver is not recommended. This practice is not universally supported by HBA drivers. Some capabilities, such as disconnect and synchronous, can be set by default by the HBA driver. Other capabilities might need to be set explicitly by the target driver. Wide-xfer and tagged-queueing must be set by the target drive, for example.

Transporting a Command

After the scsi_pkt(9S) structure is filled in, use scsi_transport(9F) to hand the structure to the bus adapter driver:

    if (scsi_transport(pkt) != TRAN_ACCEPT) {
     bp->b_resid = bp->b_bcount;
     bioerror(bp, EIO);
     biodone(bp);
     }

The other return values from scsi_transport(9F) are as follows:

TRAN_BUSY – A command for the specified target is already in progress.
TRAN_BADPKT – The DMA count in the packet was too large, or the host adapter driver rejected this packet for other reasons.

TRAN_FATAL_ERROR – The host adapter driver is unable to accept this packet.

Note –

The mutex sd_mutex in the scsi_device(9S) structure must not be held across a call to scsi_transport(9F).

If scsi_transport(9F) returns TRAN_ACCEPT, the packet becomes the responsibility of the host bus adapter driver. The packet should not be accessed by the target driver until the command completion routine is called.

Synchronous `scsi_transport()` Function

If FLAG_NOINTR is set in the packet, then scsi_transport(9F) does not return until the command is complete. No callback is performed.

Note –

Do not use FLAG_NOINTR in interrupt context.

Command Completion

When the host bus adapter driver is through with the command, the driver invokes the packet's completion callback routine. The driver then passes a pointer to the scsi_pkt(9S) structure as a parameter. After decoding the packet, the completion routine takes the appropriate action.

Example 17–5 presents a simple completion callback routine. This code checks for transport failures. In case of failure, the routine gives up rather than retrying the command. If the target is busy, extra code is required to resubmit the command at a later time.

If the command results in a check condition, the target driver needs to send a request sense command unless auto request sense has been enabled.

Otherwise, the command succeeded. At the end of processing for the command, the command destroys the packet and calls biodone(9F).

In the event of a transport error, such as a bus reset or parity problem, the target driver can resubmit the packet by using scsi_transport(9F). No values in the packet need to be changed prior to resubmitting.

The following example does not attempt to retry incomplete commands.

Note –

Normally, the target driver's callback function is called in interrupt context. Consequently, the callback function should never sleep.

Example 17–5 Completion Routine for a SCSI Driver

static void
xxcallback(struct scsi_pkt *pkt)
{
    struct buf        *bp;
    struct xxstate    *xsp;
    minor_t           instance;
    struct scsi_status *ssp;
    /*
     * Get a pointer to the buf(9S) structure for the command
     * and to the per-instance data structure.
     */
    bp = (struct buf *)pkt->pkt_private;
    instance = getminor(bp->b_edev);
    xsp = ddi_get_soft_state(statep, instance);
    /*
     * Figure out why this callback routine was called
     */
    if (pkt->pkt_reason != CMP_CMPLT) {
       bp->b_resid = bp->b_bcount;
       bioerror(bp, EIO);
       scsi_destroy_pkt(pkt);          /* Release resources */
       biodone(bp);                    /* Notify waiting threads */ ;
    } else {
       /*
        * Command completed, check status.
        * See scsi_status(9S)
        */
       ssp = (struct scsi_status *)pkt->pkt_scbp;
       if (ssp->sts_busy) {
           /* error, target busy or reserved */
       } else if (ssp->sts_chk) {
           /* Send a request sense command. */
       } else {
        bp->b_resid = pkt->pkt_resid;  /* Packet completed OK */
        scsi_destroy_pkt(pkt);
        biodone(bp);
       }
    }
}

Reuse of Packets

A target driver can reuse packets in the following ways:

Resubmit the packet unchanged.
Use scsi_sync_pkt(9F) to synchronize the data. Then, process the data in the driver. Finally, resubmit the packet.
Free DMA resources, using scsi_dmafree(9F), and pass the pkt pointer to scsi_init_pkt(9F) for binding to a new bp. The target driver is responsible for reinitializing the packet. The CDB has to have the same length as the previous CDB.
If only partial DMA is allocated during the first call to scsi_init_pkt(9F), subsequent calls to scsi_init_pkt(9F) can be made for the same packet. Calls can be made to bp as well to adjust the DMA resources to the next portion of the transfer.

Auto-Request Sense Mode

Auto-request sense mode is most desirable if queuing is used, whether the queuing is tagged or untagged. A contingent allegiance condition is cleared by any subsequent command and, consequently, the sense data is lost. Most HBA drivers start the next command before performing the target driver callback. Other HBA drivers can use a separate, lower-priority thread to perform the callbacks. This approach might increase the time needed to notify the target driver that the packet completed with a check condition. In this case, the target driver might not be able to submit a request sense command in time to retrieve the sense data.

To avoid this loss of sense data, the HBA driver, or controller, should issue a request sense command if a check condition has been detected. This mode is known as auto-request sense mode. Note that not all HBA drivers are capable of auto-request sense mode, and some drivers can only operate with auto-request sense mode enabled.

A target driver enables auto-request-sense mode by using scsi_ifsetcap(9F). The following example shows auto-request sense enabling.

Example 17–6 Enabling Auto-Request Sense Mode

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
    struct xxstate *xsp;
    struct scsi_device *sdp = (struct scsi_device *)
    ddi_get_driver_private(dip);
   /*
    * Enable auto-request-sense; an auto-request-sense cmd might
    * fail due to a BUSY condition or transport error. Therefore,
    * it is recommended to allocate a separate request sense
    * packet as well.
    * Note that scsi_ifsetcap(9F) can return -1, 0, or 1
    */
    xsp->sdp_arq_enabled =
    ((scsi_ifsetcap(ROUTE, “auto-rqsense”, 1, 1) == 1) ? 1 :
0);
   /*
    * If the HBA driver supports auto request sense then the
    * status blocks should be sizeof (struct scsi_arq_status);
    */
else
   /*
    * One byte is sufficient
    */
    xsp->sdp_cmd_stat_size =  (xsp->sdp_arq_enabled ?
    sizeof (struct scsi_arq_status) : 1);
   /* ... */
}

If a packet is allocated using scsi_init_pkt(9F) and auto-request sense is desired on this packet, additional space is needed. The target driver must request this space for the status block to hold the auto-request sense structure. The sense length used in the request sense command is sizeof, from struct scsi_extended_sense. Auto-request sense can be disabled per individual packet by allocating sizeof, from struct scsi_status, for the status block.

The packet is submitted using scsi_transport(9F) as usual. When a check condition occurs on this packet, the host adapter driver takes the following steps:

Issues a request sense command if the controller does not have auto-request sense capability
Obtains the sense data
Fills in the scsi_arq_status information in the packet's status block
Sets STATE_ARQ_DONE in the packet's pkt_state field
Calls the packet's callback handler (pkt_comp())

The target driver's callback routine should verify that sense data is available by checking the STATE_ARQ_DONE bit in pkt_state. STATE_ARQ_DONE implies that a check condition has occurred and that a request sense has been performed. If auto-request sense has been temporarily disabled in a packet, subsequent retrieval of the sense data cannot be guaranteed.

The target driver should then verify whether the auto-request sense command completed successfully and decode the sense data.

Dump Handling

The dump(9E) entry point copies a portion of virtual address space directly to the specified device in the case of system failure or checkpoint operation. See the cpr(7) and dump(9E) man pages. The dump(9E) entry point must be capable of performing this operation without the use of interrupts.

The arguments for dump() are as follows:

dev: Device number of the dump device
addr: Kernel virtual address at which to start the dump
blkno: First destination block on the device
nblk: Number of blocks to dump

Example 17–7 dump(9E) Routine

static int
xxdump(dev_t dev, caddr_t addr, daddr_t blkno, int nblk)
{
    struct xxstate     *xsp;
    struct buf         *bp;
    struct scsi_pkt    *pkt;
    int    rval;
    int    instance;

    instance = getminor(dev);
    xsp = ddi_get_soft_state(statep, instance);

    if (tgt->suspended) {
    (void) pm_raise_power(DEVINFO(tgt), 0, 1);
    }

    bp = getrbuf(KM_NOSLEEP);
    if (bp == NULL) {
    return (EIO);
    }

/* Calculate block number relative to partition. */
    
bp->b_un.b_addr = addr;
    bp->b_edev = dev;
    bp->b_bcount = nblk * DEV_BSIZE;
    bp->b_flags = B_WRITE | B_BUSY;
    bp->b_blkno = blkno;

    pkt = scsi_init_pkt(ROUTE(tgt), NULL, bp, CDB_GROUP1,
    sizeof (struct scsi_arq_status),
    sizeof (struct bst_pkt_private), 0, NULL_FUNC, NULL);
    if (pkt == NULL) {
    freerbuf(bp);
    return (EIO);
    }
    (void) scsi_setup_cdb((union scsi_cdb *)pkt->pkt_cdbp,
        SCMD_WRITE_G1, blkno, nblk, 0);
    /*
     * While dumping in polled mode, other cmds might complete
     * and these should not be resubmitted. we set the
     * dumping flag here which prevents requeueing cmds.
     */
    tgt->dumping = 1;
    rval = scsi_poll(pkt);
    tgt->dumping = 0;

    scsi_destroy_pkt(pkt);
    freerbuf(bp);

    if (rval != DDI_SUCCESS) {
    rval = EIO;
    }

    return (rval);
}

SCSI Options

SCSA defines a global variable, scsi_options, for control and debugging. The defined bits in scsi_options can be found in the file <sys/scsi/conf/autoconf.h>. The scsi_options uses the bits as follows:

SCSI_OPTIONS_DR: Enables global disconnect or reconnect.
SCSI_OPTIONS_FAST: Enables global FAST SCSI support: 10 Mbytes/sec transfers. The HBA should not operate in FAST SCSI mode unless the SCSI_OPTIONS_FAST (0x100) bit is set.
SCSI_OPTIONS_FAST20: Enables global FAST20 SCSI support: 20 Mbytes/sec transfers. The HBA should not operate in FAST20 SCSI mode unless the SCSI_OPTIONS_FAST20 (0x400) bit is set.
SCSI_OPTIONS_FAST40: Enables global FAST40 SCSI support: 40 Mbytes/sec transfers. The HBA should not operate in FAST40 SCSI mode unless the SCSI_OPTIONS_FAST40 (0x800) bit is set.
SCSI_OPTIONS_FAST80: Enables global FAST80 SCSI support: 80 Mbytes/sec transfers. The HBA should not operate in FAST80 SCSI mode unless the SCSI_OPTIONS_FAST80 (0x1000) bit is set.
SCSI_OPTIONS_FAST160: Enables global FAST160 SCSI support: 160 Mbytes/sec transfers. The HBA should not operate in FAST160 SCSI mode unless the SCSI_OPTIONS_FAST160 (0x2000) bit is set.
SCSI_OPTIONS_FAST320: Enables global FAST320 SCSI support: 320 Mbytes/sec transfers. The HBA should not operate in FAST320 SCSI mode unless the SCSI_OPTIONS_FAST320 (0x4000) bit is set.
SCSI_OPTIONS_LINK: Enables global link support.
SCSI_OPTIONS_PARITY: Enables global parity support.
SCSI_OPTIONS_QAS: Enables the Quick Arbitration Select feature. QAS is used to decrease protocol overhead when devices arbitrate for and access the bus. QAS is only supported on Ultra4 (FAST160) SCSI devices, although not all such devices support QAS. The HBA should not operate in QAS SCSI mode unless the SCSI_OPTIONS_QAS (0x100000) bit is set. Consult the appropriate Sun hardware documentation to determine whether your machine supports QAS.
SCSI_OPTIONS_SYNC: Enables global synchronous transfer capability.
SCSI_OPTIONS_TAG: Enables global tagged queuing support.
SCSI_OPTIONS_WIDE: Enables global WIDE SCSI.

Note –

The setting of scsi_options affects all host bus adapter drivers and all target drivers that are present on the system. Refer to the scsi_hba_attach(9F) man page for information on controlling these options for a particular host adapter.

Chapter 18 SCSI Host Bus Adapter Drivers

This chapter contains information on creating SCSI host bus adapter (HBA) drivers. The chapter provides sample code illustrating the structure of a typical HBA driver. The sample code shows the use of the HBA driver interfaces that are provided by the Sun Common SCSI Architecture (SCSA). This chapter provides information on the following subjects:

Introduction to Host Bus Adapter Drivers

As described in Chapter 17, SCSI Target Drivers, the DDI/DKI divides the software interface to SCSI devices into two major parts:

Target devices and drivers
Host bus adapter devices and drivers

Target device refers to a device on a SCSI bus, such as a disk or a tape drive. Target driver refers to a software component installed as a device driver. Each target device on a SCSI bus is controlled by one instance of the target driver.

Host bus adapter device refers to HBA hardware, such as an SBus or PCI SCSI adapter card. Host bus adapter driver refers to a software component that is installed as a device driver. Some examples are the esp driver on a SPARC machine, the ncrs driver on an x86 machine, and the isp driver, which works on both architectures. An instance of the HBA driver controls each of its host bus adapter devices that are configured in the system.

The Sun Common SCSI Architecture (SCSA) defines the interface between the target and HBA components.

Note –

Understanding SCSI target drivers is an essential prerequisite to writing effective SCSI HBA drivers. For information on SCSI target drivers, see Chapter 17, SCSI Target Drivers. Target driver developers can also benefit from reading this chapter.

The host bus adapter driver is responsible for performing the following tasks:

Managing host bus adapter hardware
Accepting SCSI commands from the SCSI target driver
Transporting the commands to the specified SCSI target device
Performing any data transfers that the command requires
Collecting status
Handling auto-request sense (optional)
Informing the target driver of command completion or failure

SCSI Interface

SCSA is the DDI/DKI programming interface for the transmission of SCSI commands from a target driver to a host adapter driver. By conforming to the SCSA, the target driver can easily pass any combination of SCSI commands and sequences to a target device. Knowledge of the hardware implementation of the host adapter is not necessary. Conceptually, SCSA separates the building of a SCSI command from the transporting of the command with data to the SCSI bus. SCSA manages the connections between the target and HBA drivers through an HBA transportlayer, as shown in the following figure.

Figure 18–1 SCSA Interface

Diagram shows the host bus adapter transport layer between
a target driver and SCSI devices.

The HBA transport layer is a software and hardware layer that is responsible for transporting a SCSI command to a SCSI target device. The HBA driver provides resource allocation, DMA management, and transport services in response to requests made by SCSI target drivers through SCSA. The host adapter driver also manages the host adapter hardware and the SCSI protocols necessary to perform the commands. When a command has been completed, the HBA driver calls the target driver's SCSI pkt command completion routine.

The following example illustrates this flow, with emphasis on the transfer of information from target drivers to SCSA to HBA drivers. The figure also shows typical transport entry points and function calls.

Figure 18–2 Transport Layer Flow

Diagram shows how commands flow through the HBA transport
layer.

SCSA HBA Interfaces

SCSA HBA interfaces include HBA entry points, HBA data structures, and an HBA framework.

SCSA HBA Entry Point Summary

SCSA defines a number of HBA driver entry points. These entry points are listed in the following table. The entry points are called by the system when a target driver instance connected to the HBA driver is configured. The entry points are also called when the target driver makes a SCSA request. See Entry Points for SCSA HBA Drivers for more information.

Table 18–1 SCSA HBA Entry Point Summary


Function Name	Called as a Result of
tran_abort(9E)	Target driver calling scsi_abort(9F)
tran_bus_reset(9E)	System resetting bus
tran_destroy_pkt(9E)	Target driver calling scsi_destroy_pkt(9F)
tran_dmafree(9E)	Target driver calling scsi_dmafree(9F)
tran_getcap(9E)	Target driver calling scsi_ifgetcap(9F)
tran_init_pkt(9E)	Target driver calling scsi_init_pkt(9F)
tran_quiesce(9E)	System quiescing bus
tran_reset(9E)	Target driver calling `scsi_reset`(9F)
tran_reset_notify(9E)	Target driver calling scsi_reset_notify(9F)
tran_setcap(9E)	Target driver calling scsi_ifsetcap(9F)
tran_start(9E)	Target driver calling scsi_transport(9F)
tran_sync_pkt(9E)	Target driver calling scsi_sync_pkt(9F)
tran_tgt_free(9E)	System detaching target device instance
tran_tgt_init(9E)	System attaching target device instance
tran_tgt_probe(9E)	Target driver calling scsi_probe(9F)
tran_unquiesce(9E)	System resuming activity on bus

SCSA HBA Data Structures

SCSA defines data structures to enable the exchange of information between the target and HBA drivers. The following data structures are included:

`scsi_hba_tran()` Structure

Each instance of an HBA driver must allocate a scsi_hba_tran(9S) structure by using the scsi_hba_tran_alloc(9F) function in the attach(9E) entry point. The scsi_hba_tran_alloc() function initializes the scsi_hba_tran structure. The HBA driver must initialize specific vectors in the transport structure to point to entry points within the HBA driver. After the scsi_hba_tran structure is initialized, the HBA driver exports the transport structure to SCSA by calling the scsi_hba_attach_setup(9F) function.

Caution –

Because SCSA keeps a pointer to the transport structure in the driver-private field on the devinfo node, HBA drivers must not use ddi_set_driver_private(9F). HBA drivers can, however, use ddi_get_driver_private(9F) to retrieve the pointer to the transport structure.

The SCSA interfaces require the HBA driver to supply a number of entry points that are callable through the scsi_hba_tran structure. See Entry Points for SCSA HBA Drivers for more information.

The scsi_hba_tran structure contains the following fields:

struct scsi_hba_tran {
    dev_info_t          *tran_hba_dip;          /* HBAs dev_info pointer */
    void                *tran_hba_private;      /* HBA softstate */
    void                *tran_tgt_private;      /* HBA target private pointer */
    struct scsi_device  *tran_sd;               /* scsi_device */
    int                 (*tran_tgt_init)();     /* Transport target */
                                                /* Initialization */
    int                 (*tran_tgt_probe)();    /* Transport target probe */
    void                (*tran_tgt_free)();     /* Transport target free */
    int                 (*tran_start)();        /* Transport start */
    int                 (*tran_reset)();        /* Transport reset */
    int                 (*tran_abort)();        /* Transport abort */
    int                 (*tran_getcap)();       /* Capability retrieval */
    int                 (*tran_setcap)();       /* Capability establishment */
    struct scsi_pkt     *(*tran_init_pkt)();    /* Packet and DMA allocation */
    void                (*tran_destroy_pkt)();  /* Packet and DMA */
                                                /* Deallocation */
    void                (*tran_dmafree)();      /* DMA deallocation */
    void                (*tran_sync_pkt)();     /* Sync DMA */
    void                (*tran_reset_notify)(); /* Bus reset notification */
    int                 (*tran_bus_reset)();    /* Reset bus only */
    int                 (*tran_quiesce)();      /* Quiesce a bus */
    int                 (*tran_unquiesce)();    /* Unquiesce a bus */
    int                 tran_interconnect_type; /* transport interconnect */
};

The following descriptions give more information about these scsi_hba_tran structure fields:

tran_hba_dip: Pointer to the HBA device instance dev_info structure. The function scsi_hba_attach_setup(9F) sets this field.
tran_hba_private: Pointer to private data maintained by the HBA driver. Usually, tran_hba_private contains a pointer to the state structure of the HBA driver.
tran_tgt_private: Pointer to private data maintained by the HBA driver when using cloning. By specifying SCSI_HBA_TRAN_CLONE when calling scsi_hba_attach_setup(9F), the scsi_hba_tran(9S) structure is cloned once per target. This approach enables the HBA to initialize this field to point to a per-target instance data structure in the tran_tgt_init(9E) entry point. If SCSI_HBA_TRAN_CLONE is not specified, tran_tgt_private is NULL, and tran_tgt_private must not be referenced. See Transport Structure Cloning for more information.
tran_sd: Pointer to a per-target instance scsi_device(9S) structure used when cloning. If SCSI_HBA_TRAN_CLONE is passed to scsi_hba_attach_setup(9F), tran_sd is initialized to point to the per-target scsi_device structure. This initialization takes place before any HBA functions are called on behalf of that target. If SCSI_HBA_TRAN_CLONE is not specified, tran_sd is NULL, and tran_sd must not be referenced. See Transport Structure Cloning for more information.
tran_tgt_init: Pointer to the HBA driver entry point that is called when initializing a target device instance. If no per-target initialization is required, the HBA can leave tran_tgt_init set to NULL.
tran_tgt_probe: Pointer to the HBA driver entry point that is called when a target driver instance calls scsi_probe(9F). This routine is called to probe for the existence of a target device. If no target probing customization is required for this HBA, the HBA should set tran_tgt_probe to scsi_hba_probe(9F).
tran_tgt_free: Pointer to the HBA driver entry point that is called when a target device instance is destroyed. If no per-target deallocation is necessary, the HBA can leave tran_tgt_free set to NULL.
tran_start: Pointer to the HBA driver entry point that is called when a target driver calls scsi_transport(9F).
tran_reset: Pointer to the HBA driver entry point that is called when a target driver calls scsi_reset(9F).
tran_abort: Pointer to the HBA driver entry point that is called when a target driver calls scsi_abort(9F).
tran_getcap: Pointer to the HBA driver entry point that is called when a target driver calls scsi_ifgetcap(9F).
tran_setcap: Pointer to the HBA driver entry point that is called when a target driver calls scsi_ifsetcap(9F).
tran_init_pkt: Pointer to the HBA driver entry point that is called when a target driver calls scsi_init_pkt(9F).
tran_destroy_pkt: Pointer to the HBA driver entry point that is called when a target driver calls scsi_destroy_pkt(9F).
tran_dmafree: Pointer to the HBA driver entry point that is called when a target driver calls scsi_dmafree(9F).
tran_sync_pkt: Pointer to the HBA driver entry point that is called when a target driver calls scsi_sync_pkt(9F).
tran_reset_notify: Pointer to the HBA driver entry point that is called when a target driver calls tran_reset_notify(9E).
tran_bus_reset: The function entry that resets the SCSI bus without resetting targets.
tran_quiesce: The function entry that waits for all outstanding commands to complete and blocks (or queues) any I/O requests issued.
tran_unquiesce: The function entry that allows I/O activities to resume on the SCSI bus.
tran_interconnect_type: Integer value denoting interconnect type of the transport as defined in the services.h header file.

`scsi_address` Structure

The scsi_address(9S) structure provides transport and addressing information for each SCSI command that is allocated and transported by a target driver instance.

The scsi_address structure contains the following fields:

struct scsi_address {
    struct scsi_hba_tran    *a_hba_tran;    /* Transport vectors */
    ushort_t                a_target;       /* Target identifier */
    uchar_t                 a_lun;          /* LUN on that target */
    uchar_t                 a_sublun;       /* Sub LUN on that LUN */
                                            /* Not used */
};

a_hba_tran: Pointer to the scsi_hba_tran(9S) structure, as allocated and initialized by the HBA driver. If SCSI_HBA_TRAN_CLONE was specified as the flag to scsi_hba_attach_setup(9F), a_hba_tran points to a copy of that structure.
a_target: Identifies the SCSI target on the SCSI bus.
a_lun: Identifies the SCSI logical unit on the SCSI target.

`scsi_device` Structure

The HBA framework allocates and initializes a scsi_device(9S) structure for each instance of a target device. The allocation and initialization occur before the framework calls the HBA driver's tran_tgt_init(9E) entry point. This structure stores information about each SCSI logical unit, including pointers to information areas that contain both generic and device-specific information. One scsi_device(9S) structure exists for each target device instance that is attached to the system.

If the per-target initialization is successful, the HBA framework sets the target driver's per-instance private data to point to the scsi_device(9S) structure, using ddi_set_driver_private(9F). Note that an initialization is successful if tran_tgt_init() returns success or if the vector is null.

The scsi_device(9S) structure contains the following fields:

struct scsi_device {
    struct scsi_address           sd_address;    /* routing information */
    dev_info_t                    *sd_dev;       /* device dev_info node */
    kmutex_t                      sd_mutex;      /* mutex used by device */
    void                          *sd_reserved;
    struct scsi_inquiry           *sd_inq;
    struct scsi_extended_sense    *sd_sense;
    caddr_t                       sd_private;    /* for driver's use */
};

where:

sd_address: Data structure that is passed to the routines for SCSI resource allocation.
sd_dev: Pointer to the target's dev_info structure.
sd_mutex: Mutex for use by the target driver. This mutex is initialized by the HBA framework. The mutex can be used by the target driver as a per-device mutex. This mutex should not be held across a call to scsi_transport(9F) or scsi_poll(9F). See Chapter 3, Multithreading for more information on mutexes.
sd_inq: Pointer for the target device's SCSI inquiry data. The scsi_probe(9F) routine allocates a buffer, fills the buffer in, and attaches the buffer to this field.
sd_sense: Pointer to a buffer to contain request sense data from the device. The target driver must allocate and manage this buffer itself. See the target driver's attach(9E) routine in attach() Entry Point for more information.
sd_private: Pointer field for use by the target driver. This field is commonly used to store a pointer to a private target driver state structure.

`scsi_pkt` Structure (HBA)

To execute SCSI commands, a target driver must first allocate a scsi_pkt(9S) structure for the command. The target driver must then specify its own private data area length, the command status, and the command length. The HBA driver is responsible for implementing the packet allocation in the tran_init_pkt(9E) entry point. The HBA driver is also responsible for freeing the packet in its tran_destroy_pkt(9E) entry point. See scsi_pkt Structure (Target Drivers) for more information.

The scsi_pkt(9S) structure contains these fields:

struct scsi_pkt {
    opaque_t pkt_ha_private;             /* private data for host adapter */
    struct scsi_address pkt_address;     /* destination address */
    opaque_t pkt_private;                /* private data for target driver */
    void (*pkt_comp)(struct scsi_pkt *); /* completion routine */
    uint_t  pkt_flags;                   /* flags */
    int     pkt_time;                    /* time allotted to complete command */
    uchar_t *pkt_scbp;                   /* pointer to status block */
    uchar_t *pkt_cdbp;                   /* pointer to command block */
    ssize_t pkt_resid;                   /* data bytes not transferred */
    uint_t  pkt_state;                   /* state of command */
    uint_t  pkt_statistics;              /* statistics */
    uchar_t pkt_reason;                  /* reason completion called */
};

where:

pkt_ha_private: Pointer to per-command HBA-driver private data.
pkt_address: Pointer to the scsi_address(9S) structure providing address information for this command.
pkt_private: Pointer to per-packet target-driver private data.
pkt_comp: Pointer to the target-driver completion routine called by the HBA driver when the transport layer has completed this command.
pkt_flags: Flags for the command.
pkt_time: Specifies the completion timeout in seconds for the command.
pkt_scbp: Pointer to the status completion block for the command.
pkt_cdbp: Pointer to the command descriptor block (CDB) for the command.
pkt_resid: Count of the data bytes that were not transferred when the command completed. This field can also be used to specify the amount of data for which resources have not been allocated. The HBA must modify this field during transport.
pkt_state: State of the command. The HBA must modify this field during transport.
pkt_statistics: Provides a history of the events that the command experienced while in the transport layer. The HBA must modify this field during transport.
pkt_reason: Reason for command completion. The HBA must modify this field during transport.

Per-Target Instance Data

An HBA driver must allocate a scsi_hba_tran(9S) structure during attach(9E). The HBA driver must then initialize the vectors in this transport structure to point to the required entry points for the HBA driver. This scsi_hba_tran structure is then passed into scsi_hba_attach_setup(9F).

The scsi_hba_tran structure contains a tran_hba_private field, which can be used to refer to the HBA driver's per-instance state.

Each scsi_address(9S) structure contains a pointer to the scsi_hba_tran structure. In addition, the scsi_address structure provides the target, that is, a_target, and logical unit (a_lun) addresses for the particular target device. Each entry point for the HBA driver is passed a pointer to the scsi_address structure, either directly or indirectly through the scsi_device(9S) structure. As a result, the HBA driver can reference its own state. The HBA driver can also identify the target device that is addressed.

The following figure illustrates the HBA data structures for transport operations.

Figure 18–3 HBA Transport Structures

Diagram shows the relationships of structures involved
in the HBA transport layer.

Transport Structure Cloning

Cloning can be useful if an HBA driver needs to maintain per-target private data in the scsi_hba_tran(9S) structure. Cloning can also be used to maintain a more complex address than is provided in the scsi_address(9S) structure.

In the cloning process, the HBA driver must still allocate a scsi_hba_tran structure at attach(9E) time. The HBA driver must also initialize the tran_hba_private soft state pointer and the entry point vectors for the HBA driver. The difference occurs when the framework begins to connect an instance of a target driver to the HBA driver. Before calling the HBA driver's tran_tgt_init(9E) entry point, the framework clones the scsi_hba_tran structure that is associated with that instance of the HBA. Accordingly, each scsi_address structure that is allocated and initialized for a particular target device instance points to a per-target instance copy of the scsi_hba_tran structure. The scsi_address structures do not point to the scsi_hba_tran structure that is allocated by the HBA driver at attach() time.

An HBA driver can use two important pointers when cloning is specified. These pointers are contained in the scsi_hba_tran structure. The first pointer is the tran_tgt_private field, which the driver can use to point to per-target HBA private data. The tran_tgt_private pointer is useful, for example, if an HBA driver needs to maintain a more complex address than a_target and a_lun provide. The second pointer is the tran_sd field, which is a pointer to the scsi_device(9S) structure referring to the particular target device.

When specifying cloning, the HBA driver must allocate and initialize the per-target data. The HBA driver must then initialize the tran_tgt_private field to point to this data during its tran_tgt_init(9E) entry point. The HBA driver must free this per-target data during its tran_tgt_free(9E) entry point.

When cloning, the framework initializes the tran_sd field to point to the scsi_device structure before the HBA driver tran_tgt_init() entry point is called. The driver requests cloning by passing the SCSI_HBA_TRAN_CLONE flag to scsi_hba_attach_setup(9F). The following figure illustrates the HBA data structures for cloning transport operations.

Figure 18–4 Cloning Transport Operation

Diagram shows an example of cloned HBA structures.

SCSA HBA Functions

SCSA also provides a number of functions. The functions are listed in the following table, for use by HBA drivers.

Table 18–2 SCSA HBA Functions


Function Name	Called by Driver Entry Point
scsi_hba_init(9F)	_init(9E)
scsi_hba_fini(9F)	_fini(9E)
scsi_hba_attach_setup(9F)	attach(9E)
scsi_hba_detach(9F)	detach(9E)
scsi_hba_tran_alloc(9F)	attach(9E)
scsi_hba_tran_free(9F)	detach(9E)
scsi_hba_probe(9F)	tran_tgt_probe(9E)
scsi_hba_pkt_alloc(9F)	tran_init_pkt(9E)
scsi_hba_pkt_free(9F)	tran_destroy_pkt(9E)
scsi_hba_lookup_capstr(9F)	tran_getcap(9E) and tran_setcap(9E)

HBA Driver Dependency and Configuration Issues

In addition to incorporating SCSA HBA entry points, structures, and functions into a driver, a developer must deal with driver dependency and configuration issues. These issues involve configuration properties, dependency declarations, state structure and per-command structure, entry points for module initialization, and autoconfiguration entry points.

Declarations and Structures

HBA drivers must include the following header files:

#include <sys/scsi/scsi.h>
#include <sys/ddi.h>
#include <sys/sunddi.h>

To inform the system that the module depends on SCSA routines, the driver binary must be generated with the following command. See SCSA HBA Interfaces for more information on SCSA routines.

% ld -r xx.o -o xx -N "misc/scsi"

The code samples are derived from a simplified isp driver for the QLogic Intelligent SCSI Peripheral device. The isp driver supports WIDE SCSI, with up to 15 target devices and 8 logical units (LUNs) per target.

Per-Command Structure

An HBA driver usually needs to define a structure to maintain state for each command submitted by a target driver. The layout of this per-command structure is entirely up to the device driver writer. The layout needs to reflect the capabilities and features of the hardware and the software algorithms that are used in the driver.

The following structure is an example of a per-command structure. The remaining code fragments of this chapter use this structure to illustrate the HBA interfaces.

struct isp_cmd {
     struct isp_request     cmd_isp_request;
     struct isp_response    cmd_isp_response;
     struct scsi_pkt        *cmd_pkt;
     struct isp_cmd         *cmd_forw;
     uint32_t               cmd_dmacount;
     ddi_dma_handle_t       cmd_dmahandle;
     uint_t                 cmd_cookie;
     uint_t                 cmd_ncookies;
     uint_t                 cmd_cookiecnt;
     uint_t                 cmd_nwin;
     uint_t                 cmd_curwin;
     off_t                  cmd_dma_offset;
     uint_t                 cmd_dma_len;
     ddi_dma_cookie_t       cmd_dmacookies[ISP_NDATASEGS];
     u_int                  cmd_flags;
     u_short                cmd_slot;
     u_int                  cmd_cdblen;
     u_int                  cmd_scblen;
 };

Entry Points for Module Initialization

This section describes the entry points for operations that are performed by SCSI HBA drivers.

The following code for a SCSI HBA driver illustrates a representative dev_ops(9S) structure. The driver must initialize the devo_bus_ops field in this structure to NULL. A SCSI HBA driver can provide leaf driver interfaces for special purposes, in which case the devo_cb_ops field might point to a cb_ops(9S) structure. In this example, no leaf driver interfaces are exported, so the devo_cb_ops field is initialized to NULL.

`_init()` Entry Point (SCSI HBA Drivers)

The _init(9E) function initializes a loadable module. _init() is called before any other routine in the loadable module.

In a SCSI HBA, the _init() function must call scsi_hba_init(9F) to inform the framework of the existence of the HBA driver before calling mod_install(9F). If scsi_hba__init() returns a nonzero value,_init() should return this value. Otherwise, _init() must return the value returned by mod_install(9F).

The driver should initialize any required global state before calling mod_install(9F).

If mod_install() fails, the _init() function must free any global resources allocated. _init() must call scsi_hba_fini(9F) before returning.

The following example uses a global mutex to show how to allocate data that is global to all instances of a driver. The code declares global mutex and soft-state structure information. The global mutex and soft state are initialized during _init().

`_fini()` Entry Point (SCSI HBA Drivers)

The _fini(9E) function is called when the system is about to try to unload the SCSI HBA driver. The _fini() function must call mod_remove(9F) to determine whether the driver can be unloaded. If mod_remove() returns 0, the module can be unloaded. The HBA driver must deallocate any global resources allocated in _init(9E). The HBA driver must also call scsi_hba_fini(9F).

_fini() must return the value returned by mod_remove().

Note –

The HBA driver must not free any resources or call scsi_hba_fini(9F) unless mod_remove(9F) returns 0.

Example 18–1 shows module initialization for SCSI HBA.

Example 18–1 Module Initialization for SCSI HBA

static struct dev_ops isp_dev_ops = {
    DEVO_REV,       /* devo_rev */
    0,              /* refcnt  */
    isp_getinfo,    /* getinfo */
    nulldev,        /* probe */
    isp_attach,     /* attach */
    isp_detach,     /* detach */
    nodev,          /* reset */
    NULL,           /* driver operations */
    NULL,           /* bus operations */
    isp_power,      /* power management */
};

/*
 * Local static data
 */
static kmutex_t      isp_global_mutex;
static void          *isp_state;

int
_init(void)
{
    int     err;
    
    if ((err = ddi_soft_state_init(&isp_state,
        sizeof (struct isp), 0)) != 0) {
        return (err);
    }
    if ((err = scsi_hba_init(&modlinkage)) == 0) {
        mutex_init(&isp_global_mutex, "isp global mutex",
        MUTEX_DRIVER, NULL);
        if ((err = mod_install(&modlinkage)) != 0) {
            mutex_destroy(&isp_global_mutex);
            scsi_hba_fini(&modlinkage);
            ddi_soft_state_fini(&isp_state);    
        }
    }
    return (err);
}

int
_fini(void)
{
    int     err;
    
    if ((err = mod_remove(&modlinkage)) == 0) {
        mutex_destroy(&isp_global_mutex);
        scsi_hba_fini(&modlinkage);
        ddi_soft_state_fini(&isp_state);
    }
    return (err);
}

Autoconfiguration Entry Points

Associated with each device driver is a dev_ops(9S) structure, which enables the kernel to locate the autoconfiguration entry points of the driver. A complete description of these autoconfiguration routines is given in Chapter 6, Driver Autoconfiguration. This section describes only those entry points associated with operations performed by SCSI HBA drivers. These entry points include attach(9E) and detach(9E).

`attach()` Entry Point (SCSI HBA Drivers)

The attach(9E) entry point for a SCSI HBA driver performs several tasks when configuring and attaching an instance of the driver for the device. For a typical driver of real devices, the following operating system and hardware concerns must be addressed:

Soft-state structure
DMA
Transport structure
Attaching an HBA driver
Register mapping
Interrupt specification
Interrupt handling
Create power manageable components
Report attachment status

Soft-State Structure

When allocating the per-device-instance soft-state structure, a driver must clean up carefully if an error occurs.

DMA

The HBA driver must describe the attributes of its DMA engine by properly initializing the ddi_dma_attr_t structure.

static ddi_dma_attr_t isp_dma_attr = {
     DMA_ATTR_V0,        /* ddi_dma_attr version */
     0,                  /* low address */
     0xffffffff,         /* high address */
     0x00ffffff,         /* counter upper bound */
     1,                  /* alignment requirements */
     0x3f,               /* burst sizes */
     1,                  /* minimum DMA access */
     0xffffffff,         /* maximum DMA access */
     (1<<24)-1,          /* segment boundary restrictions */
     1,                  /* scatter-gather list length */
     512,                /* device granularity */
     0                   /* DMA flags */
};

The driver, if providing DMA, should also check that its hardware is installed in a DMA-capable slot:

if (ddi_slaveonly(dip) == DDI_SUCCESS) {
    return (DDI_FAILURE);
}

Transport Structure

The driver should further allocate and initialize a transport structure for this instance. The tran_hba_private field is set to point to this instance's soft-state structure. The tran_tgt_probe field can be set to NULL to achieve the default behavior, if no special probe customization is needed.

tran = scsi_hba_tran_alloc(dip, SCSI_HBA_CANSLEEP);

isp->isp_tran                   = tran;
isp->isp_dip                    = dip;

tran->tran_hba_private          = isp;
tran->tran_tgt_private          = NULL;
tran->tran_tgt_init             = isp_tran_tgt_init;
tran->tran_tgt_probe            = scsi_hba_probe;
tran->tran_tgt_free             = (void (*)())NULL;

tran->tran_start                = isp_scsi_start;
tran->tran_abort                = isp_scsi_abort;
tran->tran_reset                = isp_scsi_reset;
tran->tran_getcap               = isp_scsi_getcap;
tran->tran_setcap               = isp_scsi_setcap;
tran->tran_init_pkt             = isp_scsi_init_pkt;
tran->tran_destroy_pkt          = isp_scsi_destroy_pkt;
tran->tran_dmafree              = isp_scsi_dmafree;
tran->tran_sync_pkt             = isp_scsi_sync_pkt;
tran->tran_reset_notify         = isp_scsi_reset_notify;
tran->tran_bus_quiesce          = isp_tran_bus_quiesce
tran->tran_bus_unquiesce        = isp_tran_bus_unquiesce
tran->tran_bus_reset            = isp_tran_bus_reset
tran->tran_interconnect_type    = isp_tran_interconnect_type

Attaching an HBA Driver

The driver should attach this instance of the device, and perform error cleanup if necessary.

i = scsi_hba_attach_setup(dip, &isp_dma_attr, tran, 0);
if (i != DDI_SUCCESS) {
    /* do error recovery */
    return (DDI_FAILURE);
}

Register Mapping

The driver should map in its device's registers. The driver need to specify the following items:

Register set index
Data access characteristics of the device
Size of the register to be mapped

ddi_device_acc_attr_t    dev_attributes;

     dev_attributes.devacc_attr_version = DDI_DEVICE_ATTR_V0;
     dev_attributes.devacc_attr_dataorder = DDI_STRICTORDER_ACC;
     dev_attributes.devacc_attr_endian_flags = DDI_STRUCTURE_LE_ACC;

     if (ddi_regs_map_setup(dip, 0, (caddr_t *)&isp->isp_reg,
     0, sizeof (struct ispregs), &dev_attributes,
     &isp->isp_acc_handle) != DDI_SUCCESS) {
        /* do error recovery */
        return (DDI_FAILURE);
     }

Adding an Interrupt Handler

The driver must first obtain the iblock cookie to initialize any mutexes that are used in the driver handler. Only after those mutexes have been initialized can the interrupt handler be added.

i = ddi_get_iblock_cookie(dip, 0, &isp->iblock_cookie};
if (i != DDI_SUCCESS) {
    /* do error recovery */
    return (DDI_FAILURE);
}

mutex_init(&isp->mutex, "isp_mutex", MUTEX_DRIVER,
(void *)isp->iblock_cookie);
i = ddi_add_intr(dip, 0, &isp->iblock_cookie,
0, isp_intr, (caddr_t)isp);
if (i != DDI_SUCCESS) {
    /* do error recovery */
    return (DDI_FAILURE);
}

If a high-level handler is required, the driver should be coded to provide such a handler. Otherwise, the driver must be able to fail the attach. See Handling High-Level Interrupts for a description of high-level interrupt handling.

Create Power Manageable Components

With power management, if the host bus adapter only needs to power down when all target adapters are at power level 0, the HBA driver only needs to provide a power(9E) entry point. Refer to Chapter 12, Power Management. The HBA driver also needs to create a pm-components(9P) property that describes the components that the device implements.

Nothing more is necessary, since the components will default to idle, and the power management framework's default dependency processing will ensure that the host bus adapter will be powered up whenever an target adapter is powered up. Provided that automatic power management is enabled automatically, the processing will also power down the host bus adapter when all target adapters are powered down ().

Report Attachment Status

Finally, the driver should report that this instance of the device is attached and return success.

ddi_report_dev(dip);
    return (DDI_SUCCESS);

`detach()` Entry Point (SCSI HBA Drivers)

The driver should perform standard detach operations, including calling scsi_hba_detach(9F).

Entry Points for SCSA HBA Drivers

An HBA driver can work with target drivers through the SCSA interface. The SCSA interfaces require the HBA driver to supply a number of entry points that are callable through the scsi_hba_tran(9S) structure.

These entry points fall into five functional groups:

Target driver instance initialization
Resource allocation and deallocation
Command transport
Capability management
Abort and reset handling
Dynamic reconfiguration

The following table lists the entry points for SCSA HBA by function groups.

Table 18–3 SCSA Entry Points


Function Groups	Entry Points Within Group	Description
Target Driver Instance Initialization	tran_tgt_init(9E)	Performs per-target initialization (optional)
	tran_tgt_probe(9E)	Probes SCSI bus for existence of a target (optional)
	tran_tgt_free(9E)	Performs per-target deallocation (optional)
Resource Allocation	tran_init_pkt(9E)	Allocates SCSI packet and DMA resources
	tran_destroy_pkt(9E)	Frees SCSI packet and DMA resources
	tran_sync_pkt(9E)	Synchronizes memory before and after DMA
	tran_dmafree(9E)	Frees DMA resources
Command Transport	tran_start(9E)	Transports a SCSI command
Capability Management	tran_getcap(9E)	Inquires about a capability's value
	tran_setcap(9E)	Sets a capability's value
Abort and Reset	tran_abort(9E)	Aborts outstanding SCSI commands
	tran_reset(9E)	Resets a target device or the SCSI bus
	tran_bus_reset(9E)	Resets the SCSI bus
	tran_reset_notify(9E)	Request to notify target of bus reset (optional)
Dynamic Reconfiguration	tran_quiesce(9E)	Stops activity on the bus
	tran_unquiesce(9E)	Resumes activity on the bus

Target Driver Instance Initialization

The following sections describe target entry points.

`tran_tgt_init()` Entry Point

The tran_tgt_init(9E) entry point enables the HBA to allocate and initialize any per-target resources. tran_tgt_init() also enables the HBA to qualify the device's address as valid and supportable for that particular HBA. By returning DDI_FAILURE, the instance of the target driver for that device is not probed or attached.

tran_tgt_init() is not required. If tran_tgt_init() is not supplied, the framework attempts to probe and attach all possible instances of the appropriate target drivers.

static int
isp_tran_tgt_init(
    dev_info_t            *hba_dip,
    dev_info_t            *tgt_dip,
    scsi_hba_tran_t       *tran,
    struct scsi_device    *sd)
{
    return ((sd->sd_address.a_target < N_ISP_TARGETS_WIDE &&
        sd->sd_address.a_lun < 8) ? DDI_SUCCESS : DDI_FAILURE);
}

`tran_tgt_probe()` Entry Point

The tran_tgt_probe(9E) entry point enables the HBA to customize the operation of scsi_probe(9F), if necessary. This entry point is called only when the target driver calls scsi_probe().

The HBA driver can retain the normal operation of scsi_probe() by calling scsi_hba_probe(9F) and returning its return value.

This entry point is not required, and if not needed, the HBA driver should set the tran_tgt_probe vector in the scsi_hba_tran(9S) structure to point to scsi_hba_probe().

scsi_probe() allocates a scsi_inquiry(9S) structure and sets the sd_inq field of the scsi_device(9S) structure to point to the data in scsi_inquiry. scsi_hba_probe() handles this task automatically. scsi_unprobe(9F) then frees the scsi_inquiry data.

Except for the allocation of scsi_inquiry data, tran_tgt_probe() must be stateless, because the same SCSI device might call tran_tgt_probe() several times. Normally, allocation of scsi_inquiry data is handled by scsi_hba_probe().

Note –

The allocation of the scsi_inquiry(9S) structure is handled automatically by scsi_hba_probe(). This information is only of concern if you want custom scsi_probe() handling.

static int
isp_tran_tgt_probe(
    struct scsi_device    *sd,
    int                   (*callback)())
{
    /*
     * Perform any special probe customization needed.
     * Normal probe handling.
     */
    return (scsi_hba_probe(sd, callback));
}

`tran_tgt_free()` Entry Point

The tran_tgt_free(9E) entry point enables the HBA to perform any deallocation or clean-up procedures for an instance of a target. This entry point is optional.

static void
isp_tran_tgt_free(
    dev_info_t            *hba_dip,
    dev_info_t            *tgt_dip,
    scsi_hba_tran_t       *hba_tran,
    struct scsi_device    *sd)
{
    /*
     * Undo any special per-target initialization done
     * earlier in tran_tgt_init(9F) and tran_tgt_probe(9F)
     */
}

Resource Allocation

The following sections discuss resource allocation.

`tran_init_pkt()` Entry Point

The tran_init_pkt(9E) entry point allocates and initializes a scsi_pkt(9S) structure and DMA resources for a target driver request.

The tran_init_pkt(9E) entry point is called when the target driver calls the SCSA function scsi_init_pkt(9F).

Each call of the tran_init_pkt(9E) entry point is a request to perform one or more of three possible services:

Allocation and initialization of a scsi_pkt(9S) structure
Allocation of DMA resources for data transfer
Reallocation of DMA resources for the next portion of the data transfer

Allocation and Initialization of a scsi_pkt(9S) Structure

The tran_init_pkt(9E) entry point must allocate a scsi_pkt(9S) structure through scsi_hba_pkt_alloc(9F) if pkt is NULL.

scsi_hba_pkt_alloc(9F) allocates space for the following items:

scsi_pkt(9S)
SCSI CDB of length cmdlen
Completion area for SCSI status of length statuslen
Per-packet target driver private data area of length tgtlen
Per-packet HBA driver private data area of length hbalen

The scsi_pkt(9S) structure members, including pkt, must be initialized to zero except for the following members:

pkt_scbp – Status completion
pkt_cdbp – CDB
pkt_ha_private – HBA driver private data
pkt_private – Target driver private data

These members are pointers to memory space where the values of the fields are stored, as shown in the following figure. For more information, refer to scsi_pkt Structure (HBA).

Figure 18–5 scsi_pkt(9S) Structure Pointers

Diagram shows the scsi_pkt structure with those members
that point to values rather than being initialized to zero.

The following example shows allocation and initialization of a scsi_pkt structure.

Example 18–2 HBA Driver Initialization of a SCSI Packet Structure

static struct scsi_pkt                 *
isp_scsi_init_pkt(
    struct scsi_address    *ap,
    struct scsi_pkt        *pkt,
    struct buf             *bp,
    int                    cmdlen,
    int                    statuslen,
    int                    tgtlen,
    int                    flags,
    int                    (*callback)(),
    caddr_t                arg)
{
    struct isp_cmd         *sp;
    struct isp             *isp;
    struct scsi_pkt        *new_pkt;

    ASSERT(callback == NULL_FUNC || callback == SLEEP_FUNC);

    isp = (struct isp *)ap->a_hba_tran->tran_hba_private;
   /*
    * First step of isp_scsi_init_pkt:  pkt allocation
    */
    if (pkt == NULL) {
        pkt = scsi_hba_pkt_alloc(isp->isp_dip, ap, cmdlen,
            statuslen, tgtlen, sizeof (struct isp_cmd),
            callback, arg);
        if (pkt == NULL) {
            return (NULL);
       }

       sp = (struct isp_cmd *)pkt->pkt_ha_private;
      /*
       * Initialize the new pkt
       */
       sp->cmd_pkt         = pkt;
       sp->cmd_flags       = 0;
       sp->cmd_scblen      = statuslen;
       sp->cmd_cdblen      = cmdlen;
       sp->cmd_dmahandle   = NULL;
       sp->cmd_ncookies    = 0;
       sp->cmd_cookie      = 0; 
       sp->cmd_cookiecnt   = 0;
       sp->cmd_nwin        = 0;
       pkt->pkt_address    = *ap;
       pkt->pkt_comp       = (void (*)())NULL;
       pkt->pkt_flags      = 0;
       pkt->pkt_time       = 0;
       pkt->pkt_resid      = 0;
       pkt->pkt_statistics = 0;
       pkt->pkt_reason     = 0;

       new_pkt = pkt;
    } else {
       sp = (struct isp_cmd *)pkt->pkt_ha_private;
       new_pkt = NULL;
    }
   /*
    * Second step of isp_scsi_init_pkt:  dma allocation/move
    */
    if (bp && bp->b_bcount != 0) {
        if (sp->cmd_dmahandle == NULL) {
            if (isp_i_dma_alloc(isp, pkt, bp,
            flags, callback) == 0) {
            if (new_pkt) {
                scsi_hba_pkt_free(ap, new_pkt);
            }
            return ((struct scsi_pkt *)NULL);
        }
        } else {
            ASSERT(new_pkt == NULL);
            if (isp_i_dma_move(isp, pkt, bp) == 0) {
                return ((struct scsi_pkt *)NULL);
            }
        }
    }
    return (pkt);
}

Allocation of DMA Resources

The tran_init_pkt(9E) entry point must allocate DMA resources for a data transfer if the following conditions are true:

bp is not null.
bp->b_bcount is not zero.
DMA resources have not yet been allocated for this scsi_pkt(9S).

The HBA driver needs to track how DMA resources are allocated for a particular command. This allocation can take place with a flag bit or a DMA handle in the per-packet HBA driver private data.

The PKT_DMA_PARTIAL flag in the pkt enables the target driver to break up a data transfer into multiple SCSI commands to accommodate the complete request. This approach is useful when the HBA hardware scatter-gather capabilities or system DMA resources cannot complete a request in a single SCSI command.

The PKT_DMA_PARTIAL flag enables the HBA driver to set the DDI_DMA_PARTIAL flag. The DDI_DMA_PARTIAL flag is useful when the DMA resources for this SCSI command are allocated. For example the ddi_dma_buf_bind_handle(9F)) command can be used to allocate DMA resources. The DMA attributes used when allocating the DMA resources should accurately describe any constraints placed on the ability of the HBA hardware to perform DMA. If the system can only allocate DMA resources for part of the request, ddi_dma_buf_bind_handle(9F) returns DDI_DMA_PARTIAL_MAP.

The tran_init_pkt(9E) entry point must return the amount of DMA resources not allocated for this transfer in the field pkt_resid.

A target driver can make one request to tran_init_pkt(9E) to simultaneously allocate both a scsi_pkt(9S) structure and DMA resources for that pkt. In this case, if the HBA driver is unable to allocate DMA resources, that driver must free the allocated scsi_pkt(9S) before returning. The scsi_pkt(9S) must be freed by calling scsi_hba_pkt_free(9F).

The target driver might first allocate the scsi_pkt(9S) and allocate DMA resources for this pkt at a later time. In this case, if the HBA driver is unable to allocate DMA resources, the driver must not free pkt. The target driver in this case is responsible for freeing the pkt.

Example 18–3 HBA Driver Allocation of DMA Resources

static int
isp_i_dma_alloc(
    struct isp         *isp,
    struct scsi_pkt    *pkt,
    struct buf         *bp,
    int                flags,
    int                (*callback)())
{
    struct isp_cmd     *sp  = (struct isp_cmd *)pkt->pkt_ha_private;
    int                dma_flags;
    ddi_dma_attr_t     tmp_dma_attr;
    int                (*cb)(caddr_t);
    int                i;

    ASSERT(callback == NULL_FUNC || callback == SLEEP_FUNC);

    if (bp->b_flags & B_READ) {
        sp->cmd_flags &= ~CFLAG_DMASEND;
        dma_flags = DDI_DMA_READ;
    } else {
        sp->cmd_flags |= CFLAG_DMASEND;
        dma_flags = DDI_DMA_WRITE;
    }
    if (flags & PKT_CONSISTENT) {
        sp->cmd_flags |= CFLAG_CMDIOPB;
        dma_flags |= DDI_DMA_CONSISTENT;
    }
    if (flags & PKT_DMA_PARTIAL) {
        dma_flags |= DDI_DMA_PARTIAL;
    }

    tmp_dma_attr = isp_dma_attr;
    tmp_dma_attr.dma_attr_burstsizes = isp->isp_burst_size;

    cb = (callback == NULL_FUNC) ? DDI_DMA_DONTWAIT :
DDI_DMA_SLEEP;

    if ((i = ddi_dma_alloc_handle(isp->isp_dip, &tmp_dma_attr,
      cb, 0, &sp->cmd_dmahandle)) != DDI_SUCCESS) {
        switch (i) {
        case DDI_DMA_BADATTR:
            bioerror(bp, EFAULT);
            return (0);
        case DDI_DMA_NORESOURCES:
            bioerror(bp, 0);
            return (0);
        }
    }

    i = ddi_dma_buf_bind_handle(sp->cmd_dmahandle, bp, dma_flags,
    cb, 0, &sp->cmd_dmacookies[0], &sp->cmd_ncookies);

    switch (i) {
    case DDI_DMA_PARTIAL_MAP:
    if (ddi_dma_numwin(sp->cmd_dmahandle, &sp->cmd_nwin) ==
            DDI_FAILURE) {
        cmn_err(CE_PANIC, "ddi_dma_numwin() failed\n");
    }

    if (ddi_dma_getwin(sp->cmd_dmahandle, sp->cmd_curwin,
        &sp->cmd_dma_offset, &sp->cmd_dma_len,
        &sp->cmd_dmacookies[0], &sp->cmd_ncookies) ==
           DDI_FAILURE) {
        cmn_err(CE_PANIC, "ddi_dma_getwin() failed\n");
    }
    goto get_dma_cookies;

    case DDI_DMA_MAPPED:
    sp->cmd_nwin = 1;
    sp->cmd_dma_len = 0;
    sp->cmd_dma_offset = 0;

get_dma_cookies:
    i = 0;
    sp->cmd_dmacount = 0;
    for (;;) {
        sp->cmd_dmacount += sp->cmd_dmacookies[i++].dmac_size;

        if (i == ISP_NDATASEGS || i == sp->cmd_ncookies)
        break;
        ddi_dma_nextcookie(sp->cmd_dmahandle,
        &sp->cmd_dmacookies[i]);
    }
    sp->cmd_cookie = i;
    sp->cmd_cookiecnt = i;

    sp->cmd_flags |= CFLAG_DMAVALID;
    pkt->pkt_resid = bp->b_bcount - sp->cmd_dmacount;
    return (1);

    case DDI_DMA_NORESOURCES:
    bioerror(bp, 0);
    break;

    case DDI_DMA_NOMAPPING:
    bioerror(bp, EFAULT);
    break;

    case DDI_DMA_TOOBIG:
    bioerror(bp, EINVAL);
    break;

    case DDI_DMA_INUSE:
    cmn_err(CE_PANIC, "ddi_dma_buf_bind_handle:"
        " DDI_DMA_INUSE impossible\n");

    default:
    cmn_err(CE_PANIC, "ddi_dma_buf_bind_handle:"
        " 0x%x impossible\n", i);
    }

    ddi_dma_free_handle(&sp->cmd_dmahandle);
    sp->cmd_dmahandle = NULL;
    sp->cmd_flags &= ~CFLAG_DMAVALID;
    return (0);
}

Reallocation of DMA Resources for Data Transfer

For a previously allocated packet with data remaining to be transferred, the tran_init_pkt(9E) entry point must reallocate DMA resources when the following conditions apply:

Partial DMA resources have already been allocated.
A non-zero pkt_resid was returned in the previous call to tran_init_pkt(9E).
bp is not null.
bp->b_bcount is not zero.

When reallocating DMA resources to the next portion of the transfer, tran_init_pkt(9E) must return the amount of DMA resources not allocated for this transfer in the field pkt_resid.

If an error occurs while attempting to move DMA resources, tran_init_pkt(9E) must not free the scsi_pkt(9S). The target driver in this case is responsible for freeing the packet.

If the callback parameter is NULL_FUNC, the tran_init_pkt(9E) entry point must not sleep or call any function that might sleep. If the callback parameter is SLEEP_FUNC and resources are not immediately available, the tran_init_pkt(9E) entry point should sleep. Unless the request is impossible to satisfy, tran_init_pkt() should sleep until resources become available.

Example 18–4 DMA Resource Reallocation for HBA Drivers

static int
isp_i_dma_move(
    struct isp         *isp,
    struct scsi_pkt    *pkt,
    struct buf         *bp)
{
    struct isp_cmd     *sp  = (struct isp_cmd *)pkt->pkt_ha_private;
    int                i;

    ASSERT(sp->cmd_flags & CFLAG_COMPLETED);
    sp->cmd_flags &= ~CFLAG_COMPLETED;
   /*
    * If there are no more cookies remaining in this window,
    * must move to the next window first.
    */
    if (sp->cmd_cookie == sp->cmd_ncookies) {
   /*
    * For small pkts, leave things where they are
    */
    if (sp->cmd_curwin == sp->cmd_nwin && sp->cmd_nwin == 1)
        return (1);
   /*
    * At last window, cannot move
    */
    if (++sp->cmd_curwin >= sp->cmd_nwin)
        return (0);
    if (ddi_dma_getwin(sp->cmd_dmahandle, sp->cmd_curwin,
        &sp->cmd_dma_offset, &sp->cmd_dma_len,
        &sp->cmd_dmacookies[0], &sp->cmd_ncookies) ==
        DDI_FAILURE)
        return (0);
        sp->cmd_cookie = 0;
    } else {
   /*
    * Still more cookies in this window - get the next one
    */
    ddi_dma_nextcookie(sp->cmd_dmahandle,
        &sp->cmd_dmacookies[0]);
    }
   /*
    * Get remaining cookies in this window, up to our maximum
    */
    i = 0;
    for (;;) {
    sp->cmd_dmacount += sp->cmd_dmacookies[i++].dmac_size;
    sp->cmd_cookie++;
    if (i == ISP_NDATASEGS ||
        sp->cmd_cookie == sp->cmd_ncookies)
            break;
    ddi_dma_nextcookie(sp->cmd_dmahandle,
        &sp->cmd_dmacookies[i]);
    }
    sp->cmd_cookiecnt = i;
    pkt->pkt_resid = bp->b_bcount - sp->cmd_dmacount;
    return (1);
}

`tran_destroy_pkt()` Entry Point

The tran_destroy_pkt(9E) entry point is the HBA driver function that deallocates scsi_pkt(9S) structures. The tran_destroy_pkt() entry point is called when the target driver calls scsi_destroy_pkt(9F).

The tran_destroy_pkt() entry point must free any DMA resources that have been allocated for the packet. An implicit DMA synchronization occurs if the DMA resources are freed and any cached data remains after the completion of the transfer. The tran_destroy_pkt() entry point frees the SCSI packet by calling scsi_hba_pkt_free(9F).

Example 18–5 HBA Driver tran_destroy_pkt(9E) Entry Point

static void
isp_scsi_destroy_pkt(
    struct scsi_address    *ap,
    struct scsi_pkt    *pkt)
{
    struct isp_cmd *sp = (struct isp_cmd *)pkt->pkt_ha_private;
   /*
    * Free the DMA, if any
    */
    if (sp->cmd_flags & CFLAG_DMAVALID) {
        sp->cmd_flags &= ~CFLAG_DMAVALID;
        (void) ddi_dma_unbind_handle(sp->cmd_dmahandle);
        ddi_dma_free_handle(&sp->cmd_dmahandle);
        sp->cmd_dmahandle = NULL;
    }
   /*
    * Free the pkt
    */
    scsi_hba_pkt_free(ap, pkt);
}

`tran_sync_pkt()` Entry Point

The tran_sync_pkt(9E) entry point synchronizes the DMA object allocated for the scsi_pkt(9S) structure before or after a DMA transfer. The tran_sync_pkt() entry point is called when the target driver calls scsi_sync_pkt(9F).

If the data transfer direction is a DMA read from device to memory, tran_sync_pkt() must synchronize the CPU's view of the data. If the data transfer direction is a DMA write from memory to device, tran_sync_pkt() must synchronize the device's view of the data.

Example 18–6 HBA Driver tran_sync_pkt(9E) Entry Point

static void
isp_scsi_sync_pkt(
    struct scsi_address    *ap,
    struct scsi_pkt        *pkt)
{
    struct isp_cmd *sp = (struct isp_cmd *)pkt->pkt_ha_private;

    if (sp->cmd_flags & CFLAG_DMAVALID) {
        (void)ddi_dma_sync(sp->cmd_dmahandle, sp->cmd_dma_offset,
        sp->cmd_dma_len,
        (sp->cmd_flags & CFLAG_DMASEND) ?
        DDI_DMA_SYNC_FORDEV : DDI_DMA_SYNC_FORCPU);
    }
}

`tran_dmafree()` Entry Point

The tran_dmafree(9E) entry point deallocates DMA resources that have been allocated for a scsi_pkt(9S) structure. The tran_dmafree() entry point is called when the target driver calls scsi_dmafree(9F).

tran_dmafree() must free only DMA resources allocated for a scsi_pkt(9S) structure, not the scsi_pkt(9S) itself. When DMA resources are freed, a DMA synchronization is implicitly performed.

Note –

The scsi_pkt(9S) is freed in a separate request to tran_destroy_pkt(9E). Because tran_destroy_pkt() must also free DMA resources, the HBA driver must keep accurate note of whether scsi_pkt() structures have DMA resources allocated.

Example 18–7 HBA Driver tran_dmafree(9E) Entry Point

static void
isp_scsi_dmafree(
    struct scsi_address    *ap,
    struct scsi_pkt        *pkt)
{
    struct isp_cmd    *sp = (struct isp_cmd *)pkt->pkt_ha_private;

    if (sp->cmd_flags & CFLAG_DMAVALID) {
        sp->cmd_flags &= ~CFLAG_DMAVALID;
        (void)ddi_dma_unbind_handle(sp->cmd_dmahandle);
        ddi_dma_free_handle(&sp->cmd_dmahandle);
        sp->cmd_dmahandle = NULL;
    }
}

Command Transport

An HBA driver goes through the following steps as part of command transport:

Accept a command from the target driver.
Issue the command to the device hardware.
Service any interrupts that occur.
Manage time outs.

`tran_start()` Entry Point

The tran_start(9E) entry point for a SCSI HBA driver is called to transport a SCSI command to the addressed target. The SCSI command is described entirely within the scsi_pkt(9S) structure, which the target driver allocated through the HBA driver's tran_init_pkt(9E) entry point. If the command involves a data transfer, DMA resources must also have been allocated for the scsi_pkt(9S) structure.

The tran_start() entry point is called when a target driver calls scsi_transport(9F).

tran_start() should perform basic error checking along with any initialization that is required by the command. The FLAG_NOINTR flag in the pkt_flags field of the scsi_pkt(9S) structure can affect the behavior of tran_start(). If FLAG_NOINTR is not set, tran_start() must queue the command for execution on the hardware and return immediately. Upon completion of the command, the HBA driver should call the pkt completion routine.

If the FLAG_NOINTR is set, then the HBA driver should not call the pkt completion routine.

The following example demonstrates how to handle the tran_start(9E) entry point. The ISP hardware provides a queue per-target device. For devices that can manage only one active outstanding command, the driver is typically required to manage a per-target queue. The driver then starts up a new command upon completion of the current command in a round-robin fashion.

Example 18–8 HBA Driver tran_start(9E) Entry Point

static int
isp_scsi_start(
    struct scsi_address    *ap,
    struct scsi_pkt        *pkt)
{
    struct isp_cmd         *sp;
    struct isp             *isp;
    struct isp_request     *req;
    u_long                 cur_lbolt;
    int                    xfercount;
    int                    rval = TRAN_ACCEPT;
    int                    i;

    sp = (struct isp_cmd *)pkt->pkt_ha_private;
    isp = (struct isp *)ap->a_hba_tran->tran_hba_private;

    sp->cmd_flags = (sp->cmd_flags & ~CFLAG_TRANFLAG) |
                CFLAG_IN_TRANSPORT;
    pkt->pkt_reason = CMD_CMPLT;
   /*
    * set up request in cmd_isp_request area so it is ready to
    * go once we have the request mutex
    */
    req = &sp->cmd_isp_request;

    req->req_header.cq_entry_type = CQ_TYPE_REQUEST;
    req->req_header.cq_entry_count = 1;
    req->req_header.cq_flags        = 0;
    req->req_header.cq_seqno = 0;
    req->req_reserved = 0;
    req->req_token = (opaque_t)sp;
    req->req_target = TGT(sp);
    req->req_lun_trn = LUN(sp);
    req->req_time = pkt->pkt_time;
    ISP_SET_PKT_FLAGS(pkt->pkt_flags, req->req_flags);
   /*
    * Set up data segments for dma transfers.
    */
    if (sp->cmd_flags & CFLAG_DMAVALID) {

        if (sp->cmd_flags & CFLAG_CMDIOPB) {
            (void) ddi_dma_sync(sp->cmd_dmahandle,
            sp->cmd_dma_offset, sp->cmd_dma_len,
            DDI_DMA_SYNC_FORDEV);
        }

        ASSERT(sp->cmd_cookiecnt > 0 &&
            sp->cmd_cookiecnt <= ISP_NDATASEGS);

        xfercount = 0;
        req->req_seg_count = sp->cmd_cookiecnt;
        for (i = 0; i < sp->cmd_cookiecnt; i++) {
            req->req_dataseg[i].d_count =
            sp->cmd_dmacookies[i].dmac_size;
            req->req_dataseg[i].d_base =
            sp->cmd_dmacookies[i].dmac_address;
            xfercount +=
            sp->cmd_dmacookies[i].dmac_size;
        }

        for (; i < ISP_NDATASEGS; i++) {
            req->req_dataseg[i].d_count = 0;
            req->req_dataseg[i].d_base = 0;
        }

        pkt->pkt_resid = xfercount;

        if (sp->cmd_flags & CFLAG_DMASEND) {
            req->req_flags |= ISP_REQ_FLAG_DATA_WRITE;
        } else {
            req->req_flags |= ISP_REQ_FLAG_DATA_READ;
        }
    } else {
        req->req_seg_count = 0;
        req->req_dataseg[0].d_count = 0;
    }
   /*
    * Set up cdb in the request
    */
    req->req_cdblen = sp->cmd_cdblen;
    bcopy((caddr_t)pkt->pkt_cdbp, (caddr_t)req->req_cdb,
    sp->cmd_cdblen);
   /*
    * Start the cmd.  If NO_INTR, must poll for cmd completion.
    */
    if ((pkt->pkt_flags & FLAG_NOINTR) == 0) {
        mutex_enter(ISP_REQ_MUTEX(isp));
        rval = isp_i_start_cmd(isp, sp);
        mutex_exit(ISP_REQ_MUTEX(isp));
    } else {
        rval = isp_i_polled_cmd_start(isp, sp);
    }
    return (rval);
}

Interrupt Handler and Command Completion

The interrupt handler must check the status of the device to be sure the device is generating the interrupt in question. The interrupt handler must also check for any errors that have occurred and service any interrupts generated by the device.

If data is transferred, the hardware should be checked to determine how much data was actually transferred. The pkt_resid field in the scsi_pkt(9S) structure should be set to the residual of the transfer.

Commands that are marked with the PKT_CONSISTENT flag when DMA resources are allocated through tran_init_pkt(9E) take special handling. The HBA driver must ensure that the data transfer for the command is correctly synchronized before the target driver's command completion callback is performed.

Once a command has completed, you need to act on two requirements:

If a new command is queued up, start the command on the hardware as quickly as possible.
Call the command completion callback. The callback has been set up in the scsi_pkt(9S) structure by the target driver to notify the target driver when the command is complete.

Start a new command on the hardware, if possible, before calling the PKT_COMP command completion callback. The command completion handling can take considerable time. Typically, the target driver calls functions such as biodone(9F) and possibly scsi_transport(9F) to begin a new command.

The interrupt handler must return DDI_INTR_CLAIMED if this interrupt is claimed by this driver. Otherwise, the handler returns DDI_INTR_UNCLAIMED.

The following example shows an interrupt handler for the SCSI HBA isp driver. The caddr_t parameter is set up when the interrupt handler is added in attach(9E). This parameter is typically a pointer to the state structure, which is allocated on a per instance basis.

Example 18–9 HBA Driver Interrupt Handler

static u_int
isp_intr(caddr_t arg)
{
    struct isp_cmd         *sp;
    struct isp_cmd         *head, *tail;
    u_short                response_in;
    struct isp_response    *resp;
    struct isp             *isp = (struct isp *)arg;
    struct isp_slot        *isp_slot;
    int                    n;

    if (ISP_INT_PENDING(isp) == 0) {
        return (DDI_INTR_UNCLAIMED);
    }

    do {
again:
       /*
        * head list collects completed packets for callback later
        */
        head = tail = NULL;
       /*
        * Assume no mailbox events (e.g., mailbox cmds, asynch
        * events, and isp dma errors) as common case.
        */
        if (ISP_CHECK_SEMAPHORE_LOCK(isp) == 0) {
            mutex_enter(ISP_RESP_MUTEX(isp));
           /*
            * Loop through completion response queue and post
            * completed pkts.  Check response queue again
            * afterwards in case there are more.
            */
            isp->isp_response_in =
            response_in = ISP_GET_RESPONSE_IN(isp);
           /*
            * Calculate the number of requests in the queue
            */
            n = response_in - isp->isp_response_out;
            if (n < 0) {
                n = ISP_MAX_REQUESTS -
                isp->isp_response_out + response_in;
            }
            while (n-- > 0) {
                ISP_GET_NEXT_RESPONSE_OUT(isp, resp);
                sp = (struct isp_cmd *)resp->resp_token;
               /*
                * Copy over response packet in sp
                */
                isp_i_get_response(isp, resp, sp);
            }
            if (head) {
                tail->cmd_forw = sp;
                tail = sp;
                tail->cmd_forw = NULL;
            } else {
                tail = head = sp;
                sp->cmd_forw = NULL;
            }
            ISP_SET_RESPONSE_OUT(isp);
            ISP_CLEAR_RISC_INT(isp);
            mutex_exit(ISP_RESP_MUTEX(isp));

            if (head) {
                isp_i_call_pkt_comp(isp, head);
            }
        } else {
            if (isp_i_handle_mbox_cmd(isp) != ISP_AEN_SUCCESS) {
                return (DDI_INTR_CLAIMED);
            }
           /*
            * if there was a reset then check the response
            * queue again
            */
            goto again;    
        }

    } while (ISP_INT_PENDING(isp));

    return (DDI_INTR_CLAIMED);
}

static void
isp_i_call_pkt_comp(
    struct isp             *isp,
    struct isp_cmd         *head)
{
    struct isp             *isp;
    struct isp_cmd         *sp;
    struct scsi_pkt        *pkt;
    struct isp_response    *resp;
    u_char                 status;

    while (head) {
        sp = head;
        pkt = sp->cmd_pkt;
        head = sp->cmd_forw;

        ASSERT(sp->cmd_flags & CFLAG_FINISHED);

        resp = &sp->cmd_isp_response;

        pkt->pkt_scbp[0] = (u_char)resp->resp_scb;
        pkt->pkt_state = ISP_GET_PKT_STATE(resp->resp_state);
        pkt->pkt_statistics = (u_long)
            ISP_GET_PKT_STATS(resp->resp_status_flags);
        pkt->pkt_resid = (long)resp->resp_resid;
       /*
        * If data was xferred and this is a consistent pkt,
        * do a dma sync
        */
        if ((sp->cmd_flags & CFLAG_CMDIOPB) &&
            (pkt->pkt_state & STATE_XFERRED_DATA)) {
                (void) ddi_dma_sync(sp->cmd_dmahandle,
                sp->cmd_dma_offset, sp->cmd_dma_len,
                DDI_DMA_SYNC_FORCPU);
        }

        sp->cmd_flags = (sp->cmd_flags & ~CFLAG_IN_TRANSPORT) |
            CFLAG_COMPLETED;
       /*
        * Call packet completion routine if FLAG_NOINTR is not set.
        */
        if (((pkt->pkt_flags & FLAG_NOINTR) == 0) &&
            pkt->pkt_comp) {
                (*pkt->pkt_comp)(pkt);
        }
    }
}

Timeout Handler

The HBA driver is responsible for enforcing time outs. A command must be complete within a specified time unless a zero time out has been specified in the scsi_pkt(9S) structure.

When a command times out, the HBA driver should mark the scsi_pkt(9S) with pkt_reason set to CMD_TIMEOUT and pkt_statistics OR'd with STAT_TIMEOUT. The HBA driver should also attempt to recover the target and bus. If this recovery can be performed successfully, the driver should mark the scsi_pkt(9S) using pkt_statistics OR'd with either STAT_BUS_RESET or STAT_DEV_RESET.

After the recovery attempt has completed, the HBA driver should call the command completion callback.

Note –

If recovery was unsuccessful or not attempted, the target driver might attempt to recover from the timeout by calling scsi_reset(9F).

The ISP hardware manages command timeout directly and returns timed-out commands with the necessary status. The timeout handler for the isp sample driver checks active commands for the time out state only once every 60 seconds.

The isp sample driver uses the timeout(9F) facility to arrange for the kernel to call the timeout handler every 60 seconds. The caddr_t argument is the parameter set up when the timeout is initialized at attach(9E) time. In this case, the caddr_t argument is a pointer to the state structure allocated per driver instance.

If timed-out commands have not been returned as timed-out by the ISP hardware, a problem has occurred. The hardware is not functioning correctly and needs to be reset.

Capability Management

The following sections discuss capability management.

`tran_getcap()` Entry Point

The tran_getcap(9E) entry point for a SCSI HBA driver is called by scsi_ifgetcap(9F). The target driver calls scsi_ifgetcap() to determine the current value of one of a set of SCSA-defined capabilities.

The target driver can request the current setting of the capability for a particular target by setting the whom parameter to nonzero. A whom value of zero indicates a request for the current setting of the general capability for the SCSI bus or for adapter hardware.

The tran_getcap() entry point should return -1 for undefined capabilities or the current value of the requested capability.

The HBA driver can use the function scsi_hba_lookup_capstr(9F) to compare the capability string against the canonical set of defined capabilities.

Example 18–10 HBA Driver tran_getcap(9E) Entry Point

static int
isp_scsi_getcap(
    struct scsi_address    *ap,
    char                   *cap,
    int                    whom)
{
    struct isp             *isp;
    int                    rval = 0;
    u_char                 tgt = ap->a_target;
   /*
    * We don't allow getting capabilities for other targets
    */
    if (cap == NULL || whom  == 0) {
        return (-1);
    }
    isp = (struct isp *)ap->a_hba_tran->tran_hba_private;
    ISP_MUTEX_ENTER(isp);

    switch (scsi_hba_lookup_capstr(cap)) {
    case SCSI_CAP_DMA_MAX:
        rval = 1 << 24; /* Limit to 16MB max transfer */
        break;
    case SCSI_CAP_MSG_OUT:
        rval = 1;
        break;
    case SCSI_CAP_DISCONNECT:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_DR) == 0) {
            break;
        } else if (
            (isp->isp_cap[tgt] & ISP_CAP_DISCONNECT) == 0) {
            break;
        }
        rval = 1;
        break;
    case SCSI_CAP_SYNCHRONOUS:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_SYNC) == 0) {
            break;
        } else if (
            (isp->isp_cap[tgt] & ISP_CAP_SYNC) == 0) {
            break;
        }
        rval = 1;
        break;
    case SCSI_CAP_WIDE_XFER:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_WIDE) == 0) {
            break;
        } else if (
            (isp->isp_cap[tgt] & ISP_CAP_WIDE) == 0) {
            break;
        }
        rval = 1;
        break;
    case SCSI_CAP_TAGGED_QING:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_DR) == 0 ||
            (isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_TAG) == 0) {
            break;
        } else if (
            (isp->isp_cap[tgt] & ISP_CAP_TAG) == 0) {
            break;
        }
        rval = 1;
        break;
    case SCSI_CAP_UNTAGGED_QING:
        rval = 1;
        break;
    case SCSI_CAP_PARITY:
        if (isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_PARITY) {
            rval = 1;
        }
        break;
    case SCSI_CAP_INITIATOR_ID:
        rval = isp->isp_initiator_id;
        break;
    case SCSI_CAP_ARQ:
        if (isp->isp_cap[tgt] & ISP_CAP_AUTOSENSE) {
            rval = 1;
        }
        break;
    case SCSI_CAP_LINKED_CMDS:
        break;
    case SCSI_CAP_RESET_NOTIFICATION:
        rval = 1;
        break;
    case SCSI_CAP_GEOMETRY:
        rval = (64 << 16) | 32;
        break;
    default:
        rval = -1;
        break;
    }

    ISP_MUTEX_EXIT(isp);
    return (rval);
}

`tran_setcap()` Entry Point

The tran_setcap(9E) entry point for a SCSI HBA driver is called by scsi_ifsetcap(9F). A target driver calls scsi_ifsetcap() to change the current one of a set of SCSA-defined capabilities.

The target driver might request that the new value be set for a particular target by setting the whom parameter to nonzero. A whom value of zero means the request is to set the new value for the SCSI bus or for adapter hardware in general.

tran_setcap() should return the following values as appropriate:

-1 for undefined capabilities
0 if the HBA driver cannot set the capability to the requested value
1 if the HBA driver is able to set the capability to the requested value

The HBA driver can use the function scsi_hba_lookup_capstr(9F) to compare the capability string against the canonical set of defined capabilities.

Example 18–11 HBA Driver tran_setcap(9E) Entry Point

static int
isp_scsi_setcap(
    struct scsi_address    *ap,
    char                   *cap,
    int                    value,
    int                    whom)
{
    struct isp             *isp;
    int                    rval = 0;
    u_char                 tgt = ap->a_target;
    int                    update_isp = 0;
   /*
    * We don't allow setting capabilities for other targets
    */
    if (cap == NULL || whom == 0) {
        return (-1);
    }

    isp = (struct isp *)ap->a_hba_tran->tran_hba_private;
    ISP_MUTEX_ENTER(isp);

    switch (scsi_hba_lookup_capstr(cap)) {
    case SCSI_CAP_DMA_MAX:
    case SCSI_CAP_MSG_OUT:
    case SCSI_CAP_PARITY:
    case SCSI_CAP_UNTAGGED_QING:
    case SCSI_CAP_LINKED_CMDS:
    case SCSI_CAP_RESET_NOTIFICATION:
   /*
    * None of these are settable through
    * the capability interface.
    */
        break;
    case SCSI_CAP_DISCONNECT:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_DR) == 0) {
                break;
        } else {
            if (value) {
                isp->isp_cap[tgt] |= ISP_CAP_DISCONNECT;
            } else {
                isp->isp_cap[tgt] &= ~ISP_CAP_DISCONNECT;
            }
        }
        rval = 1;
        break;
    case SCSI_CAP_SYNCHRONOUS:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_SYNC) == 0) {
                break;
        } else {
            if (value) {
                isp->isp_cap[tgt] |= ISP_CAP_SYNC;
            } else {
                isp->isp_cap[tgt] &= ~ISP_CAP_SYNC;
            }
        }
        rval = 1;
        break;
    case SCSI_CAP_TAGGED_QING:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_DR) == 0 ||
            (isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_TAG) == 0) {
                break;
        } else {
            if (value) {
                isp->isp_cap[tgt] |= ISP_CAP_TAG;
            } else {
            isp->isp_cap[tgt] &= ~ISP_CAP_TAG;
            }
        }
        rval = 1;
        break;
    case SCSI_CAP_WIDE_XFER:
        if ((isp->isp_target_scsi_options[tgt] &
            SCSI_OPTIONS_WIDE) == 0) {
                break;
        } else {
            if (value) {
                isp->isp_cap[tgt] |= ISP_CAP_WIDE;
            } else {
                isp->isp_cap[tgt] &= ~ISP_CAP_WIDE;
            }
        }
        rval = 1;
        break;
    case SCSI_CAP_INITIATOR_ID:
        if (value < N_ISP_TARGETS_WIDE) {
            struct isp_mbox_cmd mbox_cmd;
            isp->isp_initiator_id = (u_short) value;
           /*
            * set Initiator SCSI ID
            */
            isp_i_mbox_cmd_init(isp, &mbox_cmd, 2, 2,
            ISP_MBOX_CMD_SET_SCSI_ID,
            isp->isp_initiator_id,
            0, 0, 0, 0);
            if (isp_i_mbox_cmd_start(isp, &mbox_cmd) == 0) {
                rval = 1;
            }
        }
        break;
    case SCSI_CAP_ARQ:
        if (value) {
            isp->isp_cap[tgt] |= ISP_CAP_AUTOSENSE;
        } else {
            isp->isp_cap[tgt] &= ~ISP_CAP_AUTOSENSE;
        }
        rval = 1;
        break;
    default:
        rval = -1;
        break;
    }
    ISP_MUTEX_EXIT(isp);
    return (rval);
}

Abort and Reset Management

The following sections discuss the abort and reset entry points for SCSI HBA.

`tran_abort()` Entry Point

The tran_abort(9E) entry point for a SCSI HBA driver is called to abort any commands that are currently in transport for a particular target. This entry point is called when a target driver calls scsi_abort(9F).

The tran_abort() entry point should attempt to abort the command denoted by the pkt parameter. If the pkt parameter is NULL, tran_abort() should attempt to abort all outstanding commands in the transport layer for the particular target or logical unit.

Each command successfully aborted must be marked with pkt_reason CMD_ABORTED and pkt_statistics OR'd with STAT_ABORTED.

`tran_reset()` Entry Point

The tran_reset(9E) entry point for a SCSI HBA driver is called to reset either the SCSI bus or a particular SCSI target device. This entry point is called when a target driver calls scsi_reset(9F).

The tran_reset() entry point must reset the SCSI bus if level is RESET_ALL. If level is RESET_TARGET, just the particular target or logical unit must be reset.

Active commands affected by the reset must be marked with pkt_reason CMD_RESET. The type of reset determines whether STAT_BUS_RESET or STAT_DEV_RESET should be used to OR pkt_statistics.

Commands in the transport layer, but not yet active on the target, must be marked with pkt_reason CMD_RESET, and pkt_statistics OR'd with STAT_ABORTED.

`tran_bus_reset()` Entry Point

tran_bus_reset(9E) must reset the SCSI bus without resetting targets.

#include <sys/scsi/scsi.h>

int tran_bus_reset(dev_info_t *hba-dip, int level);

where:

*hba-dip: Pointer associated with the SCSI HBA
level: Must be set to RESET_BUS so that only the SCSI bus is reset, not the targets

The tran_bus_reset() vector in the scsi_hba_tran(9S) structure should be initialized during the HBA driver's attach(9E). The vector should point to an HBA entry point that is to be called when a user initiates a bus reset.

Implementation is hardware specific. If the HBA driver cannot reset the SCSI bus without affecting the targets, the driver should fail RESET_BUS or not initialize this vector.

`tran_reset_notify()` Entry Point

Use the tran_reset_notify(9E) entry point when a SCSI bus reset occurs. This function requests the SCSI HBA driver to notify the target driver by callback.

Example 18–12 HBA Driver tran_reset_notify(9E) Entry Point

isp_scsi_reset_notify(
    struct scsi_address    *ap,
    int                    flag,
    void                   (*callback)(caddr_t),
    caddr_t                arg)
{
    struct isp                       *isp;
    struct isp_reset_notify_entry    *p, *beforep;
    int                              rval = DDI_FAILURE;

    isp = (struct isp *)ap->a_hba_tran->tran_hba_private;
    mutex_enter(ISP_REQ_MUTEX(isp));
   /*
    * Try to find an existing entry for this target
    */
    p = isp->isp_reset_notify_listf;
    beforep = NULL;

    while (p) {
        if (p->ap == ap)
            break;
        beforep = p;
        p = p->next;
    }

    if ((flag & SCSI_RESET_CANCEL) && (p != NULL)) {
        if (beforep == NULL) {
            isp->isp_reset_notify_listf = p->next;
        } else {
            beforep->next = p->next;
        }
        kmem_free((caddr_t)p, sizeof (struct
            isp_reset_notify_entry));
        rval = DDI_SUCCESS;
    } else if ((flag & SCSI_RESET_NOTIFY) && (p == NULL)) {
        p = kmem_zalloc(sizeof (struct isp_reset_notify_entry),
            KM_SLEEP);
        p->ap = ap;
        p->callback = callback;
        p->arg = arg;
        p->next = isp->isp_reset_notify_listf;
        isp->isp_reset_notify_listf = p;
        rval = DDI_SUCCESS;
    }

    mutex_exit(ISP_REQ_MUTEX(isp));
    return (rval);
}

Dynamic Reconfiguration

To support the minimal set of hot-plugging operations, drivers might need to implement support for bus quiesce, bus unquiesce, and bus reset. The scsi_hba_tran(9S) structure supports these operations. If quiesce, unquiesce, or reset are not required by hardware, no driver changes are needed.

The scsi_hba_tran structure includes the following fields:

int (*tran_quiesce)(dev_info_t *hba-dip);
int (*tran_unquiesce)(dev_info_t *hba-dip);
int (*tran_bus_reset)(dev_info_t *hba-dip, int level);

These interfaces quiesce and unquiesce a SCSI bus.

#include <sys/scsi/scsi.h>

int prefixtran_quiesce(dev_info_t *hba-dip);
int prefixtran_unquiesce(dev_info_t *hba-dip);

tran_quiesce(9E) and tran_unquiesce(9E) are used for SCSI devices that are not designed for hot-plugging. These functions must be implemented by an HBA driver to support dynamic reconfiguration (DR).

The tran_quiesce() and tran_unquiesce() vectors in the scsi_hba_tran(9S) structure should be initialized to point to HBA entry points during attach(9E). These functions are called when a user initiates quiesce and unquiesce operations.

The tran_quiesce() entry point stops all activity on a SCSI bus prior to and during the reconfiguration of devices that are attached to the SCSI bus. The tran_unquiesce() entry point is called by the SCSA framework to resume activity on the SCSI bus after the reconfiguration operation has been completed.

HBA drivers are required to handle tran_quiesce() by waiting for all outstanding commands to complete before returning success. After the driver has quiesced the bus, any new I/O requests must be queued until the SCSA framework calls the corresponding tran_unquiesce() entry point.

HBA drivers handle calls to tran_unquiesce() by starting any target driver I/O requests in the queue.

SCSI HBA Driver Specific Issues

The section covers issues specific to SCSI HBA drivers.

Installing HBA Drivers

A SCSI HBA driver is installed in similar fashion to a leaf driver. See Chapter 21, Compiling, Loading, Packaging, and Testing Drivers. The difference is that the add_drv(1M) command must specify the driver class as SCSI, such as:

# add_drv -m" * 0666 root root" -i'"pci1077,1020"' -c scsi isp

HBA Configuration Properties

When attaching an instance of an HBA device, scsi_hba_attach_setup(9F) creates a number of SCSI configuration properties for that HBA instance. A particular property is created only if no existing property of the same name is already attached to the HBA instance. This restriction avoids overriding any default property values in an HBA configuration file.

An HBA driver must use ddi_prop_get_int(9F) to retrieve each property. The HBA driver then modifies or accepts the default value of the properties to configure its specific operation.

`scsi-reset-delay` Property

The scsi-reset-delay property is an integer specifying the recovery time in milliseconds for a reset delay by either a SCSI bus or SCSI device.

`scsi-options` Property

The scsi-options property is an integer specifying a number of options through individually defined bits:

SCSI_OPTIONS_DR (0x008) – If not set, the HBA should not grant disconnect privileges to a target device.
SCSI_OPTIONS_LINK (0x010) – If not set, the HBA should not enable linked commands.
SCSI_OPTIONS_SYNC (0x020) – If not set, the HBA driver must not negotiate synchronous data transfer. The driver should reject any attempt to negotiate synchronous data transfer initiated by a target.
SCSI_OPTIONS_PARITY (0x040) – If not set, the HBA should run the SCSI bus without parity.
SCSI_OPTIONS_TAG (0x080) – If not set, the HBA should not operate in Command Tagged Queuing mode.
SCSI_OPTIONS_FAST (0x100) – If not set, the HBA should not operate the bus in FAST SCSI mode.
SCSI_OPTIONS_WIDE (0x200) – If not set, the HBA should not operate the bus in WIDE SCSI mode.

Per-Target `scsi-options`

An HBA driver might support a per-target scsi-options feature in the following format:

target<n>-scsi-options=<hex value>

In this example, < n> is the target ID. If the per-target scsi-options property is defined, the HBA driver uses that value rather than the per-HBA driver instance scsi-options property. This approach can provide more precise control if, for example, synchronous data transfer needs to be disabled for just one particular target device. The per-target scsi-options property can be defined in the driver.conf(4) file.

The following example shows a per-target scsi-options property definition to disable synchronous data transfer for target device 3:

target3-scsi-options=0x2d8

x86 Target Driver Configuration Properties

Some x86 SCSI target drivers, such as the driver for cmdk disk, use the following configuration properties:

disk
queue
flow_control

If you use the cmdk sample driver to write an HBA driver for an x86 platform, any appropriate properties must be defined in the driver.conf(4) file.

Note –

These property definitions should appear only in an HBA driver's driver.conf(4) file. The HBA driver itself should not inspect or attempt to interpret these properties in any way. These properties are advisory only and serve as an adjunct to the cmdk driver. The properties should not be relied upon in any way. The property definitions might not be used in future releases.

The disk property can be used to define the type of disk supported by cmdk. For a SCSI HBA, the only possible value for the disk property is:

disk="scdk" – Disk type is a SCSI disk

The queue property defines how the disk driver sorts the queue of incoming requests during strategy(9E). Two values are possible:

queue="qsort" – One-way elevator queuing model, provided by disksort(9F)
queue="qfifo" – FIFO, that is, first in, first out queuing model

The flow_control property defines how commands are transported to the HBA driver. Three values are possible:

flow_control="dsngl" – Single command per HBA driver
flow_control="dmult" – Multiple commands per HBA driver. When the HBA queue is full, the driver returns TRAN_BUSY.
flow_control="duplx" – The HBA can support separate read and write queues, with multiple commands per queue. FIFO ordering is used for the write queue. The queuing model that is used for the read queue is described by the queue property. When an HBA queue is full, the driver returns TRAN_BUSY

The following example is a driver.conf(4) file for use with an x86 HBA PCI device that has been designed for use with the cmdk sample driver:

#
# config file for ISP 1020 SCSI HBA driver     
#
       flow_control="dsngl" queue="qsort" disk="scdk"
       scsi-initiator-id=7;

Support for Queuing

For a definition of tagged queuing, refer to the SCSI-2 specification. To support tagged queuing, first check the scsi_options flag SCSI_OPTIONS_TAG to see whether tagged queuing is enabled globally. Next, check to see whether the target is a SCSI-2 device and whether the target has tagged queuing enabled. If these conditions are all true, attempt to enable tagged queuing by using scsi_ifsetcap(9F).

If tagged queuing fails, you can attempt to set untagged queuing. In this mode, you submit as many commands as you think necessary or optimal to the host adapter driver. Then the host adapter queues the commands to the target one command at a time, in contrast to tagged queuing. In tagged queuing, the host adapter submits as many commands as possible until the target indicates that the queue is full.

Chapter 19 Drivers for Network Devices

To write a network driver for the Solaris OS, use the Solaris Generic LAN Driver (GLD) framework.

For new Ethernet drivers, use the GLDv3 framework. See GLDv3 Network Device Driver Framework. The GLDv3 framework is a function calls-based interface.
To maintain older Ethernet, Token Ring, or FDDI drivers, use the GLDv2 framework. See GLDv2 Network Device Driver Framework. The GLDv2 is a kernel module that provides common code for drivers to share.

GLDv3 Network Device Driver Framework

The GLDv3 framework is a function calls-based interface of MAC plugins and MAC driver service routines and structures. The GLDv3 framework implements the necessary STREAMS entry points on behalf of GLDv3 compliant drivers and handles DLPI compatibility.

This section discusses the following topics:

GLDv3 MAC Registration

GLDv3 defines a driver API for drivers that register with a plugin type of MAC_PLUGIN_IDENT_ETHER.

GLDv3 MAC Registration Process

A GLDv3 device driver must perform the following steps to register with the MAC layer:

Include the following three MAC header files: sys/mac.h, sys/mac_ether.h, and sys/mac_provider.h. Do not include any other MAC-related header file in your driver.
Populate the mac_callbacks structure.
Invoke the mac_init_ops() function in its _init() entry point.
Invoke the mac_alloc() function in its attach() entry point to allocate a mac_register structure.
Populate the mac_register structure and invoke the mac_register() function in its attach() entry point.
Invoke the mac_unregister() function in its detach() entry point.
Invoke the mac_fini_ops() function in its _fini() entry point.
Link with a dependency on misc/mac:
```
# ld -N"misc/mac" xx.o -o xx
```

GLDv3 MAC Registration Functions

The GLDv3 interface includes driver entry points that are advertised during registration with the MAC layer and MAC entry points that are invoked by drivers.

The `mac_init_ops()` and `mac_fini_ops()` Functions

void mac_init_ops(struct dev_ops *ops, const char *name);

A GLDv3 device driver must invoke the mac_init_ops(9F) function in its _init(9E) entry point before calling mod_install(9F).

void mac_fini_ops(struct dev_ops *ops);

A GLDv3 device driver must invoke the mac_fini_ops(9F) function in its _fini(9E) entry point after calling mod_remove(9F).

Example 19–1 The `mac_init_ops()` and `mac_fini_ops()` Functions

int
_init(void)
{
        int     rv;
        mac_init_ops(&xx_devops, "xx");
        if ((rv = mod_install(&xx_modlinkage)) != DDI_SUCCESS) {
                mac_fini_ops(&xx_devops);
        }
        return (rv);
}

int
_fini(void)
{
        int     rv;
        if ((rv = mod_remove(&xx_modlinkage)) == DDI_SUCCESS) {
                mac_fini_ops(&xx_devops);
        }
        return (rv);
}

The `mac_alloc()` and `mac_free()` Functions

mac_register_t *mac_alloc(uint_t version);

The mac_alloc(9F) function allocates a new mac_register structure and returns a pointer to it. Initialize the structure members before you pass the new structure to mac_register(). MAC-private elements are initialized by the MAC layer before mac_alloc() returns. The value of version must be MAC_VERSION_V1.

void mac_free(mac_register_t *mregp);

The mac_free(9F) function frees a mac_register structure that was previously allocated by mac_alloc().

The `mac_register()` and `mac_unregister()` Functions

int mac_register(mac_register_t *mregp, mac_handle_t *mhp);

To register a new instance with the MAC layer, a GLDv3 driver must invoke the mac_register(9F) function in its attach(9E) entry point. The mregp argument is a pointer to a mac_register registration information structure. On success, the mhp argument is a pointer to a MAC handle for the new MAC instance. This handle is needed by other routines such as mac_tx_update(), mac_link_update(), and mac_rx().

Example 19–2 The `mac_alloc()`, `mac_register()`, and `mac_free()` Functions and `mac_register` Structure

int
xx_attach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
        mac_register_t        *macp;

/* ... */

        if ((macp = mac_alloc(MAC_VERSION)) == NULL) {
                xx_error(dip, "mac_alloc failed");
                goto failed;
        }

        macp->m_type_ident = MAC_PLUGIN_IDENT_ETHER;
        macp->m_driver = xxp;
        macp->m_dip = dip;
        macp->m_src_addr = xxp->xx_curraddr;
        macp->m_callbacks = &xx_m_callbacks;
        macp->m_min_sdu = 0;
        macp->m_max_sdu = ETHERMTU;
        macp->m_margin = VLAN_TAGSZ;

        if (mac_register(macp, &xxp->xx_mh) == DDI_SUCCESS) {
                mac_free(macp);
                return (DDI_SUCCESS);
        }

/* failed to register with MAC */
        mac_free(macp);
failed:
        /* ... */
}

int mac_unregister(mac_handle_t mh);

The mac_unregister(9F) function unregisters a MAC instance that was previously registered with mac_register(). The mh argument is the MAC handle that was allocated by mac_register(). Invoke mac_unregister() from the detach(9E) entry point.

Example 19–3 The `mac_unregister()` Function

int
xx_detach(dev_info_t *dip, ddi_detach_cmd_t cmd)
{
        xx_t        *xxp; /* driver soft state */

        /* ... */

        switch (cmd) {
        case DDI_DETACH:

                if (mac_unregister(xxp->xx_mh) != 0) {
                        return (DDI_FAILURE);
                }
        /* ... */
}

GLDv3 MAC Registration Data Structures

The structures described in this section are defined in the sys/mac_provider.h header file. Include the following three MAC header files in your GLDv3 driver: sys/mac.h, sys/mac_ether.h, and sys/mac_provider.h. Do not include any other MAC-related header file.

The mac_register(9S) data structure is the MAC registration information structure that is allocated by mac_alloc() and passed to mac_register(). Initialize the structure members before you pass the new structure to mac_register(). MAC-private elements are initialized by the MAC layer before mac_alloc() returns. The m_version structure member is the MAC version. Do not modify the MAC version. The m_type_ident structure member is the MAC type identifier. Set the MAC type identifier to MAC_PLUGIN_IDENT_ETHER. The m_callbacks member of the mac_register structure is a pointer to an instance of the mac_callbacks structure.

The mac_callbacks(9S) data structure is the structure that your device driver uses to expose its entry points to the MAC layer. These entry points are used by the MAC layer to control the driver. These entry points are used to do tasks such as start and stop the adapters, manage multicast addresses, set promiscuous mode, query the capabilities of the adapter, and get and set properties. See Table 19–1 for a complete list of required and optional GLDv3 entry points. Provide a pointer to your mac_callbacks structure in the m_callbacks field of the mac_register structure.

The mc_callbacks member of the mac_callbacks structure is a bit mask that is a combination of the following flags that specify which of the optional entry points are implemented by the driver. Other members of the mac_callbacks structure are pointers to each of the entry points of the driver.

MC_IOCTL: The mc_ioctl() entry point is present.
MC_GETCAPAB: The mc_getcapab() entry point is present.
MC_SETPROP: The mc_setprop() entry point is present.
MC_GETPROP: The mc_getprop() entry point is present.
MC_PROPINFO: The mc_propinfo() entry point is present.
MC_PROPERTIES: All properties entry points are present. Setting MC_PROPERTIES is equivalent to setting all three flags: MC_SETPROP, MC_GETPROP, and MC_PROPINFO.

Example 19–4 The `mac_callbacks` Structure

#define XX_M_CALLBACK_FLAGS \
    (MC_IOCTL | MC_GETCAPAB | MC_PROPERTIES)

static mac_callbacks_t xx_m_callbacks = {
        XX_M_CALLBACK_FLAGS,
        xx_m_getstat,     /* mc_getstat() */
        xx_m_start,       /* mc_start() */
        xx_m_stop,        /* mc_stop() */
        xx_m_promisc,     /* mc_setpromisc() */
        xx_m_multicst,    /* mc_multicst() */
        xx_m_unicst,      /* mc_unicst() */
        xx_m_tx,          /* mc_tx() */
        NULL,             /* Reserved, do not use */
        xx_m_ioctl,       /* mc_ioctl() */
        xx_m_getcapab,    /* mc_getcapab() */
        NULL,             /* Reserved, do not use */
        NULL,             /* Reserved, do not use */
        xx_m_setprop,     /* mc_setprop() */
        xx_m_getprop,     /* mc_getprop() */
        xx_m_propinfo     /* mc_propinfo() */
};

GLDv3 Capabilities

GLDv3 implements a capability mechanism that allows the framework to query and enable capabilities that are supported by the GLDv3 driver. Use the mc_getcapab(9E)entry point to report capabilities. If a capability is supported by the driver, pass information about that capability, such as capability-specific entry points or flags through mc_getcapab(). Pass a pointer to the mc_getcapab() entry point in the mac_callback structure. See GLDv3 MAC Registration Data Structures for more information about the mac_callbacks structure.

boolean_t mc_getcapab(void *driver_handle, mac_capab_t cap, void *cap_data);

The cap argument specifies the type of capability being queried. The value of cap can be either MAC_CAPAB_HCKSUM (hardware checksum offload) or MAC_CAPAB_LSO (large segment offload). Use the cap_data argument to return the capability data to the framework.

If the driver supports the cap capability, the mc_getcapab() entry point must return B_TRUE. If the driver does not support the cap capability, mc_getcapab() must return B_FALSE.

Example 19–5 The `mc_getcapab()` Entry Point

static boolean_t
xx_m_getcapab(void *arg, mac_capab_t cap, void *cap_data)
{
        switch (cap) {
        case MAC_CAPAB_HCKSUM: {
                uint32_t *txflags = cap_data;
                *txflags = HCKSUM_INET_FULL_V4 | HCKSUM_IPHDRCKSUM;
                break;
        }
        case MAC_CAPAB_LSO: {
                /* ... */
                break;
        }
        default:
                return (B_FALSE);
        }
        return (B_TRUE);
}

The following sections describe the supported capabilities and the corresponding capability data to return.

Hardware Checksum Offload

To get data about support for hardware checksum offload, the framework sends MAC_CAPAB_HCKSUM in the cap argument. See Hardware Checksum Offload Capability Information.

To query checksum offload metadata and retrieve the per-packet hardware checksumming metadata when hardware checksumming is enabled, use mac_hcksum_get(9F). See The mac_hcksum_get() Function Flags.

To set checksum offload metadata, use mac_hcksum_set(9F). See The mac_hcksum_set() Function Flags.

See Hardware Checksumming: Hardware and Hardware Checksumming: MAC Layer for more information.

Hardware Checksum Offload Capability Information

To pass information about the MAC_CAPAB_HCKSUM capability to the framework, the driver must set a combination of the following flags in cap_data, which points to a uint32_t. These flags indicate the level of hardware checksum offload that the driver is capable of performing for outbound packets.

HCKSUM_INET_PARTIAL: Partial 1's complement checksum ability
HCKSUM_INET_FULL_V4: Full 1's complement checksum ability for IPv4 packets
HCKSUM_INET_FULL_V6: Full 1's complement checksum ability for IPv6 packets
HCKSUM_IPHDRCKSUM: IPv4 Header checksum offload capability

The `mac_hcksum_get()` Function Flags

The flags argument of mac_hcksum_get() is a combination of the following values:

HCK_FULLCKSUM: Compute the full checksum for this packet.
HCK_FULLCKSUM_OK: The full checksum was verified in hardware and is correct.
HCK_PARTIALCKSUM: Compute the partial 1's complement checksum based on other parameters passed to mac_hcksum_get(). HCK_PARTIALCKSUM is mutually exclusive with HCK_FULLCKSUM.
HCK_IPV4_HDRCKSUM: Compute the IP header checksum.
HCK_IPV4_HDRCKSUM_OK: The IP header checksum was verified in hardware and is correct.

The `mac_hcksum_set()` Function Flags

The flags argument of mac_hcksum_set() is a combination of the following values:

HCK_FULLCKSUM: The full checksum was computed and passed through the value argument.
HCK_FULLCKSUM_OK: The full checksum was verified in hardware and is correct.
HCK_PARTIALCKSUM: The partial checksum was computed and passed through the value argument. HCK_PARTIALCKSUM is mutually exclusive with HCK_FULLCKSUM.
HCK_IPV4_HDRCKSUM: The IP header checksum was computed and passed through the value argument.
HCK_IPV4_HDRCKSUM_OK: The IP header checksum was verified in hardware and is correct.

Large Segment (or Send) Offload

To query support for large segment (or send) offload, the framework sends MAC_CAPAB_LSO in the cap argument and expects the information back in cap_data, which points to a mac_capab_lso(9S) structure. The framework allocates the mac_capab_lso structure and passes a pointer to this structure in cap_data. The mac_capab_lso structure consists of an lso_basic_tcp_ipv4(9S) structure and an lso_flags member. If the driver instance supports LSO for TCP on IPv4, set the LSO_TX_BASIC_TCP_IPV4 flag in lso_flags and set the lso_max member of the lso_basic_tcp_ipv4 structure to the maximum payload size supported by the driver instance.

Use mac_lso_get(9F) to obtain per-packet LSO metadata. If LSO is enabled for this packet, the HW_LSO flag is set in the mac_lso_get() flags argument. The maximum segment size (MSS) to be used during segmentation of the large segment is returned through the location pointed to by the mss argument. See Large Segment Offload for more information.

GLDv3 Data Paths

Data-path entry points are comprised of the following components:

Callbacks exported by the driver and invoked by the GLDv3 framework for sending packets.
GLDv3 framework entry points called by the driver for transmit flow control and for receiving packets.

Transmit Data Path

The GLDv3 framework uses the transmit entry point, mc_tx(9E), to pass a chain of message blocks to the driver. Provide a pointer to the mc_tx() entry point in your mac_callbacks structure. See GLDv3 MAC Registration Data Structures for more information about the mac_callbacks structure.

Example 19–6 The `mc_tx()` Entry Point

mblk_t *
xx_m_tx(void *arg, mblk_t *mp)
{
        xx_t    *xxp = arg;
        mblk_t   *nmp;

        mutex_enter(&xxp->xx_xmtlock);

        if (xxp->xx_flags & XX_SUSPENDED) {
                while ((nmp = mp) != NULL) {
                        xxp->xx_carrier_errors++;
                        mp = mp->b_next;
                        freemsg(nmp);
                }
                mutex_exit(&xxp->xx_xmtlock);
                return (NULL);
        }

        while (mp != NULL) {
                nmp = mp->b_next;
                mp->b_next = NULL;

                if (!xx_send(xxp, mp)) {
                        mp->b_next = nmp;
                        break;
                }
                mp = nmp;
        }
        mutex_exit(&xxp->xx_xmtlock);

        return (mp);
}

The following sections discuss topics related to transmitting data to the hardware.

Flow Control

If the driver cannot send the packets because of insufficient hardware resources, the driver returns the sub-chain of packets that could not be sent. When more descriptors become available at a later time, the driver must invoke mac_tx_update(9F) to notify the framework.

Hardware Checksumming: Hardware

If the driver specified hardware checksum support (see Hardware Checksum Offload), then the driver must do the following tasks:

Use mac_hcksum_get(9F) to check every packet for hardware checksum metadata.
Program the hardware to perform the required checksum calculation.

Large Segment Offload

If the driver specified LSO capabilities (see Large Segment (or Send) Offload), then the driver must use mac_lso_get(9F) to query whether LSO must be performed on the packet.

Virtual LAN: Hardware

When the administrator configures VLANs, the MAC layer inserts the needed VLAN headers on the outbound packets before they are passed to the driver through the mc_tx() entry point.

Receive Data Path

Call the mac_rx(9F) function in your driver's interrupt handler to pass a chain of one or more packets up the stack to the MAC layer. Avoid holding mutex or other locks during the call to mac_rx(). In particular, do not hold locks that could be taken by a transmit thread during a call to mac_rx(). See mc_unicst(9E) for information about the packets that must be sent up to the MAC layer.

The following sections discuss topics related to sending data to the MAC layer.

Hardware Checksumming: MAC Layer

If the driver specified hardware checksum support (see Hardware Checksum Offload), then the driver must use the mac_hcksum_set(9F) function to associate hardware checksumming metadata with the packet.

Virtual LAN: MAC Layer

VLAN packets must be passed with their tags to the MAC layer. Do not strip the VLAN headers from the packets.

GLDv3 State Change Notifications

A driver can call the following functions to notify the network stack that the driver's state has changed.

void mac_tx_update(mac_handle_t mh);

The mac_tx_update(9F) function notifies the framework that more TX descriptors are available. If mc_tx() returns a non-empty chain of packets, then the driver must call mac_tx_update() as soon as possible after resources are available to inform the MAC layer to retry the packets that were returned as not sent. See Transmit Data Path for more information about the mc_tx() entry point.

void mac_link_update(mac_handle_t mh, link_state_t new_state);

The mac_link_update(9F) function notifies the MAC layer that the state of the media link has changed. The new_state argument must be one of the following values:

LINK_STATE_UP: The media link is up.
LINK_STATE_DOWN: The media link is down.
LINK_STATE_UNKNOWN: The media link is unknown.

GLDv3 Network Statistics

Device drivers maintain a set of statistics for the device instances they manage. The MAC layer queries these statistics through the mc_getstat(9E) entry point of the driver.

int mc_getstat(void *driver_handle, uint_t stat, uint64_t *stat_value);

The GLDv3 framework uses stat to specify the statistic being queried. The driver uses stat_value to return the value of the statistic specified by stat. If the value of the statistic is returned, mc_getstat() must return 0. If the stat statistic is not supported by the driver, mc_getstat() must return ENOTSUP.

The GLDv3 statistics that are supported are the union of generic MAC statistics and Ethernet-specific statistics. See the mc_getstat(9E) man page for a complete list of supported statistics.

Example 19–7 The `mc_getstat()` Entry Point

int
xx_m_getstat(void *arg, uint_t stat, uint64_t *val)
{
        xx_t    *xxp = arg;

        mutex_enter(&xxp->xx_xmtlock);
        if ((xxp->xx_flags & (XX_RUNNING|XX_SUSPENDED)) == XX_RUNNING)
                xx_reclaim(xxp);
        mutex_exit(&xxp->xx_xmtlock);

        switch (stat) {
        case MAC_STAT_MULTIRCV:
                *val = xxp->xx_multircv;
                break;
        /* ... */
        case ETHER_STAT_MACRCV_ERRORS:
                *val = xxp->xx_macrcv_errors;
                break;
        /* ... */
        default:
                return (ENOTSUP);
        }
        return (0);
}

GLDv3 Properties

Use the mc_propinfo(9E) entry point to return immutable attributes of a property. This information includes permissions, default values, and allowed value ranges. Use mc_setprop(9E) to set the value of a property for this particular driver instance. Use mc_getprop(9E) to return the current value of a property.

See the mc_propinfo(9E) man page for a complete list of properties and their types.

The mc_propinfo() entry point should invoke the mac_prop_info_set_perm(), mac_prop_info_set_default(), and mac_prop_info_set_range() functions to associate specific attributes of the property being queried, such as default values, permissions, or allowed value ranges.

The mac_prop_info_set_default_uint8(9F), mac_prop_info_set_default_str(9F), and mac_prop_info_set_default_link_flowctrl(9F) functions associate a default value with a specific property. The mac_prop_info_set_range_uint32(9F) function associates an allowed range of values for a specific property.

The mac_prop_info_set_perm(9F) function specifies the permission of the property. The permission can be one of the following values:

MAC_PROP_PERM_READ: The property is read-only
MAC_PROP_PERM_WRITE: The property is write-only
MAC_PROP_PERM_RW: The property can be read and written

If the mc_propinfo() entry point does not call mac_prop_info_set_perm() for a particular property, the GLDv3 framework assumes that the property has read and write permissions, corresponding to MAC_PROP_PERM_RW.

In addition to the properties listed in the mc_propinfo(9E) man page, drivers can also expose driver-private properties. Use the m_priv_props field of the mac_register structure to specify driver-private properties supported by the driver. The framework passes the MAC_PROP_PRIVATE property ID in mc_setprop(), mc_getprop(), or mc_propinfo(). See the mc_propinfo(9E) man page for more information.

Summary of GLDv3 Interfaces

The following table lists entry points, other DDI functions, and data structures that are part of the GLDv3 network device driver framework.

Table 19–1 GLDv3 Interfaces


Interface Name	Description
Required Entry Points
mc_getstat(9E)	Retrieve network statistics from the driver. See GLDv3 Network Statistics.
mc_start(9E)	Start a driver instance. The GLDv3 framework invokes the start entry point before any operation is attempted.
mc_stop(9E)	Stop a driver instance. The MAC layer invokes the stop entry point before the device is detached.
mc_setpromisc(9E)	Change the promiscuous mode of the device driver instance.
mc_multicst(9E)	Add or remove a multicast address.
mc_unicst(9E)	Set the primary unicast address. The device must start passing back through `mac_rx()` the packets with a destination MAC address that matches the new unicast address. See Receive Data Path for information about `mac_rx()`.
mc_tx(9E)	Send one or more packets. See Transmit Data Path.
Optional Entry Points
mc_ioctl(9E)	Optional ioctl driver interface. This facility is intended to be used only for debugging purposes.
mc_getcapab(9E)	Retrieve capabilities. See GLDv3 Capabilities.
mc_setprop(9E)	Set a property value. See GLDv3 Properties.
mc_getprop(9E)	Get a property value. See GLDv3 Properties.
mc_propinfo(9E)	Get information about a property. See GLDv3 Properties.
Data Structures
mac_register(9S)	Registration information. See GLDv3 MAC Registration Data Structures.
mac_callbacks(9S)	Driver callbacks. See GLDv3 MAC Registration Data Structures.
mac_capab_lso(9S)	LSO metadata. See Large Segment (or Send) Offload.
lso_basic_tcp_ipv4(9S)	LSO metadata for TCP/IPv4. See Large Segment (or Send) Offload.
MAC Registration Functions
mac_alloc(9F)	Allocate a new `mac_register` structure. See GLDv3 MAC Registration.
mac_free(9F)	Free a `mac_register` structure.
mac_register(9F)	Register with the MAC layer.
mac_unregister(9F)	Unregister from the MAC layer.
mac_init_ops(9F)	Initialize the driver's dev_ops(9S) structure.
mac_fini_ops(9F)	Release the driver's `dev_ops` structure.
Data Transfer Functions
mac_rx(9F)	Pass up received packets. See Receive Data Path.
mac_tx_update(9F)	TX resources are available. See GLDv3 State Change Notifications.
mac_link_update(9F)	Link state has changed.
mac_hcksum_get(9F)	Retrieve hardware checksum information. See Hardware Checksum Offload and Transmit Data Path.
mac_hcksum_set(9F)	Attach hardware checksum information. See Hardware Checksum Offload and Receive Data Path.
mac_lso_get(9F)	Retrieve LSO information. See Large Segment (or Send) Offload.
Properties Functions
mac_prop_info_set_perm(9F)	Set the permission of a property. See GLDv3 Properties.
mac_prop_info_set_default_uint8(9F), mac_prop_info_set_default_str(9F), mac_prop_info_set_default_link_flowctrl(9F)	Set a property value.
mac_prop_info_set_range_uint32(9F)	Set a property values range.

GLDv2 Network Device Driver Framework

GLDv2 is a multi-threaded, clonable, loadable kernel module that provides support to device drivers for local area networks. Local area network (LAN) device drivers in the Solaris OS are STREAMS-based drivers that use the Data Link Provider Interface (DLPI) to communicate with network protocol stacks. These protocol stacks use the network drivers to send and receive packets on a LAN. The GLDv2 implements much of the STREAMS and DLPI functionality for a Solaris LAN driver. The GLDv2 provides common code that many network drivers can share. Using the GLDv2 reduces duplicate code and simplifies your network driver.

For more information about GLDv2, see the gld(7D) man page.

STREAMS drivers are documented in Part II, Kernel Interface, in STREAMS Programming Guide. Specifically, see Chapter 9, “STREAMS Drivers,” in the STREAMS guide. The STREAMS framework is a message-based framework. Interfaces that are unique to STREAMS drivers include STREAMS message queue processing entry points.

The DLPI specifies an interface to the Data Link Services (DLS) of the Data Link Layer of the OSI Reference Model. The DLPI enables a DLS user to access and use any of a variety of conforming DLS providers without special knowledge of the provider's protocol. The DLPI specifies access to the DLS provider in the form of M_PROTO and M_PCPROTO type STREAMS messages. A DLPI module uses STREAMS ioctl calls to link to the MAC sub-layer. For more information about the DLPI protocol, including Solaris-specific DPLI extensions, see the dlpi(7P) man page. For general information about DLPI, see the DLPI standard at http://www.opengroup.org/pubs/catalog/c811.htm.

A Solaris network driver that is implemented using GLDv2 has two distinct parts:

Generic component. Handles STREAMS and DLPI interfaces.
Device-specific component. Handles the particular hardware device.
- Indicates the driver's dependence on the GLDv2 module by linking with a dependency on misc/gld. The GLDv2 module can be found at /kernel/misc/sparcv9/gld on SPARC systems, /kernel/misc/amd64/gld on 64–bit x86 systems, and /kernel/misc/gld on 32–bit x86 systems.
- Registers with GLDv2: In its attach(9E) entry point, the driver provides GLDv2 with pointers to its other entry points. GLDv2 uses these pointers to call into the gld(9E) entry point.
- Calls gld(9F) functions to process data or to use some other GLDv2 service. The gld_mac_info(9S) structure is the primary data interface between GLDv2 and the device-specific driver.

GLDv2 drivers must process fully formed MAC-layer packets and must not perform logical link control (LLC) handling.

This section discusses the following topics:

GLDv2 Device Support

The GLDv2 framework supports the following types of devices:

DL_ETHER: ISO 8802-3, IEEE 802.3 protocol
DL_TPR: IEEE 802.5, Token Passing Ring
DL_FDDI: ISO 9314-2, Fibre Distributed Data Interface

Ethernet V2 and ISO 8802-3 (IEEE 802.3)

For devices that are declared to be type DL_ETHER, GLDv2 provides support for both Ethernet V2 and ISO 8802-3 (IEEE 802.3) packet processing. Ethernet V2 enables a user to access a conforming provider of data link services without special knowledge of the provider's protocol. A service access point (SAP) is the point through which the user communicates with the service provider.

Streams bound to SAP values in the range [0-255] are treated as equivalent and denote that the user wants to use 8802-3 mode. If the SAP value of the DL_BIND_REQ is within this range, GLDv2 computes the length of each subsequent DL_UNITDATA_REQ message on that stream. The length does not include the 14-byte media access control (MAC) header. GLDv2 then transmits 8802-3 frames that have those lengths in the MAC frame header type fields. Such lengths do not exceed 1500.

Frames that have a type field in the range [0-1500] are assumed to be 8802-3 frames. These frames are routed up all open streams in 8802-3 mode. Those streams with SAP values in the [0-255] range are considered to be in 8802-3 mode. If more than one stream is in 8802-3 mode, the incoming frame is duplicated and routed up these streams.

Those streams that are bound to SAP values that are greater than 1500 are assumed to be in Ethernet V2 mode. These streams receive incoming packets whose Ethernet MAC header type value exactly matches the value of the SAP to which the stream is bound.

TPR and FDDI: SNAP Processing

For media types DL_TPR and DL_FDDI, GLDv2 implements minimal SNAP (Sub-Net Access Protocol) processing. This processing is for any stream that is bound to a SAP value that is greater than 255. SAP values in the range [0-255] are LLC SAP values. Such values are carried naturally by the media packet format. SAP values that are greater than 255 require a SNAP header, subordinate to the LLC header, to carry the 16-bit Ethernet V2-style SAP value.

SNAP headers are carried under LLC headers with destination SAP 0xAA. Outbound packets with SAP values that are greater than 255 require an LLC+SNAP header take the following form:

AA AA 03 00 00 00 XX XX

XX XX represents the 16-bit SAP, corresponding to the Ethernet V2 style type. This header is unique in supporting non-zero organizational unique identifier fields. LLC control fields other than 03 are considered to be LLC packets with SAP 0xAA. Clients that want to use SNAP formats other than this format must use LLC and bind to SAP 0xAA.

Incoming packets are checked for conformance with the above format. Packets that conform are matched to any streams that have been bound to the packet's 16-bit SNAP type. In addition, these packets are considered to match the LLC SNAP SAP 0xAA.

Packets received for any LLC SAP are passed up all streams that are bound to an LLC SAP, as described for media type DL_ETHER.

TPR: Source Routing

For type DL_TPR devices, GLDv2 implements minimal support for source routing.

Source routing support includes the following tasks:

Specify routing information for a packet to be sent across a bridged medium. The routing information is stored in the MAC header. This information is used to determine the route.
Learn routes.
Solicit and respond to requests for information about possible multiple routes.
Select among available routes.

Source routing adds routing information fields to the MAC headers of outgoing packets. In addition, this support recognizes such fields in incoming packets.

GLDv2 source routing support does not implement the full route determination entity (RDE) specified in Section 9 of ISO 8802-2 (IEEE 802.2). However, this support can interoperate with any RDE implementations that might exist in the same or a bridged network.

GLDv2 DLPI Providers

GLDv2 implements both Style 1 and Style 2 DLPI providers. A physical point of attachment (PPA) is the point at which a system attaches itself to a physical communication medium. All communication on that physical medium funnels through the PPA. The Style 1 provider attaches the streams to a particular PPA based on the major or minor device that has been opened. The Style 2 provider requires the DLS user to explicitly identify the desired PPA using DL_ATTACH_REQ. In this case, open(9E) creates a stream between the user and GLDv2, and DL_ATTACH_REQ subsequently associates a particular PPA with that stream. Style 2 is denoted by a minor number of zero. If a device node whose minor number is not zero is opened, Style 1 is indicated and the associated PPA is the minor number minus 1. In both Style 1 and Style 2 opens, the device is cloned.

GLDv2 DLPI Primitives

GLDv2 implements several DLPI primitives. The DL_INFO_REQ primitive requests information about the DLPI streams. The message consists of one M_PROTO message block. GLDv2 returns device-dependent values in the DL_INFO_ACK response to this request. These values are based on information that the GLDv2-based driver specified in the gld_mac_info(9S) structure that was passed to the gld_register(9F) function.

GLDv2 returns the following values on behalf of all GLDv2-based drivers:

Version is DL_VERSION_2.
Service mode is DL_CLDLS. GLDv2 implements connectionless-mode service.
Provider style is DL_STYLE1 or DL_STYLE2, depending on how the stream was opened.
No optional Quality of Service (QOS) support is present. The QOS fields are zero.

Note –

Contrary to the DLPI specification, GLDv2 returns the correct address length and broadcast address of the device in DL_INFO_ACK even before the stream has been attached to a PPA.

The DL_ATTACH_REQ primitive is used to associate a PPA with a stream. This request is needed for Style 2 DLS providers to identify the physical medium over which the communication is sent. Upon completion, the state changes from DL_UNATTACHED to DL_UNBOUND. The message consists of one M_PROTO message block. This request is not allowed when Style 1 mode is used. Streams that are opened using Style 1 are already attached to a PPA by the time the open completes.

The DL_DETACH_REQ primitive requests to detach the PPA from the stream. This detachment is allowed only if the stream was opened using Style 2.

The DL_BIND_REQ and DL_UNBIND_REQ primitives bind and unbind a DLSAP (data link service access point) to the stream. The PPA that is associated with a stream completes initialization before the completion of the processing of the DL_BIND_REQ on that stream. You can bind multiple streams to the same SAP. Each stream in this case receives a copy of any packets that were received for that SAP.

The DL_ENABMULTI_REQ and DL_DISABMULTI_REQ primitives enable and disable reception of individual multicast group addresses. Through iterative use of these primitives, an application or other DLS user can create or modify a set of multicast addresses. The streams must be attached to a PPA for these primitives to be accepted.

The DL_PROMISCON_REQ and DL_PROMISCOFF_REQ primitives turn promiscuous mode on or off on a per-stream basis. These controls operate at either at a physical level or at the SAP level. The DL Provider routes all received messages on the media to the DLS user. Routing continues until a DL_DETACH_REQ is received, a DL_PROMISCOFF_REQ is received, or the stream is closed. You can specify physical level promiscuous reception of all packets on the medium or of multicast packets only.

Note –

The streams must be attached to a PPA for these promiscuous mode primitives to be accepted.

The DL_UNITDATA_REQ primitive is used to send data in a connectionless transfer. Because this service is not acknowledged, delivery is not guaranteed. The message consists of one M_PROTO message block followed by one or more M_DATA blocks containing at least one byte of data.

The DL_UNITDATA_IND type is used when a packet is to be passed on upstream. The packet is put into an M_PROTO message with the primitive set to DL_UNITDATA_IND.

The DL_PHYS_ADDR_REQ primitive requests the MAC address currently associated with the PPA attached to the streams. The address is returned by the DL_PHYS_ADDR_ACK primitive. When using Style 2, this primitive is only valid following a successful DL_ATTACH_REQ.

The DL_SET_PHYS_ADDR_REQ primitive changes the MAC address currently associated with the PPA attached to the streams. This primitive affects all other current and future streams attached to this device. Once changed, all streams currently or subsequently opened and attached to this device obtain this new physical address. The new physical address remains in effect until this primitive changes the physical address again or the driver is reloaded.

Note –

The superuser is allowed to change the physical address of a PPA while other streams are bound to the same PPA.

The DL_GET_STATISTICS_REQ primitive requests a DL_GET_STATISTICS_ACK response containing statistics information associated with the PPA attached to the stream. Style 2 Streams must be attached to a particular PPA using DL_ATTACH_REQ before this primitive can succeed.

GLDv2 I/O Control Functions

GLDv2 implements the ioctl ioc_cmd function described below. If GLDv2 receives an unrecognizable ioctl command, GLDv2 passes the command to the device-specific driver's gldm_ioctl() routine, as described in gld(9E).

The DLIOCRAW ioctl function is used by some DLPI applications, most notably the snoop(1M) command. The DLIOCRAW command puts the stream into a raw mode. In raw mode, the driver passes full MAC-level incoming packets upstream in M_DATA messages instead of transforming the packets into the DL_UNITDATA_IND form. The DL_UNITDATA_IND form is normally used for reporting incoming packets. Packet SAP filtering is still performed on streams that are in raw mode. If a stream user wants to receive all incoming packets, the user must also select the appropriate promiscuous modes. After successfully selecting raw mode, the application is also allowed to send fully formatted packets to the driver as M_DATA messages for transmission. DLIOCRAW takes no arguments. Once enabled, the stream remains in this mode until closed.

GLDv2 Driver Requirements

GLDv2-based drivers must include the header file <sys/gld.h>.

GLDv2-based drivers must be linked with the -N“misc/gld” option:

%ld -r -N"misc/gld" xx.o -o xx

GLDv2 implements the following functions on behalf of the device-specific driver:

open(9E)
close(9E)
put(9E), required for STREAMS
srv(9E), required for STREAMS
getinfo(9E)

The mi_idname element of the module_info(9S) structure is a string that specifies the name of the driver. This string must exactly match the name of the driver module as defined in the file system.

The read-side qinit(9S) structure should specify the following elements:

qi_putp: NULL
qi_srvp: gld_rsrv
qi_qopen: gld_open
qi_qclose: gld_close

The write-side qinit(9S) structure should specify these elements:

qi_putp: gld_wput
qi_srvp: gld_wsrv
qi_qopen: NULL
qi_qclose: NULL

The devo_getinfo element of the dev_ops(9S) structure should specify gld_getinfo as the getinfo(9E) routine.

The driver's attach(9E) function associates the hardware-specific device driver with the GLDv2 facility. attach() then prepares the device and driver for use.

The attach(9E) function allocates a gld_mac_info(9S) structure using gld_mac_alloc(). The driver usually needs to save more information per device than is defined in the macinfo structure. The driver should allocate the additional required data structure and save a pointer to the structure in the gldm_private member of the gld_mac_info(9S) structure.

The attach(9E) routine must initialize the macinfo structure as described in the gld_mac_info(9S) man page. The attach() routine should then call gld_register() to link the driver with the GLDv2 module. The driver should map registers if necessary and be fully initialized and prepared to accept interrupts before calling gld_register(). The attach(9E) function should add interrupts but should not enable the device to generate these interrupts. The driver should reset the hardware before calling gld_register() to ensure the hardware is quiescent. A device must not be put into a state where the device might generate an interrupt before gld_register() is called. The device is started later when GLDv2 calls the driver's gldm_start() entry point, which is described in the gld(9E) man page. After gld_register() succeeds, the gld(9E) entry points might be called by GLDv2 at any time.

The attach(9E) routine should return DDI_SUCCESS if gld_register() succeeds. If gld_register() fails, DDI_FAILURE is returned. If a failure occurs, the attach(9E) routine should deallocate any resources that were allocated before gld_register() was called. The attach routine should then also return DDI_FAILURE. A failed macinfo structure should never be reused. Such a structure should be deallocated using gld_mac_free().

The detach(9E)function should attempt to unregister the driver from GLDv2 by calling gld_unregister(). For more information about gld_unregister(), see the gld(9F) man page. The detach(9E) routine can get a pointer to the needed gld_mac_info(9S) structure from the device's private data using ddi_get_driver_private(9F). gld_unregister() checks certain conditions that could require that the driver not be detached. If the checks fail, gld_unregister() returns DDI_FAILURE, in which case the driver's detach(9E) routine must leave the device operational and return DDI_FAILURE.

If the checks succeed, gld_unregister() ensures that the device interrupts are stopped. The driver's gldm_stop() routine is called if necessary. The driver is unlinked from the GLDv2 framework. gld_unregister() then returns DDI_SUCCESS. In this case, the detach(9E) routine should remove interrupts and use gld_mac_free() to deallocate any macinfo data structures that were allocated in the attach(9E) routine. The detach() routine should then return DDI_SUCCESS. The routine must remove the interrupt before calling gld_mac_free().

GLDv2 Network Statistics

Solaris network drivers must implement statistics variables. GLDv2 tallies some network statistics, but other statistics must be counted by each GLDv2-based driver. GLDv2 provides support for GLDv2-based drivers to report a standard set of network driver statistics. Statistics are reported by GLDv2 using the kstat(7D) and kstat(9S) mechanisms. The DL_GET_STATISTICS_REQ DLPI command can also be used to retrieve the current statistics counters. All statistics are maintained as unsigned. The statistics are 32 bits unless otherwise noted.

GLDv2 maintains and reports the following statistics.

rbytes64: Total bytes successfully received on the interface. Stores 64-bit statistics.
rbytes: Total bytes successfully received on the interface
obytes64: Total bytes that have requested transmission on the interface. Stores 64-bit statistics.
obytes: Total bytes that have requested transmission on the interface.
ipackets64: Total packets successfully received on the interface. Stores 64-bit statistics.
ipackets: Total packets successfully received on the interface.
opackets64: Total packets that have requested transmission on the interface. Stores 64-bit statistics.
opackets: Total packets that have requested transmission on the interface.
multircv: Multicast packets successfully received, including group and functional addresses (long).
multixmt: Multicast packets requested to be transmitted, including group and functional addresses (long).
brdcstrcv: Broadcast packets successfully received (long).
brdcstxmt: Broadcast packets that have requested transmission (long).
unknowns: Valid received packets not accepted by any stream (long).
noxmtbuf: Packets discarded on output because transmit buffer was busy, or no buffer could be allocated for transmit (long).
blocked: Number of times a received packet could not be put up a stream because the queue was flow-controlled (long).
xmtretry: Times transmit was retried after having been delayed due to lack of resources (long).
promisc: Current “promiscuous” state of the interface (string).

The device-dependent driver tracks the following statistics in a private per-instance structure. To report statistics, GLDv2 calls the driver's gldm_get_stats() entry point. gldm_get_stats() then updates device-specific statistics in the gld_stats(9S) structure. See the gldm_get_stats(9E) man page for more information. GLDv2 then reports the updated statistics using the named statistics variables that are shown below.

ifspeed: Current estimated bandwidth of the interface in bits per second. Stores 64-bit statistics.
media: Current media type in use by the device (string).
intr: Number of times that the interrupt handler was called, causing an interrupt (long).
norcvbuf: Number of times a valid incoming packet was known to have been discarded because no buffer could be allocated for receive (long).
ierrors: Total number of packets that were received but could not be processed due to errors (long).
oerrors: Total packets that were not successfully transmitted because of errors (long).
missed: Packets known to have been dropped by the hardware on receive (long).
uflo: Times FIFO underflowed on transmit (long).
oflo: Times receiver overflowed during receive (long).

The following group of statistics applies to networks of type DL_ETHER. These statistics are maintained by device-specific drivers of that type, as shown previously.

align_errors: Packets that were received with framing errors, that is, the packets did not contain an integral number of octets (long).
fcs_errors: Packets received with CRC errors (long).
duplex: Current duplex mode of the interface (string).
carrier_errors: Number of times carrier was lost or never detected on a transmission attempt (long).
collisions: Ethernet collisions during transmit (long).
ex_collisions: Frames where excess collisions occurred on transmit, causing transmit failure (long).
tx_late_collisions: Number of times a transmit collision occurred late, that is, after 512 bit times (long).
defer_xmts: Packets without collisions where first transmit attempt was delayed because the medium was busy (long).
first_collisions: Packets successfully transmitted with exactly one collision.
multi_collisions: Packets successfully transmitted with multiple collisions.
sqe_errors: Number of times that SQE test error was reported.
macxmt_errors: Packets encountering transmit MAC failures, except carrier and collision failures.
macrcv_errors: Packets received with MAC errors, except align_errors, fcs_errors, and toolong_errors.
toolong_errors: Packets received larger than the maximum allowed length.
runt_errors: Packets received smaller than the minimum allowed length (long).

The following group of statistics applies to networks of type DL_TPR. These statistics are maintained by device-specific drivers of that type, as shown above.

line_errors: Packets received with non-data bits or FCS errors.
burst_errors: Number of times an absence of transitions for five half-bit timers was detected.
signal_losses: Number of times loss of signal condition on the ring was detected.
ace_errors: Number of times that an AMP or SMP frame, in which A is equal to C is equal to 0, is followed by another SMP frame without an intervening AMP frame.
internal_errors: Number of times the station recognized an internal error.
lost_frame_errors: Number of times the TRR timer expired during transmit.
frame_copied_errors: Number of times a frame addressed to this station was received with the FS field `A' bit set to 1.
token_errors: Number of times the station acting as the active monitor recognized an error condition that needed a token transmitted.
freq_errors: Number of times the frequency of the incoming signal differed from the expected frequency.

The following group of statistics applies to networks of type DL_FDDI. These statistics are maintained by device-specific drivers of that type, as shown above.

mac_errors: Frames detected in error by this MAC that had not been detected in error by another MAC.
mac_lost_errors: Frames received with format errors such that the frame was stripped.
mac_tokens: Number of tokens that were received, that is, the total of non-restricted and restricted tokens.
mac_tvx_expired: Number of times that TVX has expired.
mac_late: Number of TRT expirations since either this MAC was reset or a token was received.
mac_ring_ops: Number of times the ring has entered the “Ring Operational” state from the “Ring Not Operational” state.

GLDv2 Declarations and Data Structures

This section describes the gld_mac_info(9S) and gld_stats structures.

`gld_mac_info` Structure

The GLDv2 MAC information (gld_mac_info) structure is the main data interface that links the device-specific driver with GLDv2. This structure contains data required by GLDv2 and a pointer to an optional additional driver-specific information structure.

Allocate the gld_mac_info structure using gld_mac_alloc(). Deallocate the structure using gld_mac_free(). Drivers must not make any assumptions about the length of this structure, which might vary in different releases of the Solaris OS, GLDv2, or both. Structure members private to GLDv2, not documented here, should neither be set nor be read by the device-specific driver.

The gld_mac_info(9S) structure contains the following fields.

caddr_t              gldm_private;              /* Driver private data */
int                  (*gldm_reset)();           /* Reset device */
int                  (*gldm_start)();           /* Start device */
int                  (*gldm_stop)();            /* Stop device */
int                  (*gldm_set_mac_addr)();    /* Set device phys addr */
int                  (*gldm_set_multicast)();   /* Set/delete multicast addr */
int                  (*gldm_set_promiscuous)(); /* Set/reset promiscuous mode */
int                  (*gldm_send)();            /* Transmit routine */
uint_t               (*gldm_intr)();            /* Interrupt handler */
int                  (*gldm_get_stats)();       /* Get device statistics */
int                  (*gldm_ioctl)();           /* Driver-specific ioctls */
char                 *gldm_ident;               /* Driver identity string */
uint32_t             gldm_type;                 /* Device type */
uint32_t             gldm_minpkt;               /* Minimum packet size */
                                                /* accepted by driver */
uint32_t             gldm_maxpkt;               /* Maximum packet size */
                                                /* accepted by driver */
uint32_t             gldm_addrlen;              /* Physical address length */
int32_t              gldm_saplen;               /* SAP length for DL_INFO_ACK */
unsigned char        *gldm_broadcast_addr;      /* Physical broadcast addr */
unsigned char        *gldm_vendor_addr;         /* Factory MAC address */
t_uscalar_t          gldm_ppa;                  /* Physical Point of */
                                                /* Attachment (PPA) number */
dev_info_t           *gldm_devinfo;             /* Pointer to device's */
                                                /* dev_info node */
ddi_iblock_cookie_t  gldm_cookie;               /* Device's interrupt */
                                                /* block cookie */

The gldm_private structure member is visible to the device driver. gldm_private is also private to the device-specific driver. gldm_private is not used or modified by GLDv2. Conventionally, gldm_private is used as a pointer to private data, pointing to a per-instance data structure that is both defined and allocated by the driver.

The following group of structure members must be set by the driver before calling gld_register(), and should not thereafter be modified by the driver. Because gld_register() might use or cache the values of structure members, changes made by the driver after calling gld_register() might cause unpredictable results. For more information on these structures, see the gld(9E) man page.

gldm_reset