Writing Device Drivers

Chapter 15 Drivers for Character Devices

A character device does not have physically addressable storage media, such as tape drives or serial ports, where I/O is normally performed in a byte stream. This chapter describes the structure of a character device driver, focusing in particular on entry points for character drivers. In addition, this chapter describes the use of physio(9F) and aphysio(9F) in the context of synchronous and asynchronous I/O transfers.

This chapter provides information on the following subjects:

Overview of the Character Driver Structure

Figure 15–1 shows data structures and routines that define the structure of a character device driver. Device drivers typically include the following elements:

The shaded device access section in the following figure illustrates character driver entry points.

Figure 15–1 Character Driver Roadmap

Diagram shows structures and entry points for character
device drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. These structures contain pointers to the driver entry points:


Note –

Some of these entry points can be replaced with nodev(9F) or nulldev(9F) as appropriate.


Character Device Autoconfiguration

The attach(9E) routine should perform the common initialization tasks that all devices require, such as:

See attach() Entry Point for code examples of these tasks.

Character device drivers create minor nodes of type S_IFCHR. A minor node of S_IFCHR causes a character special file that represents the node to eventually appear in the /devices hierarchy.

The following example shows a typical attach(9E) routine for character drivers. Properties that are associated with the device are commonly declared in an attach() routine. This example uses a predefined Size property. Size is the equivalent of the Nblocks property for getting the size of partition in a block device. If, for example, you are doing character I/O on a disk device, you might use Size to get the size of a partition. Since Size is a 64-bit property, you must use a 64-bit property interface. In this case, you use ddi_prop_update_int64(9F). See Device Properties for more information about properties.


Example 15–1 Character Driver attach() Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
  int instance = ddi_get_instance(dip);
  switch (cmd) {
  case DDI_ATTACH:
      /* 
       * Allocate a state structure and initialize it.
       * Map the device's registers.
       * Add the device driver's interrupt handler(s).
       * Initialize any mutexes and condition variables.
       * Create power manageable components.
       *
       * Create the device's minor node. Note that the node_type
       * argument is set to DDI_NT_TAPE.
       */
       if (ddi_create_minor_node(dip, minor_name, S_IFCHR,
           instance, DDI_NT_TAPE, 0) == DDI_FAILURE) {
           /* Free resources allocated so far. */
           /* Remove any previously allocated minor nodes. */
           ddi_remove_minor_node(dip, NULL);
           return (DDI_FAILURE);
       }
      /*
       * Create driver properties like "Size." Use "Size" 
       * instead of "size" to ensure the property works 
       * for large bytecounts.
       */
       xsp->Size = size_of_device_in_bytes;
       maj_number = ddi_driver_major(dip);
       if (ddi_prop_update_int64(makedevice(maj_number, instance), 
           dip, "Size", xsp->Size) != DDI_PROP_SUCCESS) {
           cmn_err(CE_CONT, "%s: cannot create Size property\n",
               ddi_get_name(dip));
               /* Free resources allocated so far. */
           return (DDI_FAILURE);
       }
      /* ... */
      return (DDI_SUCCESS);    
case DDI_RESUME:
      /* See the "Power Management" chapter in this book. */
default:
      return (DDI_FAILURE);
  }
}

Device Access (Character Drivers)

Access to a device by one or more application programs is controlled through the open(9E) and close(9E) entry points. An open(2) system call to a special file representing a character device always causes a call to the open(9E) routine for the driver. For a particular minor device, open(9E) can be called many times. The close(9E) routine is called only when the final reference to a device is removed. If the device is accessed through file descriptors, the final call to close(9E) can occur as a result of a close(2) or exit(2) system call. If the device is accessed through memory mapping, the final call to close(9E) can occur as a result of a munmap(2) system call.

open() Entry Point (Character Drivers)

The primary function of open() is to verify that the open request is allowed. The syntax for open(9E) is as follows:

int xxopen(dev_t *devp, int flag, int otyp, cred_t *credp);

where:

devp

Pointer to a device number. The open() routine is passed a pointer so that the driver can change the minor number. With this pointer, drivers can dynamically create minor instances of the device. An example would be a pseudo terminal driver that creates a new pseudo-terminal whenever the driver is opened. A driver that dynamically chooses the minor number normally creates only one minor device node in attach(9E) with ddi_create_minor_node(9F), then changes the minor number component of *devp using makedevice(9F) and getmajor(9F):

    *devp = makedevice(getmajor(*devp), new_minor);

You do not have to call ddi_create_minor_node(9F) for the new minor. A driver must not change the major number of *devp. The driver must keep track of available minor numbers internally.

flag

Flag with bits to indicate whether the device is opened for reading (FREAD), writing (FWRITE), or both. User threads issuing the open(2) system call can also request exclusive access to the device (FEXCL) or specify that the open should not block for any reason (FNDELAY), but the driver must enforce both cases. A driver for a write-only device such as a printer might consider an open(9E) for reading invalid.

otyp

Integer that indicates how open() was called. The driver must check that the value of otyp is appropriate for the device. For character drivers, otyp should be OTYP_CHR (see the open(9E) man page).

credp

Pointer to a credential structure containing information about the caller, such as the user ID and group IDs. Drivers should not examine the structure directly, but should instead use drv_priv(9F) to check for the common case of root privileges. In this example, only root or a user with the PRIV_SYS_DEVICES privilege is allowed to open the device for writing.

The following example shows a character driver open(9E) routine.


Example 15–2 Character Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flag, int otyp, cred_t *credp)
{
    minor_t        instance;

    if (getminor(*devp)         /* if device pointer is invalid */
        return (EINVAL);
    instance = getminor(*devp); /* one-to-one example mapping */
    /* Is the instance attached? */
    if (ddi_get_soft_state(statep, instance) == NULL)
        return (ENXIO);
    /* verify that otyp is appropriate */
    if (otyp != OTYP_CHR)
        return (EINVAL);
    if ((flag & FWRITE) && drv_priv(credp) == EPERM)
        return (EPERM);
    return (0);
}

close() Entry Point (Character Drivers)

The syntax for close(9E) is as follows:

int xxclose(dev_t dev, int flag, int otyp, cred_t *credp);

close() should perform any cleanup necessary to finish using the minor device, and prepare the device (and driver) to be opened again. For example, the open routine might have been invoked with the exclusive access (FEXCL) flag. A call to close(9E) would allow additional open routines to continue. Other functions that close(9E) might perform are:

A driver that waits for I/O to drain could wait forever if draining stalls due to external conditions such as flow control. See Threads Unable to Receive Signals for information about how to avoid this problem.

I/O Request Handling

This section discusses I/O request processing in detail.

User Addresses

When a user thread issues a write(2) system call, the thread passes the address of a buffer in user space:

    char buffer[] = "python";
    count = write(fd, buffer, strlen(buffer) + 1);

The system builds a uio(9S) structure to describe this transfer by allocating an iovec(9S) structure and setting the iov_base field to the address passed to write(2), in this case, buffer. The uio(9S) structure is passed to the driver write(9E) routine. See Vectored I/O for more information about the uio(9S) structure.

The address in the iovec(9S) is in user space, not kernel space. Thus, the address is neither guaranteed to be currently in memory nor to be a valid address. In either case, accessing a user address directly from the device driver or from the kernel could crash the system. Thus, device drivers should never access user addresses directly. Instead, a data transfer routine in the Solaris DDI/DKI should be used to transfer data into or out of the kernel. These routines can handle page faults. The DDI/DKI routines can bring in the proper user page to continue the copy transparently. Alternatively, the routines can return an error on an invalid access.

copyout(9F) can be used to copy data from kernel space to user space. copyin(9F) can copy data from user space to kernel space. ddi_copyout(9F) and ddi_copyin(9F) operate similarly but are to be used in the ioctl(9E) routine. copyin(9F) and copyout(9F) can be used on the buffer described by each iovec(9S) structure, or uiomove(9F) can perform the entire transfer to or from a contiguous area of driver or device memory.

Vectored I/O

In character drivers, transfers are described by a uio(9S) structure. The uio(9S) structure contains information about the direction and size of the transfer, plus an array of buffers for one end of the transfer. The other end is the device.

The uio(9S) structure contains the following members:

iovec_t       *uio_iov;       /* base address of the iovec */
                              /* buffer description array */
int           uio_iovcnt;     /* the number of iovec structures */
off_t         uio_offset;     /* 32-bit offset into file where */
                              /* data is transferred from or to */
offset_t      uio_loffset;    /* 64-bit offset into file where */
                              /* data is transferred from or to */
uio_seg_t     uio_segflg;     /* identifies the type of I/O transfer */
                              /* UIO_SYSSPACE:  kernel <-> kernel */
                              /* UIO_USERSPACE: kernel <-> user */
short         uio_fmode;      /* file mode flags (not driver setTable) */
daddr_t       uio_limit;      /* 32-bit ulimit for file (maximum */
                              /* block offset). not driver settable. */
diskaddr_t    uio_llimit;     /* 64-bit ulimit for file (maximum block */
                              /* block offset). not driver settable. */
int           uio_resid;      /* amount (in bytes) not */
                              /* transferred on completion */

A uio(9S) structure is passed to the driver read(9E) and write(9E) entry points. This structure is generalized to support what is called gather-write and scatter-read. When writing to a device, the data buffers to be written do not have to be contiguous in application memory. Similarly, data that is transferred from a device into memory comes off in a contiguous stream but can go into noncontiguous areas of application memory. See the readv(2), writev(2), pread(2), and pwrite(2) man pages for more information on scatter-gather I/O.

Each buffer is described by an iovec(9S) structure. This structure contains a pointer to the data area and the number of bytes to be transferred.

caddr_t    iov_base;    /* address of buffer */
int        iov_len;     /* amount to transfer */

The uio structure contains a pointer to an array of iovec(9S) structures. The base address of this array is held in uio_iov, and the number of elements is stored in uio_iovcnt.

The uio_offset field contains the 32-bit offset into the device at which the application needs to begin the transfer. uio_loffset is used for 64-bit file offsets. If the device does not support the notion of an offset, these fields can be safely ignored. The driver should interpret either uio_offset or uio_loffset, but not both. If the driver has set the D_64BIT flag in the cb_ops(9S) structure, that driver should use uio_loffset.

The uio_resid field starts out as the number of bytes to be transferred, that is, the sum of all the iov_len fields in uio_iov. This field must be set by the driver to the number of bytes that were not transferred before returning. The read(2) and write(2) system calls use the return value from the read(9E) and write(9E) entry points to determine failed transfers. If a failure occurs, these routines return -1. If the return value indicates success, the system calls return the number of bytes requested minus uio_resid. If uio_resid is not changed by the driver, the read(2) and write(2) calls return 0. A return value of 0 indicates end-of-file, even though all the data has been transferred.

The support routines uiomove(9F), physio(9F), and aphysio(9F) update the uio(9S) structure directly. These support routines update the device offset to account for the data transfer. Neither the uio_offset or uio_loffset fields need to be adjusted when the driver is used with a seekable device that uses the concept of position. I/O performed to a device in this manner is constrained by the maximum possible value of uio_offset or uio_loffset. An example of such a usage is raw I/O on a disk.

If the device has no concept of position, the driver can take the following steps:

  1. Save uio_offset or uio_loffset.

  2. Perform the I/O operation.

  3. Restore uio_offset or uio_loffset to the field's initial value.

I/O that is performed to a device in this manner is not constrained by the maximum possible value of uio_offset or uio_loffset. An example of this type of usage is I/O on a serial line.

The following example shows one way to preserve uio_loffset in the read(9E) function.

static int
xxread(dev_t dev, struct uio *uio_p, cred_t *cred_p)
{
    offset_t off;
    /* ... */
    off = uio_p->uio_loffset;  /* save the offset */
    /* do the transfer */
    uio_p->uio_loffset = off;  /* restore it */
}

Differences Between Synchronous and Asynchronous I/O

Data transfers can be synchronous or asynchronous. The determining factor is whether the entry point that schedules the transfer returns immediately or waits until the I/O has been completed.

The read(9E) and write(9E) entry points are synchronous entry points. The transfer must not return until the I/O is complete. Upon return from the routines, the process knows whether the transfer has succeeded.

The aread(9E) and awrite(9E) entry points are asynchronous entry points. Asynchronous entry points schedule the I/O and return immediately. Upon return, the process that issues the request knows that the I/O is scheduled and that the status of the I/O must be determined later. In the meantime, the process can perform other operations.

With an asynchronous I/O request to the kernel, the process is not required to wait while the I/O is in process. A process can perform multiple I/O requests and allow the kernel to handle the data transfer details. Asynchronous I/O requests enable applications such as transaction processing to use concurrent programming methods to increase performance or response time. Any performance boost for applications that use asynchronous I/O, however, comes at the expense of greater programming complexity.

Data Transfer Methods

Data can be transferred using either programmed I/O or DMA. These data transfer methods can be used either by synchronous or by asynchronous entry points, depending on the capabilities of the device.

Programmed I/O Transfers

Programmed I/O devices rely on the CPU to perform the data transfer. Programmed I/O data transfers are identical to other read and write operations for device registers. Various data access routines are used to read or store values to device memory.

uiomove(9F) can be used to transfer data to some programmed I/O devices. uiomove(9F) transfers data between the user space, as defined by the uio(9S) structure, and the kernel. uiomove() can handle page faults, so the memory to which data is transferred need not be locked down. uiomove() also updates the uio_resid field in the uio(9S) structure. The following example shows one way to write a ramdisk read(9E) routine. It uses synchronous I/O and relies on the presence of the following fields in the ramdisk state structure:

caddr_t    ram;        /* base address of ramdisk */
int        ramsize;    /* size of the ramdisk */

Example 15–3 Ramdisk read(9E) Routine Using uiomove(9F)

static int
rd_read(dev_t dev, struct uio *uiop, cred_t *credp)
{
     rd_devstate_t     *rsp;

     rsp = ddi_get_soft_state(rd_statep, getminor(dev));
     if (rsp == NULL)
       return (ENXIO);
     if (uiop->uio_offset >= rsp->ramsize)
       return (EINVAL);
     /*
      * uiomove takes the offset into the kernel buffer,
      * the data transfer count (minimum of the requested and
      * the remaining data), the UIO_READ flag, and a pointer
      * to the uio structure.
      */
     return (uiomove(rsp->ram + uiop->uio_offset,
         min(uiop->uio_resid, rsp->ramsize - uiop->uio_offset),
         UIO_READ, uiop));
}

Another example of programmed I/O would be a driver that writes data one byte at a time directly to the device's memory. Each byte is retrieved from the uio(9S) structure by using uwritec(9F). The byte is then sent to the device. read(9E) can use ureadc(9F) to transfer a byte from the device to the area described by the uio(9S) structure.


Example 15–4 Programmed I/O write(9E) Routine Using uwritec(9F)

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{
    int    value;
    struct xxstate     *xsp;

    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL)
        return (ENXIO);
    /* if the device implements a power manageable component, do this: */
    pm_busy_component(xsp->dip, 0);
    if (xsp->pm_suspended)
        pm_raise_power(xsp->dip, normal power);

    while (uiop->uio_resid > 0) {
        /*
         * do the programmed I/O access
         */
        value = uwritec(uiop);
        if (value == -1)
               return (EFAULT);
        ddi_put8(xsp->data_access_handle, &xsp->regp->data,
            (uint8_t)value);
        ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
            START_TRANSFER);
        /*
         * this device requires a ten microsecond delay
         * between writes
         */
        drv_usecwait(10);
    }
    pm_idle_component(xsp->dip, 0);
    return (0);
}

DMA Transfers (Synchronous)

Character drivers generally use physio(9F) to do the setup work for DMA transfers in read(9E) and write(9E), as is shown in Example 15–5.

int physio(int (*strat)(struct buf *), struct buf *bp,
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct uio *uio);

physio(9F) requires the driver to provide the address of a strategy(9E) routine. physio(9F) ensures that memory space is locked down, that is, memory cannot be paged out, for the duration of the data transfer. This lock-down is necessary for DMA transfers because DMA transfers cannot handle page faults. physio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information.


Example 15–5 read(9E) and write(9E) Routines Using physio(9F)

static int
xxread(dev_t dev, struct uio *uiop, cred_t *credp)
{
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_READ, xxminphys, uiop);
     return (ret);
}    

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{     
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_WRITE, xxminphys, uiop);
     return (ret);
}

In the call to physio(9F), xxstrategy is a pointer to the driver strategy() routine. Passing NULL as the buf(9S) structure pointer tells physio(9F) to allocate a buf(9S) structure. If the driver must provide physio(9F) with a buf(9S) structure, getrbuf(9F) should be used to allocate the structure. physio(9F) returns zero if the transfer completes successfully, or an error number on failure. After calling strategy(9E), physio(9F) calls biowait(9F) to block until the transfer either completes or fails. The return value of physio(9F) is determined by the error field in the buf(9S) structure set by bioerror(9F).

DMA Transfers (Asynchronous)

Character drivers that support aread(9E) and awrite(9E) use aphysio(9F) instead of physio(9F).

int aphysio(int (*strat)(struct buf *), int (*cancel)(struct buf *),
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct aio_req *aio_reqp);

Note –

The address of anocancel(9F) is the only value that can currently be passed as the second argument to aphysio(9F).


aphysio(9F) requires the driver to pass the address of a strategy(9E) routine. aphysio(9F) ensures that memory space is locked down, that is, cannot be paged out, for the duration of the data transfer. This lock-down is necessary for DMA transfers because DMA transfers cannot handle page faults. aphysio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information.

Example 15–5 and Example 15–6 demonstrate that the aread(9E) and awrite(9E) entry points differ only slightly from the read(9E) and write(9E) entry points. The difference is primarily in their use of aphysio(9F) instead of physio(9F).


Example 15–6 aread(9E) and awrite(9E) Routines Using aphysio(9F)

static int
xxaread(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
         return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_READ,
     xxminphys, aiop));
}

static int
xxawrite(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
        return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_WRITE,
     xxminphys,aiop));  
}

In the call to aphysio(9F), xxstrategy() is a pointer to the driver strategy routine. aiop is a pointer to the aio_req(9S) structure. aiop is passed to aread(9E) and awrite(9E). aio_req(9S) describes where the data is to be stored in user space. aphysio(9F) returns zero if the I/O request is scheduled successfully or an error number on failure. After calling strategy(9E), aphysio(9F) returns without waiting for the I/O to complete or fail.

minphys() Entry Point

The minphys() entry point is a pointer to a function to be called by physio(9F) or aphysio(9F). The purpose of xxminphys is to ensure that the size of the requested transfer does not exceed a driver-imposed limit. If the user requests a larger transfer, strategy(9E) is called repeatedly, requesting no more than the imposed limit at a time. This approach is important because DMA resources are limited. Drivers for slow devices, such as printers, should be careful not to tie up resources for a long time.

Usually, a driver passes the address of the kernel function minphys(9F), but the driver can define its own xxminphys() routine instead. The job of xxminphys() is to keep the b_bcount field of the buf(9S) structure under a driver's limit. The driver should adhere to other system limits as well. For example, the driver's xxminphys() routine should call the system minphys(9F) routine after setting the b_bcount field and before returning.


Example 15–7 minphys(9F) Routine

#define XXMINVAL (512 << 10)    /* 512 KB */
static void
xxminphys(struct buf *bp)
{
       if (bp->b_bcount > XXMINVAL)
        bp->b_bcount = XXMINVAL
      minphys(bp);
}

strategy() Entry Point

The strategy(9E) routine originated in block drivers. The strategy function got its name from implementing a strategy for efficient queuing of I/O requests to a block device. A driver for a character-oriented device can also use a strategy(9E) routine. In the character I/O model presented here, strategy(9E) does not maintain a queue of requests, but rather services one request at a time.

In the following example, the strategy(9E) routine for a character-oriented DMA device allocates DMA resources for synchronous data transfer. strategy() starts the command by programming the device register. See Chapter 9, Direct Memory Access (DMA) for a detailed description.


Note –

strategy(9E) does not receive a device number (dev_t) as a parameter. Instead, the device number is retrieved from the b_edev field of the buf(9S) structure passed to strategy(9E).



Example 15–8 strategy(9E) Routine

static int
xxstrategy(struct buf *bp)
{
     minor_t            instance;
     struct xxstate     *xsp;
     ddi_dma_cookie_t   cookie;

     instance = getminor(bp->b_edev);
     xsp = ddi_get_soft_state(statep, instance);
     /* ... */
      * If the device has power manageable components,
      * mark the device busy with pm_busy_components(9F),
      * and then ensure that the device is
      * powered up by calling pm_raise_power(9F).
      */
     /* Set up DMA resources with ddi_dma_alloc_handle(9F) and
      * ddi_dma_buf_bind_handle(9F).
      */
     xsp->bp = bp; /* remember bp */
     /* Program DMA engine and start command */
     return (0);
}


Note –

Although strategy() is declared to return an int, strategy() must always return zero.


On completion of the DMA transfer, the device generates an interrupt, causing the interrupt routine to be called. In the following example, xxintr() receives a pointer to the state structure for the device that might have generated the interrupt.


Example 15–9 Interrupt Routine

static u_int
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     if ( /* device did not interrupt */ ) {
        return (DDI_INTR_UNCLAIMED);
     }
     if ( /* error */ ) {
        /* error handling */
     }
     /* Release any resources used in the transfer, such as DMA resources.
      * ddi_dma_unbind_handle(9F) and ddi_dma_free_handle(9F)
      * Notify threads that the transfer is complete.
      */
     biodone(xsp->bp);
     return (DDI_INTR_CLAIMED);
}

The driver indicates an error by calling bioerror(9F). The driver must call biodone(9F) when the transfer is complete or after indicating an error with bioerror(9F).

Mapping Device Memory

Some devices, such as frame buffers, have memory that is directly accessible to user threads by way of memory mapping. Drivers for these devices typically do not support the read(9E) and write(9E) interfaces. Instead, these drivers support memory mapping with the devmap(9E) entry point. For example, a frame buffer driver might implement the devmap(9E) entry point to enable the frame buffer to be mapped in a user thread.

The devmap(9E) entry point is called to export device memory or kernel memory to user applications. The devmap() function is called from devmap_setup(9F) inside segmap(9E) or on behalf of ddi_devmap_segmap(9F).

The segmap(9E) entry point is responsible for setting up a memory mapping requested by an mmap(2) system call. Drivers for many memory-mapped devices use ddi_devmap_segmap(9F) as the entry point rather than defining their own segmap(9E) routine.

See Chapter 10, Mapping Device and Kernel Memory and Chapter 11, Device Context Management for details.

Multiplexing I/O on File Descriptors

A thread sometimes needs to handle I/O on more than one file descriptor. One example is an application program that needs to read the temperature from a temperature-sensing device and then report the temperature to an interactive display. A program that makes a read request with no data available should not block while waiting for the temperature before interacting with the user again.

The poll(2) system call provides users with a mechanism for multiplexing I/O over a set of file descriptors that reference open files. poll(2) identifies those file descriptors on which a program can send or receive data without blocking, or on which certain events have occurred.

To enable a program to poll a character driver, the driver must implement the chpoll(9E) entry point. The system calls chpoll(9E) when a user process issues a poll(2) system call on a file descriptor associated with the device. The chpoll(9E) entry point routine is used by non-STREAMS character device drivers that need to support polling.

The chpoll(9E) function uses the following syntax:

int xxchpoll(dev_t dev, short events, int anyyet, short *reventsp,
     struct pollhead **phpp);

In the chpoll(9E) entry point, the driver must follow these rules:

Example 15–10 and Example 15–11 show how to implement the polling discipline and how to use pollwakeup(9F).

The following example shows how to handle the POLLIN and POLLERR events. The driver first reads the status register to determine the current state of the device. The parameter events specifies which conditions the driver should check. If an appropriate condition has occurred, the driver sets that bit in *reventsp. If none of the conditions has occurred and if anyyet is not set, the address of the pollhead structure is returned in *phpp.


Example 15–10 chpoll(9E) Routine

static int
xxchpoll(dev_t dev, short events, int anyyet,
    short *reventsp, struct pollhead **phpp)
{
     uint8_t status;
     short revent;
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
         return (ENXIO);
     revent = 0;
     /*
    * Valid events are:
    * POLLIN | POLLOUT | POLLPRI | POLLHUP | POLLERR
    * This example checks only for POLLIN and POLLERR.
    */
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if ((events & POLLIN) && data available to read) {
        revent |= POLLIN;
     }
     if (status & DEVICE_ERROR) {
        revent |= POLLERR;
     }
     /* if nothing has occurred */
     if (revent == 0) {
        if (!anyyet) {
        *phpp = &xsp->pollhead;
        }
     }
       *reventsp = revent;
     return (0);
}

The following example shows how to use the pollwakeup(9F) function. The pollwakeup(9F) function usually is called in the interrupt routine when a supported condition has occurred. The interrupt routine reads the status from the status register and checks for the conditions. The routine then calls pollwakeup(9F) for each event to possibly notify polling threads that they should check again. Note that pollwakeup(9F) should not be called with any locks held, since deadlock could result if another routine tried to enter chpoll(9E) and grab the same lock.


Example 15–11 Interrupt Routine Supporting chpoll(9E)

static u_int
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     uint8_t    status;
     /* normal interrupt processing */
     /* ... */
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if (status & DEVICE_ERROR) {
        pollwakeup(&xsp->pollhead, POLLERR);
     }
     if ( /* just completed a read */ ) {
        pollwakeup(&xsp->pollhead, POLLIN);
     }
     /* ... */
     return (DDI_INTR_CLAIMED);
}

Miscellaneous I/O Control

The ioctl(9E) routine is called when a user thread issues an ioctl(2) system call on a file descriptor associated with the device. The I/O control mechanism is a catchall for getting and setting device-specific parameters. This mechanism is frequently used to set a device-specific mode, either by setting internal driver software flags or by writing commands to the device. The control mechanism can also be used to return information to the user about the current device state. In short, the control mechanism can do whatever the application and driver need to have done.

ioctl() Entry Point (Character Drivers)

int xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
     cred_t *credp, int *rvalp);

The cmd parameter indicates which command ioctl(9E) should perform. By convention, the driver with which an I/O control command is associated is indicated in bits 8-15 of the command. Typically, the ASCII code of a character represents the driver. The driver-specific command in bits 0-7. The creation of some I/O commands is illustrated in the following example:

#define XXIOC    (`x' << 8)     /* `x' is a character representing */
                                      /* device xx */
#define XX_GET_STATUS    (XXIOC | 1)  /* get status register */
#define XX_SET_CMD       (XXIOC | 2)  /* send command */

The interpretation of arg depends on the command. I/O control commands should be documented in the driver documentation or a man page. The command should also be defined in a public header file, so that applications can determine the name of the command, what the command does, and what the command accepts or returns as arg. Any data transfer of arg into or out of the driver must be performed by the driver.

Certain classes of devices such as frame buffers or disks must support standard sets of I/O control requests. These standard I/O control interfaces are documented in the Solaris 8 Reference Manual Collection. For example, fbio(7I) documents the I/O controls that frame buffers must support, and dkio(7I) documents standard disk I/O controls. See Miscellaneous I/O Control for more information on I/O controls.

Drivers must use ddi_copyin(9F) to transfer arg data from the user-level application to the kernel level. Drivers must use ddi_copyout(9F) to transfer data from the kernel to the user level. Failure to use ddi_copyin(9F) or ddi_copyout(9F) can result in panics under two conditions. A panic occurs if the architecture separates the kernel and user address spaces, or if the user address has been swapped out.

ioctl(9E) is usually a switch statement with a case for each supported ioctl(9E) request.


Example 15–12 ioctl(9E) Routine

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
     uint8_t        csr;
     struct xxstate     *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL) {
        return (ENXIO);
     }
     switch (cmd) {
     case XX_GET_STATUS:
       csr = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
       if (ddi_copyout(&csr, (void *)arg,
           sizeof (uint8_t), mode) != 0) {
           return (EFAULT);
       }
       break;
     case XX_SET_CMD:
       if (ddi_copyin((void *)arg, &csr,
         sizeof (uint8_t), mode) != 0) {
         return (EFAULT);
       }
       ddi_put8(xsp->data_access_handle, &xsp->regp->csr, csr);
       break;
     default:
       /* generic "ioctl unknown" error */
       return (ENOTTY);
     }
     return (0);
}

The cmd variable identifies a specific device control operation. A problem can occur if arg contains a user virtual address. ioctl(9E) must call ddi_copyin(9F) or ddi_copyout(9F) to transfer data between the data structure in the application program pointed to by arg and the driver. In Example 15–12, for the case of an XX_GET_STATUS request, the contents of xsp->regp->csr are copied to the address in arg. ioctl(9E) can store in *rvalp any integer value as the return value to the ioctl(2) system call that makes a successful request. Negative return values, such as -1, should be avoided. Many application programs assume that negative values indicate failure.

The following example demonstrates an application that uses the I/O controls discussed in the previous paragraph.


Example 15–13 Using ioctl(9E)

#include <sys/types.h>
#include "xxio.h"     /* contains device's ioctl cmds and args */
int
main(void)
{
     uint8_t    status;
     /* ... */
     /*
      * read the device status
      */
     if (ioctl(fd, XX_GET_STATUS, &status) == -1) {
         /* error handling */
     }
     printf("device status %x\n", status);
     exit(0);
}

I/O Control Support for 64-Bit Capable Device Drivers

The Solaris kernel runs in 64-bit mode on suitable hardware, supporting both 32-bit applications and 64-bit applications. A 64-bit device driver is required to support I/O control commands from programs of both sizes. The difference between a 32-bit program and a 64-bit program is the C language type model. A 32-bit program is ILP32, and a 64-bit program is LP64. See Appendix C, Making a Device Driver 64-Bit Ready for information on C data type models.

If data that flows between programs and the kernel is not identical in format, the driver must be able to handle the model mismatch. Handling a model mismatch requires making appropriate adjustments to the data.

To determine whether a model mismatch exists, the ioctl(9E) mode parameter passes the data model bits to the driver. As Example 15–14 shows, the mode parameter is then passed to ddi_model_convert_from(9F) to determine whether any model conversion is necessary.

A flag subfield of the mode argument is used to pass the data model to the ioctl(9E) routine. The flag is set to one of the following:

FNATIVE is conditionally defined to match the data model of the kernel implementation. The FMODELS mask should be used to extract the flag from the mode argument. The driver can then examine the data model explicitly to determine how to copy the application data structure.

The DDI function ddi_model_convert_from(9F) is a convenience routine that can assist some drivers with their ioctl() calls. The function takes the data type model of the user application as an argument and returns one of the following values:

DDI_MODEL_NONE is returned if no data conversion is necessary, as occurs when the application and driver have the same data model. DDI_MODEL_ILP32 is returned to a driver that is compiled to the LP64 model and that communicates with a 32-bit application.

In the following example, the driver copies a data structure that contains a user address. The data structure changes size from ILP32 to LP64. Accordingly, the 64-bit driver uses a 32-bit version of the structure when communicating with a 32-bit application.


Example 15–14 ioctl(9E) Routine to Support 32-bit Applications and 64-bit Applications

struct args32 {
    uint32_t    addr;    /* 32-bit address in LP64 */
    int     len;
}
struct args {
    caddr_t     addr;    /* 64-bit address in LP64 */
    int     len;
}

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
    struct  xxstate  *xsp;
    struct  args     a;
    xsp = ddi_get_soft_state(statep, getminor(dev));
    if (xsp == NULL) {
        return (ENXIO);
    }
    switch (cmd) {
    case XX_COPYIN_DATA:
        switch(ddi_model_convert_from(mode)) {
        case DDI_MODEL_ILP32:
        {
            struct args32 a32;

            /* copy 32-bit args data shape */
            if (ddi_copyin((void *)arg, &a32,
                sizeof (struct args32), mode) != 0) {
                return (EFAULT);
            }
            /* convert 32-bit to 64-bit args data shape */
            a.addr = a32.addr;
            a.len = a32.len;
            break;
        }
        case DDI_MODEL_NONE:
            /* application and driver have same data model. */
            if (ddi_copyin((void *)arg, &a, sizeof (struct args),
                mode) != 0) {
                return (EFAULT);
            }
        }
        /* continue using data shape in native driver data model. */
        break;

    case XX_COPYOUT_DATA:
        /* copyout handling */
        break;
    default:
        /* generic "ioctl unknown" error */
        return (ENOTTY);
    }
    return (0);
}

Handling copyout() Overflow

Sometimes a driver needs to copy out a native quantity that no longer fits in the 32-bit sized structure. In this case, the driver should return EOVERFLOW to the caller. EOVERFLOW serves as an indication that the data type in the interface is too small to hold the value to be returned, as shown in the following example.


Example 15–15 Handling copyout(9F) Overflow

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
     cred_t *cr, int *rval_p)
    {
        struct resdata res;
        /* body of driver */
        switch (ddi_model_convert_from(mode & FMODELS)) {
        case DDI_MODEL_ILP32: {
            struct resdata32 res32;

            if (res.size > UINT_MAX)
                    return (EOVERFLOW);    
            res32.size = (size32_t)res.size;
            res32.flag = res.flag;
            if (ddi_copyout(&res32,
                (void *)arg, sizeof (res32), mode))
                    return (EFAULT);
        }
        break;

        case DDI_MODEL_NONE:
            if (ddi_copyout(&res, (void *)arg, sizeof (res), mode))
                    return (EFAULT);
            break;
        }
        return (0);
    }

32-bit and 64-bit Data Structure Macros

The method in Example 15–15 works well for many drivers. An alternate scheme is to use the data structure macros that are provided in <sys/model.h>to move data between the application and the kernel. These macros make the code less cluttered and behave identically, from a functional perspective.


Example 15–16 Using Data Structure Macros to Move Data

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
        cred_t *cr, int *rval_p)
    {    
        STRUCT_DECL(opdata, op);

        if (cmd != OPONE)
            return (ENOTTY);

        STRUCT_INIT(op, mode);

        if (copyin((void *)arg,
            STRUCT_BUF(op), STRUCT_SIZE(op)))
                return (EFAULT);

        if (STRUCT_FGET(op, flag) != XXACTIVE ||     
            STRUCT_FGET(op, size) > XXSIZE)
                return (EINVAL);
        xxdowork(device_state, STRUCT_FGET(op, size));
        return (0);
}

How Do the Structure Macros Work?

In a 64-bit device driver, structure macros enable the use of the same piece of kernel memory by data structures of both sizes. The memory buffer holds the contents of the native form of the data structure, that is, the LP64 form, and the ILP32 form. Each structure access is implemented by a conditional expression. When compiled as a 32-bit driver, only one data model, the native form, is supported. No conditional expression is used.

The 64-bit versions of the macros depend on the definition of a shadow version of the data structure. The shadow version describes the 32-bit interface with fixed-width types. The name of the shadow data structure is formed by appending “32” to the name of the native data structure. For convenience, place the definition of the shadow structure in the same file as the native structure to ease future maintenance costs.

The macros can take the following arguments:

structname

The structure name of the native form of the data structure as entered after the struct keyword.

umodel

A flag word that contains the user data model, such as FILP32 or FLP64, extracted from the mode parameter of ioctl(9E).

handle

The name used to refer to a particular instance of a structure that is manipulated by these macros.

fieldname

The name of the field within the structure.

When to Use Structure Macros

Macros enable you to make in-place references only to the fields of a data item. Macros do not provide a way to take separate code paths that are based on the data model. Macros should be avoided if the number of fields in the data structure is large. Macros should also be avoided if the frequency of references to these fields is high.

Macros hide many of the differences between data models in the implementation of the macros. As a result, code written with this interface is generally easier to read. When compiled as a 32-bit driver, the resulting code is compact without needing clumsy #ifdefs, but still preserves type checking.

Declaring and Initializing Structure Handles

STRUCT_DECL(9F) and STRUCT_INIT(9F) can be used to declare and initialize a handle and space for decoding an ioctl on the stack. STRUCT_HANDLE(9F) and STRUCT_SET_HANDLE(9F) declare and initialize a handle without allocating space on the stack. The latter macros can be useful if the structure is very large, or is contained in some other data structure.


Note –

Because the STRUCT_DECL(9F) and STRUCT_HANDLE(9F) macros expand to data structure declarations, these macros should be grouped with such declarations in C code.


The macros for declaring and initializing structures are as follows:

STRUCT_DECL(structname, handle)

Declares a structure handlethat is called handle for a structname data structure. STRUCT_DECL allocates space for its native form on the stack. The native form is assumed to be larger than or equal to the ILP32 form of the structure.

STRUCT_INIT(handle, umodel)

Initializes the data model for handle to umodel. This macro must be invoked before any access is made to a structure handle declared with STRUCT_DECL(9F).

STRUCT_HANDLE(structname, handle)

Declares a structure handle that is called handle. Contrast with STRUCT_DECL(9F).

STRUCT_SET_HANDLE(handle, umodel, addr)

Initializes the data model for handle to umodel, and sets addr as the buffer used for subsequent manipulation. Invoke this macro before accessing a structure handle declared with STRUCT_DECL(9F).

Operations on Structure Handles

The macros for performing operations on structures are as follows:

size_t STRUCT_SIZE(handle)

Returns the size of the structure referred to by handle, according to its embedded data model.

typeof fieldname STRUCT_FGET(handle, fieldname)

Returns the indicated field in the data structure referred to by handle. This field is a non-pointer type.

typeof fieldname STRUCT_FGETP(handle, fieldname)

Returns the indicated field in the data structure referred to by handle. This field is a pointer type.

STRUCT_FSET(handle, fieldname, val)

Sets the indicated field in the data structure referred to by handle to value val. The type of val should match the type of fieldname. The field is a non-pointer type.

STRUCT_FSETP(handle, fieldname, val)

Sets the indicated field in the data structure referred to by handle to value val. The field is a pointer type.

typeof fieldname *STRUCT_FADDR(handle, fieldname)

Returns the address of the indicated field in the data structure referred to by handle.

struct structname *STRUCT_BUF(handle)

Returns a pointer to the native structure described by handle.

Other Operations

Some miscellaneous structure macros follow:

size_t SIZEOF_STRUCT(struct_name, datamodel)

Returns the size of struct_name, which is based on the given data model.

size_t SIZEOF_PTR(datamodel)

Returns the size of a pointer based on the given data model.