Writing Device Drivers

Chapter 10 Drivers for Character Devices

Character devices are devices that do not have physically addressable storage media, such as tape drives or serial ports, where I/O is normally performed in a byte stream. This chapter describes the structure of a character device driver, focusing in particular on character driver entry points. In addition, this chapter describes the use of physio(9F) (in read(9E) and write(9E)) and aphysio(9F) (in aread(9E) and awrite(9E)) in the context of synchronous and asynchronous I/O transfers.

This chapter provides information on the following subjects:

Character Driver Structure Overview

Figure 10–1 shows data structures and routines that define the structure of a character device driver. Device drivers typically include the following:

Device-loadable driver section
Device configuration section
Character driver entry points

The shaded device access section in Figure 10–1 illustrates character driver entry points.

Figure 10–1 Character Driver Roadmap

Diagram shows structures and entry points for character device drivers.

Associated with each device driver is a dev_ops(9S) structure, which in turn refers to a cb_ops(9S) structure. These structures contain pointers to the driver entry points. Note that some of these entry points can be replaced with nodev(9F) or nulldev(9F) as appropriate.

Character Device Autoconfiguration

The attach(9E) routine should perform the common initialization tasks that all devices require. Typically, these tasks include:

Allocating per-instance state structures
Registering device interrupts
Mapping the device's registers
Initializing mutex and condition variables
Creating power-manageable components
Creating minor nodes

See The attach() Entry Point for code examples of these tasks.

Character device drivers create minor nodes of type S_IFCHR. This causes a character special file representing the node to eventually appear in the /devices hierarchy.

Example 10–1 shows a sample attach(9E) routine. It is common to declare any properties associated with the device in an attach() routine; in this example, it is the predefined Size property. Size is the equivalent of the Nblocks property used for getting the size of partition in a block device. If, for example, you are doing character I/O on a disk device, you might use Size to get the size of a partition. Since Size is a 64–bit property—the 32–bit version is size—you must use a 64–bit property interface, in this case ddi_prop_updtate_int64(9E) See Device Properties for more on properties.

Example 10–1 Character Driver attach(9E) Routine

static int
xxattach(dev_info_t *dip, ddi_attach_cmd_t cmd)
{
  int instance = ddi_get_instance(dip);
  switch (cmd) {
  case DDI_ATTACH:
allocate a state structure and initialize it.
map the device's registers.
add the device driver's interrupt handler(s).
initialize any mutexes and condition variables.
create power manageable components.
          /*
           * Create the device's minor node. Note that the node_type
           * argument is set to DDI_NT_TAPE.
           */
           if (ddi_create_minor_node(dip, "minor_name", S_IFCHR,
               instance, DDI_NT_TAPE, 0) == DDI_FAILURE) {
                  free resources allocated so far.
               /* Remove any previously allocated minor nodes */
               ddi_remove_minor_node(dip, NULL);
               return (DDI_FAILURE);
           }
           /*
            * Create driver properties like "Size." Use "Size" 
            * instead of "size" to ensure the property works 
            * for large bytecounts.
            */
            xsp->Size = size of device in bytes;
            maj_number = ddi_driver_major(dip);
            if (ddi_prop_update_int64(makedevice(maj_number, instance), 
                     dip, "Size", xsp->Size) != DDI_PROP_SUCCESS) {
                  cmn_err(CE_CONT, "%s: cannot create Size property\n",
                           ddi_get_name(dip));
                 free resources allocated so far
                 return (DDI_FAILURE);
           }
           ...
           return (DDI_SUCCESS);    
case DDI_RESUME:
        For information, see Chapter 9, Power Management   
default:
        return (DDI_FAILURE);
      }
}

Device Access (Character Drivers)

Access to a device by one or more application programs is controlled through the open(9E) and close(9E) entry points. The open(9E) routine of a character driver is always called whenever an open(2) system call is issued on a special file representing the device. For a particular minor device, open(9E) can be called many times, but the close(9E) routine is called only when the final reference to a device is removed. If the device is accessed through file descriptors, the final call to close(9E) may occur as a result of a close(2) or exit(2) system call. If the device is accessed through memory mapping, the final call to close(9E) may occur as a result of a munmap(2) system call.

`open()` Entry Point (Character Drivers)

The syntax for open(9E) is as follows:

int xxopen(dev_t *devp, int flag, int otyp, cred_t *credp);

The primary function of open() is to verify that the open request is allowed. devp is a pointer to a device number. The open(9E) routine is passed a pointer so that the driver can change the minor number. This enables drivers to dynamically create minor instances of the device. An example of this might be a pseudo-terminal driver that creates a new pseudo-terminal whenever the driver is opened. A driver that dynamically chooses the minor number, normally creates only one minor device node in attach(9E) with ddi_create_minor_node(9F), then changes the minor number component of *devp using makedevice(9F) and getmajor(9F):

    *devp = makedevice(getmajor(*devp), new_minor);

It is not necessary to call ddi_create_minor_node(9F) for the new minor. A driver may not change the major number of *devp. The driver must keep track of available minor numbers internally.

otyp indicates how open(9E) was called. The driver must check that the value of otyp is appropriate for the device. For character drivers, otyp should be OTYP_CHR (see the open(9E) manual page).

flag contains bits indicating whether the device is being opened for reading (FREAD), writing (FWRITE), or both. User threads issuing the open(2) system call can also request exclusive access to the device (FEXCL) or specify that the open should not block for any reason (FNDELAY), but the driver must enforce both cases. A driver for a write-only device such as a printer might consider an open(9E) for reading invalid.

credp is a pointer to a credential structure containing information about the caller, such as the user ID and group IDs. Drivers should not examine the structure directly, but should instead use drv_priv(9F) to check for the common case of root privileges. In this example, only root is allowed to open the device for writing.

Example 10–2 shows a character driver open(9E) routine.

Example 10–2 Character Driver open(9E) Routine

static int
xxopen(dev_t *devp, int flag, int otyp, cred_t *credp)
{
    minor_t            instance;

        if (getminor(*devp) is invalid)
                return (EINVAL);
        instance = getminor(*devp); /* one-to-one example mapping */
        /* Is the instance attached? */
        if (ddi_get_soft_state(statep, instance) == NULL)
                return (ENXIO);
        /* verify that otyp is appropriate */
        if (otyp != OTYP_CHR)
                return (EINVAL);
        if ((flag & FWRITE) && drv_priv(credp) == EPERM)
                return (EPERM);
        return (0);
}

`close()` Entry Point (Character Drivers)

The syntax for close(9E) is as follows:

int xxclose(dev_t dev, int flag, int otyp, cred_t *credp);

close() should perform any cleanup necessary to finish using the minor device, and prepare the device (and driver) to be opened again. For example, the open routine might have been invoked with the exclusive access (FEXCL) flag. A call to close(9E) would allow further open routines to continue. Other functions that close(9E) might perform are:

Waiting for I/O to drain from output buffers before returning
Rewinding a tape (tape device)
Hanging up the phone line (modem device)

I/O Request Handling

This section gives the details of I/O request processing: from the application to the kernel, the driver, the device, the interrupt handler, and back to the user.

User Addresses

When a user thread issues a write(2) system call, it passes the address of a buffer in user space:

    char buffer[] = "python";
    count = write(fd, buffer, strlen(buffer) + 1);

The system builds a uio(9S) structure to describe this transfer by allocating an iovec(9S) structure and setting the iov_base field to the address passed to write(2), in this case, buffer. The uio(9S) structure is passed to the driver write(9E) routine (see Vectored I/O for more information about the uio(9S) structure).

The address in the iovec(9S) is in user space, not kernel space, and so is not guaranteed to be currently in memory. It is not even guaranteed to be a valid address. In either case, accessing a user address directly from the device driver or from the kernel could crash the system, so device drivers should never access user addresses directly. Instead, they should always use one of the data transfer routines in the Solaris 9 DDI/DKI that transfer data into or out of the kernel. These routines are able to handle page faults, either by bringing the proper user page in and continuing the copy transparently, or by returning an error on an invalid access.

Two routines commonly used are copyout(9F) to copy data from kernel space to user space and copyin(9F) to copy data from user space to kernel space. ddi_copyout(9F) and ddi_copyin(9F) operate similarly but are to be used in the ioctl(9E) routine. copyin(9F) and copyout(9F) can be used on the buffer described by each iovec(9S) structure, or uiomove(9F) can perform the entire transfer to or from a contiguous area of driver (or device) memory.

Vectored I/O

In character drivers, transfers are described by a uio(9S) structure. The uio(9S) structure contains information about the direction and size of the transfer, plus an array of buffers for one end of the transfer (the other end is the device).

The uio(9S) structure contains the following members:

iovec_t        *uio_iov;         /* base address of the iovec */
                                 /* buffer description array */
int            uio_iovcnt;       /* the number of iovec structures */
off_t          uio_offset;       /* 32-bit offset into file where */
                                 /* data is transferred from or to */
offset_t       uio_loffset;      /* 64-bit offset into file where */
                                 /* data is transferred from or to */
uio_seg_t      uio_segflg;       /* identifies the type of I/O */
                                 /* transfer: */
                                 /*  UIO_SYSSPACE:  kernel <-> kernel */
                                 /*  UIO_USERSPACE: kernel <-> user */
short          uio_fmode;        /* file mode flags (not driver setable) */
daddr_t        uio_limit;        /* 32-bit ulimit for file (maximum */
                                 /* block offset). not driver setable. */
diskaddr_t     uio_llimit;       /* 64-bit ulimit for file (maximum block */
                                 /* block offset). not driver setable. */
int            uio_resid;        /* amount (in bytes) not */
                                 /* transferred on completion */

A uio(9S) structure is passed to the driver read(9E) and write(9E) entry points. This structure is generalized to support what is called gather-write and scatter-read. When writing to a device, the data buffers to be written do not have to be contiguous in application memory. Similarly, when reading from a device into memory, the data comes off the device in a contiguous stream but can go into noncontiguous areas of application memory. See the readv(2), writev(2), pread(2), and pwrite(2) man pages for more information on scatter-gather I/O.

Each buffer is described by an iovec(9S) structure. This structure contains a pointer to the data area and the number of bytes to be transferred.

caddr_t        iov_base;        /* address of buffer */
int            iov_len;         /* amount to transfer */

The uio structure contains a pointer to an array of iovec(9S) structures. The base address of this array is held in uio_iov, and the number of elements is stored in uio_iovcnt.

The uio_offset field contains the 32-bit offset into the device at which the application needs to begin the transfer. uio_loffset is used for 64-bit file offsets. If the device does not support the notion of an offset, these fields can be safely ignored. The driver should interpret either uio_offset or uio_loffset (but not both). If the driver has set the D_64BIT flag in the cb_ops(9S) structure, it should use uio_loffset.

The uio_resid field starts out as the number of bytes to be transferred (the sum of all the iov_len fields in uio_iov) and must be set by the driver to the number of bytes not transferred before returning. The read(2) and write(2) system calls use the return value from the read(9E) and write(9E) entry points to determine if the transfer failed (and then return -1). If the return value indicates success, the system calls return the number of bytes requested minus uio_resid. If uio_resid is not changed by the driver, the read(2) and write(2) calls will return 0 (indicating end-of-file), even though all the data was transferred.

The support routines uiomove(9F), physio(9F), and aphysio(9F) update the uio(9S) structure directly, updating the device offset to account for the data transfer. When used with a seekable device, for which the concept of position is relevant, the driver does not need to adjust either the uio_offset or uio_loffset fields. I/O performed to a device in this manner is constrained by the maximum possible value of uio_offset or uio_loffset. An example of such a usage is raw I/O on a disk.

When performing I/O on a device on which the concept of position has no relevance, the driver can save uio_offset or uio_loffset, perform the I/O operation, then restore uio_offset or uio_loffset to the field's initial value. I/O performed to a device in this manner is not constrained by the maximum possible value of uio_offset or uio_loffset. An example of such a usage is I/O on a serial line.

The following example shows one way to preserve uio_loffset in the read(9E) function.

static int
xxread(dev_t dev, struct uio *uio_p, cred_t *cred_p)
{
     offset_t off;
     ...

     off = uio_p->uio_loffset;  /* save the offset */
        /* do the transfer */
        uio_p->uio_loffset = off;  /* restore it */
}

Synchronous Versus Asynchronous I/O

Data transfers can be synchronous or asynchronous depending on whether the entry point scheduling the transfer returns immediately or waits until the I/O has been completed.

The read(9E) and write(9E) entry points are synchronous entry points; they must not return until the I/O is complete. Upon return from the routines, the process knows whether the transfer has succeeded.

The aread(9E) and awrite(9E) entry points are asynchronous entry points. They schedule the I/O and return immediately. Upon return, the process issuing the request knows that the I/O has been scheduled and that the status of the I/O must be determined later. In the meantime, the process can perform other operations.

When an asynchronous I/O request is made to the kernel by a user process, the process is not required to wait while the I/O is in process. A process can perform multiple I/O requests and let the kernel handle the data transfer details. This is useful in applications such as transaction processing where concurrent programming methods can take advantage of asynchronous kernel I/O operations to increase performance or response time. Any performance boost for applications using asynchronous I/O, however, comes at the expense of greater programming complexity.

Data Transfer Methods

Data can be transferred using either programmed I/O or DMA. These data transfer methods can be used by either synchronous or asynchronous entry points, depending on the capabilities of the device.

Programmed I/O Transfers

Programmed I/O devices rely on the CPU to perform the data transfer. Programmed I/O data transfers are identical to other device register read and write operations. Various data access routines are used to read or store values to device memory.

uiomove(9F) can be used to transfer data to some programmed I/O devices. uiomove(9F) transfers data between the user space (defined by the uio(9S) structure) and the kernel. uiomove(9F) can handle page faults, so the memory to which data is transferred need not be locked down. It also updates the uio_resid field in the uio(9S) structure. Example 10–3 shows one way to write a ramdisk read(9E) routine. It uses synchronous I/O and relies on the presence of the following fields in the ramdisk state structure:

caddr_t        ram;            /* base address of ramdisk */
int            ramsize;        /* size of the ramdisk */

Example 10–3 Ramdisk read(9E) Routine Using uiomove(9F)

static int
rd_read(dev_t dev, struct uio *uiop, cred_t *credp)
{
     rd_devstate_t     *rsp;

     rsp = ddi_get_soft_state(rd_statep, getminor(dev));
     if (rsp == NULL)
           return (ENXIO);
     if (uiop->uio_offset >= rsp->ramsize)
           return (EINVAL);
     /*
      * uiomove takes the offset into the kernel buffer,
      * the data transfer count (minimum of the requested and
      * the remaining data), the UIO_READ flag, and a pointer
      * to the uio structure.
      */
     return (uiomove(rsp->ram + uiop->uio_offset,
             min(uiop->uio_resid, rsp->ramsize - uiop->uio_offset),
             UIO_READ, uiop));
}

Another example of programmed I/O might be a driver writing data one byte at a time directly to the device's memory. Each byte is retrieved from the uio(9S) structure using uwritec(9F), then sent to the device. read(9E) can use ureadc(9F) to transfer a byte from the device to the area described by the uio(9S) structure.

Example 10–4 Programmed I/O write(9E) Routine Using uwritec(9F)

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{
        int    value;
        struct xxstate     *xsp;

        xsp = ddi_get_soft_state(statep, getminor(dev));
        if (xsp == NULL)
                return (ENXIO);
        if the device implements a power manageable component, do this:
        pm_busy_component(xsp->dip, 0);
        if (xsp->pm_suspended)
                ddi_dev_is_needed(xsp->dip, normal power);

        while (uiop->uio_resid > 0) {
                /*
                 * do the programmed I/O access
                 */
                value = uwritec(uiop);
                if (value == -1)
                       return (EFAULT);
                ddi_put8(xsp->data_access_handle, &xsp->regp->data,
                    (uint8_t)value);
                ddi_put8(xsp->data_access_handle, &xsp->regp->csr,
                    START_TRANSFER);
                /*
                 * this device requires a ten microsecond delay
                 * between writes
                 */
                drv_usecwait(10);
        }
        pm_idle_component(xsp->dip, 0);
        return (0);
}

DMA Transfers (Synchronous)

Most character drivers use physio(9F) to do most of the setup work for DMA transfers in read(9E) and write(9E), as is shown in Example 10–5.

int physio(int (*strat)(struct buf *), struct buf *bp,
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct uio *uio);

physio(9F) requires the driver to provide the address of a strategy(9E) routine. physio(9F) ensures that memory space is locked down (cannot be paged out) for the duration of the data transfer. This is necessary for DMA transfers because they cannot handle page faults. physio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information.

Example 10–5 read(9E) and write(9E) Routines Using physio(9F)

static int
xxread(dev_t dev, struct uio *uiop, cred_t *credp)
{
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
            return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_READ, xxminphys, uiop);
     return (ret);
}    

static int
xxwrite(dev_t dev, struct uio *uiop, cred_t *credp)
{         
     struct xxstate *xsp;
     int ret;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
            return (ENXIO);
     ret = physio(xxstrategy, NULL, dev, B_WRITE, xxminphys, uiop);
     return (ret);
}

In the call to physio(9F), xxstrategy() is a pointer to the driver strategy routine. Passing NULL as the buf(9S) structure pointer tells physio(9F) to allocate a buf(9S) structure. If the driver must provide physio(9F) with a buf(9S) structure, getrbuf(9F) should be used to allocate one. physio(9F) returns zero if the transfer completes successfully, or an error number on failure. After calling strategy(9E), physio(9F) calls biowait(9F) to block until the transfer is completed or fails. The return value of physio(9F) is determined by the error field in the buf(9S) structure set by bioerror(9F).

DMA Transfers (Asynchronous)

Character drivers supporting aread(9E) and awrite(9E) use aphysio(9F) instead of physio(9F).

int aphysio(int (*strat)(struct buf *), int (*cancel)(struct buf *),
     dev_t dev, int rw, void (*mincnt)(struct buf *),
     struct aio_req *aio_reqp);

Note –

The address of anocancel(9F) is the only value that can currently be passed as the second argument to aphysio(9F).

aphysio(9F) requires the driver to pass the address of a strategy(9E) routine. aphysio(9F) ensures that memory space is locked down (cannot be paged out) for the duration of the data transfer. This is necessary for DMA transfers because they cannot handle page faults. aphysio(9F) also provides an automated way of breaking a larger transfer into a series of smaller, more manageable ones. See minphys() Entry Point for more information. Example 10–5 and Example 10–6 demonstrate that the aread(9E) and awrite(9E) entry points differ only slightly from the read(9E) and write(9E) entry points; the difference lies mainly in their use of aphysio(9F) instead of physio(9F).

Example 10–6 aread(9E) and awrite(9E) Routines Using aphysio(9F)

static int
xxaread(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
             return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_READ,
         xxminphys, aiop));
}

static int
xxawrite(dev_t dev, struct aio_req *aiop, cred_t *cred_p)
{
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
            return (ENXIO);
     return (aphysio(xxstrategy, anocancel, dev, B_WRITE,
         xxminphys,aiop));  
}

In the call to aphysio(9F), xxstrategy() is a pointer to the driver strategy routine. aiop is a pointer to the aio_req(9S) structure and is also passed to aread(9E) and awrite(9E). aio_req(9S) describes where the data is to be stored in user space. aphysio(9F) returns zero if the I/O request is scheduled successfully or an error number on failure. After calling strategy(9E), aphysio(9F) returns without waiting for the I/O to complete or fail.

`minphys()` Entry Point

xxminphys() is a pointer to a function to be called by physio(9F) or aphysio(9F) to ensure that the size of the requested transfer does not exceed a driver-imposed limit. If the user requests a larger transfer, strategy(9E) will be called repeatedly, requesting no more than the imposed limit at a time. This is important because DMA resources are limited. Drivers for slow devices, such as printers, should be careful not to tie up resources for a long time.

Usually, a driver passes the address of the kernel function minphys(9F), but the driver can define its own xxminphys() routine instead. The job of xxminphys() is to keep the b_bcount field of the buf(9S) structure below a driver limit. There might be additional system limits that the driver should not circumvent, so the driver xxminphys() routine should call the system minphys(9F) routine after setting the b_bcount field and before returning.

Example 10–7 minphys(9F) Routine

#define XXMINVAL (512 << 10)    /* 512 KB */
static void
xxminphys(struct buf *bp)
{
       if (bp->b_bcount > XXMINVAL)
            bp->b_bcount = XXMINVAL
      minphys(bp);
}

`strategy()` Entry Point

The strategy(9E) routine originated in block drivers and is so called because it can implement a strategy for efficient queuing of I/O requests to a block device. A driver for a character-oriented device can also use a strategy(9E) routine. In the character I/O model presented here, strategy(9E) does not maintain a queue of requests, but rather services one request at a time.

In Example 10–8, the strategy(9E) routine for a character-oriented DMA device allocates DMA resources for synchronous data transfer and starts the command by programming the device register (see Chapter 8, Direct Memory Access (DMA) for a detailed description).

Note –

strategy(9E) does not receive a device number (dev_t) as a parameter; instead, this is retrieved from the b_edev field of the buf(9S) structure passed to strategy(9E).

Example 10–8 strategy(9E) Routine

static int
xxstrategy(struct buf *bp)
{
     minor_t                    instance;
     struct xxstate             *xsp;
     ddi_dma_cookie_t           cookie;

     instance = getminor(bp->b_edev);
     xsp = ddi_get_soft_state(statep, instance);
     ...
     if the device has power manageable components 
     mark the device busy with pm_busy_components(9F),
     and then ensure that the device
     is powered up by calling ddi_dev_is_needed(9F).

     set up DMA resources with ddi_dma_alloc_handle(9F) and
     ddi_dma_buf_bind_handle(9F).
     xsp->bp = bp; /* remember bp */
     program DMA engine and start command
     return (0);
}

Note –

Although strategy(9E) is declared to return an int, it must always return zero.

On completion of the DMA transfer, the device generates an interrupt, causing the interrupt routine to be called. In Example 10–9, xxintr() receives a pointer to the state structure for the device that might have generated the interrupt.

Example 10–9 Interrupt Routine

static u_int
xxintr(caddr_t arg)
{
     struct xxstate *xsp = (struct xxstate *)arg;
     if (device did not interrupt) {
            return (DDI_INTR_UNCLAIMED);
     }
     if (error) {
            error handling
     }
     release any resources used in the transfer, such as DMA resources
     ddi_dma_unbind_handle(9F) and ddi_dma_free_handle(9F)
     /* notify threads that the transfer is complete */
     biodone(xsp->bp);
     return (DDI_INTR_CLAIMED);
}

The driver indicates an error by calling bioerror(9F). The driver must call biodone(9F) when the transfer is complete or after indicating an error with bioerror(9F).

Mapping Device Memory

Some devices, such as frame buffers, have memory that is directly accessible to user threads by way of memory mapping. Drivers for these devices typically do not support the read(9E) and write(9E) interfaces. Instead, these drivers support memory mapping with the devmap(9E) entry point. A typical example is a frame buffer driver that implements the devmap(9E) entry point to allow the frame buffer to be mapped in a user thread.

`segmap()` Entry Point

int xxsegmap(dev_t dev, off_t off, struct as *asp, caddr_t *addrp,
     off_t len, unsigned int prot, unsigned int maxprot,
     unsigned int flags, cred_t *credp);

segmap(9E) is the entry point responsible for actually setting up a memory mapping requested by the system on behalf of an mmap(2) system call. Drivers for many memory-mapped devices will use ddi_devmap_segmap(9F) as the entry point rather than defining their own segmap(9E) routine.

If a driver wants to check mapping permissions or allocate private mapping resources before setting up the mapping, the driver can provide its own segmap(9E) entry point. segmap(9E) must call devmap_setup(9F) before returning.

In Example 10–10, the driver controls a frame buffer that allows write-only mappings. The driver returns EINVAL if the application tries to gain read access and then calls devmap_setup(9F) to set up the user mapping.

Example 10–10 `segmap(9E)` Routine

static int
xxsegmap(dev_t dev, off_t off, struct as *asp, caddr_t *addrp,
    off_t len, unsigned int prot, unsigned int maxprot,
    unsigned int flags, cred_t *credp)
{
        if (prot & PROT_READ)
                return (EINVAL);
        return (devmap_setup(dev, (offset_t)off, as, addrp,
            (size_t)len, prot, maxprot, flags, cred));
}

`devmap()` Entry Point

int xxdevmap(dev_t dev, devmap_cookie_t handle, offset_t off,
     size_t len, size_t *maplen, uint_t model);

This entry point is called to export device memory or kernel memory to user applications. devmap(9E) is called from devmap_setup(9F) inside segmap(9E) or on behalf of ddi_devmap_segmap(9F). See Chapter 12, Mapping Device and Kernel Memory and Chapter 13, Device Context Management for details.

Multiplexing I/O on File Descriptors

A thread sometimes needs to handle I/O on more than one file descriptor. One example is an application program that needs to read the temperature from a temperature-sensing device and then report the temperature to an interactive display. If the program makes a read request and there is no data available, it should not block waiting for the temperature before interacting with the user again.

The poll(2) system call provides users with a mechanism for multiplexing I/O over a set of file descriptors that reference open files. poll(2) identifies those file descriptors on which a program can send or receive data without blocking, or on which certain events have occurred.

To allow a program to poll a character driver, the driver must implement the chpoll(9E) entry point. Its syntax is:

int xxchpoll(dev_t dev, short events, int anyyet, short *reventsp,
     struct pollhead **phpp);

The system calls chpoll(9E) when a user process issues a poll(2) system call on a file descriptor associated with the device. The chpoll(9E) entry point routine is used by non-STREAMS character device drivers that need to support polling.

In chpoll(9E), the driver must follow these rules:

Implement the following algorithm when the chpoll(9E) entry point is called:

if (events are satisfied now) {     
        *reventsp = mask of satisfied events;
} else {
        *reventsp = 0;
        if (!anyyet)
                *phpp = &local pollhead structure;
}
return (0);

xxchpoll() should check to see if certain events have occurred; see the chpoll(9E) man page. It should then return the mask of satisfied events by setting the return events in *reventsp.

If no events have occurred, the return field for the events is cleared. If the anyyet field is not set, the driver must return an instance of the pollhead structure. It is usually allocated in a state structure and should be treated as opaque by the driver. None of its fields should be referenced.

Call pollwakeup(9F) whenever a device condition of type events, listed in Example 10–11, occurs. This function should be called only with one event at a time. pollwakeup(9F) might be called in the interrupt routine when the condition has occurred.

Example 10–11 and Example 10–12 show how to implement the polling discipline and how to use pollwakeup(9F).

Example 10–11 chpoll(9E) Routine

static int
xxchpoll(dev_t dev, short events, int anyyet,
        short *reventsp, struct pollhead **phpp)
{
     uint8_t status;
     short revent;
     struct xxstate *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL)
             return (ENXIO);
     revent = 0;
     /*
        * Valid events are:
        * POLLIN | POLLOUT | POLLPRI | POLLHUP | POLLERR
        * This example checks only for POLLIN and POLLERR.
        */
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if ((events & POLLIN) && data available to read) {
            revent |= POLLIN;
     }
     if ((events & POLLERR) && (status & DEVICE_ERROR)) {
            revent |= POLLERR;
     }
     /* if nothing has occurred */
     if (revent == 0) {
            if (!anyyet) {
                *phpp = &xsp->pollhead;
            }
     }
       *reventsp = revent;
     return (0);
}

In Example 10–12, the driver can handle the POLLIN and POLLERR events. The driver first reads the status register to determine the current state of the device. The parameter events specifies which conditions the driver should check. If the appropriate conditions have occurred, the driver sets that bit in *reventsp. If none of the conditions have occurred and anyyet is not set, the address of the pollhead structure is returned in *phpp.

Example 10–12 Interrupt Routine Supporting chpoll(9E)

static u_int
xxintr(caddr_t arg)
{
       struct xxstate *xsp = (struct xxstate *)arg;
     uint8_t        status;
     normal interrupt processing
     ...
     status = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
     if (status & DEVICE_ERROR) {
            pollwakeup(&xsp->pollhead, POLLERR);
     }
     if (just completed a read) {
            pollwakeup(&xsp->pollhead, POLLIN);
     }
     ...
     return (DDI_INTR_CLAIMED);
}

pollwakeup(9F) is usually called in the interrupt routine when a supported condition has occurred. The interrupt routine reads the status from the status register and checks for the conditions. It then calls pollwakeup(9F) for each event to possibly notify polling threads that they should check again. Note that pollwakeup(9F) should not be called with any locks held, as it could cause the chpoll(9E) routine to be entered, resulting in deadlock if that routine tries to grab the same lock.

Miscellaneous I/O Control

The ioctl(9E) routine is called when a user thread issues an ioctl(2) system call on a file descriptor associated with the device. The I/O control mechanism is a catchall for getting and setting device-specific parameters. It is frequently used to set a device-specific mode, either by setting internal driver software flags or by writing commands to the device. It can also be used to return information to the user about the current device state. In short, it can do whatever the application and driver need it to do.

`ioctl()` Entry Point (Character Drivers)

int xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
     cred_t *credp, int *rvalp);

The cmd parameter indicates which command ioctl(9E) should perform. By convention, I/O control commands indicate the driver they belong to in bits 8-15 of the command (usually given by the ASCII code of a character representing the driver), and the driver-specific command in bits 0-7. They are usually created in the following way:

#define XXIOC        (`x' << 8)        /* `x' is a character representing */
                                       /* device xx */

#define XX_GET_STATUS                  (XXIOC | 1) /* get status register */
#define XX_SET_CMD                     (XXIOC | 2) /* send command */

The interpretation of arg depends on the command. I/O control commands should be documented (in the driver documentation or a manual page) and defined in a public header file, so that applications can determine the names, what they do, and what they accept or return as arg. Any data transfer of arg (into or out of the driver) must be performed by the driver.

Certain classes of devices such as frame buffers or disks must support standard sets of I/O control requests. These standard I/O control interfaces are documented in the Solaris 8 Reference Manual Collection. For example, fbio(7I) documents the I/O controls that frame buffers must support, and dkio(7I) documents standard disk I/O controls. See Miscellaneous I/O Control for more information on I/O control.

Drivers must use ddi_copyin(9F) to transfer arg data from the userland application to the kernel and ddi_copyout(9F) from kernel to userland. Failure to use ddi_copyin(9F) or ddi_copyout(9F) will result in panics on architectures that separate kernel and user address spaces, or if the user address has been swapped out.

ioctl(9E) is usually a switch statement with a case for each supported ioctl(9E) request.

Example 10–13 ioctl(9E) Routine

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
     uint8_t                csr;
     struct xxstate         *xsp;

     xsp = ddi_get_soft_state(statep, getminor(dev));
     if (xsp == NULL) {
            return (ENXIO);
     }
     switch (cmd) {
     case XX_GET_STATUS:
           csr = ddi_get8(xsp->data_access_handle, &xsp->regp->csr);
           if (ddi_copyout(&csr, (void *)arg,
               sizeof (uint8_t), mode) != 0) {
                   return (EFAULT);
           }
           break;
     case XX_SET_CMD:
           if (ddi_copyin((void *)arg, &csr,
             sizeof (uint8_t), mode) != 0) {
                 return (EFAULT);
           }
           ddi_put8(xsp->data_access_handle, &xsp->regp->csr, csr);
           break;
     default:
           /* generic "ioctl unknown" error */
           return (ENOTTY);
     }
     return (0);
}

The cmd variable identifies a specific device control operation. If arg contains a user virtual address, ioctl(9E) must call ddi_copyin(9F) or ddi_copyout(9F) to transfer data between the data structure in the application program pointed to by arg and the driver. In Example 10–13, for the case of an XX_GET_STATUS request the contents of xsp->regp->csr are copied to the address in arg. When a request succeeds, ioctl(9E) can store in *rvalp any integer value to be the return value of the ioctl(2) system call that made the request. Negative return values, such as -1, should be avoided, as they usually indicate the system call failed, and many application programs assume that negative values indicate failure.

An application that uses the I/O controls discussed above could look like Example 10–14.

Example 10–14 Using ioctl(9E)

#include <sys/types.h>
#include "xxio.h"     /* contains device's ioctl cmds and args */
int
main(void)
{
     uint8_t        status;
     ...

     /*
      * read the device status
      */
     if (ioctl(fd, XX_GET_STATUS, &status) == -1) {
             error handling
     }
     printf("device status %x\n", status);
     exit(0);
}

I/O Control Support for 64-Bit Capable Device Drivers

The Solaris kernel runs in 64-bit mode on suitable hardware and supports both 32-bit and 64-bit applications. A 64-bit device driver is required to support I/O control commands from 32-bit and 64-bit user mode programs. The difference between a 32-bit program and a 64-bit program is its C language type model: a 32-bit program is ILP32 and a 64-bit program is LP64. See Appendix C, Making a Device Driver 64-Bit Ready for information on C data type models.

Any data that flows between programs and the kernel and vice versa (for example using ddi_copyin(9F) or ddi_copyout(9F)) will either need to be identical in format regardless of the type model of the kernel and application, or the device driver should be able to handle a model mismatch between it and the application and adjust the data format accordingly.

To determine if there is a model mismatch, the ioctl(9E) mode parameter passes the data model bits to the driver. As Example 10–15 shows, the mode parameter is then passed to ddi_model_convert_from(9F) to determine if any model conversion is necessary.

The data model is passed to the ioctl(9E) routine using a flag subfield of the mode argument. The flag will be set to one of:

DATAMODEL_ILP32
DATAMODEL_LP64

with FNATIVE conditionally defined to match the data model of the kernel implementation. The flag should be extracted from the mode argument using the FMODELS mask. The driver can then determine the data model explicitly to work out how to copy the application data structure.

The DDI function ddi_model_convert_from(9F) is a convenience routine that can assist some drivers with their ioctl() calls. The function takes the data type model of the user application as an argument and returns one of the following values:

DDI_MODEL_ILP32 — Convert from ILP32 application
DDI_MODEL_NONE — No conversion needed

DDI_MODEL_NONE is returned if no data conversion is necessary. This is the case when the application and driver have the same data model. DDI_MODEL_ILP32 is returned if the driver is compiled to the LP64 data model and is communicating with a 32-bit application.

In the following example, the driver copies a data structure that contains a user address. Because the data structure changes size from ILP32 to LP64, the 64-bit driver uses a 32-bit version of the structure when communicating with a 32-bit application.

Example 10–15 ioctl(9E) Routine to Support 32-bit and 64-bit Applications

struct args32 {
        uint32_t    addr;    /* 32-bit address in LP64 */
        int         len;
}
struct args {
        caddr_t     addr;    /* 64-bit address in LP64 */
        int         len;
}

static int
xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
    cred_t *credp, int *rvalp)
{
        struct  xxstate  *xsp;
        struct  args     a;
        xsp = ddi_get_soft_state(statep, getminor(dev));
        if (xsp == NULL) {
                return (ENXIO);
        }
        switch (cmd) {
        case XX_COPYIN_DATA:
                switch(ddi_model_convert_from(mode)) {
                case DDI_MODEL_ILP32:
                {
                        struct args32 a32;

                        /* copy 32-bit args data shape */
                        if (ddi_copyin((void *)arg, &a32,
                            sizeof (struct args32), mode) != 0) {
                                return (EFAULT);
                        }
                        /* convert 32-bit to 64-bit args data shape */
                        a.addr = a32.addr;
                        a.len = a32.len;
                        break;
                }
                case DDI_MODEL_NONE:
                        /* application and driver have same data model. */
                        if (ddi_copyin((void *)arg, &a, sizeof (struct args),
                            mode) != 0) {
                                return (EFAULT);
                        }
                }
                /* continue using data shape in native driver data model. */
                break;

        case XX_COPYOUT_DATA:
                /* copyout handling */
                break;
        default:
                /* generic "ioctl unknown" error */
                return (ENOTTY);
        }
        return (0);
}

Handling `copyout()` Overflow

Sometimes a driver needs to copy out a native quantity that no longer fits in the 32-bit sized structure. In this case, the driver should return EOVERFLOW to the caller as an indication that the data type in the interface is too small to hold the value to be returned, as shown in Example 10–16.

Example 10–16 Handling copyout(9F) Overflow

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
         cred_t *cr, int *rval_p)
    {
            struct resdata res;

            ... body of driver code ...

            switch (ddi_model_convert_from(mode & FMODELS)) {
            case DDI_MODEL_ILP32: {
                        struct resdata32 res32;

                        if (res.size > UINT_MAX)
                                    return (EOVERFLOW);    
                        res32.size = (size32_t)res.size;
                        res32.flag = res.flag;
                        if (ddi_copyout(&res32,
                                (void *)arg, sizeof (res32), mode))
                                    return (EFAULT);
            }
            break;

            case DDI_MODEL_NONE:
                        if (ddi_copyout(&res, (void *)arg, sizeof (res), mode))
                                    return (EFAULT);
                        break;
            }
            return (0);
    }

32–bit and 64–bit Data Structure Macros

While the method shown in the previous example works well for many drivers, an alternate scheme is to use the data structure macros provided in <sys/model.h> to move data between the application and the kernel. These macros make the code less cluttered and behave identically, from a functional perspective.

Example 10–17 Using Data Structure Macros to Move Data

int
    xxioctl(dev_t dev, int cmd, intptr_t arg, int mode,
            cred_t *cr, int *rval_p)
    {        
                STRUCT_DECL(opdata, op);

                if (cmd != OPONE)
                                return (ENOTTY);

                STRUCT_INIT(op, mode);

                if (copyin((void *)arg,
                        STRUCT_BUF(op), STRUCT_SIZE(op)))
                                return (EFAULT);

                if (STRUCT_FGET(op, flag) != XXACTIVE ||         
                        STRUCT_FGET(op, size) > XXSIZE)
                                return (EINVAL);
                xxdowork(device_state, STRUCT_FGET(op, size));
                return (0);
}

How Do the Structure Macros Work?

In a 64-bit device driver, these macros do all that is necessary to use the same piece of kernel memory as a buffer for the contents of the native form of the data structure (that is, the LP64 form), and for the ILP32 form of the same structure. This usually means that each structure access is implemented by a conditional expression. When compiled as a 32-bit driver, only one data model is supported and only the native form exists, so no conditional expression is used.

The 64-bit versions of the macros depend on the definition of a shadow version of the data structure that describes the 32-bit interface using fixed-width types. The name of the shadow data structure is formed by appending “32” to the name of the native data structure. For convenience, place the definition of the shadow structure in the same file as the native structure to ease future maintenance costs.

The macros take arguments such as:

structname: The structure name (as would appear after the struct keyword) of the native form of the data structure
umodel: A flag word containing the user data model, such as FILP32 or FLP64, extracted from the mode parameter of ioctl(9E)
handle: The name used to refer to a particular instance of a structure that is manipulated by these macros
fieldname: The name of the field within the structure

When to Use Structure Macros

Macros enable you to make in-place references only to the fields of a data item. They do not provide a way to take separate code paths based on the data model. They should be avoided if the number of fields in the data structure is large or the frequency of references to these fields is high.

Because the macros hide many of the differences between data models in the implementation of the macros, code written with this interface is generally easier to read. When compiled as a 32-bit driver, the resulting code is compact without needing clumsy #ifdefs, but still preserves type checking.

Declaring and Initializing Structure Handles

STRUCT_DECL(9F) and STRUCT_INIT(9F) can be used to declare and initialize a handle and space for decoding an ioctl on the stack. STRUCT_HANDLE(9F) and STRUCT_SET_HANDLE(9F) declare and initialize a handle without allocating space on the stack. The latter macros can be useful if the structure is very large, or is contained in some other data structure.

Note –

Because the STRUCT_DECL(9F) and STRUCT_HANDLE(9F) macros expand to data structure declarations, they should be grouped with such declarations in C code.

STRUCT_DECL(structname, handle): Declares a structure handle called handle for a struct structname data structure, and allocates space for its native form on the stack. The native form is assumed to be larger than or equal to the ILP32 form of the structure.
STRUCT_INIT(handle, umodel): Initializes the data model for handle to umodel. This macro must be invoked before any access is made to a structure handle declared with STRUCT_DECL(9F).
STRUCT_HANDLE(structname, handle): Declares a structure handle called handle. Contrast with STRUCT_DECL(9F).
STRUCT_SET_HANDLE(handle, umodel, addr): Initializes the data model for handle to umodel, and sets addr as the buffer used for subsequent manipulation. Invoke this macro before accessing a structure handle declared with STRUCT_DECL(9F).

Operations on Structure Handles

size_t STRUCT_SIZE(handle): Returns the size of the structure referred to by handle, according to its embedded data model.
typeof fieldname STRUCT_FGET(handle, fieldname): Returns the indicated field (non-pointer type) in the data structure referred to by handle.
typeof fieldname STRUCT_FGETP(handle, fieldname): Returns the indicated field (pointer type) in the data structure referred to by handle.
STRUCT_FSET(handle, fieldname, val): Sets the indicated field (non-pointer type) in the data structure referred to by handle to value val. The type of val should match the type of fieldname.
STRUCT_FSETP(handle, fieldname, val): Sets the indicated field (pointer type) in the data structure referred to by handle to value val.
typeof fieldname *STRUCT_FADDR(handle, fieldname): Returns the address of the indicated field in the data structure referred to by handle.
struct structname *STRUCT_BUF(handle): Returns a pointer to the native structure described by handle.

Other Operations

size_t SIZEOF_STRUCT(struct_name, datamodel): Returns the size of struct_name based on the given data model.
size_t SIZEOF_PTR(datamodel): Returns the size of a pointer based on the given data model.

Chapter 10 Drivers for Character Devices

Character Driver Structure Overview

Figure 10–1 Character Driver Roadmap

Character Device Autoconfiguration

Example 10–1 Character Driver attach(9E) Routine

Device Access (Character Drivers)

open() Entry Point (Character Drivers)

Example 10–2 Character Driver open(9E) Routine

close() Entry Point (Character Drivers)

I/O Request Handling

User Addresses

Vectored I/O

Synchronous Versus Asynchronous I/O

Data Transfer Methods

Programmed I/O Transfers

Example 10–3 Ramdisk read(9E) Routine Using uiomove(9F)

Example 10–4 Programmed I/O write(9E) Routine Using uwritec(9F)

DMA Transfers (Synchronous)

Example 10–5 read(9E) and write(9E) Routines Using physio(9F)

DMA Transfers (Asynchronous)

Example 10–6 aread(9E) and awrite(9E) Routines Using aphysio(9F)

minphys() Entry Point

Example 10–7 minphys(9F) Routine

strategy() Entry Point

Example 10–8 strategy(9E) Routine

Example 10–9 Interrupt Routine

Mapping Device Memory

segmap() Entry Point

Example 10–10 segmap(9E) Routine

devmap() Entry Point

Multiplexing I/O on File Descriptors

Example 10–11 chpoll(9E) Routine

Example 10–12 Interrupt Routine Supporting chpoll(9E)

Miscellaneous I/O Control

ioctl() Entry Point (Character Drivers)

Example 10–13 ioctl(9E) Routine

Example 10–14 Using ioctl(9E)

I/O Control Support for 64-Bit Capable Device Drivers

Example 10–15 ioctl(9E) Routine to Support 32-bit and 64-bit Applications

Handling copyout() Overflow

Example 10–16 Handling copyout(9F) Overflow

32–bit and 64–bit Data Structure Macros

Example 10–17 Using Data Structure Macros to Move Data

How Do the Structure Macros Work?

When to Use Structure Macros

Declaring and Initializing Structure Handles

Operations on Structure Handles

Other Operations

`open()` Entry Point (Character Drivers)

`close()` Entry Point (Character Drivers)

`minphys()` Entry Point

`strategy()` Entry Point

`segmap()` Entry Point

Example 10–10 `segmap(9E)` Routine

`devmap()` Entry Point

`ioctl()` Entry Point (Character Drivers)

Handling `copyout()` Overflow