Writing Device Drivers

Appendix G Advanced Topics

This appendix contains a collection of topics. Not all drivers need to be concerned with the issues addressed.

Multithreading

This section supplements the guidelines presented in Chapter 4, Multithreading, for writing an MT-safe driver, a driver that safely supports multiple threads.

Lock Granularity

Here are some issues to consider when deciding on how many locks to use in a driver:

The driver should allow as many threads as possible into the driver: this leads to fine-grained locking.
However, it should not spend too much time executing the locking primitives: this approach leads to coarse-grained locking.
Driver code should be simple and maintainable.
Avoid lock contention for shared data.
Write re-entrant code wherever possible. This makes it possible for many threads to execute without grabbing any locks.
Use locks to protect the data and not the code path.
Keep in mind the level of concurrency provided by the device; if the controller can only handle one request at a time, there is no point in spending excessive time making the driver handle multiple threads.

A little thought in reorganizing the ordering and types of locks around such data can lead to considerable savings.

Avoiding Unnecessary Locks

To avoid unnecessary locks, note the following:

Use the multithreading semantics of the entry points to your advantage.

If an element of a device's state structure is read-mostly--for example, initialized in attach(9E), and destroyed in detach(9E), but only read in other entry points--there is no need to acquire a mutex to read that element of the structure. Indiscriminately adding calls to mutex_enter(9F) and mutex_exit(9F) around every access to such a variable can lead to unnecessary locking overhead.
Make all entry points re-entrant and reduce the amount of shared data by changing static variables to automatic, or by adding them to your state structure.

Note -

Kernel-thread stacks are small (currently 8 Kbytes), so do not allocate large automatic variables, and avoid deep recursion.

Locking Order

When acquiring multiple mutexes, be sure to acquire them in the same order on each code path. For example, mutexes A and B are used to protect two resources in the following ways:

Code Path 1					Code Path 2
mutex_enter(&A);					mutex_enter(&B);
 	...					...
mutex_enter(&B);					mutex_enter(&A);
 	...					...
mutex_exit(&B);					mutex_exit(&A);
 	...					...
mutex_exit(&A);					mutex_exit(&B);

If thread 1 is executing code path one, and thread two is executing code path 2, the following could occur:

Thread one acquires mutex A.
Thread two acquires mutex B.
Thread one needs mutex B, so it blocks holding mutex A.
Thread two needs mutex A, so it blocks holding mutex B.

These threads are now deadlocked. This is hard to track, particularly since the code paths are rarely so straightforward. Also, it doesn't always happen, as it depends on the relative timing of threads 1 and 2.

Scope of a Lock

Experience has shown that it is easier to deal with locks that are either held throughout the execution of a routine, or locks that are both acquired and released in one routine. Avoid nesting like this:

static void
 xxfoo(...)
 {
 	mutex_enter(&softc->lock);
 	...
 	xxbar();
 }
static void
 xxbar(...)
 {
 	...
 	mutex_exit(&softc->lock);
 }

This example works, but will almost certainly lead to maintenance problems.

If contention is likely in a particular code path, try to hold locks for a short time. In particular, arrange to drop locks before calling kernel routines that might block. For example:

mutex_enter(&softc->lock);
 			...
 softc->foo = bar;
 softc->thingp = kmem_alloc(sizeof(thing_t), KM_SLEEP);
 ...
 mutex_exit(&softc->lock);

This is better coded as:

thingp = kmem_alloc(sizeof(thing_t), KM_SLEEP);
 mutex_enter(&softc->lock);
 ...
 softc->foo = bar;
 softc->thingp = thingp;
 ...
 mutex_exit(&softc->lock);

Potential Panics

Here is a set of mutex-related panics:

panic: recursive mutex_enter. mutex %x caller %x

Mutexes are not re-entrant by the same thread. If you already own the mutex, you cannot own it again. Doing this leads to this panic.

panic: mutex_adaptive_exit: mutex not held by thread

Releasing a mutex that the current thread does not hold causes this panic.

panic: lock_set: lock held and only one CPU

This panic only occurs on a uniprocessor. It indicates that a spin mutex is held and it would spin forever, because there is no other CPU to release it. This could happen because the driver forgot to release the mutex on one code path, or blocked while holding it.

A common cause of this panic is that the device's interrupt is high-level (see ddi_intr_hilevel(9F) and Intro(9F)), and is calling a routine that blocks the interrupt handler while holding a spin mutex. This is obvious if the driver explicitly calls cv_wait(9F), but might not be so if it's blocking while grabbing an adaptive mutex with mutex_enter(9F).

Note -

In principle, this is only a problem for drivers that operate above lock level.

Sun Disk Device Drivers

Sun disk devices represent an important class of block device drivers. A Sun disk device is one that is supported by disk utility commands such as format(1M) and newfs(1M).

Disk I/O Controls

Sun disk drivers need to support a minimum set of I/O controls specific to Sun disk drivers. These I/O controls are specified in the dkio(7) manual page. Disk I/O controls transfer disk information to or from the device driver. In the case where data is copied out of the driver to the user, ddi_copyout(9F) should be used to copy the information into the user's address space. When data is copied to the disk from the user, the ddi_copyin(9F) should be used to copy data into the kernels address space. Table G-1 lists the mandatory Sun disk I/O controls.

Table G-1 Mandatory Sun Disk I/O Controls


I/O Control	Description
`DKIOCINFO`	Returns information describing the disk controller.
`DKIOCGAPART`	Returns a disk's partition map.
`DKIOCSAPART`	Sets a disk's partition map.
`DKIOCGGEOM`	Returns a disk's geometry.
`DKIOCSGEOM`	Sets a disk's geometry.
`DKIOCGVTOC`	Returns a disk's Volume Table of Contents.
`DKIOCSVTOC`	Sets a disk's Volume Table of Contents.

Disk Performance

The Solaris 7 DDI/DKI provides facilities to optimize I/O transfers for improved file system performance. It supports a mechanism to manage the list of I/O requests so as to optimize disk access for a file system. See "Asynchronous Data Transfers"for a description of enqueuing an I/O request.

The diskhd structure is used to manage a linked list of I/O requests.

struct diskhd {
 	long	b_flags;		/* not used, needed for */
 			/* consistency          */
 	struct buf *b_forw,	*b_back;	/* queue of unit queues */
 	struct buf *av_forw,	*av_back;	/* queue of bufs for this unit */
 	long	b_bcount;		/* active flag */
 };

The diskhd data structure has two buf pointers that the driver can manipulate. The av_forw pointer points to the first active I/O request. The second pointer, av_back, points to the last active request on the list.

A pointer to this structure is passed as an argument to disksort(9F), along with a pointer to the current buf structure being processed. The disksort(9F) routine is used to sort the buf requests in a fashion that optimizes disk seek and then inserts the buf pointer into the diskhd list. The disksort program uses the value that is in b_resid of the buf structure as a sort key. The driver is responsible for setting this value. Most Sun disk drivers use the cylinder group as the sort key. This tends to optimize the file system read-ahead accesses.

Once data has been added to the diskhd list, the device needs to transfer the data. If the device is not busy processing a request, the xxstart()( ) routine pulls the first buf structure off the diskhd list and starts a transfer.

If the device is busy, the driver should return from the xxstrategy()( ) entry point. Once the hardware is done with the data transfer, it generates an interrupt. The driver's interrupt routine is then called to service the device. After servicing the interrupt, the driver can then call the start()( ) routine to process the next buf structure in the diskhd list.

SCSA

Global Data Definitions

The following is information for debugging, useful when a driver experiences bus-wide problems. One global data variable has been defined for the SCSA implementation: scsi_options. This variable is a SCSA configuration longword used for debug and control. The defined bits in the scsi_options longword can be found in the file <sys/scsi/conf/autoconf.h>. Table G-2 shows their meanings when set.

Table G-2 SCSA Options


Option	Description
`SCSI_OPTIONS_DR`	Enables global disconnect/reconnect.
`SCSI_OPTIONS_SYNC`	Enables global synchronous transfer capability.
`SCSI_OPTIONS_LINK`	Enables global link support.
`SCSI_OPTIONS_PARITY`	Enables global parity support.
`SCSI_OPTIONS_TAG`	Enables global tagged queuing support.
`SCSI_OPTIONS_FAST`	Enables global FAST SCSI support: 10MB/sec transfers, as opposed to 5 MB/sec.
`SCSI_OPTIONS_FAST20`	Enables global FAST20 SCSI support: 20MB/sec transfers.
`SCSI_OPTIONS_FAST40`	Enables global FAST40 SCSI support: 40MB/sec transfers.
`SCSI_OPTIONS_FAST80`	Enables global FAST80 SCSI support: 80MB/sec transfers.
`SCSI_OPTIONS_WIDE`	Enables global WIDE SCSI.

Note -

The setting of scsi_options affects all host adapter and target drivers present on the system (as opposed to scsi_ifsetcap(9F)). Refer to scsi_hba_attach(9F) in the Solaris 2.6 Reference Manual for information on controlling these options for a particular host adapter.

The default setting for scsi_options has these values set:

SCSI_OPTIONS_DR
SCSI_OPTIONS_SYNC
SCSI_OPTIONS_LINK
SCSI_OPTIONS_PARITY
SCSI_OPTIONS_TAG
SCSI_OPTIONS_FAST
SCSI_OPTIONS_FAST20
SCSI_OPTIONS_FAST40
SCSI_OPTIONS_FAST80
SCSI_OPTIONS_WIDE

Tagged Queuing

For a definition of tagged queuing refer to the SCSI-2 specification. To support tagged queuing, first check the scsi_options flag SCSI_OPTIONS_TAG to see if tagged queuing is enabled globally. Next, check to see if the target is a SCSI-2 device and whether it has tagged queuing enabled. If this is all true, attempt to enable tagged queuing by using scsi_ifsetcap(9F). Example G-1 shows an example of supporting tagged queuing.

Example G-1 Supporting SCSI Tagged Queuing

#define ROUTE &sdp->sd_address
	...
	/*
	 * If SCSI-2 tagged queueing is supported by the disk drive and
	 * by the host adapter then we will enable it.
	 */ 
	xsp->tagflags = 0;
	if ((scsi_options & SCSI_OPTIONS_TAG) &&
		(devp->sd_inq->inq_rdf == RDF_SCSI2) &&
		(devp->sd_inq->inq_cmdque)) {
		if (scsi_ifsetcap(ROUTE, "tagged-qing", 1, 1) == 1) {
			xsp->tagflags = FLAG_STAG;
			xsp->throttle = 256;
		} else if (scsi_ifgetcap(ROUTE, "untagged-qing", 0) == 1) {
			xsp->dp->options |= XX_QUEUEING;
			xsp->throttle = 3;
		} else {
			xsp->dp->options &= ~XX_QUEUEING;
			xsp->throttle = 1;
		}
}

Untagged Queueing

If tagged queueing fails, you can attempt to set untagged queuing. In this mode, you submit as many commands as you think necessary or optimal to the host adapter driver. Then, the host adapter queues the commands to the target one at a time (as opposed to tagged queueing, where the host adapter submits as many commands as it can until the target indicates that the queue is full).