Solaris 8 Software Developer Supplement

Additional Driver Hardening Considerations

In addition to the requirements discussed in the previous sections, the driver developer must consider a few other issues such as:

Thread Interaction

Kernel panics in a device driver are often caused by the unexpected interaction of kernel threads after a device failure. When a device fails, threads can interact in ways that the designer had not anticipated.

For example, if processing routines terminate early, they might fail to signal other threads that are waiting on condition variables. Attempting to inform other modules of the failure or handling unanticipated callbacks can result in undesirable thread interactions. Examine the sequence of mutex acquisition and relinquishment that can occur during device failures.

Threads that originate in an upstream STREAMS module can run into unfortunate paradoxes if used to call back into that module unexpectedly. You might use alternative threads to handle exception messages. For instance, a wput procedure might use a read-side service routine to communicate an M_ERROR, rather than doing it directly with a read-side putnext.

A failing STREAMS device that cannot be quiesced during close (because of the fault) can generate an interrupt after the Stream has been dismantled. The interrupt handler must not attempt to use a stale Stream pointer to try to process the message.

Threats From Top-Down Requests

While protecting the system from defective hardware, the driver designer also needs to protect against driver misuse. Although the driver can assume that the kernel infrastructure is always correct (a trusted core), user requests passed to it can be potentially destructive.

For example, a user can request an action to be performed on a user-supplied data block (M_IOCTL) that is smaller than that indicated in the control part of the message. The driver should never trust a user application.

The design should consider the construction of each type of ioctl that it can receive with a view to the potential harm that it could cause. The driver should make checks to be sure that it does not process malformed ioctls.

Adaptive Strategies

A driver can continue to provide service with faulty hardware and attempt to work around the identified problem by using an alternative strategy for accessing the device. Given that broken hardware is unpredictable and given the risk associated with additional design complexity, adaptive strategies are not always wise. At most, they should be limited to periodic interrupt polling and retry attempts. Periodically retrying the device lets the driver know when a device has recovered. Periodic polling can control the interrupt mechanism after a driver has been forced to disable interrupts.

Ideally, a system always has an alternative device to provide a vital system service. Service multiplexors in kernel or user space offer the best method of maintaining system services when a device fails. Such practices are beyond the scope of this chapter.