Solaris 8 Software Developer Supplement

Chapter 3 High Availability Drivers

Note -

For the most up-to-date man pages, use the man command. The Solaris 8 Update release man pages include new feature information not found in the Solaris 8 Reference Manual Collection.

This functionality is new in the Solaris 8 10/00 release.

Availability is a function of both failure rate and speed of repair. In many cases, the failure of an individual device need not result in a total system failure. Redundant hardware components, together with drivers designed to support High Availability, can allow a system to continue operation even in the face of individual component failure. In many cases, such drivers can allow the system to be repaired even while it continues to provide service.

The programmatic elimination of driver failures resulting from device failures is called driver hardening. A hardened driver can tolerate and protect the rest of the system from errors that might otherwise propagate from a faulty device.

Functions within a driver that help isolate faults and assist in more rapid recovery and repair improve the system Serviceability; this improves Availability by reducing time to repair.

Additional information about how to create a Solaris device driver can be found in Writing Device Drivers.

Driver Hardening

Hardening is the process of ensuring that a driver works correctly in spite of faults in the I/O device that it controls or other faults originating outside the system core. A hardened driver must not panic, hang the system, or allow the uncontrolled spread of corrupted data as the result of any such faults.

The driver developer must take responsibility for:

Correct use of the DDI functions
Detecting and reporting any corruption of device I/O
Handling devices with deviant interrupt logic

All Solaris drivers should be hardened. Hardened drivers obey these rules:

Each piece of hardware should be controlled by a separate instance of the device driver.
Programmed I/O (PIO) must be performed only through the DDI access functions, using the appropriate data access handle.
The device driver must assume that data it receives from the device could be corrupted. The driver must check the integrity of the data before using it.
The driver must control the effects of any faults that it detects. Known bad data must not be released to the rest of the system.
The driver must ensure that all writes by the device into DMA buffers (DDI_DMA_READ) are contained within pages of memory controlled entirely by the driver. This prevents a DMA fault from corrupting an arbitrary part of the system's main memory.
The device driver must not be an unlimited drain on system resources if the device locks up. It should time-out if a device claims to be continuously busy. The driver should also detect a pathological (stuck) interrupt request and take appropriate action.
The driver must free up resources after a fault. For example, the system must be able to close all minor devices and detach driver instances even after the hardware fails.

Device Driver Instances

The Solaris kernel allows multiple instances of a driver. Each instance has its own data space but shares the text and some global data with other instances. The device is managed on a per-instance basis. Hardened drivers should use a separate instance for each piece of hardware unless the driver is designed to handle fail-over internally. There can be multiple instances of a driver per slot, for example, multi-function cards, which is standard behavior for Solaris device drivers.

Exclusive Use of DDI Access Handles

All programmed I/O (PIO) access by a hardened driver must use Solaris DDI access functions from the ddi_getX, ddi_putX, ddi_rep_getX, and ddi_rep_putX families of routines. The driver should not directly access the mapped registers by the address returned from ddi_regs_map_setup(9F). Using an access handle ensures that an I/O fault is controlled and its effects confined to the returned value, rather than possibly corrupting other parts of the machine state. (Avoid the ddi_peek(9F) and ddi_poke(9F) routines because they do not use access handles.)

The DDI access mechanism is important because it provides an opportunity to control how data is read into the kernel. DDI access routines provide protection by constraining the effect of bus timeout traps.

Detecting Corrupted Data

The following sections consider where data corruption can occur and the steps you can take to detect it.

Corruption of Device Management and Control Data

The driver should assume that any data obtained from the device, whether by PIO or DMA, could have been corrupted. In particular, extreme care should be taken with pointers, memory offsets, or array indexes read or calculated from data supplied by the device. Such values can be malignant, meaning they can cause a kernel panic if dereferenced. All such values should be checked for range and alignment (if required) before use.

Even if a pointer is not malignant, it can still be misleading. For example, it can point at a valid instance of an object, but not the correct one. Where possible, the driver should cross-check the pointer with the pointed-to object, or otherwise validate the data obtained through it.

Other types of data can also be misleading, such as packet lengths, status words, or channel IDs. Each should be checked to the extent possible: a packet length can be range-checked to ensure that it is not negative or larger than the containing buffer; a status word can be checked for "impossible" bits; and a channel ID can be matched against a list of valid IDs.

Where a value is used to identify a Stream, the driver must ensure that the Stream still exists. The asynchronous nature of STREAMS processing means that a Stream can be dismantled while device interrupts are still outstanding.

The driver should not reread data from the device; the data should be read once, validated, and stored in the driver's local state. This avoids the hazard presented by data that, although correct when initially read and validated, is incorrect when reread later.

The driver should also ensure that all loops are bounded, so that a device returning a continuous BUSY status, or claiming that another buffer needs to be processed, does not lock up the entire system.

Corruption of Received Data

Device errors can result in corrupted data being placed in receive buffers. Such corruption is indistinguishable from corruption that occurs beyond the domain of the device--for example, within a network. Typically, existing software is already in place to handle such corruption; for example, through integrity checks at the transport layer of a protocol stack or within the application using the device.

If the received data will not be checked for integrity at a higher layer--as in the case of a disk driver, for example--it can be integrity-checked within the driver itself. Methods of detecting corruption in received data are typically device-specific (checksums, CRC, and so forth).

Detecting Faults

Any ancestor of a device driver can disable the data path to the device if it detects a fault. When PIO access is disabled, any reads from the device return undefined values, while writes are ignored. If DMA access is disabled, the device might be prevented from accessing memory, or it might receive undefined data on reads and have writes discarded.

A device driver can detect that a data path has been disabled using the following DDI routines:

ddi_check_acc_handle(9F)
ddi_check_dma_handle(9F)

Each function checks whether any faults affecting the data path represented by the supplied handle have been detected. If one of these functions returns DDI_FAILURE, indicating that the data path has failed, the driver should report the fault using ddi_dev_report_fault(9F), perform any necessary cleanup, and, where possible, return an appropriate error to its caller.

Containment of Faults

Preservation of system integrity requires that faults be detected before they alter the system state. Consequently, the driver must test for faults whenever data returned from the device is going to be used by the system.

The ddi_check_acc_handle(9F) and ddi_check_dma_handle(9F) calls should be made at significant junctures, such as just before passing a data block to the upper layers.
Data must not be forwarded out of the driver if the device has failed.
The driver must consider other possible impacts of the failure on the integrity of the system. The driver must ensure that kernel resources, such as memory, are not permanently lost when data cannot be forwarded. Threads should not remain blocked waiting for signals that will never be generated.
The driver should limit its processing while in the failed state (for example, freeing messages in wput routines, attempting to permanently disable interrupts from a failed board, and so forth).

DMA Isolation

A defective device might initiate an improper DMA transfer over the bus. This data transfer could corrupt good data that was previously delivered. A device that fails might generate a corrupt address that can contaminate memory that does not even belong to its own driver.

In systems with an IOMMU, a device can write only to pages mapped as writable for DMA. Therefore, pages that are to be the target of DMA writes should be owned solely by one driver instance and not shared with any other kernel structure. While the page in question is mapped as writable for DMA, the driver should be suspicious of data in that page. The page must be unmapped from the IOMMU before it is passed beyond the driver, or before any validation of the data.

You can use ddi_umem_alloc(9F) to guarantee that a whole aligned page is allocated, or allocate multiple pages and ignore the memory below the first page boundary. You can find the size of an IOMMU page by using ddi_ptob(9F).

Alternatively, the driver can choose to copy the data into a safe part of memory before processing it. If this is done, the data must first be synchronized using ddi_dma_sync(9F).

Calls to ddi_dma_sync(9F) should specify SYNC_FOR_DEV before using DMA to transfer data to a device, and SYNC_FOR_CPU after using DMA to transfer data from the device to memory.

On some PCI-based systems with an IOMMU, devices may be able to use PCI dual address cycles (64-bit addresses) to bypass the IOMMU. This gives the device the potential to corrupt any region of main memory. Hardened device drivers must not attempt to use such a mode and should disable it.

Handling Stuck Interrupts

The driver must identify stuck interrupts because a persistently asserted interrupt severely affects system performance, almost certainly stalling a single-processor machine.

Sometimes it is difficult for the driver to identify a particular interrupt as bogus. For network drivers, if a receive interrupt is indicated but no new buffers have been made available, no work was needed. When this is an isolated occurrence, it is not a problem, as the actual work might already have been completed by another routine (read service, for example).

On the other hand, continuous interrupts with no work for the driver to process can indicate a stuck interrupt line. For this reason, all platforms allow a number of apparently bogus interrupts to occur before taking defensive action.

A hung device, while appearing to have work to do, might be failing to update its buffer descriptors. The driver should defend against such repetitive requests.

In some cases, platform-specific bus drivers might be capable of identifying a persistently unclaimed interrupt and can disable the offending device. However, this relies on the driver's ability to identify the valid interrupts and return the appropriate value. The driver should therefore return a DDI_INTR_UNCLAIMED result unless it detects that the device legitimately asserted an interrupt (that is, the device actually requires the driver to do some useful work).

The legitimacy of other, more incidental, interrupts is much harder to certify. To this end, an interrupt-expected flag is a useful tool for evaluating whether an interrupt is valid. Consider an interrupt such as descriptor free, which can be generated if all the device's descriptors had been previously allocated. If the driver detects that it has taken the last descriptor from the card, it can set an interrupt-expected flag. If this flag is not set when the associated interrupt is delivered, it is suspicious.

Some informative interrupts might not be predictable, such as one indicating that a medium has become disconnected or frame sync has been lost. The easiest method of detecting whether such an interrupt is stuck is to mask this particular source on first occurrence until the next polling cycle.

If the interrupt occurs again while disabled, this should be considered a false interrupt. Some devices have interrupt status bits that can be read even if the mask register has disabled the associated source and might not be causing the interrupt. Driver designers can devise more appropriate algorithms specific to their devices.

Avoid looping on interrupt status bits indefinitely. Break such loops if none of the status bits set at the start of a pass requires any real work.

Additional Driver Hardening Considerations

In addition to the requirements discussed in the previous sections, the driver developer must consider a few other issues. These are:

Thread interaction
Threats from top-down requests
Adaptive strategies

Thread Interaction

Kernel panics in a device driver are often caused by unexpected interaction of kernel threads after a device failure. When a device fails, threads can interact in ways that the designer had not anticipated.

For example, if processing routines terminate early, they may fail to signal other threads that are waiting on condition variables. Attempting to inform other modules of the failure or handling unanticipated callbacks can result in undesirable thread interactions. Examine the sequence of mutex acquisition and relinquishment that can occur during device failures.

Threads that originate in an upstream STREAMS module can run into unfortunate paradoxes if used to call back into that module unexpectedly. You might use alternative threads to handle exception messages. For instance, a wput procedure might use a read-side service routine to communicate an M_ERROR, rather than doing it directly with a read-side putnext.

A failing STREAMS device that cannot be quiesced during close (because of the fault) can generate an interrupt after the Stream has been dismantled. The interrupt handler must not attempt to use a stale Stream pointer to try to process the message.

Threats From Top-Down Requests

While protecting the system from defective hardware, the driver designer also needs to protect against driver misuse. Although the driver can assume that the kernel infrastructure is always correct (a trusted core), user requests passed to it can be potentially destructive.

For example, a user can request an action to be performed upon a user-supplied data block (M_IOCTL) that is smaller than that indicated in the control part of the message. The driver should never trust a user application.

The design should consider the construction of each type of ioctl that it can receive with a view to the potential harm that it could cause. The driver should make checks to be sure that it does not process malformed ioctls.

Adaptive Strategies

A driver can continue to provide service with faulty hardware, attempting to work around the identified problem by using an alternative strategy for accessing the device. Given that broken hardware is unpredictable and given the risk associated with additional design complexity, adaptive strategies are not always wise. At most, they should be limited to periodic interrupt polling and retry attempts. Periodically retrying the device lets the driver know when a device has recovered. Periodic polling can control the interrupt mechanism after a driver has been forced to disable interrupts.

Ideally, a system always has an alternative device to provide a vital system service. Service multiplexors in kernel or user space offer the best method of maintaining system services when a device fails. Such practices are beyond the scope of this chapter.

Serviceability

To ensure serviceability, the driver must be enabled to do the following:

Detect faulty devices and report the fault
Remove a device (as supported by the Solaris hot-plug model)
Add a new device (as supported by the Solaris hot-plug model)
Perform periodic health checks to enable the detection of latent faults

Checking the Current Device State

A driver must check its device state at appropriate points in order to avoid needlessly committing resources. The ddi_get_devstate(9F) function enables the driver to determine the device's current state, as maintained by the framework.

ddi_devstate_t ddi_get_devstate(dev_info_t *dip);

The driver is not normally called upon to handle a device that is OFFLINE. Generally, the device state will reflect earlier device fault reports, possibly modified by any reconfiguration activities that have taken place.

Correct Behavior When a Device Has Failed

The system must report a fault in terms of the impact it has on the ability of the device to provide service. Typically, loss of service is expected when:

A PIO or DMA error is detected
Data corruption is detected
The device is locked or hung (for example, when a command never completes)
A condition has occurred that the driver does not handle because it was regarded as impossible when the driver was designed

If the device state, returned by ddi_get_devstate(9F), indicates that the device is not usable, the driver should reject all new and outstanding I/O requests, returning (if possible) an appropriate error code (for example, EIO). For a STREAMS driver, M_ERROR or M_HANGUP, as appropriate, should be put upstream to indicate that the driver is not usable.

The state of the device should be checked at each major entry point, optionally before committing resources to an operation, and after reporting a fault. If at any stage the device is found to be unusable, the driver should perform any cleanup actions that are required (for example, releasing resources) and return in a timely fashion. It should not attempt any retry or recovery action, nor does it need to report a fault. The state is not a fault, and it is already known to the framework and management agents. It should mark the current request and any other outstanding or queued requests as complete, again with an error indication if possible.

The ioctl() entry point presents a problem in this respect: ioctl operations that imply I/O to the device (for example, formatting a disk) should fail if the device is unusable, while others (such as recovering error status) should continue to work. The state check might therefore need to be on a per-command basis. Alternatively, you can implement those operations that work in any state through another entry point or minor device mode, although this might be constrained by issues of compatibility with existing applications

Note that close() should always complete successfully, even if the device is unusable. If the device is unusable, the interrupt handler should return DDI_INTR_UNCLAIMED for all subsequent interrupts. If interrupts continue to be generated, this will eventually result in the interrupt being disabled.

Fault Reporting

This following function notifies the system that your driver has discovered a device fault.

void ddi_dev_report_fault(dev_info_t *dip, ddi_fault_impact_t impact,
             ddi_fault_location_t location, const char *message);

The impact parameter indicates the impact of the fault on the device's ability to provide normal service, and is used by the fault management components of the system to determine the appropriate action to take in response to the fault. This action can cause a change in the device state. A service-lost fault will cause the device state to be changed to DOWN and a service-degraded fault will cause the device state to be changed to DEGRADED.

A device should be reported as faulty if:

A PIO error is detected
Corrupted data is detected
The device has locked up

Drivers should avoid reporting the same fault repeatedly, if possible. In particular, it is redundant (and undesirable) for drivers to report any errors if the device is already in an unusable state (see ddi_get_devstate(9F)).

If a hardware fault is detected during the attach process, the driver must report the fault using ddi_dev_report_fault(9F) as well as returning DDI_FAILURE.

Periodic Health Checks

A latent fault is one that does not show itself until some other action occurs. For example, a hardware failure occurring in a device that is a cold stand-by could remain undetected until a fault occurs on the master device. At this point, it will be discovered that the system now contains two defective devices and might be unable to continue operation.

As a general rule, latent faults that are allowed to remain undetected will eventually cause system failure. Without latent fault checking, the overall availability of a redundant system is jeopardized. To avoid this, a device driver must detect latent faults and report them in the same way as other faults.

The driver should ensure that it has a mechanism for making periodic health checks on the device. In a fault-tolerant situation where the device can be the secondary or fail-over device, early detection of a failed secondary device is essential to ensure that it can be repaired or replaced before any failure in the primary device occurs.

Periodic health checks can:

Run a quick access check on the board (write, read), then check the device with the ddi_check_acc_handle(9F) routine.
Check a register or memory location on the device whose value the driver expects to have been deterministically altered since the last poll.

Features of a device that typically exhibit deterministic behavior include heartbeat semaphores, device timers (for example, local lbolt used by download), and event counters. Reading an updated predictable value from the device gives a degree of confidence that things are proceeding satisfactorily.
Time-stamp outgoing requests (transmit blocks or commands) when issued by the driver.

The periodic health check can look for any over-age requests that have not completed.
Initiate an action on the device that should be completed before the next scheduled check.

If this action is an interrupt, this is an ideal way of ensuring that the device's circuitry is still capable of delivering an interrupt.