Solaris 8 Software Developer Supplement

Serviceability

To ensure serviceability, you must enable the driver to do the following:

Detect faulty devices and report the fault
Remove a device (as supported by the Solaris hot-plug model)
Add a new device (as supported by the Solaris hot-plug model)
Perform periodic health checks to enable the detection of latent faults

Checking the Current Device State

A driver must check its device state at appropriate points in order to avoid needlessly committing resources. The ddi_get_devstate(9F) function enables the driver to determine the device's current state, as maintained by the framework.

ddi_devstate_t ddi_get_devstate(dev_info_t *dip);

The driver is not normally called on to handle a device that is OFFLINE. Generally, the device state reflecst earlier device fault reports, possibly modified by any reconfiguration activities that have occurred.

Correct Behavior When a Device Has Failed

The system must report a fault in terms of the impact it has on the ability of the device to provide service. Typically, loss of service is expected when:

A PIO or DMA error is detected.
Data corruption is detected.
The device is locked or hung (for example, when a command never completes).
A condition has occurred that the driver does not handle because it was regarded as impossible when the driver was designed.

If the device state, returned by ddi_get_devstate(9F), indicates that the device is not usable, the driver should reject all new and outstanding I/O requests and return (if possible) an appropriate error code (for example, EIO). For a STREAMS driver, M_ERROR or M_HANGUP, as appropriate, should be put upstream to indicate that the driver is not usable.

The state of the device should be checked at each major entry point, optionally before committing resources to an operation, and after reporting a fault. If at any stage the device is found to be unusable, the driver should perform any cleanup actions that are required (for example, releasing resources) and return in a timely way. It should not attempt any retry or recovery action, nor does it need to report a fault. The state is not a fault, and it is already known to the framework and management agents. It should mark the current request and any other outstanding or queued requests as complete, again with an error indication if possible.

The ioctl() entry point presents a problem in this respect: ioctl operations that imply I/O to the device (for example, formatting a disk) should fail if the device is unusable, while others (such as recovering error status) should continue to work. The state check might therefore need to be on a per-command basis. Alternatively, you can implement those operations that work in any state through another entry point or minor device mode, although this might be constrained by issues of compatibility with existing applications

Note that close() should always complete successfully, even if the device is unusable. If the device is unusable, the interrupt handler should return DDI_INTR_UNCLAIMED for all subsequent interrupts. If interrupts continue to be generated the eventual result is that the interrupt is disabled.

Fault Reporting

This following function notifies the system that your driver has discovered a device fault.

void ddi_dev_report_fault(dev_info_t *dip, ddi_fault_impact_t impact,
             ddi_fault_location_t location, const char *message);

The impact parameter indicates the impact of the fault on the device's ability to provide normal service, and is used by the fault management components of the system to determine the appropriate action to take in response to the fault. This action can cause a change in the device state. A service-lost fault causes the device state to be changed to DOWN and a service-degraded fault causes the device state to be changed to DEGRADED.

A device should be reported as faulty if:

A PIO error is detected.
Corrupted data is detected.
The device has locked up.

Drivers should avoid reporting the same fault repeatedly, if possible. In particular, it is redundant (and undesirable) for drivers to report any errors if the device is already in an unusable state (see ddi_get_devstate(9F)).

If a hardware fault is detected during the attach process, the driver must report the fault by using ddi_dev_report_fault(9F) as well as by returning DDI_FAILURE.

Periodic Health Checks

A latent fault is one that does not show itself until some other action occurs. For example, a hardware failure occurring in a device that is a cold stand-by could remain undetected until a fault occurs on the master device. At this point, it will be discovered that the system now contains two defective devices and might be unable to continue operation.

Generally, latent faults that are allowed to remain undetected will eventually cause system failure. Without latent fault checking, the overall availability of a redundant system is jeopardized. To avoid this, a device driver must detect latent faults and report them in the same way as other faults.

The driver should ensure that it has a mechanism for making periodic health checks on the device. In a fault-tolerant situation in which the device can be the secondary or failover device, early detection of a failed secondary device is essential to ensure that it can be repaired or replaced before any failure in the primary device occurs.

Periodic health checks can:

Run a quick access check on the board (write, read), then check the device with the ddi_check_acc_handle(9F) routine.
Check a register or memory location on the device that has a value the driver expects to have been deterministically altered since the last poll.

Features of a device that typically exhibit deterministic behavior include heartbeat semaphores, device timers (for example, local lbolt that is used by download), and event counters. Reading an updated predictable value from the device gives a degree of confidence that things are proceeding satisfactorily.
Time-stamp outgoing requests (transmit blocks or commands) when issued by the driver.

The periodic health check can look for any overaged requests that have not completed.
Initiate an action on the device that should be completed before the next scheduled check.

If this action is an interrupt, this is an ideal way of ensuring that the device's circuitry is still capable of delivering an interrupt.