A latent fault is one that does not show itself until some other action occurs. For example, a hardware failure occurring in a device that is a cold standby could remain undetected until a fault occurs on the master device. At this point, the system now contains two defective devices and might be unable to continue operation.
Latent faults that remain undetected typically cause system failure eventually. Without latent fault checking, the overall availability of a redundant system is jeopardized. To avoid this situation, a device driver must detect latent faults and report them in the same way as other faults.
You should provide the driver with a mechanism for making periodic health checks on the device. In a fault-tolerant situation where the device can be the secondary or failover device, early detection of a failed secondary device is essential to ensure that the secondary device can be repaired or replaced before any failure in the primary device occurs.
Periodic health checks can be used to perform the following activities:
Check any register or memory location on the device whose value might have been altered since the last poll.
Features of a device that typically exhibit deterministic behavior include heartbeat semaphores, device timers (for example, local lbolt used by download), and event counters. Reading an updated predictable value from the device gives a degree of confidence that things are proceeding satisfactorily.
Timestamp outgoing requests such as transmit blocks or commands that are issued by the driver.
The periodic health check can look for any suspect requests that have not completed.
Initiate an action on the device that should be completed before the next scheduled check.
If this action is an interrupt, this check is an ideal way to ensure that the device's circuitry can deliver an interrupt.