Writing Device Drivers

Diagnosis, Suspect Lists, and Fault Events

Once a diagnosis has been made, the diagnosis is output in the form of a list.suspect event. A list.suspect event is an event comprised of one or more possible fault or defect events. Sometimes the diagnosis cannot narrow the cause of errors to a single fault or defect. For example, the underlying problem might be a broken wire connecting controllers to the main system bus. The problem might be with a component on the bus or with the bus itself. In this specific case, the list.suspect event will contain multiple fault events: one for each controller attached to the bus, and one for the bus itself.

In addition to describing the fault that was diagnosed, a fault event also contains four payload members for which the diagnosis is applicable.

For example, after receiving a certain number of ECC correctable errors in a given amount of time for a particular memory location, the CPU and memory diagnosis engine issues a diagnosis (list.suspect event) for a faulty DIMM.

# fmdump -v -u 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c
TIME                 UUID                                 SUNW-MSG-ID
Oct 31 13:40:18.1864 38bd6f1b-a4de-4c21-db4e-ccd26fa8573c AMD-8000-8L
100%  fault.cpu.amd.icachetag

Problem in: hc:///motherboard=0/chip=0/cpu=0
Affects: cpu:///cpuid=0
FRU: hc:///motherboard=0/chip=0
Location: SLOT 2

In this example, fmd(1M) has identified a problem in a resource, specifically a CPU (hc:///motherboard=0/chip=0/cpu=0). To suppress further error symptoms and to prevent an uncorrectable error from occurring, an ASRU, (cpu:///cpuid=0), is identified for retirement. The component that needs to be replaced is the FRU (hc:///motherboard=0/chip=0).