Fault Management Overview

Language:

The Oracle Solaris Fault Management feature includes the following components:

An architecture for building resilient error handlers
Structured error telemetry
Automated diagnostic software
Response agents
Structured messaging

Many parts of the software stack participate in fault management, including the CPU, memory and I/O subsystems, Oracle Solaris ZFS, and many device drivers.

FMA can help with both faults and defects:

Faults – A faulted component is a component that used to work but no longer works.
Defects – A defective component is a component that never worked correctly.

Hardware can experience both faults and defects. Most software problems are defects or are caused by configuration issues. Fault management and system services often interact. For example, a hardware problem might cause services to be stopped or restarted. An SMF service error might cause FMA to report a defect.

The fault management stack includes error detectors, a diagnosis engine, and response agents.

Error detectors

Error detectors detect errors in the system and perform any immediate, required handling. Error detectors issue well-defined error reports, or ereports, to a diagnosis engine.

Diagnosis engine

The diagnosis engine interprets ereports and determines whether a fault or defect is present in the system. When such a determination is made, the diagnosis engine issues a suspect list that describes the resource or set of resources that might be the cause of the problem. The resource might have an associated Field Replaceable Unit (FRU), a label, or an Automatic System Reconfiguration Unit (ASRU). An ASRU might be immediately removed from service to mitigate the problem until the FRU is replaced.

When the suspect list includes multiple suspects (for example, if the diagnosis engine cannot isolate a single suspect), each suspect is assigned a probability of being the key suspect. The probabilities in this list sum to 100 percent. Suspect lists are interpreted by response agents.

Response agents

Response agents attempt to take action based on the suspect list. Responses include logging messages, taking CPU strands offline, retiring memory pages, and retiring I/O devices.

Error detectors, diagnosis engines, and response agents are connected by the Fault Manager daemon, fmd, which acts as a multiplexor between the various components, as shown in the following figure.

Figure 1-1 Fault Management Architecture Components

image:Shows relationships between the Fault Manager daemon, error detectors, diagnosis engines, and response agents.

The lifecycle of a problem managed by the Fault Manager can include the following stages:

Diagnose: A new problem has been diagnosed by the Fault Manager. The diagnosis includes a list of one or more suspects. A suspect might have been automatically isolated to prevent further errors from occurring. The problem is identified by a UUID in the event payload, and further events describing the resolution lifecycle of this problem quote a matching UUID.
Update: One or more of the suspect resources in a problem diagnosis has been repaired, replaced, or acquitted, or the resource has faulted again. The suspect list still contains at least one faulted resource. A repair might have been made by executing an fmadm command, or the system might have detected a repair such as a changed serial number for a part. The fmadm command is described in Chapter 3, Repairing Faults.
Repair: All of the suspect resources in a problem diagnosis have been repaired, resolved, or acquitted. Some or all of the resources might still be isolated.
Resolve: All of the suspect resources in a problem diagnosis have been repaired, resolved, or acquitted and are no longer isolated. For example, a CPU that was a suspect and was offlined is now back online again. Offlining and onlining resources is usually automatic.

The Fault Manager daemon is a Service Management Facility (SMF) service. The svc:/system/fmd service is enabled by default. See Managing System Services in Oracle Solaris 11.2 for more information about SMF services. See the fmd(1M) man page for more information about the Fault Manager daemon.

The fmadm config command shows the name, description, and status of each module in the Fault Manager. These modules diagnose and repair problems on the system. The fmstat command displays additional information about these modules, as shown in Fault Statistics.