Fault Management Overview

Language:

The Oracle Solaris Fault Management feature includes the following components:

An architecture for building resilient error handlers
Structured telemetry
Automated diagnostic software
Response agents
Structured messaging

Many parts of the software stack participate in fault management, including the CPU, memory and I/O subsystems, Oracle Solaris ZFS, and many device drivers.

FMA can diagnose and manage faults, defects, and alerts:

Faults – A fault is a type of problem where something that used to work no longer does. A fault typically describes a failed hardware component.
Defects – A defect is a type of problem where something never worked. A defect typically describes a software component.
Alerts – An alert is neither a fault nor a defect. An alert can represent a problem or can be simply informational.

Most software problems are defects or are caused by configuration issues. Fault management and system services often interact. For example, a hardware problem might cause services to be stopped or restarted. An SMF service error might cause FMA to report a defect.

Fault Management Architecture

The fault management stack includes error and observation detectors, a diagnosis engine, and response agents.

Error detectors

Error detectors detect errors in the system and perform any immediate, required handling. An error detector issues a well-defined error report (ereport) or informational report (ireport) to a diagnosis engine.

Observation detectors

Observation detectors report conditions in the system that are neither symptoms of faults nor defects. An observation detector issues a well-defined information report, or ireport, that might go to a diagnosis engine or might simply be logged.

Diagnosis engine

The diagnosis engine interprets ereports and ireports and determines whether a fault, defect, or alert should be diagnosed. When such a determination is made, the diagnosis engine issues a suspect list that describes the resource or set of resources that might be the cause of the problem or condition. The resource might have an associated Field Replaceable Unit (FRU), a label, or an Automatic System Reconfiguration Unit (ASRU). An ASRU might be immediately removed from service to mitigate the problem until the FRU is replaced. See Fault Management Glossary for definitions of resource, FRU, label, and ASRU.

When the suspect list includes multiple suspects (for example, if the diagnosis engine cannot isolate a single suspect), each suspect is assigned a probability of being the key suspect. The probabilities in this list sum to 100 percent. Suspect lists are interpreted by response agents.

Response agents

Response agents attempt to take action based on the suspect list. Responses include logging messages, taking CPU strands offline, retiring memory pages, and retiring I/O devices.

When specific hardware faults occur, Oracle Auto Service Request (ASR) can automatically open an Oracle service request. See the Oracle Auto Service Request (ASR) support document for more information.

Error detectors, observation detectors, diagnosis engines, and response agents are connected by the Fault Manager daemon, fmd, which acts as a multiplexor between the various components, as shown in the following figure.

Figure 1 Fault Management Architecture Components

image:Figure shows relationships between the Fault Manager daemon, error detectors, alerts, diagnosis engines, and response agents.

Lifecycle of a Problem or Condition Managed by the Fault Manager

The lifecycle of a problem or condition managed by the Fault Manager can include the following stages. Each of these lifecycle state changes is associated with the publication of a unique list event.

Diagnose: A new diagnosis has been made by the Fault Manager. The diagnosis includes a list of one or more suspects. A list.suspect event is published. The diagnosis is identified by a UUID in the event payload, and further events describing the resolution lifecycle of this diagnosis quote a matching UUID.
Isolate: A suspect has been automatically isolated to prevent further errors from occurring. A list.isolated event is published. For example, a CPU or disk has been offlined.
Update: One or more of the suspect resources in a problem diagnosis has been repaired, replaced, or acquitted, or the resource has faulted again. A list.updated event is published. The suspect list still contains at least one faulted resource. A repair might have been made by running an fmadm command, or the system might have detected a repair such as a changed serial number for a part. The fmadm command is described in Repairing Faults and Defects and Clearing Alerts.
Repair: All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted. A list.repaired event is published. Some or all of the resources might still be isolated.
Resolve: All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted and are no longer isolated. A list.resolved event is published. For example, a CPU that was a suspect and was offlined is now back online again. Offlining and onlining resources is usually automatic.

The Fault Manager daemon is a Service Management Facility (SMF) service. The svc:/system/fmd service is enabled by default. See Managing System Services in Oracle Solaris 11.4 for more information about SMF services. See the fmd(8) man page for more information about the Fault Manager daemon.

The fmadm config command shows the name, description, and status of each module in the Fault Manager. These modules diagnose, isolate resources, generate notifications, and auto-repair problems in the system. The fmstat command displays additional information about these modules, as shown in Fault Manager and Module Statistics.

Fault Management Glossary

ASRU: An Automatic System Reconfiguration Unit (ASRU) is associated with a resource and is the hardware or software component in the system that can be disabled to mitigate the effects of problems in the resource. For example, a CPU thread is an ASRU that can be offlined in response to a CPU fault. An ASRU can also be a hardware or software component in the system whose service state is impacted by the fault. The ASRU is named in the Affects field in fmadm list or fmdump -v output.

chassis: A chassis is associated with an FRU and identifies where the FRU resides. To replace an FRU, you must know the chassis location and the FRU location within that chassis. The chassis location can be /SYS for the main system chassis, a chassis_name.chassis_serial_number for an external chassis, or it could be a user defined alias for the chassis. See also label below.

diagnosis class: The diagnosis class is a unique identifier of the form sub-class1.sub-class2...sub-classN that uniquely identifies the type of fault, defect, or alert event associated with a diagnosis. The diagnosis class is also called the problem class.

FMRI: A Fault Management Resource Identifier (FMRI) is used to identify resources, FRUs, and ASRUs. FMRIs have a scheme and a scheme-specific syntax. See the fmri(7) man page for more information. You can see FMRIs by using the fmdump -v command.

FRU: A Field Replaceable Unit (FRU) is associated with a resource and is the hardware or software component in the system that can be replaced or repaired to fix a problem. For example, a CPU module is an FRU that can be replaced in response to a CPU fault.

label: A label is associated with an FRU and identifies the physical marking on the hardware that can be used to locate a specific FRU within a chassis. See also chassis above. Location fields in fmdump and fmadm list command output give the /dev/chassis path, which is a combination of the chassis and a label, or possibly a hierarchical set of labels. See the Location fields in the examples in Displaying Fault, Defect, and Alert Information. For more information about the /dev/chassis path, see the devchassis(4FS) man page.

resource: A resource is a physical or abstract entity in the system against which diagnoses can be made.