Lifecycle of a Problem or Condition Managed By the Fault Manager
The lifecycle of a problem or condition managed by the Fault Manager can include the following stages. Each of these lifecycle state changes is associated with the publication of a unique list event.
-
Diagnose – A new diagnosis has been made by the Fault Manager. The diagnosis includes a list of one or more suspects. A
list.suspectevent is published. The diagnosis is identified by a UUID in the event payload, and further events describing the resolution lifecycle of this diagnosis quote a matching UUID. -
Isolate – A suspect has been automatically isolated to prevent further errors from occurring. A
list.isolatedevent is published. For example, a CPU core or memory page has been offlined. -
Update – One or more of the suspect resources in a problem diagnosis has been repaired, replaced, or acquitted, or the resource has faulted again. A
list.updatedevent is published. The suspect list still contains at least one faulted resource. A repair might have been made by executing anfmadmcommand, or the system might have detected a repair such as a changed serial number for a part. Thefmadmcommand is described in Repairing Faults and Defects and Clearing Alerts. -
Repair – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted. A
list.repairedevent is published. Some or all of the resources might still be isolated. -
Resolve – All of the suspect resources in a diagnosis have been repaired, resolved, or acquitted and are no longer isolated. A
list.resolvedevent is published. For example, a CPU core that was a suspect and was offlined is now back online again. Offlining and onlining resources is usually automatic.
The Fault Manager daemon is a service enabled by default when using the Oracle Hardware Management Pack installer. See the fmd man page for more information about the Fault Manager daemon.
The fmadm config command shows the name, description, and status of each module in the Fault Manager. These modules diagnose, isolate resources, generate notifications, and auto-repair problems in the system.