Fault Management Architecture Overview

The Oracle Linux Fault Management Architecture (FMA) is a supplement to existing Linux error detecting and recovery mechanisms that allows system administrators to access view, act on, and clear faults detected by the Linux Kernel from the host. It uses the Oracle ILOM fault manager on the service processor to diagnose CPU and memory errors captured from the host and output them to a standard fault format that is stored in a fault management database.

This database contains a superset of all detected faults (those captured by Oracle Linux FMA and Oracle ILOM FMA) and is maintained on both the host and service processor.

In the Oracle Linux operating system, CPU and memory errors are generated at the kernel level as machine check events. These events are stored in the Linux mcelog database. The Linux mcelog daemon, mcelogd, retrieves errors stored in the database and converts them to human-readable messages that are output to the console, the mcelog file (/var/log/mcelog), and to the Linux system log. The mcelog daemon also takes action based on a set of rules stored in a configuration file. For example, these actions might include retiring a memory page from service because it contains uncorrectable errors.

The information logged by mcelog might not contain enough information to identify a bad component (such as a memory DIMM). The Oracle Linux FMA Fault Manager daemon, fmd, scans and retrieves errors stored in the mcelog file and translates the errors into the ereport format supported by Oracle ILOM. It then forwards the ereport to the service processor using the internal Host-to-ILOM interconnect port. The Oracle ILOM fault manager uses the ereport to diagnose the fault. Oracle ILOM then logs the fault in its own fault management database and sends a copy to the fault management database that resides on the Linux host.

Using this method, all system hardware faults in the database can be viewed and acted on using a similar set of fault management commands whether from the host OS or from Oracle ILOM.


Graphic showing how Linux FMA works.

Oracle server platforms running Oracle Linux include error detectors, diagnosis engines, and response agents. Error detectors and response agents reside on the Oracle Linux host. The diagnosis engines reside on the server's service processor.

  • Error detectors – These detect errors in the system and perform any immediate, required handling. They also generate well-defined error reports, or ereports, to a diagnosis engine. In Linux, the mcelog daemon detects errors, and the Oracle Linux Fault Management software collects and reformats them into ereports and forwards them to the service processor for fault diagnosis.

  • Diagnosis engines – A set of diagnostic engines located on the service processor interpret reports and determine whether a fault or defect is present. When such a determination is made, the diagnosis engine creates a suspect list that describes the resource or set of resources that might be the cause of the problem. The resource might or might not have an associated field-replaceable unit (FRU) or a label.

    When the suspect list includes multiple suspects, for example, if the diagnosis engine cannot isolate a single suspect, the suspects are assigned a probability as to each suspect being the key suspect. The probabilities in this list add up to 100 percent.

    Error detectors and diagnosis engines are connected by the Fault Manager daemon on the service processor, which acts as a multiplexor between the various components, as shown in the following figure.


    The figure shows the interrelationship between the fault management daemon and error detectors and diagnosis engines.
  • Response agents – These agents attempt to take action based on the type of error. On the host side, the mcelog daemon acts as the response agent. Responses include logging messages and retiring memory pages.

The Oracle Linux Fault Manager daemon, fmd, is itself a service. The service can be enabled and controlled as a scriptless daemon, or by using init.d scripts for greater manageability. Fault management commands supported in this version of Oracle Linux FMA include:

  • fmadm – Used by administrators and service personnel to view and clear faults maintained by the Oracle Linux Fault Manager, fmd.

  • fmdump – Used to display the contents of any of the log files associated with the Oracle Linux Fault Manager, fmd.