Fault Management Architecture Terms

The following table contains descriptions for Fault Management Architecture terms used in this document.

Term Description

CRU

A CRU is a customer-replaceable unit (such as a memory DIMM).

Diagnosis class

The diagnosis class is a unique identifier of the form sub-class1.sub-class2...sub-classN that uniquely identifies the type of fault, defect, or alert event associated with a diagnosis. The diagnosis class is also called the problem class.

Diagnosis engines

Oracle Linux FMA utilizes diagnosis engines that reside on the service processor to process hardware event ereports, including those generated by mcelog. For a list of diagnosis engines supported in the fault management architecture for Oracle ILOM, see the Oracle ILOM documentation.

Error report (Ereport)

Error reports describe error events. They include raw device and error type information so that the fault manager can diagnose the error and create an appropriate fault diagnosis message.

Fault

A fault indicates that a hardware component is present but is unusable or degraded because one or more problems have been diagnosed by the fault manager. The component has been disabled to prevent further damage to the system.

Fault case

When problems are diagnosed, the fault manager logs a fault diagnosis message that contains a case id (represented by a UUID) which references the problem.

FRU

A FRU is a field-replaceable unit (such as a processor).

Label

A location string (also called a FRU label), such as "/SYS/MB/P1" which represents the processor #1 located on the motherboard of the system. The quoted value is intended to match the label on the physical hardware or when viewed in Oracle ILOM.

Machine check events

Platform error(s) detected by the hardware and reported to the OS. The error reported might be correctable or uncorrectable, recoverable or fatal. In Linux, the mcelog captures these errors.

mcelog

mcelog provides error handling and predictive failure analysis in x86 Linux systems. The mcelog daemon processes CPU and memory machine check events and executes actions based on configurable error thresholds. A range of actions can be configured, including bad memory page retirement, CPU core offlining, and automatic cache error handling. User defined actions can be also configured.

Oracle Linux FMA captures errors processed by mcelog and stored in the mcelog log file, converts them to a standard Oracle fault format, and adds them to a synced fault management database available on both the host and Oracle ILOM.

Page retirement

A kernel facility in newer Linux OSes where an OS memory page corresponding to a defective physical memory location is removed from service, if possible. This feature helps increase system availability.

Proactive self-healing

Proactive self-healing is a fault management architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions. Proactive self-healing reduces the time required to debug a hardware or software problem and provides the system administrator or Oracle Services personnel with detailed data about each fault. The architecture consists of the Linux mcelog event management protocol, the Fault Manager, and service processor-based diagnosis engines that process errors received from the host OS to a standard FMA fault case.

Resource

A resource is a physical or abstract entity in the system against which diagnoses can be made.

Service processor (SP)

Most Oracle servers ship with a service processor that controls chassis functions such as power budgeting and control, system health monitoring, and FMA activities including error reporting and fault diagnosis.

Universal unique identifier (UUID)

A UUID is used to uniquely identify a problem across any set of systems.