|
|
CRU
|
A CRU is a customer-replaceable unit (such as a memory
DIMM).
|
Diagnosis engines
|
Oracle Linux FMA utilizes diagnosis engines that reside on the
service processor to process hardware event ereports, including
those generated by mcelog. For a list of diagnosis engines supported
in the fault management architecture for Oracle ILOM, see the Oracle
ILOM documentation.
|
Error report (Ereport)
|
Error reports describe error events. They include raw device and
error type information so that the fault manager can diagnose the
error and create an appropriate fault diagnosis message.
|
Fault
|
A fault indicates that a hardware component is present but is
unusable or degraded because one or more problems have been
diagnosed by the fault manager. The component has been disabled to
prevent further damage to the system.
|
Fault case
|
When problems are diagnosed, the fault manager logs a fault
diagnosis message that contains a case id (represented by a UUID)
which references the problem.
|
FRU
|
A FRU is a field-replaceable unit (such as a processor).
|
Label
|
A location string (also called a FRU label), such as "/SYS/MB/P1"
which represents the processor #1 located on the motherboard of the
system. The quoted value is intended to match the label on the
physical hardware or when viewed in Oracle ILOM.
|
Machine check events
|
Platform error(s) detected by the hardware and reported to the OS.
The error reported might be correctable or uncorrectable,
recoverable or fatal. In Linux, the mcelog captures these
errors.
|
mcelog
|
mcelog provides error handling and predictive failure analysis in
x86 Linux systems. The mcelog daemon processes CPU and memory
machine check events and executes actions based on configurable
error thresholds. A range of actions can be configured, including
bad memory page retirement, CPU core offlining, and automatic cache
error handling. User defined actions can be also configured.
Oracle Linux FMA captures errors processed by mcelog and stored in
the mcelog log file, converts them to a standard Oracle fault
format, and adds them to a synced fault management database
available on both the host and Oracle ILOM.
|
Page retirement
|
A kernel facility in newer Linux OSes where an OS memory page
corresponding to a defective physical memory location is removed
from service, if possible. This feature helps increase system
availability.
|
Proactive self-healing
|
Proactive self-healing is a fault management architecture and
methodology for automatically diagnosing, reporting, and handling
software and hardware fault conditions. Proactive self-healing
reduces the time required to debug a hardware or software problem
and provides the system administrator or Oracle Services personnel
with detailed data about each fault. The architecture consists of
the Linux mcelog event management protocol, the Fault Manager, and
service processor-based diagnosis engines that process errors
received from the host OS to a standard FMA fault case.
|
Service processor (SP)
|
Most Oracle servers ship with a service processor that controls
chassis functions such as power budgeting and control, system health
monitoring, and FMA activities including error reporting and fault
diagnosis.
|
Universal unique identifier (UUID)
|
A UUID is used to uniquely identify a problem across any set of
systems.
|