Go to main content

Oracle® ILOM User's Guide for System Monitoring and Diagnostics Firmware Release 3.2.x

Exit Print View

Updated: April 2018
 
 

Fault Management Terminology

Term
Description
Proactive self-healing
Proactive self-healing is a fault management architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions. Proactive self-healing reduces the time required to debug a hardware or software problem and provides the system administrator or Oracle Services personnel with detailed data about each fault. The architecture consists of an event management protocol, the Fault Manager, and fault-handling agents and diagnosis engines.
Diagnosis engines
The fault management architecture, in Oracle ILOM, includes diagnosis engines that broadcast fault events for detected system errors. For a list of diagnosis engines supported in the fault management architecture for Oracle ILOM, see fmstat Report Example and Description.
Health states
Oracle ILOM associates the following health states with every resource for which telemetry information has been received. The possible states presented in Oracle ILOM interfaces include:
  • ok – The hardware resource is present in the chassis and in use. No known problems have been detected.

  • unknown – The hardware resource is not present or not usable, but no known problems are detected. This management state can indicate that the suspect resource is disabled by the system administrator.

  • faulted – The hardware resource is present in the chassis but is unusable since one or more problems have been detected. The hardware resource is disabled (offline) to prevent further damage to the system.

  • degraded – The hardware resource is present and usable, but one or more problems have been detected. If all affected hardware resources are in the same state, this status is reflected in the event message at the end of the list. Otherwise, a separate health state is provided for each affected resource.

Fault
A fault indicates that a hardware component is present but is unusable or degraded because one or more problems have been diagnosed by the Oracle ILOM Fault Manager. The component has been disabled to prevent further damage to the system.
Managed device
A managed device can be an Oracle rackmounted server, blade server, or blade chassis.
FRU
A FRU is a field-replaceable unit (such as a drive, memory DIMM, or printed circuit board).
CRU
A CRU is a customer-replaceable unit (such as a NEM in an Oracle blade chassis.).
Universal unique identifier (UUID)
A UUID is used to uniquely identify a problem across any set of systems.