Fault Management Terminology

Term Description

Proactive self-healing

Proactive self-healing is a fault management architecture and methodology for automatically diagnosing, reporting, and handling software and hardware fault conditions. Proactive self-healing reduces the time required to debug a hardware or software problem and provides the system administrator or Oracle Services personnel with detailed data about each fault. The architecture consists of an event management protocol, the Fault Manager, and fault-handling agents and diagnosis engines.

Diagnosis engines

The fault management architecture, in Oracle ILOM, includes diagnosis engines that broadcast fault events for detected system errors. For a list of diagnosis engines supported in the fault management architecture for Oracle ILOM, see fmstat Report Example and Description.

Health states

Oracle ILOM associates the following health states with every resource for which telemetry information has been received. The possible states presented in Oracle ILOM interfaces include:

  • ok – The hardware resource is present in the chassis and in use. No known problems have been detected.
  • unknown – The hardware resource is not present or not usable, but no known problems are detected. This management state can indicate that the suspect resource is disabled by the system administrator.
  • faulted – The hardware resource is present in the chassis but is unusable since one or more problems have been detected. The hardware resource is disabled (offline) to prevent further damage to the system.
  • degraded – The hardware resource is present and usable, but one or more problems have been detected. If all affected hardware resources are in the same state, this status is reflected in the event message at the end of the list. Otherwise, a separate health state is provided for each affected resource.

Fault

A fault indicates that a hardware component is present but is unusable or degraded because one or more problems have been diagnosed by the Oracle ILOM Fault Manager. The component has been disabled to prevent further damage to the system.

FRU

A FRU is a field-replaceable unit (such as a drive, memory DIMM, or printed circuit board).

CRU

A CRU is a customer-replaceable unit.

Universal unique identifier (UUID)

A UUID is used to uniquely identify a problem across any set of systems.