3.4 Detecting Component Failures and Self-healing Autonomously
Improved ability to detect component failures and self-heal autonomously improves business continuity.
Cluster Health Monitor introduces a new diagnostic feature that identifies critical component events that indicate pending or actual failures and provides recommendations for corrective action. These actions may sometimes be performed autonomously. Such events and actions are then captured and admins are notified through components such as Oracle Trace File Analyzer.
Terms Associated with Diagnosability
CHMDiag:
CHMDiag
is a python daemon managed by osysmond
that listens for events and takes actions. Upon receiving various events/actions,
CHMDiag
validates them for correctness, does flow control, and
schedules the actions for runs. CHMDiag
monitors each action to its
completion, and kills an action if it takes longer than pre-configured time specific
to that action.
This JSON file describes all events/actions and their respective attributes. All
events/actions have uniquely identifiable IDs. This file also contains various
configurable properties for various actions/events. CHMDiag
loads
this file during its startup.
CRFE API: CRFE API is used by all C clients to send events to
CHMDiag
. This API is used by internal clients like components
(RDBMS/CSS/GIPC) to publish events/actions.
This API also provides support for both synchronous and asynchronous publication of events. Asynchronous publication of events is done through a background thread which will be shared by all CRFE API clients within a process.
CHMDIAG_BASE: This directory resides in
ORACLEB_BASE/hostname/crf/chmdiag
.
This directory path contains following directories, which are populated or managed
by CHMDiag
.
- ActionsResults: Contains all results for all of the invoked actions with a subdirectory for each action.
- EventsLog: Contains a log of all the events/actions received by
CHMDiag
and the location of their respective action results. These log files are also auto-rotated after reaching a fixed size. - CHMDiagLog: Contains
CHMDiag
daemon logs. Log files are auto-rotated and once they reach a specific size. Logs should have sufficient debug information to diagnose any problems thatCHMDiag
could run into. - Config: Contains a run sub-directory for
CHMDiag
processpid
file management.
- oclumon chmdiag description: Use the
oclumon chmdiag description
command to get a detailed description of all the supported events and actions. - oclumon chmdiag query: Use the
oclumon chmdiag query
command to query CHMDiag events/actions sent by various components and generate an HTML or a text report. - oclumon chmdiag collect: Use the
oclumon chmdiag collect
command to collect all events/actions data generated byCHMDiag
into the specified output directory location.
Parent topic: Collecting Operating System Resources Metrics