3.4 Detecting Component Failures and Self-healing Autonomously

Improved ability to detect component failures and self-heal autonomously improves business continuity.

Cluster Health Monitor introduces a new diagnostic feature that identifies critical component events that indicate pending or actual failures and provides recommendations for corrective action. These actions may sometimes be performed autonomously. Such events and actions are then captured and admins are notified through components such as Oracle Trace File Analyzer.

Terms Associated with Diagnosability

CHMDiag: CHMDiag is a python daemon managed by osysmond that listens for events and takes actions. Upon receiving various events/actions, CHMDiag validates them for correctness, does flow control, and schedules the actions for runs. CHMDiag monitors each action to its completion, and kills an action if it takes longer than pre-configured time specific to that action.

This JSON file describes all events/actions and their respective attributes. All events/actions have uniquely identifiable IDs. This file also contains various configurable properties for various actions/events. CHMDiag loads this file during its startup.

CRFE API: CRFE API is used by all C clients to send events to CHMDiag. This API is used by internal clients like components (RDBMS/CSS/GIPC) to publish events/actions.

This API also provides support for both synchronous and asynchronous publication of events. Asynchronous publication of events is done through a background thread which will be shared by all CRFE API clients within a process.

CHMDIAG_BASE: This directory resides in ORACLEB_BASE/hostname/crf/chmdiag. This directory path contains following directories, which are populated or managed by CHMDiag.

  • ActionsResults: Contains all results for all of the invoked actions with a subdirectory for each action.
  • EventsLog: Contains a log of all the events/actions received by CHMDiag and the location of their respective action results. These log files are also auto-rotated after reaching a fixed size.
  • CHMDiagLog: Contains CHMDiag daemon logs. Log files are auto-rotated and once they reach a specific size. Logs should have sufficient debug information to diagnose any problems that CHMDiag could run into.
  • Config: Contains a run sub-directory for CHMDiag process pid file management.
New commands to query, collect, and describe CHMDiag events/actions sent by various components:
  • oclumon chmdiag description: Use the oclumon chmdiag description command to get a detailed description of all the supported events and actions.
  • oclumon chmdiag query: Use the oclumon chmdiag query command to query CHMDiag events/actions sent by various components and generate an HTML or a text report.
  • oclumon chmdiag collect: Use the oclumon chmdiag collect command to collect all events/actions data generated by CHMDiag into the specified output directory location.