The Cluster Health Monitor (CHM) detects and analyzes operating system and cluster resource-related degradation and failures. CHM stores real-time operating system metrics in the Oracle Grid Infrastructure Management Repository that you can use for later triage with the help of My Oracle Support should you have cluster issues.
This section includes the following CHM topics:
System Monitor Service
There is one system monitor service on every node. The system monitor service (
osysmond) is a real-time, monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in an Oracle Grid Infrastructure Management Repository database.
Cluster Logger Service
There is one cluster logger service (OLOGGERD) for every 32 nodes in a cluster. Another OLOGGERD is spawned for every additional 32 nodes (which can be a sum of Hub and Leaf Nodes). If the cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where it was running is down), then Oracle Clusterware starts OLOGGERD on a different node. The cluster logger service manages the operating system metric database in the Oracle Grid Infrastructure Management Repository.
Oracle Grid Infrastructure Management Repository
Is an Oracle database that stores real-time operating system metrics collected by CHM. You configure the Oracle Grid Infrastructure Management Repository during an installation of or upgrade to Oracle Clusterware 12c on a cluster.
If you are upgrading Oracle Clusterware to Oracle Clusterware 12c and Oracle Cluster Registry (OCR) and the voting file are stored on raw or block devices, then you must move them to Oracle ASM or a shared file system before you upgrade your software.
Runs on one node in the cluster (this must be a Hub Node in an Oracle Flex Cluster configuration), and must support failover to another node in case of node or storage failure.
You can locate the Oracle Grid Infrastructure Management Repository on the same node as the OLOGGERD to improve performance and decrease private network traffic.
Communicates with any cluster clients (such as OLOGGERD and OCLUMON) through the private network. Oracle Grid Infrastructure Management Repository communicates with external clients over the public network, only.
Data files are located in the same disk group as the OCR and voting file.
If OCR is stored in an Oracle ASM disk group called
+MYDG, then configuration scripts will use the same disk group to store the Oracle Grid Infrastructure Management Repository.
Oracle increased the Oracle Clusterware shared storage requirement to accommodate the Oracle Grid Infrastructure Management Repository, which can be a network file system (NFS), cluster file system, or an Oracle ASM disk group.
Size and retention is managed with OCLUMON.
Oracle recommends that, when you run the
diagcollection.pl script to collect CHM data, you run the script on all nodes in the cluster to ensure gathering all of the information needed for analysis.
You must run this script as a privileged user.
"Diagnostics Collection Script" for more information about the
To run the data collection script on only the node where the cluster logger service is running:
Run the following command to identify the node running the cluster logger service:
$ Grid_home/bin/oclumon manage -get master
Run the following command from a writable directory outside the Grid home as a privileged user on the cluster logger service node to collect all the available data in the Oracle Grid Infrastructure Management Repository:
# Grid_home/bin/diagcollection.pl --collect
On Windows, run the following commands:
C:\Grid_home\perl\bin\perl.exe C:\Grid_home\bin\diagcollection.pl --collect
diagcollection.pl script creates a file called
.tar.gz, similar to the following:
To limit the amount of data you want collected, enter the following command on a single line:
# Grid_home/bin/diagcollection.pl --collect --chmos --incidenttime time --incidentduration duration
In the preceding command, the format for the
--incidenttime argument is
MM/DD/YYYY24HH:MM:SS and the format for the
--incidentduration argument is
HH:MM. For example:
# Grid_home/bin/diagcollection.pl --collect --crshome Grid_home --chmos --incidenttime 07/21/2013 01:00:00 --incidentduration 00:30