3.3 Operating System Metrics Collected by Cluster Health Monitor
Review the metrics collected by CHM.
Overview of Metrics
CHM groups the operating system data collected into a Nodeview. A Nodeview is a grouping of metric sets where each metric set contains detailed metrics of a unique system resource.
Brief description of metric sets are as follows:
- CPU metric set: Metrics for top 127 CPUs sorted by usage percentage
- Device metric set: Metrics for 127 devices that include ASM/VD/OCR along with those having a high average wait time
- Process metric set: Metrics for 127 processes
                        - Top 25 CPU consumers (idle processes not reported)
- Top 25 Memory consumers (RSS < 1% of total RAM not reported)
- Top 25 I/O consumers
- Top 25 File Descriptors consumers (helps to identify top inode consumers)
- Process Aggregation: Metrics summarized by foreground and background processes for all Oracle Database and Oracle ASM instances
 
- Network metric set: Metrics for 16 NICS that include public and private interconnects
- NFS metric set: Metrics for 32 NFS ordered by round trip time
- Protocol metric set: Metrics for protocol groups TCP, UDP, and IP
- Filesystem metric set: Metrics for filesystem utilization
- Critical resources metric set: Metrics for critical system
                    resource utilization
                        - CPU Metrics: system-wide CPU utilization statistics
- Memory Metrics: system-wide memory statistics
- Device Metrics: system-wide device statistics distinct from individual device metric set
- NFS Metrics: Total NFS devices collected every 30 seconds
- Process Metrics: system-wide unique process metrics
 
CPU Metric Set
Contains metrics from all CPU cores ordered by usage percentage.
Table 3-1 CPU Metric Set
| Metric Name (units) | Description | 
|---|---|
| system [%] | Percentage of CPU utilization occurred while running at the system level (kernel). | 
| user [%] | Percentage of CPU utilization occurred while running at the user level (application). | 
| usage [%] | Total utilization (system[%] + user[%]). | 
| nice [%] | Percentage of CPU utilization occurred while running at the user level with nice priority. | 
| ioWait [%] | Percentage of time that the CPU was idle during which the system had an outstanding disk I/O request. | 
| steal [%] | Percentage of time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another virtual processor. | 
Device Metric Set
Contains metrics from all disk devices/partitions ordered by their service time in milliseconds.
Table 3-2 Device Metric Set
| Metric Name (units) | Description | 
|---|---|
| ioR [KB/s] | Amount of data read from the device. | 
| ioW [KB/s] | Amount of data written to the device. | 
| numIOs [#/s] | Average disk I/O operations. | 
| qLen [#] | Number of I/O queued requests, that is, in a wait state. | 
| aWait [msec] | Average wait time per I/O. | 
| svcTm [msec] | Average service time per I/O request. | 
| util [%] | Percent utilization of the device (same as '%utilmetric from theiostat
                                    -xcommand. Represents the percentage of time device
                                was active). | 
Process Metric Set
Contains multiple categories of summarized metric data computed across all system processes.
Table 3-3 Process Metric Set
| Metric Name (units) | Description | 
|---|---|
| pid | Process ID. | 
| pri | Process priority (raw value from the operating system). | 
| psr | The processor that process is currently assigned to or running on. | 
| pPid | Parent process ID. | 
| nice | Nice value of the process. | 
| state | State of the process. For example, R->Running,S->Interruptible sleep, and so on. | 
| class | Scheduling class of the process. For example, RR->RobinRound,FF->First in First
                                    out,B->Batch scheduling, and so
                                on. | 
| fd [#] | Number of file descriptors opened by this process, which is updated every 30 seconds. | 
| name | Name of the process. | 
| cpu [%] | Process CPU utilization across cores. For example, 50% => 50% of single core, 400% => 100% usage of 4 cores. | 
| thrds [#] | Number of threads created by this process. | 
| vmem [KB] | Process virtual memory usage (KB). | 
| shMem [KB] | Process shared memory usage (KB). | 
| rss [KB] | Process memory-resident set size (KB). | 
| ioR [KB/s] | I/O read in kilobytes per second. | 
| ioW [KB/s] | I/O write in kilobytes per second. | 
| ioT [KB/s] | I/O total in kilobytes per second. | 
| cswch [#/s] | Context switch per second. Collected only for a few critical Oracle Database processes. | 
| nvcswch [#/s] | Non-voluntary context switch per second. Collected only for a few critical Oracle Database processes. | 
| cumulativeCpu [ms] | Amount of CPU used so far by the process in microseconds. | 
NIC Metric Set
Contains metrics from all network interfaces ordered by their total rate in kilobytes per second.
Table 3-4 NIC Metric Set
| Metric Name (units) | Description | 
|---|---|
| name | Name of the interface. | 
| tag | Tag for the interface, for example, public, private, and so on. | 
| mtu [B] | Size of the maximum transmission unit in bytes supported for the interface. | 
| rx [Kbps] | Average network receive rate. | 
| tx [Kbps] | Average network send rate. | 
| total [Kbps] | Average network transmission rate (rx[Kb/s] + tx[Kb/s]). | 
| rxPkt [#/s] | Average incoming packet rate. | 
| txPkt [#/s] | Average outgoing packet rate. | 
| pkt [#/s] | Average rate of packet transmission (rxPkt[#/s] + txPkt[#/s]). | 
| rxDscrd [#/s] | Average rate of dropped/discarded incoming packets. | 
| txDscrd [#/s] | Average rate of dropped/discarded outgoing packets. | 
| rxUnicast [#/s] | Average rate of unicast packets received. | 
| rxNonUnicast [#/s] | Average rate of multicast packets received. | 
| dscrd [#/s] | Average rate of total discarded packets (rxDscrd + txDscrd). | 
| rxErr [#/s] | Average error rate for incoming packets. | 
| txErr [#/s] | Average error rate for outgoing packets. | 
| Err [#/s] | Average error rate of total transmission (rxErr[#/s] + txErr[#/s]). | 
NFS Metric Set
Contains top 32 NFS ordered by round trip time. This metric set is collected once every 30 seconds.
Table 3-5 NFS Metric Set
| Metric Name (units) | Description | 
|---|---|
| op [#/s] | Number of read/write operations issued to a filesystem per second. | 
| bytes [#/sec] | Number of bytes read/write per second from a filesystem. | 
| rtt [s] | This is the duration from the time that the client's kernel sends the RPC request until the time it receives the reply. | 
| exe [s] | This is the duration from that NFS client does the RPC request to its kernel until the RPC request is completed, this includes the RTT time above. | 
| retrains [%] | This is the retransmission's frequency in percentage. | 
Protocol Metric Set
Contains specific metrics for protocol groups TCP, UDP, and IP. Metric values are cumulative since the system starts.
Table 3-6 TCP Metric Set
| Metric Name (units) | Description | 
|---|---|
| failedConnErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state. | 
| estResetErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state. | 
| segRetransErr [#] | Total number of TCP segments retransmitted. | 
| rxSeg [#] | Total number of TCP segments received on TCP layer. | 
| txSeg [#] | Total number of TCP segments sent from TCP layer. | 
Table 3-7 UDP Metric Set
| Metric Name (units) | Description | 
|---|---|
| unkPortErr [#] | Total number of received datagrams for which there was no application at the destination port. | 
| rxErr [#] | Number of received datagrams that could not be delivered for reasons other than the lack of an application at the destination port. | 
| rxPkt [#] | Total number of packets received. | 
| txPkt [#] | Total number of packets sent. | 
Table 3-8 IP Metric Set
| Metric Name (units) | Description | 
|---|---|
| ipHdrErr [#] | Number of input datagrams discarded due to errors in their IPv4 headers. | 
| addrErr [#] | Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity. | 
| unkProtoErr [#] | Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol. | 
| reasFailErr [#] | Number of failures detected by the IPv4 reassembly algorithm. | 
| fragFailErr [#] | Number of IPv4 discarded datagrams due to fragmentation failures. | 
| rxPkt [#] | Total number of packets received on IP layer. | 
| txPkt [#] | Total number of packets sent from IP layer. | 
Filesystem Metric Set
Contains metrics for filesystem utilization. Collected only for GRID_HOME filesystem.
Table 3-9 Filesystem Metric Set
| Metric Name (units) | Description | 
|---|---|
| mount | Mount point. | 
| type | Filesystem type, for example, etx4. | 
| tag | Filsystem tag, for example, GRID_HOME. | 
| total [KB] | Total amount of space (KB). | 
| used [KB] | Amount of used space (KB). | 
| avbl [KB] | Amount of available space (KB). | 
| used [%] | Percentage of used space. | 
| ifree [%] | Percentage of free file nodes. | 
System Metric Set
Contains a summarized metric set of critical system resource utilization.
Table 3-10 CPU Metrics
| Metric Name (units) | Description | 
|---|---|
| pCpus [#] | Number of physical processing units in the system. | 
| Cores [#] | Number of cores for all CPUs in the system. | 
| vCpus [#] | Number of logical processing units in the system. | 
| cpuHt | CPU Hyperthreading enabled (Y) or disabled (N). | 
| osName | Name of the operating system. | 
| chipName | Name of the chip of the processing unit. | 
| system [%] | Percentage of CPUs utilization that occurred while running at the system level (kernel). | 
| user [%] | Percentage of CPUs utilization that occurred while running at the user level (application). | 
| usage [%] | Total CPU utilization (system[%] + user[%]). | 
| nice [%] | Percentage of CPUs utilization occurred while running at the user level with NICE priority. | 
| ioWait [%] | Percentage of time that the CPUs were idle during which the system had an outstanding disk I/O request. | 
| Steal [%] | Percentage of time spent in involuntary wait by the virtual CPUs while the hypervisor was servicing another virtual processor. | 
| cpuQ [#] | Number of processes waiting in the run queue within the current sample interval. | 
| loadAvg1 | Average system load calculated over time of one minute. | 
| loadAvg5 | Average system load calculated over of time of five minutes. | 
| loadAvg15 | Average system load calculated over of time of 15 minutes. High load averages imply that a system is overloaded; many processes are waiting for CPU time. | 
| Intr [#/s] | Number of interrupts occurred per second in the system. | 
| ctxSwitch [#/s] | Number of context switches that occurred per second in the system. | 
Table 3-11 Memory Metrics
| Metric Name (units) | Description | 
|---|---|
| totalMem [KB] | Amount of total usable RAM (KB). | 
| freeMem [KB] | Amount of free RAM (KB). | 
| avblMem [KB] | Amount of memory available to start a new process without swapping. | 
| shMem [KB] | Memory used (mostly) by tmpfs. | 
| swapTotal [KB] | Total amount of physical swap memory (KB). | 
| swapFree [KB] | Amount of swap memory free (KB). | 
| swpIn [KB/s] | Average swap in rate within the current sample interval (KB/sec). | 
| swpOut [KB/s] | Average swap-out rate within the current sample interval (KB/sec). | 
| pgIn [#/s] | Average page in rate within the current sample interval (pages/sec). | 
| pgOut [#/s] | Average page out rate within the current sample interval (pages/sec). | 
| slabReclaim [KB] | The part of the slab that might be reclaimed such as caches. | 
| buffer [KB] | Memory used by kernel buffers. | 
| Cache [KB] | Memory used by the page cache and slabs. | 
| bufferAndCache [KB] | Total size of buffer and cache (buffer[KB] + Cache[KB]). | 
| hugePageTotal [#] | Total number of huge pages present in the system for the current sample interval. | 
| hugePageFree [KB] | Total number of free huge pages in the system for the current sample interval. | 
| hugePageSize [KB] | Size of one huge page in KB, depends on the operating system version. Typically the same for all samples for a particular host. | 
Table 3-12 Device Metrics
| Metric Name (units) | Description | 
|---|---|
| disks [#] | Number of disks configured in the system. | 
| ioR [KB/s] | Aggregate read rate across all devices. | 
| ioW [KB/s] | Aggregate write rate across all devices. | 
| numIOs [#/s] | Aggregate I/O operation rate across all devices. | 
Table 3-13 NFS Metrics
| Metric Name (units) | Description | 
|---|---|
| nfs [#] | Total NFS devices. | 
Table 3-14 Process Metrics
| Metric Name (units) | Description | 
|---|---|
| fds [#] | Number of open file structs in system. | 
| procs [#] | Number of processes. | 
| rtProcs [#] | Number of real-time processes. | 
| procsInDState | Number of processes in uninterruptible sleep. | 
| sysFdLimit [#] | System limit on a number of file structs. | 
| procsOnCpu [#] | Number of processes currently running on CPU. | 
| procsBlocked [#] | Number of processes waiting for some event/resource becomes available, such as for the completion of an I/O operation. | 
Process Aggregates Metric Set
Contains aggregated metrics for all processes by process groups.
Table 3-15 Process Aggregates Metric Set
| Metric Name (units) | Description | 
|---|---|
| DBBG | User Oracle Database background process group. | 
| DBFG | User Oracle Database foreground process group. | 
| MDBBG | MGMTDB background processes group. | 
| MDBFG | MGMTDB foreground processes group. | 
| ASMBG | ASM background processes group. | 
| ASMFG | ASM foreground processes group. | 
| IOXBG | IOS background processes group. | 
| IOXFG | IOS foreground processes group. | 
| APXBG | APX background processes group. | 
| APXFG | APX foreground processes group. | 
| CLUST | Clusterware processes group. | 
| OTHER | Default group. | 
For each group, the below metrics are aggregated to report a group summary.
| Metric Name (units) | Description | 
|---|---|
| processes [#] | Total number of processes in the group. | 
| cpu [%] | Aggregated CPU utilization. | 
| rss [KB] | Aggregated resident set size. | 
| shMem [KB] | Aggregated shared memory usage. | 
| thrds [#] | Aggregated thread count. | 
| fds [#] | Aggregated open file-descriptor. | 
| cpuWeight [%] | Contribution of the group in overall CPU utilization of the machine. | 
Parent topic: Collecting Operating System Resources Metrics