3.3 Operating System Metrics Collected by Cluster Health Monitor
Review the metrics collected by CHM.
Overview of Metrics
CHM groups the operating system data collected into a Nodeview. A Nodeview is a grouping of metric sets where each metric set contains detailed metrics of a unique system resource.
Brief description of metric sets are as follows:
- CPU metric set: Metrics for top 127 CPUs sorted by usage percentage
- Device metric set: Metrics for 127 devices that include ASM/VD/OCR along with those having a high average wait time
- Process metric set: Metrics for 127 processes
- Top 25 CPU consumers (idle processes not reported)
- Top 25 Memory consumers (RSS < 1% of total RAM not reported)
- Top 25 I/O consumers
- Top 25 File Descriptors consumers (helps to identify top inode consumers)
- Process Aggregation: Metrics summarized by foreground and background processes for all Oracle Database and Oracle ASM instances
- Network metric set: Metrics for 16 NICS that include public and private interconnects
- NFS metric set: Metrics for 32 NFS ordered by round trip time
- Protocol metric set: Metrics for protocol groups TCP, UDP, and IP
- Filesystem metric set: Metrics for filesystem utilization
- Critical resources metric set: Metrics for critical system
resource utilization
- CPU Metrics: system-wide CPU utilization statistics
- Memory Metrics: system-wide memory statistics
- Device Metrics: system-wide device statistics distinct from individual device metric set
- NFS Metrics: Total NFS devices collected every 30 seconds
- Process Metrics: system-wide unique process metrics
CPU Metric Set
Contains metrics from all CPU cores ordered by usage percentage.
Table 3-1 CPU Metric Set
Metric Name (units) | Description |
---|---|
system [%] | Percentage of CPU utilization occurred while running at the system level (kernel). |
user [%] | Percentage of CPU utilization occurred while running at the user level (application). |
usage [%] | Total utilization (system[%] + user[%]). |
nice [%] | Percentage of CPU utilization occurred while running at the user level with nice priority. |
ioWait [%] | Percentage of time that the CPU was idle during which the system had an outstanding disk I/O request. |
steal [%] | Percentage of time spent in involuntary wait by the virtual CPU while the hypervisor was servicing another virtual processor. |
Device Metric Set
Contains metrics from all disk devices/partitions ordered by their service time in milliseconds.
Table 3-2 Device Metric Set
Metric Name (units) | Description |
---|---|
ioR [KB/s] | Amount of data read from the device. |
ioW [KB/s] | Amount of data written to the device. |
numIOs [#/s] | Average disk I/O operations. |
qLen [#] | Number of I/O queued requests, that is, in a wait state. |
aWait [msec] | Average wait time per I/O. |
svcTm [msec] | Average service time per I/O request. |
util [%] | Percent utilization of the device (same as
'%util metric from the iostat
-x command. Represents the percentage of time device
was active).
|
Process Metric Set
Contains multiple categories of summarized metric data computed across all system processes.
Table 3-3 Process Metric Set
Metric Name (units) | Description |
---|---|
pid | Process ID. |
pri | Process priority (raw value from the operating system). |
psr | The processor that process is currently assigned to or running on. |
pPid | Parent process ID. |
nice | Nice value of the process. |
state | State of the process. For example, R->Running ,
S->Interruptible sleep , and so on.
|
class | Scheduling class of the process. For example,
RR->RobinRound , FF->First in First
out , B->Batch scheduling , and so
on.
|
fd [#] | Number of file descriptors opened by this process, which is updated every 30 seconds. |
name | Name of the process. |
cpu [%] | Process CPU utilization across cores. For example, 50% => 50% of single core, 400% => 100% usage of 4 cores. |
thrds [#] | Number of threads created by this process. |
vmem [KB] | Process virtual memory usage (KB). |
shMem [KB] | Process shared memory usage (KB). |
rss [KB] | Process memory-resident set size (KB). |
ioR [KB/s] | I/O read in kilobytes per second. |
ioW [KB/s] | I/O write in kilobytes per second. |
ioT [KB/s] | I/O total in kilobytes per second. |
cswch [#/s] | Context switch per second. Collected only for a few critical Oracle Database processes. |
nvcswch [#/s] | Non-voluntary context switch per second. Collected only for a few critical Oracle Database processes. |
cumulativeCpu [ms] | Amount of CPU used so far by the process in microseconds. |
NIC Metric Set
Contains metrics from all network interfaces ordered by their total rate in kilobytes per second.
Table 3-4 NIC Metric Set
Metric Name (units) | Description |
---|---|
name | Name of the interface. |
tag | Tag for the interface, for example, public, private, and so on. |
mtu [B] | Size of the maximum transmission unit in bytes supported for the interface. |
rx [Kbps] | Average network receive rate. |
tx [Kbps] | Average network send rate. |
total [Kbps] | Average network transmission rate (rx[Kb/s] + tx[Kb/s]). |
rxPkt [#/s] | Average incoming packet rate. |
txPkt [#/s] | Average outgoing packet rate. |
pkt [#/s] | Average rate of packet transmission (rxPkt[#/s] + txPkt[#/s]). |
rxDscrd [#/s] | Average rate of dropped/discarded incoming packets. |
txDscrd [#/s] | Average rate of dropped/discarded outgoing packets. |
rxUnicast [#/s] | Average rate of unicast packets received. |
rxNonUnicast [#/s] | Average rate of multicast packets received. |
dscrd [#/s] | Average rate of total discarded packets (rxDscrd + txDscrd). |
rxErr [#/s] | Average error rate for incoming packets. |
txErr [#/s] | Average error rate for outgoing packets. |
Err [#/s] | Average error rate of total transmission (rxErr[#/s] + txErr[#/s]). |
NFS Metric Set
Contains top 32 NFS ordered by round trip time. This metric set is collected once every 30 seconds.
Table 3-5 NFS Metric Set
Metric Name (units) | Description |
---|---|
op [#/s] | Number of read/write operations issued to a filesystem per second. |
bytes [#/sec] | Number of bytes read/write per second from a filesystem. |
rtt [s] | This is the duration from the time that the client's kernel sends the RPC request until the time it receives the reply. |
exe [s] | This is the duration from that NFS client does the RPC request to its kernel until the RPC request is completed, this includes the RTT time above. |
retrains [%] | This is the retransmission's frequency in percentage. |
Protocol Metric Set
Contains specific metrics for protocol groups TCP, UDP, and IP. Metric values are cumulative since the system starts.
Table 3-6 TCP Metric Set
Metric Name (units) | Description |
---|---|
failedConnErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state. |
estResetErr [#] | Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state. |
segRetransErr [#] | Total number of TCP segments retransmitted. |
rxSeg [#] | Total number of TCP segments received on TCP layer. |
txSeg [#] | Total number of TCP segments sent from TCP layer. |
Table 3-7 UDP Metric Set
Metric Name (units) | Description |
---|---|
unkPortErr [#] | Total number of received datagrams for which there was no application at the destination port. |
rxErr [#] | Number of received datagrams that could not be delivered for reasons other than the lack of an application at the destination port. |
rxPkt [#] | Total number of packets received. |
txPkt [#] | Total number of packets sent. |
Table 3-8 IP Metric Set
Metric Name (units) | Description |
---|---|
ipHdrErr [#] | Number of input datagrams discarded due to errors in their IPv4 headers. |
addrErr [#] | Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity. |
unkProtoErr [#] | Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol. |
reasFailErr [#] | Number of failures detected by the IPv4 reassembly algorithm. |
fragFailErr [#] | Number of IPv4 discarded datagrams due to fragmentation failures. |
rxPkt [#] | Total number of packets received on IP layer. |
txPkt [#] | Total number of packets sent from IP layer. |
Filesystem Metric Set
Contains metrics for filesystem utilization. Collected only for GRID_HOME filesystem.
Table 3-9 Filesystem Metric Set
Metric Name (units) | Description |
---|---|
mount | Mount point. |
type | Filesystem type, for example, etx4. |
tag | Filsystem tag, for example, GRID_HOME. |
total [KB] | Total amount of space (KB). |
used [KB] | Amount of used space (KB). |
avbl [KB] | Amount of available space (KB). |
used [%] | Percentage of used space. |
ifree [%] | Percentage of free file nodes. |
System Metric Set
Contains a summarized metric set of critical system resource utilization.
Table 3-10 CPU Metrics
Metric Name (units) | Description |
---|---|
pCpus [#] | Number of physical processing units in the system. |
Cores [#] | Number of cores for all CPUs in the system. |
vCpus [#] | Number of logical processing units in the system. |
cpuHt | CPU Hyperthreading enabled (Y) or disabled (N). |
osName | Name of the operating system. |
chipName | Name of the chip of the processing unit. |
system [%] | Percentage of CPUs utilization that occurred while running at the system level (kernel). |
user [%] | Percentage of CPUs utilization that occurred while running at the user level (application). |
usage [%] | Total CPU utilization (system[%] + user[%]). |
nice [%] | Percentage of CPUs utilization occurred while running at the user level with NICE priority. |
ioWait [%] | Percentage of time that the CPUs were idle during which the system had an outstanding disk I/O request. |
Steal [%] | Percentage of time spent in involuntary wait by the virtual CPUs while the hypervisor was servicing another virtual processor. |
cpuQ [#] | Number of processes waiting in the run queue within the current sample interval. |
loadAvg1 | Average system load calculated over time of one minute. |
loadAvg5 | Average system load calculated over of time of five minutes. |
loadAvg15 | Average system load calculated over of time of 15 minutes. High load averages imply that a system is overloaded; many processes are waiting for CPU time. |
Intr [#/s] | Number of interrupts occurred per second in the system. |
ctxSwitch [#/s] | Number of context switches that occurred per second in the system. |
Table 3-11 Memory Metrics
Metric Name (units) | Description |
---|---|
totalMem [KB] | Amount of total usable RAM (KB). |
freeMem [KB] | Amount of free RAM (KB). |
avblMem [KB] | Amount of memory available to start a new process without swapping. |
shMem [KB] | Memory used (mostly) by tmpfs. |
swapTotal [KB] | Total amount of physical swap memory (KB). |
swapFree [KB] | Amount of swap memory free (KB). |
swpIn [KB/s] | Average swap in rate within the current sample interval (KB/sec). |
swpOut [KB/s] | Average swap-out rate within the current sample interval (KB/sec). |
pgIn [#/s] | Average page in rate within the current sample interval (pages/sec). |
pgOut [#/s] | Average page out rate within the current sample interval (pages/sec). |
slabReclaim [KB] | The part of the slab that might be reclaimed such as caches. |
buffer [KB] | Memory used by kernel buffers. |
Cache [KB] | Memory used by the page cache and slabs. |
bufferAndCache [KB] | Total size of buffer and cache (buffer[KB] + Cache[KB]). |
hugePageTotal [#] | Total number of huge pages present in the system for the current sample interval. |
hugePageFree [KB] | Total number of free huge pages in the system for the current sample interval. |
hugePageSize [KB] | Size of one huge page in KB, depends on the operating system version. Typically the same for all samples for a particular host. |
Table 3-12 Device Metrics
Metric Name (units) | Description |
---|---|
disks [#] | Number of disks configured in the system. |
ioR [KB/s] | Aggregate read rate across all devices. |
ioW [KB/s] | Aggregate write rate across all devices. |
numIOs [#/s] | Aggregate I/O operation rate across all devices. |
Table 3-13 NFS Metrics
Metric Name (units) | Description |
---|---|
nfs [#] | Total NFS devices. |
Table 3-14 Process Metrics
Metric Name (units) | Description |
---|---|
fds [#] | Number of open file structs in system. |
procs [#] | Number of processes. |
rtProcs [#] | Number of real-time processes. |
procsInDState | Number of processes in uninterruptible sleep. |
sysFdLimit [#] | System limit on a number of file structs. |
procsOnCpu [#] | Number of processes currently running on CPU. |
procsBlocked [#] | Number of processes waiting for some event/resource becomes available, such as for the completion of an I/O operation. |
Process Aggregates Metric Set
Contains aggregated metrics for all processes by process groups.
Table 3-15 Process Aggregates Metric Set
Metric Name (units) | Description |
---|---|
DBBG | User Oracle Database background process group. |
DBFG | User Oracle Database foreground process group. |
MDBBG | MGMTDB background processes group. |
MDBFG | MGMTDB foreground processes group. |
ASMBG | ASM background processes group. |
ASMFG | ASM foreground processes group. |
IOXBG | IOS background processes group. |
IOXFG | IOS foreground processes group. |
APXBG | APX background processes group. |
APXFG | APX foreground processes group. |
CLUST | Clusterware processes group. |
OTHER | Default group. |
For each group, the below metrics are aggregated to report a group summary.
Metric Name (units) | Description |
---|---|
processes [#] | Total number of processes in the group. |
cpu [%] | Aggregated CPU utilization. |
rss [KB] | Aggregated resident set size. |
shMem [KB] | Aggregated shared memory usage. |
thrds [#] | Aggregated thread count. |
fds [#] | Aggregated open file-descriptor. |
cpuWeight [%] | Contribution of the group in overall CPU utilization of the machine. |
Parent topic: Collecting Operating System Resources Metrics