6 Alerts
Alerts are used to detect abnormal conditions in CNE and notify the user when any of the common services are not operating normally.
Each alert rule uses the values of one or more metrics stored in Prometheus to identify the abnormal conditions. Prometheus periodically evaluates each rule to ensure that CNE is operating normally. When rule evaluation indicates an abnormal condition, Prometheus sends an alert to the AlertManager. The resulting alert contains information about what part of the CNE cluster is affected for troubleshooting. Each alert is assigned with a severity level to inform the user of the seriousness of the alerting condition. This section provides details about CNE alerts.
6.1 Kubernetes Alerts
This section provides details about Kubernetes alerts.
Table 6-1 DISK_SPACE_LOW
| Field | Details |
|---|---|
| Description | Cluster-name : {{ $externalLabels.cluster }} Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}. Available space is {{$value }}%(< 20% left). Instance = {{ $labels.instance }} |
| Summary | Cluster-name : {{ $externalLabels.cluster }} Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }} |
| Cause | Disk space is running out on node. More than 80% of the allocated resources is consumed on the node. |
| Severity | Critical |
| SNMP Trap ID | 1001 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-2 CPU_LOAD_HIGH
| Field | Details |
|---|---|
| Description | CPU load is high on host <node name>CPU load {{ $value }}%Instance : {{ $labels.instance }} |
| Summary | CPU load is high on host {{ $labels.kubernetes_node }} |
| Cause | CPU load is more than 80% of the allocated resources on the node. |
| Severity | Major |
| SNMP Trap ID | 1002 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-3 LOW_MEMORY
| Field | Details |
|---|---|
| Description | Node {{ $labels.kubernetes_node }} available memory at {{ $value | humanize }} percent. |
| Summary | Node {{ $labels.kubernetes_node }} running out of memory |
| Cause | The available memory of a node is consumed more than 80% of the allocated memory. |
| Severity | Major |
| SNMP Trap ID | 1007 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-4 OUT_OF_MEMORY
| Field | Details |
|---|---|
| Description | Node {{ $labels.kubernetes_node }} out of memory |
| Summary | Node {{ $labels.kubernetes_node }} out of memory |
| Cause | Node available memory is consumed more than 90% of the allocated memory. |
| Severity | Critical |
| SNMP Trap ID | 1008 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-5 NTP_SANITY_CHECK_FAILED
| Field | Details |
|---|---|
| Description | NTP service sanity check failed on node {{ $labels.kubernetes_node }} |
| Summary | Clock is not synchronized on node {{ $labels.kubernetes_node }} |
| Cause | Clock is not synchronized on the node. |
| Severity | Minor |
| SNMP Trap ID | 1009 |
| Affects Service (Y/N) | N |
| Recommended Actions |
Steps to synchronize chronyd on node:
|
Table 6-6 NETWORK_INTERFACE_FAILED
| Field | Details |
|---|---|
| Description | Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable. |
| Summary | Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable. |
| Cause | Network interface is unavailable on the node. |
| Severity | Critical |
| SNMP Trap ID | 1010 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
Table 6-7 PVC_NEARLY_FULL
| Field | Details |
|---|---|
| Description | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. |
| Summary | Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full. |
| Cause | PVC storage is filled to 80% of allocated space. |
| Severity | Major |
| SNMP Trap ID | 1011 |
| Affects Service (Y/N) | N |
| Recommended Actions |
2. Use the following procedures to increase the size of PVC.
|
Table 6-8 PVC_FULL
| Field | Details |
|---|---|
| Description | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. |
| Summary | Persistent volume claim {{ $labels.persistentvolumeclaim }} is full. |
| Cause | PVC storage is filled to 90% of the allocated space. |
| Severity | Critical |
| SNMP Trap ID | 1012 |
| Affects Service (Y/N) | Y |
| Recommended Actions | NA |
Table 6-9 NODE_UNAVAILABLE
| Field | Details |
|---|---|
| Description | Kubernetes node {{ $labels.kubernetes_node }} is not in Ready state. |
| Summary | Kubernetes node {{ $labels.kubernetes_node }} is unavailable. |
| Cause | Node is not in ready state. |
| Severity | Critical |
| SNMP Trap ID | 1013 |
| Affects Service (Y/N) | Y |
| Recommended Actions | First, check if the given node is in running or shutoff
state.
If the node is in shutoff state, try restarting it from Openstack or iLo If the node is in
running state, then perform the following steps:
|
Table 6-10 ETCD_NODE_DOWN
| Field | Details |
|---|---|
| Description | Etcd is not running or is unavailable. |
| Summary | Etcd is down. |
| Cause | Etcd is not running. |
| Severity | Critical |
| SNMP Trap ID | 1014 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
Refer to the following document to restore the failed etcd: |
Table 6-11 APISERVER_CERTIFICATE_EXPIRATION_90D
| Field | Details |
|---|---|
| Description | The Kubernetes API server client certificate is expiring in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates soon. |
| Summary | The Kubernetes API Server client certificate expires in less than 90 days. |
| Cause | Cluster is not upgraded in last 275 days and certificate expires in 90 days. |
| Severity | Warning |
| SNMP Trap ID | 1033 |
| Affects Service (Y/N) | N |
| Recommended Actions | See Renewing Kubernetes Certificates section to resolve this alert. |
Table 6-12 APISERVER_CERTIFICATE_EXPIRATION_30D
| Field | Details |
|---|---|
| Description | The Kubernetes API server client certificate expires in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates soon. |
| Summary | The Kubernetes API server client certificate expires in less than 30 days. |
| Cause | Cluster is not upgraded in last 335 days and the certificate expires in 30 days. |
| Severity | Major |
| SNMP Trap ID | 1034 |
| Affects Service (Y/N) | N |
| Recommended Actions | See Renewing Kubernetes Certificates section to resolve this alert. |
Table 6-13 APISERVER_CERTIFICATE_EXPIRATION_7D
| Field | Details |
|---|---|
| Description | The Kubernetes API server client certificate will expire in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates immediately. |
| Summary | The Kubernetes API Server client certificate expires in less than 7 days. |
| Cause | Cluster is not upgraded in the last 358 days and certificate expires in 7 days. |
| Severity | Critical |
| SNMP Trap ID | 1035 |
| Affects Service (Y/N) | Y |
| Recommended Actions | See Renewing Kubernetes Certificates section to resolve this alert. |
Table 6-14 CEPH_OSD_NEARLY_FULL
| Field | Details |
|---|---|
| Description | Utilization of storage device {{ $labels.ceph_daemon }} has crossed 75% on host {{ $labels.hostname }}. |
| Summary | OSD storage device is nearly full. |
| Cause | OSD storage device is 75% full. |
| Severity | Major |
| SNMP Trap ID | 1036 |
| Affects Service (Y/N) | N |
| Recommended Actions | Contact Oracle support. |
Table 6-15 CEPH_OSD_FULL
| Field | Details |
|---|---|
| Description | Utilization of storage device {{ $labels.ceph_daemon }} has crossed 80% on host {{ $labels.hostname }}. |
| Summary | OSD storage device is critically full. |
| Cause | OSD storage device is 80% full. |
| Severity | Critical |
| SNMP Trap ID | 1037 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
Table 6-16 CEPH_OSD_DOWN
| Field | Details |
|---|---|
| Description | Storage node {{ $labels.ceph_daemon }} is down. |
| Summary | Storage node {{ $labels.ceph_daemon }} is down. |
| Cause | Ceph OSD is down. |
| Severity | Major |
| SNMP Trap ID | 1038 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
Table 6-17 VSPHERE_CSI_CONTROLLER_FAILED
| Field | Details |
|---|---|
| Description | The vSphere CSI controller process failed. |
| Summary | The vSphere CSI controller process failed. |
| Cause | Vsphere_csi_controller is down.
|
| Severity | Critical |
| SNMP Trap ID | 1042 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
6.2 Common Services Alerts
This section provides details about common services alerts.
Table 6-18 OPENSEARCH_CLUSTER_HEALTH_RED
| Field | Details |
|---|---|
| Description | Cluster Name : {{ $externalLabels.cluster }} All the primary and replica shards are not allocated in Oracle OpenSearch cluster {{ $labels.cluster }} for instance {{ $labels.instance }} |
| Summary | Cluster Name : {{ $externalLabels.cluster }} Both primary and replica shards are not available. |
| Cause | Some or all of the shards (primary) are not ready. |
| Severity | Critical |
| SNMP Trap ID | 1043 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
Check the index for which the primary and replica shard
is not able to be created and check for the indices that are in
yellow or red state. Remove them by using the following
procedure:
If this procedure did not resolve the issue, then clean up all the indexes to restore OpenSearch in GREEN state. |
Table 6-19 OPENSEARCH_CLUSTER_HEALTH_YELLOW
| Field | Details |
|---|---|
| Description | Cluster Name : {{ $externalLabels.cluster }} The primary shard has been allocated in {{ $labels.cluster }} for Instance {{ $labels.instance }} but replicas for the shard cloud not be allocated. |
| Summary | Cluster Name : {{ $externalLabels.cluster }} The primary shard is allocated but replicas are not. |
| Cause | Indicates that OpenSearch has allocated all of the primary shards, but some or all of the replicas have not been allocated. This issue is observed in some cases after a node restart or shutdown. |
| Severity | Major |
| SNMP Trap ID | 1044 |
| Affects Service (Y/N) | N |
| Recommended Actions |
The yellow alarms are observed often after a shutdown or restart of a node. Most of the times, Oracle OpenSearch recovers on its own. If not, perform the following procedure to remove the replicas from the problematic index whose replica is not able to be allocated.
|
Table 6-20 OPENSEARCH_TOO_FEW_DATA_NODES_RUNNING
| Field | Details |
|---|---|
| Description | Cluster Name : {{ $externalLabels.cluster }} There are only {{$value}} OpenSearch data nodes running in {{ $labels.cluster }} cluster. |
| Summary | Cluster Name : {{ $externalLabels.cluster }} {{ $labels.cluster }} cluster running on less than total number of data nodes. |
| Cause |
|
| Severity | Critical |
| SNMP Trap ID | 1045 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-21 PROMETHEUS_NODE_EXPORTER_NOT_RUNNING
| Field | Details |
|---|---|
| Description | Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }}. |
| Summary | Prometheus Node Exporter is NOT running. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Critical |
| SNMP Trap ID | 1006 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-22 FLUENTD_OPENSEARCH_NOT_AVAILABLE
| Field | Details |
|---|---|
| Description | Fluentd-OpenSearch is not running or is otherwise unavailable. |
| Summary | Fluentd-OpenSearch is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Critical |
| SNMP Trap ID | 1050 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-23 OPENSEARCH_DOWN
| Field | Details |
|---|---|
| Description | OpenSearch is not running or is otherwise unavailable. |
| Summary | OpenSearch is down. |
| Cause |
|
| Severity | Critical |
| SNMP Trap ID | 1047 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-24 OPENSEARCH_DASHBOARD_DOWN
| Field | Details |
|---|---|
| Description | OpenSearch dashboard is not running or is otherwise unavailable. |
| Summary | OpenSearch dashboard is down. |
| Cause |
|
| Severity | Major |
| SNMP Trap ID | 1049 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
Table 6-25 PROMETHEUS_DOWN
| Field | Details |
|---|---|
| Description | All Prometheus instances are down. No metrics will be collected until at least one Prometheus instance is restored. |
| Summary | Metrics collection is down. |
| Cause |
|
| Severity | Critical |
| SNMP Trap ID | 1017 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-26 ALERT_MANAGER_DOWN
| Field | Details |
|---|---|
| Description | All alert manager instances are down. No alerts will be received until at least one alert manager instance is restored. |
| Summary | Alert notification is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Critical |
| SNMP Trap ID | 1018 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-27 SNMP_NOTIFIER_DOWN
| Field | Details |
|---|---|
| Description | SNMP Notifier is not running or is unavailable. |
| Summary | SNMP Notifier is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Critical |
| SNMP Trap ID | 1019 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-28 JAEGER_DOWN
| Field | Details |
|---|---|
| Description | Jaeger collector is not running or is unavailable. |
| Summary | Jaeger is down. |
| Cause |
|
| Severity | Critical |
| SNMP Trap ID | 1020 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-29 METALLB_SPEAKER_DOWN
| Field | Details |
|---|---|
| Description | The MetalLB speaker on worker node {{ $labels.instance }} is down. |
| Summary | A MetalLB speaker is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Major |
| SNMP Trap ID | 1021 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-30 METALLB_CONTROLLER_DOWN
| Field | Details |
|---|---|
| Description | The MetalLB controller is not running or is unavailable. |
| Summary | The MetalLB controller is down. |
| Cause |
|
| Severity | Critical |
| SNMP Trap ID | 1022 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-31 GRAFANA_DOWN
| Field | Details |
|---|---|
| Description | Grafana is not running or is unavailable. |
| Summary | Grafana is down. |
| Cause |
|
| Severity | Major |
| SNMP Trap ID | 1024 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-32 LOAD_BALANCER_NO_HA
| Field | Details |
|---|---|
| Description | A single load balancer serving the {{ $labels.external_network }} network has failed. Load balancing will continue to operate in simplex mode. |
| Summary | A load balancer for the {{ $labels.external_network }} network is down. |
| Cause | One of the LBVM is down. |
| Severity | Major |
| SNMP Trap ID | 1025 |
| Affects Service (Y/N) | N |
| Recommended Actions | Replace the failed LBVM:
See unresolvable-reference.html section to replace the failed LBVM. |
Table 6-33 LOAD_BALANCER_NO_SERVICE
| Field | Details |
|---|---|
| Description | All Load Balancers serving the {{ $labels.external_network }} network have failed. External access for all services on this network is unavailable. |
| Summary | Load balancing for the {{ $labels.external_network }} network is unavailable. |
| Cause | Both LBVMs are down as a result, the external network is down. |
| Severity | Critical |
| SNMP Trap ID | 1026 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Replace one LBVM, wait for lb_monitor to convert it from
STANDBY to ACTIVE state (run lb_monitor.py manually if needed) and then
replace another LBVM.
See unresolvable-reference.html section to replace failed LBVM. |
Table 6-34 LOAD_BALANCER_FAILED
| Field | Details |
|---|---|
| Description | Load balancer {{ $labels.name }} at IP {{ $labels.ip_address }} on the {{ $labels.external_network }} network has failed. Perform Load Balancer recovery procedure to restore. |
| Summary | A load balancer failed. |
| Cause | One of the LBVMs or both the LBVMs are down. |
| Severity | Major |
| SNMP Trap ID | 1027 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Although this alert is not always service affecting, the
Load Balancer must be restored to restore high availability for load
balancing. Replace one of the LBVMs or both the LBVMs.
See unresolvable-reference.html section to replace failed LBVM. |
Table 6-35 PROMETHEUS_NO_HA
| Field | Details |
|---|---|
| Description | A Prometheus instance has failed. Metrics collection will continue to operate in simplex mode. |
| Summary | A Prometheus instance is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Major |
| SNMP Trap ID | 1028 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-36 ALERT_MANAGER_NO_HA
| Field | Details |
|---|---|
| Description | An AlertManager instance has failed. Alert management will continue to operate in simplex mode. |
| Summary | An AlertManager instance is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU ir issues with the image used by pod. |
| Severity | Major |
| SNMP Trap ID | 1029 |
| Affects Service (Y/N) | N |
| Recommended Actions |
|
Table 6-37 PROMXY_METRICS_AGGREGATOR_DOWN
| Field | Details |
|---|---|
| Description | Promxy failed. Metrics will be retrieved from a single Prometheus instance only. |
| Summary | Promxy is down. |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Major |
| SNMP Trap ID | 1032 |
| Affects Service (Y/N) | Y |
| Recommended Actions | As metrics are retrieved from a single Prometheus
instance, there may be gaps in the retrieved data. Promxy must be
restarted to restore the full data retrieval capabilities.
|
Table 6-38 VCNE_LB_CONTROLLER_FAILED
| Field | Details |
|---|---|
| Description | The vCNE LB Controller process failed. |
| Summary | The vCNE LB Controller process failed. |
| Cause | Pod is repeatedly crashing and is in the "CrashLoopBackOff", 0/1, or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with the image used by the pod. |
| Severity | Major |
| SNMP Trap ID | 1039 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-39 VMWARE_CSI_CONTROLLER_FAILED
| Field | Details |
|---|---|
| Description | The VmWare CSI Controller process failed. |
| Summary | The VmWare CSI Controller process failed. |
| Cause | The CSI Controller process failed.
Note: This alert is raised only when CNE is installed on a VMware infrastructure. |
| Severity | Critical |
| SNMP Trap ID | 1042 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |
Table 6-40 EGRESS_CONTROLLER_NOT_AVAILABLE
| Field | Details |
|---|---|
| Description | Egress controller is not running or is unavailable. |
| Summary | Egress controller is down |
| Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
| Severity | Critical |
| SNMP Trap ID | 1048 |
| Affects Service (Y/N) | Y |
| Recommended Actions |
|
Table 6-41 OPENSEARCH_DATA_PVC_NEARLY_FULL
| Field | Details |
|---|---|
| Description | OpenSearch Data Volume {{ $persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. Once full, this will cause OpenSearch cluster to start throwing index_block_exceptions, either increase Opensearch data PVC or remove unnecessary indices. |
| Summary | OpenSearch Data Volume is nearly full. |
| Cause | OpenSearch data PVCs are nearly full. |
| Severity | Major |
| SNMP Trap ID | 1051 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Perform one of the following recommendations:
|
6.3 Bastion Host Alerts
This section provides details about Bastion Host alerts.
Table 6-42 BASTION_HOST_FAILED
| Field | Details |
|---|---|
| Description | Bastion Host {{ $labels.name }} at IP address {{ $labels.ip_address }} is unavailable. |
| Summary | Bastion Host {{ $labels.name }} is unavailable. |
| Cause | One of the Bastion Hosts failed to respond to liveness tests. |
| Severity | Major |
| SNMP Trap ID | 1040 |
| Affects Service (Y/N) | N |
| Recommended Actions | Contact Oracle support. |
Table 6-43 ALL_BASTION_HOSTS_FAILED
| Field | Details |
|---|---|
| Description | All Bastion Hosts are unavailable. |
| Summary | All Bastion Hosts are unavailable. |
| Cause | All Bastion Hosts fail to respond to liveness tests. |
| Severity | Critical |
| SNMP Trap ID | 1041 |
| Affects Service (Y/N) | Y |
| Recommended Actions | Contact Oracle support. |