6 Alerts
Alerts are used to detect abnormal conditions in CNE and notify the user when any of the common services are not operating normally.
Each alert rule uses the values of one or more metrics stored in Prometheus to identify the abnormal conditions. Prometheus periodically evaluates each rule to ensure that CNE is operating normally. When rule evaluation indicates an abnormal condition, Prometheus sends an alert to the AlertManager. The resulting alert contains information about what part of the CNE cluster is affected for troubleshooting. Each alert is assigned with a severity level to inform the user of the seriousness of the alerting condition. This section provides details about CNE alerts.
6.1 Kubernetes Alerts
This section provides details about Kubernetes alerts.
Table 6-1 DISK_SPACE_LOW
Field | Details |
---|---|
Description | Cluster-name : {{ $externalLabels.cluster }} Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}. Available space is {{$value }}%(< 20% left). Instance = {{ $labels.instance }} |
Summary | Cluster-name : {{ $externalLabels.cluster }} Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }} |
Cause | Disk space is running out on node. More than 80% of the allocated resources is consumed on the node. |
Severity | Critical |
SNMP Trap ID | 1001 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-2 CPU_LOAD_HIGH
Field | Details |
---|---|
Description | CPU load is high on host <node name>CPU load {{ $value }}%Instance : {{ $labels.instance }} |
Summary | CPU load is high on host {{ $labels.kubernetes_node }} |
Cause | CPU load is more than 80% of the allocated resources on the node. |
Severity | Major |
SNMP Trap ID | 1002 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-3 LOW_MEMORY
Field | Details |
---|---|
Description | Node {{ $labels.kubernetes_node }} available memory at {{ $value | humanize }} percent. |
Summary | Node {{ $labels.kubernetes_node }} running out of memory |
Cause | The available memory of a node is consumed more than 80% of the allocated memory. |
Severity | Major |
SNMP Trap ID | 1007 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-4 OUT_OF_MEMORY
Field | Details |
---|---|
Description | Node {{ $labels.kubernetes_node }} out of memory |
Summary | Node {{ $labels.kubernetes_node }} out of memory |
Cause | Node available memory is consumed more than 90% of the allocated memory. |
Severity | Critical |
SNMP Trap ID | 1008 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-5 NTP_SANITY_CHECK_FAILED
Field | Details |
---|---|
Description | NTP service sanity check failed on node {{ $labels.kubernetes_node }} |
Summary | Clock is not synchronized on node {{ $labels.kubernetes_node }} |
Cause | Clock is not synchronized on the node. |
Severity | Minor |
SNMP Trap ID | 1009 |
Affects Service (Y/N) | N |
Recommended Actions |
Steps to synchronize chronyd on node:
|
Table 6-6 NETWORK_INTERFACE_FAILED
Field | Details |
---|---|
Description | Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable. |
Summary | Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable. |
Cause | Network interface is unavailable on the node. |
Severity | Critical |
SNMP Trap ID | 1010 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
Table 6-7 PVC_NEARLY_FULL
Field | Details |
---|---|
Description | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. |
Summary | Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full. |
Cause | PVC storage is filled to 80% of allocated space. |
Severity | Major |
SNMP Trap ID | 1011 |
Affects Service (Y/N) | N |
Recommended Actions |
2. Use the following procedures to increase the size of PVC.
|
Table 6-8 PVC_FULL
Field | Details |
---|---|
Description | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. |
Summary | Persistent volume claim {{ $labels.persistentvolumeclaim }} is full. |
Cause | PVC storage is filled to 90% of the allocated space. |
Severity | Critical |
SNMP Trap ID | 1012 |
Affects Service (Y/N) | Y |
Recommended Actions | NA |
Table 6-9 NODE_UNAVAILABLE
Field | Details |
---|---|
Description | Kubernetes node {{ $labels.kubernetes_node }} is not in Ready state. |
Summary | Kubernetes node {{ $labels.kubernetes_node }} is unavailable. |
Cause | Node is not in ready state. |
Severity | Critical |
SNMP Trap ID | 1013 |
Affects Service (Y/N) | Y |
Recommended Actions | First, check if the given node is in running or shutoff
state.
If the node is in shutoff state, try restarting it from Openstack or iLo If the node is in
running state, then perform the following steps:
|
Table 6-10 ETCD_NODE_DOWN
Field | Details |
---|---|
Description | Etcd is not running or is unavailable. |
Summary | Etcd is down. |
Cause | Etcd is not running. |
Severity | Critical |
SNMP Trap ID | 1014 |
Affects Service (Y/N) | Y |
Recommended Actions |
Refer to the following document to restore the failed etcd: |
Table 6-11 CEPH_OSD_NEARLY_FULL
Field | Details |
---|---|
Description | Utilization of storage device {{ $labels.ceph_daemon }} has crossed 75% on host {{ $labels.hostname }}. |
Summary | OSD storage device is nearly full. |
Cause | OSD storage device is 75% full. |
Severity | Major |
SNMP Trap ID | 1036 |
Affects Service (Y/N) | N |
Recommended Actions | Contact Oracle support. |
Table 6-12 CEPH_OSD_FULL
Field | Details |
---|---|
Description | Utilization of storage device {{ $labels.ceph_daemon }} has crossed 80% on host {{ $labels.hostname }}. |
Summary | OSD storage device is critically full. |
Cause | OSD storage device is 80% full. |
Severity | Critical |
SNMP Trap ID | 1037 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
Table 6-13 CEPH_OSD_DOWN
Field | Details |
---|---|
Description | Storage node {{ $labels.ceph_daemon }} is down. |
Summary | Storage node {{ $labels.ceph_daemon }} is down. |
Cause | Ceph OSD is down. |
Severity | Major |
SNMP Trap ID | 1038 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
Table 6-14 VSPHERE_CSI_CONTROLLER_FAILED
Field | Details |
---|---|
Description | The vSphere CSI controller process failed. |
Summary | The vSphere CSI controller process failed. |
Cause | Vsphere_csi_controller is down.
|
Severity | Critical |
SNMP Trap ID | 1042 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
6.2 Common Services Alerts
This section provides details about common services alerts.
Table 6-15 OPENSEARCH_CLUSTER_HEALTH_RED
Field | Details |
---|---|
Description | Cluster Name : {{ $externalLabels.cluster }} All the primary and replica shards are not allocated in Oracle OpenSearch cluster {{ $labels.cluster }} for instance {{ $labels.instance }} |
Summary | Cluster Name : {{ $externalLabels.cluster }} Both primary and replica shards are not available. |
Cause | Some or all of the shards (primary) are not ready. |
Severity | Critical |
SNMP Trap ID | 1043 |
Affects Service (Y/N) | Y |
Recommended Actions |
Check the index for which the primary and replica shard
is not able to be created and check for the indices that are in
yellow or red state. Remove them by using the following
procedure:
If this procedure did not resolve the issue, then clean up all the indexes to restore OpenSearch in GREEN state. |
Table 6-16 OPENSEARCH_CLUSTER_HEALTH_YELLOW
Field | Details |
---|---|
Description | Cluster Name : {{ $externalLabels.cluster }} The primary shard has been allocated in {{ $labels.cluster }} for Instance {{ $labels.instance }} but replicas for the shard cloud not be allocated. |
Summary | Cluster Name : {{ $externalLabels.cluster }} The primary shard is allocated but replicas are not. |
Cause | Indicates that OpenSearch has allocated all of the primary shards, but some or all of the replicas have not been allocated. This issue is observed in some cases after a node restart or shutdown. |
Severity | Major |
SNMP Trap ID | 1044 |
Affects Service (Y/N) | N |
Recommended Actions |
The yellow alarms are observed often after a shutdown or restart of a node. Most of the times, Oracle OpenSearch recovers on its own. If not, perform the following procedure to remove the replicas from the problematic index whose replica is not able to be allocated.
|
Table 6-17 OPENSEARCH_TOO_FEW_DATA_NODES_RUNNING
Field | Details |
---|---|
Description | Cluster Name : {{ $externalLabels.cluster }} There are only {{$value}} OpenSearch data nodes running in {{ $labels.cluster }} cluster. |
Summary | Cluster Name : {{ $externalLabels.cluster }} {{ $labels.cluster }} cluster running on less than total number of data nodes. |
Cause |
|
Severity | Critical |
SNMP Trap ID | 1045 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-18 PROMETHEUS_NODE_EXPORTER_NOT_RUNNING
Field | Details |
---|---|
Description | Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }}. |
Summary | Prometheus Node Exporter is NOT running. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Critical |
SNMP Trap ID | 1006 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-19 FLUENTD_OPENSEARCH_NOT_AVAILABLE
Field | Details |
---|---|
Description | Fluentd-OpenSearch is not running or is otherwise unavailable. |
Summary | Fluentd-OpenSearch is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Critical |
SNMP Trap ID | 1050 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-20 OPENSEARCH_DOWN
Field | Details |
---|---|
Description | OpenSearch is not running or is otherwise unavailable. |
Summary | OpenSearch is down. |
Cause |
|
Severity | Critical |
SNMP Trap ID | 1047 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-21 OPENSEARCH_DASHBOARD_DOWN
Field | Details |
---|---|
Description | OpenSearch dashboard is not running or is otherwise unavailable. |
Summary | OpenSearch dashboard is down. |
Cause |
|
Severity | Major |
SNMP Trap ID | 1049 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
Table 6-22 PROMETHEUS_DOWN
Field | Details |
---|---|
Description | All Prometheus instances are down. No metrics will be collected until at least one Prometheus instance is restored. |
Summary | Metrics collection is down. |
Cause |
|
Severity | Critical |
SNMP Trap ID | 1017 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-23 ALERT_MANAGER_DOWN
Field | Details |
---|---|
Description | All alert manager instances are down. No alerts will be received until at least one alert manager instance is restored. |
Summary | Alert notification is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Critical |
SNMP Trap ID | 1018 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-24 SNMP_NOTIFIER_DOWN
Field | Details |
---|---|
Description | SNMP Notifier is not running or is unavailable. |
Summary | SNMP Notifier is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Critical |
SNMP Trap ID | 1019 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-25 JAEGER_DOWN
Field | Details |
---|---|
Description | Jaeger collector is not running or is unavailable. |
Summary | Jaeger is down. |
Cause |
|
Severity | Critical |
SNMP Trap ID | 1020 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-26 METALLB_SPEAKER_DOWN
Field | Details |
---|---|
Description | The MetalLB speaker on worker node {{ $labels.instance }} is down. |
Summary | A MetalLB speaker is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Major |
SNMP Trap ID | 1021 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-27 METALLB_CONTROLLER_DOWN
Field | Details |
---|---|
Description | The MetalLB controller is not running or is unavailable. |
Summary | The MetalLB controller is down. |
Cause |
|
Severity | Critical |
SNMP Trap ID | 1022 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-28 GRAFANA_DOWN
Field | Details |
---|---|
Description | Grafana is not running or is unavailable. |
Summary | Grafana is down. |
Cause |
|
Severity | Major |
SNMP Trap ID | 1024 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-29 LOAD_BALANCER_NO_HA
Field | Details |
---|---|
Description | A single load balancer serving the {{ $labels.external_network }} network has failed. Load balancing will continue to operate in simplex mode. |
Summary | A load balancer for the {{ $labels.external_network }} network is down. |
Cause | One of the LBVM is down. |
Severity | Major |
SNMP Trap ID | 1025 |
Affects Service (Y/N) | N |
Recommended Actions | Replace the failed LBVM.
For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide. |
Table 6-30 LOAD_BALANCER_NO_SERVICE
Field | Details |
---|---|
Description | All Load Balancers serving the {{ $labels.external_network }} network have failed. External access for all services on this network is unavailable. |
Summary | Load balancing for the {{ $labels.external_network }} network is unavailable. |
Cause | Both LBVMs are down as a result, the external network is down. |
Severity | Critical |
SNMP Trap ID | 1026 |
Affects Service (Y/N) | Y |
Recommended Actions | Replace one LBVM, wait for lb_monitor to convert it from
STANDBY to ACTIVE state (run lb_monitor.py manually if needed) and then
replace another LBVM.
For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide. |
Table 6-31 LOAD_BALANCER_FAILED
Field | Details |
---|---|
Description | Load balancer {{ $labels.name }} at IP {{ $labels.ip_address }} on the {{ $labels.external_network }} network has failed. Perform Load Balancer recovery procedure to restore. |
Summary | A load balancer failed. |
Cause | One of the LBVMs or both the LBVMs are down. |
Severity | Major |
SNMP Trap ID | 1027 |
Affects Service (Y/N) | Y |
Recommended Actions | Although this alert is not always service affecting, the
Load Balancer must be restored to restore high availability for load
balancing. Replace one of the LBVMs or both the LBVMs.
For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide. |
Table 6-32 PROMETHEUS_NO_HA
Field | Details |
---|---|
Description | A Prometheus instance has failed. Metrics collection will continue to operate in simplex mode. |
Summary | A Prometheus instance is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Major |
SNMP Trap ID | 1028 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-33 ALERT_MANAGER_NO_HA
Field | Details |
---|---|
Description | An AlertManager instance has failed. Alert management will continue to operate in simplex mode. |
Summary | An AlertManager instance is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU ir issues with the image used by pod. |
Severity | Major |
SNMP Trap ID | 1029 |
Affects Service (Y/N) | N |
Recommended Actions |
|
Table 6-34 PROMXY_METRICS_AGGREGATOR_DOWN
Field | Details |
---|---|
Description | Promxy failed. Metrics will be retrieved from a single Prometheus instance only. |
Summary | Promxy is down. |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Major |
SNMP Trap ID | 1032 |
Affects Service (Y/N) | Y |
Recommended Actions | As metrics are retrieved from a single Prometheus
instance, there may be gaps in the retrieved data. Promxy must be
restarted to restore the full data retrieval capabilities.
|
Table 6-35 VCNE_LB_CONTROLLER_FAILED
Field | Details |
---|---|
Description | The vCNE LB Controller process failed. |
Summary | The vCNE LB Controller process failed. |
Cause | Pod is repeatedly crashing and is in the "CrashLoopBackOff", 0/1, or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with the image used by the pod. |
Severity | Major |
SNMP Trap ID | 1039 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-36 VMWARE_CSI_CONTROLLER_FAILED
Field | Details |
---|---|
Description | The VmWare CSI Controller process failed. |
Summary | The VmWare CSI Controller process failed. |
Cause | The CSI Controller process failed.
Note: This alert is raised only when CNE is installed on a VMware infrastructure. |
Severity | Critical |
SNMP Trap ID | 1042 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |
Table 6-37 EGRESS_CONTROLLER_NOT_AVAILABLE
Field | Details |
---|---|
Description | Egress controller is not running or is unavailable. |
Summary | Egress controller is down |
Cause | Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. |
Severity | Critical |
SNMP Trap ID | 1048 |
Affects Service (Y/N) | Y |
Recommended Actions |
|
Table 6-38 OPENSEARCH_DATA_PVC_NEARLY_FULL
Field | Details |
---|---|
Description | OpenSearch Data Volume {{ $persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. Once full, this will cause OpenSearch cluster to start throwing index_block_exceptions, either increase Opensearch data PVC or remove unnecessary indices. |
Summary | OpenSearch Data Volume is nearly full. |
Cause | OpenSearch data PVCs are nearly full. |
Severity | Major |
SNMP Trap ID | 1051 |
Affects Service (Y/N) | Y |
Recommended Actions | Perform one of the following recommendations:
|
6.3 Bastion Host Alerts
This section provides details about Bastion Host alerts.
Table 6-39 BASTION_HOST_FAILED
Field | Details |
---|---|
Description | Bastion Host {{ $labels.name }} at IP address {{ $labels.ip_address }} is unavailable. |
Summary | Bastion Host {{ $labels.name }} is unavailable. |
Cause | One of the Bastion Hosts failed to respond to liveness tests. |
Severity | Major |
SNMP Trap ID | 1040 |
Affects Service (Y/N) | N |
Recommended Actions | Contact Oracle support. |
Table 6-40 ALL_BASTION_HOSTS_FAILED
Field | Details |
---|---|
Description | All Bastion Hosts are unavailable. |
Summary | All Bastion Hosts are unavailable. |
Cause | All Bastion Hosts fail to respond to liveness tests. |
Severity | Critical |
SNMP Trap ID | 1041 |
Affects Service (Y/N) | Y |
Recommended Actions | Contact Oracle support. |