Oracle Communications Cloud Native Environment Alerts

This chapter provides information about the Oracle Communications Cloud Native Environment (OCCNE) alerts, and the alert rules used to implement them.

General Alerts

Alert Name Summary Description Severity Expression For SNMP Trap ID Notes
SOFTWARE_INSTALLED New software has been installed {{ $labels.product_name }} release {{ $labels.engineering_release }} has been installed info software_deployment BY (engineering_release) == 0 N/A 1100 software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed
UPGRADE_IN_PROGRESS {{ $labels.product_name }} is being upgraded {{ $labels.product_name }} is being upgraded to release {{ $labels.engineering_release }} info software_deployment BY (engineering_release) == 1 N/A 1101
UPGRADE_FAILED {{ $labels.product_name }} upgrade failed {{ $labels.product_name }} upgrade to release {{ $labels.engineering_release }} failed major software_deployment BY (engineering_release) == 2 N/A 1102
SOFTWARE_REMOVED Software removed or replaced {{ $labels.product_name }} release {{ $labels.engineering_release }} was removed info software_deployment BY (engineering_release) == 3 N/A 1103 This one needs to auto-clear after 2-3 days

Kubernetes Alerts

Alert Name Alert Name Alert Name Alert Name Alert Name Alert Name Alert Name Alert Name
DISK_SPACE_LOW Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }}for partition {{ $labels.mountpoint }} Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }} (< 20% left) critical ((node_filesystem_free_bytes / node_filesystem_size_bytes) * 100) < 20 1m 1001 software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed
CPU_LOAD_HIGH CPU load is high on host {{ $labels.kubernetes_node }} CPU load is high on host {{ $labels.kubernetes_node }}CPU load {{ $value }}% Instance : {{ $labels.instance }} warning round((1 - (sum(node_cpu_seconds_total{mode="idle"}) by (kubernetes_node, instance) / sum(node_cpu_seconds_total) by (kubernetes_node, instance) )) * 100 , .01) > 80 2m 1002
LOW_MEMORY Node {{ $labels.kubernetes_node }} running out of memory Node {{ $labels.kubernetes_node }} available memory at {{ value }} percent warning avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[10m])) / avg BY (kunernetes_node) (avg_over_time( node_memory_MemTotal[10m])) * 100 <= 20 1m 1007
OUT_OF_MEMORY Node {{ $labels.kubernetes_node }} out of memory Node {{ $labels.kubernetes_node }} available memory at < 1 percent critical avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[1m])) / avg BY (kubernetes_node) (avg_over_time( node_memory_MemTotal[1m])) * 100 < 1 N/A 1008 Averaging over a smaller interval, and not requiring the OOM condition to persist, to get a more responsive alert. If the node has (almost) no free memory for 1 minute then we alert immediately.
NTP_SANITY_CHECK_FAILED Clock not synchronized on node {{ $labels.kubernetes_node }} NTP service sanity check failed on node {{ $labels.kubernetes_node }} minor node_timex_sync_status == 0 1m 1009
NETWORK_UNAVAILABLE

Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable

Network interface {{ $labels.device }} on node {{ $labels.ubernetes_node }} is unavailable

critical node_network_up(device=~"[eno|eth].+") == 0 30s 1010 On bare metal, external network interfaces are assumed to start with the prefix "eno". On vCNE, they are assumed to start with "eth". Kubernetes creates lots of virtual network interfaces, some of which are always down, so we need to specifically select for these external-facing interfaces when alarming.
PVC_NEARLY_FULL Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining. warning (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 5 10m 1011
PVC_FULL Persistent volume claim {{ $labels.persistentvolumeclaim }} is full Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining. major (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 0.1 10m 1012
NODE_UNAVAILABLE Kubernetes node {{ $labels.node }} is unavailable Kubernetes node {{ $labels.node }} is not in Ready state critical kube_node_status_condition(condition="Ready", status="true") == 0 30s 1013
ETCD_NODE_DOWN Etcd is down Etcd is not running or is otherwise unavailable critical sum(up(job=~".*etcd.*") == 1) == 0 30s 1014

Common Service Alerts

Alert Name Summary Description Severity Expression For SNMP Trap ID Notes
ELASTICSEARCH_CLUSTER_HEALTH_RED Both primary and replica shards are not available Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }} critical elasticsearch_cluster_health_status(color="red") == 1 1m 1003
ELASTICSEARCH_CLUSTER_HEALTH_YELLOW The primary shard is allocated but replicas are not Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }} warning elasticsearch_cluster_health_status(color="yellow") == 1 1m 1004
ELASTICSEARCH_DOWN Elasticsearch is down Elasticsearch is not running or is otherwise unavailable critical elasticsearch_cluster_health_up == 0 10s 1016
ELASTICSEARCH_TOO_FEW_DATA_NODES_RUNNING {{ $labels.cluster }} cluster running on less than 3 data nodes There are only {{$value}} < 3 ElasticSearch data nodes running in {{ $labels.cluster }} cluster. Required number of data nodes are 3 or higher. critical elasticsearch_cluster_health_number_of_data_nodes < 3 2m 1005
FLUENTD_NOT_AVAILABLE Fluentd is down Fluentd is not running or is otherwise unavailable critical kube_daemonset_status_number_ready(daemonset="occne-logs-fluentd-elasticsearch") == 0 10s 1015 Fluentd runs as a daemonset - i.e. one replica on each worker node. Unfortunately there is no easy way to track a replica failure to a specific worker node, plus the kube_pod_status_ready() metric seems to keep reporting on failed pods from the past, which would lead to false alerts. All we can do here is alert if all Fluentd replicas are down.
GRAFANA_DOWN Grafana is down Grafana is not running or is otherwise unavailable major up(app="grafana") == 0 30s 1024
JAEGER_DOWN Jaeger is down Jaeger collector is not running or is otherwise unavailable critical kube_replicaset_status_ready_replicas(replicaset=~"occne-tracer-jaeger-collector-.*") == 0 10s 1020 Reporting on the Jaeger collector only.
KIBANA_DOWN Kibana is down Kibana is not running or is otherwise unavailable major (kube_deployment_status_replicas_unavailable(deployment="occne-kibana") == kube_deployment_status_replicas(deployment="occne-kibana")) 30s 1023
METALLB_CONTROLLER_DOWN The MetalLB controller is down The MetalLB controller is not running or is otherwise unavailable critical up(app="metallb", component="controller") == 0 30s 1022
METALLB_SPEAKER_DOWN A MetalLB speaker is down The MetalLB speaker on worker node {{ $labels.instance }} is down major up(app="metallb", component="speaker") == 0 10s 1021 The up() metric doesn't tell us which worker node the MetalLB speaker was running on directly, but does give us the worker node IP.
PROMETHEUS_DOWN Prometheus is down Prometheus is not running or is otherwise unavailable critical kube_deployment_status_replicas_available(deployment="occne-prometheus-server") == 0 10s 1017
PROMETHEUS_NODE_EXPORTER_NOT_RUNNING Prometheus Node Exporter is NOT running Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }} critical up(app="prometheus-node-exporter") == 0 1m 1006
SNMP_NOTIFIER_DOWN SNMP Notifier is down SNMP Notifier is not running or is otherwise unavailable critical kube_deployment_status_replicas_available(deployment="occne-snmp-notifier") == 0 10s 1019

Node status alerts and alarms

Alert name Summary Severity Expression For SNMP Trap ID Notes
NODE_DOWN MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down major db_tier_data_node_status == 0 N/A 2001 A value of 0 is used to indicate that a node is down; 1 indicates that the node is up.

CPU alerts and alarms

Alert name Summary Severity Expression For SNMP Trap ID Notes
HIGH_CPU Node ID {{ $labels.node_id }} CPU utilization at {{ value }} percent. warning (100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) BY (node_id))) >= 85 1m 2002 Alerting on average CPU utilization over the prior 10 minutes, rather than requiring the CPU utilization for every reporting period over a 10 minute interval to be > 85%.

Memory utilization alerts and alarms

Alert name Summary Severity Expression For SNMP Trap ID Notes
LOW_MEMORY Node ID {{ $labels.node_id }} memory utilization at {{ value }} percent. warning (avg_over_time(db_tier_memory_used_bytes[10m]) BY (node_id, memory_type) / avg_over_time(db_tier_memory_total_bytes[10m]) BY (node_id, memory_type)) * 100 >= 85 1m 2003 Alerting on average memory utilization over the prior 10 minutes, rather than requiring the memory utilization for every reporting period over a 10 minute interval to be > 85%.
OUT_OF_MEMORY Node ID {{ $labels.node_id }} out of memory. critical (db_tier_memory_used_bytes) BY (node_id, memory_type) >= (db_tier_memory_total_bytes) BY (node_id, memory_type) N/A 2004 Any OOM condition should be alerted; no need for the condition to exist for a certain amount of time.