Oracle Communications Cloud Native Environment Alerts
This chapter provides information about the Oracle Communications Cloud Native Environment (OCCNE) alerts, and the alert rules used to implement them.
General Alerts
Alert Name | Summary | Description | Severity | Expression | For | SNMP Trap ID | Notes |
---|---|---|---|---|---|---|---|
SOFTWARE_INSTALLED | New software has been installed | {{ $labels.product_name }} release {{ $labels.engineering_release }} has been installed | info | software_deployment BY (engineering_release) == 0 | N/A | 1100 | software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed |
UPGRADE_IN_PROGRESS | {{ $labels.product_name }} is being upgraded | {{ $labels.product_name }} is being upgraded to release {{ $labels.engineering_release }} | info | software_deployment BY (engineering_release) == 1 | N/A | 1101 | |
UPGRADE_FAILED | {{ $labels.product_name }} upgrade failed | {{ $labels.product_name }} upgrade to release {{ $labels.engineering_release }} failed | major | software_deployment BY (engineering_release) == 2 | N/A | 1102 | |
SOFTWARE_REMOVED | Software removed or replaced | {{ $labels.product_name }} release {{ $labels.engineering_release }} was removed | info | software_deployment BY (engineering_release) == 3 | N/A | 1103 | This one needs to auto-clear after 2-3 days |
Kubernetes Alerts
Alert Name | Alert Name | Alert Name | Alert Name | Alert Name | Alert Name | Alert Name | Alert Name |
---|---|---|---|---|---|---|---|
DISK_SPACE_LOW | Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }}for partition {{ $labels.mountpoint }} | Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }} (< 20% left) | critical | ((node_filesystem_free_bytes / node_filesystem_size_bytes) * 100) < 20 | 1m | 1001 | software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed |
CPU_LOAD_HIGH | CPU load is high on host {{ $labels.kubernetes_node }} | CPU load is high on host {{ $labels.kubernetes_node }}CPU load {{ $value }}% Instance : {{ $labels.instance }} | warning | round((1 - (sum(node_cpu_seconds_total{mode="idle"}) by (kubernetes_node, instance) / sum(node_cpu_seconds_total) by (kubernetes_node, instance) )) * 100 , .01) > 80 | 2m | 1002 | |
LOW_MEMORY | Node {{ $labels.kubernetes_node }} running out of memory | Node {{ $labels.kubernetes_node }} available memory at {{ value }} percent | warning | avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[10m])) / avg BY (kunernetes_node) (avg_over_time( node_memory_MemTotal[10m])) * 100 <= 20 | 1m | 1007 | |
OUT_OF_MEMORY | Node {{ $labels.kubernetes_node }} out of memory | Node {{ $labels.kubernetes_node }} available memory at < 1 percent | critical | avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[1m])) / avg BY (kubernetes_node) (avg_over_time( node_memory_MemTotal[1m])) * 100 < 1 | N/A | 1008 | Averaging over a smaller interval, and not requiring the OOM condition to persist, to get a more responsive alert. If the node has (almost) no free memory for 1 minute then we alert immediately. |
NTP_SANITY_CHECK_FAILED | Clock not synchronized on node {{ $labels.kubernetes_node }} | NTP service sanity check failed on node {{ $labels.kubernetes_node }} | minor | node_timex_sync_status == 0 | 1m | 1009 | |
NETWORK_UNAVAILABLE |
Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable |
Network interface {{ $labels.device }} on node {{ $labels.ubernetes_node }} is unavailable |
critical | node_network_up(device=~"[eno|eth].+") == 0 | 30s | 1010 | On bare metal, external network interfaces are assumed to start with the prefix "eno". On vCNE, they are assumed to start with "eth". Kubernetes creates lots of virtual network interfaces, some of which are always down, so we need to specifically select for these external-facing interfaces when alarming. |
PVC_NEARLY_FULL | Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining. | warning | (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 5 | 10m | 1011 | |
PVC_FULL | Persistent volume claim {{ $labels.persistentvolumeclaim }} is full | Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining. | major | (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 0.1 | 10m | 1012 | |
NODE_UNAVAILABLE | Kubernetes node {{ $labels.node }} is unavailable | Kubernetes node {{ $labels.node }} is not in Ready state | critical | kube_node_status_condition(condition="Ready", status="true") == 0 | 30s | 1013 | |
ETCD_NODE_DOWN | Etcd is down | Etcd is not running or is otherwise unavailable | critical | sum(up(job=~".*etcd.*") == 1) == 0 | 30s | 1014 |
Common Service Alerts
Alert Name | Summary | Description | Severity | Expression | For | SNMP Trap ID | Notes |
---|---|---|---|---|---|---|---|
ELASTICSEARCH_CLUSTER_HEALTH_RED | Both primary and replica shards are not available | Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }} | critical | elasticsearch_cluster_health_status(color="red") == 1 | 1m | 1003 | |
ELASTICSEARCH_CLUSTER_HEALTH_YELLOW | The primary shard is allocated but replicas are not | Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }} | warning | elasticsearch_cluster_health_status(color="yellow") == 1 | 1m | 1004 | |
ELASTICSEARCH_DOWN | Elasticsearch is down | Elasticsearch is not running or is otherwise unavailable | critical | elasticsearch_cluster_health_up == 0 | 10s | 1016 | |
ELASTICSEARCH_TOO_FEW_DATA_NODES_RUNNING | {{ $labels.cluster }} cluster running on less than 3 data nodes | There are only {{$value}} < 3 ElasticSearch data nodes running in {{ $labels.cluster }} cluster. Required number of data nodes are 3 or higher. | critical | elasticsearch_cluster_health_number_of_data_nodes < 3 | 2m | 1005 | |
FLUENTD_NOT_AVAILABLE | Fluentd is down | Fluentd is not running or is otherwise unavailable | critical | kube_daemonset_status_number_ready(daemonset="occne-logs-fluentd-elasticsearch") == 0 | 10s | 1015 | Fluentd runs as a daemonset - i.e. one replica on each worker node. Unfortunately there is no easy way to track a replica failure to a specific worker node, plus the kube_pod_status_ready() metric seems to keep reporting on failed pods from the past, which would lead to false alerts. All we can do here is alert if all Fluentd replicas are down. |
GRAFANA_DOWN | Grafana is down | Grafana is not running or is otherwise unavailable | major | up(app="grafana") == 0 | 30s | 1024 | |
JAEGER_DOWN | Jaeger is down | Jaeger collector is not running or is otherwise unavailable | critical | kube_replicaset_status_ready_replicas(replicaset=~"occne-tracer-jaeger-collector-.*") == 0 | 10s | 1020 | Reporting on the Jaeger collector only. |
KIBANA_DOWN | Kibana is down | Kibana is not running or is otherwise unavailable | major | (kube_deployment_status_replicas_unavailable(deployment="occne-kibana") == kube_deployment_status_replicas(deployment="occne-kibana")) | 30s | 1023 | |
METALLB_CONTROLLER_DOWN | The MetalLB controller is down | The MetalLB controller is not running or is otherwise unavailable | critical | up(app="metallb", component="controller") == 0 | 30s | 1022 | |
METALLB_SPEAKER_DOWN | A MetalLB speaker is down | The MetalLB speaker on worker node {{ $labels.instance }} is down | major | up(app="metallb", component="speaker") == 0 | 10s | 1021 | The up() metric doesn't tell us which worker node the MetalLB speaker was running on directly, but does give us the worker node IP. |
PROMETHEUS_DOWN | Prometheus is down | Prometheus is not running or is otherwise unavailable | critical | kube_deployment_status_replicas_available(deployment="occne-prometheus-server") == 0 | 10s | 1017 | |
PROMETHEUS_NODE_EXPORTER_NOT_RUNNING | Prometheus Node Exporter is NOT running | Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }} | critical | up(app="prometheus-node-exporter") == 0 | 1m | 1006 | |
SNMP_NOTIFIER_DOWN | SNMP Notifier is down | SNMP Notifier is not running or is otherwise unavailable | critical | kube_deployment_status_replicas_available(deployment="occne-snmp-notifier") == 0 | 10s | 1019 |
Node status alerts and alarms
Alert name | Summary | Severity | Expression | For | SNMP Trap ID | Notes |
---|---|---|---|---|---|---|
NODE_DOWN | MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down | major | db_tier_data_node_status == 0 | N/A | 2001 | A value of 0 is used to indicate that a node is down; 1 indicates that the node is up. |
CPU alerts and alarms
Alert name | Summary | Severity | Expression | For | SNMP Trap ID | Notes |
---|---|---|---|---|---|---|
HIGH_CPU | Node ID {{ $labels.node_id }} CPU utilization at {{ value }} percent. | warning | (100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) BY (node_id))) >= 85 | 1m | 2002 | Alerting on average CPU utilization over the prior 10 minutes, rather than requiring the CPU utilization for every reporting period over a 10 minute interval to be > 85%. |
Memory utilization alerts and alarms
Alert name | Summary | Severity | Expression | For | SNMP Trap ID | Notes |
---|---|---|---|---|---|---|
LOW_MEMORY | Node ID {{ $labels.node_id }} memory utilization at {{ value }} percent. | warning | (avg_over_time(db_tier_memory_used_bytes[10m]) BY (node_id, memory_type) / avg_over_time(db_tier_memory_total_bytes[10m]) BY (node_id, memory_type)) * 100 >= 85 | 1m | 2003 | Alerting on average memory utilization over the prior 10 minutes, rather than requiring the memory utilization for every reporting period over a 10 minute interval to be > 85%. |
OUT_OF_MEMORY | Node ID {{ $labels.node_id }} out of memory. | critical | (db_tier_memory_used_bytes) BY (node_id, memory_type) >= (db_tier_memory_total_bytes) BY (node_id, memory_type) | N/A | 2004 | Any OOM condition should be alerted; no need for the condition to exist for a certain amount of time. |