Oracle Communications Cloud Native Environment Alerts

Previous NextJavaScript must be enabled to correctly display this content

Oracle Communications Cloud Native Environment Alerts

This chapter provides information about the Oracle Communications Cloud Native Environment (OCCNE) alerts, and the alert rules used to implement them.

General Alerts

Alert Name	Summary	Description	Severity	Expression	For	SNMP Trap ID	Notes
SOFTWARE_INSTALLED	New software has been installed	{{ $labels.product_name }} release {{ $labels.engineering_release }} has been installed	info	software_deployment BY (engineering_release) == 0	N/A	1100	software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed
UPGRADE_IN_PROGRESS	{{ $labels.product_name }} is being upgraded	{{ $labels.product_name }} is being upgraded to release {{ $labels.engineering_release }}	info	software_deployment BY (engineering_release) == 1	N/A	1101
UPGRADE_FAILED	{{ $labels.product_name }} upgrade failed	{{ $labels.product_name }} upgrade to release {{ $labels.engineering_release }} failed	major	software_deployment BY (engineering_release) == 2	N/A	1102
SOFTWARE_REMOVED	Software removed or replaced	{{ $labels.product_name }} release {{ $labels.engineering_release }} was removed	info	software_deployment BY (engineering_release) == 3	N/A	1103	This one needs to auto-clear after 2-3 days

Kubernetes Alerts

Alert Name	Alert Name	Alert Name	Alert Name	Alert Name	Alert Name	Alert Name	Alert Name
DISK_SPACE_LOW	Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }}for partition {{ $labels.mountpoint }}	Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }} (< 20% left)	critical	((node_filesystem_free_bytes / node_filesystem_size_bytes) * 100) < 20	1m	1001	software_deployment metric values: 0 = installed 1 = upgrade in progress 2 = upgrade failed 3 = removed
CPU_LOAD_HIGH	CPU load is high on host {{ $labels.kubernetes_node }}	CPU load is high on host {{ $labels.kubernetes_node }}CPU load {{ $value }}% Instance : {{ $labels.instance }}	warning	round((1 - (sum(node_cpu_seconds_total{mode="idle"}) by (kubernetes_node, instance) / sum(node_cpu_seconds_total) by (kubernetes_node, instance) )) * 100 , .01) > 80	2m	1002
LOW_MEMORY	Node {{ $labels.kubernetes_node }} running out of memory	Node {{ $labels.kubernetes_node }} available memory at {{ value }} percent	warning	avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[10m])) / avg BY (kunernetes_node) (avg_over_time( node_memory_MemTotal[10m])) * 100 <= 20	1m	1007
OUT_OF_MEMORY	Node {{ $labels.kubernetes_node }} out of memory	Node {{ $labels.kubernetes_node }} available memory at < 1 percent	critical	avg BY (kubernetes_node) (avg_over_time( node_memory_MemAvailable[1m])) / avg BY (kubernetes_node) (avg_over_time( node_memory_MemTotal[1m])) * 100 < 1	N/A	1008	Averaging over a smaller interval, and not requiring the OOM condition to persist, to get a more responsive alert. If the node has (almost) no free memory for 1 minute then we alert immediately.
NTP_SANITY_CHECK_FAILED	Clock not synchronized on node {{ $labels.kubernetes_node }}	NTP service sanity check failed on node {{ $labels.kubernetes_node }}	minor	node_timex_sync_status == 0	1m	1009
NETWORK_UNAVAILABLE	Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable	Network interface {{ $labels.device }} on node {{ $labels.ubernetes_node }} is unavailable	critical	node_network_up(device=~"[eno\|eth].+") == 0	30s	1010	On bare metal, external network interfaces are assumed to start with the prefix "eno". On vCNE, they are assumed to start with "eth". Kubernetes creates lots of virtual network interfaces, some of which are always down, so we need to specifically select for these external-facing interfaces when alarming.
PVC_NEARLY_FULL	Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full	Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining.	warning	(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 5	10m	1011
PVC_FULL	Persistent volume claim {{ $labels.persistentvolumeclaim }} is full	Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ value }} % of allocated space remaining.	major	(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 0.1	10m	1012
NODE_UNAVAILABLE	Kubernetes node {{ $labels.node }} is unavailable	Kubernetes node {{ $labels.node }} is not in Ready state	critical	kube_node_status_condition(condition="Ready", status="true") == 0	30s	1013
ETCD_NODE_DOWN	Etcd is down	Etcd is not running or is otherwise unavailable	critical	sum(up(job=~".etcd.") == 1) == 0	30s	1014

Common Service Alerts

Alert Name	Summary	Description	Severity	Expression	For	SNMP Trap ID	Notes
ELASTICSEARCH_CLUSTER_HEALTH_RED	Both primary and replica shards are not available	Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}	critical	elasticsearch_cluster_health_status(color="red") == 1	1m	1003
ELASTICSEARCH_CLUSTER_HEALTH_YELLOW	The primary shard is allocated but replicas are not	Instance {{ $labels.instance }}: not all primary and replica shards are allocated in elasticsearch cluster {{ $labels.cluster }}	warning	elasticsearch_cluster_health_status(color="yellow") == 1	1m	1004
ELASTICSEARCH_DOWN	Elasticsearch is down	Elasticsearch is not running or is otherwise unavailable	critical	elasticsearch_cluster_health_up == 0	10s	1016
ELASTICSEARCH_TOO_FEW_DATA_NODES_RUNNING	{{ $labels.cluster }} cluster running on less than 3 data nodes	There are only {{$value}} < 3 ElasticSearch data nodes running in {{ $labels.cluster }} cluster. Required number of data nodes are 3 or higher.	critical	elasticsearch_cluster_health_number_of_data_nodes < 3	2m	1005
FLUENTD_NOT_AVAILABLE	Fluentd is down	Fluentd is not running or is otherwise unavailable	critical	kube_daemonset_status_number_ready(daemonset="occne-logs-fluentd-elasticsearch") == 0	10s	1015	Fluentd runs as a daemonset - i.e. one replica on each worker node. Unfortunately there is no easy way to track a replica failure to a specific worker node, plus the kube_pod_status_ready() metric seems to keep reporting on failed pods from the past, which would lead to false alerts. All we can do here is alert if all Fluentd replicas are down.
GRAFANA_DOWN	Grafana is down	Grafana is not running or is otherwise unavailable	major	up(app="grafana") == 0	30s	1024
JAEGER_DOWN	Jaeger is down	Jaeger collector is not running or is otherwise unavailable	critical	kube_replicaset_status_ready_replicas(replicaset=~"occne-tracer-jaeger-collector-.*") == 0	10s	1020	Reporting on the Jaeger collector only.
KIBANA_DOWN	Kibana is down	Kibana is not running or is otherwise unavailable	major	(kube_deployment_status_replicas_unavailable(deployment="occne-kibana") == kube_deployment_status_replicas(deployment="occne-kibana"))	30s	1023
METALLB_CONTROLLER_DOWN	The MetalLB controller is down	The MetalLB controller is not running or is otherwise unavailable	critical	up(app="metallb", component="controller") == 0	30s	1022
METALLB_SPEAKER_DOWN	A MetalLB speaker is down	The MetalLB speaker on worker node {{ $labels.instance }} is down	major	up(app="metallb", component="speaker") == 0	10s	1021	The up() metric doesn't tell us which worker node the MetalLB speaker was running on directly, but does give us the worker node IP.
PROMETHEUS_DOWN	Prometheus is down	Prometheus is not running or is otherwise unavailable	critical	kube_deployment_status_replicas_available(deployment="occne-prometheus-server") == 0	10s	1017
PROMETHEUS_NODE_EXPORTER_NOT_RUNNING	Prometheus Node Exporter is NOT running	Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }}	critical	up(app="prometheus-node-exporter") == 0	1m	1006
SNMP_NOTIFIER_DOWN	SNMP Notifier is down	SNMP Notifier is not running or is otherwise unavailable	critical	kube_deployment_status_replicas_available(deployment="occne-snmp-notifier") == 0	10s	1019

Node status alerts and alarms

Alert name	Summary	Severity	Expression	For	SNMP Trap ID	Notes
NODE_DOWN	MySQL {{ $labels.node_type }} node having node id {{ $labels.node_id }} is down	major	db_tier_data_node_status == 0	N/A	2001	A value of 0 is used to indicate that a node is down; 1 indicates that the node is up.

CPU alerts and alarms

Alert name	Summary	Severity	Expression	For	SNMP Trap ID	Notes
HIGH_CPU	Node ID {{ $labels.node_id }} CPU utilization at {{ value }} percent.	warning	(100 - (avg(avg_over_time(db_tier_cpu_os_idle[10m])) BY (node_id))) >= 85	1m	2002	Alerting on average CPU utilization over the prior 10 minutes, rather than requiring the CPU utilization for every reporting period over a 10 minute interval to be > 85%.

Memory utilization alerts and alarms

Alert name	Summary	Severity	Expression	For	SNMP Trap ID	Notes
LOW_MEMORY	Node ID {{ $labels.node_id }} memory utilization at {{ value }} percent.	warning	(avg_over_time(db_tier_memory_used_bytes[10m]) BY (node_id, memory_type) / avg_over_time(db_tier_memory_total_bytes[10m]) BY (node_id, memory_type)) * 100 >= 85	1m	2003	Alerting on average memory utilization over the prior 10 minutes, rather than requiring the memory utilization for every reporting period over a 10 minute interval to be > 85%.
OUT_OF_MEMORY	Node ID {{ $labels.node_id }} out of memory.	critical	(db_tier_memory_used_bytes) BY (node_id, memory_type) >= (db_tier_memory_total_bytes) BY (node_id, memory_type)	N/A	2004	Any OOM condition should be alerted; no need for the condition to exist for a certain amount of time.