Alerts

6 Alerts

Alerts are used to detect abnormal conditions in CNE and notify the user when any of the common services are not operating normally.

Each alert rule uses the values of one or more metrics stored in Prometheus to identify the abnormal conditions. Prometheus periodically evaluates each rule to ensure that CNE is operating normally. When rule evaluation indicates an abnormal condition, Prometheus sends an alert to the AlertManager. The resulting alert contains information about what part of the CNE cluster is affected for troubleshooting. Each alert is assigned with a severity level to inform the user of the seriousness of the alerting condition. This section provides details about CNE alerts.

6.1 Kubernetes Alerts

This section provides details about Kubernetes alerts.

Table 6-1 DISK_SPACE_LOW

Field	Details
Description	Cluster-name : {{ $externalLabels.cluster }} Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}. Available space is {{$value }}%(< 20% left). Instance = {{ $labels.instance }}
Summary	Cluster-name : {{ $externalLabels.cluster }} Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}
Cause	Disk space is running out on node. More than 80% of the allocated resources is consumed on the node.
Severity	Critical
SNMP Trap ID	1001
Affects Service (Y/N)	N
Recommended Actions	In case of vCNE, the flavour of the worker nodes can be increased to a larger flavor with more storage. Additional space can also be reclaimed by running “`podman system prune -fa`” to remove any unreferenced image layers. Verify how much space is being consumed by `/var/log` partition. If it is consuming a lot of space, logs can be rotated or shrinked to reclaim some space.

Table 6-2 CPU_LOAD_HIGH

Field	Details
Description	CPU load is high on host <node name>CPU load {{ $value }}%Instance : {{ $labels.instance }}
Summary	CPU load is high on host {{ $labels.kubernetes_node }}
Cause	CPU load is more than 80% of the allocated resources on the node.
Severity	Major
SNMP Trap ID	1002
Affects Service (Y/N)	N
Recommended Actions	In case of vCNE, the flavour of the worker nodes can be increased to a larger flavour with more number of VCPUs Manually evict unnecessary pods from the node with high CPU load to reduce load on the node. Draining and uncordon node also help in rebalancing CPU load on worker nodes

Table 6-3 LOW_MEMORY

Field	Details
Description	Node {{ $labels.kubernetes_node }} available memory at {{ $value \| humanize }} percent.
Summary	Node {{ $labels.kubernetes_node }} running out of memory
Cause	The available memory of a node is consumed more than 80% of the allocated memory.
Severity	Major
SNMP Trap ID	1007
Affects Service (Y/N)	N
Recommended Actions	In case of vCNE, flavour of the worker nodes can be increased having larger RAM size. Manually evict unnecessary pods from the node with high memory load to reduce load on the node. Draining and uncordon node help in rebalancing the CPU load on worker nodes.

Table 6-4 OUT_OF_MEMORY

Field	Details
Description	Node {{ $labels.kubernetes_node }} out of memory
Summary	Node {{ $labels.kubernetes_node }} out of memory
Cause	Node available memory is consumed more than 90% of the allocated memory.
Severity	Critical
SNMP Trap ID	1008
Affects Service (Y/N)	Y
Recommended Actions	In case of vCNE, flavour of the worker nodes can be increased having larger RAM size. Manually evict unnecessary pods from the node with high memory load to reduce load on that node.

Table 6-5 NTP_SANITY_CHECK_FAILED

Field	Details
Description	NTP service sanity check failed on node {{ $labels.kubernetes_node }}
Summary	Clock is not synchronized on node {{ $labels.kubernetes_node }}
Cause	Clock is not synchronized on the node.
Severity	Minor
SNMP Trap ID	1009
Affects Service (Y/N)	N
Recommended Actions	Steps to synchronize chronyd on node: log in to the node on which you want to synchronize clock. Run the following command: `sudo su;` Run the following command: `systemctl restart chronyd;` Watch chronyc tracking. Run the following command: `sudo reboot` If the issue is not resolved, Contact Oracle support.

Table 6-6 NETWORK_INTERFACE_FAILED

Field	Details
Description	Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable.
Summary	Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable.
Cause	Network interface is unavailable on the node.
Severity	Critical
SNMP Trap ID	1010
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

Table 6-7 PVC_NEARLY_FULL

Field	Details
Description	Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining.
Summary	Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full.
Cause	PVC storage is filled to 80% of allocated space.
Severity	Major
SNMP Trap ID	1011
Affects Service (Y/N)	N
Recommended Actions	Manually clean up PVC data for Prometheus: `$ kubectl get pods -o wide $ kubectl get pvc -n occne-infra $ Login to the nodes where Prometheus is deployed $ sudo su; lsblk $ Above command will give the path of the PVC(fetched in 2nd command), cd into it $ cd prometheus-db/ $ rm -rf ` For Oracle OpenSearch: `$ kubectl get pods -o wide $ kubectl get pvc -n occne-infra $ Login to the nodes where Oracle OpenSearch-Data/Master nodes are deployed $ sudo su; lsblk $ Above command will give the path of the PVC(fetched in 2nd command), cd into it $ cd nodes/0 $ rm -rf ` 2. Use the following procedures to increase the size of PVC. For Prometheus, see Changing Metrics Storage Allocation. For Oracle OpenSearch, see Changing Oracle OpenSearch Storage Allocation.

Table 6-8 PVC_FULL

Field	Details
Description	Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining.
Summary	Persistent volume claim {{ $labels.persistentvolumeclaim }} is full.
Cause	PVC storage is filled to 90% of the allocated space.
Severity	Critical
SNMP Trap ID	1012
Affects Service (Y/N)	Y
Recommended Actions	NA

Table 6-9 NODE_UNAVAILABLE

Field	Details
Description	Kubernetes node {{ $labels.kubernetes_node }} is not in Ready state.
Summary	Kubernetes node {{ $labels.kubernetes_node }} is unavailable.
Cause	Node is not in ready state.
Severity	Critical
SNMP Trap ID	1013
Affects Service (Y/N)	Y
Recommended Actions	First, check if the given node is in running or shutoff state. If the node is in shutoff state, try restarting it from Openstack or iLo If the node is in running state, then perform the following steps: Log in to the node. Check the kubelet status and try to reboot it.

Table 6-10 ETCD_NODE_DOWN

Field	Details
Description	Etcd is not running or is unavailable.
Summary	Etcd is down.
Cause	Etcd is not running.
Severity	Critical
SNMP Trap ID	1014
Affects Service (Y/N)	Y
Recommended Actions	Refer to the following document to restore the failed etcd: https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster

Table 6-11 APISERVER_CERTIFICATE_EXPIRATION_90D

Field	Details
Description	The Kubernetes API server client certificate is expiring in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates soon.
Summary	The Kubernetes API Server client certificate expires in less than 90 days.
Cause	Cluster is not upgraded in last 275 days and certificate expires in 90 days.
Severity	Warning
SNMP Trap ID	1033
Affects Service (Y/N)	N
Recommended Actions	See Renewing Kubernetes Certificates section to resolve this alert.

Table 6-12 APISERVER_CERTIFICATE_EXPIRATION_30D

Field	Details
Description	The Kubernetes API server client certificate expires in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates soon.
Summary	The Kubernetes API server client certificate expires in less than 30 days.
Cause	Cluster is not upgraded in last 335 days and the certificate expires in 30 days.
Severity	Major
SNMP Trap ID	1034
Affects Service (Y/N)	N
Recommended Actions	See Renewing Kubernetes Certificates section to resolve this alert.

Table 6-13 APISERVER_CERTIFICATE_EXPIRATION_7D

Field	Details
Description	The Kubernetes API server client certificate will expire in {{ value }} days. Upgrade CNE or renew your Kubernetes client certificates immediately.
Summary	The Kubernetes API Server client certificate expires in less than 7 days.
Cause	Cluster is not upgraded in the last 358 days and certificate expires in 7 days.
Severity	Critical
SNMP Trap ID	1035
Affects Service (Y/N)	Y
Recommended Actions	See Renewing Kubernetes Certificates section to resolve this alert.

Table 6-14 CEPH_OSD_NEARLY_FULL

Field	Details
Description	Utilization of storage device {{ $labels.ceph_daemon }} has crossed 75% on host {{ $labels.hostname }}.
Summary	OSD storage device is nearly full.
Cause	OSD storage device is 75% full.
Severity	Major
SNMP Trap ID	1036
Affects Service (Y/N)	N
Recommended Actions	Contact Oracle support.

Table 6-15 CEPH_OSD_FULL

Field	Details
Description	Utilization of storage device {{ $labels.ceph_daemon }} has crossed 80% on host {{ $labels.hostname }}.
Summary	OSD storage device is critically full.
Cause	OSD storage device is 80% full.
Severity	Critical
SNMP Trap ID	1037
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

Table 6-16 CEPH_OSD_DOWN

Field	Details
Description	Storage node {{ $labels.ceph_daemon }} is down.
Summary	Storage node {{ $labels.ceph_daemon }} is down.
Cause	Ceph OSD is down.
Severity	Major
SNMP Trap ID	1038
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

Table 6-17 VSPHERE_CSI_CONTROLLER_FAILED

Field	Details
Description	The vSphere CSI controller process failed.
Summary	The vSphere CSI controller process failed.
Cause	`Vsphere_csi_controller` is down.
Severity	Critical
SNMP Trap ID	1042
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

6.2 Common Services Alerts

This section provides details about common services alerts.

Table 6-18 OPENSEARCH_CLUSTER_HEALTH_RED

Field	Details
Description	Cluster Name : {{ $externalLabels.cluster }} All the primary and replica shards are not allocated in Oracle OpenSearch cluster {{ $labels.cluster }} for instance {{ $labels.instance }}
Summary	Cluster Name : {{ $externalLabels.cluster }} Both primary and replica shards are not available.
Cause	Some or all of the shards (primary) are not ready.
Severity	Critical
SNMP Trap ID	1043
Affects Service (Y/N)	Y
Recommended Actions	Check the index for which the primary and replica shard is not able to be created and check for the indices that are in yellow or red state. Remove them by using the following procedure: Run the following command on Bastion Host to check if any indices are in yellow or red state: `kubectl -n occne-infra exec -it occne-opensearch-client-0 -- curl localhost:9200/_cat/indices` Run the following command to delete the indices in yellow or red state: `kubectl -n occne-infra exec -it occne-opensearch-client-0 -- curl localhost:9200/_cat/indices \| grep 'yellow\\|red' \| awk '{ print $3 }' \| xargs -I{} kubectl -n occne-infra exec -it opensearch-client-0 -- curl -XDELETE localhost:9200/{}` Run the following command to verify if the indices with yellow or red state are deleted: `kubectl -n occne-infra exec -it opensearch-client-0 -- curl localhost:9200/_cat/indices` Restart the OpenSearch cluster in the following sequence: Master → Data → Client. If this procedure did not resolve the issue, then clean up all the indexes to restore OpenSearch in GREEN state.

Table 6-19 OPENSEARCH_CLUSTER_HEALTH_YELLOW

Field	Details
Description	Cluster Name : {{ $externalLabels.cluster }} The primary shard has been allocated in {{ $labels.cluster }} for Instance {{ $labels.instance }} but replicas for the shard cloud not be allocated.
Summary	Cluster Name : {{ $externalLabels.cluster }} The primary shard is allocated but replicas are not.
Cause	Indicates that OpenSearch has allocated all of the primary shards, but some or all of the replicas have not been allocated. This issue is observed in some cases after a node restart or shutdown.
Severity	Major
SNMP Trap ID	1044
Affects Service (Y/N)	N
Recommended Actions	The yellow alarms are observed often after a shutdown or restart of a node. Most of the times, Oracle OpenSearch recovers on its own. If not, perform the following procedure to remove the replicas from the problematic index whose replica is not able to be allocated. `PUT /logstash-2021.08.21/_settings { "index" : { "number_of_replicas":0 } }`

Table 6-20 OPENSEARCH_TOO_FEW_DATA_NODES_RUNNING

Field	Details
Description	Cluster Name : {{ $externalLabels.cluster }} There are only {{$value}} OpenSearch data nodes running in {{ $labels.cluster }} cluster.
Summary	Cluster Name : {{ $externalLabels.cluster }} {{ $labels.cluster }} cluster running on less than total number of data nodes.
Cause	Data nodes are either crashed or are in 0/1 state due to insufficient space in PVC. PVC is full.
Severity	Critical
SNMP Trap ID	1045
Affects Service (Y/N)	N
Recommended Actions	Check the OpenSearch data or master pods' running status If any of the OpenSearch pods are in the "not ready" state but running, check whether its associated PVC is full or not. If PVC is not full, then contact Oracle support. Ensure that the count of the data nodes is equal to 5, the value of the `opensearch_data_replicas_count = 5` variable. When more data nodes are added, the alert must be corrected accordingly.

Table 6-21 PROMETHEUS_NODE_EXPORTER_NOT_RUNNING

Field	Details
Description	Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }}.
Summary	Prometheus Node Exporter is NOT running.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Critical
SNMP Trap ID	1006
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue then Increase the Resources by editing the Node-Exporter Daemonset and search for resources section in it and increase the CPU or RAM accordingly. `$ kubectl edit ds occne-kube-prom-stack-prometheus-node-exporter -n occne-infra` If Resource utilization is not the issue, then Contact Oracle support.

Table 6-22 FLUENTD_OPENSEARCH_NOT_AVAILABLE

Field	Details
Description	Fluentd-OpenSearch is not running or is otherwise unavailable.
Summary	Fluentd-OpenSearch is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Critical
SNMP Trap ID	1050
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Fluentd-OpenSearch daemonset and search for resources section in it and increase the CPU or RAM accordingly. `$ kubectl edit ds occne-fluentd-opensearch -n occne-infra` If resource utilization is not the issue, then contact Oracle support.

Table 6-23 OPENSEARCH_DOWN

Field	Details
Description	OpenSearch is not running or is otherwise unavailable.
Summary	OpenSearch is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory or CPU, or issues with image used by pod. OpenSearch cluster is unavailable.
Severity	Critical
SNMP Trap ID	1047
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the "occne-opensearch-cluster" statefulset. Search for resources section in the statefulset and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support.

Table 6-24 OPENSEARCH_DASHBOARD_DOWN

Field	Details
Description	OpenSearch dashboard is not running or is otherwise unavailable.
Summary	OpenSearch dashboard is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. Opensearch dashborad is unavailable.
Severity	Major
SNMP Trap ID	1049
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

Table 6-25 PROMETHEUS_DOWN

Field	Details
Description	All Prometheus instances are down. No metrics will be collected until at least one Prometheus instance is restored.
Summary	Metrics collection is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. PVC is full.
Severity	Critical
SNMP Trap ID	1017
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Prometheus CRD. Search for resources section in the CRD and increase the CPU/RAM accordingly. `kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra` If resource utilization is not the issue, then contact Oracle support. Increase the PVC Size by referring to the Changing Metrics Storage Allocation section. Run the following commands to manually clean up the PVC data: `$ kubectl get pods -o wide $ kubectl get pvc -n occne-infra $ Login to the nodes where Prometheus is deployed $ sudo su; lsblk $ Above command will give the path of the PVC(fetched in 2nd command), cd into it $ cd prometheus-db/ $ rm -rf *`

Table 6-26 ALERT_MANAGER_DOWN

Field	Details
Description	All alert manager instances are down. No alerts will be received until at least one alert manager instance is restored.
Summary	Alert notification is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Critical
SNMP Trap ID	1018
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Alertmanager CRD. Search for resources section in the Alertmanager CRD and increase the CPU or RAM accordingly. `kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra` If resource utilization is not the issue, then contact Oracle support.

Table 6-27 SNMP_NOTIFIER_DOWN

Field	Details
Description	SNMP Notifier is not running or is unavailable.
Summary	SNMP Notifier is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Critical
SNMP Trap ID	1019
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the "occne-snmp-notifier" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support.

Table 6-28 JAEGER_DOWN

Field	Details
Description	Jaeger collector is not running or is unavailable.
Summary	Jaeger is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. OpenSerach is not available.
Severity	Critical
SNMP Trap ID	1020
Affects Service (Y/N)	N
Recommended Actions	If resource utilization is the issue then Increase the Resources by editing the "occne-tracer-jaeger-collector" Deployment and search for resources section in it and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support. Bring OpenSearch cluster back into healthy state by following the resolution mentioned in the OPENSEARCH_CLUSTER_HEALTH_RED alert. All Master, Client and Data pods must be up and running.

Table 6-29 METALLB_SPEAKER_DOWN

Field	Details
Description	The MetalLB speaker on worker node {{ $labels.instance }} is down.
Summary	A MetalLB speaker is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Major
SNMP Trap ID	1021
Affects Service (Y/N)	N
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Metallb-speaker daemonset. Search for resources section in the daemonset and increase the CPU or RAM accordingly. `$ kubectl edit ds occne-metallb-speaker -n occne-infra` If resource utilization is not the issue, then contact Oracle support.

Table 6-30 METALLB_CONTROLLER_DOWN

Field	Details
Description	The MetalLB controller is not running or is unavailable.
Summary	The MetalLB controller is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory or CPU, or issues with the image used by the pod.
Severity	Critical
SNMP Trap ID	1022
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the "occne-metallb-controller" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support.

Table 6-31 GRAFANA_DOWN

Field	Details
Description	Grafana is not running or is unavailable.
Summary	Grafana is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod. Prometheus is not available.
Severity	Major
SNMP Trap ID	1024
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the "occne-kube-prom-stack-grafana" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support.

Table 6-32 LOAD_BALANCER_NO_HA

Field	Details
Description	A single load balancer serving the {{ $labels.external_network }} network has failed. Load balancing will continue to operate in simplex mode.
Summary	A load balancer for the {{ $labels.external_network }} network is down.
Cause	One of the LBVM is down.
Severity	Major
SNMP Trap ID	1025
Affects Service (Y/N)	N
Recommended Actions	Replace the failed LBVM: See unresolvable-reference.html section to replace the failed LBVM.

Table 6-33 LOAD_BALANCER_NO_SERVICE

Field	Details
Description	All Load Balancers serving the {{ $labels.external_network }} network have failed. External access for all services on this network is unavailable.
Summary	Load balancing for the {{ $labels.external_network }} network is unavailable.
Cause	Both LBVMs are down as a result, the external network is down.
Severity	Critical
SNMP Trap ID	1026
Affects Service (Y/N)	Y
Recommended Actions	Replace one LBVM, wait for lb_monitor to convert it from STANDBY to ACTIVE state (run lb_monitor.py manually if needed) and then replace another LBVM. See unresolvable-reference.html section to replace failed LBVM.

Table 6-34 LOAD_BALANCER_FAILED

Field	Details
Description	Load balancer {{ $labels.name }} at IP {{ $labels.ip_address }} on the {{ $labels.external_network }} network has failed. Perform Load Balancer recovery procedure to restore.
Summary	A load balancer failed.
Cause	One of the LBVMs or both the LBVMs are down.
Severity	Major
SNMP Trap ID	1027
Affects Service (Y/N)	Y
Recommended Actions	Although this alert is not always service affecting, the Load Balancer must be restored to restore high availability for load balancing. Replace one of the LBVMs or both the LBVMs. See unresolvable-reference.html section to replace failed LBVM.

Table 6-35 PROMETHEUS_NO_HA

Field	Details
Description	A Prometheus instance has failed. Metrics collection will continue to operate in simplex mode.
Summary	A Prometheus instance is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Major
SNMP Trap ID	1028
Affects Service (Y/N)	N
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Prometheus CRD. Search for the resources section the Prometheus CRD and increase the CPU or RAM accordingly. `kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra` If resource utilization is not the issue then contact Oracle support. Increase PVC size by referring to the Changing Metrics Storage Allocation section. Manually clean-up PVC data: `$ kubectl get pods -o wide $ kubectl get pvc -n occne-infra $ Login to the nodes where Prometheus is deployed $ sudo su; lsblk $ Above command will give the path of the PVC(fetched in 2nd command), cd into it $ cd prometheus-db/ $ rm -rf *`

Table 6-36 ALERT_MANAGER_NO_HA

Field	Details
Description	An AlertManager instance has failed. Alert management will continue to operate in simplex mode.
Summary	An AlertManager instance is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU ir issues with the image used by pod.
Severity	Major
SNMP Trap ID	1029
Affects Service (Y/N)	N
Recommended Actions	If resource utilization is the issue, then increase the resources by editing the Alertmanager CRD. Search for the resources section in the Alertmanager CRD and increase the CPU or RAM accordingly. `kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra` If resource utilization is not the issue, then contact Oracle support.

Table 6-37 PROMXY_METRICS_AGGREGATOR_DOWN

Field	Details
Description	Promxy failed. Metrics will be retrieved from a single Prometheus instance only.
Summary	Promxy is down.
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Major
SNMP Trap ID	1032
Affects Service (Y/N)	Y
Recommended Actions	As metrics are retrieved from a single Prometheus instance, there may be gaps in the retrieved data. Promxy must be restarted to restore the full data retrieval capabilities. If resource utilization is the issue, then increase the resources by editing the "occne-promxy" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly. If resource utilization is not the issue, then contact Oracle support.

Table 6-38 VCNE_LB_CONTROLLER_FAILED

Field	Details
Description	The vCNE LB Controller process failed.
Summary	The vCNE LB Controller process failed.
Cause	Pod is repeatedly crashing and is in the "CrashLoopBackOff", 0/1, or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with the image used by the pod.
Severity	Major
SNMP Trap ID	1039
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the cause, then increase the resource by editing the "occne-lb-controller-server" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly. If resource utilization is not the cause, then contact Oracle support.

Table 6-39 VMWARE_CSI_CONTROLLER_FAILED

Field	Details
Description	The VmWare CSI Controller process failed.
Summary	The VmWare CSI Controller process failed.
Cause	The CSI Controller process failed. Note: This alert is raised only when CNE is installed on a VMware infrastructure.
Severity	Critical
SNMP Trap ID	1042
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.

Table 6-40 EGRESS_CONTROLLER_NOT_AVAILABLE

Field	Details
Description	Egress controller is not running or is unavailable.
Summary	Egress controller is down
Cause	Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity	Critical
SNMP Trap ID	1048
Affects Service (Y/N)	Y
Recommended Actions	If resource utilization is the cause, then increase the resource by editing the "occne-egress-controller" daemonset. Search for the resources section in the daemonset and increase the CPU or RAM accordingly. If resource utilization is not the cause, then contact Oracle support.

Table 6-41 OPENSEARCH_DATA_PVC_NEARLY_FULL

Field	Details
Description	OpenSearch Data Volume {{ $persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. Once full, this will cause OpenSearch cluster to start throwing index_block_exceptions, either increase Opensearch data PVC or remove unnecessary indices.
Summary	OpenSearch Data Volume is nearly full.
Cause	OpenSearch data PVCs are nearly full.
Severity	Major
SNMP Trap ID	1051
Affects Service (Y/N)	Y
Recommended Actions	Perform one of the following recommendations: Increase the PVC size of OpenSearch cluster data for which the alert is raised. Delete the old indices from OpenSearch Dashboards > dev tools > DELETE <index_name_to_be_deleted>.

6.3 Bastion Host Alerts

This section provides details about Bastion Host alerts.

Table 6-42 BASTION_HOST_FAILED

Field	Details
Description	Bastion Host {{ $labels.name }} at IP address {{ $labels.ip_address }} is unavailable.
Summary	Bastion Host {{ $labels.name }} is unavailable.
Cause	One of the Bastion Hosts failed to respond to liveness tests.
Severity	Major
SNMP Trap ID	1040
Affects Service (Y/N)	N
Recommended Actions	Contact Oracle support.

Table 6-43 ALL_BASTION_HOSTS_FAILED

Field	Details
Description	All Bastion Hosts are unavailable.
Summary	All Bastion Hosts are unavailable.
Cause	All Bastion Hosts fail to respond to liveness tests.
Severity	Critical
SNMP Trap ID	1041
Affects Service (Y/N)	Y
Recommended Actions	Contact Oracle support.