6 Alerts

Alerts are used to detect abnormal conditions in CNE and notify the user when any of the common services are not operating normally.

Each alert rule uses the values of one or more metrics stored in Prometheus to identify the abnormal conditions. Prometheus periodically evaluates each rule to ensure that CNE is operating normally. When rule evaluation indicates an abnormal condition, Prometheus sends an alert to the AlertManager. The resulting alert contains information about what part of the CNE cluster is affected for troubleshooting. Each alert is assigned with a severity level to inform the user of the seriousness of the alerting condition. This section provides details about CNE alerts.

6.1 Kubernetes Alerts

This section provides details about Kubernetes alerts.

Table 6-1 DISK_SPACE_LOW

Field Details
Description Cluster-name : {{ $externalLabels.cluster }} Disk space is almost RUNNING OUT for kubernetes node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}. Available space is {{$value }}%(< 20% left). Instance = {{ $labels.instance }}
Summary Cluster-name : {{ $externalLabels.cluster }} Disk space is RUNNING OUT on node {{ $labels.kubernetes_node }} for partition {{ $labels.mountpoint }}
Cause Disk space is running out on node. More than 80% of the allocated resources is consumed on the node.
Severity Critical
SNMP Trap ID 1001
Affects Service (Y/N) N
Recommended Actions
  • In case of vCNE, the flavour of the worker nodes can be increased to a larger flavor with more storage.
  • Additional space can also be reclaimed by running “podman system prune -fa” to remove any unreferenced image layers.
  • Verify how much space is being consumed by /var/log partition. If it is consuming a lot of space, logs can be rotated or shrinked to reclaim some space.

Table 6-2 CPU_LOAD_HIGH

Field Details
Description CPU load is high on host <node name>CPU load {{ $value }}%Instance : {{ $labels.instance }}
Summary CPU load is high on host {{ $labels.kubernetes_node }}
Cause CPU load is more than 80% of the allocated resources on the node.
Severity Major
SNMP Trap ID 1002
Affects Service (Y/N) N
Recommended Actions
  • In case of vCNE, the flavour of the worker nodes can be increased to a larger flavour with more number of VCPUs
  • Manually evict unnecessary pods from the node with high CPU load to reduce load on the node.
  • Draining and uncordon node also help in rebalancing CPU load on worker nodes

Table 6-3 LOW_MEMORY

Field Details
Description Node {{ $labels.kubernetes_node }} available memory at {{ $value | humanize }} percent.
Summary Node {{ $labels.kubernetes_node }} running out of memory
Cause The available memory of a node is consumed more than 80% of the allocated memory.
Severity Major
SNMP Trap ID 1007
Affects Service (Y/N) N
Recommended Actions
  • In case of vCNE, flavour of the worker nodes can be increased having larger RAM size.
  • Manually evict unnecessary pods from the node with high memory load to reduce load on the node.
  • Draining and uncordon node help in rebalancing the CPU load on worker nodes.

Table 6-4 OUT_OF_MEMORY

Field Details
Description Node {{ $labels.kubernetes_node }} out of memory
Summary Node {{ $labels.kubernetes_node }} out of memory
Cause Node available memory is consumed more than 90% of the allocated memory.
Severity Critical
SNMP Trap ID 1008
Affects Service (Y/N) Y
Recommended Actions
  • In case of vCNE, flavour of the worker nodes can be increased having larger RAM size.
  • Manually evict unnecessary pods from the node with high memory load to reduce load on that node.

Table 6-5 NTP_SANITY_CHECK_FAILED

Field Details
Description NTP service sanity check failed on node {{ $labels.kubernetes_node }}
Summary Clock is not synchronized on node {{ $labels.kubernetes_node }}
Cause Clock is not synchronized on the node.
Severity Minor
SNMP Trap ID 1009
Affects Service (Y/N) N
Recommended Actions

Steps to synchronize chronyd on node:

  1. log in to the node on which you want to synchronize clock.
  2. Run the following command: sudo su;
  3. Run the following command: systemctl restart chronyd;
  4. Watch chronyc tracking.
  5. Run the following command: sudo reboot

Table 6-6 NETWORK_INTERFACE_FAILED

Field Details
Description Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable.
Summary Network interface {{ $labels.device }} on node {{ $labels.kubernetes_node }} is unavailable.
Cause Network interface is unavailable on the node.
Severity Critical
SNMP Trap ID 1010
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

Table 6-7 PVC_NEARLY_FULL

Field Details
Description Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining.
Summary Persistent volume claim {{ $labels.persistentvolumeclaim }} is nearly full.
Cause PVC storage is filled to 80% of allocated space.
Severity Major
SNMP Trap ID 1011
Affects Service (Y/N) N
Recommended Actions
  1. Manually clean up PVC data for Prometheus:
    $ kubectl get pods -o wide
    $ kubectl get pvc -n occne-infra
    $ Login to the nodes where Prometheus is deployed
    $ sudo su; lsblk
    $ Above command will give the path of the PVC(fetched in 2nd command), cd into it
    $ cd prometheus-db/
    $ rm -rf *
    For Oracle OpenSearch:
    $ kubectl get pods -o wide
    $ kubectl get pvc -n occne-infra
    $ Login to the nodes where Oracle OpenSearch-Data/Master nodes are deployed
    $ sudo su; lsblk
    $ Above command will give the path of the PVC(fetched in 2nd command), cd into it
    $ cd nodes/0
    $ rm -rf *

2. Use the following procedures to increase the size of PVC.

Table 6-8 PVC_FULL

Field Details
Description Persistent volume claim {{ $labels.persistentvolumeclaim }} has {{ $value }}% of allocated space remaining.
Summary Persistent volume claim {{ $labels.persistentvolumeclaim }} is full.
Cause PVC storage is filled to 90% of the allocated space.
Severity Critical
SNMP Trap ID 1012
Affects Service (Y/N) Y
Recommended Actions NA

Table 6-9 NODE_UNAVAILABLE

Field Details
Description Kubernetes node {{ $labels.kubernetes_node }} is not in Ready state.
Summary Kubernetes node {{ $labels.kubernetes_node }} is unavailable.
Cause Node is not in ready state.
Severity Critical
SNMP Trap ID 1013
Affects Service (Y/N) Y
Recommended Actions First, check if the given node is in running or shutoff state.

If the node is in shutoff state, try restarting it from Openstack or iLo

If the node is in running state, then perform the following steps:
  1. Log in to the node.
  2. Check the kubelet status and try to reboot it.

Table 6-10 ETCD_NODE_DOWN

Field Details
Description Etcd is not running or is unavailable.
Summary Etcd is down.
Cause Etcd is not running.
Severity Critical
SNMP Trap ID 1014
Affects Service (Y/N) Y
Recommended Actions

Refer to the following document to restore the failed etcd:

https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/#restoring-an-etcd-cluster

Table 6-11 CEPH_OSD_NEARLY_FULL

Field Details
Description Utilization of storage device {{ $labels.ceph_daemon }} has crossed 75% on host {{ $labels.hostname }}.
Summary OSD storage device is nearly full.
Cause OSD storage device is 75% full.
Severity Major
SNMP Trap ID 1036
Affects Service (Y/N) N
Recommended Actions Contact Oracle support.

Table 6-12 CEPH_OSD_FULL

Field Details
Description Utilization of storage device {{ $labels.ceph_daemon }} has crossed 80% on host {{ $labels.hostname }}.
Summary OSD storage device is critically full.
Cause OSD storage device is 80% full.
Severity Critical
SNMP Trap ID 1037
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

Table 6-13 CEPH_OSD_DOWN

Field Details
Description Storage node {{ $labels.ceph_daemon }} is down.
Summary Storage node {{ $labels.ceph_daemon }} is down.
Cause Ceph OSD is down.
Severity Major
SNMP Trap ID 1038
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

Table 6-14 VSPHERE_CSI_CONTROLLER_FAILED

Field Details
Description The vSphere CSI controller process failed.
Summary The vSphere CSI controller process failed.
Cause Vsphere_csi_controller is down.
Severity Critical
SNMP Trap ID 1042
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

6.2 Common Services Alerts

This section provides details about common services alerts.

Table 6-15 OPENSEARCH_CLUSTER_HEALTH_RED

Field Details
Description Cluster Name : {{ $externalLabels.cluster }} All the primary and replica shards are not allocated in Oracle OpenSearch cluster {{ $labels.cluster }} for instance {{ $labels.instance }}
Summary Cluster Name : {{ $externalLabels.cluster }} Both primary and replica shards are not available.
Cause Some or all of the shards (primary) are not ready.
Severity Critical
SNMP Trap ID 1043
Affects Service (Y/N) Y
Recommended Actions
Check the index for which the primary and replica shard is not able to be created and check for the indices that are in yellow or red state. Remove them by using the following procedure:
  1. Run the following command on Bastion Host to check if any indices are in yellow or red state:
    kubectl -n occne-infra exec -it occne-opensearch-client-0 -- curl localhost:9200/_cat/indices
    
  2. Run the following command to delete the indices in yellow or red state:
    kubectl -n occne-infra exec -it occne-opensearch-client-0 -- curl localhost:9200/_cat/indices | grep 'yellow\|red' | awk '{ print $3 }' | xargs -I{} kubectl -n occne-infra exec -it opensearch-client-0 -- curl -XDELETE localhost:9200/{}
    
  3. Run the following command to verify if the indices with yellow or red state are deleted:
    kubectl -n occne-infra exec -it opensearch-client-0 -- curl localhost:9200/_cat/indices
    
  4. Restart the OpenSearch cluster in the following sequence: Master → Data → Client.

If this procedure did not resolve the issue, then clean up all the indexes to restore OpenSearch in GREEN state.

Table 6-16 OPENSEARCH_CLUSTER_HEALTH_YELLOW

Field Details
Description Cluster Name : {{ $externalLabels.cluster }} The primary shard has been allocated in {{ $labels.cluster }} for Instance {{ $labels.instance }} but replicas for the shard cloud not be allocated.
Summary Cluster Name : {{ $externalLabels.cluster }} The primary shard is allocated but replicas are not.
Cause Indicates that OpenSearch has allocated all of the primary shards, but some or all of the replicas have not been allocated. This issue is observed in some cases after a node restart or shutdown.
Severity Major
SNMP Trap ID 1044
Affects Service (Y/N) N
Recommended Actions

The yellow alarms are observed often after a shutdown or restart of a node. Most of the times, Oracle OpenSearch recovers on its own. If not, perform the following procedure to remove the replicas from the problematic index whose replica is not able to be allocated.

PUT /logstash-2021.08.21/_settings
{
 "index" : {
  "number_of_replicas":0
 }
}

Table 6-17 OPENSEARCH_TOO_FEW_DATA_NODES_RUNNING

Field Details
Description Cluster Name : {{ $externalLabels.cluster }} There are only {{$value}} OpenSearch data nodes running in {{ $labels.cluster }} cluster.
Summary Cluster Name : {{ $externalLabels.cluster }} {{ $labels.cluster }} cluster running on less than total number of data nodes.
Cause
  1. Data nodes are either crashed or are in 0/1 state due to insufficient space in PVC.
  2. PVC is full.
Severity Critical
SNMP Trap ID 1045
Affects Service (Y/N) N
Recommended Actions
  • Check the OpenSearch data or master pods' running status
  • If any of the OpenSearch pods are in the "not ready" state but running, check whether its associated PVC is full or not.
  • If PVC is not full, then contact Oracle support.
  • Ensure that the count of the data nodes is equal to 5, the value of the opensearch_data_replicas_count = 5 variable. When more data nodes are added, the alert must be corrected accordingly.

Table 6-18 PROMETHEUS_NODE_EXPORTER_NOT_RUNNING

Field Details
Description Prometheus Node Exporter is NOT running on host {{ $labels.kubernetes_node }}.
Summary Prometheus Node Exporter is NOT running.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Critical
SNMP Trap ID 1006
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue then Increase the Resources by editing the Node-Exporter Daemonset and search for resources section in it and increase the CPU or RAM accordingly.
    $ kubectl edit ds occne-kube-prom-stack-prometheus-node-exporter -n occne-infra
  2. If Resource utilization is not the issue, then Contact Oracle support.

Table 6-19 FLUENTD_OPENSEARCH_NOT_AVAILABLE

Field Details
Description Fluentd-OpenSearch is not running or is otherwise unavailable.
Summary Fluentd-OpenSearch is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Critical
SNMP Trap ID 1050
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Fluentd-OpenSearch daemonset and search for resources section in it and increase the CPU or RAM accordingly.
    $ kubectl edit ds occne-fluentd-opensearch -n occne-infra
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-20 OPENSEARCH_DOWN

Field Details
Description OpenSearch is not running or is otherwise unavailable.
Summary OpenSearch is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory or CPU, or issues with image used by pod.
  2. OpenSearch cluster is unavailable.
Severity Critical
SNMP Trap ID 1047
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the "occne-opensearch-cluster" statefulset. Search for resources section in the statefulset and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-21 OPENSEARCH_DASHBOARD_DOWN

Field Details
Description OpenSearch dashboard is not running or is otherwise unavailable.
Summary OpenSearch dashboard is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
  2. Opensearch dashborad is unavailable.
Severity Major
SNMP Trap ID 1049
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

Table 6-22 PROMETHEUS_DOWN

Field Details
Description All Prometheus instances are down. No metrics will be collected until at least one Prometheus instance is restored.
Summary Metrics collection is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
  2. PVC is full.
Severity Critical
SNMP Trap ID 1017
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Prometheus CRD. Search for resources section in the CRD and increase the CPU/RAM accordingly.
    kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
  2. If resource utilization is not the issue, then contact Oracle support.
  3. Increase the PVC Size by referring to the Changing Metrics Storage Allocation section.
  4. Run the following commands to manually clean up the PVC data:
    $ kubectl get pods -o wide
    $ kubectl get pvc -n occne-infra
    $ Login to the nodes where Prometheus is deployed
    $ sudo su; lsblk
    $ Above command will give the path of the PVC(fetched in 2nd command), cd into it
    $ cd prometheus-db/
    $ rm -rf *

Table 6-23 ALERT_MANAGER_DOWN

Field Details
Description All alert manager instances are down. No alerts will be received until at least one alert manager instance is restored.
Summary Alert notification is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod.
Severity Critical
SNMP Trap ID 1018
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Alertmanager CRD. Search for resources section in the Alertmanager CRD and increase the CPU or RAM accordingly.
    kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-24 SNMP_NOTIFIER_DOWN

Field Details
Description SNMP Notifier is not running or is unavailable.
Summary SNMP Notifier is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff"state due to insufficient memory/CPU, or issues with image used by pod.
Severity Critical
SNMP Trap ID 1019
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the "occne-snmp-notifier" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-25 JAEGER_DOWN

Field Details
Description Jaeger collector is not running or is unavailable.
Summary Jaeger is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
  2. OpenSerach is not available.
Severity Critical
SNMP Trap ID 1020
Affects Service (Y/N) N
Recommended Actions
  1. If resource utilization is the issue then Increase the Resources by editing the "occne-tracer-jaeger-collector" Deployment and search for resources section in it and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.
  3. Bring OpenSearch cluster back into healthy state by following the resolution mentioned in the OPENSEARCH_CLUSTER_HEALTH_RED alert. All Master, Client and Data pods must be up and running.

Table 6-26 METALLB_SPEAKER_DOWN

Field Details
Description The MetalLB speaker on worker node {{ $labels.instance }} is down.
Summary A MetalLB speaker is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Major
SNMP Trap ID 1021
Affects Service (Y/N) N
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Metallb-speaker daemonset. Search for resources section in the daemonset and increase the CPU or RAM accordingly.
    $ kubectl edit ds occne-metallb-speaker -n occne-infra
    
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-27 METALLB_CONTROLLER_DOWN

Field Details
Description The MetalLB controller is not running or is unavailable.
Summary The MetalLB controller is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory or CPU, or issues with the image used by the pod.
Severity Critical
SNMP Trap ID 1022
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the "occne-metallb-controller" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-28 GRAFANA_DOWN

Field Details
Description Grafana is not running or is unavailable.
Summary Grafana is down.
Cause
  1. Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
  2. Prometheus is not available.
Severity Major
SNMP Trap ID 1024
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the "occne-kube-prom-stack-grafana" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-29 LOAD_BALANCER_NO_HA

Field Details
Description A single load balancer serving the {{ $labels.external_network }} network has failed. Load balancing will continue to operate in simplex mode.
Summary A load balancer for the {{ $labels.external_network }} network is down.
Cause One of the LBVM is down.
Severity Major
SNMP Trap ID 1025
Affects Service (Y/N) N
Recommended Actions Replace the failed LBVM.

For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

Table 6-30 LOAD_BALANCER_NO_SERVICE

Field Details
Description All Load Balancers serving the {{ $labels.external_network }} network have failed. External access for all services on this network is unavailable.
Summary Load balancing for the {{ $labels.external_network }} network is unavailable.
Cause Both LBVMs are down as a result, the external network is down.
Severity Critical
SNMP Trap ID 1026
Affects Service (Y/N) Y
Recommended Actions Replace one LBVM, wait for lb_monitor to convert it from STANDBY to ACTIVE state (run lb_monitor.py manually if needed) and then replace another LBVM.

For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

Table 6-31 LOAD_BALANCER_FAILED

Field Details
Description Load balancer {{ $labels.name }} at IP {{ $labels.ip_address }} on the {{ $labels.external_network }} network has failed. Perform Load Balancer recovery procedure to restore.
Summary A load balancer failed.
Cause One of the LBVMs or both the LBVMs are down.
Severity Major
SNMP Trap ID 1027
Affects Service (Y/N) Y
Recommended Actions Although this alert is not always service affecting, the Load Balancer must be restored to restore high availability for load balancing. Replace one of the LBVMs or both the LBVMs.

For the procedure to replace a failed LBVM, See the "Restoring a Failed Load Balancer" section in Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

Table 6-32 PROMETHEUS_NO_HA

Field Details
Description A Prometheus instance has failed. Metrics collection will continue to operate in simplex mode.
Summary A Prometheus instance is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Major
SNMP Trap ID 1028
Affects Service (Y/N) N
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Prometheus CRD. Search for the resources section the Prometheus CRD and increase the CPU or RAM accordingly.
    kubectl edit prometheus occne-kube-prom-stack-kube-prometheus -n occne-infra
  2. If resource utilization is not the issue then contact Oracle support.
  3. Increase PVC size by referring to the Changing Metrics Storage Allocation section.
  4. Manually clean-up PVC data:
    $ kubectl get pods -o wide
    $ kubectl get pvc -n occne-infra
    $ Login to the nodes where Prometheus is deployed
    $ sudo su; lsblk
    $ Above command will give the path of the PVC(fetched in 2nd command), cd into it
    $ cd prometheus-db/
    $ rm -rf *

Table 6-33 ALERT_MANAGER_NO_HA

Field Details
Description An AlertManager instance has failed. Alert management will continue to operate in simplex mode.
Summary An AlertManager instance is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU ir issues with the image used by pod.
Severity Major
SNMP Trap ID 1029
Affects Service (Y/N) N
Recommended Actions
  1. If resource utilization is the issue, then increase the resources by editing the Alertmanager CRD. Search for the resources section in the Alertmanager CRD and increase the CPU or RAM accordingly.
    kubectl edit alertmanager occne-kube-prom-stack-kube-alertmanager -n occne-infra
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-34 PROMXY_METRICS_AGGREGATOR_DOWN

Field Details
Description Promxy failed. Metrics will be retrieved from a single Prometheus instance only.
Summary Promxy is down.
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Major
SNMP Trap ID 1032
Affects Service (Y/N) Y
Recommended Actions As metrics are retrieved from a single Prometheus instance, there may be gaps in the retrieved data. Promxy must be restarted to restore the full data retrieval capabilities.
  1. If resource utilization is the issue, then increase the resources by editing the "occne-promxy" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly.
  2. If resource utilization is not the issue, then contact Oracle support.

Table 6-35 VCNE_LB_CONTROLLER_FAILED

Field Details
Description The vCNE LB Controller process failed.
Summary The vCNE LB Controller process failed.
Cause Pod is repeatedly crashing and is in the "CrashLoopBackOff", 0/1, or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with the image used by the pod.
Severity Major
SNMP Trap ID 1039
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the cause, then increase the resource by editing the "occne-lb-controller-server" deployment. Search for the resources section in the deployment and increase the CPU or RAM accordingly.
  2. If resource utilization is not the cause, then contact Oracle support.

Table 6-36 VMWARE_CSI_CONTROLLER_FAILED

Field Details
Description The VmWare CSI Controller process failed.
Summary The VmWare CSI Controller process failed.
Cause The CSI Controller process failed.

Note: This alert is raised only when CNE is installed on a VMware infrastructure.

Severity Critical
SNMP Trap ID 1042
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.

Table 6-37 EGRESS_CONTROLLER_NOT_AVAILABLE

Field Details
Description Egress controller is not running or is unavailable.
Summary Egress controller is down
Cause Pod is repeatedly crashing and is in "CrashLoopBackOff" or 0/1 or "ImagePullBackOff" state due to insufficient memory/CPU, or issues with image used by pod.
Severity Critical
SNMP Trap ID 1048
Affects Service (Y/N) Y
Recommended Actions
  1. If resource utilization is the cause, then increase the resource by editing the "occne-egress-controller" daemonset. Search for the resources section in the daemonset and increase the CPU or RAM accordingly.
  2. If resource utilization is not the cause, then contact Oracle support.

Table 6-38 OPENSEARCH_DATA_PVC_NEARLY_FULL

Field Details
Description OpenSearch Data Volume {{ $persistentvolumeclaim }} has {{ $value }}% of allocated space remaining. Once full, this will cause OpenSearch cluster to start throwing index_block_exceptions, either increase Opensearch data PVC or remove unnecessary indices.
Summary OpenSearch Data Volume is nearly full.
Cause OpenSearch data PVCs are nearly full.
Severity Major
SNMP Trap ID 1051
Affects Service (Y/N) Y
Recommended Actions Perform one of the following recommendations:
  • Increase the PVC size of OpenSearch cluster data for which the alert is raised.
  • Delete the old indices from OpenSearch Dashboards > dev tools > DELETE <index_name_to_be_deleted>.

6.3 Bastion Host Alerts

This section provides details about Bastion Host alerts.

Table 6-39 BASTION_HOST_FAILED

Field Details
Description Bastion Host {{ $labels.name }} at IP address {{ $labels.ip_address }} is unavailable.
Summary Bastion Host {{ $labels.name }} is unavailable.
Cause One of the Bastion Hosts failed to respond to liveness tests.
Severity Major
SNMP Trap ID 1040
Affects Service (Y/N) N
Recommended Actions Contact Oracle support.

Table 6-40 ALL_BASTION_HOSTS_FAILED

Field Details
Description All Bastion Hosts are unavailable.
Summary All Bastion Hosts are unavailable.
Cause All Bastion Hosts fail to respond to liveness tests.
Severity Critical
SNMP Trap ID 1041
Affects Service (Y/N) Y
Recommended Actions Contact Oracle support.