5.1 System Level Alerts

This section lists the system level alerts for OCNADD.

Table 5-1 OCNADD_POD_CPU_USAGE_ALERT

Field Details
Triggering Condition POD CPU usage is above the set threshold (default 85%)
Severity Major
Description OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % '

PromQL Expression:

expr:

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*aggregation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kafka.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*6) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kraft.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*adapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*correlation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*filter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*configuration.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*admin.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*health.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*alarm.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ui.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*export.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*4) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*storageadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ingressadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3)

Alert Details OCI

Summary:

Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>=85||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*85||pod_cpu_usage_seconds_total[10m]{pod=~"*corr*"}.rate().groupby(namespace,pod).sum()*100>=3*85||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85", with a trigger delay of 1 minute

where, X = FIRING/OK,

n = Different services that violated the rule.

MQL Expression:

pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>={{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"corr*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }}

Note: CPU Threshold will be assigned will executing the terraform script

OID 1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used

container_cpu_usage_seconds_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system

Resolution

The alert gets cleared when the CPU utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd-custom-values.yaml file. If guidance is required, contact My Oracle Support.

Table 5-2 OCNADD_POD_MEMORY_USAGE_ALERT

Field Details
Triggering Condition POD Memory usage is above set threshold (default 90%)
Severity Major
Description OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % '

PromQL Expression:

expr:

(sum(container_memory_working_set_bytes{image!="" , pod=~".*aggregation.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*kafka.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*kraft.*"}) by (pod,namespace) > 1*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*filter.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*adapter.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*correlation.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*configuration.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*admin.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*health.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*alarm.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*ui.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*export.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*storageadapter.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or

(sum(container_memory_working_set_bytes{image!="" , pod=~".*ingressadapter.*"}) by (pod,namespace) > 8*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100)

Alert Details OCI

Summary:

Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>=90", with a trigger delay of 1 minute

where, X = FIRING/OK,

n = Different services that violated the rule.

MQL Expression:

container_memory_usage_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>={{ Memory Threshold }}

OID 1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used

container_memory_working_set_bytes

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert gets cleared when the memory utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd_custom_values.yaml file. If guidance is required, contact My Oracle Support.

Table 5-3 OCNADD_POD_RESTARTED

Field Details
Triggering Condition A POD has restarted
Severity Minor
Description A POD has restarted in the last 2 min
Alert Details CNE

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted'

PromQL Expression:

expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1

Alert Details OCI

MQL Expression:

No MQL equivalent is available

OID 1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used

kube_pod_container_status_restarts_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically if the specific pod is up.

Steps:

1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on.

2. Run the following command to check orchestration logs for liveness or readiness probe failures:

kubectl get po -n <namespace>

Note the full name of the pod that is not running, and use it in the following command:

kubectl describe pod <desired full pod name> -n <namespace>

3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide".

4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.