5.1 System Level Alerts
This section lists the system level alerts for OCNADD.
Table 5-1 OCNADD_POD_CPU_USAGE_ALERT
Field | Details |
---|---|
Triggering Condition | POD CPU usage is above the set threshold (default 85%) |
Severity | Major |
Description | OCNADD Pod High CPU usage detected for a continuous period of 5min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*aggregation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kafka.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*6) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kraft.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*adapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*correlation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*filter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*configuration.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*admin.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*health.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*alarm.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ui.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*export.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*4) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*storageadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ingressadapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) |
Alert Details OCI |
Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>=85||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*85||pod_cpu_usage_seconds_total[10m]{pod=~"*corr*"}.rate().groupby(namespace,pod).sum()*100>=3*85||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85|| pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*85", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule.
MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"*alarm*|*admin*|*health*|*config*|*kraft*"}.rate().groupby(namespace,pod).sum()*100>={{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*ui*|*aggregation*|*filter*"}.rate().groupby(namespace,pod).sum()*100>=2*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"corr*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }}||pod_cpu_usage_seconds_total[10m]{pod=~"*kafka*"}.rate().groupby(namespace,pod).sum()*100>=6*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*export*"}.rate().groupby(namespace,pod).sum()*100>=4*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*storageadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }} || pod_cpu_usage_seconds_total[10m]{pod=~"*ingressadapter*"}.rate().groupby(namespace,pod).sum()*100>=3*{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.4002 |
Metric Used |
container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system |
Resolution |
The alert gets cleared when the CPU utilization is below the critical threshold. Note: The threshold is configurable in the
|
Table 5-2 OCNADD_POD_MEMORY_USAGE_ALERT
Field | Details |
---|---|
Triggering Condition | POD Memory usage is above set threshold (default 90%) |
Severity | Major |
Description | OCNADD Pod High Memory usage detected for a continuous period of 5min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".*aggregation.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*kafka.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*kraft.*"}) by (pod,namespace) > 1*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*filter.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*adapter.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*correlation.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*configuration.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*admin.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*health.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*alarm.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*ui.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*export.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*storageadapter.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*ingressadapter.*"}) by (pod,namespace) > 8*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) |
Alert Details OCI |
Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|*corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"*adapter*|*kafka*|*kraft*|*ocnadd*|corr*|*export*|*storageadapter*|*ingressadapter*"}.groupby(namespace,pod).sum()*100>={{ Memory Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.4005 |
Metric Used |
container_memory_working_set_bytes Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert gets cleared when the memory utilization is below the critical threshold. Note: The threshold is configurable in the
|
Table 5-3 OCNADD_POD_RESTARTED
Field | Details |
---|---|
Triggering Condition | A POD has restarted |
Severity | Minor |
Description | A POD has restarted in the last 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1 |
Alert Details OCI |
MQL Expression: No MQL equivalent is available |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5006 |
Metric Used |
kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |