System Level Alerts

Previous Next JavaScript must be enabled to correctly display this content

5.1 System Level Alerts

This section lists the system level alerts for OCNADD.

Table 5-1 OCNADD_POD_CPU_USAGE_ALERT

Field	Details
Triggering Condition	POD CPU usage is above the set threshold (default 85%)
Severity	Major
Description	OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".aggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}2) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kafka."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}6) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kraft."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".adapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".correlation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".filter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".configuration."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".admin."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".health."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".alarm."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ui."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".export."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}4) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".storageadapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ingressadapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3)
Alert Details OCI	Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|kraft"}.rate().groupby(namespace,pod).sum()100>=85\|\|pod_cpu_usage_seconds_total[10m]{pod=~"ui\|aggregation\|filter"}.rate().groupby(namespace,pod).sum()100>=285\|\|pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=385\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=685\|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=485\|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=385\|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=385", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|kraft"}.rate().groupby(namespace,pod).sum()100>={{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"ui\|aggregation\|filter"}.rate().groupby(namespace,pod).sum()100>=2{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=6{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=4{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=3*{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script
OID	1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system
Resolution	The alert gets cleared when the CPU utilization is below the critical threshold. Note: The threshold is configurable in the `ocnadd-custom-values.yaml` file. If guidance is required, contact My Oracle Support.

Table 5-2 OCNADD_POD_MEMORY_USAGE_ALERT

Field	Details
Triggering Condition	POD Memory usage is above set threshold (default 90%)
Severity	Major
Description	OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".aggregation."}) by (pod,namespace) > 3102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kafka."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kraft."}) by (pod,namespace) > 1102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".filter."}) by (pod,namespace) > 3102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".adapter."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".correlation."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".configuration."}) by (pod,namespace) > 4102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".admin."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".health."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".alarm."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ui."}) by (pod,namespace) > 2102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".export."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".storageadapter."}) by (pod,namespace) > 64102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ingressadapter."}) by (pod,namespace) > 8102410241024{{ .Values.global.cluster.memory_threshold }}/100)
Alert Details OCI	Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"adapter\|kafka\|kraft\|ocnadd\|corr\|export\|storageadapter\|ingressadapter"}.groupby(namespace,pod).sum()100>={{ Memory Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used	container_memory_working_set_bytes Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the memory utilization is below the critical threshold. Note: The threshold is configurable in the `ocnadd_custom_values.yaml` file. If guidance is required, contact My Oracle Support.

Table 5-3 OCNADD_POD_RESTARTED

Field	Details
Triggering Condition	A POD has restarted
Severity	Minor
Description	A POD has restarted in the last 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1
Alert Details OCI	MQL Expression: No MQL equivalent is available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used	kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.