List of Alerts

5 List of Alerts

This section provides detailed information about the alert rules defined for OCNADD.

5.1 System Level Alerts

This section lists the system level alerts for OCNADD.

Management Group Alerts

OCNADD_POD_CPU_USAGE_ALERT

Table 5-1 OCNADD_POD_CPU_USAGE_ALERT

Field	Details
Triggering Condition	POD CPU usage is above the set threshold (default 85%)
Severity	Major
Description	OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddalarm."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmanagement.ocnaddalarm.ocnaddalarm.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddconfiguration."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ Values.ocnaddmanagement.ocnaddconfiguration.ocnaddconfiguration.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddgui."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmanagement.ocnaddgui.ocnaddgui.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddhealthmonitoring."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmanagement.ocnaddhealthmonitoring.ocnaddhealthmonitoring.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnadduirouter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmanagement.ocnadduirouter.ocnadduirouter.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddexport."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmanagement.ocnaddexport.ocnaddexport.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddmanagementgateway."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*{{ .Values.ocnaddmanagement.ocnaddmanagementgateway.ocnaddmanagementgateway.resources.limits.cpu }})
Alert Details OCI	Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|management"}.rate().groupby(namespace,pod).sum()100>=85\|\|pod_cpu_usage_seconds_total[10m]{pod=~"ui"}.rate().groupby(namespace,pod).sum()100>=285\|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=485", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"alarm\|admin\|health\|config\|management"}.rate().groupby(namespace,pod).sum()100>={{ CPU Threshold }}\|\| pod_cpu_usage_seconds_total[10m]{pod=~"ui"}.rate().groupby(namespace,pod).sum()100>=2{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=4{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script
OID	1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the CPU utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml & ocnadd-management-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_MEMORY_USAGE_ALERT

Table 5-2 OCNADD_POD_MEMORY_USAGE_ALERT

Field	Details
Triggering Condition	POD Memory usage is above set threshold (default 90%)
Severity	Major
Description	OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddalarm."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddalarm.ocnaddalarm.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddconfiguration."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddconfiguration.ocnaddconfiguration.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddgui."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddgui.ocnaddgui.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddhealthmonitoring."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddhealthmonitoring.ocnaddhealthmonitoring.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnadduirouter."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnadduirouter.ocnadduirouter.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddexport."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddexport.ocnaddexport.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddmanagementgateway."}) by (pod,namespace) > {{ .Values.ocnaddmanagement.ocnaddmanagementgateway.ocnaddmanagementgateway.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100)
Alert Details OCI	Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"alarm\|admin\|health\|config\|management\|export\|ui"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"alarm\|admin\|health\|config\|management\|export\|ui"}.groupby(namespace,pod).sum()100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"alarm\|admin\|health\|config\|management\|export\|ui"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"alarm\|admin\|health\|config\|management\|export\|ui"}.groupby(namespace,pod).sum()100>={{ Memory Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used	container_memory_working_set_bytes Note : This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the memory utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml & ocnadd-management-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_RESTARTED

Table 5-3 OCNADD_POD_RESTARTED

Field	Details
Triggering Condition	A POD has restarted
Severity	Minor
Description	A POD has restarted in the last 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.ocnaddmanagement.cluster.nameSpace.name}}"} > 1
Alert Details OCI	MQL Expression: No MQL equivalent is available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used	kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

RelayAgent Group Alerts

OCNADD_POD_CPU_USAGE_ALERT

Table 5-4 OCNADD_POD_CPU_USAGE_ALERT

Field	Details
Triggering Condition	POD CPU usage is above the set threshold (default 85%)
Severity	Major
Description	OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddbsfaggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddbsfaggregation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddnrfaggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddnrfaggregation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddpcfaggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddpcfaggregation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddscpaggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddscpaggregation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddseppaggregation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddseppaggregation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kafka-broker."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddkafka.ocnadd.kafkaBroker.resource.limit.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kraft-controller."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddkafka.ocnadd.kraftController.resource.limit.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddrelayagentgateway."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddrelayagent.ocnaddrelayagentgateway.ocnaddrelayagentgateway.resources.limits.cpu }})
Alert Details OCI	Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"kraft\|relay"}.rate().groupby(namespace,pod).sum()100>=85\|\|pod_cpu_usage_seconds_total[10m]{pod=~"aggregation"}.rate().groupby(namespace,pod).sum()100>=285\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=685", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"relay\|kraft"}.rate().groupby(namespace,pod).sum()100>={{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"aggregation"}.rate().groupby(namespace,pod).sum()100>=2{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=6{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script
OID	1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the CPU utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml and ocnadd-relayagent-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_MEMORY_USAGE_ALERT

Table 5-5 OCNADD_POD_MEMORY_USAGE_ALERT

Field	Details
Triggering Condition	POD Memory usage is above set threshold (default 90%)
Severity	Major
Description	OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddbsfaggregation."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddbsfaggregation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddnrfaggregation."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddnrfaggregation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddpcfaggregation."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddpcfaggregation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddscpaggregation."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddscpaggregation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddseppaggregation."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddaggregation.ocnaddseppaggregation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kafka-broker."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddkafka.ocnadd.kafkaBroker.resource.limit.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kraft-controller."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddkafka.ocnadd.kraftController.resource.limit.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddrelayagentgateway."}) by (pod,namespace) > {{ .Values.ocnaddrelayagent.ocnaddrelayagentgateway.ocnaddrelayagentgateway.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100)
Alert Details OCI	Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"aggregation\|kafka\|kraft\|relay"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"aggregation\|kafka\|kraft\|relay"}.groupby(namespace,pod).sum()100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"aggregation\|kafka\|kraft\|relay"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"aggregation\|kafka\|kraft\|relay"}.groupby(namespace,pod).sum()100>={{ Memory Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used	container_memory_working_set_bytes Note : This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the memory utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml and ocnadd-relayagent-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_RESTARTED

Table 5-6 OCNADD_POD_RESTARTED

Field	Details
Triggering Condition	A POD has restarted
Severity	Minor
Description	A POD has restarted in the last 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.ocnaddrelayagent.cluster.nameSpace.name }}"} > 1
Alert Details OCI	MQL Expression: No MQL equivalent is available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used	kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

Mediation Group Alerts

OCNADD_POD_CPU_USAGE_ALERT

Table 5-7 OCNADD_POD_CPU_USAGE_ALERT

Field	Details
Triggering Condition	POD CPU usage is above the set threshold (default 85%)
Severity	Major
Description	OCNADD Pod High CPU usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % ' PromQL Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddadminservice."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.admin.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddfilter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddfilter.ocnaddfilter.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kafka-broker."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddkafka.ocnadd.kafkaBroker.resource.limit.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".kraft-controller."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddkafka.ocnadd.kraftController.resource.limit.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image=~".consumeradapter.", pod=~".adapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.consumeradapter.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".correlation."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.correlation.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".storage-adapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.storageAdapter.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ingress-adapter."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}{{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.ingressadapter.resources.limits.cpu }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".ocnaddmediationgateway."}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*{{ .Values.ocnaddmediation.ocnaddmediationgateway.ocnaddmediationgateway.resources.limits.cpu }})
Alert Details OCI	Summary: Alarm "OCNADD_POD_CPU_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "pod_cpu_usage_seconds_total[10m]{pod=~"admin\|kraft\|mediation"}.rate().groupby(namespace,pod).sum()100>=85\|\|pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=385\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=685\|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=485\|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=385\|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=385\|\| pod_cpu_usage_seconds_total[10m]{pod=~"filter"}.rate().groupby(namespace,pod).sum()100>=285", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: pod_cpu_usage_seconds_total[10m]{pod=~"admin\|kraft\|mediation"}.rate().groupby(namespace,pod).sum()100>={{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"corr"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }}\|\|pod_cpu_usage_seconds_total[10m]{pod=~"kafka"}.rate().groupby(namespace,pod).sum()100>=6{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"export"}.rate().groupby(namespace,pod).sum()100>=4{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"storageadapter"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"ingressadapter"}.rate().groupby(namespace,pod).sum()100>=3{{ CPU Threshold }} \|\| pod_cpu_usage_seconds_total[10m]{pod=~"filter"}.rate().groupby(namespace,pod).sum()100>=2{{ CPU Threshold }} Note: CPU Threshold will be assigned will executing the terraform script
OID	1.3.6.1.4.1.323.5.3.53.1.29.4002
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the CPU utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml and ocnadd-mediation-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_MEMORY_USAGE_ALERT

Table 5-8 OCNADD_POD_MEMORY_USAGE_ALERT

Field	Details
Triggering Condition	POD Memory usage is above set threshold (default 90%)
Severity	Major
Description	OCNADD Pod High Memory usage detected for a continuous period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value \| printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' PromQL Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddfilter."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddfilter.ocnaddfilter.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kafka-broker."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddkafka.ocnadd.kafkaBroker.resource.limit.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".kraft-controller."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddkafka.ocnadd.kraftController.resource.limit.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image=~".consumeradapter.", pod=~".adapter."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.consumeradapter.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".correlation."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.correlation.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".storage-adapter."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.storageAdapter.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ingress-adapter."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddadmin.ocnadd.ingressadapter.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".ocnaddmediationgateway."}) by (pod,namespace) > {{ .Values.ocnaddmediation.ocnaddmediationgateway.ocnaddmediationgateway.resources.limits.memory \| regexFind "[0-9]+" }}102410241024{{ .Values.global.cluster.memory_threshold }}/100)
Alert Details OCI	Summary: Alarm "OCNADD_POD_MEMORY_USAGE_ALERT" is in a "X" state; because n metrics meet the trigger rule: "container_memory_usage_bytes[5m]{pod=~"adapter\|kafka\|kraft\|mediation\|corr\|export\|storageadapter\|ingressadapter\|filter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[5m]{pod=~"adapter\|kafka\|kraft\|mediation\|corr\|export\|storageadapter\|ingressadapter\|filter"}.groupby(namespace,pod).sum()100>=90", with a trigger delay of 1 minute where, X = FIRING/OK, n = Different services that violated the rule. MQL Expression: container_memory_usage_bytes[10m]{pod=~"adapter\|kafka\|kraft\|mediation\|corr\|export\|storageadapter\|ingressadapter\|filter"}.groupby(namespace,pod).sum()/container_spec_memory_limit_bytes[10m]{pod=~"adapter\|kafka\|kraft\|mediation\|corr\|export\|storageadapter\|ingressadapter\|filter"}.groupby(namespace,pod).sum()100>={{ Memory Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.4005
Metric Used	container_memory_working_set_bytes Note : This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert gets cleared when the memory utilization is below the critical threshold. Note: The thresholds are configurable in the ocnadd-common-custom-values.yaml and ocnadd-mediation-custom-values.yaml files. If guidance is required, contact My Oracle Support.

OCNADD_POD_RESTARTED

Table 5-9 OCNADD_POD_RESTARTED

Field	Details
Triggering Condition	A POD has restarted
Severity	Minor
Description	A POD has restarted in the last 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted' PromQL Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.ocnaddmediation.cluster.nameSpace.name }}"} > 1
Alert Details OCI	MQL Expression: No MQL equivalent is available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5006
Metric Used	kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

5.2 Application Level Alerts

This section lists the application level alerts for OCNADD.

Management Group Alerts

OCNADD_CONFIG_SVC_DOWN

Table 5-10 OCNADD_CONFIG_SVC_DOWN

Field	Details
Triggering Condition	The configuration service went down or not accessible
Severity	Critical
Description	OCNADD Configuration service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down' PromQL Expression: expr: up{service="ocnaddconfiguration"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CONFIG_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.20.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Configuration service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required..

OCNADD_ALARM_SVC_DOWN

Table 5-11 OCNADD_ALARM_SVC_DOWN

Field	Details
Triggering Condition	The alarm service went down or not accessible
Severity	Critical
Description	OCNADD Alarm service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down' PromQL Expression: expr: up{service="ocnaddalarm"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_ALARM_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.24.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Alarm service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required..

OCNADD_HEALTH_MONITORING_SVC_DOWN

Table 5-12 OCNADD_HEALTH_MONITORING_SVC_DOWN

Field	Details
Triggering Condition	The health monitoring service went down or not accessible
Severity	Critical
Description	OCNADD Health monitoring service not available for more than 2 min
Alert Details CNE	Summary: summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down' PromQL Expression: expr: up{service="ocnaddhealthmonitoring"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_HEALTH_MONITORING_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.28.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Health monitoring service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_EXPORT_SVC_DOWN

Table 5-13 OCNADD_EXPORT_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Export service went down or not accessible
Severity	Critical
Description	OCNADD Export service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Export service is down' PromQL Expression: expr: up{service=~".export."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_EXPORT_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="export"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="export"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.39.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD export service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_MANGEMENT_GW_SVC_DOWN

Table 5-14 OCNADD_MANGEMENT_GW_SVC_DOWN

Field	Details
Triggering Condition	The Management Gateway service went down or not accessible
Severity	Critical
Description	OCNADD Mangement Gateway service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddmanagementgateway service is down' PromQL Expression: expr: up{service="ocnaddmanagementgateway"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_MANGEMENT_GW_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddmanagementgateway"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddmanagementgateway"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Management Gateway service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

RelayAgent Group Alerts

OCNADD_SCP_AGGREGATION_SVC_DOWN

Table 5-15 OCNADD_SCP_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The SCP Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD SCP Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down' PromQL Expression: expr: up{service="ocnaddscpaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_SCP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.22.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD SCP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_NRF_AGGREGATION_SVC_DOWN

Table 5-16 OCNADD_NRF_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The NRF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD NRF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down' PromQL Expression: expr: up{service="ocnaddnrfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_NRF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.31.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD NRF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_SEPP_AGGREGATION_SVC_DOWN

Table 5-17 OCNADD_SEPP_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The SEPP Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD SEPP Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddseppaggregation service is down' PromQL Expression: expr: up{service="ocnaddseppaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_SEPP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.32.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD SEPP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_BSF_AGGREGATION_SVC_DOWN

Table 5-18 OCNADD_BSF_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The BSF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD BSF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddbsfaggregation service is down' PromQL Expression: expr: up{service="ocnaddbsfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_BSF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.40.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD BSF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_PCF_AGGREGATION_SVC_DOWN

Table 5-19 OCNADD_PCF_AGGREGATION_SVC_DOWN

Field	Details
Triggering Condition	The PCF Aggregation service went down or not accessible
Severity	Critical
Description	OCNADD PCF Aggregation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddpcfaggregation service is down' PromQL Expression: expr: up{service="ocnaddpcfaggregation"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_PCF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD PCF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_RELAYAGENT_GW_SVC_DOWN

Table 5-20 OCNADD_RELAYAGENT_GW_SVC_DOWN

Field	Details
Triggering Condition	The RelayAgent Gateway service went down or not accessible
Severity	Critical
Description	OCNADD RelayAgent Gateway service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddrelayagentgateway service is down' PromQL Expression: expr: up{service="ocnaddrelayagentgateway"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_RELAYAGENT_GW_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddrelayagentgateway"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddrelayagentgateway"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD RelayAgent Gateway service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Table 5-21 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the warning threshold of 80% of the supported MPS
Severity	Warn
Description	Total Ingress Message Rate is above the configured warning threshold (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.8{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.8{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5007
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the warning threshold level of 80%.

OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Table 5-22 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity	Minor
Description	Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.9{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>.9{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5008
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90%.

OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Table 5-23 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity	Major
Description	Total Ingress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>0.95{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>0.95{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5009
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95%.

OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Table 5-24 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5010
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Table 5-25 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Field	Details
Triggering Condition	The packet drop rate in Kafka streams is above the configured major threshold of 1% of the total supported MPS
Severity	Major
Description	The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% threshold of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".aggregation."}[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100> {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100> {{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5020
Metric Used	kafka_stream_task_dropped_records_total
Resolution	The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS

OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Table 5-26 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Field	Details
Triggering Condition	The packet drop rate in Kafka streams is above the configured critical threshold of 10% of the total supported MPS
Severity	Critical
Description	The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% threshold of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".aggregation."}[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100>10{{MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()100>10{{MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5021
Metric Used	kafka_stream_task_dropped_records_total
Resolution	The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS

OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Table 5-27 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Field	Details
Triggering Condition	The ingress traffic increase is more than 10% of the supported MPS
Severity	Major
Description	The ingress traffic increase is more than 10% of the supported MPS in the last 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m] offset 5m)) by (namespace) >= 1.1
Alert Details OCI	Not Available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5027
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS

OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Table 5-28 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Field	Details
Triggering Condition	The ingress traffic decrease is more than 10% of the supported MPS
Severity	Major
Description	The ingress traffic decrease is more than 10% of the supported MPS in the last 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".aggregation."}[5m] offset 5m)) by (namespace) <= 0.9
Alert Details OCI	Not Available
OID	1.3.6.1.4.1.323.5.3.53.1.29.5028
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS

Mediation Group Alerts

OCNADD_ADMIN_SVC_DOWN

Table 5-29 OCNADD_ADMIN_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Admin service went down or not accessible
Severity	Critical
Description	OCNADD Admin service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down' PromQL Expression: expr: up{service="ocnaddadminservice"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_ADMIN_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.30.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Admin service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Table 5-30 OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Consumer Adapter service went down or not accessible
Severity	Critical
Description	OCNADD Consumer Adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down' PromQL Expression: expr: up{service=~".adapter."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CONSUMER_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner=~"adapter*"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.25.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_FILTER_SVC_DOWN

Table 5-31 OCNADD_FILTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Filter service went down or not accessible
Severity	Critical
Description	OCNADD Filter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Filter service is down' PromQL Expression: expr: up{service=~".filter."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_FILTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.34.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Filter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_CORRELATION_SVC_DOWN

Table 5-32 OCNADD_CORRELATION_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Correlation service went down or not accessible
Severity	Critical
Description	OCNADD Correlation service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Correlation service is down' PromQL Expression: expr: up{service=~".correlation."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_CORRELATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="correlation"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.33.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Correlation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_MEDIATION_GW_SVC_DOWN

Table 5-33 OCNADD_MEDIATION_GW_SVC_DOWN

Field	Details
Triggering Condition	The Mediation Gateway service went down or not accessible
Severity	Critical
Description	OCNADD Mediation Gateway service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: ocnaddmediationgateway service is down' PromQL Expression: expr: up{service="ocnaddmediationgateway"} != 1
Alert Details OCI	Summary: Alarm "OCNADD_MEDIATION_GW_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddmediationgateway"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddmediationgateway"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.41.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Mediation Gateway service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_STORAGE_ADAPTER_SVC_DOWN

Table 5-34 OCNADD_STORAGE_ADAPTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Storage adapter service went down or not accessible
Severity	Critical
Description	OCNADD Storage adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Storage adapter service is down' PromQL Expression: expr: up{service=~".storageadapter."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_STORAGE_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="storageadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="storageadapter"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.38.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Storage adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_INGRESS_ADAPTER_SVC_DOWN

Table 5-35 OCNADD_INGRESS_ADAPTER_SVC_DOWN

Field	Details
Triggering Condition	The OCNADD Ingress Adapter service went down or not accessible
Severity	Critical
Description	OCNADD Ingress Adapter service not available for more than 2 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Ingress Adapter service is down' PromQL Expression: expr: up{service=~".ingressadapter."} != 1
Alert Details OCI	Summary: Alarm "OCNADD_INGRESS_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ingressadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ingressadapter"}.mean()!=1
OID	1.3.6.1.4.1.323.5.3.53.1.36.2002
Metric Used	'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system.
Resolution	The alert is cleared automatically when the OCNADD Ingress Adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required.

OCNADD_MPS_CRITICAL_INGRESS_ADAPTER_THRESHOLD_CROSSED

Table 5-36 OCNADD_MPS_CRITICAL_INGRESS_ADAPTER_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total ingress MPS crossed for ingress adapter the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	Total Ingress Message Rate for ingress adapter is above configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".ingress-adapter."}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"aggregation"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"ingress-adapter"}.rate().groupBy(k8Namespace).sum()>1.0{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.36.5010
Metric Used	kafka_stream_processor_node_process_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Table 5-37 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS
Severity	Warn
Description	The total Egress Message Rate is above the configured warning threshold alert (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.80{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.80{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5011
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the warning threshold alert level of 80% of supported MPS

OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Table 5-38 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity	Minor
Description	The total Egress Message Rate is above the configured minor threshold alert (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.90{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.90{{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5012
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90% of supported MPS

OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Table 5-39 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity	Major
Description	The total Egress Message Rate is above the configured major threshold alert (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.95{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>0.95{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5013
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95% of supported MPS

OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Table 5-40 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity	Critical
Description	The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}' PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0{{ MPS Threshold }}"
OID	1.3.6.1.4.1.323.5.3.53.1.29.5014
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Table 5-41 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Field	Details
Triggering Condition	The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer
Severity	Critical
Description	The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum (rate(ocnadd_egress_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }}
Alert Details OCI	Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}
OID	1.3.6.1.4.1.323.5.3.53.1.29.5015
Metric Used	ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS

OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Table 5-42 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured warning threshold alert level of 80%
Severity	Warn
Description	Average E2E Latency is above the configured warning threshold alert level (80%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80{{ .Values.global.cluster.max_latency }} <= .90{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5016
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissible latency

OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Table 5-43 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured minor threshold alert level of 90%
Severity	Minor
Description	Average E2E Latency is above the configured minor threshold alert level (90%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90{{ .Values.global.cluster.max_latency }} <= 0.95{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5017
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissible latency

OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Table 5-44 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured major threshold alert level of 95%
Severity	Major
Description	Average E2E Latency is above the configured minor threshold alert level (95%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95{{ .Values.global.cluster.max_latency }} <= 1.0{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5018
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissible latency

OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Table 5-45 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Field	Details
Triggering Condition	The total observed latency is above the configured critical threshold alert level of 100%
Severity	Critical
Description	Average E2E Latency is above the configured critical threshold alert level (100%) for the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms' PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }}
Alert Details OCI	Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05
OID	1.3.6.1.4.1.323.5.3.53.1.29.5019
Metric Used	ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count
Resolution	The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissible latency

OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT

Table 5-46 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 0.1% of the total supported MPS
Severity	Info
Description	Egress external connection failure rate towards 3rd party application is crossing the info threshold of 0.1% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_01PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<1", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<1
OID	1.3.6.1.4.1.323.5.3.53.1.29.5022
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards 3rd party consumers goes below the threshold (0.1%) alert level of supported MPS

OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT

Table 5-47 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 1% of the total supported MPS
Severity	Warn
Description	Egress external connection failure rate towards 3rd party application is crossing the warning threshold of 1% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 1 < 10
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<10", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<10
OID	1.3.6.1.4.1.323.5.3.53.1.29.5023
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards 3rd party consumers goes below the threshold (1%) alert level of supported MPS

OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT

Table 5-48 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 10% of the total supported MPS
Severity	Minor
Description	Egress external connection failure rate towards 3rd party application is crossing a minor threshold of 10% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 10 < 25
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<25", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<25
OID	1.3.6.1.4.1.323.5.3.53.1.29.5024
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards 3rd party consumers goes below the threshold (10%) alert level of supported MPS

OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT

Table 5-49 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT

Field	Details
Triggering Condition	The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 25% of the total supported MPS
Severity	Major
Description	Egress external connection failure rate towards 3rd party application is crossing the major threshold of 25% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections' PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 25 < 50
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100<50
OID	1.3.6.1.4.1.323.5.3.53.1.29.5025
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards 3rd party consumers goes below the threshold (25%) alert level of supported MPS

OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT

Table 5-50 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT

Field	Details
Triggering Condition	The Egress adpater failure rate towards the 3rd party application is above the configured threshold of 50% of the total supported MPS
Severity	Critical
Description	Egress external connection failure rate towards 3rd party application is crossing the critical threshold of 50% in the period of 5 min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections' PromQL Expression: expr:(sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 50
Alert Details OCI	Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()100>=50
OID	1.3.6.1.4.1.323.5.3.53.1.29.5026
Metric Used	ocnadd_egress_failed_request_total, ocnadd_egress_requests_total
Resolution	The alert is cleared automatically when the failure rate towards 3rd party consumers goes below the threshold (50%) alert level of supported MPS

OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED

Table 5-51 OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED

Field	Details
Triggering Condition	The total transaction success xDRs rate has dropped the critical threshold alert level of 90%
Severity	Critical
Description	The total transaction success xDRs rate has dropped the critical threshold alert level of 90% for the period of 5min
Alert Details CNE	Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . \| first \| value \| humanizeTimestamp }}{{ "{{" }} end }}: Transaction Success Rate is below 90% per hour:{{ .Values.global.cluster.mps }}' Expression: expr: sum(irate(ocnadd_total_transactions_total{service=~".correlation.",status="SUCCESS"}[5m]))by (namespace,service) / sum(irate(ocnadd_total_transactions_total{service=~".correlation."}[5m]))by (namespace,service) *100 < 90
Alert Details OCI	Summary: Alarm "OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"corr"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"corr"}.rate().groupBy(workername,app).sum()100<90", with a trigger delay of 1 minute MQL Expression: ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"corr"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"corr"}.rate().groupBy(workername,app).sum()100<90
OID	1.3.6.1.4.1.323.5.3.53.1.33.5029
Metric Used	ocnadd_total_transactions_total
Resolution	The alert is cleared automatically when the transaction success rate goes above the critical threshold alert level of 90%