7 Alerts
This section provides information on Oracle Communications Network Analytics Data Director (OCNADD) alerts and their configuration.
7.1 Configuring Alerts
This section describes how to configure alerts in OCNADD.
Note:
Here, the label used to update the Prometheus server is role: cnc-alerting-rules, which is added by default in Helm charts.Note:
Update the release: prom-operator label with role: cnc-alerting-rules in theocnadd-alerting-rules.yaml
file.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
release: prom-operator
name: ocnadd-alerting-rules
namespace: {{ .Values.global.cluster.nameSpace.name }}
7.2 Alert Forwarding Using Simple Network Management Protocol (SNMP)
OCNADD forwards the Prometheus alerts as Simple Network Management Protocol (SNMP) traps to the southbound SNMP servers. OCNADD uses two SNMP MIB files to generate the traps. The alert manager configuration is modified by updating the alertmanager.yaml file. In the alertmanager.yaml file, the alerts can be grouped based on podname, alertname, severity, namespace, and so on. The Prometheus alert manager is integrated with Oracle Communications Cloud Native Core, Cloud Native Environment (CNE) snmp-notifier service. The external SNMP servers are set up to receive the Prometheus alerts as SNMP traps. The operator must update the MIB files along with the alert manager file to fetch the SNMP traps in their environment.
Note:
- Only a user with admin privileges can perform the following procedures.
Alert Manager Configuration
- Run the following command to obtain the Alert Manager Secret configuration from the Bastion Host and save it to a file:
$ kubectl get secret alertmanager-occne-kube-prom-stack-kube-alertmanager -o yaml -n occne-infra > alertmanager-secret-k8s.yaml
Sample output:
apiVersion: v1 data: alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA== kind: Secret metadata: annotations: meta.helm.sh/release-name: occne-kube-prom-stack meta.helm.sh/release-namespace: occne-infra creationTimestamp: "2022-01-24T22:46:34Z" labels: app: kube-prometheus-stack-alertmanager app.kubernetes.io/instance: occne-kube-prom-stack app.kubernetes.io/managed-by: Helm app.kubernetes.io/part-of: kube-prometheus-stack app.kubernetes.io/version: 18.0.1 chart: kube-prometheus-stack-18.0.1 heritage: Helm release: occne-kube-prom-stack name: alertmanager-occne-kube-prom-stack-kube-alertmanager namespace: occne-infra resourceVersion: "5175" uid: a38eb420-a4d0-4020-a375-ab87421defde type: Opaque
-
Extract the Alert Manager configuration. The third line of the alertmanager.yaml file contains Alert Manager configuration encoded in base64 format. To extract the Alert Manager configuration, decode the alertmanager.yaml file. Run the following command:
echo 'Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCnJvdXRlOgogIGdyb3VwX2J5OgogIC0gam9iCiAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgZ3JvdXBfd2FpdDogMzBzCiAgcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJvdXRlczoKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbA==' | base64 --decode
Sample output:
global: resolve_timeout: 5m receivers: - name: default-receiver webhook_configs: - url: http://occne-snmp-notifier:9464/alerts route: group_by: - job group_interval: 5m group_wait: 30s receiver: default-receiver repeat_interval: 12h routes: - match: alertname: Watchdog receiver: default-receiver templates: - /etc/alertmanager/config/*.tmpl
-
Update the alertmanager.yaml file, alerts can be grouped based on the following:
- podname
- alertname
- severity
- namespace
Save the changes to alertmanager.yaml file.
For example:
route: group_by: [podname, alertname, severity, namespace] group_interval: 5m group_wait: 30s receiver: default-receiver repeat_interval: 12h
- Encode the updated alertmanager.yaml file, run the following command:
$ cat alertmanager.yaml | base64 -w0 Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogNW0KcmVjZWl2ZXJzOgotIG5hbWU6IGRlZmF1bHQtcmVjZWl2ZXIKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyOjk0NjQvYWxlcnRzCi0gbmFtZTogbmV3LXJlY2VpdmVyLTEKICB3ZWJob29rX2NvbmZpZ3M6CiAgLSB1cmw6IGh0dHA6Ly9vY2NuZS1zbm1wLW5vdGlmaWVyLTE6OTQ2NS9hbGVydHMKcm91dGU6CiAgZ3JvdXBfYnk6CiAgLSBqb2IKICBncm91cF9pbnRlcnZhbDogNW0KICBncm91cF93YWl0OiAzMHMKICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIHJlcGVhdF9pbnRlcnZhbDogMTJoCiAgcm91dGVzOgogIC0gcmVjZWl2ZXI6IGRlZmF1bHQtcmVjZWl2ZXIKICAgIGdyb3VwX3dhaXQ6IDMwcwogICAgZ3JvdXBfaW50ZXJ2YWw6IDVtCiAgICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIC0gcmVjZWl2ZXI6IG5ldy1yZWNlaXZlci0xCiAgICBncm91cF93YWl0OiAzMHMKICAgIGdyb3VwX2ludGVydmFsOiA1bQogICAgcmVwZWF0X2ludGVydmFsOiAxMmgKICAtIG1hdGNoOgogICAgICBhbGVydG5hbWU6IFdhdGNoZG9nCiAgICByZWNlaXZlcjogZGVmYXVsdC1yZWNlaXZlcgogIC0gbWF0Y2g6CiAgICAgIGFsZXJ0bmFtZTogV2F0Y2hkb2cKICAgIHJlY2VpdmVyOiBuZXctcmVjZWl2ZXItMQp0ZW1wbGF0ZXM6Ci0gL2V0Yy9hbGVydG1hbmFnZXIvY29uZmlnLyoudG1wbAo=
- Edit the alertmanager-secret-k8s.yaml file created in step 1. Replace the alertmanager.yaml encoded content with the output generated in the previous step.
For example:
$ vi alertmanager-secret-k8s.yaml apiVersion: v1 data: alertmanager.yaml: <paste here the encoded content of alertmanager.yaml> kind: Secret metadata: annotations: meta.helm.sh/release-name: occne-kube-prom-stack meta.helm.sh/release-namespace: occne-infra creationTimestamp: "2023-02-16T09:44:58Z" labels: app: kube-prometheus-stack-alertmanager app.kubernetes.io/instance: occne-kube-prom-stack app.kubernetes.io/managed-by: Helm app.kubernetes.io/part-of: kube-prometheus-stack app.kubernetes.io/version: 36.2.0 chart: kube-prometheus-stack-36.2.0 heritage: Helm release: occne-kube-prom-stack name: alertmanager-occne-kube-prom-stack-kube-alertmanager namespace: occne-infra resourceVersion: "8211" uid: 9b499b32-6ad2-4754-8691-70665f9daab4 type: Opaque
- Run the following command:
$ kubectl apply -f alertmanager-secret-k8s.yaml -n occne-infra
Integrate the Alert Manager with snmp-notifier Service
- Update the SNMP client destination in the occne-snmp-notifier service with the SNMP destination client IP.
Note:
For a persistent client configuration, edit the values of the snmp-notifier in Helm charts and perform a Helm upgrade.Add "warn" in alert severity to receive warning alerts from OCNADD. Run the following command:
$ kubectl edit deployment -n occne-infra occne-snmp-notifier 1. update the field "--snmp.destination=<IP>:<port>" inside the args of container and add the snmp-client destination IP. Example: spec: containers: - args: - --snmp.destination=10.20.30.40:162 2. "warn" should also be added to the severity list as some of the DD alerts are raised with severity: warn. Exmaple: - --alert.severities=critical,major,minor,warning,info,clear,warn
- To verify the SNMP notification, see the new notifications in the pod logs of occne snmp notifier. Run the following command to see the logs:
$ kubectl logs -n occne-infra <occne-snmp-notifier-pod-name>
Sample output:
10.20.30.50 - - [26/Mar/2023:13:58:14 +0000] "POST /alerts HTTP/1.1" 200 0 10.20.30.60 - - [26/Mar/2023:14:02:51 +0000] "POST /alerts HTTP/1.1" 200 0 10.20.30.70 - - [26/Mar/2023:14:03:14 +0000] "POST /alerts HTTP/1.1" 200 0 10.20.30.80 - - [26/Mar/2023:14:07:51 +0000] "POST /alerts HTTP/1.1" 200 0 10.20.30.90 - - [26/Mar/2023:14:08:14 +0000] "POST /alerts HTTP/1.1" 200 0
OCNADD MIB Files
Two OCNADD MIB files are used to generate the traps. The operator has to update the MIB files and the alert manager file to obtain the traps in their environment. The files are:
- OCNADD-MIB-TC-23.4.0.mib: This is a top level mib file, where the objects and their data types are defined.
- OCNADD-MIB-23.4.0.mib: This file fetches the objects from the top level mib file and based on the alert notification, the objects are selected for display.
Note:
MIB files are packaged along with OCNADD Custom Templates. Download the files from MOS. See Oracle Communications Cloud Native Core Network Analytics Data Director Installation, Upgrade, and Fault Recovery Guide for more information.7.3 List of Alerts
This section provides detailed information about the alert rules defined for OCNADD.
7.3.1 System Level Alerts
This section lists the system level alerts for OCNADD.
OCNADD_POD_CPU_USAGE_ALERT
Table 7-1 OCNADD_POD_CPU_USAGE_ALERT
Field | Details |
---|---|
Triggering Condition | POD CPU usage is above the set threshold (default 70%) |
Severity | Major |
Description | OCNADD Pod High CPU usage detected for a continuous period of 5min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % '
Expression: expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*aggregation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kafka.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*6) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*zookeeper.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*adapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*3) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*correlation.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*filter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*configuration.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*admin.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*health.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*alarm.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) or (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ui.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*1) |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.4002 |
Metric Used |
container_cpu_usage_seconds_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert gets cleared when the CPU utilization is below the critical threshold. Note: The threshold is configurable in the |
OCNADD_POD_MEMORY_USAGE_ALERT
Table 7-2 OCNADD_POD_MEMORY_USAGE_ALERT
Field | Details |
---|---|
Triggering Condition | POD Memory usage is above set threshold (default 70%) |
Severity | Major |
Description | OCNADD Pod High Memory usage detected for the continuous period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % ' Expression: expr: (sum(container_memory_working_set_bytes{image!="" , pod=~".*aggregation.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*kafka.*"}) by (pod,namespace) > 64*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*zookeeper.*"}) by (pod,namespace) > 1*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*filter.*"}) by (pod,namespace) > 3*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*adapter.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*correlation.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*configuration.*"}) by (pod,namespace) > 4*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*admin.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*health.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*alarm.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*ui.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.4005 |
Metric Used |
container_memory_working_set_bytes Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert gets cleared when the memory utilization is below the critical threshold. Note: The threshold is configurable in the
|
OCNADD_POD_RESTARTED
Table 7-3 OCNADD_POD_RESTARTED
Field | Details |
---|---|
Triggering Condition | A POD has restarted |
Severity | Minor |
Description | A POD has restarted in last 2 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted'
Expression: expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1 |
OID | 1.3.6.1.4.1.323.5.3.51.29.5006 |
Metric Used |
kube_pod_container_status_restarts_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric asexposed by the monitoring system. |
Resolution |
The alert is cleared automatically if the specific pod is up. Steps: 1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on. 2. Run the following command to check orchestration logs for liveness or readiness probe failures: kubectl get po -n <namespace> Note the full name of the pod that is not running, and use it in the following command: kubectl describe pod <desired full pod name> -n <namespace> 3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide". 4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
OCNADD_CORRELATION_SVC_DOWN
Table 7-4 OCNADD_CORRELATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Correlation service went down or not accessible |
Severity | Critical |
Description | OCNADD Correlation service not available for more than 2 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Correlation service is down'Expression: expr: up{service=~".*correlation.*"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.33.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Correlation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
7.3.2 Application Level Alerts
This section lists the application level alerts for OCNADD.
OCNADD_ADMIN_SVC_DOWN
Table 7-5 OCNADD_ADMIN_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Admin service is down or not accessible. |
Severity | Critical |
Description | OCNADD Admin service not available for more than 2 min |
Alert Details |
Summary: ''namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down'Expression: expr: up{service="ocnaddadminservice"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.30.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Admin service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, if guidance is required. |
OCNADD_ALARM_SVC_DOWN
Table 7-6 OCNADD_ALARM_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The alarm service is down or not accessible. |
Severity | Critical |
Description | OCNADD Alarm service not available for more than 2 min |
Alert Details |
Summary: namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down'
Expression: expr: up{service="ocnaddalarm"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.24.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Alarm service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, if guidance is required. |
OCNADD_CONFIG_SVC_DOWN
Table 7-7 OCNADD_CONFIG_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The configuration service is down or not accessible. |
Severity | Critical |
Description | OCNADD Configuration service not available for more than 2 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down'
Expression: expr: up{service="ocnaddconfiguration"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.20.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Configuration service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, if guidance is required. |
OCNADD_CONSUMER_ADAPTER_SVC_DOWN
Table 7-8 OCNADD_CONSUMER_ADAPTER_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Consumer Adapter service is down or not accessible |
Severity | Critical |
Description | OCNADD Consumer Adapter service not available for more than 2 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down' Expression: expr: up{service=~".*adapter.*"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.25.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT
Table 7-9 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 0.1% of the total supported MPS |
Severity | Info |
Description | Egress external connection failure rate towards third-party application is crossing info threshold of 0.1% in the period of 5 min |
Alert Details | Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections' Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5022 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (0.1%) alert level of supported MPS. |
OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT
Table 7-10 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third party application is above the configured threshold of 10% of total supported MPS. |
Severity | Minor |
Description | Egress external connection failure rate towards third-party application is crossing minor threshold of 10% in the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections'Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 10 < 25 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5024 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third party consumer goes below the threshold (10%) alert level of supported MPS. |
OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT
Table 7-11 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 1% of the total supported MPS. |
Severity | Warn |
Description | Egress external connection failure rate towards third-party application is crossing the warning threshold of 1% in the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{"{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections'Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 1 < 10 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5023 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third party consumer goes below the threshold (1%) alert level of supported MPS. |
OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT
Table 7-12 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third party application is above the configured threshold of 25% of total supported MPS. |
Severity | Major |
Description | Egress external connection failure rate towards third-party application is crossing major threshold of 25% in the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections' Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 25 < 50 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5025 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third party consumer goes below the threshold (25%) alert level of supported MPS. |
OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT
Table 7-13 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 50% of total supported MPS. |
Severity | Critical |
Description | Egress external connection failure rate towards third-party application is crossing critical threshold of 50% in the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections' Expression: expr:(sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 50 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5026 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumer goes below the threshold (50%) alert level of supported MPS. |
OCNADD_HEALTH_MONITORING_SVC_DOWN
Table 7-14 OCNADD_HEALTH_MONITORING_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The health monitoring service is down or not accessible. |
Severity | Critical |
Description | OCNADD Health monitoring service not available for more than 2 min. |
Alert Details |
Summary: summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down' Expression: expr: up{service="ocnaddhealthmonitoring"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.28.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Health monitoring service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support if guidance is required. |
OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT
Table 7-15 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT
Field | Details |
---|---|
Triggering Condition | The ingress traffic decrease is more than 10% of the supported MPS. |
Severity | Major |
Description | The ingress traffic decrease is more than 10% of the supported MPS in last 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS'Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) <= 0.9 |
OID | 1.3.6.1.4.1.323.5.3.51.29.5013 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS. |
OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT
Table 7-16 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT
Field | Details |
---|---|
Triggering Condition | The ingress traffic increase is more than 10% of the supported MPS. |
Severity | Major |
Description | The ingress traffic increase is more than 10% of the supported MPS in last 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS' Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) >= 1.1 |
OID | 1.3.6.1.4.1.323.5.3.51.29.5013 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS. |
OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS
Table 7-17 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS
Field | Details |
---|---|
Triggering Condition | The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS. |
Severity | Critical |
Description | The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% thereshold of Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5011 |
Metric Used | kafka_stream_task_dropped_records_total |
Resolution | The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS. |
OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS
Table 7-18 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS
Field | Details |
---|---|
Triggering Condition | The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS. |
Severity | Major |
Description | The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% thereshold of Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5011 |
Metric Used | kafka_stream_task_dropped_records_total |
Resolution | The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS. |
OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED
Table 7-19 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS. |
Severity | Critical |
Description | Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}' Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5007 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the the MPS rate goes below the critical threshold alert level of 100% of supported MPS. |
OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED
Table 7-20 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS. |
Severity | Major |
Description | Total Ingress Message Rate is above configured major threshold alert (95%) for the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5007 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the the MPS rate goes below the major threshold alert level of 95%. |
OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED
Table 7-21 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS. |
Severity | Minor |
Description | Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5007 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the the MPS rate goes below the minor threshold alert level of 90%. |
OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED
Table 7-22 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the warning threshold of 80% of the supported MPS. |
Severity | Warn |
Description | Total Ingress Message Rate is above configured warning threshold (80%) for the period of 5 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.51.29.5007 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the the MPS rate goes below the warning threshold level of 80%. |
OCNADD_NRF_AGGREGATION_SVC_DOWN
Table 7-23 OCNADD_NRF_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The NRF Aggregation service is down or not accessible |
Severity | Critical |
Description | OCNADD NRF Aggregation service not available for more than 2 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down' Expression: expr: up{service="ocnaddnrfaggregation"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.31.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD NRF Aggregation service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
OCNADD_SCP_AGGREGATION_SVC_DOWN
Table 7-24 OCNADD_SCP_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The SCP Aggregation service is down or not accessible |
Severity | Critical |
Description | OCNADD SCP Aggregation service not available for more than 2 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down'Expression: expr: up{service="ocnaddscpaggregation"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.51.22.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD SCP Aggregation service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
OCNADD_SEPP_AGGREGATION_SVC_DOWN
Table 7-25 OCNADD_SEPP_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The SEPP Aggregation service is down or not accessible |
Severity | Critical |
Description | OCNADD SEPP Aggregation service not available for more than 2 min. |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddseppaggregation service is down' Expression: expr: up{service="ocnaddseppaggregation"} != 1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.32.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD SEPP Aggregation service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows: kubectl get events --sortby=.metadata.creationTimestamp -n <namespace> 3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED
Table 7-26 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS |
Severity | Warn |
Description | The total Egress Message Rate is above the configured warning threshold alert (80%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5011 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the warning threshold alert level of 80% of supported MPS |
OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED
Table 7-27 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS |
Severity | Minor |
Description | The total Egress Message Rate is above the configured minor threshold alert (90%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5012 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90% of supported MPS |
OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED
Table 7-28 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total egress MPS crossed the major threshold alert level of 95% of the supported MPS |
Severity | Major |
Description | The total Egress Message Rate is above the configured major threshold alert (95%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5013 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95% of supported MPS |
OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED
Table 7-29 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS |
Severity | Critical |
Description | The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5014 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER
Table 7-30 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER
Field | Description |
---|---|
Triggering Condition | The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer |
Severity | Critical |
Description | The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (rate(ocnadd_egress_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5015 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED
Table 7-31 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total observed latency is above the configured warning threshold alert level of 80% |
Severity | Warn |
Description | Average E2E Latency is above the configured warning threshold alert level (80%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80*{{ .Values.global.cluster.max_latency }} <= .90*{{ .Values.global.cluster.max_latency }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5016 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissible latency |
OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED
Table 7-32 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total observed latency is above the configured minor threshold alert level of 90% |
Severity | Minor |
Description | Average E2E Latency is above the configured minor threshold alert level (90%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90*{{ .Values.global.cluster.max_latency }} <= 0.95*{{ .Values.global.cluster.max_latency }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5017 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissible latency |
OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED
Table 7-33 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total observed latency is above the configured major threshold alert level of 95% |
Severity | Major |
Description | Average E2E Latency is above the configured minor threshold alert level (95%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95*{{ .Values.global.cluster.max_latency }} <= 1.0*{{ .Values.global.cluster.max_latency }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5018 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissible latency |
OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED
Table 7-34 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED
Field | Description |
---|---|
Triggering Condition | The total observed latency is above the configured critical threshold alert level of 100% |
Severity | Critical |
Description | Average E2E Latency is above the configured critical threshold alert level (100%) for the period of 5 min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5019 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissible latency |
Table 7-35 OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED
Field | Description |
---|---|
Triggering Condition | The total transaction success xDRs rate has dropped the critical threshold alert level of 90% |
Severity | Critical |
Description | The total transaction success xDRs rate has dropped the critical threshold alert level of 90% for the period of 5min |
Alert Details |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }}$labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Transaction Success Rate is below 90% per hour:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*",status="SUCCESS"}[5m]))by (namespace,service) / sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*"}[5m]))by (namespace,service) *100 < 90 |
OID | 1.3.6.1.4.1.323.5.3.53.1.33.5029 |
Metric Used | ocnadd_total_transactions_total |
Resolution | The alert is cleared automatically when the transaction success rate goes above the critical threshold alert level of 90% |