6 Alerts

This section provides information on Oracle Communications Network Analytics Data Director (OCNADD) alerts and their configuration.

Configuring Alerts

This section describes how to configure alerts in OCNADD.

If OCNADD is installed in the OCCNE setup, all services are monitored by Prometheus by default. Therefore, there are no modifications required in the helm chart. All the Prometheus alert rules present in helm chart are updated in the Prometheus Server.

Note:

Here, the label used to update the Prometheus server is role: cnc-alerting-rules, which is added by default in helm charts.
If OCNADD is installed in the TANZU Setup, one of the files needs to be modified in helm charts with the following parameters.

Note:

Update the release: prom-operator label with role: cnc-alerting-rules in the ocnadd-alerting-rules.yaml file.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    release: prom-operator
  name: ocnadd-alerting-rules
  namespace: {{ .Values.global.cluster.nameSpace.name }}

List of Alerts

This section provides detailed information about the alert rules defined for OCNADD.

System Level Alerts

This section lists the system level alerts for OCNADD.

OCNADD_POD_CPU_USAGE_ALERT

Table 6-1 OCNADD_POD_CPU_USAGE_ALERT

Field Details
Triggering Condition POD CPU usage is above set threshold (default 70%)
Severity Major
Description OCNADD Pod High CPU usage detected for the continuous period of 5min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: CPU usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.cpu_threshold }} % '

Expression:

expr: (sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*ocnadd.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*kafka.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*zookeeper.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*egw.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2) or

(sum(rate(container_cpu_usage_seconds_total{image!="" , pod=~".*adapter.*"}[5m])) by (pod,namespace) > {{ .Values.global.cluster.cpu_threshold }}*2)

OID 1.3.6.1.4.1.323.5.3.51.29.4002
Metric Used

container_cpu_usage_seconds_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert gets cleared when the CPU utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd/values.yaml file. If guidance is required, contact My Oracle Support (MOS).

OCNADD_POD_MEMORY_USAGE_ALERT

Table 6-2 OCNADD_POD_MEMORY_USAGE_ALERT

Field Details
Triggering Condition POD Memory usage is above set threshold (default 70%)
Severity Major
Description OCNADD Pod High Memory usage detected for the continuous period of 5min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Memory usage is {{ "{{" }} $value | printf "%.2f" }} which is above threshold {{ .Values.global.cluster.memory_threshold }} % '

Expression:

(sum(container_memory_working_set_bytes{image!="" , pod=~".*ocnadd.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*kafka.*"}) by (pod,namespace) > 24*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*zookeeper.*"}) by (pod,namespace) > 1*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*egw.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100) or (sum(container_memory_working_set_bytes{image!="" , pod=~".*adapter.*"}) by (pod,namespace) > 2*1024*1024*1024*{{ .Values.global.cluster.memory_threshold }}/100)
OID 1.3.6.1.4.1.323.5.3.51.29.4005
Metric Used

container_memory_working_set_bytes

Note : This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert gets cleared when the memory utilization is below the critical threshold.

Note: The threshold is configurable in the ocnadd/values.yaml file. If guidance is required, contact My Oracle Support (MOS) .

OCNADD_POD_RESTARTED

Table 6-3 OCNADD_POD_RESTARTED

Field Details
Triggering Condition A POD has restarted
Severity Minor
Description A POD has restarted in last 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: A Pod has restarted'

Expression:

expr: kube_pod_container_status_restarts_total{namespace="{{ .Values.global.cluster.nameSpace.name }}"} > 1

OID 1.3.6.1.4.1.323.5.3.51.29.5006
Metric Used

kube_pod_container_status_restarts_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use the similar metric asexposed by the monitoring system.

Resolution

The alert is cleared automatically if the specific pod is up.

Steps:

1. Check the application logs. Check for database related failures such as connectivity, Kubernetes secrets, and so on.

2. Run the following command to check orchestration logs for liveness or readiness probe failures:

kubectl get po -n <namespace>

Note the full name of the pod that is not running, and use it in the following command:

kubectl describe pod <desired full pod name> -n <namespace>

3. Check the database status. For more information, see "Oracle Communications Cloud Native Core DBTier User Guide".

4. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

Application Level Alerts

This section lists the application level alerts for OCNADD.

OCNADD_CONFIG_SVC_DOWN

Table 6-4 OCNADD_CONFIG_SVC_DOWN

Field Details
Triggering Condition The configuration service went down or not accessible
Severity Critical
Description OCNADD Configuration service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down'

Expression:

expr: up{service="ocnaddconfiguration"} != 1

OID 1.3.6.1.4.1.323.5.3.51.20.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Triggering Condition The configuration service went down or not accessible
Severity Critical
Description OCNADD Configuration service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down'

Expression:

expr: up{service="ocnaddconfiguration"} != 1

OID 1.3.6.1.4.1.323.5.3.51.20.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Configuration service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required..

OCNADD_ALARM_SVC_DOWN

Table 6-5 OCNADD_ALARM_SVC_DOWN

Field Details
Triggering Condition The alarm service went down or not accessible
Severity Critical
Description OCNADD Alarm service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down'

Expression:

expr: up{service="ocnaddalarm"} != 1

OID 1.3.6.1.4.1.323.5.3.51.24.2002
Metric Used

'up'

Note:

This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.
Triggering Condition The alarm service went down or not accessible
Severity Critical
Description OCNADD Alarm service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down'

Expression:

expr: up{service="ocnaddalarm"} != 1

OID 1.3.6.1.4.1.323.5.3.51.24.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Alarm service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required..

OCNADD_HEALTH_MONITORING_SVC_DOWN

Table 6-6 OCNADD_HEALTH_MONITORING_SVC_DOWN

Field Details
Triggering Condition The health monitoring service went down or not accessible
Severity Critical
Description OCNADD Health monitoring service not available for more than 2 min
Alert Details

Summary:

summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down'

Expression:

expr: up{service="ocnaddhealthmonitoring"} != 1

OID 1.3.6.1.4.1.323.5.3.51.28.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Health monitoring service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

OCNADD_SCP_AGGREGATION_SVC_DOWN

Table 6-7 OCNADD_SCP_AGGREGATION_SVC_DOWN

Field Details
Severity Critical
Description OCNADD SCP Aggregation service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down'

Expression:

expr: up{service="ocnaddscpaggregation"} != 1

OID 1.3.6.1.4.1.323.5.3.51.22.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD SCP Aggregation service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

Triggering Condition The SCP Aggregation service went down or not accessible
OCNADD_NRF_AGGREGATION_SVC_DOWN

Table 6-8 OCNADD_NRF_AGGREGATION_SVC_DOWN

Field Details
Triggering Condition The NRF Aggregation service went down or not accessible
Severity Critical
Description OCNADD NRF Aggregation service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down'

Expression:

expr: up{service="ocnaddnrfaggregation"} != 1

OID 1.3.6.1.4.1.323.5.3.51.22.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD NRF Aggregation service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

OCNADD_ADMIN_SVC_DOWN

Table 6-9 OCNADD_ADMIN_SVC_DOWN

Field Details
Triggering Condition The OCNADD Admin service went down or not accessible
Severity Critical
Description OCNADD Admin service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down'

Expression:

expr: up{service="ocnaddadminservice"} != 1

OID 1.3.6.1.4.1.323.5.3.51.30.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Admin service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

OCNADD_EGW_SVC_DOWN

Table 6-10 OCNADD_EGW_SVC_DOWN

Field Details
Severity Critical
Description OCNADD Egress Gateway service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress Gw service is down'

Expression:

expr: up{service=~".*egw.*"} != 1

OID 1.3.6.1.4.1.323.5.3.51.23.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Egress Gateway service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

Triggering Condition The OCNADD Egress Gateway service went down or not accessible
OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Table 6-11 OCNADD_CONSUMER_ADAPTER_SVC_DOWN

Field Details
Triggering Condition The OCNADD Consumer Adapter service went down or not accessible
Severity Critical
Description OCNADD Consumer Adapter service not available for more than 2 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down'

Expression:

expr: up{service=~".*adapter.*"} != 1

OID 1.3.6.1.4.1.323.5.3.51.25.2002
Metric Used

'up'

Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use the similar metric as exposed by the monitoring system.

Resolution

The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available.

Steps:

1. Check for service specific alerts which may be causing the issues with service exposure.

2. Run the following command to check if the pod’s status is in “Running” state:

kubectl –n <namespace> get pod

If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:

kubectl get events --sortby=.metadata.creationTimestamp -n <namespace>

3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on.

4. Run the following command to check Helm status and make sure there are no errors:

helm status <helm release name of data director> -n<namespace>

If it is not in “STATUS: DEPLOYED”, then again capture logs and events.

5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support (MOS), If guidance is required.

OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Table 6-12 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the warning threshold of 80% of the supported MPS
Severity Warn
Description Total Ingress Message Rate is above configured warning threshold (80%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_processor_node_process_total[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5007
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the the MPS rate goes below the warning threshold level of 80%.
OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Table 6-13 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity Minor
Description Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_processor_node_process_total[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5007
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the the MPS rate goes below the minor threshold alert level of 90%.
OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Table 6-14 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity Major
Description Total Ingress Message Rate is above configured major threshold alert (95%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_processor_node_process_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5007
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the the MPS rate goes below the major threshold alert level of 95%.
OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Table 6-15 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_processor_node_process_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5007
Metric Used kafka_stream_processor_node_process_total
Triggering Condition The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_processor_node_process_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5007
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the the MPS rate goes below the critical threshold alert level of 100% of supported MPS
OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Table 6-16 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS
Severity Warn
Description Total Egress Message Rate is above configured warning threshold alert (80%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5008
Metric Used ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the the MPS rate goes below the warning threshold alert level of 80% of supported MPS
OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Table 6-17 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS
Severity Minor
Description Total Egress Message Rate is above configured minor threshold alert (90%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5008
Metric Used ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the the MPS rate goes below the minor threshold alert level of 90% of supported MPS
OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Table 6-18 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the major threshold alert level of 95% of the supported MPS
Severity Major
Description Total Egress Message Rate is above configured major threshold alert (95%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5008
Metric Used ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the the MPS rate goes below the major threshold alert level of 95% of supported MPS
OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Table 6-19 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED

Field Details
Triggering Condition The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS
Severity Critical
Description Total Egress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5008
Metric Used ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the the MPS rate goes below the critical threshold alert level of 100% of supported MPS

OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Table 6-20 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER

Field Details
Triggering Condition The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer
Severity Critical
Description Total Egress Message Rate is above configured critical threshold alert (100%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum (rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5009
Metric Used ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the the MPS rate goes below the critical threshold alert level of 100% of supported MPS
OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Table 6-21 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured warning threshold alert level of 80%
Severity Warn
Description Average E2E Latency is above configured warning threshold alert level (80%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

Expression:

expr: (sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80*{{ .Values.global.cluster.max_latency }} <= .90*{{ .Values.global.cluster.max_latency }}

OID 1.3.6.1.4.1.323.5.3.51.29.5010
Metric Used ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum, ocnadd_egressgateway_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissable latency
OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Table 6-22 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured minor threshold alert level of 90%
Severity Minor
Description Average E2E Latency is above configured minor threshold alert level (90%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

Expression:

expr: (sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90*{{ .Values.global.cluster.max_latency }} <= 0.95*{{ .Values.global.cluster.max_latency }}

OID 1.3.6.1.4.1.323.5.3.51.29.5010
Metric Used ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum, ocnadd_egressgateway_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissable latency
OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Table 6-23 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured major threshold alert level of 95%
Severity Major
Description Average E2E Latency is above configured minor threshold alert level (95%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

Expression:

expr: (sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95*{{ .Values.global.cluster.max_latency }} <= 1.0*{{ .Values.global.cluster.max_latency }}

OID 1.3.6.1.4.1.323.5.3.51.29.5010
Metric Used ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum, ocnadd_egressgateway_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissable latency
OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Table 6-24 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED

Field Details
Triggering Condition The total observed latency is above the configured critical threshold alert level of 100%
Severity Critical
Description Average E2E Latency is above configured critical threshold alert level (100%) for the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'

Expression:

expr: (sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egressgateway_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }}

OID 1.3.6.1.4.1.323.5.3.51.29.5010
Metric Used ocnadd_egressgateway_e2e_request_processing_latency_seconds_sum, ocnadd_egressgateway_e2e_request_processing_latency_seconds_count
Resolution The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissable latency
OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Table 6-25 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS

Field Details
Triggering Condition The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS
Severity Major
Description The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% thereshold of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_task_dropped_records_total[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5011
Metric Used kafka_stream_task_dropped_records_total
Resolution The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS
OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Table 6-26 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS

Field Details
Triggering Condition The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS
Severity Critical
Description The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% thereshold of Max messages per second:{{ .Values.global.cluster.mps }}'

Expression:

expr: sum(rate(kafka_stream_task_dropped_records_total[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }}

OID 1.3.6.1.4.1.323.5.3.51.29.5011
Metric Used kafka_stream_task_dropped_records_total
Resolution The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS
OCNADD_EGW_FAILURE_RATE_THRESHOLD_0.1PERCENT

Table 6-27 OCNADD_EGW_FAILURE_RATE_THRESHOLD_0.1PERCENT

Field Details
Triggering Condition The Egress gateway failure rate towards the 3rd party application is above the configured threshold of 0.1% of total supported MPS
Severity Info
Description Egress external connection failure rate towards 3rd party application is crossing info threshold of 0.1% in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections'

Expression:

expr: (sum(rate(ocnadd_egressgateway_third_party_connection_failure_total[5m])) by (namespace))/(sum(rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10

OID 1.3.6.1.4.1.323.5.3.51.29.5012
Metric Used ocnadd_egressgateway_third_party_connection_failure_total, ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the failure rate towards 3rd party consumer goes below the threshold (0.1%) alert level of supported MPS
OCNADD_EGW_FAILURE_RATE_THRESHOLD_1PERCENT

Table 6-28 OCNADD_EGW_FAILURE_RATE_THRESHOLD_1PERCENT

Field Details
Triggering Condition The Egress gateway failure rate towards the 3rd party application is above the configured threshold of 1% of total supported MPS
Severity Warn
Description Egress external connection failure rate towards 3rd party application is crossing warning threshold of 1% in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections'

Expression:

expr: (sum(rate(ocnadd_egressgateway_third_party_connection_failure_total[5m])) by (namespace))/(sum(rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace)) *100 >= 1 < 10

OID 1.3.6.1.4.1.323.5.3.51.29.5012
Metric Used ocnadd_egressgateway_third_party_connection_failure_total, ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the failure rate towards 3rd party consumer goes below the threshold (1%) alert level of supported MPS
OCNADD_EGW_FAILURE_RATE_THRESHOLD_10PERCENT

Table 6-29 OCNADD_EGW_FAILURE_RATE_THRESHOLD_10PERCENT

Field Details
Triggering Condition The Egress gateway failure rate towards the 3rd party application is above the configured threshold of 10% of total supported MPS
Severity Minor
Description Egress external connection failure rate towards 3rd party application is crossing minor threshold of 10% in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections'

Expression:

expr: (sum(rate(ocnadd_egressgateway_third_party_connection_failure_total[5m])) by (namespace))/(sum(rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace)) *100 >= 10 < 25

OID 1.3.6.1.4.1.323.5.3.51.29.5012
Metric Used ocnadd_egressgateway_third_party_connection_failure_total, ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the failure rate towards 3rd party consumer goes below the threshold (10%) alert level of supported MPS
OCNADD_EGW_FAILURE_RATE_THRESHOLD_25PERCENT

Table 6-30 OCNADD_EGW_FAILURE_RATE_THRESHOLD_25PERCENT

Field Details
Triggering Condition The Egress gateway failure rate towards the 3rd party application is above the configured threshold of 25% of total supported MPS
Severity Major
Description Egress external connection failure rate towards 3rd party application is crossing major threshold of 25% in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections'

Expression:

expr: (sum(rate(ocnadd_egressgateway_third_party_connection_failure_total[5m])) by (namespace))/(sum(rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace)) *100 >= 25 < 50

OID 1.3.6.1.4.1.323.5.3.51.29.5012
Metric Used ocnadd_egressgateway_third_party_connection_failure_total, ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the failure rate towards 3rd party consumer goes below the threshold (25%) alert level of supported MPS
OCNADD_EGW_FAILURE_RATE_THRESHOLD_50PERCENT

Table 6-31 OCNADD_EGW_FAILURE_RATE_THRESHOLD_50PERCENT

Field Details
Triggering Condition The Egress gateway failure rate towards the 3rd party application is above the configured threshold of 50% of total supported MPS
Severity Critical
Description Egress external connection failure rate towards 3rd party application is crossing critical threshold of 50% in the period of 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections'

Expression:

expr:(sum(rate(ocnadd_egressgateway_third_party_connection_failure_total[5m])) by (namespace))/(sum(rate(ocnadd_egressgateway_http_requests_total[5m])) by (namespace)) *100 >= 50

OID 1.3.6.1.4.1.323.5.3.51.29.5012
Metric Used ocnadd_egressgateway_third_party_connection_failure_total, ocnadd_egressgateway_http_requests_total
Resolution The alert is cleared automatically when the failure rate towards 3rd party consumer goes below the threshold (50%) alert level of supported MPS
OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Table 6-32 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT

Field Details
Triggering Condition The ingress traffic increase is more than 10% of the supported MPS
Severity Major
Description The ingress traffic increase is more than 10% of the supported MPS in last 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS'

Expression:

expr: sum(irate(kafka_stream_processor_node_process_total[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total[5m] offset 5m)) by (namespace) >= 1.1

OID 1.3.6.1.4.1.323.5.3.51.29.5013
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS
OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Table 6-33 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT

Field Details
Triggering Condition The ingress traffic decrease is more than 10% of the supported MPS
Severity Major
Description The ingress traffic decrease is more than 10% of the supported MPS in last 5 min
Alert Details

Summary:

'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS'

Expression:

expr: sum(irate(kafka_stream_processor_node_process_total[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total[5m] offset 5m)) by (namespace) <= 0.9

OID 1.3.6.1.4.1.323.5.3.51.29.5013
Metric Used kafka_stream_processor_node_process_total
Resolution The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS