5.2 Application Level Alerts
This section lists the application level alerts for OCNADD.
Table 5-4 OCNADD_CONFIG_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The configuration service went down or not accessible |
Severity | Critical |
Description | OCNADD Configuration service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddconfiguration service is down' PromQL Expression: expr: up{service="ocnaddconfiguration"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_CONFIG_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddconfiguration"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.20.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Configuration service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support.. |
Table 5-5 OCNADD_ALARM_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The alarm service went down or not accessible |
Severity | Critical |
Description | OCNADD Alarm service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddalarm service is down' PromQL Expression: expr: up{service="ocnaddalarm"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_ALARM_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddalarm"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.24.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Alarm service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-6 OCNADD_HEALTH_MONITORING_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The health monitoring service went down or not accessible |
Severity | Critical |
Description | OCNADD Health monitoring service not available for more than 2 min |
Alert Details CNE |
Summary: summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddhealthmonitoring service is down' PromQL Expression: expr: up{service="ocnaddhealthmonitoring"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_HEALTH_MONITORING_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddhealthmonitoring"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.28.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Health monitoring service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-7 OCNADD_SCP_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The SCP Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD SCP Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddscpaggregation service is down' PromQL Expression: expr: up{service="ocnaddscpaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_SCP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddscpaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.22.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD SCP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-8 OCNADD_NRF_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The NRF Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD NRF Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnrfaggregation service is down' PromQL Expression: expr: up{service="ocnaddnrfaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_NRF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddnrfaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.31.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD NRF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-9 OCNADD_SEPP_AGGREGATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The SEPP Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD SEPP Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddseppaggregation service is down' PromQL Expression: expr: up{service="ocnaddseppaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_SEPP_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddseppaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.32.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD SEPP Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-10 OCNADD_BSF_AGGREGATION_SVC_DOWN
Triggering Condition | The BSF Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD BSF Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddbsfaggregation service is down' PromQL Expression: expr: up{service="ocnaddbsfaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_BSF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddbsfaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.40.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD BSF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-11 OCNADD_PCF_AGGREGATION_SVC_DOWN
Triggering Condition | The PCF Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD PCF Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddpcfaggregation service is down' PromQL Expression: expr: up{service="ocnaddpcfaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_PCF_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddpcfaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.41.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD PCF Aggregation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support, If guidance is required. |
Table 5-12 OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN
Triggering Condition | The Non Oracle Aggregation service went down or not accessible |
Severity | Critical |
Description | OCNADD Non Oracle Aggregation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddnonoracleaggregation service is down' PromQL Expression: expr: up{service="ocnaddnonoracleaggregation"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_NON_ORACLE_AGGREGATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ocnaddnonoracleaggregation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.37.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Non Oracle Aggregation service starts instance becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-13 OCNADD_ADMIN_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Admin service went down or not accessible |
Severity | Critical |
Description | OCNADD Admin service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: ocnaddadminservice service is down'PromQL Expression: expr: up{service="ocnaddadminservice"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_ADMIN_SVC_DOWN" is in a "OK/FIRING"
state; because 0/1 metrics meet the trigger rule:
"podStatus[10m]{podOwner=" MQL Expression: podStatus[10m]{podOwner="ocnaddadminservice"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.30.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Admin service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-14 OCNADD_CONSUMER_ADAPTER_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Consumer Adapter service went down or not accessible |
Severity | Critical |
Description | OCNADD Consumer Adapter service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Consumer Adapter service is down'PromQL Expression: expr: up{service=~".*adapter.*", role="adapter"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_CONSUMER_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner=~"adapter*"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.25.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Consumer Adapter service start becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in “Running” state: kubectl –n <namespace> get pod If it is not in running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-15 OCNADD_FILTER_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Filter service went down or not accessible |
Severity | Critical |
Description | OCNADD Filter service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Filter service is down'PromQL Expression: expr: up{service=~".*filter.*"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_FILTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ocnaddfilter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner=" |
OID | 1.3.6.1.4.1.323.5.3.53.1.34.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Filter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-16 OCNADD_CORRELATION_SVC_DOWN
Field | Details |
---|---|
Triggering Condition | The OCNADD Correlation service went down or not accessible |
Severity | Critical |
Description | OCNADD Correlation service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Correlation service is down'PromQL Expression: expr: up{service=~".*correlation.*"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_CORRELATION_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="correlation"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="correlation"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.33.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Correlation service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-17 OCNADD_EXPORT_SVC_DOWN
Triggering Condition | The OCNADD Export service went down or not accessible |
Severity | Critical |
Description | OCNADD Export service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Export service is down'PromQL Expression: expr: up{service=~".*export.*"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_EXPORT_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="export"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="export"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.39.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD export service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-18 OCNADD_STORAGE_ADAPTER_SVC_DOWN
Triggering Condition | The OCNADD Storage adapter service went down or not accessible |
Severity | Critical |
Description | OCNADD Storage adapter service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Storage adapter service is down'PromQL Expression: expr: up{service=~".*storage-adapter.*", role="storageAdapter"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_STORAGE_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="storageadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="storageadapter"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.38.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Storage adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-19 OCNADD_INGRESS_ADAPTER_SVC_DOWN
Triggering Condition | The OCNADD Ingress Adapter service went down or not accessible |
Severity | Critical |
Description | OCNADD Ingress Adapter service not available for more than 2 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress Adapter service is down'PromQL Expression: expr: up{service=~".*ingress-adapter.*", role="ingressadapter"} != 1 |
Alert Details OCI |
Summary: Alarm "OCNADD_INGRESS_ADAPTER_SVC_DOWN" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "podStatus[10m]{podOwner="ingressadapter"}.mean()!=1", with a trigger delay of 1 minute MQL Expression: podStatus[10m]{podOwner="ingressadapter"}.mean()!=1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.36.2002 |
Metric Used |
'up' Note: This is a Prometheus metric used for instance availability monitoring. If this metric is not available, use a similar metric as exposed by the monitoring system. |
Resolution |
The alert is cleared automatically when the OCNADD Ingress Adapter service starts becoming available. Steps: 1. Check for service specific alerts which may be causing the issues with service exposure. 2. Run the following command to check if the pod’s status is in the “Running” state: kubectl –n <namespace> get pod If it is not in a running state, capture the pod logs and events. Run the following command to fetch the events as follows:
3. Refer to the application logs and check for database related failures such as connectivity, invalid secrets, and so on. 4. Run the following command to check Helm status and make sure there are no errors: Helm status <helm release name of data director> -n<namespace> If it is not in “STATUS: DEPLOYED”, then again capture logs and events. 5. If the issue persists, capture all the outputs from the above steps and contact My Oracle Support. |
Table 5-20 OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the warning threshold of 80% of the supported MPS |
Severity | Warn |
Description | Total Ingress Message Rate is above the configured warning threshold (80%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.8*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_WARNING_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.8*{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.8*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5007 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the warning threshold level of 80%. |
Table 5-21 OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the minor threshold alert level of 90% of the supported MPS |
Severity | Minor |
Description | Total Ingress Message Rate is above configured minor threshold alert (90%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.9*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_MINOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.9*{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>.9*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5008 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90%. |
Table 5-22 OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the major threshold alert level of 95% of the supported MPS |
Severity | Major |
Description | Total Ingress Message Rate is above the configured major threshold alert (95%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95 Percent of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_MAJOR_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>0.95*{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>0.95*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5009 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95%. |
Table 5-23 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS |
Severity | Critical |
Description | Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.36.5010 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
Table 5-24 OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total ingress MPS crossed the critical threshold alert level of 100% of the supported MPS |
Severity | Critical |
Description | Total Ingress Message Rate is above configured critical threshold alert (100%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above the supported Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_CRITICAL_INGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_processor_node_process_total[10m]{microservice=~"*aggregation*"}.rate().groupBy(k8Namespace).sum()>1.0*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5010 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
Table 5-25 OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total egress MPS crossed the warning threshold alert level of 80% of the supported MPS |
Severity | Warn |
Description | The total Egress Message Rate is above the configured warning threshold alert (80%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 80% of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.80*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_WARNING_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.80*{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.80*{{ MPS Threshold }}" |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5011 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the warning threshold alert level of 80% of supported MPS |
Table 5-26 OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total egress MPS crossed the minor threshold alert level of 90% of the supported MPS |
Severity | Minor |
Description | The total Egress Message Rate is above the configured minor threshold alert (90%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 90% of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.90*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_MINOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.90*{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.90*{{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5012 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the minor threshold alert level of 90% of supported MPS |
Table 5-27 OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total egress MPS crossed the major threshold alert level of 95% of the supported MPS |
Severity | Major |
Description | The total Egress Message Rate is above the configured major threshold alert (95%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above 95% of Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 0.95*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_MAJOR_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.95*{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>0.95*{{ MPS Threshold }}" |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5013 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the major threshold alert level of 95% of supported MPS |
Table 5-28 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS |
Severity | Critical |
Description | The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum (irate(ocnadd_egress_requests_total[5m])) by (namespace) > 1.0*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0*{{ MPS Threshold }}"", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0*{{ MPS Threshold }}" |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5014 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
Table 5-29 OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER
Field | Details |
---|---|
Triggering Condition | The total egress MPS crossed the critical threshold alert level of 100% of the supported MPS for a consumer |
Severity | Critical |
Description | The total Egress Message Rate is above the configured critical threshold alert (100%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Message Rate is above supported Max messages per second:{{ .Values.global.cluster.mps }}'Expression: expr: sum (rate(ocnadd_egress_requests_total[5m])) by (namespace, instance_identifier) > 1.0*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_MPS_CRITICAL_EGRESS_THRESHOLD_CROSSED_FOR_A_CONSUMER" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_requests_total[10m]{app=~"*adapter"}.rate().groupBy(worker_group,app).sum()>1.0 {{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5015 |
Metric Used | ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the MPS rate goes below the critical threshold alert level of 100% of supported MPS |
Table 5-30 OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total observed latency is above the configured warning threshold alert level of 80% |
Severity | Warn |
Description | Average E2E Latency is above the configured warning threshold alert level (80%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 80% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .80*{{ .Values.global.cluster.max_latency }} <= .90*{{ .Values.global.cluster.max_latency }} |
Alert Details OCI |
Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_WARNING_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.040&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.045 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5016 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the warning threshold alert level of 80% of permissible latency |
Table 5-31 OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total observed latency is above the configured minor threshold alert level of 90% |
Severity | Minor |
Description | Average E2E Latency is above the configured minor threshold alert level (90%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 90% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .90*{{ .Values.global.cluster.max_latency }} <= 0.95*{{ .Values.global.cluster.max_latency }} |
Alert Details OCI |
Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MINOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.045&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.0475 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5017 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the minor threshold alert level of 90% of permissible latency |
Table 5-32 OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total observed latency is above the configured major threshold alert level of 95% |
Severity | Major |
Description | Average E2E Latency is above the configured minor threshold alert level (95%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 95% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > .95*{{ .Values.global.cluster.max_latency }} <= 1.0*{{ .Values.global.cluster.max_latency }} |
Alert Details OCI |
Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_MAJOR_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.0475&&ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()<=0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5018 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the major threshold alert level of 95% of permissible latency |
Table 5-33 OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED
Field | Details |
---|---|
Triggering Condition | The total observed latency is above the configured critical threshold alert level of 100% |
Severity | Critical |
Description | Average E2E Latency is above the configured critical threshold alert level (100%) for the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: E2E Latency is above 100% of Maximum permissable latency {{ .Values.global.cluster.max_latency }} ms'PromQL Expression: expr: (sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_sum[5m])) by (namespace)) /(sum (irate(ocnadd_egress_e2e_request_processing_latency_seconds_count[5m])) by (namespace)) > 1.0*{{ .Values.global.cluster.max_latency }} |
Alert Details OCI |
Summary: Alarm "OCNADD_E2E_AVG_RECORD_LATENCY_CRITICAL_THRESHOLD_CROSSED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_e2e_request_processing_latency_seconds_sum[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_e2e_request_processing_latency_seconds_count[10m].rate().groupBy(worker_group,app).sum()>0.05 Note: {{ .Values.global.cluster.max_latency }} is considered to be 0.05 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5019 |
Metric Used | ocnadd_egress_e2e_request_processing_latency_seconds_sum, ocnadd_egress_e2e_request_processing_latency_seconds_count |
Resolution | The alert is cleared automatically when the average latency goes below the critical threshold alert level of permissible latency |
Table 5-34 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS
Field | Details |
---|---|
Triggering Condition | The packet drop rate in Kafka streams is above the configured major threshold of 1% of the total supported MPS |
Severity | Major |
Description | The packet drop rate in Kafka streams is above the configured major threshold of 1% of total supported MPS in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 1% threshold of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.01*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_1PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100> {{ MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100> {{ MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5020 |
Metric Used | kafka_stream_task_dropped_records_total |
Resolution | The alert is cleared automatically when the packet drop rate goes below the major threshold (1%) alert level of supported MPS |
Table 5-35 OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS
Field | Details |
---|---|
Triggering Condition | The packet drop rate in Kafka streams is above the configured critical threshold of 10% of the total supported MPS |
Severity | Critical |
Description | The packet drop rate in Kafka streams is above the configured critical threshold of 10% of total supported MPS in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Packet Drop rate is above 10% threshold of Max messages per second:{{ .Values.global.cluster.mps }}'PromQL Expression: expr: sum(rate(kafka_stream_task_dropped_records_total{service=~".*aggregation.*"}[5m])) by (namespace) > 0.1*{{ .Values.global.cluster.mps }} |
Alert Details OCI |
Summary: Alarm "OCNADD_KAFKA_PACKET_DROP_THRESHOLD_10PERCENT_MPS" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100>10*{{MPS Threshold }}", with a trigger delay of 1 minute MQL Expression: kafka_stream_task_dropped_records_total[10m]{microservice=~"*aggregation"}.rate().groupBy(k8Namespace).sum()*100>10*{{MPS Threshold }} |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5021 |
Metric Used | kafka_stream_task_dropped_records_total |
Resolution | The alert is cleared automatically when the packet drop rate goes below the critical threshold (10%) alert level of supported MPS |
Table 5-36 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_0.1PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the 3rd party application is above the configured threshold of 0.1% of the total supported MPS |
Severity | Info |
Description | Egress external connection failure rate towards 3rd party application is crossing the info threshold of 0.1% in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 0.1 Percent of Total Egress external connections'PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 0.1 < 10 |
Alert Details OCI |
Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_01PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<1", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=0.1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<1 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5022 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (0.1%) alert level of supported MPS |
Table 5-37 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 1% of the total supported MPS |
Severity | Warn |
Description | Egress external connection failure rate towards 3rd party application is crossing the warning threshold of 1% in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 1 Percent of Total Egress external connections'PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 1 < 10 |
Alert Details OCI |
Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_1PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<10", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=1&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<10 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5023 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (1%) alert level of supported MPS |
Table 5-38 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 10% of the total supported MPS |
Severity | Minor |
Description | Egress external connection failure rate towards 3rd party application is crossing a minor threshold of 10% in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 10 Percent of Total Egress external connections'PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 10 < 25 |
Alert Details OCI |
Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_10PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<25", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=10&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<25 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5024 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (10%) alert level of supported MPS |
Table 5-39 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adapter failure rate towards the third-party application is above the configured threshold of 25% of the total supported MPS |
Severity | Major |
Description | Egress external connection failure rate towards 3rd party application is crossing the major threshold of 25% in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 25 Percent of Total Egress external connections'PromQL Expression: expr: (sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 25 < 50 |
Alert Details OCI |
Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_25PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=25&&ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100<50 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5025 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (25%) alert level of supported MPS |
Table 5-40 OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT
Field | Details |
---|---|
Triggering Condition | The Egress adpater failure rate towards the 3rd party application is above the configured threshold of 50% of the total supported MPS |
Severity | Critical |
Description | Egress external connection failure rate towards 3rd party application is crossing the critical threshold of 50% in the period of 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Egress external connection Failure Rate detected above 50 Percent of Total Egress external connections'PromQL Expression: expr:(sum(rate(ocnadd_egress_failed_request_total[5m])) by (namespace))/(sum(rate(ocnadd_egress_requests_total[5m])) by (namespace)) *100 >= 50 |
Alert Details OCI |
Summary: Alarm "OCNADD_EGRESS_FAILURE_RATE_THRESHOLD_50PERCENT" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=50", with a trigger delay of 1 minute MQL Expression: ocnadd_egress_failed_request_total[10m].rate().groupBy(worker_group,app).sum()/ocnadd_egress_requests_total[10m].rate().groupBy(worker_group,app).sum()*100>=50 |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5026 |
Metric Used | ocnadd_egress_failed_request_total, ocnadd_egress_requests_total |
Resolution | The alert is cleared automatically when the failure rate towards third-party consumers goes below the threshold (50%) alert level of supported MPS |
Table 5-41 OCNADD_INGRESS_TRAFFIC_RATE_INCREASE_SPIKE_10PERCENT
Field | Details |
---|---|
Triggering Condition | The ingress traffic increase is more than 10% of the supported MPS |
Severity | Major |
Description | The ingress traffic increase is more than 10% of the supported MPS in the last 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS increase is more than 10 Percent of current supported MPS'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) >= 1.1 |
Alert Details OCI | Not Available |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5027 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the increase in MPS comes back to lower than 10% of the supported MPS |
Table 5-42 OCNADD_INGRESS_TRAFFIC_RATE_DECREASE_SPIKE_10PERCENT
Field | Details |
---|---|
Triggering Condition | The ingress traffic decrease is more than 10% of the supported MPS |
Severity | Major |
Description | The ingress traffic decrease is more than 10% of the supported MPS in the last 5 min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Ingress MPS decrease is more than 10 Percent of current supported MPS'PromQL Expression: expr: sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m])) by (namespace)/sum(irate(kafka_stream_processor_node_process_total{service=~".*aggregation.*"}[5m] offset 5m)) by (namespace) <= 0.9 |
Alert Details OCI | Not Available |
OID | 1.3.6.1.4.1.323.5.3.53.1.29.5028 |
Metric Used | kafka_stream_processor_node_process_total |
Resolution | The alert is cleared automatically when the decrease in MPS comes back to lower than 10% of the supported MPS |
Table 5-43 OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED
Field | Details |
---|---|
Triggering Condition | The total transaction success xDRs rate has dropped the critical threshold alert level of 90% |
Severity | Critical |
Description | The total transaction success xDRs rate has dropped the critical threshold alert level of 90% for the period of 5min |
Alert Details CNE |
Summary: 'namespace: {{ "{{" }}$labels.namespace}}, workergroup: {{ "{{" }} $labels.worker_group }}, podname: {{ "{{" }}$labels.pod}}, timestamp: {{ "{{" }} with query "time()" }}{{ "{{" }} . | first | value | humanizeTimestamp }}{{ "{{" }} end }}: Transaction Success Rate is below 90% per hour:{{ .Values.global.cluster.mps }}'Expression: expr: sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*",status="SUCCESS"}[5m]))by (namespace,service) / sum(irate(ocnadd_total_transactions_total{service=~".*correlation.*"}[5m]))by (namespace,service) *100 < 90 |
Alert Details OCI |
Summary: Alarm "OCNADD_TRANSACTION_SUCCESS_CRITICAL_THRESHOLD_DROPPED" is in a "OK/FIRING" state; because 0/1 metrics meet the trigger rule: "ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"*corr*"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"*corr*"}.rate().groupBy(workername,app).sum()*100<90", with a trigger delay of 1 minute MQL Expression: ocnadd_total_transactions_total[10m]{status="SUCCESS",app=~"*corr*"}.rate().groupBy(workername,app).sum()/ocnadd_total_transactions_total[10m]{app=~"*corr*"}.rate().groupBy(workername,app).sum()*100<90 |
OID | 1.3.6.1.4.1.323.5.3.53.1.33.5029 |
Metric Used | ocnadd_total_transactions_total |
Resolution | The alert is cleared automatically when the transaction success rate goes above the critical threshold alert level of 90% |