4 Alerts

Oracle Communications Cloud Native Core, Unified Data Repository (UDR) uses alerts to track following scenarios:
  • Pod not running or down
  • Pod restarts
  • Transaction reaches maximum threshold traffic
  • Subscriber not found
  • XFCC validation failure rate
  • Invalid user agent

Note:

The performance and capacity of the UDR system may vary based on the call model, feature or interface configuration, and underlying CNE and hardware environment, including but not limited to, the size of the json payload, operatsion type, and traffic model.

If any of the above scenarios occur, an alert triggers in Prometheus. Alerts help to handle a scenario before its failure.

Configuring the Alerts Manually

Use the following files to configure the alerts manually. The files are shared as part of custom templates of UDR:
  • ocudr_alerts_haprom.yaml to configure alerts in UDR when using CNE 25.1.2xx version.
  • ocudr_alerts_non_haprom.yaml to configure alerts in UDR when using environment other than CNE.
  • ocslf_alerts_non_haprom.yaml to configure alerts in SLF when using OSO 25.1.2xx versions.
  • ocslf_alerts_haprom.yaml to configure alerts in SLF when using CNE 25.1.2xx version.
  • oceir_alerts_haprom.yaml to configure alerts in EIR when using CNE 25.1.2xx version.
  • oceir_alerts_non_haprom.yaml to configure alerts in EIR when using environment other than CNE.

Manually Configuring the Alerts in SLF when using OSO 25.1.2xx versions:

In the SLF_Alerts.yaml file update the namespace and then perform the following steps to configure alerts manually:
  1. Run the following command to take backup of current configuration map of Prometheus.
    kubectl get configmaps occne-prometheus-server -o yaml -n occne-infra > /tmp/tempConfig.yaml
  2. Run the following commands to add UDR alerts file to the Prometheus configmap yaml file.
    sed -i '/etc\/config\/alertsudr/d' /tmp/tempConfig.yaml
    sed -i '/rule_files:/a\ \   - /etc/config/alertsudr' /tmp/tempConfig.yaml
  3. Run the following command to update the configuration map with the updated file name of SLF alert file.
    kubectl replace configmap occne-prometheus-server -f /tmp/tempConfig.yaml
  4. Run the following command to add UDR_Alertrules in the configuration map under the SLF alert file name.
    kubectl patch configmap occne-prometheus-server -n occne-infra --type merge --patch "$(cat ~/ocslf_alerts_non_haprom_25.1.200.yaml)"

    Note:

    Prometheus server takes updated configmap automatically after sometime (approximately 20 sec).

Manually Configuring the Alerts in UDR, SLF, and EIR when using CNE 25.1.2xx version:

In the UDR, SLF, and EIR Alerts.yaml, update the namespace and then perform the following steps to configure the alerts manually:

Run the following command for:

UDR:
kubectl create -f ocudr_alerts_haprom_25.1.200.yaml -n <namespace>

SLF:

kubectl create -f ocslf_alerts_haprom_25.1.200.yaml -n <namespace>
EIR:
kubectl create -f oceir_alerts_haprom_25.1.200.yaml -n <namespace>

Manually Disabling the Alerts

Perform the following steps to disable the alerts manually in the Prometheus:

  1. Edit the ocslf_alerts_non_haprom.yaml file to remove specific alert. For example, to disable the OcudrTrafficRateAboveMinorThreshold alert:
    ## ALERT SAMPLE START##
    
          - alert: OcudrTrafficRateAboveMinorThreshold
            annotations:
              description: 'Ingress traffic Rate is above minor threshold i.e. 800 mps (current value is: {{ $value }})'
              summary: 'Traffic Rate is above 80 Percent of Max requests per second(1000)'
            expr: sum(rate(oc_ingressgateway_http_requests_total{app_kubernetes_io_name="ingressgateway",kubernetes_namespace="ocudr"}[2m])) >= 800 < 900
            labels:
              severity: Minor
    
    ## ALERT SAMPLE END##
  2. Remove the specific content from alert that needs to be disabled.
  3. Configure alert again. For more information, see the To Configure section.

OSO Alerts Automation

Alerts are automated by using Helm upgrade command with the Helm chart provided as part of OSO software package. The alert automation process is as follows:
  1. A new oso-alr-config Helm chart is provided as part of OSO software package from 25.1.200 release onwards. For information to download OSO software package, see Oracle Communications, Cloud Native Core, Operations Services Overlay Installation and Upgrade Guide.
  2. The oso-alr-config Helm chart must be deployed once OSO is installed.
  3. This separate Helm chart allows the Helm install command to run without an input alert file.
    helm install oso-alr-config ocoso_csar_abc_25_1_200_0_0-rc.1_alert_config_charts.tgz -f ocoso_csar_abc_25_1_200_0_0-rc.1_alert_config_custom_values.yaml -n <oso deployed namespace>
  4. When the oso-alr-config Helm chart installation is completed then the oso-alr-config is ready to use.
  5. Run Helm upgrade, if you are enabling this feature after UDR deployment. Run the following Helm upgrade command in oso-alr-config file to apply UDR alert file.
    helm upgrade oso-alr-config oso-alr-config/ -f ocoso_csar_abc_25_1_200_0_0-rc.1_alert_config_custom_values.yaml -f ocslf_alerts_non_haprom_25.1.200.yaml -n <oso deployed namespace>
  6. Once the Helm upgrade is completed, you can view the alerts file that is applied to OSO Prometheus ConfigMap. This can be viewed in the Prometheus Graphical User Interface (GUI).
  7. You can also update the changes in the same alert file and perform a Helm upgrade. The alert file will be updated with the latest changes.
Perform the following steps to clear the alerts:
  1. An empty ocslf_alertrules_empty_<version>.yaml file is delivered as part of OSO software package. For information to download OSO software package, see Oracle Communications, Cloud Native Core, Operations Services Overlay Installation and Upgrade Guide. You must provide this ocslf_alertrules_empty_<version>.yaml file during the Helm upgrade.
  2. This ocslf_alertrules_empty_<version>.yaml file is used to remove all the alerts using the Helm upgrade command by providing ocslf_alertrules_empty_<version>.yaml file as an input file. This removes the alerts from the OSO Prometheus ConfigMap and Prometheus GUI and keeps the references under rule_files "/etc/config/alertsudr" and the alert rules will be empty "alertsudr: { }".
  3. For example, a sample Helm upgrade command to clean up alert rules is as follows:
    helm upgrade oso-alr-config oso-alr-config/ -f ocoso_csar_abc_25_1_200_0_0-rc.1_alert_config_custom_values.yaml -f ocslf_alertsrules_empty_25.1.200.yaml -n <oso deployed namespace>
Sample empty alert file is as follows:
apiVersion: v1
data:
  alertsudr: |
    {}

Observe

For more information on metrics and KPIs, see UDR Metrics and UDR KPIs sections respectively.

4.1 Alert Details

This section describes alerts in detail.

Note:

Max Ingress requests/sec in consideration is 1000/second.

Table 4-1 Alerts Levels or Severity Types

Alerts Levels / Severity Types Definition
Critical Indicates a severe issue that poses a significant risk to safety, security, or operational integrity. It requires immediate response to address the situation and prevent serious consequences. Raised for conditions may affect the service of UDR.
Major Indicates a more significant issue that has an impact on operations or poses a moderate risk. It requires prompt attention and action to mitigate potential escalation. Raised for conditions may affect the service of UDR.
Minor Indicates a situation that is low in severity and does not pose an immediate risk to safety, security, or operations. It requires attention but does not demand urgent action. Raised for conditions may affect the service of UDR.
Info or Warn (Informational) Provides general information or updates that are not related to immediate risks or actions. These alerts are for awareness and do not typically require any specific response. WARN and INFO alerts may not impact the service of UDR.

The below table provides alert names for UDR and EIR.

Table 4-2 Alert names for UDR/SLF and EIR

UDR/SLF EIR
OcudrTrafficRateAboveMajorThreshold OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold OceirTrafficRateAboveMinorThreshold
OcudrTrafficRateAboveCriticalThreshold OceirTrafficRateAboveCriticalThreshold
OcudrTransactionErrorRateAbove0.1Percent OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent OceirTransactionErrorRateAbove10Percent
OcudrTrafficRateAboveCriticalThreshold OceirTrafficRateAboveCriticalThreshold
OcudrTrafficRateAboveMajorThreshold OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold OceirTrafficRateAboveMinorThreshold
OcudrTransactionErrorRateAbove0.1Percent OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent OceirTransactionErrorRateAbove10Percent
OcudrTransactionErrorRateAbove25Percent OceirTransactionErrorRateAbove25Percent
OcudrTransactionErrorRateAbove50Percent OceirTransactionErrorRateAbove50Percent
OcudrSubscriberNotFoundAbove1Percent OceirSubscriberNotFoundAbove1Percent
OcudrSubscriberNotFoundAbove10Percent OceirSubscriberNotFoundAbove10Percent
OcudrSubscriberNotFoundAbove25Percent OceirSubscriberNotFoundAbove25Percent
OcudrSubscriberNotFoundAbove50Percent OceirSubscriberNotFoundAbove50Percent
OcudrPodsRestart OceirPodsRestart
NudrServiceDown NudrServiceDown
NudrProvServiceDown NudrProvServiceDown
NudrNotifyServiceServiceDown NA
NudrNRFClientServiceDown NudrNRFClientServiceDown
NudrConfigServiceDown NudrConfigServiceDown
NudrDiameterProxyServiceDown NudrDiameterProxyServiceDown
NudrOnDemandMigrationServiceDown NA
OcudrIngressGatewayServiceDown OceirIngressGatewayServiceDown
OcudrEgressGatewayServiceDown OceirEgressGatewayServiceDown
OcudrDbServiceDown OceirDbServiceDown
OcudrXFCCValidationFailureAbove10Percent OceirXFCCValidationFailureAbove10Percent
OcudrXFCCValidationFailureAbove20Percent OceirXFCCValidationFailureAbove20Percent
OcudrXFCCValidationFailureAbove50Percent OceirXFCCValidationFailureAbove50Percent
DRServiceOverload60Percent DRServiceOverload60Percent
DRServiceOverload75Percent DRServiceOverload75Percent
DRServiceOverload80Percent DRServiceOverload80Percent
DRServiceOverload90Percent DRServiceOverload90Percent
SLFSucessTxnDefaultGroupIdRateAbove1Percent NA
SLFSucessTxnDefaultGroupIdRateAbove10Percent NA
SLFSucessTxnDefaultGroupIdRateAbove25Percent NA
SLFSucessTxnDefaultGroupIdRateAbove50Percent NA
OcudrDiameterCongestionCongestedState OceirDiameterCongestionCongestedState
OcudrDiameterCongestionDocState OceirDiameterCongestionDocState
DRProvServiceOverload60Percent DRProvServiceOverload60Percent
DRProvServiceOverload75Percent DRProvServiceOverload75Percent
DRProvServiceOverload80Percent DRProvServiceOverload80Percent
DRProvServiceOverload90Percent DRProvServiceOverload90Percent
OcudrIngressGatewayProvServiceDown OceirIngressGatewayProvServiceDown
OcudrProvisioningTrafficRateAboveMajorThreshold OceirProvisioningTrafficRateAboveMajorThreshold
OcudrProvisioningTrafficRateAboveCriticalThreshold OceirProvisioningTrafficRateAboveCriticalThreshold
OcudrProvisioningTransactionErrorRateAbove25Percent OceirProvisioningTransactionErrorRateAbove25Percent
OcudrProvisioningTransactionErrorRateAbove50Percent OceirProvisioningTransactionErrorRateAbove50Percent
PVCFullForSLFExport NA
FailedExtractForSLFExport NA
BulkImportTransferInFailed BulkImportTransferInFailed
BulkImportTransferOutFailed BulkImportTransferOutFailed
ExportToolTransferOutFailed ExportToolTransferOutFailed
PVCFullForXMLBulkImport PVCFullForXMLBulkImport
PVCFullForBulkImport PVCFullForBulkImport
OperationalStatusCompleteShutdown OperationalStatusCompleteShutdown
NFScoreCalculationFailed NFScoreCalculationFailed
PVCFullForUDRExport NA
UDRExportFailed NA
IngressgatewayPodProtectionDocState IngressgatewayPodProtectionDocState
IngressgatewayPodProtectionCongestedState IngressgatewayPodProtectionCongestedState
RetryNotificationRecordsMaxLimitExceeded RetryNotificationRecordsMaxLimitExceeded
UserAgentHeaderNotFoundMorethan10PercentRequest NA
EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold
EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold
EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold
NudrDiameterGatewayDown NudrDiameterGatewayDown
DiameterPeerConnectionsDropped DiameterPeerConnectionsDropped
IGWSignallingPodProtectionDOCState NA
IGWSignallingPodProtectionCongestedState NA
IGWSignallingPodProtectionByRateLimitRejectedRequest NA

Note:

For the following alert details, only UDR alerts names are provided. The corresponding EIR alert names can be found in Table 4-2.

4.1.1 System Level Alerts

This section lists the system level alerts.

4.1.1.1 OcudrSubscriberNotFoundAbove1Percent

Table 4-3 OcudrSubscriberNotFoundAbove1Percent

Field Details
Description Total number of response if subscriber not found is about 1% of ingress traffic
Summary Total number of response if subscriber not found is about 1% of ingress traffic
Severity Warning
Condition Alert if number of subscribers not found is 1% of all ingress traffic
OID 1.3.6.1.4.1.323.5.3.43.1.2.7009
Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 1% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.2 OcudrSubscriberNotFoundAbove10Percent

Table 4-4 OcudrSubscriberNotFoundAbove10Percent

Field Details
Description Total number of response if subscriber not found is about 10% of ingress traffic
Summary Total number of response if subscriber not found is about 10% of ingress traffic
Severity Minor
Condition Alert if number of subscribers not found is 10% of all ingress traffic
OID 1.3.6.1.4.1.323.5.3.43.1.2.7010
Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 10% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.3 OcudrSubscriberNotFoundAbove25Percent

Table 4-5 OcudrSubscriberNotFoundAbove25Percent

Field Details
Description Total number of response if subscriber not found is about 25% of ingress traffic
Summary Total number of response if subscriber not found is about 25% of ingress traffic
Severity Major
Condition Alert if number of subscribers not found is 25% of all ingress traffic
OID 1.3.6.1.4.1.323.5.3.43.1.2.7011
Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 25% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.4 OcudrSubscriberNotFoundAbove50Percent

Table 4-6 OcudrSubscriberNotFoundAbove50Percent

Field Details
Description Total number of response if subscriber not found is about 50% of ingress traffic
Summary Total number of response if subscriber not found is about 50% of ingress traffic
Severity Critical
Condition Alert if number of subscribers not found is 50% of all ingress traffic
OID 1.3.6.1.4.1.323.5.3.43.1.2.7012
Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 50% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.5 OcudrPodsRestart

Table 4-7 OcudrPodsRestart

Field Details
Description Pod {{$labels.pod}} has restarted.
Summary namespace: {{$labels.namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : A Pod has restarted
Severity Major
Condition Alert if any of the pod got restarted
OID 1.3.6.1.4.1.323.5.3.43.1.2.7014
Metric Used kube_pod_container_status_restarts_total
Recommended Actions

The alert is cleared automatically if the specific pod is up.

Steps:

  1. Refer to the application logs on Kibana and filter based on pod name, check for database related failures such as connectivity, kubernetes secrets and so on.
  2. Check orchestration logs for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running and use it in the following command.

    kubectl describe pod <desired full pod name> -n <namespace>

  3. Check the DB status. For more information, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.1.6 NudrServiceDown

Table 4-8 NudrServiceDown

Field Details
Description OCUDR Nudr_DRService {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : DR Service is down
Severity Critical
Condition Alert if Nudr-dr service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7015
Metric Used app_kubernetes_io_name="nudr-drservice
Recommended Actions

The alert is cleared when the NudrService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.7 NudrProvServiceDown

Table 4-9 NudrProvServiceDown

Field Details
Description OCUDR Nudr_DR_PROVService {{$labels.app_kubernetes_io_name}} is down
Summary 'namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : DR Prov Service is down'
Severity Critical
Condition Alert if Nudr-dr service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7015
Metric Used app_kubernetes_io_name="nudr-dr-provservice
Recommended Actions

The alert is cleared when the NudrProvService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.8 NudrNotifyServiceServiceDown

Table 4-10 NudrNotifyServiceServiceDown

Field Details
Description OCUDR NudrNotifyServiceService {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Nudr Notify Service down.
Severity Critical
Condition Alert if Nudr Notify service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7016
Metric Used app_kubernetes_io_name="nudr-notify-service"
Recommended Actions

The alert is cleared when the NotifyService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.9 NudrNRFClientServiceDown

Table 4-11 NudrNRFClientServiceDown

Field Details
Description OCUDR NRFClient service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NRF Client service down
Severity Critical
Condition Alert if Nudr Nrf Client service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7017
Metric Used app_kubernetes_io_name="nrf-client-nfmanagement
Recommended Actions

The alert is cleared when the NRFClientService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.10 NudrConfigServiceDown

Table 4-12 NudrConfigServiceDown

Field Details
Description OCUDR config service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : nudr-config service down
Severity Critical
Condition Alert if Nudr Config service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7020
Metric Used app_kubernetes_io_name="nudr-config"
Recommended Actions

The alert is cleared when the ConfigService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.11 NudrDiameterProxyServiceDown

Table 4-13 NudrDiameterProxyServiceDown

Field Details
Description OCUDR diameterproxy service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : nudr-diameterproxy service is down
Severity Critical
Condition Alert if Nudr Diameter Proxy is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7018
Metric Used app_kubernetes_io_name="nudr-diameterproxy"
Recommended Actions

The alert is cleared when the DiameterProxyService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.12 NudrOnDemandMigrationServiceDown

Table 4-14 NudrOnDemandMigrationServiceDown

Field Details
Description OCUDR ondemand-migration service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NFSubscription service is down
Severity Critical
Condition Alert if Nudr On Demand Migration is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7019
Metric Used app_kubernetes_io_name="nudr-ondemand-migration"
Recommended Actions

The alert is cleared when the OnDemandMigrationService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.13 OcudrIngressGatewayServiceDown

Table 4-15 OcudrIngressGatewayServiceDown

Field Details
Description OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity Critical
Condition Alert if Ingress Service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7021
Metric Used app_kubernetes_io_name="ingressgateway"
Recommended Actions

The alert is cleared when the ingressgateway service is available.

Steps:

  1. Check the orchestration logs of ingress-gateway service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on ingress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.14 OcudrEgressGatewayServiceDown

Table 4-16 OcudrEgressGatewayServiceDown

Field Details
Description OCUDR Egress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Egress-Gateway service down
Severity Critical
Condition Alert if Egress Service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7022
Metric Used app_kubernetes_io_name="egressgateway"
Recommended Actions

The alert is cleared when the egressgateway service is available.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

  1. Check the orchestration logs of egress-gateway service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on egress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.15 OcudrDbServiceDown

Table 4-17 OcudrDbServiceDown

Field Details
Description Mysql connectivity service is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : MySQL connectivity service down
Severity Critical
Condition Alert if Mysql connectivity is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7023
Metric Used appinfo_service_running
Recommended Actions This alert clears when the microservice nudr-drservice is up and running.
4.1.1.16 OcudrIngressGatewayProvServiceDown

Table 4-18 OcudrIngressGatewayProvServiceDown

Field Details
Description OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity Critical
Condition Alert if Ingressgateway-prov service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7043
Metric Used app_kubernetes_io_name="ingressgateway-prov"
Recommended Actions The alert is cleared when the ingress-gateway service is available.

Steps:

  1. Check the orchestration logs of the ingress-gateway service and check for liveness or readiness probe failures using the following commands:

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on the ingress-gateway service names. Check for the ERROR WARNING logs related to the thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use the CNC NF Data Collector tool for capturing logs. Refer to NF Data Collector tool user guide for more details.

4.1.2 Application Level Alerts

This section lists the application level alerts.

4.1.2.1 OcudrTrafficRateAboveMajorThreshold

Table 4-19 OcudrTrafficRateAboveMajorThreshold

Field Details
Description 'Ingress traffic Rate is above major threshold i.e. 900 requests per second
Summary 'Traffic Rate is above 90 Percent of Max requests per second(1000)'
Severity Major
Condition Alert if Ingress traffic reaches 90% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7002
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (eg : Mated site OCUDR is unavailable in georedundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.2 OcudrTrafficRateAboveMinorThreshold

Table 4-20 OcudrTrafficRateAboveMinorThreshold

Field Details
Description Ingress traffic rate is above minor threshold i.e. 800 requests per second
Summary Traffic rate is above 80 Percent of Max requests per second(1000)
Severity Minor
Condition Alert if Ingress traffic reaches 80% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7001
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate cross the Major threshold, in which case the OcudrTrafficRateAboveMinorThreshold alert shall be raised.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic(eg : Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.3 OcudrTrafficRateAboveCriticalThreshold

Table 4-21 OcudrTrafficRateAboveCriticalThreshold

Field Details
Description 'Ingress traffic Rate is above critical threshold i.e. 950 requests per second
Summary 'Traffic Rate is above 95 Percent of Max requests per second(1000)'
Severity Critical
Condition Alert if Ingress traffic reaches 95% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7003
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (Example: Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.4 OcudrTransactionErrorRateAbove0.1Percent

Table 4-22 OcudrTransactionErrorRateAbove0.1Percent

Field Details
Description Transaction error rate is above 0.1 Percent of Total Transactions
Summary Transaction Error Rate detected above 0.1 Percent of Total Transactions
Severity Warning
Condition Alert if all error rate exceeds 0.1% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7004
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failed transactions is below 0.1 percent of the total transactions or when the number of failed transactions crosses the 1% threshold in which case the OcudrTransactionErrorRateAbove0.1Percent is raised.

Steps:

  1. Check metrics per service, per method

    For example, discovery requests can be deduced from these metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance is required, Contact My Oracle Support.
4.1.2.5 OcudrTransactionErrorRateAbove1Percent

Table 4-23 OcudrTransactionErrorRateAbove1Percent

Field Details
Description 'Transaction Error rate is above 1 Percent of Total Transactions
Summary 'Transaction Error Rate detected above 1 Percent of Total Transactions'
Severity Warning
Condition Alert if all error rate exceeds 1% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7005
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 1% of the total transactions or when the number of failure transactions cross the 10% threshold in which case the OcnrfTransactionErrorRateAbove10Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.6 OcudrTransactionErrorRateAbove10Percent

Table 4-24 OcudrTransactionErrorRateAbove10Percent

Field Details
Description Transaction error rate is above 10 Percent of Total Transactions
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Minor
Condition Alert if all error rate exceeds 10% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7006
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 10% of the total transactions or when the number of failure transactions cross the 25% threshold in which case the OcnrfTransactionErrorRateAbove25Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.7 OcudrTrafficRateAboveCriticalThreshold

Table 4-25 OcudrTrafficRateAboveCriticalThreshold

Field Details
Description Ingress traffic rate is above critical threshold i.e. 950 requests per second
Summary Traffic rate is above 95 Percent of Max requests per second(1000)
Severity Critical
Condition Alert if Ingress traffic reaches 95% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7003
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (Example: Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.8 OcudrTrafficRateAboveMajorThreshold

Table 4-26 OcudrTrafficRateAboveMajorThreshold

Field Details
Description Ingress traffic rate is above major threshold i.e. 900 requests per second
Summary Traffic rate is above 90 Percent of Max requests per second(1000)
Severity Major
Condition Alert if Ingress traffic reaches 90% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7002
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (eg: Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress gateway logs on Kibana to determine the reason for the errors.
4.1.2.9 OcudrTrafficRateAboveMinorThreshold

Table 4-27 OcudrTrafficRateAboveMinorThreshold

Field Details
Description Ingress traffic Rate is above minor threshold i.e. 800 requests per second
Summary Traffic Rate is above 80 Percent of Max requests per second (1000)
Severity Minor
Condition Alert if Ingress traffic reaches 80% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7001
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate cross the Major threshold, in which case the OcudrTrafficRateAboveMinorThreshold alert shall be raised.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (eg : Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.10 OcudrTransactionErrorRateAbove0.1Percent

Table 4-28 OcudrTransactionErrorRateAbove0.1Percent

Field Details
Description Transaction Error rate is above 0.1 Percent of Total Transactions
Summary Transaction Error Rate detected above 0.1 Percent of Total Transactions
Severity Warning
Condition Alert if all error rate exceeds 0.1% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7004
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 0.1 percent of the total transactions or when the number of failure transactions cross the 1% threshold in which case the OcudrTransactionErrorRateAbove0.1Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.11 OcudrTransactionErrorRateAbove1Percent

Table 4-29 OcudrTransactionErrorRateAbove1Percent

Field Details
Description Transaction Error rate is above 1 Percent of Total Transactions
Summary Transaction Error Rate detected above 1 Percent of Total Transactions
Severity Warning
Condition Alert if all error rate exceeds 1% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7005
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 1% of the total transactions or when the number of failure transactions cross the 10% threshold in which case the OcnrfTransactionErrorRateAbove10Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, contact My Oracle Support.
4.1.2.12 OcudrTransactionErrorRateAbove10Percent

Table 4-30 OcudrTransactionErrorRateAbove10Percent

Field Details
Description Transaction Error rate is above 10 Percent of Total Transactions
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Minor
Condition Alert if all error rate exceeds 10% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7006
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 10% of the total transactions or when the number of failure transactions cross the 25% threshold in which case the OcnrfTransactionErrorRateAbove25Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.13 OcudrTransactionErrorRateAbove25Percent

Table 4-31 OcudrTransactionErrorRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error Rate detected above 25 Percent of Total Transactions
Severity Major
Condition Alert if all error rate exceeds 25% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7007
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 25% of the total transactions or when the number of failure transactions cross the 50% threshold in which case the OcnrfTransactionErrorRateAbove50Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.14 OcudrTransactionErrorRateAbove50Percent

Table 4-32 OcudrTransactionErrorRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error Rate detected above 50 Percent of Total Transactions
Severity Critical
Condition Alert if all error rate exceeds 50% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7008
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 50 percent of the total transactions.

Steps:

  1. Check metrics per service, per method

    For example, discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.15 OcudrXFCCValidationFailureAbove10Percent

Table 4-33 OcudrXFCCValidationFailureAbove10Percent

Field Details
Description Total number of response with xfcc validation failure is about 10% of ingress traffic
Summary Total number of response with xfcc validation failure is about 10% of ingress traffic
Severity Minor
Condition Alert if XFCC validation failure is 10% of the total XFCC validations
OID 1.3.6.1.4.1.323.5.3.43.1.2.7024
Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 10% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.16 OcudrXFCCValidationFailureAbove20Percent

Table 4-34 OcudrXFCCValidationFailureAbove20Percent

Field Details
Description Total number of response with xfcc validation failure is about 20% of ingress traffic
Summary Total number of response with xfcc validation failure is about 20% of ingress traffic
Severity Major
Condition Alert if XFCC validation failure is 20% of the total XFCC validations
OID 1.3.6.1.4.1.323.5.3.43.1.2.7025
Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 20% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.17 OcudrXFCCValidationFailureAbove50Percent

Table 4-35 OcudrXFCCValidationFailureAbove50Percent

Field Details
Description Total number of response with XFCC validation failure is about 50% of ingress traffic
Summary Total number of response with XFCC validation failure is about 50% of ingress traffic.
Severity Critical
Condition Alert if XFCC validation failure is 50% of the total XFCC validations
OID 1.3.6.1.4.1.323.5.3.43.1.2.7026
Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 50% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.18 DRServiceOverload60Percent

Table 4-36 DRServiceOverload60Percent

Field Details
Description This alert is fired when the application go to the overload level of Warn level
Summary This alert is fired when the application go to the overload level of Warn level
Severity Warning
Condition Alert If the application overloads at 60%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7027
Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Warn level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.19 DRServiceOverload75Percent

Table 4-37 DRServiceOverload75Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level.
Severity Minor
Condition Alert If the application overloads at 75%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7028
Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Minor level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.20 DRServiceOverload80Percent

Table 4-38 DRServiceOverload80Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Major
Condition Alert If the application overloads at 80%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7029
Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Major level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.21 DRServiceOverload90Percent

Table 4-39 DRServiceOverload90Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Critical
Condition Alert if the application overloads at 90%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7030
Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Critical level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.22 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Table 4-40 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Field Details
Description Transaction Error Rate detected above 1 Percent of Total Transactions
Summary Transaction Error rate is above 1 Percent of Total Transactions
Severity Warning
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 1% of the total responses.
OID 1.3.6.1.4.1.323.5.3.43.1.2.7031
Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.23 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Table 4-41 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Field Details
Description Transaction Error Rate detected above 10 Percent of Total Transactions
Summary Transaction Error rate is above 10 Percent of Total Transactions
Severity Minor
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 10% of the total responses.
OID 1.3.6.1.4.1.323.5.3.43.1.2.7032
Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.24 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Table 4-42 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error rate is above 25 Percent of Total Transactions
Severity Major
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 25% of the total responses.
OID 1.3.6.1.4.1.323.5.3.43.1.2.7033
Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.25 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Table 4-43 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error rate is above 50 Percent of Total Transactions
Severity Critical
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 50% of the total responses.
OID 1.3.6.1.4.1.323.5.3.43.1.2.7034
Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.26 OcudrDiameterCongestionCongestedState

Table 4-44 OcudrDiameterCongestionCongestedState

Field Details
Description Alert will be raised if the diameter gateway pod is in CONGESTED state.
Summary Alert will be raised if the diameter gateway pod is in CONGESTED state.
Severity Critical
Condition Alert will be raised if the diameter gateway pod is in CONGESTED state.
Metric Used ocudr_pod_congestion_state = = 2
Recommended Actions

This alert is raised when the Diameter Gateway pod congestion level is set to the CONGESTED state.

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Check the pod congestion configurations and resource limit in CNC Console.
4.1.2.27 OcudrDiameterCongestionDocState

Table 4-45 OcudrDiameterCongestionDocState

Field Details
Description Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Summary Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Severity Major
Condition Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Metric Used ocudr_pod_congestion_state = = 1
Recommended Actions

This alert is raised when the Diameter Gateway pod congestion level is set to the Danger of Congestion (DOC) state.

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Check the pod congestion configurations and resource limit in CNC Console.
4.1.2.28 DRProvServiceOverload60Percent

Table 4-46 DRProvServiceOverload60Percent

Field Details
Description This alert is fired when the application go to the overload level of Warn level
Summary This alert is fired when the application go to the overload level of Warn level
Severity Warning
Condition Alert If the application overloads at 60%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7036
Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Warn level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.29 DRProvServiceOverload75Percent

Table 4-47 DRProvServiceOverload75Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Minor
Condition Alert If the application overloads at 75%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7037
Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Minor level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.30 DRProvServiceOverload80Percent

Table 4-48 DRProvServiceOverload80Percent

Field Details
Description This alert is fired when the application go to the overload level of Major level
Summary This alert is fired when the application go to the overload level of Major level
Severity Major
Condition Alert If the application overloads at 80%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7038
Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Major level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.31 DRProvServiceOverload90Percent

Table 4-49 DRProvServiceOverload90Percent

Field Details
Description This alert is fired when the application go to the overload level of critical level
Summary This alert is fired when the application go to the overload level of critical level
Severity Critical
Condition Alert If the application overloads at 90%
OID 1.3.6.1.4.1.323.5.3.43.1.2.7039
Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below critical level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.32 Diameter-Gateway pod congestion Danger of congestion state

Table 4-50 Diameter-Gateway pod congestion Danger of congestion state

Field Details
Description DiameterGateway pod at Danger of Congestion state
Summary DiameterGateway pod at Danger of Congestion state
Severity Major
Condition Alert if the diameter gateway pod is in Danger of Congestion (DOC) state
OID 1.3.6.1.4.1.323.5.3.43.1.2.7041
Metric Used occnp_pod_congestion_state==1
Recommended Actions

This alert is raised when the diameter gateway pod congestion level is set to the danger of congestion(DOC)

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Make sure the pod congestion configurations and resource limit in CNE GUI.
4.1.2.33 Diameter-Gateway pod CONGESTED state

Table 4-51 Diameter-Gateway pod CONGESTED state

Field Details
Description DiameterGateway pod at Congested state
Summary DiameterGateway pod at Congested state
Severity Critical
Condition Alert if the diameter gateway pod is in CONGESTED state
OID 1.3.6.1.4.1.323.5.3.43.1.2.7042
Metric Used occnp_pod_congestion_state==2
Recommended Actions

This alert is raised when the diameter gateway pod congestion level is set to the CONGESTED state

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Make sure the pod congestion configurations and resource limit in CNE GUI
4.1.2.34 OcudrProvisioningTrafficRateAboveMajorThreshold

Table 4-52 OcudrProvisioningTrafficRateAboveMajorThreshold

Field Details
Description Ingress traffic Rate is above critical threshold, that is, 950 requests per second
Summary Traffic Rate is above 95 Percent of Max requests per second (1000)
Severity Critical
Condition Alert if Ingress traffic reaches 95% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7044
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note:

The threshold is configurable in UDR_Alertrules.yaml.

Steps:

Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support.
  1. Refer Grafana to determine the service that is recieving high traffic.
  2. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes.
  3. Check the Ingress gateway logs on Kibana to determine the reason for the errors.
4.1.2.35 OcudrProvisioningTrafficRateAboveCriticalThreshold

Table 4-53 OcudrProvisioningTrafficRateAboveCriticalThreshold

Field Details
Description Ingress traffic Rate is above major threshold, that is, 900 requests per second
Summary Traffic Rate is above 90 Percent of Max requests per second (1000)
Severity Major
Condition Alert if Ingress traffic reaches 90% of max TPS
OID 1.3.6.1.4.1.323.5.3.43.1.2.7045
Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the total Ingress Traffic rate falls below the Major threshold or when the total traffic rate exceeds the Critical threshold in which the OcudrTrafficRateAboveMajorThreshold alert is raised.

Note:

The threshold is configurable in UDR_Alertrules.yaml.

Steps:

Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support.
  1. Refer Grafana to determine the service that is recieving high traffic.
  2. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes.
  3. Check the Ingress gateway logs on Kibana to determine the reason for the errors.
4.1.2.36 OcudrProvisioningTransactionErrorRateAbove25Percent

Table 4-54 OcudrProvisioningTransactionErrorRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error Rate detected above 25 Percent of Total Transactions
Severity Major
Condition Alert if all error rate exceeds 25% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7046
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions is below 25% of the total transactions or when the number of failure transactions exceeds the 50% threshold in which the OcnrfTransactionErrorRateAbove50Percent is raised.

Steps:

  1. Check the metrics per service per method, for example, discovery requests can be deduced from these metrics.

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, contact My Oracle Support.
4.1.2.37 OcudrProvisioningTransactionErrorRateAbove50Percent

Table 4-55 OcudrProvisioningTransactionErrorRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error Rate detected above 50 Percent of Total Transactions
Severity Critical
Condition Alert if all error rate exceeds 50% of the total transactions
OID 1.3.6.1.4.1.323.5.3.43.1.2.7047
Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure transactions is below 50 percent of the total transactions.

Steps:

  1. Check the metrics per service per method, for example, discovery requests can be deduced from these metrics.

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, contact My Oracle Support.
4.1.2.38 PVCFullForSLFExport

Table 4-56 PVCFullForSLFExport

Field Details
Description Storage for Export tool is full
Summary Storage for Export tool is full
Severity Critical
Condition Alert if PVC allocated for export tool dump path is full
Metric Used export_tool_full_usage
Recommended Actions Alert will be cleared when the PVC usage is optimized. Configure maxDumps to lower value to clear old dumps. Remove old dumps, if any from the export tool container.
4.1.2.39 FailedExtractForSLFExport

Table 4-57 FailedExtractForSLFExport

Field Details
Description Export tool job is failed
Summary Export tool job is failed
Severity Critical
Condition Alert of the export operation fails
Metric Used export_failure
Recommended Actions Check logs for failure. The alert will be cleared when the export job succeeds next time.
4.1.2.40 BulkImportTransferInFailed

Table 4-58 BulkImportTransferInFailed

Field Details
Description Transfer-in failed for bulk import
Summary Transfer-in failed for bulk import
Severity Major
Condition Alert will be raised, if Transfer-In failed from Remote to PVC
Metric Used bulkimport_transfer_in_status
Recommended Actions This alert is cleared when the transfer-in is success from bulk import. Steps
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.41 ExportToolTransferOutFailed

Table 4-59 ExportToolTransferOutFailed

Field Details
Description Transfer-out failed for export-tool
Summary Transfer-out failed for export-tool"
Severity Major
Condition Alert will be raised if Transfer-Out failed from PVC to Remote
Metric Used sftp_transfer_status
Recommended Actions This alert is cleared when the transfer-out is success from export tool. Steps
  1. Check the service specific metrics to understand the specific service request errors.. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.42 BulkImportTransferOutFailed

Table 4-60 BulkImportTransferOutFailed

Field Details
Description Transfer-out failed for bulk import
Summary Transfer-out failed for bulk import
Severity Major
Condition Alert will be raised if Transfer-Out failed from PVC to Remote
Metric Used bulkimport_transfer_out_status
Recommended Actions This alert is cleared when the transfer-out is success from bulk import. Steps
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.43 PVCFullForXMLBulkImport

Table 4-61 PVCFullForXMLBulkImport

Field Details
Description Storage for XML Bulk Import tool is full
Summary Storage for XML Bulk Import tool is full
Severity Critical
Condition Alert will be raised if the PVC is full for xml-csv container
Metric Used nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-xmltocsv",kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the PVC is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.44 PVCFullForBulkImport

Table 4-62 PVCFullForBulkImport

Field Details
Description Storage for Bulk Import tool is full
Summary Storage for Bulk Import tool is full
Severity Critical
Condition Alert will be raised if the PVC is full for bulk import container
Metric Used nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-bulk-import",kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the PVC is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.45 OperationalStatusCompleteShutdown

Table 4-63 OperationalStatusCompleteShutdown

Field Details
Description Operational state is control shutdown
Summary Operational state is control shutdown
Severity Critical
Condition Alert will be raised if the opertational state of the UDR, SLF, or EIR is COMPLETE_SHUTDOWN
Metric Used nudr_config_operational_status{kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the operational status is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.46 NFScoreCalculationFailed

Table 4-64 NFScoreCalculationFailed

Field Details
Description NFScoreCalculationFailed
Summary NFScoreCalculationFailed
Severity Major
Condition Alert is raised if the NF Score calculation are failed for any of the scoring factors
Metric Used nfscore{kubernetes_namespace="ocudr" ,factor=~"successTPS|signallingConnections|serviceHealth|replicationHealth|localityPreference|bulkImport|bulkExport",calculatedStatus="failed"}
Recommended Actions

This alert is cleared when the NF score calculation is successful.

Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.47 PVCFullForUDRExport

Table 4-65 PVCFullForUDRExport

Field Details
Description Storage for Export tool is full
Summary Storage for Export tool is full
Severity Critical
Condition Alert is raised if PVC allocated for export tool dump path is full.
Metric Used export_tool_full_usage{namespace="ocudr"}==1
Recommended Actions

Alert is cleared when the PVC usage is optimized. You must configure maxDumps to a lower value to clear old dumps.

Steps:
  1. If present, remove the old dumps from the export tool container.
4.1.2.48 UDRExportFailed

Table 4-66 UDRExportFailed

Field Details
Description Export tool job is failed
Summary Export tool job is failed
Severity Critical
Condition Alert is raised if the export operation fails for UDR Mode
Metric Used export_failure{namespace="ocudr"}== 1
Recommended Actions

You must check the logs for failure. When the next export job is successful the alert is cleared.

4.1.2.49 IngressgatewayPodProtectionDocState

Table 4-67 IngressgatewayPodProtectionDocState

Field Details
Description Ingress congestion in Doc state
Summary Ingress congestion Doc state
Severity Critical
Condition Alert is raised if Ingress congestion is in doc state.
Metric Used oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the ingress gateway comes to normal state.
Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.50 IngressgatewayPodProtectionCongestedState

Table 4-68 IngressgatewayPodProtectionCongestedState

Field Details
Description Ingress congestion in Congested state
Summary Ingress congestion in Congested state
Severity Critical
Condition Alert is raised if ingress congestion is in congested state.
Metric Used oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==2
Recommended Actions This alert will be cleared when the ingress gateway comes to normal state.
Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.51 RetryNotificationRecordsMaxLimitExceeded

Table 4-69 RetryNotificationRecordsMaxLimitExceeded

Field Details
Description Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Summary Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Severity Critical
Condition Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Metric Used nudr_notif_records_limit_exceeded{namespace="ocudr"}==1
Recommended Actions

This alert is raised when there are more notification failures and the retry notifications stored in database is more than 50k.

Steps:
  1. Check the notification failure rate and fix the reason for failures. This reduces the number of notifications marked for retry that is stored in UDR database.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.52 UserAgentHeaderNotFoundMorethan10PercentRequest

Table 4-70 UserAgentHeaderNotFoundMorethan10PercentRequest

Field Details
Description Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Summary Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Severity Critical
Condition Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic.
Metric Used (sum by(namespace)(rate(suppress_user_agent_not_found_total{namespace="ocudr"}[5m]))/sum by(namespace)(rate(oc_ingressgateway_http_requests_total{namespace="ocudr"}[5m])))*100 >= 10
Recommended Actions

This alert is cleared if the total number of requests not having User-Agent header is less than 10% of ingress traffic.

Steps:
  1. Check the service specific metrics to understand the specific service request errors.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.53 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Table 4-71 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Severity Minor
Condition Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1300000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the minor threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.54 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Table 4-72 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Severity Major
Condition Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1500000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the major threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.55 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Table 4-73 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Severity Critical
Condition Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1800000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the critical threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.56 NudrDiameterGatewayDown

Table 4-74 NudrDiameterGatewayDown

Field Details
Description Alert will be raised if Nudr-diam-gateway service is down.
Summary Alert will be raised if Nudr-diam-gateway service is down.
Severity Critical
Condition Alert will be raised if Nudr-diam-gateway service is down.
Metric Used absent(up{container="nudr-diam-gateway",namespace="ocudr"}) or up{container="nudr-diam-gateway",namespace="ocudr"} == 0
Recommended Actions

This alert is cleared when the NudrDiamGateway service is available.

Steps:
    • Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures.
      kubectl get po -n <namespace>
    • Run the following command using the full name of the pod that is not running.
      kubectl describe pod <specific desired full pod name> -n <namespace>
  1. Refer the application logs on Kibana and filter based on the appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  2. Perform the resolution steps depending on the reason for failure.
  3. Contact My Oracle Support, if guidance is required.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.2.57 DiameterPeerConnectionsDropped

Table 4-75 DiameterPeerConnectionsDropped

Field Details
Description Alert will be raised if there are no connections between diameter peer and diameter gateway.
Summary Alert will be raised if there are no connections between diameter peer and diameter gateway.
Severity Major
Condition Alert will be raised if there are no connections between diameter peer and diameter gateway.
Metric Used sum(ocudr_diam_conn_network{origHost=~".*CHI.*",container="nudr-diam-gateway",namespace="ocudr"} or vector(0))< 2 or sum(ocudr_diam_conn_network{origHost=~".*IND.*",container="nudr-diam-gateway",namespace="ocudr"} or vector(0)) < 2 or (sum(ocudr_diam_conn_network{origHost=~".*CHI.*",container="nudr-diam-gateway",kubernetes_namespace="ocudr"} or vector(0)) + sum(ocudr_diam_conn_network{origHost=~".*IND.*",container="nudr-diam-gateway",namespace="ocudr"}) or vector(0)) < 5
Recommended Actions

This alert is cleared when the NudrDiamGateway service is available.

Steps:
    • Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures.
      kubectl get po -n <namespace>
    • Run the following command using the full name of the pod that is not running.
      kubectl describe pod <specific desired full pod name> -n <namespace>
  1. Refer the application logs on Kibana and filter based on the appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  2. Perform the resolution steps depending on the reason for failure.
  3. Contact My Oracle Support, if guidance is required.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.2.58 IGWSignallingPodProtectionDOCState

Table 4-76 IGWSignallingPodProtectionDOCState

Field Details
Description Alert will be raised when the ingress gateway signaling traffic at DOC State.
Summary Alert will be raised when the ingress gateway signaling traffic at DOC State.
Severity Major
Condition Alert will be raised when the ingress gateway signaling traffic at DOC State.
Metric Used sum({namespace="ocudr",container="ingressgateway-sig"}) by (pod) == 2
Recommended Actions

This alert is cleared when the signaling traffic reaches NORMAL state.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.59 IGWSignallingPodProtectionCongestedState

Table 4-77 IGWSignallingPodProtectionCongestedState

Field Details
Description Alert will be raised when the ingress gateway signaling traffic at Congested State.
Summary Alert will be raised when the ingress gateway signaling traffic at Congested State.
Severity Critical
Condition Alert will be raised when the ingress gateway signaling traffic at Congested State.
Metric Used sum(oc_ingressgateway_congestion_system_state{namespace="ocudr",container="ingressgateway-sig"}) by (pod) == 3
Recommended Actions

This alert is cleared when the signaling traffic reaches NORMAL or DOC state.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.60 IGWSignallingPodProtectionByRateLimitRejectedRequest

Table 4-78 IGWSignallingPodProtectionByRateLimitRejectedRequest

Field Details
Description Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
Summary Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
Severity Critical
Condition Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
Metric Used (sum (rate(oc_ingressgateway_http_request_ratelimit_denied_count_total{Action="REJECT",namespace="ocudr"}[2m]) or (up * 0 ) ) )/ sum(rate(oc_ingressgateway_http_requests_total{container="ingressgateway-sig",namespace="ocudr"}[2m])) * 100 >= 1
Recommended Actions

This alert is cleared when the when rejection is reduced less than 1% of the total traffic.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.