4 Alert Configuration

This section describes how to configure alert rules for the UDR. It provides guidance on setting up measurement-based alert rules, where the alerting system evaluates metrics reported by UDR microservices against specified rule conditions to generate alerts as needed. UDR alert rules are configured based on metrics reported by UDR components. The alerting workflow monitors these metrics and issues notifications when the defined conditions are met. For more information about configuring UDR alerts in Prometheus, see the “Alert Configuration” section in Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide.

4.1 Alert Details

This section describes alerts in detail.

Note:

Max Ingress requests/sec in consideration is 1000/second.

Table 4-1 Alerts Levels or Severity Types

Alerts Levels / Severity Types Definition
Critical Indicates a severe issue that poses a significant risk to safety, security, or operational integrity. It requires immediate response to address the situation and prevent serious consequences. Raised for conditions may affect the service of UDR.
Major Indicates a more significant issue that has an impact on operations or poses a moderate risk. It requires prompt attention and action to mitigate potential escalation. Raised for conditions may affect the service of UDR.
Minor Indicates a situation that is low in severity and does not pose an immediate risk to safety, security, or operations. It requires attention but does not demand urgent action. Raised for conditions may affect the service of UDR.
Info or Warn (Informational) Provides general information or updates that are not related to immediate risks or actions. These alerts are for awareness and do not typically require any specific response. WARN and INFO alerts may not impact the service of UDR.

The below table provides alert names for UDR and EIR.

Table 4-2 Alert names for UDR/SLF and EIR

UDR/SLF EIR
OcudrTrafficRateAboveMajorThreshold OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold OceirTrafficRateAboveMinorThreshold
OcudrTrafficRateAboveCriticalThreshold OceirTrafficRateAboveCriticalThreshold
OcudrTransactionErrorRateAbove0.1Percent OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent OceirTransactionErrorRateAbove10Percent
OcudrTrafficRateAboveCriticalThreshold OceirTrafficRateAboveCriticalThreshold
OcudrTrafficRateAboveMajorThreshold OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold OceirTrafficRateAboveMinorThreshold
OcudrTransactionErrorRateAbove0.1Percent OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent OceirTransactionErrorRateAbove10Percent
OcudrTransactionErrorRateAbove25Percent OceirTransactionErrorRateAbove25Percent
OcudrTransactionErrorRateAbove50Percent OceirTransactionErrorRateAbove50Percent
OcudrSubscriberNotFoundAbove1Percent OceirSubscriberNotFoundAbove1Percent
OcudrSubscriberNotFoundAbove10Percent OceirSubscriberNotFoundAbove10Percent
OcudrSubscriberNotFoundAbove25Percent OceirSubscriberNotFoundAbove25Percent
OcudrSubscriberNotFoundAbove50Percent OceirSubscriberNotFoundAbove50Percent
OcudrPodsRestart OceirPodsRestart
NudrServiceDown NudrServiceDown
NudrProvServiceDown NudrProvServiceDown
NudrNotifyServiceServiceDown NA
NudrNRFClientServiceDown NudrNRFClientServiceDown
NudrConfigServiceDown NudrConfigServiceDown
NudrDiameterProxyServiceDown NudrDiameterProxyServiceDown
NudrOnDemandMigrationServiceDown NA
OcudrIngressGatewayServiceDown OceirIngressGatewayServiceDown
OcudrEgressGatewayServiceDown OceirEgressGatewayServiceDown
OcudrDbServiceDown OceirDbServiceDown
OcudrXFCCValidationFailureAbove10Percent OceirXFCCValidationFailureAbove10Percent
OcudrXFCCValidationFailureAbove20Percent OceirXFCCValidationFailureAbove20Percent
OcudrXFCCValidationFailureAbove50Percent OceirXFCCValidationFailureAbove50Percent
DRServiceOverload60Percent DRServiceOverload60Percent
DRServiceOverload75Percent DRServiceOverload75Percent
DRServiceOverload80Percent DRServiceOverload80Percent
DRServiceOverload90Percent DRServiceOverload90Percent
SLFSucessTxnDefaultGroupIdRateAbove1Percent NA
SLFSucessTxnDefaultGroupIdRateAbove10Percent NA
SLFSucessTxnDefaultGroupIdRateAbove25Percent NA
SLFSucessTxnDefaultGroupIdRateAbove50Percent NA
OcudrDiameterCongestionCongestedState OceirDiameterCongestionCongestedState
OcudrDiameterCongestionDocState OceirDiameterCongestionDocState
DRProvServiceOverload60Percent DRProvServiceOverload60Percent
DRProvServiceOverload75Percent DRProvServiceOverload75Percent
DRProvServiceOverload80Percent DRProvServiceOverload80Percent
DRProvServiceOverload90Percent DRProvServiceOverload90Percent
OcudrIngressGatewayProvServiceDown OceirIngressGatewayProvServiceDown
OcudrProvisioningTrafficRateAboveMajorThreshold OceirProvisioningTrafficRateAboveMajorThreshold
OcudrProvisioningTrafficRateAboveCriticalThreshold OceirProvisioningTrafficRateAboveCriticalThreshold
OcudrProvisioningTransactionErrorRateAbove25Percent OceirProvisioningTransactionErrorRateAbove25Percent
OcudrProvisioningTransactionErrorRateAbove50Percent OceirProvisioningTransactionErrorRateAbove50Percent
PVCFullForSLFExport NA
FailedExtractForSLFExport NA
BulkImportTransferInFailed BulkImportTransferInFailed
BulkImportTransferOutFailed BulkImportTransferOutFailed
ExportToolTransferOutFailed ExportToolTransferOutFailed
PVCFullForXMLBulkImport PVCFullForXMLBulkImport
PVCFullForBulkImport PVCFullForBulkImport
OperationalStatusCompleteShutdown OperationalStatusCompleteShutdown
NFScoreCalculationFailed NFScoreCalculationFailed
PVCFullForUDRExport NA
UDRExportFailed NA
IngressgatewayPodProtectionDocState IngressgatewayPodProtectionDocState
IngressgatewayPodProtectionCongestedState IngressgatewayPodProtectionCongestedState
RetryNotificationRecordsMaxLimitExceeded RetryNotificationRecordsMaxLimitExceeded
UserAgentHeaderNotFoundMorethan10PercentRequest NA
EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold
EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold
EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold
NudrDiameterGatewayDown NudrDiameterGatewayDown
DiameterPeerConnectionsDropped DiameterPeerConnectionsDropped
IGWSignallingPodProtectionDOCState NA
IGWSignallingPodProtectionCongestedState NA
IGWSignallingPodProtectionByRateLimitRejectedRequest NA

Note:

For the following alert details, only UDR alerts names are provided. The corresponding EIR alert names can be found in Table 4-2.

4.1.1 System Level Alerts

This section lists the system level alerts.

4.1.1.1 OcudrSubscriberNotFoundAbove1Percent

Table 4-3 OcudrSubscriberNotFoundAbove1Percent

Field Details
Description Total number of response if subscriber not found is about 1% of ingress traffic
Summary Total number of response if subscriber not found is about 1% of ingress traffic
Severity Warning
Condition Alert if number of subscribers not found is 1% of all ingress traffic
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7003

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7003

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7003 (For EIR alert name, see Alert Details)

Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 1% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.2 OcudrSubscriberNotFoundAbove10Percent

Table 4-4 OcudrSubscriberNotFoundAbove10Percent

Field Details
Description Total number of response if subscriber not found is about 10% of ingress traffic
Summary Total number of response if subscriber not found is about 10% of ingress traffic
Severity Minor
Condition Alert if number of subscribers not found is 10% of all ingress traffic
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7003

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7003

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7003 (For EIR alert name, see Alert Details)

Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 10% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.3 OcudrSubscriberNotFoundAbove25Percent

Table 4-5 OcudrSubscriberNotFoundAbove25Percent

Field Details
Description Total number of response if subscriber not found is about 25% of ingress traffic
Summary Total number of response if subscriber not found is about 25% of ingress traffic
Severity Major
Condition Alert if number of subscribers not found is 25% of all ingress traffic
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7003

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7003

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7003 (For EIR alert name, see Alert Details)

Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 25% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.4 OcudrSubscriberNotFoundAbove50Percent

Table 4-6 OcudrSubscriberNotFoundAbove50Percent

Field Details
Description Total number of response if subscriber not found is about 50% of ingress traffic
Summary Total number of response if subscriber not found is about 50% of ingress traffic
Severity Critical
Condition Alert if number of subscribers not found is 50% of all ingress traffic
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7003

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7003

EIR: NA

Metric Used udr_subscriber_not_found_total
Recommended Actions

The alert is cleared when the number of failure of Subscriber Not Found are below 50% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.1.5 OcudrNfStatusUnavailable

Table 4-7 OcudrNfStatusUnavailable

Field Details
Description OCUDR services unavailable
Summary OCUDR services unavailable
Severity Critical
Condition This alert is triggered if OCUDR services are unavailable.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7004

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7004

Metric Used absent(up{app_kubernetes_io_part_of="ocudr",kubernetes_namespace="ocudr"}) or sum(up{app_kubernetes_io_part_of="ocudr",kubernetes_namespace="ocudr"}) == 0
Recommended Actions The alert is cleared when all the OCUDR Services will be available.
Steps:
  1. Check the Service specific metrics to understand the specific service request errors.

    For eg: absent(up{app_kubernetes_io_part_of="ocudr",kubernetes_namespace="ocudr"}) or sum(up{app_kubernetes_io_part_of="ocudr",kubernetes_namespace="ocudr"}) == 0

  2. If guidance is required, contact My Oracle Support.
4.1.1.6 OcudrPodsRestart

Table 4-8 OcudrPodsRestart

Field Details
Description Pod {{$labels.pod}} has restarted.
Summary namespace: {{$labels.namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : A Pod has restarted
Severity Major
Condition Alert if any of the pod got restarted
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7005

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7005

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7005 (For EIR alert name, see Alert Details)

Metric Used kube_pod_container_status_restarts_total
Recommended Actions

The alert is cleared automatically if the specific pod is up.

Steps:

  1. Refer to the application logs on Kibana and filter based on pod name, check for database related failures such as connectivity, kubernetes secrets and so on.
  2. Check orchestration logs for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running and use it in the following command.

    kubectl describe pod <desired full pod name> -n <namespace>

  3. Check the DB status. For more information, see Oracle Communications Cloud Native Core, cnDBTier User Guide.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.1.7 NudrServiceDown

Table 4-9 NudrServiceDown

Field Details
Description OCUDR Nudr_DRService {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : DR Service is down
Severity Critical
Condition Alert if Nudr-dr service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7006

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7006

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7006

Metric Used app_kubernetes_io_name="nudr-drservice
Recommended Actions

The alert is cleared when the NudrService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.8 NudrProvServiceDown

Table 4-10 NudrProvServiceDown

Field Details
Description OCUDR Nudr_DR_PROVService {{$labels.app_kubernetes_io_name}} is down
Summary 'namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : DR Prov Service is down'
Severity Critical
Condition Alert if Nudr-dr service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7016

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7015

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7014

Metric Used app_kubernetes_io_name="nudr-dr-provservice
Recommended Actions

The alert is cleared when the NudrProvService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.9 NudrNotifyServiceServiceDown

Table 4-11 NudrNotifyServiceServiceDown

Field Details
Description OCUDR NudrNotifyServiceService {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Nudr Notify Service down.
Severity Critical
Condition Alert if Nudr Notify service is down
OID 1.3.6.1.4.1.323.5.3.43.1.2.7016
Metric Used app_kubernetes_io_name="nudr-notify-service"
Recommended Actions

The alert is cleared when the NotifyService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.10 NudrNRFClientServiceDown

Table 4-12 NudrNRFClientServiceDown

Field Details
Description OCUDR NRFClient service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NRF Client service down
Severity Critical
Condition Alert if Nudr Nrf Client service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7007

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7007

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7007

Metric Used app_kubernetes_io_name="nrf-client-nfmanagement
Recommended Actions

The alert is cleared when the NRFClientService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.11 NudrConfigServiceDown

Table 4-13 NudrConfigServiceDown

Field Details
Description OCUDR config service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : nudr-config service down
Severity Critical
Condition Alert if Nudr Config service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7010

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7008

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7008

Metric Used app_kubernetes_io_name="nudr-config"
Recommended Actions

The alert is cleared when the ConfigService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.12 NudrDiameterProxyServiceDown

Table 4-14 NudrDiameterProxyServiceDown

Field Details
Description OCUDR diameterproxy service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : nudr-diameterproxy service is down
Severity Critical
Condition Alert if Nudr Diameter Proxy is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7008

SLF: NA

EIR: NA

Metric Used app_kubernetes_io_name="nudr-diameterproxy"
Recommended Actions

The alert is cleared when the DiameterProxyService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.13 NudrOnDemandMigrationServiceDown

Table 4-15 NudrOnDemandMigrationServiceDown

Field Details
Description OCUDR ondemand-migration service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : NFSubscription service is down
Severity Critical
Condition Alert if Nudr On Demand Migration is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7009

SLF: NA

EIR: NA

Metric Used app_kubernetes_io_name="nudr-ondemand-migration"
Recommended Actions

The alert is cleared when the OnDemandMigrationService service is available.

Steps:

  1. Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.14 OcudrIngressGatewayServiceDown

Table 4-16 OcudrIngressGatewayServiceDown

Field Details
Description OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity Critical
Condition Alert if Ingress Service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7011

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7009

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7009 (For EIR alert name, see Alert Details)

Metric Used app_kubernetes_io_name="ingressgateway"
Recommended Actions

The alert is cleared when the ingressgateway service is available.

Steps:

  1. Check the orchestration logs of ingress-gateway service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on ingress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.15 OcudrEgressGatewayServiceDown

Table 4-17 OcudrEgressGatewayServiceDown

Field Details
Description OCUDR Egress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Egress-Gateway service down
Severity Critical
Condition Alert if Egress Service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7012

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7010

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7010 (For EIR alert name, see Alert Details)

Metric Used app_kubernetes_io_name="egressgateway"
Recommended Actions

The alert is cleared when the egressgateway service is available.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

  1. Check the orchestration logs of egress-gateway service and check for liveness or readiness probe failures using the following commands.

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on egress-gateway service names. Check for ERROR WARNING logs related to thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support.

    Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.16 OcudrDbServiceDown

Table 4-18 OcudrDbServiceDown

Field Details
Description Mysql connectivity service is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : MySQL connectivity service down
Severity Critical
Condition Alert if Mysql connectivity is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7013

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7011

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7011 (For EIR alert name, see Alert Details)

Metric Used appinfo_service_running
Recommended Actions This alert clears when the microservice nudr-drservice is up and running.
4.1.1.17 OcudrIngressGatewayProvServiceDown

Table 4-19 OcudrIngressGatewayProvServiceDown

Field Details
Description OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity Critical
Condition Alert if Ingressgateway-prov service is down
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7019

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7017

EIR: NA

Metric Used app_kubernetes_io_name="ingressgateway-prov"
Recommended Actions The alert is cleared when the ingress-gateway service is available.

Steps:

  1. Check the orchestration logs of the ingress-gateway service and check for liveness or readiness probe failures using the following commands:

    kubectl get po -n <namespace>

    Note the full name of the pod that is not running. It must be used in the following command:

    kubectl describe pod <specific desired full pod name> -n <namespace>

  2. Refer the application logs on Kibana and filter based on the ingress-gateway service names. Check for the ERROR WARNING logs related to the thread exceptions.
  3. Depending on the failure reason, take the resolution steps.
  4. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support.

    Note:

    Use the CNC NF Data Collector tool for capturing logs. Refer to NF Data Collector tool user guide for more details.

4.1.2 Application Level Alerts

This section lists the application level alerts.

4.1.2.1 OcudrSignallingTrafficRateAboveMajorThreshold

Table 4-20 OcudrSignallingTrafficRateAboveMajorThreshold

Field Details
Description 'Ingress traffic Rate is above major threshold i.e. 900 requests per second
Summary 'Traffic Rate is above 90 Percent of Max requests per second(1000)'
Severity Major
Condition Alert if Ingress traffic reaches 90% of max TPS
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7001

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7001

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7001 (For EIR alert name, see Alert Details

Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (eg : Mated site OCUDR is unavailable in georedundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.2 OcudrSignallingTrafficRateAboveMinorThreshold

Table 4-21 OcudrSignallingTrafficRateAboveMinorThreshold

Field Details
Description Ingress traffic rate is above minor threshold i.e. 800 requests per second
Summary Traffic rate is above 80 Percent of Max requests per second(1000)
Severity Minor
Condition Alert if Ingress traffic reaches 80% of max TPS
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7001

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7001

EIR: NA

Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate cross the Major threshold, in which case the OcudrTrafficRateAboveMinorThreshold alert shall be raised.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic(eg : Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.3 OcudrSignallingTrafficRateAboveCriticalThreshold

Table 4-22 OcudrSignallingTrafficRateAboveCriticalThreshold

Field Details
Description 'Ingress traffic Rate is above critical threshold i.e. 950 requests per second
Summary 'Traffic Rate is above 95 Percent of Max requests per second(1000)'
Severity Critical
Condition Alert if Ingress traffic reaches 95% of max TPS
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7001

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7001

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7001 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note: The threshold is configurable in the UDR_Alertrules.yaml

Steps:

Reassess why the OCUDR is receiving additional traffic (Example: Mated site OCUDR is unavailable in geo redundancy scenario).

If this is unexpected, contact My Oracle Support and:
  1. Refer Grafana to determine which service is receiving high traffic.
  2. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes.
  3. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
4.1.2.4 OcudrSignallingTransactionErrorRateAbove0.1Percent

Table 4-23 OcudrSignallingTransactionErrorRateAbove0.1Percent

Field Details
Description Transaction error rate is above 0.1 Percent of Total Transactions
Summary Transaction Error Rate detected above 0.1 Percent of Total Transactions
Severity Warning
Condition Alert if all error rate exceeds 0.1% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7002

SLF: NA

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7002 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failed transactions is below 0.1 percent of the total transactions or when the number of failed transactions crosses the 1% threshold in which case the OcudrTransactionErrorRateAbove0.1Percent is raised.

Steps:

  1. Check metrics per service, per method

    For example, discovery requests can be deduced from these metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance is required, Contact My Oracle Support.
4.1.2.5 OcudrSignallingTransactionErrorRateAbove1Percent

Table 4-24 OcudrSignallingTransactionErrorRateAbove1Percent

Field Details
Description 'Transaction Error rate is above 1 Percent of Total Transactions
Summary 'Transaction Error Rate detected above 1 Percent of Total Transactions'
Severity Warning
Condition Alert if all error rate exceeds 1% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7002

SLF: NA

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7002 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 1% of the total transactions or when the number of failure transactions cross the 10% threshold in which case the OcnrfTransactionErrorRateAbove10Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.6 OcudrSignallingTransactionErrorRateAbove10Percent

Table 4-25 OcudrSignallingTransactionErrorRateAbove10Percent

Field Details
Description Transaction error rate is above 10 Percent of Total Transactions
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Minor
Condition Alert if all error rate exceeds 10% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7002

SLF: NA

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7002 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 10% of the total transactions or when the number of failure transactions cross the 25% threshold in which case the OcnrfTransactionErrorRateAbove25Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.7 OcudrSignallingTransactionErrorRateAbove25Percent

Table 4-26 OcudrSignallingTransactionErrorRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error Rate detected above 25 Percent of Total Transactions
Severity Major
Condition Alert if all error rate exceeds 25% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7002

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7002

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7002 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 25% of the total transactions or when the number of failure transactions cross the 50% threshold in which case the OcnrfTransactionErrorRateAbove50Percent shall be raised.

Steps:

  1. Check metrics per service, per method

    For example discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.8 OcudrSignallingTransactionErrorRateAbove50Percent

Table 4-27 OcudrSignallingTransactionErrorRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error Rate detected above 50 Percent of Total Transactions
Severity Critical
Condition Alert if all error rate exceeds 50% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7002

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7002

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7002 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions are below 50 percent of the total transactions.

Steps:

  1. Check metrics per service, per method

    For example, discovery requests can be deduced from this metrics

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, Contact My Oracle Support.
4.1.2.9 OcudrXFCCValidationFailureAbove10Percent

Table 4-28 OcudrXFCCValidationFailureAbove10Percent

Field Details
Description Total number of response with xfcc validation failure is about 10% of ingress traffic
Summary Total number of response with xfcc validation failure is about 10% of ingress traffic
Severity Minor
Condition Alert if XFCC validation failure is 10% of the total XFCC validations
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7014

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7012

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7012 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 10% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.10 OcudrXFCCValidationFailureAbove20Percent

Table 4-29 OcudrXFCCValidationFailureAbove20Percent

Field Details
Description Total number of response with xfcc validation failure is about 20% of ingress traffic
Summary Total number of response with xfcc validation failure is about 20% of ingress traffic
Severity Major
Condition Alert if XFCC validation failure is 20% of the total XFCC validations
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7014

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7012

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7012 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 20% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.11 OcudrXFCCValidationFailureAbove50Percent

Table 4-30 OcudrXFCCValidationFailureAbove50Percent

Field Details
Description Total number of response with XFCC validation failure is about 50% of ingress traffic
Summary Total number of response with XFCC validation failure is about 50% of ingress traffic.
Severity Critical
Condition Alert if XFCC validation failure is 50% of the total XFCC validations
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7014

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7012

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7012 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_xfcc_header_validate_total
Recommended Actions

The alert is cleared when the number of failure of XFCCValidationFailure are below 50% of the total.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.12 DRServiceOverload60Percent

Table 4-31 DRServiceOverload60Percent

Field Details
Description This alert is fired when the application go to the overload level of Warn level
Summary This alert is fired when the application go to the overload level of Warn level
Severity Warning
Condition Alert If the application overloads at 60%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7013

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7013

Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Warn level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.13 DRServiceOverload75Percent

Table 4-32 DRServiceOverload75Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level.
Severity Minor
Condition Alert If the application overloads at 75%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7013

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7013

Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Minor level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.14 DRServiceOverload80Percent

Table 4-33 DRServiceOverload80Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Major
Condition Alert If the application overloads at 80%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7013

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7013

Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Major level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.15 DRServiceOverload90Percent

Table 4-34 DRServiceOverload90Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Critical
Condition Alert if the application overloads at 90%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7013

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7013

Metric Used load_level
Recommended Actions This alert is cleared when the incoming traffic is reduced to below Critical level.

Steps:

  1. Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total
  2. If guidance required, contact My Oracle Support.
4.1.2.16 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Table 4-35 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Field Details
Description Transaction Error Rate detected above 1 Percent of Total Transactions
Summary Transaction Error rate is above 1 Percent of Total Transactions
Severity Warning
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 1% of the total responses.
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7014

EIR: NA

Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.17 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Table 4-36 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Field Details
Description Transaction Error Rate detected above 10 Percent of Total Transactions
Summary Transaction Error rate is above 10 Percent of Total Transactions
Severity Minor
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 10% of the total responses.
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7014

EIR: NA

Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.18 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Table 4-37 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error rate is above 25 Percent of Total Transactions
Severity Major
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 25% of the total responses.
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7014

EIR: NA

Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.19 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Table 4-38 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error rate is above 50 Percent of Total Transactions
Severity Critical
Condition Alert if number of SLF Lookup requests responded with default Group ID exceeds 50% of the total responses.
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7014

EIR: NA

Metric Used slf_sucess_txn_default_grp_id_total
Recommended Actions

This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces.

Steps:

Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.20 OcudrDiameterCongestionCongestedState

Table 4-39 OcudrDiameterCongestionCongestedState

Field Details
Description Alert will be raised if the diameter gateway pod is in CONGESTED state.
Summary Alert will be raised if the diameter gateway pod is in CONGESTED state.
Severity Critical
Condition Alert will be raised if the diameter gateway pod is in CONGESTED state.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7018

SLF: NA

EIR: NA

Metric Used ocudr_pod_congestion_state = = 2
Recommended Actions

This alert is raised when the Diameter Gateway pod congestion level is set to the CONGESTED state.

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Check the pod congestion configurations and resource limit in CNC Console.
4.1.2.21 OcudrDiameterCongestionDocState

Table 4-40 OcudrDiameterCongestionDocState

Field Details
Description Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Summary Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Severity Major
Condition Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7018

SLF: NA

EIR: NA

Metric Used ocudr_pod_congestion_state = = 1
Recommended Actions

This alert is raised when the Diameter Gateway pod congestion level is set to the Danger of Congestion (DOC) state.

Steps:

  1. Decrease the traffic run or use proper perf resource.
  2. Check the pod congestion configurations and resource limit in CNC Console.
4.1.2.22 DRProvServiceOverload60Percent

Table 4-41 DRProvServiceOverload60Percent

Field Details
Description This alert is fired when the application go to the overload level of Warn level
Summary This alert is fired when the application go to the overload level of Warn level
Severity Warning
Condition Alert If the application overloads at 60%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7017

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7016

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Warn level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.23 DRProvServiceOverload75Percent

Table 4-42 DRProvServiceOverload75Percent

Field Details
Description This alert is fired when the application go to the overload level of Minor level
Summary This alert is fired when the application go to the overload level of Minor level
Severity Minor
Condition Alert If the application overloads at 75%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7017

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7016

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Minor level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.24 DRProvServiceOverload80Percent

Table 4-43 DRProvServiceOverload80Percent

Field Details
Description This alert is fired when the application go to the overload level of Major level
Summary This alert is fired when the application go to the overload level of Major level
Severity Major
Condition Alert If the application overloads at 80%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7017

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7016

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below Major level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.25 DRProvServiceOverload90Percent

Table 4-44 DRProvServiceOverload90Percent

Field Details
Description This alert is fired when the application go to the overload level of critical level
Summary This alert is fired when the application go to the overload level of critical level
Severity Critical
Condition Alert If the application overloads at 90%
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7017

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7016

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7015

Metric Used load_level
Recommended Actions

This alert is cleared when the incoming traffic is reduced to below critical level.

Steps:

  1. Check the Service specific metrics to understand the specific service request errors.

    Example: udr_rest_failure_response_total

  2. If guidance required, Contact My Oracle Support.
4.1.2.26 OcudrProvisioningTrafficRateAboveMajorThreshold

Table 4-45 OcudrProvisioningTrafficRateAboveMajorThreshold

Field Details
Description Ingress traffic Rate is above critical threshold, that is, 950 requests per second
Summary Traffic Rate is above 95 Percent of Max requests per second (1000)
Severity Critical
Condition Alert if Ingress traffic reaches 95% of max TPS
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7020

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7018

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7017 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_requests_total
Recommended Actions The alert is cleared when the Ingress Traffic rate falls below the Critical threshold.

Note:

The threshold is configurable in UDR_Alertrules.yaml.

Steps:

Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support.
  1. Refer Grafana to determine the service that is recieving high traffic.
  2. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes.
  3. Check the Ingress gateway logs on Kibana to determine the reason for the errors.
4.1.2.27 OcudrProvisioningTrafficRateAboveCriticalThreshold

Table 4-46 OcudrProvisioningTrafficRateAboveCriticalThreshold

Field Details
Description Ingress traffic Rate is above major threshold, that is, 900 requests per second
Summary Traffic Rate is above 90 Percent of Max requests per second (1000)
Severity Major
Condition Alert if Ingress traffic reaches 90% of max TPS
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7020

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7018

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7017 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_requests_total
Recommended Actions

The alert is cleared when the total Ingress Traffic rate falls below the Major threshold or when the total traffic rate exceeds the Critical threshold in which the OcudrTrafficRateAboveMajorThreshold alert is raised.

Note:

The threshold is configurable in UDR_Alertrules.yaml.

Steps:

Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support.
  1. Refer Grafana to determine the service that is recieving high traffic.
  2. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes.
  3. Check the Ingress gateway logs on Kibana to determine the reason for the errors.
4.1.2.28 OcudrProvisioningTransactionErrorRateAbove25Percent

Table 4-47 OcudrProvisioningTransactionErrorRateAbove25Percent

Field Details
Description Transaction Error Rate detected above 25 Percent of Total Transactions
Summary Transaction Error Rate detected above 25 Percent of Total Transactions
Severity Major
Condition Alert if all error rate exceeds 25% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7021

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7019

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7018 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions

The alert is cleared when the number of failure transactions is below 25% of the total transactions or when the number of failure transactions exceeds the 50% threshold in which the OcnrfTransactionErrorRateAbove50Percent is raised.

Steps:

  1. Check the metrics per service per method, for example, discovery requests can be deduced from these metrics.

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, contact My Oracle Support.
4.1.2.29 OcudrProvisioningTransactionErrorRateAbove50Percent

Table 4-48 OcudrProvisioningTransactionErrorRateAbove50Percent

Field Details
Description Transaction Error Rate detected above 50 Percent of Total Transactions
Summary Transaction Error Rate detected above 50 Percent of Total Transactions
Severity Critical
Condition Alert if all error rate exceeds 50% of the total transactions
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7021

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7019

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7018 (For EIR alert name, see Alert Details)

Metric Used oc_ingressgateway_http_responses_total
Recommended Actions The alert is cleared when the number of failure transactions is below 50 percent of the total transactions.

Steps:

  1. Check the metrics per service per method, for example, discovery requests can be deduced from these metrics.

    Metrics="oc_ingressgateway_http_responses_total"

    Method="GET"

    Status="503 SERVICE_UNAVAILABLE"

  2. If guidance required, contact My Oracle Support.
4.1.2.30 PVCFullForSLFExport

Table 4-49 PVCFullForSLFExport

Field Details
Description Storage for Export tool is full
Summary Storage for Export tool is full
Severity Critical
Condition Alert if PVC allocated for export tool dump path is full
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7020

EIR: NA

Metric Used export_tool_full_usage
Recommended Actions Alert will be cleared when the PVC usage is optimized. Configure maxDumps to lower value to clear old dumps. Remove old dumps, if any from the export tool container.
4.1.2.31 FailedExtractForSLFExport

Table 4-50 FailedExtractForSLFExport

Field Details
Description Export tool job is failed
Summary Export tool job is failed
Severity Critical
Condition Alert of the export operation fails
OID UDR: NA

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7021

EIR: NA

Metric Used export_failure
Recommended Actions Check logs for failure. The alert will be cleared when the export job succeeds next time.
4.1.2.32 BulkImportTransferInFailed

Table 4-51 BulkImportTransferInFailed

Field Details
Description Transfer-in failed for bulk import
Summary Transfer-in failed for bulk import
Severity Major
Condition Alert will be raised, if Transfer-In failed from Remote to PVC
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7022

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7022

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7019

Metric Used bulkimport_transfer_in_status
Recommended Actions This alert is cleared when the transfer-in is success from bulk import. Steps
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.33 ExportToolTransferOutFailed

Table 4-52 ExportToolTransferOutFailed

Field Details
Description Transfer-out failed for export-tool
Summary Transfer-out failed for export-tool"
Severity Major
Condition Alert will be raised if Transfer-Out failed from PVC to Remote
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7024

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7024

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7021

Metric Used sftp_transfer_status
Recommended Actions This alert is cleared when the transfer-out is success from export tool. Steps
  1. Check the service specific metrics to understand the specific service request errors.. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.34 BulkImportTransferOutFailed

Table 4-53 BulkImportTransferOutFailed

Field Details
Description Transfer-out failed for bulk import
Summary Transfer-out failed for bulk import
Severity Major
Condition Alert will be raised if Transfer-Out failed from PVC to Remote
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7023

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7023

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7020

Metric Used bulkimport_transfer_out_status
Recommended Actions This alert is cleared when the transfer-out is success from bulk import. Steps
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.35 PVCFullForXMLBulkImport

Table 4-54 PVCFullForXMLBulkImport

Field Details
Description Storage for XML Bulk Import tool is full
Summary Storage for XML Bulk Import tool is full
Severity Critical
Condition Alert will be raised if the PVC is full for xml-csv container
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7025

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7025

EIR: NA

Metric Used nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-xmltocsv",kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the PVC is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.36 PVCFullForBulkImport

Table 4-55 PVCFullForBulkImport

Field Details
Description Storage for Bulk Import tool is full
Summary Storage for Bulk Import tool is full
Severity Critical
Condition Alert will be raised if the PVC is full for bulk import container
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7026

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7026

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7025

Metric Used nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-bulk-import",kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the PVC is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.37 OperationalStatusCompleteShutdown

Table 4-56 OperationalStatusCompleteShutdown

Field Details
Description Operational state is control shutdown
Summary Operational state is control shutdown
Severity Critical
Condition Alert will be raised if the opertational state of the UDR, SLF, or EIR is COMPLETE_SHUTDOWN
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7027

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7027

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7026

Metric Used nudr_config_operational_status{kubernetes_namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the operational status is back to normal. Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.38 NFScoreCalculationFailed

Table 4-57 NFScoreCalculationFailed

Field Details
Description NFScoreCalculationFailed
Summary NFScoreCalculationFailed
Severity Major
Condition Alert is raised if the NF Score calculation are failed for any of the scoring factors
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7028

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7028

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7027

Metric Used nfscore{kubernetes_namespace="ocudr" ,factor=~"successTPS|signallingConnections|serviceHealth|replicationHealth|localityPreference|bulkImport|bulkExport",calculatedStatus="failed"}
Recommended Actions

This alert is cleared when the NF score calculation is successful.

Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.39 PVCFullForUDRExport

Table 4-58 PVCFullForUDRExport

Field Details
Description Storage for Export tool is full
Summary Storage for Export tool is full
Severity Critical
Condition Alert is raised if PVC allocated for export tool dump path is full.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7030

SLF: NA

EIR: NA

Metric Used export_tool_full_usage{namespace="ocudr"}==1
Recommended Actions

Alert is cleared when the PVC usage is optimized. You must configure maxDumps to a lower value to clear old dumps.

Steps:
  1. If present, remove the old dumps from the export tool container.
4.1.2.40 UDRExportFailed

Table 4-59 UDRExportFailed

Field Details
Description Export tool job is failed
Summary Export tool job is failed
Severity Critical
Condition Alert is raised if the export operation fails for UDR Mode
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7031

SLF: NA

EIR: NA

Metric Used export_failure{namespace="ocudr"}== 1
Recommended Actions

You must check the logs for failure. When the next export job is successful the alert is cleared.

4.1.2.41 IngressgatewayPodProtectionDocState

Table 4-60 IngressgatewayPodProtectionDocState

Field Details
Description Ingress congestion in Doc state
Summary Ingress congestion Doc state
Severity Critical
Condition Alert is raised if Ingress congestion is in doc state.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7032

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7029

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7028

Metric Used oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==1
Recommended Actions This alert will be cleared when the ingress gateway comes to normal state.
Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.42 IngressgatewayPodProtectionCongestedState

Table 4-61 IngressgatewayPodProtectionCongestedState

Field Details
Description Ingress congestion in Congested state
Summary Ingress congestion in Congested state
Severity Critical
Condition Alert is raised if ingress congestion is in congested state.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7033

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7030

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7029

Metric Used oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==2
Recommended Actions This alert will be cleared when the ingress gateway comes to normal state.
Steps:
  1. Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.43 RetryNotificationRecordsMaxLimitExceeded

Table 4-62 RetryNotificationRecordsMaxLimitExceeded

Field Details
Description Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Summary Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Severity Critical
Condition Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
OID: UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7036

SLF: NA

EIR: NA

Metric Used nudr_notif_records_limit_exceeded{namespace="ocudr"}==1
Recommended Actions

This alert is raised when there are more notification failures and the retry notifications stored in database is more than 50k.

Steps:
  1. Check the notification failure rate and fix the reason for failures. This reduces the number of notifications marked for retry that is stored in UDR database.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.44 UserAgentHeaderNotFoundMorethan10PercentRequest

Table 4-63 UserAgentHeaderNotFoundMorethan10PercentRequest

Field Details
Description Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Summary Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Severity Critical
Condition Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7035

SLF: NA

EIR: NA

Metric Used (sum by(namespace)(rate(suppress_user_agent_not_found_total{namespace="ocudr"}[5m]))/sum by(namespace)(rate(oc_ingressgateway_http_requests_total{namespace="ocudr"}[5m])))*100 >= 10
Recommended Actions

This alert is cleared if the total number of requests not having User-Agent header is less than 10% of ingress traffic.

Steps:
  1. Check the service specific metrics to understand the specific service request errors.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.45 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Table 4-64 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Severity Minor
Condition Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7034

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1300000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the minor threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.46 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Table 4-65 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Severity Major
Condition Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7034

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1500000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the major threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.47 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Table 4-66 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Field Details
Description Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Summary Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Severity Critical
Condition Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7034

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7034

Metric Used sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".*egress.*"}) >= 1800000000
Recommended Actions

This alert is cleared if the egress gateway JVM buffer memory is below the critical threshold limit.

Steps:
  1. Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.48 NudrDiameterGatewayDown

Table 4-67 NudrDiameterGatewayDown

Field Details
Description Alert will be raised if Nudr-diam-gateway service is down.
Summary Alert will be raised if Nudr-diam-gateway service is down.
Severity Critical
Condition Alert will be raised if Nudr-diam-gateway service is down.
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7037

SLF: NA

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7037

Metric Used absent(up{container="nudr-diam-gateway",namespace="ocudr"}) or up{container="nudr-diam-gateway",namespace="ocudr"} == 0
Recommended Actions

This alert is cleared when the NudrDiamGateway service is available.

Steps:
    • Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures.
      kubectl get po -n <namespace>
    • Run the following command using the full name of the pod that is not running.
      kubectl describe pod <specific desired full pod name> -n <namespace>
  1. Refer the application logs on Kibana and filter based on the appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  2. Perform the resolution steps depending on the reason for failure.
  3. Contact My Oracle Support, if guidance is required.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.2.49 DiameterPeerConnectionsDropped

Table 4-68 DiameterPeerConnectionsDropped

Field Details
Description Alert will be raised if there are no connections between diameter peer and diameter gateway.
Summary Alert will be raised if there are no connections between diameter peer and diameter gateway.
Severity Major
Condition Alert will be raised if there are no connections between diameter peer and diameter gateway.
OID: UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7029

SLF: NA

EIR: NA

Metric Used sum(ocudr_diam_conn_network{origHost=~".*CHI.*",container="nudr-diam-gateway",namespace="ocudr"} or vector(0))< 2 or sum(ocudr_diam_conn_network{origHost=~".*IND.*",container="nudr-diam-gateway",namespace="ocudr"} or vector(0)) < 2 or (sum(ocudr_diam_conn_network{origHost=~".*CHI.*",container="nudr-diam-gateway",kubernetes_namespace="ocudr"} or vector(0)) + sum(ocudr_diam_conn_network{origHost=~".*IND.*",container="nudr-diam-gateway",namespace="ocudr"}) or vector(0)) < 5
Recommended Actions

This alert is cleared when the NudrDiamGateway service is available.

Steps:
    • Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures.
      kubectl get po -n <namespace>
    • Run the following command using the full name of the pod that is not running.
      kubectl describe pod <specific desired full pod name> -n <namespace>
  1. Refer the application logs on Kibana and filter based on the appinfo service names. Check for ERROR WARNING logs related to thread exceptions.
  2. Perform the resolution steps depending on the reason for failure.
  3. Contact My Oracle Support, if guidance is required.

    Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.2.50 IGWSignallingPodProtectionDOCState

Table 4-69 IGWSignallingPodProtectionDOCState

Field Details
Description Alert will be raised when the ingress gateway signaling traffic at DOC State.
Summary Alert will be raised when the ingress gateway signaling traffic at DOC State.
Severity Major
Condition Alert will be raised when the ingress gateway signaling traffic at DOC State.
OID: UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7038

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7038

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7038

Metric Used sum({namespace="ocudr",container="ingressgateway-sig"}) by (pod) == 2
Recommended Actions

This alert is cleared when the signaling traffic reaches NORMAL state.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.51 IGWSignallingPodProtectionCongestedState

Table 4-70 IGWSignallingPodProtectionCongestedState

Field Details
Description Alert will be raised when the ingress gateway signaling traffic at Congested State.
Summary Alert will be raised when the ingress gateway signaling traffic at Congested State.
Severity Critical
Condition Alert will be raised when the ingress gateway signaling traffic at Congested State.
OID: UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7038

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7038

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7038

Metric Used sum(oc_ingressgateway_congestion_system_state{namespace="ocudr",container="ingressgateway-sig"}) by (pod) == 3
Recommended Actions

This alert is cleared when the signaling traffic reaches NORMAL or DOC state.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.52 IGWSignallingPodProtectionByRateLimitRejectedRequest

Table 4-71 IGWSignallingPodProtectionByRateLimitRejectedRequest

Field Details
Description Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
Summary Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
Severity Critical
Condition Alert will be raised when total rejections crossed more than 1% traffic of the total incoming traffic.
OID: UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7039

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7039

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7039

Metric Used (sum (rate(oc_ingressgateway_http_request_ratelimit_denied_count_total{Action="REJECT",namespace="ocudr"}[2m]) or (up * 0 ) ) )/ sum(rate(oc_ingressgateway_http_requests_total{container="ingressgateway-sig",namespace="ocudr"}[2m])) * 100 >= 1
Recommended Actions

This alert is cleared when the when rejection is reduced less than 1% of the total traffic.

Steps:
  1. Check the service specific metrics to for the specific service request errors. For example, oc_ingressgateway_congestion_system_state.
  2. Contact My Oracle Support, if guidance is required.
4.1.2.53 DRServiceRequestLatencyMajor

Table 4-72 DRServiceRequestLatencyMajor

Field Details
Description DR service request latency is more than 100ms
Summary DR service request latency is above 100ms
Severity Major
Condition Alert will be raised when DR service request latency exceeds 100ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7046

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7046

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7046

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_request_processing_time_seconds_bucket{namespace="ocudr",container="nudr-drservice"}[5m])))*1000 >= 100 < 250
Recommended Actions The alert is cleared when DR service latency falls below 100ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.54 DRServiceRequestLatencyCritical

Table 4-73 DRServiceRequestLatencyCritical

Field Details
Description DR service request latency is more than 250ms
Summary DR service request latency is above 250ms
Severity Critical
Condition Alert will be raised when DR service request latency exceeds 250ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7046

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7046

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7046

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_request_processing_time_seconds_bucket{namespace="ocudr",container="nudr-drservice"}[5m])))*1000 >= 250
Recommended Actions The alert is cleared when DR service latency falls below 250ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.55 DRServiceDBLatencyMajor

Table 4-74 DRServiceDBLatencyMajor

Field Details
Description DR service DB latency is more than 25ms
Summary DR service DB latency is above 25ms
Severity Major
Condition Alert will be raised when DR service DB latency exceeds 25ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7047

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7047

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7047

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_db_processing_time_seconds_bucket{namespace="ocudr",container="nudr-drservice"}[5m])))*1000 >= 25 < 50
Recommended Actions The alert is cleared when DR service DB latency falls below 25ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.56 DRServiceDBLatencyCritical

Table 4-75 DRServiceDBLatencyCritical

Field Details
Description DR service DB latency is more than 50ms
Summary DR service DB latency is above 50ms
Severity Critical
Condition Alert will be raised when DR service DB latency exceeds 50ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7047

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7047

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7047

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_db_processing_time_seconds_bucket{namespace="ocudr",container="nudr-drservice"}[5m])))*1000 >= 50
Recommended Actions The alert is cleared when DR service DB latency falls below 50ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.57 IGWSignallingTotalAvgLatencyMajor

Table 4-76 IGWSignallingTotalAvgLatencyMajor

Field Details
Description IGW signalling average latency is more than 250ms
Summary IGW signalling average latency is above 250ms
Severity Major
Condition Alert will be fired when IGW signalling average latency exceeds 250ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7048

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7048

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7048

Metric Used ((sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ) + (sum(irate(oc_ingressgateway_request_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_request_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ) + (sum(irate(oc_ingressgateway_response_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_response_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ))*1000 >= 250 < 500
Recommended Actions The alert is cleared when IGW signalling average latency falls below 250ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.58 IGWSignallingTotalAvgLatencyCritical

Table 4-77 IGWSignallingTotalAvgLatencyCritical

Field Details
Description IGW signalling average latency is more than 500ms
Summary IGW signalling average latency is above 500ms
Severity Critical
Condition Alert will be fired when IGW signalling average latency exceeds 500ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7048

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7048

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7048

Metric Used ((sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ) + (sum(irate(oc_ingressgateway_request_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_request_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ) + (sum(irate(oc_ingressgateway_response_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-sig"}[2m])) / sum(irate(oc_ingressgateway_response_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-sig"}[2m])) ))*1000 >= 500
Recommended Actions The alert is cleared when IGW signalling average latency falls below 500ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.59 DRProvServiceRequestLatencyMajor

Table 4-78 DRProvServiceRequestLatencyMajor

Field Details
Description DR provisioning service request latency is more than 100ms
Summary DR provisioning service request latency is above 100ms
Severity Major
Condition Alert will be raised when DR provisioning service request latency exceeds 100ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7049

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7049

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7049

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_request_processing_time_seconds_bucket{namespace="ocudr",container="nudr-dr-provservice"}[5m])))*1000 >= 100 < 250
Recommended Actions The alert is cleared when DR provisioning service latency falls below 100ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.60 DRProvServiceRequestLatencyCritical

Table 4-79 DRProvServiceRequestLatencyCritical

Field Details
Description DR provisioning service request latency is more than 250ms
Summary DR provisioning service request latency is above 250ms
Severity Critical
Condition Alert will be raised when DR provisioning service request latency exceeds 250ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7049

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7049

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7049

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_request_processing_time_seconds_bucket{namespace="ocudr",container="nudr-dr-provservice"}[5m])))*1000 >= 250
Recommended Actions The alert is cleared when DR provisioning service latency falls below 250ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.61 DRProvServiceDBLatencyMajor

Table 4-80 DRProvServiceDBLatencyMajor

Field Details
Description DR provisioning service DB latency is more than 25ms
Summary DR provisioning service DB latency is above 25ms
Severity Major
Condition Alert will be raised when DR provisioning service DB latency exceeds 25ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7050

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7050

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7050

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_db_processing_time_seconds_bucket{namespace="ocudr",container="nudr-dr-provservice"}[5m])))*1000 >= 25 < 50
Recommended Actions The alert is cleared when DR provisioning service DB latency falls below 25ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.62 DRProvServiceDBLatencyCritical

Table 4-81 DRProvServiceDBLatencyCritical

Field Details
Description DR provisioning service DB latency is more than 50ms
Summary DR provisioning service DB latency is above 50ms
Severity Critical
Condition Alert will be raised when DR provisioning service DB latency exceeds 50ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7050

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7050

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7050

Metric Used histogram_quantile(95 / 100, sum by(le) (rate(udr_db_processing_time_seconds_bucket{namespace="ocudr",container="nudr-dr-provservice"}[5m])))*1000 >= 50
Recommended Actions The alert is cleared when DR provisioning service DB latency falls below 50ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.63 IGWProvisioningTotalAvgLatencyMajor

Table 4-82 IGWProvisioningTotalAvgLatencyMajor

Field Details
Description IGW provisioning average latency is more than 250ms
Summary IGW provisioning average latency is above 250ms
Severity Major
Condition Alert will be fired when IGW provisioning average latency exceeds 250ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7051

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7051

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7051

Metric Used ((sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ) + (sum(irate(oc_ingressgateway_request_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_request_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ) + (sum(irate(oc_ingressgateway_response_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_response_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ))*1000 >= 250 < 500
Recommended Actions The alert is cleared when IGW provisioning average latency falls below 250ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.
4.1.2.64 IGWProvisioningTotalAvgLatencyCritical

Table 4-83 IGWProvisioningTotalAvgLatencyCritical

Field Details
Description IGW provisioning average latency is more than 500ms
Summary IGW provisioning average latency is above 500ms
Severity Critical
Condition Alert will be fired when IGW provisioning average latency exceeds 500ms
OID UDR: 1.3.6.1.4.1.323.5.3.43.1.2.7051

SLF: 1.3.6.1.4.1.323.5.3.43.1.2.7051

EIR: 1.3.6.1.4.1.323.5.3.43.1.2.7051

Metric Used ((sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_backend_invocation_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ) + (sum(irate(oc_ingressgateway_request_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_request_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ) + (sum(irate(oc_ingressgateway_response_processing_latency_seconds_sum{namespace="ocudr",container="ingressgateway-prov"}[2m])) / sum(irate(oc_ingressgateway_response_processing_latency_seconds_count{namespace="ocudr",container="ingressgateway-prov"}[2m])) ))*1000 >= 500
Recommended Actions The alert is cleared when IGW provisioning average latency falls below 500ms. Steps: Check the service-specific metrics to understand the specific service request errors. If guidance is required, contact My Oracle Support.