Alert Configuration

4 Alert Configuration

This section describes how to configure alert rules for the UDR. It provides guidance on setting up measurement-based alert rules, where the alerting system evaluates metrics reported by UDR microservices against specified rule conditions to generate alerts as needed. UDR alert rules are configured based on metrics reported by UDR components. The alerting workflow monitors these metrics and issues notifications when the defined conditions are met. For more information about configuring UDR alerts in Prometheus, see the “Alert Configuration” section in Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide.

4.1 Alert Details

This section describes alerts in detail.

Note:

Max Ingress requests/sec in consideration is 1000/second.

Table 4-1 Alerts Levels or Severity Types

Alerts Levels / Severity Types	Definition
Critical	Indicates a severe issue that poses a significant risk to safety, security, or operational integrity. It requires immediate response to address the situation and prevent serious consequences. Raised for conditions may affect the service of UDR.
Major	Indicates a more significant issue that has an impact on operations or poses a moderate risk. It requires prompt attention and action to mitigate potential escalation. Raised for conditions may affect the service of UDR.
Minor	Indicates a situation that is low in severity and does not pose an immediate risk to safety, security, or operations. It requires attention but does not demand urgent action. Raised for conditions may affect the service of UDR.
Info or Warn (Informational)	Provides general information or updates that are not related to immediate risks or actions. These alerts are for awareness and do not typically require any specific response. WARN and INFO alerts may not impact the service of UDR.

The below table provides alert names for UDR and EIR.

Table 4-2 Alert names for UDR and EIR

UDR	EIR
OcudrTrafficRateAboveMajorThreshold	OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold	OceirTrafficRateAboveMinorThreshold
OcudrTrafficRateAboveCriticalThreshold	OceirTrafficRateAboveCriticalThreshold
OcudrTransactionErrorRateAbove0.1Percent	OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent	OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent	OceirTransactionErrorRateAbove10Percent
OcudrTrafficRateAboveCriticalThreshold	OceirTrafficRateAboveCriticalThreshold
OcudrTrafficRateAboveMajorThreshold	OceirTrafficRateAboveMajorThreshold
OcudrTrafficRateAboveMinorThreshold	OceirTrafficRateAboveMinorThreshold
OcudrTransactionErrorRateAbove0.1Percent	OceirTransactionErrorRateAbove0.1Percent
OcudrTransactionErrorRateAbove1Percent	OceirTransactionErrorRateAbove1Percent
OcudrTransactionErrorRateAbove10Percent	OceirTransactionErrorRateAbove10Percent
OcudrTransactionErrorRateAbove25Percent	OceirTransactionErrorRateAbove25Percent
OcudrTransactionErrorRateAbove50Percent	OceirTransactionErrorRateAbove50Percent
OcudrSubscriberNotFoundAbove1Percent	OceirSubscriberNotFoundAbove1Percent
OcudrSubscriberNotFoundAbove10Percent	OceirSubscriberNotFoundAbove10Percent
OcudrSubscriberNotFoundAbove25Percent	OceirSubscriberNotFoundAbove25Percent
OcudrSubscriberNotFoundAbove50Percent	OceirSubscriberNotFoundAbove50Percent
OcudrPodsRestart	OceirPodsRestart
NudrServiceDown	NudrServiceDown
NudrProvServiceDown	NudrProvServiceDown
NudrNotifyServiceServiceDown	NA
NudrNRFClientServiceDown	NudrNRFClientServiceDown
NudrConfigServiceDown	NudrConfigServiceDown
NudrDiameterProxyServiceDown	NudrDiameterProxyServiceDown
NudrOnDemandMigrationServiceDown	NA
OcudrIngressGatewayServiceDown	OceirIngressGatewayServiceDown
OcudrEgressGatewayServiceDown	OceirEgressGatewayServiceDown
OcudrDbServiceDown	OceirDbServiceDown
OcudrXFCCValidationFailureAbove10Percent	OceirXFCCValidationFailureAbove10Percent
OcudrXFCCValidationFailureAbove20Percent	OceirXFCCValidationFailureAbove20Percent
OcudrXFCCValidationFailureAbove50Percent	OceirXFCCValidationFailureAbove50Percent
DRServiceOverload60Percent	DRServiceOverload60Percent
DRServiceOverload75Percent	DRServiceOverload75Percent
DRServiceOverload80Percent	DRServiceOverload80Percent
DRServiceOverload90Percent	DRServiceOverload90Percent
SLFSucessTxnDefaultGroupIdRateAbove1Percent	NA
SLFSucessTxnDefaultGroupIdRateAbove10Percent	NA
SLFSucessTxnDefaultGroupIdRateAbove25Percent	NA
SLFSucessTxnDefaultGroupIdRateAbove50Percent	NA
OcudrDiameterCongestionCongestedState	OceirDiameterCongestionCongestedState
OcudrDiameterCongestionDocState	OceirDiameterCongestionDocState
DRProvServiceOverload60Percent	DRProvServiceOverload60Percent
DRProvServiceOverload75Percent	DRProvServiceOverload75Percent
DRProvServiceOverload80Percent	DRProvServiceOverload80Percent
DRProvServiceOverload90Percent	DRProvServiceOverload90Percent
OcudrIngressGatewayProvServiceDown	OceirIngressGatewayProvServiceDown
OcudrProvisioningTrafficRateAboveMajorThreshold	OceirProvisioningTrafficRateAboveMajorThreshold
OcudrProvisioningTrafficRateAboveCriticalThreshold	OceirProvisioningTrafficRateAboveCriticalThreshold
OcudrProvisioningTransactionErrorRateAbove25Percent	OceirProvisioningTransactionErrorRateAbove25Percent
OcudrProvisioningTransactionErrorRateAbove50Percent	OceirProvisioningTransactionErrorRateAbove50Percent
PVCFullForSLFExport	NA
FailedExtractForSLFExport	NA
BulkImportTransferInFailed	BulkImportTransferInFailed
BulkImportTransferOutFailed	BulkImportTransferOutFailed
ExportToolTransferOutFailed	ExportToolTransferOutFailed
PVCFullForXMLBulkImport	PVCFullForXMLBulkImport
PVCFullForBulkImport	PVCFullForBulkImport
OperationalStatusCompleteShutdown	OperationalStatusCompleteShutdown
NFScoreCalculationFailed	NFScoreCalculationFailed
PVCFullForEXMLExport	NA
EXMLExportFailed	NA
IngressgatewayPodProtectionDocState	IngressgatewayPodProtectionDocState
IngressgatewayPodProtectionCongestedState	IngressgatewayPodProtectionCongestedState
RetryNotificationRecordsMaxLimitExceeded	RetryNotificationRecordsMaxLimitExceeded
UserAgentHeaderNotFoundMorethan10PercentRequest	NA
EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold	EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold
EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold	EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold
EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold	EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold
NudrDiameterGatewayDown	NudrDiameterGatewayDown
DiameterPeerConnectionsDropped	DiameterPeerConnectionsDropped

Note:

For the following alert details, only UDR alerts names are provided. The corresponding EIR alert names can be found in Table 4-2.

4.1.1 System Level Alerts

This section lists the system level alerts.

4.1.1.1 OcudrSubscriberNotFoundAbove1Percent

Table 4-3 OcudrSubscriberNotFoundAbove1Percent

Field	Details
Description	Total number of response if subscriber not found is about 1% of ingress traffic
Summary	Total number of response if subscriber not found is about 1% of ingress traffic
Severity	Warning
Condition	Alert if number of subscribers not found is 1% of all ingress traffic
OID	1.3.6.1.4.1.323.5.3.43.1.2.7009
Metric Used	udr_subscriber_not_found_total
Recommended Actions	The alert is cleared when the number of failure of Subscriber Not Found are below 1% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.1.2 OcudrSubscriberNotFoundAbove10Percent

Table 4-4 OcudrSubscriberNotFoundAbove10Percent

Field	Details
Description	Total number of response if subscriber not found is about 10% of ingress traffic
Summary	Total number of response if subscriber not found is about 10% of ingress traffic
Severity	Minor
Condition	Alert if number of subscribers not found is 10% of all ingress traffic
OID	1.3.6.1.4.1.323.5.3.43.1.2.7010
Metric Used	udr_subscriber_not_found_total
Recommended Actions	The alert is cleared when the number of failure of Subscriber Not Found are below 10% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.1.3 OcudrSubscriberNotFoundAbove25Percent

Table 4-5 OcudrSubscriberNotFoundAbove25Percent

Field	Details
Description	Total number of response if subscriber not found is about 25% of ingress traffic
Summary	Total number of response if subscriber not found is about 25% of ingress traffic
Severity	Major
Condition	Alert if number of subscribers not found is 25% of all ingress traffic
OID	1.3.6.1.4.1.323.5.3.43.1.2.7011
Metric Used	udr_subscriber_not_found_total
Recommended Actions	The alert is cleared when the number of failure of Subscriber Not Found are below 25% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.1.4 OcudrSubscriberNotFoundAbove50Percent

Table 4-6 OcudrSubscriberNotFoundAbove50Percent

Field	Details
Description	Total number of response if subscriber not found is about 50% of ingress traffic
Summary	Total number of response if subscriber not found is about 50% of ingress traffic
Severity	Critical
Condition	Alert if number of subscribers not found is 50% of all ingress traffic
OID	1.3.6.1.4.1.323.5.3.43.1.2.7012
Metric Used	udr_subscriber_not_found_total
Recommended Actions	The alert is cleared when the number of failure of Subscriber Not Found are below 50% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.1.5 OcudrPodsRestart

Table 4-7 OcudrPodsRestart

Field	Details
Description	Pod {{$labels.pod}} has restarted.
Summary	namespace: {{$labels.namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : A Pod has restarted
Severity	Major
Condition	Alert if any of the pod got restarted
OID	1.3.6.1.4.1.323.5.3.43.1.2.7014
Metric Used	kube_pod_container_status_restarts_total
Recommended Actions	The alert is cleared automatically if the specific pod is up. Steps: Refer to the application logs on Kibana and filter based on pod name, check for database related failures such as connectivity, kubernetes secrets and so on. Check orchestration logs for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running and use it in the following command. kubectl describe pod <desired full pod name> -n <namespace> Check the DB status. For more information, see Oracle Communications Cloud Native Core, cnDBTier User Guide. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.1.6 NudrServiceDown

Table 4-8 NudrServiceDown

Field	Details
Description	OCUDR Nudr_DRService {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : DR Service is down
Severity	Critical
Condition	Alert if Nudr-dr service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7015
Metric Used	app_kubernetes_io_name="nudr-drservice
Recommended Actions	The alert is cleared when the NudrService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.7 NudrProvServiceDown

Table 4-9 NudrProvServiceDown

Field	Details
Description	OCUDR Nudr_DR_PROVService {{$labels.app_kubernetes_io_name}} is down
Summary	'namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : DR Prov Service is down'
Severity	Critical
Condition	Alert if Nudr-dr service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7015
Metric Used	app_kubernetes_io_name="nudr-dr-provservice
Recommended Actions	The alert is cleared when the NudrProvService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.8 NudrNotifyServiceServiceDown

Table 4-10 NudrNotifyServiceServiceDown

Field	Details
Description	OCUDR NudrNotifyServiceService {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : Nudr Notify Service down.
Severity	Critical
Condition	Alert if Nudr Notify service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7016
Metric Used	app_kubernetes_io_name="nudr-notify-service"
Recommended Actions	The alert is cleared when the NotifyService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.9 NudrNRFClientServiceDown

Table 4-11 NudrNRFClientServiceDown

Field	Details
Description	OCUDR NRFClient service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : NRF Client service down
Severity	Critical
Condition	Alert if Nudr Nrf Client service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7017
Metric Used	app_kubernetes_io_name="nrf-client-nfmanagement
Recommended Actions	The alert is cleared when the NRFClientService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.10 NudrConfigServiceDown

Table 4-12 NudrConfigServiceDown

Field	Details
Description	OCUDR config service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : nudr-config service down
Severity	Critical
Condition	Alert if Nudr Config service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7020
Metric Used	app_kubernetes_io_name="nudr-config"
Recommended Actions	The alert is cleared when the ConfigService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.11 NudrDiameterProxyServiceDown

Table 4-13 NudrDiameterProxyServiceDown

Field	Details
Description	OCUDR diameterproxy service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : nudr-diameterproxy service is down
Severity	Critical
Condition	Alert if Nudr Diameter Proxy is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7018
Metric Used	app_kubernetes_io_name="nudr-diameterproxy"
Recommended Actions	The alert is cleared when the DiameterProxyService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.12 NudrOnDemandMigrationServiceDown

Table 4-14 NudrOnDemandMigrationServiceDown

Field	Details
Description	OCUDR ondemand-migration service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : NFSubscription service is down
Severity	Critical
Condition	Alert if Nudr On Demand Migration is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7019
Metric Used	app_kubernetes_io_name="nudr-ondemand-migration"
Recommended Actions	The alert is cleared when the OnDemandMigrationService service is available. Steps: Check the orchestration logs of appinfo service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on appinfo service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.13 OcudrIngressGatewayServiceDown

Table 4-15 OcudrIngressGatewayServiceDown

Field	Details
Description	OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity	Critical
Condition	Alert if Ingress Service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7021
Metric Used	app_kubernetes_io_name="ingressgateway"
Recommended Actions	The alert is cleared when the ingressgateway service is available. Steps: Check the orchestration logs of ingress-gateway service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on ingress-gateway service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.14 OcudrEgressGatewayServiceDown

Table 4-16 OcudrEgressGatewayServiceDown

Field	Details
Description	OCUDR Egress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : Egress-Gateway service down
Severity	Critical
Condition	Alert if Egress Service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7022
Metric Used	app_kubernetes_io_name="egressgateway"
Recommended Actions	The alert is cleared when the egressgateway service is available. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Check the orchestration logs of egress-gateway service and check for liveness or readiness probe failures using the following commands. kubectl get po -n <namespace> Note the full name of the pod that is not running. It must be used in the following command kubectl describe pod <specific desired full pod name> -n <namespace> Refer the application logs on Kibana and filter based on egress-gateway service names. Check for ERROR WARNING logs related to thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and Contact My Oracle Support. Note: Use CNC NF Data Collector tool for capturing logs. Refer "NF Data Collector tool user guide" for more details.

4.1.1.15 OcudrDbServiceDown

Table 4-17 OcudrDbServiceDown

Field	Details
Description	Mysql connectivity service is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : MySQL connectivity service down
Severity	Critical
Condition	Alert if Mysql connectivity is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7023
Metric Used	appinfo_service_running
Recommended Actions	This alert clears when the microservice nudr-drservice is up and running.

4.1.1.16 OcudrIngressGatewayProvServiceDown

Table 4-18 OcudrIngressGatewayProvServiceDown

Field	Details
Description	OCUDR Ingress-Gateway service {{$labels.app_kubernetes_io_name}} is down
Summary	namespace: {{$labels.kubernetes_namespace}}, podname: {{$labels.kubernetes_pod_name}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} : Ingress-gateway service down
Severity	Critical
Condition	Alert if Ingressgateway-prov service is down
OID	1.3.6.1.4.1.323.5.3.43.1.2.7043
Metric Used	app_kubernetes_io_name="ingressgateway-prov"
Recommended Actions	The alert is cleared when the ingress-gateway service is available. Steps: Check the orchestration logs of the ingress-gateway service and check for liveness or readiness probe failures using the following commands: `kubectl get po -n <namespace>` Note the full name of the pod that is not running. It must be used in the following command: `kubectl describe pod <specific desired full pod name> -n <namespace>` Refer the application logs on Kibana and filter based on the ingress-gateway service names. Check for the ERROR WARNING logs related to the thread exceptions. Depending on the failure reason, take the resolution steps. In case the issue persists, capture all the outputs for the above steps and contact My Oracle Support. Note: Use the CNC NF Data Collector tool for capturing logs. Refer to NF Data Collector tool user guide for more details.

4.1.2 Application Level Alerts

This section lists the application level alerts.

4.1.2.1 OcudrTrafficRateAboveMajorThreshold

Table 4-19 OcudrTrafficRateAboveMajorThreshold

Field	Details
Description	'Ingress traffic Rate is above major threshold i.e. 900 requests per second
Summary	'Traffic Rate is above 90 Percent of Max requests per second(1000)'
Severity	Major
Condition	Alert if Ingress traffic reaches 90% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7002
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the Ingress Traffic rate falls below the Critical threshold. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic (eg : Mated site OCUDR is unavailable in georedundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress Gateway section in Grafana to determine an increase in 4xx and 5xx error codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

4.1.2.2 OcudrTrafficRateAboveMinorThreshold

Table 4-20 OcudrTrafficRateAboveMinorThreshold

Field	Details
Description	Ingress traffic rate is above minor threshold i.e. 800 requests per second
Summary	Traffic rate is above 80 Percent of Max requests per second(1000)
Severity	Minor
Condition	Alert if Ingress traffic reaches 80% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7001
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate cross the Major threshold, in which case the OcudrTrafficRateAboveMinorThreshold alert shall be raised. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic(eg : Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

4.1.2.3 OcudrTrafficRateAboveCriticalThreshold

Table 4-21 OcudrTrafficRateAboveCriticalThreshold

Field	Details
Description	'Ingress traffic Rate is above critical threshold i.e. 950 requests per second
Summary	'Traffic Rate is above 95 Percent of Max requests per second(1000)'
Severity	Critical
Condition	Alert if Ingress traffic reaches 95% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7003
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the Ingress Traffic rate falls below the Critical threshold. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic (Example: Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

4.1.2.4 OcudrTransactionErrorRateAbove0.1Percent

Table 4-22 OcudrTransactionErrorRateAbove0.1Percent

Field	Details
Description	Transaction error rate is above 0.1 Percent of Total Transactions
Summary	Transaction Error Rate detected above 0.1 Percent of Total Transactions
Severity	Warning
Condition	Alert if all error rate exceeds 0.1% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7004
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failed transactions is below 0.1 percent of the total transactions or when the number of failed transactions crosses the 1% threshold in which case the OcudrTransactionErrorRateAbove0.1Percent is raised. Steps: Check metrics per service, per method For example, discovery requests can be deduced from these metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance is required, Contact My Oracle Support.

4.1.2.5 OcudrTransactionErrorRateAbove1Percent

Table 4-23 OcudrTransactionErrorRateAbove1Percent

Field	Details
Description	'Transaction Error rate is above 1 Percent of Total Transactions
Summary	'Transaction Error Rate detected above 1 Percent of Total Transactions'
Severity	Warning
Condition	Alert if all error rate exceeds 1% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7005
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 1% of the total transactions or when the number of failure transactions cross the 10% threshold in which case the OcnrfTransactionErrorRateAbove10Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.6 OcudrTransactionErrorRateAbove10Percent

Table 4-24 OcudrTransactionErrorRateAbove10Percent

Field	Details
Description	Transaction error rate is above 10 Percent of Total Transactions
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Minor
Condition	Alert if all error rate exceeds 10% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7006
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 10% of the total transactions or when the number of failure transactions cross the 25% threshold in which case the OcnrfTransactionErrorRateAbove25Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.7 OcudrTrafficRateAboveCriticalThreshold

Table 4-25 OcudrTrafficRateAboveCriticalThreshold

Field	Details
Description	Ingress traffic rate is above critical threshold i.e. 950 requests per second
Summary	Traffic rate is above 95 Percent of Max requests per second(1000)
Severity	Critical
Condition	Alert if Ingress traffic reaches 95% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7003
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the Ingress Traffic rate falls below the Critical threshold. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic (Example: Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

4.1.2.8 OcudrTrafficRateAboveMajorThreshold

Table 4-26 OcudrTrafficRateAboveMajorThreshold

Field	Details
Description	Ingress traffic rate is above major threshold i.e. 900 requests per second
Summary	Traffic rate is above 90 Percent of Max requests per second(1000)
Severity	Major
Condition	Alert if Ingress traffic reaches 90% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7002
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the Ingress Traffic rate falls below the Critical threshold. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic (eg: Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress gateway section in Grafana to determine increase in 4xx and 5xx Error codes. Check Ingress gateway logs on Kibana to determine the reason for the errors.

4.1.2.9 OcudrTrafficRateAboveMinorThreshold

Table 4-27 OcudrTrafficRateAboveMinorThreshold

Field	Details
Description	Ingress traffic Rate is above minor threshold i.e. 800 requests per second
Summary	Traffic Rate is above 80 Percent of Max requests per second (1000)
Severity	Minor
Condition	Alert if Ingress traffic reaches 80% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7001
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared either when the total Ingress Traffic rate falls below the Minor threshold or when the total traffic rate cross the Major threshold, in which case the OcudrTrafficRateAboveMinorThreshold alert shall be raised. Note: The threshold is configurable in the UDR_Alertrules.yaml Steps: Reassess why the OCUDR is receiving additional traffic (eg : Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support and: Refer Grafana to determine which service is receiving high traffic. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx Error codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

4.1.2.10 OcudrTransactionErrorRateAbove0.1Percent

Table 4-28 OcudrTransactionErrorRateAbove0.1Percent

Field	Details
Description	Transaction Error rate is above 0.1 Percent of Total Transactions
Summary	Transaction Error Rate detected above 0.1 Percent of Total Transactions
Severity	Warning
Condition	Alert if all error rate exceeds 0.1% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7004
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 0.1 percent of the total transactions or when the number of failure transactions cross the 1% threshold in which case the OcudrTransactionErrorRateAbove0.1Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.11 OcudrTransactionErrorRateAbove1Percent

Table 4-29 OcudrTransactionErrorRateAbove1Percent

Field	Details
Description	Transaction Error rate is above 1 Percent of Total Transactions
Summary	Transaction Error Rate detected above 1 Percent of Total Transactions
Severity	Warning
Condition	Alert if all error rate exceeds 1% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7005
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 1% of the total transactions or when the number of failure transactions cross the 10% threshold in which case the OcnrfTransactionErrorRateAbove10Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, contact My Oracle Support.

4.1.2.12 OcudrTransactionErrorRateAbove10Percent

Table 4-30 OcudrTransactionErrorRateAbove10Percent

Field	Details
Description	Transaction Error rate is above 10 Percent of Total Transactions
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Minor
Condition	Alert if all error rate exceeds 10% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7006
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 10% of the total transactions or when the number of failure transactions cross the 25% threshold in which case the OcnrfTransactionErrorRateAbove25Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.13 OcudrTransactionErrorRateAbove25Percent

Table 4-31 OcudrTransactionErrorRateAbove25Percent

Field	Details
Description	Transaction Error Rate detected above 25 Percent of Total Transactions
Summary	Transaction Error Rate detected above 25 Percent of Total Transactions
Severity	Major
Condition	Alert if all error rate exceeds 25% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7007
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 25% of the total transactions or when the number of failure transactions cross the 50% threshold in which case the OcnrfTransactionErrorRateAbove50Percent shall be raised. Steps: Check metrics per service, per method For example discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.14 OcudrTransactionErrorRateAbove50Percent

Table 4-32 OcudrTransactionErrorRateAbove50Percent

Field	Details
Description	Transaction Error Rate detected above 50 Percent of Total Transactions
Summary	Transaction Error Rate detected above 50 Percent of Total Transactions
Severity	Critical
Condition	Alert if all error rate exceeds 50% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7008
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions are below 50 percent of the total transactions. Steps: Check metrics per service, per method For example, discovery requests can be deduced from this metrics Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, Contact My Oracle Support.

4.1.2.15 OcudrXFCCValidationFailureAbove10Percent

Table 4-33 OcudrXFCCValidationFailureAbove10Percent

Field	Details
Description	Total number of response with xfcc validation failure is about 10% of ingress traffic
Summary	Total number of response with xfcc validation failure is about 10% of ingress traffic
Severity	Minor
Condition	Alert if XFCC validation failure is 10% of the total XFCC validations
OID	1.3.6.1.4.1.323.5.3.43.1.2.7024
Metric Used	oc_ingressgateway_xfcc_header_validate_total
Recommended Actions	The alert is cleared when the number of failure of XFCCValidationFailure are below 10% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.16 OcudrXFCCValidationFailureAbove20Percent

Table 4-34 OcudrXFCCValidationFailureAbove20Percent

Field	Details
Description	Total number of response with xfcc validation failure is about 20% of ingress traffic
Summary	Total number of response with xfcc validation failure is about 20% of ingress traffic
Severity	Major
Condition	Alert if XFCC validation failure is 20% of the total XFCC validations
OID	1.3.6.1.4.1.323.5.3.43.1.2.7025
Metric Used	oc_ingressgateway_xfcc_header_validate_total
Recommended Actions	The alert is cleared when the number of failure of XFCCValidationFailure are below 20% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.17 OcudrXFCCValidationFailureAbove50Percent

Table 4-35 OcudrXFCCValidationFailureAbove50Percent

Field	Details
Description	Total number of response with XFCC validation failure is about 50% of ingress traffic
Summary	Total number of response with XFCC validation failure is about 50% of ingress traffic.
Severity	Critical
Condition	Alert if XFCC validation failure is 50% of the total XFCC validations
OID	1.3.6.1.4.1.323.5.3.43.1.2.7026
Metric Used	oc_ingressgateway_xfcc_header_validate_total
Recommended Actions	The alert is cleared when the number of failure of XFCCValidationFailure are below 50% of the total. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.18 DRServiceOverload60Percent

Table 4-36 DRServiceOverload60Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Warn level
Summary	This alert is fired when the application go to the overload level of Warn level
Severity	Warning
Condition	Alert If the application overloads at 60%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7027
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Warn level. Steps: Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total If guidance required, contact My Oracle Support.

4.1.2.19 DRServiceOverload75Percent

Table 4-37 DRServiceOverload75Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Minor level
Summary	This alert is fired when the application go to the overload level of Minor level.
Severity	Minor
Condition	Alert If the application overloads at 75%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7028
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Minor level. Steps: Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total If guidance required, contact My Oracle Support.

4.1.2.20 DRServiceOverload80Percent

Table 4-38 DRServiceOverload80Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Minor level
Summary	This alert is fired when the application go to the overload level of Minor level
Severity	Major
Condition	Alert If the application overloads at 80%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7029
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Major level. Steps: Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total If guidance required, contact My Oracle Support.

4.1.2.21 DRServiceOverload90Percent

Table 4-39 DRServiceOverload90Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Minor level
Summary	This alert is fired when the application go to the overload level of Minor level
Severity	Critical
Condition	Alert if the application overloads at 90%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7030
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Critical level. Steps: Check the service specific metrics to understand the specific service request errors. for eg: udr_rest_failure_response_total If guidance required, contact My Oracle Support.

4.1.2.22 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Table 4-40 SLFSucessTxnDefaultGroupIdRateAbove1Percent

Field	Details
Description	Transaction Error Rate detected above 1 Percent of Total Transactions
Summary	Transaction Error rate is above 1 Percent of Total Transactions
Severity	Warning
Condition	Alert if number of SLF Lookup requests responded with default Group ID exceeds 1% of the total responses.
OID	1.3.6.1.4.1.323.5.3.43.1.2.7031
Metric Used	slf_sucess_txn_default_grp_id_total
Recommended Actions	This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces. Steps: Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.23 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Table 4-41 SLFSucessTxnDefaultGroupIdRateAbove10Percent

Field	Details
Description	Transaction Error Rate detected above 10 Percent of Total Transactions
Summary	Transaction Error rate is above 10 Percent of Total Transactions
Severity	Minor
Condition	Alert if number of SLF Lookup requests responded with default Group ID exceeds 10% of the total responses.
OID	1.3.6.1.4.1.323.5.3.43.1.2.7032
Metric Used	slf_sucess_txn_default_grp_id_total
Recommended Actions	This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces. Steps: Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.24 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Table 4-42 SLFSucessTxnDefaultGroupIdRateAbove25Percent

Field	Details
Description	Transaction Error Rate detected above 25 Percent of Total Transactions
Summary	Transaction Error rate is above 25 Percent of Total Transactions
Severity	Major
Condition	Alert if number of SLF Lookup requests responded with default Group ID exceeds 25% of the total responses.
OID	1.3.6.1.4.1.323.5.3.43.1.2.7033
Metric Used	slf_sucess_txn_default_grp_id_total
Recommended Actions	This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces. Steps: Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.25 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Table 4-43 SLFSucessTxnDefaultGroupIdRateAbove50Percent

Field	Details
Description	Transaction Error Rate detected above 50 Percent of Total Transactions
Summary	Transaction Error rate is above 50 Percent of Total Transactions
Severity	Critical
Condition	Alert if number of SLF Lookup requests responded with default Group ID exceeds 50% of the total responses.
OID	1.3.6.1.4.1.323.5.3.43.1.2.7034
Metric Used	slf_sucess_txn_default_grp_id_total
Recommended Actions	This alert is cleared when SLF Lookup request coming for subscribers not provisioned reduces. Steps: Check the subscriber range received for Lookup and make sure to avoid if there is any unexpected out of range of subscribers.

4.1.2.26 OcudrDiameterCongestionCongestedState

Table 4-44 OcudrDiameterCongestionCongestedState

Field	Details
Description	Alert will be raised if the diameter gateway pod is in CONGESTED state.
Summary	Alert will be raised if the diameter gateway pod is in CONGESTED state.
Severity	Critical
Condition	Alert will be raised if the diameter gateway pod is in CONGESTED state.
Metric Used	ocudr_pod_congestion_state = = 2
Recommended Actions	This alert is raised when the Diameter Gateway pod congestion level is set to the CONGESTED state. Steps: Decrease the traffic run or use proper perf resource. Check the pod congestion configurations and resource limit in CNC Console.

4.1.2.27 OcudrDiameterCongestionDocState

Table 4-45 OcudrDiameterCongestionDocState

Field	Details
Description	Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Summary	Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Severity	Major
Condition	Alert will be raised if the diameter gateway pod is in is in Danger of Congestion (DOC) state.
Metric Used	ocudr_pod_congestion_state = = 1
Recommended Actions	This alert is raised when the Diameter Gateway pod congestion level is set to the Danger of Congestion (DOC) state. Steps: Decrease the traffic run or use proper perf resource. Check the pod congestion configurations and resource limit in CNC Console.

4.1.2.28 DRProvServiceOverload60Percent

Table 4-46 DRProvServiceOverload60Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Warn level
Summary	This alert is fired when the application go to the overload level of Warn level
Severity	Warning
Condition	Alert If the application overloads at 60%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7036
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Warn level. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.29 DRProvServiceOverload75Percent

Table 4-47 DRProvServiceOverload75Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Minor level
Summary	This alert is fired when the application go to the overload level of Minor level
Severity	Minor
Condition	Alert If the application overloads at 75%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7037
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Minor level. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.30 DRProvServiceOverload80Percent

Table 4-48 DRProvServiceOverload80Percent

Field	Details
Description	This alert is fired when the application go to the overload level of Major level
Summary	This alert is fired when the application go to the overload level of Major level
Severity	Major
Condition	Alert If the application overloads at 80%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7038
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below Major level. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.31 DRProvServiceOverload90Percent

Table 4-49 DRProvServiceOverload90Percent

Field	Details
Description	This alert is fired when the application go to the overload level of critical level
Summary	This alert is fired when the application go to the overload level of critical level
Severity	Critical
Condition	Alert If the application overloads at 90%
OID	1.3.6.1.4.1.323.5.3.43.1.2.7039
Metric Used	load_level
Recommended Actions	This alert is cleared when the incoming traffic is reduced to below critical level. Steps: Check the Service specific metrics to understand the specific service request errors. Example: udr_rest_failure_response_total If guidance required, Contact My Oracle Support.

4.1.2.32 Diameter-Gateway pod congestion Danger of congestion state

Table 4-50 Diameter-Gateway pod congestion Danger of congestion state

Field	Details
Description	DiameterGateway pod at Danger of Congestion state
Summary	DiameterGateway pod at Danger of Congestion state
Severity	Major
Condition	Alert if the diameter gateway pod is in Danger of Congestion (DOC) state
OID	1.3.6.1.4.1.323.5.3.43.1.2.7041
Metric Used	occnp_pod_congestion_state==1
Recommended Actions	This alert is raised when the diameter gateway pod congestion level is set to the danger of congestion(DOC) Steps: Decrease the traffic run or use proper perf resource. Make sure the pod congestion configurations and resource limit in CNE GUI.

4.1.2.33 Diameter-Gateway pod CONGESTED state

Table 4-51 Diameter-Gateway pod CONGESTED state

Field	Details
Description	DiameterGateway pod at Congested state
Summary	DiameterGateway pod at Congested state
Severity	Critical
Condition	Alert if the diameter gateway pod is in CONGESTED state
OID	1.3.6.1.4.1.323.5.3.43.1.2.7042
Metric Used	occnp_pod_congestion_state==2
Recommended Actions	This alert is raised when the diameter gateway pod congestion level is set to the CONGESTED state Steps: Decrease the traffic run or use proper perf resource. Make sure the pod congestion configurations and resource limit in CNE GUI

4.1.2.34 OcudrProvisioningTrafficRateAboveMajorThreshold

Table 4-52 OcudrProvisioningTrafficRateAboveMajorThreshold

Field	Details
Description	Ingress traffic Rate is above critical threshold, that is, 950 requests per second
Summary	Traffic Rate is above 95 Percent of Max requests per second (1000)
Severity	Critical
Condition	Alert if Ingress traffic reaches 95% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7044
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the Ingress Traffic rate falls below the Critical threshold. Note: The threshold is configurable in `UDR_Alertrules.yaml`. Steps: Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support. Refer Grafana to determine the service that is recieving high traffic. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes. Check the Ingress gateway logs on Kibana to determine the reason for the errors.

4.1.2.35 OcudrProvisioningTrafficRateAboveCriticalThreshold

Table 4-53 OcudrProvisioningTrafficRateAboveCriticalThreshold

Field	Details
Description	Ingress traffic Rate is above major threshold, that is, 900 requests per second
Summary	Traffic Rate is above 90 Percent of Max requests per second (1000)
Severity	Major
Condition	Alert if Ingress traffic reaches 90% of max TPS
OID	1.3.6.1.4.1.323.5.3.43.1.2.7045
Metric Used	oc_ingressgateway_http_requests_total
Recommended Actions	The alert is cleared when the total Ingress Traffic rate falls below the Major threshold or when the total traffic rate exceeds the Critical threshold in which the OcudrTrafficRateAboveMajorThreshold alert is raised. Note: The threshold is configurable in `UDR_Alertrules.yaml`. Steps: Reassess why OCUDR is receiving an additional traffic (for example, Mated site OCUDR is unavailable in geo redundancy scenario). If this is unexpected, contact My Oracle Support. Refer Grafana to determine the service that is recieving high traffic. Refer to the Ingress gateway section in Grafana to determine an increase in 4xx and 5xx Error codes. Check the Ingress gateway logs on Kibana to determine the reason for the errors.

4.1.2.36 OcudrProvisioningTransactionErrorRateAbove25Percent

Table 4-54 OcudrProvisioningTransactionErrorRateAbove25Percent

Field	Details
Description	Transaction Error Rate detected above 25 Percent of Total Transactions
Summary	Transaction Error Rate detected above 25 Percent of Total Transactions
Severity	Major
Condition	Alert if all error rate exceeds 25% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7046
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions is below 25% of the total transactions or when the number of failure transactions exceeds the 50% threshold in which the OcnrfTransactionErrorRateAbove50Percent is raised. Steps: Check the metrics per service per method, for example, discovery requests can be deduced from these metrics. Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, contact My Oracle Support.

4.1.2.37 OcudrProvisioningTransactionErrorRateAbove50Percent

Table 4-55 OcudrProvisioningTransactionErrorRateAbove50Percent

Field	Details
Description	Transaction Error Rate detected above 50 Percent of Total Transactions
Summary	Transaction Error Rate detected above 50 Percent of Total Transactions
Severity	Critical
Condition	Alert if all error rate exceeds 50% of the total transactions
OID	1.3.6.1.4.1.323.5.3.43.1.2.7047
Metric Used	oc_ingressgateway_http_responses_total
Recommended Actions	The alert is cleared when the number of failure transactions is below 50 percent of the total transactions. Steps: Check the metrics per service per method, for example, discovery requests can be deduced from these metrics. Metrics="oc_ingressgateway_http_responses_total" Method="GET" Status="503 SERVICE_UNAVAILABLE" If guidance required, contact My Oracle Support.

4.1.2.38 PVCFullForSLFExport

Table 4-56 PVCFullForSLFExport

Field	Details
Description	Storage for Export tool is full
Summary	Storage for Export tool is full
Severity	Critical
Condition	Alert if PVC allocated for export tool dump path is full
Metric Used	export_tool_full_usage
Recommended Actions	Alert will be cleared when the PVC usage is optimized. Configure maxDumps to lower value to clear old dumps. Remove old dumps, if any from the export tool container.

4.1.2.39 FailedExtractForSLFExport

Table 4-57 FailedExtractForSLFExport

Field	Details
Description	Export tool job is failed
Summary	Export tool job is failed
Severity	Critical
Condition	Alert of the export operation fails
Metric Used	export_failure
Recommended Actions	Check logs for failure. The alert will be cleared when the export job succeeds next time.

4.1.2.40 BulkImportTransferInFailed

Table 4-58 BulkImportTransferInFailed

Field	Details
Description	Transfer-in failed for bulk import
Summary	Transfer-in failed for bulk import
Severity	Major
Condition	Alert will be raised, if Transfer-In failed from Remote to PVC
Metric Used	bulkimport_transfer_in_status
Recommended Actions	This alert is cleared when the transfer-in is success from bulk import. Steps Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.41 ExportToolTransferOutFailed

Table 4-59 ExportToolTransferOutFailed

Field	Details
Description	Transfer-out failed for export-tool
Summary	Transfer-out failed for export-tool"
Severity	Major
Condition	Alert will be raised if Transfer-Out failed from PVC to Remote
Metric Used	sftp_transfer_status
Recommended Actions	This alert is cleared when the transfer-out is success from export tool. Steps Check the service specific metrics to understand the specific service request errors.. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.42 BulkImportTransferOutFailed

Table 4-60 BulkImportTransferOutFailed

Field	Details
Description	Transfer-out failed for bulk import
Summary	Transfer-out failed for bulk import
Severity	Major
Condition	Alert will be raised if Transfer-Out failed from PVC to Remote
Metric Used	bulkimport_transfer_out_status
Recommended Actions	This alert is cleared when the transfer-out is success from bulk import. Steps Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.43 PVCFullForXMLBulkImport

Table 4-61 PVCFullForXMLBulkImport

Field	Details
Description	Storage for XML Bulk Import tool is full
Summary	Storage for XML Bulk Import tool is full
Severity	Critical
Condition	Alert will be raised if the PVC is full for xml-csv container
Metric Used	nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-xmltocsv",kubernetes_namespace="ocudr"}==1
Recommended Actions	This alert will be cleared when the PVC is back to normal. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.44 PVCFullForBulkImport

Table 4-62 PVCFullForBulkImport

Field	Details
Description	Storage for Bulk Import tool is full
Summary	Storage for Bulk Import tool is full
Severity	Critical
Condition	Alert will be raised if the PVC is full for bulk import container
Metric Used	nudr_bulk_import_tool_pvc_full_usage{app_kubernetes_io_name="nudr-bulk-import",kubernetes_namespace="ocudr"}==1
Recommended Actions	This alert will be cleared when the PVC is back to normal. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.45 OperationalStatusCompleteShutdown

Table 4-63 OperationalStatusCompleteShutdown

Field	Details
Description	Operational state is control shutdown
Summary	Operational state is control shutdown
Severity	Critical
Condition	Alert will be raised if the opertational state of the UDR, SLF, or EIR is COMPLETE_SHUTDOWN
Metric Used	nudr_config_operational_status{kubernetes_namespace="ocudr"}==1
Recommended Actions	This alert will be cleared when the operational status is back to normal. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.46 NFScoreCalculationFailed

Table 4-64 NFScoreCalculationFailed

Field	Details
Description	NFScoreCalculationFailed
Summary	NFScoreCalculationFailed
Severity	Major
Condition	Alert is raised if the NF Score calculation are failed for any of the scoring factors
Metric Used	nfscore{kubernetes_namespace="ocudr" ,factor=~"successTPS\|signallingConnections\|serviceHealth\|replicationHealth\|localityPreference\|bulkImport\|bulkExport",calculatedStatus="failed"}
Recommended Actions	This alert is cleared when the NF score calculation is successful. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.47 PVCFullForEXMLExport

Table 4-65 PVCFullForEXMLExport

Field	Details
Description	Storage for Export tool is full
Summary	Storage for Export tool is full
Severity	Critical
Condition	Alert is raised if PVC allocated for export tool dump path is full.
Metric Used	export_tool_full_usage{namespace="ocudr"}==1
Recommended Actions	Alert is cleared when the PVC usage is optimized. You must configure maxDumps to a lower value to clear old dumps. Steps: If present, remove the old dumps from the export tool container.

4.1.2.48 EXMLExportFailed

Table 4-66 EXMLExportFailed

Field	Details
Description	Export tool job is failed
Summary	Export tool job is failed
Severity	Critical
Condition	Alert is raised if the export operation fails for EXML Mode
Metric Used	export_failure{namespace="ocudr"}== 1
Recommended Actions	You must check the logs for failure. When the next export job is successful the alert is cleared.

4.1.2.49 IngressgatewayPodProtectionDocState

Table 4-67 IngressgatewayPodProtectionDocState

Field	Details
Description	Ingress congestion in Doc state
Summary	Ingress congestion Doc state
Severity	Critical
Condition	Alert is raised if Ingress congestion is in doc state.
Metric Used	oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==1
Recommended Actions	This alert will be cleared when the ingress gateway comes to normal state. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.50 IngressgatewayPodProtectionCongestedState

Table 4-68 IngressgatewayPodProtectionCongestedState

Field	Details
Description	Ingress congestion in Congested state
Summary	Ingress congestion in Congested state
Severity	Critical
Condition	Alert is raised if ingress congestion is in congested state.
Metric Used	oc_ingressgateway_pod_congestion_state{namespace="ocudr"}==2
Recommended Actions	This alert will be cleared when the ingress gateway comes to normal state. Steps: Check the service specific metrics to understand the specific service request errors. For example, udr_rest_failure_response_total. Contact My Oracle Support, if guidance is required.

4.1.2.51 RetryNotificationRecordsMaxLimitExceeded

Table 4-69 RetryNotificationRecordsMaxLimitExceeded

Field	Details
Description	Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Summary	Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Severity	Critical
Condition	Alert will be raised if the retry notifications stored in UDR database exceeds maximum limit.
Metric Used	nudr_notif_records_limit_exceeded{namespace="ocudr"}==1
Recommended Actions	This alert is raised when there are more notification failures and the retry notifications stored in database is more than 50k. Steps: Check the notification failure rate and fix the reason for failures. This reduces the number of notifications marked for retry that is stored in UDR database. Contact My Oracle Support, if guidance is required.

4.1.2.52 UserAgentHeaderNotFoundMorethan10PercentRequest

Table 4-70 UserAgentHeaderNotFoundMorethan10PercentRequest

Field	Details
Description	Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Summary	Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic when suppress notification feature is enabled.
Severity	Critical
Condition	Alert will be raised if the total number of requests not having User-Agent header is 10% of ingress traffic.
Metric Used	(sum by(namespace)(rate(suppress_user_agent_not_found_total{namespace="ocudr"}[5m]))/sum by(namespace)(rate(oc_ingressgateway_http_requests_total{namespace="ocudr"}[5m])))*100 >= 10
Recommended Actions	This alert is cleared if the total number of requests not having User-Agent header is less than 10% of ingress traffic. Steps: Check the service specific metrics to understand the specific service request errors. Contact My Oracle Support, if guidance is required.

4.1.2.53 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Table 4-71 EgressGatewayJVMBufferMemoryUsedAboveMinorThreshold

Field	Details
Description	Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Summary	Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Severity	Minor
Condition	Alert will be raised if egress gateway JVM buffer memory is above the minor threshold limit.
Metric Used	sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".egress."}) >= 1300000000
Recommended Actions	This alert is cleared if the egress gateway JVM buffer memory is below the minor threshold limit. Steps: Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit. Contact My Oracle Support, if guidance is required.

4.1.2.54 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Table 4-72 EgressGatewayJVMBufferMemoryUsedAboveMajorThreshold

Field	Details
Description	Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Summary	Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Severity	Major
Condition	Alert will be raised if egress gateway JVM buffer memory is above the major threshold limit.
Metric Used	sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".egress."}) >= 1500000000
Recommended Actions	This alert is cleared if the egress gateway JVM buffer memory is below the major threshold limit. Steps: Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit. Contact My Oracle Support, if guidance is required.

4.1.2.55 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Table 4-73 EgressGatewayJVMBufferMemoryUsedAboveCriticalThreshold

Field	Details
Description	Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Summary	Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Severity	Critical
Condition	Alert will be raised if egress gateway JVM buffer memory is above the critical threshold limit.
Metric Used	sum by (id, pod) (jvm_buffer_memory_used_bytes{namespace="ocudr",pod=~".egress."}) >= 1800000000
Recommended Actions	This alert is cleared if the egress gateway JVM buffer memory is below the critical threshold limit. Steps: Check the reason for egress gateway JVM buffer memory is above the threshold limit. and why it is not clearing sufficient memory by itself to reach below the threshold limit. Contact My Oracle Support, if guidance is required.

4.1.2.56 NudrDiameterGatewayDown

Table 4-74 NudrDiameterGatewayDown

Field	Details
Description	Alert will be raised if Nudr-diam-gateway service is down.
Summary	Alert will be raised if Nudr-diam-gateway service is down.
Severity	Critical
Condition	Alert will be raised if Nudr-diam-gateway service is down.
Metric Used	absent(up{container="nudr-diam-gateway",namespace="ocudr"}) or up{container="nudr-diam-gateway",namespace="ocudr"} == 0
Recommended Actions	This alert is cleared when the NudrDiamGateway service is available. Steps: Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures. `kubectl get po -n <namespace>` Run the following command using the full name of the pod that is not running. `kubectl describe pod <specific desired full pod name> -n <namespace>` Refer the application logs on Kibana and filter based on the appinfo service names. Check for `ERROR WARNING` logs related to thread exceptions. Perform the resolution steps depending on the reason for failure. Contact My Oracle Support, if guidance is required. Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.

4.1.2.57 DiameterPeerConnectionsDropped

Table 4-75 DiameterPeerConnectionsDropped

Field	Details
Description	Alert will be raised if there are no connections between diameter peer and diameter gateway.
Summary	Alert will be raised if there are no connections between diameter peer and diameter gateway.
Severity	Major
Condition	Alert will be raised if there are no connections between diameter peer and diameter gateway.
Metric Used	sum(ocudr_diam_conn_network{origHost=~".CHI.",container="nudr-diam-gateway",namespace="ocudr"} or vector(0))< 2 or sum(ocudr_diam_conn_network{origHost=~".IND.",container="nudr-diam-gateway",namespace="ocudr"} or vector(0)) < 2 or (sum(ocudr_diam_conn_network{origHost=~".CHI.",container="nudr-diam-gateway",kubernetes_namespace="ocudr"} or vector(0)) + sum(ocudr_diam_conn_network{origHost=~".IND.",container="nudr-diam-gateway",namespace="ocudr"}) or vector(0)) < 5
Recommended Actions	This alert is cleared when the NudrDiamGateway service is available. Steps: Run the following command to check the orchestration logs of appinfo service and check for liveness or readiness probe failures. `kubectl get po -n <namespace>` Run the following command using the full name of the pod that is not running. `kubectl describe pod <specific desired full pod name> -n <namespace>` Refer the application logs on Kibana and filter based on the appinfo service names. Check for `ERROR WARNING` logs related to thread exceptions. Perform the resolution steps depending on the reason for failure. Contact My Oracle Support, if guidance is required. Note: Use CNC NF Data Collector tool for capturing logs. For more information, see Oracle Communications Cloud Native Core, Network Function Data Collector User Guide.