Alerts

8 Alerts

This section provides information on Policy alerts and their configuration.

Note:

The performance and capacity of the system can vary based on the call model, configuration, including but not limited to the deployed policies and corresponding data, for example, policy tables.

You can configure alerts in Prometheus and Alertrules.yaml file.

The following table describes the various severity types of alerts generated by Policy:

Table 8-1 Alerts Levels or Severity Types

Alerts Levels / Severity Types	Definition
Critical	Indicates a severe issue that poses a significant risk to safety, security, or operational integrity. It requires immediate response to address the situation and prevent serious consequences. Raised for conditions can affect the service of Policy.
Major	Indicates a more significant issue that has an impact on operations or poses a moderate risk. It requires prompt attention and action to mitigate potential escalation. Raised for conditions can affect the service of Policy.
Minor	Indicates a situation that is low in severity and does not pose an immediate risk to safety, security, or operations. It requires attention but does not demand urgent action. Raised for conditions can affect the service of Policy.
Info or Warn (Informational)	Provides general information or updates that are not related to immediate risks or actions. These alerts are for awareness and do not typically require any specific response. WARN and INFO alerts may not impact the service of Policy.

For details on how to configure Policy alerts, see Configuring Alerts section in Oracle Communications Cloud Native Core, Converged Policy Installation, Upgrade, and Fault Recovery Guide.

For details on how to configure SNMP Notifier, see Configuring SNMP Notifier section in Oracle Communications Cloud Native Core, Converged Policy Installation, Upgrade, and Fault Recovery Guide.

8.1 List of Alerts

This section provides detailed information about the alert rules defined for Policy. It consists of the following three types of alerts:

Common Alerts - This category of alerts is common and required for all three modes of deployment.
PCF Alerts - This category of alerts is specific to PCF microservices and required for Converged and PCF only modes of deployment.
PCRF Alerts - This category of alerts is specific to PCRF microservices and required for Converged and PCRF only modes of deployment.

8.1.1 Common Alerts

This section provides information about alerts that are common for PCF and PCRF.

8.1.1.1 POD_CONGESTION_L1

Table 8-2 POD_CONGESTION_L1

Field	Details
Name in Alert Yaml File	PodCongestionL1
Description	Alert when cpu of pod is in CONGESTION_L1 state.
Summary	Alert when cpu of pod is in CONGESTION_L1 state.
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="cpu",container!~"bulwark\|diam-gateway"} == 2
OID	1.3.6.1.4.1.323.5.3.52.1.2.71
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.2 POD_CONGESTION_L2

Table 8-3 POD_CONGESTION_L2

Field	Details
Name in Alert Yaml File	PodCongestionL2
Description	Alert when cpu of pod is in CONGESTION_L2 state.
Summary	Alert when cpu of pod is in CONGESTION_L2 state.
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="cpu"} == 3
OID	1.3.6.1.4.1.323.5.3.52.1.2.72
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.3 POD_PENDING_REQUEST_CONGESTION_L1

Table 8-4 POD_PENDING_REQUEST_CONGESTION_L1

Field	Details
Name in Alert Yaml File	PodPendingRequestCongestionL1
Description	Alert when queue of pod is in CONGESTION_L1 state.
Summary	Alert when queue of pod is in CONGESTION_L1 state.
Severity	critical
Expression	occnp_pod_resource_congestion_state{type="queue",container!~"bulwark\|diam-gateway"} == 2
OID	1.3.6.1.4.1.323.5.3.52.1.2.73
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.4 POD_PENDING_REQUEST_CONGESTION_L2

Table 8-5 POD_PENDING_REQUEST_CONGESTION_L2

Field	Details
Name in Alert Yaml File	PodPendingRequestCongestionL2
Description	Alert when queue of pod is in CONGESTION_L2 state.
Summary	Alert when queue of pod is in CONGESTION_L2 state.
Severity	critical
Expression	occnp_pod_resource_congestion_state{type="queue"} == 3
OID	1.3.6.1.4.1.323.5.3.52.1.2.74
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.5 POD_CPU_CONGESTION_L1

Table 8-6 POD_CPU_CONGESTION_L1

Field	Details
Name in Alert Yaml File	PodCPUCongestionL1
Description	Alert when cpu of pod is in CONGESTION_L1 state.
Summary	Alert when cpu of pod is in CONGESTION_L1 state.Alert when pod is in CONGESTION_L1 state.
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="cpu",container!~"bulwark\|diam-gateway"} == 2
OID	1.3.6.1.4.1.323.5.3.52.1.2.73
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.6 POD_CPU_CONGESTION_L2

Table 8-7 POD_CPU_CONGESTION_L2

Field	Details
Name in Alert Yaml File	PodCPUCongestionL2
Description	Alert when cpu of pod is in CONGESTION_L2 state.
Summary	Alert when cpu of pod is in CONGESTION_L2 state.
Severity	critical
Expression	occnp_pod_resource_congestion_state{type="cpu"} == 3
OID	1.3.6.1.4.1.323.5.3.52.1.2.74
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.7 Pod_Memory_DoC

Table 8-8 Pod_Memory_DoC

Field	Details
Description	Pod Resource Congestion status of {{$labels.service}} service is DoC for Memory type
Summary	Pod Resource Congestion status of {{$labels.service}} service is DoC for Memory type
Severity	Major
Expression	occnp_pod_resource_congestion_state{type="memory"} == 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.31
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	Alert triggers based on the resource limit usage and load shedding configurations in congestion control. The CPU, Memory, and queue usage can be referred using the Grafana Dashboard. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.8 Pod_Memory_Congested

Table 8-9 Pod_Memory_Congested

Field	Details
Description	Pod Resource Congestion status of {{$labels.service}} service is congested for Memory type
Summary	Pod Resource Congestion status of {{$labels.service}} service is congested for Memory type
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="memory"} == 2
OID	1.3.6.1.4.1.323.5.3.52.1.2.32
Metric Used	occnp_pod_resource_congestion_state
Recommended Actions	Alert triggers based on the resource limit usage and load shedding configurations in congestion control. The CPU, Memory, and queue usage can be referred using the Grafana Dashboard. For any additional guidance, contact My Oracle Support.

8.1.1.9 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-10 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the critical threshold limit.
Summary	RAA Rx fail count exceeds the critical threshold limit.
Severity	CRITICAL
Expression	sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.10 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-11 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the major threshold limit.
Summary	RAA Rx fail count exceeds the major threshold limit.
Severity	MAJOR
Expression	sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.11 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-12 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the minor threshold limit.
Summary	RAA Rx fail count exceeds the minor threshold limit.
Severity	MINOR
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.12 ASA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-13 ASA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	ASA Rx fail count exceeds the critical threshold limit.
Summary	ASA Rx fail count exceeds the critical threshold limit.
Severity	CRITICAL
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.13 ASA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-14 ASA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	ASA Rx fail count exceeds the major threshold limit.
Summary	ASA Rx fail count exceeds the major threshold limit.
Severity	MAJOR
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.14 ASA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-15 ASA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	ASA Rx fail count exceeds the minor threshold limit.
Summary	ASA Rx fail count exceeds the minor threshold limit.
Severity	MINOR
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.15 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-16 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	ASA Rx timeout count exceeds the minor threshold limit
Summary	ASA Rx timeout count exceeds the minor threshold limit
Severity	MINOR
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.16 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-17 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	ASA Rx timeout count exceeds the major threshold limit
Summary	ASA Rx timeout count exceeds the major threshold limit
Severity	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 90
Expression	MAJOR
OID	1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.17 ASA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-18 ASA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	ASA Rx timeout count exceeds the critical threshold limit
Summary	ASA Rx timeout count exceeds the critical threshold limit
Severity	CRITICAL
Expression	sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used	-
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.18 SCP_PEER_UNAVAILABLE

Table 8-19 SCP_PEER_UNAVAILABLE

Field	Details
Description	Configured SCP peer is unavailable.
Summary	Configured SCP peer is unavailable.
Severity	Major
Expression	occnp_oc_egressgateway_peer_health_status != 0. SCP peer [ {{$labels.peer}} ] is unavailable.
OID	1.3.6.1.4.1.323.5.3.52.1.2.60
Metric Used	occnp_oc_egressgateway_peer_health_status
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.19 SCP_PEER_SET_UNAVAILABLE

Table 8-20 SCP_PEER_SET_UNAVAILABLE

Field	Details
Description	None of the SCP peer available for configured peerset.
Summary	{{ $value }} SCP peers under peer set {{$labels.peerset}} are currently unavailable.
Severity	Critical
Expression	(occnp_oc_egressgateway_peer_count > 0 and (occnp_oc_egressgateway_peer_available_count) == 0)
OID	1.3.6.1.4.1.323.5.3.52.1.2.61
Metric Used	occnp_oc_egressgateway_peer_count and occnp_oc_egressgateway_peer_available_count
Recommended Actions	NF clears the critical alarm when atleast one SCP peer in a peerset becomes available such that all other SCP peers in the given peerset are still unavailable. For any additional guidance, contact My Oracle Support.

8.1.1.20 STALE_CONFIGURATION

Table 8-21 STALE_CONFIGURATION

Field	Details
Description	In last 10 minutes, the current service config_level does not match the config_level from the config-server.
Summary	In last 10 minutes, the current service config_level does not match the config_level from the config-server.
Severity	Major
Expression	(sum by(namespace) (topic_version{app_kubernetes_io_name="config-server",topicName="config.level"})) / (count by(namespace) (topic_version{app_kubernetes_io_name="config-server",topicName="config.level"})) != (sum by(namespace) (topic_version{app_kubernetes_io_name!="config-server",topicName="config.level"})) / (count by(namespace) (topic_version{app_kubernetes_io_name!="config-server",topicName="config.level"}))
OID	1.3.6.1.4.1.323.5.3.52.1.2.62
Metric Used	topic_version
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.21 POLICY_SERVICES_DOWN

Table 8-22 POLICY_SERVICES_DOWN

Field	Details
Name in Alert Yaml File	PCF_SERVICES_DOWN
Description	{{$labels.service}} service is not running.
Summary	{{$labels.service}} service is not running.
Severity	Critical
Expression	None of the pods of the CNC Policy application are available.
OID	1.3.6.1.4.1.323.5.3.36.1.2.1
Metric Used	appinfo_service_running{vendor="Oracle", application="occnp", category!=""}!= 1
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.22 DIAM_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-23 DIAM_TRAFFIC_RATE_ABOVE_THRESHOLD

Field	Details
Name in Alert Yaml File	DiamTrafficRateAboveThreshold
Description	Diameter Connector Ingress traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary	Traffic Rate is above 90 Percent of Max requests per second.
Severity	Major
Expression	The total Ingress traffic rate for Diameter connector has crossed the configured threshold of 900 TPS. Default value of this alert trigger point in Common_Alertrules.yaml file is when Diameter Connector Ingress Rate crosses 90% of maximum ingress requests per second.
OID	1.3.6.1.4.1.323.5.3.36.1.2.6
Metric Used	ocpm_ingress_request_total
Recommended Actions	The alert gets cleared when the Ingress traffic rate falls below the threshold. Note: Threshold levels can be configured using the `Common_Alertrules.yaml` file. It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic: Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors. For any additional guidance, contact My Oracle Support.

8.1.1.23 DIAM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-24 DIAM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field	Details
Name in Alert Yaml File	DiamIngressErrorRateAbove10Percent
Description	Transaction Error Rate detected above 10 Percent of Total on Diameter Connector (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions.
Severity	Critical
Expression	The number of failed transactions is above 10 percent of the total transactions on Diameter Connector.
OID	1.3.6.1.4.1.323.5.3.36.1.2.7
Metric Used	ocpm_ingress_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance:`ocpm_ingress_response_total{servicename_3gpp="rx",response_code!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. For any additional guidance, contact My Oracle Support.

8.1.1.24 DIAM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-25 DIAM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Name in Alert Yaml File	DiamEgressErrorRateAbove1Percent
Description	Egress Transaction Error Rate detected above 1 Percent of Total on Diameter Connector (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 1 Percent of Total Transactions
Severity	Minor
Expression	The number of failed transactions is above 1 percent of the total Egress Gateway transactions on Diameter Connector.
OID	1.3.6.1.4.1.323.5.3.36.1.2.8
Metric Used	ocpm_egress_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 1% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the errors. For instance:`ocpm_egress_response_total{servicename_3gpp="rx",response_code!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. For any additional guidance, contact My Oracle Support.

8.1.1.25 UDR_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-26 UDR_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field	Details
Name in Alert Yaml File	PcfUdrIngressTrafficRateAboveThreshold
Description	User service Ingress traffic Rate from UDR is above threshold of Max MPS (current value is: {{ $value }})
Summary	Traffic Rate is above 90 Percent of Max requests per second
Severity	Major
Expression	The total User Service Ingress traffic rate from UDR has crossed the configured threshold of 900 TPS. Default value of this alert trigger point in Common_Alertrules.yaml file is when user service Ingress Rate from UDR crosses 90% of maximum ingress requests per second.
OID	1.3.6.1.4.1.323.5.3.36.1.2.9
Metric Used	ocpm_userservice_inbound_count_total{service_resource="udr-service"}
Recommended Actions	The alert gets cleared when the Ingress traffic rate falls below the threshold. Note: Threshold levels can be configured using the `Common_Alertrules.yaml` file. It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic: Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors. For any additional guidance, contact My Oracle Support.

8.1.1.26 UDR_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-27 UDR_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field	Details
Name in Alert Yaml File	PcfUdrEgressErrorRateAbove10Percent
Description	Egress Transaction Error Rate detected above 10 Percent of Total on User service (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Critical
Expression	The number of failed transactions from UDR is more than 10 percent of the total transactions.
OID	1.3.6.1.4.1.323.5.3.36.1.2.10
Metric Used	ocpm_udr_tracking_response_total{servicename_3gpp="nudr-dr",response_code!~"2.*"}
Recommended Actions	The alert gets cleared when the number of failure transactions falls below the configured threshold. Note: Threshold levels can be configured using the `Common_Alertrules.yaml` file. It is recommended to assess the reason for failed transactions. Perform the following steps to analyze the cause of increased traffic: Refer Egress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Egress Gateway logs on Kibana to determine the reason for the errors. For any additional guidance, contact My Oracle Support.

8.1.1.27 POLICYDS_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-28 POLICYDS_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field	Details
Name in Alert Yaml File	PolicyDsIngressTrafficRateAboveThreshold
Description	Ingress Traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary	Traffic Rate is above 90 Percent of Max requests per second
Severity	Critical
Expression	The total PolicyDS Ingress message rate has crossed the configured threshold of 900 TPS. 90% of maximum Ingress request rate. Default value of this alert trigger point in Common_Alertrules.yaml file is when PolicyDS Ingress Rate crosses 90% of maximum ingress requests per second.
OID	1.3.6.1.4.1.323.5.3.36.1.2.13
Metric Used	client_request_total Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the Ingress traffic rate falls below the threshold. Note: Threshold levels can be configured using the `Common_Alertrules.yaml` file. It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic: Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors. For any additional guidance, contact My Oracle Support.

8.1.1.28 POLICYDS_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-29 POLICYDS_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field	Details
Name in Alert Yaml File	PolicyDsIngressErrorRateAbove10Percent
Description	Ingress Transaction Error Rate detected above 10 Percent of Total on PolicyDS service (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Critical
Expression	The number of failed transactions is above 10 percent of the total transactions for PolicyDS service.
OID	1.3.6.1.4.1.323.5.3.36.1.2.14
Metric Used	client_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance:`client_response_total{response!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. For any additional guidance, contact My Oracle Support.

8.1.1.29 POLICYDS_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-30 POLICYDS_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Name in Alert Yaml File	PolicyDsEgressErrorRateAbove1Percent
Description	Egress Transaction Error Rate detected above 1 Percent of Total on PolicyDS service (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 1 Percent of Total Transactions
Severity	Minor
Expression	The number of failed transactions is above 1 percent of the total transactions for PolicyDS service.
OID	1.3.6.1.4.1.323.5.3.36.1.2.15
Metric Used	server_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance:`server_response_total{response!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. For any additional guidance, contact My Oracle Support.

8.1.1.30 UDR_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Table 8-31 UDR_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	PcfUdrIngressTimeoutErrorAboveMajorThreshold
Description	Ingress Timeout Error Rate detected above 10 Percent of Total towards UDR service (current value is: {{ $value }})
Summary	Timeout Error Rate detected above 10 Percent of Total Transactions
Severity	Major
Expression	The number of failed transactions due to timeout is above 10 percent of the total transactions for UDR service.
OID	1.3.6.1.4.1.323.5.3.36.1.2.16
Metric Used	ocpm_udr_tracking_request_timeout_total{servicename_3gpp="nudr-dr"}
Recommended Actions	The alert gets cleared when the number of failed transactions due to timeout are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance: `ocpm_udr_tracking_request_timeout_total{servicename_3gpp="nudr-dr"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. For any additional guidance, contact My Oracle Support.

8.1.1.31 DB_TIER_DOWN_ALERT

Table 8-32 DB_TIER_DOWN_ALERT

Field	Details
Name in Alert Yaml File	DBTierDownAlert
Description	DB cannot be reachable.
Summary	DB cannot be reachable.
Severity	Critical
Expression	Database is not available.
OID	1.3.6.1.4.1.323.5.3.36.1.2.18
Metric Used	appinfo_category_running{category="database"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.32 CPU_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Table 8-33 CPU_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	CPUUsagePerServiceAboveMinorThreshold
Description	CPU usage for {{$labels.service}} service is above 60
Summary	CPU usage for {{$labels.service}} service is above 60
Severity	Minor
Expression	A service pod has reached the configured minor threshold (60%) of its CPU usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.19
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the CPU utilization falls below the minor threshold or crosses the major threshold, in which case CPUUsagePerServiceAboveMajorThreshold alert shall be raised. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.33 CPU_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Table 8-34 CPU_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	CPUUsagePerServiceAboveMajorThreshold
Description	CPU usage for {{$labels.service}} service is above 80
Summary	CPU usage for {{$labels.service}} service is above 80
Severity	Major
Expression	A service pod has reached the configured major threshold (80%) of its CPU usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.20
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the CPU utilization falls below the major threshold or crosses the critical threshold, in which case CPUUsagePerServiceAboveCriticalThreshold alert shall be raised. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.34 CPU_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Table 8-35 CPU_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	CPUUsagePerServiceAboveCriticalThreshold
Description	CPU usage for {{$labels.service}} service is above 90
Summary	CPU usage for {{$labels.service}} service is above 90
Severity	Critical
Expression	A service pod has reached the configured critical threshold (90%) of its CPU usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.21
Metric Used	container_cpu_usage_seconds_total Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the CPU utilization falls below the critical threshold. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.35 MEMORY_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Table 8-36 MEMORY_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	MemoryUsagePerServiceAboveMinorThreshold
Description	Memory usage for {{$labels.service}} service is above 60
Summary	Memory usage for {{$labels.service}} service is above 60
Severity	Minor
Expression	A service pod has reached the configured minor threshold (60%) of its memory usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.22
Metric Used	container_memory_usage_bytes Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the memory utilization falls below the minor threshold or crosses the critical threshold, in which case MemoryUsagePerServiceAboveMajorThreshold alert shall be raised. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.36 MEMORY_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Table 8-37 MEMORY_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	MemoryUsagePerServiceAboveMajorThreshold
Description	Memory usage for {{$labels.service}} service is above 80
Summary	Memory usage for {{$labels.service}} service is above 80
Severity	Major
Expression	A service pod has reached the configured major threshold (80%) of its memory usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.23
Metric Used	container_memory_usage_bytes Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the memory utilization falls below the major threshold or crosses the critical threshold, in which case MemoryUsagePerServiceAboveCriticalThreshold alert shall be raised. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.37 MEMORY_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Table 8-38 MEMORY_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	MemoryUsagePerServiceAboveCriticalThreshold
Description	Memory usage for {{$labels.service}} service is above 90
Summary	Memory usage for {{$labels.service}} service is above 90
Severity	Critical
Expression	A service pod has reached the configured critical threshold (90%) of its memory usage limits.
OID	1.3.6.1.4.1.323.5.3.36.1.2.24
Metric Used	container_memory_usage_bytes Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.
Recommended Actions	The alert gets cleared when the memory utilization falls below the critical threshold. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. For any additional guidance, contact My Oracle Support.

8.1.1.38 POD_CONGESTED

Table 8-39 POD_CONGESTED

Field	Details
Name in Alert Yaml File	PodCongested
Description	The pod congestion status is set to congested.
Summary	Pod Congestion status of {{$labels.service}} service is congested
Severity	Critical
Expression	occnp_pod_congestion_state == 4
OID	1.3.6.1.4.1.323.5.3.36.1.2.26
Metric Used	occnp_pod_congestion_state
Recommended Actions	The alert gets cleared when the system is back to normal state. For any additional guidance, contact My Oracle Support.

8.1.1.39 POD_DANGER_OF_CONGESTION

Table 8-40 POD_DANGER_OF_CONGESTION

Field	Details
Description	The pod congestion status is set to Danger of Congestion.
Summary	Pod Congestion status of {{$labels.service}} service is DoC
Severity	Major
Expression	occnp_pod_resource_congestion_state == 1
OID	1.3.6.1.4.1.323.5.3.36.1.2.25
Metric Used	occnp_pod_congestion_state
Recommended Actions	The alert gets cleared when the system is back to normal state. For any additional guidance, contact My Oracle Support.

8.1.1.40 POD_PENDING_REQUEST_CONGESTED

Table 8-41 POD_PENDING_REQUEST_CONGESTED

Field	Details
Name in Alert Yaml File	PodPendingRequestCongested
Description	The pod congestion status is set to congested for PendingRequest.
Summary	Pod Resource Congestion status of {{$labels.service}} service is congested for PendingRequest type.
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="queue"} == 4
OID	1.3.6.1.4.1.323.5.3.36.1.2.28
Metric Used	occnp_pod_resource_congestion_state{type="queue"}
Recommended Actions	The alert gets cleared when the pending requests in the queue comes below the configured threshold value. For any additional guidance, contact My Oracle Support.

8.1.1.41 POD_PENDING_REQUEST_DANGER_OF_CONGESTION

Table 8-42 POD_PENDING_REQUEST_DANGER_OF_CONGESTION

Field	Details
Description	The pod congestion status is set to DoC for pending requests.
Summary	Pod Resource Congestion status of {{$labels.service}} service is DoC for PendingRequest type.
Severity	Major
Expression	occnp_pod_resource_congestion_state{type="queue"} == 1
OID	1.3.6.1.4.1.323.5.3.36.1.2.27
Metric Used	occnp_pod_resource_congestion_state{type="queue"}
Recommended Actions	The alert gets cleared when the pending requests in the queue comes below the configured threshold value. For any additional guidance, contact My Oracle Support.

8.1.1.42 POD_CPU_CONGESTED

Table 8-43 POD_CPU_CONGESTED

Field	Details
Name in Alert Yaml File	PodCPUCongested
Description	The pod congestion status is set to congested for CPU.
Summary	Pod Resource Congestion status of {{$labels.service}} service is congested for CPU type.
Severity	Critical
Expression	occnp_pod_resource_congestion_state{type="cpu"} == 4
OID	1.3.6.1.4.1.323.5.3.36.1.2.30
Metric Used	occnp_pod_resource_congestion_state{type="cpu"}
Recommended Actions	The alert gets cleared when the system CPU usage comes below the configured threshold value. For any additional guidance, contact My Oracle Support.

8.1.1.43 POD_CPU_DANGER_OF_CONGESTION

Table 8-44 POD_CPU_DANGER_OF_CONGESTION

Field	Details
Description	Pod Resource Congestion status of {{$labels.service}} service is DoC for CPU type.
Summary	Pod Resource Congestion status of {{$labels.service}} service is DoC for CPU type.
Severity	Major
Expression	The pod congestion status is set to DoC for CPU.
OID	1.3.6.1.4.1.323.5.3.36.1.2.29
Metric Used	occnp_pod_resource_congestion_state{type="cpu"}
Recommended Actions	The alert gets cleared when the system CPU usage comes below the configured threshold value. For any additional guidance, contact My Oracle Support.

8.1.1.44 SERVICE_OVERLOADED

Table 8-45 SERVICE_OVERLOADED

Field	Details
Description	Overload Level of {{$labels.service}} service is L1
Summary	Overload Level of {{$labels.service}} service is L1
Severity	Minor
Expression	The overload level of the service is L1.
OID	1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used	load_level
Recommended Actions	The alert gets cleared when the system is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-46 SERVICE_OVERLOADED

Field	Details
Description	Overload Level of {{$labels.service}} service is L2
Summary	Overload Level of {{$labels.service}} service is L2
Severity	Major
Expression	The overload level of the service is L2.
OID	1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used	load_level
Recommended Actions	The alert gets cleared when the system is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-47 SERVICE_OVERLOADED

Field	Details
Description	Overload Level of {{$labels.service}} service is L3
Summary	Overload Level of {{$labels.service}} service is L3
Severity	Critical
Expression	The overload level of the service is L3.
OID	1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used	load_level
Recommended Actions	The alert gets cleared when the system is back to normal state. For any additional guidance, contact My Oracle Support.

8.1.1.45 SERVICE_RESOURCE_OVERLOADED

Alerts when service is in overload state due to memory usage

Table 8-48 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L1 for {{$labels.type}} type
Summary	{{$labels.service}} service is L1 for {{$labels.type}} type
Severity	Minor
Expression	The overload level of the service is L1 due to memory usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="memory"}
Recommended Actions	The alert gets cleared when the memory usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-49 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L2 for {{$labels.type}} type
Summary	{{$labels.service}} service is L2 for {{$labels.type}} type
Severity	Major
Expression	The overload level of the service is L2 due to memory usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="memory"}
Recommended Actions	The alert gets cleared when the memory usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-50 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L3 for {{$labels.type}} type.
Summary	{{$labels.service}} service is L3 for {{$labels.type}} type
Severity	Critical
Expression	The overload level of the service is L3 due to memory usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="memory"}
Recommended Actions	The alert gets cleared when the memory usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to CPU usage

Table 8-51 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L1 for {{$labels.type}} type
Summary	{{$labels.service}} service is L1 for {{$labels.type}} type
Severity	Minor
Expression	The overload level of the service is L1 due to CPU usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="cpu"}
Recommended Actions	The alert gets cleared when the CPU usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-52 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L2 for {{$labels.type}} type
Summary	{{$labels.service}} service is L2 for {{$labels.type}} type
Severity	Major
Expression	The overload level of the service is L2 due to CPU usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="cpu"}
Recommended Actions	The alert gets cleared when the CPU usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-53 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L3 for {{$labels.type}} type
Summary	{{$labels.service}} service is L3 for {{$labels.type}} type
Severity	Critical
Expression	The overload level of the service is L3 due to CPU usage.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="cpu"}
Recommended Actions	The alert gets cleared when the CPU usage of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to number of pending messages

Table 8-54 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L1 for {{$labels.type}} type
Summary	{{$labels.service}} service is L1 for {{$labels.type}} type
Severity	Minor
Expression	The overload level of the service is L1 due to number of pending messages.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_pending_count"}
Recommended Actions	The alert gets cleared when the number of pending messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-55 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L2 for {{$labels.type}} type
Summary	{{$labels.service}} service is L2 for {{$labels.type}} type
Severity	Major
Expression	The overload level of the service is L2 due to number of pending messages.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_pending_count"}
Recommended Actions	The alert gets cleared when the number of pending messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-56 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L3 for {{$labels.type}} type
Summary	{{$labels.service}} service is L3 for {{$labels.type}} type
Severity	Critical
Expression	The overload level of the service is L3 due to number of pending messages.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_pending_count"}
Recommended Actions	The alert gets cleared when the number of pending messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to number of failed requests

Table 8-57 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L1 for {{$labels.type}} type.
Summary	{{$labels.service}} service is L1 for {{$labels.type}} type.
Severity	Minor
Expression	The overload level of the service is L1 due to number of failed requests.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_failure_count"}
Recommended Actions	The alert gets cleared when the number of failed messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-58 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L2 for {{$labels.type}} type.
Summary	{{$labels.service}} service is L2 for {{$labels.type}} type.
Severity	Major
Expression	The overload level of the service is L2 due to number of failed requests.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_failure_count"}
Recommended Actions	The alert gets cleared when the number of failed messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

Table 8-59 SERVICE_RESOURCE_OVERLOADED

Field	Details
Description	{{$labels.service}} service is L3 for {{$labels.type}} type.
Summary	{{$labels.service}} service is L3 for {{$labels.type}} type.
Severity	Critical
Expression	The overload level of the service is L3 due to number of failed requests.
OID	1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used	service_resource_overload_level{type="svc_failure_count"}
Recommended Actions	The alert gets cleared when the number of failed messages of the service is back to normal state. For any additional guidance, contact My Oracle Support.

8.1.1.46 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_CRITICAL_THRESHOLD

Table 8-60 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	Notification Transaction Error exceeds the critical threshold limit for a given Subscriber Notification server
Summary	Transaction Error exceeds the critical threshold limit for a given Subscriber Notification server
Severity	Critical
Expression	The number of error responses for a given subscriber notification server exceeds the critical threshold of 1000.
OID	1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used	http_notification_response_total{responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

Table 8-61 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	Notification Transaction Error exceeds the major threshold limit for a given Subscriber Notification server
Summary	Transaction Error exceeds the major threshold limit for a given Subscriber Notification server
Severity	Major
Expression	The number of error responses for a given subscriber notification server exceeds the major threshold value, that is, between 750 and 1000.
OID	1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used	http_notification_response_total{responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

Table 8-62 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	Notification Transaction Error exceeds the minor threshold limit for a given Subscriber Notification server
Summary	Transaction Error exceeds the minor threshold limit for a given Subscriber Notification server
Severity	Minor
Expression	The number of error responses for a given subscriber notification server exceeds the minor threshold value, that is, between 500 and 750.
OID	1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used	http_notification_response_total{responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.47 SYSTEM_IMPAIRMENT_MAJOR

Table 8-63 SYSTEM_IMPAIRMENT_MAJOR

Field	Details
Description	Major impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 10 minutes.
Summary	Major impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 10 minutes.
Severity	Major
Expression	(db_tier_replication_status{role="failed"} == 0) or (db_tier_replication_status{role="active"} == 0) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="standby"})) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="failed"})) or (avg_over_time(db_tier_binlog_used_bytes_percentage[5m])>= 80)
OID	1.3.6.1.4.1.323.5.3.52.1.2.43
Metric Used	db_tier_replication_status and db_tier_binlog_used_bytes_percentage
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.48 SYSTEM_IMPAIRMENT_CRITICAL

Table 8-64 SYSTEM_IMPAIRMENT_CRITICAL

Field	Details
Description	Critical impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 30 minutes.
Summary	Critical impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 30 minutes.
Severity	Critical
Expression	(db_tier_replication_status{role="failed"} == 0) or (db_tier_replication_status{role="active"} == 0) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="standby"})) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="failed"})) or (avg_over_time(db_tier_binlog_used_bytes_percentage[5m])>= 80)
OID	1.3.6.1.4.1.323.5.3.52.1.2.43
Metric Used	db_tier_replication_status and db_tier_binlog_used_bytes_percentage
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.49 SYSTEM_OPERATIONAL_STATE_PARTIAL_SHUTDOWN

Table 8-65 SYSTEM_OPERATIONAL_STATE_PARTIAL_SHUTDOWN

Field	Details
Description	System Operational State is now in partial shutdown state.
Summary	System Operational State is now in partial shutdown state.
Severity	Major
Expression	system_operational_state == 2
OID	1.3.6.1.4.1.323.5.3.37.1.2.17
Metric Used	system_operational_state == 2
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.50 SYSTEM_OPERATIONAL_STATE_COMPLETE_SHUTDOWN

Table 8-66 SYSTEM_OPERATIONAL_COMPLETE_SHUTDOWN

Field	Details
Description	System Operational State is now in complete shutdown state
Summary	System Operational State is now in complete shutdown state
Severity	Critical
Expression	system_operational_state == 3
OID	1.3.6.1.4.1.323.5.3.37.1.2.17
Metric Used	system_operational_state
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.51 TDF_CONNECTION_DOWN

Table 8-67 TDF_CONNECTION_DOWN

Field	Details
Description	TDF connection is down.
Summary	TDF connection is down.
Severity	Critical
Expression	occnp_diam_conn_app_network{applicationName="Sd"} == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.48
Metric Used	occnp_diam_conn_app_network
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.52 DIAM_CONN_PEER_DOWN

Table 8-68 DIAM_CONN_PEER_DOWN

Field	Details
Description	Diameter connection to peer {{ $labels.peerHost }} is down.
Summary	Diameter connection to peer is down.
Severity	Major
Expression	Diameter connection to peer peerHost in given namespace is down.
OID	1.3.6.1.4.1.323.5.3.52.1.2.50
Metric Used	occnp_diam_conn_network
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.53 DIAM_CONN_NETWORK_DOWN

Table 8-69 DIAM_CONN_NETWORK_DOWN

Field	Details
Description	All the diameter network connections are down.
Summary	All the diameter network connections are down.
Severity	Critical
Expression	sum by (kubernetes_namespace)(occnp_diam_conn_network) == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.51
Metric Used	occnp_diam_conn_network
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.54 DIAM_CONN_BACKEND_DOWN

Table 8-70 DIAM_CONN_BACKEND_DOWN

Field	Details
Description	All the diameter backend connections are down.
Summary	All the diameter backend connections are down.
Severity	Critical
Expression	sum by (kubernetes_namespace)(occnp_diam_conn_backend) == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.52
Metric Used	occnp_diam_conn_network
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.55 PerfInfoActiveOverloadThresholdFetchFailed

Table 8-71 PerfInfoActiveOverloadThresholdFetchFailed

Field	Details
Description	The application fails to get the current active overload level threshold data.
Summary	The application fails to get the current active overload level threshold data.
Severity	Major
Expression	active_overload_threshold_fetch_failed == 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.53
Metric Used	active_overload_threshold_fetch_failed
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.56 SLA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-72 SLA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	SLA Sy fail count exceeds the critical threshold limit
Summary	SLA Sy fail count exceeds the critical threshold limit
Severity	Critical
Expression	sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.58
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.57 SLA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-73 SLA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	SLA Sy fail count exceeds the major threshold limit
Summary	SLA Sy fail count exceeds the major threshold limit
Severity	Major
Expression	sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.58
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.58 SLA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-74 SLA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	SLA Sy fail count exceeds the minor threshold limit
Summary	SLA Sy fail count exceeds the minor threshold limit
Severity	Minor
Expression	sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.58
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.59 STA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-75 STA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	STA Sy fail count exceeds the critical threshold limit.
Summary	STA Sy fail count exceeds the critical threshold limit.
Severity	Critical
Expression	The failure rate of Sy STA responses is more than 90% of the total responses.
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.59
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.60 STA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-76 STA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	STA Sy fail count exceeds the major threshold limit.
Summary	STA Sy fail count exceeds the major threshold limit.
Severity	Major
Expression	The failure rate of Sy STA responses is more than 80% and less and or equal to 90% of the total responses.
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.59
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.61 STA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-77 STA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	STA Sy fail count exceeds the minor threshold limit.
Summary	STA Sy fail count exceeds the minor threshold limit.
Severity	Minor
Expression	The failure rate of Sy STA responses is more than 60% and less and or equal to 80% of the total responses.
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.59
Metric Used	occnp_diam_response_local_total
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.62 SMSC_CONNECTION_DOWN

Table 8-78 STASYFailCountExceedsCritcalThreshold

Field	Details
Description	This alert is triggered when connection to SMSC host is down.
Summary	Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Severity	Major
Expression	sum by(namespace, pod, smscName)(occnp_active_smsc_conn_count) == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.63
Metric Used	occnp_active_smsc_conn_count
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.63 STA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-79 STASYFailCountExceedsCritcalThreshold

Field	Details
Description	The failure rate of Rx STA responses is more than 90% of the total responses.
Summary	STA Rx fail count exceeds the critical threshold limit.
Severity	Critical
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.64
Metric Used	occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.64 STA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-80 STA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	The failure rate of Rx STA responses is more than 80% and less and or equal to 90% of the total responses.
Summary	STA Rx fail count exceeds the major threshold limit.
Severity	Major
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.64
Metric Used	occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) & AF and ensure connectivity is present. Check that the session and user is valid and hasn't been removed in the Policy database, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.65 STA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-81 STA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	The failure rate of Rx STA responses is more than 60% and less and or equal to 80% of the total responses.
Summary	STA Rx fail count exceeds the minor threshold limit.
Severity	Minor
Expression	sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.64
Metric Used	occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) & AF and ensure connectivity is present. Check that the session and user is valid and hasn't been removed in the Policy database, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.66 SNA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-82 SNA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	The failure rate of Sy SNA responses is more than 90% of the total responses.
Summary	SNA Sy fail count exceeds the critical threshold limit
Severity	Critical
Expression	sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.65
Metric Used	occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.67 SNA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-83 SNA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	The failure rate of Sy SNA responses is more than 80% and less and or equal to 90% of the total responses.
Summary	SNA Sy fail count exceeds the major threshold limit
Severity	Major
Expression	sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) 100 <= 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.65
Metric Used	occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.68 SNA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-84 SNA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	The failure rate of Sy STA responses is more than 60% and less and or equal to 80% of the total responses.
Summary	SNA Sy fail count exceeds the minor threshold limit
Severity	Minor
Expression	sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) 100 <= 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.65
Metric Used	occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}
Recommended Actions	Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s). For any additional guidance, contact My Oracle Support.

8.1.1.69 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Table 8-85 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Field	Details
Description	This alerts is triggered when more than 10 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	The Diam requests are being discarded due to timeout processing occurring above 10%.
Severity	Minor
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	ocpm_stale_diam_request_cleanup_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.70 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Table 8-86 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Field	Details
Description	This alert is triggered when more than 20 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	The Diam requests are being discarded due to timeout processing occurring above 20%.
Severity	Major
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	ocpm_late_arrival_rejection_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.71 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Table 8-87 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	This alert is triggered when more than 30 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	The Diam requests are being discarded due to timeout processing occurring above 30%.
Severity	Critical
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	ocpm_late_arrival_rejection_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.72 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MINOR

Table 8-88 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MINOR

Field	Details
Description	Certificate expiry in less than 6 months.
Summary	Certificate expiry in less than 6 months.
Severity	Minor
Expression	dgw_tls_cert_expiration_seconds - time() <= 15724800
OID	1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used	dgw_tls_cert_expiration_seconds
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.73 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MAJOR

Table 8-89 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MAJOR

Field	Details
Description	Certificate expiry in less than 3 months.
Summary	Certificate expiry in less than 3 months.
Severity	Major
Expression	dgw_tls_cert_expiration_seconds - time() <= 7862400
OID	1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used	dgw_tls_cert_expiration_seconds
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.74 DIAM_GATEWAY_CERTIFICATE_EXPIRY_CRITICAL

Table 8-90 DIAM_GATEWAY_CERTIFICATE_EXPIRY_CRITICAL

Field	Details
Description	Certificate expiry in less than 1 month.
Summary	Certificate expiry in less than 1 month.
Severity	Critical
Expression	dgw_tls_cert_expiration_seconds - time() <= 2592000
OID	1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used	dgw_tls_cert_expiration_seconds
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.75 DGW_TLS_CONNECTION_FAILURE

Table 8-91 DGW_TLS_CONNECTION_FAILURE

Field	Details
Description	Alert for TLS connection establishment.
Summary	TLS Connection failure when Diam gateway is an initiator.
Severity	Major
Expression	sum by (namespace,reason)(occnp_diam_failed_conn_network) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.81
Metric Used	occnp_diam_failed_conn_network
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.76 POLICY_CONNECTION_FAILURE

Table 8-92 POLICY_CONNECTION_FAILURE

Field	Details
Description	Connection failure on Egress and Ingress Gateways for incoming and outgoing connections.
Summary	Connection failure on Egress and Ingress Gateways for incoming and outgoing connections.
Severity	Major
Expression	sum(increase(occnp_oc_ingressgateway_connection_failure_total[5m]) >0 or (occnp_oc_ingressgateway_connection_failure_total unless occnp_oc_ingressgateway_connection_failure_total offset 5m )) by (namespace,app, error_reason) > 0 or sum(increase(occnp_oc_egressgateway_connection_failure_total[5m]) >0 or (occnp_oc_egressgateway_connection_failure_total unless occnp_oc_egressgateway_connection_failure_total offset 5m )) by (namespace,app, error_reason) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.76
Metric Used	occnp_oc_ingressgateway_connection_failure_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.77 AUDIT_NOT_RUNNING

Table 8-93 AUDIT_NOT_RUNNING

Field	Details
Description	Audit has not been running for at least 1 hour.
Summary	Audit has not been running for at least 1 hour.
Severity	CRITICAL
Expression	(absent_over_time(data_repository_invocations_seconds_count{method="getQueuedTablesToAudit"}[1h]) == 1) OR (sum(increase(data_repository_invocations_seconds_count{method="getQueuedTablesToAudit"}[1h])) == 0)
OID	1.3.6.1.4.1.323.5.3.52.1.2.78
Metric Used	data_repository_invocations_seconds_count
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.78 DIAMETER_POD_ERROR_RESPONSE_MINOR

Table 8-94 DIAMETER_POD_ERROR_RESPONSE_MINOR

Field	Details
Description	At least 1% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Summary	At least 1% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Severity	MINOR
Expression	(topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=1
OID	1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used	ocbsf_diam_response_network_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.79 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-95 DIAMETER_POD_ERROR_RESPONSE_MAJOR

Field	Details
Description	At least 5% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Summary	At least 5% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Severity	MAJOR
Expression	(topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=5
OID	1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used	ocbsf_diam_response_network_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.80 DIAMETER_POD_ERROR_RESPONSE_CRITICAL

Table 8-96 DIAMETER_POD_ERROR_RESPONSE_CRITICAL

Field	Details
Description	At least 10% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER
Summary	At least 10% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER
Severity	CRITICAL
Expression	(topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=10
OID	1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used	ocbsf_diam_response_network_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.81 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Table 8-97 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsCriticalThreshold
Description	The lock requests fails to acquire the lock count exceeds the critical threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 75 Percent of Total Transactions.
Severity	Critical
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=75
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, above 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 75%.

8.1.1.82 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-98 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsMajorThreshold
Description	The lock requests fails to acquire the lock count exceeds the major threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 50 Percent of Total Transactions.
Severity	Major
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >= 50 < 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, between 50% and 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 50%. If the rate exceeds 75%, a higher severity alert will trigger.

8.1.1.83 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Table 8-99 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsMinorThreshold
Description	The lock requests fails to acquire the lock count exceeds the minor threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 20 Percent of Total Transactions.
Severity	Minor
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=20 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, between 20% and 50% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 20%. If the rate exceeds 50%, a higher severity alert will trigger.

8.1.1.84 CERTIFICATE_EXPIRY_MINOR

Table 8-100 CERTIFICATE_EXPIRY_MINOR

Field	Details
Description	Certificate expiry in less than 6 months
Summary	Certificate expiry in less than 6 months
Severity	MINOR
Expression	security_cert_x509_expiration_seconds - time() <= 15724800
OID	1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used	-
Recommended Actions	-

8.1.1.85 CERTIFICATE_EXPIRY_MAJOR

Table 8-101 CERTIFICATE_EXPIRY_MAJOR

Field	Details
Description	Certificate expiry in less than 3 months
Summary	Certificate expiry in less than 3 months
Severity	MAJOR
Expression	security_cert_x509_expiration_seconds - time() <= 7862400
OID	1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used	-
Recommended Actions	-

8.1.1.86 CERTIFICATE_EXPIRY_CRITICAL

Table 8-102 CERTIFICATE_EXPIRY_CRITICAL

Field	Details
Description	Certificate expiry in less than 1 months
Summary	Certificate expiry in less than 1 months
Severity	CRITICAL
Expression	security_cert_x509_expiration_seconds - time() <= 2592000
OID	1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used	-
Recommended Actions	-

8.1.1.87 PERF_INFO_ACTIVE_OVERLOADTHRESHOLD_DATA_PRESENT

Table 8-103 PERF_INFO_ACTIVE_OVERLOADTHRESHOLD_DATA_PRESENT

Field	Details
Description	-
Summary	-
Severity	MINOR
Expression	active_overload_threshold_fetch_failed == 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.53
Metric Used	-
Recommended Actions	-

8.1.1.88 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-104 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field	Details
Description	More than 10% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 10% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	MINOR
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.89 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-105 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field	Details
Description	More than 20% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 20% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	MAJOR
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.90 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-106 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	More than 30% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 30% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	CRITICAL
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.91 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-107 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field	Details
Description	More than 10% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 10% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	MINOR
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.92 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-108 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field	Details
Description	More than 20% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 20% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	MAJOR
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.93 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-109 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	More than 30% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary	More than 30% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity	CRITICAL
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used	occnp_late_processing_rejection_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.94 EGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Table 8-110 EGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Field	Details
Description	Policy Egress Gateway Data Director unreachable for {{$labels.namespace}}.
Summary	kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} BSF Egress Gateway Data Director unreachable
Severity	Major
Expression	sum(oc_egressgateway_dd_unreachable) by(namespace,container) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.84
Metric Used	oc_egressgateway_dd_unreachable
Recommended Actions	Alert gets cleared automatically when the connection with data director is established.

8.1.1.95 INGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Table 8-111 INGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Field	Details
Description	Policy Ingress Gateway Data Director unreachable for {{$labels.namespace}}.
Summary	'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . \| first \| value \| humanizeTimestamp }}{{ end }} BSF Ingress Gateway Data Director unreachable'
Severity	Major
Expression	sum(oc_ingressgateway_dd_unreachable) by(namespace,container) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.83
Metric Used	oc_ingressgateway_dd_unreachable
Recommended Actions	Alert gets cleared automatically when the connection with data director is established.

8.1.1.96 STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-112 STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity	Critical
Expression	-
OID	-
Metric Used	ocpm_late_processing_rejection_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.97 STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-113 STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field	Details
Description	This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity	Major
Expression	-
OID	-
Metric Used	ocpm_late_processing_rejection_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.98 STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-114 STALE_HTTP_REQUEST_CLEANUP_MINOR

Field	Details
Description	This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity	Minor
Expression	-
OID	-
Metric Used	ocpm_late_processing_rejection_total occnp_diam_request_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.99 STALE_BINDING_REQUEST_REJECTION_CRITICAL

Table 8-115 STALE_BINDING_REQUEST_REJECTION_CRITICAL

Field	Details
Description	This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	'{{ $value }} % of requests are being discarded by binding svc due to request being stale either on arrival or during processing.'summary: "More than 30% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".binding"}[5m]))+sum by (namespace) rate(occnp_late_arrival_rejection_total{microservice=~".binding"}[5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total{microservice=~".binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".binding"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used	occnp_late_arrival_rejection_total occnp_late_processing_rejection_total ocpm_binding_inbound_request_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.100 STALE_BINDING_REQUEST_REJECTION_MAJOR

Table 8-116 STALE_BINDING_REQUEST_REJECTION_MAJOR

Field	Details
Description	This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	'{{ $value }} % of requests are being discarded by binding svc due to request being stale either on arrival or during processing.'summary: "More than 20% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity	Major
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".binding"}[5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total {microservice=~".binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".binding"}[5m]))) * 100 >= 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used	occnp_late_arrival_rejection_total occnp_late_processing_rejection_total ocpm_binding_inbound_request_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.101 STALE_BINDING_REQUEST_REJECTION_MINOR

Table 8-117 STALE_BINDING_REQUEST_REJECTION_MINOR

Field	Details
Description	This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary	'{{ $value }} % of requests are being discarded by binding service due to request being stale either on arrival or during processing.' summary: "More than 10% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".binding"} [5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total {microservice=~".binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".binding"}[5m]))) * 100 >= 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used	occnp_late_arrival_rejection_total occnp_late_processing_rejection_total ocpm_binding_inbound_request_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.102 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-118 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field	Details
Description	At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Minor
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	-

8.1.1.103 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-119 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field	Details
Description	At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Major
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.104 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-120 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Critical
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.105 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-121 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field	Details
Description	At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Minor
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.106 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-122 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field	Details
Description	At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Major
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.107 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-123 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary	At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity	Critical
Expression	-
OID	-
Metric Used	`occnp_late_arrival_rejection_total` `occnp_late_processing_rejection_total` `ocpm_userservice_inbound_count_total`
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.1.108 UPDATE_NOTIFY_TIMEOUT_ABOVE_70_PERCENT

Table 8-124 UPDATE_NOTIFY_TIMEOUT_ABOVE_70_PERCENT

Field	Details
Description	Number of Update Notify failed because a timeout is equal to or above 70% in a given time period.
Summary	Number of Update Notify failed because a timeout is equal to or above 70% in a given time period.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 70
OID	-
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.109 UPDATE_NOTIFY_TIMEOUT_ABOVE_50_PERCENT

Table 8-125 UPDATE_NOTIFY_TIMEOUT_ABOVE_50_PERCENT

Field	Details
Description	Number of Update Notify that failed because a timeout is equal to or above 50% but less than 70% in a given time period.
Summary	Number of Update Notify that failed because a timeout is equal to or above 50% but less than 70% in a given time period.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 50 < 70
OID	-
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.110 UPDATE_NOTIFY_TIMEOUT_ABOVE_30_PERCENT

Table 8-126 UPDATE_NOTIFY_TIMEOUT_ABOVE_30_PERCENT

Field	Details
Description	Number of Update Notify that failed because a timeout is equal to or above 30% but less than 50% of total Rx sessions.
Summary	Number of Update Notify that failed because a timeout is equal to or above 30% but less than 50% of total Rx sessions.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 30 < 50
OID	-
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.111 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MINOR

Table 8-127 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MINOR

Field	Details
Description	If 30% to 50% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary	If 30% to 50% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity	Minor
Expression	(sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2."}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) 100 > 30 <= 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used	occnp_policy_data_resubscription_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.112 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MAJOR

Table 8-128 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MAJOR

Field	Details
Description	If 50% to 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary	If 50% to 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity	Major
Expression	(sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2."}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) 100 > 50 <= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used	occnp_policy_data_resubscription_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.113 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_CRITICAL

Table 8-129 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_CRITICAL

Field	Details
Description	If 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary	If 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity	Critical
Expression	(sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2."}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) 100 > 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used	occnp_policy_data_resubscription_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.114 POLICYDS_EXPIRED_SUBSCRIPTION

Table 8-130 POLICYDS_EXPIRED_SUBSCRIPTION

Field	Details
Description	If more than 10% of audited subscriptions are expired, this alert will be raised.
Summary	If more than 10% of audited subscriptions are expired, this alert will be raised.
Severity	Major
Expression	(sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_request_total{expiryStatus="EXPIRED"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_request_total[5m]))) * 100 > 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.130
Metric Used	occnp_policy_data_resubscription_request_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.115 LDAP_PEER_CONNECTION_LOST

Table 8-131 LDAP_PEER_CONNECTION_LOST

Field	Details
Name in Alert Yaml File	LDAP_PEER_CONNECTION_LOST
Description	This alert is triggered when the LDAP Gateway loses connection to its LDAP peer(s). It is based on the value of the occnp_ldap_conn_total metric falling to zero. The connection re-attempt and alert clearance behavior is governed by a new configuration parameter, LDAP_CONNECTION_REVERT_DELAY.
Summary	LDAP Gateway loses connection to its LDAP peer(s).
Severity	major
Expression	sum by (namespace,peer)(occnp_ldap_conn_total) == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.113
Metric Used	occnp_ldap_conn_total
Recommended Actions	Verify that the LDAP server is running and connectivity between the PCF and LDAP peers is available. If LDAP is reachable, check the configured LDAP_CONNECTION_REVERT_DELAY value since reconnection attempts and alert clearance depend on this setting.

8.1.1.116 IGW_POD_PROTECTION_DOC_STATE

Table 8-132 IGW_POD_PROTECTION_DOC_STATE

Field	Details
Description	The Ingress Gateway is in Danger_of_Congestion Level for the pod {{$labels.pod}} in namespace {{$labels.namespace}} ( current congestion level: {{ $value }} % )
Summary	Ingress Gateway pod congestion state in Danger_of_Congestion Level.
Severity	Minor
Expression	oc_ingressgateway_congestion_system_state{microservice=~".*ingress-gateway"} == 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.123
Metric Used	oc_ingressgateway_congestion_system_state
Recommended Actions	The alert is cleared when the pod CPU consumption dropped below the configured abatement value for the DOC level.

8.1.1.117 IGW_POD_PROTECTION_CONGESTED_STATE

Table 8-133 IGW_POD_PROTECTION_CONGESTED_STATE

Field	Details
Description	The Ingress Gateway is in Congested Level for the pod {{$labels.pod}} in namespace {{$labels.namespace}} ( current congestion level: {{ $value }} % )
Summary	Ingress Gateway pod congestion state in Congested level.
Severity	Critical
Expression	sum(oc_ingressgateway_congestion_system_state{app_kubernetes_io_name="occnp-ingress-gateway"}) by (pod) == 4
OID	1.3.6.1.4.1.323.5.3.52.1.2.123
Metric Used	oc_ingressgateway_congestion_system_state
Recommended Actions	The alert is cleared when the pod CPU consumption dropped below the configured abatement value for the Congested level.

8.1.2 PCF Alerts

This section provides information on PCF alerts.

8.1.2.1 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-134 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field	Details
Description	UDR returning with POST subscribe response but without user data for SM as part of immediate reporting occurring above 10% for service {{$labels.microservice}} in {{$labels.namespace}} ( current value: {{ $value }} % )
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Severity	Minor
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing SM user data check is based on: `service_subresource = "sm-data"` (indicates the UDR POST was to get SM user data from UDR) `operation_type = "POST"` (indicates this is a POST call) `imm_reports_present = "false"` (indicates no SM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without SM user data as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (e.g., `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Inform UDR Operator If the above points are validated and SM user data is still not retrieved, inform the UDR operators to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.2 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-135 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Severity	Major
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing SM user data check is based on: `service_subresource = "sm-data"` (to indicate the UDR POST was to get SM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no SM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without user data for SM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Inform UDR Operator If the above points are validated and still no SM user data is retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.3 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-136 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field	Details
Description	More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary	More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity	Critical
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing AM user data check is based on: `service_subresource = "sm-data"` (to indicate the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no SM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without user data for SM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Inform UDR Operator If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.4 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-137 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field	Details
Description	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity	Minor
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	occnp_immrep_response_total
Metric Used	1.3.6.1.4.1.323.5.3.52.1.2.128
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation check is based on: `service_subresource = "sm-data"` (to indicate the UDR POST was to get SM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Inform UDR Operator If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.5 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-138 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity	Major
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.128
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The failed feature negotiation check is based on: `service_subresource = "sm-data"` (to indicate the UDR POST was to get SM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.6 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-139 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field	Details
Description	More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary	More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity	Critical
Expression	(sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.128
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation is based on: `service_subresource = "sm-data"` (indicates the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (indicates this is a POST call) `immediate_report_pcc = "false"` (indicates that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and SM user data is still not retrieved: Inform the UDR operators. Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.7 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MINOR

Table 8-140 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MINOR

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 10% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary	More than 10% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity	Minor
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER", microservice=diam-connector}[5m]))) * 100 >= 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used	`occnp_diam_request_local_total` `occnp_stale_diam_request_cleanup_total`
Recommended Actions	The alert gets cleared when the number of stale requests is below 10% of the total requests. To troubleshoot and resolve the issue, perform the following steps: Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner. Review the configuration settings of the Diameter connector and check timeout settings if necessary. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive. For further assistance, contact My Oracle Support.

8.1.2.8 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MAJOR

Table 8-141 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MAJOR

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 20% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary	More than 20% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity	Major
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER", microservice=diam-connector}[5m]))) * 100 >= 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used	`occnp_diam_request_local_total` `occnp_stale_diam_request_cleanup_total`
Recommended Actions	The alert gets cleared when the number of stale requests is below 20% of the total requests. To troubleshoot and resolve the issue, perform the following steps: Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner. Review the configuration settings of the Diameter connector and check timeout settings if necessary. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive. For further assistance, contact My Oracle Support.

8.1.2.9 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_CRITICAL

Table 8-142 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 30% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary	More than 30% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity	Critical
Expression	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER", microservice=diam-connector}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used	`occnp_diam_request_local_total` `occnp_stale_diam_request_cleanup_total`
Recommended Actions	The alert gets cleared when the number of stale requests is below 30% of the total requests. To troubleshoot and resolve the issue, perform the following steps: Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner. Review the configuration settings of the Diameter connector and check timeout settings if necessary. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive. For further assistance, contact My Oracle Support.

8.1.2.10 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_CRITICAL_THRESHOLD

Table 8-143 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	{{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary	70% or more of binding were missing but restored from BSF over all bindings being audited.
Severity	Critical
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding", response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding",response_code="2xx"}[5m]))) * 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.11 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MAJOR_THRESHOLD

Table 8-144 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	{{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary	50% to 70% of binding were missing but restored from BSF over all bindings being audited
Severity	Major
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding", response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding",response_code="2xx"}[5m]))) * 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.12 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MINOR_THRESHOLD

Table 8-145 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	{{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary	30% to 50% of binding were missing but restored from BSF over all bindings being audited.
Severity	Minor
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding",response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding",response_code="2xx"}[5m]))) * 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.13 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_CRITICAL_THRESHOLD

Table 8-146 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	{{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary	70% or more of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity	Critical
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding", response_code!~"2."}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding"}[5m]))) 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	Verify the health condition of BSF Management Service. For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.14 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MAJOR_THRESHOLD

Table 8-147 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	{{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary	50% to 70% of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity	Major
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding", response_code!~"2."}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding"}[5m]))) 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	Verify the health condition of BSF Management Service. For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.15 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MINOR_THRESHOLD

Table 8-148 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	{{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary	30% to 50% of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity	Minor
Expression	(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding", response_code!~"2."}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".binding"}[5m]))) 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used	occnp_session_binding_revalidation_response_total
Recommended Actions	Verify the health condition of BSF Management Service. For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.16 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-149 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MINOR_THRESHOLD_PERCENT

Field	Details
Description	when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary	Primary Key lookup failed is equal or above 10% but less than 50% of total PA create.
Severity	Minor
Expression	sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 10 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used	occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.17 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-150 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MAJOR_THRESHOLD_PERCENT

Field	Details
Description	when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary	Primary Key lookup failed is equal or above 50% but less than 75% of total PA create.
Severity	Major
Expression	sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 50 < 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used	occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.18 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-151 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field	Details
Description	when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary	Primary Key lookup failed is equal or above 75% of total PA create
Severity	Critical
Expression	sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used	occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.19 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MINOR

Table 8-152 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MINOR

Field	Details
Description	{{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary	At least 10% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}
Severity	Minor
Expression	( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total 0) ) ) ) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used	occnp_enhanced_overload_reject_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.20 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MAJOR

Table 8-153 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MAJOR

Field	Details
Description	{{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary	At least 20% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}
Severity	Major
Expression	( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total 0) ) ) ) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used	occnp_enhanced_overload_reject_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.21 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_CRITICAL

Table 8-154 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_CRITICAL

Field	Details
Description	{{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary	At least 30% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}.
Severity	Critical
Expression	( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total 0) ) ) ) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used	occnp_enhanced_overload_reject_total
Recommended Actions	For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.22 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD

Table 8-155 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD
Description	More than 70% of timer capacity has been occupied for n1n2 transfer failure notification
Summary	More than 70% of timer capacity has been occupied for n1n2 transfer failure notification
Severity	Minor
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission. Cause: This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan. These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow. The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 70% of a baseline capacity of 360,000. Dimensions: timerName: UE_N1N2TransferFailure namespace: as per Prometheus label used in aggregation siteId: underlying metric label; rule aggregates with max by (namespace) Diagnostic Information: Validate the alert metric: Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window. Correlate with triggering failures: Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window. Review logs around the alert window: In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period. Verify retransmission/backoff settings: Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative). Check downstream/egress health: Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts. Confirm processing throughput: Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining. Watch for capacity rejections: Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure", errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits. Recovery: Resolve underlying failures: Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates. Enable or optimize retransmission: Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload. Increase draining capacity: Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster. Adjust capacity if needed: Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes. Reduce new load temporarily: Throttle or defer non-critical timer creates for this timerName until utilization drops. Monitor until recovered: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.23 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD

Table 8-156 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD
Description	More than 80% of timer capacity has been occupied for n1n2 transfer failure notification
Summary	More than 80% of timer capacity has been occupied for n1n2 transfer failure notification
Severity	Major
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission. Cause: This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan. - These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow. - The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 80% of a baseline capacity of 360000. Dimensions: timerName : UE_N1N2TransferFailure namespace : as per Prometheus label used in aggregation siteId : underlying metric label; rule aggregates with max by (namespace) Diagnostic Information : Validate the alert metric: Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window. Correlate with triggering failures: Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window. Review logs around the alert window: In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period. Verify retransmission/backoff settings: Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative). Check downstream/egress health: Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts. Confirm processing throughput: Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining. Watch for capacity rejections: Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure",errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits. Recovery : Resolve underlying failures: Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates. Enable or optimize retransmission: Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload. Increase draining capacity: Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster. Adjust capacity if needed: Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes. Reduce new load temporarily: Throttle or defer non-critical timer creates for this timerName until utilization drops. Monitor until recovered: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.24 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD

Table 8-157 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD
Description	More than 90% of timer capacity has been occupied for n1n2 transfer failure notification
Summary	More than 90% of timer capacity has been occupied for n1n2 transfer failure notification
Severity	Critical
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission. Cause: This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan. - These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow. - The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 90% of a baseline capacity of 360000. Dimensions: timerName : UE_N1N2TransferFailure namespace : as per Prometheus label used in aggregation siteId : underlying metric label; rule aggregates with max by (namespace) Diagnostic Information : Validate the alert metric: Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window. Correlate with triggering failures: Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window. Review logs around the alert window: In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period. Verify retransmission/backoff settings: Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative). Check downstream/egress health: Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts. Confirm processing throughput: Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining. Watch for capacity rejections: Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure",errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits. Recovery : Resolve underlying failures: Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates. Enable or optimize retransmission: Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload. Increase draining capacity: Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster. Adjust capacity if needed: Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes. Reduce new load temporarily: Throttle or defer non-critical timer creates for this timerName until utilization drops. Monitor until recovered: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.25 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD

Table 8-158 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD
Description	More than 70% of timers capacity has been occupied for amf discovery.
Summary	More than 70% of timers capacity has been occupied for amf discovery.
Severity	Minor
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client. Cause: More than 70% of timer capacity has been occupied for AMF discovery. The `occnp_timer_capacity` metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 70%) of the total 360K capacity. Diagnostic Information: High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery. Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries. Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage. Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption. Recovery: Review the logs and monitor for trends in UE failures with AMF discovery. Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues. Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations. If the issue persists, please check with Support team.

8.1.2.26 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD

Table 8-159 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD
Description	More than 80% of timer capacity has been occupied for amf discovery.
Summary	More than 80% of timer capacity has been occupied for amf discovery.
Severity	Major
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client. Cause: More than 80% of timer capacity has been occupied for AMF discovery. The `occnp_timer_capacity` metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 80%) of the total 360K capacity. Diagnostic Information: High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery. Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries. Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage. Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption. Recovery: Review the logs and monitor for trends in UE failures with AMF discovery. Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues. Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations. If the issue persists, please check with Support team.

8.1.2.27 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD

Table 8-160 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD
Description	More than 90% of timer capacity has been occupied for amf discovery.
Summary	More than 90% of timer capacity has been occupied for amf discovery.
Severity	Critical
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client. Cause: More than 90% of timer capacity has been occupied for AMF discovery. The `occnp_timer_capacity` metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 90%) of the total 360K capacity. Diagnostic Information: High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery. Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries. Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage. Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption. Recovery: Review the logs and monitor for trends in UE failures with AMF discovery. Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues. Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations. If the issue persists, please check with Support team.

8.1.2.28 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD

Table 8-161 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD
Description	More than 70% of timer capacity has been occupied for n1n2 subscribe.
Summary	More than 70% of timer capacity has been occupied for n1n2 subscribe.
Severity	Minor
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing. Cause: More than 70% of timer capacity has been occupied for N1N2 subscription. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 70% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations. Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow. Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers. High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently. Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate. Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage. Recovery: Review logs and N1N2 subscription flow metrics for unusual error patterns. Investigate AMF and related network function health and recent changes. Check configuration for timer parameters and adjust if necessary. Monitor for spikes in traffic or unusual load patterns. If the issue persists, please check with Support team.

8.1.2.29 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD

Table 8-162 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD
Description	More than 80% of timer capacity has been occupied for n1n2 subscribe.
Summary	More than 80% of timer capacity has been occupied for n1n2 subscribe.
Severity	Major
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing. Cause: More than 80% of timer capacity has been occupied for N1N2 subscription. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 80% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations. Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow. Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers. High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently. Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate. Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage. Recovery: Review logs and N1N2 subscription flow metrics for unusual error patterns. Investigate AMF and related network function health and recent changes. Check configuration for timer parameters and adjust if necessary. Monitor for spikes in traffic or unusual load patterns. If the issue persists, please check with Support team.

8.1.2.30 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD

Table 8-163 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD
Description	More than 90% of timer capacity has been occupied for n1n2 subscribe.
Summary	More than 90% of timer capacity has been occupied for n1n2 subscribe.
Severity	Critical
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing. Cause: More than 90% of timer capacity has been occupied for N1N2 subscription. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 90% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations. Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow. Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers. High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently. Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate. Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage. Recovery: Review logs and N1N2 subscription flow metrics for unusual error patterns. Investigate AMF and related network function health and recent changes. Check configuration for timer parameters and adjust if necessary. Monitor for spikes in traffic or unusual load patterns. If the issue persists, please check with Support team.

8.1.2.31 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD

Table 8-164 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD
Description	More than 70% of timer capacity has been occupied for n1n2 transfer.
Summary	More than 70% of timer capacity has been occupied for n1n2 transfer.
Severity	Minor
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing. Cause: More than 70% of timer capacity has been occupied for N1N2 transfer. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 70% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers. Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts. Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate. High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity. Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers. AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs. Recovery: Review recent logs and metrics related to N1N2 transfer failures. Investigate the health status of the AMF and other supporting NFs. Check resource utilization and adjust timer back-off/retry configuration if needed. Look for recent network changes or spikes in request volume. If the issue persists, please check with Support team.

8.1.2.32 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD

Table 8-165 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD
Description	More than 80% of timer capacity has been occupied for n1n2 transfer.
Summary	More than 80% of timer capacity has been occupied for n1n2 transfer.
Severity	Major
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 80
OID	1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing. Cause: More than 80% of timer capacity has been occupied for N1N2 transfer. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 80% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers. Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts. Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate. High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity. Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers. AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs. Recovery: Review recent logs and metrics related to N1N2 transfer failures. Investigate the health status of the AMF and other supporting NFs. Check resource utilization and adjust timer back-off/retry configuration if needed. Look for recent network changes or spikes in request volume. If the issue persists, please check with Support team.

8.1.2.33 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD

Table 8-166 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD
Description	More than 90% of timer capacity has been occupied for n1n2 transfer.
Summary	More than 90% of timer capacity has been occupied for n1n2 transfer.
Severity	Critical
Expression	(max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 90
OID	1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used	occnp_timer_capacity
Recommended Actions	The `occnp_timer_capacity` metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing. Cause: More than 90% of timer capacity has been occupied for N1N2 transfer. The `occnp_timer_capacity` metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 90% of the total 360K capacity. Diagnostic Information: Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers. Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts. Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate. High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity. Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers. AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs. Recovery: Review recent logs and metrics related to N1N2 transfer failures. Investigate the health status of the AMF and other supporting NFs. Check resource utilization and adjust timer back-off/retry configuration if needed. Look for recent network changes or spikes in request volume. If the issue persists, please check with Support team.

8.1.2.34 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-167 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description	More than 25% of n1n2 subscribe reattempt failed.
Summary	More than 25% of n1n2 subscribe reattempt failed.
Severity	Minor
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) 100 > 25
OID	1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe. If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy. Cause: An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The `http_out_conn_response_total` metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 25% of such reattempts fail over a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests. Network Issues or Communication Failures: Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses. Configuration Errors: Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail. High Load or Resource Exhaustion: If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected. Timeouts or Latency Issues: Prolonged delays in response times could cause requests to time out, leading to apparent failures. Recovery: Review logs and error codes for patterns or specific failure reasons. Check the health and recent activity of the AMF(s) and relevant network paths. Examine configuration settings related to N1N2 subscriptions and ensure they are correct. Investigate any spikes in load or indications of resource bottlenecks. Correlate with recent changes or deployments in the environment. If the issue persists, please check with Support team.

8.1.2.35 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-168 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description	More than 50% of n1n2 subscribe reattempt failed.
Summary	More than 50% of n1n2 subscribe reattempt failed.
Severity	Major
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy. Cause: An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The `http_out_conn_response_total` metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 50% of such reattempts fail over a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests. Network Issues or Communication Failures: Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses. Configuration Errors: Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail. High Load or Resource Exhaustion: If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected. Timeouts or Latency Issues:Prolonged delays in response times could cause requests to time out, leading to apparent failures. Recovery: Review logs and error codes for patterns or specific failure reasons. Check the health and recent activity of the AMF(s) and relevant network paths. Examine configuration settings related to N1N2 subscriptions and ensure they are correct. Investigate any spikes in load or indications of resource bottlenecks. Correlate with recent changes or deployments in the environment. If the issue persists, please check with Support team.

8.1.2.36 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-169 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description	More than 75% of n1n2 subscribe reattempt failed.
Summary	More than 75% of n1n2 subscribe reattempt failed.
Severity	Critical
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy. Cause: An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The `http_out_conn_response_total` metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 75% of such reattempts fail over a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests. Network Issues or Communication Failures: Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses. Configuration Errors: Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail. High Load or Resource Exhaustion: If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected. Timeouts or Latency Issues:Prolonged delays in response times could cause requests to time out, leading to apparent failures. Recovery: Review logs and error codes for patterns or specific failure reasons. Check the health and recent activity of the AMF(s) and relevant network paths. Examine configuration settings related to N1N2 subscriptions and ensure they are correct. Investigate any spikes in load or indications of resource bottlenecks. Correlate with recent changes or deployments in the environment. If the issue persists, please check with Support team.

8.1.2.37 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-170 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description	More than 25% of n1n2 transfer reattempt failed.
Summary	More than 25% of n1n2 transfer reattempt failed.
Severity	Minor
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) 100 > 25
OID	1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy. Cause: An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The `http_out_conn_response_total` metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 25% of such reattempts result in failure within a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail. Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages. Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver. System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests. Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues. Recovery: Review and analyze failure logs and returned error codes. Check the operational health and resource status of the AMF and other involved NFs. Validate network connectivity and latency between all relevant components. Inspect configuration and recent changes for potential misalignments. Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments. If the issue persists, please check with Support team.

8.1.2.38 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-171 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description	More than 50% of n1n2 transfer reattempt failed.
Summary	More than 50% of n1n2 transfer reattempt failed.
Severity	Major
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy. Cause: An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The `http_out_conn_response_total` metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 50% of such reattempts result in failure within a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail. Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages. Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver. System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests. Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues. Recovery: Review and analyze failure logs and returned error codes. Check the operational health and resource status of the AMF and other involved NFs. Validate network connectivity and latency between all relevant components. Inspect configuration and recent changes for potential misalignments. Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments. If the issue persists, please check with Support team.

8.1.2.39 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-172 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description	More than 75% of n1n2 transfer reattempt failed.
Summary	More than 75% of n1n2 transfer reattempt failed.
Severity	Critical
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy. Cause: An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The `http_out_conn_response_total` metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 75% of such reattempts result in failure within a 5-minute period. Diagnostic Information: AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail. Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages. Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver. System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests. Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues. Recovery: Review and analyze failure logs and returned error codes. Check the operational health and resource status of the AMF and other involved NFs. Validate network connectivity and latency between all relevant components. Inspect configuration and recent changes for potential misalignments. Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments. If the issue persists, please check with Support team.

8.1.2.40 SM_STALE_REQUEST_PROCESSING_REJECT_MINOR

Table 8-173 SM_STALE_REQUEST_PROCESSING_REJECT_MINOR

Field	Details
Name in Alert Yaml File	SM_STALE_REQUEST_PROCESSING_REJECT_MINOR
Description	More than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Summary	More than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Severity	Minor
Expression	(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used	occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions	The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session. Cause: The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale. For example, if a request includes the following header parameters: `sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z'` (sender timestamp) `sbiMaxRSPTime3GPP='3000'` (maximum response time in milliseconds) In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric. When more than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised. Diagnostic Information: Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP). Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays. Review Configurations: Confirm that max response times (`sbiMaxRSPTime3GPP`) are correctly set as per the service contract. Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time. Recovery: Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.

8.1.2.41 SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Table 8-174 SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Field	Details
Name in Alert Yaml File	SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR
Description	More than 20% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Summary	More than 20% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Severity	Major
Expression	(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used	occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions	The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session. Cause: The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale. For example, if a request includes the following header parameters: `sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z'` (sender timestamp) `sbiMaxRSPTime3GPP='3000'` (maximum response time in milliseconds) In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric. When more than 20% and less than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised. Diagnostic Information: Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP). Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays. Review Configurations: Confirm that max response times (`sbiMaxRSPTime3GPP`) are correctly set as per the service contract. Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time. Recovery: Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.

8.1.2.42 SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Table 8-175 SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Field	Details
Name in Alert Yaml File	SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL
Description	More than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Summary	More than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale
Severity	Critical
Expression	(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used	occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions	The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session. Cause: The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale. For example, if a request includes the following header parameters: `sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z'` (sender timestamp) `sbiMaxRSPTime3GPP='3000'` (maximum response time in milliseconds) In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric. When more than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised. Diagnostic Information: Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP). Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays. Review Configurations: Confirm that max response times (`sbiMaxRSPTime3GPP`) are correctly set as per the service contract. Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time. Recovery: Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.

8.1.2.43 UE_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Table 8-176 UE_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Field	Details
Description	This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary	This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity	Major
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used	`occnp_late_processing_rejection_total`
Recommended Actions	Metric `occnp_late_processing_rejection_total` is pegged when requests being processed become stale. Cause: More than 20% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window. Diagnostic Information: High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing. Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times. Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests. Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times. Recovery: Monitor system and service resource utilization. Review recent changes to workload, configuration, or deployments. Tune timeouts and thresholds appropriately based on observed service latency. Analyze logs to pinpoint where delays are occurring in the request processing workflow. If the issue persists, please check with Support team.

8.1.2.44 UE_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Table 8-177 UE_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Field	Details
Description	This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary	This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used	`occnp_late_processing_rejection_total`
Recommended Actions	Metric `occnp_late_processing_rejection_total` is pegged when requests being processed become stale. Cause: More than 30% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window. Diagnostic Information: High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing. Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times. Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests. Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times. Recovery: Monitor system and service resource utilization. Review recent changes to workload, configuration, or deployments. Tune timeouts and thresholds appropriately based on observed service latency. Analyze logs to pinpoint where delays are occurring in the request processing workflow. If the issue persists, please check with Support team.

8.1.2.45 UE_STALE_REQUEST_PROCESSING_REJECT_MINOR

Table 8-178 UE_STALE_REQUEST_PROCESSING_REJECT_MINOR

Field	Details
Description	This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary	This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used	`occnp_late_processing_rejection_total`
Recommended Actions	Metric `occnp_late_processing_rejection_total` is pegged when requests being processed become stale. Cause: More than 10% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window. Diagnostic Information: High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing. Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times. Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests. Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times. Recovery: Monitor system and service resource utilization. Review recent changes to workload, configuration, or deployments. Tune timeouts and thresholds appropriately based on observed service latency. Analyze logs to pinpoint where delays are occurring in the request processing workflow. If the issue persists, please check with Support team.

8.1.2.46 UE_STALE_REQUEST_ARRIVAL_REJECT_MINOR

Table 8-179 UE_STALE_REQUEST_ARRIVAL_REJECT_MINOR

Field	Details
Description	This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary	This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.109
Metric Used	`ocpm_late_arrival_rejection_total`
Recommended Actions	Metric `ocpm_late_arrival_rejection_total` is pegged when a received requests is stale. Cause: Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale. Metric: `ocpm_late_arrival_rejection_total` Increments when the UE Service determines incoming requests are stale (arrived too late to process). The staleness check is based on: `3gpp-Sbi-Sender-Timestamp` (preferred) `3gpp-Sbi-Origination-Timestamp` (fallback if sender timestamp is unavailable) `3gpp-Sbi-Max-Rsp-Time` (maximum allowed response time, in ms) Request Example: `3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'` `3gpp-Sbi-Max-Rsp-Time='3000'` (i.e., 3 seconds) If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric. Alarm Condition: If more than 10% of ingress requests result in `504 GATEWAY_TIMEOUT` errors due to staleness, an alarm is raised. Diagnostic Information: Verify Time Synchronization Ensure all Network Functions (NFs) have synchronized system clocks (using NTP). Time drift between sender and UE Service may falsely trigger staleness. Check Network Latency Investigate possible network delays or congestion between external NF and the UE Service. High or unstable latency can lead to late arrival of requests. Analyze Sender Behavior Validate that the sending NF populates `3gpp-Sbi-Sender-Timestamp` (or Origination-Timestamp) correctly. Misconfigured or delayed timestamping can corrupt staleness calculation. Assess Max Response Time Values Review if the `3gpp-Sbi-Max-Rsp-Time` value is appropriate for your network and application conditions. Very short response times may not be feasible under current latency conditions. Review Application Load Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service. Resource exhaustion may delay request processing, even if requests arrive on time. Correlation with Other Metrics Examine related metrics such as total request counts, processing times, error types, etc., to identify trends. Check if certain sources or request types are consistently late. Check for Backlogs Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline. Recovery: Verify Time Synchronization Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected. Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected. Review UE Service Application & Resources Check the UE Service for high CPU/memory usage or any request processing backlogs. Restart or scale up resources temporarily if the system is overloaded. Contact Upstream NF Owners Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.

8.1.2.47 UE_STALE_REQUEST_ARRIVAL_REJECT_MAJOR

Table 8-180 UE_STALE_REQUEST_ARRIVAL_REJECT_MAJOR

Field	Details
Description	This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary	This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.109
Metric Used	`ocpm_late_arrival_rejection_total`
Recommended Actions	Metric `ocpm_late_arrival_rejection_total` is pegged when a received requests is stale. Cause: Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale. Metric: `ocpm_late_arrival_rejection_total` Increments when the UE Service determines incoming requests are stale (arrived too late to process). The staleness check is based on: `3gpp-Sbi-Sender-Timestamp` (preferred) `3gpp-Sbi-Origination-Timestamp` (fallback if sender timestamp is unavailable) `3gpp-Sbi-Max-Rsp-Time` (maximum allowed response time, in ms) Request Example: `3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'` `3gpp-Sbi-Max-Rsp-Time='3000'` (i.e., 3 seconds) If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric. Alarm Condition: If more than 20% of ingress requests result in `504 GATEWAY_TIMEOUT` errors due to staleness, an alarm is raised. Diagnostic Information: Verify Time Synchronization Ensure all Network Functions (NFs) have synchronized system clocks (using NTP). Time drift between sender and UE Service may falsely trigger staleness. Check Network Latency Investigate possible network delays or congestion between external NF and the UE Service. High or unstable latency can lead to late arrival of requests. Analyze Sender Behavior Validate that the sending NF populates `3gpp-Sbi-Sender-Timestamp` (or Origination-Timestamp) correctly. Misconfigured or delayed timestamping can corrupt staleness calculation. Assess Max Response Time Values Review if the `3gpp-Sbi-Max-Rsp-Time` value is appropriate for your network and application conditions. Very short response times may not be feasible under current latency conditions. Review Application Load Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service. Resource exhaustion may delay request processing, even if requests arrive on time. Correlation with Other Metrics Examine related metrics such as total request counts, processing times, error types, etc., to identify trends. Check if certain sources or request types are consistently late. Check for Backlogs Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline. Recovery: Verify Time Synchronization Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected. Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected. Review UE Service Application & Resources Check the UE Service for high CPU/memory usage or any request processing backlogs. Restart or scale up resources temporarily if the system is overloaded. Contact Upstream NF Owners Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.

8.1.2.48 UE_STALE_REQUEST_ARRIVAL_REJECT_CRITICAL

Table 8-181 UE_STALE_REQUEST_ARRIVAL_REJECT_CRITICAL

Field	Details
Description	This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary	This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".pcf_ueservice"}[5m]))) * 100 > 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.109
Metric Used	`ocpm_late_arrival_rejection_total`
Recommended Actions	Metric `ocpm_late_arrival_rejection_total` is pegged when a received requests is stale. Cause: Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale. Metric: `ocpm_late_arrival_rejection_total` Increments when the UE Service determines incoming requests are stale (arrived too late to process). The staleness check is based on: `3gpp-Sbi-Sender-Timestamp` (preferred) `3gpp-Sbi-Origination-Timestamp` (fallback if sender timestamp is unavailable) `3gpp-Sbi-Max-Rsp-Time` (maximum allowed response time, in ms) Request Example: `3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'` `3gpp-Sbi-Max-Rsp-Time='3000'` (i.e., 3 seconds) If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric. Alarm Condition: If more than 30% of ingress requests result in `504 GATEWAY_TIMEOUT` errors due to staleness, an alarm is raised. Diagnostic Information: Verify Time Synchronization Ensure all Network Functions (NFs) have synchronized system clocks (using NTP). Time drift between sender and UE Service may falsely trigger staleness. Check Network Latency Investigate possible network delays or congestion between external NF and the UE Service. High or unstable latency can lead to late arrival of requests. Analyze Sender Behavior Validate that the sending NF populates `3gpp-Sbi-Sender-Timestamp` (or Origination-Timestamp) correctly. Misconfigured or delayed timestamping can corrupt staleness calculation. Assess Max Response Time Values Review if the `3gpp-Sbi-Max-Rsp-Time` value is appropriate for your network and application conditions. Very short response times may not be feasible under current latency conditions. Review Application Load Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service. Resource exhaustion may delay request processing, even if requests arrive on time. Correlation with Other Metrics Examine related metrics such as total request counts, processing times, error types, etc., to identify trends. Check if certain sources or request types are consistently late. Check for Backlogs Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline. Recovery: Verify Time Synchronization Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected. Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected. Review UE Service Application & Resources Check the UE Service for high CPU/memory usage or any request processing backlogs. Restart or scale up resources temporarily if the system is overloaded. Contact Upstream NF Owners Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.

8.1.2.49 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-182 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description	More than 75% of N1N2 transfer failure notification reattempts failed.
Summary	More than 75% of N1N2 transfer failure notification reattempts failed.
Severity	Critical
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on: Why the flow triggering N1N2 transfer failure notification is failing, or Check the health of the AMF to which the request are going to Cause: http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification. Dimensions: IsReattempt : true reattemptType : UE_N1N2TransferFailure OperationType : transfer ResponseCode : !2xx In this case more than 75% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame) Diagnostic Information : Check Recent Logs: Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert). Focus on error responses: look for 4xx/5xx HTTP responses and their reasons. Correlate with Traffic Patterns: Determine if failures are for specific to certain AMFs or random. Check if there's a sudden surge in failures (indicating a broader issue). Inspect Network Health and Configuration: Ensure connectivity and correct routing between PCF-UE and its downstream targets. Validate configurations, especially recently changed ones. Cross-check Incident/Event Timeline: Review recent maintenance, deployments, or network events that could correlate with the increase in failures. Evaluate for Service Overload: Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress. Check with Peers: See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem. Recovery : Resolve Underlying Service Issues: If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation. Address any misconfiguration or errors causing repeated non-2xx responses. Revert Recent Changes: If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact. Mitigate Service Overload: If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible. Network Remediation: Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints. Monitor and Confirm Recovery: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold. Ensure related services in affected namespaces also recover.

8.1.2.50 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-183 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description	More than 50% of N1N2 transfer failure notification reattempts failed.
Summary	More than 50% of N1N2 transfer failure notification reattempts failed.
Severity	Major
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on: Why the flow triggering N1N2 transfer failure notification is failing, or Check the health of the AMF to which the request are going to Cause: http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification. Dimensions: IsReattempt : true reattemptType : UE_N1N2TransferFailure OperationType : transfer ResponseCode : !2xx In this case more than 50% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame) Diagnostic Information : Check Recent Logs: Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert). Focus on error responses: look for 4xx/5xx HTTP responses and their reasons. Correlate with Traffic Patterns: Determine if failures are for specific to certain AMFs or random. Check if there's a sudden surge in failures (indicating a broader issue). Inspect Network Health and Configuration: Ensure connectivity and correct routing between PCF-UE and its downstream targets. Validate configurations, especially recently changed ones. Cross-check Incident/Event Timeline: Review recent maintenance, deployments, or network events that could correlate with the increase in failures. Evaluate for Service Overload: Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress. Check with Peers: See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem. Recovery : Resolve Underlying Service Issues: If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation. Address any misconfiguration or errors causing repeated non-2xx responses. Revert Recent Changes: If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact. Mitigate Service Overload: If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible. Network Remediation: Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints. Monitor and Confirm Recovery: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold. Ensure related services in affected namespaces also recover.

8.1.2.51 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-184 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description	More than 25% of N1N2 transfer failure notification reattempts failed.
Summary	More than 25% of N1N2 transfer failure notification reattempts failed.
Severity	Minor
Expression	(sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2."}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) 100 > 25
OID	1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used	http_out_conn_response_total, http_out_conn_request_total
Recommended Actions	The `http_out_conn_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on: Why the flow triggering N1N2 transfer failure notification is failing, or Check the health of the AMF to which the request are going to Cause: http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification. Dimensions: IsReattempt : true reattemptType : UE_N1N2TransferFailure OperationType : transfer ResponseCode : !2xx In this case more than 25% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame) Diagnostic Information : Check Recent Logs: Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert). Focus on error responses: look for 4xx/5xx HTTP responses and their reasons. Correlate with Traffic Patterns: Determine if failures are for specific to certain AMFs or random. Check if there's a sudden surge in failures (indicating a broader issue). Inspect Network Health and Configuration: Ensure connectivity and correct routing between PCF-UE and its downstream targets. Validate configurations, especially recently changed ones. Cross-check Incident/Event Timeline: Review recent maintenance, deployments, or network events that could correlate with the increase in failures. Evaluate for Service Overload: Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress. Check with Peers: See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem. Recovery : Resolve Underlying Service Issues: If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation. Address any misconfiguration or errors causing repeated non-2xx responses. Revert Recent Changes: If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact. Mitigate Service Overload: If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible. Network Remediation: Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints. Monitor and Confirm Recovery: Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold. Ensure related services in affected namespaces also recover.

8.1.2.52 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-185 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description	More than 75% of amf discovery reattempts failed.
Summary	More than 75% of amf discovery reattempts failed.
Severity	Critical
Expression	(sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2."}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used	occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions	The `occnp_ue_nf_discovery_reattempt_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case, the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on: Why the AMF discovery flow is failing, or Check the health of the AMF to which the request are going to. Cause: The main cause of the `occnp_ue_nf_discovery_reattempt_response_total` metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests. Diagnostic Information: AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts. Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail. Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests. NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information. Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time. Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures. Recovery: Review logs and error responses associated with AMF discovery attempts. Check the health status and recent operational history of the target AMF and NRF. Verify network health and connectivity between all relevant components. Validate all associated configurations (PCF-UE, AMF, NRF). If the issue persists, please check with Support team.

8.1.2.53 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-186 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description	More than 50% of amf discovery reattempts failed.
Summary	More than 50% of amf discovery reattempts failed.
Severity	Major
Expression	(sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2."}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used	occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions	The `occnp_ue_nf_discovery_reattempt_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case, the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on: Why the AMF discovery flow is failing, or Check the health of the AMF to which the request are going to. Cause: The main cause of the `occnp_ue_nf_discovery_reattempt_response_total` metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests. Diagnostic Information: AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts. Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail. Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests. NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information. Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time. Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures. Recovery: Review logs and error responses associated with AMF discovery attempts. Check the health status and recent operational history of the target AMF and NRF. Verify network health and connectivity between all relevant components. Validate all associated configurations (PCF-UE, AMF, NRF). If the issue persists, please check with Support team.

8.1.2.54 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-187 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description	More than 25% of amf discovery reattempts failed.
Summary	More than 25% of amf discovery reattempts failed.
Severity	Minor
Expression	(sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2."}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) 100 > 25
OID	1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used	occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions	The `occnp_ue_nf_discovery_reattempt_response_total` metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then, in this case the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on: Why the AMF discovery flow is failing, or Check the health of the AMF to which the request are going to. Cause: The main cause of the `occnp_ue_nf_discovery_reattempt_response_total` metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests. Diagnostic Information: AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts. Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail. Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests. NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information. Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time. Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures. Recovery: Review logs and error responses associated with AMF discovery attempts. Check the health status and recent operational history of the target AMF and NRF. Verify network health and connectivity between all relevant components. Validate all associated configurations (PCF-UE, AMF, NRF). If the issue persists, please check with Support team.

8.1.2.55 INGRESS_ERROR_RATE_ABOVE_10_PERCENT_PER_POD

Table 8-188 INGRESS_ERROR_RATE_ABOVE_10_PERCENT_PER_POD

Field	Details
Name in Alert Yaml File	IngressErrorRateAbove10PercentPerPod
Description	Ingress Error Rate above 10 Percent in {{$labels.kubernetes_name}} in {{$labels.kubernetes_namespace}}
Summary	Transaction Error Rate in {{$labels.kubernetes_node}} (current value is: {{ $value }})
Severity	Critical
Expression	(sum by(pod)(rate(ocpm_ingress_response_total{response_code!~"2."}[24h]) or (up 0 ) )/sum by(pod)(rate(ocpm_ingress_response_total[24h]))) * 100>= 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.2
Metric Used	ocpm_ingress_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. Cause This alert fires when 10% or more of ingress (incoming) HTTP requests handled by any individual pod result in non-2xx (unsuccessful) responses, measured over a 1-day window. A high ingress error rate per pod suggests issues that could impact application availability, reliability, or user experience. Common causes include: Application-level errors (returning 4xx or 5xx status codes) due to bugs, configuration issues, invalid client requests, or backend failures Resource exhaustion (CPU, memory, open connections) or saturation within the affected pod Dependency failures (database, cache, or external service outages), causing the pod to respond with errors Recent deployments, rollouts, or configuration changes introducing regressions or incompatibilities Network problems or timeouts impacting request processing Unhandled exceptions or circuit breaker activations Diagnostic Information Identify affected pods from alert labels Review pod logs to categorize errors by type (4xx client errors, 5xx server errors, timeouts, etc.) Correlate errors with spikes in traffic, resource usage, or specific endpoints Examine resource utilization and health metrics (CPU, memory, connection pools, thread pools) Check readiness/liveness probe status and pod restart history Review changes in deployments, configurations, or dependencies preceding the alert Investigate for signs of dependency issues, cascading failures, or external API problems Recovery Isolate and address root cause: Use logs, error breakdowns, and metrics to determine if issues are within the pod, code, dependencies, or external factors Rollback if needed: If problems started following a recent deployment or config change, consider reverting Increase resources or scale out: Add capacity if the pod is resource-constrained Fix code or configuration: Resolve bugs, correct misconfigurations, or address unhandled cases Remediate downstream/third-party issues: Work with owners of failing dependencies if external Alert resolution: The alert will auto-resolve when the pod’s ingress error rate falls below 10% for the measuring window For any additional guidance, contact My Oracle Support.

8.1.2.56 SM_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-189 SM_TRAFFIC_RATE_ABOVE_THRESHOLD

Field	Details
Name in Alert Yaml File	SMTrafficRateAboveThreshold
Description	SM service Ingress traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary	Traffic Rate is above 90 Percent of Max requests per second
Severity	Major
Expression	The total SM service Ingress traffic rate has crossed the configured threshold of 900 TPS. Default value of this alert trigger point in PCF_Alertrules.yaml file is when SM service Ingress Rate crosses 90% of maximum ingress requests per second.
OID	1.3.6.1.4.1.323.5.3.36.1.2.3
Metric Used	ocpm_ingress_request_total{servicename_3gpp="npcf-smpolicycontrol"}
Recommended Actions	The alert gets cleared when the Ingress traffic rate falls below the threshold. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic: Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Ingress Gateway logs on Kibana to determine the reason for the errors. Cause: The metric `ocpm_ingress_request_total` is incremented for every inbound HTTP request reaching the SM component of the SM service with the dimension `serviceName3gpp="npcf-smpolicycontrol"`. If the 2-minute average exceeds 900 mps, this indicates that the system may be experiencing an overload or an abnormal spike in traffic. Diagnostic Information: Examine Current Rate: Query `ocpm_ingress_request_total` for `serviceName3gpp="npcf-smpolicycontrol"` to assess the current ingress traffic rate. Review Upstream Sources: Identify if request rates from any upstream SMF, AF, or TDF instances have increased. Inspect Application Logs: Check for `WARN` or `ERROR` messages in logs related to overload or congestion control rejections, which can help determine if the system is rejecting requests or experiencing resource pressure. Recovery: Throttle or Rate-Limit: Apply or adjust overload/congestion control configurations to throttle or rate-limit requests from SMF as appropriate, to restore rate to expected levels. Scale Resources: Add more replicas to the `sm-service` deployment if needed to reduce the average rate per instance. Threshold Adjustment: Adjust the alert threshold if normal traffic patterns or business requirements change. Alert Resolution: When the sustained request rate stays below 900 mps, Prometheus will automatically clear the `SM_TRAFFIC_RATE_ABOVE_THRESHOLD` alert. For any additional guidance, contact My Oracle Support.

8.1.2.57 SM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-190 SM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field	Details
Name in Alert Yaml File	SMIngressErrorRateAbove10Percent
Description	Transaction Error Rate detected above 10 Percent of Total on SM service (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Critical
Expression	The number of failed transactions is above 10 percent of the total transactions.
OID	1.3.6.1.4.1.323.5.3.36.1.2.4
Metric Used	ocpm_ingress_response_total
Recommended Actions	The alert gets cleared when the number of failed transactions are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance: `ocpm_ingress_response_total{servicename_3gpp="npcf-smpolicycontrol",response_code!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. Cause This alert fires when more than 10% of all HTTP responses returned by the SM Service (`npcf-smpolicycontrol)`over the past day are non-2xx (i.e., not successful). This may be due to: Upstream or downstream system failures Application-level errors (5xx codes) Client-side or bad requests (4xx codes) Misconfiguration, rate limiting, or resource exhaustion Diagnostic Information Break down error rates by response code to differentiate client, server, and other errors. Search for error messages, stack traces, and signs of repeated failure or congestion. Validate that dependencies(upstream services, DB) are functioning correctly. Analyze recent deployments or config changes Check for network latency Recovery: Identify and Address Root Cause: Use error breakdown and logs to pinpoint and fix the underlying issue. Rollback Recent Changes: If a recent deployment is responsible, consider rolling back temporarily. Scale or Resource Adjustment: Add resources if you detect resource exhaustion. Rate Limiting or Throttling: Apply throttling to minimize error propagation from upstream. Alert Resolution: Once the error rate remains below 10% for a sustained period (1 day), the alert will auto-resolve. For any additional guidance, contact My Oracle Support.

8.1.2.58 SM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-191 SM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Name in Alert Yaml File	SMEgressErrorRateAbove1Percent
Description	Egress Transaction Error Rate detected above 1 Percent of Total Transactions (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 1 Percent of Total Transactions
Severity	Minor
Expression	The number of failed transactions is above 1 percent of the total transactions.
OID	1.3.6.1.4.1.323.5.3.36.1.2.5
Metric Used	system_operational_state == 1
Recommended Actions	The alert gets cleared when the number of failed transactions are below 1% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance: `ocpm_egress_response_total{servicename_3gpp="npcf-smpolicycontrol",response_code!~"2.*"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. Cause This alert fires when more than 1% of all HTTP responses returned by the SM Service (`npcf-smpolicycontrol)`over the past day are non-2xx (i.e., not successful). This may be due to: Upstream or downstream system failures Application-level errors (5xx codes) Client-side or bad requests (4xx codes) Misconfiguration, rate limiting, or resource exhaustion Diagnostic Information Break down error rates by response code to differentiate client, server, and other errors. Search for error messages, stack traces, and signs of repeated failure or congestion. Validate that dependencies(upstream services, DB) are functioning correctly. Analyze recent deployments or config changes Check for network latency Recovery: Identify and Address Root Cause: Use error breakdown and logs to pinpoint and fix the underlying issue. Rollback Recent Changes: If a recent deployment is responsible, consider rolling back temporarily. Scale or Resource Adjustment: Add resources if you detect resource exhaustion. Rate Limiting or Throttling: Apply throttling to minimize error propagation from upstream. Alert Resolution: Once the error rate remains below 10% for a sustained period (1 day), the alert will auto-resolve. For any additional guidance, contact My Oracle Support.

8.1.2.59 PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-192 PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field	Details
Name in Alert Yaml File	PcfChfIngressTrafficRateAboveThreshold
Description	User service Ingress traffic Rate from CHF is above threshold of Max MPS (current value is: {{ $value }})
Summary	Traffic Rate is above 90 Percent of Max requests per second
Severity	Major
Expression	The total User Service Ingress traffic rate from CHF has crossed the configured threshold of 900 TPS. Default value of this alert trigger point in PCF_Alertrules.yaml file is when user service Ingress Rate from CHF crosses 90% of maximum ingress requests per second.
OID	1.3.6.1.4.1.323.5.3.36.1.2.11
Metric Used	ocpm_userservice_inbound_count_total{service_resource="chf-service"}
Recommended Actions	Cause: The metric ocpm_userservice_inbound_count_total with dimension service_resource="chf-service" is incremented for every inbound HTTP request reaching the CHF connector service. If the 2-minute average exceeds 900 mps, this indicates that the system may be experiencing an overload or an abnormal spike in traffic. Diagnostic Information: Examine Current Rate: Query ocpm_userservice_inbound_count_total for service_resource="chf-service" to assess the current ingress traffic rate. Review Upstream Sources: Identify if request rates from any upstream CHF, SMF, AMF instances have increased. Inspect Application Logs: Check for `WARN` or `ERROR` messages in logs related to overload or congestion control rejections, which can help determine if the system is rejecting requests or experiencing resource pressure. Recovery: Throttle or Rate-Limit: Apply or adjust congestion control configurations to throttle requests from downstream services as appropriate, to restore rate to expected levels. Scale Resources: Add more replicas to the `Chf connector` deployment if needed to reduce the average rate per instance. Threshold Adjustment: Adjust the alert threshold if normal traffic patterns or business requirements change. Alert Resolution: When the sustained request rate stays below 900 mps, Prometheus will automatically clear the PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD alert. For any additional guidance, contact My Oracle Support.

8.1.2.60 PCF_CHF_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-193 PCF_CHF_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field	Details
Name in Alert Yaml File	PcfChfEgressErrorRateAbove10Percent
Description	The number of failed transactions from CHF is more than 10 percent of the total transactions.
Summary	Transaction Error Rate detected above 10 Percent of Total Transactions
Severity	Critical
Expression	(sum(rate(ocpm_chf_tracking_response_total {servicename_3gpp="nchf-spendinglimitcontrol",response_code!~"2."} [24h]) or (up 0 ) ) / sum(rate(ocpm_chf_tracking_response_total {servicename_3gpp="nchf-spendinglimitcontrol"} [24h]))) 100 >= 10
OID	1.3.6.1.4.1.323.5.3.36.1.2.12
Metric Used	ocpm_chf_tracking_response_total
Recommended Actions	The alert gets cleared when the number of failure transactions falls below the configured threshold. Note: Threshold levels can be configured using the `PCF_Alertrules.yaml` file. It is recommended to assess the reason for failed transactions. Perform the following steps to analyze the cause of increased traffic: Refer Egress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes. Check Egress Gateway logs on Kibana to determine the reason for the errors. Cause: This alert fires when more than 10% of all HTTP responses for the PCF (CHF connector the PCF component that calls the external CHF via nchf-spendinglimitcontrol) over the past day are non-2xx (i.e., not successful). This may be due to: External CHF partial outage or dependency failures. Application-level errors (5xx) or timeouts on the CHF path. Client/bad requests (4xx) from the CHF connector due to schema/version or auth issues. Misconfiguration, rate limiting/throttling, TLS/mTLS or DNS problems, or resource exhaustion. Diagnostic Information: Break down error rates by response class (4xx vs 5xx vs timeouts/TLS/connect resets). Search CHF connector service logs and traces for recurring errors, stack traces, circuit-breaker events, or congestion. Validate external CHF health and dependencies (service/DB), and check for throttling indicators. Analyze recent deployments or configuration changes in PCF or CHF (endpoints, timeouts, retries, API versions). Check for traffic spikes, connection pool saturation, CPU/memory pressure, or elevated latency. Recovery: Identify and address root cause: Use error breakdown, logs, and traces to pinpoint whether the issue is in the PCF CHF client, network/TLS/auth, or the external CHF. Roll back recent changes: Temporarily revert relevant PCF/CHF deployments or configs if correlated with the onset. Scale or resource adjustment: Increase capacity or tune connection/thread pools; enable autoscaling if appropriate. Rate limiting or throttling: Use bounded retries with backoff and apply throttling to reduce cascading failures. Alert resolution: Once the non-2xx rate remains below 10% for a sustained period (1 day), the alert will auto-resolve. For any additional guidance, contact My Oracle Support.

8.1.2.61 PCF_CHF_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Table 8-194 PCF_CHF_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	Ingress Timeout Error Rate detected above 10 Percent of Total towards CHF service (current value is: {{ $value }})
Summary	Timeout Error Rate detected above 10 Percent of Total Transactions
Severity	Major
Expression	The number of failed transactions due to timeout is above 10 percent of the total transactions for CHF service.
OID	1.3.6.1.4.1.323.5.3.36.1.2.17
Metric Used	ocpm_chf_tracking_request_timeout_total{servicename_3gpp="nchf-spendinglimitcontrol"}
Recommended Actions	The alert gets cleared when the number of failed transactions due to timeout are below 10% of the total transactions. To assess the reason for failed transactions, perform the following steps: Check the service specific metrics to understand the service specific errors. For instance: `ocpm_chf_tracking_request_timeout_total{servicename_3gpp="nchf-spendinglimitcontrol"}` The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH. Cause: This alert is triggered when more than 10% of all inbound requests from PCF (Policy Control Function) to the CHF (`nchf-spendinglimitcontrol`) time out over a 1-day window. This may impact charging, quota enforcement, or service delivery. Common causes include: Network latency, intermittent packet loss, or connectivity issues between PCF and CHF Overload, resource congestion, or unresponsiveness in the CHF or its dependencies Resource exhaustion or scaling limits in the PCF, CHF, or intermediary components Misconfiguration of timeout thresholds, retries, or circuit breaker settings Downstream service or database issues affecting CHF’s ability to respond in time Recent changes or deployments that introduced performance bottlenecks or regressions Diagnostic Information: Identify which part of the infrastructure is experiencing timeouts: is it consistent across all traffic or localized? Review logs from PCF, CHF, and network/security appliances for repeated timeout, retry, or connection reset events Check health dashboards for CHF (CPU, memory, response latency, DB availability, etc.) Analyze request/response timings, queue lengths, and backlog at ingress points Correlate with recent deployment, scaling, or network changes Examine resource usage and pod health for PCF and CHF components Recovery: Isolate the root cause: Use logs and health metrics to determine if the problem is with CHF availability, network path, or PCF. Scale or optimize: Increase resources, scale instances, or optimize configuration for PCF and CHF services as needed. Rollback if needed: If the alert correlates with new deployments or config changes, consider reverting. Network remediation: Address any identified network latency, packet loss, or DNS resolution issues. Tune configuration: Adjust timeout settings, connection pools, and retry logic based on observed conditions. Coordinate: Engage CHF, PCF, and platform support teams as needed for collaborative troubleshooting. Alert Resolution: This alert will auto-resolve once the ingress timeout error rate drops below 10% of total requests to CHF over the evaluation window. For any additional guidance, contact My Oracle Support.

8.1.2.62 PCF_PENDING_BINDING_SITE_TAKEOVER

Table 8-195 PCF_PENDING_BINDING_SITE_TAKEOVER

Field	Details
Description	The site takeover configuration has been activated
Summary	The site takeover configuration has been activated
Severity	CRITICAL
Expression	sum by (application, container, namespace) (changes(occnp_pending_binding_site_takeover[2m])) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.45
Metric Used	occnp_pending_binding_site_takeover
Recommended Actions	Cause: This alert fires when the site takeover functionality is engaged to handle geo-redundancy scenarios. Site takeover is typically activated when a site in a distributed PCF deployment is down or unreachable, empowering another site to process that site’s pending binding operations for service continuity. Diagnostic Information: Check configuration to confirm the alternate site profile is correctly set and the takeover flag is enabled. Examine PendingOperation records to ensure the alternate site is processing entries from the down site’s site ID. Review service logs for site takeover-related events, handoff messages, and any associated errors during takeover or operation processing. Recovery & Actions: Verify that site takeover activation was intentional and aligns with fail-over or DR (Disaster Recovery) procedures. Monitor processing of pending operations for successful handoff and completion under the alternate site. Communicate with relevant operations/support teams about the takeover to prevent conflicting operations. Disable site takeover once the original site is restored to normal operation, so pending operations revert to their standard ownership and workflow. Audit for any missed or failed operations during the site handover, and remediate as needed. Alert Resolution: The alert will auto-resolve once there are no new site takeover events, and the takeover configuration is deactivated or no longer required. For any additional guidance, contact My Oracle Support.

8.1.2.63 PCF_PENDING_BINDING_THRESHOLD_LIMIT_REACHED

Table 8-196 PCF_PENDING_BINDING_THRESHOLD_LIMIT_REACHED

Field	Details
Description	The Pending Operation table threshold has been reached.
Summary	The Pending Operation table threshold has been reached.
Severity	CRITICAL
Expression	sum by (application, container, namespace) (changes(occnp_threshold_limit_reached_total[2m])) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.46
Metric Used	occnp_threshold_limit_reached_total
Recommended Actions	Cause This alert fires when the number of records in the Pending Operation table (to reattempt binding registration in BSF at a later time) reaches a predefined threshold. This means the system’s retry or pending queue for binding operations is saturated and may be at risk of delaying, or failing new operations. Exceeding this threshold typically signals that retry or binding registrations are not clearing at an expected rate. Common causes include: Persistent errors or failures from BSF in response to binding attempts, triggering retries Widespread or systemic service degradation in BSF, Binding Service, or network paths Application bugs resulting in stuck or orphaned PendingOperation records Misconfigured thresholds, retry intervals, or logic in SM or Binding Service Resource starvation (CPU, memory, DB connections) preventing timely processing of pending operations Recent deployments, configuration updates, or load spikes overwhelming the binding flow Diagnostic Information Check the volume, age, and growth trend of records in the Pending Operation table Correlate with other alerts or incident tickets related to BSF, Binding Service, network, or DB health Analyze logs from SM Service, Binding Service, and (if applicable) Audit Service for repeated errors, retry loops, or slow processing Review recent deployments or configuration changes to PCF Service components Inspect resource utilization for relevant pods, containers, and backend storage Confirm correct configuration of the threshold limit, retry intervals, and error code handling Recovery Prioritize clearing pending records: Investigate and remediate the root cause(s) of unprocessed binding operations (BSF issues, infra bottlenecks, logic bugs) Scale resources or prioritize processing: Add capacity or redistribute load if resource constraints are found Tune configuration: Adjust thresholds, error code mappings, and retry intervals as necessary Audit retry and cleanup logic: Ensure orphaned or stale records are purged and retry logic is functioning as intended Rollback if needed: If issue began with a recent deployment or config change, consider reverting Coordinate across teams: Engage with BSF, Infrastructure, and DB owners as required Alert resolution: The alert will auto-resolve once the number of records in the Pending Operation table returns below the configured threshold and normal processing resumes. For any additional guidance, contact My Oracle Support.

8.1.2.64 PCF_PENDING_BINDING_RECORDS_COUNT

Table 8-197 PCF_PENDING_BINDING_RECORDS_COUNT

Field	Details
Description	An attempt to internally recreate a PCF binding has been triggered by PCF
Summary	An attempt to internally recreate a PCF binding has been triggered by PCF
Severity	MINOR
Expression	sum by (application, container, namespace) (changes(occnp_pending_operation_records_count[10s])) > 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.47
Metric Used	occnp_pending_operation_records_count
Recommended Actions	Cause This alert fires when a new pending binding operation is inserted into the system by the SM Service(to reattempt binding registration in BSF at a later time). This typically happens when the BSF reattempt settings are configured and the response from BSF to a binding registration indicates an error condition that requires a retry (as per pre-configured error codes). Common causes for entries in the PendingOperation table include:Common causes include: BSF returns a transient or retry-eligible error code in response to binding requests. Temporary unavailability or instability of BSF or related network paths. Application bugs leading to improper handling of BSF responses or retry logic. Recent configuration changes impacting retry or error handling logic. Diagnostic Information Review SM Service and binding service logs to trace binding requests, BSF response codes, and the creation/updating of PendingOperations. Verify resource utilization and health across relevant pods or containers. Analyze timing and Volume of pending operation records—spikes may indicate regression or external service instability. Recovery Monitor pending operation clearance: Confirm that retries triggered by Audit Service notifications are processed and successfully clear pending records. Investigate recurring or persistent errors: If retries are frequently required or repeatedly fail, drill down to BSF responses, retry outcomes, and any correlated infrastructure issues. Coordinate with BSF/service owners: If an underlying BSF or network problem persists, work with those teams to restore normal registration flow. Tune configuration: Adjust error code mapping, retry intervals, or thresholds based on observed workload and service behavior. Rollback if needed: Revert recent deployments or config updates if they correlate with spikes in pending operations. Alert resolution: The alert will auto-resolve when new pending binding operation records are no longer being routinely created, retries are succeeding, and the overall pending queue stabilizes or clears. For any additional guidance, contact My Oracle Support.

8.1.2.65 AUTONOMOUS_SUBSCRIPTION_FAILURE

Table 8-198 AUTONOMOUS_SUBSCRIPTION_FAILURE

Field	Details
Description	Autonomous subscription failed for a configured Slice Load Level
Summary	Autonomous subscription failed for a configured Slice Load Level
Severity	Critical
Expression	The number of failed Autonomous Subscription for a configured Slice Load Level in nwdaf-agent is greater than zero.
OID	1.3.6.1.4.1.323.5.3.52.1.2.49
Metric Used	subscription_failure{requestType="autonomous"}
Recommended Actions	The alert gets cleared when the failed Autonomous Subscription is corrected. To clear the alert, perform the following steps: Delete the Slice Load Level configuration. Re-provision the Slice Load Level configuration. Cause: This alert activates when there is at least one autonomous subscription (such as the NWDAF event subscription process) failure detected for a given S-NSSAI, indicating that the system was unable to successfully initiate or maintain a subscription for a specific network slice. Common causes may include: Remote service (e.g., NWDAF) is unavailable, responds with a failure, or returns an error code. Authentication/authorization failures (invalid tokens, credentials, certificates). Incorrect, missing, or unsupported subscription parameters (S-NSSAI, event types, notification targets). API version or schema mismatches between subscribing and serving systems. Rate limiting, resource exhaustion, or capacity constraints in remote service. Network or DNS/connectivity problems between components. Recent deployment or configuration change introducing new issues. Diagnostic Information: Check which S-NSSAI (network slice) is affected using the alert labels. Review NWDAF gent service logs, and collect relevant error codes and messages from the failed subscription attempts. Examine recent changes or deployments to the NWDAF Agent, remote NWDAF, or related interfaces/services. Assess service health and connectivity between the agent and NWDAF (latency, errors, authentication status). Validate the subscription request payload, endpoint URLs, and configuration for the target S-NSSAI. Look for evidence of transient or repeated network/service issues. Recovery: Identify the failed subscription(s): Use the alert labels and logs to pinpoint the slice(s) affected. Resolve remote or local service issues: Work with relevant teams to restore NWDAF or agent functionality, address authentication or network problems, or resolve configuration mismatches. Retry or re-initiate subscriptions as needed after addressing the root cause. Rollback changes if the alert coincides with recent deployments, configuration modifications, or rollouts. Alert Resolution: This alert will automatically resolve once the system detects that there are no new autonomous subscription failures (i.e., no new increments in the failure counter) for the affected S-NSSAI(s) within the evaluation window. Successful re-establishment or correction will clear the alert. For any additional guidance, contact My Oracle Support.

8.1.2.66 AM_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Table 8-199 AM_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Description	AM Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Summary	AM Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Severity	MINOR
Expression	(sum(rate(http_out_conn_response_total{pod=~".amservice.",responseCode!~"2.",servicename3gpp="npcf-am-policy-control"}[1d])) / sum(rate(http_out_conn_response_total{pod=~".amservice.",servicename3gpp="npcf-am-policy-control"}[1d]))) 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.54
Metric Used	http_out_conn_response_total
Recommended Actions	Cause This alert triggers when 1% or more of notification requests sent from the AM service (part of PCF) to the AMF ( `npcf-am-policy-control` endpoint) result in non-2xx (unsuccessful) responses over a 1-day window. These notifications inform AMF about access or mobility events. A significant portion of errors could be 404 responses, which occur when AMF does not have the corresponding session in its context. This may indicate attempts to notify AMF about sessions that have already ended or were never established. Other possible causes include: Partial outage, degradation, or overload in the AMF Application errors in the AM service or AMF (e.g., other 4xx or 5xx codes) Schema or API mismatches due to recent deployments or configuration changes Authentication, authorization, or TLS certificate issues Network/connectivity problems Resource exhaustion in the AMF Diagnostic Information Break down non-2xx responses by HTTP status code, especially 404 versus other 4xx/5xx Examine AM service and AMF logs for detailed error messages and patterns Review session establishment, update, and termination flows in both AM service and AMF Investigate recent deployments, configuration changes, or spikes in error rates Assess resource usage and health of both AM service and AMF Validate API contracts, payload formats, and endpoint configurations Check for authentication/authorization or certificate issues Recovery Identify and resolve the root cause: Use logs, traces, and error breakdowns to determine if high 404 rates are expected (due to session lifecycle), or if there is a systematic issue such as stale notifications Tune notification logic: Adjust workflows to minimize duplicate or late notifications when sessions may already have ended Rollback or adjust recent changes: If errors correlate with deployments or config updates, consider reverting them Scale or adjust resources: Add capacity or tune connection/timeouts if resource exhaustion is present Remediate network or security problems: Ensure stable communication and correct authentication/certificates between PCF and AMF Alert resolution: The alert will auto-resolve when the error rate drops below 1% over the measuring window For any additional guidance, contact My Oracle Support.

8.1.2.67 AM_AR_ERROR_RATE_ABOVE_1_PERCENT

Table 8-200 AM_AR_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Description	Alternate Routing Error Rate detected above 1 Percent of Total on AM Service (current value is: {{ $value }})
Summary	Alternate Routing Error Rate detected above 1 Percent of Total on AM Service (current value is: {{ $value }})
Severity	MINOR
Expression	(sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".amservice.",responseCode!~"2.",servicename3gpp="npcf-am-policy-control"}[1d])) / sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".amservice.",servicename3gpp="npcf-am-policy-control"}[1d]))) 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.55
Metric Used	ocpm_ar_response_total
Recommended Actions	Cause This alert fires when 1% or more of alternate routing (AR) requests initiated by the AM service (as part of PCF) to AMF ( `npcf-am-policy-control` ) result in non-2xx (unsuccessful) responses over a 1-day window, grouped by FQDN. Alternate routing is the process of retrying the original request to a different AMF instance when the initial attempt fails. A rising AR error rate suggests persistent issues with connectivity, service health, or configuration for primary or alternate AMF endpoints. Typical causes include: Persistent unavailability, overload, or partial outages affecting some or all AMF instances Application-level errors from AMF (many 4xx/5xx responses, including 404s for missing sessions) Schema or API incompatibility after deployments or configuration changes Authentication, authorization, or certificate-related failures during retries Network or DNS problems affecting communication with one or more AMF instances Resource exhaustion, scaling issues, or retry storm in the AM service Misconfiguration of alternate endpoint lists or retry logic Diagnostic Information Break down failed AR responses by HTTP status code (4xx, 5xx, timeouts) to pinpoint the failure type Review AM service logs to identify why alternate routing was triggered and the response from each retry Inspect AMF logs for errors and session context associated with AR requests Assess health, status, and readiness of all AMF endpoints relevant to the alerting FQDN Check authentication credentials, certificate validity, and endpoint configuration Correlate AR error spikes with recent deployments, updates, scaling actions, or network incidents Analyze retry logic to ensure backoff and failover policies are working as expected Recovery Isolate the root cause: Use logs and metrics to determine if AR failures are due to persistent AMF unavailability, configuration problems, or retry logic bugs Remediate endpoint or network issues: Restore AMF health, increase capacity, or fix network connectivity to all AMF endpoints Fix authentication or certificate problems: Update or refresh security credentials as necessary Adjust or rollback changes as needed: If increased errors align with a recent deployment or config update Tune retry/backoff policies: Update AR configuration to minimize repeated failures or retry storms Alert resolution: The alert auto-resolves once the AR error rate drops below 1% over the measurement window For any additional guidance, contact My Oracle Support.

8.1.2.68 UE_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Table 8-201 UE_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Field	Details
Description	UE Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Summary	UE Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Severity	MINOR
Expression	(sum(rate(http_out_conn_response_total{pod=~".ueservice.",responseCode!~"2.",servicename3gpp="npcf-ue-policy-control"}[1d])) / sum(rate(http_out_conn_response_total{pod=~".ueservice.",servicename3gpp="npcf-ue-policy-control"}[1d]))) 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.56
Metric Used	http_out_conn_response_total
Recommended Actions	Cause This alert triggers when 1% or more of notification requests sent from the UE service (part of PCF) to the AMF (npcf-ue-policy-control endpoint) result in non-2xx (unsuccessful) responses over a 1-day window. These notifications inform AMF about UE policy events. A significant portion of errors could be 404 responses, which occur when AMF does not have the corresponding session in its context. This may indicate attempts to notify AMF about sessions that have already ended or were never established. Other possible causes include: Partial outage, degradation, or overload in the AMF Application errors in the AM service or AMF (e.g., other 4xx or 5xx codes) Schema or API mismatches due to recent deployments or configuration changes Authentication, authorization, or TLS certificate issues Network/connectivity problems Resource exhaustion in the AMF Diagnostic Information Break down non-2xx responses by HTTP status code, especially 404 versus other 4xx/5xx Examine AM service and AMF logs for detailed error messages and patterns Review session establishment, update, and termination flows in both AM service and AMF Investigate recent deployments, configuration changes, or spikes in error rates Assess resource usage and health of both AM service and AMF Validate API contracts, payload formats, and endpoint configurations Check for authentication/authorization or certificate issues Recovery Identify and resolve the root cause: Use logs, traces, and error breakdowns to determine if high 404 rates are expected (due to session lifecycle), or if there is a systematic issue such as stale notifications Tune notification logic: Adjust workflows to minimize duplicate or late notifications when sessions may already have ended Rollback or adjust recent changes: If errors correlate with deployments or config updates, consider reverting them Scale or adjust resources: Add capacity or tune connection/timeouts if resource exhaustion is present Remediate network or security problems: Ensure stable communication and correct authentication/certificates between PCF and AMF Alert resolution: The alert will auto-resolve when the error rate drops below 1% over the measuring window For any additional guidance, contact My Oracle Support.

8.1.2.69 UE_AR_FAILURE_RATE_ABOVE_1_PERCENT

Table 8-202 UE_AR_FAILURE_RATE_ABOVE_1_PERCENT

Field	Details
Description	Alternate Routing Error Rate detected above 1 Percent of Total on UE Service (current value is: {{ $value }})
Summary	Transaction Error Rate detected above 1 Percent of Total Transactions on UE Alternate Routing
Severity	MINOR
Expression	(sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".ueservice.",responseCode!~"2.",servicename3gpp="npcf-ue-policy-control"}[1d])) / sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".ueservice.",servicename3gpp="npcf-ue-policy-control"}[1d]))) 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.57
Metric Used	ocpm_ar_response_total
Recommended Actions	Cause This alert fires when 1% or more of alternate routing (AR) requests initiated by the AM service (as part of PCF) to AMF (npcf-ue-policy-control) result in non-2xx (unsuccessful) responses over a 1-day window, grouped by FQDN. Alternate routing is the process of retrying the original request to a different AMF instance when the initial attempt fails. A rising AR error rate suggests persistent issues with connectivity, service health, or configuration for primary or alternate AMF endpoints. Typical causes include: Persistent unavailability, overload, or partial outages affecting some or all AMF instances Application-level errors from AMF (many 4xx/5xx responses, including 404s for missing sessions) Schema or API incompatibility after deployments or configuration changes Authentication, authorization, or certificate-related failures during retries Network or DNS problems affecting communication with one or more AMF instances Resource exhaustion, scaling issues, or retry storm in the AM service Misconfiguration of alternate endpoint lists or retry logic Diagnostic Information Break down failed AR responses by HTTP status code (4xx, 5xx, timeouts) to pinpoint the failure type Review AM service logs to identify why alternate routing was triggered and the response from each retry Inspect AMF logs for errors and session context associated with AR requests Assess health, status, and readiness of all AMF endpoints relevant to the alerting FQDN Check authentication credentials, certificate validity, and endpoint configuration Correlate AR error spikes with recent deployments, updates, scaling actions, or network incidents Analyze retry logic to ensure backoff and failover policies are working as expected Recovery Isolate the root cause: Use logs and metrics to determine if AR failures are due to persistent AMF unavailability, configuration problems, or retry logic bugs Remediate endpoint or network issues: Restore AMF health, increase capacity, or fix network connectivity to all AMF endpoints Fix authentication or certificate problems: Update or refresh security credentials as necessary Adjust or rollback changes as needed: If increased errors align with a recent deployment or config update Tune retry/backoff policies: Update AR configuration to minimize repeated failures or retry storms Alert resolution: The alert auto-resolves once the AR error rate drops below 1% over the measurement window For any additional guidance, contact My Oracle Support.

8.1.2.70 SMSC_CONNECTION_DOWN

Table 8-203 SMSC_CONNECTION_DOWN

Field	Details
Description	Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Summary	Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Severity	MAJOR
Expression	sum by(namespace, pod, smscName)(occnp_active_smsc_conn_count) == 0
OID	1.3.6.1.4.1.323.5.3.52.1.2.63
Metric Used	occnp_active_smsc_conn_count
Recommended Actions	Cause This alert fires when the connection count to a specific SMSC (Short Message Service Center) peer (`smscName`) drops to zero in a notifier service pod. This means that the notifier service in the indicated pod has lost connectivity with the SMSC peer, which may halt or delay SMS delivery for affected sessions. Common causes include: Network connectivity issues between the notifier pod and the SMSC peer (latency, packet loss, firewall changes) SMSC peer instance is offline, unresponsive, or undergoing maintenance Unexpected restart or crash of the notifier service pod TCP session timeout, reset, or socket exhaustion TLS/certificate negotiation failures (if applicable) Misconfiguration of SMSC endpoint, port, or authentication details Recent pod or infrastructure changes affecting networking or endpoints Diagnostic Information Identify which `namespace`, `pod`, and `smscName` are affected from alert labels Check notifier pod logs for errors, timeouts, or repeated reconnection attempts to the SMSC Confirm SMSC peer health and status via monitoring tools or coordination with peer’s operations Validate network connectivity (test with ping/telnet/traceroute), DNS resolution, and firewall or security rules Review recent changes in deployment, SMSC endpoint configuration, or certificate rotation Check for underlying resource issues (CPU, memory, open file/socket limits) on the notifier pod Recovery Restore connectivity: Address any network or firewall problems between the notifier pod and SMSC peer Restart services: If the notifier pod is in a bad state, restart it to reestablish the connection Engage SMSC operations: If the peer is down, coordinate with the SMSC provider/team to restore service Correct configuration: Verify endpoint settings, authentication, and port assignments in both notifier and SMSC Rollback recent changes: If disconnection began after deployment or configuration change, consider reverting Alert resolution: The alert will auto-resolve once the connection count returns above zero for the affected pod and SMSC For any additional guidance, contact My Oracle Support.

8.1.2.71 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Table 8-204 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsMinorThreshold
Description	The lock requests fails to acquire the lock count exceeds the minor threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 20 Percent of Total Transactions.
Severity	Minor
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=20 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, between 20% and 50% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 20%. If the rate exceeds 50%, a higher severity alert will trigger.

8.1.2.72 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-205 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsMajorThreshold
Description	The lock requests fails to acquire the lock count exceeds the major threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 50 Percent of Total Transactions.
Severity	Major
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >= 50 < 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, between 50% and 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 50%. If the rate exceeds 75%, a higher severity alert will trigger.

8.1.2.73 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Table 8-206 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Name in Alert Yaml File	lockAcquisitionExceedsCriticalThreshold
Description	The lock requests fails to acquire the lock count exceeds the critical threshold limit. The (current value is: {{ $value }})
Summary	Keys used in Bulwark lock request which are already in locked state detected above 75 Percent of Total Transactions.
Severity	Critical
Expression	(sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=75
OID	1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used	lock_request_total
Recommended Actions	Cause This alert fires when, within a 5-minute window, above 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate: Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots) Stale or orphaned locks that are not being properly released Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark Misconfigured lock TTL (time to live), expiry, or retry/backoff policies Recent deployment, scaling events, or increased load causing higher lock demand or contention Bugs in the client logic resulting in frequent or incorrect lock requests Diagnostic Information Identify affected namespaces and resources prone to high contention or failure Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory) Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways Analyze trends following deployments, configuration changes, or traffic spikes Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings) Investigate for node clock skew, which can impact distributed locking Recovery Reduce Contention: Identify and resolve any traffic pattern that causes lock contention Backend Remediation: Scale or optimize Bulwark and address any backend health issues Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 75%.

8.1.2.74 SM_UPDATE_NOTIFY_FAILED_ABOVE_50_PERCENT

Table 8-207 SM_UPDATE_NOTIFY_FAILED_ABOVE_50_PERCENT

Field	Details
Description	Update Notify Terminate sent to SMF failed >= 50 < 60
Summary	Update Notify Terminate sent to SMF failed >= 50 < 60
Severity	MINOR
Expression	(sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2."})100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol"}) >= 50 < 60
OID	1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	Cause This alert fires when, over the evaluation period, between 50% and 60% of `terminate_notify` HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references. Other common causes include: SMF service partial outage or overload Application-level errors (4xx other than 404, 5xx) Network issues Configuration mistakes Recent deployments or system changes Diagnostic Information Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx) Check PCF and SM Service logs for error details related to `terminate_notify` requests Review SMF logs for the context and reasoning behind 404 responses Analyze the timing and volume of session termination requests compared to active session counts Correlate with recent maintenance, scaling events, or deployment changes Evaluate resource utilization and connectivity between PCF and SMF Recovery Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations. Scale/Resources: Address resource exhaustion or performance bottlenecks as needed Alert Resolution: The alert will auto-resolve once failed response rates fall below 50% for the evaluation window. A higher-severity alert may trigger if failures exceed 60%. For any additional guidance, contact My Oracle Support.

8.1.2.75 SM_UPDATE_NOTIFY_FAILED_ABOVE_60_PERCENT

Table 8-208 SM_UPDATE_NOTIFY_FAILED_ABOVE_60_PERCENT

Field	Details
Description	Update Notify Terminate sent to SMF failed >= 60 < 70
Summary	Update Notify Terminate sent to SMF failed >= 60 < 70
Severity	MAJOR
Expression	(sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2."})100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol"}) >= 60 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	Cause This alert fires when, over the evaluation period, between 60% and 70% of `terminate_notify` HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references. Other common causes include: SMF service partial outage or overload Application-level errors (4xx other than 404, 5xx) Network issues Configuration mistakes Recent deployments or system changes Diagnostic Information Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx) Check PCF and SM Service logs for error details related to `terminate_notify` requests Review SMF logs for the context and reasoning behind 404 responses Analyze the timing and volume of session termination requests compared to active session counts Correlate with recent maintenance, scaling events, or deployment changes Evaluate resource utilization and connectivity between PCF and SMF Recovery Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations. Scale/Resources: Address resource exhaustion or performance bottlenecks as needed Alert Resolution: The alert will auto-resolve once failed response rates fall below 60% for the evaluation window. A higher-severity alert may trigger if failures exceed 70%. For any additional guidance, contact My Oracle Support.

8.1.2.76 SM_UPDATE_NOTIFY_FAILED_ABOVE_70_PERCENT

Table 8-209 SM_UPDATE_NOTIFY_FAILED_ABOVE_70_PERCENT

Field	Details
Description	Update Notify Terminate sent to SMF failed >= 70
Summary	Update Notify Terminate sent to SMF failed >= 70
Severity	CRITICAL
Expression	(sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2."})100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".smservice.",servicename3gpp="npcf-smpolicycontrol"}) >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	Cause This alert fires when, over the evaluation period, above 70% of `terminate_notify` HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references. Other common causes include: SMF service partial outage or overload Application-level errors (4xx other than 404, 5xx) Network issues Configuration mistakes Recent deployments or system changes Diagnostic Information Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx) Check PCF and SM Service logs for error details related to `terminate_notify` requests Review SMF logs for the context and reasoning behind 404 responses Analyze the timing and volume of session termination requests compared to active session counts Correlate with recent maintenance, scaling events, or deployment changes Evaluate resource utilization and connectivity between PCF and SMF Recovery Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations. Scale/Resources: Address resource exhaustion or performance bottlenecks as needed Alert Resolution: The alert will auto-resolve once failed response rates fall below 70%. For any additional guidance, contact My Oracle Support.

8.1.2.77 UPDATE_NOTIFY_FAILURE_ABOVE_30_PERCENT

Table 8-210 UPDATE_NOTIFY_FAILURE_ABOVE_30_PERCENT

Field	Details
Description	{{ $value }} % of update notify sent to SMF that failed.
Summary	More than 30% of update notify sent to SMF failed
Severity	minor
Expression	sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	`occnp_http_out_conn_response_total` metric is pegged when PCF receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF. If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy. For any additional guidance, contact My Oracle Support.

8.1.2.78 UPDATE_NOTIFY_FAILURE_ABOVE_50_PERCENT

Table 8-211 UPDATE_NOTIFY_FAILURE_ABOVE_50_PERCENT

Field	Details
Description	Number of Update notify that failed is equal or above 50% but less than 70% in a given time period
Summary	Number of Update notify that failed is equal or above 50% but less than 70% in a given time period
Severity	MAJOR
Expression	(sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	`occnp_http_out_conn_response_total` metric is pegged when PCF receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF. If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy. For any additional guidance, contact My Oracle Support.

8.1.2.79 UPDATE_NOTIFY_FAILURE_ABOVE_70_PERCENT

Table 8-212 UPDATE_NOTIFY_FAILURE_ABOVE_70_PERCENT

Field	Details
Description	{{ $value }} % of update notify sent to SMF that failed
Summary	More than 70% of update notify sent to SMF failed
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm",responseCode!~"2."}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used	occnp_http_out_conn_response_total
Recommended Actions	`occnp_http_out_conn_response_total` metric is pegged when PCF receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF. If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy. For any additional guidance, contact My Oracle Support.

8.1.2.80 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST

Table 8-213 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST

Field	Details
Description	Ingress Gateway traffic gets rejected more than 1% because of ratelimiting.
Summary	Ingress Gateway traffic gets rejected more than 1% because of ratelimiting.
Severity	Major
Expression	(sum by (namespace,pod) (rate(oc_ingressgateway_http_request_ratelimit_values_total {Allowed="false",app_kubernetes_io_name="occnp-ingress-gateway"}[2m])))/ (sum by (namespace,pod) (rate(oc_ingressgateway_http_request_ratelimit_values_total {app_kubernetes_io_name="occnp-ingress-gateway"}[2m]))) * 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.103
Metric Used	oc_ingressgateway_http_request_ratelimit_values_total
Recommended Actions	Cause: Alert is triggered when percentage of denied requests is above 1% of total tps.. Diagnostic Information: Metric involved: oc_ingressgateway_http_request_ratelimit_values_total Error observed: 429 Too Many Requests, NF_CONGESTION_RISK Cause value: Allowed="false" Condition: podProtectionByRateLimiting.enabled = true and podProtectionByRateLimiting.fillRate settings Verification steps: podProtectionByRateLimiting.fillRate to a lower value and podProtectionByRateLimiting.deniedRequestActions.action=REJECT for lower congestion level Run 4500 TPS or above for SM traffic; Confirm some request dropped with Error 429. Verify that the alert get triggered. Monitoring recommendations: Monitor 4xx error; and counter increase for oc_ingressgateway_http_request_ratelimit_values_total{Allowed="false"} Watch for spikes following client deployments. Recovery: Check Network traffic burst and storm Investigate traffic load balancer issues and network issues. Review SM Service Resources Restart or scale up resources temporarily if the system is congested Reconfig setting for podProtectionByRateLimiting.fillRate to a higher value and assign podProtectionByRateLimiting.deniedRequestActions.action=REJECT to higher congestion level Disable feature if this flow is the only one affected we can disable this feature as a last resource For any additional guidance, contact My Oracle Support.

8.1.2.81 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MINOR_THRESHOLD

Table 8-214 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MINOR_THRESHOLD

Field	Details
Description	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 20 Percent of Total n1n2 notify Request.
Summary	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 20 Percent of Total n1n2 notify Request.
Severity	Minor
Expression	sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used	ue_n1_transfer_ue_notification_total
Recommended Actions	The ue_n1_transfer_ue_notification_total metric is pegged when a fragment delivered by the PCF (pcf-ue service) is rejected by the UE (User Equipment). So, the operator needs to check on the AMF/UE side why these UPSI/URSP rules were rejected. For any additional guidance, contact My Oracle Support.

8.1.2.82 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-215 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 50 Percent of Total n1n2 notify Request.
Summary	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 50 Percent of Total n1n2 notify Request.
Severity	Major
Expression	sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used	ue_n1_transfer_ue_notification_total
Recommended Actions	ue_n1_transfer_ue_notification_total metric is pegged when fragment delivered by PCF (pcf-ue service) is rejected by UE (User Equipment). So operator needs to check on AMF/UE side why these UPSI/URSP rules were rejected. For any additional guidance, contact My Oracle Support.

8.1.2.83 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-216 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Description	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 75 Percent of Total n1n2 notify Request.
Summary	UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 75 Percent of Total n1n2 notify Request.
Severity	CRITICAL
Expression	sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used	ue_n1_transfer_ue_notification_total
Recommended Actions	ue_n1_transfer_ue_notification_total metric is pegged when fragment delivered by PCF (pcf-ue service) is rejected by UE (User Equipment). So operator needs to check on AMF/UE side why these UPSI/URSP rules were rejected. For any additional guidance, contact My Oracle Support.

8.1.2.84 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MINOR_THRESHOLD

Table 8-217 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MINOR_THRESHOLD

Field	Details
Description	Over 20% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Summary	Above 20 percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Severity	Minor
Expression	sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used	ue_n1_transfer_failure_notification_total
Recommended Actions	ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed. Also operator might have to check if AMF has proper retransmission and reattempt configurations in place For any additional guidance, contact My Oracle Support.

8.1.2.85 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-218 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	Over 50% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Summary	Over 50% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Severity	Major
Expression	sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used	ue_n1_transfer_failure_notification_total
Recommended Actions	ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed. Also operator might have to check if AMF has proper retransmission and reattempt configurations in place. For any additional guidance, contact My Oracle Support.

8.1.2.86 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-219 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Description	Over 75% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Summary	Over 75% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.
Severity	Critical
Expression	sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used	ue_n1_transfer_failure_notification_total
Recommended Actions	ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed. Also operator might have to check if AMF has proper retransmission and reattempt configurations in place. For any additional guidance, contact My Oracle Support.

8.1.2.87 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MINOR_THRESHOLD

Table 8-220 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MINOR_THRESHOLD

Field	Details
Description	Over 20% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Summary	Over 20% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Severity	Minor
Expression	sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used	ue_n1_transfer_t3501_expiry_total
Recommended Actions	ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value. For any additional guidance, contact My Oracle Support.

8.1.2.88 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-221 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	Over 50% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Summary	Over 50% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Severity	Major
Expression	sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used	ue_n1_transfer_t3501_expiry_total
Recommended Actions	ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires. In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked. If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI. Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value. For any additional guidance, contact My Oracle Support.

8.1.2.89 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-222 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Description	Over 75% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Summary	Over 75% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.
Severity	Critical
Expression	sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID	1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used	ue_n1_transfer_t3501_expiry_total
Recommended Actions	ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value For any additional guidance, contact My Oracle Support.

8.1.2.90 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_CRITICAL_THRESHOLD

Table 8-223 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_CRITICAL_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Summary	This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used	ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an Error Metrics: ocpm_handle_update_notify_error_response_as_pending_confirmation_total This will be incremented when configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and specific error error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPOMNSE_CODE and timeout happens during update notify triggered by AAR-I and AAR-U. Alarm Condition: If more than or equal to 70% of update_notify total requests fails with configured errorCode, an alarm is raised Diagnostic Information: Check Network Latency Investigate possible delays in network which is resulting in timeouts Verify sender information Verify if the notifUri where we are sending the information is correct Verify receiver NF Verify that the SMF that is receiving the traffic is in a healthy state Review application Verify if Sm is not congested If signs like constant error logs are showing Monitor System/resource utilization (CPU, Memory, queues) Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded Disable feature if this flow is the only one affected we can disable this feature as a last resource For any additional guidance, contact My Oracle Support.

8.1.2.91 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MAJOR_THRESHOLD

Table 8-224 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify failed because a timeout is equal or above 50% in a given time period.
Summary	This alert is triggered when the number of update notify failed because a timeout is equal or above 50% in a given time period.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used	ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an Error Metrics: ocpm_handle_update_notify_error_response_as_pending_confirmation_total This will be incremented when configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and specific error error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPOMNSE_CODE and timeout happens during update notify triggered by AAR-I and AAR-U. Alarm Condition: If more than or equal to 50% but less than 70% of update_notify total requests fails with configured errorCode, an alarm is raised Diagnostic Information: Check Network Latency Investigate possible delays in network which is resulting in timeouts Verify sender information Verify if the notifUri where we are sending the information is correct Verify receiver NF Verify that the SMF that is receiving the traffic is in a healthy state Review application Verify if Sm is not congested If signs like constant error logs are showing Monitor System/resource utilization (CPU, Memory, queues) Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded Disable feature if this flow is the only one affected we can disable this feature as a last resource For any additional guidance, contact My Oracle Support.

8.1.2.92 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MINOR_THRESHOLD

Table 8-225 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MINOR_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Summary	This alert is triggered when the number of update notify failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm", responseCode=~"5xx/4xx"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used	ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an error. Metrics: ocpm_handle_update_notify_error_response_as_pending_confirmation_total This will be incremented when: Configuration flag `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED` is enabled, and A specific error is added in `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPONSE_CODE`, and A timeout happens during Update Notify triggered by AAR-I and AAR-U. Alarm Condition: If ≥ 50% and < 70% of update_notify total requests fail with the configured error code, an alarm is raised. Diagnostic Information: Check Network Latency Investigate possible delays in the network that are resulting in timeouts. Verify Sender Information Verify if the `notifUri` where we are sending the information is correct. Verify Receiver NF Verify that the SMF receiving the traffic is in a healthy state. Review Application Verify that SM is not congested. Check for constant error logs. Monitor system/resource utilization (CPU, memory, queues). Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded. Disable Feature If this flow is the only one affected, disable this feature as a last resort.

8.1.2.93 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_CRITICAL_THRESHOLD

Table 8-226 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_CRITICAL_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Summary	This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used	ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the Update Notify operation towards SMF ends up with a timeout. Metrics: ocpm_handle_update_notify_timeout_as_pending_confirmation_total This will be incremented when: Configuration flag `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED` is enabled, and A specific error is added in `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS` , and A timeout happens during Update Notify triggered by AAR-I and AAR-U. Alarm Condition: If ≥ 70% of update_notify total requests fail with a timeout, an alarm is raised. Diagnostic Information: Check Network Latency Investigate possible delays in the network that are resulting in timeouts. Verify Sender Information Verify if the `notifUri` where we are sending the information is correct. Verify Receiver NF Verify that the SMF receiving the traffic is in a healthy state. Review Application Verify that SM is not congested. Check for constant error logs. Monitor system/resource utilization (CPU, memory, queues). Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded. Disable Feature If this flow is the only one affected, disable this feature as a last resort. For any additional guidance, contact My Oracle Support.

8.1.2.94 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MAJOR_THRESHOLD

Table 8-227 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MAJOR_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify that failed because a timeout is equal or above 50% but less than 70% in a given time period.
Summary	This alert is triggered when the number of update notify that failed because a timeout is equal or above 50% but less than 70% in a given time period.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used	ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with a timeout. Metrics: ocpm_handle_update_notify_timeout_as_pending_confirmation_total This will be incremented when the configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and a specific error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS, and a timeout happens during update notify triggered by AAR-I and AAR-U. Alarm Condition: If more than or equal to 50% but less than 70% of update_notify total requests fail with a timeout, an alarm is raised. Diagnostic Information: Check Network Latency Investigate possible delays in the network which are resulting in timeouts. Verify sender information Verify if the notifUri where we are sending the information is correct. Verify receiver NF Verify that the SMF that is receiving the traffic is in a healthy state. Review application Verify if SM is not congested. Look for signs such as constant error logs. Monitor system/resource utilization (CPU, memory, queues). Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded. Disable feature If this flow is the only one affected, you can disable this feature as a last resort. For any additional guidance, contact My Oracle Support.

8.1.2.95 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MINOR_THRESHOLD

Table 8-228 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MINOR_THRESHOLD

Field	Details
Description	This alert is triggered when the number of update notify that failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Summary	This alert is triggered when the number of update notify that failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".pcf_sm"}[5m]))) * 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used	ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions	Cause: Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the Update Notify operation towards SMF ends up with a timeout. Metrics: ocpm_handle_update_notify_timeout_as_pending_confirmation_total This will be incremented when: Configuration flag `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED` is enabled, and A specific error is added in `SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS` , and A timeout happens during Update Notify triggered by AAR-I and AAR-U. Alarm Condition: If ≥ 30% and < 50% of update_notify total requests fail with a timeout, an alarm is raised. Diagnostic Information: Check Network Latency Investigate possible delays in the network that are resulting in timeouts. Verify Sender Information Verify if the `notifUri` where we are sending the information is correct. Verify Receiver NF Verify that the SMF receiving the traffic is in a healthy state. Review Application Verify that SM is not congested. Check for constant error logs. Monitor system/resource utilization (CPU, memory, queues). Recover: Check Network Latency and Connectivity Investigate any current network issues or bottlenecks between the external NF and the SM service. Resolve any high latency or packet loss immediately if detected. Review SM Service Application and Resources Restart or scale up resources temporarily if the system is overloaded. Disable Feature If this flow is the only one affected, disable this feature as a last resort. For any additional guidance, contact My Oracle Support.

8.1.2.96 PCF_STATE_NON_FUNCTIONAL_CRITICAL

Table 8-229 PCF_STATE_NON_FUNCTIONAL_CRITICAL

Field	Details
Description	Policy is in non functional state due to DB cluster state down.
Summary	Policy is in non functional state due to DB cluster state down.
Severity	Critical
Expression	appinfo_nfDbFunctionalState_current{nfDbFunctionalState="Not_Running"} == 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.102
Metric Used	appinfo_nfDbFunctionalState_current
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.2.97 UDR_GET_REVALIDATION_FAILURE_ABOVE_MAJOR_PERCENT

Table 8-230 UDR_GET_REVALIDATION_FAILURE_ABOVE_MAJOR_PERCENT

Field	Details
Description	This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Summary	This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code!~"2.",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned. Cause: The ocpm_udr_tracking_response_total metric is pegged whenever a response is received from the UDR in the UDR Connector. In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation exceeds the configured threshold. This alert is triggered when more than 50% but less than 70% of GET calls for UDR revalidation (`operation_type=resubscribe`, `service_resource=subscription-revalidation`) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes). Diagnostic Information: Check Recent Logs Review logs from the PCF UDR Connector and Egress Gateway for the relevant time intervals. Review errors at SCP routing. Identify the failure responses—look for non-2xx HTTP status codes and any error payloads. Analyze Failure Patterns Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations. Assess whether there is a spike in failed revalidations or if failures are intermittent. Inspect UDR Health and Reachability Verify the health and responsiveness of the UDR service. Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP). Review PCF–UDR Connector Configuration Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR. Review any recent configuration or deployment changes that might correspond to the start of failures. Check for Resource or Rate Limiting Evaluate whether there are signs of resource exhaustion (CPU, memory, network) on either service. Investigate if the UDR is rate-limiting incoming requests or experiencing overload. Correlate with Related Alerts or Incidents Cross-check whether other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors). Recovery: Restore UDR Service Health Address any service outages, restarts, or degraded performance on the UDR side. If resource constraints are detected, consider scaling UDR or optimizing load. Fix Connectivity or Configuration Issues Resolve network issues (latency, DNS, firewall). Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations. For any additional guidance, contact My Oracle Support.

8.1.2.98 UDR_GET_REVALIDATION_FAILURE_ABOVE_CRITICAL_PERCENT

Table 8-231 UDR_GET_REVALIDATION_FAILURE_ABOVE_CRITICAL_PERCENT

Field	Details
Description	This alert is triggered when more than 70% of the UDR revalidation using method GET call failed.
Summary	This alert is triggered when more than 70% of the UDR revalidation using method GET call failed.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code!~"2.",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned. Cause: The ocpm_udr_tracking_response_total metric is pegged whenever a response is received from the UDR in the UDR Connector. In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation exceeds the configured threshold. This alert is triggered when more than 70% of GET calls for UDR revalidation (`operation_type=resubscribe`, `service_resource=subscription-revalidation`) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes). Diagnostic Information: Check Recent Logs Review logs from PCF UDR Connector and Egress Gateway for the relevant time intervals. Review errors at SCP routing. Identify the failure responses — look for non-2xx HTTP status codes and any error payloads. Analyze Failure Patterns Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations. Assess whether there is a spike in failed revalidations or if failures are intermittent. Inspect UDR Health and Reachability Verify the health and responsiveness of the UDR service. Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP). Review PCF–UDR Connector Configuration Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR. Review any recent configuration or deployment changes that might correspond to the start of failures. Check for Resource or Rate Limiting Evaluate whether there are signs of resource exhaustion (CPU, memory, network) on either service. Investigate if the UDR is rate-limiting incoming requests or experiencing overload. Correlate with Related Alerts or Incidents Cross-check whether other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors). Recovery: Restore UDR Service Health Address any service outages, restarts, or degraded performance on the UDR side. If resource constraints are detected, consider scaling UDR or optimizing load. Fix Connectivity or Configuration Issues Resolve network issues (latency, DNS, firewall). Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations. For any additional guidance, contact My Oracle Support.

8.1.2.99 UDR_GET_REVALIDATION_FAILURE_ABOVE_MINOR_PERCENT

Table 8-232 UDR_GET_REVALIDATION_FAILURE_ABOVE_MINOR_PERCENT

Field	Details
Description	This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Summary	This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code!~"2.",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned. Cause: The ocpm_udr_tracking_response_total metric is pegged whenever we receive a response from the UDR in the UDR Connector. In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation is above the configured threshold. This alert is triggered when more than 30% but less than 50% of GET calls for UDR revalidation ( `operation_type=resubscribe` , `service_resource=subscription-revalidation` ) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes). Diagnostic Information: Check Recent Logs Review logs from the PCF UDR Connector and Egress Gateway for the relevant time intervals. Review errors at SCP routing. Identify the failure responses — look for non-2xx HTTP status codes and any error payloads. Analyze Failure Patterns Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations. Assess if there is a spike in failed revalidations or if failures are intermittent. Inspect UDR Health and Reachability Verify the health and responsiveness of the UDR service. Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP). Review PCF–UDR Connector Configuration Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR. Review any recent configuration or deployment changes that might correspond to the start of failures. Check for Resource or Rate Limiting Evaluate if there are signs of resource exhaustion (CPU, memory, network) on either service. Investigate if the UDR is rate-limiting incoming requests or experiencing overload. Correlate with Related Alerts or Incidents Cross-check if other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors). Recovery: Restore UDR Service Health Address any service outages, restarts, or degraded performance on the UDR side. If resource constraints are detected, consider scaling UDR or optimizing load. Fix Connectivity or Configuration Issues Resolve network issues (latency, DNS, firewall). Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations. For any additional guidance, contact My Oracle Support.

8.1.2.100 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_CRITICAL_PERCENT

Table 8-233 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_CRITICAL_PERCENT

Field	Details
Description	This alert is triggered when more than 70% of the UDR revalidation using method GET call failed with status code 404 NOT FOUND.
Summary	This alert is triggered when more than 70% of the UDR revalidation using method GET call failed with status code 404 NOT FOUND.
Severity	Critical
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned. Cause: This alert is triggered when more than 70% of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window. A 404 response indicates that the requested subscription for revalidation was not found in UDR. Diagnostic Information: Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable. Investigate Missing Subscriptions: Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for it. Check if there is a data synchronization issue between the originator and UDR. Look for patterns: Determine whether the 404s are concentrated in a particular user group. Recovery: Audit Subscription Lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced. Review Recent Deployments or Changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s. For any additional guidance, contact My Oracle Support.

8.1.2.101 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MAJOR_PERCENT

Table 8-234 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MAJOR_PERCENT

Field	Details
Description	This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Summary	This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Severity	Major
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 50 < 70
OID	1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned. Cause: This alert is triggered when more than 50% (but less than 70%) of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window. A 404 response indicates that the requested subscription for revalidation was not found in UDR. Diagnostic Information: Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable. Investigate missing subscriptions: Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for the same. Check if there is a data synchronization issue between the originator and UDR. Look for patterns: Determine whether the 404s are concentrated in a particular user group. Recovery: Audit subscription lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced. Review recent deployments or changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s. For any additional guidance, contact My Oracle Support.

8.1.2.102 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MINOR_PERCENT

Table 8-235 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MINOR_PERCENT

Field	Details
Description	This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Summary	This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Severity	Minor
Expression	(sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 30 < 50
OID	1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used	ocpm_udr_tracking_response_total
Recommended Actions	The `ocpm_udr_tracking_response_total` metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned. Cause: This alert is triggered when more than 30% (but less than 50%) of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window. A 404 response indicates that the requested subscription for revalidation was not found in UDR. Diagnostic Information: Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable. Investigate missing subscriptions: Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for the same. Check if there is a data synchronization issue between the originator and UDR. Look for patterns: Determine whether the 404s are concentrated in a particular user group. Recovery: Audit subscription lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced. Review recent deployments or changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s. For any additional guidance, contact My Oracle Support.

8.1.2.103 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-236 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field	Details
Description	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m])) / sum(rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing AM user data check is based on: `service_subresource = "am-data"` (indicates the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (indicates this is a POST call) `imm_reports_present = "false"` (indicates no AM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without AM user data as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (e.g., `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and AM user data is still not retrieved, inform the UDR operators to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.104 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-237 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Severity	Major
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing AM user data check is based on: `service_subresource = "am-data"` (to indicate the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no AM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without user data for AM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and still no AM user data is retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.105 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-238 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field	Details
Description	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting.
Summary	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting.
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The missing AM user data check is based on: `service_subresource = "am-data"` (to indicate the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no AM user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without user data for AM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.106 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-239 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field	Details
Description	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation check is based on: `service_subresource = "am-data"` (to indicate the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.107 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-240 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity	Major
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. The failed feature negotiation check is based on: `service_subresource = "am-data"` (to indicate the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.108 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-241 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field	Details
Description	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation is based on: `service_subresource = "am-data"` (indicates the UDR POST was to get AM user data from UDR) `operation_type = "POST"` (indicates this is a POST call) `immediate_report_pcc = "false"` (indicates that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and AM user data is still not retrieved: Inform the UDR operators. Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.109 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-242 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field	Details
Description	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The missing UE user data check is based on: `service_subresource = "ue-policy-set"` (indicates the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (indicates this is a POST call) `imm_reports_present = "false"` (indicates no UE user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and UE user data is still not retrieved: Inform the UDR operators. Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.110 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-243 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity	Major
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The missing UE user data check is based on: `service_subresource = "ue-policy-set"` (to indicate the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no UE user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, `40000000`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to true in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Inform UDR Operator If the above points are validated and UE user data is still not retrieved: Inform the UDR operators. Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as true. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.111 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-244 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field	Details
Description	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity	CRITICAL
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The missing UE user data check is based on: `service_subresource = "ue-policy-set"` (to indicate the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `imm_reports_present = "false"` (to indicate no UE user data was returned from UDR as part of the Immediate Reporting capability) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as `1` when converted to hex (for example, `"40000000"`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to `true` in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Intimate UDR Operator If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as `true`. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.112 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-245 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field	Details
Description	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary	More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation is based on: `service_subresource = "ue-policy-set"` (to indicate the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as `1` when converted to hex (for example, `"40000000"`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to `true` in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Intimate UDR Operator If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as `true`. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.113 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-246 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field	Details
Description	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary	More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity	Major
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation is based on: `service_subresource = "ue-policy-set"` (to indicate the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as `1` when converted to hex (for example, `"40000000"`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to `true` in the UDR POST request payload. Verify UDR Profile Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Intimate UDR Operator If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as `true`. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.114 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-247 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field	Details
Description	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary	More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used	occnp_immrep_response_total
Recommended Actions	Cause: Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. Metric: occnp_immrep_response_total Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting. The failed feature negotiation is based on: `service_subresource = "ue-policy-set"` (to indicate the UDR POST was to get UE user data from UDR) `operation_type = "POST"` (to determine this is a POST call) `immediate_report_pcc = "false"` (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature) If these metric dimensions are satisfied, then the alarm will trigger. Alarm Condition: More than or equal to 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting. Diagnostic Information: Verify ImmReportPcc Ensure the `suppFeat` attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as `1` when converted to hex (for example, `"40000000"`). This is crucial for feature negotiation with UDR. Verify immRep Ensure the `immRep` attribute is set to `true` in the request payload for the UDR POST. Verify UDR Profile Ensure that User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled. Last Resort – Intimate UDR Operator If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end. Recovery: Verify the `suppFeat` attribute is sent with the 30th byte allotted for ImmReportPcc. Verify `immRep` is being sent as `true`. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery. For any additional guidance, contact My Oracle Support.

8.1.2.115 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST_EGW

Table 8-248 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST_EGW

Field	Details
Description	Egress Gateway traffic is getting rejected more than 1% because of ratelimiting.
Summary	Egress Gateway traffic is getting rejected more than 1% because of ratelimiting.
Severity	Major
Expression	(sum(rate(oc_egressgateway_http_request_ratelimit_values_total {allowed="false",app_kubernetes_io_name="egress-gateway",,namespace="$NAMESPACE"}[2m]) or (up * 0 ) ) )/sum(rate(oc_egressgateway_http_request_ratelimit_values_total {app_kubernetes_io_name="egress-gateway",,namespace="$NAMESPACE"}[2m])) * 100 >= 1
OID	1.3.6.1.4.1.323.5.3.52.1.2.114
Metric Used	oc_egressgateway_http_request_ratelimit_values_total
Recommended Actions	The alert is cleared when the failure rate goes below 1% of total tps. For any additional guidance, contact My Oracle Support.

8.1.2.116 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-249 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 60% in a given time period.
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 60
OID	1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with `umcDataIncluded="false"` but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved:`occnp_pa_sponsored_sessions_total` Error observed: 403 Requested Service Not Authorized Condition: Unauthorized Sponsored Connectivity requests Verification steps: Send a Sponsored Connectivity request with valid authorization and supported features. Confirm the request succeeds. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Track the 4xx error ratio by caller/tenant and by sponsor/ASP. Pay special attention to spikes after client deployments or policy/configuration changes. Recovery: Identify the failing caller. Review the request payload, entitlement, and policy configuration. Confirm that the sponsor/ASP is authorized for the requested service. Correct any misconfigurations in policy rules or subscription data. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.117 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-250 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MAJOR_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 40% in a given time period.
Severity	Major
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 40 < 60
OID	1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with `umcDataIncluded="false"` but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved:`occnp_pa_sponsored_sessions_total` Error observed: 403 Requested Service Not Authorized Condition: Unauthorized Sponsored Connectivity requests Verification steps: Send a Sponsored Connectivity request with valid authorization and supported features. Confirm the request succeeds. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Track the 4xx error ratio by caller/tenant and by sponsor/ASP. Pay special attention to spikes after client deployments or policy/configuration changes. Recovery: Identify the failing caller. Review the request payload, entitlement, and policy configuration. Confirm that the sponsor/ASP is authorized for the requested service. Correct any misconfigurations in policy rules or subscription data. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.118 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-251 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MINOR_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 20% in a given time period.
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 20 < 40
OID	1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with `umcDataIncluded="false"`, but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved: occnp_pa_sponsored_sessions_total Error observed: 403 Requested Service Not Authorized Condition: Unauthorized Sponsored Connectivity requests Verification steps: Send a Sponsored Connectivity request with valid authorization and supported features. Confirm the request succeeds. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Track the 4xx error ratio by caller/tenant and by sponsor/ASP. Pay special attention to spikes after client deployments or policy/configuration changes. Recovery: Identify the failing caller. Review the request payload, entitlement, and policy configuration. Confirm that the sponsor/ASP is authorized for the requested service. Correct any misconfigurations in policy rules or subscription data. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.119 AF_MANDATORY_IE_MISSING_SC_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-252 AF_MANDATORY_IE_MISSING_SC_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 60% in a given time period.
Severity	Critical
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 60
OID	1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to `cause="MANDATORY_IE_MISSING"`. This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved: occnp_pa_sponsored_sessions_total Error observed: 400 Bad Request Cause value: MANDATORY_IE_MISSING Condition: Sponsored Connectivity requests missing mandatory IE fields Common missing IEs: sponId, aspId, afAppId Verification steps: Send a valid Sponsored Connectivity request including all mandatory IEs. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated. Confirm the request succeeds. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP. Watch for spikes following client deployments or gateway transformation changes. Recovery: Identify the failing caller. Compare the request payload against the API contract. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.). Review and fix any gateway or payload transformation issues. Redeploy the corrected configuration or client. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.2.120 AF_MANDATORY_IE_MISSING_SC_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-253 AF_MANDATORY_IE_MISSING_SC_ABOVE_MAJOR_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 40% in a given time period.
Severity	Major
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 40 < 60
OID	1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to `cause="MANDATORY_IE_MISSING"`. This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved: occnp_pa_sponsored_sessions_total Error observed: 400 Bad Request Cause value: MANDATORY_IE_MISSING Condition: Sponsored Connectivity requests missing mandatory IE fields Common missing IEs: sponId, aspId, afAppId Verification steps: Send a valid Sponsored Connectivity request including all mandatory IEs. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated. Confirm the request succeeds. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP. Watch for spikes following client deployments or gateway transformation changes. Recovery: Identify the failing caller. Compare the request payload against the API contract. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.). Review and fix any gateway or payload transformation issues. Redeploy the corrected configuration or client. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.2.121 AF_MANDATORY_IE_MISSING_SC_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-254 AF_MANDATORY_IE_MISSING_SC_ABOVE_MINOR_THRESHOLD_PERCENT

Field	Details
Description	{{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary	This alert is triggered when the number of PATCH request that failed is equal to or above 20% in a given time period.
Severity	Minor
Expression	(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 20 < 40
OID	1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used	occnp_pa_sponsored_sessions_total
Recommended Actions	If this alert is triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM). Cause: Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to `cause="MANDATORY_IE_MISSING"`. This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric. Diagnostic Information: Metric involved: occnp_pa_sponsored_sessions_total Error observed: 400 Bad Request Cause value: MANDATORY_IE_MISSING Condition: Sponsored Connectivity requests missing mandatory IE fields Common missing IEs: sponId, aspId, afAppId Verification steps: Send a valid Sponsored Connectivity request including all mandatory IEs. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated. Confirm the request succeeds. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window. Monitoring recommendations: Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP. Watch for spikes following client deployments or gateway transformation changes. Recovery: Identify the failing caller. Compare the request payload against the API contract. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.). Review and fix any gateway or payload transformation issues. Redeploy the corrected configuration or client. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.3 PCRF Alerts

This section provides information about PCRF alerts.

8.1.3.1 PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

Table 8-255 PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	PRE fail count exceeds the critical threshold limit.
Summary	Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	PRE fail count exceeds the critical threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used	http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.2 PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

Table 8-256 PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	PRE fail count exceeds the major threshold limit.
Summary	Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	PRE fail count exceeds the major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used	http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.3 PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

Table 8-257 PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	PRE fail count exceeds the minor threshold limit.
Summary	Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	minor
Condition	PRE fail count exceeds the minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used	http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.4 PCRF_DOWN

Table 8-258 PCRF_DOWN

Field	Details
Description	PCRF Service is down
Summary	Alert PCRF_DOWN NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	None of the pods of the PCRF service are available.
OID	1.3.6.1.4.1.323.5.3.44.1.2.33
Metric Used	appinfo_service_running{service=~".*pcrf-core"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.5 CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-259 CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	CCA fail count exceeds the critical threshold limit
Summary	Alert CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of CCA messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used	occnp_diam_response_local_total{msgType=~"CCA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.6 CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-260 CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	CCA fail count exceeds the major threshold limit
Summary	Alert CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of CCA messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used	occnp_diam_response_local_total{msgType=~"CCA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.7 CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-261 CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	CCA fail count exceeds the minor threshold limit
Summary	Alert CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of CCA messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used	occnp_diam_response_local_total{msgType=~"CCA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.8 AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-262 AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	AAA fail count exceeds the critical threshold limit
Summary	Alert AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of AAA messages has exceeded the critical threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used	occnp_diam_response_local_total{msgType=~"AAA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.9 AAA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

AAA Fail Count Exceeds Major Threshold

Table 8-263 AAA Fail Count Exceeds Major Threshold

Field	Details
Description	AAA fail count exceeds the major threshold limit
Summary	Alert AAA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of AAA messages has exceeded the major threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used	occnp_diam_response_local_total{msgType=~"AAA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.10 AAA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

AAA Fail Count Exceeds Minor Threshold

Table 8-264 AAA Fail Count Exceeds Minor Threshold

Field	Details
Description	AAA fail count exceeds the minor threshold limit
Summary	Alert AAA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of AAA messages has exceeded the minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used	occnp_diam_response_local_total{msgType=~"AAA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.11 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-265 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the critical threshold limit
Summary	Alert RAA_Rx_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of RAA Rx messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.12 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-266 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the major threshold limit
Summary	Alert RAA_Rx_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of RAA Rx messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.13 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-267 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	RAA Rx fail count exceeds the minor threshold limit
Summary	Alert RAA_Rx_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of RAA Rx messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.14 RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-268 RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	RAA Gx fail count exceeds the critical threshold limit
Summary	Alert RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of RAA Gx messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.15 RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-269 RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	RAA Gx fail count exceeds the major threshold limit
Summary	Alert RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of RAA Gx messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.16 RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-270 RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	RAA Gx fail count exceeds the minor threshold limit
Summary	Alert RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of RAA Gx messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.17 ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-271 ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	ASA fail count exceeds the critical threshold limit
Summary	Alert ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of ASA messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used	occnp_diam_response_local_total{msgType=~"ASA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.18 ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-272 ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	ASA fail count exceeds the major threshold limit
Summary	Alert ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of ASA messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used	occnp_diam_response_local_total{msgType=~"ASA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.19 ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-273 ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	ASA fail count exceeds the minor threshold limit
Summary	Alert ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of ASA messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used	occnp_diam_response_local_total{msgType=~"ASA.", responseCode!~"2."}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.20 STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-274 STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	STA fail count exceeds the critical threshold limit.
Summary	sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) 100 > 90
Severity	Critical
Condition	The failure rate of STA messages has exceeded the configured critical threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.21 STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-275 STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	STA fail count exceeds the major threshold limit.
Summary	sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) 100 > 80
Severity	Major
Condition	The failure rate of STA messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.22 STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-276 STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	STA fail count exceeds the minor threshold limit.
Summary	sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2."}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) 100 > 60
Severity	Minor
Condition	The failure rate of STA messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used	occnp_diam_response_local_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.23 ASATimeoutlCountExceedsThreshold

ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-277 ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	ASA timeout count exceeds the critical threshold limit
Summary	Alert ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The timeout rate of ASA messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used	occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.24 ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-278 ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	ASA timeout count exceeds the major threshold limit
Summary	Alert ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The timeout rate of ASA messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used	occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.25 ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-279 ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	ASA timeout count exceeds the minor threshold limit
Summary	Alert ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The timeout rate of ASA messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used	occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.26 RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-280 RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field	Details
Description	RAA Gx timeout count exceeds the critical threshold limit
Summary	Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The timeout rate of RAA Gx messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.27 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-281 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	RAA Gx timeout count exceeds the major threshold limit
Summary	Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The timeout rate of RAA Gx messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.28 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-282 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	RAA Gx timeout count exceeds the minor threshold limit
Summary	Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The timeout rate of RAA Gx messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.29 RAA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA Rx Timeout Count Exceeds Critical Threshold

Table 8-283 RAA Rx Timeout Count Exceeds Critical Threshold

Field	Details
Description	RAA Rx timeout count exceeds the critical threshold limit
Summary	Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The timeout rate of RAA Rx messages has exceeded the configured threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.30 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-284 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field	Details
Description	RAA Rx timeout count exceeds the major threshold limit
Summary	Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The timeout rate of RAA Rx messages has exceeded the configured major threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.31 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-285 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field	Details
Description	RAA Rx timeout count exceeds the minor threshold limit
Summary	Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The timeout rate of RAA Rx messages has exceeded the configured minor threshold limit.
OID	1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used	occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.32 RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-286 RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field	Details
Description	CCA, AAA, RAA, ASA and STA error rate combined is above 10 percent
Summary	Alert RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 10% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.33 RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-287 RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field	Details
Description	CCA, AAA, RAA, ASA and STA error rate combined is above 5 percent
Summary	Alert RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 5% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.34 RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-288 RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field	Details
Description	CCA, AAA, RAA, ASA and STA error rate combined is above 1 percent
Summary	Alert RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 1% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.35 Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-289 Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field	Details
Description	Rx error rate combined is above 10 percent
Summary	Alert Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of Rx responses is more than 10% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.36 Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-290 Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field	Details
Description	Rx error rate combined is above 5 percent
Summary	Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of Rx responses is more than 5% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.37 Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-291 Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field	Details
Description	Rx error rate combined is above 1 percent
Summary	Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of Rx responses is more than 1% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.38 Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-292 Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field	Details
Description	Gx error rate combined is above 10 percent
Summary	Alert Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Critical
Condition	The failure rate of Gx responses is more than 10% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.39 Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-293 Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field	Details
Description	Gx error rate combined is above 5 percent
Summary	Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Major
Condition	The failure rate of Gx responses is more than 5% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.40 Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

(Required) <Enter a short description here.>

Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-294 Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field	Details
Description	Gx error rate combined is above 1 percent
Summary	Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity	Minor
Condition	The failure rate of Gx responses is more than 1% of the total responses.
OID	1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used	occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.41 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Table 8-295 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 30%
Summary	The Diameter requests are being discarded due to timeout processing occurring above 30%
Severity	Critical
Condition	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 30
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	occnp_stale_diam_request_cleanup_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.42 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Table 8-296 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 20%
Summary	The Diameter requests are being discarded due to timeout processing occurring above 20%
Severity	Major
Condition	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 20
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	occnp_stale_diam_request_cleanup_total
Recommended Actions	For any additional guidance, contact My Oracle Support.

8.1.3.43 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Table 8-297 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Field	Details
Description	The Diameter requests are being discarded due to timeout processing occurring above 10%
Summary	The Diameter requests are being discarded due to timeout processing occurring above 10%
Severity	Minor
Condition	(sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR\|CER"}[24h]))) * 100 >= 10
OID	1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used	occnp_stale_diam_request_cleanup_total
Recommended Actions	For any additional guidance, contact My Oracle Support.