8 Alerts

This section provides information on Policy alerts and their configuration.

Note:

The performance and capacity of the system can vary based on the call model, configuration, including but not limited to the deployed policies and corresponding data, for example, policy tables.

You can configure alerts in Prometheus and Alertrules.yaml file.

The following table describes the various severity types of alerts generated by Policy:

Table 8-1 Alerts Levels or Severity Types

Alerts Levels / Severity Types Definition
Critical Indicates a severe issue that poses a significant risk to safety, security, or operational integrity. It requires immediate response to address the situation and prevent serious consequences. Raised for conditions can affect the service of Policy.
Major Indicates a more significant issue that has an impact on operations or poses a moderate risk. It requires prompt attention and action to mitigate potential escalation. Raised for conditions can affect the service of Policy.
Minor Indicates a situation that is low in severity and does not pose an immediate risk to safety, security, or operations. It requires attention but does not demand urgent action. Raised for conditions can affect the service of Policy.
Info or Warn (Informational) Provides general information or updates that are not related to immediate risks or actions. These alerts are for awareness and do not typically require any specific response. WARN and INFO alerts may not impact the service of Policy.

For details on how to configure Policy alerts, see Configuring Alerts section in Oracle Communications Cloud Native Core, Converged Policy Installation, Upgrade, and Fault Recovery Guide.

For details on how to configure SNMP Notifier, see Configuring SNMP Notifier section in Oracle Communications Cloud Native Core, Converged Policy Installation, Upgrade, and Fault Recovery Guide.

8.1 List of Alerts

This section provides detailed information about the alert rules defined for Policy. It consists of the following three types of alerts:
  1. Common Alerts - This category of alerts is common and required for all three modes of deployment.
  2. PCF Alerts - This category of alerts is specific to PCF microservices and required for Converged and PCF only modes of deployment.
  3. PCRF Alerts - This category of alerts is specific to PCRF microservices and required for Converged and PCRF only modes of deployment.

8.1.1 Common Alerts

This section provides information about alerts that are common for PCF and PCRF.

8.1.1.1 POD_CONGESTION_L1

Table 8-2 POD_CONGESTION_L1

Field Details
Name in Alert Yaml File PodCongestionL1
Description Alert when cpu of pod is in CONGESTION_L1 state.
Summary Alert when cpu of pod is in CONGESTION_L1 state.
Severity Critical
Expression occnp_pod_resource_congestion_state{type="cpu",container!~"bulwark|diam-gateway"} == 2
OID 1.3.6.1.4.1.323.5.3.52.1.2.71
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.2 POD_CONGESTION_L2

Table 8-3 POD_CONGESTION_L2

Field Details
Name in Alert Yaml File PodCongestionL2
Description Alert when cpu of pod is in CONGESTION_L2 state.
Summary Alert when cpu of pod is in CONGESTION_L2 state.
Severity Critical
Expression occnp_pod_resource_congestion_state{type="cpu"} == 3
OID 1.3.6.1.4.1.323.5.3.52.1.2.72
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.3 POD_PENDING_REQUEST_CONGESTION_L1

Table 8-4 POD_PENDING_REQUEST_CONGESTION_L1

Field Details
Name in Alert Yaml File PodPendingRequestCongestionL1
Description Alert when queue of pod is in CONGESTION_L1 state.
Summary Alert when queue of pod is in CONGESTION_L1 state.
Severity critical
Expression occnp_pod_resource_congestion_state{type="queue",container!~"bulwark|diam-gateway"} == 2
OID 1.3.6.1.4.1.323.5.3.52.1.2.73
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.4 POD_PENDING_REQUEST_CONGESTION_L2

Table 8-5 POD_PENDING_REQUEST_CONGESTION_L2

Field Details
Name in Alert Yaml File PodPendingRequestCongestionL2
Description Alert when queue of pod is in CONGESTION_L2 state.
Summary Alert when queue of pod is in CONGESTION_L2 state.
Severity critical
Expression occnp_pod_resource_congestion_state{type="queue"} == 3
OID 1.3.6.1.4.1.323.5.3.52.1.2.74
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.5 POD_CPU_CONGESTION_L1

Table 8-6 POD_CPU_CONGESTION_L1

Field Details
Name in Alert Yaml File PodCPUCongestionL1
Description Alert when cpu of pod is in CONGESTION_L1 state.
Summary Alert when cpu of pod is in CONGESTION_L1 state.Alert when pod is in CONGESTION_L1 state.
Severity Critical
Expression occnp_pod_resource_congestion_state{type="cpu",container!~"bulwark|diam-gateway"} == 2
OID 1.3.6.1.4.1.323.5.3.52.1.2.73
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.6 POD_CPU_CONGESTION_L2

Table 8-7 POD_CPU_CONGESTION_L2

Field Details
Name in Alert Yaml File PodCPUCongestionL2
Description Alert when cpu of pod is in CONGESTION_L2 state.
Summary Alert when cpu of pod is in CONGESTION_L2 state.
Severity critical
Expression occnp_pod_resource_congestion_state{type="cpu"} == 3
OID 1.3.6.1.4.1.323.5.3.52.1.2.74
Metric Used occnp_pod_resource_congestion_state
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.7 Pod_Memory_DoC

Table 8-8 Pod_Memory_DoC

Field Details
Description Pod Resource Congestion status of {{$labels.service}} service is DoC for Memory type
Summary Pod Resource Congestion status of {{$labels.service}} service is DoC for Memory type
Severity Major
Expression occnp_pod_resource_congestion_state{type="memory"} == 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.31
Metric Used occnp_pod_resource_congestion_state
Recommended Actions
Alert triggers based on the resource limit usage and load shedding configurations in congestion control. The CPU, Memory, and queue usage can be referred using the Grafana Dashboard.

Note:

Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.8 Pod_Memory_Congested

Table 8-9 Pod_Memory_Congested

Field Details
Description Pod Resource Congestion status of {{$labels.service}} service is congested for Memory type
Summary Pod Resource Congestion status of {{$labels.service}} service is congested for Memory type
Severity Critical
Expression occnp_pod_resource_congestion_state{type="memory"} == 2
OID 1.3.6.1.4.1.323.5.3.52.1.2.32
Metric Used occnp_pod_resource_congestion_state
Recommended Actions

Alert triggers based on the resource limit usage and load shedding configurations in congestion control. The CPU, Memory, and queue usage can be referred using the Grafana Dashboard.

For any additional guidance, contact My Oracle Support.

8.1.1.9 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-10 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the critical threshold limit.
Summary RAA Rx fail count exceeds the critical threshold limit.
Severity CRITICAL
Expression sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236"}[5m])) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.10 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-11 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the major threshold limit.
Summary RAA Rx fail count exceeds the major threshold limit.
Severity MAJOR
Expression sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="RAA", appId="16777236"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) * 100 <= 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.11 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-12 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the minor threshold limit.
Summary RAA Rx fail count exceeds the minor threshold limit.
Severity MINOR
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="RAA"}[5m])) * 100 <= 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.35
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.12 ASA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-13 ASA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description ASA Rx fail count exceeds the critical threshold limit.
Summary ASA Rx fail count exceeds the critical threshold limit.
Severity CRITICAL
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.13 ASA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-14 ASA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description ASA Rx fail count exceeds the major threshold limit.
Summary ASA Rx fail count exceeds the major threshold limit.
Severity MAJOR
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.14 ASA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-15 ASA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description ASA Rx fail count exceeds the minor threshold limit.
Summary ASA Rx fail count exceeds the minor threshold limit.
Severity MINOR
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.66
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.15 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-16 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description ASA Rx timeout count exceeds the minor threshold limit
Summary ASA Rx timeout count exceeds the minor threshold limit
Severity MINOR
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.16 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-17 ASA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description ASA Rx timeout count exceeds the major threshold limit
Summary ASA Rx timeout count exceeds the major threshold limit
Severity sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 <= 90
Expression MAJOR
OID 1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used occnp_diam_response_local_total
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.17 ASA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-18 ASA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description ASA Rx timeout count exceeds the critical threshold limit
Summary ASA Rx timeout count exceeds the critical threshold limit
Severity CRITICAL
Expression sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA",responseCode="timeout"}[5m])) / sum(rate(occnp_diam_response_local_total{appId="16777236",msgType="ASA"}[5m])) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.67
Metric Used -
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.18 SCP_PEER_UNAVAILABLE

Table 8-19 SCP_PEER_UNAVAILABLE

Field Details
Description Configured SCP peer is unavailable.
Summary Configured SCP peer is unavailable.
Severity Major
Expression occnp_oc_egressgateway_peer_health_status != 0. SCP peer [ {{$labels.peer}} ] is unavailable.
OID 1.3.6.1.4.1.323.5.3.52.1.2.60
Metric Used occnp_oc_egressgateway_peer_health_status
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.19 SCP_PEER_SET_UNAVAILABLE

Table 8-20 SCP_PEER_SET_UNAVAILABLE

Field Details
Description None of the SCP peer available for configured peerset.
Summary {{ $value }} SCP peers under peer set {{$labels.peerset}} are currently unavailable.
Severity Critical
Expression (occnp_oc_egressgateway_peer_count > 0 and (occnp_oc_egressgateway_peer_available_count) == 0)
OID 1.3.6.1.4.1.323.5.3.52.1.2.61
Metric Used occnp_oc_egressgateway_peer_count and occnp_oc_egressgateway_peer_available_count
Recommended Actions

NF clears the critical alarm when atleast one SCP peer in a peerset becomes available such that all other SCP peers in the given peerset are still unavailable.

For any additional guidance, contact My Oracle Support.
8.1.1.20 STALE_CONFIGURATION

Table 8-21 STALE_CONFIGURATION

Field Details
Description In last 10 minutes, the current service config_level does not match the config_level from the config-server.
Summary In last 10 minutes, the current service config_level does not match the config_level from the config-server.
Severity Major
Expression (sum by(namespace) (topic_version{app_kubernetes_io_name="config-server",topicName="config.level"})) / (count by(namespace) (topic_version{app_kubernetes_io_name="config-server",topicName="config.level"})) != (sum by(namespace) (topic_version{app_kubernetes_io_name!="config-server",topicName="config.level"})) / (count by(namespace) (topic_version{app_kubernetes_io_name!="config-server",topicName="config.level"}))
OID 1.3.6.1.4.1.323.5.3.52.1.2.62
Metric Used topic_version
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.21 POLICY_SERVICES_DOWN

Table 8-22 POLICY_SERVICES_DOWN

Field Details
Name in Alert Yaml File PCF_SERVICES_DOWN
Description {{$labels.service}} service is not running.
Summary {{$labels.service}} service is not running.
Severity Critical
Expression None of the pods of the CNC Policy application are available.
OID 1.3.6.1.4.1.323.5.3.36.1.2.1
Metric Used appinfo_service_running{vendor="Oracle", application="occnp", category!=""}!= 1
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.22 DIAM_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-23 DIAM_TRAFFIC_RATE_ABOVE_THRESHOLD

Field Details
Name in Alert Yaml File DiamTrafficRateAboveThreshold
Description Diameter Connector Ingress traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary Traffic Rate is above 90 Percent of Max requests per second.
Severity Major
Expression The total Ingress traffic rate for Diameter connector has crossed the configured threshold of 900 TPS.

Default value of this alert trigger point in Common_Alertrules.yaml file is when Diameter Connector Ingress Rate crosses 90% of maximum ingress requests per second.

OID 1.3.6.1.4.1.323.5.3.36.1.2.6
Metric Used ocpm_ingress_request_total
Recommended Actions The alert gets cleared when the Ingress traffic rate falls below the threshold.

Note: Threshold levels can be configured using the Common_Alertrules.yaml file.

It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Ingress Gateway logs on Kibana to determine the reason for the errors.
For any additional guidance, contact My Oracle Support.
8.1.1.23 DIAM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-24 DIAM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field Details
Name in Alert Yaml File DiamIngressErrorRateAbove10Percent
Description Transaction Error Rate detected above 10 Percent of Total on Diameter Connector (current value is: {{ $value }})
Summary Transaction Error Rate detected above 10 Percent of Total Transactions.
Severity Critical
Expression The number of failed transactions is above 10 percent of the total transactions on Diameter Connector.
OID 1.3.6.1.4.1.323.5.3.36.1.2.7
Metric Used ocpm_ingress_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: ocpm_ingress_response_total{servicename_3gpp="rx",response_code!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.
For any additional guidance, contact My Oracle Support.
8.1.1.24 DIAM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-25 DIAM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Name in Alert Yaml File DiamEgressErrorRateAbove1Percent
Description Egress Transaction Error Rate detected above 1 Percent of Total on Diameter Connector (current value is: {{ $value }})
Summary Transaction Error Rate detected above 1 Percent of Total Transactions
Severity Minor
Expression The number of failed transactions is above 1 percent of the total Egress Gateway transactions on Diameter Connector.
OID 1.3.6.1.4.1.323.5.3.36.1.2.8
Metric Used ocpm_egress_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 1% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the errors. For instance: ocpm_egress_response_total{servicename_3gpp="rx",response_code!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.
For any additional guidance, contact My Oracle Support.
8.1.1.25 UDR_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-26 UDR_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field Details
Name in Alert Yaml File PcfUdrIngressTrafficRateAboveThreshold
Description User service Ingress traffic Rate from UDR is above threshold of Max MPS (current value is: {{ $value }})
Summary Traffic Rate is above 90 Percent of Max requests per second
Severity Major
Expression The total User Service Ingress traffic rate from UDR has crossed the configured threshold of 900 TPS.

Default value of this alert trigger point in Common_Alertrules.yaml file is when user service Ingress Rate from UDR crosses 90% of maximum ingress requests per second.

OID 1.3.6.1.4.1.323.5.3.36.1.2.9
Metric Used ocpm_userservice_inbound_count_total{service_resource="udr-service"}
Recommended Actions The alert gets cleared when the Ingress traffic rate falls below the threshold.

Note: Threshold levels can be configured using the Common_Alertrules.yaml file.

It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

For any additional guidance, contact My Oracle Support.

8.1.1.26 UDR_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-27 UDR_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field Details
Name in Alert Yaml File PcfUdrEgressErrorRateAbove10Percent
Description Egress Transaction Error Rate detected above 10 Percent of Total on User service (current value is: {{ $value }})
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Critical
Expression The number of failed transactions from UDR is more than 10 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.36.1.2.10
Metric Used ocpm_udr_tracking_response_total{servicename_3gpp="nudr-dr",response_code!~"2.*"}
Recommended Actions The alert gets cleared when the number of failure transactions falls below the configured threshold.

Note: Threshold levels can be configured using the Common_Alertrules.yaml file.

It is recommended to assess the reason for failed transactions. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Egress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Egress Gateway logs on Kibana to determine the reason for the errors.

For any additional guidance, contact My Oracle Support.

8.1.1.27 POLICYDS_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-28 POLICYDS_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field Details
Name in Alert Yaml File PolicyDsIngressTrafficRateAboveThreshold
Description Ingress Traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary Traffic Rate is above 90 Percent of Max requests per second
Severity Critical
Expression The total PolicyDS Ingress message rate has crossed the configured threshold of 900 TPS. 90% of maximum Ingress request rate.

Default value of this alert trigger point in Common_Alertrules.yaml file is when PolicyDS Ingress Rate crosses 90% of maximum ingress requests per second.

OID 1.3.6.1.4.1.323.5.3.36.1.2.13
Metric Used client_request_total

Note: This is a Kubernetes metric used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the Ingress traffic rate falls below the threshold.

Note: Threshold levels can be configured using the Common_Alertrules.yaml file.

It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

For any additional guidance, contact My Oracle Support.

8.1.1.28 POLICYDS_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-29 POLICYDS_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field Details
Name in Alert Yaml File PolicyDsIngressErrorRateAbove10Percent
Description Ingress Transaction Error Rate detected above 10 Percent of Total on PolicyDS service (current value is: {{ $value }})
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Critical
Expression The number of failed transactions is above 10 percent of the total transactions for PolicyDS service.
OID 1.3.6.1.4.1.323.5.3.36.1.2.14
Metric Used client_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: client_response_total{response!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

For any additional guidance, contact My Oracle Support.

8.1.1.29 POLICYDS_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-30 POLICYDS_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Name in Alert Yaml File PolicyDsEgressErrorRateAbove1Percent
Description Egress Transaction Error Rate detected above 1 Percent of Total on PolicyDS service (current value is: {{ $value }})
Summary Transaction Error Rate detected above 1 Percent of Total Transactions
Severity Minor
Expression The number of failed transactions is above 1 percent of the total transactions for PolicyDS service.
OID 1.3.6.1.4.1.323.5.3.36.1.2.15
Metric Used server_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: server_response_total{response!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

For any additional guidance, contact My Oracle Support.

8.1.1.30 UDR_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Table 8-31 UDR_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File PcfUdrIngressTimeoutErrorAboveMajorThreshold
Description Ingress Timeout Error Rate detected above 10 Percent of Total towards UDR service (current value is: {{ $value }})
Summary Timeout Error Rate detected above 10 Percent of Total Transactions
Severity Major
Expression The number of failed transactions due to timeout is above 10 percent of the total transactions for UDR service.
OID 1.3.6.1.4.1.323.5.3.36.1.2.16
Metric Used ocpm_udr_tracking_request_timeout_total{servicename_3gpp="nudr-dr"}
Recommended Actions The alert gets cleared when the number of failed transactions due to timeout are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: ocpm_udr_tracking_request_timeout_total{servicename_3gpp="nudr-dr"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

For any additional guidance, contact My Oracle Support.

8.1.1.31 DB_TIER_DOWN_ALERT

Table 8-32 DB_TIER_DOWN_ALERT

Field Details
Name in Alert Yaml File DBTierDownAlert
Description DB cannot be reachable.
Summary DB cannot be reachable.
Severity Critical
Expression Database is not available.
OID 1.3.6.1.4.1.323.5.3.36.1.2.18
Metric Used appinfo_category_running{category="database"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.32 CPU_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Table 8-33 CPU_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File CPUUsagePerServiceAboveMinorThreshold
Description CPU usage for {{$labels.service}} service is above 60
Summary CPU usage for {{$labels.service}} service is above 60
Severity Minor
Expression A service pod has reached the configured minor threshold (60%) of its CPU usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.19
Metric Used container_cpu_usage_seconds_total

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the CPU utilization falls below the minor threshold or crosses the major threshold, in which case CPUUsagePerServiceAboveMajorThreshold alert shall be raised.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.33 CPU_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Table 8-34 CPU_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File CPUUsagePerServiceAboveMajorThreshold
Description CPU usage for {{$labels.service}} service is above 80
Summary CPU usage for {{$labels.service}} service is above 80
Severity Major
Expression A service pod has reached the configured major threshold (80%) of its CPU usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.20
Metric Used container_cpu_usage_seconds_total

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the CPU utilization falls below the major threshold or crosses the critical threshold, in which case CPUUsagePerServiceAboveCriticalThreshold alert shall be raised.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.34 CPU_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Table 8-35 CPU_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File CPUUsagePerServiceAboveCriticalThreshold
Description CPU usage for {{$labels.service}} service is above 90
Summary CPU usage for {{$labels.service}} service is above 90
Severity Critical
Expression A service pod has reached the configured critical threshold (90%) of its CPU usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.21
Metric Used container_cpu_usage_seconds_total

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the CPU utilization falls below the critical threshold.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.35 MEMORY_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Table 8-36 MEMORY_USAGE_PER_SERVICE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File MemoryUsagePerServiceAboveMinorThreshold
Description Memory usage for {{$labels.service}} service is above 60
Summary Memory usage for {{$labels.service}} service is above 60
Severity Minor
Expression A service pod has reached the configured minor threshold (60%) of its memory usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.22
Metric Used container_memory_usage_bytes

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the minor threshold or crosses the critical threshold, in which case MemoryUsagePerServiceAboveMajorThreshold alert shall be raised.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.36 MEMORY_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Table 8-37 MEMORY_USAGE_PER_SERVICE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File MemoryUsagePerServiceAboveMajorThreshold
Description Memory usage for {{$labels.service}} service is above 80
Summary Memory usage for {{$labels.service}} service is above 80
Severity Major
Expression A service pod has reached the configured major threshold (80%) of its memory usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.23
Metric Used container_memory_usage_bytes

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the major threshold or crosses the critical threshold, in which case MemoryUsagePerServiceAboveCriticalThreshold alert shall be raised.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.37 MEMORY_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Table 8-38 MEMORY_USAGE_PER_SERVICE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File MemoryUsagePerServiceAboveCriticalThreshold
Description Memory usage for {{$labels.service}} service is above 90
Summary Memory usage for {{$labels.service}} service is above 90
Severity Critical
Expression A service pod has reached the configured critical threshold (90%) of its memory usage limits.
OID 1.3.6.1.4.1.323.5.3.36.1.2.24
Metric Used container_memory_usage_bytes

Note: This is a Kubernetes used for instance availability monitoring. If the metric is not available, use similar metrics exposed by the monitoring system.

Recommended Actions The alert gets cleared when the memory utilization falls below the critical threshold.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

For any additional guidance, contact My Oracle Support.

8.1.1.38 POD_CONGESTED

Table 8-39 POD_CONGESTED

Field Details
Name in Alert Yaml File PodCongested
Description The pod congestion status is set to congested.
Summary Pod Congestion status of {{$labels.service}} service is congested
Severity Critical
Expression occnp_pod_congestion_state == 4
OID 1.3.6.1.4.1.323.5.3.36.1.2.26
Metric Used occnp_pod_congestion_state
Recommended Actions The alert gets cleared when the system is back to normal state.

For any additional guidance, contact My Oracle Support.

8.1.1.39 POD_DANGER_OF_CONGESTION

Table 8-40 POD_DANGER_OF_CONGESTION

Field Details
Description The pod congestion status is set to Danger of Congestion.
Summary Pod Congestion status of {{$labels.service}} service is DoC
Severity Major
Expression occnp_pod_resource_congestion_state == 1
OID 1.3.6.1.4.1.323.5.3.36.1.2.25
Metric Used occnp_pod_congestion_state
Recommended Actions The alert gets cleared when the system is back to normal state.

For any additional guidance, contact My Oracle Support.

8.1.1.40 POD_PENDING_REQUEST_CONGESTED

Table 8-41 POD_PENDING_REQUEST_CONGESTED

Field Details
Name in Alert Yaml File PodPendingRequestCongested
Description The pod congestion status is set to congested for PendingRequest.
Summary Pod Resource Congestion status of {{$labels.service}} service is congested for PendingRequest type.
Severity Critical
Expression occnp_pod_resource_congestion_state{type="queue"} == 4
OID 1.3.6.1.4.1.323.5.3.36.1.2.28
Metric Used occnp_pod_resource_congestion_state{type="queue"}
Recommended Actions The alert gets cleared when the pending requests in the queue comes below the configured threshold value.

For any additional guidance, contact My Oracle Support.

8.1.1.41 POD_PENDING_REQUEST_DANGER_OF_CONGESTION

Table 8-42 POD_PENDING_REQUEST_DANGER_OF_CONGESTION

Field Details
Description The pod congestion status is set to DoC for pending requests.
Summary Pod Resource Congestion status of {{$labels.service}} service is DoC for PendingRequest type.
Severity Major
Expression occnp_pod_resource_congestion_state{type="queue"} == 1
OID 1.3.6.1.4.1.323.5.3.36.1.2.27
Metric Used occnp_pod_resource_congestion_state{type="queue"}
Recommended Actions The alert gets cleared when the pending requests in the queue comes below the configured threshold value.

For any additional guidance, contact My Oracle Support.

8.1.1.42 POD_CPU_CONGESTED

Table 8-43 POD_CPU_CONGESTED

Field Details
Name in Alert Yaml File PodCPUCongested
Description The pod congestion status is set to congested for CPU.
Summary Pod Resource Congestion status of {{$labels.service}} service is congested for CPU type.
Severity Critical
Expression occnp_pod_resource_congestion_state{type="cpu"} == 4
OID 1.3.6.1.4.1.323.5.3.36.1.2.30
Metric Used occnp_pod_resource_congestion_state{type="cpu"}
Recommended Actions The alert gets cleared when the system CPU usage comes below the configured threshold value.

For any additional guidance, contact My Oracle Support.

8.1.1.43 POD_CPU_DANGER_OF_CONGESTION

Table 8-44 POD_CPU_DANGER_OF_CONGESTION

Field Details
Description Pod Resource Congestion status of {{$labels.service}} service is DoC for CPU type.
Summary Pod Resource Congestion status of {{$labels.service}} service is DoC for CPU type.
Severity Major
Expression The pod congestion status is set to DoC for CPU.
OID 1.3.6.1.4.1.323.5.3.36.1.2.29
Metric Used occnp_pod_resource_congestion_state{type="cpu"}
Recommended Actions The alert gets cleared when the system CPU usage comes below the configured threshold value.

For any additional guidance, contact My Oracle Support.

8.1.1.44 SERVICE_OVERLOADED

Table 8-45 SERVICE_OVERLOADED

Field Details
Description Overload Level of {{$labels.service}} service is L1
Summary Overload Level of {{$labels.service}} service is L1
Severity Minor
Expression The overload level of the service is L1.
OID 1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used load_level
Recommended Actions The alert gets cleared when the system is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-46 SERVICE_OVERLOADED

Field Details
Description Overload Level of {{$labels.service}} service is L2
Summary Overload Level of {{$labels.service}} service is L2
Severity Major
Expression The overload level of the service is L2.
OID 1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used load_level
Recommended Actions The alert gets cleared when the system is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-47 SERVICE_OVERLOADED

Field Details
Description Overload Level of {{$labels.service}} service is L3
Summary Overload Level of {{$labels.service}} service is L3
Severity Critical
Expression The overload level of the service is L3.
OID 1.3.6.1.4.1.323.5.3.36.1.2.40
Metric Used load_level
Recommended Actions The alert gets cleared when the system is back to normal state.

For any additional guidance, contact My Oracle Support.

8.1.1.45 SERVICE_RESOURCE_OVERLOADED

Alerts when service is in overload state due to memory usage

Table 8-48 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L1 for {{$labels.type}} type
Summary {{$labels.service}} service is L1 for {{$labels.type}} type
Severity Minor
Expression The overload level of the service is L1 due to memory usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="memory"}
Recommended Actions The alert gets cleared when the memory usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-49 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L2 for {{$labels.type}} type
Summary {{$labels.service}} service is L2 for {{$labels.type}} type
Severity Major
Expression The overload level of the service is L2 due to memory usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="memory"}
Recommended Actions The alert gets cleared when the memory usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-50 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L3 for {{$labels.type}} type.
Summary {{$labels.service}} service is L3 for {{$labels.type}} type
Severity Critical
Expression The overload level of the service is L3 due to memory usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="memory"}
Recommended Actions The alert gets cleared when the memory usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to CPU usage

Table 8-51 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L1 for {{$labels.type}} type
Summary {{$labels.service}} service is L1 for {{$labels.type}} type
Severity Minor
Expression The overload level of the service is L1 due to CPU usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="cpu"}
Recommended Actions The alert gets cleared when the CPU usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-52 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L2 for {{$labels.type}} type
Summary {{$labels.service}} service is L2 for {{$labels.type}} type
Severity Major
Expression The overload level of the service is L2 due to CPU usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="cpu"}
Recommended Actions The alert gets cleared when the CPU usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-53 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L3 for {{$labels.type}} type
Summary {{$labels.service}} service is L3 for {{$labels.type}} type
Severity Critical
Expression The overload level of the service is L3 due to CPU usage.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="cpu"}
Recommended Actions The alert gets cleared when the CPU usage of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to number of pending messages

Table 8-54 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L1 for {{$labels.type}} type
Summary {{$labels.service}} service is L1 for {{$labels.type}} type
Severity Minor
Expression The overload level of the service is L1 due to number of pending messages.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_pending_count"}
Recommended Actions The alert gets cleared when the number of pending messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-55 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L2 for {{$labels.type}} type
Summary {{$labels.service}} service is L2 for {{$labels.type}} type
Severity Major
Expression The overload level of the service is L2 due to number of pending messages.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_pending_count"}
Recommended Actions The alert gets cleared when the number of pending messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-56 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L3 for {{$labels.type}} type
Summary {{$labels.service}} service is L3 for {{$labels.type}} type
Severity Critical
Expression The overload level of the service is L3 due to number of pending messages.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_pending_count"}
Recommended Actions The alert gets cleared when the number of pending messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Alerts when service is in overload state due to number of failed requests

Table 8-57 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L1 for {{$labels.type}} type.
Summary {{$labels.service}} service is L1 for {{$labels.type}} type.
Severity Minor
Expression The overload level of the service is L1 due to number of failed requests.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_failure_count"}
Recommended Actions The alert gets cleared when the number of failed messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-58 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L2 for {{$labels.type}} type.
Summary {{$labels.service}} service is L2 for {{$labels.type}} type.
Severity Major
Expression The overload level of the service is L2 due to number of failed requests.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_failure_count"}
Recommended Actions The alert gets cleared when the number of failed messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

Table 8-59 SERVICE_RESOURCE_OVERLOADED

Field Details
Description {{$labels.service}} service is L3 for {{$labels.type}} type.
Summary {{$labels.service}} service is L3 for {{$labels.type}} type.
Severity Critical
Expression The overload level of the service is L3 due to number of failed requests.
OID 1.3.6.1.4.1.323.5.3.36.1.2.41
Metric Used service_resource_overload_level{type="svc_failure_count"}
Recommended Actions The alert gets cleared when the number of failed messages of the service is back to normal state.

For any additional guidance, contact My Oracle Support.

8.1.1.46 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_CRITICAL_THRESHOLD

Table 8-60 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description Notification Transaction Error exceeds the critical threshold limit for a given Subscriber Notification server
Summary Transaction Error exceeds the critical threshold limit for a given Subscriber Notification server
Severity Critical
Expression The number of error responses for a given subscriber notification server exceeds the critical threshold of 1000.
OID 1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used http_notification_response_total{responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

Table 8-61 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description Notification Transaction Error exceeds the major threshold limit for a given Subscriber Notification server
Summary Transaction Error exceeds the major threshold limit for a given Subscriber Notification server
Severity Major
Expression The number of error responses for a given subscriber notification server exceeds the major threshold value, that is, between 750 and 1000.
OID 1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used http_notification_response_total{responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

Table 8-62 SUBSCRIBER_NOTIFICATION_ERROR_EXCEEDS_MINOR_THRESHOLD

Field Details
Description Notification Transaction Error exceeds the minor threshold limit for a given Subscriber Notification server
Summary Transaction Error exceeds the minor threshold limit for a given Subscriber Notification server
Severity Minor
Expression The number of error responses for a given subscriber notification server exceeds the minor threshold value, that is, between 500 and 750.
OID 1.3.6.1.4.1.323.5.3.36.1.2.42
Metric Used http_notification_response_total{responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.47 SYSTEM_IMPAIRMENT_MAJOR

Table 8-63 SYSTEM_IMPAIRMENT_MAJOR

Field Details
Description Major impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 10 minutes.
Summary Major impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 10 minutes.
Severity Major
Expression (db_tier_replication_status{role="failed"} == 0) or (db_tier_replication_status{role="active"} == 0) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="standby"})) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="failed"})) or (avg_over_time(db_tier_binlog_used_bytes_percentage[5m])>= 80)
OID 1.3.6.1.4.1.323.5.3.52.1.2.43
Metric Used db_tier_replication_status and db_tier_binlog_used_bytes_percentage
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.48 SYSTEM_IMPAIRMENT_CRITICAL

Table 8-64 SYSTEM_IMPAIRMENT_CRITICAL

Field Details
Description Critical impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 30 minutes.
Summary Critical impairment alert raised for REPLICATION_FAILED or REPLICATION_CHANNEL_DOWN or BINLOG_STORAGE usage must be more than 80% for 30 minutes.
Severity Critical
Expression (db_tier_replication_status{role="failed"} == 0) or (db_tier_replication_status{role="active"} == 0) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="standby"})) or (count by (site_name) (db_tier_replication_status) == count by (site_name) (db_tier_replication_status{role="failed"})) or (avg_over_time(db_tier_binlog_used_bytes_percentage[5m])>= 80)
OID 1.3.6.1.4.1.323.5.3.52.1.2.43
Metric Used db_tier_replication_status and db_tier_binlog_used_bytes_percentage
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.49 SYSTEM_OPERATIONAL_STATE_PARTIAL_SHUTDOWN

Table 8-65 SYSTEM_OPERATIONAL_STATE_PARTIAL_SHUTDOWN

Field Details
Description System Operational State is now in partial shutdown state.
Summary System Operational State is now in partial shutdown state.
Severity Major
Expression system_operational_state == 2
OID 1.3.6.1.4.1.323.5.3.37.1.2.17
Metric Used system_operational_state == 2
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.50 SYSTEM_OPERATIONAL_STATE_COMPLETE_SHUTDOWN

Table 8-66 SYSTEM_OPERATIONAL_COMPLETE_SHUTDOWN

Field Details
Description System Operational State is now in complete shutdown state
Summary System Operational State is now in complete shutdown state
Severity Critical
Expression system_operational_state == 3
OID 1.3.6.1.4.1.323.5.3.37.1.2.17
Metric Used system_operational_state
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.51 TDF_CONNECTION_DOWN

Table 8-67 TDF_CONNECTION_DOWN

Field Details
Description TDF connection is down.
Summary TDF connection is down.
Severity Critical
Expression occnp_diam_conn_app_network{applicationName="Sd"} == 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.48
Metric Used occnp_diam_conn_app_network
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.52 DIAM_CONN_PEER_DOWN

Table 8-68 DIAM_CONN_PEER_DOWN

Field Details
Description Diameter connection to peer {{ $labels.peerHost }} is down.
Summary Diameter connection to peer is down.
Severity Major
Expression Diameter connection to peer peerHost in given namespace is down.
OID 1.3.6.1.4.1.323.5.3.52.1.2.50
Metric Used occnp_diam_conn_network
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.53 DIAM_CONN_NETWORK_DOWN

Table 8-69 DIAM_CONN_NETWORK_DOWN

Field Details
Description All the diameter network connections are down.
Summary All the diameter network connections are down.
Severity Critical
Expression sum by (kubernetes_namespace)(occnp_diam_conn_network) == 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.51
Metric Used occnp_diam_conn_network
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.54 DIAM_CONN_BACKEND_DOWN

Table 8-70 DIAM_CONN_BACKEND_DOWN

Field Details
Description All the diameter backend connections are down.
Summary All the diameter backend connections are down.
Severity Critical
Expression sum by (kubernetes_namespace)(occnp_diam_conn_backend) == 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.52
Metric Used occnp_diam_conn_network
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.55 PerfInfoActiveOverloadThresholdFetchFailed

Table 8-71 PerfInfoActiveOverloadThresholdFetchFailed

Field Details
Description The application fails to get the current active overload level threshold data.
Summary The application fails to get the current active overload level threshold data.
Severity Major
Expression active_overload_threshold_fetch_failed == 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.53
Metric Used active_overload_threshold_fetch_failed
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.56 SLA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-72 SLA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description SLA Sy fail count exceeds the critical threshold limit
Summary SLA Sy fail count exceeds the critical threshold limit
Severity Critical
Expression

sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) * 100 > 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.58

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.57 SLA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-73 SLA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description

SLA Sy fail count exceeds the major threshold limit

Summary

SLA Sy fail count exceeds the major threshold limit

Severity Major
Expression

sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) * 100 <= 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.58

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.58 SLA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-74 SLA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description

SLA Sy fail count exceeds the minor threshold limit

Summary

SLA Sy fail count exceeds the minor threshold limit

Severity Minor
Expression

sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="SLA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SLA"}[5m])) * 100 <= 80

OID

1.3.6.1.4.1.323.5.3.52.1.2.58

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.59 STA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-75 STA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description

STA Sy fail count exceeds the critical threshold limit.

Summary

STA Sy fail count exceeds the critical threshold limit.

Severity Critical
Expression

The failure rate of Sy STA responses is more than 90% of the total responses.

Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) * 100 > 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.59

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.60 STA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-76 STA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description

STA Sy fail count exceeds the major threshold limit.

Summary

STA Sy fail count exceeds the major threshold limit.

Severity Major
Expression

The failure rate of Sy STA responses is more than 80% and less and or equal to 90% of the total responses.

Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) * 100 <= 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.59

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.61 STA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-77 STA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description

STA Sy fail count exceeds the minor threshold limit.

Summary

STA Sy fail count exceeds the minor threshold limit.

Severity Minor
Expression

The failure rate of Sy STA responses is more than 60% and less and or equal to 80% of the total responses.

Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777302"}[5m])) * 100 <= 80

OID

1.3.6.1.4.1.323.5.3.52.1.2.59

Metric Used

occnp_diam_response_local_total

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.62 SMSC_CONNECTION_DOWN

Table 8-78 STASYFailCountExceedsCritcalThreshold

Field Details
Description This alert is triggered when connection to SMSC host is down.
Summary Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Severity Major
Expression sum by(namespace, pod, smscName)(occnp_active_smsc_conn_count) == 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.63
Metric Used occnp_active_smsc_conn_count
Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present. If the user hasn't been added in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.63 STA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-79 STASYFailCountExceedsCritcalThreshold

Field Details
Description

The failure rate of Rx STA responses is more than 90% of the total responses.

Summary

STA Rx fail count exceeds the critical threshold limit.

Severity Critical
Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) * 100 > 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.64

Metric Used

occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present.

Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.64 STA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-80 STA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description

The failure rate of Rx STA responses is more than 80% and less and or equal to 90% of the total responses.

Summary

STA Rx fail count exceeds the major threshold limit.

Severity Major
Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) * 100 <= 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.64

Metric Used

occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) & AF and ensure connectivity is present.

Check that the session and user is valid and hasn't been removed in the Policy database, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.65 STA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-81 STA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description

The failure rate of Rx STA responses is more than 60% and less and or equal to 80% of the total responses.

Summary

STA Rx fail count exceeds the minor threshold limit.

Severity Minor
Expression

sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA", appId="16777236"}[5m])) * 100 <= 80

OID

1.3.6.1.4.1.323.5.3.52.1.2.64

Metric Used

occnp_diam_response_local_total{msgType="STA", appId="16777236", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) & AF and ensure connectivity is present.

Check that the session and user is valid and hasn't been removed in the Policy database, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.66 SNA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-82 SNA_SY_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description

The failure rate of Sy SNA responses is more than 90% of the total responses.

Summary

SNA Sy fail count exceeds the critical threshold limit

Severity Critical
Expression

sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) * 100 > 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.65

Metric Used

occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present.

Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.67 SNA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-83 SNA_SY_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description

The failure rate of Sy SNA responses is more than 80% and less and or equal to 90% of the total responses.

Summary

SNA Sy fail count exceeds the major threshold limit

Severity Major
Expression

sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) * 100 > 80 and sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) * 100 <= 90

OID

1.3.6.1.4.1.323.5.3.52.1.2.65

Metric Used

occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present.

Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.68 SNA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-84 SNA_SY_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description

The failure rate of Sy STA responses is more than 60% and less and or equal to 80% of the total responses.

Summary

SNA Sy fail count exceeds the minor threshold limit

Severity Minor
Expression

sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) * 100 > 60 and sum(rate(occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="SNA"}[5m])) * 100 <= 80

OID

1.3.6.1.4.1.323.5.3.52.1.2.65

Metric Used

occnp_diam_response_local_total{msgType="SNA", responseCode!~"2.*"}

Recommended Actions

Check the connectivity between diam-gw pod(s) and OCS server and ensure connectivity is present.

Check that the session and user hasn't been removed in the OCS configuration, then configure the user(s).

For any additional guidance, contact My Oracle Support.

8.1.1.69 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Table 8-85 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Field Details
Description This alerts is triggered when more than 10 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary The Diam requests are being discarded due to timeout processing occurring above 10%.
Severity Minor
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used
  • ocpm_stale_diam_request_cleanup_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.70 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Table 8-86 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Field Details
Description This alert is triggered when more than 20 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary The Diam requests are being discarded due to timeout processing occurring above 20%.
Severity Major
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used
  • ocpm_late_arrival_rejection_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.71 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Table 8-87 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Field Details
Description This alert is triggered when more than 30 % of the received Diameter requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary The Diam requests are being discarded due to timeout processing occurring above 30%.
Severity Critical
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used
  • ocpm_late_arrival_rejection_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.72 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MINOR

Table 8-88 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MINOR

Field Details
Description Certificate expiry in less than 6 months.
Summary Certificate expiry in less than 6 months.
Severity Minor
Expression dgw_tls_cert_expiration_seconds - time() <= 15724800
OID 1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used dgw_tls_cert_expiration_seconds
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.73 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MAJOR

Table 8-89 DIAM_GATEWAY_CERTIFICATE_EXPIRY_MAJOR

Field Details
Description Certificate expiry in less than 3 months.
Summary Certificate expiry in less than 3 months.
Severity Major
Expression dgw_tls_cert_expiration_seconds - time() <= 7862400
OID 1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used dgw_tls_cert_expiration_seconds
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.74 DIAM_GATEWAY_CERTIFICATE_EXPIRY_CRITICAL

Table 8-90 DIAM_GATEWAY_CERTIFICATE_EXPIRY_CRITICAL

Field Details
Description Certificate expiry in less than 1 month.
Summary Certificate expiry in less than 1 month.
Severity Critical
Expression dgw_tls_cert_expiration_seconds - time() <= 2592000
OID 1.3.6.1.4.1.323.5.3.37.1.2.47
Metric Used dgw_tls_cert_expiration_seconds
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.75 DGW_TLS_CONNECTION_FAILURE

Table 8-91 DGW_TLS_CONNECTION_FAILURE

Field Details
Description Alert for TLS connection establishment.
Summary TLS Connection failure when Diam gateway is an initiator.
Severity Major
Expression sum by (namespace,reason)(occnp_diam_failed_conn_network) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.81
Metric Used occnp_diam_failed_conn_network
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.76 POLICY_CONNECTION_FAILURE

Table 8-92 POLICY_CONNECTION_FAILURE

Field Details
Description Connection failure on Egress and Ingress Gateways for incoming and outgoing connections.
Summary Connection failure on Egress and Ingress Gateways for incoming and outgoing connections.
Severity Major
Expression sum(increase(occnp_oc_ingressgateway_connection_failure_total[5m]) >0 or (occnp_oc_ingressgateway_connection_failure_total unless occnp_oc_ingressgateway_connection_failure_total offset 5m )) by (namespace,app, error_reason) > 0

or

sum(increase(occnp_oc_egressgateway_connection_failure_total[5m]) >0 or (occnp_oc_egressgateway_connection_failure_total unless occnp_oc_egressgateway_connection_failure_total offset 5m )) by (namespace,app, error_reason) > 0

OID 1.3.6.1.4.1.323.5.3.52.1.2.76
Metric Used occnp_oc_ingressgateway_connection_failure_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.77 AUDIT_NOT_RUNNING

Table 8-93 AUDIT_NOT_RUNNING

Field Details
Description Audit has not been running for at least 1 hour.
Summary Audit has not been running for at least 1 hour.
Severity CRITICAL
Expression (absent_over_time(data_repository_invocations_seconds_count{method="getQueuedTablesToAudit"}[1h]) == 1) OR (sum(increase(data_repository_invocations_seconds_count{method="getQueuedTablesToAudit"}[1h])) == 0)
OID 1.3.6.1.4.1.323.5.3.52.1.2.78
Metric Used data_repository_invocations_seconds_count
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.78 DIAMETER_POD_ERROR_RESPONSE_MINOR

Table 8-94 DIAMETER_POD_ERROR_RESPONSE_MINOR

Field Details
Description At least 1% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Summary At least 1% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Severity MINOR
Expression (topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=1
OID 1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used ocbsf_diam_response_network_total
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.79 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-95 DIAMETER_POD_ERROR_RESPONSE_MAJOR

Field Details
Description At least 5% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Summary At least 5% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER.
Severity MAJOR
Expression (topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=5
OID 1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used ocbsf_diam_response_network_total
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.80 DIAMETER_POD_ERROR_RESPONSE_CRITICAL

Table 8-96 DIAMETER_POD_ERROR_RESPONSE_CRITICAL

Field Details
Description At least 10% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER
Summary At least 10% of the Diam Response connection requests failed with error DIAMETER_UNABLE_TO_DELIVER
Severity CRITICAL
Expression (topk(1,((sort_desc(sum by (pod) (rate(ocbsf_diam_response_network_total{responseCode="3002"}[2m])))/ (sum by (pod) (rate(ocbsf_diam_response_network_total[2m])))) * 100))) >=10
OID 1.3.6.1.4.1.323.5.3.52.1.2.79
Metric Used ocbsf_diam_response_network_total
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.1.81 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Table 8-97 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsCriticalThreshold
Description The lock requests fails to acquire the lock count exceeds the critical threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 75 Percent of Total Transactions.
Severity Critical
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=75
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, above 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 75%.
8.1.1.82 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-98 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsMajorThreshold
Description The lock requests fails to acquire the lock count exceeds the major threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 50 Percent of Total Transactions.
Severity Major
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >= 50 < 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, between 50% and 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 50%. If the rate exceeds 75%, a higher severity alert will trigger.
8.1.1.83 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Table 8-99 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsMinorThreshold
Description The lock requests fails to acquire the lock count exceeds the minor threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 20 Percent of Total Transactions.
Severity Minor
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=20 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, between 20% and 50% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 20%. If the rate exceeds 50%, a higher severity alert will trigger.
8.1.1.84 CERTIFICATE_EXPIRY_MINOR

Table 8-100 CERTIFICATE_EXPIRY_MINOR

Field Details
Description Certificate expiry in less than 6 months
Summary Certificate expiry in less than 6 months
Severity MINOR
Expression security_cert_x509_expiration_seconds - time() <= 15724800
OID 1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used -
Recommended Actions -
8.1.1.85 CERTIFICATE_EXPIRY_MAJOR

Table 8-101 CERTIFICATE_EXPIRY_MAJOR

Field Details
Description Certificate expiry in less than 3 months
Summary Certificate expiry in less than 3 months
Severity MAJOR
Expression security_cert_x509_expiration_seconds - time() <= 7862400
OID 1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used -
Recommended Actions -
8.1.1.86 CERTIFICATE_EXPIRY_CRITICAL

Table 8-102 CERTIFICATE_EXPIRY_CRITICAL

Field Details
Description Certificate expiry in less than 1 months
Summary Certificate expiry in less than 1 months
Severity CRITICAL
Expression security_cert_x509_expiration_seconds - time() <= 2592000
OID 1.3.6.1.4.1.323.5.3.52.1.2.77
Metric Used -
Recommended Actions -
8.1.1.87 PERF_INFO_ACTIVE_OVERLOADTHRESHOLD_DATA_PRESENT

Table 8-103 PERF_INFO_ACTIVE_OVERLOADTHRESHOLD_DATA_PRESENT

Field Details
Description -
Summary -
Severity MINOR
Expression active_overload_threshold_fetch_failed == 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.53
Metric Used -
Recommended Actions -
8.1.1.88 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-104 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field Details
Description More than 10% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 10% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity MINOR
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.89 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-105 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field Details
Description More than 20% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 20% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity MAJOR
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.90 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-106 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field Details
Description More than 30% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 30% of incoming requests towards UDR-connector is rejected due to request being stale on arrival or during processing by the connector
Severity CRITICAL
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="UDR-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="udr-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="UDR-C"}[5m]))) * 100 > 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.85
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.91 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-107 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field Details
Description More than 10% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 10% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity MINOR
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.92 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-108 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field Details
Description More than 20% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 20% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity MAJOR
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.93 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-109 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field Details
Description More than 30% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Summary More than 30% of incoming requests towards CHF-connector is rejected due to request being stale on arrival or during processing by the connector
Severity CRITICAL
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{mode="CHF-C"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m])))/(sum by (namespace) (rate(ocpm_userservice_inbound_count_total{service_resource="chf-service"}[5m])) + sum by (namespace) (rate(occnp_late_arrival_rejection_total{mode="CHF-C"}[5m]))) * 100 > 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.86
Metric Used occnp_late_processing_rejection_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.94 EGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Table 8-110 EGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Field Details
Description Policy Egress Gateway Data Director unreachable for {{$labels.namespace}}.
Summary kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} BSF Egress Gateway Data Director unreachable
Severity Major
Expression sum(oc_egressgateway_dd_unreachable) by(namespace,container) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.84
Metric Used oc_egressgateway_dd_unreachable
Recommended Actions Alert gets cleared automatically when the connection with data director is established.
8.1.1.95 INGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Table 8-111 INGRESS_GATEWAY_DD_UNREACHABLE_MAJOR

Field Details
Description Policy Ingress Gateway Data Director unreachable for {{$labels.namespace}}.
Summary 'kubernetes_namespace: {{$labels.kubernetes_namespace}}, timestamp: {{ with query "time()" }}{{ . | first | value | humanizeTimestamp }}{{ end }} BSF Ingress Gateway Data Director unreachable'
Severity Major
Expression sum(oc_ingressgateway_dd_unreachable) by(namespace,container) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.83
Metric Used oc_ingressgateway_dd_unreachable
Recommended Actions Alert gets cleared automatically when the connection with data director is established.
8.1.1.96 STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-112 STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field Details
Description This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity Critical
Expression -
OID -
Metric Used
  • ocpm_late_processing_rejection_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.97 STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-113 STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field Details
Description This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity Major
Expression -
OID -
Metric Used
  • ocpm_late_processing_rejection_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.98 STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-114 STALE_HTTP_REQUEST_CLEANUP_MINOR

Field Details
Description This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Severity Minor
Expression -
OID -
Metric Used
  • ocpm_late_processing_rejection_total
  • occnp_diam_request_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.99 STALE_BINDING_REQUEST_REJECTION_CRITICAL

Table 8-115 STALE_BINDING_REQUEST_REJECTION_CRITICAL

Field Details
Description This alert is triggered when more than 30 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary '{{ $value }} % of requests are being discarded by binding svc due to request being stale either on arrival or during processing.'summary: "More than 30% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity Critical
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".*binding"}[5m]))+sum by (namespace) rate(occnp_late_arrival_rejection_total{microservice=~".*binding"}[5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total{microservice=~".*binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".*binding"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_binding_inbound_request_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.100 STALE_BINDING_REQUEST_REJECTION_MAJOR

Table 8-116 STALE_BINDING_REQUEST_REJECTION_MAJOR

Field Details
Description This alert is triggered when more than 20 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary '{{ $value }} % of requests are being discarded by binding svc due to request being stale either on arrival or during processing.'summary: "More than 20% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity Major
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".*binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".*binding"}[5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total {microservice=~".*binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".*binding"}[5m]))) * 100 >= 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_binding_inbound_request_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.101 STALE_BINDING_REQUEST_REJECTION_MINOR

Table 8-117 STALE_BINDING_REQUEST_REJECTION_MINOR

Field Details
Description This alert is triggered when more than 10 % of the received HTTP requests are cancelled due to them being stale (received too late, or took too much time to process them).
Summary '{{ $value }} % of requests are being discarded by binding service due to request being stale either on arrival or during processing.' summary: "More than 10% of the Binding requests failed with error TIMED_OUT_REQUEST"
Severity Minor
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total {microservice=~".*binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".*binding"} [5m])))/(sum by (namespace) (rate(ocpm_binding_inbound_request_total {microservice=~".*binding"}[5m]))+sum by (namespace) (rate(occnp_late_arrival_rejection_total{microservice=~".*binding"}[5m]))) * 100 >= 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.87
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_binding_inbound_request_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.102 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-118 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field Details
Description At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Minor
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions -
8.1.1.103 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-119 UDR_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field Details
Description At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Major
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.104 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-120 UDR_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field Details
Description At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Critical
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.105 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Table 8-121 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MINOR

Field Details
Description At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 10 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Minor
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.106 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Table 8-122 CHF_C_STALE_HTTP_REQUEST_CLEANUP_MAJOR

Field Details
Description At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 20 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Major
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.107 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Table 8-123 CHF_C_STALE_HTTP_REQUEST_CLEANUP_CRITICAL

Field Details
Description At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Summary At least 30 % of the received HTTP requests are cancelled per operation type due to them being stale (received too late, or took too much time to process them).
Severity Critical
Expression -
OID -
Metric Used
  • occnp_late_arrival_rejection_total
  • occnp_late_processing_rejection_total
  • ocpm_userservice_inbound_count_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.1.108 UPDATE_NOTIFY_TIMEOUT_ABOVE_70_PERCENT

Table 8-124 UPDATE_NOTIFY_TIMEOUT_ABOVE_70_PERCENT

Field Details
Description Number of Update Notify failed because a timeout is equal to or above 70% in a given time period.
Summary Number of Update Notify failed because a timeout is equal to or above 70% in a given time period.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 70
OID -
Metric Used occnp_http_out_conn_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.109 UPDATE_NOTIFY_TIMEOUT_ABOVE_50_PERCENT

Table 8-125 UPDATE_NOTIFY_TIMEOUT_ABOVE_50_PERCENT

Field Details
Description Number of Update Notify that failed because a timeout is equal to or above 50% but less than 70% in a given time period.
Summary Number of Update Notify that failed because a timeout is equal to or above 50% but less than 70% in a given time period.
Severity Major
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 50 < 70
OID -
Metric Used occnp_http_out_conn_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.110 UPDATE_NOTIFY_TIMEOUT_ABOVE_30_PERCENT

Table 8-126 UPDATE_NOTIFY_TIMEOUT_ABOVE_30_PERCENT

Field Details
Description Number of Update Notify that failed because a timeout is equal to or above 30% but less than 50% of total Rx sessions.
Summary Number of Update Notify that failed because a timeout is equal to or above 30% but less than 50% of total Rx sessions.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_for_rx_collision_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 30 < 50
OID -
Metric Used occnp_http_out_conn_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.111 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MINOR

Table 8-127 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MINOR

Field Details
Description If 30% to 50% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary If 30% to 50% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity Minor
Expression (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2.*"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) * 100 > 30 <= 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used occnp_policy_data_resubscription_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.112 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MAJOR

Table 8-128 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_MAJOR

Field Details
Description If 50% to 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary If 50% to 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity Major
Expression (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2.*"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) * 100 > 50 <= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used occnp_policy_data_resubscription_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.113 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_CRITICAL

Table 8-129 POLICYDS_PREEXPIRY_RESUBSCRIBE_FAILURE_CRITICAL

Field Details
Description If 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Summary If 70% of subscriptions which are in PRE_EXPIRY period fail to resubscribe, this alert will be raised.
Severity Critical
Expression (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY",response!~"2.*"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_response_total{expiryStatus="PRE_EXPIRY"}[5m]))) * 100 > 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.129
Metric Used occnp_policy_data_resubscription_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.114 POLICYDS_EXPIRED_SUBSCRIPTION

Table 8-130 POLICYDS_EXPIRED_SUBSCRIPTION

Field Details
Description If more than 10% of audited subscriptions are expired, this alert will be raised.
Summary If more than 10% of audited subscriptions are expired, this alert will be raised.
Severity Major
Expression (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_request_total{expiryStatus="EXPIRED"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_policy_data_resubscription_request_total[5m]))) * 100 > 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.130
Metric Used occnp_policy_data_resubscription_request_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.1.115 LDAP_PEER_CONNECTION_LOST

Table 8-131 LDAP_PEER_CONNECTION_LOST

Field Details
Name in Alert Yaml File LDAP_PEER_CONNECTION_LOST
Description This alert is triggered when the LDAP Gateway loses connection to its LDAP peer(s). It is based on the value of the occnp_ldap_conn_total metric falling to zero. The connection re-attempt and alert clearance behavior is governed by a new configuration parameter, LDAP_CONNECTION_REVERT_DELAY.
Summary LDAP Gateway loses connection to its LDAP peer(s).
Severity major
Expression sum by (namespace,peer)(occnp_ldap_conn_total) == 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.113
Metric Used occnp_ldap_conn_total
Recommended Actions
  • Verify that the LDAP server is running and connectivity between the PCF and LDAP peers is available.
  • If LDAP is reachable, check the configured LDAP_CONNECTION_REVERT_DELAY value since reconnection attempts and alert clearance depend on this setting.
8.1.1.116 IGW_POD_PROTECTION_DOC_STATE

Table 8-132 IGW_POD_PROTECTION_DOC_STATE

Field Details
Description The Ingress Gateway is in Danger_of_Congestion Level for the pod {{$labels.pod}} in namespace {{$labels.namespace}} ( current congestion level: {{ $value }} % )
Summary Ingress Gateway pod congestion state in Danger_of_Congestion Level.
Severity Minor
Expression oc_ingressgateway_congestion_system_state{microservice=~".*ingress-gateway"} == 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.123
Metric Used oc_ingressgateway_congestion_system_state
Recommended Actions

The alert is cleared when the pod CPU consumption dropped below the configured abatement value for the DOC level.

8.1.1.117 IGW_POD_PROTECTION_CONGESTED_STATE

Table 8-133 IGW_POD_PROTECTION_CONGESTED_STATE

Field Details
Description The Ingress Gateway is in Congested Level for the pod {{$labels.pod}} in namespace {{$labels.namespace}} ( current congestion level: {{ $value }} % )
Summary Ingress Gateway pod congestion state in Congested level.
Severity Critical
Expression sum(oc_ingressgateway_congestion_system_state{app_kubernetes_io_name="occnp-ingress-gateway"}) by (pod) == 4
OID 1.3.6.1.4.1.323.5.3.52.1.2.123
Metric Used oc_ingressgateway_congestion_system_state
Recommended Actions The alert is cleared when the pod CPU consumption dropped below the configured abatement value for the Congested level.

8.1.2 PCF Alerts

This section provides information on PCF alerts.

8.1.2.1 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-134 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field Details
Description UDR returning with POST subscribe response but without user data for SM as part of immediate reporting occurring above 10% for service {{$labels.microservice}} in {{$labels.namespace}} ( current value: {{ $value }} % )
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Severity Minor
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing SM user data check is based on:

      • service_subresource = "sm-data" (indicates the UDR POST was to get SM user data from UDR)

      • operation_type = "POST" (indicates this is a POST call)

      • imm_reports_present = "false" (indicates no SM user data was returned from UDR as part of the Immediate Reporting capability)

    • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without SM user data as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (e.g., 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Inform UDR Operator

    • If the above points are validated and SM user data is still not retrieved, inform the UDR operators to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.2 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-135 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting
Severity Major
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing SM user data check is based on:

      • service_subresource = "sm-data" (to indicate the UDR POST was to get SM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no SM user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without user data for SM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Inform UDR Operator

    • If the above points are validated and still no SM user data is retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.3 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-136 UDR_SM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field Details
Description More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity Critical
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.127
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing AM user data check is based on:

      • service_subresource = "sm-data" (to indicate the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no SM user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without user data for SM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Inform UDR Operator

    • If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.4 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-137 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field Details
Description More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity Minor
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID occnp_immrep_response_total
Metric Used 1.3.6.1.4.1.323.5.3.52.1.2.128
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation check is based on:

      • service_subresource = "sm-data" (to indicate the UDR POST was to get SM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Inform UDR Operator

    • If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.5 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-138 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity Major
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.128
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The failed feature negotiation check is based on:

      • service_subresource = "sm-data" (to indicate the UDR POST was to get SM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and SM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.6 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-139 UDR_SM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field Details
Description More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Summary More than 30% of the traffic, UDR returned with POST subscribe response but without user data for SM as part of immediate reporting.
Severity Critical
Expression (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (microservice, namespace) (rate(occnp_immrep_response_total{service_subresource="sm-data",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.128
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation is based on:

      • service_subresource = "sm-data" (indicates the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (indicates this is a POST call)

      • immediate_report_pcc = "false" (indicates that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for SM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, 40000000).

    • This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and SM user data is still not retrieved:

      • Inform the UDR operators.

      • Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.7 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MINOR

Table 8-140 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MINOR

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 10% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary More than 10% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity Minor
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER", microservice=diam-connector}[5m]))) * 100 >= 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used
  • occnp_diam_request_local_total
  • occnp_stale_diam_request_cleanup_total
Recommended Actions

The alert gets cleared when the number of stale requests is below 10% of the total requests. To troubleshoot and resolve the issue, perform the following steps:

  1. Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}.
  2. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner.
  3. Review the configuration settings of the Diameter connector and check timeout settings if necessary.
  4. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive.

For further assistance, contact My Oracle Support.

8.1.2.8 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MAJOR

Table 8-141 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_MAJOR

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 20% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary More than 20% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity Major
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER", microservice=diam-connector}[5m]))) * 100 >= 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used
  • occnp_diam_request_local_total
  • occnp_stale_diam_request_cleanup_total
Recommended Actions

The alert gets cleared when the number of stale requests is below 20% of the total requests. To troubleshoot and resolve the issue, perform the following steps:

  1. Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}.
  2. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner.
  3. Review the configuration settings of the Diameter connector and check timeout settings if necessary.
  4. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive.

For further assistance, contact My Oracle Support.

8.1.2.9 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_CRITICAL

Table 8-142 STALE_DIAMETER_CONNECTOR_REQUEST_CLEANUP_CRITICAL

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 30% inside pod {{$labels.pod}} for service {{$labels.microservice}} in {{$labels.namespace}}
Summary More than 30% of the Diam Connector requests failed with error DIAMETER_ERROR_TIMED_OUT_REQUEST.
Severity Critical
Expression (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total{microservice=diam-connector}[5m]))) / (sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER", microservice=diam-connector}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.88
Metric Used
  • occnp_diam_request_local_total
  • occnp_stale_diam_request_cleanup_total
Recommended Actions

The alert gets cleared when the number of stale requests is below 30% of the total requests. To troubleshoot and resolve the issue, perform the following steps:

  1. Identify the root cause of the timeout processing by reviewing the logs for the pod {{$labels.pod}} and service {{$labels.microservice}} in {{$labels.namespace}}.
  2. Verify the performance and resource utilization (CPU, memory) of the pod and make sure it has sufficient resources to process the requests in a timely manner.
  3. Review the configuration settings of the Diameter connector and check timeout settings if necessary.
  4. Ensure that the backend services that the Diameter connector communicates with are healthy and responsive.

For further assistance, contact My Oracle Support.

8.1.2.10 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_CRITICAL_THRESHOLD

Table 8-143 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description {{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary 70% or more of binding were missing but restored from BSF over all bindings being audited.
Severity Critical
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding", response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding",response_code="2xx"}[5m]))) * 100 >= 70

OID 1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.11 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MAJOR_THRESHOLD

Table 8-144 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description {{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary 50% to 70% of binding were missing but restored from BSF over all bindings being audited
Severity Major
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding", response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding",response_code="2xx"}[5m]))) * 100 >= 50 < 70

OID 1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.12 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MINOR_THRESHOLD

Table 8-145 SESSION_BINDING_MISSING_FROM_BSF_EXCEEDS_MINOR_THRESHOLD

Field Details
Description {{ $value }} % binding were missing but restored from BSF over all bindings being audited in {{$labels.namespace}}.
Summary 30% to 50% of binding were missing but restored from BSF over all bindings being audited.
Severity Minor
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding",response_code="2xx",action="restored"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding",response_code="2xx"}[5m]))) * 100 >= 30 < 50

OID 1.3.6.1.4.1.323.5.3.52.1.2.89
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.13 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_CRITICAL_THRESHOLD

Table 8-146 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description {{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary 70% or more of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity Critical
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding", response_code!~"2.*"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding"}[5m]))) * 100 >= 70

OID 1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

Verify the health condition of BSF Management Service.

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.14 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MAJOR_THRESHOLD

Table 8-147 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description {{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary 50% to 70% of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity Major
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding", response_code!~"2.*"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding"}[5m]))) * 100 >= 50 < 70

OID 1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

Verify the health condition of BSF Management Service.

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.15 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MINOR_THRESHOLD

Table 8-148 SESSION_BINDING_REVALIDATION_WITH_BSF_FAILURE_EXCEEDS_MINOR_THRESHOLD

Field Details
Description {{ $value }} % failed Revalidation Responses received from BSF over total Revalidation Responses in {{$labels.namespace}}.
Summary 30% to 50% of failed Revalidation Responses received from BSF over total Revalidation Responses
Severity Minor
Expression

(sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding", response_code!~"2.*"}[5m])) /sum by (namespace)(rate(occnp_session_binding_revalidation_response_total{microservice=~".*binding"}[5m]))) * 100 >= 30 < 50

OID 1.3.6.1.4.1.323.5.3.52.1.2.90
Metric Used occnp_session_binding_revalidation_response_total
Recommended Actions

Verify the health condition of BSF Management Service.

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.16 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-149 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MINOR_THRESHOLD_PERCENT

Field Details
Description when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary Primary Key lookup failed is equal or above 10% but less than 50% of total PA create.
Severity Minor
Expression sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 10 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.17 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-150 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_MAJOR_THRESHOLD_PERCENT

Field Details
Description when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary Primary Key lookup failed is equal or above 50% but less than 75% of total PA create.
Severity Major
Expression sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 50 < 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.18 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-151 N7_OPTIMIZED_LOOKUP_ERROR_RATE_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field Details
Description when {{ $value }} % of primary key lookup fails during PA create in namespace {{$labels.namespace}}
Summary Primary Key lookup failed is equal or above 75% of total PA create
Severity Critical
Expression sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total{status="failed"}[30m])) / sum by (namespace)(increase(occnp_optimized_smpolicyassociation_lookup_query_total[30m])) * 100 >= 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.124
Metric Used occnp_optimized_smpolicyassociation_lookup_query_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.19 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MINOR

Table 8-152 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MINOR

Field Details
Description {{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary At least 10% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}
Severity Minor
Expression ( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".*pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) ) ) ) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used occnp_enhanced_overload_reject_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.20 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MAJOR

Table 8-153 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_MAJOR

Field Details
Description {{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary At least 20% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}
Severity Major
Expression ( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".*pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) ) ) ) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used occnp_enhanced_overload_reject_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.21 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_CRITICAL

Table 8-154 SM_SVC_REQ_ENHANCED_OVERLOAD_REJECTION_CRITICAL

Field Details
Description {{ $value }}% of incoming request towards pcf_sm service are rejected due to enhanced overload control mechanism
Summary At least 30% of the received Requests have been rejected due to Overload state of pcf-sm service in namespace {{$labels.namespace}}.
Severity Critical
Expression ( sum by (namespace) (rate(occnp_enhanced_overload_reject_total{microservice=~".*pcf_sm"}[2m])) / (sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) + (sum by (namespace) (rate(session_oam_request_total{microservice=~".*pcf_sm"}[2m]) or occnp_enhanced_overload_reject_total * 0) ) ) ) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.125
Metric Used occnp_enhanced_overload_reject_total
Recommended Actions

For any additional guidance, contact My Oracle Support (https://support.oracle.com).

8.1.2.22 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD

Table 8-155 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MINOR_THRESHOLD
Description More than 70% of timer capacity has been occupied for n1n2 transfer failure notification
Summary More than 70% of timer capacity has been occupied for n1n2 transfer failure notification
Severity Minor
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission.

Cause:

This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan.

  • These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow.
  • The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 70% of a baseline capacity of 360,000.

    Dimensions:

    timerName: UE_N1N2TransferFailure

    namespace: as per Prometheus label used in aggregation

    siteId: underlying metric label; rule aggregates with max by (namespace)

Diagnostic Information:

  1. Validate the alert metric:

    Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window.

  2. Correlate with triggering failures:

    Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window.

  3. Review logs around the alert window:

    In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period.

  4. Verify retransmission/backoff settings:

    Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative).

  5. Check downstream/egress health:

    Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts.

  6. Confirm processing throughput:

    Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining.

  7. Watch for capacity rejections:

    Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure", errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits.

Recovery:

  1. Resolve underlying failures:

    Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates.

  2. Enable or optimize retransmission:

    Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload.

  3. Increase draining capacity:

    Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster.

  4. Adjust capacity if needed:

    Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes.

  5. Reduce new load temporarily:

    Throttle or defer non-critical timer creates for this timerName until utilization drops.

  6. Monitor until recovered:

    Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.23 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD

Table 8-156 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_MAJOR_THRESHOLD
Description More than 80% of timer capacity has been occupied for n1n2 transfer failure notification
Summary More than 80% of timer capacity has been occupied for n1n2 transfer failure notification
Severity Major
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission.

Cause:

This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan.

- These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow.

- The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 80% of a baseline capacity of 360000.

Dimensions:

timerName : UE_N1N2TransferFailure

namespace : as per Prometheus label used in aggregation

siteId : underlying metric label; rule aggregates with max by (namespace)

Diagnostic Information :

  1. Validate the alert metric:

    Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window.

  2. Correlate with triggering failures:

    Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window.

  3. Review logs around the alert window:

    In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period.

  4. Verify retransmission/backoff settings:

    Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative).

  5. Check downstream/egress health:

    Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts.

  6. Confirm processing throughput:

    Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining.

  7. Watch for capacity rejections:

    Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure",errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits.

Recovery :

  1. Resolve underlying failures:

    Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates.

  2. Enable or optimize retransmission:

    Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload.

  3. Increase draining capacity:

    Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster.

  4. Adjust capacity if needed:

    Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes.

  5. Reduce new load temporarily:

    Throttle or defer non-critical timer creates for this timerName until utilization drops.

  6. Monitor until recovered:

    Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.24 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD

Table 8-157 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_ABOVE_CRITICAL_THRESHOLD
Description More than 90% of timer capacity has been occupied for n1n2 transfer failure notification
Summary More than 90% of timer capacity has been occupied for n1n2 transfer failure notification
Severity Critical
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2TransferFailure"})/360000) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.107
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer failure notification reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer failure notification and possibly enable re-transmission.

Cause:

This alert indicates sustained high utilization of the UE N1N2 Transfer Failure Notification timer pool. The occnp_timer_capacity gauge tracks the current number of outstanding timers per timerName, updated every timer scan.

- These timers are created when the UE cannot deliver URSP rules and the system initiates a reattempt flow using backoff with a timer. High utilization suggests many failures are triggering the N1N2 transfer failure notification flow.

- The alert notifies when utilization for timerName "UE_N1N2TransferFailure" exceeds 90% of a baseline capacity of 360000.

Dimensions:

timerName : UE_N1N2TransferFailure

namespace : as per Prometheus label used in aggregation

siteId : underlying metric label; rule aggregates with max by (namespace)

Diagnostic Information :

  1. Validate the alert metric:

    Inspect occnp_timer_capacity{timerName="UE_N1N2TransferFailure"} in Prometheus/Grafana (and /actuator/prometheus) and review trends around the alert window.

  2. Correlate with triggering failures:

    Check for spikes in N1N2 transfer failure notifications and URSP delivery failures within the same time window.

  3. Review logs around the alert window:

    In PCF-UE and related components/egress, look for errors leading to N1N2 transfer failure notifications; align timestamps with the alert period.

  4. Verify retransmission/backoff settings:

    Ensure retransmission is enabled; confirm backoff parameters are appropriate (not overly conservative).

  5. Check downstream/egress health:

    Validate connectivity and response health for AMF or upstream endpoints; look for elevated error rates/timeouts.

  6. Confirm processing throughput:

    Verify rate_per_second for this timerName, worker thread health, and pod readiness/liveness; ensure backlog is draining.

  7. Watch for capacity rejections:

    Observe occnp_timer_create_failure_total{timerName="UE_N1N2TransferFailure",errorCause="TIMER_CAPACITY_EXCEEDS"} for signs of hard-cap hits.

Recovery :

  1. Resolve underlying failures:

    Work with upstream/AMF and correct misconfigurations causing the flow to trigger N1N2 transfer failure notifications at high rates.

  2. Enable or optimize retransmission:

    Turn on retransmission if disabled; tune backoff to improve success while avoiding downstream overload.

  3. Increase draining capacity:

    Temporarily raise rate_per_second and/or scale pods to drain outstanding timers faster.

  4. Adjust capacity if needed:

    Temporarily increase the registered timer_capacity baseline for this timerName while addressing root causes.

  5. Reduce new load temporarily:

    Throttle or defer non-critical timer creates for this timerName until utilization drops.

  6. Monitor until recovered:

    Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.

8.1.2.25 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD

Table 8-158 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MINOR_THRESHOLD
Description More than 70% of timers capacity has been occupied for amf discovery.
Summary More than 70% of timers capacity has been occupied for amf discovery.
Severity Minor
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client.

Cause:

More than 70% of timer capacity has been occupied for AMF discovery. The occnp_timer_capacity metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 70%) of the total 360K capacity.

Diagnostic Information:

  • High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery.

  • Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries.

  • Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage.

  • Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption.

Recovery:

  • Review the logs and monitor for trends in UE failures with AMF discovery.
  • Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues.
  • Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations.
  • If the issue persists, please check with Support team.
8.1.2.26 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD

Table 8-159 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_MAJOR_THRESHOLD
Description More than 80% of timer capacity has been occupied for amf discovery.
Summary More than 80% of timer capacity has been occupied for amf discovery.
Severity Major
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client.

Cause:

More than 80% of timer capacity has been occupied for AMF discovery. The occnp_timer_capacity metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 80%) of the total 360K capacity.

Diagnostic Information:

  • High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery.

  • Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries.

  • Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage.

  • Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption.

Recovery:

  • Review the logs and monitor for trends in UE failures with AMF discovery.
  • Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues.
  • Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations.
  • If the issue persists, please check with Support team.
8.1.2.27 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD

Table 8-160 AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_AMF_DISCOVERY_ABOVE_CRITICAL_THRESHOLD
Description More than 90% of timer capacity has been occupied for amf discovery.
Summary More than 90% of timer capacity has been occupied for amf discovery.
Severity Critical
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_AMFDiscovery"})/360000) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.95
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to AMF discovery reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with NRF discovery and possibly enable direct or indirect alternate routing from NRF client.

Cause:

More than 90% of timer capacity has been occupied for AMF discovery. The occnp_timer_capacity metric records the current timer count. These timers are created when the User Equipment (UE) cannot deliver URSP rules, retries with a back-off, and creates a timer. This alert is triggered when capacity for timers corresponding to AMF Discovery reaches a certain percent (over 90%) of the total 360K capacity.

Diagnostic Information:

  • High Rate of UE Failures: Many User Equipment (UE) devices are unable to deliver URSP (User Route Selection Policy) rules, causing increased retries and timer creation for AMF discovery.

  • Network Function (NRF or AMF) Issues: Problems or instability with the AMF (Access and Mobility Management Function) or related NRF (Network Repository Function) components might prevent successful discovery or rule delivery, resulting in more timer retries.

  • Resource Bottlenecks: Network resource constraints or congestion could delay or prevent successful URSP rule delivery, again resulting in repeated retries and high timer usage.

  • Excessively Short Timer Values: If the back-off or retry timers are set too short, UEs may repeat attempts too rapidly, compounding timer consumption.

Recovery:

  • Review the logs and monitor for trends in UE failures with AMF discovery.
  • Consider enabling direct or indirect alternate routing from the NRF-client to mitigate timer capacity issues.
  • Investigate any recent configuration or software changes, check for network health (especially AMF and NRF), and verify timer-related configurations.
  • If the issue persists, please check with Support team.
8.1.2.28 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD

Table 8-161 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MINOR_THRESHOLD
Description More than 70% of timer capacity has been occupied for n1n2 subscribe.
Summary More than 70% of timer capacity has been occupied for n1n2 subscribe.
Severity Minor
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing.

Cause:

More than 70% of timer capacity has been occupied for N1N2 subscription. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 70% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations.

  • Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow.

  • Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers.

  • High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently.

  • Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate.

  • Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage.

Recovery:

  • Review logs and N1N2 subscription flow metrics for unusual error patterns.
  • Investigate AMF and related network function health and recent changes.
  • Check configuration for timer parameters and adjust if necessary.
  • Monitor for spikes in traffic or unusual load patterns.
  • If the issue persists, please check with Support team.
8.1.2.29 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD

Table 8-162 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_MAJOR_THRESHOLD
Description More than 80% of timer capacity has been occupied for n1n2 subscribe.
Summary More than 80% of timer capacity has been occupied for n1n2 subscribe.
Severity Major
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing.

Cause:

More than 80% of timer capacity has been occupied for N1N2 subscription. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 80% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations.

  • Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow.

  • Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers.

  • High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently.

  • Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate.

  • Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage.

Recovery:

  • Review logs and N1N2 subscription flow metrics for unusual error patterns.
  • Investigate AMF and related network function health and recent changes.
  • Check configuration for timer parameters and adjust if necessary.
  • Monitor for spikes in traffic or unusual load patterns.
  • If the issue persists, please check with Support team.
8.1.2.30 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD

Table 8-163 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_SUBSCRIBE_ABOVE_CRITICAL_THRESHOLD
Description More than 90% of timer capacity has been occupied for n1n2 subscribe.
Summary More than 90% of timer capacity has been occupied for n1n2 subscribe.
Severity Critical
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageSubscribe"})/360000) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.96
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 subscribe reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 subscription or on the AMF side and possibly enable the direct/indirect alternate routing.

Cause:

More than 90% of timer capacity has been occupied for N1N2 subscription. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after back-off, and creates a new timer. This alert is triggered when timer capacity for N1N2 subscription exceeds 90% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Subscription Flows: Multiple User Equipment (UE) devices may be repeatedly failing to complete N1N2 subscription actions, resulting in retries and new timer creations.
  • Persistent Delivery or Communication Issues: Failures in delivering URSP rules or problems communicating with the AMF or other network functions might cause UEs to retrigger the N1N2 subscription flow.
  • Underlying AMF or Network Instability: Instability, health issues, or misconfigurations in the AMF (Access and Mobility Management Function) could prevent successful subscription completion, leading to increased timers.
  • High Traffic Volume or Spikes: Unexpectedly high volumes of N1N2 subscription requests can cause a large number of timers to be in use concurrently.
  • Resource Limitations or Performance Bottlenecks: Processing delays or resource bottlenecks (CPU, memory, network) within the UE-service or supporting backend could slow down or block subscription handling, causing timers to accumulate.
  • Improper Timer or Retry Configuration: Short retry intervals or misconfigured back-off could lead to rapid, repeated subscription attempts and excessive timer usage.

Recovery:

  • Review logs and N1N2 subscription flow metrics for unusual error patterns.
  • Investigate AMF and related network function health and recent changes.
  • Check configuration for timer parameters and adjust if necessary.
  • Monitor for spikes in traffic or unusual load patterns.
If the issue persists, please check with Support team.
8.1.2.31 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD

Table 8-164 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MINOR_THRESHOLD
Description More than 70% of timer capacity has been occupied for n1n2 transfer.
Summary More than 70% of timer capacity has been occupied for n1n2 transfer.
Severity Minor
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 70% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing.

Cause:

More than 70% of timer capacity has been occupied for N1N2 transfer. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 70% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers.
  • Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts.
  • Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate.
  • High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity.
  • Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers.
  • AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs.

Recovery:

  • Review recent logs and metrics related to N1N2 transfer failures.
  • Investigate the health status of the AMF and other supporting NFs.
  • Check resource utilization and adjust timer back-off/retry configuration if needed.
  • Look for recent network changes or spikes in request volume.
If the issue persists, please check with Support team.
8.1.2.32 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD

Table 8-165 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_MAJOR_THRESHOLD
Description More than 80% of timer capacity has been occupied for n1n2 transfer.
Summary More than 80% of timer capacity has been occupied for n1n2 transfer.
Severity Major
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 80
OID 1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 80% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing.

Cause:

More than 80% of timer capacity has been occupied for N1N2 transfer. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 80% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers.
  • Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts.
  • Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate.
  • High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity.
  • Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers.
  • AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs.

Recovery:

  • Review recent logs and metrics related to N1N2 transfer failures.
  • Investigate the health status of the AMF and other supporting NFs.
  • Check resource utilization and adjust timer back-off/retry configuration if needed.
  • Look for recent network changes or spikes in request volume.
If the issue persists, please check with Support team.
8.1.2.33 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD

Table 8-166 AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File AUDIT_TIMER_CAPACITY_FOR_UE_N1N2_TRANSFER_ABOVE_CRITICAL_THRESHOLD
Description More than 90% of timer capacity has been occupied for n1n2 transfer.
Summary More than 90% of timer capacity has been occupied for n1n2 transfer.
Severity Critical
Expression (max by (namespace) (occnp_timer_capacity{timerName="UE_N1N2MessageTransfer"})/360000) * 100 > 90
OID 1.3.6.1.4.1.323.5.3.52.1.2.97
Metric Used occnp_timer_capacity
Recommended Actions

The occnp_timer_capacity metric is pegged during each timer scan, providing the current timers count. These timers were created when UE was not able to deliver the URSP rules and reattempt with back off. In this scenario an alert is triggered when the timers capacity corresponding to N1N2 transfer subscribe reaches 90% of the maximum rate limit of 360K. In this case the operator can troubleshoot and identify the reasons for failures with the flow triggering N1N2 transfer and possibly enable direct/indirect alternate routing.

Cause:

More than 90% of timer capacity has been occupied for N1N2 transfer. The occnp_timer_capacity metric tracks the current count of active timers. These timers are created when the User Equipment (UE) fails to deliver URSP rules, retries after a back-off, and creates a new timer. This alert is triggered when the timer capacity for N1N2 transfer exceeds 90% of the total 360K capacity.

Diagnostic Information:

  • Frequent UE Failures in N1N2 Transfer Flows: Many User Equipment (UE) devices are failing to complete N1N2 transfer operations successfully. Each failure leads to retries and the creation of new timers.
  • Delivery or Communication Issues: Persistent network issues preventing successful URSP rule delivery or failures in communication between the UE, AMF (Access and Mobility Management Function), or other relevant network functions can result in repeated N1N2 transfer attempts.
  • Resource Constraints or Performance Bottlenecks: Limited processing resources, high latency, or overload conditions (e.g., CPU/memory/network contention) can slow down or block the completion of transfer requests, causing timers to accumulate.
  • High Volume of Requests: An increased volume of N1N2 transfer requests due to network events or abnormal UE behavior can lead to a rapid consumption of available timer capacity.
  • Improper Timer Configuration: Short back-off intervals or aggressive retry settings can cause repeated rapid reattempts, increasing the number of concurrent timers.
  • AMF or Other NF Instability: Outages or instability in the AMF or related network functions may cause requests to go unprocessed, triggering continual retries from UEs.

Recovery:

  • Review recent logs and metrics related to N1N2 transfer failures.
  • Investigate the health status of the AMF and other supporting NFs.
  • Check resource utilization and adjust timer back-off/retry configuration if needed.
  • Look for recent network changes or spikes in request volume.
If the issue persists, please check with Support team.
8.1.2.34 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-167 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description More than 25% of n1n2 subscribe reattempt failed.
Summary More than 25% of n1n2 subscribe reattempt failed.
Severity Minor
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) * 100 > 25
OID 1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe. If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy.

Cause:

An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The http_out_conn_response_total metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 25% of such reattempts fail over a 5-minute period.

Diagnostic Information:

  • AMF (Access and Mobility Management Function) Unavailability or Instability: The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests.
  • Network Issues or Communication Failures: Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses.
  • Configuration Errors: Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail.
  • High Load or Resource Exhaustion: If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected.
  • Timeouts or Latency Issues: Prolonged delays in response times could cause requests to time out, leading to apparent failures.

Recovery:

  • Review logs and error codes for patterns or specific failure reasons.
  • Check the health and recent activity of the AMF(s) and relevant network paths.
  • Examine configuration settings related to N1N2 subscriptions and ensure they are correct.
  • Investigate any spikes in load or indications of resource bottlenecks.
  • Correlate with recent changes or deployments in the environment.
If the issue persists, please check with Support team.
8.1.2.35 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-168 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description More than 50% of n1n2 subscribe reattempt failed.
Summary More than 50% of n1n2 subscribe reattempt failed.
Severity Major
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy.

Cause:

An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The http_out_conn_response_total metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 50% of such reattempts fail over a 5-minute period.

Diagnostic Information:

AMF (Access and Mobility Management Function) Unavailability or Instability:

The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests.

Network Issues or Communication Failures:

Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses.

Configuration Errors:

Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail.

High Load or Resource Exhaustion:

If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected.

  • Timeouts or Latency Issues:Prolonged delays in response times could cause requests to time out, leading to apparent failures.

Recovery:

  • Review logs and error codes for patterns or specific failure reasons.
  • Check the health and recent activity of the AMF(s) and relevant network paths.
  • Examine configuration settings related to N1N2 subscriptions and ensure they are correct.
  • Investigate any spikes in load or indications of resource bottlenecks.
  • Correlate with recent changes or deployments in the environment.
If the issue persists, please check with Support team.
8.1.2.36 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-169 UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_SUBSCRIBE_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description More than 75% of n1n2 subscribe reattempt failed.
Summary More than 75% of n1n2 subscribe reattempt failed.
Severity Critical
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",operationType="subscribe",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",operationType="subscribe"}[5m]))) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.99
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 subscribe.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 subscription is failing or if the AMF that request are going to is unhealthy.

Cause:

An elevated percentage of reattempt failures has been detected for UE N1N2 subscriptions. The http_out_conn_response_total metric increments whenever the PCF-UE receives a response for outbound messages, specifically tracking reattempts where the operation type is "subscribe" and the response code is not in the 2xx (success) range. This alert triggers when more than 75% of such reattempts fail over a 5-minute period.

Diagnostic Information:

  • AMF (Access and Mobility Management Function) Unavailability or Instability: The target AMF may be experiencing outages, heavy load, or is otherwise unhealthy, causing it to reject or fail to respond to subscription requests.
  • Network Issues or Communication Failures: Network congestion, routing problems, or transient communication errors may prevent successful delivery of N1N2 subscription requests or receipt of responses.
  • Configuration Errors: Misconfiguration of endpoints (such as incorrect URLs, authentication, or authorization settings) may cause subscription requests to be rejected or fail.
  • High Load or Resource Exhaustion: If the AMF or intermediate network components are overloaded or have run out of necessary resources (e.g., memory, threads, process slots), reattempted requests may be rejected.
  • Timeouts or Latency Issues:Prolonged delays in response times could cause requests to time out, leading to apparent failures.

Recovery:

  • Review logs and error codes for patterns or specific failure reasons.
  • Check the health and recent activity of the AMF(s) and relevant network paths.
  • Examine configuration settings related to N1N2 subscriptions and ensure they are correct.
  • Investigate any spikes in load or indications of resource bottlenecks.
  • Correlate with recent changes or deployments in the environment.
If the issue persists, please check with Support team.
8.1.2.37 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-170 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description More than 25% of n1n2 transfer reattempt failed.
Summary More than 25% of n1n2 transfer reattempt failed.
Severity Minor
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) * 100 > 25
OID 1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy.

Cause:

An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The http_out_conn_response_total metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 25% of such reattempts result in failure within a 5-minute period.

Diagnostic Information:

  • AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail.
  • Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages.
  • Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver.
  • System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests.
  • Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues.

Recovery:

  • Review and analyze failure logs and returned error codes.
  • Check the operational health and resource status of the AMF and other involved NFs.
  • Validate network connectivity and latency between all relevant components.
  • Inspect configuration and recent changes for potential misalignments.
  • Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments.
If the issue persists, please check with Support team.
8.1.2.38 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-171 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description More than 50% of n1n2 transfer reattempt failed.
Summary More than 50% of n1n2 transfer reattempt failed.
Severity Major
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy.

Cause:

An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The http_out_conn_response_total metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 50% of such reattempts result in failure within a 5-minute period.

Diagnostic Information:

  • AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail.
  • Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages.
  • Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver.
  • System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests.
  • Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues.

Recovery:

  • Review and analyze failure logs and returned error codes.
  • Check the operational health and resource status of the AMF and other involved NFs.
  • Validate network connectivity and latency between all relevant components.
  • Inspect configuration and recent changes for potential misalignments.
  • Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments.
If the issue persists, please check with Support team.
8.1.2.39 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-172 UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description More than 75% of n1n2 transfer reattempt failed.
Summary More than 75% of n1n2 transfer reattempt failed.
Severity Critical
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2MessageTransfer", operationType="transfer"}[5m]))) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.100
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. In this case the alert is notifying when there is a certain amount of reattempt failure for ue n1n2 transfer.If there is an increase of failure, operator can revise the reason why the flow triggering n1n2 message transfer is failing or if the AMF that request are going to is unhealthy.

Cause:

An increased percentage of reattempt failures has been detected for UE N1N2 message transfers. The http_out_conn_response_total metric increments when the PCF-UE receives a response for messages being sent out of the Network Function (NF), specifically monitoring reattempts of the "transfer" operation where the response code is not 2xx (success). This alert is triggered when over 75% of such reattempts result in failure within a 5-minute period.

Diagnostic Information:

  • AMF (Access and Mobility Management Function) Unavailability or Instability: If the target AMF is down, overloaded, or behaving unpredictably, message transfer requests (especially retries) are more likely to fail.
  • Network Path Issues: Transient or persistent network failures, high latency, or packet loss between the PCF-UE and the target network function can disrupt the successful transfer of N1N2 messages.
  • Configuration Errors: Misconfiguration in endpoints, credentials, or other protocol parameters can cause messages to be consistently rejected or fail to deliver.
  • System Resource Constraints: Resource exhaustion (CPU, memory, file descriptors, etc.) on either the PCF-UE or the AMF could prevent successful handling of transfer requests.
  • Timeouts and Slow Processing: Delayed responses or timeouts can be interpreted as failures, particularly if the operation times out consistently during high load or due to backend issues.

Recovery:

  • Review and analyze failure logs and returned error codes.
  • Check the operational health and resource status of the AMF and other involved NFs.
  • Validate network connectivity and latency between all relevant components.
  • Inspect configuration and recent changes for potential misalignments.
  • Correlate the timing of increased failures with network incidents, maintenance windows, or new deployments.
If the issue persists, please check with Support team.
8.1.2.40 SM_STALE_REQUEST_PROCESSING_REJECT_MINOR

Table 8-173 SM_STALE_REQUEST_PROCESSING_REJECT_MINOR

Field Details
Name in Alert Yaml File SM_STALE_REQUEST_PROCESSING_REJECT_MINOR
Description

More than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Summary

More than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Severity Minor
Expression

(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 10 < 20

OID 1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session.

Cause:

The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale.

For example, if a request includes the following header parameters:

  • sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z' (sender timestamp)
  • sbiMaxRSPTime3GPP='3000' (maximum response time in milliseconds)

In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric.

When more than 10% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised.

Diagnostic Information:

  • Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP).
  • Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays.
  • Review Configurations: Confirm that max response times (sbiMaxRSPTime3GPP) are correctly set as per the service contract.
  • Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time.

Recovery:

Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.
8.1.2.41 SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Table 8-174 SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Field Details
Name in Alert Yaml File SM_STALE_REQUEST_PROCESSING_REJECT_MAJOR
Description

More than 20% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Summary

More than 20% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Severity Major
Expression

(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 20 < 30

OID 1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session.

Cause:

The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale.

For example, if a request includes the following header parameters:

  • sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z' (sender timestamp)
  • sbiMaxRSPTime3GPP='3000' (maximum response time in milliseconds)

In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric.

When more than 20% and less than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised.

Diagnostic Information:

  • Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP).
  • Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays.
  • Review Configurations: Confirm that max response times (sbiMaxRSPTime3GPP) are correctly set as per the service contract.
  • Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time.

Recovery:

Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.
8.1.2.42 SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Table 8-175 SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Field Details
Name in Alert Yaml File SM_STALE_REQUEST_PROCESSING_REJECT_CRITICAL
Description

More than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Summary

More than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT due to request being stale

Severity Critical
Expression

(sum by (namespace,pod) (rate(occnp_late_processing_rejection_total{microservice=~"occnp_pcf_sm"}[5m])))/(sum by (namespace,pod) (rate(ocpm_ingress_request_total{microservice=~"occnp_pcf_sm"}[5m]))) * 100 >= 30

OID 1.3.6.1.4.1.323.5.3.52.1.2.101
Metric Used occnp_late_processing_rejection_total, ocpm_ingress_request_total
Recommended Actions The metric occnp_late_processing_rejection_total is pegged when Late Processing finds a stale session.

Cause:

The metric occnp_late_processing_rejection_total is incremented when the SM Service determines that a request has become stale.

For example, if a request includes the following header parameters:

  • sbiSenderTimestamp3GPP='2025-11-03T09:48:01.000Z' (sender timestamp)
  • sbiMaxRSPTime3GPP='3000' (maximum response time in milliseconds)

In this scenario, if there is a delay in receiving a response from the external Network Function (NF), a stale check is later performed. If the request is deemed stale during this check, it is counted in the metric.

When more than 30% of the Ingress requests failed with error 504 GATEWAY_TIMEOUT, then this alarm will be raised.

Diagnostic Information:

  • Validate Timestamps: Ensure that system clocks are synchronized (e.g., via NTP).
  • Analyze Latency: Use tracing or metric data to identify bottlenecks in response time—look for patterns in external NF response delays.
  • Review Configurations: Confirm that max response times (sbiMaxRSPTime3GPP) are correctly set as per the service contract.
  • Scale System Resources: Check for resource constraints (CPU, memory, bandwidth) and scale up your system or services as needed to handle the incoming request load within the allowed response time.

Recovery:

Once the recommended diagnostic actions are implemented and responses from the external NF are received within the expected timeframe, the percentage of rejected messages will begin to decline, ultimately clearing the alert.
8.1.2.43 UE_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Table 8-176 UE_STALE_REQUEST_PROCESSING_REJECT_MAJOR

Field Details
Description This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity Major
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used occnp_late_processing_rejection_total
Recommended Actions Metric occnp_late_processing_rejection_total is pegged when requests being processed become stale.

Cause:

More than 20% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window.

Diagnostic Information:

  • High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing.
  • Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times.
  • Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests.
  • Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times.

Recovery:

  • Monitor system and service resource utilization.
  • Review recent changes to workload, configuration, or deployments.
  • Tune timeouts and thresholds appropriately based on observed service latency.
  • Analyze logs to pinpoint where delays are occurring in the request processing workflow.
If the issue persists, please check with Support team.
8.1.2.44 UE_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Table 8-177 UE_STALE_REQUEST_PROCESSING_REJECT_CRITICAL

Field Details
Description This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity Critical
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used occnp_late_processing_rejection_total
Recommended Actions Metric occnp_late_processing_rejection_total is pegged when requests being processed become stale.

Cause:

More than 30% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window.

Diagnostic Information:

  • High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing.
  • Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times.
  • Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests.
  • Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times.

Recovery:

  • Monitor system and service resource utilization.
  • Review recent changes to workload, configuration, or deployments.
  • Tune timeouts and thresholds appropriately based on observed service latency.
  • Analyze logs to pinpoint where delays are occurring in the request processing workflow.
If the issue persists, please check with Support team.
8.1.2.45 UE_STALE_REQUEST_PROCESSING_REJECT_MINOR

Table 8-178 UE_STALE_REQUEST_PROCESSING_REJECT_MINOR

Field Details
Description This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Summary This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to request going stale, while being processed by the service.
Severity Minor
Expression (sum by (namespace) (rate(occnp_late_processing_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace) (rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.104
Metric Used occnp_late_processing_rejection_total
Recommended Actions Metric occnp_late_processing_rejection_total is pegged when requests being processed become stale.

Cause:

More than 10% of incoming requests to the ue-service have been rejected because they became stale during processing. The service flags a request as stale when its processing exceeds an acceptable time window.

Diagnostic Information:

  • High System Load or Resource Contention: The ue-service or its backend components may be overloaded (e.g., CPU, memory, I/O), delaying request processing.
  • Inefficient Request Handling or Bottlenecks: There may be inefficiencies or slow operations within the service logic, such as database queries, API calls, or complex computations causing extended processing times.
  • Network Latency or Downstream Delays: High network latency or slow responses from dependent services or databases could increase the time required to process requests.
  • Increased Volume of Requests: A spike in incoming requests can overwhelm the service, leading to request queues and increased wait times.

Recovery:

  • Monitor system and service resource utilization.
  • Review recent changes to workload, configuration, or deployments.
  • Tune timeouts and thresholds appropriately based on observed service latency.
  • Analyze logs to pinpoint where delays are occurring in the request processing workflow.
If the issue persists, please check with Support team.
8.1.2.46 UE_STALE_REQUEST_ARRIVAL_REJECT_MINOR

Table 8-179 UE_STALE_REQUEST_ARRIVAL_REJECT_MINOR

Field Details
Description This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary This alert is triggered when more than 10% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 10
OID

1.3.6.1.4.1.323.5.3.52.1.2.109

Metric Used ocpm_late_arrival_rejection_total
Recommended Actions Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

Cause:

Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

  • Metric: ocpm_late_arrival_rejection_total
    • Increments when the UE Service determines incoming requests are stale (arrived too late to process).
    • The staleness check is based on:
      • 3gpp-Sbi-Sender-Timestamp (preferred)
      • 3gpp-Sbi-Origination-Timestamp (fallback if sender timestamp is unavailable)
      • 3gpp-Sbi-Max-Rsp-Time (maximum allowed response time, in ms)
  • Request Example:
    • 3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'
    • 3gpp-Sbi-Max-Rsp-Time='3000' (i.e., 3 seconds)
  • If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric.
  • Alarm Condition:
    • If more than 10% of ingress requests result in 504 GATEWAY_TIMEOUT errors due to staleness, an alarm is raised.

Diagnostic Information:

  1. Verify Time Synchronization
    • Ensure all Network Functions (NFs) have synchronized system clocks (using NTP).
    • Time drift between sender and UE Service may falsely trigger staleness.
  2. Check Network Latency
    • Investigate possible network delays or congestion between external NF and the UE Service.
    • High or unstable latency can lead to late arrival of requests.
  3. Analyze Sender Behavior
    • Validate that the sending NF populates 3gpp-Sbi-Sender-Timestamp (or Origination-Timestamp) correctly.
    • Misconfigured or delayed timestamping can corrupt staleness calculation.
  4. Assess Max Response Time Values
    • Review if the 3gpp-Sbi-Max-Rsp-Time value is appropriate for your network and application conditions.
    • Very short response times may not be feasible under current latency conditions.
  5. Review Application Load
    • Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service.
    • Resource exhaustion may delay request processing, even if requests arrive on time.
  6. Correlation with Other Metrics
    • Examine related metrics such as total request counts, processing times, error types, etc., to identify trends.
    • Check if certain sources or request types are consistently late.
  7. Check for Backlogs
    • Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline.

Recovery:

  1. Verify Time Synchronization
    • Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected.
  2. Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected.
  3. Review UE Service Application & Resources
    • Check the UE Service for high CPU/memory usage or any request processing backlogs.
    • Restart or scale up resources temporarily if the system is overloaded.
  4. Contact Upstream NF Owners
Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.
8.1.2.47 UE_STALE_REQUEST_ARRIVAL_REJECT_MAJOR

Table 8-180 UE_STALE_REQUEST_ARRIVAL_REJECT_MAJOR

Field Details
Description This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary This alert is triggered when more than 20% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity Major
Expression (sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 20
OID

1.3.6.1.4.1.323.5.3.52.1.2.109

Metric Used ocpm_late_arrival_rejection_total
Recommended Actions Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

Cause:

Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

  • Metric: ocpm_late_arrival_rejection_total
    • Increments when the UE Service determines incoming requests are stale (arrived too late to process).
    • The staleness check is based on:
      • 3gpp-Sbi-Sender-Timestamp (preferred)
      • 3gpp-Sbi-Origination-Timestamp (fallback if sender timestamp is unavailable)
      • 3gpp-Sbi-Max-Rsp-Time (maximum allowed response time, in ms)
  • Request Example:
    • 3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'
    • 3gpp-Sbi-Max-Rsp-Time='3000' (i.e., 3 seconds)
  • If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric.
  • Alarm Condition:
    • If more than 20% of ingress requests result in 504 GATEWAY_TIMEOUT errors due to staleness, an alarm is raised.

Diagnostic Information:

  1. Verify Time Synchronization
    • Ensure all Network Functions (NFs) have synchronized system clocks (using NTP).
    • Time drift between sender and UE Service may falsely trigger staleness.
  2. Check Network Latency
    • Investigate possible network delays or congestion between external NF and the UE Service.
    • High or unstable latency can lead to late arrival of requests.
  3. Analyze Sender Behavior
    • Validate that the sending NF populates 3gpp-Sbi-Sender-Timestamp (or Origination-Timestamp) correctly.
    • Misconfigured or delayed timestamping can corrupt staleness calculation.
  4. Assess Max Response Time Values
    • Review if the 3gpp-Sbi-Max-Rsp-Time value is appropriate for your network and application conditions.
    • Very short response times may not be feasible under current latency conditions.
  5. Review Application Load
    • Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service.
    • Resource exhaustion may delay request processing, even if requests arrive on time.
  6. Correlation with Other Metrics
    • Examine related metrics such as total request counts, processing times, error types, etc., to identify trends.
    • Check if certain sources or request types are consistently late.
  7. Check for Backlogs
    • Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline.

Recovery:

  1. Verify Time Synchronization
    • Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected.
  2. Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected.
  3. Review UE Service Application & Resources
    • Check the UE Service for high CPU/memory usage or any request processing backlogs.
    • Restart or scale up resources temporarily if the system is overloaded.
  4. Contact Upstream NF Owners
Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.
8.1.2.48 UE_STALE_REQUEST_ARRIVAL_REJECT_CRITICAL

Table 8-181 UE_STALE_REQUEST_ARRIVAL_REJECT_CRITICAL

Field Details
Description This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Summary This alert is triggered when more than 30% of the incoming requests towards UE Policy service are rejected due to requests being stale upon arrival to the service.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_late_arrival_rejection_total{microservice=~".*pcf_ueservice"}[5m])) / sum by (namespace)(rate(ocpm_ingress_request_total{microservice=~".*pcf_ueservice"}[5m]))) * 100 > 30
OID

1.3.6.1.4.1.323.5.3.52.1.2.109

Metric Used ocpm_late_arrival_rejection_total
Recommended Actions Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

Cause:

Metric ocpm_late_arrival_rejection_total is pegged when a received requests is stale.

  • Metric: ocpm_late_arrival_rejection_total
    • Increments when the UE Service determines incoming requests are stale (arrived too late to process).
    • The staleness check is based on:
      • 3gpp-Sbi-Sender-Timestamp (preferred)
      • 3gpp-Sbi-Origination-Timestamp (fallback if sender timestamp is unavailable)
      • 3gpp-Sbi-Max-Rsp-Time (maximum allowed response time, in ms)
  • Request Example:
    • 3gpp-Sbi-Sender-Timestamp='2025-11-03T09:48:01.000Z'
    • 3gpp-Sbi-Max-Rsp-Time='3000' (i.e., 3 seconds)
  • If request arrives after (Sender-Timestamp + Max-Rsp-Time), it is considered stale and counted in the metric.
  • Alarm Condition:
    • If more than 30% of ingress requests result in 504 GATEWAY_TIMEOUT errors due to staleness, an alarm is raised.

Diagnostic Information:

  1. Verify Time Synchronization
    • Ensure all Network Functions (NFs) have synchronized system clocks (using NTP).
    • Time drift between sender and UE Service may falsely trigger staleness.
  2. Check Network Latency
    • Investigate possible network delays or congestion between external NF and the UE Service.
    • High or unstable latency can lead to late arrival of requests.
  3. Analyze Sender Behavior
    • Validate that the sending NF populates 3gpp-Sbi-Sender-Timestamp (or Origination-Timestamp) correctly.
    • Misconfigured or delayed timestamping can corrupt staleness calculation.
  4. Assess Max Response Time Values
    • Review if the 3gpp-Sbi-Max-Rsp-Time value is appropriate for your network and application conditions.
    • Very short response times may not be feasible under current latency conditions.
  5. Review Application Load
    • Monitor system/resource utilization (CPU, memory, queue lengths) on the UE Service.
    • Resource exhaustion may delay request processing, even if requests arrive on time.
  6. Correlation with Other Metrics
    • Examine related metrics such as total request counts, processing times, error types, etc., to identify trends.
    • Check if certain sources or request types are consistently late.
  7. Check for Backlogs
    • Review UE Service logs for any signs of backlogs, bottlenecks, or spikes in the request handling pipeline.

Recovery:

  1. Verify Time Synchronization
    • Ensure all relevant Network Functions (NFs) have correct system time. Resynchronize clocks if any drift is detected.
  2. Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the UE Service. Resolve any high latency or packet loss immediately if detected.
  3. Review UE Service Application & Resources
    • Check the UE Service for high CPU/memory usage or any request processing backlogs.
    • Restart or scale up resources temporarily if the system is overloaded.
  4. Contact Upstream NF Owners
Notify owners of external NFs if they are sending delayed or incorrectly timestamped requests so they can take corrective action.
8.1.2.49 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-182 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description More than 75% of N1N2 transfer failure notification reattempts failed.
Summary More than 75% of N1N2 transfer failure notification reattempts failed.
Severity Critical
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions
The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on:
  • Why the flow triggering N1N2 transfer failure notification is failing, or
  • Check the health of the AMF to which the request are going to

Cause:

http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification.

Dimensions:

IsReattempt : true

reattemptType : UE_N1N2TransferFailure

OperationType : transfer

ResponseCode : !2xx

In this case more than 75% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame)

Diagnostic Information :

  1. Check Recent Logs:
    • Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert).
    • Focus on error responses: look for 4xx/5xx HTTP responses and their reasons.
  2. Correlate with Traffic Patterns:
    • Determine if failures are for specific to certain AMFs or random.
    • Check if there's a sudden surge in failures (indicating a broader issue).
  3. Inspect Network Health and Configuration:
    • Ensure connectivity and correct routing between PCF-UE and its downstream targets.
    • Validate configurations, especially recently changed ones.
  4. Cross-check Incident/Event Timeline:
    • Review recent maintenance, deployments, or network events that could correlate with the increase in failures.
  5. Evaluate for Service Overload:
    • Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress.
  6. Check with Peers:
    • See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem.

Recovery :

  1. Resolve Underlying Service Issues:
    • If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation.
    • Address any misconfiguration or errors causing repeated non-2xx responses.
  2. Revert Recent Changes:
    • If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact.
  3. Mitigate Service Overload:
    • If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible.
  4. Network Remediation:
    • Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints.
  5. Monitor and Confirm Recovery:
    • Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.
Ensure related services in affected namespaces also recover.
8.1.2.50 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-183 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description More than 50% of N1N2 transfer failure notification reattempts failed.
Summary More than 50% of N1N2 transfer failure notification reattempts failed.
Severity Major
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions
The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on:
  • Why the flow triggering N1N2 transfer failure notification is failing, or
  • Check the health of the AMF to which the request are going to

Cause:

http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification.

Dimensions:

IsReattempt : true

reattemptType : UE_N1N2TransferFailure

OperationType : transfer

ResponseCode : !2xx

In this case more than 50% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame)

Diagnostic Information :

  1. Check Recent Logs:
    • Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert).
    • Focus on error responses: look for 4xx/5xx HTTP responses and their reasons.
  2. Correlate with Traffic Patterns:
    • Determine if failures are for specific to certain AMFs or random.
    • Check if there's a sudden surge in failures (indicating a broader issue).
  3. Inspect Network Health and Configuration:
    • Ensure connectivity and correct routing between PCF-UE and its downstream targets.
    • Validate configurations, especially recently changed ones.
  4. Cross-check Incident/Event Timeline:
    • Review recent maintenance, deployments, or network events that could correlate with the increase in failures.
  5. Evaluate for Service Overload:
    • Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress.
  6. Check with Peers:
    • See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem.

Recovery :

  1. Resolve Underlying Service Issues:
    • If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation.
    • Address any misconfiguration or errors causing repeated non-2xx responses.
  2. Revert Recent Changes:
    • If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact.
  3. Mitigate Service Overload:
    • If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible.
  4. Network Remediation:
    • Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints.
  5. Monitor and Confirm Recovery:
    • Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.
Ensure related services in affected namespaces also recover.
8.1.2.51 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-184 UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_N1N2_TRANSFER_FAILURE_NOTIFICATION_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description More than 25% of N1N2 transfer failure notification reattempts failed.
Summary More than 25% of N1N2 transfer failure notification reattempts failed.
Severity Minor
Expression (sum by (namespace) (increase(http_out_conn_response_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(http_out_conn_request_total{isReattempt="true",reattemptType="UE_N1N2TransferFailure",operationType="transfer"}[5m]))) * 100 > 25
OID 1.3.6.1.4.1.323.5.3.52.1.2.106
Metric Used http_out_conn_response_total, http_out_conn_request_total
Recommended Actions
The http_out_conn_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case the alert notifies when there is a certain amount of reattempt failure for UE N1N2 transfer failure notification. If there is an increase of failure, operator can investigate on:
  • Why the flow triggering N1N2 transfer failure notification is failing, or
  • Check the health of the AMF to which the request are going to

Cause:

http_out_conn_response_total metric with indicated dimensions is pegged when PCF-UE receives a response for an outgoing reattempt transfer request triggered for N1N2TransferFailure notification.

Dimensions:

IsReattempt : true

reattemptType : UE_N1N2TransferFailure

OperationType : transfer

ResponseCode : !2xx

In this case more than 25% of outgoing transfer reattempts (due to N1N2TransferFailure as notified by AMF) receive a non-2xx (failure) response in the last 5 minutes (or selected sample frame)

Diagnostic Information :

  1. Check Recent Logs:
    • Analyze logs for both PCF-UE and Egress Gateway in the relevant namespace for error details (timestamps matching the period of alert).
    • Focus on error responses: look for 4xx/5xx HTTP responses and their reasons.
  2. Correlate with Traffic Patterns:
    • Determine if failures are for specific to certain AMFs or random.
    • Check if there's a sudden surge in failures (indicating a broader issue).
  3. Inspect Network Health and Configuration:
    • Ensure connectivity and correct routing between PCF-UE and its downstream targets.
    • Validate configurations, especially recently changed ones.
  4. Cross-check Incident/Event Timeline:
    • Review recent maintenance, deployments, or network events that could correlate with the increase in failures.
  5. Evaluate for Service Overload:
    • Examine resource metrics (CPU, memory, rate of requests) of the affected service(PCF-UE, PCF-EGW) to determine if it’s under duress.
  6. Check with Peers:
    • See if corresponding namespaces (other tenants/products) are seeing similar issues, could indicate a platform or shared service problem.

Recovery :

  1. Resolve Underlying Service Issues:
    • If the upstream service (e.g., AMF or other network function) is unhealthy, work with the respective team to restore normal operation.
    • Address any misconfiguration or errors causing repeated non-2xx responses.
  2. Revert Recent Changes:
    • If the issue correlates with recent deployments or configuration changes, consider rolling back to the previous stable state after assessing impact.
  3. Mitigate Service Overload:
    • If resource constraints are detected (CPU, memory, connections), scale up resources or reduce load by throttling non-critical requests where possible.
  4. Network Remediation:
    • Resolve any detected connectivity or routing issues between PCF-UE and the egress gateway or upstream endpoints.
  5. Monitor and Confirm Recovery:
    • Continue monitoring the alert metric after remedial actions to confirm the failure rate falls below the alert threshold.
Ensure related services in affected namespaces also recover.
8.1.2.52 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Table 8-185 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_CRITICAL_THRESHOLD
Description More than 75% of amf discovery reattempts failed.
Summary More than 75% of amf discovery reattempts failed.
Severity Critical
Expression (sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions The occnp_ue_nf_discovery_reattempt_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case, the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on:
  • Why the AMF discovery flow is failing, or
  • Check the health of the AMF to which the request are going to.

Cause:

The main cause of the occnp_ue_nf_discovery_reattempt_response_total metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests.

Diagnostic Information:

  • AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts.
  • Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail.
  • Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests.
  • NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information.
  • Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time.
  • Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures.

Recovery:

  • Review logs and error responses associated with AMF discovery attempts.
  • Check the health status and recent operational history of the target AMF and NRF.
  • Verify network health and connectivity between all relevant components.
  • Validate all associated configurations (PCF-UE, AMF, NRF).
If the issue persists, please check with Support team.
8.1.2.53 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Table 8-186 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MAJOR_THRESHOLD
Description More than 50% of amf discovery reattempts failed.
Summary More than 50% of amf discovery reattempts failed.
Severity Major
Expression (sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions The occnp_ue_nf_discovery_reattempt_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then in this case, the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on:
  • Why the AMF discovery flow is failing, or
  • Check the health of the AMF to which the request are going to.

Cause:

The main cause of the occnp_ue_nf_discovery_reattempt_response_total metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests.

Diagnostic Information:

  • AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts.
  • Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail.
  • Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests.
  • NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information.
  • Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time.
  • Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures.

Recovery:

  • Review logs and error responses associated with AMF discovery attempts.
  • Check the health status and recent operational history of the target AMF and NRF.
  • Verify network health and connectivity between all relevant components.
  • Validate all associated configurations (PCF-UE, AMF, NRF).
If the issue persists, please check with Support team.
8.1.2.54 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Table 8-187 UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File UE_AMF_DISCOVERY_REATTEMPT_FAILURE_ABOVE_MINOR_THRESHOLD
Description More than 25% of amf discovery reattempts failed.
Summary More than 25% of amf discovery reattempts failed.
Severity Minor
Expression (sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_response_total{operationType="timer_expiry_notification",responseCode!~"2.*"}[5m])) / sum by (namespace) (increase(occnp_ue_nf_discovery_reattempt_request_total{operationType="timer_expiry_notification"}[5m]))) * 100 > 25
OID 1.3.6.1.4.1.323.5.3.52.1.2.105
Metric Used occnp_ue_nf_discovery_reattempt_response_total
Recommended Actions The occnp_ue_nf_discovery_reattempt_response_total metric is pegged when PCF-UE receives a response from a message that is going out of the NF. Then, in this case the alert notifies when there is a certain number of reattempt failure while discovering AMF. If there is an increase of failure, operator can investigate on:
  • Why the AMF discovery flow is failing, or
  • Check the health of the AMF to which the request are going to.

Cause:

The main cause of the occnp_ue_nf_discovery_reattempt_response_total metric being pegged—indicating a notable number of reattempt failures during AMF discovery—is that the PCF-UE (Policy Control Function - User Equipment) is receiving non-success responses (failures) when retrying AMF (Access and Mobility Management Function) discovery requests.

Diagnostic Information:

  • AMF Unavailability or Health Issues: The target AMF may be down, unresponsive, overloaded, or otherwise unhealthy, resulting in failed or rejected discovery attempts.
  • Network Issues or Latency: Communication issues such as network congestion, high latency, or dropped packets between the PCF-UE and the AMF (or intermediary NFs) can cause discovery attempts to fail.
  • Incorrect Configuration: Misconfigurations in the PCF-UE or AMF—such as wrong endpoint addresses, security settings, or authentication parameters—may prevent the successful completion of discovery requests.
  • NRF (Network Repository Function) Problems: If AMF discovery relies on the NRF and the NRF is unhealthy or misconfigured, the PCF-UE may be unable to retrieve up-to-date or correct AMF information.
  • Resource Exhaustion: If the system is under heavy load or resources (CPU, memory, threads) are depleted, discovery requests may not be handled on time.
  • Timeouts and Slow Processing: Slow responses from the AMF or network timeouts can contribute to repeated reattempts and failures.

Recovery:

  • Review logs and error responses associated with AMF discovery attempts.
  • Check the health status and recent operational history of the target AMF and NRF.
  • Verify network health and connectivity between all relevant components.
  • Validate all associated configurations (PCF-UE, AMF, NRF).
If the issue persists, please check with Support team.
8.1.2.55 INGRESS_ERROR_RATE_ABOVE_10_PERCENT_PER_POD

Table 8-188 INGRESS_ERROR_RATE_ABOVE_10_PERCENT_PER_POD

Field Details
Name in Alert Yaml File IngressErrorRateAbove10PercentPerPod
Description Ingress Error Rate above 10 Percent in {{$labels.kubernetes_name}} in {{$labels.kubernetes_namespace}}
Summary Transaction Error Rate in {{$labels.kubernetes_node}} (current value is: {{ $value }})
Severity Critical
Expression (sum by(pod)(rate(ocpm_ingress_response_total{response_code!~"2.*"}[24h])

or

(up * 0 ) )/sum by(pod)(rate(ocpm_ingress_response_total[24h]))) * 100>= 10

OID 1.3.6.1.4.1.323.5.3.52.1.2.2
Metric Used ocpm_ingress_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors.
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

Cause

This alert fires when 10% or more of ingress (incoming) HTTP requests handled by any individual pod result in non-2xx (unsuccessful) responses, measured over a 1-day window. A high ingress error rate per pod suggests issues that could impact application availability, reliability, or user experience.

Common causes include:

  • Application-level errors (returning 4xx or 5xx status codes) due to bugs, configuration issues, invalid client requests, or backend failures
  • Resource exhaustion (CPU, memory, open connections) or saturation within the affected pod
  • Dependency failures (database, cache, or external service outages), causing the pod to respond with errors
  • Recent deployments, rollouts, or configuration changes introducing regressions or incompatibilities
  • Network problems or timeouts impacting request processing
  • Unhandled exceptions or circuit breaker activations

Diagnostic Information

  • Identify affected pods from alert labels
  • Review pod logs to categorize errors by type (4xx client errors, 5xx server errors, timeouts, etc.)
  • Correlate errors with spikes in traffic, resource usage, or specific endpoints
  • Examine resource utilization and health metrics (CPU, memory, connection pools, thread pools)
  • Check readiness/liveness probe status and pod restart history
  • Review changes in deployments, configurations, or dependencies preceding the alert
  • Investigate for signs of dependency issues, cascading failures, or external API problems

Recovery

  • Isolate and address root cause: Use logs, error breakdowns, and metrics to determine if issues are within the pod, code, dependencies, or external factors
  • Rollback if needed: If problems started following a recent deployment or config change, consider reverting
  • Increase resources or scale out: Add capacity if the pod is resource-constrained
  • Fix code or configuration: Resolve bugs, correct misconfigurations, or address unhandled cases
  • Remediate downstream/third-party issues: Work with owners of failing dependencies if external
Alert resolution: The alert will auto-resolve when the pod’s ingress error rate falls below 10% for the measuring window

For any additional guidance, contact My Oracle Support.

8.1.2.56 SM_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-189 SM_TRAFFIC_RATE_ABOVE_THRESHOLD

Field Details
Name in Alert Yaml File SMTrafficRateAboveThreshold
Description SM service Ingress traffic Rate is above threshold of Max MPS (current value is: {{ $value }})
Summary Traffic Rate is above 90 Percent of Max requests per second
Severity Major
Expression The total SM service Ingress traffic rate has crossed the configured threshold of 900 TPS.

Default value of this alert trigger point in PCF_Alertrules.yaml file is when SM service Ingress Rate crosses 90% of maximum ingress requests per second.

OID 1.3.6.1.4.1.323.5.3.36.1.2.3
Metric Used ocpm_ingress_request_total{servicename_3gpp="npcf-smpolicycontrol"}
Recommended Actions The alert gets cleared when the Ingress traffic rate falls below the threshold.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

It is recommended to assess the reason for additional traffic. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Ingress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Ingress Gateway logs on Kibana to determine the reason for the errors.

Cause:

The metric ocpm_ingress_request_total is incremented for every inbound HTTP request reaching the SM component of the SM service with the dimension serviceName3gpp="npcf-smpolicycontrol". If the 2-minute average exceeds 900 mps, this indicates that the system may be experiencing an overload or an abnormal spike in traffic.

Diagnostic Information:

Examine Current Rate:

Query ocpm_ingress_request_total for serviceName3gpp="npcf-smpolicycontrol" to assess the current ingress traffic rate.

Review Upstream Sources:

Identify if request rates from any upstream SMF, AF, or TDF instances have increased.

Inspect Application Logs:

Check for WARN or ERROR messages in logs related to overload or congestion control rejections, which can help determine if the system is rejecting requests or experiencing resource pressure.

Recovery:

  • Throttle or Rate-Limit: Apply or adjust overload/congestion control configurations to throttle or rate-limit requests from SMF as appropriate, to restore rate to expected levels.
  • Scale Resources: Add more replicas to the sm-service deployment if needed to reduce the average rate per instance.
  • Threshold Adjustment: Adjust the alert threshold if normal traffic patterns or business requirements change.
Alert Resolution: When the sustained request rate stays below 900 mps, Prometheus will automatically clear the SM_TRAFFIC_RATE_ABOVE_THRESHOLD alert.

For any additional guidance, contact My Oracle Support.

8.1.2.57 SM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-190 SM_INGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field Details
Name in Alert Yaml File SMIngressErrorRateAbove10Percent
Description Transaction Error Rate detected above 10 Percent of Total on SM service (current value is: {{ $value }})
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Critical
Expression The number of failed transactions is above 10 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.36.1.2.4
Metric Used ocpm_ingress_response_total
Recommended Actions The alert gets cleared when the number of failed transactions are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: ocpm_ingress_response_total{servicename_3gpp="npcf-smpolicycontrol",response_code!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

Cause

This alert fires when more than 10% of all HTTP responses returned by the SM Service (npcf-smpolicycontrol)over the past day are non-2xx (i.e., not successful). This may be due to:

  • Upstream or downstream system failures
  • Application-level errors (5xx codes)
  • Client-side or bad requests (4xx codes)
  • Misconfiguration, rate limiting, or resource exhaustion

Diagnostic Information

  • Break down error rates by response code to differentiate client, server, and other errors.
  • Search for error messages, stack traces, and signs of repeated failure or congestion.
  • Validate that dependencies(upstream services, DB) are functioning correctly.
  • Analyze recent deployments or config changes
  • Check for network latency

Recovery:

  • Identify and Address Root Cause: Use error breakdown and logs to pinpoint and fix the underlying issue.
  • Rollback Recent Changes: If a recent deployment is responsible, consider rolling back temporarily.
  • Scale or Resource Adjustment: Add resources if you detect resource exhaustion.
  • Rate Limiting or Throttling: Apply throttling to minimize error propagation from upstream.
Alert Resolution: Once the error rate remains below 10% for a sustained period (1 day), the alert will auto-resolve.

For any additional guidance, contact My Oracle Support.

8.1.2.58 SM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Table 8-191 SM_EGRESS_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Name in Alert Yaml File SMEgressErrorRateAbove1Percent
Description Egress Transaction Error Rate detected above 1 Percent of Total Transactions (current value is: {{ $value }})
Summary Transaction Error Rate detected above 1 Percent of Total Transactions
Severity Minor
Expression The number of failed transactions is above 1 percent of the total transactions.
OID 1.3.6.1.4.1.323.5.3.36.1.2.5
Metric Used system_operational_state == 1
Recommended Actions The alert gets cleared when the number of failed transactions are below 1% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: ocpm_egress_response_total{servicename_3gpp="npcf-smpolicycontrol",response_code!~"2.*"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

Cause

This alert fires when more than 1% of all HTTP responses returned by the SM Service (npcf-smpolicycontrol)over the past day are non-2xx (i.e., not successful). This may be due to:

  • Upstream or downstream system failures
  • Application-level errors (5xx codes)
  • Client-side or bad requests (4xx codes)
  • Misconfiguration, rate limiting, or resource exhaustion

Diagnostic Information

  • Break down error rates by response code to differentiate client, server, and other errors.
  • Search for error messages, stack traces, and signs of repeated failure or congestion.
  • Validate that dependencies(upstream services, DB) are functioning correctly.
  • Analyze recent deployments or config changes
  • Check for network latency

Recovery:

  • Identify and Address Root Cause: Use error breakdown and logs to pinpoint and fix the underlying issue.
  • Rollback Recent Changes: If a recent deployment is responsible, consider rolling back temporarily.
  • Scale or Resource Adjustment: Add resources if you detect resource exhaustion.
  • Rate Limiting or Throttling: Apply throttling to minimize error propagation from upstream.
Alert Resolution: Once the error rate remains below 10% for a sustained period (1 day), the alert will auto-resolve.

For any additional guidance, contact My Oracle Support.

8.1.2.59 PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Table 8-192 PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD

Field Details
Name in Alert Yaml File PcfChfIngressTrafficRateAboveThreshold
Description User service Ingress traffic Rate from CHF is above threshold of Max MPS (current value is: {{ $value }})
Summary Traffic Rate is above 90 Percent of Max requests per second
Severity Major
Expression The total User Service Ingress traffic rate from CHF has crossed the configured threshold of 900 TPS.

Default value of this alert trigger point in PCF_Alertrules.yaml file is when user service Ingress Rate from CHF crosses 90% of maximum ingress requests per second.

OID 1.3.6.1.4.1.323.5.3.36.1.2.11
Metric Used ocpm_userservice_inbound_count_total{service_resource="chf-service"}
Recommended Actions

Cause:

The metric ocpm_userservice_inbound_count_total with dimension service_resource="chf-service" is incremented for every inbound HTTP request reaching the CHF connector service. If the 2-minute average exceeds 900 mps, this indicates that the system may be experiencing an overload or an abnormal spike in traffic.

Diagnostic Information:

Examine Current Rate:

Query ocpm_userservice_inbound_count_total for service_resource="chf-service" to assess the current ingress traffic rate.

Review Upstream Sources:

Identify if request rates from any upstream CHF, SMF, AMF instances have increased.

Inspect Application Logs:

Check for WARN or ERROR messages in logs related to overload or congestion control rejections, which can help determine if the system is rejecting requests or experiencing resource pressure.

Recovery:

  • Throttle or Rate-Limit: Apply or adjust congestion control configurations to throttle requests from downstream services as appropriate, to restore rate to expected levels.
  • Scale Resources: Add more replicas to the Chf connector deployment if needed to reduce the average rate per instance.
  • Threshold Adjustment: Adjust the alert threshold if normal traffic patterns or business requirements change.
Alert Resolution: When the sustained request rate stays below 900 mps, Prometheus will automatically clear the PCF_CHF_INGRESS_TRAFFIC_RATE_ABOVE_THRESHOLD alert.

For any additional guidance, contact My Oracle Support.

8.1.2.60 PCF_CHF_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Table 8-193 PCF_CHF_EGRESS_ERROR_RATE_ABOVE_10_PERCENT

Field Details
Name in Alert Yaml File PcfChfEgressErrorRateAbove10Percent
Description The number of failed transactions from CHF is more than 10 percent of the total transactions.
Summary Transaction Error Rate detected above 10 Percent of Total Transactions
Severity Critical
Expression

(sum(rate(ocpm_chf_tracking_response_total {servicename_3gpp="nchf-spendinglimitcontrol",response_code!~"2.*"} [24h]) or (up * 0 ) ) / sum(rate(ocpm_chf_tracking_response_total {servicename_3gpp="nchf-spendinglimitcontrol"} [24h]))) 100 >= 10

OID 1.3.6.1.4.1.323.5.3.36.1.2.12
Metric Used ocpm_chf_tracking_response_total
Recommended Actions The alert gets cleared when the number of failure transactions falls below the configured threshold.

Note: Threshold levels can be configured using the PCF_Alertrules.yaml file.

It is recommended to assess the reason for failed transactions. Perform the following steps to analyze the cause of increased traffic:
  1. Refer Egress Gateway section in Grafana to determine increase in 4xx and 5xx error response codes.
  2. Check Egress Gateway logs on Kibana to determine the reason for the errors.

Cause:

This alert fires when more than 10% of all HTTP responses for the PCF (CHF connector the PCF component that calls the external CHF via nchf-spendinglimitcontrol) over the past day are non-2xx (i.e., not successful). This may be due to:

  • External CHF partial outage or dependency failures.
  • Application-level errors (5xx) or timeouts on the CHF path.
  • Client/bad requests (4xx) from the CHF connector due to schema/version or auth issues.
  • Misconfiguration, rate limiting/throttling, TLS/mTLS or DNS problems, or resource exhaustion.

Diagnostic Information:

  • Break down error rates by response class (4xx vs 5xx vs timeouts/TLS/connect resets).
  • Search CHF connector service logs and traces for recurring errors, stack traces, circuit-breaker events, or congestion.
  • Validate external CHF health and dependencies (service/DB), and check for throttling indicators.
  • Analyze recent deployments or configuration changes in PCF or CHF (endpoints, timeouts, retries, API versions).
  • Check for traffic spikes, connection pool saturation, CPU/memory pressure, or elevated latency.

Recovery:

  • Identify and address root cause: Use error breakdown, logs, and traces to pinpoint whether the issue is in the PCF CHF client, network/TLS/auth, or the external CHF.
  • Roll back recent changes: Temporarily revert relevant PCF/CHF deployments or configs if correlated with the onset.
  • Scale or resource adjustment: Increase capacity or tune connection/thread pools; enable autoscaling if appropriate.
  • Rate limiting or throttling: Use bounded retries with backoff and apply throttling to reduce cascading failures.
Alert resolution: Once the non-2xx rate remains below 10% for a sustained period (1 day), the alert will auto-resolve.

For any additional guidance, contact My Oracle Support.

8.1.2.61 PCF_CHF_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Table 8-194 PCF_CHF_INGRESS_TIMEOUT_ERROR_ABOVE_MAJOR_THRESHOLD

Field Details
Description Ingress Timeout Error Rate detected above 10 Percent of Total towards CHF service (current value is: {{ $value }})
Summary Timeout Error Rate detected above 10 Percent of Total Transactions
Severity Major
Expression The number of failed transactions due to timeout is above 10 percent of the total transactions for CHF service.
OID 1.3.6.1.4.1.323.5.3.36.1.2.17
Metric Used ocpm_chf_tracking_request_timeout_total{servicename_3gpp="nchf-spendinglimitcontrol"}
Recommended Actions The alert gets cleared when the number of failed transactions due to timeout are below 10% of the total transactions.
To assess the reason for failed transactions, perform the following steps:
  1. Check the service specific metrics to understand the service specific errors. For instance: ocpm_chf_tracking_request_timeout_total{servicename_3gpp="nchf-spendinglimitcontrol"}
  2. The service specific errors can be further filtered for errors specific to a method such as GET, PUT, POST, DELETE, and PATCH.

Cause:

This alert is triggered when more than 10% of all inbound requests from PCF (Policy Control Function) to the CHF (nchf-spendinglimitcontrol) time out over a 1-day window. This may impact charging, quota enforcement, or service delivery.

Common causes include:

  • Network latency, intermittent packet loss, or connectivity issues between PCF and CHF
  • Overload, resource congestion, or unresponsiveness in the CHF or its dependencies
  • Resource exhaustion or scaling limits in the PCF, CHF, or intermediary components
  • Misconfiguration of timeout thresholds, retries, or circuit breaker settings
  • Downstream service or database issues affecting CHF’s ability to respond in time
  • Recent changes or deployments that introduced performance bottlenecks or regressions

Diagnostic Information:

  • Identify which part of the infrastructure is experiencing timeouts: is it consistent across all traffic or localized?
  • Review logs from PCF, CHF, and network/security appliances for repeated timeout, retry, or connection reset events
  • Check health dashboards for CHF (CPU, memory, response latency, DB availability, etc.)
  • Analyze request/response timings, queue lengths, and backlog at ingress points
  • Correlate with recent deployment, scaling, or network changes
  • Examine resource usage and pod health for PCF and CHF components

Recovery:

  • Isolate the root cause: Use logs and health metrics to determine if the problem is with CHF availability, network path, or PCF.
  • Scale or optimize: Increase resources, scale instances, or optimize configuration for PCF and CHF services as needed.
  • Rollback if needed: If the alert correlates with new deployments or config changes, consider reverting.
  • Network remediation: Address any identified network latency, packet loss, or DNS resolution issues.
  • Tune configuration: Adjust timeout settings, connection pools, and retry logic based on observed conditions.
  • Coordinate: Engage CHF, PCF, and platform support teams as needed for collaborative troubleshooting.

Alert Resolution: This alert will auto-resolve once the ingress timeout error rate drops below 10% of total requests to CHF over the evaluation window.

For any additional guidance, contact My Oracle Support.

8.1.2.62 PCF_PENDING_BINDING_SITE_TAKEOVER

Table 8-195 PCF_PENDING_BINDING_SITE_TAKEOVER

Field Details
Description The site takeover configuration has been activated
Summary The site takeover configuration has been activated
Severity CRITICAL
Expression sum by (application, container, namespace) (changes(occnp_pending_binding_site_takeover[2m])) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.45
Metric Used occnp_pending_binding_site_takeover
Recommended Actions

Cause:

This alert fires when the site takeover functionality is engaged to handle geo-redundancy scenarios. Site takeover is typically activated when a site in a distributed PCF deployment is down or unreachable, empowering another site to process that site’s pending binding operations for service continuity.

Diagnostic Information:

  • Check configuration to confirm the alternate site profile is correctly set and the takeover flag is enabled.
  • Examine PendingOperation records to ensure the alternate site is processing entries from the down site’s site ID.
  • Review service logs for site takeover-related events, handoff messages, and any associated errors during takeover or operation processing.

Recovery & Actions:

  • Verify that site takeover activation was intentional and aligns with fail-over or DR (Disaster Recovery) procedures.
  • Monitor processing of pending operations for successful handoff and completion under the alternate site.
  • Communicate with relevant operations/support teams about the takeover to prevent conflicting operations.
  • Disable site takeover once the original site is restored to normal operation, so pending operations revert to their standard ownership and workflow.
  • Audit for any missed or failed operations during the site handover, and remediate as needed.

Alert Resolution: The alert will auto-resolve once there are no new site takeover events, and the takeover configuration is deactivated or no longer required.

For any additional guidance, contact My Oracle Support.

8.1.2.63 PCF_PENDING_BINDING_THRESHOLD_LIMIT_REACHED

Table 8-196 PCF_PENDING_BINDING_THRESHOLD_LIMIT_REACHED

Field Details
Description The Pending Operation table threshold has been reached.
Summary The Pending Operation table threshold has been reached.
Severity CRITICAL
Expression sum by (application, container, namespace) (changes(occnp_threshold_limit_reached_total[2m])) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.46
Metric Used occnp_threshold_limit_reached_total
Recommended Actions

Cause

This alert fires when the number of records in the Pending Operation table (to reattempt binding registration in BSF at a later time) reaches a predefined threshold. This means the system’s retry or pending queue for binding operations is saturated and may be at risk of delaying, or failing new operations. Exceeding this threshold typically signals that retry or binding registrations are not clearing at an expected rate.

Common causes include:

  • Persistent errors or failures from BSF in response to binding attempts, triggering retries
  • Widespread or systemic service degradation in BSF, Binding Service, or network paths
  • Application bugs resulting in stuck or orphaned PendingOperation records
  • Misconfigured thresholds, retry intervals, or logic in SM or Binding Service
  • Resource starvation (CPU, memory, DB connections) preventing timely processing of pending operations
  • Recent deployments, configuration updates, or load spikes overwhelming the binding flow

Diagnostic Information

  • Check the volume, age, and growth trend of records in the Pending Operation table
  • Correlate with other alerts or incident tickets related to BSF, Binding Service, network, or DB health
  • Analyze logs from SM Service, Binding Service, and (if applicable) Audit Service for repeated errors, retry loops, or slow processing
  • Review recent deployments or configuration changes to PCF Service components
  • Inspect resource utilization for relevant pods, containers, and backend storage
  • Confirm correct configuration of the threshold limit, retry intervals, and error code handling

Recovery

  • Prioritize clearing pending records: Investigate and remediate the root cause(s) of unprocessed binding operations (BSF issues, infra bottlenecks, logic bugs)
  • Scale resources or prioritize processing: Add capacity or redistribute load if resource constraints are found
  • Tune configuration: Adjust thresholds, error code mappings, and retry intervals as necessary
  • Audit retry and cleanup logic: Ensure orphaned or stale records are purged and retry logic is functioning as intended
  • Rollback if needed: If issue began with a recent deployment or config change, consider reverting
  • Coordinate across teams: Engage with BSF, Infrastructure, and DB owners as required
Alert resolution: The alert will auto-resolve once the number of records in the Pending Operation table returns below the configured threshold and normal processing resumes.

For any additional guidance, contact My Oracle Support.

8.1.2.64 PCF_PENDING_BINDING_RECORDS_COUNT

Table 8-197 PCF_PENDING_BINDING_RECORDS_COUNT

Field Details
Description An attempt to internally recreate a PCF binding has been triggered by PCF
Summary An attempt to internally recreate a PCF binding has been triggered by PCF
Severity MINOR
Expression sum by (application, container, namespace) (changes(occnp_pending_operation_records_count[10s])) > 0
OID 1.3.6.1.4.1.323.5.3.52.1.2.47
Metric Used occnp_pending_operation_records_count
Recommended Actions

Cause

This alert fires when a new pending binding operation is inserted into the system by the SM Service(to reattempt binding registration in BSF at a later time). This typically happens when the BSF reattempt settings are configured and the response from BSF to a binding registration indicates an error condition that requires a retry (as per pre-configured error codes).

Common causes for entries in the PendingOperation table include:Common causes include:

  • BSF returns a transient or retry-eligible error code in response to binding requests.
  • Temporary unavailability or instability of BSF or related network paths.
  • Application bugs leading to improper handling of BSF responses or retry logic.
  • Recent configuration changes impacting retry or error handling logic.

Diagnostic Information

  • Review SM Service and binding service logs to trace binding requests, BSF response codes, and the creation/updating of PendingOperations.
  • Verify resource utilization and health across relevant pods or containers.
  • Analyze timing and Volume of pending operation records—spikes may indicate regression or external service instability.

Recovery

  • Monitor pending operation clearance: Confirm that retries triggered by Audit Service notifications are processed and successfully clear pending records.
  • Investigate recurring or persistent errors: If retries are frequently required or repeatedly fail, drill down to BSF responses, retry outcomes, and any correlated infrastructure issues.
  • Coordinate with BSF/service owners: If an underlying BSF or network problem persists, work with those teams to restore normal registration flow.
  • Tune configuration: Adjust error code mapping, retry intervals, or thresholds based on observed workload and service behavior.
  • Rollback if needed: Revert recent deployments or config updates if they correlate with spikes in pending operations.
Alert resolution: The alert will auto-resolve when new pending binding operation records are no longer being routinely created, retries are succeeding, and the overall pending queue stabilizes or clears.

For any additional guidance, contact My Oracle Support.

8.1.2.65 AUTONOMOUS_SUBSCRIPTION_FAILURE

Table 8-198 AUTONOMOUS_SUBSCRIPTION_FAILURE

Field Details
Description Autonomous subscription failed for a configured Slice Load Level
Summary Autonomous subscription failed for a configured Slice Load Level
Severity Critical
Expression The number of failed Autonomous Subscription for a configured Slice Load Level in nwdaf-agent is greater than zero.
OID 1.3.6.1.4.1.323.5.3.52.1.2.49
Metric Used subscription_failure{requestType="autonomous"}
Recommended Actions The alert gets cleared when the failed Autonomous Subscription is corrected.
To clear the alert, perform the following steps:
  1. Delete the Slice Load Level configuration.
  2. Re-provision the Slice Load Level configuration.

Cause:

This alert activates when there is at least one autonomous subscription (such as the NWDAF event subscription process) failure detected for a given S-NSSAI, indicating that the system was unable to successfully initiate or maintain a subscription for a specific network slice. Common causes may include:

  • Remote service (e.g., NWDAF) is unavailable, responds with a failure, or returns an error code.
  • Authentication/authorization failures (invalid tokens, credentials, certificates).
  • Incorrect, missing, or unsupported subscription parameters (S-NSSAI, event types, notification targets).
  • API version or schema mismatches between subscribing and serving systems.
  • Rate limiting, resource exhaustion, or capacity constraints in remote service.
  • Network or DNS/connectivity problems between components.
  • Recent deployment or configuration change introducing new issues.

Diagnostic Information:

  • Check which S-NSSAI (network slice) is affected using the alert labels.
  • Review NWDAF gent service logs, and collect relevant error codes and messages from the failed subscription attempts.
  • Examine recent changes or deployments to the NWDAF Agent, remote NWDAF, or related interfaces/services.
  • Assess service health and connectivity between the agent and NWDAF (latency, errors, authentication status).
  • Validate the subscription request payload, endpoint URLs, and configuration for the target S-NSSAI.
  • Look for evidence of transient or repeated network/service issues.

Recovery:

  • Identify the failed subscription(s): Use the alert labels and logs to pinpoint the slice(s) affected.
  • Resolve remote or local service issues: Work with relevant teams to restore NWDAF or agent functionality, address authentication or network problems, or resolve configuration mismatches.
  • Retry or re-initiate subscriptions as needed after addressing the root cause.
  • Rollback changes if the alert coincides with recent deployments, configuration modifications, or rollouts.
Alert Resolution: This alert will automatically resolve once the system detects that there are no new autonomous subscription failures (i.e., no new increments in the failure counter) for the affected S-NSSAI(s) within the evaluation window. Successful re-establishment or correction will clear the alert.

For any additional guidance, contact My Oracle Support.

8.1.2.66 AM_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Table 8-199 AM_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Description AM Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Summary AM Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Severity MINOR
Expression (sum(rate(http_out_conn_response_total{pod=~".*amservice.*",responseCode!~"2.*",servicename3gpp="npcf-am-policy-control"}[1d])) / sum(rate(http_out_conn_response_total{pod=~".*amservice.*",servicename3gpp="npcf-am-policy-control"}[1d]))) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.54
Metric Used http_out_conn_response_total
Recommended Actions

Cause

This alert triggers when 1% or more of notification requests sent from the AM service (part of PCF) to the AMF (
npcf-am-policy-control
endpoint) result in non-2xx (unsuccessful) responses over a 1-day window. These notifications inform AMF about access or mobility events. A significant portion of errors could be 404 responses, which occur when AMF does not have the corresponding session in its context. This may indicate attempts to notify AMF about sessions that have already ended or were never established.

Other possible causes include:

  • Partial outage, degradation, or overload in the AMF
  • Application errors in the AM service or AMF (e.g., other 4xx or 5xx codes)
  • Schema or API mismatches due to recent deployments or configuration changes
  • Authentication, authorization, or TLS certificate issues
  • Network/connectivity problems
  • Resource exhaustion in the AMF

Diagnostic Information

  • Break down non-2xx responses by HTTP status code, especially 404 versus other 4xx/5xx
  • Examine AM service and AMF logs for detailed error messages and patterns
  • Review session establishment, update, and termination flows in both AM service and AMF
  • Investigate recent deployments, configuration changes, or spikes in error rates
  • Assess resource usage and health of both AM service and AMF
  • Validate API contracts, payload formats, and endpoint configurations
  • Check for authentication/authorization or certificate issues

Recovery

  • Identify and resolve the root cause: Use logs, traces, and error breakdowns to determine if high 404 rates are expected (due to session lifecycle), or if there is a systematic issue such as stale notifications
  • Tune notification logic: Adjust workflows to minimize duplicate or late notifications when sessions may already have ended
  • Rollback or adjust recent changes: If errors correlate with deployments or config updates, consider reverting them
  • Scale or adjust resources: Add capacity or tune connection/timeouts if resource exhaustion is present
  • Remediate network or security problems: Ensure stable communication and correct authentication/certificates between PCF and AMF
Alert resolution: The alert will auto-resolve when the error rate drops below 1% over the measuring window

For any additional guidance, contact My Oracle Support.

8.1.2.67 AM_AR_ERROR_RATE_ABOVE_1_PERCENT

Table 8-200 AM_AR_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Description Alternate Routing Error Rate detected above 1 Percent of Total on AM Service (current value is: {{ $value }})
Summary Alternate Routing Error Rate detected above 1 Percent of Total on AM Service (current value is: {{ $value }})
Severity MINOR
Expression (sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".*amservice.*",responseCode!~"2.*",servicename3gpp="npcf-am-policy-control"}[1d])) / sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".*amservice.*",servicename3gpp="npcf-am-policy-control"}[1d]))) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.55
Metric Used ocpm_ar_response_total
Recommended Actions

Cause

This alert fires when 1% or more of alternate routing (AR) requests initiated by the AM service (as part of PCF) to AMF (
npcf-am-policy-control
) result in non-2xx (unsuccessful) responses over a 1-day window, grouped by FQDN.

Alternate routing is the process of retrying the original request to a different AMF instance when the initial attempt fails. A rising AR error rate suggests persistent issues with connectivity, service health, or configuration for primary or alternate AMF endpoints.

Typical causes include:

  • Persistent unavailability, overload, or partial outages affecting some or all AMF instances
  • Application-level errors from AMF (many 4xx/5xx responses, including 404s for missing sessions)
  • Schema or API incompatibility after deployments or configuration changes
  • Authentication, authorization, or certificate-related failures during retries
  • Network or DNS problems affecting communication with one or more AMF instances
  • Resource exhaustion, scaling issues, or retry storm in the AM service
  • Misconfiguration of alternate endpoint lists or retry logic

Diagnostic Information

  • Break down failed AR responses by HTTP status code (4xx, 5xx, timeouts) to pinpoint the failure type
  • Review AM service logs to identify why alternate routing was triggered and the response from each retry
  • Inspect AMF logs for errors and session context associated with AR requests
  • Assess health, status, and readiness of all AMF endpoints relevant to the alerting FQDN
  • Check authentication credentials, certificate validity, and endpoint configuration
  • Correlate AR error spikes with recent deployments, updates, scaling actions, or network incidents
  • Analyze retry logic to ensure backoff and failover policies are working as expected

Recovery

  • Isolate the root cause: Use logs and metrics to determine if AR failures are due to persistent AMF unavailability, configuration problems, or retry logic bugs
  • Remediate endpoint or network issues: Restore AMF health, increase capacity, or fix network connectivity to all AMF endpoints
  • Fix authentication or certificate problems: Update or refresh security credentials as necessary
  • Adjust or rollback changes as needed: If increased errors align with a recent deployment or config update
  • Tune retry/backoff policies: Update AR configuration to minimize repeated failures or retry storms
Alert resolution: The alert auto-resolves once the AR error rate drops below 1% over the measurement window

For any additional guidance, contact My Oracle Support.

8.1.2.68 UE_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Table 8-201 UE_NOTIFICATION_ERROR_RATE_ABOVE_1_PERCENT

Field Details
Description UE Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Summary UE Notification Error Rate detected above 1 Percent of Total (current value is: {{ $value }})
Severity MINOR
Expression (sum(rate(http_out_conn_response_total{pod=~".*ueservice.*",responseCode!~"2.*",servicename3gpp="npcf-ue-policy-control"}[1d])) / sum(rate(http_out_conn_response_total{pod=~".*ueservice.*",servicename3gpp="npcf-ue-policy-control"}[1d]))) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.56
Metric Used http_out_conn_response_total
Recommended Actions

Cause

This alert triggers when 1% or more of notification requests sent from the UE service (part of PCF) to the AMF (npcf-ue-policy-control endpoint) result in non-2xx (unsuccessful) responses over a 1-day window. These notifications inform AMF about UE policy events. A significant portion of errors could be 404 responses, which occur when AMF does not have the corresponding session in its context. This may indicate attempts to notify AMF about sessions that have already ended or were never established.

Other possible causes include:

  • Partial outage, degradation, or overload in the AMF
  • Application errors in the AM service or AMF (e.g., other 4xx or 5xx codes)
  • Schema or API mismatches due to recent deployments or configuration changes
  • Authentication, authorization, or TLS certificate issues
  • Network/connectivity problems
  • Resource exhaustion in the AMF

Diagnostic Information

  • Break down non-2xx responses by HTTP status code, especially 404 versus other 4xx/5xx
  • Examine AM service and AMF logs for detailed error messages and patterns
  • Review session establishment, update, and termination flows in both AM service and AMF
  • Investigate recent deployments, configuration changes, or spikes in error rates
  • Assess resource usage and health of both AM service and AMF
  • Validate API contracts, payload formats, and endpoint configurations
  • Check for authentication/authorization or certificate issues

Recovery

  • Identify and resolve the root cause: Use logs, traces, and error breakdowns to determine if high 404 rates are expected (due to session lifecycle), or if there is a systematic issue such as stale notifications
  • Tune notification logic: Adjust workflows to minimize duplicate or late notifications when sessions may already have ended
  • Rollback or adjust recent changes: If errors correlate with deployments or config updates, consider reverting them
  • Scale or adjust resources: Add capacity or tune connection/timeouts if resource exhaustion is present
  • Remediate network or security problems: Ensure stable communication and correct authentication/certificates between PCF and AMF
Alert resolution: The alert will auto-resolve when the error rate drops below 1% over the measuring window

For any additional guidance, contact My Oracle Support.

8.1.2.69 UE_AR_FAILURE_RATE_ABOVE_1_PERCENT

Table 8-202 UE_AR_FAILURE_RATE_ABOVE_1_PERCENT

Field Details
Description Alternate Routing Error Rate detected above 1 Percent of Total on UE Service (current value is: {{ $value }})
Summary Transaction Error Rate detected above 1 Percent of Total Transactions on UE Alternate Routing
Severity MINOR
Expression (sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".*ueservice.*",responseCode!~"2.*",servicename3gpp="npcf-ue-policy-control"}[1d])) / sum by (fqdn) (rate(ocpm_ar_response_total{pod=~".*ueservice.*",servicename3gpp="npcf-ue-policy-control"}[1d]))) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.57
Metric Used ocpm_ar_response_total
Recommended Actions

Cause

This alert fires when 1% or more of alternate routing (AR) requests initiated by the AM service (as part of PCF) to AMF (npcf-ue-policy-control) result in non-2xx (unsuccessful) responses over a 1-day window, grouped by FQDN.

Alternate routing is the process of retrying the original request to a different AMF instance when the initial attempt fails. A rising AR error rate suggests persistent issues with connectivity, service health, or configuration for primary or alternate AMF endpoints.

Typical causes include:

  • Persistent unavailability, overload, or partial outages affecting some or all AMF instances
  • Application-level errors from AMF (many 4xx/5xx responses, including 404s for missing sessions)
  • Schema or API incompatibility after deployments or configuration changes
  • Authentication, authorization, or certificate-related failures during retries
  • Network or DNS problems affecting communication with one or more AMF instances
  • Resource exhaustion, scaling issues, or retry storm in the AM service
  • Misconfiguration of alternate endpoint lists or retry logic

Diagnostic Information

  • Break down failed AR responses by HTTP status code (4xx, 5xx, timeouts) to pinpoint the failure type
  • Review AM service logs to identify why alternate routing was triggered and the response from each retry
  • Inspect AMF logs for errors and session context associated with AR requests
  • Assess health, status, and readiness of all AMF endpoints relevant to the alerting FQDN
  • Check authentication credentials, certificate validity, and endpoint configuration
  • Correlate AR error spikes with recent deployments, updates, scaling actions, or network incidents
  • Analyze retry logic to ensure backoff and failover policies are working as expected

Recovery

  • Isolate the root cause: Use logs and metrics to determine if AR failures are due to persistent AMF unavailability, configuration problems, or retry logic bugs
  • Remediate endpoint or network issues: Restore AMF health, increase capacity, or fix network connectivity to all AMF endpoints
  • Fix authentication or certificate problems: Update or refresh security credentials as necessary
  • Adjust or rollback changes as needed: If increased errors align with a recent deployment or config update
  • Tune retry/backoff policies: Update AR configuration to minimize repeated failures or retry storms
Alert resolution: The alert auto-resolves once the AR error rate drops below 1% over the measurement window

For any additional guidance, contact My Oracle Support.

8.1.2.70 SMSC_CONNECTION_DOWN

Table 8-203 SMSC_CONNECTION_DOWN

Field Details
Description Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Summary Connection to SMSC peer {{$labels.smscName}} is down in notifier service pod {{$labels.pod}}
Severity MAJOR
Expression sum by(namespace, pod, smscName)(occnp_active_smsc_conn_count) == 0
OID
1.3.6.1.4.1.323.5.3.52.1.2.63
Metric Used occnp_active_smsc_conn_count
Recommended Actions

Cause

This alert fires when the connection count to a specific SMSC (Short Message Service Center) peer (smscName) drops to zero in a notifier service pod. This means that the notifier service in the indicated pod has lost connectivity with the SMSC peer, which may halt or delay SMS delivery for affected sessions.

Common causes include:

  • Network connectivity issues between the notifier pod and the SMSC peer (latency, packet loss, firewall changes)
  • SMSC peer instance is offline, unresponsive, or undergoing maintenance
  • Unexpected restart or crash of the notifier service pod
  • TCP session timeout, reset, or socket exhaustion
  • TLS/certificate negotiation failures (if applicable)
  • Misconfiguration of SMSC endpoint, port, or authentication details
  • Recent pod or infrastructure changes affecting networking or endpoints

Diagnostic Information

  • Identify which namespace, pod, and smscName are affected from alert labels
  • Check notifier pod logs for errors, timeouts, or repeated reconnection attempts to the SMSC
  • Confirm SMSC peer health and status via monitoring tools or coordination with peer’s operations
  • Validate network connectivity (test with ping/telnet/traceroute), DNS resolution, and firewall or security rules
  • Review recent changes in deployment, SMSC endpoint configuration, or certificate rotation
  • Check for underlying resource issues (CPU, memory, open file/socket limits) on the notifier pod

Recovery

  • Restore connectivity: Address any network or firewall problems between the notifier pod and SMSC peer
  • Restart services: If the notifier pod is in a bad state, restart it to reestablish the connection
  • Engage SMSC operations: If the peer is down, coordinate with the SMSC provider/team to restore service
  • Correct configuration: Verify endpoint settings, authentication, and port assignments in both notifier and SMSC
  • Rollback recent changes: If disconnection began after deployment or configuration change, consider reverting
Alert resolution: The alert will auto-resolve once the connection count returns above zero for the affected pod and SMSC

For any additional guidance, contact My Oracle Support.

8.1.2.71 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Table 8-204 LOCK_ACQUISITION_EXCEEDS_MINOR_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsMinorThreshold
Description The lock requests fails to acquire the lock count exceeds the minor threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 20 Percent of Total Transactions.
Severity Minor
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=20 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, between 20% and 50% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 20%. If the rate exceeds 50%, a higher severity alert will trigger.
8.1.2.72 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Table 8-205 LOCK_ACQUISITION_EXCEEDS_MAJOR_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsMajorThreshold
Description The lock requests fails to acquire the lock count exceeds the major threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 50 Percent of Total Transactions.
Severity Major
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >= 50 < 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, between 50% and 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 50%. If the rate exceeds 75%, a higher severity alert will trigger.
8.1.2.73 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Table 8-206 LOCK_ACQUISITION_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Name in Alert Yaml File lockAcquisitionExceedsCriticalThreshold
Description The lock requests fails to acquire the lock count exceeds the critical threshold limit. The (current value is: {{ $value }})
Summary Keys used in Bulwark lock request which are already in locked state detected above 75 Percent of Total Transactions.
Severity Critical
Expression (sum by (namespace) (increase(lock_response_total{requestType="acquireLock",responseType="failure"}[5m])) /sum by (namespace) (increase(lock_request_total{requestType="acquireLock"}[5m]))) * 100 >=75
OID 1.3.6.1.4.1.323.5.3.52.1.2.69
Metric Used lock_request_total
Recommended Actions

Cause

This alert fires when, within a 5-minute window, above 75% of lock acquisition requests (acquireLock) to the Bulwark service in any namespace fail. Elevated lock acquisition failure rates may indicate:

  • Lock contention, with multiple clients attempting to acquire the same lock/resource concurrently (hot spots)
  • Stale or orphaned locks that are not being properly released
  • Performance degradation or partial outages in the Coherence distributed cache backend used by Bulwark
  • Misconfigured lock TTL (time to live), expiry, or retry/backoff policies
  • Recent deployment, scaling events, or increased load causing higher lock demand or contention
  • Bugs in the client logic resulting in frequent or incorrect lock requests

Diagnostic Information

  • Identify affected namespaces and resources prone to high contention or failure
  • Examine Bulwark and application logs for specific lock acquisition errors or contention/wait messages
  • Review the health of the bulwark service(and coherence cluster), including resource utilization (CPU, memory)
  • Check lock TTL and cleanup mechanisms to ensure timely lock release by both typical and failure pathways
  • Analyze trends following deployments, configuration changes, or traffic spikes
  • Assess and validate the configuration for Bulwark (connection pools, timeouts, backoff settings)
  • Investigate for node clock skew, which can impact distributed locking

Recovery

  • Reduce Contention: Identify and resolve any traffic pattern that causes lock contention
  • Backend Remediation: Scale or optimize Bulwark and address any backend health issues
  • Configuration Tuning: Adjust TTLs, retry intervals, and backoff strategies for optimal application behavior
  • Rollback if Needed: Revert recent changes to Bulwark deployments or configurations if correlated to failure spikes
Alert Resolution: Alert will auto-resolve once lock acquisition failure rates in a namespace drop below 75%.
8.1.2.74 SM_UPDATE_NOTIFY_FAILED_ABOVE_50_PERCENT

Table 8-207 SM_UPDATE_NOTIFY_FAILED_ABOVE_50_PERCENT

Field Details
Description Update Notify Terminate sent to SMF failed >= 50 < 60
Summary Update Notify Terminate sent to SMF failed >= 50 < 60
Severity MINOR
Expression (sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2.*"})*100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol"}) >= 50 < 60
OID 1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used occnp_http_out_conn_response_total
Recommended Actions

Cause

This alert fires when, over the evaluation period, between 50% and 60% of terminate_notify HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references.

Other common causes include:

  • SMF service partial outage or overload
  • Application-level errors (4xx other than 404, 5xx)
  • Network issues
  • Configuration mistakes
  • Recent deployments or system changes

Diagnostic Information

  • Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx)
  • Check PCF and SM Service logs for error details related to terminate_notify requests
  • Review SMF logs for the context and reasoning behind 404 responses
  • Analyze the timing and volume of session termination requests compared to active session counts
  • Correlate with recent maintenance, scaling events, or deployment changes
  • Evaluate resource utilization and connectivity between PCF and SMF

Recovery

  • Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data
  • Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations.
  • Scale/Resources: Address resource exhaustion or performance bottlenecks as needed
Alert Resolution: The alert will auto-resolve once failed response rates fall below 50% for the evaluation window. A higher-severity alert may trigger if failures exceed 60%.

For any additional guidance, contact My Oracle Support.

8.1.2.75 SM_UPDATE_NOTIFY_FAILED_ABOVE_60_PERCENT

Table 8-208 SM_UPDATE_NOTIFY_FAILED_ABOVE_60_PERCENT

Field Details
Description Update Notify Terminate sent to SMF failed >= 60 < 70
Summary Update Notify Terminate sent to SMF failed >= 60 < 70
Severity MAJOR
Expression (sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2.*"})*100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol"}) >= 60 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used occnp_http_out_conn_response_total
Recommended Actions

Cause

This alert fires when, over the evaluation period, between 60% and 70% of terminate_notify HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references.

Other common causes include:

  • SMF service partial outage or overload
  • Application-level errors (4xx other than 404, 5xx)
  • Network issues
  • Configuration mistakes
  • Recent deployments or system changes

Diagnostic Information

  • Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx)
  • Check PCF and SM Service logs for error details related to terminate_notify requests
  • Review SMF logs for the context and reasoning behind 404 responses
  • Analyze the timing and volume of session termination requests compared to active session counts
  • Correlate with recent maintenance, scaling events, or deployment changes
  • Evaluate resource utilization and connectivity between PCF and SMF

Recovery

  • Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data
  • Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations.
  • Scale/Resources: Address resource exhaustion or performance bottlenecks as needed
Alert Resolution: The alert will auto-resolve once failed response rates fall below 60% for the evaluation window. A higher-severity alert may trigger if failures exceed 70%.

For any additional guidance, contact My Oracle Support.

8.1.2.76 SM_UPDATE_NOTIFY_FAILED_ABOVE_70_PERCENT

Table 8-209 SM_UPDATE_NOTIFY_FAILED_ABOVE_70_PERCENT

Field Details
Description Update Notify Terminate sent to SMF failed >= 70
Summary Update Notify Terminate sent to SMF failed >= 70
Severity CRITICAL
Expression (sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol",responseCode!~"2.*"})*100)/ sum(occnp_http_out_conn_response_total{operationType="terminate_notify",pod=~".*smservice.*",servicename3gpp="npcf-smpolicycontrol"}) >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.80
Metric Used occnp_http_out_conn_response_total
Recommended Actions

Cause

This alert fires when, over the evaluation period, above 70% of terminate_notify HTTP outbound requests sent from PCF (SM Service pods) to SMF result in non-2xx (failed) HTTP responses. In this workflow, PCF notifies SMF to terminate a session. Notably, SMF will return a 404 error if the session does not exist in its current context. Elevated rates of 404 errors could indicate attempts to terminate already-removed sessions or stale references.

Other common causes include:

  • SMF service partial outage or overload
  • Application-level errors (4xx other than 404, 5xx)
  • Network issues
  • Configuration mistakes
  • Recent deployments or system changes

Diagnostic Information

  • Break down non-2xx responses by HTTP code (especially distinguishing 404s from 5xx or other 4xx)
  • Check PCF and SM Service logs for error details related to terminate_notify requests
  • Review SMF logs for the context and reasoning behind 404 responses
  • Analyze the timing and volume of session termination requests compared to active session counts
  • Correlate with recent maintenance, scaling events, or deployment changes
  • Evaluate resource utilization and connectivity between PCF and SMF

Recovery

  • Determine Root Cause: Use error codes, logs, and traces to identify whether high 404s are expected (e.g., requests for sessions already removed) or whether there are issues with session tracking, race conditions, or stale data
  • Rollback if Needed: If recent changes coincide with failures, consider rolling back deployments or configurations.
  • Scale/Resources: Address resource exhaustion or performance bottlenecks as needed
Alert Resolution: The alert will auto-resolve once failed response rates fall below 70%.

For any additional guidance, contact My Oracle Support.

8.1.2.77 UPDATE_NOTIFY_FAILURE_ABOVE_30_PERCENT

Table 8-210 UPDATE_NOTIFY_FAILURE_ABOVE_30_PERCENT

Field Details
Description {{ $value }} % of update notify sent to SMF that failed.
Summary More than 30% of update notify sent to SMF failed
Severity minor
Expression sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 30 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used occnp_http_out_conn_response_total
Recommended Actions

occnp_http_out_conn_response_total metric is pegged when PCF receives a response from a message that is going out of the NF.

In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF.

If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy.

For any additional guidance, contact My Oracle Support.

8.1.2.78 UPDATE_NOTIFY_FAILURE_ABOVE_50_PERCENT

Table 8-211 UPDATE_NOTIFY_FAILURE_ABOVE_50_PERCENT

Field Details
Description Number of Update notify that failed is equal or above 50% but less than 70% in a given time period
Summary Number of Update notify that failed is equal or above 50% but less than 70% in a given time period
Severity MAJOR
Expression (sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 50 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used occnp_http_out_conn_response_total
Recommended Actions

occnp_http_out_conn_response_total metric is pegged when PCF receives a response from a message that is going out of the NF.

In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF.

If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy.

For any additional guidance, contact My Oracle Support.

8.1.2.79 UPDATE_NOTIFY_FAILURE_ABOVE_70_PERCENT

Table 8-212 UPDATE_NOTIFY_FAILURE_ABOVE_70_PERCENT

Field Details
Description {{ $value }} % of update notify sent to SMF that failed
Summary More than 70% of update notify sent to SMF failed
Severity Critical
Expression (sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm",responseCode!~"2.*"}[5m])) / sum by (namespace) (rate(occnp_http_out_conn_response_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.94
Metric Used occnp_http_out_conn_response_total
Recommended Actions

occnp_http_out_conn_response_total metric is pegged when PCF receives a response from a message that is going out of the NF.

In this case the alert is notifying when there is a certain amount of update-notify failure coming from SMF.

If there is an increase of update-notify failure operator can revise if all the flows that trigger update-notify are failing or analyze which flow is failing the most or if the SMF that request are going to is unhealthy.

For any additional guidance, contact My Oracle Support.

8.1.2.80 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST

Table 8-213 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST

Field Details
Description Ingress Gateway traffic gets rejected more than 1% because of ratelimiting.
Summary Ingress Gateway traffic gets rejected more than 1% because of ratelimiting.
Severity Major
Expression (sum by (namespace,pod) (rate(oc_ingressgateway_http_request_ratelimit_values_total {Allowed="false",app_kubernetes_io_name="occnp-ingress-gateway"}[2m])))/ (sum by (namespace,pod) (rate(oc_ingressgateway_http_request_ratelimit_values_total {app_kubernetes_io_name="occnp-ingress-gateway"}[2m]))) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.103
Metric Used oc_ingressgateway_http_request_ratelimit_values_total
Recommended Actions

Cause:

Alert is triggered when percentage of denied requests is above 1% of total tps..

Diagnostic Information:

  • Metric involved: oc_ingressgateway_http_request_ratelimit_values_total
  • Error observed: 429 Too Many Requests, NF_CONGESTION_RISK
  • Cause value: Allowed="false"
  • Condition: podProtectionByRateLimiting.enabled = true and podProtectionByRateLimiting.fillRate settings
  • Verification steps:
    • podProtectionByRateLimiting.fillRate to a lower value and podProtectionByRateLimiting.deniedRequestActions.action=REJECT for lower congestion level
    • Run 4500 TPS or above for SM traffic;
    • Confirm some request dropped with Error 429.
    • Verify that the alert get triggered.
  • Monitoring recommendations:
    • Monitor 4xx error; and counter increase for oc_ingressgateway_http_request_ratelimit_values_total{Allowed="false"}
    • Watch for spikes following client deployments.

Recovery:

  • Check Network traffic burst and storm
  • Investigate traffic load balancer issues and network issues.
  • Review SM Service Resources
  • Restart or scale up resources temporarily if the system is congested
  • Reconfig setting for podProtectionByRateLimiting.fillRate to a higher value and assign podProtectionByRateLimiting.deniedRequestActions.action=REJECT to higher congestion level
  • Disable feature
if this flow is the only one affected we can disable this feature as a last resource

For any additional guidance, contact My Oracle Support.

8.1.2.81 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MINOR_THRESHOLD

Table 8-214 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MINOR_THRESHOLD

Field Details
Description UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 20 Percent of Total n1n2 notify Request.
Summary UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 20 Percent of Total n1n2 notify Request.
Severity Minor
Expression sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used ue_n1_transfer_ue_notification_total
Recommended Actions

The ue_n1_transfer_ue_notification_total metric is pegged when a fragment delivered by the PCF (pcf-ue service) is rejected by the UE (User Equipment).

So, the operator needs to check on the AMF/UE side why these UPSI/URSP rules were rejected.

For any additional guidance, contact My Oracle Support.

8.1.2.82 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-215 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_MAJOR_THRESHOLD

Field Details
Description UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 50 Percent of Total n1n2 notify Request.
Summary UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 50 Percent of Total n1n2 notify Request.
Severity Major
Expression sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used ue_n1_transfer_ue_notification_total
Recommended Actions ue_n1_transfer_ue_notification_total metric is pegged when fragment delivered by PCF (pcf-ue service) is rejected by UE (User Equipment). So operator needs to check on AMF/UE side why these UPSI/URSP rules were rejected.

For any additional guidance, contact My Oracle Support.

8.1.2.83 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-216 UE_N1N2_NOTIFY_REJECTION_RATE_ABOVE_CRITICAL_THRESHOLD

Field Details
Description UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 75 Percent of Total n1n2 notify Request.
Summary UE N1N2 Notification Rate containing request of MANAGE_UE_POLICY_COMMAND_REJECT from AMF is detected to be above 75 Percent of Total n1n2 notify Request.
Severity CRITICAL
Expression sum by (namespace) (rate(ue_n1_transfer_ue_notification_total{commandType="MANAGE_UE_POLICY_COMMAND_REJECT"}[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.91
Metric Used ue_n1_transfer_ue_notification_total
Recommended Actions ue_n1_transfer_ue_notification_total metric is pegged when fragment delivered by PCF (pcf-ue service) is rejected by UE (User Equipment). So operator needs to check on AMF/UE side why these UPSI/URSP rules were rejected.

For any additional guidance, contact My Oracle Support.

8.1.2.84 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MINOR_THRESHOLD

Table 8-217 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MINOR_THRESHOLD

Field Details
Description

Over 20% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Summary

Above 20 percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Severity Minor
Expression sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used ue_n1_transfer_failure_notification_total
Recommended Actions

ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF

In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed.

Also operator might have to check if AMF has proper retransmission and reattempt configurations in place

For any additional guidance, contact My Oracle Support.

8.1.2.85 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-218 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_MAJOR_THRESHOLD

Field Details
Description

Over 50% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Summary

Over 50% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Severity Major
Expression sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used ue_n1_transfer_failure_notification_total
Recommended Actions

ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF

In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed.

Also operator might have to check if AMF has proper retransmission and reattempt configurations in place.

For any additional guidance, contact My Oracle Support.

8.1.2.86 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-219 UE_N1N2_TRANSFER_FAILURE_RATE_ABOVE_CRITICAL_THRESHOLD

Field Details
Description

Over 75% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Summary

Over 75% percent of total N1N2 transfer requests from AMF are of N1N2 transfer failure notification requests from AMF.

Severity Critical
Expression sum by (namespace) (rate(ue_n1_transfer_failure_notification_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.92
Metric Used ue_n1_transfer_failure_notification_total
Recommended Actions

ue_n1_transfer_failure_notification_total metric is pegged when PCF receives transfer failure notification from AMF

In this case operator needs to check for connectivity issues on AMF and UE - why fragment transfer to UE failed.

Also operator might have to check if AMF has proper retransmission and reattempt configurations in place.

For any additional guidance, contact My Oracle Support.

8.1.2.87 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MINOR_THRESHOLD

Table 8-220 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MINOR_THRESHOLD

Field Details
Description

Over 20% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Summary

Over 20% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Severity Minor
Expression sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used ue_n1_transfer_t3501_expiry_total
Recommended Actions

ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires

In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked

If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI

Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value.

For any additional guidance, contact My Oracle Support.

8.1.2.88 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MAJOR_THRESHOLD

Table 8-221 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_MAJOR_THRESHOLD

Field Details
Description

Over 50% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Summary

Over 50% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Severity Major
Expression sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used ue_n1_transfer_t3501_expiry_total
Recommended Actions

ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires.

In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked.

If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI.

Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value.

For any additional guidance, contact My Oracle Support.

8.1.2.89 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_CRITICAL_THRESHOLD

Table 8-222 UE_N1N2_TRANSFER_T3501_TIMER_EXPIRY_RATE_ABOVE_CRITICAL_THRESHOLD

Field Details
Description

Over 75% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Summary

Over 75% of UE N1N2 transfers have T3501 timer expiry before the N1N2 notify is received from AMF for the respective transfer.

Severity Critical
Expression sum by (namespace) (rate(ue_n1_transfer_t3501_expiry_total[5m])) / sum by (namespace) (rate(ue_n1_transfer_response_total[5m])) * 100 > 75
OID 1.3.6.1.4.1.323.5.3.52.1.2.93
Metric Used ue_n1_transfer_t3501_expiry_total
Recommended Actions

ue_n1_transfer_t3501_expiry_total metric is pegged when PCF was not able to get any N1N2 notification message from AMF before T3501 timer expires

In this case operator needs to check on AMF side why N1N2 message was delayed, also connectivity between PCF and AMF needs to be checked

If connection between PCF and AMF is not an issue then as workaround operator can increase T3501 timer time by navigating to PCF GUI

Service Configuration -> PCF UE Timer Setting section and increasing T3501 Timer Duration field to a bigger value

For any additional guidance, contact My Oracle Support.

8.1.2.90 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_CRITICAL_THRESHOLD

Table 8-223 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_CRITICAL_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Summary This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an Error

  • Metrics: ocpm_handle_update_notify_error_response_as_pending_confirmation_total
    • This will be incremented when configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and specific error error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPOMNSE_CODE and timeout happens during update notify triggered by AAR-I and AAR-U.
  • Alarm Condition:
    • If more than or equal to 70% of update_notify total requests fails with configured errorCode, an alarm is raised

Diagnostic Information:

  • Check Network Latency
    • Investigate possible delays in network which is resulting in timeouts
  • Verify sender information
    • Verify if the notifUri where we are sending the information is correct
  • Verify receiver NF
    • Verify that the SMF that is receiving the traffic is in a healthy state
  • Review application
    • Verify if Sm is not congested
    • If signs like constant error logs are showing
    • Monitor System/resource utilization (CPU, Memory, queues)

Recover:

  • Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected.
  • Review SM Service Application and Resources
    • Restart or scale up resources temporarily if the system is overloaded
  • Disable feature
    • if this flow is the only one affected we can disable this feature as a last resource

For any additional guidance, contact My Oracle Support.

8.1.2.91 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MAJOR_THRESHOLD

Table 8-224 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MAJOR_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify failed because a timeout is equal or above 50% in a given time period.
Summary This alert is triggered when the number of update notify failed because a timeout is equal or above 50% in a given time period.
Severity Major
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 50 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an Error

  • Metrics: ocpm_handle_update_notify_error_response_as_pending_confirmation_total
    • This will be incremented when configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and specific error error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPOMNSE_CODE and timeout happens during update notify triggered by AAR-I and AAR-U.
  • Alarm Condition:
    • If more than or equal to 50% but less than 70% of update_notify total requests fails with configured errorCode, an alarm is raised

Diagnostic Information:

  • Check Network Latency
    • Investigate possible delays in network which is resulting in timeouts
  • Verify sender information
    • Verify if the notifUri where we are sending the information is correct
  • Verify receiver NF
    • Verify that the SMF that is receiving the traffic is in a healthy state
  • Review application
    • Verify if Sm is not congested
    • If signs like constant error logs are showing
    • Monitor System/resource utilization (CPU, Memory, queues)

Recover:

  • Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected.
  • Review SM Service Application and Resources
    • Restart or scale up resources temporarily if the system is overloaded
  • Disable feature
    • if this flow is the only one affected we can disable this feature as a last resource

For any additional guidance, contact My Oracle Support.

8.1.2.92 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MINOR_THRESHOLD

Table 8-225 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_ERROR_RESPONSE_ABOVE_MINOR_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Summary This alert is triggered when the number of update notify failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_error_response_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm", responseCode=~"5xx/4xx"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 30 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.111
Metric Used ocpm_handle_update_notify_error_response_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_error_response_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with an error.

Metrics:

  • ocpm_handle_update_notify_error_response_as_pending_confirmation_total

    • This will be incremented when:

      • Configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled, and

      • A specific error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.RESPONSE_CODE, and

      • A timeout happens during Update Notify triggered by AAR-I and AAR-U.

Alarm Condition:

  • If ≥ 50% and < 70% of update_notify total requests fail with the configured error code, an alarm is raised.

Diagnostic Information:

  • Check Network Latency

    • Investigate possible delays in the network that are resulting in timeouts.

  • Verify Sender Information

    • Verify if the notifUri where we are sending the information is correct.

  • Verify Receiver NF

    • Verify that the SMF receiving the traffic is in a healthy state.

  • Review Application

    • Verify that SM is not congested.

    • Check for constant error logs.

    • Monitor system/resource utilization (CPU, memory, queues).

Recover:

  • Check Network Latency and Connectivity

    • Investigate any current network issues or bottlenecks between the external NF and the SM service.

    • Resolve any high latency or packet loss immediately if detected.

  • Review SM Service Application and Resources

    • Restart or scale up resources temporarily if the system is overloaded.

  • Disable Feature

    • If this flow is the only one affected, disable this feature as a last resort.

8.1.2.93 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_CRITICAL_THRESHOLD

Table 8-226 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_CRITICAL_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Summary This alert is triggered when the number of update notify failed because a timeout is equal or above 70% in a given time period.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the Update Notify operation towards SMF ends up with a timeout.

Metrics:

  • ocpm_handle_update_notify_timeout_as_pending_confirmation_total

    • This will be incremented when:

      • Configuration flag
        SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED
        is enabled, and
      • A specific error is added in
        SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS
        , and
      • A timeout happens during Update Notify triggered by AAR-I and AAR-U.

Alarm Condition:

  • If ≥ 70% of update_notify total requests fail with a timeout, an alarm is raised.

Diagnostic Information:

  • Check Network Latency

    • Investigate possible delays in the network that are resulting in timeouts.

  • Verify Sender Information

    • Verify if the
      notifUri
      where we are sending the information is correct.
  • Verify Receiver NF

    • Verify that the SMF receiving the traffic is in a healthy state.

  • Review Application

    • Verify that SM is not congested.

    • Check for constant error logs.

    • Monitor system/resource utilization (CPU, memory, queues).

Recover:

  • Check Network Latency and Connectivity

    • Investigate any current network issues or bottlenecks between the external NF and the SM service.

    • Resolve any high latency or packet loss immediately if detected.

  • Review SM Service Application and Resources

    • Restart or scale up resources temporarily if the system is overloaded.

  • Disable Feature

    • If this flow is the only one affected, disable this feature as a last resort.

For any additional guidance, contact My Oracle Support.

8.1.2.94 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MAJOR_THRESHOLD

Table 8-227 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MAJOR_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify that failed because a timeout is equal or above 50% but less than 70% in a given time period.
Summary This alert is triggered when the number of update notify that failed because a timeout is equal or above 50% but less than 70% in a given time period.
Severity Major
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 50 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the operation Update Notify towards SMF ends up with a timeout.

  • Metrics: ocpm_handle_update_notify_timeout_as_pending_confirmation_total

    • This will be incremented when the configuration flag SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED is enabled and a specific error is added in SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS, and a timeout happens during update notify triggered by AAR-I and AAR-U.
  • Alarm Condition:

    • If more than or equal to 50% but less than 70% of update_notify total requests fail with a timeout, an alarm is raised.

Diagnostic Information:

  • Check Network Latency
    • Investigate possible delays in the network which are resulting in timeouts.
  • Verify sender information
    • Verify if the notifUri where we are sending the information is correct.
  • Verify receiver NF
    • Verify that the SMF that is receiving the traffic is in a healthy state.
  • Review application
    • Verify if SM is not congested.
    • Look for signs such as constant error logs.
    • Monitor system/resource utilization (CPU, memory, queues).

Recover:

  • Check Network Latency and Connectivity
    • Investigate any current network issues or bottlenecks between the external NF and the SM Service. Resolve any high latency or packet loss immediately if detected.
  • Review SM Service Application and Resources
    • Restart or scale up resources temporarily if the system is overloaded.
  • Disable feature
    • If this flow is the only one affected, you can disable this feature as a last resort.

For any additional guidance, contact My Oracle Support.

8.1.2.95 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MINOR_THRESHOLD

Table 8-228 RX_PENDING_CONFIRMATION_UPDATE_NOTIFY_TIMEOUT_ABOVE_MINOR_THRESHOLD

Field Details
Description This alert is triggered when the number of update notify that failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Summary This alert is triggered when the number of update notify that failed because a timeout is equal or above 30% but less than 50% of total Rx sessions.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_handle_update_notify_timeout_as_pending_confirmation_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m])) / sum by (namespace) (rate(ocpm_rx_update_notify_request_total{operationType="update_notify",microservice=~".*pcf_sm"}[5m]))) * 100 >= 30 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.112
Metric Used ocpm_handle_update_notify_timeout_as_pending_confirmation_total
Recommended Actions

Cause:

Metric ocpm_handle_update_notify_timeout_as_pending_confirmation_total is pegged when the Update Notify operation towards SMF ends up with a timeout.

Metrics:

  • ocpm_handle_update_notify_timeout_as_pending_confirmation_total

    • This will be incremented when:

      • Configuration flag
        SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.ENABLED
        is enabled, and
      • A specific error is added in
        SYSTEM.RX.UPDATE_NOTIFY.RULES.PENDING_CONFIRMATION.EXCEPTIONS
        , and
      • A timeout happens during Update Notify triggered by AAR-I and AAR-U.

Alarm Condition:

  • If ≥ 30% and < 50% of update_notify total requests fail with a timeout, an alarm is raised.

Diagnostic Information:

  • Check Network Latency

    • Investigate possible delays in the network that are resulting in timeouts.

  • Verify Sender Information

    • Verify if the
      notifUri
      where we are sending the information is correct.
  • Verify Receiver NF

    • Verify that the SMF receiving the traffic is in a healthy state.

  • Review Application

    • Verify that SM is not congested.

    • Check for constant error logs.

    • Monitor system/resource utilization (CPU, memory, queues).

Recover:

  • Check Network Latency and Connectivity

    • Investigate any current network issues or bottlenecks between the external NF and the SM service.

    • Resolve any high latency or packet loss immediately if detected.

  • Review SM Service Application and Resources

    • Restart or scale up resources temporarily if the system is overloaded.

  • Disable Feature

    • If this flow is the only one affected, disable this feature as a last resort.

For any additional guidance, contact My Oracle Support.

8.1.2.96 PCF_STATE_NON_FUNCTIONAL_CRITICAL

Table 8-229 PCF_STATE_NON_FUNCTIONAL_CRITICAL

Field Details
Description Policy is in non functional state due to DB cluster state down.
Summary Policy is in non functional state due to DB cluster state down.
Severity Critical
Expression appinfo_nfDbFunctionalState_current{nfDbFunctionalState="Not_Running"} == 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.102
Metric Used appinfo_nfDbFunctionalState_current
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.2.97 UDR_GET_REVALIDATION_FAILURE_ABOVE_MAJOR_PERCENT

Table 8-230 UDR_GET_REVALIDATION_FAILURE_ABOVE_MAJOR_PERCENT

Field Details
Description This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Summary This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Severity Major
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code!~"2.*",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 50 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned.

Cause:

The ocpm_udr_tracking_response_total metric is pegged whenever a response is received from the UDR in the UDR Connector.

In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation exceeds the configured threshold.

This alert is triggered when more than 50% but less than 70% of GET calls for UDR revalidation

(operation_type=resubscribe, service_resource=subscription-revalidation) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes).

Diagnostic Information:

  1. Check Recent Logs

    • Review logs from the PCF UDR Connector and Egress Gateway for the relevant time intervals.

    • Review errors at SCP routing.

    • Identify the failure responses—look for non-2xx HTTP status codes and any error payloads.

  2. Analyze Failure Patterns

    • Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations.

    • Assess whether there is a spike in failed revalidations or if failures are intermittent.

  3. Inspect UDR Health and Reachability

    • Verify the health and responsiveness of the UDR service.

    • Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP).

  4. Review PCF–UDR Connector Configuration

    • Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR.

    • Review any recent configuration or deployment changes that might correspond to the start of failures.

  5. Check for Resource or Rate Limiting

    • Evaluate whether there are signs of resource exhaustion (CPU, memory, network) on either service.

    • Investigate if the UDR is rate-limiting incoming requests or experiencing overload.

  6. Correlate with Related Alerts or Incidents

    • Cross-check whether other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors).

Recovery:

  1. Restore UDR Service Health

    • Address any service outages, restarts, or degraded performance on the UDR side.

    • If resource constraints are detected, consider scaling UDR or optimizing load.

  2. Fix Connectivity or Configuration Issues

    • Resolve network issues (latency, DNS, firewall).

    • Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations.

For any additional guidance, contact My Oracle Support.

8.1.2.98 UDR_GET_REVALIDATION_FAILURE_ABOVE_CRITICAL_PERCENT

Table 8-231 UDR_GET_REVALIDATION_FAILURE_ABOVE_CRITICAL_PERCENT

Field Details
Description This alert is triggered when more than 70% of the UDR revalidation using method GET call failed.
Summary This alert is triggered when more than 70% of the UDR revalidation using method GET call failed.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code!~"2.*",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned.

Cause:

The ocpm_udr_tracking_response_total metric is pegged whenever a response is received from the UDR in the UDR Connector.

In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation exceeds the configured threshold.

This alert is triggered when more than 70% of GET calls for UDR revalidation

(operation_type=resubscribe, service_resource=subscription-revalidation) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes).

Diagnostic Information:

  1. Check Recent Logs

    • Review logs from PCF UDR Connector and Egress Gateway for the relevant time intervals.

    • Review errors at SCP routing.

    • Identify the failure responses — look for non-2xx HTTP status codes and any error payloads.

  2. Analyze Failure Patterns

    • Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations.

    • Assess whether there is a spike in failed revalidations or if failures are intermittent.

  3. Inspect UDR Health and Reachability

    • Verify the health and responsiveness of the UDR service.

    • Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP).

  4. Review PCF–UDR Connector Configuration

    • Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR.

    • Review any recent configuration or deployment changes that might correspond to the start of failures.

  5. Check for Resource or Rate Limiting

    • Evaluate whether there are signs of resource exhaustion (CPU, memory, network) on either service.

    • Investigate if the UDR is rate-limiting incoming requests or experiencing overload.

  6. Correlate with Related Alerts or Incidents

    • Cross-check whether other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors).

Recovery:

  1. Restore UDR Service Health

    • Address any service outages, restarts, or degraded performance on the UDR side.

    • If resource constraints are detected, consider scaling UDR or optimizing load.

  2. Fix Connectivity or Configuration Issues

    • Resolve network issues (latency, DNS, firewall).

    • Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations.

For any additional guidance, contact My Oracle Support.

8.1.2.99 UDR_GET_REVALIDATION_FAILURE_ABOVE_MINOR_PERCENT

Table 8-232 UDR_GET_REVALIDATION_FAILURE_ABOVE_MINOR_PERCENT

Field Details
Description This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Summary This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code!~"2.*",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 30 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.108
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR is received in UDR connector. This alert is notifying when the number of responses received from UDR for operation resubscribe that failed is above the threshold mentioned.

Cause:

The ocpm_udr_tracking_response_total metric is pegged whenever we receive a response from the UDR in the UDR Connector.

In this case, alerts are triggered when the number of failed responses received from UDR for the resubscribe operation is above the configured threshold.

This alert is triggered when more than 30% but less than 50% of GET calls for UDR revalidation

(
operation_type=resubscribe
,
service_resource=subscription-revalidation
) sent by the PCF-UserService fail (i.e., receive non-2xx HTTP response codes).

Diagnostic Information:

  1. Check Recent Logs

    • Review logs from the PCF UDR Connector and Egress Gateway for the relevant time intervals.

    • Review errors at SCP routing.

    • Identify the failure responses — look for non-2xx HTTP status codes and any error payloads.

  2. Analyze Failure Patterns

    • Determine if failures are tied to specific UDRs, subscriber groups, or are distributed across all revalidations.

    • Assess if there is a spike in failed revalidations or if failures are intermittent.

  3. Inspect UDR Health and Reachability

    • Verify the health and responsiveness of the UDR service.

    • Check network connectivity from the PCF to UDR; look for timeouts, DNS errors, or other connectivity issues in intermediary services (EGW, SCP).

  4. Review PCF–UDR Connector Configuration

    • Ensure proper configuration of endpoints, service credentials, and connection settings between PCF and UDR.

    • Review any recent configuration or deployment changes that might correspond to the start of failures.

  5. Check for Resource or Rate Limiting

    • Evaluate if there are signs of resource exhaustion (CPU, memory, network) on either service.

    • Investigate if the UDR is rate-limiting incoming requests or experiencing overload.

  6. Correlate with Related Alerts or Incidents

    • Cross-check if other alerts in the same namespace indicate broader issues (e.g., infrastructure, dependency outages, authentication errors).

Recovery:

  1. Restore UDR Service Health

    • Address any service outages, restarts, or degraded performance on the UDR side.

    • If resource constraints are detected, consider scaling UDR or optimizing load.

  2. Fix Connectivity or Configuration Issues

    • Resolve network issues (latency, DNS, firewall).

    • Correct any erroneous endpoint URLs or authentication parameters in PCF User or UDR Connector configurations.

For any additional guidance, contact My Oracle Support.

8.1.2.100 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_CRITICAL_PERCENT

Table 8-233 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_CRITICAL_PERCENT

Field Details
Description This alert is triggered when more than 70% of the UDR revalidation using method GET call failed with status code 404 NOT FOUND.
Summary This alert is triggered when more than 70% of the UDR revalidation using method GET call failed with status code 404 NOT FOUND.
Severity Critical
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned.

Cause:

This alert is triggered when more than 70% of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window.

A 404 response indicates that the requested subscription for revalidation was not found in UDR.

Diagnostic Information:

  1. Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period.

  2. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable.

  3. Investigate Missing Subscriptions:

    • Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for it.

    • Check if there is a data synchronization issue between the originator and UDR.

  4. Look for patterns: Determine whether the 404s are concentrated in a particular user group.

Recovery:

  1. Audit Subscription Lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced.

  2. Review Recent Deployments or Changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s.

For any additional guidance, contact My Oracle Support.

8.1.2.101 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MAJOR_PERCENT

Table 8-234 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MAJOR_PERCENT

Field Details
Description This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Summary This alert is triggered when more than or equal to 50% but less that 70% of the UDR revalidation using method GET call failed.
Severity Major
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 50 < 70
OID 1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned.

Cause:

This alert is triggered when more than 50% (but less than 70%) of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window.

A 404 response indicates that the requested subscription for revalidation was not found in UDR.

Diagnostic Information:

  1. Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period.

  2. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable.

  3. Investigate missing subscriptions:

    • Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for the same.

    • Check if there is a data synchronization issue between the originator and UDR.

  4. Look for patterns: Determine whether the 404s are concentrated in a particular user group.

Recovery:

  1. Audit subscription lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced.

  2. Review recent deployments or changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s.

For any additional guidance, contact My Oracle Support.

8.1.2.102 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MINOR_PERCENT

Table 8-235 UDR_GET_REVALIDATION_404_FAILURE_ABOVE_MINOR_PERCENT

Field Details
Description This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Summary This alert is triggered when more than or equal to 30% but less that 50% of the UDR revalidation using method GET call failed.
Severity Minor
Expression (sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",response_code="404",service_resource="subscription-revalidation"}[5m])) / sum by (namespace) (rate(ocpm_udr_tracking_response_total{operation_type="resubscribe",microservice=~".*pcf_user",service_resource="subscription-revalidation"}[5m]))) * 100 >= 30 < 50
OID 1.3.6.1.4.1.323.5.3.52.1.2.110
Metric Used ocpm_udr_tracking_response_total
Recommended Actions

The ocpm_udr_tracking_response_total metric is pegged whenever a response from UDR in UDR connector is received. This alert notifies when the number of responses received from UDR for operation resubscribe that failed with a 404 Not Found is above the threshold mentioned.

Cause:

This alert is triggered when more than 30% (but less than 50%) of UDR revalidation GET operations managed by PCF fail with an HTTP 404 (Not Found) response code within the specified window.

A 404 response indicates that the requested subscription for revalidation was not found in UDR.

Diagnostic Information:

  1. Check UDR logs: Look for 404 errors and the accompanying trigger request details (trigger of UDR revalidation request) for the affected time period.

  2. Verify subscription states: Ensure that the subscriptions expected to be present actually exist and are not being deleted, expired, or unavailable.

  3. Investigate missing subscriptions:

    • Determine why the revalidation call is being made for a non-existent subscription ID. If the subscription has gone stale, verify why an audit was not triggered for the same.

    • Check if there is a data synchronization issue between the originator and UDR.

  4. Look for patterns: Determine whether the 404s are concentrated in a particular user group.

Recovery:

  1. Audit subscription lifecycle: Ensure proper creation, update, and deletion workflows so stale subscription IDs are not reused or referenced.

  2. Review recent deployments or changes: Check whether recent code or configuration changes in pcf_user or UDR might have led to increased 404s.

For any additional guidance, contact My Oracle Support.

8.1.2.103 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-236 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field Details
Description More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Severity Minor
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m])) / sum(rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing AM user data check is based on:

      • service_subresource = "am-data" (indicates the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (indicates this is a POST call)

      • imm_reports_present = "false" (indicates no AM user data was returned from UDR as part of the Immediate Reporting capability)

    • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without AM user data as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (e.g., 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and AM user data is still not retrieved, inform the UDR operators to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.104 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-237 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting
Severity Major
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing AM user data check is based on:

      • service_subresource = "am-data" (to indicate the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no AM user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without user data for AM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and still no AM user data is retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.105 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-238 UDR_AM_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field Details
Description More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting.
Summary More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for AM as part of immediate reporting.
Severity Critical
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.116
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The missing AM user data check is based on:

      • service_subresource = "am-data" (to indicate the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no AM user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without user data for AM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.106 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-239 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field Details
Description More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity Minor
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation check is based on:

      • service_subresource = "am-data" (to indicate the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR in the POST REST API request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.107 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-240 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity Major
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

    • The failed feature negotiation check is based on:

      • service_subresource = "am-data" (to indicate the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in the request payload has the 30th byte set to 1 when converted to hex (for example, 40000000). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and AM user data is still not retrieved, inform the UDR operators whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.108 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-241 UDR_AM_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field Details
Description More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Summary More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for AM as part of immediate reporting
Severity Critical
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="am-data",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.117
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation is based on:

      • service_subresource = "am-data" (indicates the UDR POST was to get AM user data from UDR)

      • operation_type = "POST" (indicates this is a POST call)

      • immediate_report_pcc = "false" (indicates that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic: UDR returned a POST Subscribe response with failed feature negotiation for AM as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, 40000000).

    • This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and AM user data is still not retrieved:

      • Inform the UDR operators.

      • Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.109 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MINOR

Table 8-242 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MINOR

Field Details
Description More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity Minor
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The missing UE user data check is based on:

      • service_subresource = "ue-policy-set" (indicates the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (indicates this is a POST call)

      • imm_reports_present = "false" (indicates no UE user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, 40000000).

    • This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and UE user data is still not retrieved:

      • Inform the UDR operators.

      • Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.110 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Table 8-243 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity Major
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

The metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The missing UE user data check is based on:

      • service_subresource = "ue-policy-set" (to indicate the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no UE user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte set to 1 when converted to hex (for example, 40000000).

    • This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Inform UDR Operator

    • If the above points are validated and UE user data is still not retrieved:

      • Inform the UDR operators.

      • Ask them to verify whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.111 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Table 8-244 UDR_UE_IMMREP_RESPONSE_MISSING_DATA_CRITICAL

Field Details
Description More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Summary More than or equal to 30% of the traffic, UDR returned with POST subscribe response but without user data for UE as part of immediate reporting
Severity CRITICAL
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",imm_reports_present="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.118
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for a POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The missing UE user data check is based on:

      • service_subresource = "ue-policy-set" (to indicate the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • imm_reports_present = "false" (to indicate no UE user data was returned from UDR as part of the Immediate Reporting capability)

  • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic: UDR returned a POST Subscribe response without UE user data as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as 1 when converted to hex (for example, "40000000"). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Intimate UDR Operator

    • If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.112 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Table 8-245 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MINOR

Field Details
Description More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary More than or equal to 10% but less that 20% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity Minor
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 10 < 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation is based on:

      • service_subresource = "ue-policy-set" (to indicate the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 10% but less than 20% of the traffic:

    UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as 1 when converted to hex (for example, "40000000"). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Intimate UDR Operator

    • If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.113 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Table 8-246 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_MAJOR

Field Details
Description More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary More than or equal to 20% but less that 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity Major
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 20 < 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation is based on:

      • service_subresource = "ue-policy-set" (to indicate the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 20% but less than 30% of the traffic:

    UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as 1 when converted to hex (for example, "40000000"). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the UDR POST request payload.

  3. Verify UDR Profile

    • Ensure User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Intimate UDR Operator

    • If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.114 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Table 8-247 UDR_UE_IMMREP_FEATURE_NEGOTIATION_FAILED_CRITICAL

Field Details
Description More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Summary More than or equal to 30% of the traffic, UDR returned with POST subscribe response but with failed feature negotiation for UE as part of immediate reporting
Severity Critical
Expression (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post",immediate_report_pcc="false"}[5m]))) / (sum by (namespace) (rate(occnp_immrep_response_total{service_subresource="ue-policy-set",operation_type="post"}[5m]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.119
Metric Used occnp_immrep_response_total
Recommended Actions

Cause:

Metric occnp_immrep_response_total is pegged when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

Metric:

  • occnp_immrep_response_total

    • Increments when UDR-C receives a user data response from UDR for POST Subscription with Immediate Reporting.

    • The failed feature negotiation is based on:

      • service_subresource = "ue-policy-set" (to indicate the UDR POST was to get UE user data from UDR)

      • operation_type = "POST" (to determine this is a POST call)

      • immediate_report_pcc = "false" (to indicate that no feature negotiation happened with UDR on the ImmReportPcc feature)

  • If these metric dimensions are satisfied, then the alarm will trigger.

Alarm Condition:

  • More than or equal to 30% of the traffic:

    UDR returned a POST Subscribe response with failed feature negotiation for UE as part of Immediate Reporting.

Diagnostic Information:

  1. Verify ImmReportPcc

    • Ensure the suppFeat attribute sent towards UDR for the POST REST API call in its request payload has the 30th byte allocated as 1 when converted to hex (for example, "40000000"). This is crucial for feature negotiation with UDR.

  2. Verify immRep

    • Ensure the immRep attribute is set to true in the request payload for the UDR POST.

  3. Verify UDR Profile

    • Ensure that User Data is requested only for those UDR profiles that PCF obtained from NRF with the ImmReportPcc feature enabled.

  4. Last Resort – Intimate UDR Operator

    • If the above points are validated and still no UE user data is retrieved, then intimate the same to UDR operators to check whether the Immediate Reporting feature is working and negotiated from their end.

Recovery:

  1. Verify the suppFeat attribute is sent with the 30th byte allotted for ImmReportPcc.

  2. Verify immRep is being sent as true.

  3. Verify the UDR profile being chosen to perform the UDR POST has ImmReportPcc enabled after on-demand/autonomous UDR discovery.

For any additional guidance, contact My Oracle Support.

8.1.2.115 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST_EGW

Table 8-248 POD_PROTECTION_BY_RATELIMIT_REJECTED_REQUEST_EGW

Field Details
Description Egress Gateway traffic is getting rejected more than 1% because of ratelimiting.
Summary Egress Gateway traffic is getting rejected more than 1% because of ratelimiting.
Severity Major
Expression (sum(rate(oc_egressgateway_http_request_ratelimit_values_total {allowed="false",app_kubernetes_io_name="egress-gateway",,namespace="$NAMESPACE"}[2m]) or (up * 0 ) ) )/sum(rate(oc_egressgateway_http_request_ratelimit_values_total {app_kubernetes_io_name="egress-gateway",,namespace="$NAMESPACE"}[2m])) * 100 >= 1
OID 1.3.6.1.4.1.323.5.3.52.1.2.114
Metric Used oc_egressgateway_http_request_ratelimit_values_total
Recommended Actions

The alert is cleared when the failure rate goes below 1% of total tps.

For any additional guidance, contact My Oracle Support.

8.1.2.116 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-249 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 60% in a given time period.
Severity Critical
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 60
OID 1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with umcDataIncluded="false" but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved:occnp_pa_sponsored_sessions_total

  • Error observed: 403 Requested Service Not Authorized

  • Condition: Unauthorized Sponsored Connectivity requests

Verification steps:

  1. Send a Sponsored Connectivity request with valid authorization and supported features.

  2. Confirm the request succeeds.

  3. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Track the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Pay special attention to spikes after client deployments or policy/configuration changes.

Recovery:

  1. Identify the failing caller.

  2. Review the request payload, entitlement, and policy configuration.

  3. Confirm that the sponsor/ASP is authorized for the requested service.

  4. Correct any misconfigurations in policy rules or subscription data.

  5. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service.

  6. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.117 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-250 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MAJOR_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 40% in a given time period.
Severity Major
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 40 < 60
OID 1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with umcDataIncluded="false" but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved:occnp_pa_sponsored_sessions_total

  • Error observed: 403 Requested Service Not Authorized

  • Condition: Unauthorized Sponsored Connectivity requests

Verification steps:

  1. Send a Sponsored Connectivity request with valid authorization and supported features.

  2. Confirm the request succeeds.

  3. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Track the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Pay special attention to spikes after client deployments or policy/configuration changes.

Recovery:

  1. Identify the failing caller.

  2. Review the request payload, entitlement, and policy configuration.

  3. Confirm that the sponsor/ASP is authorized for the requested service.

  4. Correct any misconfigurations in policy rules or subscription data.

  5. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service.

  6. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.118 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-251 SMF_REQUESTED_SERVICE_NOT_AUTHORIZED_ABOVE_MINOR_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 20% in a given time period.
Severity Minor
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="403"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 20 < 40
OID 1.3.6.1.4.1.323.5.3.52.1.2.120
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 403 Requested Service Not Authorized response. This occurs when the client sends a Sponsored request with umcDataIncluded="false", but the requested service is not authorized. As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved: occnp_pa_sponsored_sessions_total

  • Error observed: 403 Requested Service Not Authorized

  • Condition: Unauthorized Sponsored Connectivity requests

Verification steps:

  1. Send a Sponsored Connectivity request with valid authorization and supported features.

  2. Confirm the request succeeds.

  3. Verify that the 403 / Requested Service Not Authorized ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Track the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Pay special attention to spikes after client deployments or policy/configuration changes.

Recovery:

  1. Identify the failing caller.

  2. Review the request payload, entitlement, and policy configuration.

  3. Confirm that the sponsor/ASP is authorized for the requested service.

  4. Correct any misconfigurations in policy rules or subscription data.

  5. Ensure that Sponsored Connectivity is supported in both the PCF SM Service and the PA Service.

  6. Escalate if the issue persists after authorization or policy fixes, or if it impacts multiple tenants or partners.

8.1.2.119 AF_MANDATORY_IE_MISSING_SC_ABOVE_CRITICAL_THRESHOLD_PERCENT

Table 8-252 AF_MANDATORY_IE_MISSING_SC_ABOVE_CRITICAL_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 60% in a given time period.
Severity Critical
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 60
OID 1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to cause="MANDATORY_IE_MISSING". This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved: occnp_pa_sponsored_sessions_total

  • Error observed: 400 Bad Request

  • Cause value: MANDATORY_IE_MISSING

  • Condition: Sponsored Connectivity requests missing mandatory IE fields

  • Common missing IEs: sponId, aspId, afAppId

Verification steps:

  1. Send a valid Sponsored Connectivity request including all mandatory IEs.

  2. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated.

  3. Confirm the request succeeds.

  4. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Watch for spikes following client deployments or gateway transformation changes.

Recovery:

  1. Identify the failing caller.

  2. Compare the request payload against the API contract.

  3. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.).

  4. Review and fix any gateway or payload transformation issues.

  5. Redeploy the corrected configuration or client.

  6. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.2.120 AF_MANDATORY_IE_MISSING_SC_ABOVE_MAJOR_THRESHOLD_PERCENT

Table 8-253 AF_MANDATORY_IE_MISSING_SC_ABOVE_MAJOR_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 40% in a given time period.
Severity Major
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 40 < 60
OID 1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert gets triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to cause="MANDATORY_IE_MISSING". This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved: occnp_pa_sponsored_sessions_total

  • Error observed: 400 Bad Request

  • Cause value: MANDATORY_IE_MISSING

  • Condition: Sponsored Connectivity requests missing mandatory IE fields

  • Common missing IEs: sponId, aspId, afAppId

Verification steps:

  1. Send a valid Sponsored Connectivity request including all mandatory IEs.

  2. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated.

  3. Confirm the request succeeds.

  4. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Watch for spikes following client deployments or gateway transformation changes.

Recovery:

  1. Identify the failing caller.

  2. Compare the request payload against the API contract.

  3. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.).

  4. Review and fix any gateway or payload transformation issues.

  5. Redeploy the corrected configuration or client.

  6. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.2.121 AF_MANDATORY_IE_MISSING_SC_ABOVE_MINOR_THRESHOLD_PERCENT

Table 8-254 AF_MANDATORY_IE_MISSING_SC_ABOVE_MINOR_THRESHOLD_PERCENT

Field Details
Description {{ $value }} % of patch requests failed in {{$labels.namespace}}.
Summary This alert is triggered when the number of PATCH request that failed is equal to or above 20% in a given time period.
Severity Minor
Expression (sum by (namespace) (rate(occnp_pa_sponsored_sessions_total{responseCode="400",cause="MANDATORY_IE_MISSING"}[5m])))/(sum by (namespace) (rate(occnp_pa_sponsored_sessions_total[5m]))) * 100 >= 20 < 40
OID 1.3.6.1.4.1.323.5.3.52.1.2.122
Metric Used occnp_pa_sponsored_sessions_total
Recommended Actions If this alert is triggered, Prometheus metrics or other tools can be used to check what error codes are being thrown and identify if the error comes from the NF being reached (in this case SM).

Cause:

Alerts are triggered when Sponsored Connectivity requests processed by PA-Service fail with a 400 Bad Request due to cause="MANDATORY_IE_MISSING". This happens when the client sends a Sponsored Connectivity request missing one or more mandatory Information Elements (IEs). As a result, PA rejects the request and increments the occnp_pa_sponsored_sessions_total metric.

Diagnostic Information:

  • Metric involved: occnp_pa_sponsored_sessions_total

  • Error observed: 400 Bad Request

  • Cause value: MANDATORY_IE_MISSING

  • Condition: Sponsored Connectivity requests missing mandatory IE fields

  • Common missing IEs: sponId, aspId, afAppId

Verification steps:

  1. Send a valid Sponsored Connectivity request including all mandatory IEs.

  2. Ensure sponId and aspId are present and that Sponsored Connectivity is negotiated.

  3. Confirm the request succeeds.

  4. Verify that the 400 / MANDATORY_IE_MISSING ratio drops below the alert threshold within one evaluation window.

Monitoring recommendations:

  • Monitor the 4xx error ratio by caller/tenant and by sponsor/ASP.

  • Watch for spikes following client deployments or gateway transformation changes.

Recovery:

  1. Identify the failing caller.

  2. Compare the request payload against the API contract.

  3. Restore all mandatory IE fields (sponId, aspId, afAppId, etc.).

  4. Review and fix any gateway or payload transformation issues.

  5. Redeploy the corrected configuration or client.

  6. Escalate if the issue persists after fixes or impacts multiple tenants.

8.1.3 PCRF Alerts

This section provides information about PCRF alerts.

8.1.3.1 PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

Table 8-255 PRE_UNREACHABLE_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description PRE fail count exceeds the critical threshold limit.
Summary Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition PRE fail count exceeds the critical threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.2 PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

Table 8-256 PRE_UNREACHABLE_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description PRE fail count exceeds the major threshold limit.
Summary Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition PRE fail count exceeds the major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.3 PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

Table 8-257 PRE_UNREACHABLE_EXCEEDS_MINOR_THRESHOLD

Field Details
Description PRE fail count exceeds the minor threshold limit.
Summary Alert PRE unreachable NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity minor
Condition PRE fail count exceeds the minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.9
Metric Used http_out_conn_response_total{container="pcrf-core", responseCode!~"2.*", serviceResource="PRE"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.4 PCRF_DOWN

Table 8-258 PCRF_DOWN

Field Details
Description PCRF Service is down
Summary Alert PCRF_DOWN NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition None of the pods of the PCRF service are available.
OID 1.3.6.1.4.1.323.5.3.44.1.2.33
Metric Used appinfo_service_running{service=~".*pcrf-core"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.3.5 CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-259 CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description CCA fail count exceeds the critical threshold limit
Summary Alert CCA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of CCA messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used occnp_diam_response_local_total{msgType=~"CCA.*", responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.3.6 CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-260 CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description CCA fail count exceeds the major threshold limit
Summary Alert CCA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of CCA messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used occnp_diam_response_local_total{msgType=~"CCA.*", responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.3.7 CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-261 CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description CCA fail count exceeds the minor threshold limit
Summary Alert CCA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of CCA messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.13
Metric Used occnp_diam_response_local_total{msgType=~"CCA.*", responseCode!~"2.*"}
Recommended Actions

For any additional guidance, contact My Oracle Support.

8.1.3.8 AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-262 AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description AAA fail count exceeds the critical threshold limit
Summary Alert AAA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of AAA messages has exceeded the critical threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used occnp_diam_response_local_total{msgType=~"AAA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.9 AAA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

AAA Fail Count Exceeds Major Threshold

Table 8-263 AAA Fail Count Exceeds Major Threshold

Field Details
Description AAA fail count exceeds the major threshold limit
Summary Alert AAA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of AAA messages has exceeded the major threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used occnp_diam_response_local_total{msgType=~"AAA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.10 AAA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

AAA Fail Count Exceeds Minor Threshold

Table 8-264 AAA Fail Count Exceeds Minor Threshold

Field Details
Description AAA fail count exceeds the minor threshold limit
Summary Alert AAA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of AAA messages has exceeded the minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.34
Metric Used occnp_diam_response_local_total{msgType=~"AAA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.11 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-265 RAA_RX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the critical threshold limit
Summary Alert RAA_Rx_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of RAA Rx messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.12 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-266 RAA_RX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the major threshold limit
Summary Alert RAA_Rx_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of RAA Rx messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.13 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-267 RAA_RX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description RAA Rx fail count exceeds the minor threshold limit
Summary Alert RAA_Rx_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of RAA Rx messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.35
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.14 RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-268 RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description RAA Gx fail count exceeds the critical threshold limit
Summary Alert RAA_GX_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of RAA Gx messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.15 RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-269 RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description RAA Gx fail count exceeds the major threshold limit
Summary Alert RAA_GX_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of RAA Gx messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.16 RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-270 RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description RAA Gx fail count exceeds the minor threshold limit
Summary Alert RAA_GX_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of RAA Gx messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.18
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.17 ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-271 ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description ASA fail count exceeds the critical threshold limit
Summary Alert ASA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of ASA messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used occnp_diam_response_local_total{msgType=~"ASA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.18 ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-272 ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description ASA fail count exceeds the major threshold limit
Summary Alert ASA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of ASA messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used occnp_diam_response_local_total{msgType=~"ASA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.19 ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-273 ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description ASA fail count exceeds the minor threshold limit
Summary Alert ASA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of ASA messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.17
Metric Used occnp_diam_response_local_total{msgType=~"ASA.*", responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.20 STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-274 STA_FAIL_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description STA fail count exceeds the critical threshold limit.
Summary sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) * 100 > 90
Severity Critical
Condition The failure rate of STA messages has exceeded the configured critical threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.21 STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-275 STA_FAIL_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description STA fail count exceeds the major threshold limit.
Summary sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) * 100 > 80
Severity Major
Condition The failure rate of STA messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.22 STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-276 STA_FAIL_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description STA fail count exceeds the minor threshold limit.
Summary sum(rate(occnp_diam_response_local_total{msgType="STA", responseCode!~"2.*"}[5m])) / sum(rate(occnp_diam_response_local_total{msgType="STA"}[5m])) * 100 > 60
Severity Minor
Condition The failure rate of STA messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.19
Metric Used occnp_diam_response_local_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.23 ASATimeoutlCountExceedsThreshold

ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-277 ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description ASA timeout count exceeds the critical threshold limit
Summary Alert ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The timeout rate of ASA messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.24 ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-278 ASA_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description ASA timeout count exceeds the major threshold limit
Summary Alert ASA_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The timeout rate of ASA messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.25 ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

(Required) <Enter a short description here.>

ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-279 ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description ASA timeout count exceeds the minor threshold limit
Summary Alert ASA_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The timeout rate of ASA messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.31
Metric Used occnp_diam_response_local_total{msgType="ASA", responseCode="timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.26 RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Table 8-280 RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

Field Details
Description RAA Gx timeout count exceeds the critical threshold limit
Summary Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The timeout rate of RAA Gx messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.27 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-281 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description RAA Gx timeout count exceeds the major threshold limit
Summary Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The timeout rate of RAA Gx messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.28 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-282 RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description RAA Gx timeout count exceeds the minor threshold limit
Summary Alert RAA_GX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The timeout rate of RAA Gx messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.44.1.2.32
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Gx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.29 RAA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD

RAA Rx Timeout Count Exceeds Critical Threshold

Table 8-283 RAA Rx Timeout Count Exceeds Critical Threshold

Field Details
Description RAA Rx timeout count exceeds the critical threshold limit
Summary Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_CRITICAL_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The timeout rate of RAA Rx messages has exceeded the configured threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.30 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Table 8-284 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD

Field Details
Description RAA Rx timeout count exceeds the major threshold limit
Summary Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_MAJOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The timeout rate of RAA Rx messages has exceeded the configured major threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.31 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Table 8-285 RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD

Field Details
Description RAA Rx timeout count exceeds the minor threshold limit
Summary Alert RAA_RX_TIMEOUT_COUNT_EXCEEDS_MINOR_THRESHOLD NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The timeout rate of RAA Rx messages has exceeded the configured minor threshold limit.
OID 1.3.6.1.4.1.323.5.3.36.1.2.36
Metric Used occnp_diam_response_local_total{msgType="RAA", appType="Rx", responseCode!~"timeout"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.32 RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-286 RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field Details
Description CCA, AAA, RAA, ASA and STA error rate combined is above 10 percent
Summary Alert RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 10% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.33 RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-287 RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field Details
Description CCA, AAA, RAA, ASA and STA error rate combined is above 5 percent
Summary Alert RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 5% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.34 RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-288 RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field Details
Description CCA, AAA, RAA, ASA and STA error rate combined is above 1 percent
Summary Alert RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The combined failure rate of CCA, AAA, RAA, ASA, and STA messages is more than 1% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.37
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.35 Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-289 Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field Details
Description Rx error rate combined is above 10 percent
Summary Alert Rx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of Rx responses is more than 10% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.36 Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-290 Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field Details
Description Rx error rate combined is above 5 percent
Summary Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of Rx responses is more than 5% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.37 Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-291 Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field Details
Description Rx error rate combined is above 1 percent
Summary Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of Rx responses is more than 1% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.38
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Rx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.38 Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Table 8-292 Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT

Field Details
Description Gx error rate combined is above 10 percent
Summary Alert Gx_RESPONSE_ERROR_RATE_ABOVE_CRITICAL_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Critical
Condition The failure rate of Gx responses is more than 10% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.39 Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Table 8-293 Gx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT

Field Details
Description Gx error rate combined is above 5 percent
Summary Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MAJOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Major
Condition The failure rate of Gx responses is more than 5% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.40 Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

(Required) <Enter a short description here.>

Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Table 8-294 Gx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT

Field Details
Description Gx error rate combined is above 1 percent
Summary Alert Rx_RESPONSE_ERROR_RATE_ABOVE_MINOR_PERCENT NS:{{ $labels.kubernetes_namespace }}, PODNAME:{{ $labels.kubernetes_pod_name }}, INST:{{ $labels.instance }} REL:{{ $labels.release }}
Severity Minor
Condition The failure rate of Gx responses is more than 1% of the total responses.
OID 1.3.6.1.4.1.323.5.3.36.1.2.39
Metric Used occnp_diam_response_local_total{ responseCode!~"2.*", appType="Gx"}
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.41 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Table 8-295 STALE_DIAMETER_REQUEST_CLEANUP_CRITICAL

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 30%
Summary The Diameter requests are being discarded due to timeout processing occurring above 30%
Severity Critical
Condition (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 30
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used occnp_stale_diam_request_cleanup_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.42 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Table 8-296 STALE_DIAMETER_REQUEST_CLEANUP_MAJOR

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 20%
Summary The Diameter requests are being discarded due to timeout processing occurring above 20%
Severity Major
Condition (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 20
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used occnp_stale_diam_request_cleanup_total
Recommended Actions For any additional guidance, contact My Oracle Support.
8.1.3.43 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Table 8-297 STALE_DIAMETER_REQUEST_CLEANUP_MINOR

Field Details
Description The Diameter requests are being discarded due to timeout processing occurring above 10%
Summary The Diameter requests are being discarded due to timeout processing occurring above 10%
Severity Minor
Condition (sum by (namespace, microservice, pod) (increase(occnp_stale_diam_request_cleanup_total[24h])) / sum by (namespace, microservice, pod) (increase(occnp_diam_request_local_total{msgType!~"DWR|CER"}[24h]))) * 100 >= 10
OID 1.3.6.1.4.1.323.5.3.52.1.2.82
Metric Used occnp_stale_diam_request_cleanup_total
Recommended Actions For any additional guidance, contact My Oracle Support.