Service Related Issues

4.4 Service Related Issues

This section describes the most common service related issues and their resolution steps.

4.4.1 Resolving Microservices related Issues through Metrics and ConfigDB

This section describes how to troubleshoot issues related to UDR microservices using metrics.

nudr-drservice

If requests for nudr-drservice fail, then try to find the root cause from metrics using following guidelines:

If the count of measurement “udr_schema_operations_failure_total” is increasing, check the content of the incoming request and make sure that the incoming json data blob is proper and as per the specification.
If “udr_db_operations_failure_total” measurements are increasing,
- Make sure that connectivity is proper between microservices and MySQL DB nodes.
- Make sure that you are not trying to insert duplicate keys.
- Make sure that DB nodes have enough resources available.

nudr-dr-provservice

If requests for nudr-dr-provservice fails, then try to find the root cause from metrics using following guidelines:

If the count “udr_schema_operations_failure_total” measurement is increasing, check the content of incoming request and ensure the incoming JSON data blob is proper and as per the specification.
If “udr_db_operations_failure_total” measurements are increasing, then ensure:
- there is connectivity between microservices and MySQL DB nodes.
- you are not trying to insert duplicate keys.
- database nodes have enough resources available.

nudr-nrf-nfmanagement

If requests for nudr-nrf-nfmanagement fail, then try to find the root cause from metrics using following guidelines:

Check for current health status of NRF using the nrfclient_nrf_operative_status metric. If it is 0, it is UNHEALTHY or UNAVAILABLE.
Check for current NF status using the nrfclient_nrf_operative_status metric, and NF status with NRF with nrfclient_nf_status_with_nrf metric.
If NF status is 0, then check the appinfo_service_running metric for various services configured in the app-info section depending on the UDR mode.

nudr-nrf-client-service

If requests for nudr-nrf-client-service fail, then try to find the root cause from metrics using following guidelines:

Check the current health status of NRF using "nrfclient_nrf_operative_status" metric. If it is 0, then it is UNHEALTHY or UNAVAILABLE.
Check the current network function status using "nrfclient_nrf_operative_status" metric and network function status with NRF using "nrfclient_nf_status_with_nrf" metric.
If network function status is 0, check "appinfo_service_running" metric for various services configured in the app info section depending on UDR mode.

ocudr-nudr-notify-service

If requests for ocudr-nudr-notify-service fail, then try to find the root cause from metrics using following guidelines:

Measurements like “nudr_notif_notifications_ack_2xx_total”, “nudr_notif_notifications_ack_4xx_total”, “nudr_notif_notifications_ack_5xx_total” gives information about the response code returned in the notification response.
If count of “nudr_notif_notifications_send_fail_total” measurement is increasing, make sure that notification server mentioned in NOTIFICATION_URI during subscription request, which is expected to receive the notifications, is up and running.
The default retry count for failed notifications is two and this is configurable from the retrycount parameter in the custom values yaml file. Perform the following steps if alerts is raised for exceeding notifications table limit threshold:
- Log in to mysql database terminal to check the number of records on the NOTIFICATIONS table under UDR subscriber database (select count(*) from NOTIFICATIONS).
- Perform the following steps if the notification records count is consistently above 50k:
  - Check if there are more failures on the notification sent from notify service using nudr_notif_notifications_ack_4xx_total and nudr_notif_notifications_ack_5xx_total metrics.
  - Check the reason for the failure and resolve the failure.
  - If the failure is temporary and cannot be avoided then use the notifications configuration REST API or CNC Console to reduce the retrycount to 0 or 1. This will make sure that the table size does not increase faster.

ocudr-nudr-config

If requests for ocudr-nudr-config fail, try to find the root cause from metrics using following guidelines:

Measurements like “nudr_config_total_requests_total{Method='GET'}”, “nudr_config_total_requests_total{Method='POST'}”, “nudr_config_total_requests_total{Method='PUT'}” gives information about the total request pegged for the method GET, POST, and PUT respectively.
If count of measurement “nudr_config_total_responses_total{Method='GET/POST/PUT',StatusCode="400/404/405/500"}” is increasing, it means the requests are not being processed and results in failures.

If requests for ocudr-nudr-config fail, try to find the root cause from configdb using following guidelines:

If you get a BAD REQUEST for GET API, then make sure all the tables shown below is present in configdb table.

Figure 4-37 Configdb Table
If all the table are present and you are getting a BAD REQUEST for GET API, then you must verify the configuration item table shown below.

Figure 4-38 Configuration Item Table
If you get a BAD REQUEST and NOT FOUND for Import and Export API, then you must verify the import and export data table shown below.

Figure 4-39 Import and Export Data

ocudr-nudr-bulk-import

Following are some of the known errors that you can address if encountered.

If the bulk-import logs show "dr-service is down. Job cannot be executed", then check whether dr-service and Ingress Gateway are in the running state.
If the count of nudr_bulk_import_csvfile_records_read_total(Method="DELETE/PUT/POST", Status="Failure") metric is increasing, then it means the CSV file records are not valid. This can be resolved by providing correct keyType, KeyValue, operationType, nfType, and jsonPayload.
If the count of nudr_bulk_import_records_processed_total(Method = "POST/PUT/DELETE", StatusCode="201/204", Status="Success") is increasing, then it means the records are being processed by UDR correctly.
To find the number of request processed successfully for PCF, measure the count of Nudr_bulk_import_PCF_total{StatusCode="204/201", Status="Success"} metric.

For information about bulk import metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

ocudr-nudr-xmltocsv

After copying the ixml file using kubectl cp command, log into xmltocsv container and run the following command to check whether the file is copied or not:

> kubectl exec -it <pod name> -c nudr-xmltocsv -n <namespace> bash > cd /home/udruser/xml

If the count of measurement of the nudr_xmltocsv_xmlfile_records_read_total(Status="Failure") metric is increasing, then it shows the records in the ixml file are not valid. You need to ensure that correct ixml file is provided.

If the measurement count of the nudr_xmltocsv_records_processed_total{Method = "POST/PUT/DELETE/PATCH", Status="Success"} metric is increasing, then it denotes that the records are processed successfully.

For information about xmltocsv metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

ocudr-nudr-diameterproxy

If diameterproxy restarts, then make sure the database configurations are correct. For information about ocudr-nudr-diameterproxy metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

diam-gateway

If the Diameter Gateway sends a CEA message with DIAMETER_UNKNOWN_PEER metric, then it means the client peer configuration is not done correctly. Configure the allowedClientNodes section of Diameter Gateway service using REST API.

If the Diameter Gateway sends a CEA message success and other SH message response with DIAMETER_UNABLE_TO_COMPLY/DIAMETER_MISSING_AVP metric, then the problem may lie in the requested Sh message.

If the Diameter Gateway error logs show errors like connection refused with some IP and port, then it means a specified peer node configured is not able to accept the CER request from the Diameter Gateway and Diameter Gateway retries to connect with that peer.

If you are getting DIAMETER_UNABLE_TO_DELIVERY error message, then it means diameterproxy is turned off or not running. If the Diameter Gateway goes to crashloop back off state, then it means that incorrect peer node is configured.

Use metric ocudr_diam_conn_network to verify the active connection in the peer nodes.

For information about diam-gateway metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

nudr-migration

If a pod is in the pending state, it means resources are not present in the CNE and if a pod is in the ImagePullBackoff state, it means the image is not able to fetch from repository. Run the following command to check details:

kubectl describe pod <pod-name> -n <namespace>

If the pod is in the running state and data migration has not happened, then:

check the logs and search for ERROR in logs
Either the source UDR or target UDR is down. Verify logs.

If you are not able to connect to 4G UDR, then:

Check logs for DIAMETER_UNABLE_TO_COMPLY in CER/CEA messages.
Check whether UDR/UDA messages are received from 4G UDR.
Check whether K8S_HOST_IP port is same as an external IP address of Kubernetes node that you gave in affinity. If they are different, then you get DIAMETER_UNABLE_TO_COMPLY in CEA response.

For information about nudr-migration metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

overload-manager

To troubleshoot errors related to overload-manager, consider the following points:

In the global section, if the overloadmanager flag is disabled, then the overload manager REST APIs of Ingress Gateway and perf-info microservice do not load.
If the overload manager data is not present in the common_configuration table, then ensure the overloadmanger flag is enabled at the global level.
svcName configured at ocpolicymapping API should be taken from routesConfig section. If the svcName configured in policymapping is different from svcName configured in routesConfig, then overload manager does not trigger.
To check specific load level of metric, check the perf-info logs. The perf-info logs contain load level of each metric.
If the alerts are not raised for overload manager, then ensure the alerts are properly loaded and are not loaded from Prometheus.

On-demand migration Range Support

To troubleshoot errors related to on-demand migration range support, consider the following points:

By default, on-demand migration works for all key type and key values, if there is no change in the configurations. Check the REST configuration of global section for key type and key range.
If on-demand migration does not trigger after key type and key range is set through global configuration API, perform the following step:
- Check if the valid key type and key range that is mentioned in the configuration API contains the same key type and key range that is used for the test. Valid keys are Mobile Station Integrated Services Digital Network (MSISDN) or International Mobile Subscriber Identity (IMSI).
If the on-demand migration range support feature is not used, you can set the default key type and key range from the global configuration API as below:
```
"keyType": "msisdn",
"keyRange": "000000-000000"
```

For information about on-demand migration metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

4.4.2 Debugging Errors from Egress Gateway

If the traffic is not routed through Egress Gateway, then check the following:

Check whether global.egress is enabled.
Check whether Egress pod is running from kubectl. To check, run the following command:
kubectl get pods -n <Release.name>
To enable the outgoing traffic using HTTPS, set the enableOutgoingHttps parameter as 'true'.
Create unique certificates and keys for all Egress and respective Ingress NF's. It is the same as Ingress debugging.

Debugging Errors When SCP Integration is Enabled

UDR Egress Gateway route configurations are performed to route all the notifications through SCP, and the NRF traffic is sent directly to the NRF host. If the routing does not work, then configure the routes as follows:

Figure 4-40 Routes Config

Note:

The above configuration is present as part of default values.

If you want to send notifications through SCP, configure Egress Gateway as shown in the following image. If setId 0 is used, configure both httpConfigs and httpsConfigs as shown in the image. For setId having static host configuration for httpsConfigs (even if its not used), it is mandatory to configure this parameter using dummy values as shown in the image. If it is not configured, then the Egress Gateway log shows NullPointerException.

Figure 4-41 Sending Notification Through SCP

If it uses setId 1 or 2, enable Alternate Route service and configure proper host details for Egress Gateway to communicate with alternate route service. If configurations are not done as expected, then it gives 425 error, which is the default error configured for virtual FQDN lookup failure. If you see 503 or other 4xx errors, then it is because the actual endpoint or SCP is not reachable.

Figure 4-42 Using setId 1 or 2

Figure 4-43 Using setId

Figure 4-44 Using setId 1 or 2 (cont..)

Figure 4-45 Using setId 1 or 2 (cont..)

Retry to multiple SCPs in case of failure depends on the failure code and operation performed. If it is not in the configured list, then it does not attempt a retry. The number of retries depends on retries configuration as follows:

Figure 4-46 SCP Retry

Also, ensure that scpRerouteEnabled is set to true.

Figure 4-47 scpRerouteEnabled set to true

If DNS resolution from core-dns service does not happen, check whether the following configuration is enabled on alternate-route service.

Figure 4-48 DNS Srv Configuration

4.4.3 Debugging Errors from Ingress Gateway

The possible errors that you may encounter from Ingress Gateway are:

Check for 404 Error: If the request fails with 404 status code with the following ProblemDetails, then there may be issues with the routeConfig on the ingressgateway custom values file.
```
{"title":"404 NOT_FOUND","status":404,"detail":"udr001.oracle.com: ingressgateway: NOT_FOUND: OUDR-IGWSIG-E183"}
```
You must check the custom values.yaml file for the essential route configurations. If the essential route configurations are not present you must add the route configurations.
Check for 503 Error: If the request fails with 503 status code with "SERVICE_UNAVAILABLE" in Problem Details, then it means that the nudr-drservice pod is not reachable due to some reason.
```
{"title":"Service Unavailable","status":503,"detail":"udr001.oracle.com: ingressgateway: Service Unavailable: OUDR-IGWSIG-E003","cause":"Encountered unknown host exception at IGW"}
```
You can confirm the same in the errors/exception logs of the ocudr-ingressgateway pod. Check for ocudr-nudr-drservice pod status and fix the issue.

4.4.4 Debugging Errors from nudr-config

The Cloud Native Core (CNC) Console GUI uses the debugging errors received from nudr_config to update or view the configuration items. The debugging error details from nudr-config are as follows:

Check for 400 Error: If the following request fails with 400 status code with "404 Not Found", it indicates that the logging level information is not present in the database or the microservice is not enabled.

Figure 4-49 Checking for 400 Error

If common_config_hook is unable to create configuration item for the common services like ingress-gateway, egress-gateway, or alternate-route, then the GET request for the logging gives the following response:

Figure 4-50 Response of Get Request for Logging

4.4.5 Debugging Notification Issues

If UDR does not generate any notification, check the notify service port configuration in the values.yaml file. These ports must be same as the ports on which notify service is running.

nudr-drservice:
...
...
...
    notify:
        port:
            http: 5001
            https: 5002