4 Troubleshooting Unified Data Repository

This chapter provides information to troubleshoot the common errors, which can be encountered during the preinstallation, installation, upgrade, and rollback procedures of Oracle Communications Cloud Native Core, Unified Data Repository (UDR).

4.1 Generic Checklist

The following sections provide generic checklist for troubleshooting UDR:

Deployment Related Checklist

  • Run the following command to check the installation of kubectl.
    $ kubectl
    If kubectl is not installed, you can visit https://kubernetes.io/docs/tasks/tools/install-kubectl/
  • Run the following command to check the installation of UDR.
    $ kubectl get pods -n <ocudr-namespace>

    Figure 4-1 Sample Output: UDR Pods Status

    Sample Output: UDR Pods Status

    Note:

    The STATUS of all the pods is 'Running'.
  • Run the following command to view all the events related to a particular namespace.
    kubectl get events -n <ocudr-namespace>
  • Ensure the preinstall job is in the completed state and all the UDR microservices are in the running state.
  • To verify the database and user creation, the following guidelines must be followed:
    • If the preinstall pod is in the 'ERROR' state, run the following command to check the logs to debug the issue.
      kubectl logs -n <namespace> <pre install pod name>
    • If you see the following message in logs, it is possibly because the MySQL server does not allow remote connections to the privileged users.
      {"thrown":{"commonElementCount":0,"localizedMessage":"Access denied for user 'root'@'10.233.118.132' to database 'saqdb'","message":"Access denied for user 'root'@'%' to database 'saqdb'","name":"java.sql.SQLSyntaxErrorException","extendedStackTrace":"java.sql.SQLSyntaxErrorException: Access denied for user 'root'@'%' to database 'saqdb'\n\tat
      To fix the user access error, run the following steps on all the SQL nodes to modify the user table in MySQL DB.
      
      1. mysql> update mysql.user set host='%' where User='<privileged username>';
         Query OK, 0 rows affected (0.00 sec)
         Rows matched: 1  Changed: 0  Warnings: 0
      2. mysql> flush privileges;
         Query OK, 0 rows affected (0.06 sec)
    • If the preinstall job is complete but the dr-service and notify-service pods are crashing with a similar error message in logs as above, then the user may not be created. To fix this, you need to set the value of createUser field in the custom-values.yaml file to 'true' before installing UDR.

      Note:

      For more information on creating a database user, see the Creating Database User or Group section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide.
      preInstall:
          image:
            name: nudr_pre_install_hook
            tag: 25.1.202
          config:
            logLevel: WARN
          # Flag to enable user creation. Keep this flag true.
          # Change to false when installed with vDBTier. For vDBTier instllation user creation on DB
          # should be manually done
          createUser: true
    • If the preinstall pod is in the ERROR state with the following error message in logs:
      "message":"Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'createUser': Invocation of init method failed; nested exception is java.sql.SQLException: NDB_STORED_USER privilege is not supported. Please use MySQL version 8.0.22 or higher",
      

      Then, it could be because the data tier you are trying to connect has a MySQL package installed that does not support the NDB_STORED_USER privilege. To fix this, set the createUser flag to 'false' and create the user manually on all SQL nodes.

    • If there is "The database secret is empty" or "Invalid data present in the secret" error message in the preinstall hook logs, then create the secret as mentioned in the Installing Unified Data Repository chapter in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide . Check for the case sensitivity of the keys in the secret. For example, encryptionKey, dsusername, and so on.
  • Run the following command to verify whether UDR specific pods are working as expected:
    $ kubectl get pods -n <ocudr-namespace>

    Figure 4-2 Sample Output: UDR Pods Status


    Sample Output UDR Pods Status

    Result: In the figure given above, you can see that the status of all the pods is 'Running'.

    Note:

    The number of pods for each service depends on Helm configuration. In addition, all pods must be in a ready state and you need to ensure that there are no continuous restarts.

Helm Installation Checklist

Run the following command to check the installation of helm.

$ helm ls

If helm is not installed, run the following set of commands one after another to install helm:

  1. curl -o /tmp/helm.tgz https://storage.googleapis.com/kubernetes-helm/helm-v2.9.1-linux-amd64.tar.gz.
    Replace with the latest Helm download link.
  2. tar -xzvf /tmp/helm.tgz -C /usr/local/bin --strip-components=1 linux-amd64/helmrm -f /tmp/helm.tgz
  3. kubectl create serviceaccount --namespace kube-system tiller
  4. kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
  5. helm init --service-account tiller
  6. kubectl get po -n kube-system
    # Wait for the tiller pod to be up
  7. helm ls
    # Does not return any error. Try again if an error is returned as the tiller pod may be coming up.
  8. helm install
    . If this command fails immediately with a syntax error, check for the required data for the helm install command to run.

Database Related Checklist

To verify database connectivity:

  • Log in to the NDB cluster and verify the creation of UDR database with all the tables. To check the entries in the database tables, run the following command:
    select count(*) from RESOURCE_MAP
    It ensures that the connection is fine and the database is created successfully. This count differs based on the udrServices option selected under the global section of the custom-values.yaml file. But this table cannot be empty.

    Figure 4-3 Sample Output: Verifying Table Entries in Database


    Sample Output: Verifying Table Entries in Database

  • To verify UDR subscribers, check the provisioning flow on UDR. Use the following provisioning URL supported on UDR to verify the provisioning flow:
    • If you use external tools like postman and http2 curl, then follow this URL:

      http://<ocudr-ingress-gateway-ip>:<http-external-port>/nudr-dr-prov/v1/profile-data/msisdn-1111111113

      In case of curl, the client must support an http2 curl utility.
    • If HTTPS is enabled in UDR Ingress Gateway, then follow this URL:

      https://<ocudr-ingress-gateway-ip>:<https-external-port>/nudr-dr-prov/v1/profile-data/msisdn-1111111113

    Verifying provisioning flow on UDR also confirms the udrdb status on the NDB cluster.

  • Check the nudr-nrf-client-nfmanagement logs for no 503 errors. This helps to find out if all the FQDN configured, as part of helm configurations, in values are resolvable.
  • Verify NRF registration by checking the nrfclient_current_nf_status and nrfclient_nf_status_with_nrf metrics on Prometheus.

4.2 Database Related Issues

This section describes the database related issues.

Verifying SQL Exception Failures with nudr-pre-install-hook pod

The nudr-pre-install-hook pod creates UDR database along with the tables required. If it does not create the database, then perform the following steps to debug the pod failure.

  • Verify whether the helm install command hangs for longer time or fails with the BackOffLimit Exceeded error.
  • Watch the kubectl get pods command based on the release namespace.
  • Check whether nudr-preinstall pod is going to error state. This means the DB creation has failed or connection to DB is not successful.
  • Run the following command on logs:
    kubectl logs <udr-pre-install-hook pod id> --n <ocudr-namespace>
  • Check the log output of the pods for any warning or SQL exceptions using above command continuously. If any warning or SQL exception is found, it means there is an issue with the SQL connection or the SQL Node. Examine each exception thoroughly to find the root cause.
  • Verify the following information in the values.yaml file.
    global:
    
      ...
    
      ...
    
      ...
    
      # MYSQL Connectivity Configurations
      mysql:
        dbServiceName: &dbHostName "mysql-connectivity-service.occne-ndb"  #This is a read only parameter. Use the default value.
        port: &dbPortNumber "3306"
        configdbname: &configdbname udrconfigdb
        dbname: &dbname udrdb
        # Do not change the below values
        dbUNameLiteral: &dbUserName dsusername
        dbPwdLiteral: &dbUserPass dspassword
        dbEngine: &dbEngine NDBCLUSTER
    
      nrfClientDbName: *configdbname
      dbCredSecretName: &dbSecretName 'ocudr-secrets
  • Ensure that the following service is available in the Cloud Native Environment (CNE).

    Figure 4-4 Service Availability in CNE

    Service Availability in CNE
  • Check whether Kubernetes secrets are present. If secrets exist, then check their encrypted details like username, password, and DB name. If these details do not exist, then update the secrets.
  • After making any changes, run the following command to upgrade Helm.
    helm upgrade <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

    For more information, see the Creating Kubernetes Secret - DBName, Username, Password, and Encryption Key section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide .

Verifying SQL Exception Failures with Common Services pre-install-hook pod

  • Run the following command on logs:
    kubectl logs <failed-pre-install-hook-pod> -n <ocudr-namespace>
  • Check the log output of the pods for any warning or SQL exceptions using above command continuously. If any warning or SQL exception is found, it means there is an issue with the SQL connection or the SQL Node. Examine each exception thoroughly to find the root cause.
  • Verify the following information in the values.yaml file.
    global:
    
      ...
    
      ...
    
      ...
    
      # MYSQL Connectivity Configurations
      mysql:
        dbServiceName: &dbHostName "mysql-connectivity-service.occne-ndb"  #This is a read only parameter. Use the default value.
        port: &dbPortNumber "3306"
        configdbname: &configdbname udrconfigdb
        dbname: &dbname udrdb
        # Do not change the below values
        dbUNameLiteral: &dbUserName dsusername
        dbPwdLiteral: &dbUserPass dspassword
        dbEngine: &dbEngine NDBCLUSTER
    
      nrfClientDbName: *configdbname
      dbCredSecretName: &dbSecretName 'ocudr-secrets'
  • Ensure that the following service is available in the Cloud Native Environment (CNE).

    Figure 4-5 Service Availability in CNE

    Service Availability in CNE
  • Check whether Kubernetes secrets are present. If secrets exist, then check their encrypted details like username, password, and DB name. If these details do not exist, then update the secrets.
  • After making any changes, run the following command to upgrade Helm.
    helm install <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

    For more information, see the Creating Kubernetes Secret - DBName, Username, Password, and Encryption Key section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide .

Verifying SQL Exception Failure with nudr-pre-upgrade-hook pod

The nudr-pre-upgrade-hook pod takes care of the database schema upgrade of UDR. It adds new tables if required, along with few more entries to the existing tables. Perform the following steps to debug this pod failure when there is an issue with the database upgrade:
  1. Checks whether the helm upgrade command hangs for long time or fails with BackOffLimit exceeded error.
  2. Ensure that the pre_upgrade_hook.yaml file is present in the templates directory of the target charts, with the required annotation. This is for the nudr-pre-upgrade-hook pod to come up.

    "helm.sh/hook": "pre-upgrade"

  3. Watch the kubectl get pods command based on the release namespace.
  4. Run the following command on the pods to check if the nudr-pre-upgrade pod is going to error state. It means that the DB schema upgrade has failed or connection to DB is not successful.

    kubectl logs <nudr-pre-upgrade-hook pod id> --n <ocudr-namespace>

  5. Check the log output of the pod for any warning or SQL Exception. If there is any, it means there is an issue with the SQL connection or the SQL Node. Check the Exception details to get the root cause.
  6. After the upgrade completes, run the following command to verify whether all the pods are running containers with the updated images.

    kubectl describe pod <pod id> --n <ocudr-namespace>

  7. If the nudr-pre-upgrade pod throws an error, check the logs. If the logs has "Change in UDR Mode not allowed" error, then check if the configuration of udrServices in the values.yaml file is different from previous version. If the logs has "Change in VSA Level not allowed" error, then check if the configuration of vsaLevel in the values.yaml file is different from previous version.

4.3 Deployment Related Issues

This section describes the most common deployment related issues and their resolution steps. Users are recommended to attempt the resolution provided in this section before contacting Oracle Support.

4.3.1 Debugging Pre-Installation Related Issues

As of now, there are no known preinstallation related issues that you may encounter before installing UDR. However, it is recommended to see the Prerequisites and PreInstallation Tasks section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide to prepare for UDR installation.

4.3.2 Debugging Installation Related Issues

This section describes how to troubleshoot the installation related issues. It is recommended to see the Generic Checklist section also in addition to the information shared in this section.

4.3.2.1 Debugging Pod Creation Failure

A pod creation can fail due to various reasons. Some of the possible scenarios are as follows:

Verifying Pod Image Correctness

To verify pod image:

  • Check whether any of the pods is in the ImagePullBackOff state.
  • Check whether the image name used for any pod is incorrect. Verify the following values in the custom-values.yaml file.
    global:
      dockerRegistry: ocudr-registry.us.oracle.com:5000/ocudr
     
    nudr-drservice:
      image:
        name: nudr_datarepository_service
        tag: 25.1.202
     
    nudr-dr-provservice:
      image:
        name: nudr_datarepository_service
        tag: 25.1.202
     
     
    nudr-notify-service:
      image:
        name: nudr_notify_service
        tag: 25.1.202
     
    nudr-config:
      image:
        name: nudr_config
        tag: 25.1.202
     
    config-server:
      # Image details
      image: ocpm_config_server
      imageTag: 25.1.207
      pullPolicy: IfNotPresent
      
    ingressgateway-sig:
      image:
        name: ocingress_gateway
        tag: 25.1.210
      initContainersImage:
        name: configurationinit
        tag: 25.1.210
      updateContainersImage:
        name: configurationupdate
        tag: 25.1.210
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    ingressgateway-prov:
      image:
        name: ocingress_gateway
        tag: 25.1.210
      initContainersImage:
        name: configurationinit
        tag: 25.1.210
      updateContainersImage:
        name: configurationupdate
        tag: 25.1.210
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    egressgateway:
      image:
        name: ocegress_gateway
        tag: 25.1.210
      initContainersImage:
        name: configurationinit
        tag: 25.1.210
      updateContainersImage:
        name: configurationupdate
        tag: 25.1.210
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    nudr-diameterproxy
      image:
        name: nudr_diameterproxy
        tag: 25.1.202
         
    nudr-ondemandmigration:
      image:
        name: nudr_ondemandmigration
        tag: 25.1.202
     
    alternate-route:
      deploymentDnsSrv:
        image: alternate_route
        tag: 25.1.210
        pullPolicy: IfNotPresent
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    perf-info:
      image: perf-info
      imageTag: 25.1.207
      imagepullPolicy: Always
     
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    app-info:
      image: app-info
      imageTag: 25.1.207
      imagepullPolicy: Always
     
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    nrf-client:
      image: nrf-client
      imageTag: 25.1.208
      imagepullPolicy: Always
     
      dbHookImage:
        name: common_config_hook
        tag: 25.1.210
        pullPolicy: IfNotPresent
     
    nudr-dbcr-auditor-service:
      image:
        name: nudr_dbcr_auditor_service
        tag: 25.1.202
        pullPolicy: IfNotPresent
  • After updating the values.yaml file, run the following command for helm upgrade:

    helm upgrade <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

  • If the helm install command is stuck for a long time or fails with timeout error, verify whether the pre install hooks have come up. Verify whether there exists any ImagePullBackOff status check as follows.

    hookImageDetails

    global:
      dockerRegistry: ocudr-registry.us.oracle.com:5000/ocudr
     
      preInstall:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
       
      preUpgrade:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
     
      postUpgrade:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
     
      postInstall:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
     
      preRollback:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
     
      postRollback:
        image:
          name: nudr_common_hooks
          tag: 25.1.202
     
      test:
        image:
          name: nf_test
          tag: 25.1.204

    After updating these values, you can purge the deployment and install helm again.

Verifying Resource Allocation Failure

To verify any resource allocation failure:

  • Run the following command to verify whether any pod is in the Pending state.

    kubectl describe <nudr-drservice pod id> --n <ocudr-namespace>

  • Verify whether any warning on insufficient CPU exists in the describe output of the respective pod. If it exists, it means there are insufficient CPUs for the pods to start. Address this hardware issue.
  • If any preinstall hooks are in pending state, then check the resources allocated for hooks. Do not allocate higher values for hooks. If hooks with lower CPU or memory are going to pending state, then there is an issue with available resources on cluster. Check the resources and reduce the number of CPUs alloted to the pod in the values.yaml file.

    hookresources

    global:
      hookJobResources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 1
          memory: 1Gi

    resources

    nudr-drservice:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 2Gi
     
    nudr-dr-provservice:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 2Gi
     
    nudr-notify-service:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 2Gi
     
    nudr-config:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 2Gi
     
    config-server:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 512Mi
     
    nudr-client:
      resources:
        limits:
          cpu: 1
          memory: 2Gi
        requests:
          cpu: 1
          memory: 512Mi    
     
    nudr-diameterproxy:
      resources:
        limits:
          cpu: 3
          memory: 4Gi
        requests:
          cpu: 3
          memory: 4Gi   
     
    nudr-ondemand-migration:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 2
          memory: 2Gi      
     
    ingressgateway:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
          initServiceCpu: 1
          initServiceMemory: 1Gi
          updateServiceCpu: 1
          updateServiceMemory: 1Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
        requests:
          cpu: 2
          memory: 2Gi
          initServiceCpu: 1
          initServiceMemory: 1Gi
          updateServiceCpu: 1
          updateServiceMemory: 1Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
     
    egressgateway:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
          initServiceCpu: 1
          initServiceMemory: 1Gi
          updateServiceCpu: 1
          updateServiceMemory: 1Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
        requests:
          cpu: 2
          memory: 2Gi
          initServiceCpu: 1
          initServiceMemory: 1Gi
          updateServiceCpu: 1
          updateServiceMemory: 1Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
     
    alternate-route:
      resources:
        limits:
          cpu: 2
          memory: 2Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
        requests:
          cpu: 2
          memory: 2Gi
          commonHooksCpu: 1
          commonHooksMemory: 1Gi
     
    perf-info:
      resources:
        limits:
          cpu: 1
          memory: 1Gi
        requests:
          cpu: 1
          memory: 1Gi
     
    app-info:
      resources:
        limits:
          cpu: 1
          memory: 1Gi
        requests:
          cpu: 1
          memory: 1Gi
  • Run the following helm upgrade command after updating the values.yaml file.

    helm upgrade <helm chart> [--version <OCUDR version>] --name <release> --namespace <ocudr-namespace> -f <ocudr_values.yaml>

Verifying Resource Allocation Issues on Webscale Environment

Webscale environment has openshift container installed. There can be cases where,

  • Pods does not scale after you run the installation command and the helm install command fails with timeout error. In this case, check for preinstall hooks failure. Run the oc get job command to create the jobs. Describe the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
  • Any of the actual microservice pods do not scale post the hooks completion. In this case, run the oc get rs command to get the list of replicaset created for the NF deployment. Then, describe the replicaset for which the pods are not getting scaled and check for resource quota limit exceeded errors with CPU or memory.
  • Helm install command times-out after all the microservice pods are scaled as expected with the expected number of replicas. In this case, check for post install hooks failure. Run the oc get job command to get the post install jobs and do a describe on the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
  • Resource quota exceed beyond limits.
4.3.2.2 Debugging Pod Startup Failure
Follow the guidelines shared below to debug the pod startup failure liveness check issues:
  • If dr-service, diameter-proxy, and diam-gateway services are stuck in the CrashLoopBackOff state, then the reason could be that config-server is not yet up. A sample log on these services is as follows:
    "Config Server is Not yet Up, Wait For config server to be up."

    To resolve this, make sure the dependent services nudr-config and nudr-config-server is up or the startup probe will attempt to restart pod for every configured amount of time.

  • If the notify-service and on-demand migration service is stuck in the Init state, then the reason could be the dr-service is not yet up. A sample log on these services is as follows:
    "DR Service is Not yet Up, Wait For dr service to be up."

    To resolve this, check for failures on dr-service or the startup probe will attempt to restart pod for every configured amount of time.

  • If the microservices connecting to mySQL database is stuck in Crashloopbackoff state, check for mySQL exceptions in the logs and fix accordingly or If you receive error messages The database secret is empty or Invalid data present in the secret in the main service container logs make sure that the secret is created as mentioned in document and check for the case sensitivity of the keys in the secret. For example, encryptionKey, dsusername, and so on.
4.3.2.3 Debugging UDR with Service Mesh Failure
There are some known failure scenarios that you can encounter while installing UDR with service mesh. The scenarios along with their solutions are as follows:
  • Istio-Proxy side car container not attached to Pod: This particular failure arises when istio injection is not enabled on the NF installed namespace. Run the following command to verify the same:

    kubectl get namespace -L istio-injection

    Figure 4-6 Verifying Istio-Proxy

    Verifying Istio-Proxy

    To enable the istio injection, run the following command:

    kubectl label --overwrite namespace <nf-namespace> istio-injection=enabled

  • If any of the hook pods is not responding in the 'NotReady' state and is not cleared after completion, check if the following configuration is set to 'true' under global section. Also, ensure that the URL configured for istioSidecarQuitUrl is correct.

    Figure 4-7 When Hook Pod is NotReady

    When Hook Pod is NotReady
  • When Prometheus does not scrape metrics from nudr-nrf-client-service, see if the following annotation is present under nudr-nrf-client-service:

    Figure 4-8 nrf-client service

    nrf-client service
  • If there are issues in viewing UDR metrics on OSO Prometheus, you need to ensure that the following highlighted annotation is added to all deployments for the NF.

    Figure 4-9 Issues in Viewing UDR Metrics - Add Annotation

    Issues in Viewing UDR Metrics - Add Annotation
  • When vDBTier is used as backend and there are connectivity issues, and when nudr-preinstall communicates with DB, which can be seen from error logs on preinstall hook pod, then make the destination rule and service entry for mysql-connectivity-service on occne-infra namespace.
  • When installed on ASM, if ingressgateway, egressgateway, or alternate-route services go into CrashLoopBackOff, you must check if the coherence ports is excluded for inbound and outbound on the istio-proxy.
  • On the latest F5 versions, if the default istio-proxy resources assigned is less, make sure that you assign minimum one CPU and one GB RAM for all UDR services. The traffic handling services must be same as mentioned in the resource profile. If the pods crash due to less memory, you must check the configuration. You can refer the following annotations in the custom values file.
    deployment:
    # Replica count for deployment 
    replicaCount: 2
    # Microservice specific notation for deployment 
    customExtension: 
    labels: {}
     annotations:
    sidecar.istio.io/proxyCPU: "1000m" 
    sidecar.istio.io/proxyCPULimit: "lOOOm"
    sidecar.istio.io/proxyMemory: "lGi" 
    sidecar.istio.io/proxyMemoryLimit : "lGi" 
    proxy.istio.io/config: | 
    terminationDrainDuration: 60s
4.3.2.4 Debugging SLF Default Group ID related Issues
SLF default group ID is added to the SLF_GROUP_NAME table through Helm hooks during UDR installation or upgrade. If a subscriber is not found and default group ID is enabled, then a response with default group ID is sent. If the default group ID is not found in the response, then use the following API to add the Default Group ID (This is similar to other SLF Groups PUT operation).
http://localhost:8080/slf-group-prov/v1/slf-group
{
	"slfGroupName": "DefaultGrp",
	"slfGroupType": "LteHss",
	"nfGroupIDs": {
		"NEF": "nef-group-default",
		"UDM": "udm-group-default",
		"PCF": "pcf-group-default",
		"AUSF": "ausf-group-default",
		"CHF": "chf-group-default"
	}
}

The default group name is dynamically editable through CNCC. If user changes default group name on CNCC and does not add the same to SLF_GROUP_NAME, then the default group name can be added through API as mentioned above.

4.3.2.5 Debugging Subscriber Activity Logging
This section describes how to troubleshoot the subscriber activity logging related issues.
  • If subscriber activity logging is not enabled, check the subscriberAcitivtiyEnabled flag and subscriberIdentifiers keys in the Global Configurations Parameters. For more information, see Oracle Communications Cloud Native Core, Unified Data Repository User Guide.
  • If you are not getting the subscriber logs after enabling the flag, then make sure that the subscriber identifiers mentioned in the configuration API contains the same key value as the testing subscriber identifiers.
  • Each subscriber identifiers can be configured up to 100 keys using CNC Console or REST API .
  • You can remove a subscriber from this feature by removing the subscriber identifiers key from the Global Configurations Parameters as shown below:
    "subscriberActivityEnabled": true, "subscriberldentifers": {
    "nai": [],
    "imsi": [
    "1111111127,1111111128"
    ],
    "extid": [],
    "msisdn": [
    "1111111129,1111111130"
    ]
4.3.2.6 Debugging Subscriber Bulk Import Tool Related Issues
Subscriber bulk Import tool pod can run into pending state during installation. The reasons could be as follow:
  • Resources is not available. In this case, allocate more resources for the namespace. The "kubectl describe pod" will give you more details on this issue.
  • PVC allocation failed for subscriber bulk import tool. During re-installation, there can be a case where the existing PVC is not linked to subscriber bulk import tool. You can debug the issue based on the details from the describe pod output.
  • The storage class is not configured correctly. In this case, check the correctness of the configuration as below.

    Figure 4-10 Bulk Import Persistent Claim


    Bulk Import Persistent Claim

  • When subscriber bulk tool installation is complete, there can be a case where REST APIs configurations is not working. In this case, make sure that the below configuration is updated in the custom-values.yaml file.

    Figure 4-11 OCUDR Release name


    OCUDR Release name

  • If the transfer-in and transfer-out functionality are not working after the remote host transfer is enabled. Make sure that the below steps are performed to resolve the issue:
    • The kubernetes secrets created are correct for the private and public keys.
    • The remote host configured and secrets created are of the same remote host.
    • The remote path is correct.
    • The space remaining in the remote host is within the limits of the file size that you are transferring.
4.3.2.7 Debugging NF Scoring for a Site
If there are issues related to NF Scoring, then perform the following steps:
  • Perform a GET request using http://<nudr-config-host>:<nudr-config-port>/udr/nf-common-component/v1/app-info/nfScoring to check if the NF Scoring feature is enabled. If the feature is disabled, the request will show an "ERROR feature not enabled". To enable the feature, use the above GET API to set the feature flag to true and then fetch the NF score.
  • To get the detailed information on provisioning and signaling, multiple ingress gateways must be set to true for UDR and SLF.
  • If the Custom Criteria is enabled and the calculation of NF score for custom criteria fails, you must check the name of the metric and other configured details in custom criteria.
4.3.2.8 Debugging Subscriber Export Tool Related Issues
Subscriber export tool pod can run into pending state during installation. The reasons could be as follow:
  • Resources is not available. In this case, allocate more resources for the namespace. The "kubectl describe pod" will give you more details on this issue.
  • PVC allocation failed for subscriber bulk import tool. During re-installation, there can be a case where the existing PVC is not linked to subscriber export tool. You can debug the issue based on the details from the describe pod output.
  • The storage class is not configured correctly. In this case, check the correctness of the configuration as below.

    Figure 4-12 Export tool Presistent Claim


    Export tool Presistent Claim

  • When subscriber export tool installation is complete, there can be a case where the REST APIs configuration is not working. In this case, make sure that the below configuration is updated in the custom-values.yaml file.

    Figure 4-13 OCUDR Release name


    OCUDR Release name

  • If the export dump is not generated, you can check the logs for more details. Check the configuration is updated correctly as below.

    Figure 4-14 Export Tool Persistent Claim Standard


    Export tool Persistent Claim Standard

  • If the transfer-in and transfer-out functionality are not working after the remote host transfer is enabled. Make sure that the below steps are performed to resolve the issue:
    • The kubernetes secrets created are correct for the private and public keys.
    • The remote host configured and secrets created are of the same remote host.
    • The remote path is correct.
    • The space remaining in the remote host is within the limits of the file size that you are transferring.
4.3.2.9 Debugging Controlled Shutdown Related Issues
If there are issues related to controlled shutdown, then perform the following steps:
  • Check the REST API GLOBAL configuration section if the control shutdown is not enabled. Make sure the flag enableControlledShutdown parameter is set to true to enable the feature.
  • Once the flag is enabled to true, you can do a PUT request to udr/nf-common-component/v1/operationalState. The PUT request throws an error if the flag is disabled.
  • When the operational state is set to COMPLETE_SHUTDOWN, all the ingress gateway requests are rejected with the configured error codes. If the request is not rejected, check if the feature flag is enabled and do a GET request on udr/nf-common-component/v1/operationalState.
  • The subscriber export tool and subscriber import tool rejects all the new request that is queued for processing.
  • When the operational state is COMPLETE_SHUTDOWN the NF status is updated as SUSPENDED at NRF. Check the app-info logs if the status is not updated to SUSPENDED. The logs contain the operational state of COMPLETE_SHUTDOWN.
4.3.2.10 Debug Readiness Failure
During the lifecycle of a pod if the pod containers is in NotReady state the reasons could be as follow:
  • Make sure that the dependent services is up. Check the logs for the below content:

    Dependent services down, Set readiness state to REFUSING_TRAFFIC

  • Make sure that the database is available and the app-info is ready to monitor the database. Check the logs for the below content:

    DB connection down, Set readiness state to REFUSING_TRAFFIC

  • The readiness failure can occur, if resource map or key map table in the database are not having proper content. Check the logs for the below content:

    ReourceMap/KeyMap Entries missing, Set readiness state to REFUSING_TRAFFIC

4.3.2.11 Enable cnDBTier Metrics with OSO Prometheus
cnDBTier setup must be applied with the below yaml file. For example:
kubectl create -f <.yaml> -n <nsdbtier>
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: cndb-to-mysql-external-se
spec:
  exportTo:
  - "."
  hosts:
  - mysql-connectivity-service
  location: MESH_EXTERNAL
  ports:
  - number: 3306
    name: mysql2
    protocol: MySQL
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: nf-to-nf
spec:
  exportTo:
  - "."
  hosts:
  - "*.$DOMAIN_NAME"   # DOMAIN_NAME must be replaced with the deployed CNE Domain name
  location: MESH_EXTERNAL
  ports:
  - number: 80
    name: HTTP2-80
    protocol: TCP
  - number: 8080
    name: HTTP2-8080
    protocol: TCP
  - number: 3306
    name: TCP-3306
    protocol: TCP
  - number: 1186
    name: TCP-1186
    protocol: TCP
  - number: 2202
    name: TCP-2202
    protocol: TCP
  resolution: NONE
---
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: default
spec:
  mtls:
    mode: PERMISSIVE
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: nf-to-kube-api-server
spec:
  hosts:
  - kubernetes.default.svc.$DOMAIN_NAME  # DOMAIN_NAME must be replaced with the deployed CNE Domain name
  exportTo:
  - "."
  addresses:
  - 172.16.13.4
  location: MESH_INTERNAL
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  - number: 6443
    name: https-1
    protocol: HTTPS
  resolution: NONE
---
Install Operations Services Overlay (OSO) Promethues with cnDBTier namespace in the OSO custom-values.yaml file to remove the cnDBTier metrics. The sample yaml file is given below:
##################################################################################
#                                                                                #
# Copyright (c) 2022 Oracle and/or its affiliates. All rights reserved.          #
#                                                                                #
##################################################################################
 
nameOverride: prom
 
## Helm-Test (Optional)
# Values needed for helm-test, Comment the entire Section if Helm-test not needed.
helmtestimage: occne-repo-host:5000/k8s.gcr.io/ingress-nginx/controller:v1.3.1
useasm: false
namespace: ocudr-ns
clustername: cne-23-1-rc2
resources:
  limits:
    cpu: 10m
    memory: 32Mi
  requests:
    cpu: 10m
    memory: 32Mi
promsvcname: oso-prom-svr
almsvcname: oso-prom-alm
prometheushealthyurl: /prometheus/-/healthy
prometheusreadyurl: /prometheus/-/ready
 
# Note: There are 3 types of label definitons provided in this custom values file
# TYPE1: Global(allResources)
# TYPE2: lb & nonlb TYPE label only
# TYPE3: service specific label
## NOTE: POD level labels can be inserted using the specific pod label sections, every pod/container has this label defined below in all components sections.
# ******** Custom Extension Global Parameters ********
#**************************************************************************
 
global_oso:
# Prefix & Suffix that will be added to containers
  k8Resource:
    container:
      prefix:
      suffix:
 
# Service account for Prometheus, Alertmanagers
  serviceAccountNamePromSvr: ""
  serviceAccountNameAlertMgr: ""
 
  customExtension:
# TYPE1 Label
    allResources:
      labels: {}
 
# TYPE2 Labels
    lbServices:
      labels: {}
 
    nonlbServices:
      labels: {}
 
    lbDeployments:
      labels: {}
 
    nonlbDeployments:
      labels: {}
 
    lbStatefulSets:
      labels: {}
 
# Add annotations for disabling sidecar injections into oso pods here
# eg: annotations:
#       - sidecar.istio.io/inject: "false"
annotations:
  - sidecar.istio.io/inject: "false"
 
## Setting this parameter to false will disable creation of all default clusterrole, clusterolebing, role, rolebindings for the componenets that are packaged in this csar.
rbac:
  create: true
 
podSecurityPolicy:
  enabled: false
 
## Define serviceAccount names for components. Defaults to component's fully qualified name.
##
serviceAccounts:
  alertmanager:
    create: true
    name:
    annotations: {}
  nodeExporter:
    create: false
    name:
    annotations: {}
  pushgateway:
    create: false
    name:
    annotations: {}
  server:
    create: true
    name:
    annotations: {}
 
alertmanager:
  enabled: true
 
  ## Use a ClusterRole (and ClusterRoleBinding)
  ## - If set to false - Define a Role and RoleBinding in the defined namespaces ONLY
  ## This makes alertmanager work - for users who do not have ClusterAdmin privs, but wants alertmanager to operate on their own namespaces, instead of clusterwide.
  useClusterRole: false
 
  ## Set to a rolename to use existing role - skipping role creating - but still doing serviceaccount and rolebinding to the rolename set here.
  useExistingRole: false
 
  ## alertmanager resources name
  name: alm
  image:
    repository: occne-repo-host:5000/quay.io/prometheus/alertmanager
    tag: v0.24.0
    pullPolicy: IfNotPresent
  extraArgs:
    data.retention: 120h
  prefixURL: /cne-23-1-rc2/alertmanager
  baseURL: "http://localhost/cne-23-1-rc2/alertmanager"
  configFileName: alertmanager.yml
  nodeSelector: {}
  affinity: {}
  podDisruptionBudget:
    enabled: true
    minAvailable: 1
  persistentVolume:
    enabled: true
    accessModes:
      - ReadWriteOnce
    annotations: {}
    ## alertmanager data Persistent Volume existing claim name
    ## Requires alertmanager.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    mountPath: /data
    size: 2Gi
    storageClass: "standard"
 
  ## Annotations to be added to alertmanager pods
  ##
  podAnnotations: {}
 
  ## Labels to be added to Prometheus AlertManager pods
  ##
  podLabels: {}
 
  replicaCount: 2
 
  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}
 
  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: true
    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady
 
    ## Alertmanager headless service to use for the statefulset
    ##
    headless:
      annotations: {}
      labels: {}
 
      ## Enabling peer mesh service end points for enabling the HA alert manager
      ## Ref: https://github.com/prometheus/alertmanager/blob/master/README.md
      enableMeshPeer: true
 
      servicePort: 80
 
  ## alertmanager resource requests and limits
  ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
  ##
  resources:
    limits:
      cpu: 20m
      memory: 64Mi
    requests:
      cpu: 20m
      memory: 64Mi
 
  service:
    annotations: {}
    labels: {}
    clusterIP: ""
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 80
    # nodePort: 30000
    sessionAffinity: None
    type: ClusterIP
 
## Monitors ConfigMap changes and POSTs to a URL
## Ref: https://github.com/jimmidyson/configmap-reload
##
configmapReload:
  prometheus:
    enabled: true
    ## configmap-reload container name
    ##
    name: configmap-reload
    image:
      repository: occne-repo-host:5000/docker.io/jimmidyson/configmap-reload
      tag: v0.8.0
      pullPolicy: IfNotPresent
 
    # containerPort: 9533
 
    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
 
    ## Security context to be added to configmap-reload container
    containerSecurityContext: {}
 
    ## configmap-reload resource requests and limits
    ## Ref: http://kubernetes.io/docs/user-guide/compute-resources/
    ##
    resources:
      limits:
        cpu: 10m
        memory: 32Mi
      requests:
        cpu: 10m
        memory: 32Mi
 
  alertmanager:
    enabled: true
    name: configmap-reload
    image:
      repository: occne-repo-host:5000/docker.io/jimmidyson/configmap-reload
      tag: v0.8.0
      pullPolicy: IfNotPresent
 
    # containerPort: 9533
    ## Additional configmap-reload mounts
    ##
    extraConfigmapMounts: []
      # - name: prometheus-alerts
      #   mountPath: /etc/alerts.d
      #   subPath: ""
      #   configMap: prometheus-alerts
      #   readOnly: true
 
    resources:
      limits:
        cpu: 10m
        memory: 32Mi
      requests:
        cpu: 10m
        memory: 32Mi
 
kubeStateMetrics:
  enabled: false
 
nodeExporter:
  enabled: false
 
server:
  enabled: true
  ## namespaces to monitor (instead of monitoring all - clusterwide). Needed if you want to run without Cluster-admin privileges.
  namespaces: []
  #  - ocudr-ns
  name: svr
  image:
    repository: occne-repo-host:5000/quay.io/prometheus/prometheus
    tag: v2.39.1
    pullPolicy: IfNotPresent
  prefixURL: /cne-23-1-rc2/prometheus
  baseURL: "http://localhost/cne-23-1-rc2/prometheus"
 
  ## Additional server container environment variables
  env: []
 
  # List of flags to override default parameters, e.g:
  # - --enable-feature=agent
  # - --storage.agent.retention.max-time=30m
  defaultFlagsOverride: []
 
  extraFlags:
    - web.enable-lifecycle
    ## web.enable-admin-api flag controls access to the administrative HTTP API which includes functionality such as
    ## deleting time series. This is disabled by default.
    # - web.enable-admin-api
    ##
    ## storage.tsdb.no-lockfile flag controls BD locking
    # - storage.tsdb.no-lockfile
    ##
    ## storage.tsdb.wal-compression flag enables compression of the write-ahead log (WAL)
    # - storage.tsdb.wal-compression
 
  ## Path to a configuration file on prometheus server container FS
  configPath: /etc/config/prometheus.yml
 
  global:
    scrape_interval: 1m
    scrape_timeout: 30s
    evaluation_interval: 1m
  #remoteWrite:
    #- url OSO_CORTEX_URL
    # remote_timout (default = 30s)
      #remote_timeout: OSO_REMOTE_WRITE_TIMEOUT
    # bearer_token for cortex server to be configured
    #      bearer_token: BEARER_TOKEN
 
  extraArgs:
    storage.tsdb.retention.size: 1GB
  ## Additional Prometheus server Volume mounts
  ##
  extraVolumeMounts: []
 
  ## Additional Prometheus server Volumes
  ##
  extraVolumes: []
 
  ## Additional Prometheus server hostPath mounts
  ##
  extraHostPathMounts: []
    # - name: certs-dir
    #   mountPath: /etc/kubernetes/certs
    #   subPath: ""
    #   hostPath: /etc/kubernetes/certs
    #   readOnly: true
 
  extraConfigmapMounts: []
 
  nodeSelector: {}
  affinity: {}
  podDisruptionBudget:
    enabled: false
    maxUnavailable: 1
 
  persistentVolume:
    enabled: true
    accessModes:
      - ReadWriteOnce
    annotations: {}
 
    ## Prometheus server data Persistent Volume existing claim name
    ## Requires server.persistentVolume.enabled: true
    ## If defined, PVC must be created manually before volume will be bound
    existingClaim: ""
    size: 2Gi
    storageClass: "standard"
 
  ## Annotations to be added to Prometheus server pods
  ##
  podAnnotations: {}
 
  ## Labels to be added to Prometheus server pods
  ##
  podLabels: {}
 
  ## Prometheus AlertManager configuration
  ##
  alertmanagers:
  - kubernetes_sd_configs:
      - role: pod
    # Namespace to be configured
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
    path_prefix: cne-23-1-rc2/alertmanager
    relabel_configs:
    - source_labels: [__meta_kubernetes_namespace]
    # namespace to be configured
      regex: ocudr-ns
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_app]
      regex: prom
      action: keep
    - source_labels: [__meta_kubernetes_pod_label_component]
      regex: alm
      action: keep
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_probe]
      regex: .*
      action: keep
    - source_labels: [__meta_kubernetes_pod_container_port_number]
      regex:
      action: drop
 
  ## Use a StatefulSet if replicaCount needs to be greater than 1 (see below)
  ##
  replicaCount: 1
 
  ## Annotations to be added to deployment
  ##
  deploymentAnnotations: {}
 
  statefulSet:
    ## If true, use a statefulset instead of a deployment for pod management.
    ## This allows to scale replicas to more than 1 pod
    ##
    enabled: false
 
    annotations: {}
    labels: {}
    podManagementPolicy: OrderedReady
 
  resources:
    limits:
      cpu: 2
      memory: 4Gi
    requests:
      cpu: 2
      memory: 4Gi
 
  service:
    enabled: true
    annotations: {}
    labels: {}
    clusterIP: ""
    externalIPs: []
    loadBalancerIP: ""
    loadBalancerSourceRanges: []
    servicePort: 80
    sessionAffinity: None
    type: NodePort
 
    ## If using a statefulSet (statefulSet.enabled=true), configure the
    ## service to connect to a specific replica to have a consistent view
    ## of the data.
    statefulsetReplica:
      enabled: false
      replica: 0
  retention: "7d"
 
pushgateway:
  ## If false, pushgateway will not be installed
  ##
  enabled: false
 
## alertmanager ConfigMap entries
##
alertmanagerFiles:
  alertmanager.yml:
    global: {}
      # slack_api_url: ''
 
    receivers:
      - name: default-receiver
        # slack_configs:
        #  - channel: '@you'
        #    send_resolved: true
 
    route:
      group_wait: 10s
      group_interval: 5m
      receiver: default-receiver
      repeat_interval: 3h
 
## Prometheus server ConfigMap entries for rule files (allow prometheus labels interpolation)
ruleFiles: {}
 
## Prometheus server ConfigMap entries
##
serverFiles:
 
  ## Alerts configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
  alerting_rules.yml: {}
 
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use alerting_rules.yml
  alerts: {}
 
  ## Records configuration
  ## Ref: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/
  recording_rules.yml: {}
  ## DEPRECATED DEFAULT VALUE, unless explicitly naming your files, please use recording_rules.yml
  rules: {}
 
  prometheus.yml:
    rule_files:
      - /etc/config/recording_rules.yml
      - /etc/config/alerting_rules.yml
    ## Below two files are DEPRECATED will be removed from this default values file
      - /etc/config/rules
      - /etc/config/alerts
 
    scrape_configs:
      - job_name: prometheus
        metrics_path: cne-23-1-rc2/prometheus/metrics
        static_configs:
          - targets:
            - localhost:9090
 
extraScrapeConfigs: |
 
  - job_name: 'oracle-cnc-service'
    kubernetes_sd_configs:
      - role: service
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_service_name
 
  - job_name: 'oracle-cnc-pod'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - ocudr-ns
          - dbtier-ns
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
 
  - job_name: 'oracle-cnc-endpoints'
    kubernetes_sd_configs:
      - role: endpoints
        #namespaces:
        #  names:
        #  - ns1
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_oracle_com_cnc]
        regex: true
        action: keep
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
 
  - job_name: 'oracle-cnc-ingress'
    kubernetes_sd_configs:
      - role: ingress
        #namespaces:
        #  names:
        #  - ns1
        #  - ns2
    relabel_configs:
      - source_labels: [__meta_kubernetes_ingress_annotation_oracle_com_cnc]
        regex: true
        action: keep
4.3.2.12 Debugging ndbmysqld Pods Restart During cnDBTier Installation or Upgrade

During cnDBTier Installation or Upgrade the readiness probe fails as the ndbmysqld pods wait for the data nodes to be up and running. This causes the ndbmysqld pods to restart with Reason: Error and Exit Code: 1 error. If the data nodes takes time to come up for any reason, such as slowness of cluster, the ndbmysqld pods will restart. The ndbmysqld pods stabilize when the data nodes comes up.

4.3.2.13 Debugging Error Logging Related Issues
If there are issues related to Error Logging, then perform the following steps:
  • You must set the additionalErrorLogging parameter to ENABLED per microservice for the Error Logging feature to work. This feature is "DISABLED" by default and it can be ENABLED or DISABLED using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.
  • For logging subscriber information in the logs, you must set the logSubscriberInfo parameter to "ENABLED" per microservice. The parameter can be ENABLED or DISABLED using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.
    Debugging Error Logging Feature

4.3.2.14 Debugging Suppress Notification Related Issues
If there are issues related to Suppress Notification, then perform the following steps:
  • You must set the suppressNotificationEnabled parameter to true in the global section of the custom values yaml file for the Suppress Notification feature to work. This feature is enabled by default and it can be enabled or disabled using REST APIs, CNC console, or by changing the values in custom values yaml file during installation.
  • If you observe unexpected notification then check if the feature is enabled from the global configuration using configuration REST APIs.
  • If the feature is enabled and you observe unexpected notifications for update requests then compare the User-Agent received in the request header with the User-Agent received in the subscription request.
  • This feature is applicable only for signaling requests. For provisioning request the notification generation behavior remains same as earlier.
  • This feature does not work with the subscription created in the previous release versions. You must create new subscriptions with the feature enabled for Suppress Notification feature to work.
    Debugging Suppress Notification Related Issues

4.3.2.15 Debugging Diameter S13 Interface Related Issues
If there are issues related to Diameter S13 Interface, then perform the following steps:
  • If global.s13InterfaceEnable flag is set to true and if the helm installation is throwing errors, you must enable the following parameters in ocudr-custom-values.yaml file:
    • global.diamGatewayEnable
    • nudr-diameterproxy.enabled
  • If the NRF client heartbeats do not consider the diameter services when diameter S13 interface is enabled, you must check the following configuration in the ocudr-custom-values.yaml file.

    Note:

    The NRF client does not register the EIR with NRF if diameter services is down.
    #eir mode
    eir: &eir
    - '{{ .Release.Name }}-nudr-drservice'
    - '{{ .Release.Name }}-egressgateway'
    - '{{ .Release.Name }}-ingressgateway-sig'
    - '{{ .Release.Name }}-ingressgateway-prov'
    - '{{ .Release.Name }}-nudr-config' #uncomment only if config service enabled
    - '{{ .Release.Name }}-nudr-config-server' #uncomment only if config service enabled
    - '{{ .Release.Name }}-alternate-route' #uncomment if alternate route enabled
    - '{{ .Release.Name }}-nudr-dr-provservice' # uncomment only if drProvisioningEnabled is enabled
    - '{{ .Release.Name }}-nudr-diameterproxy' # uncomment only if s13InterfaceEnable is enabled
    - '{{ .Release.Name }}-nudr-diam-gateway' # uncomment only if s13InterfaceEnable is enabled
  • If the diameter gateway is answering CEA message with DIAMETER_UNKNOWN_PEER, then client peer configuration is incorrect. You must perform the configuration in allowedClientNodes section of diameter gateway service configuration using REST API for the client to connect to EIR and send an ECR request.
  • If the diameter gateway is answering CEA message as success and other diameter message responds with DIAMETER_UNABLE_TO_COMPLY/DIAMETER_MISSING_AVP, then the issue could be in the diameter message request.
  • If there are error logs in diameter gateway microservice stating that the connection is refused with IP and port numbers, then the specified configured peer node was not able to accept CER request from diameter gateway. The diameter gateway retries multiple times to connect that peer.
  • If you are getting DIAMETER_UNABLE_TO_DELIVERY error message, then diameterproxy microservice is down.
  • If the diam-gateway goes to crashloop back off state, then it could be due to incorrect peer node configuration.
  • Active connections to the existing peer nodes can be verified using ocudr_diam_conn_network metric.
4.3.2.16 Debugging TLS Related Issues

This section describes the TLS related issues and their resolution steps. It is recommended to attempt the resolution steps provided in this guide before contacting Oracle Support.

Problem: Handshake is not established between UDRs.

Scenario: When the client version is TLSv1.2 and the server version is TLSv1.3.

Server Error Message

The client supported protocol versions[TLSv1.2] are not accepted by server preferences [TLSv1.3]

Client Error Message

Received fatal alert: protocol_version

Scenario: When the client version is TLSv1.3 and the server version is TLSv1.2.

Server Error Message

The client supported protocol versions[TLSv1.3]are not accepted by server preferences [TLSv1.2]

Client Error Message

Received fatal alert: protocol_version

Solution:

If the error logs have the SSL exception, do the following:

Check the TLS version of both UDRs, if both support different and single TLS versions, (that is, UDR 1 supports TLS 1.2 only and UDR 2 supports TLS 1.3 only or vice versa), handshake fails. Ensure that the TLS version is same for both UDRs or revert to default configuration for both UDRs. The TLS version communication supported are:

Table 4-1 TLS Version Used

Client TLS Version Server TLS Version TLS Version Used
TLSv1.2, TLSv1.3 TLSv1.2, TLSv1.3 TLSv1.3
TLSv1.3 TLSv1.3 TLSv1.3
TLSv1.3 TLSv1.2, TLSv1.3 TLSv1.3
TLSv1.2, TLSv1.3 TLSv1.3 TLSv1.3
TLSv1.2 TLSv1.2, TLSv1.3 TLSv1.2
TLSv1.2, TLSv1.3 TLSv1.2 TLSv1.2

Check the cipher suites being supported by both UDRs, it should be either the same or should have common cipher suites present. If not, revert to default configuration.

Problem: Pods are not coming up after populating the clientDisabledExtension or serverDisabledExtension Helm parameter.

Solution:

  • Check the value of the clientDisabledExtension or serverDisabledExtension parameters. The following extensions should not be present for these parameters:
    • supported_versions
    • key_share
    • supported_groups
    • signature_algorithms
    • pre_shared_key

If any of the above values is present, remove them or revert to default configuration for the pod to come up.

Problem: Pods are not coming up after populating the clientSignatureSchemes Helm parameter.

Solution:

  • Check the value of the clientSignatureSchemes parameter.
  • The following values should be present for this parameter:
    • rsa_pkcs1_sha512
    • rsa_pkcs1_sha384
    • rsa_pkcs1_sha256
    If any of the above values is not present, add them or revert to default configuration for the pod to come up.
4.3.2.17 Debugging Dual Stack Related Issues
With this feature, cnUDR can be deployed on a dual stack Kubernetes infrastructure. Using the dual stack mechanism, cnUDR establishes and accepts connections within the pods and services in a Kubernetes cluster using IPv4 or IPv6. You can configure the feature by setting the global.deploymentMode parameter to indicate the deployment mode of the cnUDR in the global section of the ocudr-custom values.yaml file. The default value is set as ClusterPreferred and the values can be changed in ocudr-custom values.yaml file during installation. The Helm configuration is as follows:
  • To use this feature, cnUDR must be deployed on a dual stack Kubernetes infrastructure either in IPv4 preferred CNE or IPv6 preferred CNE.
  • If the global.deploymentMode parameter is set to 'IPv6_IPv4' then, when all the pods are running, the services, such as ingressgateway-prov and ingressgateway-sig must have both IPv6 and IPv4 addresses assigned. The default address must be IPv6 . The IP family policy must be set to RequireDualStack. The load balancer assigned must have both IPv4 and IPv6 addresses.
  • All internal services must be single stack and must have only IPv6 and IP family policy. All the pods must have both IPv4 and IPv6 addresses.
  • This feature does not work after upgrade since the upgrade path is not identified for this feature. The operators must perform a fresh installation of the NF to enable the Dual Stack functionality.
#Possible values : IPv4, IPv6, IPv4_IPv6,IPv6_IPv4,ClusterPreferred
global:
deploymentMode: ClusterPreferred
4.3.2.18 Debugging Lifecycle Management (LCM) Based Automation Issues
Perform the following steps if there are issues related to Lifecycle Management (LCM) Based Automation feature:
  • Make sure that the autoCreateResources.enabled and autoCreateResources.serviceaccounts.enabled flags are enabled.
  • During upgrade, if a new service account name is provided in the serviceAccountName parameter with autoCreateResources.enabled and autoCreateResources.serviceaccounts.enabled flags enabled, then a new service account name is created. If the service account name is not created, then you must check the configuration and the flags again.
  • During upgrade, you must use a different service account name when you upgrading from manual to automation. If you use same service account name when upgrading from manual to automation, then Helm does not allow upgrade due to ownership issues.
  • To use the OSO alerts automation feature, you must follow the installation steps of the oso-alr-config Helm chart by providing the alert file that needs to be applied to the namespace. For more information, see Oracle Communications Cloud Native Core, Operations Services Overlay Installation and Upgrade Guide. If the alert file is not provided during Helm installation, you can provide the alert file during upgrade procedure. Alert file can be applied to the namespace during Helm installation or upgrade procedure.
  • If an incorrect data is added to the alert file, you can clear the entire data in the alert file by providing the empty alert file (ocslf_alertrules_empty_<version>.yaml). For more information, see "OSO Alerts Automation" section in Oracle Communications Cloud Native Core, Unified Data Repository User Guide.
  • If you provide the service account name during the upgrade but the feature is disabled, the nudr-pre-upgrade hook fails because it cannot find the service account. If the upgrade fails, the rollback to the previous version will be unsuccessful due to the missing alternate route service account, resulting in an error message indicating that the service account for the alternate route service is not found. To address this issue, it is necessary to manually create the service account after the initial upgrade failure, then continue with the upgrade, this will also ensures a successful rollback.

4.3.3 Debugging Post Installation Related Issues

This section describes how to troubleshoot the post installation related issues.

4.3.3.1 Debugging Helm Test Issues
To debug the Helm Test issues:
  • Run the following command to get the Helm Test pod name.

    kubectl get pods -n <deployment-namespace>

  • Check for the Helm Test pod that is in error state.

    Figure 4-15 Helm Test Pod

    Helm Test Pod
  • Run the following command to check the Helm Test pod:

    kubectl logs <helm_test_pod_name> -n <deployment_namespace>

    In the logs, concentrate on ERROR and WARN level logs. There can be multiple reasons for failure. Some of them are shown below:
    • Figure 4-16 Helm Test in Pending State

      Helm Test in Pending State

      In this case, check for CPU and Memory availability in the Kubernetes cluster.

    • Figure 4-17 Pod Readiness Failed

      Pod Readiness Failed

      In this case, check for the correctness of the readiness probe URL in the particular microservice Helm charts under charts folder. In the above case, check for charts of notify service or check if the pod is crashing for some reason when the URL configured for readiness probe is correct.

    • There are a few other cases where the httpGet parameter is not configured for the readiness probe. In this case, Helm Test is considered a success for that pod. If the Pod or PVC list is fetched based on namespace and the labelSelector is empty, then the helm test is considered a success.

The Helm test logs generate the following error:

Figure 4-18 Helm Test Log Error


Helm Test Log Error

  • Check whether the required permission for te resource of the group is missing in the deployment-rbac.yaml file. The above sample shows that the permissions are missing for the persistent volume claims.
  • Give the appropriate permissions and redeploy.

    Check if the following error appears while running the helm test:

    14:35:57.732 [main] WARN  org.springframework.context.annotation.AnnotationConfigApplicationContext - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'k8SFabricClient': Invocation of init method failed; nested exception is java.lang.ArrayIndexOutOfBoundsException: Index 1 out of bounds for length 1
    { {} }
  • Check the custom values file that is used to create the deployment. The resources should be mentioned in the form of an array under the resources section in the following format: <k8ResourceName>/<maxNFVersion>.
4.3.3.2 Debugging Horizontal Pod Autoscaler Issues

There can be scenarios where Horizontal Pod Autoscaler (HPA) running on nudr-drservice deployment and nudr_notify_service might not get the CPU metrics successfully from the pods. Run the following command to view the HPA details:

kubectl get hpa

In this scenario, you need to check the following:
  • Check whether the metrics server is running on the Kubernetes cluster. If the server is running and the CPU usage pod is still not accessible, check the metrics-server values.yaml file for the arguments passed shown as follows:

    Figure 4-19 metrics-server yaml file

    metrics-server yaml file
  • If any changes are required, make them, restart the metrics server pod, and check for correctness. Wait a couple of minutes after the metrics server starts to see the CPU usage update on running kubectl get hpa command.

    Figure 4-20 Debugging HPA Issues

    Debugging HPA Issues
4.3.3.3 Debugging HTTPS Support Related Issues
UDR supports HTTPS and its validations at the UDR Ingress Gateway. You may encounter issues related to HTTPS when:
  • HTTPS port is not exposed: Run the following command to verify if the HTTPS port is exposed:

    kubectl get svc --n <ocudr-namespace>

    Figure 4-21 HTTPS Port Exposed

    HTTPS Port Exposed

    Note:

    In the above figure, the secure port is 443.
    If the HTTPS port is not exposed, then enable the configuration information highlighted in the following figure under the ingressgateway section of the values.yaml file.

    Figure 4-22 Configuration Info under Ingressgateway

    Configuration Info under Ingressgateway
  • Ingress Gateway Container is stuck in Init State/Failed: The Ingress Gateway container may stop responding due to any one of the following reasons:
    • When config initssl is enabled under ingressgateway section of the values.yaml file.

      Figure 4-23 config initssl

      config initssl
    • If config initssl is enabled, then check whether secrets are created with all required certificates. The following figure shows the commands that you need to run to check whether secrets are present and have all the required data.

      Figure 4-24 Commands to Check Secrets

      Commands to Check Secrets
  • Config-Server Container not responding in Hooks Init State: UDR does not respond in the Hooks Init state when there is a database connection failure.

    Figure 4-25 Config Server Container Status

    Config Server Container Status

    In this case, run the describe pod command (on the above pod). In most cases, it is due to secrets not being found.

    Also, verify the configuration given below to ensure config-server deployment refers to the correct secret values.
    global:
    
      dbCredSecretName: 'ocudr-secrets'
  • Config-Server Post Upgrade Hooks with below error.:

    When more than one UDR is installed with the same nfInstanceId and in the same udrDB, installation does not cause any issue or error. However, for the second UDR, there are no config-server related tables in the udrConfigDB. So when upgrade is performed on the second UDR setup, then the following error occurs.

    Figure 4-26 Config-Server Post Upgrade Hooks Error

    Config-Server Post Upgrade Hooks error
  • OAuth2 Related Issues:
    If you do not mention the OAuth secret name and namespace properly or if the public key certificate in secret is not in correct format, then the Ingress Gateway crashes.

    Figure 4-27 Ingress Gateway Crashed

    Ingress Gateway Crashed
    Other scenarios are:
    • The secret name in which public key certificate is stored is incorrect: In this scenario, it is advisable to check the logs of a pod that states "cannot retrieve secret from api server".
    • The public key certificate stored in secret is not in proper format: The public key format is {nrfInstanceId}_RS256.crt (6faf1bbc-6e4a-4454-a507-a14ef8e1bc5c_RS256.crt).

      If the public key is not stored in this format, then check the logs of pod that states "Malformed entry in NRF PublicKey Secret with key ecdsa.crt". Here, ecdsa.crt is the public key certificate in oauthsecret.

    By using public key certificate in required format, you can resolve these issues. You need to correct the fields with the proper secret name and namespace.

4.3.3.4 Debugging PodDisruptionBudget Related Issues
A pod can have voluntary or involuntary disruptions at any given time. Voluntary disruptions are either initiated by the application owner or the cluster administrator. Examples of voluntary disruptions are deleting the deployment or a controller that manages the pod, updating a deployment pod template, or accidentally deleting a pod. Involuntary disruptions are unavoidable and can be caused due to any one or more of the following given examples.
  • Disappearance of a node from the cluster due to cluster network partition
  • Accidentally deleting a virtual machine instance
  • Eviction of a pod when a node runs out of resources
To handle a voluntary disruption, you can set the PodDisruptionBudget value to determine the number of replicas of the application must be running at any given time. To configure the PodDisruptionBudget:
  1. Run the following command to check the pods running on different nodes:

    kubectl get pods -o wide -n ocudr

    Figure 4-28 Pods Running on Different Nodes

    Pods Running on Different Nodes
  2. Run the following set of commands to unschedule the node:
    kubectl cordon cne-180-dev2-k8s-node-4
    kubectl cordon cne-180-dev2-k8s-node-7
    kubectl cordon cne-180-dev2-k8s-node-5
    kubectl cordon cne-180-dev2-k8s-node-8
    kubectl cordon cne-180-dev2-k8s-node-1
    After unscheduling the nodes, the state of the nodes changes to 'Ready,SchedulingDisabled' as follows:

    Figure 4-29 After Nodes are Unscheduled

    After Nodes are Unscheduled
  3. Run the following set of commands to drain the nodes:
    kubectl drain cne-180-dev2-k8s-node-1 --ignore-daemonsets --delete-local-data
    kubectl drain cne-180-dev2-k8s-node-8 --ignore-daemonsets --delete-local-data
    kubectl drain cne-180-dev2-k8s-node-5 --ignore-daemonsets --delete-local-data
    kubectl drain cne-180-dev2-k8s-node-4 --ignore-daemonsets --delete-local-data
    kubectl drain cne-180-dev2-k8s-node-7 --ignore-daemonsets --delete-local-data
  4. If you are required to drain the nodes or evict the pods, then ensure the minimum number of pods are in ready state to serve the application request. To configure the minimum number of pods value, set the minAvailable parameter in the Helm charts for individual microservices. This ensures the availability of a minimum number of pods and they are not evicted. You can check logs while draining the nodes as follows:

    Figure 4-30 Logs When Trying to Evict Pod

    Logs When Trying to Evict Pod
4.3.3.5 Debugging Pod Eviction Issues
During heavy traffic, there can be a situation where any UDR or Provisioning Gateway pod can run into evicted state. To handle the eviction issues, increase the Ephemeral storage allocation of pods under global section. Update the containerLogStorage configuration under global.ephemeralStorage.limits section to '5000'.

Figure 4-31 Configuring Container Log Storage

Configuring Container Log Storage
After making the above changes, perform helm upgrade. If the Ingress Gateway or Egress Gateway pods are running into Evicted state, then update the ephemeralStorageLimit configuration to '5120' and perform a helm upgrade.

Figure 4-32 Ingress Gateway or Egress Gateway - Evicted State

Ingress Gateway or Egress Gateway - Evicted State
4.3.3.6 Debugging Taints or Tolerations Misconfigurations

The following points should be considered when the Node Selector and Taints or Toleration feature is used on UDR or Provisioning Gateway:

  • If any of the pods are going to the Pending state, ensure that the node selector used is pointing to the correct slave node name, and it has enough space to place the pod.

    Figure 4-33 Global Configuration


    Global Configuration

  • Use the following configuration to configure toleration settings for the tainted nodes.Update the Global section configurations for the settings to be applicable for all the services.

    Figure 4-34 Global Configuration


    Global Configuration

  • For Node Selector and Tolerations, configuration at the microservice level takes priority over configuration at the global level.

    Figure 4-35 Toleration and Node Selector


    Toleration and Node Selector

4.3.3.7 Debugging UDR Registration with NRF Failure
UDR registration with NRF may fail due to various reasons. Some of the possible scenarios are as follows:
  • Confirm whether registration was successful from the nrf-client-service pod.
  • Check the ocudr-nudr-nrf-client-nfmanagement logs. If the log has "UDR is Deregistration" then:
    • Check if all the services mentioned under allorudr/slf (depending on UDR mode) in the values.yaml file has same spelling as that of service name and are enabled.
    • Once all services are up, UDR must register with NRF.
  • If you see a log for SERVICE_UNAVAILABLE(503), check if the primary and secondary NRF configurations (primaryNrfApiRoot/secondaryNrfApiRoot) are correct and they are UP and Running.
4.3.3.8 Debugging User Agent Header Related Issues
If there are issues related to user agent header, then perform the following steps:
  • Under Ingressgateway section, enable the user-agent flag to true.
  • If there are issues in the consumer NFTypes validations for NF's, then check the NF types in the configurations present under ingressgateway section.

    Figure 4-36 Enabling User Agent Header

    Enabling User Agent Header
  • If the userAgentHeaderValidationConfgMode set to REST, the custom-values.yaml configurations are ignored. The configuration is loaded based on userAgentHeaderValidationConfgMode is set.

Note:

From UDR 24.1.0 release onwards this feature is supported using REST API mode. If the feature does not work, make sure the feature is enabled and configured post UDR is upgraded, see Postupgrade Tasks section in Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide

4.3.3.9 Debugging LCI and OCI Header Related Issues

If there are issues related to the LCI and OCI Header feature, perform the following:

Under Ingressgateway-sig section, set the lciHeaderConfig.enabled and ociHeaderConfig.enabled parameters as true, respectively.

Note:

  • Configured the names of the headers the same as in the default configuration.
  • Ensure to wait upto the validity period to report LCI and OCI Header in the next response.
  • For OCI, check the overloadconfigrange and reduction metrics. Based on which, OCI is reported.
4.3.3.10 Debugging Conflict Resolution Feature
Perform the following steps if the conflict resolution feature does not work:
  • If the exception tables or UPDATE_TIME column are missing from the UDR subscriber database, perform the following steps:
    • Ensure to run the SQL command from the SQL files on the database ndbappsql node.

      Note:

      Following SQL files are available in Custom_Templates file:
      • ALL_mode_ndb_replication_insert.sql
      • SLF_mode_ndb_replication_insert.sql
      • EIR_mode_ndb_replication_insert.sql
      • ALL_mode_ndb_replication_insert_UPGRADE.sql.file
      • SLF_mode_ndb_replication_insert_UPGRADE.sql
      • EIR_mode_ndb_replication_insert_UPGRADE.sql
      For more information on how to download the package, see Downloading Unified Data Repository Package section in the Oracle Communications Cloud Native Core, Unified Data Repository Installation, Upgrade, and Fault Recovery Guide.
    • after setting the global.dbConflictResolutionEnabled parameter to true in ocudr_custom_values.yaml file, if the UPDATE_TIME column is updated as 0, then run the REST APIs to fetch the global configurations and check the global.dbConflictResolutionEnabled is set to true. If the parameter is not set to true, perform a PUT operation for global configuration update to set the parameter to true.
    • If nudr_dbcr_auditor service is not enabled, then make sure to enable the global.dbConflictResolutionEnabled parameter and perform an Helm upgrade.
    • If nudr_dbcr_auditor service is not clearing exception tables or fixing data conflict, then make sure that the database replication is running.
    • If nudr_dbcr_auditor service is not clearing exceptions on IDXTODATA$EX exception tables then you must check if have the error log "Communication failed to Mate site during audit". If you get this error make sure to check if the following configuration on the custom values file is configured correctly.
      # Provide MateSite IGW IP List, Comma separated values. Provide fqdn or IP with port
      mateSitesIgwIPList: 'http://ocudr-ingressgateway-prov.myudr2:80,http://ocudr-ingressgateway-prov.myudr3:80'
4.3.3.11 Debugging UDR Error Responses Using App Error Code
From UDR release 24.1.0 onwards additional information are provided in the ProblemDetails.detail parameter as part of the error responses feature. This is applicable for all UDR mode deployments which includes signaling and provisioning error responses from UDR.

Note:

The responses from nudr-config service for the REST APIs configuration remains same.

Sample ProblemDetails.detail as follows:

<nfFqdn>: <service-name>: <Readable Error details>: <App-Error-Code>, example,

slf01.abc.com: Nudr_GroupIDmap: Request header Issue, Unsupported Media Type: OSLF-DRS-HDRVLD-E001

Table 4-2 Parameters of the Details Field of the Payload

Parameter Name Description
nfFqdn

Indicates the NF FQDN. It is obtained from the nfFqdn Helm Chart parameter.

Sample Value: slf01.abc.com: Nudr_GroupIDmap

service-name

Indicates the microservice name. It is the originator of the error response. This value is static and cannot be configured.

Sample Value: Nudr_GroupIDmap

Readable Error details

Provides a short description of the error.

Sample Value: Group Not Found

App-Error ID

Indicates the microservice ID and the error IDo<nftype>-<serviceId>-<category>-E<XXX>.

Sample Value: OSLF-DRS-SIG-E302, where,
  • OSLF is the vendor NF
  • DRS is the microservice ID
  • SIG is the category
  • E302 is the app error code
nftype

Indicates the vendor or NF type. This parameter is prefixed with “O”, which indicates Oracle. For example, if the NF type is SLF, the vendor name becomes OSLF. It is obtained from the nfType Helm Chart parameter.

Sample Value: OSLF

serviceId It is either DRS (nudr-drservice) or DRP (nudr-dr-provservice). This value is set based on container name.
Category Category to be used is fetched from error catalog. Errors are classified into categories based on serviceid. Following are the list of categories:
  • SIG
  • PROV
  • URIVLD
  • HDRVLD
  • REQVLD
  • DB
  • INTRNL
4.3.3.12 Debugging Provisioning Logs Related Issues
If there are issues related to provisioning log, then perform the following steps:
  • You can enable or disable the provisioning log feature by setting the provLogsEnabled parameter flag to true using REST APIs, CNC Console, or by changing the values in custom.yaml file. By default provision logging feature is disabled.
  • You can set the provisioning API names that are supported for provision logging by changing the provLogsApiNames configuration field to the required value. The default value is nudr-dr-prov. The accepted values are as follows:
    • nudr-dr-prov
    • nudr-group-id-map-prov
    • slf-group-prov
    • n5g-eir-prov
  • If provLogsEnabled flag is set to true, then it is recommended to change the values of logStorage to 4000 MB (apporx. equal to 4GB) for nudr-dr-prov pods to store provision logging files. If the values are not updated, then nudr-dr-prov pods will crash when ephemeral storage is full.
    Troubleshooting for Provisioning Logging

4.3.4 Debugging Upgrade or Rollback Failure

When Unified Data Repository (UDR) upgrade or rollback fails, perform the following steps:

  1. Run the following command to check the pre or post upgrade or rollback hook logs:
    kubectl logs <pod name> -n <namespace>
  2. After detecting the cause of failure, do the following:
    • For upgrade failure:
      • If the cause of upgrade failure is database or network connectivity issue, then resolve the issue and rerun the upgrade command.
      • If the cause of failure is not related to database or network connectivity issue and is observed during the preupgrade phase, then do not perform rollback because UDR deployment remains in the source or previous release.
      • If the upgrade failure occurs during the post upgrade phase. For example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
    • For rollback failure: If the cause of rollback failure is database or network connectivity issue, then resolve the issue and rerun the rollback command.
  3. If the issue persists, contact My Oracle Support.

4.4 Service Related Issues

This section describes the most common service related issues and their resolution steps.

4.4.1 Resolving Microservices related Issues through Metrics and ConfigDB

This section describes how to troubleshoot issues related to UDR microservices using metrics.

nudr-drservice

If requests for nudr-drservice fail, then try to find the root cause from metrics using following guidelines:

  • If the count of measurement “udr_schema_operations_failure_total” is increasing, check the content of the incoming request and make sure that the incoming json data blob is proper and as per the specification.
  • If “udr_db_operations_failure_total” measurements are increasing,
    • Make sure that connectivity is proper between microservices and MySQL DB nodes.
    • Make sure that you are not trying to insert duplicate keys.
    • Make sure that DB nodes have enough resources available.

nudr-dr-provservice

If requests for nudr-dr-provservice fails, then try to find the root cause from metrics using following guidelines:

  • If the count “udr_schema_operations_failure_total” measurement is increasing, check the content of incoming request and ensure the incoming JSON data blob is proper and as per the specification.
  • If “udr_db_operations_failure_total” measurements are increasing, then ensure:
    • there is connectivity between microservices and MySQL DB nodes.
    • you are not trying to insert duplicate keys.
    • database nodes have enough resources available.

nudr-nrf-nfmanagement

If requests for nudr-nrf-nfmanagement fail, then try to find the root cause from metrics using following guidelines:

  • Check for current health status of NRF using the nrfclient_nrf_operative_status metric. If it is 0, it is UNHEALTHY or UNAVAILABLE.
  • Check for current NF status using the nrfclient_nrf_operative_status metric, and NF status with NRF with nrfclient_nf_status_with_nrf metric.
  • If NF status is 0, then check the appinfo_service_running metric for various services configured in the app-info section depending on the UDR mode.

nudr-nrf-client-service

If requests for nudr-nrf-client-service fail, then try to find the root cause from metrics using following guidelines:
  • Check the current health status of NRF using "nrfclient_nrf_operative_status" metric. If it is 0, then it is UNHEALTHY or UNAVAILABLE.
  • Check the current network function status using "nrfclient_nrf_operative_status" metric and network function status with NRF using "nrfclient_nf_status_with_nrf" metric.
  • If network function status is 0, check "appinfo_service_running" metric for various services configured in the app info section depending on UDR mode.

ocudr-nudr-notify-service

If requests for ocudr-nudr-notify-service fail, then try to find the root cause from metrics using following guidelines:

  • Measurements like “nudr_notif_notifications_ack_2xx_total”, “nudr_notif_notifications_ack_4xx_total”, “nudr_notif_notifications_ack_5xx_total” gives information about the response code returned in the notification response.
  • If count of “nudr_notif_notifications_send_fail_total” measurement is increasing, make sure that notification server mentioned in NOTIFICATION_URI during subscription request, which is expected to receive the notifications, is up and running.
  • The default retry count for failed notifications is two and this is configurable from the retrycount parameter in the custom values yaml file. Perform the following steps if alerts is raised for exceeding notifications table limit threshold:
    • Log in to mysql database terminal to check the number of records on the NOTIFICATIONS table under UDR subscriber database (select count(*) from NOTIFICATIONS).
    • Perform the following steps if the notification records count is consistently above 50k:
      • Check if there are more failures on the notification sent from notify service using nudr_notif_notifications_ack_4xx_total and nudr_notif_notifications_ack_5xx_total metrics.
      • Check the reason for the failure and resolve the failure.
      • If the failure is temporary and cannot be avoided then use the notifications configuration REST API or CNC Console to reduce the retrycount to 0 or 1. This will make sure that the table size does not increase faster.

ocudr-nudr-config

If requests for ocudr-nudr-config fail, try to find the root cause from metrics using following guidelines:

  • Measurements like “nudr_config_total_requests_total{Method='GET'}”, “nudr_config_total_requests_total{Method='POST'}”, “nudr_config_total_requests_total{Method='PUT'}” gives information about the total request pegged for the method GET, POST, and PUT respectively.
  • If count of measurement “nudr_config_total_responses_total{Method='GET/POST/PUT',StatusCode="400/404/405/500"}” is increasing, it means the requests are not being processed and results in failures.

If requests for ocudr-nudr-config fail, try to find the root cause from configdb using following guidelines:

  • If you get a BAD REQUEST for GET API, then make sure all the tables shown below is present in configdb table.

    Figure 4-37 Configdb Table


    Configdb table

  • If all the table are present and you are getting a BAD REQUEST for GET API, then you must verify the configuration item table shown below.

    Figure 4-38 Configuration Item Table


    configuration_item table

  • If you get a BAD REQUEST and NOT FOUND for Import and Export API, then you must verify the import and export data table shown below.

    Figure 4-39 Import and Export Data


    Import and Export Data

ocudr-nudr-bulk-import

Following are some of the known errors that you can address if encountered.
  • If the bulk-import logs show "dr-service is down. Job cannot be executed", then check whether dr-service and Ingress Gateway are in the running state.
  • If the count of nudr_bulk_import_csvfile_records_read_total(Method="DELETE/PUT/POST", Status="Failure") metric is increasing, then it means the CSV file records are not valid. This can be resolved by providing correct keyType, KeyValue, operationType, nfType, and jsonPayload.
  • If the count of nudr_bulk_import_records_processed_total(Method = "POST/PUT/DELETE", StatusCode="201/204", Status="Success") is increasing, then it means the records are being processed by UDR correctly.
  • To find the number of request processed successfully for PCF, measure the count of Nudr_bulk_import_PCF_total{StatusCode="204/201", Status="Success"} metric.

For information about bulk import metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

ocudr-nudr-xmltocsv

After copying the ixml file using kubectl cp command, log into xmltocsv container and run the following command to check whether the file is copied or not:

> kubectl exec -it <pod name> -c nudr-xmltocsv -n <namespace> bash > cd /home/udruser/xml

If the count of measurement of the nudr_xmltocsv_xmlfile_records_read_total(Status="Failure") metric is increasing, then it shows the records in the ixml file are not valid. You need to ensure that correct ixml file is provided.

If the measurement count of the nudr_xmltocsv_records_processed_total{Method = "POST/PUT/DELETE/PATCH", Status="Success"} metric is increasing, then it denotes that the records are processed successfully.

For information about xmltocsv metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

ocudr-nudr-diameterproxy

If diameterproxy restarts, then make sure the database configurations are correct. For information about ocudr-nudr-diameterproxy metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

diam-gateway

If the Diameter Gateway sends a CEA message with DIAMETER_UNKNOWN_PEER metric, then it means the client peer configuration is not done correctly. Configure the allowedClientNodes section of Diameter Gateway service using REST API.

If the Diameter Gateway sends a CEA message success and other SH message response with DIAMETER_UNABLE_TO_COMPLY/DIAMETER_MISSING_AVP metric, then the problem may lie in the requested Sh message.

If the Diameter Gateway error logs show errors like connection refused with some IP and port, then it means a specified peer node configured is not able to accept the CER request from the Diameter Gateway and Diameter Gateway retries to connect with that peer.

If you are getting DIAMETER_UNABLE_TO_DELIVERY error message, then it means diameterproxy is turned off or not running. If the Diameter Gateway goes to crashloop back off state, then it means that incorrect peer node is configured.

Use metric ocudr_diam_conn_network to verify the active connection in the peer nodes.

For information about diam-gateway metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

nudr-migration

If a pod is in the pending state, it means resources are not present in the CNE and if a pod is in the ImagePullBackoff state, it means the image is not able to fetch from repository. Run the following command to check details:

kubectl describe pod <pod-name> -n <namespace>

If the pod is in the running state and data migration has not happened, then:
  • check the logs and search for ERROR in logs
  • Either the source UDR or target UDR is down. Verify logs.
If you are not able to connect to 4G UDR, then:
  • Check logs for DIAMETER_UNABLE_TO_COMPLY in CER/CEA messages.
  • Check whether UDR/UDA messages are received from 4G UDR.
  • Check whether K8S_HOST_IP port is same as an external IP address of Kubernetes node that you gave in affinity. If they are different, then you get DIAMETER_UNABLE_TO_COMPLY in CEA response.

For information about nudr-migration metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

overload-manager

To troubleshoot errors related to overload-manager, consider the following points:

  • In the global section, if the overloadmanager flag is disabled, then the overload manager REST APIs of Ingress Gateway and perf-info microservice do not load.
  • If the overload manager data is not present in the common_configuration table, then ensure the overloadmanger flag is enabled at the global level.
  • svcName configured at ocpolicymapping API should be taken from routesConfig section. If the svcName configured in policymapping is different from svcName configured in routesConfig, then overload manager does not trigger.
  • To check specific load level of metric, check the perf-info logs. The perf-info logs contain load level of each metric.
  • If the alerts are not raised for overload manager, then ensure the alerts are properly loaded and are not loaded from Prometheus.

On-demand migration Range Support

To troubleshoot errors related to on-demand migration range support, consider the following points:
  • By default, on-demand migration works for all key type and key values, if there is no change in the configurations. Check the REST configuration of global section for key type and key range.
  • If on-demand migration does not trigger after key type and key range is set through global configuration API, perform the following step:
    • Check if the valid key type and key range that is mentioned in the configuration API contains the same key type and key range that is used for the test. Valid keys are Mobile Station Integrated Services Digital Network (MSISDN) or International Mobile Subscriber Identity (IMSI).
  • If the on-demand migration range support feature is not used, you can set the default key type and key range from the global configuration API as below:
    "keyType": "msisdn",
    "keyRange": "000000-000000"

For information about on-demand migration metrics, see Oracle Communications Cloud Native Core, Unified Data Repository Users Guide.

4.4.2 Debugging Errors from Egress Gateway

If the traffic is not routed through Egress Gateway, then check the following:
  • Check whether global.egress is enabled.
  • Check whether Egress pod is running from kubectl. To check, run the following command:

    kubectl get pods -n <Release.name>

  • To enable the outgoing traffic using HTTPS, set the enableOutgoingHttps parameter as 'true'.
  • Create unique certificates and keys for all Egress and respective Ingress NF's. It is the same as Ingress debugging .

Debugging Errors When SCP Integration is Enabled

UDR Egress Gateway route configurations are performed to route all the notifications through SCP, and the NRF traffic is sent directly to the NRF host. If the routing does not work, then configure the routes as follows:

Figure 4-40 Routes Config


Routes Config

Note:

The above configuration is present as part of default values.

If you want to send notifications through SCP, configure Egress Gateway as shown in the following image. If setId 0 is used, configure both httpConfigs and httpsConfigs as shown in the image. For setId having static host configuration for httpsConfigs (even if its not used), it is mandatory to configure this parameter using dummy values as shown in the image. If it is not configured, then the Egress Gateway log shows NullPointerException.

Figure 4-41 Sending Notification Through SCP

Sending Notification Through SCP

If it uses setId 1 or 2, enable Alternate Route service and configure proper host details for Egress Gateway to communicate with alternate route service. If configurations are not done as expected, then it gives 425 error, which is the default error configured for virtual FQDN lookup failure. If you see 503 or other 4xx errors, then it is because the actual endpoint or SCP is not reachable.

Figure 4-42 Using setId 1 or 2

Using setId 1 or 2

Figure 4-43 Using setId


Routes Configuration

Figure 4-44 Using setId 1 or 2 (cont..)


Using setId 1 or 2 (cont..)

Figure 4-45 Using setId 1 or 2 (cont..)

Using setId 1 or 2 (cont..)
Retry to multiple SCPs in case of failure depends on the failure code and operation performed. If it is not in the configured list, then it does not attempt a retry. The number of retries depends on retries configuration as follows:

Figure 4-46 SCP Retry

SCP Retry

Also, ensure that scpRerouteEnabled is set to true.

Figure 4-47 scpRerouteEnabled set to true

scpRerouteEnabled set to true
If DNS resolution from core-dns service does not happen, check whether the following configuration is enabled on alternate-route service.

Figure 4-48 DNS Srv Configuration

DNS Srv Configuration

4.4.3 Debugging Errors from Ingress Gateway

The possible errors that you may encounter from Ingress Gateway are:
  • Check for 404 Error: If the request fails with 404 status code with the following ProblemDetails, then there may be issues with the routeConfig on the ingressgateway custom values file.
    {"title":"404 NOT_FOUND","status":404,"detail":"udr001.oracle.com: ingressgateway: NOT_FOUND: OUDR-IGWSIG-E183"}

    You must check the custom values.yaml file for the essential route configurations. If the essential route configurations are not present you must add the route configurations.

  • Check for 503 Error: If the request fails with 503 status code with "SERVICE_UNAVAILABLE" in Problem Details, then it means that the nudr-drservice pod is not reachable due to some reason.
    {"title":"Service Unavailable","status":503,"detail":"udr001.oracle.com: ingressgateway: Service Unavailable: OUDR-IGWSIG-E003","cause":"Encountered unknown host exception at IGW"}
    You can confirm the same in the errors/exception logs of the ocudr-ingressgateway pod. Check for ocudr-nudr-drservice pod status and fix the issue.

4.4.4 Debugging Errors from nudr-config

The Cloud Native Core (CNC) Console GUI uses the debugging errors received from nudr_config to update or view the configuration items. The debugging error details from nudr-config are as follows:
  • Check for 400 Error: If the following request fails with 400 status code with "404 Not Found", it indicates that the logging level information is not present in the database or the microservice is not enabled.

    Figure 4-49 Checking for 400 Error


    Checking for 400 Error

    If common_config_hook is unable to create configuration item for the common services like ingress-gateway, egress-gateway, or alternate-route, then the GET request for the logging gives the following response:

    Figure 4-50 Response of Get Request for Logging

    Response of Get Request for Logging

4.4.5 Debugging Notification Issues

If UDR does not generate any notification, check the notify service port configuration in the values.yaml file. These ports must be same as the ports on which notify service is running.
nudr-drservice:
...
...
...
    notify:
        port:
            http: 5001
            https: 5002