4 Troubleshooting NEF

This chapter provides information to troubleshoot the common errors which can be encountered during the preinstallation, installation, upgrade, and rollback procedures of NEF.

4.1 Generic Checklist

The following sections provide a generic checklist for troubleshooting tips.

Deployment related tips

Perform the following checks after the deployment:
  • Are NEF deployment, pods, and services created?

    Are NEF deployment, pods, and services running and available?

    Run the following the command:
    # kubectl -n <namespace> get deployments,pods,svc
    Inspect the output, check the following columns:
    • AVAILABLE of deployment
    • READY, STATUS, and RESTARTS of a pod
    • PORT(S) of service
  • Is the correct image used?

    Is the correct environment variables set in the deployment?

    Run the following command:
    # kubectl -n <namespace> get deployment <deployment-name> -o yaml
    Inspect the output, check the environment and image.
    # kubectl -n nef-svc get deployment ocnef-monitoringevents -o yaml
    apiVersion: extensions/v1beta1
    kind: Deployment
    metadata:
      annotations:
        deployment.kubernetes.io/revision: "1"
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"ocnef-monitoringevents","namespace":"nef-svc"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"ocnef-monitoringevents"}},"template":{"metadata":{"labels":{"app":"ocnef-monitoringevents"}},"spec":{"containers":[{"env":[{"name":"MYSQL_HOST","value":"mysql"},{"name":"MYSQL_PORT","value":"3306"},{"name":"MYSQL_DATABASE","value":"nefdb"},{"name":"NEF_SVC_ENDPOINT","value":"ocnef-monitoringevents"}],"image":"cne-repo:5000/ocnef-monitoringevents:latest","imagePullPolicy":"Always","name":"ocnef-monitoringevents","ports":[{"containerPort":8080,"name":"server"}]}]}}}}
      creationTimestamp: 2018-08-27T15:45:59Z
      generation: 1
      name: ocnef-monitoringevents
      namespace: nef-svc
      resourceVersion: "2336498"
      selfLink: /apis/extensions/v1beta1/namespaces/nef-svc/deployments/ocnef-monitoringevents
      uid: 4b82fe89-aa10-11e8-95fd-fa163f20f9e2
    spec:
      progressDeadlineSeconds: 600
      replicas: 1
      revisionHistoryLimit: 10
      selector:
        matchLabels:
          app: ocnef-monitoringevents
      strategy:
        rollingUpdate:
          maxSurge: 25%
          maxUnavailable: 25%
        type: RollingUpdate
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: ocnef-monitoringevents
        spec:
          containers:
          - env:
            - name: MYSQL_HOST
              value: mysql
            - name: MYSQL_PORT
              value: "3306"
            - name: MYSQL_DATABASE
              value: nefdb
            - name: NRF_SVC_ENDPOINT
              value: ocnef-monitoringevents
            image: cne-repo:5000/ocnef-monitoringevents:latest
            imagePullPolicy: Always
            name: ocnef-monitoringevents
            ports:
            - containerPort: 8080
              name: server
              protocol: TCP
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Always
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
    status:
      availableReplicas: 1
      conditions:
      - lastTransitionTime: 2018-08-27T15:46:01Z
        lastUpdateTime: 2018-08-27T15:46:01Z
        message: Deployment has minimum availability.
        reason: MinimumReplicasAvailable
        status: "True"
        type: Available
      - lastTransitionTime: 2018-08-27T15:45:59Z
        lastUpdateTime: 2018-08-27T15:46:01Z
        message: ReplicaSet "ocnef-monitoringevents-7898d657d9" has successfully progressed.
        reason: NewReplicaSetAvailable
        status: "True"
        type: Progressing
      observedGeneration: 1
      readyReplicas: 1
      replicas: 1
      updatedReplicas: 1
  • Check if the microservices can access each other via REST interface.

    Run the following command:

    # kubectl -n <namespace> exec <pod name> -- curl <uri>
    Example:
    # kubectl -n nef-svc exec ocnef-fivegcagent-44f4d8f5d5-6q92i -- curl http://ocnef-monitoringevents:8080/3gpp-monitoring-event/v1/anyAfID1000/subscriptions

    Note:

    These commands are in their simple form and display the logs only if there is a single nef<registration> and nf<subscription> pod deployed.

Application related tips

Run the following command to check the application logs and look for exceptions:
# kubectl -n <namespace> logs -f <pod name>

You can use '-f' to follow the logs or 'grep' for specific pattern in the log output.

Example:

# kubectl -n nef-svc logs -f $(kubectl -n nef-svc get pods -o name|cut -d'/' -f2|grep nfr)
# kubectl -n nef-svc logs -f $(kubectl -n nef-svc get pods -o name|cut -d'/' -f2|grep nfs)

Note:

These commands are in their simple form and display the logs only if there is 1 nef<registration> and nf<subscription> pod deployed.

4.2 Deployment Related Issue

This section describes the most common deployment related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.

4.2.1 Installation

4.2.1.1 Helm Install Failure

This section describes the various scenarios in which helm install might fail. Following are some of the scenarios:

4.2.1.1.1 Incorrect image name in ocnef-custom-values file

Problem

helm install might fail if an incorrect image name is provided in the ocnef-custom-values.yaml file.

Error Code/Error Message

When kubectl get pods -n <ocnef_namespace> is performed, the status of the pods might be ImagePullBackOff or ErrImagePull.

For example:

$ kubectl get pods -n ocnef

nefats-ocats-nef-8bd489d58-jd7ld                1/1     Running   0          47h
ocats-ocats-nef-67cf948f67-k59cn                1/1     Running   2          6d22h
ocnef-config-server-75bd4fc7f8-ttgbx            1/1     Running   0          4h41m
ocnef-expgw-afmgr-67dff6c6fd-tblvq              2/2     Running   0          4h41m
ocnef-expgw-apimgr-5665864dc4-bq9qj             1/1     Running   0          4h41m
ocnef-expgw-apirouter-5dc68f4c69-jdh9q          2/2     Running   0          4h41m
ocnef-expgw-eventmgr-67c5fbdb9c-zg6ll           1/1     Running   0          4h41m
ocnef-ext-egress-gateway-f569449d4-xd7gs        1/1     Running   0          4h41m
ocnef-ext-ingress-gateway-69f989878b-2tvdh      1/1     Running   0          4h41m
ocnef-fivegc-egress-gateway-6f84b8685c-xp292    1/1     Running   0          4h41m
ocnef-fivegc-ingress-gateway-757566b6d5-bjrqm   1/1     Running   0          4h41m
ocnef-fivegcagent-667d87696d-pqfd7              1/1     Running   0          4h41m
ocnef-monitoringevents-87cdb4b67-qfpn2          1/1     Running   0          4h41m
ocnef-nfdb-5ff78cf4d6-qm2mn                     1/1     Running   0          47h
ocnef-ocnef-ccfclient-7fd9c5c4bc-jc9tz          1/1     Running   0          4h41m
ocnef-ocnef-expiry-auditor-6c97cf49f7-r47kb     1/1     Running   0          4h41m
ocnefsim-ocstub-nef-af-74df6f7b4f-rp54q         1/1     Running   0          46h
ocnefsim-ocstub-nef-gmlc-5d456ffddb-2xzx8       1/1     Running   0          46h
ocnefsim-ocstub-nef-nrf-67fbd5bdf6-lqbvp        1/1     Running   0          46h
ocnefsim-ocstub-nef-udm-9d86d96c7-wjhf5         1/1     Running   0          46h

Solution

Perform the following steps to verify and correct the image name:
  1. Check ocnef-custom-values.yaml file has the release specific image name and tags.
    vi ocnef-custom-values-<release-number>
    For ocnef images details, see "Customizing NEF" in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
  2. Edit ocnef-custom-values file in case the release specific image name and tags must be modified.
  3. Save the file.
  4. Run the following command to delete the deployment:
    helm delete --purge <release_namespace>
    Sample command:
    helm delete --purge ocnef
  5. To verify the deletion, see the "Verifying Uninstallation" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
  6. Run helm install command. For helm install command, see the "Customizing NEF" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
  7. Run kubectl get pods -n <ocnef_namespace> to verify if all the pods are in Running state.

    For example:

    $ kubectl get pods -n ocnef
    
    NAME                                                   READY   STATUS    RESTARTS   AGE
    nefats-ocats-nef-8bd489d58-jd7ld                1/1     Running   0          47h
    ocats-ocats-nef-67cf948f67-k59cn                1/1     Running   2          6d22h
    ocnef-config-server-75bd4fc7f8-ttgbx            1/1     Running   0          4h41m
    ocnef-expgw-afmgr-67dff6c6fd-tblvq              2/2     Running   0          4h41m
    ocnef-expgw-apimgr-5665864dc4-bq9qj             1/1     Running   0          4h41m
    ocnef-expgw-apirouter-5dc68f4c69-jdh9q          2/2     Running   0          4h41m
    ocnef-expgw-eventmgr-67c5fbdb9c-zg6ll           1/1     Running   0          4h41m
    ocnef-ext-egress-gateway-f569449d4-xd7gs        1/1     Running   0          4h41m
    ocnef-ext-ingress-gateway-69f989878b-2tvdh      1/1     Running   0          4h41m
    ocnef-fivegc-egress-gateway-6f84b8685c-xp292    1/1     Running   0          4h41m
    ocnef-fivegc-ingress-gateway-757566b6d5-bjrqm   1/1     Running   0          4h41m
    ocnef-fivegcagent-667d87696d-pqfd7              1/1     Running   0          4h41m
    ocnef-monitoringevents-87cdb4b67-qfpn2          1/1     Running   0          4h41m
    ocnef-nfdb-5ff78cf4d6-qm2mn                     1/1     Running   0          47h
    ocnef-ocnef-ccfclient-7fd9c5c4bc-jc9tz          1/1     Running   0          4h41m
    ocnef-ocnef-expiry-auditor-6c97cf49f7-r47kb     1/1     Running   0          4h41m
    ocnefsim-ocstub-nef-af-74df6f7b4f-rp54q         1/1     Running   0          46h
    ocnefsim-ocstub-nef-gmlc-5d456ffddb-2xzx8       1/1     Running   0          46h
    ocnefsim-ocstub-nef-nrf-67fbd5bdf6-lqbvp        1/1     Running   0          46h
    ocnefsim-ocstub-nef-udm-9d86d96c7-wjhf5         1/1     Running   0          46h
4.2.1.1.2 Docker registry is configured incorrectly

Problem

helm install might fail if the docker registry is not configured in all primary and secondary nodes.

Error Code/Error Message

When kubectl get pods -n <ocnef_namespace> is performed, the status of the pods might be ImagePullBackOff or ErrImagePull.

For example:

$ kubectl get pods -n ocnef

Solution

Configure docker registry on all primary and secondary nodes. For more information on configuring the docker registry, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.

4.2.1.1.3 Continuous Restart of Pods

Problem

helm install might fail if the MySQL primary and secondary hosts are not configured properly in ocnef-custom-values.yaml.

Error Code/Error Message

When kubectl get pods -n <ocnef_namespace> is performed, the pods restart count increases continuously.

For example:

$ kubectl get pods -n ocnef

Solution

MySQL servers(s) may not be configured properly according to the pre-installation steps. For configuring MySQL servers, see the "Configuring Database, Creating Users, and Granting Permissions" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.

4.2.1.2 Pod Creation Failure

A pod creation can fail due to various reasons. Some of the possible scenarios are as follows:

Verifying Pod Image Correctness

To verify pod image:

  • Check whether any of the pods is in the ImagePullBackOff state.
  • Check if the image name used for all the pods are correct. Verify the image names and versions from the values in the NEF custom-values.yaml file. For more information about the custom value file, see Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
  • After updating the custom-values.yaml file, run the following command for helm upgrade:

    helm upgrade <helm chart> [--version <OCNEF version>] --name <release> --namespace <ocnefnamespace> -f <ocnef_values.yaml>

Verifying Resource Allocation Failure

To verify any resource allocation failure:

  • Run the following command to verify whether any pod is in the pending state.

    kubectl describe <nef-drservice pod id> --n <ocnef-namespace>

  • Verify whether any warning on insufficient CPU exists in the describe output of the respective pod. If it exists, it means there are insufficient CPUs for the pods to start. Address this hardware issue.
  • Run the following helm upgrade command after updating the values.yaml file.

Verifying Resource Allocation Issues on Webscale Environment

Webscale environment has openshift container installed. There can be cases where,

  • Pods does not scale after you run the installation command and the helm install command fails with timeout error. In this case, check for preinstall hooks failure. Run the oc get job command to create the jobs. Describe the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
  • Any of the actual microservice pods do not scale post the hooks completion. In this case, run the oc get rs command to get the list of replicaset created for the NF deployment. Then, describe the replicaset for which the pods are not getting scaled and check for resource quota limit exceeded errors with CPU or memory.
  • Helm install command times-out after all the microservice pods are scaled as expected with the expected number of replicas. In this case, check for post install hooks failure. Run the oc get job command to get the post install jobs and do a describe on the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
  • Resource quota exceed beyond limits.
4.2.1.3 Pod Startup Failure
Follow the guidelines shared below to debug the pod startup failure liveness check issues:
  • If dr-service, diameter-proxy, and diam-gateway services are stuck in the Init state, then the reason could be that config-server is not yet up. A sample log on these services is as follows:
    "Config Server is Not yet Up, Wait For config server to be up."

    To resolve this, you must either check for the reason of config-server not being up or if the config-server is not required, then disable it.

  • If the notify and on-demand migration service is stuck in the Init state, then the reason could be the dr-service is not yet up. A sample log on these services is as follows:
    "DR Service is Not yet Up, Wait For dr service to be up."

    To resolve this, check for failures on dr-service.

4.2.1.4 NRF Registration Failure
The NEF registration with NRF may fail due to various reasons. Some of the possible scenarios are as follows:
  • Confirm whether registration was successful from the nrf-client-service pod.
  • Check the ocnef-nrf-client-nfmanagement logs. If the log has "OCNEF is Deregistration" then:
    • Check if all the services mentioned under allorudr/slf (depending on NEF mode) in the custom-values.yaml file has same spelling as that of service name and are enabled.
    • Once all services are up, NEF must register with NRF.
  • If you see a log for SERVICE_UNAVAILABLE(503), check if the primary and secondary NRF configurations (primaryNrfApiRoot/secondaryNrfApiRoot) are correct and they are UP and Running.
4.2.1.5 Custom Value File Parse Failure
This section explains troubleshooting procedure in case of failure while parsing ocnef-custom-values.yaml file.

Problem

Not able to parse ocnef-custom-values-x.x.x.yaml, while running helm install.

Error Code/Error Message

Error: failed to parse ocnef-custom-values-x.x.x.yaml: error converting YAML to JSON: yaml

Symptom

While creating the ocnef-custom-values-x.x.x.yaml file, if the aforementioned error is received, it means that the file is not created properly. The tree structure may not have been followed or there may also be tab spaces in the file.

Solution

Following the procedure as mentioned:
  1. Download the latest NRF templates zip file from MOS. For more information, see the "Downloading NEF Package" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
  2. Follow the steps mentioned in the "Installation Tasks" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.

4.2.2 Post Installation

4.2.2.1 Helm Test Error Scenario

Following are the error scenarios that may be identified using helm test.

  1. Run the following command to get the Helm Test pod name:
    kubectl get pods -n <deployment-namespace>
  2. When a helm test is performed, a new helm test pod is created. Check for the Helm Test pod that is in an error state.
  3. Get the logs using the following command:
    kubectl logs <podname> -n <namespace>
    Example:
    kubectl get <helm_test_pod> -n ocnef

    For further assistance, collect the logs and contact MOS.

4.3 Database Related Issues

This section describes the most common database related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.

4.3.1 MySQL DB Access Failure

Problem

Keyword - wait-for-db

Tags - "config-server" "database" "readiness" "init" "SQLException" "access denied"

Because of the database accessibility issues from the NEF service, pods stay in the init state.

For some pods, if they come up, they will be kept on getting the exception : " Cannot connect to database server java.sql.SQLException"

Reasons:

  1. MySQL host IP address OR MySQL-service name[in case of occne-infra] is not correctly given.
  2. Few MySQL nodes are probably down.
  3. Username/Password given in the secrets are not created in the database OR not having proper grant/access to service databases.
  4. MOST LIKELY - Databases are not created correctly with the same name mentioned in the NEF-custom-value file while installing NEF.

Resolution Steps

To resolve this issue, perform the following steps:
  1. Check if the database IP is proper and pingable from worker nodes of the Kubernetes cluster. Update the database IP and service accordingly. If required, you can use floating IP as well. If the database connectivity issue is there, then please update the proper IP address.

    In the case of the OCCNE-infra, Instead of mentioning IP address for MySQL connection, please use FQDN for mysql-connectivity-service to connect to the database.

  2. Manually log in to MySQL via the same database IP mentioned in a custom-value file. In case of MySQL service name, describe the service by command :
    kubectl describe svc <mysql-servicename> -n <namespace> 
    and login to the MySQL database with all sets of IPs described in the MySQL service, If any SQL node is down, it will lead to an intermittent DB query failure issue. So make sure that you can log in to MySQL from all the Nodes mentioned in the IP list of MySQL-service describe command.

    Make sure that all the MySQL nodes are up and running before installing NEF.

  3. Check the existing user list into the database using SQL query: select user from mysql.user;
    Check if all the mentioned users in the custom-value of NEF installation are present in the database.

    Note:

    Create the user with proper password as mentioned in the secret file of the NEF.
  4. Check the grants of all the users mentioned into the custom-value file by SQL query: "show grants for <username>;"

    If username/password issue is there, then please correctly create the user with the required password and provide grants as per the Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.

  5. Check the databases are created with the same name mentioned in the custom-value file for the services.

    Note:

    Create the database as per the custom-value file.
  6. Check if problematic pods are getting created on any one unique worker node. If yes, then may be the cause of the error can be the worker node. Try draining the problematic worker node and allow pods to move to another node.

4.4 Service Related Issues

This section describes the most common service related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.

4.4.1 Errors from Egress Gateway

If the traffic is not routed through Egress Gateway, then check the following:
  • Check whether the Egress gateway parameters are configured correctly through the NEF custom values.
  • Check whether Egress pod is running from Kubectl. To check, run the following command:

    kubectl get pods -n <Release.name>

  • To enable the outgoing traffic using HTTPS, set the enableOutgoingHttps parameter as true.

4.4.2 Debugging Errors from Ingress Gateway

The possible errors that you may encounter from Ingress Gateway are:

  • Check for 500 Error: If the request fails with 500 status code without Problem Details information, it means that the flow ended in ocnef-ingressgateway pod without route. You can confirm the same in the errors or exception section of the ocnef-ingressgateway pod logs. Y
  • Check for 503 Error: If the request fails with 503 status code with "SERVICE_UNAVAILABLE" in Problem Details, then it means that the ocnef-expgw-apirouter pod is not reachable due to some reason.

4.5 Upgrade or Rollback Failure

When Oracle Communications Network Exposure Function (NEF) upgrade or rollback fails, perform the following procedure.

  1. Check the pre or post upgrade logs or rollback hook logs in Kibana as applicable. You can filter upgrade or rollback logs using the following filters:
    • For upgrade: hookName = "pre-upgrade" or hookName = "post-upgrade"
    • For rollback: hookName = "pre-rollback" or hookName = "post-rollback"

    
    {
       "instant":{
          "epochSecond":1669292396,
          "nanoOfSecond":918939800
       },
       "thread":"main",
       "level":"INFO",
       "loggerName":"com.oracle.utils.SqlUtils",
       "message":"Executing the SQL query: ALTER TABLE `ocnef11`.`ocnef_me_subscription` \nADD CONSTRAINT `me_subscription_id_to_owner_site_id`\n FOREIGN KEY (`owner_site_id`)\n REFERENCES `ocnef_site_instance_model` (`site_instance_ref_id`)\n ON DELETE NO ACTION\n ON UPDATE NO ACTION;\n",
       "endOfBatch":false,
       "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger",
       "threadId":1,
       "threadPriority":5,
       "messageTimestamp":"2022-11-24T17:49:56.918+0530",
       "hookName":"pre-upgrade"
    }
  2. Check the pod logs in Kibana to analyze the cause of failure.
  3. After detecting the cause of failure, do the following:
    • For upgrade failure:
      • If the cause of upgrade failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the upgrade command.
      • If the cause of failure occurs during the preupgrade phase, do not perform the roll back.
      • If the upgrade failure occurs during the postupgrade phase, for example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
    • For rollback failure: If the cause of rollback failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the rollback command.
    • If the issue persists, contact My Oracle Support.