4 Troubleshooting NEF
This chapter provides information to troubleshoot the common errors which can be encountered during the preinstallation, installation, upgrade, and rollback procedures of NEF.
4.1 Generic Checklist
The following sections provide a generic checklist for troubleshooting tips.
Deployment related tips
- Are NEF deployment, pods, and services created?
Are NEF deployment, pods, and services running and available?
Run the following the command:# kubectl -n <namespace> get deployments,pods,svc
Inspect the output, check the following columns:- AVAILABLE of deployment
- READY, STATUS, and RESTARTS of a pod
- PORT(S) of service
- Is the correct image used?
Is the correct environment variables set in the deployment?
Run the following command:# kubectl -n <namespace> get deployment <deployment-name> -o yaml
Inspect the output, check the environment and image.# kubectl -n nef-svc get deployment ocnef-monitoringevents -o yaml apiVersion: extensions/v1beta1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"ocnef-monitoringevents","namespace":"nef-svc"},"spec":{"replicas":1,"selector":{"matchLabels":{"app":"ocnef-monitoringevents"}},"template":{"metadata":{"labels":{"app":"ocnef-monitoringevents"}},"spec":{"containers":[{"env":[{"name":"MYSQL_HOST","value":"mysql"},{"name":"MYSQL_PORT","value":"3306"},{"name":"MYSQL_DATABASE","value":"nefdb"},{"name":"NEF_SVC_ENDPOINT","value":"ocnef-monitoringevents"}],"image":"cne-repo:5000/ocnef-monitoringevents:latest","imagePullPolicy":"Always","name":"ocnef-monitoringevents","ports":[{"containerPort":8080,"name":"server"}]}]}}}} creationTimestamp: 2018-08-27T15:45:59Z generation: 1 name: ocnef-monitoringevents namespace: nef-svc resourceVersion: "2336498" selfLink: /apis/extensions/v1beta1/namespaces/nef-svc/deployments/ocnef-monitoringevents uid: 4b82fe89-aa10-11e8-95fd-fa163f20f9e2 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: ocnef-monitoringevents strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: app: ocnef-monitoringevents spec: containers: - env: - name: MYSQL_HOST value: mysql - name: MYSQL_PORT value: "3306" - name: MYSQL_DATABASE value: nefdb - name: NRF_SVC_ENDPOINT value: ocnef-monitoringevents image: cne-repo:5000/ocnef-monitoringevents:latest imagePullPolicy: Always name: ocnef-monitoringevents ports: - containerPort: 8080 name: server protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: availableReplicas: 1 conditions: - lastTransitionTime: 2018-08-27T15:46:01Z lastUpdateTime: 2018-08-27T15:46:01Z message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: 2018-08-27T15:45:59Z lastUpdateTime: 2018-08-27T15:46:01Z message: ReplicaSet "ocnef-monitoringevents-7898d657d9" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 1 readyReplicas: 1 replicas: 1 updatedReplicas: 1
- Check if the microservices can access each other via REST interface.
Run the following command:
# kubectl -n <namespace> exec <pod name> -- curl <uri>
Example:# kubectl -n nef-svc exec ocnef-fivegcagent-44f4d8f5d5-6q92i -- curl http://ocnef-monitoringevents:8080/3gpp-monitoring-event/v1/anyAfID1000/subscriptions
Note:
These commands are in their simple form and display the logs only if there is a single nef<registration> and nf<subscription> pod deployed.
Application related tips
# kubectl -n <namespace> logs -f <pod name>
You can use '-f' to follow the logs or 'grep' for specific pattern in the log output.
Example:
# kubectl -n nef-svc logs -f $(kubectl -n nef-svc get pods -o name|cut -d'/' -f2|grep nfr)
# kubectl -n nef-svc logs -f $(kubectl -n nef-svc get pods -o name|cut -d'/' -f2|grep nfs)
Note:
These commands are in their simple form and display the logs only if there is 1 nef<registration> and nf<subscription> pod deployed.4.2 Deployment Related Issue
This section describes the most common deployment related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.
4.2.1 Installation
4.2.1.1 Helm Install Failure
This section describes the various scenarios in which helm install
might fail. Following are some of the scenarios:
4.2.1.1.1 Incorrect image name in ocnef-custom-values file
Problem
helm install
might fail if an incorrect image name is provided in the ocnef-custom-values.yaml file.
Error Code/Error Message
When kubectl get pods -n <ocnef_namespace>
is performed, the status of the pods might be ImagePullBackOff or ErrImagePull.
For example:
$ kubectl get pods -n ocnef
nefats-ocats-nef-8bd489d58-jd7ld 1/1 Running 0 47h
ocats-ocats-nef-67cf948f67-k59cn 1/1 Running 2 6d22h
ocnef-config-server-75bd4fc7f8-ttgbx 1/1 Running 0 4h41m
ocnef-expgw-afmgr-67dff6c6fd-tblvq 2/2 Running 0 4h41m
ocnef-expgw-apimgr-5665864dc4-bq9qj 1/1 Running 0 4h41m
ocnef-expgw-apirouter-5dc68f4c69-jdh9q 2/2 Running 0 4h41m
ocnef-expgw-eventmgr-67c5fbdb9c-zg6ll 1/1 Running 0 4h41m
ocnef-ext-egress-gateway-f569449d4-xd7gs 1/1 Running 0 4h41m
ocnef-ext-ingress-gateway-69f989878b-2tvdh 1/1 Running 0 4h41m
ocnef-fivegc-egress-gateway-6f84b8685c-xp292 1/1 Running 0 4h41m
ocnef-fivegc-ingress-gateway-757566b6d5-bjrqm 1/1 Running 0 4h41m
ocnef-fivegcagent-667d87696d-pqfd7 1/1 Running 0 4h41m
ocnef-monitoringevents-87cdb4b67-qfpn2 1/1 Running 0 4h41m
ocnef-nfdb-5ff78cf4d6-qm2mn 1/1 Running 0 47h
ocnef-ocnef-ccfclient-7fd9c5c4bc-jc9tz 1/1 Running 0 4h41m
ocnef-ocnef-expiry-auditor-6c97cf49f7-r47kb 1/1 Running 0 4h41m
ocnefsim-ocstub-nef-af-74df6f7b4f-rp54q 1/1 Running 0 46h
ocnefsim-ocstub-nef-gmlc-5d456ffddb-2xzx8 1/1 Running 0 46h
ocnefsim-ocstub-nef-nrf-67fbd5bdf6-lqbvp 1/1 Running 0 46h
ocnefsim-ocstub-nef-udm-9d86d96c7-wjhf5 1/1 Running 0 46h
Solution
- Check ocnef-custom-values.yaml file has the release specific image name and
tags.
For ocnef images details, see "Customizing NEF" in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.vi ocnef-custom-values-<release-number>
- Edit ocnef-custom-values file in case the release specific image name and tags must be modified.
- Save the file.
- Run the following command to delete the deployment:
helm delete --purge <release_namespace>
Sample command:helm delete --purge ocnef
- To verify the deletion, see the "Verifying Uninstallation" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
- Run
helm install
command. For helm install command, see the "Customizing NEF" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide. - Run
kubectl get pods -n <ocnef_namespace>
to verify if all the pods are in Running state.For example:
$ kubectl get pods -n ocnef
NAME READY STATUS RESTARTS AGE nefats-ocats-nef-8bd489d58-jd7ld 1/1 Running 0 47h ocats-ocats-nef-67cf948f67-k59cn 1/1 Running 2 6d22h ocnef-config-server-75bd4fc7f8-ttgbx 1/1 Running 0 4h41m ocnef-expgw-afmgr-67dff6c6fd-tblvq 2/2 Running 0 4h41m ocnef-expgw-apimgr-5665864dc4-bq9qj 1/1 Running 0 4h41m ocnef-expgw-apirouter-5dc68f4c69-jdh9q 2/2 Running 0 4h41m ocnef-expgw-eventmgr-67c5fbdb9c-zg6ll 1/1 Running 0 4h41m ocnef-ext-egress-gateway-f569449d4-xd7gs 1/1 Running 0 4h41m ocnef-ext-ingress-gateway-69f989878b-2tvdh 1/1 Running 0 4h41m ocnef-fivegc-egress-gateway-6f84b8685c-xp292 1/1 Running 0 4h41m ocnef-fivegc-ingress-gateway-757566b6d5-bjrqm 1/1 Running 0 4h41m ocnef-fivegcagent-667d87696d-pqfd7 1/1 Running 0 4h41m ocnef-monitoringevents-87cdb4b67-qfpn2 1/1 Running 0 4h41m ocnef-nfdb-5ff78cf4d6-qm2mn 1/1 Running 0 47h ocnef-ocnef-ccfclient-7fd9c5c4bc-jc9tz 1/1 Running 0 4h41m ocnef-ocnef-expiry-auditor-6c97cf49f7-r47kb 1/1 Running 0 4h41m ocnefsim-ocstub-nef-af-74df6f7b4f-rp54q 1/1 Running 0 46h ocnefsim-ocstub-nef-gmlc-5d456ffddb-2xzx8 1/1 Running 0 46h ocnefsim-ocstub-nef-nrf-67fbd5bdf6-lqbvp 1/1 Running 0 46h ocnefsim-ocstub-nef-udm-9d86d96c7-wjhf5 1/1 Running 0 46h
4.2.1.1.2 Docker registry is configured incorrectly
Problem
helm install
might fail if the docker registry is not configured in all primary and secondary nodes.
Error Code/Error Message
When kubectl get pods -n <ocnef_namespace>
is performed, the status of the pods might be ImagePullBackOff or ErrImagePull.
For example:
$ kubectl get pods -n ocnef
Solution
Configure docker registry on all primary and secondary nodes. For more information on configuring the docker registry, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
4.2.1.1.3 Continuous Restart of Pods
Problem
helm install
might fail if the MySQL primary and secondary hosts are not configured properly in ocnef-custom-values.yaml
.
Error Code/Error Message
When kubectl get pods -n <ocnef_namespace>
is performed, the pods restart count increases continuously.
For example:
$ kubectl get pods -n ocnef
Solution
MySQL servers(s) may not be configured properly according to the pre-installation steps. For configuring MySQL servers, see the "Configuring Database, Creating Users, and Granting Permissions" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
4.2.1.2 Pod Creation Failure
A pod creation can fail due to various reasons. Some of the possible scenarios are as follows:
Verifying Pod Image Correctness
To verify pod image:
- Check whether any of the pods is in the ImagePullBackOff state.
- Check if the image name used for all the pods are correct. Verify the image names and versions from the values in the NEF custom-values.yaml file. For more information about the custom value file, see Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
- After updating the custom-values.yaml file, run the following command for helm upgrade:
helm upgrade <helm chart> [--version <OCNEF version>] --name <release> --namespace <ocnefnamespace> -f <ocnef_values.yaml>
Verifying Resource Allocation Failure
To verify any resource allocation failure:
- Run the following command to verify whether any pod is in the pending state.
kubectl describe <nef-drservice pod id> --n <ocnef-namespace>
- Verify whether any warning on insufficient CPU exists in the describe output of the respective pod. If it exists, it means there are insufficient CPUs for the pods to start. Address this hardware issue.
- Run the following helm upgrade command after updating the values.yaml file.
Verifying Resource Allocation Issues on Webscale Environment
Webscale environment has openshift container installed. There can be cases where,
- Pods does not scale after you run the installation command and the helm install command fails with timeout error. In this case, check for preinstall hooks failure. Run the oc get job command to create the jobs. Describe the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
- Any of the actual microservice pods do not scale post the hooks completion. In this case, run the oc get rs command to get the list of replicaset created for the NF deployment. Then, describe the replicaset for which the pods are not getting scaled and check for resource quota limit exceeded errors with CPU or memory.
- Helm install command times-out after all the microservice pods are scaled as expected with the expected number of replicas. In this case, check for post install hooks failure. Run the oc get job command to get the post install jobs and do a describe on the job for which the pods are not getting scaled and check if there are quota limit exceeded errors with CPU or memory.
- Resource quota exceed beyond limits.
4.2.1.3 Pod Startup Failure
- If dr-service, diameter-proxy, and diam-gateway services are stuck
in the Init state, then the reason could be that config-server is not yet up. A
sample log on these services is as
follows:
"Config Server is Not yet Up, Wait For config server to be up."
To resolve this, you must either check for the reason of config-server not being up or if the config-server is not required, then disable it.
- If the notify and on-demand migration service is stuck in the Init
state, then the reason could be the dr-service is not yet up. A sample log on
these services is as
follows:
"DR Service is Not yet Up, Wait For dr service to be up."
To resolve this, check for failures on dr-service.
4.2.1.4 NRF Registration Failure
- Confirm whether registration was successful from the nrf-client-service pod.
- Check the ocnef-nrf-client-nfmanagement logs. If the log has
"OCNEF is Deregistration" then:
- Check if all the services mentioned under allorudr/slf (depending on NEF mode) in the custom-values.yaml file has same spelling as that of service name and are enabled.
- Once all services are up, NEF must register with NRF.
- If you see a log for SERVICE_UNAVAILABLE(503), check if the primary and secondary NRF configurations (primaryNrfApiRoot/secondaryNrfApiRoot) are correct and they are UP and Running.
4.2.1.5 Custom Value File Parse Failure
ocnef-custom-values.yaml
file.
Problem
Not able to parse ocnef-custom-values-x.x.x.yaml, while running helm install
.
Error Code/Error Message
Error: failed to parse ocnef-custom-values-x.x.x.yaml: error converting YAML to JSON: yaml
Symptom
While creating the ocnef-custom-values-x.x.x.yaml file, if the aforementioned error is received, it means that the file is not created properly. The tree structure may not have been followed or there may also be tab spaces in the file.
Solution
- Download the latest NRF templates zip file from MOS. For more information, see the "Downloading NEF Package" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
- Follow the steps mentioned in the "Installation Tasks" section in Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
4.2.2 Post Installation
4.2.2.1 Helm Test Error Scenario
Following are the error scenarios that may be identified using helm test.
- Run the following command to get the Helm Test pod name:
kubectl get pods -n <deployment-namespace>
- When a helm test is performed, a new helm test pod is created. Check for the Helm Test pod that is in an error state.
- Get the logs using the following command:
kubectl logs <podname> -n <namespace>
Example:kubectl get <helm_test_pod> -n ocnef
For further assistance, collect the logs and contact MOS.
4.3 Database Related Issues
This section describes the most common database related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.
4.3.1 MySQL DB Access Failure
Problem
Keyword - wait-for-db Tags - "config-server" "database" "readiness" "init" "SQLException" "access denied"Because of the database accessibility issues from the NEF service, pods stay in the init state.
For some pods, if they come up, they will be kept on getting the exception : " Cannot connect to database server java.sql.SQLException"
Reasons:
- MySQL host IP address OR MySQL-service name[in case of occne-infra] is not correctly given.
- Few MySQL nodes are probably down.
- Username/Password given in the secrets are not created in the database OR not having proper grant/access to service databases.
- MOST LIKELY - Databases are not created correctly with the same name mentioned in the NEF-custom-value file while installing NEF.
Resolution Steps
- Check if the database IP is proper and pingable from worker
nodes of the Kubernetes cluster. Update the database IP and service
accordingly. If required, you can use floating IP as well. If the database
connectivity issue is there, then please update the proper IP address.
In the case of the OCCNE-infra, Instead of mentioning IP address for MySQL connection, please use FQDN for mysql-connectivity-service to connect to the database.
- Manually log in to MySQL via the same database IP mentioned in a custom-value
file. In case of MySQL service name, describe the service by command :
and login to the MySQL database with all sets of IPs described in the MySQL service, If any SQL node is down, it will lead to an intermittent DB query failure issue. So make sure that you can log in to MySQL from all the Nodes mentioned in the IP list of MySQL-service describe command.kubectl describe svc <mysql-servicename> -n <namespace>
Make sure that all the MySQL nodes are up and running before installing NEF.
- Check the existing user list into the database using SQL query:
select user from mysql.user;
Check if all the mentioned users in the custom-value of NEF installation are present in the database.Note:
Create the user with proper password as mentioned in the secret file of the NEF. - Check the grants of all the users mentioned into the custom-value file by SQL
query: "show grants for <username>;"
If username/password issue is there, then please correctly create the user with the required password and provide grants as per the Oracle Communications Cloud Native Core, Network Exposure Function Installation, Upgrade, and Fault Recovery Guide.
- Check the databases are created with the same name mentioned in the
custom-value file for the services.
Note:
Create the database as per the custom-value file. - Check if problematic pods are getting created on any one unique worker node. If yes, then may be the cause of the error can be the worker node. Try draining the problematic worker node and allow pods to move to another node.
4.4 Service Related Issues
This section describes the most common service related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact My Oracle Support.
4.4.1 Errors from Egress Gateway
- Check whether the Egress gateway parameters are configured correctly through the NEF custom values.
- Check whether Egress pod is running from Kubectl. To check, run the following command:
kubectl get pods -n <Release.name>
- To enable the outgoing traffic using HTTPS, set the enableOutgoingHttps parameter as true.
4.4.2 Debugging Errors from Ingress Gateway
The possible errors that you may encounter from Ingress Gateway are:
- Check for 500 Error: If the request fails with 500 status code without Problem Details information, it means that the flow ended in
ocnef-ingressgateway
pod without route. You can confirm the same in the errors or exception section of theocnef-ingressgateway
pod logs. Y - Check for 503 Error: If the request fails with 503 status code with "SERVICE_UNAVAILABLE" in Problem Details, then it means that the
ocnef-expgw-apirouter
pod is not reachable due to some reason.
4.5 Upgrade or Rollback Failure
When Oracle Communications Network Exposure Function (NEF) upgrade or rollback fails, perform the following procedure.
- Check the pre or post upgrade logs or rollback hook logs in Kibana as
applicable. You can filter upgrade or rollback logs using the following filters:
- For upgrade: hookName = "pre-upgrade" or hookName = "post-upgrade"
-
For rollback: hookName = "pre-rollback" or hookName = "post-rollback"
{ "instant":{ "epochSecond":1669292396, "nanoOfSecond":918939800 }, "thread":"main", "level":"INFO", "loggerName":"com.oracle.utils.SqlUtils", "message":"Executing the SQL query: ALTER TABLE `ocnef11`.`ocnef_me_subscription` \nADD CONSTRAINT `me_subscription_id_to_owner_site_id`\n FOREIGN KEY (`owner_site_id`)\n REFERENCES `ocnef_site_instance_model` (`site_instance_ref_id`)\n ON DELETE NO ACTION\n ON UPDATE NO ACTION;\n", "endOfBatch":false, "loggerFqcn":"org.apache.logging.slf4j.Log4jLogger", "threadId":1, "threadPriority":5, "messageTimestamp":"2022-11-24T17:49:56.918+0530", "hookName":"pre-upgrade" }
- Check the pod logs in Kibana to analyze the cause of failure.
- After detecting the cause of failure, do the following:
- For upgrade failure:
- If the cause of upgrade failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the upgrade command.
- If the cause of failure occurs during the preupgrade phase, do not perform the roll back.
- If the upgrade failure occurs during the postupgrade phase, for example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
- For rollback failure: If the cause of rollback failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the rollback command.
- If the issue persists, contact My Oracle Support.
- For upgrade failure: