4 Troubleshooting BSF
This section provides information on how to troubleshoot the common errors that may occur during the installation and upgrade of Oracle Communications Cloud Native Core, Binding Support Function.
4.1 Deployment Issues
This section provides information on how troubleshoot deployment issues for BSF.
4.1.1 Helm Install Failure
If helm install
command Fails
This section describes various scenarios and troubleshooting procedures
if the helm install
command fails.
helm install
failure:- Chart syntax issue
Please resolve the chart specific things and rerun the
helm install
command, because in this case, no hooks should have begun. - Most Possible Reason [Timeout]
If any job is stuck in a pending/error state and not able to execute, it will result in the timeout after 5 minutes as default timeout for helm command is "5 minutes". In this case, follow the below steps to troubleshoot.
helm install
command failed in case of duplicated chart
Here, configmaphelm install /home/cloud-user/bsf_1.5.0/sprint3.1/ocbsf-1.5.0-sprint.3.1.tgz --name ocbsf2 --namespace ocbsf2 -f cust-ashish.yaml Error: release ocbsf2 failed: configmaps "perfinfo-config-ocbsf2" already exists
perfinfo-config-ocbsf2
exists multiple times while creating the Kubernetes objects after the pre-upgrade hooks. This is a failure scenario. In this case, perform the following troubleshooting steps.Troubleshooting steps:- Run the following command to cleanup the databases
created by the
helm install
command :DROP DATABASE IF EXISTS ocbsf_commonconfig; DROP DATABASE IF EXISTS ocbsf_config_server; DROP DATABASE IF EXISTS ocpm_bsf; DROP DATABASE IF EXISTS ocbsf_release; DROP DATABASE IF EXISTS ocbsf_audit_service;
- Run the following command to get kubernetes
objects:
This gives a detailed overview of which objects are stuck or in a failed state.kubectl get all -n <release_namespace>
- Run the following command to delete all kubernetes
objects:
kubectl delete all --all -n <release_namespace> kubectl delete cm --all -n <release_namespace>
- Run the following command to list all the
kubernetes
objects:
If this is in a failed state, please purge the namespace using the commandhelm ls --all
helm delete --purge <release_namespace>
Note:
If the execution of this command is taking more time, run the below command parallelly in another session to clear all the delete jobs.while true; do kubectl delete jobs --all -n <release_namespace>; sleep 5;done
Once that is succeeded, press "ctrl+c" to stop the above script.helm delete --purge <release_namespace>
- After the database cleanup, run the
helm install
command.
- Run the following command to cleanup the databases
created by the
4.1.2 Loss of Session Binding data
Symptom
When the NDB Cluster goes down (due to node group being down), it results in loss of session binding data from pcf_binding table in BSF.
Problem
NOLOGGING is enabled for the pcf_binding table in the ocpm_bsf database which means there is non-persistent storage of session table in PV. Hence, when the NDB Cluster goes down there is loss of session binding data.
Resolution Steps
To resolve this issue, perform the following steps:
- Monitor the alerts and the KPI to determine the status of the BSF cnDBTier and the NF.
- If there is an issue due to cnDBTier cluster down, perform a controlled shutdown of the BSF in the affected site.
- Perform a GRR procedure in the cnDBTier to recover the database and replication.
- Start the BSF again using the controlled shutdown procedure to reinitiate the BSF to accept traffic.
4.2 Startup Probes
To increase the application's reliability and availability, startup probes are introduced in BSF. Consider a scenario where the configuration is not loaded or partially loaded but the service goes into a ready state. This may result in different pods showing different behaviour for the same service. With the introduction of startup probe, the readiness and liveness checks for a pod are not initiated until the configuration is loaded completely and startup probe is successful. However, if the startup probe fails, the container restarts.
- Log in to a container by running the following
command:
kubectl exec -it podname -n namespace -- bash curl -kv http://localhost:<monitoring-port>/<startup-probe-url>
Example:kubectl exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
The sample output can be as follow:[cloud-user@bastion-1 ~]$ * Trying ::1... * TCP_NODELAY set * connect to ::1 port 9000 failed: Connection refused * Trying 127.0.0.1... * TCP_NODELAY set * connect to 127.0.0.1 port 9000 failed: Connection refused * Failed to connect to localhost port 9000: Connection refused * Closing connection 0 curl: (7) Failed to connect to localhost port 9000: Connection refused command terminated with exit code 7 [cloud-user@bastion-1 ~]$ k exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 9000 (#0) > GET /actuator/health/startup HTTP/1.1 > Host: localhost:9000 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 503 Service Unavailable < Date: Thu, 21 Apr 2022 11:18:03 GMT < Content-Type: application/json;charset=utf-8 < Transfer-Encoding: chunked < Server: Jetty(9.4.43.v20210629) < * Connection #0 to host localhost left intact {"status":"DOWN"}[cloud-user@bastion-1 ~]$ k exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 9000 (#0) > GET /actuator/health/startup HTTP/1.1 > Host: localhost:9000 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Date: Thu, 21 Apr 2022 11:18:04 GMT < Content-Type: application/json;charset=utf-8 < Transfer-Encoding: chunked < Server: Jetty(9.4.43.v20210629) < * Connection #0 to host localhost left intact {"status":"UP"}[cloud-user@bastion-1 ~]$
- To check why the startup probe failed, describe the
output:
Describe output: Warning Unhealthy <invalid> (x10 over 2m45s) kubelet Startup probe failed: Get "http://10.233.81.231:9000/actuator/health/startup": dial tcp 10.233.81.231:9000: connect: connection refused
The following could be the possible reasons for startup probe failure:- Network connectivity issue
- Database connection issue due to which server is not coming up
- Due to any other exception
- If the reason for startup probe failure is not clear, check the logs to determine if it is due to an issue with config-server connection or any issue with fetching configurations from the config-server.
4.3 Upgrade or Rollback Failure
When Oracle Communications Cloud Native Core, Binding Support Function (BSF) upgrade or rollback fails, perform the following procedure.
- Check the pre or post upgrade or rollback hook logs in Kibana as
applicable.
Users can filter upgrade or rollback logs using the following filters:
- For upgrade: lifeCycleEvent=9001 or 9011
-
For rollback: lifeCycleEvent=9002
- Check the pod logs in Kibana to analyze the cause of failure.
- After detecting the cause of failure, do the following:
- For upgrade failure:
- If the cause of upgrade failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the upgrade command.
- If the cause of failure is not related to database or network connectivity issue and is observed during the preupgrade phase, do not perform rollback because BSF deployment remains in the source or older release.
- If the upgrade failure occurs during the postupgrade phase, for example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
- For rollback failure: If the cause of rollback failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the rollback command.
- For upgrade failure:
- If the issue persists, contact My Oracle Support.