4 Troubleshooting BSF

This section provides information on how to troubleshoot the common errors that may occur during the installation and upgrade of Oracle Communications Cloud Native Core, Binding Support Function.

4.1 Deployment Issues

This section provides information on how troubleshoot deployment issues for BSF.

4.1.1 Helm Install Failure

If helm install command Fails

This section describes various scenarios and troubleshooting procedures if the helm install command fails.

Reasons for helm install failure:
  • Chart syntax issue

    Please resolve the chart specific things and rerun the helm install command, because in this case, no hooks should have begun.

  • Most Possible Reason [Timeout]

    If any job is stuck in a pending/error state and not able to execute, it will result in the timeout after 5 minutes as default timeout for helm command is "5 minutes". In this case, follow the below steps to troubleshoot.

  • helm install command failed in case of duplicated chart
    helm install /home/cloud-user/bsf_1.5.0/sprint3.1/ocbsf-1.5.0-sprint.3.1.tgz --name ocbsf2 --namespace ocbsf2 -f cust-ashish.yaml
    Error: release ocbsf2 failed: configmaps "perfinfo-config-ocbsf2" already exists
    Here, configmap perfinfo-config-ocbsf2 exists multiple times while creating the Kubernetes objects after the pre-upgrade hooks. This is a failure scenario. In this case, perform the following troubleshooting steps.
    Troubleshooting steps:
    1. Run the following command to cleanup the databases created by the helm install command :
      
      DROP DATABASE IF EXISTS ocbsf_commonconfig;
      DROP DATABASE IF EXISTS ocbsf_config_server;
      DROP DATABASE IF EXISTS ocpm_bsf;
      DROP DATABASE IF EXISTS ocbsf_release;
      DROP DATABASE IF EXISTS ocbsf_audit_service;
    2. Run the following command to get kubernetes objects:
      kubectl get all -n <release_namespace>
      This gives a detailed overview of which objects are stuck or in a failed state.
    3. Run the following command to delete all kubernetes objects:
      kubectl delete all --all -n <release_namespace>
      kubectl delete cm --all -n <release_namespace>
    4. Run the following command to list all the kubernetes objects:
      helm ls --all
      If this is in a failed state, please purge the namespace using the command
      helm delete --purge <release_namespace>

      Note:

      If the execution of this command is taking more time, run the below command parallelly in another session to clear all the delete jobs.
      while true; do kubectl delete jobs --all -n <release_namespace>; sleep 5;done
      Monitor the below command:
      helm delete --purge <release_namespace>
      Once that is succeeded, press "ctrl+c" to stop the above script.
    5. After the database cleanup, run the helm install command.

4.1.2 Loss of Session Binding data

Symptom

When the NDB Cluster goes down (due to node group being down), it results in loss of session binding data from pcf_binding table in BSF.

Problem

NOLOGGING is enabled for the pcf_binding table in the ocpm_bsf database which means there is non-persistent storage of session table in PV. Hence, when the NDB Cluster goes down there is loss of session binding data.

Resolution Steps

To resolve this issue, perform the following steps:

  1. Monitor the alerts and the KPI to determine the status of the BSF cnDBTier and the NF.
  2. If there is an issue due to cnDBTier cluster down, perform a controlled shutdown of the BSF in the affected site.
  3. Perform a GRR procedure in the cnDBTier to recover the database and replication.
  4. Start the BSF again using the controlled shutdown procedure to reinitiate the BSF to accept traffic.

4.2 Startup Probes

To increase the application's reliability and availability, startup probes are introduced in BSF. Consider a scenario where the configuration is not loaded or partially loaded but the service goes into a ready state. This may result in different pods showing different behaviour for the same service. With the introduction of startup probe, the readiness and liveness checks for a pod are not initiated until the configuration is loaded completely and startup probe is successful. However, if the startup probe fails, the container restarts.

To check the status of startup probe or investigate the reason of failing, perform the following steps:
  1. Log in to a container by running the following command:
    kubectl exec -it podname -n namespace -- bash
    curl -kv http://localhost:<monitoring-port>/<startup-probe-url>
    Example:
    kubectl exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
    The sample output can be as follow:
    [cloud-user@bastion-1 ~]$ 
    *   Trying ::1...
    * TCP_NODELAY set
    * connect to ::1 port 9000 failed: Connection refused
    *   Trying 127.0.0.1...
    * TCP_NODELAY set
    * connect to 127.0.0.1 port 9000 failed: Connection refused
    * Failed to connect to localhost port 9000: Connection refused
    * Closing connection 0
    curl: (7) Failed to connect to localhost port 9000: Connection refused
    command terminated with exit code 7
    [cloud-user@bastion-1 ~]$ k exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
    *   Trying ::1...
    * TCP_NODELAY set
    * Connected to localhost (::1) port 9000 (#0)
    > GET /actuator/health/startup HTTP/1.1
    > Host: localhost:9000
    > User-Agent: curl/7.61.1
    > Accept: */*
    >
    < HTTP/1.1 503 Service Unavailable
    < Date: Thu, 21 Apr 2022 11:18:03 GMT
    < Content-Type: application/json;charset=utf-8
    < Transfer-Encoding: chunked
    < Server: Jetty(9.4.43.v20210629)
    <
    * Connection #0 to host localhost left intact
    {"status":"DOWN"}[cloud-user@bastion-1 ~]$ k exec -it test-bsf-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
    *   Trying ::1...
    * TCP_NODELAY set
    * Connected to localhost (::1) port 9000 (#0)
    > GET /actuator/health/startup HTTP/1.1
    > Host: localhost:9000
    > User-Agent: curl/7.61.1
    > Accept: */*
    >
    < HTTP/1.1 200 OK
    < Date: Thu, 21 Apr 2022 11:18:04 GMT
    < Content-Type: application/json;charset=utf-8
    < Transfer-Encoding: chunked
    < Server: Jetty(9.4.43.v20210629)
    <
    * Connection #0 to host localhost left intact
    {"status":"UP"}[cloud-user@bastion-1 ~]$
  2. To check why the startup probe failed, describe the output:
    Describe output:
    
     Warning  Unhealthy  <invalid> (x10 over 2m45s)  kubelet            Startup probe failed: Get "http://10.233.81.231:9000/actuator/health/startup": dial tcp 10.233.81.231:9000: connect: connection refused
    The following could be the possible reasons for startup probe failure:
    • Network connectivity issue
    • Database connection issue due to which server is not coming up
    • Due to any other exception
  3. If the reason for startup probe failure is not clear, check the logs to determine if it is due to an issue with config-server connection or any issue with fetching configurations from the config-server.

4.3 Upgrade or Rollback Failure

When Oracle Communications Cloud Native Core, Binding Support Function (BSF) upgrade or rollback fails, perform the following procedure.

  1. Check the pre or post upgrade or rollback hook logs in Kibana as applicable.
    Users can filter upgrade or rollback logs using the following filters:
    • For upgrade: lifeCycleEvent=9001 or 9011
    • For rollback: lifeCycleEvent=9002

  2. Check the pod logs in Kibana to analyze the cause of failure.
  3. After detecting the cause of failure, do the following:
    • For upgrade failure:
      • If the cause of upgrade failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the upgrade command.
      • If the cause of failure is not related to database or network connectivity issue and is observed during the preupgrade phase, do not perform rollback because BSF deployment remains in the source or older release.
      • If the upgrade failure occurs during the postupgrade phase, for example, post upgrade hook failure due to target release pod not moving to ready state, then perform a rollback.
    • For rollback failure: If the cause of rollback failure is database or network connectivity issue, contact your system administrator. When the issue is resolved, rerun the rollback command.
  4. If the issue persists, contact My Oracle Support.