4.1 Deployment Related Issues
This section describes the most common deployment related issues and their resolution steps. It is recommended to perform the resolution steps provided in this guide. If the issue still persists, then contact Oracle Support.
4.1.1 Helm Install Failure
If helm install
command Fails
This section covers the reasons and troubleshooting procedures if the
helm install
command fails.
helm install
failure:- Chart syntax issue [This issue could be shown in the few
seconds]
Please resolve the chart specific things and rerun the
helm install
command, because in this case, no hooks should have begun. - Most possible reason [TIMEOUT]
If any job stuck in a pending/error state and unable to run, it will result in the timeout after 5 minutes. As default timeout for helm command is "5 minutes". In this case, we have to follow the below steps to troubleshoot.
helm install
command failed in case of duplicated charthelm install /home/cloud-user/pcf_1.6.1/sprint3.1/ocpcf-1.6.1-sprint.3.1.tgz --name ocpcf2 --namespace ocpcf2 -f <custom-value-file>
Error: release ocpcf2 failed: configmaps "perfinfo-config-ocpcf2" already exists
Here, configmap 'perfinfo-config-ocpcf2' exists multiple times, while creating Kubernetes objects after pre-upgrade hooks, this will be failed. In this case also please go through the below troubleshooting steps.Troubleshooting steps:- Check from describe/logs of failure pods and fix them accordingly. You need
to verify what went wrong on the installation of the Policy by
checking the below points:
For the PODs which were not started, run the following command to check the failed pods:
kubectl describe pod <pod-name> -n <release-namespace>
For the PODs which were started but failed to come into "READY"state, run the following command to check the failed pods:kubectl describe logs <pod-name> -n <release-namespace>
- Run the below command to get kubernetes
objects:
This gives a detailed overview of which objects are stuck or in a failed state.kubectl get all -n <release_namespace>
- Run the below command to delete all kubernetes
objects:
kubectl delete all --all -n <release_namespace>
- Run the below command to delete all current
configmaps:
kubectl delete cm --all -n <release-namespace>
- Run the below command to cleanup the databases
created by the
helm install
command and create the database again:DROP DATABASE IF EXISTS occnp_audit_service; DROP DATABASE IF EXISTS occnp_config_server; DROP DATABASE IF EXISTS occnp_pcf_am; DROP DATABASE IF EXISTS occnp_pcf_sm; DROP DATABASE IF EXISTS occnp_pcrf_core; DROP DATABASE IF EXISTS occnp_release; DROP DATABASE IF EXISTS occnp_binding; DROP DATABASE IF EXISTS occnp_policyds; DROP DATABASE IF EXISTS occnp_pcf_ue; DROP DATABASE IF EXISTS occnp_commonconfig; CREATE DATABASE IF NOT EXISTS occnp_audit_service; CREATE DATABASE IF NOT EXISTS occnp_config_server; CREATE DATABASE IF NOT EXISTS occnp_pcf_am; CREATE DATABASE IF NOT EXISTS occnp_pcf_sm; CREATE DATABASE IF NOT EXISTS occnp_pcrf_core; CREATE DATABASE IF NOT EXISTS occnp_release; CREATE DATABASE IF NOT EXISTS occnp_binding; CREATE DATABASE IF NOT EXISTS occnp_policyds; CREATE DATABASE IF NOT EXISTS occnp_pcf_ue; CREATE DATABASE IF NOT EXISTS occnp_commonconfig;
In addition, clean up the entries in "mysql.ndb_replication" table by running the following command:DROP TABLE IF EXISTS mysql.ndb_replication;
- Run the following command :
- For
Helm2:
helm ls --all
- For
Helm3:
helm3 ls -n <release-namespace>
If this is in a failed state, please purge the namespace using the following command:helm delete --purge <release_namespace>
Once the purge command is succeeded, press "ctrl+c" to stop the above script.
Note:
If the command is taking more time, run the following command in another session to clear all the delete jobs.while true; do kubectl delete jobs --all -n <release_namespace>; sleep 5;done
- For
Helm2:
- After the database cleanup and creation of the
database again, run the
helm install
command.
- Check from describe/logs of failure pods and fix them accordingly. You need
to verify what went wrong on the installation of the Policy by
checking the below points:
If helm install
command fails due to atomic and timeout
options
The helm install
command fails as the external-ip allocation
(Loadbalancer) fails for Diameter Gateway, Ingress Gateway, and Configuration
Management service as they are of the type loadbalancer.
Reason: The primary reason for this problem is availability of limited infrastructure due to which floating IPs may not be available. It may also happen due to the sytem taking more time to assign floating IPs, as a result of which charts purge.
Solution: To resolve this issue, user may either skip
--atomic
keyword from the helm
install command
or set a higher timeout value.
4.1.2 Configuration Issue where mysql-username had an Extra Line
Symptom
No suitable driver found for jdbc
Problem
Secret files contain the user id and password for the MySQL. User ID and password inside the secret file shall be base64 encoded. During base64 encoding, if a new line is present in the user id and password – the line is also encoded and may cause issues when they are decoded back.
Resolution Steps
To resolve this issue, perform the following steps:
- Get the secret file created by customer.
- Fetch the encoded MySQL username and password.
- Go to https://www.base64decode.org/.
- Give the username and password and click decode.
- Verify if the extra line is present in the username and password. If present, remove the extra line.
- Decode it again.
4.1.3 App Info Worker Time Out
Problem
[CRITICAL] WORKER TIMEOUT
The appinfo process has a HTTP server (gunicorn) and a few worker processes. The request comes to the gunicorn process, then the worker processes handle the request. If the worker does not return in 30 seconds, then gunicorn prints "WORKER TIMEOUT" error, and kills the worker. From the log, it appears that the worker processes are stuck somewhere.
- Change the appinfo deployment, increase the liveness threshold value from 3 to a higher value. By doing so, appinfo is not impacted by readiness check.
- Watch the log of appinfo to check whether the problem still exists.
- If the problem still exists, then we need to find out why the worker process is
stuck. Run the following command to get into appinfo
pod:
kubectl -n <pcf namespace> exec -it <pod name> /bin/bash
- Create a temporary python
file:
cat > xxx_test.py import pdb import appinfo pdb.set_trace() appinfo.app.run(port=9999)
- Run the following command to run this temporary python file
It launches a python debugger, type "continue" to run the app.python3 xxx_test.py
- Open another terminal, run the following
command:
Then, run the following command to check whether this temporary service can return immediately:kubectl -n <pcf namespace> exec -it <pod name> /bin/bash
If curl gets stuck, then we have reproduced the problem. Now in the python debugger, type "ctrl+C", and you should be able to get the stack trace that indicates the problem.curl localhost:9999/v1/readiness
4.1.4 Startup Probes
To increase the application's reliability and availability, startup probes are introduced in Policy. Consider a scenario where the configuration is not loaded or partially loaded but the service goes into a ready state. This may result in different pods showing different behaviour for the same service. With the introduction of startup probe, the readiness and liveness checks for a pod are not initiated until the configuration is loaded completely and startup probe is successful. However, if the startup probe fails, the container restarts.
- Log in to a container by running the following
command:
kubectl exec -it podname -n namespace -- bash curl -kv http://localhost:<monitoring-port>/<startup-probe-url>
Example:kubectl exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
The sample output can be as follow:[cloud-user@bastion-1 ~]$ * Trying ::1... * TCP_NODELAY set * connect to ::1 port 9000 failed: Connection refused * Trying 127.0.0.1... * TCP_NODELAY set * connect to 127.0.0.1 port 9000 failed: Connection refused * Failed to connect to localhost port 9000: Connection refused * Closing connection 0 curl: (7) Failed to connect to localhost port 9000: Connection refused command terminated with exit code 7 [cloud-user@bastion-1 ~]$ k exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 9000 (#0) > GET /actuator/health/startup HTTP/1.1 > Host: localhost:9000 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 503 Service Unavailable < Date: Thu, 21 Apr 2022 11:18:03 GMT < Content-Type: application/json;charset=utf-8 < Transfer-Encoding: chunked < Server: Jetty(9.4.43.v20210629) < * Connection #0 to host localhost left intact {"status":"DOWN"}[cloud-user@bastion-1 ~]$ k exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup * Trying ::1... * TCP_NODELAY set * Connected to localhost (::1) port 9000 (#0) > GET /actuator/health/startup HTTP/1.1 > Host: localhost:9000 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Date: Thu, 21 Apr 2022 11:18:04 GMT < Content-Type: application/json;charset=utf-8 < Transfer-Encoding: chunked < Server: Jetty(9.4.43.v20210629) < * Connection #0 to host localhost left intact {"status":"UP"}[cloud-user@bastion-1 ~]$
- To check why the startup probe failed, describe the
output:
Describe output: Warning Unhealthy <invalid> (x10 over 2m45s) kubelet Startup probe failed: Get "http://10.233.81.231:9000/actuator/health/startup": dial tcp 10.233.81.231:9000: connect: connection refused
The following could be the possible reasons for startup probe failure:- Network connectivity issue
- Database connection issue due to which server is not coming up
- Due to any other exception
- If the reason for startup probe failure is not clear, check the logs to determine if it is due to an issue with config-server connection or any issue with fetching configurations from the config-server.
4.1.5 Monitoring of Diameter Gateway worker nodes failure
Symptom
When Diameter Gateway node fails, new replicas are not created in a different worker node.
Problem
On the Diameter Gateway, if the worker node is being shutdown, it is set to "Terminating" state. The diameter gatway pods are statefulsets, due to which new pods are not created until the original pod dies. While in similar scenario new worker nodes are spun for replicasets. The pod has to be forced killed using the --force option.
Resolution
For Diameter Gateway, set terminationGracePeriodSeconds
to 0s. This is done by configuring the occnp-custom-values.yaml file.
diam-gateway:
# Graceful Termination
gracefulShutdown:
gracePeriod:0s
Create an alert that gets triggered when a node is down. Do modify the oid and name as per customer deployment if needed.
name: NODE_UNAVAILABLE
expr: kube_node_status_condition{condition="Ready",status="true"}== 0
for: 30s
labels:
oid: XXXXXX
severity: critical
annotations:
description: Kubernetes node {{ $labels.node }} is not in Ready state
summary: Kubernetes node {{ $labels.node }} is unavailable {code}