Deployment Related Issues

4.1.1 Helm Install Failure

If helm install command Fails

This section covers the reasons and troubleshooting procedures if the helm install command fails.

Reasons for helm install failure:

Chart syntax issue [This issue could be shown in the few seconds]
Please resolve the chart specific things and rerun the helm install command, because in this case, no hooks should have begun.
Most possible reason [TIMEOUT]
If any job stuck in a pending/error state and unable to run, it will result in the timeout after 5 minutes. As default timeout for helm command is "5 minutes". In this case, we have to follow the below steps to troubleshoot.
helm install command failed in case of duplicated chart
```
helm install /home/cloud-user/pcf_1.6.1/sprint3.1/ocpcf-1.6.1-sprint.3.1.tgz --name ocpcf2 --namespace ocpcf2 -f <custom-value-file>
```
Error: release ocpcf2 failed: configmaps "perfinfo-config-ocpcf2" already exists
Here, configmap 'perfinfo-config-ocpcf2' exists multiple times, while creating Kubernetes objects after pre-upgrade hooks, this will be failed. In this case also please go through the below troubleshooting steps.
Troubleshooting steps:
1. Check from describe/logs of failure pods and fix them accordingly. You need to verify what went wrong on the installation of the Policy by checking the below points:
  For the PODs which were not started, run the following command to check the failed pods:
```
kubectl describe pod <pod-name>  -n <release-namespace>
```
  For the PODs which were started but failed to come into "READY"state, run the following command to check the failed pods:
```
kubectl describe logs <pod-name>  -n <release-namespace>
```
2. Run the below command to get kubernetes objects:
```
kubectl get all -n <release_namespace>
```
  This gives a detailed overview of which objects are stuck or in a failed state.
3. Run the below command to delete all kubernetes objects:
```
kubectl delete all --all -n <release_namespace>
```
4. Run the below command to delete all current configmaps:
```
kubectl delete cm --all -n <release-namespace>
```
5. Run the below command to cleanup the databases created by the helm install command and create the database again:
```
DROP DATABASE IF EXISTS occnp_audit_service;
DROP DATABASE IF EXISTS occnp_config_server;
DROP DATABASE IF EXISTS occnp_pcf_am;
DROP DATABASE IF EXISTS occnp_pcf_sm;
DROP DATABASE IF EXISTS occnp_pcrf_core;
DROP DATABASE IF EXISTS occnp_release;
DROP DATABASE IF EXISTS occnp_binding;
DROP DATABASE IF EXISTS occnp_policyds;
DROP DATABASE IF EXISTS occnp_pcf_ue;
DROP DATABASE IF EXISTS occnp_commonconfig;
CREATE DATABASE IF NOT EXISTS occnp_audit_service;
CREATE DATABASE IF NOT EXISTS occnp_config_server;
CREATE DATABASE IF NOT EXISTS occnp_pcf_am;
CREATE DATABASE IF NOT EXISTS occnp_pcf_sm;
CREATE DATABASE IF NOT EXISTS occnp_pcrf_core;
CREATE DATABASE IF NOT EXISTS occnp_release;
CREATE DATABASE IF NOT EXISTS occnp_binding;
CREATE DATABASE IF NOT EXISTS occnp_policyds;
CREATE DATABASE IF NOT EXISTS occnp_pcf_ue;
CREATE DATABASE IF NOT EXISTS occnp_commonconfig; 
```
  In addition, clean up the entries in "mysql.ndb_replication" table by running the following command:
```
DROP TABLE IF EXISTS mysql.ndb_replication;
```
6. Run the following command :
  - For Helm2:
```
helm ls --all
```
  - For Helm3:
```
helm3 ls -n <release-namespace>
```
  If this is in a failed state, please purge the namespace using the following command:
```
helm delete --purge <release_namespace>
```
  Once the purge command is succeeded, press "ctrl+c" to stop the above script.
  Note:
  If the command is taking more time, run the following command in another session to clear all the delete jobs.
```
while true; do kubectl delete jobs --all -n <release_namespace>; sleep 5;done
```
7. After the database cleanup and creation of the database again, run the helm install command.

If helm install command fails due to atomic and timeout options

The helm install command fails as the external-ip allocation (Loadbalancer) fails for Diameter Gateway, Ingress Gateway, and Configuration Management service as they are of the type loadbalancer.

Reason: The primary reason for this problem is availability of limited infrastructure due to which floating IPs may not be available. It may also happen due to the sytem taking more time to assign floating IPs, as a result of which charts purge.

Solution: To resolve this issue, user may either skip --atomic keyword from the helm install command or set a higher timeout value.

4.1.2 Configuration Issue where mysql-username had an Extra Line

Symptom

No suitable driver found for jdbc

Problem

Secret files contain the user id and password for the MySQL. User ID and password inside the secret file shall be base64 encoded. During base64 encoding, if a new line is present in the user id and password – the line is also encoded and may cause issues when they are decoded back.

Resolution Steps

To resolve this issue, perform the following steps:

Get the secret file created by customer.
Fetch the encoded MySQL username and password.
Go to https://www.base64decode.org/.
Give the username and password and click decode.
Verify if the extra line is present in the username and password. If present, remove the extra line.
Decode it again.

4.1.3 App Info Worker Time Out

Problem

PCF appinfo pod is stuck in restarting with the following log:

[CRITICAL] WORKER TIMEOUT

The appinfo process has a HTTP server (gunicorn) and a few worker processes. The request comes to the gunicorn process, then the worker processes handle the request. If the worker does not return in 30 seconds, then gunicorn prints "WORKER TIMEOUT" error, and kills the worker. From the log, it appears that the worker processes are stuck somewhere.

Troubleshooting steps:

Change the appinfo deployment, increase the liveness threshold value from 3 to a higher value. By doing so, appinfo is not impacted by readiness check.
Watch the log of appinfo to check whether the problem still exists.
If the problem still exists, then we need to find out why the worker process is stuck. Run the following command to get into appinfo pod:
```
kubectl -n <pcf namespace> exec -it <pod name> /bin/bash
```

Create a temporary python file:

cat  > xxx_test.py

import pdb
import appinfo

pdb.set_trace()
appinfo.app.run(port=9999)

Run the following command to run this temporary python file
```
python3 xxx_test.py
```
It launches a python debugger, type "continue" to run the app.
Open another terminal, run the following command:
```
kubectl -n <pcf namespace> exec -it <pod name> /bin/bash
```
Then, run the following command to check whether this temporary service can return immediately:
```
curl localhost:9999/v1/readiness
```
If curl gets stuck, then we have reproduced the problem. Now in the python debugger, type "ctrl+C", and you should be able to get the stack trace that indicates the problem.

4.1.4 Startup Probes

To increase the application's reliability and availability, startup probes are introduced in Policy. Consider a scenario where the configuration is not loaded or partially loaded but the service goes into a ready state. This may result in different pods showing different behaviour for the same service. With the introduction of startup probe, the readiness and liveness checks for a pod are not initiated until the configuration is loaded completely and startup probe is successful. However, if the startup probe fails, the container restarts.

To check the status of startup probe or investigate the reason of failing, perform the following steps:

Log in to a container by running the following command:

kubectl exec -it podname -n namespace -- bash
curl -kv http://localhost:<monitoring-port>/<startup-probe-url>

Example:

kubectl exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup

The sample output can be as follow:

[cloud-user@bastion-1 ~]$ 
*   Trying ::1...
* TCP_NODELAY set
* connect to ::1 port 9000 failed: Connection refused
*   Trying 127.0.0.1...
* TCP_NODELAY set
* connect to 127.0.0.1 port 9000 failed: Connection refused
* Failed to connect to localhost port 9000: Connection refused
* Closing connection 0
curl: (7) Failed to connect to localhost port 9000: Connection refused
command terminated with exit code 7
[cloud-user@bastion-1 ~]$ k exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9000 (#0)
> GET /actuator/health/startup HTTP/1.1
> Host: localhost:9000
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< Date: Thu, 21 Apr 2022 11:18:03 GMT
< Content-Type: application/json;charset=utf-8
< Transfer-Encoding: chunked
< Server: Jetty(9.4.43.v20210629)
<
* Connection #0 to host localhost left intact
{"status":"DOWN"}[cloud-user@bastion-1 ~]$ k exec -it test-pcrf-core-797cf5997-2zlgf -- curl -kv http://localhost:9000/actuator/health/startup
*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9000 (#0)
> GET /actuator/health/startup HTTP/1.1
> Host: localhost:9000
> User-Agent: curl/7.61.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Date: Thu, 21 Apr 2022 11:18:04 GMT
< Content-Type: application/json;charset=utf-8
< Transfer-Encoding: chunked
< Server: Jetty(9.4.43.v20210629)
<
* Connection #0 to host localhost left intact
{"status":"UP"}[cloud-user@bastion-1 ~]$

To check why the startup probe failed, describe the output:
```
Describe output:

 Warning  Unhealthy  <invalid> (x10 over 2m45s)  kubelet            Startup probe failed: Get "http://10.233.81.231:9000/actuator/health/startup": dial tcp 10.233.81.231:9000: connect: connection refused
```
The following could be the possible reasons for startup probe failure:
- Network connectivity issue
- Database connection issue due to which server is not coming up
- Due to any other exception
If the reason for startup probe failure is not clear, check the logs to determine if it is due to an issue with config-server connection or any issue with fetching configurations from the config-server.

4.1.5 Monitoring of Diameter Gateway worker nodes failure

Symptom

When Diameter Gateway node fails, new replicas are not created in a different worker node.

Problem

On the Diameter Gateway, if the worker node is being shutdown, it is set to "Terminating" state. The diameter gatway pods are statefulsets, due to which new pods are not created until the original pod dies. While in similar scenario new worker nodes are spun for replicasets. The pod has to be forced killed using the --force option.

Resolution

For Diameter Gateway, set terminationGracePeriodSeconds to 0s. This is done by configuring the occnp-custom-values.yaml file.

Example:


diam-gateway:
  # Graceful Termination
  gracefulShutdown: 
    gracePeriod:0s

Create an alert that gets triggered when a node is down. Do modify the oid and name as per customer deployment if needed.

Example:


name: NODE_UNAVAILABLE
expr: kube_node_status_condition{condition="Ready",status="true"}== 0
for: 30s
labels:
oid: XXXXXX
severity: critical
annotations:
description: Kubernetes node {{ $labels.node }} is not in Ready state
summary: Kubernetes node {{ $labels.node }} is unavailable {code}

4.1 Deployment Related Issues

4.1.1 Helm Install Failure

4.1.2 Configuration Issue where mysql-username had an Extra Line

4.1.3 App Info Worker Time Out

4.1.4 Startup Probes

4.1.5 Monitoring of Diameter Gateway worker nodes failure