4 Fault Recovery
This chapter describes the fault recovery procedures for various failure scenarios.
Kubernetes Cluster
This section describes the fault recovery procedures for various failure scenarios in a Kubernetes Cluster.
Recovering a Failed Bastion Host
This section describes the procedure to replace a failed Bastion Host.
- You must have login access to Bastion Host.
Note:
This procedure is applicable for a single Bastion Host failure only.- Use SSH to log in to a working Bastion Host. If the working
Bastion Host was a standby Bastion Host, then it must have successfully
become an Active Bastion Host as per Bastion HA Feature, within 10 seconds.
To verify this, run the following command and check if the output is
IS active-bastion:$ is_active_bastionSample output:
IS active-bastion - Use SSH to log in to a working Bastion Host and run the
following commands to deploy a new Bastion Host.
Replace
<new bastion node>in the following command with the name of the new Bastion Host that is deployed.$ cd /var/occne/cluster/$OCCNE_CLUSTER/ $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh - Update the Operating System (OS) to avoid certificate mismatch.
To perform an OS update, follow the instructions specific to updating the OS
in the Upgrading CNE section.
Note:
- Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
- Ensure that you update only the Operating System and do not set
up
OCCNE_NEW_VERSION.
- Reinstall bastion controller to take the new internal IP into
the configuration
file:
$ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
- Log in to OpenStack cloud using your credentials.
- From the Compute menu, select Instances, and locate the failed Bastion's instance that you want to replace.
- Click the drop-down list from the Actions column, and
select Delete Instance to delete the failed Bastion host:

- Use SSH to log in to a working Bastion Host. If the working
Bastion Host was a standby Bastion Host, then it must have successfully
become an Active Bastion Host as per Bastion HA Feature, within 10 seconds.
To verify this, run the following command and check if the output is
IS active-bastion:$ is_active_bastionSample output:
IS active-bastion - Use SSH to log in to a working Bastion Host and run the
following commands to create a new Bastion Host.
Replace
<new bastion node>in the following command with the name of the new Bastion Host that is deployed.$ cd /var/occne/cluster/$OCCNE_CLUSTER/ $ source openrc.sh $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh - Update the Operating System (OS) to avoid certificate mismatch.
To perform an OS update, follow the instructions specific to updating the OS
in the Upgrading CNE section.
Note:
- Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
- Ensure that you update only the Operating System and do not set
up
OCCNE_NEW_VERSION.
- Reinstall bastion controller to take the new internal IP into
the configuration
file:
$ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
- Log in to VMware cloud using your credentials.
- From the Compute menu, select Virtual Machines, and locate the failed Bastion's VM that you want to replace.
- From the Actions menu, select Power → Power Off.
- From the Actions menu, select Delete to delete
the failed Bastion Host:

- Use SSH to log in to a working Bastion Host. If the working
Bastion Host was a standby Bastion Host, then it must have successfully
become an Active Bastion Host as per Bastion HA Feature, within 10 seconds.
To verify this, run the following command and check if the output is
IS active-bastion:$ is_active_bastionSample output:
IS active-bastion - Use SSH to log in to a working Bastion Host and run the
following commands to deploy a new Bastion Host.
Replace
<new bastion node>in the following command with the name of the new Bastion Host that is deployed.$ cd /var/occne/cluster/$OCCNE_CLUSTER/ $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh - Update the Operating System (OS) to avoid certificate mismatch.
To perform an OS update, follow the instructions specific to updating the OS
in the Upgrading CNE section.
Note:
- Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
- Ensure that you update only the Operating System and do not set
up
OCCNE_NEW_VERSION.
- Reinstall bastion controller to take the new internal IP into
the configuration
file:
$ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
Validating Bastion Host Recovery
occne-bastion-controller logs to monitoring and observe the
change in behavior of the cluster. Run the following command to fetch the
occne-bastion-controller
logs:$ kco logs deploy/occne-bastion-controller -fSample
output:2024-07-18 06:28:45,003 7f79c6131740 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:45,101 7f79c6131740 MONITOR:INFO: Initializing DB /data/db/ctrlData.db [/modules/monData.py:132]
2024-07-18 06:28:45,102 7f79c6131740 MONITOR:INFO: Creating DB directory /data/db [/modules/monData.py:86]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: Bastion Monitor db created [/modules/monData.py:123]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: DB file /data/db/ctrlData.db initialized [/modules/monData.py:140]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: Starting monitor and app http server [/modules/startControllerProc.py:17]
2024-07-18 06:28:47,203 7f5fa9c69740 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:47,211 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
2024-07-18 06:28:47,303 7f5fa8a7d640 MONITOR:INFO: Detecting Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' to begin monitoring [/modules/monitor.py:98]
2024-07-18 06:28:47,303 7f5fa3fff640 MONITOR:INFO: Detecting Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' to begin monitoring [/modules/monitor.py:98]
[2024-07-18 06:28:47 +0000] [8] [INFO] Starting gunicorn 22.0.0
[2024-07-18 06:28:47 +0000] [8] [INFO] Listening at: http://0.0.0.0:8000 (8)
[2024-07-18 06:28:47 +0000] [8] [INFO] Using worker: gevent
[2024-07-18 06:28:47 +0000] [11] [INFO] Booting worker with pid: 11
2024-07-18 06:28:48,305 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-18 06:28:48,401 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-18 06:28:49,805 7f6907eb7800 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:58,909 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-18 06:28:58,910 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 0 retries [/modules/monitor.py:162]
2024-07-18 06:28:58,915 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-18 06:29:09,511 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-18 06:29:09,512 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-18 06:29:09,514 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 1 retries [/modules/monitor.py:162]
2024-07-18 06:29:20,012 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 2 retries [/modules/monitor.py:162]
2024-07-18 06:29:32,822 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
2024-07-18 06:29:32,902 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
[2024-07-18 14:48:53 +0000] [11] [INFO] Autorestarting worker after current request.
[2024-07-18 14:48:54 +0000] [11] [INFO] Worker exiting (pid: 11)
[2024-07-18 14:48:54 +0000] [22738] [INFO] Booting worker with pid: 22738
2024-07-18 14:48:56,419 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 19:37:48,413 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /yum/OracleLinux/OL9/ol9_appstream/repodata/repomd.xml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8a96fd0>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=0 [/modules/monitor.py:125]
2024-07-18 19:38:03,432 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8250040>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=1 [/modules/monitor.py:125]
2024-07-18 19:38:03,447 7f5fa8a7d640 MONITOR:INFO: Switched Active bastion from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:155]
2024-07-18 19:38:13,459 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:173]
2024-07-18 19:38:18,469 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa82503a0>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=2 [/modules/monitor.py:125]
2024-07-18 19:38:46,598 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8243100>: Failed to establish a new connection: [Errno 113] No route to host')): retries=4 [/modules/monitor.py:125]
2024-07-18 19:39:39,078 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8250400>: Failed to establish a new connection: [Errno 113] No route to host')): retries=8 [/modules/monitor.py:125]
2024-07-18 19:41:24,038 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8243d60>: Failed to establish a new connection: [Errno 113] No route to host')): retries=16 [/modules/monitor.py:125]
2024-07-18 19:44:53,958 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8a8be50>: Failed to establish a new connection: [Errno 113] No route to host')): retries=32 [/modules/monitor.py:125]
2024-07-18 19:51:53,989 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8260040>: Failed to establish a new connection: [Errno 113] No route to host')): retries=64 [/modules/monitor.py:125]
2024-07-18 20:04:26,519 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=128 [/modules/monitor.py:125]
2024-07-18 20:23:53,620 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
[2024-07-18 23:08:53 +0000] [22738] [INFO] Autorestarting worker after current request.
[2024-07-18 23:08:54 +0000] [22738] [INFO] Worker exiting (pid: 22738)
[2024-07-18 23:08:54 +0000] [44780] [INFO] Booting worker with pid: 44780
2024-07-18 23:08:56,406 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-19 00:01:00,707 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
[2024-07-19 07:28:53 +0000] [44780] [INFO] Autorestarting worker after current request.
[2024-07-19 07:28:54 +0000] [44780] [INFO] Worker exiting (pid: 44780)
[2024-07-19 07:28:55 +0000] [67513] [INFO] Booting worker with pid: 67513
2024-07-19 07:28:57,414 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-19 15:48:53 +0000] [67513] [INFO] Autorestarting worker after current request.
[2024-07-19 15:48:54 +0000] [67513] [INFO] Worker exiting (pid: 67513)
[2024-07-19 15:48:54 +0000] [90200] [INFO] Booting worker with pid: 90200
2024-07-19 15:48:59,803 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 00:08:53 +0000] [90200] [INFO] Autorestarting worker after current request.
[2024-07-20 00:08:54 +0000] [90200] [INFO] Worker exiting (pid: 90200)
[2024-07-20 00:08:54 +0000] [112911] [INFO] Booting worker with pid: 112911
2024-07-20 00:08:56,314 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 08:28:53 +0000] [112911] [INFO] Autorestarting worker after current request.
[2024-07-20 08:28:53 +0000] [112911] [INFO] Worker exiting (pid: 112911)
[2024-07-20 08:28:54 +0000] [135612] [INFO] Booting worker with pid: 135612
2024-07-20 08:28:56,309 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 16:48:53 +0000] [135612] [INFO] Autorestarting worker after current request.
[2024-07-20 16:48:54 +0000] [135612] [INFO] Worker exiting (pid: 135612)
[2024-07-20 16:48:54 +0000] [158319] [INFO] Booting worker with pid: 158319
2024-07-20 16:48:56,514 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 01:08:53 +0000] [158319] [INFO] Autorestarting worker after current request.
[2024-07-21 01:08:54 +0000] [158319] [INFO] Worker exiting (pid: 158319)
[2024-07-21 01:08:54 +0000] [181019] [INFO] Booting worker with pid: 181019
2024-07-21 01:08:56,902 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 09:28:53 +0000] [181019] [INFO] Autorestarting worker after current request.
[2024-07-21 09:28:53 +0000] [181019] [INFO] Worker exiting (pid: 181019)
[2024-07-21 09:28:54 +0000] [203715] [INFO] Booting worker with pid: 203715
2024-07-21 09:28:56,216 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 17:48:53 +0000] [203715] [INFO] Autorestarting worker after current request.
[2024-07-21 17:48:54 +0000] [203715] [INFO] Worker exiting (pid: 203715)
[2024-07-21 17:48:54 +0000] [226420] [INFO] Booting worker with pid: 226420
2024-07-21 17:48:56,717 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 02:08:53 +0000] [226420] [INFO] Autorestarting worker after current request.
[2024-07-22 02:08:54 +0000] [226420] [INFO] Worker exiting (pid: 226420)
[2024-07-22 02:08:55 +0000] [249119] [INFO] Booting worker with pid: 249119
2024-07-22 02:08:58,114 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 10:28:53 +0000] [249119] [INFO] Autorestarting worker after current request.
[2024-07-22 10:28:54 +0000] [249119] [INFO] Worker exiting (pid: 249119)
[2024-07-22 10:28:54 +0000] [271794] [INFO] Booting worker with pid: 271794
2024-07-22 10:28:56,918 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 18:48:53 +0000] [271794] [INFO] Autorestarting worker after current request.
[2024-07-22 18:48:54 +0000] [271794] [INFO] Worker exiting (pid: 271794)
[2024-07-22 18:48:54 +0000] [294486] [INFO] Booting worker with pid: 294486
2024-07-22 18:48:56,513 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-22 21:21:19,625 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:173]
2024-07-22 21:22:30,328 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8277340>, 'Connection to 192.168.1.53 timed out. (connect timeout=5)')): retries=0 [/modules/monitor.py:125]
2024-07-22 21:22:45,350 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8204850>, 'Connection to 192.168.1.53 timed out. (connect timeout=5)')): retries=1 [/modules/monitor.py:125]
2024-07-22 21:22:45,370 7f5fa3fff640 MONITOR:INFO: Switched Active bastion from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:155]
2024-07-22 21:22:55,376 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
2024-07-22 21:22:58,566 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa82775b0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=2 [/modules/monitor.py:125]
2024-07-22 21:23:24,805 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8208730>: Failed to establish a new connection: [Errno 113] No route to host')): retries=4 [/modules/monitor.py:125]
2024-07-22 21:24:17,286 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8277160>: Failed to establish a new connection: [Errno 113] No route to host')): retries=8 [/modules/monitor.py:125]
2024-07-22 21:26:02,566 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8250d30>: Failed to establish a new connection: [Errno 113] No route to host')): retries=16 [/modules/monitor.py:125]
2024-07-22 21:29:32,486 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa827bac0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=32 [/modules/monitor.py:125]
2024-07-22 21:36:33,222 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8a96af0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=64 [/modules/monitor.py:125]
2024-07-22 21:50:34,502 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa827bbe0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=128 [/modules/monitor.py:125]
2024-07-22 21:52:03,240 7f5fa8a7d640 MONITOR:INFO: Detecting Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' to begin monitoring [/modules/monitor.py:98]
2024-07-22 21:52:03,705 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-22 21:52:15,322 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-22 21:52:27,117 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-22 21:52:50,901 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=4 [/modules/monitor.py:125]
2024-07-22 21:53:38,012 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=8 [/modules/monitor.py:125]
2024-07-22 21:55:13,814 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=16 [/modules/monitor.py:125]
2024-07-22 21:58:22,008 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=32 [/modules/monitor.py:125]
2024-07-22 21:58:47,310 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]pipeline.sh script is run, you'll observe a lot of Ansible
lines as part of the progress of provisioning and configuration of new node. The
following codeblock provides a sample output of this
scenario:Monday 22 July 2024 21:58:42 +0000 (0:00:00.681) 0:00:11.112 ***********
===============================================================================
repo_backup : Ensure proper ownership of /var/occne directory ------------------------ 3.65s
Gathering Facts ---------------------------------------------------------------------- 1.18s
repo_backup : Copy OCCNE var dir to target bastion, may take 10s of minutes if repository is populated (registry, clusters, admin.conf, etc) --- 1.02s
Gathering Facts ---------------------------------------------------------------------- 0.79s
Install bash completion -------------------------------------------------------------- 0.70s
update_banner : Restart sshd after Banner update ------------------------------------- 0.68s
repo_backup : Ensure proper ownership of /var/www/html/occne directory --------------- 0.64s
repo_backup : Copy OCCNE html dir to target bastion (pxe, downloads, helm, etc) ------ 0.52s
update_banner : Restore the original Banner setting ---------------------------------- 0.37s
repo_backup : Update /var/occne/cluster/occne2-j-jorge-l-lopez/occne.ini repo lines to point to 'this' repo host --- 0.35s
Flag HTTP repo as ready -------------------------------------------------------------- 0.32s
update_banner : Assert that OCCNE banner is absent in sshd_config -------------------- 0.23s
identify : Determine set of occne kvm guest nodes hosted by this node ---------------- 0.11s
identify : set_fact ------------------------------------------------------------------ 0.06s
identify : Flag localhost_self identity ---------------------------------------------- 0.05s
identify : create localhost_kvm_host_group group ------------------------------------- 0.05s
identify : Create localhost_self_group group ----------------------------------------- 0.05s
identify : set occne_user ------------------------------------------------------------ 0.04s
identify : Dump self-identity variables ---------------------------------------------- 0.04s
identify : Determine set of k8s guest nodes hosted by this node ---------------------- 0.03s
WARN[0012] image used by 0cf93ed451350a2e0c1173956b0fc903f9d887a2fef898d052295054cd2cf6b6: image is in use by a container: consider listing external containers and force-removing image
+ sudo podman image rm -fi winterfell:5000/occne/provision:24.2.0
Untagged: winterfell:5000/occne/provision:24.2.0
Deleted: 2b96b90fc39f54920518de792deec48d7907d81219ac5f519ec2c78ba99e5c99
Skipping: PROV REMOVE
-POST Post Processing Finished
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mon Jul 22 05:58:50 PM EDT 2024
artifacts/pipeline.sh completed successfully
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||After monitoring the logs, run the is_active_bastion and
get_other_bastions commands to verify the existing Bastion
Hosts included in the cluster.
Recovering a Failed Kubernetes Controller Node
This section describes the procedure to recover a single failed Kubernetes controller node in vCNE deployments.
Prerequisites
- You must have login access to a Bastion Host.
- You must have login access to the cloud GUI.
Note:
- This procedure is applicable for vCNE (OpenStack and VMware) deployments only. CNE doesn't support recovering a failed Kubernetes controller node in BareMetal deployments.
- This procedure is applicable for replacing a single Kubernetes controller node only.
- Control Node 1 (member of
etcd1) requires specific steps to be performed. Be mindful when you are replacing this node. - If you are using a CNLB based vCNE deployment, then ensure that you are aware of
the following pointers:
- A minimum of three control nodes (control plane and etcd hosts) are required to maintain the cluster's high availability and responsiveness. However, the cluster can still operate with an even number of control nodes, though it is not recommended for a long period of time.
- Some maintenance procedures such as, CNE standard upgrade and cluster update procedures are not supported after removing a control node from a cluster with even number of controller nodes. In such cases, you must add a new node before performing the procedures.
- If you are replacing
etcd1control node, then this procedure makesetcd2as the newetcd1by default.For example, the following code block shows the controller nodes before replacingetcd1control node:
The following code block shows the controller nodes after replacingetcd1: occne1-test-k8s-ctrl-1 etcd2: occne1-test-k8s-ctrl-2 etcd3: occne1-test-k8s-ctrl-3etcd1control node:
Therefore, ensure that all IPs related toetcd1: occne1-test-k8s-ctrl-2 etcd2: occne1-test-k8s-ctrl-3 etcd3: occne1-test-k8s-ctrl-1etcd1control pane IP are replaced withetcd2working controller node IPs as indicated in the applicable steps in the procedure.
Recovering a Failed Kubernetes Controller Node in an OpenStack Deployment
This section describes the procedure to recover a failed Kubernetes controller node in an OpenStack deployment.
- Use SSH to log in to Bastion Host and remove the failed Kubernetes
controller node by following the procedure described in the Removing a Controller Node in OpenStack Deployment section.
Take a note of the internal IP addresses of all the controller nodes and the etcd member number (etcd1, etcd2 or etcd3) of the failed controller node. Also take note of the IPs and hostnames of the other working Control nodes.
- Use the original terraform file to create a new controller node VM:
Note:
To revert the changes, perform this step only if the failed control node was a member of etcd1.cd /var/occne/cluster/${OCCNE_CLUSTER} mv terraform.tfstate /tmp cp ${OCCNE_CLUSTER}/terraform.tfstate.backup terraform.tfstate - Depending on the type of Load Balancer used, run one of the
following steps to create a new Controller Node Instance within the cloud:
- Run the following Terraform commands if you are using an LBVM based
OpenStack deployment:
Note:
Before running theterraform applycommand in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the.tfstatefile.$ source openrc.sh $ terraform plan -var-file=$OCCNE_CLUSTER/cluster.tfvars $ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve - Run the following Tofu commands if you are using an CNLB based OpenStack deployment.
Note:
- Use the tofu plan command to verify that no other resources, than the controller node that is currently replaced, are affected.
- Before running the
tofu applycommand in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the.tfstatefile.
$ cd /var/occne/cluster/${OCCNE_CLUSTER} $ source openrc.sh $ tofu plan -var-file=$OCCNE_CLUSTER/cluster.tfvars $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
- Run the following Terraform commands if you are using an LBVM based
OpenStack deployment:
- Switch the terraform file.
Note:
Perform this step only if the failed control node was a member ofetcd1.$ cd /var/occne/cluster/$OCCNE_CLUSTER $ python3 scripts/switchTfstate.pyFor example:[cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
Sample output:Beginning tfstate switch order k8s control nodes terraform.tfstate.lastversion created as backup Controller Nodes order before rotation: occne7-test-k8s-ctrl-1 occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 Controller Nodes order after rotation: occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 occne7-test-k8s-ctrl-1 Success: terraform.tfstate rotated for cluster occne7-test - Edit the failed
kube_control_planeIP with the IP of working control node.Note:
- Perform this step only if the failed control node was a member of etcd1.
- If you are using an CNLB based deployment and if
etcd1is replaced, then update the IP with previous IP ofetcd2.
$ kubectl edit cm -n kube-public cluster-infoSample output:. . server: https://<working control node IP address>:6443 . .
- Log in to OpenStack GUI using your credentials and note the replaced
node's internal IP address and hostname. In most of the cases, the new IP
address and hostname remains the same as the ones before deletion. The new IP
address and hostname are referred to as
replaced_node_ipandreplaced_node_hostnamein the remaining procedure. - Run the following command from Bastion Host to configure the replaced control
node
OS:
OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=<replaced_node_hostname> artifacts/pipeline.shFor example:OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=occne7-test-k8s-ctrl-1 artifacts/pipeline.sh
- Update the
/etc/hostsfile in Bastion Host with thereplaced_node_ipandreplaced_node_hostname. Ensure that there are two matching entries, that is,<replaced_node_hostname>must match with the<replaced_node_ip>.$ sudo vi /etc/hostsSample output:192.168.202.232 occne7-test-k8s-ctrl-1.novalocal occne7-test-k8s-ctrl-1 192.168.202.232 lb-apiserver.kubernetes.local - Use SSH to log in to each controller node in the cluster, except
the controller node that is newly created, and run the following commands as a
root user to update
replaced_node_ipin thekube-apiserver.yaml,kubeadm-config.yaml, andhostsfiles:Note:
- Ensure that you don't delete or change any other IPs related to the running control nodes.
- The following commands replaces Node1. If you are replacing other nodes, the order of IPs may change. This is also applicable to the later steps in this procedure.
- kube-apiserver.yaml:
vi /etc/kubernetes/manifests/kube-apiserver.yamlSample output:- --etcd-servers=https://192.168.202.232:2379,https://192.168.203.194:2379,https://192.168.200.115:2379 - kubeadm-config.yaml:
$ vi /etc/kubernetes/kubeadm-config.yamlSample output:etcd: external: endpoints: - https://<replaced_node_ip>:2379 - https://192.168.203.194:2379 - https://192.168.200.115:2379 ------------------------------------ certSANs: - kubernetes - kubernetes.default - kubernetes.default.svc - kubernetes.default.svc.occne7-test - 10.233.0.1 - localhost - 127.0.0.1 - occne7-test-k8s-ctrl-1 - occne7-test-k8s-ctrl-2 - occne7-test-k8s-ctrl-3 - lb-apiserver.kubernetes.local - <replaced_node_ip> - 192.168.203.194 - 192.168.200.115 - localhost.localdomain timeoutForControlPlane: 5m0s - hosts:
$ vi /etc/hostsSample output:<replaced_node_ip> occne7-test-k8s-ctrl-1.novalocal occne7-test-k8s-ctrl-1
- Run the following commands in a Bastion Host to update all
instances of the
<replaced_node_ip>. If the failed controller node was a member ofetcd1, then update thecontrolPlaneEndpointvalue with the IP address of the working controller node (that is, from ctrl-1 to ctrl-2):$ kubectl edit configmap kubeadm-config -n kube-systemSample output:apiServer: certSANs: - kubernetes - kubernetes.default - kubernetes.default.svc - kubernetes.default.svc.occne7-test - 10.233.0.1 - localhost - 127.0.0.1 - occne7-test-k8s-ctrl-1 - occne7-test-k8s-ctrl-2 - occne7-test-k8s-ctrl-3 - lb-apiserver.kubernetes.local - <replaced_node_ip> - 192.168.203.194 - 192.168.200.115 - localhost.localdomain ---------------------------------------------------- # If etcd1 was replaced, please update IP with previous etcd2 IP. controlPlaneEndpoint: <working_node_ip>:6443 #Only update if was part of etcd1 ---------------------------------------------------- etcd: external: caFile: /etc/ssl/etcd/ssl/ca.pem certFile: /etc/ssl/etcd/ssl/node-occne7-test-k8s-ctrl-1.pem endpoints: - https://<replaced_node_ip>:2379 - https://192.168.203.194:2379 - https://192.168.200.115:2379 - Run the
cluster.ymlplaybook from Bastion Host 1 to add the new controller node into the cluster:Note:
Replace<occne password>with the CNE password and<central repo>with the name of your central repository.$ podman run -it --rm --rmi --network host --name DEPLOY_$OCCNE_CLUSTER -v /var/occne/cluster/$OCCNE_CLUSTER:/host -v /var/occne:/var/occne:rw -e OCCNE_vCNE=openstack -e OCCNEINV=/host/hosts -e 'PLAYBOOK=/kubespray/cluster.yml' -e 'OCCNEARGS=--extra-vars={"occne_userpw":"<occne password>"} --extra-vars=occne_hostname=$OCCNE_CLUSTER-bastion-1 -i /host/occne.ini' <central repo>:5000/occne/k8s_install:$OCCNE_VERSION bash $ set -e $ /copyHosts.sh ${OCCNEINV} $ ansible-playbook -i /kubespray/inventory/occne/hosts --become --private-key /host/.ssh/occne_id_rsa /kubespray/cluster.yml ${OCCNEARGS} $ exit - Verify if the new controller node is added to the cluster using the following
command:
$ kubectl get nodeSample output:NAME STATUS ROLES AGE VERSION occne7-test-k8s-ctrl-1 Ready control-plane,master 30m v1.22.5 occne7-test-k8s-ctrl-2 Ready control-plane,master 2d19h v1.22.5 occne7-test-k8s-ctrl-3 Ready control-plane,master 2d19h v1.22.5 occne7-test-k8s-node-1 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-2 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-3 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-4 Ready <none> 2d19h v1.22.5 - Perform the following steps to validate the addition to the etcd cluster using
etcdctl:
- Use SSH to log in to the Bastion
Host:
$ ssh <working control node hostname>For example:$ ssh occne7-test-k8s-ctrl-2 - Switch to the root
user:
$ sudo suFor example:[cloud-user@occne7-test-k8s-ctrl-2]# sudo su - Source
/etc/etcd.env:$ source /etc/etcd.envFor example:[root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env - Run the following command to list the etcd members
list:
$ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member listFor example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member listSample output:52513ddd2aa49770, started, etcd1, https://192.168.202.232:2380, https://192.168.201.158:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false 80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
- Use SSH to log in to the Bastion
Host:
Recovering a Failed Kubernetes Controller Node in a VMware Deployment
This section describes the procedure to recover a failed Kubernetes controller node in a VMware deployment.
- Use SSH to log in to Bastion Host and remove the failed Kubernetes
controller node by following the procedure described in the Removing a Controller Node in VMware Deployment section.
Note the internal IP addresses of all the controller nodes and the etcd member number (etcd1, etcd2 or etcd3) of the failed controller node. Also take note of the IPs and hostnames of the other working Control nodes.
- Use the original terraform file to create a new controller node VM:
Note:
To revert the switch changes, perform this step only if the failed control node was a member of etcd1.$ cd /var/occne/cluster/${OCCNE_CLUSTER} $ mv terraform.tfstate /tmp $ cp ${OCCNE_CLUSTER}/terraform.tfstate.backup terraform.tfstate - Depending on the type of Load Balancer used, run one of the
following steps to create a new Controller Node Instance within the cloud:
- Run the following Terraform commands if you are using an LBVM based
VMware deployment:
Note:
Before running theterraform applycommand in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the.tfstatefile.$ source openrc.sh $ terraform plan -var-file=$OCCNE_CLUSTER/cluster.tfvars $ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve - Run the following Tofu commands if you are using an CNLB based VMware
deployment.
Note:
Before running thetofu applycommand in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the.tfstatefile.$ cd /var/occne/cluster/$OCCNE_CLUSTER/ $ tofu plan -var-file=$OCCNE_CLUSTER/cluster.tfvars $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
- Run the following Terraform commands if you are using an LBVM based
VMware deployment:
- Switch the terraform file.
Note:
Perform this step only if the failed control node was a member of etcd1.$ cd /var/occne/cluster/$OCCNE_CLUSTER $ cp terraform.tfstate terraform.tfstate.original $ python3 scripts/switchTfstate.pyFor example:[cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
Sample output:Beginning tfstate switch order k8s control nodes terraform.tfstate.lastversion created as backup Controller Nodes order before rotation: occne7-test-k8s-ctrl-1 occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 Controller Nodes order after rotation: occne7-test-k8s-ctrl-2 occne7-test-k8s-ctrl-3 occne7-test-k8s-ctrl-1 Success: terraform.tfstate rotated for cluster occne7-test - Edit the current failed kube_control_plane IP with the IP of working
control node.
Note:
- Perform this step only if the failed control node was a member of etcd1.
- If you are using an CNLB based deployment and if
etcd1is replaced, then update the IP with previous IP ofetcd2.
$ kubectl edit cm -n kube-public cluster-infoSample output:. . server: https://<working control node IP address>:6443 . .
- Log in to VMware GUI using your credentials and note the replaced
node's internal IP address and hostname. In most of the cases, the new IP
address and hostname remains the same as the ones before deletion. The new IP
address and hostname are referred to as
replaced_node_ipandreplaced_node_hostnamein the remaining procedure. - Run the following command from Bastion Host to configure the replaced control
node
OS:
OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=<replaced_node_hostname> artifacts/pipeline.shFor example:OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=occne7-test-k8s-ctrl-1 artifacts/pipeline.sh
- Update the
/etc/hostsfile in Bastion Host with thereplaced_node_ipandreplaced_node_hostname. Make sure there are two matching entries.$ sudo vi /etc/hostsSample output:192.168.202.232 occne7-test-k8s-ctrl-1.novalocal occne7-test-k8s-ctrl-1 192.168.202.232 lb-apiserver.kubernetes.local - Use SSH to log in to each controller node in the cluster, except
the controller node that is newly created, and run the following commands as a
root user to update the
replaced_node_ipin thekube-apiserver.yaml,kubeadm-config.yaml, andhostsfiles:Note:
- Ensure that you don't delete or change any other IPs related to the running control nodes.
- The following commands replaces Node1. If you are replacing other nodes, the order of IPs may change. This is also applicable to the later steps in this procedure.
- kube-apiserver.yaml:
vi /etc/kubernetes/manifests/kube-apiserver.yamlSample output:- --etcd-servers=https://192.168.202.232:2379,https://192.168.203.194:2379,https://192.168.200.115:2379 - kubeadm-config.yaml:
$ vi /etc/kubernetes/kubeadm-config.yamlSample output:etcd: external: endpoints: - https://<replaced_node_ip>:2379 - https://192.168.203.194:2379 - https://192.168.200.115:2379 ------------------------------------ certSANs: - kubernetes - kubernetes.default - kubernetes.default.svc - kubernetes.default.svc.occne7-test - 10.233.0.1 - localhost - 127.0.0.1 - occne7-test-k8s-ctrl-1 - occne7-test-k8s-ctrl-2 - occne7-test-k8s-ctrl-3 - lb-apiserver.kubernetes.local - <replaced_node_ip> - 192.168.203.194 - 192.168.200.115 - localhost.localdomain timeoutForControlPlane: 5m0s - hosts:
$ vi /etc/hostsSample output:<replaced_node_ip> occne7-test-k8s-ctrl-1.novalocal occne7-test-k8s-ctrl-1
- Run the following commands in a Bastion Host to update all
instances of
<replaced_node_ip>. If the failed controller node was a member ofetcd1, then update thecontrolPlaneEndpointvalue with the IP address of the working controller node (that is, from ctrl-1 to ctrl-2):Note:
$ kubectl edit configmap kubeadm-config -n kube-systemSample output:apiServer: certSANs: - kubernetes - kubernetes.default - kubernetes.default.svc - kubernetes.default.svc.occne7-test - 10.233.0.1 - localhost - 127.0.0.1 - occne7-test-k8s-ctrl-1 - occne7-test-k8s-ctrl-2 - occne7-test-k8s-ctrl-3 - lb-apiserver.kubernetes.local - <replaced_node_ip> - 192.168.203.194 - 192.168.200.115 - localhost.localdomain ---------------------------------------------------- # If etcd1 was replaced, please update IP with previous etcd2 IP. controlPlaneEndpoint: <working_node_ip>:6443 #Only update if was part of etcd1 ---------------------------------------------------- etcd: external: caFile: /etc/ssl/etcd/ssl/ca.pem certFile: /etc/ssl/etcd/ssl/node-occne7-test-k8s-ctrl-1.pem endpoints: - https://<replaced_node_ip>:2379 - https://192.168.203.194:2379 - https://192.168.200.115:2379 - Run the
cluster.ymlplaybook from Bastion Host 1 to add the new controller node into the cluster:Note:
Replace<occne password>with the CNE password and<central repo>with the name of your central repository.$ podman run -it --rm --rmi --network host --name DEPLOY_$OCCNE_CLUSTER -v /var/occne/cluster/$OCCNE_CLUSTER:/host -v /var/occne:/var/occne:rw -e OCCNE_vCNE=vcd -e OCCNEINV=/host/hosts -e 'PLAYBOOK=/kubespray/cluster.yml' -e 'OCCNEARGS=--extra-vars={"occne_userpw":"<occne password>"} --extra-vars=occne_hostname=$OCCNE_CLUSTER-bastion-1 -i /host/occne.ini' <central repo>:5000/occne/k8s_install:$OCCNE_VERSION bash $ set -e $ /copyHosts.sh ${OCCNEINV} $ ansible-playbook -i /kubespray/inventory/occne/hosts --become --private-key /host/.ssh/occne_id_rsa /kubespray/cluster.yml ${OCCNEARGS} $ exit - Verify if the new controller node is added to the cluster using the following
command:
$ kubectl get nodeSample output:NAME STATUS ROLES AGE VERSION occne7-test-k8s-ctrl-1 Ready control-plane,master 30m v1.22.5 occne7-test-k8s-ctrl-2 Ready control-plane,master 2d19h v1.22.5 occne7-test-k8s-ctrl-3 Ready control-plane,master 2d19h v1.22.5 occne7-test-k8s-node-1 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-2 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-3 Ready <none> 2d19h v1.22.5 occne7-test-k8s-node-4 Ready <none> 2d19h v1.22.5 - Perform the following steps to validate the addition to the etcd
cluster using etcdctl:
- Use SSH to log in to the Bastion
Host:
$ ssh <working control node hostname>For example:$ ssh occne7-test-k8s-ctrl-2 - Switch to the root
user:
$ sudo suFor example:[cloud-user@occne7-test-k8s-ctrl-2]# sudo su - Source
/etc/etcd.env:$ source /etc/etcd.envFor example:[root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env - Run the following command to list the etcd members
list:
$ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member listFor example:[root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member listSample output:52513ddd2aa49770, started, etcd1, https://192.168.202.232:2380, https://192.168.201.158:2379, false f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false 80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
- Use SSH to log in to the Bastion
Host:
Restoring the etcd Database
This section describes the procedure to restore etcd cluster data from the backup.
- A backup copy of the etcd database must be available. For the procedure to create a backup of your etcd database, refer to the "Performing an etcd Data Backup" section of Oracle Communications Cloud Native Core, Cloud Native Environment User Guide.
- At least one Kubernetes controller node must be operational.
- Find Kubernetes controller hostname: Run the following command to
get the names of Kubernetes controller nodes.
$ kubectl get nodesSample output:NAME STATUS ROLES AGE VERSION occne3-my-cluster-k8s-ctrl-1 Ready control-plane,master 4d1h v1.23.7 occne3-my-cluster-k8s-ctrl-2 Ready control-plane,master 4d1h v1.23.7 occne3-my-cluster-k8s-ctrl-3 Ready control-plane,master 4d1h v1.23.7 occne3-my-cluster-k8s-node-1 Ready <none> 4d1h v1.23.7 occne3-my-cluster-k8s-node-2 Ready <none> 4d1h v1.23.7 occne3-my-cluster-k8s-node-3 Ready <none> 4d1h v1.23.7 occne3-my-cluster-k8s-node-4 Ready <none> 4d1h v1.23.7You must restore the etcd data on any one of the controller nodes that is in Ready state. From the output, note the name of a controller node, that is in Ready state, to restore the etcd data.
- Run the etcd-restore script:
- On the Bastion Host, switch to the
/var/occne/cluster/${OCCNE_CLUSTER}/artifactsdirectory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts - Run the
etcd_restore.shscript:$ ./etcd_restore.shOn running the script, the system prompts you to enter the following details:- k8s-ctrl node: Enter the name of the controller node (noted in Step 1) on which you want to restore the etcd data.
- Snapshot: Select the PVC snapshot that you want to restore from the list of PVC snapshots displayed.
Example:$ ./artifacts/etcd_restore.shSample output:Enter the K8s-ctrl hostname to restore etcd backup: occne3-my-cluster-k8s-ctrl-1 occne-etcd-backup pvc exists! occne-etcd-backup pvc is in bound state! Creating occne-etcd-backup pod pod/occne-etcd-backup created waiting for Pod to be in running state waiting for Pod to be in running state waiting for Pod to be in running state waiting for Pod to be in running state waiting for Pod to be in running state occne-etcd-backup pod is in running state! List of snapshots present on the PVC: snapshotdb.2022-11-14 Enter the snapshot from the list which you want to restore: snapshotdb.2022-11-14 This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties. Restoring etcd data backup This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties. Deprecated: Use `etcdutl snapshot restore` instead. 2022-11-14T20:22:37Z info snapshot/v3_snapshot.go:248 restoring snapshot {"path": "snapshotdb.2022-11-14", "wal-dir": "default.etcd/member/wal", "data-dir": "default.etcd", "snap-dir": "default.etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/command/snapshot_command.go:129\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/main.go:59\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"} 2022-11-14T20:22:37Z info membership/store.go:141 Trimming membership information from the backend... 2022-11-14T20:22:37Z info membership/cluster.go:421 added member {"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]} 2022-11-14T20:22:37Z info snapshot/v3_snapshot.go:269 restored snapshot {"path": "snapshotdb.2022-11-14", "wal-dir": "default.etcd/member/wal", "data-dir": "default.etcd", "snap-dir": "default.etcd/member/snap"} Removing etcd-backup-pod pod "occne-etcd-backup" deleted etcd-data-restore is successful!!
- On the Bastion Host, switch to the
Recovering a Failed Kubernetes Worker Node
This section provides the manual procedures to replace a failed Kubernetes Worker Node for bare metal, OpenStack, and VMware.
Prerequisites
- Kubernetes worker node must be taken out of service.
- Bare metal server must be repaired and the same bare metal server must be added back into the cluster.
- You must have credentials to access Openstack GUI.
- You must have credentials to access VMware GUI or CLI.
Limitations
Some of the steps in these procedures must be run manually.
Recovering a Failed Kubernetes Worker Node in Bare Metal
This section describes the manual procedure to replace a failed Kubernetes Worker Node in a bare metal deployment.
Prerequisites
- Kubernetes worker node must be taken out of service.
- Bare metal server must be repaired and the same bare metal server must be added back into the cluster.
Procedure
- Run the following command to remove Object Storage Daemon (OSD)
from the worker node, before removing the worker node from the Kubernetes
cluster:
Note:
Remove one OSD at a time. Do not remove multiple OSDs at once. Check the cluster status between removing multiple OSDs.Sample rook_toolbox file.
# Note down the osd-id hosted on the worker node which is to be removed $ kubectl get pods -n rook-ceph -o wide |grep osd |grep <worker-node> # Scale down the rook-ceph-operator deployment and OSD deployment $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0 $ kubectl -n rook-ceph scale deployment rook-ceph-osd-<ID> --replicas=0 # Install the rook-ceph tool box $ kubectl create -f rook_toolbox.yaml # Connect to the rook-ceph toolbox using the following command: $ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash # Once connected to the toolbox, check the ceph cluster status using the following commands: $ ceph status $ ceph osd status $ ceph osd tree # Mark the OSD deployment as out using the following commands and purge the OSD: $ ceph osd out osd.<ID> $ ceph osd purge <ID> --yes-i-really-mean-it # Verify that the OSD is removed from the node and ceph cluster status: $ ceph status $ ceph osd status $ ceph osd tree # Exit the rook-ceph toolbox $ exit # Delete the OSD deployments of the purged OSD $ kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID> # Scale up the rook-ceph-operator deployment using the following command: $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1 # Remove the rook-ceph tool box deployment $ kubectl -n rook-ceph delete deploy/rook-ceph-tools - Set CENTRAL_REPO, CENTRAL_REPO_REGISTRY_PORT, and NODE
environment variables to allow podman commands to run on the Bastion
Host:
$ export NODE=<workernode-full-name> $ export CENTRAL_REPO=<central-repo-name> $ export CENTRAL_REPO_REGISTRY_PORT=<central-repo-port>Example:$ export NODE=k8s-6.delta.lab.us.oracle.com $ export CENTRAL_REPO=winterfell $ export CENTRAL_REPO_REGISTRY_PORT=5000 - Run one of the following commands to remove the old worker
node:
- If the worker node is reachable from the Bastion
Host:
$ sudo podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--extra-vars="{'node':'${NODE}'}" -e 'PLAYBOOK=/kubespray/remove-node.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION} - If the worker node is not reachable from the Bastion
Host:
$ sudo podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--extra-vars="{'node':'${NODE}','reset_nodes':false}" -e 'PLAYBOOK=/kubespray/remove-node.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION}
Confirmation message is prompted asking to remove the node. Enter "
yes" at the prompt. This will take several minutes, mostly spent at the task "Drain node except daemonsets resource" (even if the node is unreachable). - If the worker node is reachable from the Bastion
Host:
- Run the following command to verify that the node was
removed:
$ kubectl get nodesVerify that the target worker node is no longer listed.
Adding Node to a Kubernetes Cluster
This section describes the procedure to add a new node to a Kubernetes cluster.
- Replace the node's settings in
hosts.iniwith the replacement node's settings (probably a MAC address change, if the node is a direct replacement). If you are adding a node, add it to thehosts.iniin all the relevant places (machine inventory section, and proper groups). - Set the environment variables (CENTRAL_REPO, CENTRAL_REPO_REGISTRY_PORT
(if not 5000), and NODE) to run the
dockercommand on the Bastion Host:export NODE=k8s-6.delta.lab.us.oracle.com export CENTRAL_REPO=winterfell - Install the OS on the new target worker
node:
podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--limit=${NODE},localhost ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/provision:${OCCNE_VERSION} - Run the following command to scale up the Kubernetes cluster with the
new worker
node:
podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e 'INSTALL=scale.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION} - Run the following command to verify the new node is up and running in the
cluster:
kubectl get nodes
Adding OSDs in a Ceph Cluster
This procedure sets up a ceph-osd daemon, configures it to use one drive, and configures the cluster to distribute data to the Object Storage Daemon (OSD). If your host has multiple drives, you may add an OSD for each drive by repeating this procedure. To add an OSD, create a data directory for it, mount a drive to that directory, add the OSD to the cluster, and then add it to the crush map. When you add the OSD to the crush map, consider the weight you give to the new OSD.
- Connect to the
rook-cephtoolbox using the following command:$ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash - Before adding OSD, make sure that the current OSD tree does not have
any outliers which could become (nearly) full if the new crush map decides to put
even more Placement Groups (PGs) on that
OSD:
$ ceph osd df | sort -k 7 -nUsereweight-by-utilizationto force PGs off the OSD:$ ceph osd test-reweight-by-utilization $ ceph osd reweight-by-utilizationFor optimal viewing, set up atmuxsession and make three panes:- A pane with the "
watch ceph -s" command that displays the status of the Ceph cluster every 2 seconds. - A pane with the "
watch ceph osd tree" command that displays the status of the OSDs in the Ceph cluster every 2 seconds. - A pane to run the actual commands.
- A pane with the "
- In order to deploy an OSD, there must be a storage device that is available on which
the OSD is deployed.
Run the following command to display an inventory of storage devices on all cluster hosts:
$ ceph orch device lsA storage device is considered available, if all of the following conditions are met:- The device must have no partitions.
- The device must not have any LVM state.
- The device must not be mounted.
- The device must not contain a file system.
- The device must not contain a Ceph BlueStore OSD.
- The device must be larger than 5 GB.
- Ceph will not provision an OSD on a device that is not available.
- To verify that the cluster is in a healthy state, connect to the Rook Toolbox and run the
ceph statuscommand:- All
monsmust be in quorum. - A
mgrmust be active state. - At least one OSD must be active state.
- If the
healthis notHEALTH_OK, the warnings or errors must be investigated.
- All
Recovering a Failed Kubernetes Worker Node in OpenStack
This section describes the manual procedure to replace a failed Kubernetes Worker Node in an OpenStack deployment.
Prerequisites
- You must have credentials to access Openstack GUI.
Procedure
- Perform the following steps to identify and remove the failed worker
node:
Note:
Run all the commands as a cloud-user in the/var/occne/cluster/${OCCNE_CLUSTER}folder.- Identify the node that is in a not ready, not reachable, or degraded
state and note the node's IP
address:
kubectl get node -A -o wideSample output:NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME occne3-user-k8s-ctrl-1 Ready control-plane,master 178m v1.23.7 192.168.1.92 192.168.1.92 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-ctrl-2 Ready control-plane,master 178m v1.23.7 192.168.1.117 192.168.1.117 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-ctrl-3 Ready control-plane,master 178m v1.23.7 192.168.1.118 192.168.1.118 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-1 Ready <none> 176m v1.23.7 192.168.1.135 192.168.1.135 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-2 Ready <none> 176m v1.23.7 192.168.1.137 192.168.1.137 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-3 Ready <none> 176m v1.23.7 192.168.1.136 192.168.1.136 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-4 Ready <none> 176m v1.23.7 192.168.1.119 192.168.1.119 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 - Copy and take a backup of the original Terraform tfstate
file:
$ cp terraform.tfstate terraform.tfstate.bkp-orig - On identifying the failed node, drain the node from the
Kubernetes
cluster:
$ kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-dataThis command ignores daemon sets and all possible storage attached as the failed worker node may contain some local storage volumes attached to it.
Note:
If this command runs without an error, move to Step e. Else, perform Step d. - You may encounter errors after running Step c or the pods
may not be detached from node that are getting deleted. For
example:
You can ignore these errors and exit the terminal (CTRL+c). After exiting from terminal, run the following command to manually delete the node:I0705 07:40:22.050275 409374 request.go:697] Waited for 1.083354916s due to client-side throttling, not priority and fairness, request: GET:https://lb-apiserver.kubernetes.local:6443/ api/v1/namespaces/occne-infra/pods/occne-lb-controller-server-58f689459b-w95t$ kubectl delete node <node-name> - Wait (about 10 to 15 minutes) until all pods are detached from the deleted node and assigned to other node that are in Ready state. Skip to step g if this step is successful. There can still be Pods that are in Pending state due to insufficient memory. This issue can be resolved when you add new node to the cluster.
- If the previous step fails, perform the following steps to manually
remove the pods that are running in the failed worker node:
- Identify the pods that are not in active state and
delete each of the pods by running the following
command:
Repeat this step until all the pods are removed from the cluster.$ kubectl delete pod --force <pod-name> -n <name-space> - Run the following command to drain the node from the Kubernetes
cluster:
$ kubectl drain occne3-user-k8s-node-2 --force --ignore-daemonsets --delete-emptydir-data
- Identify the pods that are not in active state and
delete each of the pods by running the following
command:
- Verify if the failed node is removed from the
cluster:
$ kubectl get nodesSample output:NAME STATUS ROLES AGE VERSION occne3-user-k8s-ctrl-1 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-ctrl-2 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-ctrl-3 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-node-1 Ready <none> 2d v1.23.7 occne3-user-k8s-node-3 Ready <none> 2d v1.23.7 occne3-user-k8s-node-4 Ready <none> 2d v1.23.7Verify that the target worker node is no longer listed.
- Log in to the OpenStack GUI, and manually power off (if needed) and delete the node. At this stage, the node must be already removed from the cluster.
- Identify the node that is in a not ready, not reachable, or degraded
state and note the node's IP
address:
- Delete the failed node from OpenStack GUI:
- Log in to the Openstack GUI console by using your credentials.
- From the list of nodes displayed, locate and select the failed worker node.
- From the Actions menu in the last column of the record, select
Delete Instance.

- Reconfirm your action by clicking Delete Instance and wait for the node to be deleted.
- Run terraform apply to recreate and add the node into the Kubernetes cluster:
- Log in to the Bastion Host and switch to the cluster tools directory:
/var/occne/cluster/${OCCNE_CLUSTER}. - Run the following command to log in to the cloud using the openrc.sh
script and provide the required details (username, password, and domain
name):
Example:
$ source openrc.shSample output:Please enter your OpenStack Username for project Team-CNE: user@oracle.com Please enter your OpenStack Password for project Team-CNE as user : ************** Please enter your OpenStack Domain for project Team-CNE: DSEE - Run terraform apply to recreate the
node:
$ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve - Locate the IP address of the newly created node in the
terraform.tfstatefile. If the IP is same as the old node that is removed, move to step f. Else, perform Step e.$ grep -A6 occne6-utpalkant-k-kumar-k8s-node-1 terraform.tfstate | grep ipSample output:"fixed_ip_v4": "192.168.200.78", "fixed_ip_v6": "", "floating_ip": "", - If the IP address of the newly created node is different
from the old node's IP, replace the IP address in the following files
from the active
Bastion:
vi /etc/hosts vi /var/occne/cluster/${OCCNE_CLUSTER}/hosts.ini vi /var/occne/cluster/${OCCNE_CLUSTER}/lbvm/lbCtrlData.json - Run the pipeline command to provide the node with the OS:
Example, considering the affected node as worker-node-2:
$ OCCNE_CONTAINERS='(PROV)' OCCNE_DEPS_SKIP=1 OCCNE_ARGS='--limit=occne3-user-k8s-node-2' OCCNE_STAGES=(DEPLOY) pipeline.sh - Run the following command to install and configure
Kubernetes. This adds the node back into the
cluster.
$ OCCNE_CONTAINERS='(K8S)' OCCNE_DEPS_SKIP=1 OCCNE_STAGES=(DEPLOY) pipeline.sh - Verify if the node is added back into the
cluster:
$ kubectl get nodesSample output:NAME STATUS ROLES AGE VERSION occne3-user-k8s-ctrl-1 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-ctrl-2 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-ctrl-3 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-node-1 Ready <none> 2d1h v1.23.7 occne3-user-k8s-node-2 Ready <none> 111m v1.23.7 occne3-user-k8s-node-3 Ready <none> 2d1h v1.23.7 occne3-user-k8s-node-4 Ready <none> 2d1h v1.23.7
- Log in to the Bastion Host and switch to the cluster tools directory:
Recovering a Failed Kubernetes Worker Node in VMware
This section describes the manual procedure to replace a failed Kubernetes Worker Node in a VMware deployment.
Prerequisites
- You must have credentials to access VMware GUI or CLI.
Procedure
- Perform the following steps to identify and remove the failed worker
node:
Note:
Run all the commands as a cloud-user in the/var/occne/cluster/${OCCNE_CLUSTER}folder.- Identify the node that is in a not ready, not reachable, or degraded
state and note the node's IP
address:
kubectl get node -A -o wideSample output:NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME occne3-user-k8s-ctrl-1 Ready control-plane,master 178m v1.23.7 192.168.1.92 192.168.1.92 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-ctrl-2 Ready control-plane,master 178m v1.23.7 192.168.1.117 192.168.1.117 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-ctrl-3 Ready control-plane,master 178m v1.23.7 192.168.1.118 192.168.1.118 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-1 Ready <none> 176m v1.23.7 192.168.1.135 192.168.1.135 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-2 Ready <none> 176m v1.23.7 192.168.1.137 192.168.1.137 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-3 Ready <none> 176m v1.23.7 192.168.1.136 192.168.1.136 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 occne3-user-k8s-node-4 Ready <none> 176m v1.23.7 192.168.1.119 192.168.1.119 Oracle Linux Server 8.6 5.4.17-2136.309.5.el8uek.x86_64 containerd://1.6.4 - Copy and take a backup of the original Terraform tfstate
file:
# cp terraform.tfstate terraform.tfstate.bkp-orig - On identifying the failed node, drain the node from the
Kubernetes
cluster:
# kubectl drain occne3-user-k8s-node-2 --ignore-daemonsets --delete-emptydir-dataThis command ignores daemon sets and all possible storage attached as the failed worker node may contain some local storage volumes attached to it.
Note:
If this command runs without an error, move to Step e. Else, perform Step d. - You may encounter errors after running Step c or the pods
may not be detached from node that are getting deleted. For
example:
You can ignore these errors and exit the terminal (CTRL+c). After exiting from terminal run the following command to manually delete the node:I0705 07:40:22.050275 409374 request.go:697] Waited for 1.083354916s due to client-side throttling, not priority and fairness, request: GET:https://lb-apiserver.kubernetes.local:6443/ api/v1/namespaces/occne-infra/pods/occne-lb-controller-server-58f689459b-w95t# kubectl delete node <node-name> - Wait (about 10 to 15 minutes) until all pods are detached from the deleted node and assigned to other node that are in Ready state. Skip to step g if this step is successful. There can still be Pods that are in Pending state due to insufficient memory. This issue can be resolved when you add new node to the cluster.
- If the previous step fails, perform the following steps to manually
remove the pods that are running in the failed worker node:
- Identify the pods that are not in online state and delete each
of the pods by running the following
command:
Repeat this step until all the pods are removed from the cluster.# kubectl delete pod --force <pod-name> -n <name-space> - Run the following command to drain the node from the Kubernetes
cluster:
# kubectl drain occne3-user-k8s-node-2 --force --ignore-daemonsets --delete-emptydir-data
- Identify the pods that are not in online state and delete each
of the pods by running the following
command:
- Verify if the failed node is removed from the
cluster:
# kubectl get nodesSample output:NAME STATUS ROLES AGE VERSION occne3-user-k8s-ctrl-1 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-ctrl-2 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-ctrl-3 Ready control-plane,master 2d v1.23.7 occne3-user-k8s-node-1 Ready <none> 2d v1.23.7 occne3-user-k8s-node-3 Ready <none> 2d v1.23.7 occne3-user-k8s-node-4 Ready <none> 2d v1.23.7Verify that the target worker node is no longer listed.
- Identify the node that is in a not ready, not reachable, or degraded
state and note the node's IP
address:
- Log in to the VCD or VMWare console. Manually power off (if needed) and delete the failed node from VMware. At this stage, the node must be already removed from VCD consoler.
- Recreate and add the node back into the Kubernetes
cluster:
Note:
Run all the commands as a cloud-user in the/var/occne/cluster/${OCCNE_CLUSTER}folder.- Run terraform apply to recreate the
node:
# terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve - Locate the IP address of the newly created node in the
terraform.tfstatefile. If the IP is same as the old node that is removed, move to step d. Else, perform Step c.# grep -A6 occne3-user-k8s-node-2 terraform.tfstate | grep ipSample output:"ip": "192.168.1.137", "ip_allocation_mode": "POOL", - If the IP address of the newly created node is different
from the old node's IP, replace the IP address in the following
files:
- /etc/hosts - /var/occne/cluster/${OCCNE_CLUSTER}/hosts.ini - /var/occne/cluster/${OCCNE_CLUSTER}/lbvm/lbCtrlData.json - Run the pipeline command to provide the node with the OS:
Example, considering the affected node as worker-node-2:
# OCCNE_CONTAINERS='(PROV)' OCCNE_DEPS_SKIP=1 OCCNE_ARGS='--limit=occne3-user-k8s-node-2' OCCNE_STAGES=(DEPLOY) pipeline.sh - Run the following command to install and configure
Kubernetes. This adds the node back into the
cluster.
# OCCNE_CONTAINERS='(K8S)' OCCNE_DEPS_SKIP=1 OCCNE_STAGES=(DEPLOY) pipeline.sh - Verify if the node is added back into the
cluster:
# kubectl get nodesSample output:NAME STATUS ROLES AGE VERSION occne3-user-k8s-ctrl-1 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-ctrl-2 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-ctrl-3 Ready control-plane,master 2d1h v1.23.7 occne3-user-k8s-node-1 Ready <none> 2d1h v1.23.7 occne3-user-k8s-node-2 Ready <none> 111m v1.23.7 occne3-user-k8s-node-3 Ready <none> 2d1h v1.23.7 occne3-user-k8s-node-4 Ready <none> 2d1h v1.23.7
- Run terraform apply to recreate the
node:
Restoring CNE from Backup
This section provides details about restoring a CNE cluster from backup.
Prerequisites
Before restoring CNE from backups, ensure that the following prerequisites are met.
- The CNE cluster must have been backed up successfully. For more information about taking a CNE backup, see the "Creating CNE Cluster Backup" sectiopn of Oracle Communications Cloud Native Core, Cloud Native Environment User Guide.
- At least one Kubernetes controller node must be operational.
- As this is a non-destructive restore, all the corrupted or non-functioning resources must be destroyed before initiating the restore process.
- This procedure replaces your current cluster directory with the one saved in your CNE cluster backup. Therefore before performing a restore, backup any Bastion directory file that you consider sensitive.
- For a BareMetal deployment, the following rook-ceph
storage classes must created and made available:
- standard
- occne-esdata-s
- occne-esmaster-sc
- occne-metrics-sc
- For a BareMetal deployment, PVCs must be created for all, except bastion-controller.
- For a vCNE deployment, PVCs must be created for all, except bastion-controller and lb-controller.
Note:
- Velero backups have a default retention period of 30 days. CNE provides only the non-expired backups for an automated cluster restore.
- Perform the restore procedure from the same Bastion Host from which the backups were taken from.
Performing a Cluster Restore From Backup
This section describes the procedure to restore a CNE cluster from backup.
Note:
- The backup restore procedure can restore backups of both Bastion and Velero.
- This procedure is used for running a restore for the first time only. If you want to rerun a restore, see Rerunning Cluster Restore.
Dropping All CNE Services:
Perform the following steps to run thevelero_drop_services.sh script to drop only the
currently supported services:
- Navigate to the
/var/occne/cluster/${OCCNE_CLUSTER}/directory where thevelero_drop_services.shis located:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/ - Run the
velero_drop_services.shscript:Note:
If you are using this script for the first time, you can run the script by passing--help / -has an argument or run the script without passing any argument to get more information about the script../scripts/restore/velero_drop_services.sh -hSample output:This script helps you drop services to prepare your cluster for a velero restore from backup, it receives a space separated list of arguments for uninstalled different components Usage: provision/provision/roles/bastion_setup/files/scripts/backup/velero_drop_services.sh [space separated arguments] Valid arguments: - bastion-controller - opensearch - fluentd-opensearch - jaeger - snmp-notifier - metrics-server - nginx-promxy - promxy - vcne-egress-controller - istio - cert-manager - kube-system - all: Drop all the above Note: If you place 'all' anywhere in your arguments all will be dropped.
velero_drop_services.sh script to drop a
service or set of services. For example:
- To drop a services or a set of services,
pass the service names as a space separated
list:
./scripts/restore/velero_drop_services.sh jaeger fluentd-opensearch istio
- To drop all the supported services, use
all:./scripts/restore/velero_drop_services.sh all
Initiating Cluster Restore
- Perform the following steps to initiate a cluster
restore:
- Navigate to the
/var/occne/cluster/${OCCNE_CLUSTER}/directory:$ cd /var/occne/cluster/${OCCNE_CLUSTER}/ - Run the
createClusterRestore.pyscript:- If you do not know the name of
the backup that you are going to restore, run the
following command to choose the backup from the
available list and then run the
restore:
$ ./scripts/restore/createClusterRestore.pySample output:Please choose and type the available backup you want to restore into your OCCNE cluster - occne-example-20250310-203651 - occne-example-20250314-053600 Please type the name of your backup: ... - If you know the name of the back
that you are going to restore, run the script by
passing the backup
name:
$ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the Velero backup previously created.For example, considering the backup name as "occne-example-20250314-053600" the restore script is run as follows:$ ./scripts/restore/createClusterRestore.py occne-example-20250314-053600Sample output:No /var/occne/cluster/occne-example/artifacts/velero.d/restore/cluster_restores_log.json log file, creating new one Initializing cluster restore with backup: occne-example-20250314-053600... No /var/occne/cluster/occne-example/artifacts/velero.d/restore/cluster_restores_log.json log file, creating new one Initializing bastion restore : 'occne-example-20250314-053600' Downloading bastion backup occne-example-20250314-053600 Successfully downloaded bastion backup occne-example-20250314-053600.tar at home directory GENERATED LOG FILE AT: /home/cloud-uservar/occne/cluster/occne-example/logs/velero/restore/downloadBastionBackup-20250314-055014.log GENERATED LOG FILE AT: /var/occne/homecluster/cloud-useroccne-example/logs/velero/restore/createBastionRestore-20250314-055014.log Initializing Velero K8s restore : 'occne-example-20250314-053600' Successfully created velero restore SKIPPING: Step 'update-pvc-annotations' skipped due to not found Pods/PVCs in 'Pending' state No /var/occne/cluster/occne-example/artifacts/restore/cluster_restores_log.json log file, creating new one Successfully created cluster restore GENERATED LOG FILE AT: /home/cloud-user/var/occne/cluster/occne-example/logs/velero/restore/createClusterRestore-20250314-055014.log
- If you do not know the name of
the backup that you are going to restore, run the
following command to choose the backup from the
available list and then run the
restore:
- For CNE 23.3.x and older versions, run the
following script to add the specific annotations
into the affected PVCs for specifying storage
provider, and removing affected pods associated
with these PVCs for forcing them to re-create
themselves.
$ ./restore/updatePVCAnnotations.py
- Navigate to the
Verifying Restore
- When Velero restore is completed, it may take
several minutes for Kubernetes resources to fully up and
functional. You must monitor the restore to ensure that all
services are up and running. Run the following command to
get the status of all pods, deployments, and
services:
$ kubectl get all -n occne-infra - Once you verify that all resources are restored,
run a cluster test to verify if every single resource is up
and running.
$ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES=(TEST) pipeline.sh
Rerunning Cluster Restore
This section describes the procedure to rerun a restore that is already completed successfully.
- Navigate to the
/var/occne/cluster/$OCCNE_CLUSTER/directory:$ cd /var/occne/cluster/$OCCNE_CLUSTER/ - Open the
cluster_restores_log.jsonfile:vi artifacts/velero.d/cluster_restores_log.jsonSample output:{ "occne-cluster-20230712-220439": { "cluster-restore-state": "COMPLETED", "bastion-restore": { "state": "COMPLETED" }, "velero-restore": { "state": "COMPLETED" } } } - Edit the file to set the value of
"cluster-restore-state"to"RESTART"as shown in the following code block:{ "occne-example-20250314-053600": { "cluster-restore-state": "RESTART", "bastion-restore": { "state": "COMPLETED" }, "velero-restore": { "state": "COMPLETED" }, "update-pvc-annotations": { "state": "SKIPPED" } } } - Perform the following steps to remove the previously created Velero
restore objects:
- Run the following command to delete the Velero restore
object:
$ velero restore delete $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the previously created Velero backup. - Wait until the restore object is deleted and verify the
same using the following
command:
$ velero get restore - Run the Dropping All CNE Services procedure to delete all the services that were created when this procedure was run first.
- Verify that Step c is completed successfully and no resources are left. Skipping this verification can cause the cluster to go into an unrecoverable state.
- Run the CNE restore script without an
interactive
menu:
$ .scripts/restore/createClusterRestore.py $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the previously created Velero backup.
- Run the following command to delete the Velero restore
object:
Rerunning a Failed Cluster Restore
This section describes the procedure to rerun a restore that failed.
CNE provides options to resume cluster restore from the stage it failed. Perform any of the following steps depending on the stage in which your restore failed:
Bastion Host Failure
- Go to the
/var/occne/cluster/${OCCNE_CLUSTER}/directory.$ cd /var/occne/cluster/${OCCNE_CLUSTER}/ - Set the
BACKUP_NAMEvariable with the name of the backup that is going to be restored.$ export BACKUP_NAME="Name-of-the-previously-created-velero-backup"Sample output:
$ export BACKUP_NAME=occne-example-20250314-053600 - To resume the cluster restore from this stage, rerun the restore script without
using the interactive
menu:
$ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the Velero backup previously created.
Kubernetes Velero Restore Failure
- Go to the
/var/occne/cluster/${OCCNE_CLUSTER}/directory.$ cd /var/occne/cluster/${OCCNE_CLUSTER}/ - Set the
BACKUP_NAMEvariable with the name of the backup that is going to be restored.$ export BACKUP_NAME="<Name-of-the-previously-created-velero-backup>"Sample output:
$ export BACKUP_NAME=occne-example-20250314-053600 - Run the following command to delete the Velero restore
object:
$ velero restore delete $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the previously created Velero backup.Sample output:
The restore will be fully deleted after all the associated data (restore files in object storage) are removed.Are you sure you want to continue (Y/N)? Y Request to delete restore "occne-example-20250314-053600" submitted successfully. - Verify that restore is deleted successfully and no resources are left. Skipping this verification can cause the cluster to go into an unrecoverable state.
- Run the CNE restore script without an interactive
menu:
$ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>where,
<BACKUP_NAME>is the name of the previously created Velero backup.
Modifying Annotations and Deleting PV in Kubernetes:
If the restore fails at this point and shows that the are pods are
waiting for their PVs, use the updatePVCAnnotations.py script to
automatically modify annotations and delete PV in Kubernetes.
updatePVCAnnotations.py script is used to:
- add specific annotations to the affected PVCs for specifying storage provider.
- remove affected pods associated with the affected PVCs to force the pods to recreate themselves.
updatePVCAnnotations.py
script:$ ./scripts/restore/updatePVCAnnotations.pyTroubleshooting Restore Failures
This section provides the guidelines to troubleshoot restore failures.
Prerequisites
Before using this section to troubleshoot a restore failure, verify the following:- Verify connectivity with S3 object storage.
- Verify if the credentials used while activating Velero are still active.
- Verify if the credentials are granted with read or write permission.
Troubleshooting Failed Bastion Restore
Table 4-1 Troubleshooting Failed Bastion Restore
| Cause | Possible Solution |
|---|---|
|
|
Troubleshooting Failed Kubernetes Velero Restore
Table 4-2 Troubleshooting Failed Kubernetes Velero Restore
| Cause | Possible Solution |
|---|---|
| Velero backup object are not available. | Run the following command and verify the following:
Sample
output:
|
| PVCs are not attached correctly. |
Verify that, after a restore, every PVC that was created with
the CNE services under the occne-infra is still available
and is in
Bound
status:Sample
output:
|
| PVs are not available. | Verify that, before and after a restore, PVs are
available for the common services to
restore:Sample
output:
|
Common Services
This section describes the fault recovery procedures for common services.
Restoring a Failed Load Balancer
This section provides a detailed procedure to restore a Virtualized CNE (vCNE) Load Balancer that fails when the Kubernetes cluster is in service. This procedure also can be used to recreate an LBVM that is manually deleted.
- You must know the reason for the Load Balancer Virtual Machines (LBVM) failure.
- You must know the LBVM name to be replaced and the address pool.
- Ensure that the
cluster.tfvarsfile is available for terraform to recreate the LBVM. - You must run this procedure in the active bastion.
- The following procedure does not attempt to determine the failure of LBVM.
- The role or status of LBVM to be replaced must not be ACTIVE.
- If a LOAD_BALANCER_NO_SERVICE alert is raised and both LBVMs are down, then this procedure must be used to recover one LBVM at a time.
Restoring a Failed LB Controller
This section provides a detailed procedure to restore a failed LB controller using a backup crontab.
A failed LB Controller can be restored using backup and restore processes.
Prerequisites- Ensure that the LB controller is installed.
- Ensure that the metal-lb is installed.
Backup LB controller database
- Create the
backuplbcontroller.shscript to backup the LB controller database.Run the following commands to create the script at the /root directory:
cd /var/occne/cluster/${OCCNE_CLUSTER} touch backuplbcontroller.sh chmod +x backuplbcontroller.sh vi backuplbcontroller.sh - Add the following lines of code inside the
backuplbcontroller.shscript.Following is a sample content of
backuplbcontroller.shscript:#!/bin/bash timenow=`date +%Y-%m-%d_%H%M%S` occne_lb_pod_name=$(kubectl get pods -n occne-infra -o custom-columns=NAME:.metadata.name | grep occne-lb-controller-server) if [ -z "$occne_lb_pod_name" ] then echo "The occne-lb-controller-server pod could not be found. $timenow" >> lb_backup.log exit 1 fi occne_lb_pod_ready=$(kubectl get pods $occne_lb_pod_name -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' -n occne-infra| grep True) if [ -z "$occne_lb_pod_ready" ] then echo "The occne-lb-controller-server pod is not ready. $timenow" >> lb_backup.log exit 1 fi lb_db_status=$(kubectl exec -it $occne_lb_pod_name -n occne-infra -- /bin/bash -c "cd create_lb/lbCtrl;python3 -c 'import lbCtrlData;print(lbCtrlData.getLbvms())'" | grep ACTIVE) if [ -z "$lb_db_status" ] then echo "None of the LBVMs are ACTIVE. $timenow" >> lb_backup.log exit 1 fi echo "Backing up DB from pod: $occne_lb_pod_name at $timenow." >> lb_backup.log db_backup_res=$(kubectl cp occne-infra/$occne_lb_pod_name:data/sqlite/db /tmp/backuplbcontroller/) backup_last_db=$(cp /tmp/backuplbcontroller/lbCtrlData.db /tmp/backuplbcontroller/lbCtrlData_$timenow.db) echo "$db_backup_res" >> lb_backup.log backups_list=$(ls -t /tmp/backuplbcontroller/ | grep lbCtrlData_) max_backups=10 if [ $(echo "$backups_list" | wc -l) -gt $max_backups ] then oldest_backup=$(echo "$backups_list" | tail -n +$(($max_backups+1))) rm "/tmp/backuplbcontroller/$oldest_backup" fi - Add the script to a cron job.
Run the following command to edit the crontab:
sudo crontab -eAdd the following line to the crontab:
0 */4 * * * /var/occne/cluster/${OCCNE_CLUSTER}/backuplbcontroller.sh - Verify that the changes in the crontab were added.
Run the following command to get the pod name:
sudo crontab -l
Restore LB Controller
Note:
To restore the database, perform the backup LB Controller procedure.- In
sudomode, stop the backups from the cron job. Once the LB controller database is restored, add the line again into the cron job.Run the following command to edit the crontab:
sudo crontab -eDelete the next line from the cron job:0 */4 * * * /var/occne/cluster/${OCCNE_CLUSTER}/backuplbcontroller.sh - Run the following command to uninstall
metallbandlb-controllerpods. Both the pods must be uninstalled so that bgp peering is recreated. Wait for the pod termination and PVC termination after uninstall, for bothmetallbandlb-controllerpods.Run the following to commands to uninstall Helm:
helm uninstall occne-metallb occne-lb-controller -n occne-infraWait for the resources to be terminated.
- Reinstall the
metallbandlb-controllerpods.Run the following deploy command to reinstall the pods:
OCCNE_CONTAINERS=(CFG) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS="--tags=metallb,vcne-lb-controller" /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/pipeline.sh - Restore LB Controller database.
Run the following restore commands:
- Stop the monitor in lb-controller after install until
db file is
updated.
kubectl get deploy -n occne-infra | grep occne-lb-controller | awk '{print $1}' kubectl set env deployment/<lb deploy> UPGRADE_IN_PROGRESS="true" -n occne-infraWait for the pod to be in the running state.
- Load the db file back to the
Container.
sudo cp /tmp/backuplbcontroller/lbCtrlData.db lbCtrlData.db sudo chmod a+rwx lbCtrlData.db kubectl get pods -n occne-infra | grep occne-lb-controller | awk '{print $1}' kubectl cp lbCtrlData.db occne-infra/<lb controller pod>:/data/sqlite/db - Start the monitor
again.
Wait for thekubectl set env deployment/<lb deploy> UPGRADE_IN_PROGRESS="false" -n occne-infralb-controllerpod to terminate and recreate.
- Stop the monitor in lb-controller after install until
db file is
updated.
To restore the backup files in case they get corrupted, it is
recommended that upto 10 copies of the database must be saved with a 4-hour
separation. If the latest snapshot encounters issues, repeat the commands provided
in the last step (step 4), specifying the version of the database you wish to
restore. All the older backup files are located in the directory
/tmp/backuplbcontroller/.
Run the following commands to restore the older database backups:
sudo ls /tmp/backuplbcontroller/
sudo cp /tmp/backuplbcontroller/lbCtrlData_<db version to restore>.db lbCtrlData.db