4 Fault Recovery

This chapter describes the fault recovery procedures for various failure scenarios.

Kubernetes Cluster

This section describes the fault recovery procedures for various failure scenarios in a Kubernetes Cluster.

Recovering a Failed Bastion Host

This section describes the procedure to replace a failed Bastion Host.

Prerequisites
  • You must have login access to Bastion Host.
Procedure

Note:

This procedure is applicable for a single Bastion Host failure only.
Perform one of the following procedures to replace a failed Bastion Host depending on your deployment model:
  1. Replacing a Failed Bastion Host in BareMetal
  2. Replacing a Failed Bastion Host in OpenStack
  3. Replacing a Failed Bastion Host in VMware
Replacing a failed Bastion Host in BareMetal
  1. Use SSH to log in to a working Bastion Host. If the working Bastion Host was a standby Bastion Host, then it must have successfully become an Active Bastion Host as per Bastion HA Feature, within 10 seconds. To verify this, run the following command and check if the output is IS active-bastion:
    $ is_active_bastion

    Sample output:

    IS active-bastion

  2. Use SSH to log in to a working Bastion Host and run the following commands to deploy a new Bastion Host.

    Replace <new bastion node> in the following command with the name of the new Bastion Host that is deployed.

    $ cd /var/occne/cluster/$OCCNE_CLUSTER/
    $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh  
  3. Update the Operating System (OS) to avoid certificate mismatch. To perform an OS update, follow the instructions specific to updating the OS in the Upgrading CNE section.

    Note:

    • Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
    • Ensure that you update only the Operating System and do not set up OCCNE_NEW_VERSION.
  4. Reinstall bastion controller to take the new internal IP into the configuration file:
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
Replacing a Failed Bastion Host in OpenStack
  1. Log in to OpenStack cloud using your credentials.
  2. From the Compute menu, select Instances, and locate the failed Bastion's instance that you want to replace.
  3. Click the drop-down list from the Actions column, and select Delete Instance to delete the failed Bastion host:

    Delete_Failed_BastionHost_OpenStack

  4. Use SSH to log in to a working Bastion Host. If the working Bastion Host was a standby Bastion Host, then it must have successfully become an Active Bastion Host as per Bastion HA Feature, within 10 seconds. To verify this, run the following command and check if the output is IS active-bastion:
    $ is_active_bastion

    Sample output:

    IS active-bastion

  5. Use SSH to log in to a working Bastion Host and run the following commands to create a new Bastion Host.

    Replace <new bastion node> in the following command with the name of the new Bastion Host that is deployed.

    $ cd /var/occne/cluster/$OCCNE_CLUSTER/
     $ source openrc.sh
     $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
     $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh
  6. Update the Operating System (OS) to avoid certificate mismatch. To perform an OS update, follow the instructions specific to updating the OS in the Upgrading CNE section.

    Note:

    • Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
    • Ensure that you update only the Operating System and do not set up OCCNE_NEW_VERSION.
  7. Reinstall bastion controller to take the new internal IP into the configuration file:
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
Replacing a Failed Bastion Host in VMware
  1. Log in to VMware cloud using your credentials.
  2. From the Compute menu, select Virtual Machines, and locate the failed Bastion's VM that you want to replace.
  3. From the Actions menu, select Power → Power Off.
  4. From the Actions menu, select Delete to delete the failed Bastion Host:

    Delete_Failed_Bastion_Host_in_VMWare

  5. Use SSH to log in to a working Bastion Host. If the working Bastion Host was a standby Bastion Host, then it must have successfully become an Active Bastion Host as per Bastion HA Feature, within 10 seconds. To verify this, run the following command and check if the output is IS active-bastion:
    $ is_active_bastion

    Sample output:

    IS active-bastion

  6. Use SSH to log in to a working Bastion Host and run the following commands to deploy a new Bastion Host.

    Replace <new bastion node> in the following command with the name of the new Bastion Host that is deployed.

    $ cd /var/occne/cluster/$OCCNE_CLUSTER/
    $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
    $ OCCNE_CONTAINERS=(PROV) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS=--limit=<new bastion node> pipeline.sh      
  7. Update the Operating System (OS) to avoid certificate mismatch. To perform an OS update, follow the instructions specific to updating the OS in the Upgrading CNE section.

    Note:

    • Before updating the Operating System, ensure that you read through the prerequisites thoroughly.
    • Ensure that you update only the Operating System and do not set up OCCNE_NEW_VERSION.
  8. Reinstall bastion controller to take the new internal IP into the configuration file:
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(REMOVE)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh
     $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES="(DEPLOY)" OCCNE_ARGS='--tags=bastion-controller' pipeline.sh

Validating Bastion Host Recovery

After recovering a failed Bastion Host, check the occne-bastion-controller logs to monitoring and observe the change in behavior of the cluster. Run the following command to fetch the occne-bastion-controller logs:
$ kco logs deploy/occne-bastion-controller -f
Sample output:
2024-07-18 06:28:45,003 7f79c6131740 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:45,101 7f79c6131740 MONITOR:INFO: Initializing DB /data/db/ctrlData.db [/modules/monData.py:132]
2024-07-18 06:28:45,102 7f79c6131740 MONITOR:INFO: Creating DB directory /data/db [/modules/monData.py:86]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: Bastion Monitor db created [/modules/monData.py:123]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: DB file /data/db/ctrlData.db initialized [/modules/monData.py:140]
2024-07-18 06:28:45,116 7f79c6131740 MONITOR:INFO: Starting monitor and app http server [/modules/startControllerProc.py:17]
2024-07-18 06:28:47,203 7f5fa9c69740 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:47,211 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
2024-07-18 06:28:47,303 7f5fa8a7d640 MONITOR:INFO: Detecting Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' to begin monitoring [/modules/monitor.py:98]
2024-07-18 06:28:47,303 7f5fa3fff640 MONITOR:INFO: Detecting Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' to begin monitoring [/modules/monitor.py:98]
[2024-07-18 06:28:47 +0000] [8] [INFO] Starting gunicorn 22.0.0
[2024-07-18 06:28:47 +0000] [8] [INFO] Listening at: http://0.0.0.0:8000 (8)
[2024-07-18 06:28:47 +0000] [8] [INFO] Using worker: gevent
[2024-07-18 06:28:47 +0000] [11] [INFO] Booting worker with pid: 11
2024-07-18 06:28:48,305 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-18 06:28:48,401 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-18 06:28:49,805 7f6907eb7800 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 06:28:58,909 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-18 06:28:58,910 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 0 retries [/modules/monitor.py:162]
2024-07-18 06:28:58,915 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-18 06:29:09,511 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.53/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-18 06:29:09,512 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-18 06:29:09,514 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 1 retries [/modules/monitor.py:162]
2024-07-18 06:29:20,012 7f5fa3fff640 MONITOR:INFO: No HEALTHY standby bastions found for switchover: 2 retries [/modules/monitor.py:162]
2024-07-18 06:29:32,822 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
2024-07-18 06:29:32,902 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
[2024-07-18 14:48:53 +0000] [11] [INFO] Autorestarting worker after current request.
[2024-07-18 14:48:54 +0000] [11] [INFO] Worker exiting (pid: 11)
[2024-07-18 14:48:54 +0000] [22738] [INFO] Booting worker with pid: 22738
2024-07-18 14:48:56,419 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-18 19:37:48,413 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /yum/OracleLinux/OL9/ol9_appstream/repodata/repomd.xml (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8a96fd0>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=0 [/modules/monitor.py:125]
2024-07-18 19:38:03,432 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8250040>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=1 [/modules/monitor.py:125]
2024-07-18 19:38:03,447 7f5fa8a7d640 MONITOR:INFO: Switched Active bastion from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:155]
2024-07-18 19:38:13,459 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:173]
2024-07-18 19:38:18,469 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa82503a0>, 'Connection to 192.168.1.57 timed out. (connect timeout=5)')): retries=2 [/modules/monitor.py:125]
2024-07-18 19:38:46,598 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8243100>: Failed to establish a new connection: [Errno 113] No route to host')): retries=4 [/modules/monitor.py:125]
2024-07-18 19:39:39,078 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8250400>: Failed to establish a new connection: [Errno 113] No route to host')): retries=8 [/modules/monitor.py:125]
2024-07-18 19:41:24,038 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8243d60>: Failed to establish a new connection: [Errno 113] No route to host')): retries=16 [/modules/monitor.py:125]
2024-07-18 19:44:53,958 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8a8be50>: Failed to establish a new connection: [Errno 113] No route to host')): retries=32 [/modules/monitor.py:125]
2024-07-18 19:51:53,989 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.57', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8260040>: Failed to establish a new connection: [Errno 113] No route to host')): retries=64 [/modules/monitor.py:125]
2024-07-18 20:04:26,519 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is 'FAILED' with exception=URL https://192.168.1.57/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=128 [/modules/monitor.py:125]
2024-07-18 20:23:53,620 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
[2024-07-18 23:08:53 +0000] [22738] [INFO] Autorestarting worker after current request.
[2024-07-18 23:08:54 +0000] [22738] [INFO] Worker exiting (pid: 22738)
[2024-07-18 23:08:54 +0000] [44780] [INFO] Booting worker with pid: 44780
2024-07-18 23:08:56,406 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-19 00:01:00,707 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
[2024-07-19 07:28:53 +0000] [44780] [INFO] Autorestarting worker after current request.
[2024-07-19 07:28:54 +0000] [44780] [INFO] Worker exiting (pid: 44780)
[2024-07-19 07:28:55 +0000] [67513] [INFO] Booting worker with pid: 67513
2024-07-19 07:28:57,414 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-19 15:48:53 +0000] [67513] [INFO] Autorestarting worker after current request.
[2024-07-19 15:48:54 +0000] [67513] [INFO] Worker exiting (pid: 67513)
[2024-07-19 15:48:54 +0000] [90200] [INFO] Booting worker with pid: 90200
2024-07-19 15:48:59,803 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 00:08:53 +0000] [90200] [INFO] Autorestarting worker after current request.
[2024-07-20 00:08:54 +0000] [90200] [INFO] Worker exiting (pid: 90200)
[2024-07-20 00:08:54 +0000] [112911] [INFO] Booting worker with pid: 112911
2024-07-20 00:08:56,314 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 08:28:53 +0000] [112911] [INFO] Autorestarting worker after current request.
[2024-07-20 08:28:53 +0000] [112911] [INFO] Worker exiting (pid: 112911)
[2024-07-20 08:28:54 +0000] [135612] [INFO] Booting worker with pid: 135612
2024-07-20 08:28:56,309 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-20 16:48:53 +0000] [135612] [INFO] Autorestarting worker after current request.
[2024-07-20 16:48:54 +0000] [135612] [INFO] Worker exiting (pid: 135612)
[2024-07-20 16:48:54 +0000] [158319] [INFO] Booting worker with pid: 158319
2024-07-20 16:48:56,514 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 01:08:53 +0000] [158319] [INFO] Autorestarting worker after current request.
[2024-07-21 01:08:54 +0000] [158319] [INFO] Worker exiting (pid: 158319)
[2024-07-21 01:08:54 +0000] [181019] [INFO] Booting worker with pid: 181019
2024-07-21 01:08:56,902 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 09:28:53 +0000] [181019] [INFO] Autorestarting worker after current request.
[2024-07-21 09:28:53 +0000] [181019] [INFO] Worker exiting (pid: 181019)
[2024-07-21 09:28:54 +0000] [203715] [INFO] Booting worker with pid: 203715
2024-07-21 09:28:56,216 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-21 17:48:53 +0000] [203715] [INFO] Autorestarting worker after current request.
[2024-07-21 17:48:54 +0000] [203715] [INFO] Worker exiting (pid: 203715)
[2024-07-21 17:48:54 +0000] [226420] [INFO] Booting worker with pid: 226420
2024-07-21 17:48:56,717 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 02:08:53 +0000] [226420] [INFO] Autorestarting worker after current request.
[2024-07-22 02:08:54 +0000] [226420] [INFO] Worker exiting (pid: 226420)
[2024-07-22 02:08:55 +0000] [249119] [INFO] Booting worker with pid: 249119
2024-07-22 02:08:58,114 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 10:28:53 +0000] [249119] [INFO] Autorestarting worker after current request.
[2024-07-22 10:28:54 +0000] [249119] [INFO] Worker exiting (pid: 249119)
[2024-07-22 10:28:54 +0000] [271794] [INFO] Booting worker with pid: 271794
2024-07-22 10:28:56,918 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
[2024-07-22 18:48:53 +0000] [271794] [INFO] Autorestarting worker after current request.
[2024-07-22 18:48:54 +0000] [271794] [INFO] Worker exiting (pid: 271794)
[2024-07-22 18:48:54 +0000] [294486] [INFO] Booting worker with pid: 294486
2024-07-22 18:48:56,513 7f6907eb6980 MONITOR:INFO: Logger set to level=20 INFO [/modules/logger.py:39]
2024-07-22 21:21:19,625 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.57' to '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' [/modules/monitor.py:173]
2024-07-22 21:22:30,328 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8277340>, 'Connection to 192.168.1.53 timed out. (connect timeout=5)')): retries=0 [/modules/monitor.py:125]
2024-07-22 21:22:45,350 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f5fa8204850>, 'Connection to 192.168.1.53 timed out. (connect timeout=5)')): retries=1 [/modules/monitor.py:125]
2024-07-22 21:22:45,370 7f5fa3fff640 MONITOR:INFO: Switched Active bastion from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:155]
2024-07-22 21:22:55,376 7f5fa9c69740 MONITOR:INFO: Detected Active changed from '192.168.1.53' to '192.168.1.57' 'occne2-j-jorge-l-lopez-bastion-1' [/modules/monitor.py:173]
2024-07-22 21:22:58,566 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa82775b0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=2 [/modules/monitor.py:125]
2024-07-22 21:23:24,805 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8208730>: Failed to establish a new connection: [Errno 113] No route to host')): retries=4 [/modules/monitor.py:125]
2024-07-22 21:24:17,286 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8277160>: Failed to establish a new connection: [Errno 113] No route to host')): retries=8 [/modules/monitor.py:125]
2024-07-22 21:26:02,566 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8250d30>: Failed to establish a new connection: [Errno 113] No route to host')): retries=16 [/modules/monitor.py:125]
2024-07-22 21:29:32,486 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa827bac0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=32 [/modules/monitor.py:125]
2024-07-22 21:36:33,222 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa8a96af0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=64 [/modules/monitor.py:125]
2024-07-22 21:50:34,502 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.53' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=HTTPSConnectionPool(host='192.168.1.53', port=443): Max retries exceeded with url: /REPO_READY (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5fa827bbe0>: Failed to establish a new connection: [Errno 113] No route to host')): retries=128 [/modules/monitor.py:125]
2024-07-22 21:52:03,240 7f5fa8a7d640 MONITOR:INFO: Detecting Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' to begin monitoring [/modules/monitor.py:98]
2024-07-22 21:52:03,705 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=0 [/modules/monitor.py:125]
2024-07-22 21:52:15,322 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=1 [/modules/monitor.py:125]
2024-07-22 21:52:27,117 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=2 [/modules/monitor.py:125]
2024-07-22 21:52:50,901 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=4 [/modules/monitor.py:125]
2024-07-22 21:53:38,012 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=8 [/modules/monitor.py:125]
2024-07-22 21:55:13,814 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=16 [/modules/monitor.py:125]
2024-07-22 21:58:22,008 7f5fa3fff640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is 'FAILED' with exception=URL https://192.168.1.21/REPO_READY REPO_READY Flag gave unexpected result=404 != 200: retries=32 [/modules/monitor.py:125]
2024-07-22 21:58:47,310 7f5fa8a7d640 MONITOR:INFO: Bastion '192.168.1.21' 'occne2-j-jorge-l-lopez-bastion-2' is reachable: set state to 'HEALTHY' [/modules/monitor.py:115]
When the pipeline.sh script is run, you'll observe a lot of Ansible lines as part of the progress of provisioning and configuration of new node. The following codeblock provides a sample output of this scenario:
Monday 22 July 2024  21:58:42 +0000 (0:00:00.681)       0:00:11.112 ***********
===============================================================================
repo_backup : Ensure proper ownership of /var/occne directory ------------------------ 3.65s
Gathering Facts ---------------------------------------------------------------------- 1.18s
repo_backup : Copy OCCNE var dir to target bastion, may take 10s of minutes if repository is populated (registry, clusters, admin.conf, etc) --- 1.02s
Gathering Facts ---------------------------------------------------------------------- 0.79s
Install bash completion -------------------------------------------------------------- 0.70s
update_banner : Restart sshd after Banner update ------------------------------------- 0.68s
repo_backup : Ensure proper ownership of /var/www/html/occne directory --------------- 0.64s
repo_backup : Copy OCCNE html dir to target bastion (pxe, downloads, helm, etc) ------ 0.52s
update_banner : Restore the original Banner setting ---------------------------------- 0.37s
repo_backup : Update /var/occne/cluster/occne2-j-jorge-l-lopez/occne.ini repo lines to point to 'this' repo host --- 0.35s
Flag HTTP repo as ready -------------------------------------------------------------- 0.32s
update_banner : Assert that OCCNE banner is absent in sshd_config -------------------- 0.23s
identify : Determine set of occne kvm guest nodes hosted by this node ---------------- 0.11s
identify : set_fact ------------------------------------------------------------------ 0.06s
identify : Flag localhost_self identity ---------------------------------------------- 0.05s
identify : create localhost_kvm_host_group group ------------------------------------- 0.05s
identify : Create localhost_self_group group ----------------------------------------- 0.05s
identify : set occne_user ------------------------------------------------------------ 0.04s
identify : Dump self-identity variables ---------------------------------------------- 0.04s
identify : Determine set of k8s guest nodes hosted by this node ---------------------- 0.03s
WARN[0012] image used by 0cf93ed451350a2e0c1173956b0fc903f9d887a2fef898d052295054cd2cf6b6: image is in use by a container: consider listing external containers and force-removing image
+ sudo podman image rm -fi winterfell:5000/occne/provision:24.2.0
Untagged: winterfell:5000/occne/provision:24.2.0
Deleted: 2b96b90fc39f54920518de792deec48d7907d81219ac5f519ec2c78ba99e5c99
Skipping: PROV REMOVE
-POST Post Processing Finished
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Mon Jul 22 05:58:50 PM EDT 2024
artifacts/pipeline.sh completed successfully
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

After monitoring the logs, run the is_active_bastion and get_other_bastions commands to verify the existing Bastion Hosts included in the cluster.

Recovering a Failed Kubernetes Controller Node

This section describes the procedure to recover a single failed Kubernetes controller node in vCNE deployments.

Prerequisites

  • You must have login access to a Bastion Host.
  • You must have login access to the cloud GUI.

Note:

  • This procedure is applicable for vCNE (OpenStack and VMware) deployments only. CNE doesn't support recovering a failed Kubernetes controller node in BareMetal deployments.
  • This procedure is applicable for replacing a single Kubernetes controller node only.
  • Control Node 1 (member of etcd1) requires specific steps to be performed. Be mindful when you are replacing this node.
  • If you are using a CNLB based vCNE deployment, then ensure that you are aware of the following pointers:
    • A minimum of three control nodes (control plane and etcd hosts) are required to maintain the cluster's high availability and responsiveness. However, the cluster can still operate with an even number of control nodes, though it is not recommended for a long period of time.
    • Some maintenance procedures such as, CNE standard upgrade and cluster update procedures are not supported after removing a control node from a cluster with even number of controller nodes. In such cases, you must add a new node before performing the procedures.
    • If you are replacing etcd1 control node, then this procedure makes etcd2 as the new etcd1 by default.
      For example, the following code block shows the controller nodes before replacing etcd1 control node:
      etcd1: occne1-test-k8s-ctrl-1
      etcd2: occne1-test-k8s-ctrl-2
      etcd3: occne1-test-k8s-ctrl-3
      The following code block shows the controller nodes after replacing etcd1 control node:
      etcd1: occne1-test-k8s-ctrl-2
      etcd2: occne1-test-k8s-ctrl-3
      etcd3: occne1-test-k8s-ctrl-1
      Therefore, ensure that all IPs related to etcd1 control pane IP are replaced with etcd2 working controller node IPs as indicated in the applicable steps in the procedure.
Recovering a Failed Kubernetes Controller Node in an OpenStack Deployment

This section describes the procedure to recover a failed Kubernetes controller node in an OpenStack deployment.

Procedure
  1. Use SSH to log in to Bastion Host and remove the failed Kubernetes controller node by following the procedure described in the Removing a Controller Node in OpenStack Deployment section.

    Take a note of the internal IP addresses of all the controller nodes and the etcd member number (etcd1, etcd2 or etcd3) of the failed controller node. Also take note of the IPs and hostnames of the other working Control nodes.

  2. Use the original terraform file to create a new controller node VM:

    Note:

    To revert the changes, perform this step only if the failed control node was a member of etcd1.
    cd /var/occne/cluster/${OCCNE_CLUSTER}
    mv terraform.tfstate /tmp
    cp ${OCCNE_CLUSTER}/terraform.tfstate.backup terraform.tfstate
  3. Depending on the type of Load Balancer used, run one of the following steps to create a new Controller Node Instance within the cloud:
    • Run the following Terraform commands if you are using an LBVM based OpenStack deployment:

      Note:

      Before running the terraform apply command in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the .tfstate file.
      $ source openrc.sh
      $ terraform plan -var-file=$OCCNE_CLUSTER/cluster.tfvars
      $ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
    • Run the following Tofu commands if you are using an CNLB based OpenStack deployment.

      Note:

      • Use the tofu plan command to verify that no other resources, than the controller node that is currently replaced, are affected.
      • Before running the tofu apply command in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the .tfstate file.
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}
      $ source openrc.sh
      $ tofu plan -var-file=$OCCNE_CLUSTER/cluster.tfvars
      $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
      
  4. Switch the terraform file.

    Note:

    Perform this step only if the failed control node was a member of etcd1.
    $ cd /var/occne/cluster/$OCCNE_CLUSTER
    $ python3 scripts/switchTfstate.py
    For example:
    [cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
    Sample output:
    Beginning tfstate switch order k8s control nodes
      
            terraform.tfstate.lastversion created as backup
      
    Controller Nodes order before rotation:
    occne7-test-k8s-ctrl-1
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
      
    Controller Nodes order after rotation:
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
    occne7-test-k8s-ctrl-1
      
    Success: terraform.tfstate rotated for cluster occne7-test
  5. Edit the failed kube_control_plane IP with the IP of working control node.

    Note:

    • Perform this step only if the failed control node was a member of etcd1.
    • If you are using an CNLB based deployment and if etcd1 is replaced, then update the IP with previous IP of etcd2.
    $ kubectl edit cm -n kube-public cluster-info
    Sample output:
    .
    .
    server: https://<working control node IP address>:6443
    .
    .
  6. Log in to OpenStack GUI using your credentials and note the replaced node's internal IP address and hostname. In most of the cases, the new IP address and hostname remains the same as the ones before deletion. The new IP address and hostname are referred to as replaced_node_ip and replaced_node_hostname in the remaining procedure.
  7. Run the following command from Bastion Host to configure the replaced control node OS:
    OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=<replaced_node_hostname> artifacts/pipeline.sh
    For example:
    OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=occne7-test-k8s-ctrl-1 artifacts/pipeline.sh
  8. Update the /etc/hosts file in Bastion Host with the replaced_node_ip and replaced_node_hostname. Ensure that there are two matching entries, that is, <replaced_node_hostname> must match with the <replaced_node_ip>.
    $ sudo vi /etc/hosts
    Sample output:
    192.168.202.232  occne7-test-k8s-ctrl-1.novalocal  occne7-test-k8s-ctrl-1
    192.168.202.232 lb-apiserver.kubernetes.local
  9. Use SSH to log in to each controller node in the cluster, except the controller node that is newly created, and run the following commands as a root user to update replaced_node_ip in the kube-apiserver.yaml, kubeadm-config.yaml, and hosts files:

    Note:

    • Ensure that you don't delete or change any other IPs related to the running control nodes.
    • The following commands replaces Node1. If you are replacing other nodes, the order of IPs may change. This is also applicable to the later steps in this procedure.
    • kube-apiserver.yaml:
      vi /etc/kubernetes/manifests/kube-apiserver.yaml
      Sample output:
      - --etcd-servers=https://192.168.202.232:2379,https://192.168.203.194:2379,https://192.168.200.115:2379
    • kubeadm-config.yaml:
      $ vi /etc/kubernetes/kubeadm-config.yaml
      Sample output:
      etcd:
        external:
            endpoints:
            - https://<replaced_node_ip>:2379
            - https://192.168.203.194:2379
            - https://192.168.200.115:2379
       
      ------------------------------------
       
        certSANs:
        - kubernetes
        - kubernetes.default
        - kubernetes.default.svc
        - kubernetes.default.svc.occne7-test
        - 10.233.0.1
        - localhost
        - 127.0.0.1
        - occne7-test-k8s-ctrl-1
        - occne7-test-k8s-ctrl-2
        - occne7-test-k8s-ctrl-3
        - lb-apiserver.kubernetes.local
        - <replaced_node_ip>
        - 192.168.203.194
        - 192.168.200.115
        - localhost.localdomain
        timeoutForControlPlane: 5m0s
    • hosts:
      $ vi /etc/hosts
      Sample output:
         <replaced_node_ip>  occne7-test-k8s-ctrl-1.novalocal  occne7-test-k8s-ctrl-1
  10. Run the following commands in a Bastion Host to update all instances of the <replaced_node_ip>. If the failed controller node was a member of etcd1, then update the controlPlaneEndpoint value with the IP address of the working controller node (that is, from ctrl-1 to ctrl-2):
    $ kubectl edit configmap kubeadm-config -n kube-system
    Sample output:
    apiServer:
          certSANs:
          - kubernetes
          - kubernetes.default
          - kubernetes.default.svc
          - kubernetes.default.svc.occne7-test
          - 10.233.0.1
          - localhost
          - 127.0.0.1
          - occne7-test-k8s-ctrl-1
          - occne7-test-k8s-ctrl-2
          - occne7-test-k8s-ctrl-3
          - lb-apiserver.kubernetes.local
          - <replaced_node_ip>
          - 192.168.203.194
          - 192.168.200.115
          - localhost.localdomain
     
    ----------------------------------------------------
        # If etcd1 was replaced, please update IP with previous etcd2 IP.
        controlPlaneEndpoint: <working_node_ip>:6443 #Only update if was part of etcd1
     
    ----------------------------------------------------
     
        etcd:
          external:
            caFile: /etc/ssl/etcd/ssl/ca.pem
            certFile: /etc/ssl/etcd/ssl/node-occne7-test-k8s-ctrl-1.pem
            endpoints:
            - https://<replaced_node_ip>:2379
            - https://192.168.203.194:2379
            - https://192.168.200.115:2379
  11. Run the cluster.yml playbook from Bastion Host 1 to add the new controller node into the cluster:

    Note:

    Replace <occne password> with the CNE password and <central repo> with the name of your central repository.
    $ podman run -it --rm --rmi --network host --name DEPLOY_$OCCNE_CLUSTER -v /var/occne/cluster/$OCCNE_CLUSTER:/host -v /var/occne:/var/occne:rw -e OCCNE_vCNE=openstack -e OCCNEINV=/host/hosts -e 'PLAYBOOK=/kubespray/cluster.yml' -e 'OCCNEARGS=--extra-vars={"occne_userpw":"<occne password>"} --extra-vars=occne_hostname=$OCCNE_CLUSTER-bastion-1 -i /host/occne.ini' <central repo>:5000/occne/k8s_install:$OCCNE_VERSION bash
     
    $ set -e
     
    $ /copyHosts.sh ${OCCNEINV}
     
    $ ansible-playbook -i /kubespray/inventory/occne/hosts --become --private-key /host/.ssh/occne_id_rsa /kubespray/cluster.yml ${OCCNEARGS}
     
    $ exit
  12. Verify if the new controller node is added to the cluster using the following command:
    $ kubectl get node
    Sample output:
    NAME                               STATUS   ROLES                  AGE     VERSION
    occne7-test-k8s-ctrl-1   Ready    control-plane,master   30m     v1.22.5
    occne7-test-k8s-ctrl-2   Ready    control-plane,master   2d19h   v1.22.5
    occne7-test-k8s-ctrl-3   Ready    control-plane,master   2d19h   v1.22.5
    occne7-test-k8s-node-1   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-2   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-3   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-4   Ready    <none>                 2d19h   v1.22.5
  13. Perform the following steps to validate the addition to the etcd cluster using etcdctl:
    1. Use SSH to log in to the Bastion Host:
      $ ssh <working control node hostname>
      For example:
      $ ssh occne7-test-k8s-ctrl-2
    2. Switch to the root user:
      $ sudo su
      For example:
      [cloud-user@occne7-test-k8s-ctrl-2]# sudo su
    3. Source /etc/etcd.env:
      $ source /etc/etcd.env
      For example:
      [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env
    4. Run the following command to list the etcd members list:
      $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member list
      For example:
      [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member list
      Sample output:
      52513ddd2aa49770, started, etcd1, https://192.168.202.232:2380, https://192.168.201.158:2379, false
      f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
      80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false
Recovering a Failed Kubernetes Controller Node in a VMware Deployment

This section describes the procedure to recover a failed Kubernetes controller node in a VMware deployment.

Procedure
  1. Use SSH to log in to Bastion Host and remove the failed Kubernetes controller node by following the procedure described in the Removing a Controller Node in VMware Deployment section.

    Note the internal IP addresses of all the controller nodes and the etcd member number (etcd1, etcd2 or etcd3) of the failed controller node. Also take note of the IPs and hostnames of the other working Control nodes.

  2. Use the original terraform file to create a new controller node VM:

    Note:

    To revert the switch changes, perform this step only if the failed control node was a member of etcd1.
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}
    $ mv terraform.tfstate /tmp
    $ cp ${OCCNE_CLUSTER}/terraform.tfstate.backup terraform.tfstate
  3. Depending on the type of Load Balancer used, run one of the following steps to create a new Controller Node Instance within the cloud:
    • Run the following Terraform commands if you are using an LBVM based VMware deployment:

      Note:

      Before running the terraform apply command in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the .tfstate file.
      $ source openrc.sh
      $ terraform plan -var-file=$OCCNE_CLUSTER/cluster.tfvars
      $ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
    • Run the following Tofu commands if you are using an CNLB based VMware deployment.

      Note:

      Before running the tofu apply command in the following code block, ensure that no other resources are currently being modified. If you find other resources not related to this procedure are being modified, then abort this procedure and verify the .tfstate file.
      $ cd /var/occne/cluster/$OCCNE_CLUSTER/
      $ tofu plan -var-file=$OCCNE_CLUSTER/cluster.tfvars
      $ tofu apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
      
  4. Switch the terraform file.

    Note:

    Perform this step only if the failed control node was a member of etcd1.
    $ cd /var/occne/cluster/$OCCNE_CLUSTER
    $ cp terraform.tfstate terraform.tfstate.original
    $ python3 scripts/switchTfstate.py
    For example:
    [cloud-user@occne7-test-bastion-1]$ python3 scripts/switchTfstate.py
    Sample output:
    Beginning tfstate switch order k8s control nodes
      
            terraform.tfstate.lastversion created as backup
      
    Controller Nodes order before rotation:
    occne7-test-k8s-ctrl-1
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
      
    Controller Nodes order after rotation:
    occne7-test-k8s-ctrl-2
    occne7-test-k8s-ctrl-3
    occne7-test-k8s-ctrl-1
      
    Success: terraform.tfstate rotated for cluster occne7-test
  5. Edit the current failed kube_control_plane IP with the IP of working control node.

    Note:

    • Perform this step only if the failed control node was a member of etcd1.
    • If you are using an CNLB based deployment and if etcd1 is replaced, then update the IP with previous IP of etcd2.
    $ kubectl edit cm -n kube-public cluster-info
    Sample output:
    .
    .
    server: https://<working control node IP address>:6443
    .
    .
  6. Log in to VMware GUI using your credentials and note the replaced node's internal IP address and hostname. In most of the cases, the new IP address and hostname remains the same as the ones before deletion. The new IP address and hostname are referred to as replaced_node_ip and replaced_node_hostname in the remaining procedure.
  7. Run the following command from Bastion Host to configure the replaced control node OS:
    OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=<replaced_node_hostname> artifacts/pipeline.sh
    For example:
    OCCNE_CONTAINERS=(PROV) OCCNE_ARGS=--limit=occne7-test-k8s-ctrl-1 artifacts/pipeline.sh
  8. Update the /etc/hosts file in Bastion Host with the replaced_node_ip and replaced_node_hostname. Make sure there are two matching entries.
    $ sudo vi /etc/hosts
    Sample output:
    192.168.202.232  occne7-test-k8s-ctrl-1.novalocal  occne7-test-k8s-ctrl-1
    192.168.202.232 lb-apiserver.kubernetes.local
  9. Use SSH to log in to each controller node in the cluster, except the controller node that is newly created, and run the following commands as a root user to update the replaced_node_ip in the kube-apiserver.yaml, kubeadm-config.yaml, and hosts files:

    Note:

    • Ensure that you don't delete or change any other IPs related to the running control nodes.
    • The following commands replaces Node1. If you are replacing other nodes, the order of IPs may change. This is also applicable to the later steps in this procedure.
    • kube-apiserver.yaml:
      vi /etc/kubernetes/manifests/kube-apiserver.yaml
      Sample output:
      - --etcd-servers=https://192.168.202.232:2379,https://192.168.203.194:2379,https://192.168.200.115:2379
    • kubeadm-config.yaml:
      $ vi /etc/kubernetes/kubeadm-config.yaml
      Sample output:
      etcd:
        external:
            endpoints:
            - https://<replaced_node_ip>:2379
            - https://192.168.203.194:2379
            - https://192.168.200.115:2379
       
      ------------------------------------
       
        certSANs:
        - kubernetes
        - kubernetes.default
        - kubernetes.default.svc
        - kubernetes.default.svc.occne7-test
        - 10.233.0.1
        - localhost
        - 127.0.0.1
        - occne7-test-k8s-ctrl-1
        - occne7-test-k8s-ctrl-2
        - occne7-test-k8s-ctrl-3
        - lb-apiserver.kubernetes.local
        - <replaced_node_ip>
        - 192.168.203.194
        - 192.168.200.115
        - localhost.localdomain
        timeoutForControlPlane: 5m0s
    • hosts:
      $ vi /etc/hosts
      Sample output:
      <replaced_node_ip>  occne7-test-k8s-ctrl-1.novalocal  occne7-test-k8s-ctrl-1
  10. Run the following commands in a Bastion Host to update all instances of <replaced_node_ip>. If the failed controller node was a member of etcd1, then update the controlPlaneEndpoint value with the IP address of the working controller node (that is, from ctrl-1 to ctrl-2):

    Note:

    $ kubectl edit configmap kubeadm-config -n kube-system
    Sample output:
    apiServer:
          certSANs:
          - kubernetes
          - kubernetes.default
          - kubernetes.default.svc
          - kubernetes.default.svc.occne7-test
          - 10.233.0.1
          - localhost
          - 127.0.0.1
          - occne7-test-k8s-ctrl-1
          - occne7-test-k8s-ctrl-2
          - occne7-test-k8s-ctrl-3
          - lb-apiserver.kubernetes.local
          - <replaced_node_ip>
          - 192.168.203.194
          - 192.168.200.115
          - localhost.localdomain
     
    ----------------------------------------------------
        # If etcd1 was replaced, please update IP with previous etcd2 IP.
        controlPlaneEndpoint: <working_node_ip>:6443 #Only update if was part of etcd1
     
    ----------------------------------------------------
     
        etcd:
          external:
            caFile: /etc/ssl/etcd/ssl/ca.pem
            certFile: /etc/ssl/etcd/ssl/node-occne7-test-k8s-ctrl-1.pem
            endpoints:
            - https://<replaced_node_ip>:2379
            - https://192.168.203.194:2379
            - https://192.168.200.115:2379
  11. Run the cluster.yml playbook from Bastion Host 1 to add the new controller node into the cluster:

    Note:

    Replace <occne password> with the CNE password and <central repo> with the name of your central repository.
    $ podman run -it --rm --rmi --network host --name DEPLOY_$OCCNE_CLUSTER -v /var/occne/cluster/$OCCNE_CLUSTER:/host -v /var/occne:/var/occne:rw -e OCCNE_vCNE=vcd -e OCCNEINV=/host/hosts -e 'PLAYBOOK=/kubespray/cluster.yml' -e 'OCCNEARGS=--extra-vars={"occne_userpw":"<occne password>"} --extra-vars=occne_hostname=$OCCNE_CLUSTER-bastion-1 -i /host/occne.ini' <central repo>:5000/occne/k8s_install:$OCCNE_VERSION bash
    
    $ set -e
     
    $ /copyHosts.sh ${OCCNEINV}
     
    $ ansible-playbook -i /kubespray/inventory/occne/hosts --become --private-key /host/.ssh/occne_id_rsa /kubespray/cluster.yml ${OCCNEARGS}
     
    $ exit
  12. Verify if the new controller node is added to the cluster using the following command:
    $ kubectl get node
    Sample output:
    NAME                               STATUS   ROLES                  AGE     VERSION
    occne7-test-k8s-ctrl-1   Ready    control-plane,master   30m     v1.22.5
    occne7-test-k8s-ctrl-2   Ready    control-plane,master   2d19h   v1.22.5
    occne7-test-k8s-ctrl-3   Ready    control-plane,master   2d19h   v1.22.5
    occne7-test-k8s-node-1   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-2   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-3   Ready    <none>                 2d19h   v1.22.5
    occne7-test-k8s-node-4   Ready    <none>                 2d19h   v1.22.5
  13. Perform the following steps to validate the addition to the etcd cluster using etcdctl:
    1. Use SSH to log in to the Bastion Host:
      $ ssh <working control node hostname>
      For example:
      $ ssh occne7-test-k8s-ctrl-2
    2. Switch to the root user:
      $ sudo su
      For example:
      [cloud-user@occne7-test-k8s-ctrl-2]# sudo su
    3. Source /etc/etcd.env:
      $ source /etc/etcd.env
      For example:
      [root@occne7-test-k8s-ctrl-2 cloud-user]# source /etc/etcd.env
    4. Run the following command to list the etcd members list:
      $ /usr/local/bin/etcdctl --endpoints https://<working control node IP address>:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member list
      For example:
      [root@occne7-test-k8s-ctrl-2 cloud-user]# /usr/local/bin/etcdctl --endpoints https://192.168.203.194:2379 --cacert=$ETCD_PEER_TRUSTED_CA_FILE --cert=$ETCD_CERT_FILE --key=$ETCD_KEY_FILE member list
      Sample output:
      52513ddd2aa49770, started, etcd1, https://192.168.202.232:2380, https://192.168.201.158:2379, false
      f1200d9975868073, started, etcd2, https://192.168.203.194:2380, https://192.168.203.194:2379, false
      80845fb2b5120458, started, etcd3, https://192.168.200.115:2380, https://192.168.200.115:2379, false

Restoring the etcd Database

This section describes the procedure to restore etcd cluster data from the backup.

Prerequisites
  1. A backup copy of the etcd database must be available. For the procedure to create a backup of your etcd database, refer to the "Performing an etcd Data Backup" section of Oracle Communications Cloud Native Core, Cloud Native Environment User Guide.
  2. At least one Kubernetes controller node must be operational.
Procedure
  1. Find Kubernetes controller hostname: Run the following command to get the names of Kubernetes controller nodes.
    $ kubectl get nodes
    Sample output:
    NAME                            STATUS   ROLES                  AGE    VERSION
    occne3-my-cluster-k8s-ctrl-1   Ready    control-plane,master   4d1h   v1.23.7
    occne3-my-cluster-k8s-ctrl-2   Ready    control-plane,master   4d1h   v1.23.7
    occne3-my-cluster-k8s-ctrl-3   Ready    control-plane,master   4d1h   v1.23.7
    occne3-my-cluster-k8s-node-1   Ready    <none>                 4d1h   v1.23.7
    occne3-my-cluster-k8s-node-2   Ready    <none>                 4d1h   v1.23.7
    occne3-my-cluster-k8s-node-3   Ready    <none>                 4d1h   v1.23.7
    occne3-my-cluster-k8s-node-4   Ready    <none>                 4d1h   v1.23.7

    You must restore the etcd data on any one of the controller nodes that is in Ready state. From the output, note the name of a controller node, that is in Ready state, to restore the etcd data.

  2. Run the etcd-restore script:
    1. On the Bastion Host, switch to the /var/occne/cluster/${OCCNE_CLUSTER}/artifacts directory:
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}/artifacts
    2. Run the etcd_restore.sh script:
      $ ./etcd_restore.sh
      On running the script, the system prompts you to enter the following details:
      • k8s-ctrl node: Enter the name of the controller node (noted in Step 1) on which you want to restore the etcd data.
      • Snapshot: Select the PVC snapshot that you want to restore from the list of PVC snapshots displayed.
      Example:
      $ ./artifacts/etcd_restore.sh
      Sample output:
      Enter the K8s-ctrl hostname to restore etcd backup: occne3-my-cluster-k8s-ctrl-1
       
      occne-etcd-backup pvc exists!
       
      occne-etcd-backup pvc is in bound state!
       
      Creating occne-etcd-backup pod
      pod/occne-etcd-backup created
       
      waiting for Pod to be in running state
       
      waiting for Pod to be in running state
       
      waiting for Pod to be in running state
       
      waiting for Pod to be in running state
       
      waiting for Pod to be in running state
       
      occne-etcd-backup pod is in running state!
       
      List of snapshots present on the PVC:
      snapshotdb.2022-11-14
      Enter the snapshot from the list which you want to restore: snapshotdb.2022-11-14
      This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties.
       
      Restoring etcd data backup
       
      This site is for the exclusive use of Oracle and its authorized customers and partners. Use of this site by customers and partners is subject to the Terms of Use and Privacy Policy for this site, as well as your contract with Oracle. Use of this site by Oracle employees is subject to company policies, including the Code of Conduct. Unauthorized access or breach of these terms may result in termination of your authorization to use this site and/or civil and criminal penalties.
      Deprecated: Use `etcdutl snapshot restore` instead.
       
      2022-11-14T20:22:37Z info snapshot/v3_snapshot.go:248 restoring snapshot {"path": "snapshotdb.2022-11-14", "wal-dir": "default.etcd/member/wal", "data-dir": "default.etcd", "snap-dir": "default.etcd/member/snap", "stack": "go.etcd.io/etcd/etcdutl/v3/snapshot.(*v3Manager).Restore\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdutl/snapshot/v3_snapshot.go:254\ngo.etcd.io/etcd/etcdutl/v3/etcdutl.SnapshotRestoreCommandFunc\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdutl/etcdutl/snapshot_command.go:147\ngo.etcd.io/etcd/etcdctl/v3/ctlv3/command.snapshotRestoreCommandFunc\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/command/snapshot_command.go:129\ngithub.com/spf13/cobra.(*Command).execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:856\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:960\ngithub.com/spf13/cobra.(*Command).Execute\n\t/go/pkg/mod/github.com/spf13/cobra@v1.1.3/command.go:897\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.Start\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/ctl.go:107\ngo.etcd.io/etcd/etcdctl/v3/ctlv3.MustStart\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/ctlv3/ctl.go:111\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/etcdctl/main.go:59\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}
      2022-11-14T20:22:37Z info membership/store.go:141 Trimming membership information from the backend...
      2022-11-14T20:22:37Z info membership/cluster.go:421 added member {"cluster-id": "cdf818194e3a8c32", "local-member-id": "0", "added-peer-id": "8e9e05c52164694d", "added-peer-peer-urls": ["http://localhost:2380"]}
      2022-11-14T20:22:37Z info snapshot/v3_snapshot.go:269 restored snapshot {"path": "snapshotdb.2022-11-14", "wal-dir": "default.etcd/member/wal", "data-dir": "default.etcd", "snap-dir": "default.etcd/member/snap"}
       
      Removing etcd-backup-pod
      pod "occne-etcd-backup" deleted
       
      etcd-data-restore is successful!!

Recovering a Failed Kubernetes Worker Node

This section provides the manual procedures to replace a failed Kubernetes Worker Node for bare metal, OpenStack, and VMware.

Prerequisites

  • Kubernetes worker node must be taken out of service.
  • Bare metal server must be repaired and the same bare metal server must be added back into the cluster.
  • You must have credentials to access Openstack GUI.
  • You must have credentials to access VMware GUI or CLI.

Limitations

Some of the steps in these procedures must be run manually.

Recovering a Failed Kubernetes Worker Node in Bare Metal

This section describes the manual procedure to replace a failed Kubernetes Worker Node in a bare metal deployment.

Prerequisites

  • Kubernetes worker node must be taken out of service.
  • Bare metal server must be repaired and the same bare metal server must be added back into the cluster.

Procedure

Removing the Failed Worker Node
  1. Run the following command to remove Object Storage Daemon (OSD) from the worker node, before removing the worker node from the Kubernetes cluster:

    Note:

    Remove one OSD at a time. Do not remove multiple OSDs at once. Check the cluster status between removing multiple OSDs.

    Sample rook_toolbox file.

    # Note down the osd-id hosted on the worker node which is to be removed
    $ kubectl get pods -n rook-ceph -o wide |grep osd |grep <worker-node>
     
    # Scale down the rook-ceph-operator deployment and OSD deployment
    $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=0
    $ kubectl -n rook-ceph scale deployment rook-ceph-osd-<ID> --replicas=0
      
    # Install the rook-ceph tool box
    $ kubectl create -f rook_toolbox.yaml
      
    # Connect to the rook-ceph toolbox using the following command:
    $ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
      
    # Once connected to the toolbox, check the ceph cluster status using the following commands:
    $ ceph status
    $ ceph osd status
    $ ceph osd tree
      
    # Mark the OSD deployment as out using the following commands and purge the OSD:
    $ ceph osd out osd.<ID>
    $ ceph osd purge <ID> --yes-i-really-mean-it
     
    # Verify that the OSD is removed from the node and ceph cluster status:
    $ ceph status
    $ ceph osd status
    $ ceph osd tree
     
    # Exit the rook-ceph toolbox
    $ exit
      
    # Delete the OSD deployments of the purged OSD
    $ kubectl delete deployment -n rook-ceph rook-ceph-osd-<ID>
      
    # Scale up the rook-ceph-operator deployment using the following command:
    $ kubectl -n rook-ceph scale deployment rook-ceph-operator --replicas=1
      
    # Remove the rook-ceph tool box deployment
    $ kubectl -n rook-ceph delete deploy/rook-ceph-tools
  2. Set CENTRAL_REPO, CENTRAL_REPO_REGISTRY_PORT, and NODE environment variables to allow podman commands to run on the Bastion Host:
    $ export NODE=<workernode-full-name>
    $ export CENTRAL_REPO=<central-repo-name>
    $ export CENTRAL_REPO_REGISTRY_PORT=<central-repo-port>
    
    Example:
    $ export NODE=k8s-6.delta.lab.us.oracle.com
    $ export CENTRAL_REPO=winterfell
    $ export CENTRAL_REPO_REGISTRY_PORT=5000
  3. Run one of the following commands to remove the old worker node:
    1. If the worker node is reachable from the Bastion Host:
      $ sudo podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--extra-vars="{'node':'${NODE}'}" -e 'PLAYBOOK=/kubespray/remove-node.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION}
    2. If the worker node is not reachable from the Bastion Host:
      $ sudo podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--extra-vars="{'node':'${NODE}','reset_nodes':false}" -e 'PLAYBOOK=/kubespray/remove-node.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION}

    Confirmation message is prompted asking to remove the node. Enter "yes" at the prompt. This will take several minutes, mostly spent at the task "Drain node except daemonsets resource" (even if the node is unreachable).

  4. Run the following command to verify that the node was removed:
    $ kubectl get nodes

    Verify that the target worker node is no longer listed.

Adding Node to a Kubernetes Cluster

This section describes the procedure to add a new node to a Kubernetes cluster.

  1. Replace the node's settings in hosts.ini with the replacement node's settings (probably a MAC address change, if the node is a direct replacement). If you are adding a node, add it to the hosts.ini in all the relevant places (machine inventory section, and proper groups).
  2. Set the environment variables (CENTRAL_REPO, CENTRAL_REPO_REGISTRY_PORT (if not 5000), and NODE) to run the docker command on the Bastion Host:
    export NODE=k8s-6.delta.lab.us.oracle.com
    export CENTRAL_REPO=winterfell
  3. Install the OS on the new target worker node:
    podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e OCCNEARGS=--limit=${NODE},localhost ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/provision:${OCCNE_VERSION}
  4. Run the following command to scale up the Kubernetes cluster with the new worker node:
    podman run -it --rm --cap-add=NET_ADMIN --network host -v /var/occne/cluster/${OCCNE_CLUSTER}:/host -v /var/occne:/var/occne:rw -e 'INSTALL=scale.yml' ${CENTRAL_REPO}:${CENTRAL_REPO_REGISTRY_PORT:-5000}/occne/k8s_install:${OCCNE_VERSION}
  5. Run the following command to verify the new node is up and running in the cluster:
    kubectl get nodes
Adding OSDs in a Ceph Cluster

This procedure sets up a ceph-osd daemon, configures it to use one drive, and configures the cluster to distribute data to the Object Storage Daemon (OSD). If your host has multiple drives, you may add an OSD for each drive by repeating this procedure. To add an OSD, create a data directory for it, mount a drive to that directory, add the OSD to the cluster, and then add it to the crush map. When you add the OSD to the crush map, consider the weight you give to the new OSD.

  1. Connect to the rook-ceph toolbox using the following command:
    $ kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
  2. Before adding OSD, make sure that the current OSD tree does not have any outliers which could become (nearly) full if the new crush map decides to put even more Placement Groups (PGs) on that OSD:
    $ ceph osd df | sort -k 7 -n
    Use reweight-by-utilization to force PGs off the OSD:
    $ ceph osd test-reweight-by-utilization
      
    $ ceph osd reweight-by-utilization
    For optimal viewing, set up a tmux session and make three panes:
    • A pane with the "watch ceph -s" command that displays the status of the Ceph cluster every 2 seconds.
    • A pane with the "watch ceph osd tree" command that displays the status of the OSDs in the Ceph cluster every 2 seconds.
    • A pane to run the actual commands.
  3. In order to deploy an OSD, there must be a storage device that is available on which the OSD is deployed.
    Run the following command to display an inventory of storage devices on all cluster hosts:
    $ ceph orch device ls
    A storage device is considered available, if all of the following conditions are met:
    • The device must have no partitions.
    • The device must not have any LVM state.
    • The device must not be mounted.
    • The device must not contain a file system.
    • The device must not contain a Ceph BlueStore OSD.
    • The device must be larger than 5 GB.
    • Ceph will not provision an OSD on a device that is not available.
  4. To verify that the cluster is in a healthy state, connect to the Rook Toolbox and run the ceph status command:
    • All mons must be in quorum.
    • A mgr must be active state.
    • At least one OSD must be active state.
    • If the health is not HEALTH_OK, the warnings or errors must be investigated.
Recovering a Failed Kubernetes Worker Node in OpenStack

This section describes the manual procedure to replace a failed Kubernetes Worker Node in an OpenStack deployment.

Prerequisites

  • You must have credentials to access Openstack GUI.

Procedure

  1. Perform the following steps to identify and remove the failed worker node:

    Note:

    Run all the commands as a cloud-user in the /var/occne/cluster/${OCCNE_CLUSTER} folder.
    1. Identify the node that is in a not ready, not reachable, or degraded state and note the node's IP address:
      kubectl get node -A -o wide
      Sample output:
      NAME                     STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                  KERNEL-VERSION                    CONTAINER-RUNTIME
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   178m   v1.23.7   192.168.1.92    192.168.1.92    Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   178m   v1.23.7   192.168.1.117   192.168.1.117   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   178m   v1.23.7   192.168.1.118   192.168.1.118   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-1   Ready    <none>                 176m   v1.23.7   192.168.1.135   192.168.1.135   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-2   Ready    <none>                 176m   v1.23.7   192.168.1.137   192.168.1.137   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-3   Ready    <none>                 176m   v1.23.7   192.168.1.136   192.168.1.136   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-4   Ready    <none>                 176m   v1.23.7   192.168.1.119   192.168.1.119   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
    2. Copy and take a backup of the original Terraform tfstate file:
      $ cp terraform.tfstate terraform.tfstate.bkp-orig
    3. On identifying the failed node, drain the node from the Kubernetes cluster:
      $ kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

      This command ignores daemon sets and all possible storage attached as the failed worker node may contain some local storage volumes attached to it.

      Note:

      If this command runs without an error, move to Step e. Else, perform Step d.
    4. You may encounter errors after running Step c or the pods may not be detached from node that are getting deleted. For example:
      I0705 07:40:22.050275 409374 request.go:697] Waited for 1.083354916s due to client-side throttling, not priority and fairness, request: GET:https://lb-apiserver.kubernetes.local:6443/
      api/v1/namespaces/occne-infra/pods/occne-lb-controller-server-58f689459b-w95t
      You can ignore these errors and exit the terminal (CTRL+c). After exiting from terminal, run the following command to manually delete the node:
      $ kubectl delete node <node-name>
    5. Wait (about 10 to 15 minutes) until all pods are detached from the deleted node and assigned to other node that are in Ready state. Skip to step g if this step is successful. There can still be Pods that are in Pending state due to insufficient memory. This issue can be resolved when you add new node to the cluster.
    6. If the previous step fails, perform the following steps to manually remove the pods that are running in the failed worker node:
      1. Identify the pods that are not in active state and delete each of the pods by running the following command:
        $ kubectl delete pod --force <pod-name> -n <name-space>
        Repeat this step until all the pods are removed from the cluster.
      2. Run the following command to drain the node from the Kubernetes cluster:
        $ kubectl drain occne3-user-k8s-node-2 --force --ignore-daemonsets --delete-emptydir-data
    7. Verify if the failed node is removed from the cluster:
      $ kubectl get nodes
      Sample output:
      NAME                            STATUS   ROLES                  AGE   VERSION
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-node-1   Ready    <none>                 2d    v1.23.7
      occne3-user-k8s-node-3   Ready    <none>                 2d    v1.23.7
      occne3-user-k8s-node-4   Ready    <none>                 2d    v1.23.7

      Verify that the target worker node is no longer listed.

    8. Log in to the OpenStack GUI, and manually power off (if needed) and delete the node. At this stage, the node must be already removed from the cluster.
  2. Delete the failed node from OpenStack GUI:
    1. Log in to the Openstack GUI console by using your credentials.
    2. From the list of nodes displayed, locate and select the failed worker node.
    3. From the Actions menu in the last column of the record, select Delete Instance.
      Delete Worker Node from OpenStack GUI

    4. Reconfirm your action by clicking Delete Instance and wait for the node to be deleted.
  3. Run terraform apply to recreate and add the node into the Kubernetes cluster:
    1. Log in to the Bastion Host and switch to the cluster tools directory: /var/occne/cluster/${OCCNE_CLUSTER}.
    2. Run the following command to log in to the cloud using the openrc.sh script and provide the required details (username, password, and domain name):

      Example:

      $ source openrc.sh
      Sample output:
      Please enter your OpenStack Username for project Team-CNE: user@oracle.com
      Please enter your OpenStack Password for project Team-CNE as user : **************
      Please enter your OpenStack Domain for project Team-CNE: DSEE
      
    3. Run terraform apply to recreate the node:
      $ terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
    4. Locate the IP address of the newly created node in the terraform.tfstate file. If the IP is same as the old node that is removed, move to step f. Else, perform Step e.
      $ grep -A6 occne6-utpalkant-k-kumar-k8s-node-1 terraform.tfstate | grep ip
      Sample output:
      "fixed_ip_v4": "192.168.200.78",
                      "fixed_ip_v6": "",
                      "floating_ip": "",
    5. If the IP address of the newly created node is different from the old node's IP, replace the IP address in the following files from the active Bastion:
      vi /etc/hosts
      vi /var/occne/cluster/${OCCNE_CLUSTER}/hosts.ini
      vi /var/occne/cluster/${OCCNE_CLUSTER}/lbvm/lbCtrlData.json
    6. Run the pipeline command to provide the node with the OS:
      Example, considering the affected node as worker-node-2:
      $ OCCNE_CONTAINERS='(PROV)' OCCNE_DEPS_SKIP=1 OCCNE_ARGS='--limit=occne3-user-k8s-node-2' OCCNE_STAGES=(DEPLOY) pipeline.sh
    7. Run the following command to install and configure Kubernetes. This adds the node back into the cluster.
      $ OCCNE_CONTAINERS='(K8S)' OCCNE_DEPS_SKIP=1 OCCNE_STAGES=(DEPLOY) pipeline.sh
    8. Verify if the node is added back into the cluster:
      $ kubectl get nodes
      Sample output:
      NAME                            STATUS   ROLES                  AGE    VERSION
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-node-1   Ready    <none>                 2d1h   v1.23.7
      occne3-user-k8s-node-2   Ready    <none>                 111m   v1.23.7
      occne3-user-k8s-node-3   Ready    <none>                 2d1h   v1.23.7
      occne3-user-k8s-node-4   Ready    <none>                 2d1h   v1.23.7
Recovering a Failed Kubernetes Worker Node in VMware

This section describes the manual procedure to replace a failed Kubernetes Worker Node in a VMware deployment.

Prerequisites

  • You must have credentials to access VMware GUI or CLI.

Procedure

  1. Perform the following steps to identify and remove the failed worker node:

    Note:

    Run all the commands as a cloud-user in the /var/occne/cluster/${OCCNE_CLUSTER} folder.
    1. Identify the node that is in a not ready, not reachable, or degraded state and note the node's IP address:
      kubectl get node -A -o wide
      Sample output:
      NAME                     STATUS   ROLES                  AGE    VERSION   INTERNAL-IP     EXTERNAL-IP     OS-IMAGE                  KERNEL-VERSION                    CONTAINER-RUNTIME
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   178m   v1.23.7   192.168.1.92    192.168.1.92    Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   178m   v1.23.7   192.168.1.117   192.168.1.117   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   178m   v1.23.7   192.168.1.118   192.168.1.118   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-1   Ready    <none>                 176m   v1.23.7   192.168.1.135   192.168.1.135   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-2   Ready    <none>                 176m   v1.23.7   192.168.1.137   192.168.1.137   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-3   Ready    <none>                 176m   v1.23.7   192.168.1.136   192.168.1.136   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      occne3-user-k8s-node-4   Ready    <none>                 176m   v1.23.7   192.168.1.119   192.168.1.119   Oracle Linux Server 8.6   5.4.17-2136.309.5.el8uek.x86_64   containerd://1.6.4
      
    2. Copy and take a backup of the original Terraform tfstate file:
      # cp terraform.tfstate terraform.tfstate.bkp-orig
    3. On identifying the failed node, drain the node from the Kubernetes cluster:
      # kubectl drain occne3-user-k8s-node-2 --ignore-daemonsets --delete-emptydir-data

      This command ignores daemon sets and all possible storage attached as the failed worker node may contain some local storage volumes attached to it.

      Note:

      If this command runs without an error, move to Step e. Else, perform Step d.
    4. You may encounter errors after running Step c or the pods may not be detached from node that are getting deleted. For example:
      I0705 07:40:22.050275 409374 request.go:697] Waited for 1.083354916s due to client-side throttling, not priority and fairness, request: GET:https://lb-apiserver.kubernetes.local:6443/
      api/v1/namespaces/occne-infra/pods/occne-lb-controller-server-58f689459b-w95t
      You can ignore these errors and exit the terminal (CTRL+c). After exiting from terminal run the following command to manually delete the node:
      # kubectl delete node <node-name>
    5. Wait (about 10 to 15 minutes) until all pods are detached from the deleted node and assigned to other node that are in Ready state. Skip to step g if this step is successful. There can still be Pods that are in Pending state due to insufficient memory. This issue can be resolved when you add new node to the cluster.
    6. If the previous step fails, perform the following steps to manually remove the pods that are running in the failed worker node:
      1. Identify the pods that are not in online state and delete each of the pods by running the following command:
        # kubectl delete pod --force <pod-name> -n <name-space>
        Repeat this step until all the pods are removed from the cluster.
      2. Run the following command to drain the node from the Kubernetes cluster:
        # kubectl drain occne3-user-k8s-node-2 --force --ignore-daemonsets --delete-emptydir-data
    7. Verify if the failed node is removed from the cluster:
      # kubectl get nodes
      Sample output:
      NAME                            STATUS   ROLES                  AGE   VERSION
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   2d    v1.23.7
      occne3-user-k8s-node-1   Ready    <none>                 2d    v1.23.7
      occne3-user-k8s-node-3   Ready    <none>                 2d    v1.23.7
      occne3-user-k8s-node-4   Ready    <none>                 2d    v1.23.7

      Verify that the target worker node is no longer listed.

  2. Log in to the VCD or VMWare console. Manually power off (if needed) and delete the failed node from VMware. At this stage, the node must be already removed from VCD consoler.
  3. Recreate and add the node back into the Kubernetes cluster:

    Note:

    Run all the commands as a cloud-user in the /var/occne/cluster/${OCCNE_CLUSTER} folder.
    1. Run terraform apply to recreate the node:
      # terraform apply -var-file=$OCCNE_CLUSTER/cluster.tfvars -auto-approve
    2. Locate the IP address of the newly created node in the terraform.tfstate file. If the IP is same as the old node that is removed, move to step d. Else, perform Step c.
      # grep -A6 occne3-user-k8s-node-2 terraform.tfstate | grep ip
      Sample output:
       "ip": "192.168.1.137",
      "ip_allocation_mode": "POOL",
    3. If the IP address of the newly created node is different from the old node's IP, replace the IP address in the following files:
      - /etc/hosts
      - /var/occne/cluster/${OCCNE_CLUSTER}/hosts.ini
      - /var/occne/cluster/${OCCNE_CLUSTER}/lbvm/lbCtrlData.json
    4. Run the pipeline command to provide the node with the OS:
      Example, considering the affected node as worker-node-2:
      # OCCNE_CONTAINERS='(PROV)' OCCNE_DEPS_SKIP=1 OCCNE_ARGS='--limit=occne3-user-k8s-node-2' OCCNE_STAGES=(DEPLOY) pipeline.sh
    5. Run the following command to install and configure Kubernetes. This adds the node back into the cluster.
      # OCCNE_CONTAINERS='(K8S)' OCCNE_DEPS_SKIP=1 OCCNE_STAGES=(DEPLOY) pipeline.sh
    6. Verify if the node is added back into the cluster:
      # kubectl get nodes
      Sample output:
      NAME                            STATUS   ROLES                  AGE    VERSION
      occne3-user-k8s-ctrl-1   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-ctrl-2   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-ctrl-3   Ready    control-plane,master   2d1h   v1.23.7
      occne3-user-k8s-node-1   Ready    <none>                 2d1h   v1.23.7
      occne3-user-k8s-node-2   Ready    <none>                 111m   v1.23.7
      occne3-user-k8s-node-3   Ready    <none>                 2d1h   v1.23.7
      occne3-user-k8s-node-4   Ready    <none>                 2d1h   v1.23.7

Restoring CNE from Backup

This section provides details about restoring a CNE cluster from backup.

Prerequisites

Before restoring CNE from backups, ensure that the following prerequisites are met.

  • The CNE cluster must have been backed up successfully. For more information about taking a CNE backup, see the "Creating CNE Cluster Backup" sectiopn of Oracle Communications Cloud Native Core, Cloud Native Environment User Guide.
  • At least one Kubernetes controller node must be operational.
  • As this is a non-destructive restore, all the corrupted or non-functioning resources must be destroyed before initiating the restore process.
  • This procedure replaces your current cluster directory with the one saved in your CNE cluster backup. Therefore before performing a restore, backup any Bastion directory file that you consider sensitive.
  • For a BareMetal deployment, the following rook-ceph storage classes must created and made available:
    • standard
    • occne-esdata-s
    • occne-esmaster-sc
    • occne-metrics-sc
  • For a BareMetal deployment, PVCs must be created for all, except bastion-controller.
  • For a vCNE deployment, PVCs must be created for all, except bastion-controller and lb-controller.

Note:

  • Velero backups have a default retention period of 30 days. CNE provides only the non-expired backups for an automated cluster restore.
  • Perform the restore procedure from the same Bastion Host from which the backups were taken from.
Performing a Cluster Restore From Backup

This section describes the procedure to restore a CNE cluster from backup.

Note:

  • The backup restore procedure can restore backups of both Bastion and Velero.
  • This procedure is used for running a restore for the first time only. If you want to rerun a restore, see Rerunning Cluster Restore.

Dropping All CNE Services:

Perform the following steps to run the velero_drop_services.sh script to drop only the currently supported services:
  1. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory where the velero_drop_services.sh is located:
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  2. Run the velero_drop_services.sh script:

    Note:

    If you are using this script for the first time, you can run the script by passing --help / -h as an argument or run the script without passing any argument to get more information about the script.
    ./scripts/restore/velero_drop_services.sh -h
    Sample output:
    This script helps you drop services to prepare your cluster for a
    velero restore from backup, it receives a space separated list of
    arguments for uninstalled different components
     
    Usage:
    provision/provision/roles/bastion_setup/files/scripts/backup/velero_drop_services.sh [space separated arguments]
     
    Valid arguments:
      - bastion-controller
      - opensearch
      - fluentd-opensearch
      - jaeger
      - snmp-notifier
      - metrics-server
      - nginx-promxy
      - promxy
      - vcne-egress-controller
      - istio
      - cert-manager
      - kube-system
      - all:        Drop all the above
     
    Note: If you place 'all' anywhere in your arguments all will be dropped.
You can use the velero_drop_services.sh script to drop a service or set of services. For example:
  • To drop a services or a set of services, pass the service names as a space separated list:
    ./scripts/restore/velero_drop_services.sh jaeger fluentd-opensearch istio
  • To drop all the supported services, use all:
    ./scripts/restore/velero_drop_services.sh all

Initiating Cluster Restore

  1. Perform the following steps to initiate a cluster restore:
    1. Navigate to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory:
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
    2. Run the createClusterRestore.py script:
      • If you do not know the name of the backup that you are going to restore, run the following command to choose the backup from the available list and then run the restore:
        $ ./scripts/restore/createClusterRestore.py
        Sample output:
        Please choose and type the available backup you want to restore into your OCCNE cluster
        
        - occne-example-20250310-203651
        - occne-example-20250314-053600
         
        Please type the name of your backup: ...
      • If you know the name of the back that you are going to restore, run the script by passing the backup name:
        $ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>

        where, <BACKUP_NAME> is the name of the Velero backup previously created.

        For example, considering the backup name as "occne-example-20250314-053600" the restore script is run as follows:
        $ ./scripts/restore/createClusterRestore.py occne-example-20250314-053600
        Sample output:
        No /var/occne/cluster/occne-example/artifacts/velero.d/restore/cluster_restores_log.json log file, creating new one
        
        Initializing cluster restore with backup: occne-example-20250314-053600...
        
        No /var/occne/cluster/occne-example/artifacts/velero.d/restore/cluster_restores_log.json log file, creating new one
        Initializing bastion restore : 'occne-example-20250314-053600'
        
        Downloading bastion backup occne-example-20250314-053600
        
        Successfully downloaded bastion backup occne-example-20250314-053600.tar at home directory
        
        GENERATED LOG FILE AT: /home/cloud-uservar/occne/cluster/occne-example/logs/velero/restore/downloadBastionBackup-20250314-055014.log
        
        GENERATED LOG FILE AT: /var/occne/homecluster/cloud-useroccne-example/logs/velero/restore/createBastionRestore-20250314-055014.log
        Initializing Velero K8s restore : 'occne-example-20250314-053600'
        Successfully created velero restore
        SKIPPING: Step 'update-pvc-annotations' skipped due to not found Pods/PVCs in 'Pending' state
        No /var/occne/cluster/occne-example/artifacts/restore/cluster_restores_log.json log file, creating new one
        
        Successfully created cluster restore
        
        GENERATED LOG FILE AT: /home/cloud-user/var/occne/cluster/occne-example/logs/velero/restore/createClusterRestore-20250314-055014.log
    3. For CNE 23.3.x and older versions, run the following script to add the specific annotations into the affected PVCs for specifying storage provider, and removing affected pods associated with these PVCs for forcing them to re-create themselves.
      $ ./restore/updatePVCAnnotations.py 

Verifying Restore

  1. When Velero restore is completed, it may take several minutes for Kubernetes resources to fully up and functional. You must monitor the restore to ensure that all services are up and running. Run the following command to get the status of all pods, deployments, and services:
    $ kubectl get all -n occne-infra
  2. Once you verify that all resources are restored, run a cluster test to verify if every single resource is up and running.
    $ OCCNE_CONTAINERS=(CFG) OCCNE_STAGES=(TEST) pipeline.sh
Rerunning Cluster Restore

This section describes the procedure to rerun a restore that is already completed successfully.

  1. Navigate to the /var/occne/cluster/$OCCNE_CLUSTER/ directory:
    $ cd /var/occne/cluster/$OCCNE_CLUSTER/
  2. Open the cluster_restores_log.json file:
    vi artifacts/velero.d/cluster_restores_log.json
    Sample output:
    {
        "occne-cluster-20230712-220439": {
            "cluster-restore-state": "COMPLETED",
            "bastion-restore": {
                "state": "COMPLETED"
            },
            "velero-restore": {
                "state": "COMPLETED"
            }
        }
    }
  3. Edit the file to set the value of "cluster-restore-state" to "RESTART" as shown in the following code block:
    {
        "occne-example-20250314-053600": {
            "cluster-restore-state": "RESTART",
            "bastion-restore": {
                "state": "COMPLETED"
            },
            "velero-restore": {
                "state": "COMPLETED"
            },
            "update-pvc-annotations": {
                "state": "SKIPPED"
            }
        }
    }
  4. Perform the following steps to remove the previously created Velero restore objects:
    1. Run the following command to delete the Velero restore object:
      $ velero restore delete $<BACKUP_NAME>

      where, <BACKUP_NAME> is the name of the previously created Velero backup.

    2. Wait until the restore object is deleted and verify the same using the following command:
      $ velero get restore
    3. Run the Dropping All CNE Services procedure to delete all the services that were created when this procedure was run first.
    4. Verify that Step c is completed successfully and no resources are left. Skipping this verification can cause the cluster to go into an unrecoverable state.
    5. Run the CNE restore script without an interactive menu:
      $ .scripts/restore/createClusterRestore.py $<BACKUP_NAME>

      where, <BACKUP_NAME> is the name of the previously created Velero backup.

Rerunning a Failed Cluster Restore

This section describes the procedure to rerun a restore that failed.

CNE provides options to resume cluster restore from the stage it failed. Perform any of the following steps depending on the stage in which your restore failed:

Bastion Host Failure

  1. Go to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory.
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  2. Set the BACKUP_NAME variable with the name of the backup that is going to be restored.
    $ export BACKUP_NAME="Name-of-the-previously-created-velero-backup"

    Sample output:

    
    $ export BACKUP_NAME=occne-example-20250314-053600
  3. To resume the cluster restore from this stage, rerun the restore script without using the interactive menu:
    $ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>

    where, <BACKUP_NAME> is the name of the Velero backup previously created.

Kubernetes Velero Restore Failure

  1. Go to the /var/occne/cluster/${OCCNE_CLUSTER}/ directory.
    $ cd /var/occne/cluster/${OCCNE_CLUSTER}/
  2. Set the BACKUP_NAME variable with the name of the backup that is going to be restored.
    $ export BACKUP_NAME="<Name-of-the-previously-created-velero-backup>"

    Sample output:

    $ export BACKUP_NAME=occne-example-20250314-053600
  3. Run the following command to delete the Velero restore object:
    $ velero restore delete $<BACKUP_NAME>

    where, <BACKUP_NAME> is the name of the previously created Velero backup.

    Sample output:

    Are you sure you want to continue (Y/N)? Y
    Request to delete restore "occne-example-20250314-053600" submitted successfully.
    The restore will be fully deleted after all the associated data (restore files in object storage) are removed.
  4. Verify that restore is deleted successfully and no resources are left. Skipping this verification can cause the cluster to go into an unrecoverable state.
  5. Run the CNE restore script without an interactive menu:
    $ ./scripts/restore/createClusterRestore.py $<BACKUP_NAME>

    where, <BACKUP_NAME> is the name of the previously created Velero backup.

Modifying Annotations and Deleting PV in Kubernetes:

If the restore fails at this point and shows that the are pods are waiting for their PVs, use the updatePVCAnnotations.py script to automatically modify annotations and delete PV in Kubernetes.

The updatePVCAnnotations.py script is used to:
  • add specific annotations to the affected PVCs for specifying storage provider.
  • remove affected pods associated with the affected PVCs to force the pods to recreate themselves.
Use the following command to run the updatePVCAnnotations.py script:
$ ./scripts/restore/updatePVCAnnotations.py
Troubleshooting Restore Failures

This section provides the guidelines to troubleshoot restore failures.

Prerequisites

Before using this section to troubleshoot a restore failure, verify the following:
  • Verify connectivity with S3 object storage.
  • Verify if the credentials used while activating Velero are still active.
  • Verify if the credentials are granted with read or write permission.

Troubleshooting Failed Bastion Restore

Table 4-1 Troubleshooting Failed Bastion Restore

Cause Possible Solution
  • The restore script is run on a Bastion that is different from the one from which the backup was taken.
  • The restore script is not run from the active Bastion.
  1. Verify that you are using the same Bastion from which the backup was taken. Run the following command and verify if the Bastion Host name displayed in the output matches with the Bastion on which you are currently running the restore procedure:
    $ jq '.["{CLUSTER-BACKUP-NAME}"]["source_bastion"]' /var/occne/cluster/$OCCNE_CLUSTER/artifacts/backup/cluster_backups_log.json
    Sample output:
    "occne-cluster-bastion-1"
  2. Verify if you are currently using the active Bastion:
    $ is_active_bastion
    Sample output:
    IS active-bastion

Troubleshooting Failed Kubernetes Velero Restore

Velero restore can fail due to several reasons. This section lists some of the most frequent causes and possible solutions to fix them:

Table 4-2 Troubleshooting Failed Kubernetes Velero Restore

Cause Possible Solution
Velero backup object are not available. Run the following command and verify the following:
  • The backup object is in "COMPLETED" status.
  • The backup is without errors.
  • The backup is not expired.
$ velero get backup
Sample output:
NAME                            STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
occne-cluster-20230711-051602   Completed         0        0          2023-07-11 05:16:33 +0000 UTC   26d       minio              <none>
PVCs are not attached correctly.
Verify that, after a restore, every PVC that was created with the CNE services under the occne-infra is still available and is in Bound status:
$ kubectl get pvc -n occne-infra
Sample output:
NAME                                                                                                     STATUS   ...
bastion-controller-pvc                                                                                   Bound    ...
lb-controller-pvc                                                                                        Bound    ...
occne-opensearch-cluster-data-occne-opensearch-cluster-data-0                                            Bound    ...
occne-opensearch-cluster-data-occne-opensearch-cluster-data-1                                            Bound    ...
occne-opensearch-cluster-data-occne-opensearch-cluster-data-2                                            Bound    ...
occne-opensearch-cluster-master-occne-opensearch-cluster-master-0                                        Bound    ...
occne-opensearch-cluster-master-occne-opensearch-cluster-master-1                                        Bound    ...
occne-opensearch-cluster-master-occne-opensearch-cluster-master-2                                        Bound    ...
prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-0   Bound    ...
prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-1   Bound    ...
PVs are not available. Verify that, before and after a restore, PVs are available for the common services to restore:
$ kubectl get pv
Sample output:
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM
pvc-20d8e323-307c-40a4-86d6-d58278d4e75f   1Gi        RWO            Delete           ...      occne-infra/bastion-controller-pvc
pvc-7318c445-c363-4851-a2a5-be27b600586d   1Gi        RWO            Delete           ...      occne-infra/lb-controller-pvc
pvc-d13c97be-68b0-4252-9f61-12572236e18d   8Gi        RWO            Delete           ...      occne-infra/prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-0   occne-metrics-sc             6d6h
pvc-7e06bb6c-911f-4f2e-b607-8c1a3e08c69c   8Gi        RWO            Delete           ...      occne-infra/prometheus-occne-kube-prom-stack-kube-prometheus-db-prometheus-occne-kube-prom-stack-kube-prometheus-1   occne-metrics-sc             6d6h
pvc-f21c3070-d80c-4dd5-b493-069e5ecccf13   30Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-0
pvc-92468237-c1d1-4449-9fdd-dbdb14f54611   30Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-1
pvc-a16e2dab-8f04-4c22-8911-2fb630053eb3   30Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-master-occne-opensearch-cluster-master-2
pvc-1f1579a9-519f-4fce-b719-a24e59464354   10Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-0
pvc-d9a58939-5523-4d0f-88d7-86c54645ae16   10Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-1
pvc-27a40a24-0a69-4013-8bad-811fcf41175f   10Gi       RWO            Delete           ...      occne-infra/occne-opensearch-cluster-data-occne-opensearch-cluster-data-2

Common Services

This section describes the fault recovery procedures for common services.

Restoring a Failed Load Balancer

This section provides a detailed procedure to restore a Virtualized CNE (vCNE) Load Balancer that fails when the Kubernetes cluster is in service. This procedure also can be used to recreate an LBVM that is manually deleted.

Prerequisites
  • You must know the reason for the Load Balancer Virtual Machines (LBVM) failure.
  • You must know the LBVM name to be replaced and the address pool.
  • Ensure that the cluster.tfvars file is available for terraform to recreate the LBVM.
  • You must run this procedure in the active bastion.
Limitations
  • The following procedure does not attempt to determine the failure of LBVM.
  • The role or status of LBVM to be replaced must not be ACTIVE.
  • If a LOAD_BALANCER_NO_SERVICE alert is raised and both LBVMs are down, then this procedure must be used to recover one LBVM at a time.
Procedure
  1. For OpenStack deployment, check the active bastion and set up user environment:
    1. Check if the Bastion is the active Bastion:
      $ is_active_bastion 
      Sample output:
      IS active-bastion
    2. On the Bastion Host, change directory to the cluster directory and source the OpenStack environment file:
      $ cd /var/occne/cluster/${OCCNE_CLUSTER}
      $ source openrc.sh

      Note:

      ${OCCNE_CLUSTER} is environment variable on the bastion host, and can be used directly in the command. The actual cluster name is inserted into the command.
  2. Identify the LBVM to be replaced:
    1. A LOAD_BALANCER_FAILED or LOAD_BALANCER_FAILED alert must have been raised to indicate the need to run this step. The example of LOAD_BALANCER_FAILED alert description is as follows:
      Load balancer mycluster-oam-lbvm-1 at IP 10.75.X.X on the OAM network has failed. Execute load balancer recovery procedure to restore.
      
    2. Record the failed load balancer name and the network name from the alert description.
  3. Recreate the failed load balancer VM. This action starts the replaceLbvm_<Date>-<Time>.log log file in the /var/occne/cluster/${OCCNE_CLUSTER}/logs directory:
    Run the replaceLbvm.py script:

    Note:

    Run the tail -f /var/occne/cluster/${OCCNE_CLUSTER}/logs/replaceLbvm_<Date>-<Time>.log command on a separate shell (Bastion Host) to track the progress of the recover script implementation.
    $ /var/occne/cluster/${OCCNE_CLUSTER}/scripts/replaceLbvm.py -p <peer_address_pool_name> -n <failed_lbvm_name>
    
    For example:
    $ /var/occne/cluster/${OCCNE_CLUSTER}/scripts/replaceLbvm.py -p oam -n mycluster-oam-lbvm-1
    
    There following codeblock displays the additional arguments and examples that the -h/--help flag provides:
    $ ./replaceLbvm.py -h
    usage: replaceLbvm.py [-h] -p POOL -n NAME [-db] [-nb] [-rn] [-fc]
     
    Use to replace a LBVM that's in FAILED status by default. This script run terraform destroy and
    terraform apply to recreate the LBVM, updates the cluster with the new LBVM data, run the pipeline.sh
    to provision/configure the new LBVM and run scripts inside the lb-controller.
    Parameters allow user to indicate which LBVM will be replaced, by indicating the LBVM name and network pool.
     
    Parameters:
      Required parameters:
        -p/--pool (The LBVM network pool name)
        -n/--name: upgrade (The LBVM name the will be replaced)
     
      Optional Parameters:
        -db/--debug: Print the class attributes to help debugging.
        -nb/--nobackup: Might not always want to make copies of the files... especially when debugging.
        -rn/--replacenotfailed: Replace a LBVM that is not in FAILED status.
        -fc/--forcecreate: Recreates the LBVM when it was deleted manually.
     
    WARNING: This script should only be run on the Active Bastion Host.
             Openstack Only: Need to run "source openrc.sh" before this script.
     
        Examples:
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -db
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -nb
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -rn
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -fc
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -db -nb -rn
         ./replaceLbvm.py -p oam -n my-cluster-oam-lbvm-2 -db -nb -fc
         ./replaceLbvm.py --pool oam --name my-cluster-oam-lbvm-2 --debug --nobackup --replacenotfailed
    The following are some of the error scenarios that you may encounter. Ensure that you check the log file to analyze the errors.
    1. If the script is unable to retrieve the LBVM IP, the script prints the following error message:
      $ /var/occne/cluster/${OCCNE_CLUSTER}/scripts/replaceLbvm.py.py -p oam -n mycluster-oam-lvbm-1
       
          -----Initializing replace LBVM process-----
       
       
       - Backing up configuration files at /var/occne/cluster/mycluster/backupConfig...
       
       - Getting the LBVM information...
       
      Unable to run processReplaceLbvm - args: Namespace(pool='oam', name='mycluster-oam-lvbm-1', debug=False, nobackup=False, replacenotfailed=False, forcecreate=False)
      Error:    Unable to retrieve the LBVM IP
       - For more information check /var/occne/cluster/mycluster/logs/replaceLbvm_20240325-162511.log
    2. If the LBVM that is replace is not in a FAILED status, the script prints the following error message:
      $ ./replaceLbvm.py -p oam -n mycluster-oam-lbvm-2
       
          -----Initializing replace LBVM process-----
       
       
       - Backing up configuration files at /var/occne/cluster/mycluster/backupConfig...
       
       - Getting the LBVM information...
       
       - The LBVM oam - mycluster-oam-lbvm-2 is in STANDBY status. Please verify the LBVM name and the pool, if you still want to replace this LBVM, set -rn/--replacenotfailed flag on the command line when running this script.
      In this case, you can force the replacement by using the -rn flag:
      $ ./replaceLbvm.py -p oam -n mycluster-oam-lbvm-2 -rn
       
          -----Initializing replace LBVM process-----
       
       
       - Backing up configuration files at /var/occne/cluster/mycluster/backupConfig...
       
       - Getting the LBVM information...
       
       - Destroying the LBVM with terraform destroy...
    3. If this script is run for an LBVM that is ACTIVE, the script prints the following error message:
      [cloud-user@mycluster scripts]$ ./replaceLbvm.py -p oam -n mycluster-oam-lbvm-1
       
          -----Initializing replace LBVM process-----
       
       
       - Backing up configuration files at /var/occne/cluster/mycluster/backupConfig...
       
       - Getting the LBVM information...
       
       - The LBVM oam - mycluster-oam-lbvm-1 is in ACTIVE status, replacing this LBVM will cause problems as loosing connection inside the cluster. Please verify the LBVM name and the pool.

      If the failed LBVM is the ACTIVE LBVM, then the STANDBY LBVM takes the control and transfers all the ports to allow the external traffic automatically. You can recover the failed LBVM in the meantime.

    If there are no errors, the script prints a success message similar to the following:
    $ ./replaceLbvm.py -p oam -n mycluster-oam-lbvm-2 -rn
     
        -----Initializing replace LBVM process-----
     
     
     - Backing up configuration files at /var/occne/cluster/mycluster/backupConfig...
     
     - Getting the LBVM information...
     
     - Destroying the LBVM with terraform destroy...
       Successfully applied terraform destroy - check /var/occne/cluster/mycluster/logs/replaceLbvm_20240325-164003.log for details
     
     - Recreating the LBVM with terraform apply...
       Successfully applied Openstack/VCD terraform apply - check /var/occne/cluster/mycluster/logs/replaceLbvm_20240325-164003.log for details
     
     - Updating the LBVM IP...
     
     - Running pipeline.sh for provision and configure - can take considerable time to complete...
       Successfully created configmap lb-controller-master-ip.
     
     - Running recoverLbvm.py inside the lb-controller pod to recover the new LBVM...
    Running recoverLbvm.py in lb-controller pod
    Neither LBVM for poolName oam are in a FAILED role - nothing to recover: [{'id': 0, 'poolName': 'oam', 'name': 'mycluster-oam-lbvm-1', 'ipaddr': '192.168.0.1', 'role': 'ACTIVE', 'status': 'UP'}, {'id': 1, 'poolName': 'oam', 'name': 'mycluster-oam-lbvm-2', 'ipaddr': '192.168.0.2', 'role': 'STANDBY', 'status': 'UP'}]
     
     
     - Running updTemplates.py inside the lb-controller pod to update haproxy templates on the new LBVM...
     
    Log file generated at: /var/occne/cluster/mycluster/logs/replaceLbvm_20240325-164003.log
     
    LBVM successfully replaced on the cluster: mycluster

Restoring a Failed LB Controller

This section provides a detailed procedure to restore a failed LB controller using a backup crontab.

A failed LB Controller can be restored using backup and restore processes.

Prerequisites
  • Ensure that the LB controller is installed.
  • Ensure that the metal-lb is installed.

Backup LB controller database

  1. Create the backuplbcontroller.sh script to backup the LB controller database.

    Run the following commands to create the script at the /root directory:

    cd /var/occne/cluster/${OCCNE_CLUSTER}
    touch backuplbcontroller.sh
    chmod +x backuplbcontroller.sh
    vi backuplbcontroller.sh
  2. Add the following lines of code inside the backuplbcontroller.sh script.

    Following is a sample content of backuplbcontroller.sh script:

    #!/bin/bash
    timenow=`date +%Y-%m-%d_%H%M%S`
     
    occne_lb_pod_name=$(kubectl get pods -n occne-infra -o custom-columns=NAME:.metadata.name | grep occne-lb-controller-server)
    if [ -z "$occne_lb_pod_name" ]
    then
        echo "The occne-lb-controller-server pod could not be found. $timenow" >> lb_backup.log
        exit 1
    fi
     
    occne_lb_pod_ready=$(kubectl get pods $occne_lb_pod_name -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}' -n occne-infra| grep True)
    if [ -z "$occne_lb_pod_ready" ]
    then
        echo "The occne-lb-controller-server pod is not ready. $timenow" >> lb_backup.log
        exit 1
    fi
     
    lb_db_status=$(kubectl exec -it $occne_lb_pod_name  -n occne-infra --  /bin/bash -c "cd create_lb/lbCtrl;python3 -c 'import lbCtrlData;print(lbCtrlData.getLbvms())'" | grep ACTIVE)
    if [ -z "$lb_db_status" ]
    then
        echo "None of the LBVMs are ACTIVE. $timenow" >> lb_backup.log
        exit 1
    fi
     
    echo "Backing up DB from pod: $occne_lb_pod_name at $timenow." >> lb_backup.log
    db_backup_res=$(kubectl cp occne-infra/$occne_lb_pod_name:data/sqlite/db /tmp/backuplbcontroller/)
    backup_last_db=$(cp /tmp/backuplbcontroller/lbCtrlData.db /tmp/backuplbcontroller/lbCtrlData_$timenow.db) 
    echo "$db_backup_res" >> lb_backup.log
     
    backups_list=$(ls -t /tmp/backuplbcontroller/ | grep lbCtrlData_)
    max_backups=10
    if [ $(echo "$backups_list" | wc -l) -gt $max_backups ]
    then
      oldest_backup=$(echo "$backups_list" | tail -n +$(($max_backups+1)))
      rm "/tmp/backuplbcontroller/$oldest_backup"
    fi
  3. Add the script to a cron job.

    Run the following command to edit the crontab:

    sudo crontab -e

    Add the following line to the crontab:

    0 */4 * * * /var/occne/cluster/${OCCNE_CLUSTER}/backuplbcontroller.sh
  4. Verify that the changes in the crontab were added.

    Run the following command to get the pod name:

    sudo crontab -l

Restore LB Controller

Note:

To restore the database, perform the backup LB Controller procedure.
  1. In sudo mode, stop the backups from the cron job. Once the LB controller database is restored, add the line again into the cron job.

    Run the following command to edit the crontab:

    sudo crontab -e
    Delete the next line from the cron job:
    0 */4 * * * /var/occne/cluster/${OCCNE_CLUSTER}/backuplbcontroller.sh
  2. Run the following command to uninstall metallb and lb-controller pods. Both the pods must be uninstalled so that bgp peering is recreated. Wait for the pod termination and PVC termination after uninstall, for both metallb and lb-controller pods.

    Run the following to commands to uninstall Helm:

    helm uninstall occne-metallb occne-lb-controller -n occne-infra

    Wait for the resources to be terminated.

  3. Reinstall the metallb and lb-controller pods.

    Run the following deploy command to reinstall the pods:

    OCCNE_CONTAINERS=(CFG) OCCNE_STAGES=(DEPLOY) OCCNE_ARGS="--tags=metallb,vcne-lb-controller" /var/occne/cluster/${OCCNE_CLUSTER}/artifacts/pipeline.sh
  4. Restore LB Controller database.
    Run the following restore commands:
    1. Stop the monitor in lb-controller after install until db file is updated.
      kubectl get deploy -n occne-infra | grep occne-lb-controller | awk '{print $1}'
      kubectl set env deployment/<lb deploy> UPGRADE_IN_PROGRESS="true" -n occne-infra

      Wait for the pod to be in the running state.

    2. Load the db file back to the Container.
      sudo cp /tmp/backuplbcontroller/lbCtrlData.db lbCtrlData.db
      sudo chmod a+rwx lbCtrlData.db
      kubectl get pods -n occne-infra | grep occne-lb-controller | awk '{print $1}'
      kubectl cp lbCtrlData.db occne-infra/<lb controller pod>:/data/sqlite/db
    3. Start the monitor again.
      kubectl set env deployment/<lb deploy> UPGRADE_IN_PROGRESS="false" -n occne-infra
      Wait for the lb-controller pod to terminate and recreate.

To restore the backup files in case they get corrupted, it is recommended that upto 10 copies of the database must be saved with a 4-hour separation. If the latest snapshot encounters issues, repeat the commands provided in the last step (step 4), specifying the version of the database you wish to restore. All the older backup files are located in the directory /tmp/backuplbcontroller/.

Run the following commands to restore the older database backups:

sudo ls /tmp/backuplbcontroller/
sudo cp /tmp/backuplbcontroller/lbCtrlData_<db version to restore>.db lbCtrlData.db