End-to-End Compute Node Upgrade

The compute node upgrade is designed as another orchestrated workflow within the full rack upgrade workflow. Each compute node goes through its own end-to-end process. All nodes are handled in the same way one after the other. Assuming the nodes are running Oracle Linux 8, the Upgrader takes advantage of the alternate partitions – as described for the management cluster – for maximum efficiency and minimum inpact on availability.

To allow all compute nodes to be upgraded without intervention from an administrator, the workflow takes care of compute instance migration and displaced instances autonomously. If the compute nodes or instances are in a state that prevents a successful end-to-end run, the workflow is stopped and a notification is returned, so the administrator can take corrective measures.

An end-to-end compute node upgrade consists of these steps:

Check the status of the compute node and instances.
- If instances are running that cannot be migrated, the workflow is stopped.
- If the compute node is locked for maintenance, or otherwise not ready for upgrade, the workflow is stopped.
Set a provisioning lock on the compute node.

Caution:

Depending on the high-availability configuration of the Compute service, automatic instance migrations can prevent the Upgrader from successfully locking a compute node. See Monitoring Displaced Instances.
Migrate all running compute instances away from the compute node. If compute node evacuation fails, the workflow is stopped and the operation is rolled back.

Caution:

If the workflow fails at this stage, the provisioning lock is not automatically removed. The lock prevents further instance placement changes.
Deprovision and rediscover the compute node.
Upgrade the host: reboot from the new OS base image and install any additional tools required.
Provision the compute node so it can resume normal operation.

Monitoring Displaced Instances

During compute node upgrade or patching, no active compute instances can be present, so the node must be evacuated and placed under a provisioning lock. To evacuate a compute node, the Compute Service live-migrates instances to another compute node in the same fault domain. If the fault domain does not have sufficient capacity, high-availability configuration settings might cause instances to be live-migrated to another fault domain, and migrated back to their selected fault domain when the required capacity is available again.

Compute instances that have been migrated away from their assigned fault domain, are called displaced instances. Their migrations can interfere with compute node upgrade or patching. When the locks on a given compute node are released, its displaced instances start migrating back, during which time it might be impossible to lock the next compute node for maintenance.

Before upgrading or patching a compute node, monitor the status of displaced instances. Do not proceed with the next compute node until the list is empty.

In the Service CLI, use the command getDisplacedInstances. In the following example, two instances have been migrated away from fault domain 1.

PCA-ADMIN> getDisplacedInstances
Data:
 id                        displayName  compartmentId                faultDomain     faultDomainSelected
 --                        -----------  -------------                -----------     -------------------
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-3  FAULT-DOMAIN-1
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-2  FAULT-DOMAIN-1

In the Service Web UI, click the navigation menu, click FD Instances, and then click Displaced Instances.

For more information, refer to the following sections in the Hardware Administration chapter of the Oracle Private Cloud Appliance Administrator Guide:

Migrating instances and locking a compute node: see "Performing Compute Node Operations".
Compute service HA configuration: see "Configuring the Compute Service for High Availability".