Compute Node Maintenance Operations

For maintenance operations, including Private Cloud Appliance software upgrade or patching, and when hardware repair is needed, compute nodes must be placed in maintenance mode. This requires evacuation of running compute instances, and locking the node out of any other system operations.

Evacuating a Compute Node

Some compute node operations can only be performed if the compute node has no running compute instances. Administrators can migrate all running instances away from a compute node, also known as evacuating the compute node.

By default, if enough resources are available, running instances are live migrated to other compute nodes in the same fault domain.

Important

Before you perform a compute node evacuation, check what the behavior will be for any instances that cannot be live migrated to another compute node in the same fault domain.

This topic, and High Availability Configuration for Compute Instances, describe how to check settings and how instances are handled for different settings.

Live migration between different types of compute nodes is not supported. For example, you can't migrate compute instances from an Oracle Server X10 to an Oracle Server X11.

Check whether strict fault domain enforcement is set.

When strict fault domain enforcement is disabled (Strict FD is set to Disabled in the Service Web UI or Strict FD Enabled is false in the Service CLI), instances that cannot be live migrated to another compute node in the same fault domain are live migrated to a different fault domain if enough resources are available in that fault domain.
When strict fault domain enforcement is enabled (Strict FD is set to Enabled in the Service Web UI or Strict FD Enabled is true in the Service CLI), instances that cannot be live migrated to another compute node in the same fault domain do not migrate; those instances are still running in the compute node that you want to evacuate.

Enable or disable strict fault domain enforcement to set whether instances that cannot live migrate to other compute nodes in the same fault domain will be live migrated to a different fault domain or still running in the same compute node after you attempt to evacuate the compute node.

If some instances cannot be live migrated, either because the current fault domain is not able to accommodate them and strict fault domain enforcement is enabled, or because strict fault domain enforcement is disabled but other fault domains also cannot accommodate the instances, then you can re-run the migrate operation with the force option specified. When the force option is specified, the Compute service soft stops any instances that fail to live migrate, allowing the evacuation to proceed.

Restart stopped instances. If instances were stopped by the Compute service (not manually stopped by an administrator) and you want them to be automatically restored to running when resources become available, check that the Auto Recovery property of the Compute service is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Viewing and Setting Compute Service Configuration and Configuring the Recovery State for a Stopped Instance.

Instances can be stopped by the Compute service if the force option is used when an administrator evacuates a compute node, or in response to an unplanned compute node outage. You can change the Auto Recovery setting at any time before or after resources become available after an administrative maintenance or unplanned outage to restart instances that were stopped by the Compute service. If the instance availability recovery action is set to STOP_INSTANCE, the instance remains stopped even though the Auto Recovery property is enabled. If the instance availability recovery action is later changed to RESTORE_INSTANCE, a subsequent Auto Recovery pass will restart the instance.

Return relocated instances. If instances are live migrated to a different fault domain (displaced), and you want them returned to their selected fault domain (the fault domain that is specified in the instance configuration) when resources become available, check that the Auto Resolve property of the Compute service is enabled. See Viewing and Setting Compute Service Configuration and Compute Service Configuration Commands. You can set the Auto Resolve property at any time before or after the compute node evacuation completes to relocate any displaced instances.

Use the following procedures to perform the migrate operation.

Compute Node Evacuation: Before You Begin

Check fault domain and compute node resources. See Monitoring System Capacity. Based on this information, decide whether to do any of the following:
- Terminate instances that are no longer needed.
- Reconfigure some instances to use fewer resources. For example, specify a different shape.
- Reconfigure some instances to specify a different fault domain.
- Stop some instances while you perform the compute node evacuation.
- Shutdown non migratable instances. See next step.
- Specify the force option on the migration operation to soft stop any instances that cannot be live migrated. See the discussion above of instance availability recovery action and Auto Recovery configuration.
While it's possible to specify the force option on the vmMigrate operation to soft stop any instances that cannot be live migrated, the best practice is to gracefully shut down non migratable instances before migration so any workloads running on the instance will be in a good state.
1. Display the list of non migratable instances.
  
  Copy the ID of the running instances, so you can shut them down.
```
PCA-ADMIN> getNonMigratableInstances
Data:
  id                           Display Name  Compute Node Id  Domain State
  --                           ------------  ---------------  ------------
  ocid1.instance.unique_ID     instance202   CN_ID            running
  ocid1.instance.unique_ID     kqh027        CN_ID            shut off
```
2. Shut down the running instances.
  
  See Stopping, Starting, and Resetting an Instance.
Disable provisioning on the compute node.

See Disabling Compute Node Provisioning.

Using the Service Web UI

In the navigation menu, click Rack Units.
In the Rack Units table, find the host name of the compute node that you want to evacuate. Click the Actions menu for that host, and click the Migrate All Vms option.

Alternatively, in the Rack Units table, click the host name of the compute node that you want to evacuate to display the details page for that compute node. Click the Controls menu, and click the Migrate All Vms option.
On the Confirm Migrating VMs dialog, choose whether to force stop any instances that cannot be migrated.

By default, the force stop option is not enabled, and instances that cannot be migrated will still be running on the node after the migrate operation completes. To force stop instances that cannot be migrated, enable the force stop option in the Confirm Migrating VMs dialog.
On the Confirm Migrating VMs dialog, click the Migrate button.

The Compute service live migrates the running instances to other compute nodes if enough resources are available and High Availability settings are configured to allow it. If the Force option was specified, any instances that could not be migrated are soft stopped. If any instances could not be migrated and Force was not specified, those instances remain running in the compute node that you are attempting to evacuate.

Using the Service CLI

Display the list of compute nodes.

Copy the ID of the compute node that you that you want to evacuate.

PCA-ADMIN> list ComputeNode
Data:
  id                                     name       provisioningState   provisioningType
  --                                     ----       -----------------   ----------------
  3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
  f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
  4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM

Use the migrateVm command to live migrate all running compute instances off the compute node. To soft stop any instances that fail to migrate, set the force option:
```
PCA-ADMIN> migrateVm id=7a0236f4-b00e-461d-93a0-b22673a18d9c force=true
JobId: 6f1e94bc-7d5b-4002-ada9-7d4b504a2599
```
The Compute service live migrates the running instances to other compute nodes if enough resources are available and High Availability settings are configured to allow it. If force=true was specified, any instances that could not be migrated are soft stopped. If any instances could not be migrated and force=true was not specified, those instances remain running in the compute node that you are attempting to evacuate.

Use the job ID to check the status of the migrateVm command.
```
PCA-ADMIN> show Job id=6f1e94bc-7d5b-4002-ada9-7d4b504a2599
[...]
  Done = true
  Name = MODIFY_TYPE
  Run State = Succeeded
```

Disabling Compute Node Provisioning

Several compute node operations can only be performed on condition that provisioning has been disabled. Follow these instructions to impose and release a provisioning lock.

Using the Service Web UI

In the navigation menu, click Rack Units.
In the Rack Units table, click the host name of the compute node you want to make changes to.

The compute node detail page appears.
In the top-right corner of the page, click Controls and select the Provisioning Lock command.

When the confirmation window appears, click Lock to proceed.

After successful completion, the Compute Node Information tab shows Provisioning Locked = Yes.
To release the provisioning lock, click Controls and select the Provisioning Unlock command.

When the confirmation window appears, click Unlock to proceed.

After successful completion, the Compute Node Information tab shows Provisioning Locked = No.

Using the Service CLI

Display the list of compute nodes.

Copy the ID of the compute node for which you want to disable provisioning operations.

PCA-ADMIN> list ComputeNode
Data:
  id                                     name       provisioningState   provisioningType
  --                                     ----       -----------------   ----------------
  3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
  f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
  4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM

Set a provisioning lock on the compute node.

PCA-ADMIN> provisioningLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
JobId: 6ee78c8a-e227-4d31-a770-9b9c96085f3f

Use the job ID to check the status of your command.

PCA-ADMIN> show Job id=6ee78c8a-e227-4d31-a770-9b9c96085f3f
[...]
  Done = true
  Name = MODIFY_TYPE
  Run State = Succeeded

When the job has completed, confirm that the compute node is under provisioning lock.

PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
[...]
  Provisioning State = Provisioned
  [...]
  Provisioning Locked = true
  Maintenance Locked = false

All provisioning operations are now disabled until the lock is released.

To release the provisioning lock, use this command:

PCA-ADMIN> provisioningUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
JobId: 523892e8-c2d4-403c-9620-2f3e94015b46

Use the job ID to check the status of your command.

PCA-ADMIN> show Job id=523892e8-c2d4-403c-9620-2f3e94015b46
[...]
  Done = true
  Name = MODIFY_TYPE
  Run State = Succeeded

When the job has completed, confirm that the provisioning lock has been released.

PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
[...]
  Provisioning State = Provisioned
  [...]
  Provisioning Locked = false
  Maintenance Locked = false

Locking a Compute Node for Maintenance

For maintenance operations, compute nodes must be placed in maintenance mode. Follow these instructions to impose and release a maintenance lock. Before you can lock a compute node for maintenance, you must disable provisioning first. Maintenance operations can only be performed if the compute node has no running compute instances.

Caution

Depending on the high-availability configuration of the Compute service, automatic instance migrations can prevent you from successfully locking a compute node. See High Availability Configuration for Compute Instances. This situation is more likely to occur when available compute capacity is limited.

Instance recovery or migration operations after a compute node outage can cause a maintenance lock to fail. Compute nodes involved in instance migrations will reject the maintenance lock until the migrations are complete.
Displaced instances could be migrated back to their original fault domain when a compute node maintenance lock is released. A compute node from where a displaced instance is migrated back will reject the maintenance lock until the migration is complete.
Migrating an instance typically takes no more than 30 seconds. However, large instances and heavy workloads increase the time required.
In the event that an instance gets stuck in moving state and migration fails to complete, its host compute node cannot be locked for maintenance. Contact Oracle for assistance.

Using the Service Web UI

Ensure that provisioning has been disabled on the compute node.
Ensure that the compute node has no active instances. They must be migrated or shut down.
In the navigation menu, click Rack Units.
In the Rack Units table, click the host name of the compute node that requires maintenance.

The compute node detail page appears.
In the top-right corner of the page, click Controls and select the Maintenance Lock command.

When the confirmation window appears, click Lock to proceed.

After successful completion, the Compute Node Information tab shows Maintenance Locked = Yes.
To release the maintenance lock, click Controls and select the Maintenance Unlock command.

When the confirmation window appears, click Unlock to proceed.

After successful completion, the Compute Node Information tab shows Maintenance Locked = No.

Using the Service CLI

Display the list of compute nodes.

Copy the ID of the compute node that requires maintenance.

PCA-ADMIN> list ComputeNode
Data:
  id                                     name       provisioningState   provisioningType
  --                                     ----       -----------------   ----------------
  3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
  f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
  4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM

Ensure that provisioning has been disabled on the compute node.

Lock the compute node for maintenance.

PCA-ADMIN> maintenanceLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
JobId: e46f6603-2af2-4df4-a0db-b15156491f88

Use the job ID to check the status of your command.

PCA-ADMIN> show Job id=e46f6603-2af2-4df4-a0db-b15156491f88
[...]
  Done = true
  Name = MODIFY_TYPE
  Run State = Succeeded

When the job has completed, confirm that the compute node has been locked for maintenance.

PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
[...]
  Provisioning State = Provisioned
  [...]
  Provisioning Locked = true
  Maintenance Locked = true

The compute node is now ready for maintenance.

To release the maintenance lock, use this command:

PCA-ADMIN> maintenanceUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
JobId: 625af20e-4b49-4201-879f-41d4405314c7

Use the job ID to check the status of your command.

PCA-ADMIN> show Job id=625af20e-4b49-4201-879f-41d4405314c7
[...]
  Done = true
  Name = MODIFY_TYPE
  Run State = Succeeded

When the job has completed, confirm that the provisioning lock has been released.

PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
[...]
  Provisioning State = Provisioned
  [...]
  Provisioning Locked = true
  Maintenance Locked = false

Oracle Cloud Infrastructure Documentation

Compute Node Maintenance Operations

Evacuating a Compute Node

Disabling Compute Node Provisioning

Locking a Compute Node for Maintenance