Performing Compute Node Operations

From the Rack Units list of the Service Web UI, an administrator can execute certain operations on hardware components. These operations can be accessed from the Actions menu, which is the button with three vertical dots on the right hand side of each table row. In practice, only the View Details and Copy ID operations are available for all component types.

When compute nodes are in the discovery state or coming up, their status is 'Failed' until the hardware process transitions them to 'Ready to Provision'. This process typically takes under five minutes. If the failed state persists, use the Service CLI command list ComputeNode to determine the provisioning state of the compute nodes and take appropriate action.

For compute nodes, several other operations are available, either from the Actions menu or from the compute node detail page. Those operations are described in detail in this section, including the equivalent steps in the Service CLI.

Provisioning a Compute Node

Before a compute node can be used to host your compute instances, it must be provisioned by an administrator. The appliance software detects the compute nodes that are installed in the rack and cabled to the switches, meaning they appear in the Rack Units list as Ready to Provision. You can provision them from the Service Web UI or Service CLI.

Using the Service Web UI

  1. In the navigation menu, click Rack Units.

  2. In the Rack Units table, click the host name of the compute node you want to provision.

    The compute node detail page appears.

  3. In the top-right corner of the page, click Controls and select the Provision command.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node you want to provision.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-20 08:53:56,681 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      29f68a0e-4744-4a92-9545-7c48fa365d0a   pcacn001   Ready to Provision  Unspecified
      7a0236f4-b00e-461d-93a0-b22673a18d9c   pcacn003   Ready to Provision  Unspecified
      dc8ae567-b07f-48e0-89bd-e57069c20010   pcacn002   Ready to Provision  Unspecified
  2. Provision the compute node with this command:

    PCA-ADMIN> provision id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Command: provision id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Status: Success
    Time: 2021-08-20 11:35:40,152 UTC
    JobId: ea93cac4-4430-4663-aafd-d70701593fb2

    Use the job ID to check the status of your provision command.

    PCA-ADMIN> show Job id=ea93cac4-4430-4663-aafd-d70701593fb2
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  3. Repeat the provision command for any other compute nodes you want to provision at this time.

  4. Confirm that the compute nodes have been provisioned.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-20 11:38:29,509 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      29f68a0e-4744-4a92-9545-7c48fa365d0a   pcacn001   Provisioned         KVM
      7a0236f4-b00e-461d-93a0-b22673a18d9c   pcacn003   Provisioned         KVM
      dc8ae567-b07f-48e0-89bd-e57069c20010   pcacn002   Provisioned         KVM

Providing Platform Images

Platform images are provided during Private Cloud Appliance installation, and new platform images might be provided during appliance upgrade or patching operations.

During installation, upgrade, and patching, new platform images are placed on the management node in /nfs/shared_storage/oci_compute_images. During patching and upgrade, you can run commands to make these images available to Compute Enclave users. See the patchOCIimages command in "Patching Oracle Cloud Infrastructure Images" in the Oracle Private Cloud Appliance Patching Guide, and the upgradeOCIImages command in "Upgrading Oracle Cloud Infrastructure Images" in the Oracle Private Cloud Appliance Upgrade Guide.

The image import command described in Importing Platform Imagesalso makes the images available to Compute Enclave users. Run this importPlatformImages command if images were not imported during patch or upgrade, or you need to re-import images. You can also use this command to make custom images available to all Compute Enclave users after you put the image in /nfs/shared_storage/oci_compute_images on the management node.

During upgrade and patching, new versions of an image do not replace existing versions on the management node. If more than three versions of an image are available on the management node, only the newest three versions are shown when images are listed in the Compute Enclave. Older platform images are still available to users by specifying the image OCID.

Importing Platform Images

Run the importPlatformImages command to make all images that are in /nfs/shared_storage/oci_compute_images on the management node also available in all compartments in all tenancies in the Compute Enclave.

PCA-ADMIN> importPlatformImages
Command: importPlatformImages
Status: Running
Time: 2022-11-10 17:35:20,345 UTC
JobId: f21b9d86-ccf2-4bd3-bab9-04dc3adb2966

Use the JobId to get more detailed information about the job. In the following example, no new images have been delivered:

PCA-ADMIN> show job id=f21b9d86-ccf2-4bd3-bab9-04dc3adb2966
Command: show job id=f21b9d86-ccf2-4bd3-bab9-04dc3adb2966
Status: Success
Time: 2022-11-10 17:35:36,023 UTC
Data: 
  Id = f21b9d86-ccf2-4bd3-bab9-04dc3adb2966
  Type = Job
  Done = true
  Name = OPERATION
  Progress Message = There are no new platform image files to import
  Run State = Succeeded
  Transcript = 2022-11-10 17:35:20.339 : Created job OPERATION
  Username = admin

Listing Platform Images

Use the listplatformImages command to list all platform images that have been imported from the management node.

PCA-ADMIN> listplatformImages
Data:
  id                        displayName                                lifecycleState
  --                        -----------                                --------------
  ocid1.image.unique_ID_1   uln-pca-Oracle-Linux-7.9-2023.09.26_0...   AVAILABLE
  ocid1.image.unique_ID_2   uln-pca-Oracle-Linux-8-2023.09.26_0.oci    AVAILABLE
  ocid1.image.unique_ID_3   uln-pca-Oracle-Linux-9-2023.09.26_0.oci    AVAILABLE
  ocid1.image.unique_ID_4   uln-pca-Oracle-Linux8-OKE-1.26.6-2024...   AVAILABLE
  ocid1.image.unique_ID_5   uln-pca-Oracle-Linux8-OKE-1.27.7-2024...   AVAILABLE
  ocid1.image.unique_ID_6   uln-pca-Oracle-Linux8-OKE-1.28.3-2024...   AVAILABLE
  ocid1.image.unique_ID_7   uln-pca-Oracle-Solaris-11-2023.10.16_...   AVAILABLE

Compute Enclave users see the same lifecycleState that listplatformImages shows. Shortly after running importPlatformImages, both listplatformImages and the Compute Enclave might show new images with lifecycleState IMPORTING. When the importPlatformImages job is complete, both listplatformImages and the Compute Enclave show the images as AVAILABLE.

If you delete a platform image as shown in Deleting Platform Images, both listplatformImages and the Compute Enclave show the image as DELETING or DELETED.

Deleting Platform Images

Use the following command to delete the specified platform image. The image shows as DELETING and then DELETED in listplatformImages output and in the Compute Enclave, and eventually is not listed at all. However, the image file is not deleted from the management node, and running the importPlatformImages command re-imports the image so that the image is again available in all compartments.

PCA-ADMIN> deleteplatformImage imageId=ocid1.image.unique_ID_7
JobId: 401567c3-3662-46bb-89d2-b7ad1541fa2d

PCA-ADMIN> listplatformImages
Data:
  id                        displayName                                lifecycleState
  --                        -----------                                --------------
  ocid1.image.unique_ID_1   uln-pca-Oracle-Linux-7.9-2023.09.26_0...   AVAILABLE
  ocid1.image.unique_ID_2   uln-pca-Oracle-Linux-8-2023.09.26_0.oci    AVAILABLE
[...]
  ocid1.image.unique_ID_7   uln-pca-Oracle-Solaris-11-2023.10.16_...   DELETED

Disabling Compute Node Provisioning

Several compute node operations can only be performed on condition that provisioning has been disabled. This section explains how to impose and release a provisioning lock.

Using the Service Web UI

  1. In the navigation menu, click Rack Units.

  2. In the Rack Units table, click the host name of the compute node you want to make changes to.

    The compute node detail page appears.

  3. In the top-right corner of the page, click Controls and select the Provisioning Lock command.

    When the confirmation window appears, click Lock to proceed.

    After successful completion, the Compute Node Information tab shows Provisioning Locked = Yes.

  4. To release the provisioning lock, click Controls and select the Provisioning Unlock command.

    When the confirmation window appears, click Unlock to proceed.

    After successful completion, the Compute Node Information tab shows Provisioning Locked = No.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node for which you want to disable provisioning operations.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-23 09:25:56,307 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
      f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
      4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM
  2. Set a provisioning lock on the compute node.

    PCA-ADMIN> provisioningLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: provisioningLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:29:46,568 UTC
    JobId: 6ee78c8a-e227-4d31-a770-9b9c96085f3f

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=6ee78c8a-e227-4d31-a770-9b9c96085f3f
    Command: show Job id=6ee78c8a-e227-4d31-a770-9b9c96085f3f
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  3. When the job has completed, confirm that the compute node is under provisioning lock.

    PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    [...]
      Provisioning State = Provisioned
      [...]
      Provisioning Locked = true
      Maintenance Locked = false

    All provisioning operations are now disabled until the lock is released.

  4. To release the provisioning lock, use this command:

    PCA-ADMIN> provisioningUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: provisioningUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:44:58,531 UTC
    JobId: 523892e8-c2d4-403c-9620-2f3e94015b46

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=523892e8-c2d4-403c-9620-2f3e94015b46
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  5. When the job has completed, confirm that the provisioning lock has been released.

    PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    [...]
      Provisioning State = Provisioned
      [...]
      Provisioning Locked = false
      Maintenance Locked = false

Locking a Compute Node for Maintenance

For maintenance operations, compute nodes must be placed in maintenance mode. This section explains how to impose and release a maintenance lock. Before you can lock a compute node for maintenance, you must disable provisioning first. Maintenance operations can only be performed if the compute node has no running compute instances.

Caution:

Depending on the high-availability configuration of the Compute service, automatic instance migrations can prevent you from successfully locking a compute node. See Configuring the Compute Service for High Availability. This situation is more likely to occur when available compute capacity is limited.

  • Instance recovery or migration operations after a compute node outage can cause a maintenance lock to fail. Compute nodes involved in instance migrations will reject the maintenance lock until the migrations are complete.

  • Displaced instances could be migrated back to their original fault domain when a compute node maintenance lock is released. A compute node from where a displaced instance is migrated back will reject the maintenance lock until the migration is complete.

  • Migrating an instance typically takes no more than 30 seconds. However, large instances and heavy workloads increase the time required.

  • In the event that an instance gets stuck in moving state and migration fails to complete, its host compute node cannot be locked for maintenance. Contact Oracle for assistance.

Using the Service Web UI

  1. Ensure that provisioning has been disabled on the compute node.

    See Disabling Compute Node Provisioning.

  2. Ensure that the compute node has no active instances. They must be migrated or shut down.

    See Migrating Instances from a Compute Node.

  3. In the navigation menu, click Rack Units.

  4. In the Rack Units table, click the host name of the compute node that requires maintenance.

    The compute node detail page appears.

  5. In the top-right corner of the page, click Controls and select the Maintenance Lock command.

    When the confirmation window appears, click Lock to proceed.

    After successful completion, the Compute Node Information tab shows Maintenance Locked = Yes.

  6. To release the maintenance lock, click Controls and select the Maintenance Unlock command.

    When the confirmation window appears, click Unlock to proceed.

    After successful completion, the Compute Node Information tab shows Maintenance Locked = No.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node that requires maintenance.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-23 09:25:56,307 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
      f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
      4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM
  2. Ensure that provisioning has been disabled on the compute node.

    See Disabling Compute Node Provisioning.

  3. Lock the compute node for maintenance.

    PCA-ADMIN> maintenanceLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: maintenanceLock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:56:05,443 UTC
    JobId: e46f6603-2af2-4df4-a0db-b15156491f88

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=e46f6603-2af2-4df4-a0db-b15156491f88
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  4. When the job has completed, confirm that the compute node has been locked for maintenance.

    PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    [...]
      Provisioning State = Provisioned
      [...]
      Provisioning Locked = true
      Maintenance Locked = true

    The compute node is now ready for maintenance.

  5. To release the maintenance lock, use this command:

    PCA-ADMIN> maintenanceUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: maintenanceUnlock id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 10:00:53,902 UTC
    JobId: 625af20e-4b49-4201-879f-41d4405314c7

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=625af20e-4b49-4201-879f-41d4405314c7
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  6. When the job has completed, confirm that the provisioning lock has been released.

    PCA-ADMIN> show ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    [...]
      Provisioning State = Provisioned
      [...]
      Provisioning Locked = true
      Maintenance Locked = false

Migrating Instances from a Compute Node

Some compute node operations, such as some maintenance operations, can only be performed if the compute node has no running compute instances. Administrators can migrate all running instances away from a compute node, also known as evacuating the compute node. If enough resources are available, running instances are live migrated to other compute nodes in the same fault domain.

Important:

Before you perform a compute node evacuation, check what the behavior will be for any instances that cannot be migrated to another compute node in the same fault domain.

See Viewing and Setting Compute Service Configuration to check whether strict fault domain enforcement is set.

When strict fault domain enforcement is disabled (Strict FD is set to Disabled in the Service Web UI or Strict FD Enabled is false in the Service CLI), instances that cannot be migrated to another compute node in the same fault domain are migrated to a different fault domain.

When strict fault domain enforcement is enabled (Strict FD is set to Enabled in the Service Web UI or Strict FD Enabled is true in the Service CLI), instances that cannot be migrated to another compute node in the same fault domain do not migrate; those instances are still running in the compute node that you are trying to evacuate.

Enable or disable strict fault domain enforcement to set whether instances that cannot migrate to other compute nodes in the same fault domain will be migrated to a different fault domain or still running in the same compute node after you attempt to evacuate the compute node.

If the current fault domain is not able to accommodate some instances that need to be migrated, and strict fault domain enforcement is enabled, you can re-run the migrate operation with the force option specified. When the force option is specified, the Compute service will soft stop any instances that fail to migrate, allowing the evacuation to proceed.

Restart stopped instances. If instances were stopped by the Compute service (not manually stopped by an administrator) and you want them to be automatically restored to running when resources become available, check that the Auto Recovery property of the Compute service is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Viewing and Setting Compute Service Configuration and Configuring the Recovery State for a Stopped Instance.

Instances can be stopped by the Compute service if the force option is used or if no fault domain can accommodate the instances. You can change the Auto Recovery setting at any time before or after the compute node evacuation completes to restart instances that were stopped by the Compute service. If the instance availability recovery action is set to STOP_INSTANCE, the instance remains stopped even though the Auto Recovery property is enabled. If the instance availability recovery action is later changed to RESTORE_INSTANCE, a subsequent Auto Recovery pass will restart the instance.

Return relocated instances. If instances are migrated to a different fault domain (displaced), and you want them returned to their selected fault domain (the fault domain that is specified in the instance configuration) when resources become available, check that the Auto Resolve property of the Compute service is enabled. See Viewing and Setting Compute Service Configuration and Compute Service Configuration Commands. You can set the Auto Resolve property at any time before or after the compute node evacuation completes to relocate any displaced instances.

Use the following procedures to perform the migrate operation.

Compute Node Evacuation: Before You Begin

  1. Check fault domain and compute node resources. See Viewing CPU and Memory Usage By Fault Domain. Based on this information, decide whether to do any of the following:

    • Terminate instances that are no longer needed.

    • Reconfigure some instances to use fewer resources. For example, specify a different shape.

    • Reconfigure some instances to specify a different fault domain.

    • Stop some instances while you perform the compute node evacuation.

    • Specify the force option on the migration operation to soft stop any instances that cannot be migrated. See the discussion above of instance availability recovery action and Auto Recovery configuration.

  2. Disable provisioning on the compute node. See Disabling Compute Node Provisioning.

Using the Service Web UI

  1. In the navigation menu, click Rack Units.

  2. In the Rack Units table, click the host name of the compute node that you want to evacuate.

    The compute node details page appears.

  3. In the top-right corner of the compute node details page, click Controls and select the Migrate All Vms command.

    The Compute service migrates the running instances to other compute nodes.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node that you that you want to evacuate.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-23 09:25:56,307 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
      f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
      4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM
  2. Use the migrateVm command to migrate all running compute instances off the compute node.

    PCA-ADMIN> migrateVm id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Command: migrateVm id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Status: Running
    Time: 2021-08-20 10:37:05,781 UTC
    JobId: 6f1e94bc-7d5b-4002-ada9-7d4b504a2599

    To soft stop any instances that fail to migrate, set the force option:

    PCA-ADMIN> migrateVm id=cn_id force=true

    The Compute service migrates the running instances to other compute nodes.

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=6f1e94bc-7d5b-4002-ada9-7d4b504a2599
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded

Configuring the Compute Service for High Availability

Migrating Instances from a Compute Node describes how to evacuate a compute node for maintenance. In the case of a compute node unplanned outage, the Compute service attempts to evacuate the compute node or stop and restart the instances.

The following sections describe how you can set high availability configuration to control how the Compute service handles an unplanned outage.

Using Instance and Compute Service High Availability Configuration

The following sections describe how to use high availability configuration to manage outcomes for different types of compute node outages. Instance availability recovery action is the only high availability configuration that is set for each instance. All other high availability configuration is set on the Compute service and affects all instances.

The selected fault domain is the fault domain that is specified in the instance configuration. A displaced instance is in a fault domain that is not its selected fault domain.

Planned Maintenance Outage

See Migrating Instances from a Compute Node for information about using instance availability recovery action (set on each instance), and the Auto Recovery and Auto Resolve properties of the Compute service when performing a compute node evacuation.

Unplanned Outage Less Than Ten Minutes

After an unplanned outage of less than ten minutes, by default the Compute service attempts to restart instances that were running before the outage. Actual behavior depends on how the instances and the Compute service are configured. The following decision flow describes how you can control this behavior.

Do you want the Compute service to attempt to restart instances that were running prior to the outage? This is the default.

  • Yes. Check that Auto Recovery is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Configuring the Recovery State for a Stopped Instance.

    If some instances can no longer be accommodated in their selected fault domain, Auto Recovery will continue to poll and attempt to restart the instances. See also getForcedStoppedInstances.

    If the instance availability recovery action is set to STOP_INSTANCE, the instance will remain stopped, even if Auto Recovery is enabled.

  • No. Disable Auto Recovery. Instances that had been running prior to the outage will remain stopped.

The instance availability recovery action setting and Auto Recovery setting can be changed at any time, and the changes will be effective at the next polling time.

Unplanned Outage More Than Ten Minutes

After an unplanned outage of more than ten minutes, by default the Compute service attempts to reboot migrate (cold migrate) instances off the compute node, and instances that cannot be accommodated on other compute nodes in the same fault domain are migrated to other fault domains. Actual behavior depends on how the Compute service is configured. The following decision flow describes how you can control this behavior.

Do you want running instances to be reboot migrated? Reboot migration is stopping and starting each running instance on a given compute node. See also "Compute Instance Availability" in "High Availability" in the Architecture and Design chapter of Oracle Private Cloud Appliance Concepts Guide.

  • Yes. Check that VM High Availability is enabled.

    If some instances cannot be accommodated on another compute node in the same fault domain, do you want those instances to be reboot migrated to a different fault domain?

    • Yes. Check that Strict FD is disabled. Instances that cannot be accommodated in any fault domain remain stopped by the Compute service.

      After reboot migration, do you want instances that are running in a fault domain that is not their selected fault domain to be automatically live migrated to their selected fault domain when resources become available?

      • Yes. Check that Auto Resolve is enabled. See also getDisplacedInstances.

      • No. Disable Auto Resolve.

    • No. Enable Strict FD. Instances that were running prior to the outage and cannot be migrated to another compute node in the current fault domain remain stopped by the Compute service.

  • No. Disable VM High Availability. Instances that were running prior to the outage are stopped by the Compute service.

Do you want instances that were stopped by the Compute service to be automatically restored to running in their selected fault domain? If yes, check that Auto Recovery is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Configuring the Recovery State for a Stopped Instance.

Viewing and Setting Compute Service Configuration

For information about how these configuration settings work, see Compute Service Configuration Commands.

Using the Service Web UI

On the navigation menu, click FD Instances and then click Compute Service Detail.

The Compute Service Information page shows the current settings for Auto Recovery, Auto Resolve Displaced Instances, VM High Availability, and Strict FD. All of these settings are enabled by default except for Strict FD, which is disabled by default. By default, fault domain placement is not strictly enforced when the Compute service migrates instances.

Use the Controls menu on the Compute Service Information page to change the values of these configuration settings between Enabled and Disabled.

Using the Service CLI

Use the show computeservice command to show the current Compute service configuration settings. In the following example, the default values are set for the four high availability configuration settings: Auto Recovery Action Enabled, Auto-Resolve Displaced Instances Enabled, VM High Availability Enabled, and Strict FD Enabled. All of these settings are true by default except for Strict FD Enabled, which is false by default.

PCA-ADMIN> show computeservice
Command: show computeservice
Status: Success
Time: 2023-04-17 20:37:42,296 UTC
Data:
 Id = unique_ID
 Type = ComputeService
 total CN cpu usage percent = 23.3
 total CN memory usage percent = 16.2
 Auto Recovery Action Enabled = true
 Auto-Resolve Displaced Instances Enabled = true
 VM High Availability Enabled = true
 Strict FD Enabled = false
 Name = Compute Service
 Work State = Normal

To change these settings, use the commands in the following list. The showcustomcmds computeservice command lists all high availability configuration commands in the Compute service.

PCA-ADMIN> showcustomcmds computeservice
    enableAutoRecoveryAction
    disableAutoRecoveryAction
    enableAutoResolveDisplacedInstances
    disableAutoResolveDisplacedInstances
    enableVmHighAvailability
    disableVmHighAvailability
    enableStrictFD
    disableStrictFD
    getForcedStoppedInstances
    getDisplacedInstances

For example, to disable Auto Recovery Action Enabled, run the disableAutoRecoveryAction command. To enable strict fault domain enforcement, run the enableStrictFD command.

Compute Service Configuration Commands

This section describes the behavior of the high availability configuration settings in the Compute service. The Service CLI commands are used in the list in this section. To access the equivalent Service Web UI settings, click the navigation menu and click FD Instances. See Viewing and Setting Compute Service Configuration.

In these descriptions, the selected fault domain is the fault domain that is specified in the instance configuration. A displaced instance is in a fault domain that is not its selected fault domain.

enableAutoRecoveryAction

Enables the automatic restart of instances that were stopped by the Compute service. This is the default. If the instance availability recovery action is set to RESTORE_INSTANCE, this command causes instances that were stopped by the Compute service to be automatically restarted in their selected fault domain when resources are available. See also Configuring the Recovery State for a Stopped Instance and getForcedStoppedInstances.

Instances could have been stopped by the Compute service for the following reasons:

  • As a result of specifying the force option on a migrate all operation.

  • Because no fault domain can accommodate these instances.

  • As a result of a compute node outage.

You can set this Auto Recovery property at any time before or after an outage to restart instances that were stopped by the Compute service. If the instance availability recovery action is set to STOP_INSTANCE, the instance remains stopped even though the Auto Recovery property is enabled. If the instance availability recovery action is later changed to RESTORE_INSTANCE, a subsequent Auto Recovery pass will restart the instance.

disableAutoRecoveryAction

Disables the automatic restart of stopped instances. Instances that were stopped by the Compute service are not automatically restarted when resources are available.

enableAutoResolveDisplacedInstances

Enables the return of running instances to their selected fault domain. This is the default. If instances were moved to a different fault domain (displaced) during compute node evacuation, this command enables those instances to be automatically live migrated to their selected fault domain once sufficient resources are available in that fault domain. See also getDisplacedInstances.

You can set this Auto Resolve configuration at any time before or after an outage to relocate any displaced instances.

Instances that are stopped are not migrated.

disableAutoResolveDisplacedInstances

Disables the return of instances to their selected fault domain. Instances that were moved to a different fault domain during compute node evacuation remain in the fault domain to which they were moved.

enableVmHighAvailability

Enables High Availability (reboot migration) off of an unreachable compute node. This is the default.

disableVmHighAvailability

Disables reboot migration.

enableStrictFD

Enables strict fault domain enforcement. During compute node evacuation, any instance that cannot be moved to a different compute node in the same fault domain is stopped if the force option was specified. If the force option was not specified, the migrate operation fails.

disableStrictFD

Disables strict fault domain enforcement. This is the default. During compute node evacuation, any instance that cannot be moved to a different compute node in the same fault domain is moved to a different fault domain. This move to a different fault domain is temporary if the Auto Resolve property of the Compute service is enabled: If Auto Resolve is enabled, then when resources become available, the moved instances are live migrated back to their selected fault domain. See also getDisplacedInstances.

getForcedStoppedInstances

Lists all instances that were stopped via the use of the force option on the migrate operation or that were stopped by the Compute service because no fault domain can accommodate these instances.

PCA-ADMIN> getForcedStoppedInstances
Command: getForcedStoppedInstances
Status: Success
Time: 2023-04-17 20:53:51,410 UTC
Data:
 id                        displayName  compartmentId
 --                        -----------  -------------
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID

In the Service Web UI, click the navigation menu, click FD Instances, and then click Forced Stopped Instances. Use the Actions menu to copy the OCIDs.

getDisplacedInstances

Lists instances that are currently running in a fault domain that is not their selected fault domain. Instances that are not running are not shown.

In the following example, running instances are being migrated away from fault domain 1. One instance has been placed in fault domain 2 and one has been placed in fault domain 3.

PCA-ADMIN> getDisplacedInstances
Command: getDisplacedInstances
Status: Success
Time: 2023-04-18 23:20:41,484 UTC
Data:
 id                        displayName  compartmentId                faultDomain     faultDomainSelected
 --                        -----------  -------------                -----------     -------------------
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-3  FAULT-DOMAIN-1
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-2  FAULT-DOMAIN-1

In the Service Web UI, click the navigation menu, click FD Instances, and then click Displaced Instances. Use the Actions menu to copy the OCIDs.

Configuring the Recovery State for a Stopped Instance

If the Compute service stopped an instance, you can configure how that stopped instance will be treated when resources are again available by setting the instance availability recovery action and the Auto Recovery property of the Compute service.

See the description of the enableAutoRecoveryAction command in Compute Service Configuration Commands for reasons that an instance can be stopped by the Compute service. See also the descriptions of disableAutoRecoveryAction and getForcedStoppedInstances.

During instance launch or in a subsequent instance update, set the instance recovery action in the instance availability configuration.

In the Compute Web UI, see the "Availability configuration" section in the dialog to create or edit an instance or create or edit an instance configuration. To restart instances that were stopped by the Compute service, check the box labeled "Restore instance lifecycle state after infrastructure maintenance". This is the default. To keep stopped instances stopped, uncheck the "Restore instance" box.

In the OCI CLI, use the --availability-config option or the availabilityConfig property in the compute instance launch or update command or the instance configuration create or update command. Set the recoveryAction to RESTORE_INSTANCE or STOP_INSTANCE. The default behavior is RESTORE_INSTANCE.

"availabilityConfig": {"recoveryAction": "STOP_INSTANCE"}

Enabling Strict Fault Domain Enforcement

To enable strict fault domain enforcement, do one of the following:

  • In the Service Web UI, click the navigation menu, click FD Instances, and click Compute Service Detail. On the Compute Service Information page, click the Controls menu, and click Enable Strict FD.

  • In the Service CLI, run the enableStrictFD command.

For more information about the effect of fault domain enforcement, see Compute Service Configuration Commands.

In case the current fault domain does not have enough resources to accommodate all instances that need to be migrated, do the following:

  • If you are performing a planned compute node evacuation, specify the force option on the migration operation to stop the instances in their current fault domain.

  • Run the enableAutoRecoveryAction command or select Enable Auto Recovery in the Service Web UI.

  • Ensure that the instance availability recovery action for each instance is set to RESTORE_INSTANCE, which is the default. See Configuring the Recovery State for a Stopped Instance.

See the example in Migrating Instances from a Compute Node.

Starting, Resetting or Stopping a Compute Node

The Service Enclave allows administrators to send start, reboot and shutdown signals to the compute nodes.

Using the Service Web UI

  1. Make sure that the compute node is locked for maintenance.

    See Locking a Compute Node for Maintenance.

  2. In the navigation menu, click Rack Units.

  3. In the Rack Units table, locate the compute node you want to start, reset or stop.

  4. Click the Action menu (three vertical dots) and select the appropriate action: Start, Reset, or Stop.

  5. When the confirmation window appears, click the appropriate action button to proceed.

    A pop-up window appears for a few seconds to confirm that the compute node is starting, stopping, or restarting.

  6. When the compute node is up and running again, release the maintenance and provisioning locks.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node that you want to start, reset or stop.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-23 09:25:56,307 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      3e62bf25-a26c-407e-ab8b-df01a4ad98b6   pcacn002   Provisioned         KVM
      f7b8356b-052f-4911-babb-447e6ab9c78d   pcacn003   Provisioned         KVM
      4e06ebdf-faed-484e-996d-d77af786f123   pcacn001   Provisioned         KVM
  2. Make sure that the compute node is locked for maintenance.

    See Locking a Compute Node for Maintenance.

  3. Start, reset or stop the compute node using the corresponding command:

    PCA-ADMIN> start ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: start ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:26:06,446 UTC
    Data:
      Success
    
    PCA-ADMIN> reset id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: reset id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:27:06,434 UTC
    Data:
      Success
    PCA-ADMIN> stop ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Command: stop ComputeNode id=f7b8356b-052f-4911-babb-447e6ab9c78d
    Status: Success
    Time: 2021-08-23 09:31:38,271 UTC
    Data:
      Success
  4. When the compute node is up and running again, release the maintenance and provisioning locks.

Deprovisioning a Compute Node

If you need to take a compute node out of service, for example to replace a defective one, you must deprovision it first, so that its data is removed cleanly from the system databases.

Using the Service Web UI

  1. In the navigation menu, click Rack Units.

  2. In the Rack Units table, click the host name of the compute node you want to deprovision.

    The compute node detail page appears.

  3. In the top-right corner of the page, click Controls and select the Provisioning Lock command.

    When the confirmation window appears, click Lock to proceed.

    After successful completion, the Compute Node Information tab shows Provisioning Locked = Yes.

  4. Make sure that no more compute instances are running on the compute node.

    Click Controls and select the Migrate All Vms command. The system migrates the instances to other compute nodes.

  5. To deprovision the compute node, click Controls and select the Deprovision command.

    When the confirmation window appears, click Deprovision to proceed.

    After successful completion, the Compute Node Information tab shows Provisioning State = Ready to Provision.

Using the Service CLI

  1. Display the list of compute nodes.

    Copy the ID of the compute node you want to deprovision.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-20 08:53:56,681 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      29f68a0e-4744-4a92-9545-7c48fa365d0a   pcacn001   Provisioned         KVM
      7a0236f4-b00e-461d-93a0-b22673a18d9c   pcacn003   Provisioned         KVM
      dc8ae567-b07f-48e0-89bd-e57069c20010   pcacn002   Provisioned         KVM
  2. Set a provisioning lock on the compute node.

    PCA-ADMIN> provisioningLock id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Command: provisioningLock id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Status: Success
    Time: 2021-08-20 10:30:00,320 UTC
    JobId: ed4a4646-6d73-41f9-9cb0-73ea35e0d766

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=ed4a4646-6d73-41f9-9cb0-73ea35e0d766
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  3. Confirm that the compute node is under provisioning lock.

    PCA-ADMIN> show ComputeNode id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    [...]
      Provisioning Locked = true
  4. Migrate all running compute instances off the compute node you want to deprovision.

    PCA-ADMIN> migrateVm id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Command: migrateVm id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Status: Running
    Time: 2021-08-20 10:37:05,781 UTC
    JobId: 6f1e94bc-7d5b-4002-ada9-7d4b504a2599

    Use the job ID to check the status of your command.

    PCA-ADMIN> show Job id=6f1e94bc-7d5b-4002-ada9-7d4b504a2599
    Command: show Job id=6f1e94bc-7d5b-4002-ada9-7d4b504a2599
    Status: Success
    Time: 2021-08-20 10:39:59,025 UTC
    Data:
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  5. Deprovision the compute node with this command:

    PCA-ADMIN> deprovision id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Command: deprovision id=7a0236f4-b00e-461d-93a0-b22673a18d9c
    Status: Success
    Time: 2021-08-20 11:30:43,793 UTC
    JobId: 9868fdac-ddb6-4260-9ce1-c018cf2ddc8d

    Use the job ID to check the status of your deprovision command.

    PCA-ADMIN> show Job id=9868fdac-ddb6-4260-9ce1-c018cf2ddc8d
    [...]
      Done = true
      Name = MODIFY_TYPE
      Run State = Succeeded
  6. Confirm that the compute node has been deprovisioned.

    PCA-ADMIN> list ComputeNode
    Command: list ComputeNode
    Status: Success
    Time: 2021-08-20 08:53:56,681 UTC
    Data:
      id                                     name       provisioningState   provisioningType
      --                                     ----       -----------------   ----------------
      29f68a0e-4744-4a92-9545-7c48fa365d0a   pcacn001   Provisioned         KVM
      7a0236f4-b00e-461d-93a0-b22673a18d9c   pcacn003   Ready to Provision  Unspecified
      dc8ae567-b07f-48e0-89bd-e57069c20010   pcacn002   Provisioned         KVM