Configuring High Availability in the Compute Service

Administrators can set parameters to control how the Compute service tries to keep instances available in response to planned or unplanned compute node outages. Compute service high availability settings affect all compute instances, and interact with individual instance recovery settings.

When planned maintenance needs to be performed, a compute node is evacuated. If possible, the Compute service live migrates all running instances to other compute nodes in the same fault domain. If this default scenario cannot be performed, the high availability (HA) parameters, at the level of the Compute service and the individual instance, determine further options to migrate, stop, and recover affected compute instances.

In the case of an unplanned compute node outage, the Compute service stops the instances, and if the outage persists, attempts to evacuate the compute node by restarting the affected instances on other compute nodes. This automated form of cold migration is called reboot migration.

Instance and Compute Service High Availability Configuration

The high availability (HA) configuration of the Compute service enables you to manage outcomes for different types of compute node outages. Instance availability recovery action is the only high availability configuration that is set for each instance. All other high availability configuration is set on the Compute service and affects all instances.

The selected fault domain is the fault domain that is specified in the instance configuration. A displaced instance is in a fault domain that is not its selected fault domain.

Planned Maintenance Outage

See Compute Node Maintenance Operations for information about how to evacuate a compute node. If possible, the Compute service live migrates running instances to other compute nodes in the same fault domain. The section also describes how to use the instance availability recovery action (set on each instance), and the Auto Recovery and Auto Resolve properties of the Compute service when performing a compute node evacuation.

Unplanned Outages

The Compute service attempts to stop instances and reboot migrate the instances under the following compute node outage conditions:

Power down from HW status
Inability to reach the compute node data network

A compute node could experience an outage where the Compute service cannot migrate the instances. For example, if the Compute service cannot reach the compute node at all, then the Compute service cannot stop and reboot migrate the instances.

Unplanned Outage Shorter than Five Minutes

In an unplanned outage, the Compute service stops the affected instances. If the outage lasts less than five minutes, by default the Compute service attempts to restart instances that were running before the outage. Actual behavior depends on how the instances and the Compute service are configured. The following decision flow describes how you can control this behavior.

Do you want the Compute service to attempt to restart instances that were running prior to the outage? This is the default.

Yes. Check that Auto Recovery is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Configuring the Recovery State for a Stopped Instance.

If some instances can no longer be accommodated in their selected fault domain, Auto Recovery will continue to poll and attempt to restart the instances. See also getForcedStoppedInstances.

If the instance availability recovery action is set to STOP_INSTANCE, the instance will remain stopped, even if Auto Recovery is enabled.
No. Disable Auto Recovery. Instances that had been running prior to the outage will remain stopped.

The instance availability recovery action setting and Auto Recovery setting can be changed at any time, and the changes will be effective at the next polling time.

Unplanned Outage Longer than Five Minutes

In an unplanned outage, the Compute service stops the affected instances. If the outage lasts more than five minutes, by default the Compute service attempts to reboot migrate (cold migrate) instances off the compute node. Instances that cannot be accommodated on other compute nodes in the same fault domain are reboot migrated to other fault domains. Actual behavior depends on how the Compute service is configured. The following decision flow describes how you can control this behavior.

Do you want running instances to be reboot migrated? Reboot migration is stopping and starting each running instance on a given compute node. See also High Availability Configuration for Compute Instances.

Yes. Check that VM High Availability is enabled.

If some instances cannot be accommodated on another compute node in the same fault domain, do you want those instances to be reboot migrated to a different fault domain?
- Yes. Check that Strict FD is disabled. Instances that cannot be accommodated in any fault domain remain stopped by the Compute service.
  
  After reboot migration, do you want instances that are running in a fault domain that is not their selected fault domain to be automatically live migrated to their selected fault domain when resources become available?
  - Yes. Check that Auto Resolve is enabled. See also getDisplacedInstances.
  - No. Disable Auto Resolve.
- No. Enable Strict FD. Instances that were running prior to the outage and cannot be migrated to another compute node in the current fault domain remain stopped by the Compute service.
No. Disable VM High Availability. Instances that were running prior to the outage are stopped by the Compute service.

Do you want instances that were stopped by the Compute service to be automatically restored to running in their selected fault domain? If yes, check that Auto Recovery is enabled and the instance availability recovery action is set to RESTORE_INSTANCE. See Configuring the Recovery State for a Stopped Instance.

Viewing and Setting Compute Service Configuration

For information about how these configuration settings work, see Compute Service Configuration Commands.

Using the Service Web UI

On the navigation menu, click FD Instances and then click Compute Service Detail.

The Compute Service Information page shows the current settings for Auto Recovery, Auto Resolve Displaced Instances, VM High Availability, and Strict FD. All of these settings are enabled by default except for Strict FD, which is disabled by default. By default, fault domain placement is not strictly enforced when the Compute service migrates instances.

Use the Controls menu on the Compute Service Information page to change the values of these configuration settings between Enabled and Disabled.

Using the Service CLI

Use the show computeservice command to show the current Compute service configuration settings. In the following example, the default values are set for the four high availability configuration settings: Auto Recovery Action Enabled, Auto-Resolve Displaced Instances Enabled, VM High Availability Enabled, and Strict FD Enabled. All of these settings are true by default except for Strict FD Enabled, which is false by default.

PCA-ADMIN> show computeservice
Data:
 Id = unique_ID
 Type = ComputeService
 total CN cpu usage percent = 23.3
 total CN memory usage percent = 16.2
 Auto Recovery Action Enabled = true
 Auto-Resolve Displaced Instances Enabled = true
 VM High Availability Enabled = true
 Strict FD Enabled = false
 Name = Compute Service
 Work State = Normal

To change these settings, use the commands in the following list. The showcustomcmds computeservice command lists all high availability configuration commands in the Compute service.

PCA-ADMIN> showcustomcmds computeservice
    enableAutoRecoveryAction
    disableAutoRecoveryAction
    enableAutoResolveDisplacedInstances
    disableAutoResolveDisplacedInstances
    enableVmHighAvailability
    disableVmHighAvailability
    enableStrictFD
    disableStrictFD
    getForcedStoppedInstances
    getDisplacedInstances

For example, to disable Auto Recovery Action Enabled, run the disableAutoRecoveryAction command. To enable strict fault domain enforcement, run the enableStrictFD command.

Compute Service Configuration Commands

The Service CLI commands for Compute service HA configuration are shown in the list that follows. To access the equivalent Service Web UI settings, click the navigation menu and click FD Instances. See Viewing and Setting Compute Service Configuration.

In these descriptions, the selected fault domain is the fault domain that is specified in the instance configuration. A displaced instance is in a fault domain that is not its selected fault domain.

enableAutoRecoveryAction

Enables the automatic restart of instances that were stopped by the Compute service. This is the default. If the instance availability recovery action is set to RESTORE_INSTANCE, this command causes instances that were stopped by the Compute service to be automatically restarted in their selected fault domain when resources are available. See also Configuring the Recovery State for a Stopped Instance and getForcedStoppedInstances.

Instances could have been stopped by the Compute service for the following reasons:

As a result of specifying the force option on a migrate all operation and some instances were not able to be migrated. See Compute Node Maintenance Operations.
As a result of an unplanned compute node outage.

You can set this Auto Recovery property at any time before or after an administrative maintenance outage or an unplanned outage to restart instances that were stopped by the Compute service. If the instance availability recovery action is set to STOP_INSTANCE, the instance remains stopped even though the Auto Recovery property is enabled. If the instance availability recovery action is later changed to RESTORE_INSTANCE, a subsequent Auto Recovery pass will restart the instance.

disableAutoRecoveryAction

Disables the automatic restart of stopped instances. Instances that were stopped by the Compute service are not automatically restarted when resources are available.

enableAutoResolveDisplacedInstances

Enables the return of running instances to their selected fault domain. This is the default. If instances were moved to a different fault domain (displaced) during compute node evacuation, this command enables those instances to be automatically live migrated to their selected fault domain once sufficient resources are available in that fault domain. See also getDisplacedInstances.

You can set this Auto Resolve configuration at any time before or after an outage to relocate any displaced instances. Instances that are stopped are not migrated.

disableAutoResolveDisplacedInstances

Disables the return of instances to their selected fault domain. Instances that were moved to a different fault domain during compute node evacuation remain in the fault domain to which they were moved.

enableVmHighAvailability

Enables High Availability (reboot migration) off of an unreachable compute node. This is the default.

disableVmHighAvailability

Disables reboot migration.

enableStrictFD

Enables strict fault domain enforcement. During compute node evacuation, any instance that cannot be moved to a different compute node in the same fault domain is stopped if the force option was specified. If the force option was not specified, the migrate operation fails.

disableStrictFD

Disables strict fault domain enforcement. This is the default. During compute node evacuation, any instance that cannot be moved to a different compute node in the same fault domain is moved to a different fault domain. This move to a different fault domain is temporary if the Auto Resolve property of the Compute service is enabled: If Auto Resolve is enabled, then when resources become available, the moved instances are live migrated back to their selected fault domain. See also getDisplacedInstances.

getForcedStoppedInstances

Lists all instances that were stopped via the use of the force option on the migrate operation or that were stopped by the Compute service in response to an unplanned outage.

PCA-ADMIN> getForcedStoppedInstances
Data:
 id                        displayName  compartmentId
 --                        -----------  -------------
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID

In the Service Web UI, click the navigation menu, click FD Instances, and then click Forced Stopped Instances. Use the Actions menu to copy the OCIDs.

getDisplacedInstances

Lists instances that are currently running in a fault domain that is not their selected fault domain. Instances that are not running are not shown.

In the following example, running instances are being migrated away from fault domain 1. One instance has been placed in fault domain 2 and one has been placed in fault domain 3.

PCA-ADMIN> getDisplacedInstances
Data:
 id                        displayName  compartmentId                faultDomain     faultDomainSelected
 --                        -----------  -------------                -----------     -------------------
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-3  FAULT-DOMAIN-1
 ocid1.instance.unique_ID  inst-name    ocid1.compartment.unique_ID  FAULT-DOMAIN-2  FAULT-DOMAIN-1

In the Service Web UI, click the navigation menu, click FD Instances, and then click Displaced Instances. Use the Actions menu to copy the OCIDs.

Configuring the Recovery State for a Stopped Instance

If the Compute service stopped an instance, you can configure how that stopped instance will be treated when resources are again available by setting the instance availability recovery action and the Auto Recovery property of the Compute service.

See the description of the enableAutoRecoveryAction command in Compute Service Configuration Commands for reasons that an instance can be stopped by the Compute service. See also the descriptions of disableAutoRecoveryAction and getForcedStoppedInstances.

During instance launch or in a subsequent instance update, set the instance recovery action in the instance availability configuration.

In the Compute Web UI, see the "Availability configuration" section in the dialog to create or edit an instance or create or edit an instance configuration. To restart instances that were stopped by the Compute service, check the box labeled "Restore instance lifecycle state after infrastructure maintenance". This is the default. To keep stopped instances stopped, uncheck the "Restore instance" box.

In the OCI CLI, use the --availability-config option or the availabilityConfig property in the compute instance launch or update command or the instance configuration create or update command. Set the recoveryAction to RESTORE_INSTANCE or STOP_INSTANCE. The default behavior is RESTORE_INSTANCE.

"availabilityConfig": {"recoveryAction": "STOP_INSTANCE"}

Enabling Strict Fault Domain Enforcement

To enable strict fault domain enforcement, do one of the following:

In the Service Web UI, click the navigation menu, click FD Instances, and click Compute Service Detail. On the Compute Service Information page, click the Controls menu, and click Enable Strict FD.
In the Service CLI, run the enableStrictFD command.

For more information about the effect of fault domain enforcement, see Compute Service Configuration Commands.

In case the current fault domain does not have enough resources to accommodate all instances that need to be migrated, do the following:

If you are performing a planned compute node evacuation, specify the force option on the migration operation to stop the instances in their current fault domain.
Run the enableAutoRecoveryAction command or select Enable Auto Recovery in the Service Web UI.
Ensure that the instance availability recovery action for each instance is set to RESTORE_INSTANCE, which is the default. See Configuring the Recovery State for a Stopped Instance.

See the example in Evacuating a Compute Node.