High Availability Configuration for Compute Instances
For compute instances, high availability means automated recovery in case the underlying infrastructure fails, or a component is brought down for maintenance. The state of the compute nodes, hypervisors, and compute instances is monitored continually.
High availability (HA) of compute instances is configurable. The behavior described in this section is based on standard settings. For information about configurable HA settings – such as reboot migration, fault domain placement, and automatic recovery – see Configuring High Availability in the Compute Service.
By default, the system attempts to live-migrate or restart instances in their selected fault domain, but it can also restart instances in other fault domains if insufficient resources are available in the selected fault domain. The selected fault domain is the one specified in the instance configuration.
- Compute Node Outage
-
If a compute node goes down because of an unplanned reboot, its instances are restarted when the compute node successfully returns to normal operation. However, this behavior is configurable. At the next polling interval, by default if instances are found that should be running but are in a different state, the start command is issued again. If any instances have crashed and remain in that state, the hypervisor attempts to restart them up to 5 times. Instances that were not running before the compute node became unavailable, remain shut down when the compute node is up and running again.
If a compute node is lost because of a failure, by default the system attempts to live migrate running compute instances from the failed compute node to other compute nodes. Actual behavior depends on how you have configured the Compute service high availability parameters.
A compute node is considered failing when it has been disconnected from the data network or has been in powered-off state for about 5 minutes. This 5-minute timeout is the threshold for placing the compute node in
FAIL
state and its agent inEVACUATING
state. This condition is required before the reboot migration can start. - Reboot Migration
-
Reboot migration implies that all compute instances from the failing compute node are stopped and restarted on another compute node. When migration is complete, the failing compute node's agent indicates that instances have been evacuated. If the compute node eventually reboots successfully, it must go through a cleanup process that removes all stale instance configurations and associated virtual disks. After cleanup, the compute node can host compute instances again.
During the entire reboot migration, the instances remain in "moving" configuration state. When migration is completed, the instance configuration state is changed to "running". Instances that were stopped before the failure are not migrated, because they are not associated with any compute node.
- Fault Domain Settings
-
Fault domain preference is not strictly enforced with instance migration, meaning the Compute service by default can stop instances if their selected fault domain has insufficient resources, and restart them on a compute node in another fault domain. If strict fault domain enforcement is configured in the Compute service, instances that cannot be migrated to another compute node in the selected fault domain must be stopped.
If automatic fault domain resolution is enabled in the Compute service, instances that were migrated to a different fault domain can be migrated back to their selected fault domain.
- Planned Maintenance
-
In case of planned maintenance, the administrator must first disable provisioning for the compute node in question, and apply a maintenance lock. When the compute node is under a provisioning lock, the administrator can live-migrate all running compute instances to another compute node. Maintenance mode can only be activated when there are no more running instances on the compute node. You can specify the force option to stop any instances that cannot be migrated. All compute instance operations on this compute node are disabled. A compute node in maintenance mode cannot be provisioned or deprovisioned.