Plan High Availability for Compute Instances

Oracle Cloud Infrastructure Compute provides both bare metal and virtual machine (VM) instances which allow you to deploy any size server that you need, from a small VM with a single core to a far more robust bare metal server with many cores and a large amount of RAM. These options ensure the performance, flexibility, and control to run your most demanding applications and workloads in the cloud.

To plan for high availability of your compute instances, the key design strategies you should consider are:
  • Eliminating single points of failure, either by properly leveraging an availability domain's three fault domains or by deploying instances across multiple availability domains.
  • Using floating IP addresses.
  • Ensuring that your design protects both the data availability and integrity of your Compute instances.
This article describes these strategies.

Distribute Instances Across Fault Domains

One of the key principles of designing high availability solutions is to avoid single points of failure. A key design principle then would be to distribute your instances across multiple fault domains.

In a single-availability-domain deployment, by properly leveraging fault domains, you can increase the availability of applications running on Oracle Cloud Infrastructure. Your application's architecture determines whether you separate or group instances by using fault domains.
  • Scenario 1: Highly Available Application Architecture

    In this scenario, you have a highly available application—for example, two web servers and a clustered database. Here, you group one web server and one database node in one fault domain and the other half of each pair in another fault domain. This architecture ensures that a failure of any one fault domain does not result in an outage for your application.

  • Scenario 2: Single Web Server and Database Instance Architecture

    In this scenario, your application architecture is not highly available—for example, you have one web server and one database instance. Here, both the web server and the database instance must be placed in the same fault domain. This architecture ensures that your application is impacted only by the failure of that single fault domain.

Distribute Instances Across Availability Domains

Another approach to high availability is to deploy Compute instances that perform the same tasks across multiple availability domains. This design removes a single point of failure by introducing redundancy.

The following diagram illustrates web server VMs deployed in two availability domains to implement redundancy:

Description of dual-ad.png follows
Description of the illustration dual-ad.png


The architecture shows multiple availability domains (ADs). For a region that has a single AD, adjust the architecture to distribute your resources across the fault domains within the AD.
Depending on your system or application requirements, you can implement this architectural redundancy in either standby or active mode:
  • In standby mode, when the primary component fails, the standby component takes over. Standby mode is typically used for applications that need to maintain their states.
  • In active mode, no components are designated as primary or standby; all components are actively participating in performing the same tasks. When one of the components fails, the related tasks are simply distributed to another component. Active mode is typically used for stateless applications.

Use Floating IP Addresses

Floating IP addresses of Compute instances, either the secondary private IP address or the reserved public IP address, play a key role in high availability architecture design on Oracle Cloud Infrastructure.

A Compute instance can be assigned a secondary private IP address. If the Compute instance has problems, you can reassign that secondary private IP address to a standby instance in the same subnet to achieve instance failover.

A reserved public IP address can be persistent and exist beyond the lifetime of the Compute instance to which it's currently assigned. In the case of high availability and failover scenarios, you can unassign a reserved public IP address from the primary instance and then reassign it to the standby instance.

You can automate this floating IP address failover by leveraging Linux high availability services, such as Corosync or Pacemaker.

Ensure High Availability and Integrity of Your Data

For a high availability architecture, always ensure that your design protects both the data availability and integrity of your Compute instances. To protect the data availability of your Compute instance, you can either replicate or back up your data to another location.

You can use either synchronous or asynchronous replication to protect your data if your Compute instance fails:
  • The availability domains in a region are interconnected by a high-performance network that supports synchronous replication. If your application needs an instant failover and can’t tolerate data loss, employ synchronous replication. Because of its network performance requirements, synchronous replication is typically used within one region.
  • For applications that need the protection of data availability across regions, employ asynchronous replication.
Traditional backups are another way to protect your data. For maximum data durability, don’t store your backups in the same availability domain as their original Compute instance. You should use Oracle Cloud Infrastructure Object Storage to back up the data of your Compute instance. For Compute instances with local NVMe drives, a protected RAID array is the best way to protect against an NVMe device failure.