Plan High Availability for Compute Instances

Oracle Cloud Infrastructure Compute provides both bare metal and virtual machine (VM) instances which allow you to deploy any size server that you need, from a small VM with a single core to a large VM or bare metal server with many cores and a larger amount of RAM. These options ensure the performance, flexibility, and control to run your most demanding applications and workloads in the cloud.

To plan for high availability of your compute instances, the key design strategies you should consider are:

Eliminating single points of failure by properly leveraging fault domain and availability domains.
Using monitoring, instance pools and load balancer.
Ensuring that your design protects both the data availability and integrity of your Compute instances.

This article describes these strategies.

Distribute Instances Across Fault Domains

One of the key principles of designing high availability solutions is to avoid single points of failure. A key design principle then would be to distribute your instances across multiple fault domains.

In a single-availability-domain deployment, by properly leveraging fault domains, you can increase the availability of applications running on Oracle Cloud Infrastructure. Your application's architecture determines whether you separate or group instances by using fault domains.

Scenario 1: Highly Available Application Architecture
In this scenario, you have a highly available application—for example, two web servers and a clustered database. Here, you group one web server and one database node in one fault domain and the other half of each pair in another fault domain. This architecture ensures that a failure of any one fault domain does not result in an outage for your application.
Scenario 2: Single Web Server and Database Instance Architecture
In this scenario, your application architecture is not highly available—for example, you have one web server and one database instance. Here, both the web server and the database instance must be placed in the same fault domain. This architecture ensures that your application is impacted only by the failure of that single fault domain.

Distribute Instances Across Availability Domains

Another approach to high availability is to deploy Compute instances that perform the same tasks across multiple availability domains. This design removes a single point of failure by introducing redundancy across data centers.

In a multi-availability-domain deployment, you separate groups of instances by availability domain. This protects your application from data center level failures, such as power outages, physical infrastructure failures, or planned maintenance events.

When instances are distributed across availability domains or fault domains, a Load Balancer is often used to improve resource usage, facilitate scaling, and ensure high availability. It supports routing incoming requests to various backend sets, or groups of compute instances, balancing the network traffic among them.

The following diagram illustrates web server VMs deployed in two availability domains to implement redundancy, along with a load balancer:

Description of the illustration public-lb.png

Note:

The architecture shows multiple availability domains (ADs). For a region that has a single AD, adjust the architecture to distribute your resources across the fault domains within the AD.

Depending on your system or application requirements, you can implement this architectural redundancy in either standby or active mode:

In standby mode, when the primary component fails, the standby component takes over. Standby mode is typically used for applications that need to maintain their states.
In active mode, no components are designated as primary or standby; all components are actively participating in performing the same tasks. When one of the components fails, the related tasks are simply distributed to another component. Active mode is typically used for stateless applications.

Ensure High Availability and Integrity of Your Data

For a high availability architecture, always ensure that your design protects both the data availability and integrity of your Compute instances. To protect the data availability of your Compute instance, you can either replicate or back up your data to another location.

Block Volume Summary
The Oracle Cloud Infrastructure Block Volume service lets you dynamically provision and manage block storage volumes. You can create, attach, connect, and move volumes, as well as change volume performance, as needed, to meet your storage, performance, and application requirements. After you attach and connect a volume to an instance, you can use the volume like a regular hard drive. You can also disconnect a volume and attach it to another instance without the loss of data.
Volume Durability
The Oracle Cloud Infrastructure Block Volume service offer a high level of data durability compared to standard, attached drives. All volumes are automatically replicated for you, helping to protect against data loss. Multiple copies of data are stored redundantly across multiple storage servers with built-in repair mechanisms. For service level objective, the Block Volume service is designed to provide 99.99 percent annual durability for block volumes and boot volumes. However, we recommend that you make regular backups to protect against the failure of an availability domain.
Volume Replication
The Block Volume service provides you with the capability to perform ongoing automatic asynchronous replication of block volumes and boot volumes to other regions or availability domains within the same region. Cross availability domain replication within the same region is only supported for regions with more than one availability domian. To determine which regions contain more than one availability domain, see the Availability Domains field in the table listing the regions in About Regions and Availaibility Domains. This feature supports disaster recovery, migration, and business expansion scenarios, without requiring volume backups. For more information, see "Replicating a Volume", which you can access from the Explore More topic, elsewhere in this playbook.

About Block Volume Backups

The backups feature of the Oracle Cloud Infrastructure Block Volume service lets you take a point-in-time snapshot of the data on a block volume. You can make a backup of a volume when it is attached to an instance or while it is detached. These backups can then be restored to new volumes either immediately after a backup or at a later time that you choose.

Backups are encrypted and stored in OCI Object Storage, and can be restored as new volumes to any availability domain within the same region they are stored. This capability provides you with a spare copy of a volume and gives you the ability to successfully complete disaster recovery within the same region.

Yyou can initiate a backup either of two ways: by manually starting the backup or by assigning a policy that defines a set backup schedule. For more information on bloack volumes, see Overview of Block Volume Backups, which you can access from the Explore More topic elsewhere in this playbook. https://docs.oracle.com/en-us/iaas/Content/Block/Concepts/blockvolumebackups.htm

Use Synchronous or Asynchronous Replication

You can use either synchronous or asynchronous replication to protect your data if your Compute instance fails:

The availability domains in a region are interconnected by a high-performance network that supports synchronous replication. If your application needs an instant failover and can’t tolerate data loss, employ synchronous replication. Because of its network performance requirements, synchronous replication is typically used within one region.
For applications that need the protection of data availability across regions, employ asynchronous replication.

Traditional backups are another way to protect your data. For maximum data durability, don’t store your backups in the same availability domain as their original Compute instance. You should use Oracle Cloud Infrastructure Object Storage to back up the data of your Compute instance. For Compute instances with local NVMe drives, a protected RAID array is the best way to protect against an NVMe device failure.

For more information, see "Protecting Data on NVMe Devices", which you can access from the Explore More topic elsewhere in this playbook.