Plan High Availability for Compute Instances
Oracle Cloud Infrastructure Compute provides both bare metal and virtual machine (VM) instances which allow you to deploy any size server that you need, from a small VM with a single core to a large VM or bare metal server with many cores and a larger amount of RAM. These options ensure the performance, flexibility, and control to run your most demanding applications and workloads in the cloud.
- Eliminating single points of failure by properly leveraging fault domain and availability domains.
- Using monitoring, instance pools and load balancer.
- Ensuring that your design protects both the data availability and integrity of your Compute instances.
Distribute Instances Across Fault Domains
One of the key principles of designing high availability solutions is to avoid single points of failure. A key design principle then would be to distribute your instances across multiple fault domains.
- Scenario 1: Highly Available Application Architecture
In this scenario, you have a highly available application—for example, two web servers and a clustered database. Here, you group one web server and one database node in one fault domain and the other half of each pair in another fault domain. This architecture ensures that a failure of any one fault domain does not result in an outage for your application.
- Scenario 2: Single Web Server and Database Instance Architecture
In this scenario, your application architecture is not highly available—for example, you have one web server and one database instance. Here, both the web server and the database instance must be placed in the same fault domain. This architecture ensures that your application is impacted only by the failure of that single fault domain.
Distribute Instances Across Availability Domains
Another approach to high availability is to deploy Compute instances that perform the same tasks across multiple availability domains. This design removes a single point of failure by introducing redundancy across data centers.
In a multi-availability-domain deployment, you separate groups of instances by availability domain. This protects your application from data center level failures, such as power outages, physical infrastructure failures, or planned maintenance events.
When instances are distributed across availability domains or fault domains, a Load Balancer is often used to improve resource usage, facilitate scaling, and ensure high availability. It supports routing incoming requests to various backend sets, or groups of compute instances, balancing the network traffic among them.
The following diagram illustrates web server VMs deployed in two availability domains to implement redundancy, along with a load balancer:
Note:
The architecture shows multiple availability domains (ADs). For a region that has a single AD, adjust the architecture to distribute your resources across the fault domains within the AD.- In standby mode, when the primary component fails, the standby component takes over. Standby mode is typically used for applications that need to maintain their states.
- In active mode, no components are designated as primary or standby; all components are actively participating in performing the same tasks. When one of the components fails, the related tasks are simply distributed to another component. Active mode is typically used for stateless applications.
Ensure High Availability and Integrity of Your Data
For a high availability architecture, always ensure that your design protects both the data availability and integrity of your Compute instances. To protect the data availability of your Compute instance, you can either replicate or back up your data to another location.
- Block Volume Summary
The Oracle Cloud Infrastructure Block Volume service lets you dynamically provision and manage block storage volumes. You can create, attach, connect, and move volumes, as well as change volume performance, as needed, to meet your storage, performance, and application requirements. After you attach and connect a volume to an instance, you can use the volume like a regular hard drive. You can also disconnect a volume and attach it to another instance without the loss of data.
- Volume Durability
The Oracle Cloud Infrastructure Block Volume service offer a high level of data durability compared to standard, attached drives. All volumes are automatically replicated for you, helping to protect against data loss. Multiple copies of data are stored redundantly across multiple storage servers with built-in repair mechanisms. For service level objective, the Block Volume service is designed to provide 99.99 percent annual durability for block volumes and boot volumes. However, we recommend that you make regular backups to protect against the failure of an availability domain.
- Volume Replication
The Block Volume service provides you with the capability to perform ongoing automatic asynchronous replication of block volumes and boot volumes to other regions or availability domains within the same region. Cross availability domain replication within the same region is only supported for regions with more than one availability domian. To determine which regions contain more than one availability domain, see the Availability Domains field in the table listing the regions in About Regions and Availaibility Domains. This feature supports disaster recovery, migration, and business expansion scenarios, without requiring volume backups. For more information, see "Replicating a Volume", which you can access from the Explore More topic, elsewhere in this playbook.
About Block Volume Backups
Backups are encrypted and stored in OCI Object Storage, and can be restored as new volumes to any availability domain within the same region they are stored. This capability provides you with a spare copy of a volume and gives you the ability to successfully complete disaster recovery within the same region.
Yyou can initiate a backup either of two ways: by manually starting the backup or by assigning a policy that defines a set backup schedule. For more information on bloack volumes, see Overview of Block Volume Backups, which you can access from the Explore More topic elsewhere in this playbook. https://docs.oracle.com/en-us/iaas/Content/Block/Concepts/blockvolumebackups.htm
Use Synchronous or Asynchronous Replication
You can use either synchronous or asynchronous replication to protect your data if your Compute instance fails:
- The availability domains in a region are interconnected by a high-performance network that supports synchronous replication. If your application needs an instant failover and can’t tolerate data loss, employ synchronous replication. Because of its network performance requirements, synchronous replication is typically used within one region.
- For applications that need the protection of data availability across regions, employ asynchronous replication.
Traditional backups are another way to protect your data. For maximum data durability, don’t store your backups in the same availability domain as their original Compute instance. You should use Oracle Cloud Infrastructure Object Storage to back up the data of your Compute instance. For Compute instances with local NVMe drives, a protected RAID array is the best way to protect against an NVMe device failure.
For more information, see "Protecting Data on NVMe Devices", which you can access from the Explore More topic elsewhere in this playbook.