High Availability

High availability (HA) systems are designed to ensure that they have the maximum potential for uptime and accessibility.

Enterprise applications are critical to everyday business operations and need to be available. There is an expectation that these systems are always working and there will never be any downtime. Although it's impossible to rule out downtime entirely, you can minimize the negative impacts of downtime by ensuring that your applications have HA. To ensure HA, eliminate single points of failure, so that even if components fail the application remains running and available. Oracle Cloud Infrastructure (OCI) provides HA capabilities and reliable and resilient cloud topology best practices that let you create enterprise applications that are HA.

Because multi-tier or three-tier architectures are common in traditional on-premises enterprise applications, let's use an example three-tier enterprise application to show how you can use OCI HA capabilities and the reliable and resilient cloud topology best practices to make that application HA. The following diagram shows an example enterprise application in a single region HA configuration.

Example enterprise application in a single region high availability configuration.

This information does not cover connectivity from on-premises to OCI or disaster recovery (DR) aspects of the infrastructure.

HA Concepts

When your infrastructure is configured to provide nearly full-time availability, it's an HA system.

To design a high availability architecture, consider the following key elements:

  • Redundancy: Does each resource have at least one similar resource ready to step in and take over? Notice that in each tier shown in the diagram, the resources always have a primary and a standby, and resources are in different availability domains and fault domains to avoid single points of failure (SPOF).
  • Monitoring: Are primary resources functioning as expected? If not, at what point does the backup resource take over for the primary?
  • Failover: When the criteria is met to trigger a switch from primary to standby, is the standby ready?

Achieving HA requires that a system accounts for all of these elements. Although high availability can be achieved at many different levels (including the application level and the cloud infrastructure level), this section focuses on the cloud infrastructure level. For more information, see Learn about architecting a highly available cloud topology.

Choosing an HA Approach

Some applications are more critical than others. Use the following decision tree to decide which OCI HA capabilities to use when deploying multi-tier enterprise applications on OCI.

Decision tree to decide which OCI high availability capabilities to use when deploying multi-tier enterprise applications.

For our example enterprise application, we need HA and to be able to survive an availability domain outage. In addition, we need to be able to survive a regional outage but can handle some downtime if a region is affected. For these reasons we chose an active/passive deployment in multiple regions. The passive deployment aspects are covered in Disaster Recovery.

Measuring HA

High availability is the ability of a system to meet a continuous level of operational performance, or uptime, for a given period of time.

Availability is usually expressed as a percentage of uptime in a year, often described in terms of "nines." The following table shows availability levels and the associated downtime of each level.

Availability %Availability (Nines)Downtime per YearDowntime per MonthDowntime per WeekDowntime per Day
90%One nine36.53 days73.05 hours16.80 hours2.40 hours
99%Two nines3.65 days7.31 hours1.68 hours14.40 minutes
99.9%Three nines8.77 hours43.83 minutes10.08 minutes1.44 minutes
99.99%Four nines52.60 minutes4.38 minutes1.01 minutes8.64 seconds
99.999%Five nines5.26 minutes26.30 seconds6.05 seconds864.00 milliseconds
99.9999%Six nines31.56 seconds2.63 seconds604.80 milliseconds86.40 milliseconds
99.99999%Seven nines3.16 seconds262.98 milliseconds60.48 milliseconds8.64 milliseconds
99.999999%Eight nines315.58 milliseconds26.30 milliseconds6.05 milliseconds864.00 microseconds
99.9999999%Nine nines31.56 milliseconds2.63 milliseconds604.80 microseconds86.40 microseconds

Each Oracle Cloud Infrastructure service usually has a service-level agreement (SLA) that defines the expected availability of that service. Most cloud solutions require that you use a combination of services to achieve the desired architecture for your cloud deployment. When services are used in combination, the overall system availability is dependent on the availability of each of the subsystems. The overall SLA for a system with multiple components is called a composite SLA.

To calculate the composite SLA of a system or application, consider all the subsystems and how those systems are configured. For example, consider a scenario where an application is dependent on two systems, System A and System B. Each system has a 99.9% availability. The systems are serially dependent, as shown in the following image.

Diagram of an example system with serially dependent subsystems.

If either System A or System B is unavailable, then the whole system is unavailable. For this type of system configuration, you can calculate the composite SLA by multiplying the availability of the two systems: 99.9% × 99.9% = 99.8%. Because of the serial dependency between the two systems, the resulting composite SLA of 99.8% is lower than the individual SLAs for each system.

HA Design Considerations

Oracle Cloud Infrastructure provides the building blocks that let you enable HA for your infrastructure.

The example enterprise application uses services within the OCI concepts of regions, availability domains, and fault domains. The use of multiple availability domains, and multiple fault domains within each of those availability domains, increases redundancy and eliminates SPOF. For background information about regions and a list of resources that are available across regions, within a single region, or within a single availability domain, see Regions and Availability Domains.

We recommended that you review the relevant OCI product resilience information, and then based on OCI platform products chosen, adjust the architectures to accommodate any gaps between the product capabilities and their HA requirements.

Your home region is where Oracle creates your tenancy and this is where your organization's Identity and Access Management (IAM) resources are defined. Depending on your business requirements, you can subscribe to other regions and IAM automatically propagates updates to all regions in your tenancy. For more information, see Managing Regions.

Networking

After creating the network foundation of virtual cloud networks (VCNs) and subnets, to provide high availability you need to use the Load Balancing service to distribute traffic. When a load balancer is deployed, it uses an HA configuration as shown in the example architecture diagram. For more information, see Plan High Availability for Network Resources.

Compute

To eliminate SPOF, create multiple compute instances that are distributed across fault domains in each of the availability domains. Place compute instances behind a load balancer to distribute the traffic and achieve HA as shown in the example architecture. For more information, see Overview of the Compute Service, Best Practices for Your Compute Instances, and Plan High Availability for Compute Instances.

Storage

OCI provides a set of storage services (Block VolumeFile Storage, and Object Storage), that you can configure to meet the requirements of a high availability architecture.

Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage is a regional service and is available across all availability domains within a region. Data is stored redundantly across multiple storage servers and across multiple availability domains to ensure high availability. Object Storage also includes automatic self-healing and data integrity monitoring to further enhance its durability and availability.

File Storage provides a durable, scalable, and secure enterprise-grade network file system. It uses a resilient architecture that replicates data five times across different fault domains, ensuring high availability and durability. File Storage can scale automatically to accommodate the growth of up to 8 exabytes of data. File system snapshots and clones can be used to protect data from accidental deletions, and to make copies of data instantly. The snapshots life cycles can be managed automatically by using the policy based snapshot feature.

Block volumes are durable and highly available by storing multiple copies of data redundantly across storage servers with built-in repair mechanisms. Block volumes can be attached to one or many virtual machines (VM), and they persist beyond the life span of virtual machines. Block volumes further enhance the high availability with automated backups to Object Storage and volume cloning features.

For steps to create storage resources, see Creating a VolumeCreating File Systems, and Managing Buckets. For best practices, see Plan High Availability for Storage.

Database

OCI Oracle databases come in multiple deployment models or flavors. Each model offers an increasing set of HA capabilities.

Regardless of the database system used, we recommend that you refer to Maximum Availability Architecture (MAA), which is a set of best practices developed by Oracle engineers over many years for the integrated use of Oracle high availability, data protection, and disaster recovery technologies.

OCI Base Database Service

OCI Base Database Service lets you have full control over your data while leveraging the capabilities of Oracle Database and OCI. For the list of supported Database editions and the underlying compute shapes on which they can be deployed, see OCI Base Database Service documentation. The HA features mentioned apply to all the database versions or the underlying compute shapes.

Enterprise Edition Extreme Performance edition allows for a two-node Real Application Cluster (RAC) database system with nodes spanning different fault domains within the same availability domain. This provides high availability in the following scenarios:

  • Node failure protection
  • Zero downtime software maintenance
  • Elastic changes (CPU, memory, and storage) with no downtime
  • (Almost) Transparent unplanned maintenance

If HA is required across availability domains, you might consider a passive standby RAC-enabled database mirroring the primary RAC database system, with data replicated by way of Oracle Data Guard. Failover to the passive standby could be manual with a small downtime.

Note: OCI Base Database supports a maximum of two RAC nodes. For Oracle Database versions or for RAC nodes greater than 2, consider OCI Exadata Database on Dedicated Infrastructure (ExaDB-D).

Exadata Database on Dedicated Infrastructure (ExaDB-D)

Exadata provides built-in high availability capabilities. All the existing best practices with your on-premises Exadata are applicable. Concepts described for the OCI Base Database, such as RAC and Data Guard (for standby database), are applicable to Exadata Database on Dedicated Infrastructure (ExaDB-D), with the following additional attributes:

  • Exadata Database on Dedicated Infrastructure (ExaDB-D) allows for more than two RAC nodes, which is a limitation with the Base Database system.
  • Exadata scalability, performance, and availability
  • Exadata agility with changing number of VMs, storage, and compute resources
  • Data protection and Exadata QoS for database operations

Exadata has instant failure detection that can detect database node, storage server, and network failures in less than 2 seconds, and resume application and database service uptime and performance.

We recommend the following configurations to ensure continuous availability for your applications.

  • Use Oracle Clusterware-managed database services to connect your application. For Oracle Data Guard environments, use role based services.
  • Use the recommended connection string with built-in timeouts, retries, and delays, so that incoming connections don't see errors during outages.
  • Configure your connections with Fast Application Notification.
  • Leverage Application Continuity or Transparent Application Continuity to replay in-flight uncommitted transactions transparently after failures.

Autonomous Database

By default, Oracle Autonomous Database (ADB) is highly available, incorporating a multi-node configuration to protect against localized hardware failures.

Each ADB application service resides in at least one Oracle Real Application Clusters (Oracle RAC) instance, with the option to fail over to another available Oracle RAC instance for unplanned outages or planned maintenance activities, enabling zero or near-zero downtime.

Major database upgrades are automated. In addition, downtime for Oracle Autonomous Database Serverless (ADB-S) is minimal.

The uptime service-level agreements (SLAs) per month is 99.95% (a maximum of 22 minutes of downtime per month).

ADB-S allows for one local (across ADs or within ADs for Single AD regions), and an additional, remote standby.

Autonomous Data Guard adds one symmetric standby database with Oracle Data Guard to an Exadata rack locally (across ADs or within ADs for Single AD regions) with an additional one in another region. The primary and standby database systems are configured symmetrically to ensure that performance service levels are maintained after Data Guard role transitions.

The best practices for maintaining application uptimes are described here.

Monitoring

Monitoring enables you to actively and passively monitor your cloud resources for improved availability and consistent service levels. For an example, see End-to-End Monitoring of applications running on Oracle Cloud Infrastructure.

Explore More