Disaster Recovery

A well-architected disaster recovery (DR) plan enables you to recover quickly from disasters and continue to provide services to your users.

DR is the process of preparing for and recovering from a disaster. A disaster can be any event that puts your applications at risk, from network outages to equipment and application failures to natural disasters. It's almost impossible to predict when you will need disaster recovery, just like you can't predict when you'll get in a car accident. If you can't control when a disaster strikes, the next best thing is to be able to control the recovery process.

A well-designed DR plan lets you recover quickly from disasters and provide business continuity. As your organization moves workloads to the cloud, you need to translate your understanding about how to build resilient on-premises systems to the cloud. Oracle Cloud Infrastructure (OCI) provides highly available, secure, and scalable infrastructure and services that enable you to recover your cloud workloads quickly, reliably, and securely.

Because multi-tier or three-tier architectures are common in traditional on-premises enterprise applications, let's use an example three-tier enterprise application to show how you can make that application more resilient from disaster by using OCI DR capabilities and the reliable and resilient cloud topology best practices. The following diagram shows an example enterprise application in warm standby DR configuration.

Example enterprise application in warm standby disaster recovery configuration.

DR Concepts

The first step in planning for DR involves determining the recovery time objective (RTO) and recovery point objective (RPO).

The RTO is the target time within which a given application must be restored after a disaster occurs. Typically, the more critical the application, the lower the RTO.

The RPO is the period after a disaster occurs for which an application can tolerate lost data before the disaster begins to affect the business.

To build a plan that guarantees the recovery of your applications after a disaster and is cost effective, you must consider both the target time to recover and the tolerance for data loss.

Diagram showing recovery point objective before a disaster, the disaster, then the recovery time objective.

For more information, see Best practices for protecting your cloud topology against disasters.

Choosing a DR Approach

Some applications are more critical than others. The DR solution you choose depends on many possible requirements, including availability, data durability, RTO, and RPO.

Evaluate the DR methods in the following table to decide which OCI DR capabilities to use when deploying multi-tier enterprise applications on OCI.

DR MethodRPORTOCost
Backup and restoreHoursHours$
Pilot lightMinutesMinutes$$
Warm standbySecondsMinutes$$$
Active/activeNear zeroPotential zero$$$$

Consider both regions and the availability domains within a region for DR and high availability (HA) scenarios. A region is a localized geographic area, and an availability domain is one or more data centers located within a region. If your DR plan requires that DR sites are physically located far apart, using multiple regions can accomplish this goal.

For our example enterprise application, we need to be able to survive a regional outage but can handle some downtime if a region is affected. For these reasons we chose a warm standby deployment in multiple regions.

Manage DR Orchestration with Full Stack DR

Full Stack Disaster Recovery (DR) is an OCI native service that provides a simple and consistent interface to orchestrate DR operations for many different systems, making it easy for any authorized user in your IT operations to trigger a failover or switchover without needing to understand any of the underlying recovery processes.

Full Stack DR is Oracle’s first true disaster recovery as a service (DRaaS) solution for OCI, and is more than just a simple orchestration engine. Full Stack DR is a highly scalable, highly extensible DR management service that fully automates the steps needed to test, transition, or recover critical and non-critical business systems between two OCI regions from anywhere around the globe with a single click.

The Problems Enterprises Face with Recovery at Scale

Your company probably has more than just a few mission and business critical applications hosted in your OCI tenancy. To complicate things, every one of these Oracle or non-Oracle applications has a different recovery process with different recovery point and recovery time objectives. In addition, the processes for recovery of each different application stack can be complex, requiring the full attention of your most senior technical specialists to accomplish.

Your IT organization probably has the skills and determination to recover one or two different applications in a day or two in an all-consuming, all-hands-on-deck effort from the company’s most senior IT specialists. But what happens if your IT organization is faced with the prospect of recovering more than just a couple systems at the same time?

Full Stack DR Makes Recovery at Scale Easy

Full Stack DR is designed to handle DR workflows at scale without involving your most skilled technical experts in the event you need to recover many systems at the same time. Full Stack DR normalizes the way DR operations are executed and monitored using a consistent, simple method through the OCI Console.

Full Stack DR organizes various applications into independent protection groups without changing anything about the way you’ve installed and configured your existing Oracle and non-Oracle applications in OCI. Full Stack DR can recover just one component of an application stack or recover the entire application stack with a single click – you choose what you want to do.

Full Stack DR Validates Readiness of DR Plans

Full Stack DR helps validate that critical business systems are ready for any unexpected disruptions of service through our built-in, fully automated DR readiness checks. Our Precheck feature is automatically added to the list of tasks that Full Stack DR steps through during any DR operation.

The Prechecks are non-disruptive and can be run at any point without disturbing your production systems. We validate the sanity of DR plans by checking to see if network, storage, compute, Oracle databases, and any custom scripts you’ve added to a DR plan, are where they should be and are ready to go.

Flexibility to Manage Any Deployment Architecture

Flexibility is a key concept behind the design of Full Stack DR. Different business systems require different recovery solutions. Therefore, Full Stack DR conforms to the way you need to recover each individual business system in a way that matches your technical and business needs. How you choose to install and deploy a business system for disaster recovery is up to you.

Our DRaaS solution can manage recovery differently for each individual business system whether it is deployed for VM failover, pilot light, cold standby, warm standby, hot standby, or active/active. You handle the deployment, and we handle the recovery.

DR Design Considerations

There are many things to consider, depending on the DR method that you implement.

For background information about DR capabilities, see DR Capabilities of Oracle Cloud. In this example, we review the warm standby method and the OCI resources needed to implement warm standby, which include a second region for a cross-region deployment.

Networking

After creating the network foundation of virtual cloud networks (VCNs) and subnets in the respective regions, to configure DR you need to peer the VCNs in the different regions to facilitate network connectivity.

Compute

To run applications on compute instances in two regions, you must make the compute images available in both regions. In the region for DR, deploy a minimal setup to maintain a warm standby. Then, use capacity reservations to reserve the rest of the required capacity to run all the VMs when the DR region becomes primary. For more information, see Overview of the Compute Service and Best Practices for Your Compute Instances.

Storage

OCI provides a set of storage services that includes Block Volume, File Storage, and Object Storage, that provide built-in redundancy and high availability features by maintaining multiple copies of data. These storage services also provide native replication that can be configured for cross region disaster recovery.

Object Storage is an internet-scale, high-performance storage platform that offers reliable and cost-efficient data durability. Object Storage is a regional service and is available across all availability domains within a region. Object storage replication can be configured across regions for DR purposes.

Block Volume has a fully managed and asynchronous replication feature to aid with disaster recovery. With a recovery time objective (RTO) of less than a minute, you can replicate volumes and volume groups to another region. An automated backup feature is also available to produce crash-consistent backups for volumes and volume groups. These backups can be automatically copied to another region.

Similar to other storage services in OCI, File Storage has built-in replication features to asynchronously replicate to another availability domain and region. Using the File Storage cloning feature, the data at the target side can be made available almost instantly (RTO). For a complete DR experience, the replication also replicates snapshots with the main file system data.

Database

High availability design is meant to ensure application availability in case of IaaS failure events, such as node or network failure. The database DR scenarios deal with prevention of the loss of critical business data due to significant and unavoidable outage to the primary database(s) that often impact an entire region or an availability domain.

We recommend that you refer to Maximum Availability Architecture (MAA), which is a set of best practices and reference architectures developed by Oracle engineers over many years for the integrated use of Oracle high availability, data protection, and disaster recovery technologies.

The key considerations for a DR design are the RPO (Recovery Point Objective), which is the amount of data loss your application can tolerate, and RTO (Recovery Time Objective), which is the maximum recovery time that your application can tolerate before systems need to come back online. Based on these, there are various categories that MAA defines with increasing costs and complexity. These are categorized as Bronze, Silver, Aurous, Gold, and Platinum, each with progressively increasing complexity and resilience. These form the basis for the DR reference architectures specified by MAA.

Maximum Availability Architecture (MAA) TiersCore ArchitectureRecovery Point Objective (RPO)Recovery Time Objective (RTO)Oracle Autonomous Database on Serverless Exadata Infrastructure (ADB-S)Oracle Autonomous Database on Dedicated Exadata Infrastructure (ADB-D and ADB-C@C)Oracle Base Database Service (VM)Oracle Exadata Database Service on Dedicated Infrastructure (ExaDB-D)Oracle Exadata Database Service on Cloud@Customer (ExaDB-C@C)
BRONZESingle instance with local backup and replicated backupLast BackupHoursOut of the boxOut of the boxOut of the boxOut of the boxOut of the box
SILVERRAC with local backup and replicated backupLast BackupHours (Zero for planned maintenance)Out of the boxOut of the boxOut of the box for 2 nodes (Require EE Extreme Performance)Out of the boxOut of the box
AUROUSRefreshable PDBLast RefreshMinutes+ Autonomous Data GuardOptionalOptionalOptionalOptional
GOLDDatabase with Cross-site Active-Passive replication by way of (Active) Data GuardZeroSecondsNot Applicable+ Data Guard+ Data Guard (Requires EE/EE HP for Standard DG, EE EP for Active DG)+ Data Guard+ Data Guard
PLATINUMDatabase with Cross-site Active-Active replication by way of GoldenGateZeroZero+ GoldenGate+ GoldenGate+ GoldenGate+ GoldenGate+ GoldenGate

This DR design and strategy describes the prevention of data loss in the Oracle database. A robust DR strategy must also address configurations for continuous availability of applications.

Key technologies that form the basis of MAA include:

Monitoring

OCI Monitoring lets you actively and passively monitor your cloud resources for improved availability and consistent service levels. Ensure that you're subscribed to the OCI status notifications and check the Service Health Dashboard. For an example, see End-to-End Monitoring of applications running on Oracle Cloud Infrastructure.

Explore More