Disaster Recovery

The goal of disaster recovery (DR) is to provide high availability at the level of an installation site, and to protect critical workloads hosted on Compute Cloud@Customer against outages and data loss.

The current Compute Cloud@Customer controller software provides a Disaster Recovery service with orchestration of DR operations from within the Service Enclave, the infrastructure administration environment. The service is also called native DR because it is built directly into the infrastructure services layer.

Setting up disaster recovery requires that you open a support request (SR) with Oracle to establish a peer connection between systems, and provision an authorization group with the necessary access to all relevant operations in the highly restricted Service Enclave.

To open a Support Request. See Creating a Support Request. To access support, sign in to the Oracle Cloud Console as described in Sign In to the OCI Console.

Configuring the Disaster Recovery Service

Setting up disaster recovery requires initial setup by Oracle. Further management of the DR configuration and execution of DR plan operations is the responsibility of an authorized infrastructure administrator. The systems participating in the DR setup are fully operational environments on their own, running in different physical locations.

Peer Connection (Oracle)

A mutual peer connection must be established first, so the installations can operate as each other's standby or replica in case an outage occurs at one of the sites. It involves these steps:

  1. Installing dedicated cabling at each site between the system's spine switches and the data center network.

  2. Creating a local endpoint on each participating system. Traffic between peered systems flows through tunnels between endpoints.

  3. Creating the peer connection on each participating system. Configuration parameters of each connected system must be included to complete the peering. When each system has accepted the connection from its peer, a mutual trust relationship is established and peering is complete.

Service Setup (Oracle)

The Disaster Recovery service is configured on top of an active peer connection. On each peered system the service is enabled with a single command that includes the serial number of its peer.

If the Disaster Recovery service needs to be disabled, the DR service setup must be deleted from each peered system.

DR Configurations (authorized administrator)

The Disaster Recovery service makes a clear distinction between resources and operations. DR configurations specify the resources that play a critical role in protecting Compute Cloud@Customer workloads against site-level incidents. A DR configuration contains compute instances under DR protection, and their required site mappings.

Site mappings determine how and where on the standby system the instances should be brought back up in case the primary system experiences an outage. They associate network resources and compartment hierarchies on the primary and standby systems with each other. Each site mapping consists of a source object – subnet or compartment – on the primary system, and a corresponding target object on the standby system.

Compute instances are added to the DR configuration after the site mappings. For each instance under DR protection, data and disks are stored in the ZFS storage project associated with the DR configuration, and replicated over the peer connection.

DR Plans (authorized administrator)

DR operations are defined in a DR plan, which outlines the steps to perform during a switchover, failover, or postfailover operation. These operations are performed on the resources described in the DR configuration the DR plan is associated with.

  • When a switchover is performed, there is no outage, so both peered systems are online. The goal is to move all resources covered in the DR configuration from the primary system (A) to the standby system (B). When completed, system B becomes the primary and system A the standby for the resources in question.

  • A failover is performed on the standby system, when one of the peered systems goes down. The goal is to recover all resources covered in the DR configuration on the standby system (B), allowing continuation of service. The failover steps are similar to the switchover plan, but none of the operations on the primary system (A) can be performed. The primary system cannot be cleaned up until it comes back online.

  • A postfailover plan is performed after a failover, when the system that experienced an outage comes back online, and the peer connection is restored. The goal is to clean up the DR configuration on the primary system that went down (A), and set it up as the standby for the new primary system (B).

Working with DR Configurations and DR Plans

The implementation of the Native Disaster Recovery service is shared with Private Cloud Appliance, but explicit permission must be given to one or more administrator accounts to access the DR functionality of the Service Enclave.

From the Service CLI you create and maintain the DR configurations and DR plans. Follow the instructions in the Private Cloud Appliance documentation.

  • Working with Disaster Recovery Configurations

    Learn how to create and manage DR configurations. It's important to keep site mappings up-to-date and ensure the list of compute instances under DR protection reflects your current requirements.

  • Working with Disaster Recovery Plans

    Learn how to create and manage DR plans. These define the operations performed on protected resources when a controlled switchover is executed, and when a site level incident occurs that requires failover. The execution of a DR plan is not automated through failure detection. It must be initiated from the Service CLI on the standby system.

Supported Resources

It is important to understand what is covered under disaster recovery and what is not.

Disaster recovery supports:

  • Compute instances

  • The block volumes associated with these compute instances

The following limitations apply to the Disaster Recovery service:

  • File systems are not supported

  • Object storage is not supported

  • OKE clusters are not supported

  • Application and network load balancers are not supported

  • SR-IOV instances are not supported

Storage Replication

In addition to the system level DR service, Console users can add a data recovery option by enabling storage replication. Replication to a peered system is available for boot and block volumes, volume groups, file systems and their snapshots. A replica can be used to restore operation when the original resource has become unavailable or corrupted.

For details and instructions, see the documentation sections on Block Volume Storage and File Storage.