Disaster Recovery

The goal of disaster recovery (DR) is to provide high availability at the level of an installation site, and to protect critical workloads hosted on Oracle Private Cloud Appliance against outages and data loss.

With appliance software version 3.0.2-b1261765 and later, a new DR service is provided, with orchestration of DR operations built directly into the Service Enclave.

Note:

First-generation disaster recovery, which relies on a third system running an Oracle Enterprise Manager installation with Oracle Site Guard, remains operational for existing appliances with DR configured. A path is provided to migrate existing configurations to the new DR service.

Setting up disaster recovery is the responsibility of an appliance administrator or Oracle engineer. It involves interconnecting all participating systems, and configuring the replication settings on both Private Cloud Appliance systems. The two appliances are both fully operational environments on their own, but at the same time configured to be each other's standby or replica.

The administrator determines which workloads and resources are under disaster recovery control by creating and managing DR configurations. The DR configurations define which compute instances and associated block volumes are protected against site-level incidents, and map the relevant source and target compartments and networking resources between the peered systems. A DR configuration can be refreshed to pick up changes that might have occurred to the instances it includes.

When an incident occurs, failover operations are launched to bring up the instances under disaster recovery control on the standby or replica system. Failover is not granular but a site-wide process, which causes compute instances to fail abruptly. After reversing the roles of the primary and replica ZFS Storage Appliances, the DR service launches the affected compute instances of the primary system on the standby system, in accordance with the replication settings and site mappings.

A failover is the result of a disruptive incident being detected at one of the installation sites. However, the DR service also supports switchover, which is a similar process but manually triggered by an administrator. In a controlled switchover scenario, the running instances are safely stopped on the primary system to avoid data loss or corruption. Switchover is typically used for planned maintenance or testing. When both sites are fully operational again, the peered Private Cloud Appliance systems can be returned to their original configuration by performing a (second) switchover from replica to primary, also known as a failback in generic DR terms.

Native DR

The current implementation of disaster recovery is also called native DR, because the service is built into the Private Cloud Appliance infrastructure services layer. When a mutual peer connection has been established between the appliances, the primary and standby rack DR services communicate using REST API calls. Local commands are sent to the appropriate cloud infrastructure service through the platform layer's internal messaging and administration services.

The DR service makes a clear distinction between resources and operations. Resources under DR protection are identified in the DR configuration. DR operations are defined in a DR plan, which outlines the steps to perform during a switchover, failover, or postfailover operation. DR configurations and DR plans are shared between the peered systems, and can be created and maintained from the primary or standby appliance. DR operations could also be executed from either appliance, except for failover, which is always triggered from the standby.

For more information and step-by-step configuration instructions, see Native Disaster Recovery in the Oracle Private Cloud Appliance Administrator Guide.

First-Generation DR

Besides the two Private Cloud Appliance systems installed at different sites, this implementation requires a third system running an Oracle Enterprise Manager installation with Oracle Site Guard. In this plugin, the administrator must set up both Private Cloud Appliance systems as sites, and configure the failover workflows, known as operation plans. It means the orchestration of DR operations is external to the appliance software. When an incident is detected in either environment, the role of Oracle Site Guard is to execute the failover workflows, so the impacted compute instances can be recovered by restarting them on the replica system.

Peering in this DR approach is configured between the ZFS Storage Appliances, not at the appliance level. With the creation of a DR configuration, a dedicated ZFS project is set up for replication to the peer system, and the compute instance resources involved are moved from the default storage location to this new ZFS project. The dedicated network connection between the two peer ZFS Storage Appliances – one in each rack – ensures reliable data replication at 5-minute intervals.

For more information and step-by-step configuration instructions, see Disaster Recovery in the Oracle Private Cloud Appliance Administrator Guide.