Disaster Recovery

The goal of disaster recovery is to provide high availability at the level of an installation site, and to protect critical workloads hosted on a Private Cloud Appliance against outages and data loss. The implementation requires two Private Cloud Appliance systems installed at different sites, and a third system running an Oracle Enterprise Manager installation with Oracle Site Guard.

The two Private Cloud Appliance systems are both fully operational environments on their own, but at the same time configured to be each other's replica. A dedicated network connection between the two peer ZFS Storage Appliances – one in each rack – ensures reliable data replication at 5-minute intervals. When an incident is detected in either environment, the role of Oracle Site Guard is to execute the failover workflows, known as operation plans.

Setting up disaster recovery is the responsibility of an appliance administrator or Oracle engineer. It involves interconnecting all participating systems, and configuring the Oracle Site Guard operation plans and the replication settings on both Private Cloud Appliance systems. The administrator determines which workloads and resources are under disaster recovery control by creating and managing DR configurations through the Service CLI on the two appliances.

The DR configurations are the core elements. The administrator adds critical compute instances to a DR configuration, so that they can be protected against site-level incidents. Storage and network connection information is collected and stored for each instance included in the DR configuration. With the creation of a DR configuration, a dedicated ZFS project is set up for replication to the peer ZFS Storage Appliance, and the compute instance resources involved are moved from the default storage location to this new ZFS project. A DR configuration can be refreshed at all times to pick up changes that might have occurred to the instances it includes.

Next, site mapping details are added to the DR configuration. All relevant compartments and subnets must be mapped to their counterparts on the replica system. A DR configuration cannot work unless the compartment hierarchy and network configuration exist on both Private Cloud Appliance systems.

When an incident occurs, failover operations are launched to bring up the instances under disaster recovery control on the replica system. This failover is not granular but a site-wide process involving these steps:

  1. The site-level incident causes running instances to fail abruptly. They cannot be shut down gracefully.

  2. Reversing the roles of the primary and replica ZFS Storage Appliance.

  3. Recovering the affected compute instances of the primary system by starting them on the replica system.

  4. Cleaning up the primary system: removing stopped instances, frozen DR configurations, and so on.

  5. Setting up reverse DR configurations based on the ZFS project and instance metadata.

A failover is the result of a disruptive incident being detected at one of the installation sites. However, Oracle Site Guard also supports switchover, which is effectively the same process but manually triggered by an administrator. In a controlled switchover scenario the first step in the process is to safely stop the running instances on the primary system to avoid data loss or corruption. Switchover is typically used for planned maintenance or testing. After a failover or switchover, when both sites are fully operational, a failback is performed to return the two Private Cloud Appliance systems to their original configuration.