Working with Disaster Recovery Plans
A disaster recovery (DR) plan describes the operations that must be performed on the Private Cloud Appliance resources that are under the protection of the disaster recovery service.
A DR plan is associated with a DR configuration, and is executed by an administrator either when a site-level incident is detected (failover), or when one of the sites must be taken offline (switchover). After a failover, when the affected system is back online, postfailover operations are performed to ensure that both systems are ready to run new DR operations.
These sections explain how to build and execute DR plans:
About DR Operations and Default Plans
The native DR service provides plans with default steps for each type of operation. DR plan steps can be customized. The built-in plans are configured as follows:
- Switchover Plan
-
When a switchover is performed, there is no outage, so both peered systems are online. The goal is to move all resources covered in the DR configuration from the primary system (A) to the standby system (B). When completed, system B becomes the primary and system A the standby for the resources in question.
The plan starts with prechecks to ensure that both systems meet the requirements to allow compute instances to be stopped on the primary system and started again on the standby system. The prechecks include site mappings as well as other critical elements, such as tags, security lists, or network security groups. The role reversal precheck specifically ensures that the ZFS Storage Appliance in each rack is in the correct state.
When the prechecks are completed without errors, the DR configuration on the primary system (A) is frozen and its compute instances are stopped, so the role reversal can begin. Based on resource metadata exchanged between the peered systems, and replicated data on the standby ZFS Storage Appliance, the target system (B) is prepared to assume the primary role for the instances in the DR configuration. The replication process is reversed and ready to use the source system (A) as its standby as soon as the switchover is complete.
Using the replicated volumes, the compute instances in the DR configuration are launched on the standby system (B). An identical DR configuration is created on the standby system, with all source and target resources in the site mappings inverted. The metadata of the newly launched instances is stored in the DR configuration. On the primary system (A) a cleanup is performed: the DR configuration is disabled and its compute instances are terminated.
To complete the switchover, data replication from the new primary system (B) to the standby system (A) is started, the DR plans are moved to the new standby system (A), and the storage project and metadata associated with the original DR configuration are deleted from system A.
- Failover Plan
-
A failover is performed on the standby system, when one of the peered systems goes down. The goal is to recover all resources covered in the DR configuration on the standby system (B), allowing continuation of service. The failover steps are similar to the switchover plan, but none of the operations on the primary system (A) can be performed. The primary system cannot be cleaned up until it comes back online.
The plan starts with prechecks to ensure that the standby system and its ZFS Storage Appliance are in the correct state to bring up the resources covered in the DR configuration. When the prechecks are completed without errors, the role reversal begins.
Using the replicated metadata and resources, the compute instances in the DR configuration are launched on the standby system (B), which assumes the primary role. An identical DR configuration is created on system B, which has become the primary, with inverted site mappings and metadata collected from the newly launched instances. In preparation of the original primary system (A) coming back online, the replication process is reversed and ready to use system A as the standby.
When the original primary system (A) eventually comes online, the remaining steps to return the DR configuration to a correct working state are performed by executing the postfailover plan.
- Postfailover Plan
-
A postfailover plan is performed after a failover, when the system that experienced an outage comes back online, and the peer connection is restored. The goal is to clean up the DR configuration on the primary system that went down (A), and set it up as the standby for the new primary system (B).
There are no prechecks in a postfailover plan. System A is back online after an outage and needs to be cleaned up: the DR configuration is disabled and its compute instances are terminated. Data replication from the new primary system (B) to the standby system (A) is started, the DR plans are moved to the new standby system (A), and the storage project and metadata associated with the original DR configuration are deleted from system A.
To move resources that were originally hosted on system A back from system B, the administrator must perform a switchover from B to A for the relevant DR configuration(s).