Configure OCI Full Stack Disaster Recovery

Configure your disaster recovery protection groups and create your switchover and failover plans. The steps depend the disaster recovery model that you're using.

Define the Primary DR Protection Group

The primary Disaster Recovery (DR) Protection Group contains the components of your system in the primary region. It contains those components that require any action during a switchover or failover.

Perform the following steps to define the primary DR Protection Group:

  1. Log onto the Oracle Cloud Infrastructure Console in the primary region.
  2. Navigate to Migration and Disaster Recovery, then click DR Protection Groups.
  3. Click Create DR Protection Group.
  4. Enter a name for the DR Protection Group.
  5. Select the Compartment, then provide an Oracle Cloud Infrastructure Object Storage bucket for the logs.
  6. Leave the role as Not configured for now.
  7. Click Add members.
    1. Add the primary mid-tier compute instances. Choose Non-moving instance in the Compute instance type.
      Resource Type Instance Compute Instance Type
      Compute Mid-tier compute instance 0 of primary region Non-moving instance
      Compute Mid-tier compute instance 1 of primary region Non-moving instance
      Compute Mid-tier compute instance n of primary region Non-moving instance
    2. Add the primary database. Select the appropriate resource type (Database or Autonomous Database).
  8. Click Create.

Define the Standby DR Protection Group

The standby Disaster Recovery (DR) Protection Group contains the components of your system in the secondary region. It contains those components that require any action during a switchover or failover.

Perform the following steps to define the standby DR Protection Group:

  1. Log onto the Oracle Cloud Infrastructure Console in the standby region.
  2. Navigate to Migration and Disaster Recovery, then click DR Protection Groups.
  3. Click Create DR Protection Group.
  4. Enter a name for the DR Protection Group.
  5. Select the Compartment, then provide an Oracle Cloud Infrastructure Object Storage bucket for the logs.
  6. Set the role to Standby.
    1. Select the primary region in Peer region.
    2. Select the previously created DR Protection Group as Peer DR Protection Group.
  7. Click Add members.
    1. Add the standby mid-tier compute instances. Choose Non-moving instance in the Compute instance type.
      Resource Type Instance Compute Instance Type
      Compute Mid-tier compute instance 0 of standby region Non-moving instance
      Compute Mid-tier compute instance 1 of standby region Non-moving instance
      Compute Mid-tier compute instance n of standby region Non-moving instance
    2. Add the standby database. Select the appropriate resource type (Database or Autonomous Database).
  8. Click Create.

Complete the DR Protection Groups Definition

When you're using the DR model based on Block Volume cross-region replication, configure the replicated block volume in each compute member in the primary DR Protection Group and the standby DR Protection Group.

Note:

This step only applies to the disaster recovery model based on OCI Block Volumes cross-region replication. This step does NOT apply to disaster recovery models based on “OCI File Storage with rsync” and “Database File System (DBFS)” methods for configuration replication.

  1. Configure the replicated block volume in each compute member in the primary DR Protection Group.
    1. Edit a compute member, click Advanced options, then click the Block Volumes tab.
      • In Block Volume, select the block volume attached to the instance that is replicated to secondary.
      • In Volume attachment reference instance, select the peer compute instance from standby.

        This compute instance is used to get the attachment details when switching to this region.

      • In mount point, indicate the mount point where the block volume is mounted.
    2. The compute instance can have more than one block volume that is replicated. For example, in Oracle WebLogic Server for OCI, you can replicate both the wlsociprefix-data-block-N and the wlsociprefix-mw-block-N to secondary. If that is the case, add any additional replicated block volumes to the compute instance member definition.

      Note:

      DO NOT ADD the BOOT volumes. They are not replicated.
    3. Repeat the previous step for each compute instance member in the primary Disaster protection group.
    The following is an example of a Block Volume's advanced properties in the primary DR protection group details for compute members:
    Compute Member Block Volume Volume Attachment Reference Instance Mount Point
    Mid-tier compute instance 0 of primary region wlsociprefix-data-block-1 Mid-tier compute instance 0 of standby /u01/data
    Mid-tier compute instance 1 of primary region wlsociprefix-data-block-2 Mid-tier compute instance 1 of standby /u01/data
    Mid-tier compute instance n of primary region wlsociprefix-data-block-N Mid-tier compute instance N of standby /u01/data
  2. Configure the replicated block volume in each compute member in the standby DR Protection Group:
    1. Edit a standby compute member, click Advanced options, then click the Block Volumes tab.
      • In Block Volume, select the block volume from the primary region that will be attached to this compute instance. The list directly shows the block volumes of primary.
      • In Volume attachment reference instance, select the peer compute instance from primary.

        This is used to get the attachment details when switching to this region.

      • In mount point, indicate the mount point where the block volume is mounted.
    2. The compute instance can have more than one block volume that is replicated. For example, in Oracle WebLogic Server for OCI, you can replicate both the wlsociprefix-data-block-N and the wlsociprefix-mw-block-N to secondary. If that is the case, add any additional replicated block volumes to the compute instance member definition.

      Note:

      DO NOT ADD the BOOT volumes. They are not replicated.
    3. Repeat the previous step for each compute instance that is a member of the group.
    The following is an example of a Block Volume's advanced properties in the standby DR Protection Group details for compute members:
    Compute Member Block Volume Volume Attachment Reference Instance Mount Point
    Mid-tier compute instance 0 of standby region wlsociprefix-data-block-1 Mid-tier compute instance 0 of primary /u01/data
    Mid-tier compute instance 1 of standby region wlsociprefix-data-block-2 Mid-tier compute instance 1 of primary /u01/data
    Mid-tier compute instance n of standby region wlsociprefix-data-block-N Mid-tier compute instance N of primary /u01/data
  3. Edit the primary DR Protection Group to add the volume groups that are replicated as members of the primary DR Protection Group.
    1. Click Add a Member.
    2. Select the Volume Group resource type.
    3. Select the Volume group that is replicated to standby
    4. Repeat for all the Volume groups created in primary that are replicated to standby.

      Note:

      Perform this in the primary DR group only. You don't need to add any volume group to the standby DR protection group. OCI Full Stack Disaster Recovery Service will automatically add them as members to the standby DR Protection Group when it becomes primary, during the switchover or failover process.

About DR Plans

Create disaster recovery (DR) plans for your protection groups. A DR plan in a particular DR Protection Group is valid for switching or failing over to that DR Protection Group.

For Region 1's DR Protection Group, you define the switchover and failover plans from Region 2 to Region 1. For Region 2's DR Protection Group, you define the switchover and failover plans from Region 1 to Region 2.

Note:

You can only create and modify plans in the DR Protection Group that has a standby role.
You can create the following types of plans:
  • Switchover Plan

    Performs a planned transition of services from the primary DR Protection Group to the standby DR Protection Group. Switchover plans are used to perform an orderly transition by shutting down the application stack in the primary region and then bringing it up in the standby region. Therefore, a switchover plan requires that application stack components and other required OCI services be available in both regions. Switchover plans are typically used for the purposes of planned site maintenance, software patching, DR testing, and validation.

  • Failover Plan

    Performs an unplanned transition of services to the standby region. Failover plans usually perform an immediate transition by bringing up the application stack in the standby region, without attempting to shutdown service in the primary region. Therefore, a failover plan only requires that OCI services be available in the standby region. Failover plans are generally used to perform DR transitions when an outage or disaster affects the primary region.

Create the Switchover Plan

Create the switchover plan in the standby Disaster Recovery (DR) Protection Group.

  1. In the Oracle Cloud Infrastructure Console, navigate to the standby DR Protection Group, click Plans, then click Create Plan.
  2. Provide a name for the plan.
    For example, switchover_to_region2.
  3. Select Switchover for the plan type.

    When the plan is created, it includes the built-in steps: the pre-checks and the database switchover step, and steps to manage the Block Volumes cross-region replication, if used.

    The steps are grouped in Plan Groups. All of the steps under the same plan group run in parallel.
    The following are the plan groups expected out-of-the-box in a switchover plan for DR models based on OCI File Storage with rsync and Oracle Database File System config replication methods:
    • Built-in prechecks: Performs prechecks for all of the steps in the plan.
    • Switchover databases (Standby): Performs the switchover of the database.
    The following are the plan groups expected out-of-the-box in a switchover plan for the DR model based on OCI Block Volumes cross-region replica method:
    • Built-in prechecks: Performs prechecks for all of the steps in the plan.
    • Detach Block Volumes from compute instances: Unmounts and detaches the Block Volumes from the primary compute instances.
    • Switchover Volume Groups: Activates the Block Volumes Groups replicas in the standby site, so new Block Volumes Groups and Block volumes are created in the standby. They are a copy of the primary block volumes.
    • Switchover databases (Standby): Performs the switchover of the database.
    • Attach Block Volumes from compute instances: Attaches the activated Block Volumes to the standby compute instances.
    • Reverse Volume Groups’ Replication: Enables the cross-region replication in the new Block Volume groups created in the standby region (new primary). They are now replicated to the previous primary region.
    • Terminate Volume Group: Terminates the Block Volume Groups and Block Volumes in previous primary region.
    • Remove Volume Groups from DR Protection Group: Removes the Block Volume Groups members from the previous primary DR protection group definition. The Block Volume Groups are added now as members of the new primary DR protection Group.

    Note:

    The Terminate Volume Group step is disabled by default.

    When the step is disabled, the Block Volumes and Block Volume groups in previous primary are not deleted (only detached). You will have to delete them manually. When the step is enabled, the Block Volumes and Block Volume groups in previous primary are automatically deleted.

    After initial validation tests, Oracle recommends enabling this step to avoid block volume duplications. Otherwise, the block volumes left behind will be continuously replicating and, even if they are not used, they incur an undesired cost.

  4. For the rest of the actions, add the user-defined Plan Groups and steps for the Oracle WebLogic Server (WLS) instances and front-end DNS switchover, as shown in the table.
    User-Defined Plan Group Step Error Mode Region Script Target Instance Script Parameters Run as User
    WLS stop in remote_region (parallel) WLS stop node 0 Stop on Error Remote region Run local script Mid-tier compute instance 0 /opt/scripts/custom_stop.sh oracle
    WLS stop in remote_region (parallel) WLS stop node 1 Stop on Error Remote region Run local script Mid-tier compute instance 1 /opt/scripts/custom_stop.sh oracle
    WLS stop in remote_region (parallel) WLS stop node N Stop on Error Remote region Run local script Mid-tier compute instance N /opt/scripts/custom_stop.sh oracle
    WLS Admin Server start in this_region WLS Admin Server Stop on Error This region Run local script Mid-tier compute instance 0 /opt/scripts/custom_start_aserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node 0 Stop on Error This region Run local script Mid-tier compute instance 0 /opt/scripts/custom_start_mserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node 1 Stop on Error This region Run local script Mid-tier compute instance 1 /opt/scripts/custom_start_mserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node N Stop on Error This region Run local script Mid-tier compute instance N /opt/scripts/custom_start_mserver.sh oracle
    Front-end DNS switchover Front-end DNS switchover Stop on Error This region Run local script / function Mid-tier compute instance 0 Path to the DNS script in the host opc (or the user that runs the DNS script)

    Note:

    The default timeout for each operation is 3600 seconds, which adjusts properly for most cases. For some operations, such as start and stop of WLS managed servers, you might need to adjust this value based on the applications deployed and whether graceful shutdowns need to wait for Java Transaction API (JTA) settings and long running operations. Similarly, the start timeout will depend on your Oracle WebLogic Server deployments. For example, in a SOA system this might vary depending on the number and type of composites deployed. Since this can have a direct impact on the expected recovery time objective (RTO), first verify each operation manually for your system and use the acceptable timeout value to meet the RTO (you might need to intervene if a timeout occurs).

    The steps under the same Plan Group are executed in parallel. The plan groups are executed in serial mode. Therefore, place the steps to stop the Oracle WebLogic Server instances under the same Plan Group, so that those Oracle WebLogic Server instances are stopped in parallel. However, the steps to start Oracle WebLogic Server instances are separated into 2 plan groups: one plan group to start the Administration Server in the first node, and other plan group with N steps, to start the Oracle WebLogic Server managed instances in all the hosts in parallel.

  5. Optionally, you can add the following user-defined steps when you use DR model based on OCI File Storage with rsync or Oracle Database File System config replication. These scripts replicate the Oracle WebLogic configuration to standby before the switchover:
    User-Defined Plan Group Step Error Mode Region Script Target Instance Script Parameters Run as User
    (optional,) Config sync in primary (from primary to staging folder) Run config replica script in primary node 0 Stop on Error Remote region Run local script Mid-tier compute instance 0 /u01/scripts/config_replica.sh oracle
    (optional) Config sync in standby (from staging folder to standby) Run config replica script in standby node 0 Stop on Error This region Run local script Mid-tier compute instance 0 /u01/scripts/config_replica.sh oracle
  6. Add these user-defined steps when using DR model based on OCI Block Volumes cross-region replica to replace the database connect strings in the Oracle WebLogic (WLS) configuration to point to the local database:
    User- Defined Plan Group Step Error Mode Region Script Target Instance Script Parameters Run as User
    DB Connect string replacement in WLS (all in parallel) in WLS node 0 Stop on Error This region Run local script Mid-tier compute instance 0 /u01/scripts/replacement_script_BVmodel.sh oracle
    DB Connect string replacement in WLS (all in parallel) in WLS node 1 Stop on Error This region Run local script Mid-tier compute instance 1 /u01/scripts/replacement_script_BVmodel.sh oracle
    DB Connect string replacement in WLS (all in parallel) in WLS node N Stop on Error This region Run local script Mid-tier compute instance N /u01/scripts/replacement_script_BVmodel.sh oracle
  7. Reorder the Plan Groups in the plan as follows when using the DR model based on OCI File Storage with rsync or Oracle Database File System config replication:
    Plan Group Position Plan Group Plan Group Type
    1 Built-in prechecks Built-in step
    2 (optional) Config Sync in Primary (from primary to staging folder) User-defined step
    3 (optional) Config Sync in Standby (from staging folder to standby) User-defined step
    4 Oracle WebLogic Server shutdown in remote_region (parallel) User-defined step
    5 DNS switchover User-defined step
    6 Switchover databases (Standby) Built-in step
    7 Oracle WebLogic Server Admin Server start in this_region User-defined step
    8 Oracle WebLogic Server managed servers start in this_region (all nodes in parallel) User-defined step
  8. Reorder the Plan Groups in the plan as follows when using DR model based on OCI Block Volumes cross-region replication, based on the default order:
    Plan Group Position Plan Group Plan Group Type
    1 Built-in prechecks Built-in step
    2 Oracle WebLogic Server shutdown in remote_region (parallel) User-defined step
    3 Detach Block Volumes from compute instances Built-in step
    4 Switchover Volume Groups Built-in step
    5 DNS switchover User-defined step
    6 Switchover databases (Standby) Built-in step
    7 Attach Block Volumes from compute instances Built-in step
    8 DB Connect string replacement in Oracle WebLogic Server (all in parallel) User-defined step
    9 Oracle WebLogic Server Admin Server start in this_region User-defined step
    10 Oracle WebLogic Server managed servers start in this_region (all nodes in parallel) User-defined step
    11 Reverse Volume Groups’ Replication Built-in step
    12 Terminate Volume Group Built-in step
    13 Remove Volume Groups from DR Protection Group Built-in step

    The downtime for this switchover plan starts in the step 2 and finishes once step 10 is done.

    To minimize the down time during the switchover plan, you can use the following order:
    Plan Group Position Plan Group Plan Group Type
    1 Built-in prechecks Built-in step
    2 Switchover Volume Groups Built-in step
    3 Attach Block Volumes from compute instances Built-in step
    4 DB Connect string replacement in Oracle WebLogic Server (all in parallel) User-defined step
    5 Oracle WebLogic Server shutdown in remote_region (parallel) User-defined step
    6 DNS switchover User-defined step
    7 Switchover databases (Standby) Built-in step
    8 Oracle WebLogic Server Admin Server start in this_region User-defined step
    9 Oracle WebLogic Server managed servers start in this_region (all nodes in parallel) User-defined step
    10 Detach Block Volumes from compute instances Built-in step
    11 Reverse Volume Groups’ Replication Built-in step
    12 Terminate Volume Group Built-in step
    13 Remove Volume Groups from DR Protection Group Built-in step
    The downtime for this switchover occurs between the step 5 and finishes once step 9 is done.

    Note:

    The step to Terminate Volume Group is disabled by default.

    When the step is disabled, the Block Volumes and Block Volume groups in previous primary are not deleted (they are only detached and the cross-region replica disabled). You must delete them manually. When the step is enabled, the Block Volumes and Block Volume groups in previous primary are automatically deleted.

    After the initial validation tests, Oracle recommends enabling this step to avoid block volume duplications. Otherwise, the block volumes left behind will continuously replicate and, even if they are not used, they incur an undesired cost.

  9. Repeat these steps to create the switchback plan in the DR Protection Group for the primary region.

    Note:

    To create the switchback plan in the DR Protection Group for the primary region, you must wait until it is in the standby role. Therefore, schedule a switchover in a planned downtime window or wait until the next planned switchover to create the switchback plans in the other DR Protection Group.

Create the Failover Plan

Create the failover plan in the standby DR Protection Group.

  1. In the OCI Console, navigate to the standby DR Protection Group, click Plans, then click Create Plan.
  2. Provide a name for the plan.
    For example, failover_to_region2.
  3. Select Failover for the plan type.
    When the plan is created, it includes the built-in steps: the prechecks and the database failover step, and steps related with Block Volumes cross-region replication if used.
    The following are the plan groups expected out-of-the-box in a failover plan for DR models based on OCI File Storage with rsync and Oracle Database File System config replication methods:
    • Built-in prechecks: Performs prechecks for all of the steps in the plan.
    • Failover databases (Standby): Performs the failover of the database.
    The following are the plan groups expected out-of-the-box in a failover plan for the DR model based on OCI Block Volumes cross-region replica method:
    • Built-in prechecks: Performs prechecks for all of the steps in the plan.
    • Failover Volume Groups: Activates the Block Volumes Groups replicas in the standby region, so new Block Volumes Groups and Block volumes are created in the standby. They are a copy of the primary block volumes.
    • Failover databases (Standby): Performs the failover of the database.
    • Attach Block Volumes from compute instances: Attaches the Block Volumes in standby to the standby compute instances.

    Note:

    The failover plan does not include any operation in the primary DR group. After a failover, you must manually perform some actions once the primary system is available again. See Resetting DR Configuration After a Failover for details.

  4. For the rest of the actions, add the Plan Groups and steps, as shown in the table.
    User-Defined Plan Group Step Error Mode Region Script Target Instance Script Parameters Run as User
    WLS stop in remote_region (parallel) WLS stop node 0 Continue on error Remote region Run local script Mid-tier compute instance 0 /opt/scripts/custom_stop.sh oracle
    WLS stop in remote_region (parallel) WLS stop node 1 Continue on error Remote region Run local script Mid-tier compute instance 1 /opt/scripts/custom_stop.sh oracle
    WLS stop in remote_region (parallel) WLS stop node N Continue on error Remote region Run local script Mid-tier compute instance N /opt/scripts/custom_stop.sh oracle
    WLS Admin Server start in this_region WLS Admin Server Stop on Error This region Run local script Mid-tier compute instance 0 /opt/scripts/custom_start_aserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node 0 Stop on Error This region Run local script Mid-tier compute instance 0 /opt/scripts/custom_start_mserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node 1 Stop on Error This region Run local script Mid-tier compute instance 1 /opt/scripts/custom_start_mserver.sh oracle
    WLS managed servers start in this_region (all in parallel) WLS start node N Stop on Error This region Run local script Mid-tier compute instance N /opt/scripts/custom_start_mserver.sh oracle
    Front-end DNS switchover Front-end DNS switchover Stop on Error This region Run local script / function Mid-tier compute instance 0 Path to the DNS script in the host opc (or the user that runs the DNS script)

    The steps are the same as those defined for the equivalent switchover plan. But in this case, make sure you set the Error Mode to Continue on error in the steps that stop the Oracle WebLogic Server in primary. Because, in a failover scenario, the primary components might be unavailable.

    Note:

    The default timeout for each operation is 3600 seconds, which adjusts properly for most cases. For some operations, such as start and stop of WLS managed servers, you might need to adjust this value based on the applications deployed and whether graceful shutdowns need to wait for Java Transaction API (JTA) settings and long running operations. Similarly, the start timeout will depend on your Oracle WebLogic Server deployments. For example, in a SOA system this might vary depending on the number and type of composites deployed. Since this can have a direct impact on the expected recovery time objective (RTO), first verify each operation manually for your system and use the acceptable timeout value to meet the RTO (you might need to intervene if a timeout occurs).

    The plan groups are executed in serial mode. The steps under the same Plan Group are executed in parallel. Therefore, place the steps to stop the Oracle WebLogic Server instances under the same Plan Group, so that those Oracle WebLogic Server instances are stopped in parallel. However, the steps to start Oracle WebLogic Server instances are separated into 2 plan groups: one plan group to start the Administration Server in the first node, and other plan group with N steps, to start the Oracle WebLogic Server managed instances in all the nodes in parallel.

  5. Add these user-defined steps when using DR model based on OCI Block Volumes cross-region replica to replace the database connect strings in the Oracle WebLogic Server (WLS) configuration to point to the local database:
    User- Defined Plan Group Step Error Mode Region Script Target Instance Script Parameters Run as User
    DB Connect string replacement in WLS (all in parallel) in WLS node 0 Stop on Error This region Run local script Mid-tier compute instance 0 /u01/scripts/replacement_script_BVmodel.sh oracle
    DB Connect string replacement in WLS (all in parallel) in WLS node 1 Stop on Error This region Run local script Mid-tier compute instance 1 /u01/scripts/replacement_script_BVmodel.sh oracle
    DB Connect string replacement in WLS (all in parallel) in WLS node N Stop on Error This region Run local script Mid-tier compute instance N /u01/scripts/replacement_script_BVmodel.sh oracle
  6. Reorder the Plan Groups in the failover plan as follows when using the DR model based on Oracle Cloud Infrastructure File Storage with rsync or Oracle Database File System config replication:
    Plan Group Position Plan Group Plan Group Type
    1 Built-in prechecks Built-in step
    2 Oracle WebLogic Server shutdown in remote_region (parallel) User-defined step
    3 DNS switchover User-defined step
    4 Failover databases (Standby) Built-in step
    5 Oracle WebLogic Server Admin Server start in this_region User-defined step
    6 Oracle WebLogic Server managed servers start in this_region (all nodes in parallel) User-defined step
  7. Reorder the Plan Groups in the plan as follows when using DR model based on OCI Block Volumes cross-region replication, based on the default order.
    Plan Group Position Plan Group Plan Group Type
    1 Built-in prechecks Built-in step
    2 Oracle WebLogic Server shutdown in remote_region (parallel) User-defined step
    3 Failover Volume Groups Built-in step
    4 DNS switchover User-defined step
    5 Failover databases (Standby) Built-in step
    6 Attach Block Volumes from compute instances Built-in step
    7 DB Connect string replacement in WLS in this_region (all nodes in parallel) User-defined step
    8 Oracle WebLogic Server Admin Server start in this_region User-defined step
    9 Oracle WebLogic Server managed servers start in this_region (all nodes in parallel) User-defined step
  8. Repeat these steps to create the failover plan in the DR Protection Group for the primary region.

    Note:

    To create the failover plan in the DR Protection Group for the primary region, you must wait until it is in standby role.