Note:

Automate Switchover and Failover Plans for OCI Kubernetes Engine (Stateful) with OCI Full Stack Disaster Recovery

Introduction

Oracle Cloud Infrastructure Full Stack Disaster Recovery (OCI Full Stack DR) orchestrates the transition of compute, database, and applications between Oracle Cloud Infrastructure (OCI) regions from around the globe with a single click. Customers can automate the steps needed to recover one or more business systems without redesigning or rearchitecting existing infrastructure, databases, or applications and without needing specialized management or conversion servers.

Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a managed Kubernetes service that simplifies the development, deployment, and operation of containerized workloads at scale. OKE enables you to quickly create, manage, and consume Kubernetes clusters that leverage underlying OCI Compute, networking, and storage services.

Deployment Architecture

Architecture Diagram

Objectives

The following tasks will be covered in this tutorial.

Note: In this tutorial, primary region is Frankfurt and standby region is Amsterdam.

Prerequisites

Task 1: Create a Dynamic Group and Policies for OKE and OCI Full Stack DR

These policies let OCI Full Stack DR service access the OCI Object Storage bucket to upload configuration backup. The policy for OCI Object Storage bucket access from OKE cluster is dependent on the cluster type.

  1. Create a dynamic group and policies for managed node pool.

    • Create a dynamic group named <cluster1_dg>.

      All {instance.compartment.id = '<compartment_ocid>'}
      
    • Create the following policies.

      Allow dynamic-group cluster1_dg to manage object-family in compartment <compartment>
      Allow dynamic-group cluster1_dg to manage cluster-family in compartment <compartment>
      
  2. Create the following policies for virtual node pool.

    Allow any-user to manage objects in tenancy where all { request.principal.type = 'workload', request.principal.namespace = 'brie', request.principal.service_account = 'brie-reader', request.principal.cluster_id = '<Cluster_OCID>'}
    
    Allow any-user to manage objects in tenancy where all { request.principal.type = 'workload', request.principal.namespace = 'brie', request.principal.service_account = 'brie-creator', request.principal.cluster_id = '<Cluster_OCID>'}
    

    These policies give pods running in brie namespace with service account brie-reader or brie-creator to read and write to OCI Object Storage bucket.

  3. Create a dynamic group and policies for container instance. These policies let runtime container instances created by OCI Full Stack DR service access the OKE cluster and OCI Object Storage bucket.

    • Create a dynamic group named <bastion1_dg>.

      All {resource.type='computecontainerinstance'}
      
    • Create the following policies.

      Allow dynamic-group bastion1_dg to manage object-family in compartment <compartment>
      Allow dynamic-group bastion1_dg to manage cluster-family in compartment <compartment>
      
  4. Create a dynamic group and policy for jump host.

    If you are using jump host, then this policy lets OCI Full Stack DR access the OKE cluster and the OCI Object storage buckets. If jump host and cluster are in the same compartment, then you can avoid steps to create new dynamic group and policy to provide access to OCI Object Storage bucket.

    • Create a dynamic group named <bastion1_dg>.

      All {instance.compartment.id = '<compartment_ocid>'}
      
    • Create the following policy.

      Allow dynamic-group bastion1_dg to manage cluster-family in compartment <compartment>Allow dynamic-group bastion1_dg to manage cluster in compartment <compartment>
      

Note: If you do not include the identity_domain_name before the dynamic-group, then the policy statement is evaluated as though the group belongs to the default identity domain. For more information, see How Policies Work.

Task 2: Add Primary OKE Cluster to the Primary DR Protection Groups

  1. In the primary DRPG (DRPG_MUSHOP_FRA), select Members and click Add Member.

    Add Primary OKE Cluster

  2. Select OKE Cluster as Resource type.

    Add Primary OKE Cluster

  3. Enter the following required information.

    • OKE Cluster: Enter a OKE cluster.
    • Backup: Enter backup information.
      • Backup bucket: Select bucket.
      • Select Specify the backup schedule.
      • Schedule type: Enter schedule type.
      • Start time: Enter start time in UTC.
      • Interval: Enter interval in days.
      • The maximum number of backups you want to retain (optional): Enter maximum number of backups.
      • Select Image replication:
      • Image replication secret (optional): Select image.
      • Namespace (optional): Enter namespace.
    • Peer OKE cluster: Select peer OKE cluster.
  4. Select I understand that I must refresh and verify all the existing plans and click Add.

    Add Primary OKE Cluster

Task 3: Add Volume Groups to the Primary DR Protection Groups

  1. In the primary DRPG (DRPG_MUSHOP_FRA), select Members and click Add Member.

    Add Volume Groups

  2. Select Volume Group as Resource type.

    Add Volume Groups

  3. Enter the following required information.

    • Volume Group: Select volume group.
  4. Select I understand that I must refresh and verify all the existing plans and click Add.

    Add Volume Groups

Task 4: Add Standby OKE Cluster to the Standby DR Protection Groups

  1. In the standby DRPG (DRPG_MUSHOP_AMS), select Members and click Add Member.

    Add Standby OKE Cluster

  2. Select OKE Cluster as Resource type.

    Add Standby OKE Cluster

  3. Enter the following required information.

    • OKE Cluster: Enter a OKE cluster.
    • Backup: Enter backup information.
      • Backup bucket: Select bucket.
    • Peer OKE cluster: Select peer OKE cluster.
  4. Select I understand that I must refresh and verify all the existing plans and click Add.

    Add Standby OKE Cluster

Task 5: Create a Start Drill Plan

  1. In the standby DRPG (DRPG_MUSHOP_AMS), select Plans and click Create Plan.

    Create a Start Drill Plan

  2. Enter a Name for the plan, select Start Drill as Plan Type and click Create.

    Create a Start Drill Plan

    After a few minutes, the plan will show Active state.

    Create a Start Drill Plan

  3. Select the plan created to see its content.

    Create a Start Drill Plan

Task 6: Execute the Start Drill Plan

  1. Select the plan created in Task 5.

    Execute the Start Drill Plan

  2. Select Enable prechecks and click Execute plan.

    Execute the Start Drill Plan

    After a few minutes, all groups will show Success state.

    Execute the Start Drill Plan

Task 7: Check the Application Running on the Standby OKE Cluster

Connect to the standby OKE cluster and check if the application is running, for MuShop application run the following command.

kubectel get all -n mushop

Check the application running on the standby OKE Cluster

Task 8: Create a Stop Drill Plan

  1. In the standby DRPG (DRPG_MUSHOP_AMS), select Plans and click Create Plan.

    Create a Stop Drill Plan

  2. Enter a Name for the plan, select Start Drill as Plan Type and click Create.

    Create a Stop Drill Plan

    After a few minutes, the plan will show Active state.

    Create a Stop Drill Plan

Task 9: Execute the Stop Drill Plan

  1. Select the plan created in Task 8.

    Execute the Stop Drill Plan

  2. Select Enable prechecks and click Execute plan.

    Execute the Stop Drill Plan

    After a few minutes, all groups will show Succuss state.

    Execute the Stop Drill Plan

Task 10: Check Clean-up on Standby OKE Cluster

Connect to the standby OKE cluster and check the list of namespaces using the following command.

kubectl get namespaces

Check cleanup on standby OKE Cluster

Next Steps

Once you create and execute the drill plans, now it is time to create a switchover plan and failover plan.

There are two best practices that should be incorporated into the normal day-to-day operations to help ensure the readiness of your DR plans.

Think about scheduling weekly prechecks of all DR plans in the standby DR Protection Group. Prechecks can be run at any time and have zero impact on production workloads. This will help ensure integrity of your DR plans, catching missing member resources, missing networks, the inability to find expected scripts called by user-defined steps, and so on.

Another very important way of validating the readiness of your disaster recovery is to schedule periodic DR drills once a month or quarter. DR drills also have zero impact on production workloads, but give you the ability to validate recovery of compute, storage, Oracle databases and backend sets for load balancers in the standby region with the click of a single button. For more information, see:

Acknowledgments

More Learning Resources

Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.

For product documentation, visit Oracle Help Center.