Note:
- This tutorial requires access to Oracle Cloud. To sign up for a free account, see Get started with Oracle Cloud Infrastructure Free Tier.
- It uses example values for Oracle Cloud Infrastructure credentials, tenancy, and compartments. When completing your lab, substitute these values with ones specific to your cloud environment.
Automate Switchover and Failover Plans for OCI Kubernetes Engine (Stateful) with OCI Full Stack Disaster Recovery
Introduction
Oracle Cloud Infrastructure Full Stack Disaster Recovery (OCI Full Stack DR) orchestrates the transition of compute, database, and applications between Oracle Cloud Infrastructure (OCI) regions from around the globe with a single click. Customers can automate the steps needed to recover one or more business systems without redesigning or rearchitecting existing infrastructure, databases, or applications and without needing specialized management or conversion servers.
Oracle Cloud Infrastructure Kubernetes Engine (OKE) is a managed Kubernetes service that simplifies the development, deployment, and operation of containerized workloads at scale. OKE enables you to quickly create, manage, and consume Kubernetes clusters that leverage underlying OCI Compute, networking, and storage services.
Deployment Architecture
Objectives
The following tasks will be covered in this tutorial.
- Task 1: Create a dynamic group and policies for OKE and OCI Full Stack Disaster Recovery
- Task 2: Add primary OKE cluster to the primary DR Protection Groups
- Task 3: Add volume groups to the primary DR Protection Groups
- Task 4: Add standby OKE cluster to the standby DR Protection Groups
- Task 5: Create a start drill plan
- Task 6: Execute the start drill plan
- Task 7: Check the application running on the standby OKE cluster
- Task 8: Create a stop drill plan
- Task 9: Execute the stop drill plan
- Task 10: Check clean-up on standby OKE cluster
Note: In this tutorial, primary region is Frankfurt and standby region is Amsterdam.
Prerequisites
-
This tutorial assumes the DR Protection Groups (DRPG) already exist, and you have existing DR plans in both regions.
-
This tutorial assumes the reader has administrator privileges and the required Oracle Cloud Infrastructure Identity and Access Management (OCI IAM) policies for OCI Full Stack DR are already in place. For more information, see Configuring Identity and Access Management (IAM) policies to use Full Stack DR and Policies for Full Stack Disaster Recovery.
-
This tutorial assumes the reader has an OKE cluster deployed on the primary region and a peer cluster on the standby region. For more information, see Creating a Cluster.
-
This tutorial assumes the reader has disconnected (mocked) MuShop application deployed on the primary OKE cluster. For more information, see Deploy MuShop.
-
The block volumes generated by the OKE cluster have already been added to the (
vg_oke_mushop
) volume group. You must create volume group with cross region replicaiton,for more information, see Create Volume Groups. -
Create OCI Object Storage bucket in the primary region and standby region to store OKE backups. For more information, see Object Storage.
Task 1: Create a Dynamic Group and Policies for OKE and OCI Full Stack DR
These policies let OCI Full Stack DR service access the OCI Object Storage bucket to upload configuration backup. The policy for OCI Object Storage bucket access from OKE cluster is dependent on the cluster type.
-
Create a dynamic group and policies for managed node pool.
-
Create a dynamic group named
<cluster1_dg>
.All {instance.compartment.id = '<compartment_ocid>'}
-
Create the following policies.
Allow dynamic-group cluster1_dg to manage object-family in compartment <compartment> Allow dynamic-group cluster1_dg to manage cluster-family in compartment <compartment>
-
-
Create the following policies for virtual node pool.
Allow any-user to manage objects in tenancy where all { request.principal.type = 'workload', request.principal.namespace = 'brie', request.principal.service_account = 'brie-reader', request.principal.cluster_id = '<Cluster_OCID>'} Allow any-user to manage objects in tenancy where all { request.principal.type = 'workload', request.principal.namespace = 'brie', request.principal.service_account = 'brie-creator', request.principal.cluster_id = '<Cluster_OCID>'}
These policies give pods running in brie namespace with service account
brie-reader
orbrie-creator
to read and write to OCI Object Storage bucket. -
Create a dynamic group and policies for container instance. These policies let runtime container instances created by OCI Full Stack DR service access the OKE cluster and OCI Object Storage bucket.
-
Create a dynamic group named
<bastion1_dg>
.All {resource.type='computecontainerinstance'}
-
Create the following policies.
Allow dynamic-group bastion1_dg to manage object-family in compartment <compartment> Allow dynamic-group bastion1_dg to manage cluster-family in compartment <compartment>
-
-
Create a dynamic group and policy for jump host.
If you are using jump host, then this policy lets OCI Full Stack DR access the OKE cluster and the OCI Object storage buckets. If jump host and cluster are in the same compartment, then you can avoid steps to create new dynamic group and policy to provide access to OCI Object Storage bucket.
-
Create a dynamic group named
<bastion1_dg>
.All {instance.compartment.id = '<compartment_ocid>'}
-
Create the following policy.
Allow dynamic-group bastion1_dg to manage cluster-family in compartment <compartment>Allow dynamic-group bastion1_dg to manage cluster in compartment <compartment>
-
Note: If you do not include the
identity_domain_name
before thedynamic-group
, then the policy statement is evaluated as though the group belongs to the default identity domain. For more information, see How Policies Work.
Task 2: Add Primary OKE Cluster to the Primary DR Protection Groups
-
In the primary DRPG (
DRPG_MUSHOP_FRA
), select Members and click Add Member. -
Select OKE Cluster as Resource type.
-
Enter the following required information.
- OKE Cluster: Enter a OKE cluster.
- Backup: Enter backup information.
- Backup bucket: Select bucket.
- Select Specify the backup schedule.
- Schedule type: Enter schedule type.
- Start time: Enter start time in UTC.
- Interval: Enter interval in days.
- The maximum number of backups you want to retain (optional): Enter maximum number of backups.
- Select Image replication:
- Image replication secret (optional): Select image.
- Namespace (optional): Enter namespace.
- Peer OKE cluster: Select peer OKE cluster.
-
Select I understand that I must refresh and verify all the existing plans and click Add.
Task 3: Add Volume Groups to the Primary DR Protection Groups
-
In the primary DRPG (
DRPG_MUSHOP_FRA
), select Members and click Add Member. -
Select Volume Group as Resource type.
-
Enter the following required information.
- Volume Group: Select volume group.
-
Select I understand that I must refresh and verify all the existing plans and click Add.
Task 4: Add Standby OKE Cluster to the Standby DR Protection Groups
-
In the standby DRPG (
DRPG_MUSHOP_AMS
), select Members and click Add Member. -
Select OKE Cluster as Resource type.
-
Enter the following required information.
- OKE Cluster: Enter a OKE cluster.
- Backup: Enter backup information.
- Backup bucket: Select bucket.
- Peer OKE cluster: Select peer OKE cluster.
-
Select I understand that I must refresh and verify all the existing plans and click Add.
Task 5: Create a Start Drill Plan
-
In the standby DRPG (
DRPG_MUSHOP_AMS
), select Plans and click Create Plan. -
Enter a Name for the plan, select Start Drill as Plan Type and click Create.
After a few minutes, the plan will show Active state.
-
Select the plan created to see its content.
Task 6: Execute the Start Drill Plan
-
Select the plan created in Task 5.
-
Select Enable prechecks and click Execute plan.
After a few minutes, all groups will show Success state.
Task 7: Check the Application Running on the Standby OKE Cluster
Connect to the standby OKE cluster and check if the application is running, for MuShop application run the following command.
kubectel get all -n mushop
Task 8: Create a Stop Drill Plan
-
In the standby DRPG (
DRPG_MUSHOP_AMS
), select Plans and click Create Plan. -
Enter a Name for the plan, select Start Drill as Plan Type and click Create.
After a few minutes, the plan will show Active state.
Task 9: Execute the Stop Drill Plan
-
Select the plan created in Task 8.
-
Select Enable prechecks and click Execute plan.
After a few minutes, all groups will show Succuss state.
Task 10: Check Clean-up on Standby OKE Cluster
Connect to the standby OKE cluster and check the list of namespaces using the following command.
kubectl get namespaces
Next Steps
Once you create and execute the drill plans, now it is time to create a switchover plan and failover plan.
There are two best practices that should be incorporated into the normal day-to-day operations to help ensure the readiness of your DR plans.
- Regular periodic execution of prechecks.
- Regular periodic execution of DR drills.
Think about scheduling weekly prechecks of all DR plans in the standby DR Protection Group. Prechecks can be run at any time and have zero impact on production workloads. This will help ensure integrity of your DR plans, catching missing member resources, missing networks, the inability to find expected scripts called by user-defined steps, and so on.
Another very important way of validating the readiness of your disaster recovery is to schedule periodic DR drills once a month or quarter. DR drills also have zero impact on production workloads, but give you the ability to validate recovery of compute, storage, Oracle databases and backend sets for load balancers in the standby region with the click of a single button. For more information, see:
Related Links
Acknowledgments
- Author - Raphael Teixeira (Principal member of technical staff for Full Stack DR engineering)
More Learning Resources
Explore other labs on docs.oracle.com/learn or access more free learning content on the Oracle Learning YouTube channel. Additionally, visit education.oracle.com/learning-explorer to become an Oracle Learning Explorer.
For product documentation, visit Oracle Help Center.
Automate Switchover and Failover Plans for OCI Kubernetes Engine (Stateful) with OCI Full Stack Disaster Recovery
G26105-01
February 2025
Copyright ©2025, Oracle and/or its affiliates.