22 Configuring Disaster Recovery

In the past, disaster recovery plans often involved syncing a file system between the main and backup systems. The most effective method for syncing is through disk replication. However, if this option is not available, the next best solution is to use the rsync utility to copy the data from one location to another.

This chapter includes the following topics:

Generic Disaster Recovery Processes

Some of the common disaster recovery processes include using rsync to copy data between sites, using Oracle Data Guard to create a standby database, backing up and restoring the cluster artifacts, synchronizing the persistent volumes, and so on.

Here is the list of the generic disaster recovery processes:

Creating a Container with rsync

You can replicate persistent volume data in a number of ways. These can be divided into two categories:
  • Hardware Replication - Uses disk based replication of cloud based file system replication.
  • Software Replication - Uses a software tool to manually replicate a file system. One of the most widely available and efficient tools to perform software replication is rsync.

This section describes how to create a light weight container with the rsync command which can be run inside the Kubernetes cluster. This container can then be invoked using a Kubernetes CronJob.

You can run rsync as a CronJob on the underlying worker nodes, but this creates an external dependency. Another way of achieving this is by creating a pod that runs inside of the cluster which performs the rsync operations.

This job ensures that the application disaster recovery processes run as part of the Kubernetes cluster. CronJobs require container images that include the rsync command. The two main Linux container images, Alpine and Busybox, do not contain the rsync command by default. However, they can be extended to include the command. Complete the following steps to create an image based on the Alpine container with rsync included.

The following steps provide instructions on how to create a container image based on Alpine that also includes the rsync utility. These steps require a runtime environment with Podman or Docker and a logged-in DockerHub account to access the repository images.

  1. Download and start the Alpine container using the following command:
    podman run -dit alpine sh

    This command will start an Alpine container.

  2. Obtain the container ID using the command:
    podman ps

    The output of this command will be similar to:

    CONTAINER ID  IMAGE                            COMMAND     CREATED         STATUS             
    7cb3c9de95ea  docker.io/library/alpine:latest  sh          20 minutes ago  Up 20 minutes ago

    The container id is 7cb3c9de95ea.

  3. Connect to the container using the container id. For example:
    podman attach 7cb3c9de95ea
  4. Install rsync into the container using the commands:
    apk update
    apk add rsync

    The output will be similar to:

    / # apk update
    fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/main/x86_64/APKINDEX.tar.gz
    fetch https://dl-cdn.alpinelinux.org/alpine/v3.17/community/x86_64/APKINDEX.tar.gz
    v3.17.0-136-gc113cb02a1 [https://dl-cdn.alpinelinux.org/alpine/v3.17/main]
    v3.17.0-137-ge9e378b388 [https://dl-cdn.alpinelinux.org/alpine/v3.17/community]
    OK: 17803 distinct packages available
    
    / # apk add rsync
    (1/5) Installing libacl (2.3.1-r1)
    (2/5) Installing lz4-libs (1.9.4-r1)
    (3/5) Installing popt (1.19-r0)
    (4/5) Installing zstd-libs (1.5.2-r9)
    (5/5) Installing rsync (3.2.7-r0)
    Executing busybox-1.35.0-r29.trigger
    OK: 8 MiB in 20 packages

    Exit the container by pressing Ctrl+P+Q on the keyboard.

  5. Commit the changes to the image using the command:
    podman commit -m "Added rsync" 7cb3c9de95ea alpine-rsync -f docker

    The output will be similar to:

    Getting image source signatures
    Copying blob ded7a220bb05 skipped: already exists
    Copying blob 6459a5a97a89 done
    Copying config 6cf4ad33ac done
    Writing manifest to image destination
    Storing signatures
    6cf4ad33aca33fe62cb0fee9354b0b8f548ab9983427c25c24f77fcb6542e17c
  6. Check that alpine-rsync is now in your local image store by using the command:
    podman images

    The output will be similar to:

    REPOSITORY                                          TAG                            IMAGE ID      CREATED         SIZE
    localhost/alpine-rsync                              latest                         6cf4ad33aca3  14 seconds ago  11.1 MB
    docker.io/library/alpine                            latest                         49176f190c7e  2 weeks ago     7.34 MB
  7. Tag the image with your container registry name using the command:
    podman tag 6cf4ad33aca3  iad.ocir.io/mytenancy/idm/alpine-rsync:latest
  8. Verify that alpine-rsync is now tagged in your local image store by using the command:
    podman images

    The output will be similar to:

    iad.ocir.io/mytenancy/idm/alpine-rsync              latest                         6cf4ad33aca3  28 minutes ago  11.1 MB
    localhost/alpine-rsync                              latest                         6cf4ad33aca3  28 minutes ago  11.1 MB
    docker.io/library/alpine                            latest                         49176f190c7e  2 weeks ago     7.34 MB
  9. Push the image to your container registry using the commands:
    podman login iad.ocir.io/mytenant -u paasmaa/oracleidentitycloudservice/myuser -p "mypassword"
    podman push iad.ocir.io/mytenancy/idm/alpine-rsync:latest
    The image is now available for use from your container registry.

Creating a Data Guard Database

You can use rsync to copy the data from one site to the other outside of Kubernetes. However, a more efficient way of copying data is by creating a Kubernetes CronJob. In the event of a disaster recovery, after you have deployed your application on the standby site, you should delete the existing database and create a standby database using Oracle Data Guard.

If you are using a non-Oracle Cloud based deployment (or you want to configure it manually), see Learn About Configuring a Standby Database for Disaster Recovery for instructions.

Note:

After creating the Data Guard database, ensure that it has the same initialization parameter values as that of the primary database.

If you are using Oracle Cloud Infrastructure, use the following steps:

  1. Log in to the Oracle Cloud Infrastructure Console for your tenancy.
  2. Ensure that your region is set to the same region where your primary database resides.
  3. Navigate to Oracle Database and click Oracle Base Database (VM, BM).
  4. Select the name of the database system that houses your database. Your database will be listed in the Databases section of the page.
  5. Select the name of your database.
  6. Select the Data Guard Associations submenu.
  7. Click Enable Data Guard.
  8. Enter the following information in the wizard:
    • Display Name : Enter a name for the Data Guard database.
    • Region: Select the region where the standby site is located.
    • Availability Domain: Choose the availability domain to locate your database.
    • Configure Shape: Edit the shape, if needed.
    • License Type: Choose the license type you want to use.
    • Virtual Cloud Network Name: Enter the name of your VCN on the standby site.
    • Client Subnet: Enter the name of the database subnet on the standby site. For example, db-subnet.
    • Enter a host name prefix for the database nodes. For example, regdb (where reg is the name of the region).
    • Enter the type of Data Guard you want to deploy. For example, Active Data Guard (recommended).
  9. Click Next.
  10. On the Database Information screen, enter the following information:
    • Database Image: Ensure that the database image is the same as that of the primary database.
    • Database Password: Enter a sys password for the standby database.
  11. Click Enable Data Guard.
  12. Verify that the Data Guard database has the same initialization parameters as that of the primary database.

    Note:

    Due to a bug in the Data Guard creation process, you may find that some parameters, such as processes, have overriding values for a specific instance. If you encounter this issue, use the following command to remove the overriding values:
    alter system reset processes sid='instance_name';

Note:

After running the network analyzer, if you see successful communication paths between your regions but a Data Guard association error occurs similar to the following:
A cross-region Data Guard association cannot be created between databases of DB systems that
      belong to VCNs that are not peers.

Then, raise a Service Request with Oracle Support requesting that CrossRegionDataguardNetworkValidation be disabled.

Backing Up and Restoring the Kubernetes Objects

Every Kubernetes cluster is distinct and requires individual maintenance. The artifacts within the cluster determine the application's operation. The artifacts include namespaces, persistent volumes, config maps, and more. These artifacts must be created on the standby cluster before you start the application.

There are two ways of creating the artifacts on the standby system:

  • Perform a separate installation using the same configuration information as that of the primary site but using a throwaway database. After you complete the installation, discard the database and domain configuration and replace it with the primary site's database and domain configuration.
  • Back up the Kubernetes objects in the primary site and restore them to the standby site, pointing the persistent volumes to the standby NFS servers. This approach is not recommended for Oracle Advanced Authentication.

Oracle provides a tool to simplify the process of backing up and restoring the Kubernetes objects. For information about the tool, see Using the Oracle K8s Disaster Protection Scripts.

To use these scripts:

  1. Download the scripts to a suitable location using the command:
    git clone -q https://github.com/oracle-samples/maa
    chmod 700 maa/kubernetes-maa/*.sh
  2. Create a directory to hold the backup. For example:
    mkdir -p /workdir/k8backup
  3. Create a backup of a namespace using the command:
    kubernetes-maa/maak8-get-all-artifacts.sh <Backup Directory> <Namespace>
    For example:
    kubernetes-maa/maak8-get-all-artifacts.sh /workdir/k8backup  oamns

    This process will take a few minutes to complete. It creates a dated directory in the backup location that contains a .gz file, which is the backup archive.

  4. Transfer the .gz file to the standby site using your preferred file transfer mechanism.
  5. Restore the backup site into the standby cluster using the command:
    kubernetes-maa/maak8-push-all-artifacts.sh <BACKUP_FILE> <WORKING DIRECTORY>
    For example:
    kubernetes-maa/maak8-push-all-artifacts.sh /workdir/k8backup/23-06-05-11-03-15/ /workdir/k8restore

Note:

  • If you do not precreate the persistent volumes on the standby site before restoring, the applications will not start. Alternatively, you can create the PVs after restoring, but this will cause pods to remain pending until the volumes are created. It is crucial to ensure that the persistent volumes on the standby site point to the NFS server in that site.
  • It is possible to request the backup to include the persistent volumes in its backup process. However, this will also backup the mounts to the primary NFS server. Ensure that you do not allow the application to start on the standby site by using the primary NFS server because it could result in the corruption of your primary deployment.
  • If the application is running on the primary when the backup is being taken, the application will be started on the standby when the restore is being performed. To minimize errors, you should open the Data Guard database as a snapshot standby. However, some errors may still occur due to the running jobs from SOA and OIG, but these can be ignored because they are caused by the inactive status of the site.

At some point, you should switch your Data Guard database to the standby site and verify that the standby services are functional.

Creating a Kubernetes CronJob to Synchronize the Persistent Volumes

For each solution described in this section, the configuration information stored on the persistent volumes must be synchronized between the primary and standby sites, with the ability to reverse the synchronization in case of a site switchover.

As mentioned earlier, to achieve the maximum efficiency, the best approach is to employ file system replication. See Backing Up and Restoring the Kubernetes Objects Alternatively, a cron job be created inside the Kubernetes cluster to carry out this task. The process is the same for each product, with only the underlying volumes being different. It is advisable to create a separate cron job for each product rather than having one that handles all products.

The following steps describe how to achieve this. The example uses Oracle Access Manager (OAM) for simplicity, but it applies to any product.

Creating a Namespace for the Backup Jobs

Create a separate namespace outside of the product namespace to hold the backup jobs, either per product or for all the backup jobs.

Create the namespace using the command:

create namespace <namespace>
For example:
create namespace drns
Creating the Persistent Volumes

Each site requires a persistent volume that points to the NFS filer in the remote location.

Create the following persistent volumes on Site 1 using the provided yaml files:

site1_primary_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: primary-oam-pv
  labels:
    type: primary-oam-pv
spec:
  storageClassName: manual
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /export/IAMPVS/oampv
    server: <site1_nfs_server>

site1_standby_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: standby-oam-pv
  labels:
    type: standby-oam-pv
spec:
  storageClassName: manual
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /export/IAMPVS/oampv
    server: <site2_nfs_server>

Create the following persistent volumes on Site 2 using the provided yaml files:

site2_primary_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: primary-oam-pv
  labels:
    type: primary-oam-pv
spec:
  storageClassName: manual
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /export/IAMPVS/oampv
    server: <site2_nfs_server>

site2_standby_pv.yaml

apiVersion: v1
kind: PersistentVolume
metadata:
  name: standby-oam-pv
  labels:
    type: standby-oam-pv
spec:
  storageClassName: manual
  capacity:
    storage: 30Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /export/IAMPVS/oampv
    server: <site1_nfs_server>

Create the persistent volumes using the commands:

Site1
kubectl create -f site1_primary_pv.yaml
kubectl create -f site1_standby_pv.yaml
Site2
kubectl create -f site2_primary_pv.yaml
kubectl create -f site2_standby_pv.yaml

Creating the Persistent Volume Claims

After creating the persistent volumes, you can create the persistent volume claims in the DR namespace.

Create the following files (note that the files will be identical on each site):

primary_pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: primary-oampv-pvc
  namespace: <DRNS>
  labels:
    type: primary-oampv-pvc
spec:
  storageClassName: manual
  accessModes:
  - ReadWriteMany
  resources:
     requests:
       storage: 30Gi
  selector:
    matchLabels:
      type: primary-oam-pv
standby_pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: standby-oampv-pvc
  namespace: <DRNS>
  labels:
    type: standby-oampv-pvc
spec:
  storageClassName: manual
  accessModes:
  - ReadWriteMany
  resources:
     requests:
       storage: 30Gi
  selector:
    matchLabels:
      type: standby-oam-pv
Create the persistent volume claims on both the sites using the commands:
kubectl create -f primary_pvc.yaml
kubectl create -f standby_pvc.yaml

Creating a DR ConfigMap

Each site has a ConfigMap object containing the configuration information about the site, such as the site's role, domain name, primary database scan address and service name, and so on.

  • The role the site is performing. This role determines which way the PV is replicated. The replication should always be from primary to standby.
  • The name of the domain (determines the location of the DOMAIN_HOME on the persistent volume).
  • The Primary Database Scan address and service name. These are used to change the replicated configuration information to point to the database in the standby site.
  • The standby database scan address and service name. These are used to change the replicated configuration information to point to the database in the standby site.
Description of the ConfigMap object:
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-cm
  namespace: <DRNS>
data:
  ENV_TYPE: <ENV_TYPE>
  DR_TYPE: <DR_TYPE>
  OAM_DOMAIN_NAME: <OAM_DOMAIN_NAME>
  OAM_LOCAL_SCAN: <OAM_LOCAL_SCAN>
  OAM_REMOTE_SCAN: <OAM_REMOTE_SCAN>
  OAM_LOCAL_SERVICE: <OAM_LOCAL_SERVICE>
  OAM_REMOTE_SERVICE: <OAM_REMOTE_SERVICE>
  OIG_DOMAIN_NAME: <OIG_DOMAIN_NAME>
  OIG_LOCAL_SCAN: <OIG_LOCAL_SCAN>
  OIG_REMOTE_SCAN: <OIG_REMOTE_SCAN>
  OIG_LOCAL_SERVICE: <OIG_LOCAL_SERVICE>
  OIG_REMOTE_SERVICE: <OIG_REMOTE_SERVICE>
  OIRI_LOCAL_SCAN: <OIRI_LOCAL_SCAN>
  OIRI_REMOTE_SCAN: <OIRI_REMOTE_SCAN>
  OIRI_LOCAL_SERVICE: <OIRI_LOCAL_SERVICE>
  OIRI_REMOTE_SERVICE: <OIRI_REMOTE_SERVICE>
  OIRI_REMOTE_K8: <OIRI_REMOTE_K8>
  OIRI_REMOTE_K8CONFIG: <OIRI_REMOTE_K8CONFIG>
  OIRI_REMOTE_K8CA: <OIRI_REMOTE_K8CA>
  OIRI_LOCAL_K8: <OIRI_LOCAL_K8>
  OIRI_LOCAL_K8CONFIG: <OIRI_LOCAL_K8CONFIG>
  OIRI_LOCAL_K8CA: <OIRI_LOCAL_K8CA>
  OAA_LOCAL_SCAN: <OAA_LOCAL_SCAN>
  OAA_REMOTE_SCAN: <OAA_REMOTE_SCAN>
  OAA_LOCAL_SERVICE: <OAA_LOCAL_SERVICE>
  OAA_REMOTE_SERVICE: <OAA_REMOTE_SERVICE>

Here is a sample of the ConfigMap object:

site1_dr_cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-cm
  namespace: <DRNS>
data:
  ENV_TYPE: OTHER
  DR_TYPE: PRIMARY
  OAM_LOCAL_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OAM_REMOTE_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OAM_LOCAL_SERVICE: oamsvc.dbsubnet.oke.oraclevcn.com
  OAM_REMOTE_SERVICE: oamsvc.dbsubnet.oke.oraclevcn.com
  OIG_DOMAIN_NAME: governancedomain
  OIG_LOCAL_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OIG_REMOTE_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OIG_LOCAL_SERVICE: oigsvc.dbsubnet.oke.oraclevcn.com
  OIG_REMOTE_SERVICE: oigsvc.dbsubnet.oke.oraclevcn.com
  OIRI_LOCAL_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OIRI_LOCAL_SERVICE: oirisvc.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_SERVICE: oirisvc.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_K8: 10.1.0.10:6443
  OIRI_REMOTE_K8CONFIG: standby_k8config
  OIRI_REMOTE_K8CA: standby_ca.crt
  OIRI_LOCAL_K8: 10.0.0.5:6443
  OIRI_LOCAL_K8CONFIG: primary_k8config
  OIRI_LOCAL_K8CA:primary_ca.crt
  OAA_LOCAL_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OAA_REMOTE_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OAA_LOCAL_SERVICE: oaasvc.dbsubnet.oke.oraclevcn.com
  OAA_REMOTE_SERVICE: oaasvc.dbsubnet.oke.oraclevcn.com

site2_dr_cm.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: dr-cm
  namespace: <DRNS>
data:
  ENV_TYPE: OTHER
  DR_TYPE: STANDBY
  OAM_LOCAL_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OAM_REMOTE_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OAM_LOCAL_SERVICE: oamsvc.dbsubnet.oke.oraclevcn.com
  OAM_REMOTE_SERVICE: oamsvc.dbsubnet.oke.oraclevcn.com
  OIG_DOMAIN_NAME: governancedomain
  OIG_LOCAL_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OIG_REMOTE_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OIG_LOCAL_SERVICE: oigsvc.dbsubnet.oke.oraclevcn.com
  OIG_REMOTE_SERVICE: oigsvc.dbsubnet.oke.oraclevcn.com
  OIRI_LOCAL_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OIRI_LOCAL_SERVICE: oirisvc.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_SERVICE: oirisvc.dbsubnet.oke.oraclevcn.com
  OIRI_REMOTE_K8: 10.0.0.10:6443
  OIRI_REMOTE_K8CONFIG: standby_k8config
  OIRI_REMOTE_K8CA: standby_ca.crt
  OIRI_LOCAL_K8: 10.1.0.5:6443
  OIRI_LOCAL_K8CONFIG: primary_k8config
  OIRI_LOCAL_K8CA:primary_ca.crt
  OAA_LOCAL_SCAN: site2-scan.dbsubnet.oke.oraclevcn.com
  OAA_REMOTE_SCAN: site1-scan.dbsubnet.oke.oraclevcn.com
  OAA_LOCAL_SERVICE: oaasvc.dbsubnet.oke.oraclevcn.com
  OAA_REMOTE_SERVICE: oaasvc.dbsubnet.oke.oraclevcn.com
Create the ConfigMap for Site 1 using the following command:
kubectl create -f site1_dr_cm.yaml
Create the ConfigMap for Site 2 using the following command:
kubectl create -f site2_dr_cm.yaml

Creating a Backup/Restore Job

The steps below describe how to create a job which will run periodically to back up the primary persistent volume using the rsync command and restore it to the standby system.

Creating a Backup/Restore Script

You should create a script to backup the persistent volume and copy the backup to the standby site. The deployment automation scripts contain a sample script to perform this task. This script is called <product>_dr.sh and it is located in the SCRIPT_DIR/templates/<product> directory.

For example, Oracle Access Manager, the script that needs to be used for disaster recovery is located at SCRIPT_DIR/templates/oam/oam_dr.sh. You need to copy this script to the persistent volume using one of the following methods:
  • By copying the script to a running container in the deployment.
  • By creating a simple Alpine container to access the persistent volume.
  • By temporarily mounting the NFS to a host where the scripts are being developed.

For the purpose of this document, the location where the script needs to be copied is referred to as 'dr_scripts'.

The script will run on both sites. If the site is in primary mode, it will create a backup of the persistent volume and send it to Site 2. If the site is in standby mode, it will restore the backup received from Site 1 onto Site 2. It is recommended that the scripts make a local copy of the persistent volume before performing any backup or restore operations.

It is highly recommended that the scripts make a local copy of the persistent volume beforehand by using NFS utilities such as snapshots or by using the rsync command, as per the following steps:

  1. To keep backups small, exclude files which are not needed.
    EXCLUDE_LIST="--exclude=\".snapshot\" --exclude=\"backups\" --exclude=\"domains/*/servers/*/tmp\" --exclude=\"logs/*\" --exclude=\"dr_scripts\" --exclude=\"network\" --exclude \"domains/*/servers/*/data/nodemanager/*.lck\" --exclude \"domains/*/servers/*/data/nodemanager/*.pid\" --exclude \"domains/*/servers/*/data/nodemager/*.state\" --exclude \"domains/ConnectorDefaultDirectory\"--exclude=\"backup_running\""
  2. Set up the location inside the container where the primary PV and standby PV are mounted.
    PRIMARY_BASE=/u01/primary_oampv
    DR_BASE=/u01/dr_oampv
    rsync -avz $EXCLUDE_LIST $PRIMARY_BASE/ $PRIMARY_BASE/backups/backup_date
  3. Transfer the backup to the standby site using the rsync command:
    rsync -avz $EXCLUDE_LIST $PRIMARY_BASE/backups/backup_date/ $DR_BASE/backups/backup_date
  4. On the standby site, restore the backup using the rsync command:
    rsync -avz $EXCLUDE_LIST $PRIMARY_BASE/backups/backup_date / $PRIMARY_BASE
  5. Search through the domain and change any occurrences of the primary database scan address and service to that used in the standby site.
    1. Determine all the files that need to be changed using the command:
      cd $PRIMARY_BASE/domains/$OAM_DOMAIN_NAME/config
      grep -rl "$REMOTE_SCAN" . | grep -v backup
    2. Update the relevant files with the correct information from the DR ConfigMap object.
Creating a CronJob to Run the Backup/Restore Script Periodically

To automate the backup process, you need to create a job that runs the backup script periodically. Another job will be created based on this job to initialize the backup process. To avoid any unexpected runs of the CronJob, it should be suspended immediately after its creation.

The frequency of the job's schedule should be based on how often configuration changes are made. If changes are less frequent, the job can be run less frequently. However, the more often the job runs, the more resources it will require, which can potentially impact system performance. Additionally, the interval between jobs should provide enough time for the job to complete before the next iteration starts. A suitable frequency might be once per day with additional manual synchronizations as needed.

The initial run of the job will take longer compared to the subsequent runs. Oracle recommends you to suspend the CronJob as soon as you create it, and then manually run a job to initialize the DR site. After you initialize the DR site, you can restart the CronJob.

To create a CronJob in the drns namespace, create a file called oamdr-cron.yaml with the following contents:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: <product>rsyncdr
  namespace: <DRNS>
spec:
  schedule: "*/720 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          imagePullSecrets:
          - name: regcred
          containers:
          - name: alpine-rsync
            image: iad.ocir.io/mytenancy/idm/alpine-rsync:latest
            imagePullPolicy: IfNotPresent
            envFrom:
              - configMapRef:
                  name: dr-cm
            volumeMounts:
              - mountPath: "/u01/primary_oampv"
                name: oampv
              - mountPath: "/u01/dr_oampv"
                name: oampv-dr
            command:
            - /bin/sh
            - -c
            - /u01/primary_oampv/dr_scripts/oam_dr.sh
          volumes:
          - name: oampv
            persistentVolumeClaim:
              claimName: primary-oampv-pvc
          - name: oampv-dr
            persistentVolumeClaim:
              claimName: standby-oampv-pvc
          restartPolicy: OnFailure
For example:
apiVersion: batch/v1
kind: CronJob
metadata:
  name: oamrsyncdr
  namespace: drns
spec:
  schedule: "*/720 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          imagePullSecrets:
          - name: regcred
          containers:
          - name: alpine-rsync
            image: iad.ocir.io/paasmaa/idm/alpine-rsync:latest
            imagePullPolicy: IfNotPresent
            envFrom:
              - configMapRef:
                  name: dr-cm
            volumeMounts:
              - mountPath: "/u01/primary_oampv"
                name: oampv
              - mountPath: "/u01/dr_oampv"
                name: oampv-dr
            command:
            - /bin/sh
            - -c
            - /u01/primary_oampv/dr_scripts/oam_dr.sh
          volumes:
          - name: oampv
            persistentVolumeClaim:
              claimName: primary-oampv-pvc
          - name: oampv-dr
            persistentVolumeClaim:
              claimName: standby-oampv-pvc
          restartPolicy: OnFailure

This job runs once a day. For a full description of the CronJob scheduling in Kuberenetes, see FreeBSD File Formats Manual.

This job uses the custom Alpine container image that was created earlier. See Creating a Container with rsync. To allow for the initialization of the DR site, which takes longer than regular refreshes, you need to immediately suspend the job. See Suspending/Resuming the CronJob. It is essential to create a separate job to perform the initialization process as this ensures that only one sync job runs at a time because it takes time to perform the initial copy.

Suspending/Resuming the CronJob
To suspend a backup CronJob, run the following command:
kubectl patch cronjobs <product>rsyncdr -p '{"spec" : {"suspend" : true }}' -n <NAMESPACE>
For example:
kubectl patch cronjobs oamrsyncdr -p '{"spec" : {"suspend" : true }}' -n drns

To restart the backup CronJob, run the following command:

kubectl patch cronjobs <product>rsyncdr -p '{"spec" : {"suspend" : false }}' -n <NAMESPACE>
For example:
kubectl patch cronjobs oamrsyncdr -p '{"spec" : {"suspend" : false }}' -n drns
Creating an Initialization Job
Use the CronJob defined earlier (Creating a CronJob to Run the Backup/Restore Script Periodically), create a one off job to perform the initial data transfer by using the command:
kubectl create job --from=cronjob.batch/<product>rsyncdr <product>-initialise-dr -n <DRNS>

For example:

kubectl create job --from=cronjob.batch/oamrsyncdr oam-initialise-dr -n drns

Monitor the job until it completes successfully. After the job completes, you can initiate a restore on the second site by using the same commands.

Configuring Disaster Recovery for Oracle Unified Directory

The disaster recovery for Oracle Unified Directory (OUD) is an active/passive solution. Currently, you cannot use the dsreplication command to set up a remote replica.

The process of setting up the DR site is summarized below:

  • Perform a minimal installation on Site 2 using the standard deployment procedures.
  • Stop the OUD processes on the DR site.
  • Delete the OUD persistent volume data on the DR site.
  • Do one of the following:
    • Enable disk replication for the persistent volume.
    • Enable manual replication using a CronJob which mounts both the local and the remote OUD persistent volume. This job will use the rsync command to synchronize the data. Suspend the job after the replication is complete.
  • Perform an initial data sync.
  • Verify that the DR site comes up successfully.
  • Shut down the DR site.
  • Restart the CronJob to copy the data periodically.

The following sections describe the procedure in detail:

Prerequisites

Ensure that you meet the following prerequisites:
  • You are able to mount the remote file systems in the local pods. This task may require firewall or security list permissions.
  • You are able to perform disk replication between the sites. This task may require firewall or security list permissions.
  • You have access to Alpine or Busybox with rsync installed, if you plan to use manual replication. See Creating a Backup/Restore Job.

Creating an Empty OUD Deployment on Site 2

Create an OUD deployment on Site 2 by following the instructions in Installing and Configuring Oracle Unified Directory with the following exceptions:

  • Use the same schema extension file as Site 1.
  • There is no need for a seeding file.
  • Use a minimal server overrides file. For example:
    image:
      repository: <OUD_REPOSITORY>
      tag: <OUD_VER>
      pullPolicy: IfNotPresent
    
    imagePullSecrets:
      - name: regcred
    
    oudConfig:
      baseDN: <LDAP_SEARCHBASE>
      rootUserDN: <LDAP_ADMIN_USER>
      rootUserPassword: <LDAP_ADMIN_PWD>
      sleepBeforeConfig: 300
    
    persistence:
      type: networkstorage
      networkstorage:
        nfs:
          server: <PVSERVER>
          path: <OUD_SHARE>
    
    configVolume:
      enabled: true
      type: networkstorage
      networkstorage:
        nfs:
          server: <PVSERVER>
          path: <OUD_CONFIG_SHARE>
      mountPath: /u01/oracle/config-input
    
    replicaCount: <OUD_REPLICAS>
    
    ingress:
      enabled: false
      type: nginx
      tlsEnabled: false
    
    elk:
      enabled: false
      imagePullSecrets:
        - name: dockercred
    
    cronJob:
      kubectlImage:
        repository: bitnami/kubectl
        tag: <KUBERNETES_VER>
        pullPolicy: IfNotPresent
    
        imagePullSecrets:
        - name: dockercred
    
    baseOUD:
      envVars:
        - name: schemaConfigFile_1
          value: /u01/oracle/config-input/99-user.ldif
        - name: restartAfterSchemaConfig
          value: "true"
    
    replOUD:
      envVars:
        - name: dsconfig_1
          value: set-global-configuration-prop --set lookthrough-limit:75000

Enabling a Manual Replication

Complete the following procedures if you want to create a CronJob to manually replicate data between the sites.

Creating a Persistent Volume for the Remote Persistent Volume

To create a persistent volume for the remote OUD persistent volume:

  1. Create the ouddr-pv.yaml file with the following contents:
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: oudpv-dr
      labels:
        type: edgdr-pv
    spec:
      storageClassName: manual
      capacity:
        storage: 30Gi
      accessModes:
        - ReadWriteMany
      nfs:
        path: <OUD_SHARE>
        server: <PV_SERVER>
    For example:
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: oudpv-dr
      labels:
        type: edgdr-pv
    spec:
      storageClassName: manual
      capacity:
        storage: 30Gi
      accessModes:
        - ReadWriteMany
      nfs:
        path: /exports/IAMPVS/oudpv
        server: <REMOTE_PVSERVER>
  2. Create the persistent volume using the command:
    kubectl create -f ouddr-pv.yaml
  3. Verify that the persistent volume has been created using the command:
    kubectl get pv
    The output appears similar to the following:
    NAME                      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                            STORAGECLASS   REASON   AGE
    edg-oud-ds-rs-pv          30Gi       RWX            Delete           Bound    oudns/edg-oud-ds-rs-pvc          manual                  21d
    edg-oud-ds-rs-pv-config   10Gi       RWX            Retain           Bound    oudns/edg-oud-ds-rs-pvc-config   manual                  21d
    oudpv-dr                  30Gi       RWX            Retain           Bound    oudns/ouddr-pvc                  manual                  54m

    Note:

    You will see a persistent volume for the current OUD deployment and the new one created for the remote OUD deployment.

You should create this persistent volume on both Site 1 and Site 2. Ensure that on Site 1 the REMOTE_PV_SERVER points to the PVSERVER on Site 2 and on Site 2 the REMOTE_PV_SERVER points to the PVSERVER on Site 1. Creating the PV on both the sites ensures that the configuration is immediately available if the roles of Site 1 and Site 2 need to be reversed.

Creating a Persistent Volume Claim for the Remote Persistent Volume

To create a persistent volume claim for the remote OUD persistent volume:

  1. Create the ouddr-pvc.yaml with the following contents:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ouddr-pvc
      namespace: <OUDNS>
      labels:
        type: ouddr-pvc
    spec:
      storageClassName: manual
      accessModes:
      - ReadWriteMany
      resources:
         requests:
           storage: 30Gi
      selector:
        matchLabels:
          type: edgdr-pv
    For example:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ouddr-pvc
      namespace: oudns
      labels:
        type: ouddr-pvc
    spec:
      storageClassName: manual
      accessModes:
      - ReadWriteMany
      resources:
         requests:
           storage: 30Gi
      selector:
        matchLabels:
          type: edgdr-pv
  2. Create the persistent volume claim using the command:
    kubectl create -f ouddr-pvc.yaml
  3. Verify that the persistent volume has been created using the command:
    kubectl get pvc -n <OUDNS>
    The output will appear similar to the following:
    NAME                       STATUS   VOLUME                    CAPACITY   ACCESS MODES   STORAGECLASS   AGE
    edg-oud-ds-rs-pvc          Bound    edg-oud-ds-rs-pv          30Gi       RWX            manual         21d
    edg-oud-ds-rs-pvc-config   Bound    edg-oud-ds-rs-pv-config   10Gi       RWX            manual         21d
    ouddr-pvc                  Bound    oudpv-dr                  30Gi       RWX            manual         63m

    Note:

    You will see a persistent volume claim for the current OUD deployment and the new one created for the remote OUD deployment.

You can create the same PVC claim on both Site 1 and Site 2. Creating on both sites ensures that the configuration is available immediately if the roles of Site 1 and Site 2 need to be reversed.

Creating a Backup and Restore Script

You must create a script that will backup the OUD instances and copy it to your standby site. This script should then be able to restore the backup on the standby site. If you need an example of such a script, you can find it in the deployment automation scripts under the name oud_dr.sh. The script is located in the SCRIPT_DIR/templates/oud directory. See Creating a Backup/Restore Script.

This script must have the following characteristics:

  • It should be able to determine whether it is running on the primary or the standby site.
  • It should not run if a previous backup/restore job is in progress.
  • When running on the primary site:
    • Create a localized backup of the persistent volume (to a temporary location) by using snapshots (if available) or rsync.
    • Copy the created backup to a temporary location on the standby site using rsync.
  • When running on the standby site, restore the OUD instances in the received backup to the persistent volume.

    Note:

    Restore only the instances, not the backup scripts.

Copying the Backup Script to the Persistent Volume

The easiest way to ensure that the container image is able to run the backup script is to copy the script to the persistent volume. You should perform this step on both the sites using the following commands:

kubectl exec -n <OUDNS> -ti <OUD_POD_PREFIX>-oud-ds-rs-0 -- mkdir /u01/oracle/user_projects/dr_scripts
kubectl cp <WORKDIR>/oud_dr.sh <OUDNS>/<OUD_POD_PREFIX>-oud-ds-rs-0:/u01/oracle/user_projects/dr_scripts
kubectl exec -n <OUDNS> -ti <OUD_POD_PREFIX>-oud-ds-rs-0 -- chmod 750 /u01/oracle/user_projects/dr_scripts/oud_dr.sh

For example:

kubectl exec -n oudns -ti edg-oud-ds-rs-0 -- mkdir /u01/oracle/user_projects/dr_scripts
kubectl cp /workdir/OUD/oud_dr.sh oudns edg-oud-ds-rs-0:/u01/oracle/user_projects/dr_scripts
kubectl exec -n oudns -ti edg-oud-ds-rs-0 -- chmod 750 /u01/oracle/user_projects/dr_scripts/oud_dr.sh
Creating a ConfigMap to Determine Site Characteristics
Creating a CronJob

Create a CronJob to synchronize the local and remote persistent volumes. For instructions, see Creating a CronJob to Run the Backup/Restore Script Periodically. After creating, suspend the CronJob immediately. For instructions, see Suspending/Resuming the CronJob.

Shutting Down the OUD Installation on Site 2

Shut down the OUD installation on Site 2 if everything is functioning as expected. Use the following command to shut down:

helm upgrade -n oudns --set replicaCount=0 edg <WORKDIR>/samples/kubernetes/helm/oud-ds-rs --reuse-values

Deleting the Persistent Volume Data on Site 2

To ensure that the data from Site 1 is fully copied across to Site 2, it is important to remove the data that was created as part of the installation. Assuming that the persistent volume is mounted locally, you can remove the data using the following command:

rm -rf /nfs_volumes/oudpv/*oud-ds-rs*

Creating an Initialization Job

Use the CronJob you defined earlier (see Creating a CronJob for creating a one off job to perform the initial data transfer using the command:

kubectl create job --from=cronjob.batch/rsyncdr initialise-dr -n <DRNS>
For example:
kubectl create job --from=cronjob.batch/rsyncdr initialise-dr -n drns

Note:

Perform this step on Site 1.

Verifying OUD on the DR Site

After the successful initial data load, you can start the OUD servers in Site 2 using the command:

helm upgrade -n <OUDNS> --set replicaCount=<REPLICA_COUNT> <OUD_POD_PREFIX> <WORKDIR>/samples/kubernetes/helm/oud-ds-rs --reuse-values
For example:
helm upgrade -n oudns --set replicaCount=1 edg  /workdir/OUD/samples/kubernetes/helm/oud-ds-rs --reuse-values

After starting the servers, verify the availability of data from the primary site by checking the replication status and running the ldapsearch command-line tool.

If everything is functioning as expected, shut down OUD by using the following command:

helm upgrade -n oudns --set replicaCount=0 edg <WORKDIR>/samples/kubernetes/helm/oud-ds-rs --reuse-values

Starting the Automatic Syncing Process

Start the CronJob on Site 1 to periodically synchronize the changes from Site 1 Persistent Volume to Site 2. See Suspending/Resuming the CronJob.

Configuring Disaster Recovery for Oracle Access Manager

The disaster recovery for Oracle Access Manager (OAM) is an active/passive solution.

The following sections describe the procedure in detail:

Prerequisites

Before enabling disaster recovery for Oracle Access Manager, ensure the following:

  • The primary and standby sites for Data Guard communicate with each other.
  • The primary and standby sites for file system replication communicate with each other.
  • The Kubernetes clusters in both the primary and standby sites are able to resolve the names of the NFS servers in all sites.
  • Each Kubernetes cluster is independent.
  • A running Kubernetes cluster is present in both the sites.

Creating the Disaster Recovery Site From the Primary Site

On the primary site:
  1. Create a backup of the persistent volume holding OAM data (oampv). See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes. The file exclusion list for OAM is defined as follows:
    EXCLUDE_LIST="--exclude=\".snapshot\" --exclude=\"backups\" --exclude=\"domains/*/servers/*/tmp\" --exclude=\"logs/*\" --exclude=\"dr_scripts\" --exclude=\"network\" --exclude \"domains/*/servers/*/data/nodemanager/*.lck\" --exclude \"domains/*/servers/*/data/nodemanager/*.pid\" --exclude \"domains/*/servers/*/data/nodemager/*.state\" --exclude \"domains/ConnectorDefaultDirectory\"--exclude=\"backup_running\""
  2. Back up the Kubernetes objects for restoration on the standby sites. See Backing Up and Restoring the Kubernetes Objects.
  3. Set up Oracle Data Guard between the primary and standby sites. See Creating a Data Guard Database.
  4. Ensure that the database services used on the primary site are replicated on the standby. These services should be active when the database is running in the roles of PRIMARY or SNAPSHOT STANDBY. See Creating Database Services.

Creating the Disaster Recovery Site on the Standby Site

There are two options for creating the disaster recovery site on the standby site:

By Creating a New Disposable Environment on Site 2

  1. Create a disposable deployment in Site 2 using the same values as the primary site. If you have used the EDG automation scripts, you just need to change the values of the worker nodes and the PV server in the response file.
  2. Shut down the Kubernetes pods that are running on Site 2.
  3. Delete the contents of the persistent volume.
  4. Restore the backup of the persistent volume from the backup taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  5. Amend the database connection strings in the restored backup to point to the database on Site 2.
  6. Copy the WebGate objects from the restored backup to the Oracle HTTP Servers on Site 2.
  7. Validate that the disaster recovery site is working.
  8. Delete the throwaway database used to perform the initial installation.

By Performing a Kubernetes Restore on Site 2

  1. Create the persistent volumes for the OAM domain such that they point at the NFS server in the standby site. See Creating the Kubernetes Persistent Volume.
  2. Open the standby database as a snapshot standby by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to snapshot standby
  3. Restore a backup of the persistent volume taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  4. Restore the backup of the Kubernetes objects from the backup taken on Site 1. See Backing Up and Restoring the Kubernetes Objects.
  5. Amend the database connection strings in the restored backup to point to the database on Site 2.
  6. Copy the WebGate objects from the restored backup to the Oracle HTTP Servers on Site 2.
  7. Validate that the disaster recovery site is working.
  8. Reinstate the standby database by issuing the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to physical standby

    Switch over the database and validate that the deployment is fully working.

Configuring Disaster Recovery for Oracle Identity Governance

The disaster recovery for Oracle Identity Governance (OIG) is an active/passive solution.

The following sections describe the procedure in detail:

Prerequisites

Before enabling disaster recovery for Oracle Identity Governance, ensure the following:

  • The primary and standby sites for file system replication communicate with each other.
  • The database system in the primary site communicate with each other for database replication.
  • The Kubernetes clusters in both the primary and standby sites are able to resolve the names of the NFS servers in all sites.
  • Each Kubernetes cluster is independent.
  • A running Kubernetes cluster is present in both the sites.

Creating the Disaster Recovery Site From the Primary Site

On the primary site:
  1. Create a backup of the persistent volume holding OIG data (oigpv). See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes. The file exclusion list for OIG is defined as follows:
    EXCLUDE_LIST="--exclude=\".snapshot\" --exclude=\"backups\" --exclude=\"domains/*/servers/*/tmp\" --exclude=\"logs/*\" --exclude=\"dr_scripts\" --exclude=\"network\" --exclude \"domains/*/servers/*/data/nodemanager/*.lck\" --exclude \"domains/*/servers/*/data/nodemanager/*.pid\" --exclude \"domains/*/servers/*/data/nodemager/*.state\" --exclude \"domains/ConnectorDefaultDirectory\" --exclude=\"backup_running\""
  2. Back up the Kubernetes objects for restoration on the standby sites. See Backing Up and Restoring the Kubernetes Objects.
  3. Set up Oracle Data Guard between the primary and standby sites.

Creating the Disaster Recovery Site on the Standby Site

There are two options for creating the disaster recovery site on the standby site:

By Creating a New Disposable Environment on Site 2

  1. Create a disposable deployment in Site 2 using the same values as the primary site. If you have used the EDG automation scripts, you just need to change the values of the worker nodes and the PV server in the response file.
  2. Shut down the Kubernetes pods that are running on Site 2.
  3. Delete the contents of the persistent volume.
  4. Restore the backup of the persistent volume from the backup taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  5. Amend the database connection strings in the restored backup to point to the database on Site 2.
  6. Validate that the disaster recovery site is working.
  7. Delete the throwaway database used to perform the initial installation.

By Performing a Kubernetes Restore on Site 2

  1. Create the persistent volumes for the OIG domain such that they point at the NFS server in the standby site. See Creating the Kubernetes Persistent Volume.
  2. Open the standby database as a snapshot standby by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to snapshot standby
  3. Restore a backup of the persistent volume taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  4. Restore the backup of the Kubernetes objects from the backup taken on Site 1. See Backing Up and Restoring the Kubernetes Objects.
  5. Amend the database connection strings in the restored backup to point to the database in Site 2.
  6. Validate that the disaster recovery site is working.
  7. Reinstate the standby database by issuing the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to physical standby

    Switch over the database and validate that the deployment is fully working.

Disabling the Job Scheduler for Configuring Oracle Identity Manager

You must the disable the Oracle Identity Governance job scheduler on Site 2 for configuring Oracle Identity Governance.

The OIG job scheduler uses the database intensively. To keep inter-site traffic to a minimum, the job scheduler should be disabled on the site where the database is not primary. This step is not required if you plan to shut down the OIG Managed Servers on Site 2. You can disable the job scheduler by adding the following parameter to the server startup arguments:

-Dscheduler.disabled=true

After switching over the database, this parameter should be removed from these servers and added into the server definitions of the now standby site.

Configuring Disaster Recovery for Oracle Identity Role Intelligence

The disaster recovery for Oracle Identity Role Intelligence (OIRI) is an active/passive solution.

The following sections describe the procedure in detail:

Prerequisites

Before enabling disaster recovery for Oracle Identity Role Intelligence, ensure the following:

  • The primary and standby sites for file system replication communicate with each other.
  • The database system in the primary site communicate with each other for database replication.
  • The Kubernetes clusters in both the primary and standby sites are able to resolve the names of the NFS servers in all sites.
  • Each Kubernetes cluster is independent.
  • A running Kubernetes cluster is present in both the sites.

Creating the Disaster Recovery Site

OIRI is tightly integrated into the Kubernetes framework. It interacts directly with the cluster to be able to run data-ingestion tasks. When you deploy OIRI, it contains information such as the cluster name in which it is running. When you run OIRI from a different site, then, in addition to updating the database connection information, you also need to update the Kubernetes cluster information. You can perform this task in two ways.

  • When you are starting OIRI on a different site, update the Kubernetes configuration.
  • Store the configuration for each site on the persistent volume. Then, when you restore the persistent volume on the standby site, replace the Kubernetes configuration with the appropriate values using these files.

    It is possible to create these files only after you have created the OIRI Kubernetes objects on the standby site.

    The recommended approach is to create copies of the Kubernetes configuration objects in the persistent volume and label them according to the site. For example, primary_ca.crt and primary_k8config. Create the standby site (see Creating the Disaster Recovery Site on the Standby Site), and then create a new ca.crt and kubeconfig file based on the standby cluster. Copy these files to the persistent volume of the primary site calling the files standby_ca.crt and standby_k8config. This naming convention will ensure that when you replicate the persistent volume to the standby site, you have copies of the ca.crt and kubeconfig file for both the primary and standby clusters. After running the PV replication task, restore the ca.crt and kubeconfig files on the persistent volume with those relevant to the site you want to run. For example:

    • When you use the primary site, the ca.crt and config files will use the contents of the primary_ca.crt and primary_k8config files, respectively.
    • When you are use the standby site, the ca.crt and config files will use the contents of the standby_ca.crt and standby_k8config files, respectively.
Creating the Disaster Recovery Site From the Primary Site

On the primary site:

  1. Create copies of the existing ca.crt and config files located on the workpv (mounted as /app/k8s in the OIRI-CLI container). Call these files primary_ca.crt and primary_config, respectively.
  2. Create a backup of the persistent volumes holding the OIRI data (oiripv, dingpv, and workpv). See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes. The file exclusion list for OIRI is defined as follows:
    EXCLUDE_LIST="--exclude=\".snapshot\" --exclude=\"backups\" --exclude=\"dr_scripts\" --exclude=\"backup_running\""
  3. Back up the Kubernetes objects for restoration on the standby sites. See Backing Up and Restoring the Kubernetes Objects.
  4. Set up Oracle Data Guard between the primary and standby sites. See Creating a Data Guard Database.
  5. Ensure that the database services used on the primary site are replicated on the standby. These services should be active when the database is running in the roles of PRIMARY and SNAPSHOT standby. See Creating Database Services.
Creating the Disaster Recovery Site on the Standby Site

There are two options for creating the disaster recovery site on the standby site:

By Creating a New Disposable Environment on Site 2

  1. Create a disposable deployment in Site 2 using the same values as the primary site. If you have used the EDG automation scripts, you just need to change the values of the worker nodes and the PV server in the response file.
  2. Copy the ca.crt and config files generated on the standby site to the workpv on the primary site. Call these files standby_ca.crt and standby_config, respectively.
  3. Shut down the Kubernetes pods that are running on Site 2.
  4. Delete the contents of the persistent volume.
  5. Restore the backup of the persistent volume from the backup taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  6. Amend the database connection strings in the restored backup to point to the database on Site 2. Update the following files:
    /app/oiri/data/conf/application.yaml
    /app/data/conf/custom-attributes.yaml
    /app/data/conf/data-ingestion-config.yaml
    /app/data/conf/dbconfig.yaml
    /app/data/conf/env.properties
  7. Amend the Kubernetes cluster URL in the restored backup to point to the Kubernetes cluster in the standby site. To find the Kubernetes cluster URL, use the following command:
    grep server: $KUBECONFIG | sed 's/server://;s/ //g'

    Update the following files:

    /app/data/conf/data-ingestion-config.yaml
    /app/data/conf/env.properties
  8. Switch the Kubernetes configuration files to the copies appropriate to the standby site. For example:
    cp /app/k8s/standby_ca.crt /app/k8s/ca.crt
    cp /app/k8s/standby_ca.crt /app/ca.crt
    cp /app/k8s/standby_config.crt /app/k8s/config
  9. Validate that the disaster recovery site is working.
  10. Delete the throwaway database used to create the initial installation on Site 2.

By Performing a Kubernetes Restore on Site 2

  1. Create the persistent volumes for the OIRI persistent volumes such that they point at the NFS server in the standby site. See Creating the Kubernetes Persistent Volume.
  2. Open the standby database as a snapshot standby by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to snapshot standby
  3. Restore a backup of the persistent volume taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  4. Restore the backup of the Kubernetes objects from the backup taken on Site 1. See Backing Up and Restoring the Kubernetes Objects.
  5. Start the OIRI-CLI container on the standby system. See Starting the Administration CLI.
  6. Create a kubeconfig file and a certificate on the standby site. See Generating the ca.crt Certificate and Creating a Kubernetes Configuration File for OIRI.

    Store the resulting files ca.crt and oiri_config in the OIRI-CLI locations in both the primary and standby sites.

    /app/k8s/standby_ca.crt
    /app/k8s/standby_ca.crt 
    /app/k8s/standby_config.crt
  7. Amend the database connection strings in the restored backup to point to the database in Site 2.

    Update the following files:

    /app/oiri/data/conf/application.yaml
    /app/data/conf/custom-attributes.yaml
    /app/data/conf/data-ingestion-config.yaml
    /app/data/conf/dbconfig.yaml
    /app/data/conf/env.properties
  8. Amend the Kubernetes cluster URL in the restored backup to point to the Kubernetes cluster in the standby site. To find the Kubernetes cluster URL, use the following command:
    grep server: $KUBECONFIG | sed 's/server://;s/ //g'

    Update the following files:

    /app/data/conf/data-ingestion-config.yaml
    /app/data/conf/env.properties
  9. Switch the Kubernetes configuration files to the copies appropriate to the standby site. For example:
    cp /app/k8s/standby_ca.crt /app/k8s/ca.crt
    cp /app/k8s/standby_ca.crt /app/ca.crt
    cp /app/k8s/standby_config.crt /app/k8s/config
  10. Validate that the disaster recovery site is working.
  11. Reinstate the standby database by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to physical standby

    Switch over the database and validate that the deployment is fully working.

  12. Recopy the persistent volumes from the primary to the standby site on a periodic basis. For example, once a week, or whenever you make configuration changes to the OIRI deployment.

Configuring Disaster Recovery for Oracle Advanced Authentication

The disaster recovery for Oracle Advanced Authentication (OAA) is an active/passive solution.

The following sections describe the procedure in detail:

Prerequisites

Before enabling disaster recovery for Oracle Advanced Authentication, ensure the following:

  • The primary and standby sites for file system replication communicate with each other.
  • The database system in the primary site communicate with each other for database replication.
  • The Kubernetes clusters in both the primary and standby sites are able to resolve the names of the NFS servers in all sites.
  • Each Kubernetes cluster is independent.
  • A running Kubernetes cluster is present in both the sites.
  • The OAuth provider must be available in both the sites.

Note:

If you are using OAM as the OAuth provider and the failover scenario is an all or nothing, that is, you will fail/switch over OAM and OAA together, then, for the purposes of creating the DR site, you need to enable OAM in the standby site during configuration by temporarily converting the OAM standby database to a snapshot standby and starting OAM using that database. After configuring OAA, shut down OAM and revert the database to a physical standby.

Each OAM must use the same SSL certificate.

Creating the Disaster Recovery Site

OAA is tightly integrated into the Kubernetes framework. When you deploy OAA, it contains information such as the cluster name in which it is running. Therefore, you must exclude this information when you replicate the persistent volumes.

Creating the Disaster Recovery Site From the Primary Site

On the primary site:

  1. Create a backup of the persistent volumes holding the OAA data (oaaconfigpv, oaacreadpv, oaavaultpv, and oaalogspv). See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes. The file exclusion list for OAA is defined as follows:
    EXCLUDE_LIST="--exclude=\".snapshot\" --exclude=\"backups\" --exclude=\"dr_scripts\" --exclude=\"backup_running\" --exclude=\"k8sconfig\" --exclude=\"ca.crt\""
  2. Set up Oracle Data Guard between the primary and standby sites.
  3. Ensure that the database services used on the primary site are replicated on the standby. These services should be active when the database is running in the roles of PRIMARY and SNAPSHOT standby. See Creating Database Services.
Creating the Disaster Recovery Site on the Standby Site

On the standby site:

  1. Open the standby database as a snapshot standby by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to snapshot standby
  2. Restore a backup of the persistent volume taken on Site 1. See Creating a Kubernetes CronJob to Synchronize the Persistent Volumes.
  3. Create a namespace for OAA (this should be the same as the namespace of the primary site). See Creating Kubernetes Namespaces.
  4. Create a container registry secret. See Creating a Container Registry Secret.
  5. Create the OAA Management Container. See Starting the Management Container.
  6. Grant the OAA Management Container access to the Kubernetes cluster. See Granting the Management Container Access to the Kubernetes Cluster.
  7. Modify the installOAA.properties file copied from the primary site. The file is located in the /u01/oracle/scripts/settings directory in the OAA Management Container. See Creating the OAA Property File.
    Make the following changes:
    • Change database.host to the scan address of the standby database.
    • Set database.createschema=false.
  8. Recreate the OAA deployment by running the OAA.sh script. See Deploying Oracle Advanced Authentication.
  9. Verify that the OAA pods start successfully.
  10. Reinstate the standby database by using the following Data Guard broker commands:
    dgmgrl sys/Password
    show configuration
    convert database standby_db_name to physical standby

    Switch over the database and validate that the deployment is fully working.

  11. Recopy the persistent volumes from the primary to the standby site on a periodic basis. For example, once a week, or whenever you make configuration changes to the OAA deployment.