7 Fault Recovery

This chapter provides information about fault recovery for OCNADD deployment.

7.1 Overview

This section describes procedures to perform the backup and restore for the Oracle Communications Network Analytics Data Director (OCNADD) deployment. The backup and restore procedures will be used in the fault recovery of the OCNADD. The OCNADD operators can take only the OCNADD instance specific database and required OCNADD Kafka metadata backup and restore them either on the same or a different Kubernetes cluster.

The backup and restore procedures are helpful in the following scenarios:

  • OCNADD fault recovery
  • OCNADD cluster migration
  • OCNADD setup replication from production to development or staging
  • OCNADD cluster upgrade to new CNE version or K8s version

The OCNADD backup contains the following data:

  • OCNADD database(s) backup
  • OCNADD Kafka metadata backup including the topics and partitions information

Note:

If the deployed helm charts and the customized ocnadd-custom-values-23.3.0.yaml for the current deployment is stored in the customer helm or artifact repository, then the helm chart and ocnadd-custom-values-23.3.0.yaml backup is not required.

OCNADD Backup and Restore

OCNADD Database(s) Backup

The OCNADD database consists of the following:
  • Configuration data: This data is exclusive for the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD instance to store its configuration data and operator driven configuration. Operators can configure the OCNADD instance specific configurations using the Configuration UI service through the Cloud Native Configuration Console.
  • Alarm configuration data: This data is also exclusive to the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD Alarm service instance to store its alarm configuration and alarms.
  • Health monitoring data: This data is also exclusive to the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD Health monitoring service instance to store the health profile of various other services.

The database backup job uses the mysqldump utility.

The Scheduled regular backups helps in:

  • Restoring the stable version of the data directory databases
  • Minimize significant loss of data due to upgrade or rollback failure
  • Minimize loss of data due to system failure
  • Minimize loss of data due to data corruption or deletion due to external input
  • Migration of the database information from one site to another site

OCNADD Kafka Metadata Backup

The OCNADD Kafka metadata backup contains the following information:

  • Created topics information
  • Created partitions per topic information

7.1.1 Fault Recovery Impact Areas

The following table shares information about impact of OCNADD fault recovery scenarios:

Table 7-1 OCNADD Fault Recovery Scenarios Impact Information

Scenario Requires Fault Recovery or Reinstallation of CNE? Requires Fault Recovery or Reinstallation of cnDBTier? Requires Fault Recovery or Reinstallation of Data Director?
Scenario 1: Deployment Failure

Recovering OCNADD when its deployment is corrupted

No No Yes
Scenario 2: cnDBTier Corruption No Yes No

However, it requires to restore the databases from backup and Helm upgrade of the same OCNADD version to update the OCNADD configuration. For example, change in cnDBTier service information, such as cnDB endpoints, DB credentials, and so on.

Scenario 3: Database Corruption

Recovering from corrupted OCNADD configuration database

No No No

However, it requires to restore the databases from old backup.

Scenario 4: Site Failure

Complete site failure due to infrastructure failure, for example, hardware, CNE, and so on.

Yes Yes Yes
Scenario 5: Backup Restore in a Different Cluster

Obtaining the OCNADD backup from one deployment site and restore it to another site

No No No

However, it requires to restore the database.

7.1.2 Prerequisites

Before you run any fault recovery procedure, ensure that the following prerequisites are met:

  • cnDBTier must be in a healthy state and available on a new or newly installed site where the restore needs to be performed
  • Automatic backup should be enabled for OCNADD.
  • Docker images used during the last installation or upgrade must be retained in the external data storage or repository
  • The ocnadd-custom-values-23.3.0.yaml used at the time of OCNADD deployment must be retained. If the ocnadd-custom-values-23.3.0.yaml file is not retained, it is required to be recreated manually. This task increases the overall fault recovery time.

Important:

Do not change DB Secret or cnDBTier MySQL FQDN or IP or PORT configurations during backup and restore.

7.2 Backup and Restore Flow

Important:

  • It is recommended to keep the backup storage in the external storage that can be shared between different clusters. This is required, so that in an event of a fault, the backup is accessible on the other clusters. The backup job should create a PV or PVC from the external storage provided for the backup.
  • In case the external storage is not made available for the backup storage, the customer should take care to copy the backups from the associated backup PV in the cluster to the external storage. The security and connectivity to the external storage should be managed by the customer. To copy the backup from the backup PV to the external server, follow Verifying OCNADD Backup.
  • The restore job should have access to the external storage so that the backup from the external storage can be used for the restoration of the OCNADD services. In case the external storage is not available, the backup should be copied from the external storage to the backup PV in the new cluster. For information on the procedure, see Verifying OCNADD Backup.

Note:

At a time, only one among the three backup jobs (ocnaddmanualbackup, ocnaddverify or ocnaddrestore) can be running. If any existing backup job is running, that job needs to be deleted to spawn the new job.


kubectl delete job.batch/<ocnadd*> -n <namespace>

where namespace = Namespace of OCNADD deployment
      ocnadd* = Running jobs in the namespace (ocnaddmanualbackup, ocnaddverify or ocnaddrestore)

Example: 
      kubectl delete job.batch/ocnaddverify -n ocnadd-deploy

Backup

  1. The OCNADD backup is managed using the backup job created at the time of installation. The backup job runs as a cron job and takes the daily backup of the following:
    • OCNADD databases for configuration, alarms, and health monitoring
    • OCNADD Kafka metadata including topics and partitions, which are previously created
  2. The automated backup job spawns as a container and takes the backup at the scheduled time. The backup file OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2 is created and stored in the PV mounted on the path /work-dir/backup by the backup container.
  3. On-demand backup can also be created by creating the backup container. For more information, see Performing OCNADD Manual Backup.
  4. The backup can be stored on external storage.

Restore

  1. The OCNADD restore job must have access to the backups from the backup PV/PVC.
  2. The restore uses the latest backup file available in the backup storage if the BACKUP_FILE argument is not given.
  3. The restore job performs the restore in the following order:
    1. Restore the OCNADD database(s) on the cnDBTier.
    2. Restore the Kafka metadata.

7.3 OCNADD Backup

The OCNADD backup is of two types:
  • Automated backup
  • Manual backup

Automated Backup

  • This is managed by the automated K8s job configured during the installation of the OCNADD. For more information, see Configure OCNADD Backup Cronjob.
  • It is a scheduled job and runs daily at the configured time to collect the OCNADD backup and creates the backup file OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2.

Manual Backup

7.4 Performing OCNADD Backup Procedures

7.4.1 Performing OCNADD Manual Backup

Perform the following steps to take the manual backup:

  1. Go to custom-templates folder in the extracted ocnadd-release package and update the ocnadd_manualBackup.yaml file with the following information:
    1. Value for BACKUP_DATABASES can be set to ALL (that is, healthdb_schema, configuration_schema, and alarm_schema) or the individual DB names can also be passed. By default, the value is 'ALL'.
    2. Value of BACKUP_ARG can be set to ALL, DB, or KAFKA. By default, the value is ALL.
    3. Update other values as follows:
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: ocnaddmanualbackup
        namespace: ocnadd-deploy     #---> update the namespace
      spec:
        template:      
          metadata:
            name: ocnaddmanualbackup
          spec:
            volumes:
            - name: backup-vol
              persistentVolumeClaim:
                  claimName: backup-mysql-pvc
            - name: config-vol   
              configMap:
                name: config-backuprestore-scripts
            serviceAccountName: ocnadd-sa-ocnadd      #---> update the service account name. Format:<serviceAccount>-sa-ocnadd
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              fsGroup: 1000
             containers:
            - name: ocnaddmanualbackup
              image: <repo-path>/ocnaddbackuprestore:2.0.0      #---> update repository path
              volumeMounts:
                - mountPath: "work-dir"
                  name: backup-vol
                - mountPath: "config-backuprestore-scripts"
                  name: config-vol
              env:
                - name: HOME
                  value: /home/ocnadd
                - name: DB_USER
                  valueFrom:
                    secretKeyRef:
                      name: db-secret
                      key: MYSQL_USER
                - name: DB_PASSWORD      
                  valueFrom:
                    secretKeyRef:
                      name: db-secret
                      key: MYSQL_PASSWORD      
                - name: BACKUP_DATABASES
                  value: ALL
                - name: BACKUP_ARG
                  value: ALL
              command:
              - /bin/sh
              - -c
              - |
                cp /config-backuprestore-scripts/*.sh /home/ocnadd
                chmod +x /home/ocnadd/*.sh  
                mkdir /work-dir/backup
                echo "Executing manual backup script"
                bash /home/ocnadd/backup.sh $BACKUP_DATABASES $BACKUP_ARG
                ls -lh /work-dir/backup
            restartPolicy: Never
      status: {}
  2. Run the below command to run the job:
    kubectl create -f ocnadd_manualBackup.yaml

7.4.2 Verifying OCNADD Backup

Caution:

The connectivity between the external storage through either PV/PVC or network connectivity must be ensured.

To verify the backup, perform the following steps:
  1. Go to the custom-templates folder in the extracted ocnadd-release package and update the ocnadd_verify_backup.yaml file with the following information:
    1. Sleep time is configurable, update it if required. (the default value is set to 10m)
    2. Update other values as follows:

      apiVersion: batch/v1
      kind: Job
      metadata:
        name: ocnaddverify
        namespace: ocnadd-deploy          #---> update the namespace
      spec:
        template:      
          metadata:
            name: ocnaddverify
          spec:
            volumes:
            - name: backup-vol
              persistentVolumeClaim:
                  claimName: backup-mysql-pvc
            - name: config-vol   
              configMap:
                name: config-backuprestore-scripts
            serviceAccountName: ocnadd-sa-ocnadd      #---> update the service account name. Format:<serviceAccount>-sa-ocnadd
            securityContext:
              runAsUser: 1000
              runAsGroup: 1000
              fsGroup: 1000
             containers:
            - name: ocnaddverify
              image: <repo-path>/ocnaddbackuprestore:2.0.0      #---> update repository path
              volumeMounts:
                - mountPath: "work-dir"
                  name: backup-vol
                - mountPath: "config-mysql-scripts"
                  name: config-vol
              env:
                - name: HOME
                  value: /home/ocnadd
                - name: DB_USER
                  valueFrom:
                    secretKeyRef:
                      name: db-secret
                      key: MYSQL_USER
                - name: DB_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: db-secret
                      key: MYSQL_PASSWORD
              command:
              - /bin/sh
              - -c
              - |
                cp /config-backuprestore-scripts/*.sh /home/ocnadd
                chmod +x /home/ocnadd/*.sh 
                echo "Checking backup path"
                ls -lh /work-dir/backup
                sleep 10m
            restartPolicy: Never
      status: {}
  2. Run the following command in the running container to verify the backup generated at the mounted PV:
    kubectl create -f ocnadd_verify_backup.yaml
  3. If the external storage is used as PV/PVC, go to the ocnaddverify container using the following commands:
    1. kubectl exec -it <verify_pod> -n <ocnadd namespace> -- bash
    2. Change the directory to /work-dir/backup and verify the DB backup and Kafka metadata backup files in the latest backup file OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2.

7.4.3 Retrieving the OCNADD Backup Files

  1. Run the Verifying OCNADD Backup procedure to spawn the verify_pod.
  2. Go to the running ocnaddverify pod to identify and retrieve the desired backup folder using the following commands:
    1. Run the following command to access the pod:
      kubectl exec -it <ocnaddverify-*> -n <namespace> -- bash

      where namespace is namespace of ocnadd

      ocnaddverify-* is the verify pod in the namespace

    2. Change the directory to /work-dir/backup and identify the backup file "OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2".
    3. Exit the ocnaddverify pod.
  3. Run the following command to copy the OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2 file, which will copy backup file from the pod to the local bastion server:
    kubectl cp -n <namespace> <verify_pod>:/work-dir/backup/<OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> <OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> 

    For example:

    kubectl cp -n ocnadd ocnaddverify-drwzq:/work-dir/backup/OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2

    where, namespace is the namespace of ocnadd and verify_pod is the verify pod in the namespace.

7.4.4 Copy and Restore the OCNADD backup

  1. Retrieve the OCNADD backup file.
  2. Perform the Verifying OCNADD Backup procedure to spawn the verify_pod.
  3. Copy the backup file from the local bastion server to the running ocnaddverify pod, run the following command:
    kubectl cp <OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> <verify_pod>:/work-dir/backup/<OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> -n <namespace>

    For example:

    kubectl cp OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 ocnaddverify-mrdxn:/work-dir/backup/OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 -n ocnadd
  4. Go to ocnaddverify pod and path, /workdir/backup/OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2 to verify if the backup has been copied.
  5. Restore OCNADD using the procedure defined in Creating OCNADD Restore Job.
  6. Restart the ocnaddconfiguration, ocnaddalarm, ocnaddhealthmonitoring, and ocnaddadminsvc pods.

7.5 Disaster Recovery Scenarios

This chapter describes the disaster recovery procedures for different recovery scenarios.

7.5.1 Scenario 1: Deployment Failure

This section describes how to recover OCNADD when the OCNADD deployment corrupts.

For more information, see Restoring OCNADD.

7.5.2 Scenario 2: cnDBTier Corruption

This section describes how to recover the cnDBTier corruption. For more information, see Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide. After the cnDBTier recovery, restore the OCNADD database from the previous backup.

To restore the OCNADD database, execute the procedure Creating OCNADD Restore Job by setting BACKUP_ARG to DB.

7.5.3 Scenario 3: Database Corruption

This section describes how to recover from the corrupted OCNADD database.

Perform the following steps to recover the OCNADD configuration database (DB) from the corrupted database:

  1. Retain the working ocnadd backup by following Retrieving the OCNADD Backup Files procedure.
  2. Drop the existing Databases by accessing the MySql DB.
  3. Perform the Copy and Restore the OCNADD backup procedure to restore the backup.

7.5.4 Scenario 4: Site Failure

This section describes how to perform fault recovery when the OCNADD site has software failure.

Perform the following steps in case of a complete site failure:

  1. Run the Cloud Native Environment (CNE) installation procedure to install a new Kubernetes cluster. For more information, see Oracle Communications Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
  2. Run the cnDBTier installation procedure. For more information, see Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide.
  3. For cnDBTier fault recovery, take a data backup from an older site and restore it to a new site. For more information about cnDBTier backup, see "Create On-demand Database Backup" and to restore the database to a new site, see "Restore DB with Backup" in Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide.
  4. Restore OCNADD. For more information, see Restoring OCNADD.

7.5.5 Scenario 5: Backup Restore in a Different Cluster

This section describes how to obtain the OCNADD backup from one deployment site and restore it to another site.

Perform the following steps:

  1. Retrieving the OCNADD Backup Files
  2. Copy and Restore the OCNADD backup

7.6 Restoring OCNADD

Perform this procedure to restore OCNADD when a fault event has occurred or deployment is corrupted.

Note:

This procedure expects the OCNADD backup folder is retained.
  1. Get the retained backup file ""OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2".
  2. Get the Helm charts that was used in the earlier deployment.
  3. Run the following command to uninstall the corrupted OCNADD deployment:
    helm uninstall <release_name> --namespace <namespace>

    Where, <release_name> is a name used to track this installation instance and <namespace> is the namespace of OCNADD deployment.

    Example:

    helm uninstall ocnadd --namespace ocnadd-ns
  4. Install OCNADD using the helm charts that was used in the earlier deployment. For the installation procedure see, Installing OCNADD.
  5. To verify whether OCNADD installation is complete, see Verifying OCNADD Installation.
  6. Follow procedure Copy and Restore the OCNADD backup

7.7 Creating OCNADD Restore Job

  1. Restore the OCNADD database by executing the following steps:
    1. Go to the custom-templates folder inside the extracted ocnadd-release package and update the ocnadd_restore.yaml file based on the restore requirements:
      1. The value of BACKUP_ARG can be set to DB, KAFKA, and ALL. By default, the value is 'ALL'.
      2. The value of BACKUP_FILE can be set to folder name which needs to be restored, if not mentioned the latest backup will be used.
      3. Update other values as below:
        apiVersion: batch/v1
        kind: Job
        metadata:
          name: ocnaddrestore
          namespace: ocnadd-deploy    #---> update the namespace
        spec:
          template:     
            metadata:
              name: ocnaddrestore
            spec:
              volumes:
              - name: backup-vol
                persistentVolumeClaim:
                    claimName: backup-mysql-pvc
              - name: config-vol  
                configMap:
                  name: config-backuprestore-scripts
              serviceAccountName: ocnadd-sa-ocnadd      #---> update the service account name. Format:<serviceAccount>-sa-ocnadd
              securityContext:
                runAsUser: 1000
                runAsGroup: 1000
                fsGroup: 1000
               containers:
              - name: ocnaddrestore
                image: <repo-path>/ocnaddbackuprestore:2.0.0      #---> update repository path
                volumeMounts:
                  - mountPath: "work-dir"
                    name: backup-vol
                  - mountPath: "config-backuprestore-scripts"
                    name: config-vol
                env:
                  - name: HOME
                    value: /home/ocnadd
                  - name: DB_USER
                    valueFrom:
                      secretKeyRef:
                        name: db-secret
                        key: MYSQL_USER
                  - name: DB_PASSWORD
                    valueFrom:
                      secretKeyRef:
                        name: db-secret
                        key: MYSQL_PASSWORD
                  - name: BACKUP_ARG
                    value: ALL
                  - name: BACKUP_FILE
                    value:
                command:
                - /bin/sh
                - -c
                - |
                  cp /config-backuprestore-scripts/*.sh /home/ocnadd
                  chmod +x /home/ocnadd/*.sh
                  echo "Executing restore script"
                  ls -lh /work-dir/backup
                  bash /home/ocnadd/restore.sh $BACKUP_ARG $BACKUP_FILE
                  sleep 15m
              restartPolicy: Never
        status: {}
  2. Run the following command to run the restore job:
    kubectl create -f ocnadd_restore.yaml 
  3. Restart the ocnaddconfiguration, ocnaddalarm, and ocnaddhealthmonitoring pods once the restore job gets completed.

Note:

If the backup is not available for the mentioned date, the pod will be in an error state, notifying the Backup is not available for the given date: $DATE, in such case provide the correct backup dates and repeat the procedure.

7.8 Configuring Backup and Restore Parameters

To configure backup and restore parameters, configure the parameters listed in the following table:

Table 7-2 Backup and Restore Parameters

Parameter Name Data Type Range Default Value Mandatory(M)/Optional(O)/Conditional(C) Description
BACKUP_STORAGE STRING - 20Gi M Persistent Volume storage to keep the OCDD backups
MYSQLDB_NAMESPACE STRING - occne-cndbtierone M Mysql Cluster Namespace
BACKUP_CRONEXPRESSION STRING - 0 8 * * * M Cron expression to schedule backup cronjob
BACKUP_ARG STRING - ALL M KAFKA , DB or ALL backup
BACKUP_FILE STRING - - O Backup folder name which needs to be restored
BACKUP_DATABASES STRING - ALL M Individual databases or all databases backup that need to be taken