7 Fault Recovery
This chapter provides information about fault recovery for OCNADD deployment.
7.1 Overview
This section describes procedures to perform the backup and restore for the Oracle Communications Network Analytics Data Director (OCNADD) deployment. The backup and restore procedures will be used in the fault recovery of the OCNADD. The OCNADD operators can take only the OCNADD instance specific database and required OCNADD Kafka metadata backup and restore them either on the same or a different Kubernetes cluster.
The backup and restore procedures are helpful in the following scenarios:
- OCNADD fault recovery
- OCNADD cluster migration
- OCNADD setup replication from production to development or staging
- OCNADD cluster upgrade to new CNE version or K8s version
The OCNADD backup contains the following data:
- OCNADD database(s) backup
- OCNADD Kafka metadata backup including the topics and partitions information
Note:
If the deployed helm charts and the customizedocnadd-custom-values-23.4.0.0.1.yaml
for the current deployment is stored in the customer helm or artifact
repository, then the helm chart and ocnadd-custom-values-23.4.0.0.1.yaml
backup is not required.

OCNADD Database(s) Backup
- Configuration data: This data is exclusive for the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD instance to store its configuration data and operator driven configuration. Operators can configure the OCNADD instance specific configurations using the Configuration UI service through the Cloud Native Configuration Console.
- Alarm configuration data: This data is also exclusive to the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD Alarm service instance to store its alarm configuration and alarms.
- Health monitoring data: This data is also exclusive to the given OCNADD instance. Therefore, an exclusive logical database is created and used by an OCNADD Health monitoring service instance to store the health profile of various other services.
The database backup job uses the mysqldump utility.
The Scheduled regular backups helps in:
- Restoring the stable version of the data directory databases
- Minimize significant loss of data due to upgrade or rollback failure
- Minimize loss of data due to system failure
- Minimize loss of data due to data corruption or deletion due to external input
- Migration of the database information from one site to another site
OCNADD Kafka Metadata Backup
The OCNADD Kafka metadata backup contains the following information:
- Created topics information
- Created partitions per topic information
7.1.1 Fault Recovery Impact Areas
The following table shares information about impact of OCNADD fault recovery scenarios:
Table 7-1 OCNADD Fault Recovery Scenarios Impact Information
Scenario | Requires Fault Recovery or Reinstallation of CNE? | Requires Fault Recovery or Reinstallation of cnDBTier? | Requires Fault Recovery or Reinstallation of Data Director? |
---|---|---|---|
Scenario 1: Deployment Failure Recovering OCNADD when its deployment is corrupted |
No | No | Yes |
Scenario 2: cnDBTier Corruption | No | Yes | No
However, it requires to restore the databases from backup and Helm upgrade of the same OCNADD version to update the OCNADD configuration. For example, change in cnDBTier service information, such as cnDB endpoints, DB credentials, and so on. |
Scenario 3: Database Corruption Recovering from corrupted OCNADD configuration database |
No | No | No
However, it requires to restore the databases from old backup. |
Scenario 4: Site Failure Complete site failure due to infrastructure failure, for example, hardware, CNE, and so on. |
Yes | Yes | Yes |
7.1.2 Prerequisites
Before you run any fault recovery procedure, ensure that the following prerequisites are met:
- cnDBTier must be in a healthy state and available on a new or newly installed site where the restore needs to be performed
- Automatic backup should be enabled for OCNADD.
- Docker images used during the last installation or upgrade must be retained in the external data storage or repository
- The
ocnadd-custom-values-23.4.0.0.1.yaml
used at the time of OCNADD deployment must be retained. If theocnadd-custom-values-23.4.0.0.1.yaml
file is not retained, it is required to be recreated manually. This task increases the overall fault recovery time.
Important:
Do not change DB Secret or cnDBTier MySQL FQDN or IP or PORT configurations during backup and restore.7.2 Backup and Restore Flow
Important:
- It is recommended to keep the backup storage in the external storage that can be shared between different clusters. This is required, so that in an event of a fault, the backup is accessible on the other clusters. The backup job should create a PV or PVC from the external storage provided for the backup.
- In case the external storage is not made available for the backup storage, the customer should take care to copy the backups from the associated backup PV in the cluster to the external storage. The security and connectivity to the external storage should be managed by the customer. To copy the backup from the backup PV to the external server, follow Verifying OCNADD Backup.
- The restore job should have access to the external storage so that the backup from the external storage can be used for the restoration of the OCNADD services. In case the external storage is not available, the backup should be copied from the external storage to the backup PV in the new cluster. For information on the procedure, see Verifying OCNADD Backup.
Note:
At a time, only one among the three backup jobs (ocnaddmanualbackup, ocnaddverify or ocnaddrestore) can be running. If any existing backup job is running, that job needs to be deleted to spawn the new job.
kubectl delete job.batch/<ocnadd*> -n <namespace>
where namespace = Namespace of OCNADD deployment
ocnadd* = Running jobs in the namespace (ocnaddmanualbackup, ocnaddverify or ocnaddrestore)
Example:
kubectl delete job.batch/ocnaddverify -n ocnadd-deploy
Backup
- The OCNADD backup is managed using the backup job created at the time of installation.
The backup job runs as a cron job and takes the daily backup of the
following:
- OCNADD databases for configuration, alarms, and health monitoring
- OCNADD Kafka metadata including topics and partitions, which are previously created
- The automated backup job spawns as a container and takes the backup
at the scheduled time. The backup file
OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2
is created and stored in the PV mounted on the path/work-dir/backup
by the backup container. - On-demand backup can also be created by creating the backup container. For more information, see Performing OCNADD Manual Backup.
- The backup can be stored on external storage.
Restore
- The OCNADD restore job must have access to the backups from the backup PV/PVC.
- The restore uses the latest backup file available in the backup storage if the BACKUP_FILE argument is not given.
- The restore job performs the restore in the following order:
- Restore the OCNADD database(s) on the cnDBTier.
- Restore the Kafka metadata.
7.3 OCNADD Backup
- Automated backup
- Manual backup
Automated Backup
- This is managed by the automated K8s job configured during the installation of the OCNADD. For more information, see Updating the OCNADD Backup Cronjob step.
- It is a scheduled job and runs daily at the configured time to
collect the OCNADD backup and creates the backup file
OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2
.
Manual Backup
- This is managed by an on-demand job.
- A new K8s job will be created on executing the Performing OCNADD Manual Backup procedure.
- The job completes after taking the backup. Follow Verifying OCNADD Backup procedure to verify the generated backup.
7.4 Performing OCNADD Backup Procedures
7.4.1 Performing OCNADD Manual Backup
Perform the following steps to take the manual backup:
- Go to
custom-templates
folder in the extracted ocnadd-release package and update theocnadd_manualBackup.yaml
file with the following information:- Value for BACKUP_DATABASES can be set to ALL (that is, healthdb_schema, configuration_schema, and alarm_schema) or the individual DB names can also be passed. By default, the value is 'ALL'.
- Value of BACKUP_ARG can be set to ALL, DB, or KAFKA. By default, the value is ALL.
- Update other values as
follows:
apiVersion: batch/v1 kind: Job metadata: name: ocnaddmanualbackup namespace: ocnadd-deploy #---> update the namespace ------------ spec: serviceAccountName: ocnadd-deploy-sa-ocnadd #---> update the service account name. Format:<serviceAccount>-sa-ocnadd ------------ containers: - name: ocnaddmanualbackup image: <repo-path>/ocdd.repo/ocnaddbackuprestore:2.0.1 #---> update repository path ------------ initContainers: - name: ocnaddinitcontainer image: <repo-path>/utils.repo/jdk17-openssl:1.0.6 #---> update repository path env: - name: BACKUP_DATABASES value: ALL - name: BACKUP_ARG value: ALL
- Run the below command to run the
job:
kubectl create -f ocnadd_manualBackup.yaml
7.4.2 Verifying OCNADD Backup
Caution:
The connectivity between the external storage through either PV/PVC or network connectivity must be ensured.
- Go to the
custom-templates
folder in the extracted ocnadd-release package and update theocnadd_verify_backup.yaml
file with the following information:- Sleep time is configurable, update it if required (the default value is set to 10m).
- Update other values as
follows:
apiVersion: batch/v1 kind: Job metadata: name: ocnaddverify namespace: ocnadd-deploy #---> update the namespace ------------ spec: serviceAccountName: ocnadd-sa-ocnadd #---> update the service account name. Format:<serviceAccount>-sa-ocnadd ------------ containers: - name: ocnaddverify image: <repo-path>/ocdd.repo/ocnaddbackuprestore:2.0.1 #---> update repository path ------------ initContainers: - name: ocnaddinitcontainer image: <repo-path>/utils.repo/jdk17-openssl:1.0.6 #---> update repository path
- Run the below command to create the
job:
kubectl create -f ocnadd_verify_backup.yaml
- If the external storage is used as PV/PVC, then enter the ocnaddverify-xxxx
container using the following commands:
kubectl exec -it <ocnaddverify-xxxx> -n <ocnadd namespace> -- bash
- Change the directory to
/work-dir/backup
and inside the latest backup fileOCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2
, verify the DB backup and Kafka metadata backup files.
7.4.3 Retrieving the OCNADD Backup Files
- Run the Verifying OCNADD Backup procedure to spawn the
ocnaddverify-xxxx
. - Go to the running
ocnaddverify
pod to identify and retrieve the desired backup folder using the following commands:- Run the following command to access the
pod:
kubectl exec -it <ocnaddverify-xxxx> -n <ocandd-namespace> -- bash
where,
<ocnadd-namespace>
is the namespace where the ocnadd management group services are running.<ocnaddverify-xxxx>
is the backup verification pod in the same namespace. - Change the directory to
/work-dir/backup
and identify the backup file "OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2
". - Exit the
ocnaddverify
pod.
- Run the following command to access the
pod:
- Copy the backup file from the pod to the local bastion server by copying the file
OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2
, and run the following command:kubectl cp -n <ocnadd-namespace> <ocnaddverify-xxxx>:/work-dir/backup/<OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> <OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2>
where,
<ocnadd-namespace>
is the namespace where the ocnadd management group services are running.<ocnaddverify-xxxx>
is the backup verification pod in the same namespace.For example:
kubectl cp -n ocnadd ocnaddverify-drwzq:/work-dir/backup/OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2
7.4.4 Copying and Restoring the OCNADD backup
- Retrieve the OCNADD backup file.
- Perform the Verifying OCNADD Backup procedure to spawn the
ocnaddverify-xxxx
. - Copy the backup file from the local bastion server to the running
ocnaddverify pod, run the following
command:
kubectl cp <OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> <ocnaddverify-xxxx>:/work-dir/backup/<OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2> -n <ocnadd-namespace>
For example:
kubectl cp OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 ocnaddverify-mrdxn:/work-dir/backup/OCNADD_BACKUP_10-05-2023_08-00-05.tar.bz2 -n ocnadd
- Go to
ocnaddverify
pod and path,/workdir/backup/OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2
to verify if the backup has been copied. - Restore OCNADD using the procedure defined in Creating OCNADD Restore Job.
- Restart the
ocnaddconfiguration
,ocnaddalarm
,ocnaddhealthmonitoring
, andocnaddadminsvc
pods.
7.5 Disaster Recovery Scenarios
This chapter describes the disaster recovery procedures for different recovery scenarios.
7.5.1 Scenario 1: Deployment Failure
This section describes how to recover OCNADD when the OCNADD deployment corrupts.
For more information, see Restoring OCNADD.
7.5.2 Scenario 2: cnDBTier Corruption
This section describes how to recover the cnDBTier corruption. For more information, see Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide. After the cnDBTier recovery, restore the OCNADD database from the previous backup.
To restore the OCNADD database, execute the procedure Creating OCNADD Restore Job by setting BACKUP_ARG to DB.
7.5.3 Scenario 3: Database Corruption
This section describes how to recover from the corrupted OCNADD database.
Perform the following steps to recover the OCNADD configuration database (DB) from the corrupted database:
- Retain the working ocnadd backup by following Retrieving the OCNADD Backup Files procedure.
- Drop the existing Databases by accessing the MySql DB.
- Perform the Copying and Restoring the OCNADD backup procedure to restore the backup.
7.5.4 Scenario 4: Site Failure
This section describes how to perform fault recovery when the OCNADD site has software failure.
Perform the following steps in case of a complete site failure:
- Run the Cloud Native Environment (CNE) installation procedure to install a new Kubernetes cluster. For more information, see Oracle Communications Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
- Run the cnDBTier installation procedure. For more information, see Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide.
- For cnDBTier fault recovery, take a data backup from an older site and restore it to a new site. For more information about cnDBTier backup, see "Create On-demand Database Backup" and to restore the database to a new site, see "Restore DB with Backup" in Oracle Communications Cloud Native Core cnDBTier Installation, Upgrade, and Fault Recovery Guide.
- Restore OCNADD. For more information, see Restoring OCNADD.
7.6 Restoring OCNADD
Perform this procedure to restore OCNADD when a fault event has occurred or deployment is corrupted.
Note:
This procedure expects the OCNADD backup folder is retained.- Get the retained backup file
"
OCNADD_BACKUP_DD-MM-YYYY_hh-mm-ss.tar.bz2
". - Get the Helm charts that was used in the earlier deployment.
- Run the following command to uninstall the corrupted OCNADD deployment:
Management Group or any Worker
Group:
helm uninstall <release_name> --namespace <namespace>
Where,
<release_name>
is the release name of the ocnadd deployment which is being uninstalled.<namespace>
is the namespace of OCNADD deployment which is being uninstalled.For example: To uninstall the Management Group
helm uninstall ocnadd-mgmt --namespace dd-mgmt-group
- Install the Management Group or any Worker Group that was corrupted and uninstalled in the previous step using the helm charts that were used in the earlier deployment. For the installation procedure see, Installing OCNADD.
- To verify whether OCNADD installation is complete, see Verifying OCNADD Installation.
- Follow procedure Copying and Restoring the OCNADD backup
7.7 Creating OCNADD Restore Job
- Restore the OCNADD database by executing the following steps:
- Go to the
custom-templates
folder inside the extractedocnadd-release
package and update theocnadd_restore.yaml
file based on the restore requirements:- The value of BACKUP_ARG can be set to DB, KAFKA, and ALL. By default, the value is 'ALL'.
- The value of BACKUP_FILE can be set to folder name which needs to be restored, if not mentioned the latest backup will be used.
- Update other values as
below:
apiVersion: batch/v1 kind: Job metadata: name: ocnaddrestore namespace: ocnadd-deploy #---> update the namespace ------------ spec: serviceAccountName: ocnadd-deploy-sa-ocnadd #---> update the service account name. Format:<serviceAccount>-sa-ocnadd ------------ containers: - name: ocnaddrestore image: <repo-path>/ocdd.repo/ocnaddbackuprestore:2.0.1 #---> update repository path ------------ initContainers: - name: ocnaddinitcontainer image: <repo-path>/utils.repo/jdk17-openssl:1.0.6 #---> update repository path env: - name: BACKUP_ARG value: ALL - name: BACKUP_FILE value: "" #---> update the backup file name which needs to be restored, if not mentioned the latest backup will be used for example "OCNADD_Backup_DD-MM-YYYY_hh-mm-ss.tar.bz2"
- Go to the
- Run the following command to run the restore
job:
kubectl create -f ocnadd_restore.yaml
Note:
Make sure to delete all the backup, restore, and verify jobs before creating the restore job. Related jobs areocnaddbackup
,ocnaddrestore
,ocnaddverify
andocnaddmanualbackup
. - Wait for the restore job to be completed. It usually takes 10 to 15 minutes or more depending upon the size of the backup.
- Restart the
ocnaddconfiguration
,ocnaddalarm
, andocnaddhealthmonitoring
pods once the restore job gets completed.
Note:
If the backup is not available for the mentioned date, the pod will be in an error state, notifying the Backup is not available for the given date: $DATE, in such case provide the correct backup dates and repeat the procedure.7.8 Configuring Backup and Restore Parameters
To configure backup and restore parameters, configure the parameters listed in the following table:
Table 7-2 Backup and Restore Parameters
Parameter Name | Data Type | Range | Default Value | Mandatory(M)/Optional(O)/Conditional(C) | Description |
---|---|---|---|---|---|
BACKUP_STORAGE | STRING | - | 20Gi | M | Persistent Volume storage to keep the OCDD backups |
MYSQLDB_NAMESPACE | STRING | - | occne-cndbtierone | M | Mysql Cluster Namespace |
BACKUP_CRONEXPRESSION | STRING | - | 0 8 * * * | M | Cron expression to schedule backup cronjob |
BACKUP_ARG | STRING | - | ALL | M | KAFKA , DB or ALL backup |
BACKUP_FILE | STRING | - | - | O | Backup folder name which needs to be restored |
BACKUP_DATABASES | STRING | - | ALL | M | Individual databases or all databases backup that need to be taken |