7 Fault Recovery

This chapter describes the procedures to perform fault recovery for Oracle Communications Cloud Native Core, Security Edge Protection Proxy (SEPP) deployment.

7.1 Overview

You must take backup of the databases and restore them either on the same or a different cluster. The SEPP database (MySQL NDB Cluster) is used for running any command or to follow any instruction.

Note:

This section describes recovery procedures to restore SEPP completely or partially.

7.2 Impacted Areas

The following table provides information about the impacted areas during SEPP fault recovery:

Table 7-1 Impacted Areas

Scenario Requires Fault Recovery or re-install of CNE? Requires Fault Recovery or re-install of cnDBTier? Requires Fault Recovery or re-install of SEPP? Other
Scenario 1: Deployment Failure No No Yes SEPP DB is not restored. Only helm uninstall/install is done.
Scenario 2: cnDBTier Corruption No Yes No (use helm upgrade if DB configuration is changed) cnDBTier must be restored from backup and not re-install. If re-install of cnDBTier is needed, then CNE also need to re-installed.
Scenario 2A: When DBTier failed in Single or Multiple (but not all) Sites No Yes No NA
Scenario 2B: When DBTier failed in all Sites No Yes No NA
Scenario 3: Database Corruption No No No SEPP backup and restore of configuration database is required on impacted site. This needs automatic periodic backup.
Scenario 4: Site Failure Yes Yes Yes NA
Scenario 4A: Single or Multiple Site Failure Yes Yes Yes NA
Scenario 4B:All Site Failure Yes Yes Yes NA

7.3 Prerequisites

Before performing any fault recovery procedure, ensure that the following prerequisites are met:
  1. cnDBTier must be in a healthy state and available on multiple sites along with SEPP. To check the cnDBTier status, perform the following steps:
    1. Run the following command to ensure that all the nodes are connected:
      ndb_mgm> show
    2. Run the following command to check the pod status:
      kubectl get pods -n <namespace>

      If the pod status is Running, then cnDBTier is in healthy state.

    3. Run the following command to check if the replication is up:
      mysql> show slave status\G

      In case there is any error, seeOracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.

    4. Run the following command to check which cnDBTier has ACTIVE replication to take backup:
      select * from replication_info.DBTIER_REPLICATION_CHANNEL_INFO;
  2. Automatic backup must be enabled on cnDBTier. Enabling automatic backup helps in:
    • restoring stable version of the SEPP database.
    • minimizing significant loss of data due to upgrades or roll back failures.
    • minimizing loss of data due to system failure.
    • minimizing loss of data due to data corruption or deletion due to external input.
    • migrating database information from one site to another.
  3. The following files must be available for fault recovery:
    • Custom values file used at the time of network function deployment
    • Helm charts used at the time of network function deployment
    • Secrets and Certificates
    • RBAC resources

7.4 Fault Recovery Scenarios

This section describes the fault recovery procedures for various scenarios.

7.4.1 Scenario 1: Deployment Failure

This scenario describes how to recover SEPP when the deployment corrupts.

To recover SEPP:

  1. Run the following command to uninstall SEPP:
    helm uninstall <release_name> --namespace <namespace>
     
    Example:
    helm uninstall ocsepp --namespace seppsvc
    For more information about uninstalling SEPP, see the Uninstalling SEPP section.
  2. Install SEPP as described in the Installing SEPP section. Use the backup of custom values file to reinstall SEPP.

Restore SEPP, cnDBTier, and SEPP database (DB) as described in Restoring SEPP and cnDBTier.

7.4.2 Scenario 2: cnDBTier Corruption

This section describes how to recover database when the data replication is broken due to database corruption and cnDBTier has failed in single, multiple sites or all sites.

When the database corrupts, the database on all the other sites may also corrupt due to data replication. It depends on the replication status after the corruption has occurred. If the data replication is broken due to database corruption, then cnDBTier fails in either single or multiple sites (not all sites). And if the data replication is successful, then database corruption replicates to all the cnDBTier sites and cnDBTier fails in all sites.

The following are cnDBTier failure scenarios:

If corrupted database is replicated to mated sites, follow:

If corrupted database cause replication failure and hence local to a site, follow:

  • When DBTier failed in all Sites

    Note:

    This scenario impacts all the NFs using the corrupted cnDBTier. All the NFs sharing cnDBTier needs to do a fault recovery as cnDBTier is corrupted.
7.4.2.1 When cnDBTier failed in Single or Multiple (but not all) Sites

This section describes how to recover database when the data replication is broken due to database corruption and cnDBTier has failed in either single or multiple sites (not all sites).

To recover database:

  1. Uninstall SEPP Helm chart. For information about uninstalling SEPP, see the Uninstalling SEPP section.
  2. For cnDBTier fault recovery:
    1. Create on-demand backup from mated site that has health replication with failed site. For more information about cnDBTier backup, see the "Create On-demand Database Backup" chapter in the Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.
    2. Use the backup data from mate site for restore. For more information about cnDBTier restore, see the "Restore Georeplication Failure" chapter in Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.

      Note:

      The "Restore Georeplication Failure" chapter in Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide has a procedure for two sites where one of the cluster has fatal error. You can perform that procedure for all the sites in a multiple site setup.
  3. Install SEPP Helm chart. For more information about installing SEPP, see the Installing SEPP section.
7.4.2.2 When cnDBTier failed in all Sites

This section describes how to recover database when successful data replication corrupts all the cnDBTier sites.

To recover database:

  1. Uninstall SEPP helm charts. For more information about uninstalling SEPP, see the Uninstalling SEPP section.
  2. For cnDBTier fault recovery:
    1. Use on-demand backup file to restore database from the previous data backup. For more information about cnDBTier restore, see the Restore Georeplication Failure chapter in the Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.

      Note:

      The Restore Georeplication Failure chapter has a procedure for two sites where one of the cluster has fatal error. You can perform that procedure for all the sites in a multiple site setup.
  3. Install SEPP helm charts. For more information about installing SEPP, see the see the Installing SEPP section.

7.4.3 Scenario 3: Configuration Database Corruption

This scenario describes how to recover SEPP when its configuration database corrupts.

The configuration database is stored in a site exclusive database along with its tables. Thus, corruption of configuration database impacts only a particular site.

To recover SEPP configuration database, user has to restore the datebase (DB) backup.
  1. Transfer the <backup_ filename >.sql.gz file to the SQL node where user wants to restore it.
  2. Log in to MySQL NDB Cluster's SQL node on the new DB cluster and create a new database where the database needs to be restored.
  3. For details on creating database and user and adding permissions, see Configuring Database, Creating Users, and Granting Permissions section.

    Note:

    The database name created in the above step should be same as the database name created in the following step. The Kubernetes secret must be the same as in the custom_values.yaml used later for Installing SEPP.
  4. Use the following command to restore the database to the new database:
    gunzip < <backup_filename>.sql.gz | mysql -h127.0.0.1 –u <username> -p <backup-database-name>

    Enter the password when prompted.

    Example:

    gunzip < SEPPdbBackup.sql.gz | mysql -h127.0.0.1 -u dbuser -p seppdb

7.4.4 Scenario 4: Site Failure

This section describes how to perform fault recovery when either one, multiple, or all sites have a software failure. The following are site failure scenarios:

7.4.4.1 Single or Multiple Site Failure

This scenario applies when one or more sites, and not all sites, have failed and there is a requirement to perform fault recovery. It is assumed that the user has cnDBTier and SEPP installed on multiple sites with automatic data replication and backup enabled.

To recover the failed sites:

  1. Run the Cloud Native Environment (CNE) installation procedure to install a new cluster. For more information, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation Guide.
  2. For cnDBTier fault recovery:
    1. Take on-demand backup from the mate site that has health replication with the failed site or sites. For more information about on-demand backup, see the "Create On-demand Database Backup" chapter in the Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.
    2. Use the backup data from the mate site to restore the database. For more information about database restore, see "Restore Georeplication Failure" chapter in the Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.
  3. Install SEPP Helm chart. For more information about installing SEPP, see the Uninstalling SEPP section.
7.4.4.2 All Sites Failure

This scenario applies when all sites have failed, and there is a requirement to perform fault recovery. It is assumed that the user has cnDBTier and SEPP installed on multiple sites with automatic data replication and backup enabled.

To recover all the failed sites:

  1. Run the Cloud Native Environment (CNE) installation procedure to install a new cluster. For more information, see Oracle Communications Cloud Native Core, Cloud Native Environment Installation, Upgrade, and Fault Recovery Guide.
  2. Use on-demand backup file to restore database from previous data backup. For more information about database restore, see "Restore Georeplication Failure" chapter in Oracle Communications Cloud Native Core, cnDBTier Installation, Upgrade, and Fault Recovery Guide.

    Note:

    • The auto-data backup file is one that is built from scheduled automatic backup.
    • The "Restore Georeplication Failure" chapter contains a procedure for two sites where one of the clusters has fatal error. You can perform the same procedure for all the sites in a multiple site setup.
  3. Install SEPP helm chart.