Configuring Disaster Recovery in ECE Cloud Native

Setting Up Active-Active Disaster Recovery for ECE

Disaster recovery provides continuity in service for your customers and guards against data loss if a system fails. In ECE cloud native, disaster recovery is implemented by configuring two or more active production sites at different geographical locations. If one production site fails, another active production site takes over the traffic from the failed site.

During operation, ECE requests are routed across the production sites based on your load-balancing configuration. All updates that occur in an ECE cluster at one production site are replicated to other production sites through the Coherence cache federation.

For more information about the active-active disaster recovery configuration, see "About the Active-Active System" in BRM System Administrator's Guide.

To configure ECE cloud native for active-active disaster recovery:

In each Kubernetes cluster, expose ports on the external IP using the Kubernetes LoadBalancer service.

The ECE Helm chart includes a sample YAML file for the LoadBalancer service (oc-cn-ece-helm-chart/templates/ece-service-external.yaml) that you can configure for your environment.

On your primary production site, update the override-values.yaml file with the external IP of the LoadBalancer service, the federation-related parameters, the JMX port for the monitoring agent, the active-active disaster recovery parameters, and so on.

The following shows example override-values.yaml file settings for a primary production site:

monitoringAgent:
   monitoringAgentList:
      - name: "monitoringagent1"
        replicas: 1
        jmxport: "31020"
        jmxEnabled: "true"
        jvmJMXOpts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false  -Dcom.sun.management.jmxremote.password.file=../config/jmxremote.password -Dsecure.access.name=admin -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=31020 -Dcom.sun.management.jmxremote.rmi.port=31020"
        jvmOpts: "-Djava.net.preferIPv4Addresses=true"
        jvmGCOpts: ""
        restartCount: "0"
        nodeSelector: "node1"
      - name: "monitoringagent2"
        replicas: 1
        jmxport: "31021"
        jmxEnabled: "true"
        jvmJMXOpts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false  -Dcom.sun.management.jmxremote.password.file=../config/jmxremote.password -Dsecure.access.name=admin -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=31021 -Dcom.sun.management.jmxremote.rmi.port=31021"
        jvmOpts: "-Djava.net.preferIPv4Addresses=true"
        jvmGCOpts: ""
        restartCount: "0"
        nodeSelector: "node2"
charging:
   jmxport: "31022"
   coherencePort: "31015"
...
...
   clusterName: "BRM"
   isFederation: "true"
   primaryCluster: "true"
   secondaryCluster: "false"
   clusterTopology: "active-active"
   cluster:
      primary:
         clusterName: "BRM"
         eceServiceName: ece-server
         eceServicefqdnOrExternalIP: "0.1.2.3"
      secondary:
         - clusterName: "BRM2"
           eceServiceName: ece-server
           eceServicefqdnOrExternalIp: "0.1.2.3"
   federatedCacheScheme:
      federationPort:
         brmfederated: 31016
         xreffederated: 31017
         replicatedfederated: 31018
         offerProfileFederated: 31019

On your secondary production site, update the override-values.yaml file with the external IP of the LoadBalancer service, the federation-related parameters, the JMX port for the monitoring agent, the active-active disaster recovery parameters, and so on.

The following shows example settings in an override-values.yaml for a secondary production site:

monitoringAgent:
   monitoringAgentList:
      - name: "monitoringagent1"
        replicas: 1
        jmxport: "31020"
        jmxEnabled: "true"
        jvmJMXOpts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false  -Dcom.sun.management.jmxremote.password.file=../config/jmxremote.password -Dsecure.access.name=admin -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=31020 -Dcom.sun.management.jmxremote.rmi.port=31020"
        jvmOpts: "-Djava.net.preferIPv4Addresses=true"
        jvmGCOpts: ""
        restartCount: "0"
        nodeSelector: "node1"
      - name: "monitoringagent2"
        replicas: 1
        jmxport: "31021"
        jmxEnabled: "true"
        jvmJMXOpts: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false  -Dcom.sun.management.jmxremote.password.file=../config/jmxremote.password -Dsecure.access.name=admin -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=31021 -Dcom.sun.management.jmxremote.rmi.port=31021"
        jvmOpts: "-Djava.net.preferIPv4Addresses=true"
        jvmGCOpts: ""
        restartCount: "0"
        nodeSelector: "node2"
charging:
   jmxport: "31022"
   coherencePort: "31015"
...
... 
  clusterName: "BRM2"
   isFederation: "true"
   primaryCluster: "false"
   secondaryCluster: "true"
   clusterTopology: "active-active"
   cluster:
      primary:
         clusterName: "BRM"
         eceServiceName: ece-server
         eceServicefqdnOrExternalIP: "0.1.2.3"
      secondary:
         - clusterName: "BRM2"
           eceServiceName: ece-server
           eceServicefqdnOrExternalIp: "0.1.2.3"
   federatedCacheScheme:
      federationPort:
         brmfederated: 31016
         xreffederated: 31017
         replicatedfederated: 31018
         offerProfileFederated: 31019

On your primary and secondary production sites, add the customerGroupConfigurations and siteConfigurations sections to the override-values.yaml file.

The following shows example settings to add to the override-values.yaml file in your primary and secondary production sites:

customerGroupConfigurations:
   - name: "customergroup1"
     clusterPreference:
       - priority: "1"
         routingGatewayList: "0.1.2.3:31500"
         name: "BRM"
       - priority: "2"
         routingGatewayList: "0.1.2.3:31500"
         name: "BRM2"
   - name: "customergroup2"
     clusterPreference:
       - priority: "2"
         routingGatewayList: "0.1.2.3:31500"
         name: "BRM"
       - priority: "1"
         routingGatewayList: "0.1.2.3:31500"
         name: "BRM2"
siteConfigurations:
   - name: "BRM"
     affinitySiteNames: "BRM2"
     monitorAgentJmxConfigurations:
       - name: "monitoringagent1"
         host: "node1"
         jmxPort: "31020"
         disableMonitor: "true"
       - name: "monitoringagent2"
         host: "node2"
         jmxPort: "31021"
         disableMonitor: "true"
   - name: "BRM2"
     affinitySiteNames: "BRM"
     monitorAgentJmxConfigurations:
       - name: "monitoringagent1"
         host: "node1"
         jmxPort: "31020"
         disableMonitor: "true"
       - name: "monitoringagent2"
         host: "node2"
         jmxPort: "31021"
         disableMonitor: "true"

In your override-values.yaml file, configure kafkaConfigurationList with both primary and secondary site Kafka details.

The following shows example settings to add to the override-values.yaml file in your primary and secondary production sites:

kafkaConfigurationList:         
   -  name: "BRM"
      hostname: "hostname:port"            
      topicName: "ECENotifications"
      suspenseTopicName: "ECESuspenseQueue"
      partitions: "200"
      kafkaProducerReconnectionInterval: "120000"
      kafkaProducerReconnectionMax: "36000000" 
      kafkaDGWReconnectionInterval: "120000" 
      kafkaDGWReconnectionMax: "36000000" 
      kafkaBRMReconnectionInterval: "120000"
      kafkaBRMReconnectionMax: "36000000"
      kafkaHTTPReconnectionInterval: "120000"
      kafkaHTTPReconnectionMax: "36000000"
   -  name: "BRM2"            
      hostname: "hostname:port"
      topicName: "ECENotifications"
      suspenseTopicName: "ECESuspenseQueue"            
      partitions: "200"            
      kafkaProducerReconnectionInterval: "120000"            
      kafkaProducerReconnectionMax: "36000000"            
      kafkaDGWReconnectionInterval: "120000"            
      kafkaDGWReconnectionMax: "36000000"            
      kafkaBRMReconnectionInterval: "120000"            
      kafkaBRMReconnectionMax: "36000000"            
      kafkaHTTPReconnectionInterval: "120000"            
      kafkaHTTPReconnectionMax: "36000000"

If data persistence is enabled, configure a primary and secondary Rated Event Formatter instance on your primary and secondary production sites for each site in the ratedEventFormatter section of the override-values.yaml file.

The following shows example settings to add to the override-values.yaml file in your primary and secondary production sites:

ratedEventFormatter:
   ratedEventFormatterList:
        ratedEventFormatterConfiguration:
           name: "ref_site1_primary"
           partition: "1"
           connectionName: "oracle1"
           siteName: "site1"
           threadPoolSize: "2"
           retainDuration: "0"
           ripeDuration: "30"
           checkPointInterval: "20"
           pluginPath: "ece-ratedeventformatter.jar"
           pluginType: "oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
           pluginName: "brmCdrPluginDC1Primary"
           noSQLBatchSize: "25"
        ratedEventFormatterConfiguration:
           name: "ref_site1_secondary"
           partition: "1"
           connectionName: "oracle2"
           siteName: "site1"
           primaryInstanceName: "ref_site1_primary"
           threadPoolSize: "2"
           retainDuration: "0"
           ripeDuration: "30"
           checkPointInterval: "20"
           pluginPath: "ece-ratedeventformatter.jar"
           pluginType: "oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
           pluginName: "brmCdrPluginDC1Primary"
           noSQLBatchSize: "25"
        ratedEventFormatterConfiguration:
           name: "ref_site2_primary"
           partition: "1"
           connectionName: "oracle2"
           siteName: "site2"
           threadPoolSize: "2"
           retainDuration: "0"
           ripeDuration: "30"
           checkPointInterval: "20"
           pluginPath: "ece-ratedeventformatter.jar"
           pluginType: "oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
           pluginName: "brmCdrPluginDC1Primary"
           noSQLBatchSize: "25"
        ratedEventFormatterConfiguration:
           name: "ref_site2_secondary"
           partition: "1"
           connectionName: "oracle1"
           siteName: "site2"
           primaryInstanceName: "ref_site2_primary"
           threadPoolSize: "2"
           retainDuration: "0"
           ripeDuration: "30"
           checkPointInterval: "20"
           pluginPath: "ece-ratedeventformatter.jar"
           pluginType: "oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
           pluginName: "brmCdrPluginDC1Primary"
           noSQLBatchSize: "25"

The siteName property determines the site where the instance processes rated events. This lets you configure secondary instances as backups for remote sites. The sample specifies that the ref_site1_secondary instance is running at site 2, but processes rated events federated from site 1 in case of an outage.

For more information about Rated Event Formatter in active-active systems, see "About Rated Event Formatter in a Persistence-Enabled Active-Active System" in BRM System Administrator's Guide.

Depending on whether persistence is enabled in ECE, do one of the following:

If persistence is enabled, add the cachePersistenceConfigurations and connectionConfigurations.OraclePersistenceConnectionConfigurations sections to your override-values.yaml file on both primary and secondary production sites.

The following shows example settings to add to the override-values.yaml file on your primary and secondary sites:

 cachePersistenceConfigurations:
      cachePersistenceConfigurationList:
        -  clusterName: "BRM"
           persistenceStoreType: "OracleDB"
           persistenceConnectionName: "oraclePersistence1"
...
...
        -  clusterName: "BRM2"
           persistenceStoreType: "OracleDB"
           persistenceConnectionName: "oraclePersistence2"
...
...
   connectionConfigurations:
         OraclePersistenceConnectionConfigurations:
            - clusterName: "BRM"
              name: "oraclePersistence1"
...
...
            - clusterName: "BRM2"
              name: "oraclePersistence2"
...
...

If persistence is disabled, add the ratedEventPublishers and NoSQLConnectionConfigurations sections to your override-values.yaml file on primary and secondary production sites.

The following shows example settings to add to the override-values.yaml file on your primary and secondary sites:

   ratedEventPublishers:
     -  clusterName: "BRM"
        noSQLConnectionName: "noSQLConnection1"
        threadPoolSize: "4"
     -  clusterName: "BRM2"
        noSQLConnectionName: "noSQLConnection2"
        threadPoolSize: "4"
   connectionConfigurations:
         NoSQLConnectionConfigurations:
            - clusterName: "BRM"
              name: "noSQLConnection1"
...
...
            - clusterName: "BRM2"
              name: "noSQLConnection2"
...
...

Deploy the ECE Helm chart (oc-cn-ece-helm-cart) on the primary cluster and bring the primary cluster to the Usage Processing state.
Invoke federation from the primary production site to your secondary production sites by connecting from JConsole of the ecs1 pod.
1. Update the label for the ecs1-0 pod:
```
kubectl label -n NameSpace po ecs1-0 ece-jmx=ece-jmx-external
```
2. Update the /etc/hosts file on the remote machine with the worker node of ecs1-0:
```
IP_OF_WORKER_NODE ecs1-0.ece-server.namespace.svc.cluster.local
```
3. Connect to JConsole:
```
jconsole ecs1-0.ece-server.namespace.svc.cluster.local:31022
```
  JConsole starts.
4. Invoke start() and replicateAll() with the secondary production site name from the coordinator node of each federated cache in JMX. To do so:
  1. Expand the Coherence node, expand Federation, expand BRMFederatedCache, expand Coordinator, and then expand Coordinator. Click on start(BRM2) and replicateAll(BRM2), where BRM2 is the secondary production site name.
  2. Expand the Coherence node, expand Federation, expand OfferProfileFederatedCache, expand Coordinator, and then expand Coordinator. Click on start(BRM2) and replicateAll(BRM2).
  3. Expand the Coherence node, expand Federation, expand ReplicatedFederatedCache, expand Coordinator, and then expand Coordinator. Click on start(BRM2) and replicateAll(BRM2).
  4. Expand the Coherence node, expand Federation, expand XRefFederatedCache, expand Coordinator, and then expand Coordinator. Click on start(BRM2) and replicateAll(BRM2).
5. From the secondary production site, verify that data is being federated from the primary production site to the secondary production sites, and that all pods are running.
After federation completes, your primary and secondary production sites move to the Usage Processing state, and the monitoring agent pods are spawned.
When all pods are ready on each site, scale down and then scale up the monitoring agent pods in each production site. This synchronizes the monitoring agent pods with the other pods in the cluster.

Note:

Repeat these steps to scale up or down any pod after the monitoring agent is initialized.
1. Scale down monitoringagent1 to 0:
```
kubectl -n NameSpace scale deploy monitoringagent1 --replicas=0
```
2. Wait for monitoringagent1 to stop and then scale it back up to 1.
```
kubectl -n NameSpace scale deploy monitoringagent1 --replicas=1
```
3. Scale down monitoringagent2 to 0:
```
kubectl -n NameSpace scale deploy monitoringagent2 --replicas=0
```
4. Wait for monitoringagent2 to stop and then scale it back up to 1.
```
kubectl -n NameSpace scale deploy monitoringagent2 --replicas=1
```
Verify that the monitoring agent logs are collecting metrics.

Processing Usage Requests on Site Receiving Request

By default, the ECE active-active disaster recovery mode processes usage requests according to the preferred site assignments in the customerGroup list. For example, if subscriber A's preferred primary site is site 1, ECE processes subscriber A's usage requests on site 1. If subscriber A's usage request is received by production site 2, it is sent to production site 1 for processing.

You can configure the ECE active-active mode to process usage requests on the site that receives the request, regardless of the subscriber's preferred site. For example, if a subscriber's usage request is received by production site 1, it is processed on production site 1. Similarly, if the usage request is received by production site 2, it is processed on production site 2.

Note:

This configuration does not apply to usage charging requests for sharing group members. Usage requests for sharing group members are processed on the same site as the sharing group parent.

To configure the ECE active-active mode to process usage requests on the site that receives the request irrespective of the subscriber's preferred site:

In your override-values.yaml file for oc-cn-ece-helm-chart, set the charging.brsConfigurations.brsConfigurationList.brsConfig.skipActiveActivePreferredSiteRouting key to true.
Run the helm upgrade command to update your ECE Helm release:
```
helm upgrade EceReleaseName oc-cn-ece-helm-chart --values OverrideValuesFile -n BrmNameSpace
```
where:
- EceReleaseName is the release name for oc-cn-ece-helm-chart and is used to track the installation instance.
- OverrideValuesFile is the path to the YAML file that overrides the default configurations in the oc-cn-ece-helm-chart/values.yaml file.
- BrmNameSpace is the namespace in which to create BRM Kubernetes objects for the BRM Helm chart.

Stopping ECE from Routing to a Failed Site

When an active production site fails, you must notify the monitoring agent about the failed site. This stops ECE from rerouting requests to the failed production site.

To notify the monitoring agent about a failed production site:

Connect to the monitoring agent through JConsole:
1. Update /etc/hosts with the worker IP of the monitoringagent1 pod.
```
worker_IP ece-monitoringagent-service-1
```
2. Connect through JConsole by running this command:
```
jconsole ece-monitoringagent-service-1:31020
```
  JConsole starts.
Expand the ECE Monitoring node.
Expand Agent.
Expand Operations.
Set the failoverSite() operation to the name of the failed production site.

You can also use the activateSecondaryInstanceFor operation to fail over to a backup Rated Event Formatter as described in "Activating a Secondary Rated Event Formatter Instance." See "Resolving Rated Event Formatter Instance Outages" in BRM System Administrator's Guide for conceptual information about how to resolve Rated Event Formatter outages.

Adding Fixed Site Back to ECE System

Notify the monitoring agent after a failed production site starts functioning again. This allows ECE to route requests to the site again.

To add a fixed site back to the ECE disaster recovery system:

Connect to the monitoring agent through JConsole:
1. Update /etc/hosts with the worker IP of the monitoringagent1 pod.
```
worker_IP ece-monitoringagent-service-1
```
2. Connect through JConsole by running this command:
```
jconsole ece-monitoringagent-service-1:31020
```
  JConsole starts.
Expand the ECE Monitoring node.
Expand Agent.
Expand Operations.
Set the recoverSite() operation to the name of the original production site.

Activating a Secondary Rated Event Formatter Instance

If a primary Rated Event Formatter instance is down, you can activate a secondary instance to take over rated event processing.

To activate a secondary Rated Event Formatter instance:

Connect to the ratedeventformatter pod through JConsole by doing the following:
1. Update the label for the ratedeventformatter pod:
```
kubectl label -n NameSpace po ratedeventformatter1-0 ece-jmx=ece-jmx-external
```
  Note:
  
  ece-jmx-service-external has only one endpoint as the IP of the ratedeventformatter pod.
2. Update the /etc/hosts file on the remote machine with the worker node of the ratedeventformatter pod.
```
IP_OF_WORKER_NODE ratedeventformatter1-0.ece-server.namespace.svc.cluster.local
```
3. Connect through JConsole by running this command:
```
jconsole redeventformatter1-0.ece-server.namespace.svc.cluster.local:31022
```
  JConsole starts.
Expand the ECE Monitoring node.
Expand RatedEventFormatterMatrices.
Expand Operations.
Run the activateSecondaryInstance operation.

The secondary Rated Event Formatter instance begins processing rated events.

Getting Rated Event Formatter Checkpoint Information

You can retrieve information about the last Rated Event Formatter checkpoint committed to the database.

To retrieve information about the last Rated Event Formatter checkpoint:

Connect to the ecs1 pod through JConsole. See "Creating a JMX Connection to ECE Using JConsole" for more information.
Expand the ECE Configuration node.
Expand the database connection you want checkpoint information from.
Expand Operations.
Run the queryRatedEventCheckPoint operation.

Checkpoint information appears for all Rated Event Formatter instances using the database connection. Information includes site, schema, plugin names, and the time of the most recent checkpoint.