Configuring ECE for Disaster Recovery

Introduction

BRM offers a disaster recovery (DR) architecture ensuring business continuity in the event of an unexpected site deployment failure. BRM disaster recovery capabilities provide continuity in service usage for your end customers and minimize data loss if a system failure occurs. BRM supports deployment models intended to meet disaster recovery and business continuity needs.

Oracle Communications Charging, Billing and Revenue Management deployment includes functional components that manage business functionality end to end from subscription acquisition to usage charging, billing, and revenue management. The functional components include:

Billing and Revenue Management (BRM) server
Pricing Design Center (PDC)
Elastic Charging Engine (ECE)
Offline Mediation Controller

Business Continuity with ECE Disaster Recovery

Customers deploying BRM products also look for measures should a disaster strike. The disaster can be a power outage, hardware burn-out, or site not being available due to natural calamities such as floods or earthquakes. For hardware burn-out or even localized power outages, a local redundancy measure could be taken by deploying the products in high available mode, wherever possible.

Due to its distributed architecture, BRM components can be deployed with local redundancy for high availability.

For deploying with geo-redundancy, Table 88-1 shows the available deployment modes, along with recommended options for each of the functional components.

Table 88-1 Available and Recommended Deployment Modes

Components	Deployment Mode
Billing and Revenue Management (BRM) Server	Active-Hot Standby
Pricing Design Center (PDC)	Active-Hot Standby
Elastic Charging Engine (ECE)	Active-Active
Offline Mediation Controller (OCOMC)	Active-Standby

It is essential to continue to process transactions and, in case of a site failure, it is advised to choose an adequate mode of deployment based on recovery objectives set by your business. BRM components allow you to choose a deployment mode (component wise) that fits your business needs.

About Deployment Modes with Geo-Redundancy

You can deploy the BRM Server, PDC, and ECE components with geo-redundancy to improve business continuity.

Deploying BRM Server and PDC with Geo-Redundancy

BRM Server and PDC are required to be deployed for charging with ECE and these use Oracle Database to store data. The database replication using Data Guard is recommended to replicate the data across sites.

Figure 88-1 shows the disaster recovery deployment modes for BRM and PDC.

Figure 88-1 Deployment Modes for BRM and PDC

Description of "Figure 88-1 Deployment Modes for BRM and PDC"

In the above deployment of BRM and PDC, that are running in Active-Hot Standby mode, the database is continuously replicated in real-time using Data Guard/Active Data Guard from the active site to the standby site, ensuring minimal data loss. This requires monitoring of the sites and manual intervention when a site failure occurs.

Within a given site, local redundancy for BRM Server ensures continued availability of the system. BRM supports multi-instance configuration of Connection Managers and Data Managers for high availability, so that transactions are processed through available BRM processes connected to the same database instance.

Deploying ECE with Geo-Redundancy

Elastic Charging Engine is deployed in Active-Active mode. Active-Active mode uses Coherence federation-based data replication across sites and requires Coherence Grid Edition. This mode is more beneficial as you are processing data on both sites rather than keeping one site in standby mode while its processes run.

Figure 88-2 shows the Active-Active deployment mode for disaster recovery for ECE.

Figure 88-2 Active-Active Deployment Mode for ECE

Description of "Figure 88-2 Active-Active Deployment Mode for ECE"

Active-Active mode consists of two ECE sites, where all ECE sites can actively process charging requests simultaneously. Each ECE site’s cache holds all subscriber data and pricing configuration data, and ECE cache data is asynchronously replicated among all of the ECE sites using Coherence cache federation so that the cache data in all of the ECE sites remains synchronized.

In Active-Active mode, all subscribers belonging to a sharing group are processed in the same ECE site to ensure there is no revenue exposure due to concurrent processing on different ECE sites impacting the same shared balance. For subscribers that do not belong to a sharing group, the operator can configure ECE to process charging traffic in one of the following ways:

Local processing mode: Process requests for all subscribers in the ECE site that they arrive in from the network. In this mode, only the requests for the sharing group members may get forwarded to another site where the shared balance is managed. This is the recommended mode for better processing rates.
Preferred site processing mode: Process all requests for a given subscriber in the same ECE site, controlled by an ordered list of preferred ECE sites for each subscriber grouping. In this mode, a charging request arriving at a non-preferred ECE site will get forwarded to the preferred ECE site for processing. For example, all sharing group members are processed in a preferred site where the shared balance is managed.

All active ECE sites interface with one active BRM and PDC instance. Any updates from BRM will be processed by one of the active ECE sites and will be synchronized to all other active ECE sites via Coherence federation. Rated events created in each active ECE site will be processed on that site and loaded into the BRM database via the configured method.

In Figure 88-2, if ECE site 2 fails, ECE site 1 will automatically be able to handle the entire network’s charging traffic. The operator should manually mark ECE site 2 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 2 until the problem is corrected.

In Figure 88-2, if ECE site 1 fails, ECE site 2 will automatically be able to handle the entire network’s charging traffic. However, in this case, manual steps are required to update the configuration to ensure that updates from Active BRM and PDC are processed in ECE site 2 without reliance on ECE site 1. The operator should manually mark ECE site 1 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 1 until the problem is corrected.

For all Active-Active deployments, the operator should ensure that each ECE site is properly sized to handle the expected load in the worst-case failure scenario.

If you would like to process the data on only one site, then the other site will stay in Hot Standby mode. This deployment mode is called Active-Hot Standby, as shown in Figure 88-3. When you have both sites configured for processing, it may not be preferred to keep one site idle, rather process in both the sites. Oracle recommends deploying in Active-Active mode because it provides the best RTO and RPO of all deployment modes.

Figure 88-3 Active-Hot Standby Deployment Mode for ECE

Description of "Figure 88-3 Active-Hot Standby Deployment Mode for ECE"

The configurations for Active-Hot Standby mode are the same as for an Active-Active system.

If an operator deploys all components, the deployment will be as shown in Figure 88-4.

Figure 88-4 Site Deployment with Recommended Options

Description of "Figure 88-4 Site Deployment with Recommended Options"

BRM and PDC will always be in Active-Hot Standby mode. The DB from the BRM and PDC active site is replicated to one or more standby sites. If there is a failure of active deployment, manual intervention is needed for making one of the BRM/PDC standby sites active and reconfiguring the new active BRM/PDC instance to communicate with the existing primary ECE or ECE instance in the same site.

ECE is in Active-Active mode. Pricing updates from PDC and customer updates from BRM will be processed by one of the active ECE sites and will be synchronized to the other active ECE site via Coherence federation. Rated events created in an active ECE site will be processed in that site and loaded into the Active BRM database via the configured method.

If there is a failure of an ECE site, then:

The remaining active ECE site can automatically handle the load redistributed from the core network.
The operator should manually mark the ECE site as failed as soon as this condition is observed to ensure the failed ECE site attempts no traffic processing until the problem is corrected.
If the failed ECE site was the primary ECE site being used by the active BRM and PDC, manual steps are required to update the configuration to ensure that updates from the active BRM and PDC are processed in the other active ECE site.
If the network gateways remain in service at the failed ECE site (that is, a partial ECE site failure), the network gateways may need to be manually turned down to force the network clients to redistribute charging traffic to the remaining active ECE site.

Offline Mediation Controller is in Active-Standby mode. Offline Mediation Controller processes offline rating requests and generally are processed on one site to keep it simple and easily manageable.

Note:

You must define the customer group in the override-values.yaml file, irrespective of whether ECE is in Active/Active or Active/Hot-Standby deployment modes. The customer group configuration is mandatory for both modes. Both groups must have the same cluster preference in Active/Hot-Standby modes.

About Load Balancing in an Active-Active System

In an active-active system, External Module (EM) Gateway routes BRM update requests across sites based on the app and site configurations to ensure load balancing.

EM Gateway routes connection requests to Diameter Gateway, RADIUS Gateway, and HTTP Gateway nodes in one of the active sites. The request is rerouted to the backup production site if the site does not respond.

You can set up load-balancing configurations based on your requirements.

About Rated Event Formatter in a Persistence-Enabled Active-Active System

When data persistence is enabled, each site in an active-active system has a primary Rated Event Formatter instance for each schema, and at least one secondary instance for each schema.

As rated events are created, the following happens on each site:

ECE creates rated events and commits them to the Coherence cache. Each rated event created by ECE includes the Coherence cluster name of the site where it was created.
The Coherence federation service replicates the events to the remote sites, as it does for other federated objects.
Coherence caching persists the events to the database in batches. Each schema at each site has its own rated event database table.
The primary Rated Event Formatter instance processes all rated events from the corresponding site-specific database table.
The primary Rated Event Formatter instance commits the formatted events to the cache as a checkpoint. The site name is included in the checkpoint data, along with the schema number, timestamp, and plugin type.
The Coherence federation service replicates the checkpoint to the remote sites, as it does for other federated objects. The remote site ECE servers then purge the events persisted in the checkpoint from the database in batches by schema and by site.
Coherence caching persists the checkpoint to the database to be consumed by Rated Event Loader. Checkpoints are grouped by schema and by site.
The ECE server purges the events related to the persisted checkpoint. Events are purged from the database in batches by schema and by site.

Remote sites that receive federated events and checkpoints similarly persist them to and purge them from the database, in site and schema-specific database tables. In this way, all sites contain the same rated events and checkpoints, no matter where they were generated, and each rated event and checkpoint retains information about the site that generated it. If the Rated Event Formatter instance at any one site is down, a secondary instance at a remote site can immediately begin processing the rated events, preserving the site-specific information as though it were the original site. See "Resolving Rated Event Formatter Instance Outages".

Resolving Rated Event Formatter Instance Outages

If a primary Rated Event Formatter instance is down, take one of the following approaches, depending on whether the outage is planned or unplanned, and considering your operational needs:

Planned outage: Primary instance finishes processing: Choose this option for planned outages, when rating stops but the primary Rated Event Formatter instance can keep processing.
1. After no new rated events are being generated by the site, wait until the local Rated Event Formatter has finished processing all rated events from the site.
2. In the remote sites, drop or truncate the rated event database table for the rated events federated from the site with the outage. Dropping the table means you must recreate it and its indexes after resolving the outage.
3. Stop the Rated Event Formatter at the site with the outage.
4. When the outage is resolved, you can start Rated Event Formatter again to resume processing events.
Unplanned outage: Secondary instance takes over processing: Choose this option for unplanned outages, when the primary Rated Event Formatter is also down. After failing over to the backup site as described in "Failing Over to a Backup Site (Active-Active)", perform the following tasks:
1. Confirm that the last successful Rated Event Formatter checkpoint for the local site matches the one federated to the remote site. You can use the JMX queryRatedEventCheckPoint operation in the ECE configuration MBeans. See "Getting Rated Event Formatter Checkpoint Information".
2. If needed, start the secondary Rated Event Formatter instance on the remote site.
3. Activate the secondary Rated Event Formatter instance on the remote site using the JMX activateSecondaryInstance operation in the ECE monitoring MBeans. See "Activating a Secondary Rated Event Formatter Instance".
  The secondary instance takes over processing the federated rated events as though it were the primary instance at the site with the outage. The events and checkpoints are persisted in the database tables for the original site, not the remote site.
4. Wait until the secondary instance has finished processing all rated events federated from the site with the outage.
5. At the site with the outage, drop or truncate the rated event database table for local events. Dropping the table means you must recreate it and its indexes after resolving the outage.
6. Stop the secondary Rated Event Formatter instance.
7. When the outage is resolved and the site has been recovered as described in "Switching Back to the Original Production Site (Active-Active)", restart the primary Rated Event Formatter again to resume processing events at the local site. If you had the secondary Rated Event Formatter instance running at the remote site before the outage, restart it too.

Getting Rated Event Formatter Checkpoint Information

You can retrieve information about the last Rated Event Formatter checkpoint committed to the database.

To retrieve information about the last Rated Event Formatter checkpoint:

Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".
Expand the ECE Configuration node.
Expand the database connection you want checkpoint information from.
Expand Operations.
Run the queryRatedEventCheckPoint operation.
Checkpoint information appears for all Rated Event Formatter instances using the database connection. Information includes site, schema, and plugin names as well as the time of the most recent checkpoint.

Activating a Secondary Rated Event Formatter Instance

If a primary Rated Event Formatter instance is down, you can activate a secondary instance to take over rated event processing.

To activate a secondary Rated Event Formatter instance:

Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".
Expand the ECE Monitoring node.
Expand RatedEventFormatterMatrices.
Expand Operations.
Run the activateSecondaryInstance operation.
The secondary Rated Event Formatter instance begins processing rated events.

About CDR Generator in an Active-Active System

When CDR generation is enabled, each site in an active-active system contains a CDR Generator, and each site can generate unrated CDRs for external systems. When a production site goes down, the CDR Store retains all in-progress CDR sessions, and subsequent 5G usage events are diverted to the CDR Gateway on the other production site.

In an active-active system, you can configure CDR Generator to do the following:

Mark partially processed CDRs in the CDR Store as incomplete to prevent downstream mediation systems from processing them. To do so, use the CDR Formatter's and CDR Gateway's enableIncompleteCdrDetection attribute.
Mark when CDRs contain duplicate usage updates. To do so, use the CDR Gateway's retransmissionDuplicateDetectionEnabled attribute.
Indicate that CDRs were closed for a custom value (in the CDR's causeForRecordClosing field). To enable CDR Generator to add a custom reason why a CDR was closed, use the CDR Formatter's enableStaleSessionCleanupCustomField attribute. To specify the custom value to add, use the CDR Formatter's staleSessionCauseForRecordClosingString attribute.

For information about configuring these attributes, see "Setting Up ECE to Generate CDRs" in ECE Implementing Charging.

Configuring an Active-Active System

To configure an active-active system:

In the primary production site, do the following:

Configure the ECE components (Customer Updater, EM Gateway, and so on).

Add all details about participant sites to the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml).

To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file.

See Table 88-2 for more information about providing the federation configuration parameter descriptions and default values.

Table 88-2 Federation Configuration Parameters

Name	Description
name	The name of the participant site. Note: The name of the participant site must match the name of the cluster in the participant site.
address	The IP address of the participant site.
port	The port number assigned to the Coherence cluster port of the participant site.
initial-action	Specifies whether the federation service should be started for replicating data to the participant sites. Valid values are: start: Specifies that the federation service has to be started and the data must be automatically replicated to the participant sites. stop: Specifies that the federation service has to be stopped, and the data must not be automatically replicated to the participant sites. Note: Ensure that this parameter is set to stop for all participant sites except for the current site. For example, if you are adding the backup or remote production sites details in the primary production site, this parameter must be set to stop for all backup or remote production sites.

Go to the ECE_home/config/management directory, where ECE_home is the directory in which ECE is installed.
Configure HTTP Gateway. See "Connecting ECE to a 5G Client" in ECE Implementing Charging for more information.
Open the charging-settings.xml file.

In the CustomerGroupConfiguration section, set the app configuration parameters as shown in the following sample file:

<customerGroupConfigurations config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfigurations">
               <customerGroupConfigurationList>
                    <customerGroupConfiguration
                         config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup5">
                         <clusterPreferenceList.name config-class="java.util.ArrayList">

                     <clusterPreferenceConfiguration
                         config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                         name="BRM-S2"
                         priority="1" routingGatewayList="host1:port1"/>
                         </clusterPreferenceList>
                    </customerGroupConfiguration>
                    <customerGroupConfiguration
                         config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup2">
                         <clusterPreferenceList config-class="java.util.ArrayList">

                         <clusterPreferenceConfiguration
                         config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                         name="BRM-S2"
                         priority="1" routingGatewayList="host1:port1,host1:port1"/>


                         <clusterPreferenceConfiguration
                         config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                         name="BRM-S1"
                         priority="2" routingGatewayList="host2:port2,host2:port2"/>
                         </clusterPreferenceList>
                    </customerGroupConfiguration>
              </customerGroupConfigurationList>
          </customerGroupConfigurations>

Table 88-3 provides the configuration parameters of the CustomerGroupConfiguration section.

Table 88-3 CustomerGroupConfiguration Parameters

Configuration Parameter and Description

Configuration	Parameter and Description
CustomerGroupConfiguration	`name=` Customers are processed and distributed in active-active system sites based on customerGroup. The customer names configured in customerGroup are updated to the PublicUserIdentity (PUI) cache when you load the customer information to ECE through customerUpdater or when you create or update information of customers in BRM using EM Gateway. `clusterPreferenceList=` Includes a list of cluster names and priority for each cluster name for routing the requests during a site failure.
clusterPreferenceConfiguration	`name=` Name of the cluster. `clusterPreferenceConfiguration.priority=` The priority of the preferred cluster that is assigned in the customerGroup list to process the rating request. The priority to process the request is in the incremental order of numbers and assigned to the lowest number. For example, if you set the value to 1 for priority, the cluster associated with this number processes the request first. `routingGatewayList=` A comma-separated list of the host name and port number of chargingServer values used for httpGateway.

CustomerGroupConfiguration

```
name=
```
Customers are processed and distributed in active-active system sites based on customerGroup. The customer names configured in customerGroup are updated to the PublicUserIdentity (PUI) cache when you load the customer information to ECE through customerUpdater or when you create or update information of customers in BRM using EM Gateway.
```
clusterPreferenceList=
```
Includes a list of cluster names and priority for each cluster name for routing the requests during a site failure.

clusterPreferenceConfiguration

```
name=
```
Name of the cluster.
```
clusterPreferenceConfiguration.priority=
```
The priority of the preferred cluster that is assigned in the customerGroup list to process the rating request.
The priority to process the request is in the incremental order of numbers and assigned to the lowest number. For example, if you set the value to 1 for priority, the cluster associated with this number processes the request first.
```
routingGatewayList=
```
A comma-separated list of the host name and port number of chargingServer values used for httpGateway.

Configure a primary and secondary Rated Event Formatter instance for each site in the ratedEventFormatter section, as shown in the following sample file:

<ratedEventFormatterConfigurationList config-class="java.util.ArrayList">
    <ratedEventFormatterConfiguration
            config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
            name="ref_site1_primary"
            partition="1"
            connectionName="oracle1"
            siteName="site1"
            threadPoolSize="2"
            retainDuration="0"
            ripeDuration="30"
            checkPointInterval="20"
            maxPersistenceCatchupTime="0"
            pluginPath="ece-ratedeventformatter.jar"
            pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
            pluginName="brmCdrPluginDC1Primary" …
            … />
    <ratedEventFormatterConfiguration
            config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
            name="ref_site1_secondary"
            partition="1"
            connectionName="oracle2"
            siteName="site1"
            primaryInstanceName="ref_site1_primary"
            threadPoolSize="2"
            retainDuration="0"
            ripeDuration="30"
            checkPointInterval="20"
            maxPersistenceCatchupTime="0"
            pluginPath="ece-ratedeventformatter.jar"
            pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
            pluginName="brmCdrPluginDC1Secondary"
            … />
    <ratedEventFormatterConfiguration
            config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
            name="ref_site2_primary"
            partition="1"
            connectionName="oracle2"
            siteName="site2"
            threadPoolSize="2"
            retainDuration="0"
            ripeDuration="30"
            checkPointInterval="20"
            maxPersistenceCatchupTime="0"
            pluginPath="ece-ratedeventformatter.jar"
            pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
            pluginName="brmCdrPluginDC2Primary"
            …  />
    <ratedEventFormatterConfiguration
            config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
            name="ref_site2_secondary"
            partition="1"
            connectionName="oracle1"
            siteName="site2"
            primaryInstanceName="ref_site2_primary"
            threadPoolSize="2"
            retainDuration="0"
            ripeDuration="30"
            checkPointInterval="20"
            maxPersistenceCatchupTime="0"
            pluginPath="ece-ratedeventformatter.jar"
            pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
            pluginName="brmCdrPluginDC2Secondary"
            …  />
</ratedEventFormatterConfigurationList>

The siteName property determines the site that the instance processes rated events for. This lets you configure secondary instances as backups for remote sites. The sample specifies that the ref_site1_secondary instance running is running at site 2, but processes rated events federated from site 1 in case of an outage.

Configure the production sites to process the routing requests.

Open the site-configuration.xml file. Configure all monitorAgent instances from all sites. Each Monitor Agent instance includes the Coherence cluster name, host name or IP address, and JMX port.

Table 88-4 provides the configuration parameters of Monitoring Agent.

Table 88-4 Monitor Agent Configuration Parameters

Name	Description
name	The name of the production or remote site where the request should be processed. These should correspond to site names defined for the Rated Event Formatter instances.
host	The IP address of the participant site.
jmxPort	jmxPort of the production or remote site.
disableMonitor	This configuration allows a monitorAgent instance to disable collecting monitoring results from multiple monitorAgent instances running within a site. It prevents generating redundant monitoring results for a site. Note: Default value is set to false. If you set this value to true, monitorAgent instance disallows collecting redundant monitoring results.

Note:

The monitorAgent properties should match with the properties in the eceTopology.conf file where a monitorAgent instance is configured to start from a specific production site.

Copy the JMSConfiguration.xml file content of all sites to a single file and enter the following details:
- Add this tag for the queue types:
```
<Cluster>clusterName</Cluster>
```
- Import the wallet for all clusters and specify the wallet path in the following locations:
```
<KeyStoreLocation> and <ECEWalletLocation>
```
In the eceTopology.conf file, enable the JMX port for all ECS server nodes and clients, such as Diameter Gateway, HTTP Gateway, RADIUS Gateway, and EM Gateway. Also, enable the JMX port for each Monitor Agent instance.
Start ECE. See "Starting ECE" for more information.

On the backup or remote site, do the following:
1. Configure the ECE components (Customer Updater, EM Gateway, and so on).
  Ensure the following:
  - The names of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site are unique.
  - At least two instances of Rated Event Formatter are configured to allow for failover. The data persistence-enabled system requires configuring at least one primary and one secondary instance for each site.
2. Set the following parameter in the ECE_home/config/ece.properties file to false:
```
loadConfigSettings = false
```
  When you start the charging server nodes, the application-configuration data is not loaded into memory.
3. Add all the details of participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml).
  
  To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_ home/config/ece.properties file. Table 88-2 provides the federation configuration parameter descriptions and default values.
4. Start the Elastic Charging Controller (ECC):
```
./ecc
```
5. Start the charging server nodes:
```
start server
```
On the primary production site, run the following commands:
```
gridSync start
gridSync replicate
```
The federation service is started, and all the existing data is replicated to the backup or remote production sites.
On the backup sites, do the following:
1. Verify that the same number of entries as in the primary production site are available in the customer, balance, configuration, and pricing caches in the backup or remote production sites using the query.sh utility.
2. Verify that the charging server nodes in the backup or remote production sites are in the same state as those in the primary production site.
3. Configure the following ECE components and the Oracle persistence database connection details by using a JMX editor:
  - Rated Event Formatter
  - Rated Event Publisher
  - Diameter Gateway
  - RADIUS Gateway
  - HTTP Gateway
  Ensure the following:
  - The names of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site are unique.
  - At least two instances of Rated Event Formatter are configured to allow for failover. A data persistence-enabled system requires configuring at least one primary and one secondary instance for each site.
4. Start the following ECE processes and gateways:
```
start brmGateway
start ratedEventFormatter
start diameterGateway
start radiusGateway
start httpGateway
```
  The remote production sites are up and running with all required data.
5. Run the following command:
```
gridSync start
```
  The federation service is started to replicate the data from the backup or remote production sites to the preferred production site.
After starting Rated Event Formatter in the remote production sites, copy the CDR files generated by Rated Event Formatter from the remote production sites to the primary production site using the SFTP utility.

Note:

When configuring the active-hot standby system, the preferred site for each customer group should be the same. That is, the preferred site should be given as the current active site. For example:

customerGroupConfigurations:
      - name: "customergroup1"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"
      - name: "customergroup2"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"

Including Custom Clients in Your Active-Active Configuration

If your system includes a custom client application that calls the ECE API, add the custom client to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from your custom client to a backup site when a site failover occurs. To do so, you configure the custom client as an ECE Monitor Framework-compliant node in the ECE cluster.

To add a custom client to an active-active configuration:

Modify your custom client to use the ECE Monitor Framework:

Add this import statement:

import oracle.communication.brm.charging.monitor.framework.internal.MonitorFramework;

Add these lines to the program:

if (MonitorFramework.isJMXEnabledApp) {
   MonitorFramework monitorFramework = (MonitorFramework) context.getBean(MonitorFramework.MONITOR_BEAN_NAME);
   try {
      monitorFramework.initializeMonitor(null); // null parameter for any non-ECE Monitor Agent node
   }
   catch (Exception ex) {
      // Failed to initialize Monitor Framework, check log file
      System.exit(-1);
   }
}
...  // continue as before

When you start your custom client, include these Java system properties:
1. -Dcom.sun.management.jmxremote.port set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number.
2. -Dcom.sun.management.jmxremote.rmi.port set to the port number to which the RMI connector will be bound.
3. -Dtangosol.coherence.member set to the name of the custom client application instance running within the ECE cluster.
For example:
```
java -Dcom.sun.management.jmxremote.port=6666 \
-Dcom.sun.management.jmxremote.rmi.port=6666 \
-Dtangosol.coherence.member=customApp1 \
-jar customApp1.jar
```
Edit the ECE_home/config/eceTopology.conf file to include a row for each custom client application instance. For each row, enter the following information:
- node-name: The name of the JVM process for that node.
- role: The role of the JVM process for that node.
- host name: The host name of the physical server machine on which the node resides. For a standalone system, enter localhost.
- host ip: If your host contains multiple IP addresses, enter the IP address so Coherence can be pointed to a port.
- JMX port: The JMX port of the JVM process for that node. By specifying a JMX port number for one node, you expose MBeans for setting performance-related properties and collecting statistics for all node processes. Enter any free port, such as 9999, for the charging server node to be the JMX-management enabled node.
- start CohMgt: Specify whether you want the node to be JMX-management enabled.
For example:
```
#node-name     |role       |host name  (no spaces!) |host ip  |JMX port  |start CohMgt  |JVM Tuning File  
customApp1     |customApp  |localhost               |         |6666      |false         |
```

Including Offline Mediation Controller in Your Active-Active Configuration

If your system includes Oracle Communications Offline Mediation Controller, add it to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from Offline Mediation Controller to a backup site when a site fail over occurs.

To include Offline Mediation Controller in your active-active configuration:

On each active production site, do the following:
1. Log in to your ECE driver machine as the rms user.
2. In your ocecesdk/config/client-charging-context.xml file, add the following line to the beans element:
```
<importresource="classpath:/META-INF/spring/monitor.framework-context.xml"/>
```
On each Offline Mediation Controller machine, do the following:
1. Log in to your Offline Mediation Controller machine as the rms user.
2. Add the following lines to your OCOMC_home/bin/nodemgr file:
```
-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.rmi.port=rmi_port
-Dcom.sun.management.jmxremote.port=port
```
  where:
  - rmi_port is set to the port number to which the RMI connector will be bound.
  - port is set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number.
3. In the OCOMC_home/bin/UDCEnvironment file, set JMX_ENABLED_STATUS to true and set JMX_PORT to the desired JMX port number:
```
#For enabling jmx in an active-active setup 
JMX_ENABLED_STATUS=true
JMX_PORT=9992
```
On each Offline Mediation Controller machine, restart Node Manager by going to the OCOMC_home/bin directory and running this command:
```
./nodemgr
```

Failing Over to a Backup Site (Active-Active)

To fail over to a backup site in an active-active configuration:

Open a JMX editor such as a JConsole.
Expand the ECE Monitoring node.
Expand Agent.
Expand Operations.
Set the failoverSite() operation to the name of the failed site.
On the backup site, stop replicating the ECE cache data to the primary production site by running the following command:
```
gridSync stop PrimaryProductionClusterName
```
where PrimaryProductionClusterName is the name of the cluster in the primary production site.
On the backup site, do the following:
1. Change the BRM, PDC, and Customer Updater connection details to connect to BRM and PDC on the backup site by using a JMX editor.
  
  Note:
  If only ECE in the primary production site failed and BRM and PDC in the primary production site are still running, you need not change the BRM and PDC connection details on the backup site.
2. Start BRM and PDC.
Recover the data in the Oracle NoSQL database data store of the primary production site by performing the following:
1. Convert the secondary Oracle NoSQL database data store node of the primary production site to the primary Oracle NoSQL database data store node by performing a failover operation in the Oracle NoSQL database data store. For more information, see "Performing a Failover" in Oracle NoSQL Database Administrator's Guide.
  The secondary Oracle NoSQL database data store node of the primary production site is now the primary Oracle NoSQL database data store node of the primary production site.
2. On the backup site, convert the rated events from the Oracle NoSQL database data store node that you just converted into the primary node into CDR files by starting Rated Event Formatter.
3. In a backup or remote production site, load the CDR files that you just converted into BRM by using Rated Event (RE) Loader.
4. Shut down the Oracle NoSQL database data store node that you just converted into the primary node.
  See the "stop" utility in Oracle NoSQL Database Administrator's Guide for more information.
5. Stop the Rated Event Formatter that you just started.
In a backup or remote production site, start Pricing Updater, Customer Updater, and EM Gateway by running the following commands:
```
start pricingUpdater
start customerUpdater
start emGateway
```
All pricing and customer data is now back in the ECE grid in the backup or remote production site.
Stop and restart BRM Gateway.
Migrate internal BRM notifications from the primary production site to a backup or remote production site. See "Migrating ECE Notifications" for more information.
Note:
- If the expiry duration is configured for these notifications, ensure that you migrate the notifications before they expire. For the expiry duration, see the expiry-delay entry for the ServiceContext module in the ECE_home/config/charging-cache-config.xml file.
- All external notifications from a production site are published to the respective JMS queue. Diameter Gateway retrieves the notifications from the JMS queue and replicates to other sites based on the configuration.
Ensure that the network clients route all requests to the backup or remote production site.

The former backup site or one of the remote production sites is now the new preferred production site. When the preferred site starts functioning, you mark the recoverSite and the site traffic routes back to the preferred site. For more information, see "Switching Back to the Original Production Site (Active-Active)".

Switching Back to the Original Production Site (Active-Active)

To switch back to the original production site in an active-active system:

Install ECE and other required components in the original primary production site. For more information, see "Installing Elastic Charging Engine" in ECE Installation Guide.

Note:
If only ECE in the original primary production site failed and BRM and PDC in the original primary production site are still running, install only ECE and provide the connection details about BRM and PDC in the original primary production site during ECE installation.
On the machine on which the Oracle WebLogic server is installed, verify that the JMS queues have been created for loading pricing data and for sending event notification, and that JMS credentials have been configured correctly.
Set the following parameter in the ECE_home/config/ece.properties file to false:
```
loadConfigSettings = false
```
The configuration data is not loaded in memory.
Add all details about participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging- coherence-override-prod.xml).

To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file. Table 88-2 provides the federation configuration parameter descriptions and default values.
Go to the ECE_home/bin directory.
Start ECC:
```
./ecc
```
Start the charging server nodes:
```
start server
```
Replicate the ECE cache data to the original production site by using the gridSync utility. For more information, see "Replicating ECE Cache Data".

Start the following ECE processes and gateways:

start brmGateway
start ratedEventFormatter
start diameterGateway
start radiusGateway
start httpGateway

Verify that the same number of entries as in the new production site are available in the customer, balance, configuration, and pricing caches in the original production site by using the query.sh utility.
Stop Pricing Updater, Customer Updater, and EM Gateway in the new primary production site and then start them in the original primary production site.
Migrate internal BRM notifications from the new primary production site to the original primary production site. For more information, see "Migrating ECE Notifications".
Change the BRM Gateway, Customer Updater, and Pricing Updater connection details to connect to BRM and PDC in the original primary production site by using a JMX editor.
Stop RE Loader in the new primary production site and then start it in the original primary production site.
Stop and restart BRM Gateway in both the new primary production site and the original primary production site.

The roles of the sites are now reversed to the original roles.
Open a JMX editor such as a JConsole.
Expand the ECE Monitoring node.
Expand Agent.
Expand Operations.
Set the recoverSite() operation to the name of the recovered site.
If data persistence is enabled and you failed over your Rated Event Formatter instance at the original site to a secondary instance at a remote site, restart any primary and secondary Rated Event Formatter instances at the original site.

Note:

EM Gateway routes connection requests to either local Elastic Charging Server (ECS) or to the HTTP Gateway nodes on the remote sites. If the site does not respond, the request is processed locally on the same site. When a production site goes down, the CDR database retains all in-progress (or incomplete) CDR sessions, and all unrated 5G usage events are diverted should be diverted to the remote HTTP Gateway.

When a site is marked DOWN in an Active-Active setup, the gateways need to be brought down or they will continue to service requests which is not expected.

Processing Usage Requests in the Site Received

To configure the ECE active-active mode to process usage requests in the site that receives the request irrespective of the subscriber's preferred site, perform the following steps:

Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".
Expand the ECE Configuration node.
Expand charging.brsConfigurations.default.
Expand Attributes.
Set the skipActiveActivePreferredSiteRouting attribute to true.

Note:
By default, the skipActiveActivePreferredSiteRouting attribute is set to false.

Replicating ECE Cache Data

In an active-hot standby system, a segmented active-active system, or an active-active system, when you configure or perform disaster recovery, you replicate the ECE cache data to the participant sites by using the gridSync utility.

To replicate the ECE cache data:

Go to the ECE_home/bin directory.
Start ECC:
```
./ecc
```
Do one of the following:
- To start replicating data to a specific participant site asynchronously and also replicate all the existing ECE cache data to a specific participant site, run the following commands:
```
gridSync start [remoteClusterName]
gridSync replicate [remoteClusterName]
```
  where remoteClusterName is the name of the cluster in a participant site.
- To start replicating data to all the participant sites asynchronously and also replicate all the existing ECE cache data to all the participant sites, run the following commands:
```
gridSync start
gridSync replicate
```

See "gridSync" for more information on the gridSync utility.

Migrating ECE Notifications

When you failover to a backup site or switching back to the primary site, you must migrate the notifications to the destination site.

Note:

If you are using Apache Kafka for notification handling, notifications are not migrated to the destination site. Apache Kafka retains the notifications and these notifications appear in the original site or components when they are active.

To migrate ECE notifications:

Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".
Expand the ECE Configuration node.
Expand systemAdmin.
Expand Operations.
Select triggerFailedClusterServiceContextEventMigration.
In the method's failedClusterName field, enter the name of the failed site's cluster.
Click the triggerFailedClusterServiceContextEventMigration button.

All the internal BRM notifications are migrated to the destination site. In an active-active system, the external notifications are also migrated to the destination site. If you cannot establish the WebLogic cluster subscription due to a site failover, you should restart Diameter Gateway on the destination site. If a site recovers from a failover, you should restart all the Diameter Gateway instances in the cluster.

Active-Active Sy/Gy Session Re-Anchoring

During prolonged outages of Diameter Gateways in active-active disaster recovery systems, Sy sessions may fail to re-anchor to another available Diameter Gateway if the Policy and Charging Rules Function (PCRF) does not initiate Spending Limit Request (SLR) intermediates. Notifications such as the Status Notification Request (SNR) are not processed from the partition ID of the unavailable Diameter Gateway. Consequently, notifications remain trapped on the partition id that was being handled by the Diameter Gateway where the SLR initiation occurred unless the Sy sessions are terminated and restarted on a different Diameter Gateway.

For Gy sessions, the receipt of a Credit Control Request Update (CCR-U) will result in a Gy Session being re-anchored at a different Diameter Gateway where the CCR Initial was received. However, prior to this re-anchoring any notifications for Gy sessions using a partition ID consumed by the failed Diameter Gateway will remain trapped on that partition ID.

When a Diameter Gateway is out of service, you can use the JMX console to issue the command for another Diameter Gateway to start consuming from the partition ID for the failed Diameter Gateway. This gateway can be on the same site or, if there is full-site outage for Diameter Gateways, at an alternate site.

The only criterion for the new Diameter Gateway is that it is connected to the remote peers that the out-of-service Diameter Gateway sends the Sy Status Notification Request (SNR) and Gy Re-Authorization Requests (RAR) to.

The issue arises when a Diameter Gateway is down. This prevents the SNRs and RARs from being retrieved and forwarded from the Kafka partition. To fix this, do the following:

Configure another Diameter Gateway instance to start consuming from the Kafka partition. For more information, see "Consuming SNRs/RARs from a Specified Kafka Partition".
Configure the Diameter Gateway instance that went down to stop consuming from the Kafka partition. This should be done after the Diameter Gateway that was down has recovered, and should be performed on the same Diameter Gateway that the previous step was performed on. For more information, see "Stopping the Consumption of SNRs/RARs from a Specified Kafka Partition".

Figure 88-5 depicts the prolonged outage of a Diameter Gateway (DGW-1-1 in the figure) in an active-active disaster recovery system.

Figure 88-5 Single Diameter Gateway Outage

Description of "Figure 88-5 Single Diameter Gateway Outage"

In the figure, PGW represents a packet gateway for sending charging requests to ECE and DGW represents a Diameter Gateway.

Since DGW-1-1 is down, the SNR/ RAR cannot be retrieved from Kafka partition 1 (on the gateway that is down) unless the JMX Console is used to issue the command to connect them to another Diameter Gateway. In this case, it is changed to connect to DGW-1-2 at the same site.

Figure 88-6 depicts the prolonged outage of multiple Diameter Gateways (DGW-1-1 and DGW-1-2 in the figure) in an active-active disaster recovery system.

Figure 88-6 Multiple Diameter Gateway Outages

Description of "Figure 88-6 Multiple Diameter Gateway Outages"

Since both DGW-1-1 and DGW-1-2 are down, the SNR/ RAR cannot be retrieved from Kafka partitions 1 and 2 at Site 1 unless the JMX Console is used to issue the commands to connect them to another Diameter Gateway. In this case partition 1 is moved to DGW-2-1 and partition 2 is moved to DGW-2-2, both at the second site.

Consuming SNRs/RARs from a Specified Kafka Partition

Learn how to configure the Diameter Gateway to start consuming SNRs/ RARs from a specified Kafka partition, during prolonged Diameter Gateway outages.

This allows for the Sy/Gy sessions to be re-anchored to another Diameter Gateway that is connected to the remote peers that the Diameter Gateway will then send to the Sy SNR/Gy RARs. You can use JMX commands to configure ECE to make another Diameter Gateway start consuming from a specific partition ID. The only parameter you need to pass to the command is the new partition ID that the Kafka consumer will use.

To configure ECE to make another Diameter Gateway start consuming from a specific partition ID:

Connect ECE Cloud Native to JConsole following the instructions in "JMX Connection to ECE Using JConsole" in BRM Cloud Native System Administrator’s Guide.
For ECE on-prem, access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans" for more information.
JConsole should be open on the DiameterGateway process (and not the ECS process).
Expand the DiameterGateway node.
Expand DiameterGatewayKafkaResiliencyTracker.
Expand Operations.
Enter the partition ID in the partition field for the startKafkaListener method, and click the method.
The Operation return value dialog box appears, with the value set to True. Click OK.

Stopping the Consumption of SNRs/RARs from a Specified Kafka Partition

Learn how to configure the Diameter Gateway to stop consuming SNRs/ RARs from a specified Kafka partition, after prolonged Diameter Gateway outages.

You can use a second command in the JMX Console to configure the Diameter Gateway to stop consuming from a specific Kafka partition. This command should be performed when the Diameter Gateway that was down has been restored.

To configure ECE to make another Diameter Gateway stop consuming from a specific partition ID:

Connect ECE Cloud Native to JConsole following the instructions in "JMX Connection to ECE Using JConsole" in BRM Cloud Native System Administrator’s Guide.
For ECE on-prem, access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans" for more information.
JConsole should be open on the DiameterGateway process (and not the ECS process).
Expand the DiameterGateway node.
Expand DiameterGatewayKafkaResiliencyTracker.
Expand Operations.
Enter the partition ID in the partition field for the stopKafkaListener method, and click the method.
The Operation return value dialog box appears, with the value set to True. Click OK.

88 Configuring ECE for Disaster Recovery

Introduction

Business Continuity with ECE Disaster Recovery

About Deployment Modes with Geo-Redundancy

Deploying BRM Server and PDC with Geo-Redundancy

Deploying ECE with Geo-Redundancy

About Load Balancing in an Active-Active System

About Rated Event Formatter in a Persistence-Enabled Active-Active System

Resolving Rated Event Formatter Instance Outages

Getting Rated Event Formatter Checkpoint Information

Activating a Secondary Rated Event Formatter Instance

About CDR Generator in an Active-Active System

Configuring an Active-Active System

Including Custom Clients in Your Active-Active Configuration

Including Offline Mediation Controller in Your Active-Active Configuration

Failing Over to a Backup Site (Active-Active)

Switching Back to the Original Production Site (Active-Active)

Processing Usage Requests in the Site Received

Replicating ECE Cache Data

Migrating ECE Notifications

Active-Active Sy/Gy Session Re-Anchoring

Consuming SNRs/RARs from a Specified Kafka Partition

Stopping the Consumption of SNRs/RARs from a Specified Kafka Partition