84 Configuring ECE for Disaster Recovery

Learn how to configure Oracle Communications Billing and Revenue Management Elastic Charging Engine (ECE) for disaster recovery.

Topics in this document:

Introduction

Oracle Communications BRM offers a disaster recovery (DR) architecture ensuring business continuity in the event of an unexpected site deployment failure. BRM disaster recovery capabilities provide continuity in service usage for your end customers and minimize data loss if a system failure occurs. BRM supports deployment models intended to meet disaster recovery and business continuity needs.

Oracle Communications Charging, Billing and Revenue Management deployment includes functional components that manage business functionality end to end from subscription acquisition to usage charging, billing, and revenue management. The functional components include:
  1. Billing and Revenue Management (BRM) server

  2. Pricing Design Center (PDC)

  3. Elastic Charging Engine (ECE)

  4. Offline Mediation Controller (OCOMC)

Business Continuity with ECE Disaster Recovery

Customers deploying BRM products also look for measures should a disaster strike. The disaster can be a power outage, hardware burn-out, or site not being available due to natural calamities such as floods or earthquakes. For hardware burn-out or even localized power outages, a local redundancy measure could be taken by deploying the products in high available mode, wherever possible.

Due to its distributed architecture, BRM components can be deployed with local redundancy for high availability.

For deploying with geo-redundancy, Table 84-1 shows the available deployment modes, along with recommended options for each of the functional components.

Table 84-1 Available and Recommended Deployment Modes

Components Deployment Mode

Billing and Revenue Management (BRM) Server

Active-Hot Standby

Pricing Design Center (PDC)

Active-Hot Standby

Elastic Charging Engine (ECE)

Active-Active

Offline Mediation Controller (OCOMC)

Active-Standby

It is essential to continue to process transactions and, in case of a site failure, it is advised to choose an adequate mode of deployment based on recovery objectives set by your business. BRM components allow you to choose a deployment mode (component wise) that fits your business needs.

About Deployment Modes with Geo-Redundancy

You can deploy the BRM Server, PDC, and ECE components with geo-redundancy to improve business continuity.

Deploying BRM Server and PDC with Geo-Redundancy

BRM Server and PDC are required to be deployed for charging with ECE and these use Oracle Database to store data. The database replication using Data Guard is recommended to replicate the data across sites.

Figure 84-1 shows the disaster recovery deployment modes for BRM and PDC.

Figure 84-1 Deployment Modes for BRM and PDC



In the above deployment of BRM and PDC, that are running in Active-Hot Standby mode, the database is continuously replicated in real-time using Data Guard/Active Data Guard from the active site to the standby site, ensuring minimal data loss. This requires monitoring of the sites and manual intervention when a site failure occurs.

Within a given site, local redundancy for BRM Server ensures continued availability of the system. BRM supports multi-instance configuration of Connection Managers and Data Managers for high availability, so that transactions are processed through available BRM processes connected to the same database instance.

Deploying ECE with Geo-Redundancy

Elastic Charging Engine is deployed in Active-Active mode. Active-Active mode uses Coherence federation-based data replication across sites and requires Coherence Grid Edition. This mode is more beneficial as you are processing data on both the sites, rather than keeping one site in standby mode while its processes are running.

Figure 84-2 shows the Active-Active deployment mode for disaster recovery for ECE.

Figure 84-2 Active-Active Deployment Mode for ECE



Active-Active mode consists of two ECE sites, where all ECE sites are able to actively process charging requests simultaneously. Each ECE site’s cache holds all subscriber data and pricing configuration data, and ECE cache data is asynchronously replicated among all of the ECE sites using Coherence cache federation so that the cache data in all of the ECE sites remains synchronized.

In Active-Active mode, all subscribers belonging to a sharing group are processed in the same ECE site to ensure there is no revenue exposure due to concurrent processing on different ECE’s impacting the same shared balance. For subscribers that DO NOT belong to a sharing group, the operator can configure ECE to process charging traffic in one of the following ways:
  1. Local processing mode: Process requests for all subscribers in the ECE site that they arrive in from the network. In this mode, only the requests for the sharing group members may get forwarded to another site where the shared balance is managed. This is the recommended mode for better processing rates.

  2. Preferred site processing mode: Process all requests for a given subscriber in the same ECE site, controlled by an ordered list of preferred ECE sites for each subscriber grouping. In this mode, a charging request arriving at a non-preferred ECE site will get forwarded to the preferred ECE site for processing. For example, all members of a sharing group are processed in a preferred site where the shared balance is managed.

All active ECE sites interface with one active BRM and PDC instance as shown. Any updates from BRM will be processed by one of the active ECE sites and will be synchronized to all other active ECE sites via Coherence federation. Rated events created in each active ECE site will be processed in that site and loaded into the BRM database via the configured method.

In Figure 84-2, if ECE site 2 fails, ECE site 1 will automatically be able to handle the entire network’s charging traffic. The operator should manually mark ECE site 2 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 2 until the problem is corrected.

In Figure 84-2, if ECE site 1 fails, ECE site 2 will automatically be able to handle the entire network’s charging traffic; however, in this case manual steps are required to update the configuration to ensure that updates from Active BRM and PDC are processed in ECE site 2 without reliance on ECE site 1. The operator should manually mark ECE site 1 as failed as soon as this condition is observed to ensure no traffic processing is attempted by ECE site 1 until the problem is corrected.

For all Active-Active deployments, the operator should ensure that each ECE site is properly sized to handle the expected load in the worst-case failure scenario.

If in the above configuration if you would like to process the data on only one site, then the other site will stay in Hot Standby mode. This deployment mode is called Active-Hot Standby as shown in Fig 3 below. When you have both sites configured for processing, it may not be preferred to keep one site idle, rather process in both the sites. So we recommend you to deploy Active-Active mode as it provides the best RTO and RPO of all deployment modes.

Figure 84-3 Active-Hot Standby Deployment Mode for ECE



The configurations for Active-Hot Standby mode is the same as that for Active-Active system.

If an operator deploys all components as indicated in Table 1, the deployment will be as shown in Figure 84-4:

Figure 84-4 Site Deployment with Recommended Options



BRM and PDC will always be in Active-Hot Standby mode. The DB from the BRM and PDC active site is replicated to one or more of the standby sites. If there is a failure of active deployment, manual intervention is needed for making one of the BRM/PDC standby sites active and reconfiguring the new active BRM/PDC instance to communicate with the existing primary ECE or ECE instance in the same site.

ECE is in Active-Active mode. Pricing updates from PDC and customer updates from BRM will be processed by one of the active ECE sites (ECE site 1 in Fig 2) and will be synchronized to the other active ECE site via Coherence federation. Rated events created in an active ECE site will be processed in that site and loaded into the Active BRM database via the configured method.

If there is a failure of an ECE site, then:
  • The remaining active ECE site is automatically able to handle the load redistributed from the core network.

  • The operator should manually mark ECE site as failed as soon as this condition is observed to ensure no traffic processing is attempted by the failed ECE site until the problem is corrected.

  • If the failed ECE site was the primary ECE site being used by the active BRM and PDC, manual steps are required to update the configuration to ensure that updates from the active BRM and PDC are processed in the other active ECE site.

  • If the network gateways remain in service at the failed ECE site (i.e. a partial ECE site failure), the network gateways may need to be manually turned down to force the network clients to redistribute charging traffic to the remaining active ECEs site.

OCOMC is in Active-Standby mode. OCOMC processes offline rating requests and generally are processed on one site to keep it simple and easily manageable.

About Load Balancing in an Active-Active System

In an active-active system, EM Gateway routes BRM update requests across sites based on the app and site configurations to ensure load balancing.

EM Gateway routes connection requests to Diameter Gateway, RADIUS Gateway, and HTTP Gateway nodes in one of the active sites. If the site does not respond, the request is rerouted to the backup production site.

You can set up load balancing configuration based on your requirements.

About Rated Event Formatter in a Persistence-Enabled Active-Active System

When data persistence is enabled, each site in an active-active system has a primary Rated Event Formatter instance for each schema, and at least one secondary instance for each schema.

As rated events are created, the following happens on each site:

  1. ECE creates rated events and commits them to the Coherence cache. Each rated event created by ECE includes the Coherence cluster name of the site where it was created.
  2. The Coherence federation service replicates the events to the remote sites, as it does for other federated objects.
  3. Coherence caching persists the events to the database in batches. Each schema at each site has its own rated event database table.
  4. The primary Rated Event Formatter instance processes all rated events from the corresponding site-specific database table.
  5. The primary Rated Event Formatter instance commits the formatted events to the cache as a checkpoint. The site name is included in the checkpoint data, along with the schema number, timestamp, and plugin type.
  6. The Coherence federation service replicates the checkpoint to the remote sites, as it does for other federated objects. The remote site ECE servers then purge the events persisted in the checkpoint from the database in batches by schema and by site.
  7. Coherence caching persists the checkpoint to the database to be consumed by Rated Event Loader. Checkpoints are grouped by schema and by site.
  8. The ECE server purges the events related to the persisted checkpoint. Events are purged from the database in batches by schema and by site.

Remote sites that receive federated events and checkpoints similarly persist them to and purge them from the database, in site and schema-specific database tables. In this way, all sites contain the same rated events and checkpoints, no matter where they were generated, and each rated event and checkpoint retains information about the site that generated it. If the Rated Event Formatter instance at any one site is down, a secondary instance at a remote site can immediately begin processing the rated events, preserving the site-specific information as though it were the original site. See "Resolving Rated Event Formatter Instance Outages".

Resolving Rated Event Formatter Instance Outages

If a primary Rated Event Formatter instance is down, take one of the following approaches, depending on whether the outage is planned or unplanned, and considering your operational needs:

  • Planned outage: Primary instance finishes processing: Choose this option for planned outages, when rating stops but the primary Rated Event Formatter instance can keep processing.
    1. After no new rated events are being generated by the site, wait until the local Rated Event Formatter has finished processing all rated events from the site.
    2. In the remote sites, drop or truncate the rated event database table for the rated events federated from the site with the outage. Dropping the table means you must recreate it and its indexes after resolving the outage.
    3. Stop the Rated Event Formatter at the site with the outage.
    4. When the outage is resolved, you can start Rated Event Formatter again to resume processing events.
  • Unplanned outage: Secondary instance takes over processing: Choose this option for unplanned outages, when the primary Rated Event Formatter is also down. After failing over to the backup site as described in "Failing Over to a Backup Site (Active-Active)", perform the following tasks:
    1. Confirm that the last successful Rated Event Formatter checkpoint for the local site matches the one federated to the remote site. You can use the JMX queryRatedEventCheckPoint operation in the ECE configuration MBeans. See "Getting Rated Event Formatter Checkpoint Information".
    2. If needed, start the secondary Rated Event Formatter instance on the remote site.
    3. Activate the secondary Rated Event Formatter instance on the remote site using the JMX activateSecondaryInstance operation in the ECE monitoring MBeans. See "Activating a Secondary Rated Event Formatter Instance".

      The secondary instance takes over processing the federated rated events as though it were the primary instance at the site with the outage. The events and checkpoints are persisted in the database tables for the original site, not the remote site.

    4. Wait until the secondary instance has finished processing all rated events federated from the site with the outage.
    5. At the site with the outage, drop or truncate the rated event database table for local events. Dropping the table means you must recreate it and its indexes after resolving the outage.
    6. Stop the secondary Rated Event Formatter instance.
    7. When the outage is resolved and the site has been recovered as described in "Switching Back to the Original Production Site (Active-Active)", restart the primary Rated Event Formatter again to resume processing events at the local site. If you had the secondary Rated Event Formatter instance running at the remote site before the outage, restart it too.

Getting Rated Event Formatter Checkpoint Information

You can retrieve information about the last Rated Event Formatter checkpoint committed to the database.

To retrieve information about the last Rated Event Formatter checkpoint:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.
  3. Expand the database connection you want checkpoint information from.
  4. Expand Operations.
  5. Run the queryRatedEventCheckPoint operation.

    Checkpoint information appears for all Rated Event Formatter instances using the database connection. Information includes site, schema, and plugin names as well as the time of the most recent checkpoint.

Activating a Secondary Rated Event Formatter Instance

If a primary Rated Event Formatter instance is down, you can activate a secondary instance to take over rated event processing.

To activate a secondary Rated Event Formatter instance:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Monitoring node.
  3. Expand RatedEventFormatterMatrices.
  4. Expand Operations.
  5. Run the activateSecondaryInstance operation.

    The secondary Rated Event Formatter instance begins processing rated events.

About CDR Generator in an Active-Active System

When CDR generation is enabled, each site in an active-active system contains a CDR Generator, and each site can generate unrated CDRs for external systems. When a production site goes down, the CDR Store retains all in-progress CDR sessions, and subsequent 5G usage events are diverted to the CDR Gateway on the other production site.

In an active-active system, you can configure CDR Generator to do the following:

  • Mark partially processed CDRs in the CDR Store as incomplete to prevent downstream mediation systems from processing them. To do so, use the CDR Formatter's and CDR Gateway's enableIncompleteCdrDetection attribute.

  • Mark when CDRs contain duplicate usage updates. To do so, use the CDR Gateway's retransmissionDuplicateDetectionEnabled attribute.

  • Indicate that CDRs were closed for a custom value (in the CDR's causeForRecordClosing field). To enable CDR Generator to add a custom reason why a CDR was closed, use the CDR Formatter's enableStaleSessionCleanupCustomField attribute. To specify the custom value to add, use the CDR Formatter's staleSessionCauseForRecordClosingString attribute.

For information about configuring these attributes, see "Setting Up ECE to Generate CDRs" in ECE Implementing Charging.

Configuring an Active-Active System

To configure an active-active system:
  1. In the primary production site, do the following:
    1. Configure the ECE components (Customer Updater, EM Gateway, and so on).

    2. Add all details about participant sites to the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml).

      To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file.

      See Table 84-2 for more information about providing the federation configuration parameter descriptions and default values.

      Table 84-2 Federation Configuration Parameters

      Name Description

      name

      The name of the participant site.

      Note: The name of the participant site must match the name of the cluster in the participant site.

      address

      The IP address of the participant site.

      port

      The port number assigned to the Coherence cluster port of the participant site.

      initial-action

      Specifies whether the federation service should be started for replicating data to the participant sites. Valid values are:

      • start: Specifies that the federation service has to be started and the data must be automatically replicated to the participant sites.

      • stop: Specifies that the federation service has to be stopped and the data must not be automatically replicated to the participant sites.

      Note: Ensure that this parameter is set to stop for all participant sites except for the current site. For example, if you are adding the backup or remote production sites details in the primary production site, this parameter must be set to stop for all backup or remote production sites.

    3. Go to the ECE_home/config/management directory, where ECE_home is the directory in which ECE is installed.

    4. Configure HTTP Gateway. See "Connecting ECE to a 5G Client" in ECE Implementing Charging for more information.

    5. Open the charging-settings.xml file.

    6. In the CustomerGroupConfiguration section, set the app configuration parameters as shown in the following sample file:

      Copy
      <customerGroupConfigurations config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfigurations">
                     <customerGroupConfigurationList>
                          <customerGroupConfiguration
                               config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup5">
                               <clusterPreferenceList.name config-class="java.util.ArrayList">
      
                           <clusterPreferenceConfiguration
                               config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                               name="BRM-S2"
                               priority="1" routingGatewayList="host1:port1"/>
                               </clusterPreferenceList>
                          </customerGroupConfiguration>
                          <customerGroupConfiguration
                               config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.CustomerGroupConfiguration" name="customerGroup2">
                               <clusterPreferenceList config-class="java.util.ArrayList">
      
                               <clusterPreferenceConfiguration
                               config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                               name="BRM-S2"
                               priority="1" routingGatewayList="host1:port1,host1:port1"/>
      
      
                               <clusterPreferenceConfiguration
                               config-class="oracle.communication.brm.charging.appconfiguration.beans.customergroup.ClusterPreferenceConfiguration"
                               name="BRM-S1"
                               priority="2" routingGatewayList="host2:port2,host2:port2"/>
                               </clusterPreferenceList>
                          </customerGroupConfiguration>
                    </customerGroupConfigurationList>
                </customerGroupConfigurations>
      provides the configuration parameters of the CustomerGroupConfiguration section.

      Table 84-3 CustomerGroupConfiguration Parameters

      Configuration Parameter and Description

      CustomerGroupConfiguration

      • name=
        Customers are processed and distributed in active-active system sites based on customerGroup. The customer names configured in customerGroup are updated to the PublicUserIdentity (PUI) cache when you load the customer information to ECE through customerUpdater or when you create or update information of customers in BRM using EM Gateway.
      • clusterPreferenceList=
        Includes a list of cluster names and priority for each cluster name for routing the requests during a site failure.

      clusterPreferenceConfiguration

      • name=
        Name of the cluster.
      • clusterPreferenceConfiguration.priority=
        The priority of the preferred cluster that is assigned in the customerGroup list to process the rating request.

        The priority to process the request is in the incremental order of numbers and assigned to the lowest number. For example, if you set the value to 1 for priority, the cluster associated with this number processes the request first.

      • routingGatewayList=
        A comma-separated list of the host name and port number of chargingServer values used for httpGateway.
    7. Configure a primary and secondary Rated Event Formatter instance for each site in the ratedEventFormatter section, as shown in the following sample file:

      Copy
      <ratedEventFormatterConfigurationList config-class="java.util.ArrayList">
          <ratedEventFormatterConfiguration
                  config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
                  name="ref_site1_primary"
                  partition="1"
                  connectionName="oracle1"
                  siteName="site1"
                  threadPoolSize="2"
                  retainDuration="0"
                  ripeDuration="30"
                  checkPointInterval="20"
                  maxPersistenceCatchupTime="0"
                  pluginPath="ece-ratedeventformatter.jar"
                  pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
                  pluginName="brmCdrPluginDC1Primary" …
                  … />
          <ratedEventFormatterConfiguration
                  config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
                  name="ref_site1_secondary"
                  partition="1"
                  connectionName="oracle2"
                  siteName="site1"
                  primaryInstanceName="ref_site1_primary"
                  threadPoolSize="2"
                  retainDuration="0"
                  ripeDuration="30"
                  checkPointInterval="20"
                  maxPersistenceCatchupTime="0"
                  pluginPath="ece-ratedeventformatter.jar"
                  pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
                  pluginName="brmCdrPluginDC1Secondary"
                  … />
          <ratedEventFormatterConfiguration
                  config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
                  name="ref_site2_primary"
                  partition="1"
                  connectionName="oracle2"
                  siteName="site2"
                  threadPoolSize="2"
                  retainDuration="0"
                  ripeDuration="30"
                  checkPointInterval="20"
                  maxPersistenceCatchupTime="0"
                  pluginPath="ece-ratedeventformatter.jar"
                  pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
                  pluginName="brmCdrPluginDC2Primary"
                  …  />
          <ratedEventFormatterConfiguration
                  config-class="oracle.communication.brm.charging.appconfiguration.beans.ratedeventformatter.RatedEventFormatterConfiguration"
                  name="ref_site2_secondary"
                  partition="1"
                  connectionName="oracle1"
                  siteName="site2"
                  primaryInstanceName="ref_site2_primary"
                  threadPoolSize="2"
                  retainDuration="0"
                  ripeDuration="30"
                  checkPointInterval="20"
                  maxPersistenceCatchupTime="0"
                  pluginPath="ece-ratedeventformatter.jar"
                  pluginType="oracle.communication.brm.charging.ratedevent.formatterplugin.internal.SampleFormatterPlugInImpl"
                  pluginName="brmCdrPluginDC2Secondary"
                  …  />
      </ratedEventFormatterConfigurationList>

      The siteName property determines the site that the instance processes rated events for. This lets you configure secondary instances as backups for remote sites. The sample specifies that the ref_site1_secondary instance running is running at site 2, but processes rated events federated from site 1 in case of an outage.

    8. Configure the production sites to process the routing requests.

    9. Open the site-configuration.xml file. Configure all monitorAgent instances from all sites. Each Monitor Agent instance includes the Coherence cluster name, host name or IP address, and JMX port.

      provides the configuration parameters of Monitoring Agent.

      Table 84-4 Monitor Agent Configuration Parameters

      Name Description

      name

      The name of the production or remote site where the request should be processed.

      These should correspond to site names defined for the Rated Event Formatter instances.

      host

      The IP address of the participant site.

      jmxPort

      jmxPort of the production or remote site.

      disableMonitor

      This configuration allows a monitorAgent instance to disable collecting monitoring results from multiple monitorAgent instances running within a site. It prevents generating redundant monitoring results for a site.

      Note: Default value is set to false. If you set this value to true, monitorAgent instance disallows collecting redundant monitoring results.

      Note:

      The monitorAgent properties should match with the properties in the eceTopology.conf file where a monitorAgent instance is configured to start from a specific production site.

    10. Copy the JMSConfiguration.xml file content of all sites to a single file and enter the following details:
      • Add the
        <Cluster>clusterName</Cluster>
        tag for the queue types.
      • Import the wallet for all clusters and specify the wallet path in the
        <KeyStoreLocation> and <ECEWalletLocation>
        locations.
    11. In the eceTopology.conf file, enable the JMX port for all ECS server nodes and clients, such as Diameter Gateway, HTTP Gateway, RADIUS Gateway, and EM Gateway. Also, enable the JMX port for each Monitor Agent instance.

    12. Start ECE. See "Starting ECE" for more information.

  2. On the backup or remote site, do the following:
    1. Configure the ECE components (Customer Updater, EM Gateway, and so on).

      Ensure the following:
      • The name of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site is unique.
      • At least two instances of Rated Event Formatter are configured to allow for failover. The data persistence-enabled system requires configuring at least one primary and one secondary instance for each site.
    2. Set the following parameter in the ECE_home/config/ece.properties file to false:
      Copy
      loadConfigSettings = false
      The application-configuration data is not loaded into memory when you start the charging server nodes.
    3. Add all the details of participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging-coherence-override-prod.xml).

      To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_ home/config/ece.properties file. Table 84-2 provides the federation configuration parameter descriptions and default values.

    4. Start the Elastic Charging Controller (ECC):
      Copy
      ./ecc
    5. Start the charging server nodes:
      Copy
      start server
  3. On the primary production site, run the following commands:

    Copy
    gridSync start
    gridSync replicate

    The federation service is started and all the existing data is replicated to the backup or remote production sites.

  4. On the backup sites, do the following:
    1. Verify that the same number of entries as in the primary production site are available in the customer, balance, configuration, and pricing caches in the backup or remote production sites by using the query.sh utility.

    2. Verify that the charging server nodes in the backup or remote production sites are in the same state as the charging server nodes in the primary production site.

    3. Configure the following ECE components and the Oracle persistence database connection details by using a JMX editor:
      • Rated Event Formatter
      • Rated Event Publisher
      • Diameter Gateway
      • RADIUS Gateway
      • HTTP Gateway
      Ensure the following:
      • The name of Diameter Gateway, RADIUS Gateway, HTTP Gateway, Rated Event Formatter, and Rated Event Publisher for each site is unique.

      • At least two instances of Rated Event Formatter are configured to allow for failover. A data persistence-enabled system requires configuring at least one primary and one secondary instance for each site.

    4. Start the following ECE processes and gateways:

      Copy
      start brmGateway
      start ratedEventFormatter
      start diameterGateway
      start radiusGateway
      start httpGateway

      The remote production sites are up and running with all required data.

    5. Run the following command:

      Copy
      gridSync start

      The federation service is started to replicate the data from the backup or remote production sites to the preferred production site.

  5. After starting Rated Event Formatter in the remote production sites, ensure that you copy the CDR files generated by Rated Event Formatter from the remote production sites to the primary production site by using the SFTP utility.

Note:

When configuring active-hot standby system the preferred site for each of the customer groups should be same i.e. the preferred should be given as the current Active site. For example:

customerGroupConfigurations:
      - name: "customergroup1"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"
      - name: "customergroup2"
        clusterPreference:
          - priority: "1"
            routingGatewayList: ""
            name: "BRM"
          - priority: "2"
            routingGatewayList: ""
            name: "BRM2"

Including Custom Clients in Your Active-Active Configuration

If your system includes a custom client application that calls the ECE API, you need to add the custom client to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from your custom client to a backup site when a site failover occurs. To do so, you configure the custom client as an ECE Monitor Framework-compliant node in the ECE cluster.

To add a custom client to an active-active configuration:
  1. Modify your custom client to use the ECE Monitor Framework:
    1. Add this import statement:

      Copy
      import oracle.communication.brm.charging.monitor.framework.internal.MonitorFramework;
    2. Add these lines to the program:

      Copy
      if (MonitorFramework.isJMXEnabledApp) {
         MonitorFramework monitorFramework = (MonitorFramework) context.getBean(MonitorFramework.MONITOR_BEAN_NAME);
         try {
            monitorFramework.initializeMonitor(null); // null parameter for any non-ECE Monitor Agent node
         }
         catch (Exception ex) {
            // Failed to initialize Monitor Framework, check log file
            System.exit(-1);
         }
      }
      ...  // continue as before
  2. When you start your custom client, include these Java system properties:
    1. -Dcom.sun.management.jmxremote.port set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number.

    2. -Dcom.sun.management.jmxremote.rmi.port set to the port number to which the RMI connector will be bound.

    3. -Dtangosol.coherence.member set to the name of the custom client application instance running within the ECE cluster.

    For example:

    Copy
    java -Dcom.sun.management.jmxremote.port=6666 \
    -Dcom.sun.management.jmxremote.rmi.port=6666 \
    -Dtangosol.coherence.member=customApp1 \
    -jar customApp1.jar
  3. Edit the ECE_home/config/eceTopology.conf file to include a row for each custom client application instance. For each row, enter the following information:
    1. node-name: The name of the JVM process for that node.

    2. role: The role of the JVM process for that node.

    3. host name: The host name of the physical server machine on which the node resides. For a standalone system, enter localhost.

    4. host ip: If your host contains multiple IP addresses, enter the IP address so that Coherence can be pointed to a port.

    5. JMX port: The JMX port of the JVM process for that node. By specifying a JMX port number for one node, you expose MBeans for setting performance-related properties and collecting statistics for all node processes. Enter any free port, such as 9999, for the charging server node to be the JMX-management enabled node.

    6. start CohMgt: Specify whether you want the node to be JMX-management enabled.

    For example:

    Copy
    #node-name     |role       |host name  (no spaces!) |host ip  |JMX port  |start CohMgt  |JVM Tuning File  
    customApp1     |customApp  |localhost               |         |6666      |false         |

Including Offline Mediation Controller in Your Active-Active Configuration

If your system includes Oracle Communications Offline Mediation Controller, you need to add it to your active-active disaster recovery configuration. This enables the active-active system architecture to automatically route requests from Offline Mediation Controller to a backup site when a site failover occurs.

To include Offline Mediation Controller in your active-active configuration:
  1. On each active production site, do the following:
    1. Log in to your ECE driver machine as the rms user.

    2. In your ocecesdk/config/client-charging-context.xml file, add the following line to the beans element:

      Copy
      <importresource="classpath:/META-INF/spring/monitor.framework-context.xml"/>
  2. On each Offline Mediation Controller machine, do the following:
    1. Log in to your Offline Mediation Controller machine as the rms user.

    2. Add the following lines to your OCOMC_home/bin/nodemgr file:

      Copy
      -Dcom.sun.management.jmxremote 
      -Dcom.sun.management.jmxremote.ssl=false
      -Dcom.sun.management.jmxremote.rmi.port=rmi_port
      -Dcom.sun.management.jmxremote.port=port
      where:
      • rmi_port is set to the port number to which the RMI connector will be bound.

      • port is set to the port number for enabling JMX RMI connections. Ensure that you specify an unused port number.

    3. In the OCOMC_home/bin/UDCEnvironment file, set JMX_ENABLED_STATUS to true and set JMX_PORT to the desired JMX port number:

      Copy
      #For enabling jmx in an active-active setup 
      JMX_ENABLED_STATUS=true
      JMX_PORT=9992
  3. On each Offline Mediation Controller machine, restart Node Manager by going to the OCOMC_home/bin directory and running this command:

    Copy
    ./nodemgr

Failing Over to a Backup Site (Active-Active)

To fail over to a backup site in an active-active configuration:
  1. Open a JMX editor such as a JConsole.

  2. Expand the ECE Monitoring node.

  3. Expand Agent.

  4. Expand Operations.

  5. Set the failoverSite() operation to the name of the failed site.

  6. On the backup site, stop replicating the ECE cache data to the primary production site by running the following command:

    gridSync stop PrimaryProductionClusterName

    where PrimaryProductionClusterName is the name of the cluster in the primary production site.

  7. On the backup site, do the following:
    1. Change the BRM, PDC, and Customer Updater connection details to connect to BRM and PDC on the backup site by using a JMX editor.

      Note:

      If only ECE in the primary production site failed and BRM and PDC in the primary production site are still running, you need not change the BRM and PDC connection details on the backup site.
    2. Start BRM and PDC.
  8. Recover the data in the Oracle NoSQL database data store of the primary production site by performing the following:
    1. Convert the secondary Oracle NoSQL database data store node of the primary production site to the primary Oracle NoSQL database data store node by performing a failover operation in the Oracle NoSQL database data store. For more information, see "Performing a Failover" in Oracle NoSQL Database Administrator's Guide.

      The secondary Oracle NoSQL database data store node of the primary production site is now the primary Oracle NoSQL database data store node of the primary production site.

    2. On the backup site, convert the rated events from the Oracle NoSQL database data store node that you just converted into the primary node into CDR files by starting Rated Event Formatter.
    3. In a backup or remote production site, load the CDR files that you just converted into BRM by using Rated Event (RE) Loader.
    4. Shut down the Oracle NoSQL database data store node that you just converted into the primary node.

      See the "stop" utility in Oracle NoSQL Database Administrator's Guide for more information.

    5. Stop the Rated Event Formatter that you just started.
  9. In a backup or remote production site, start Pricing Updater, Customer Updater, and EM Gateway by running the following commands:

    start pricingUpdater
    start customerUpdater
    start emGateway

    All pricing and customer data is now back in the ECE grid in the backup or remote production site.

  10. Stop and restart BRM Gateway.

  11. Migrate internal BRM notifications from the primary production site to a backup or remote production site. See "Migrating ECE Notifications" for more information.

    Note:

    • If the expiry duration is configured for these notifications, ensure that you migrate the notifications before they expire. For the expiry duration, see the expiry-delay entry for the ServiceContext module in the ECE_home/config/charging-cache-config.xml file.
    • All external notifications from a production site are published to the respective JMS queue. Diameter Gateway retrieves the notifications from the JMS queue and replicates to other sites based on the configuration.
  12. Ensure that the network clients route all requests to the backup or remote production site.

The former backup site or one of the remote production sites is now the new preferred production site. When the preferred site starts functioning, you mark the recoverSite and the site traffic routes back to the preferred site. For more information, see "Switching Back to the Original Production Site (Active-Active)".

Switching Back to the Original Production Site (Active-Active)

To switch back to the original production site in an active-active system:
  1. Install ECE and other required components in the original primary production site. For more information, see "Installing Elastic Charging Engine" in ECE Installation Guide.

    Note:

    If only ECE in the original primary production site failed and BRM and PDC in the original primary production site are still running, install only ECE and provide the connection details about BRM and PDC in the original primary production site during ECE installation.
  2. On the machine on which the Oracle WebLogic server is installed, verify that the JMS queues have been created for loading pricing data and for sending event notification, and that JMS credentials have been configured correctly.

  3. Set the following parameter in the ECE_home/config/ece.properties file to false:
    loadConfigSettings = false

    The configuration data is not loaded in memory.

  4. Add all details about participant sites in the federation-config section of the ECE Coherence override file (for example, ECE_home/config/charging- coherence-override-prod.xml).

    To confirm which ECE Coherence override file is used, see the tangosol.coherence.override value in the ECE_home/config/ece.properties file.Table 84-2 provides the federation configuration parameter descriptions and default values.

  5. Go to the ECE_home/bin directory.

  6. Start ECC:

    ./ecc
  7. Start the charging server nodes:

    start server
  8. Replicate the ECE cache data to the original production site by using the gridSync utility. For more information, see "Replicating ECE Cache Data".

  9. Start the following ECE processes and gateways:

    start brmGateway
    start ratedEventFormatter
    start diameterGateway
    start radiusGateway
    start httpGateway
  10. Verify that the same number of entries as in the new production site are available in the customer, balance, configuration, and pricing caches in the original production site by using the query.sh utility.

  11. Stop Pricing Updater, Customer Updater, and EM Gateway in the new primary production site and then start them in the original primary production site.

  12. Migrate internal BRM notifications from the new primary production site to the original primary production site. For more information, see "Migrating ECE Notifications".

  13. Change the BRM Gateway, Customer Updater, and Pricing Updater connection details to connect to BRM and PDC in the original primary production site by using a JMX editor.

  14. Stop RE Loader in the new primary production site and then start it in the original primary production site.

  15. Stop and restart BRM Gateway in both the new primary production site and the original primary production site.

    The roles of the sites are now reversed to the original roles.

  16. Open a JMX editor such as a JConsole.

  17. Expand the ECE Monitoring node.

  18. Expand Agent.

  19. Expand Operations.

  20. Set the recoverSite() operation to the name of the recovered site.

  21. If data persistence is enabled and you failed over your Rated Event Formatter instance at the original site to a secondary instance at a remote site, restart any primary and secondary Rated Event Formatter instances at the original site.

Note:

EM Gateway routes connection requests to either local ECS or to the HTTP Gateway nodes on the remote sites. If the site does not respond, the request is processed locally on the same site. When a production site goes down, the CDR database retains all in-progress (or incomplete) CDR sessions, and all unrated 5G usage events are diverted should be diverted to the remote HTTPGW.

When a site is marked DOWN in an Active-Active setup, the gateways need to be brought down or they will continue to service requests which is not expected.

Processing Usage Requests in the Site Received

To configure the ECE active-active mode to process usage requests in the site that receives the request irrespective of the subscriber's preferred site, perform the following steps:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.

  3. Expand charging.brsConfigurations.default.

  4. Expand Attributes.

  5. Set the skipActiveActivePreferredSiteRouting attribute to true.

    Note:

    By default, the skipActiveActivePreferredSiteRouting attribute is set to false.

Replicating ECE Cache Data

In an active-hot standby system, a segmented active-active system, or an active-active system, when you configure or perform disaster recovery, you replicate the ECE cache data to the participant sites by using the gridSync utility.

To replicate the ECE cache data:

  1. Go to the ECE_home/bin directory.

  2. Start ECC:

    ./ecc
  3. Do one of the following:

    • To start replicating data to a specific participant site asynchronously and also replicate all the existing ECE cache data to a specific participant site, run the following commands:

      gridSync start [remoteClusterName]
      gridSync replicate [remoteClusterName]

      where remoteClusterName is the name of the cluster in a participant site.

    • To start replicating data to all the participant sites asynchronously and also replicate all the existing ECE cache data to all the participant sites, run the following commands:

      gridSync start
      gridSync replicate

See "gridSync" for more information on the gridSync utility.

Migrating ECE Notifications

When you failover to a backup site or switching back to the primary site, you must migrate the notifications to the destination site.

Note:

If you are using Apache Kafka for notification handling, notifications are not migrated to the destination site. Apache Kafka retains the notifications and these notifications appear in the original site or components when they are active.

To migrate ECE notifications:

  1. Access the ECE configuration MBeans in a JMX editor, such as JConsole. See "Accessing ECE Configuration MBeans".

  2. Expand the ECE Configuration node.

  3. Expand systemAdmin.

  4. Expand Operations.

  5. Select triggerFailedClusterServiceContextEventMigration.

  6. In the method's failedClusterName field, enter the name of the failed site's cluster.

  7. Click the triggerFailedClusterServiceContextEventMigration button.

All the internal BRM notifications are migrated to the destination site. In an active-active system, the external notifications are also migrated to the destination site. If you cannot establish the WebLogic cluster subscription due to a site failover, you should restart Diameter Gateway on the destination site. If a site recovers from a failover, you should restart all the Diameter Gateway instances in the cluster.