6 Configuring Geographically-Redundant Installations

This chapter describes how to replicate call state transactions across multiple, regional Oracle Communications Converged Application Server installations ("sites"):

Using Geographically-Redundant SIP Data Tiers
Requirements and Limitations
Steps for Configuring Geographic Persistence
Using the Configuration Wizard Templates for Geographic Persistence
Manually Configuring Geographic Redundancy
Understanding Geo-Redundant Replication Behavior
Monitoring Replication Across Regional Sites
Troubleshooting c Replication

Introducing Geographic Redundancy

Geo-Redundancy ensures uninterrupted transactions and communications for providers, using geographically-separated SIP server deployments.

A primary site can process various SIP transactions and communications and upon determining a transaction boundary, replicate the state data associated with the transaction being processed, to a secondary site. Upon failure of the primary site, calls are routed from the failed primary site to a secondary site for processing. Similarly, upon recovery, the calls are re-routed back to the primary site.

Figure 6-1 Geo-Redundancy

Description of "Figure 6-1 Geo-Redundancy"

In the preceding figure, Geo-Redundancy is portrayed. The process proceeds as follows:

Call is initiated on a primary Converged Application Server Cluster site, call setup and processing occurs normally.
Call is replicated as usual to the site's SIP State Tier, and becomes eligible for replication to a secondary site.
A single replica in the SIP State Tier then places the call state data to be replicated on a JMS queue configured.
Call is transmitted to one of the available engines using JMS over WAN.
Engines at the secondary site monitor their local queue for new messages. Upon receiving a message, an Engine in the secondary site Converged Application Server Cluster persists the call state data and assigns it the site ID value of the primary site.

Table 6-1 Geographic Redundancy flow

Normal Operation	failover
When a session is initiated on a primary Converged Application Server site, call setup and processing occurs normally.	Global LB policy updated to begin routing calls - primary site to secondary site.
When a SIP transaction boundary is reached, the call is replicated (in-memory) to the site's data tier, and becomes eligible for replication to a secondary site.	Once complete, the secondary site begins processing requests for the backed-up call state data.
A single replica in the data tier then places the call state data to be replicated on a JMS queue configured on the replica site.	When a requests hit secondary site engine retrieves the data and activates the call state, taking ownership for the call.
Data is transmitted to one of the available engines round-robin fashion.	Sets the site ID associated with the call to zero (making it appear local).
Engines at the secondary site monitor their local queue for new messages.	Activates all dormant timers present in the call state.
Upon receiving a message, an engine on the secondary site persists the call state data and assigns it the site ID value of the primary site.	By default, call states are activated only for individual calls, and only after those calls are requested on the backup site.
The site ID distinguishes replicated call state data on the secondary site from any other call state data actively managed by the secondary site.	Servlets can use the WlssSipApplicationSession.getGeoSiteId() method to examine the site ID associated with a call.
Timers in replicated call state data remain dormant on the secondary site, so that timer processing does not become a bottleneck to performance.	Any non-zero value for the site ID indicates that the Servlet is working with call state data that was replicated from another site.

Situations Best Suited to Use Geo-Redundancy

The following situations are best suited to take advantage of Geo-Redundancy:

Your application uses SIP dialog states that are long-lived (dialog states that typically last 30 seconds or longer, such as SUBSCRIBE dialogs or conferences)
Your application would reasonably be able to reconstruct the session (re-INVITE, expire SUBSCRIBE dialogs to trigger re-subscriptions, and so on) from the state that has been replicated
The link between two Converged Application Server clusters or sites is low-bandwidth (<1Gb/s each direction) or high (or variable) latency (>5ms 95%)

Situations Not Suited to Use Geo-Redundancy

Geo-Redundancy should not be used in these situations:

A high-capacity link between sites is available
Your application does not reach SIP dialog steady-states that are likely to last longer than the time it would take to re-route all traffic to the secondary site in the event of catastrophic failure (15-30 seconds)
If the application session is likely to be terminated by the user before the application could re-construct the session (most users will disconnect their calls before the session can be re-established from the secondary site)
The volume of session state objects created by the application is greater than the site interconnect can support

Geo-Redundancy Considerations

Consider the following issues when planning for Geo-Redundancy:

Dimension the system for the site link.
Each dialog state is ~25KB on the wire (25600 bits).
A typical B2BUA is two (2) dialogs.
Aim for 25% utilization (or less, depending on the specific equipment and topology of the site) to accommodate “jitter” and sustained latency on the link.

For example, a 100 Mb/s link can handle approximately1000 call states per second, and a typical B2BUA (in the default configuration) generates 4 states during the call (two for each dialog). So, a 100 Mb/s link will support a single Converged Application Server cluster dimensioned for a peak arrival rate (call rate) of 250 CPS.
Geo-Redundancy is not transparent to the application; in most cases the application must be designed to use SetPersist() appropriately, and the developer must consider the volume of state that the application will queue for replication between sites.
SetPersist() should be used within the application code to selectively identify dialog states that will be long-lived.
Given the time it generally takes to route traffic to a secondary site, any application that replicates state more frequently will unnecessarily saturate the JMS queue and site interconnect.
Tuning of JMS to the specific application environment is required: Serialization options, message batching, reliable delivery options and queue size are all variable, depending on the specific application and site characteristics
Geo-Redundancy default behavior is to replicate all dialog state changes when Geo-Redundancy is enabled for the container (this is not recommended for production deployments).
Given the time it generally takes to route traffic to a secondary site, any application that replicates state more frequently will unnecessarily saturate the site interconnect.
SetPersist() should be used within the application code to selectively identify dialog states that will be long-lived (longer than ~20-30 seconds would be a reasonable threshold).

Using Geographically-Redundant SIP Data Tiers

The basic call state replication functionality available in the Converged Application Server SIP data tier provides excellent failover capabilities for a single site installation. However, the active replication performed within the SIP data tier requires high network bandwidth in order to meet the latency performance needs of most production networks. This bandwidth requirement makes a single SIP data tier cluster unsuitable for replicating data over large distances, such as from one regional data center to another.

The Converged Application Server geographic persistence feature enables you to replica call state transactions across multiple Converged Application Server installations (multiple Administrative domains or "sites"). A geographically-redundant configuration minimizes dropped calls in the event of a catastrophic failure of an entire site, for example due to an extended, regional power outage.

Figure 6-2 Oracle Communications Converged Application Server Geographic Persistence

Converged Application Server deployment using geographic persistence.

Description of "Figure 6-2 Oracle Communications Converged Application Server Geographic Persistence"

Example Domain Configurations

A secondary Converged Application Server domain that persists data from another domain may itself process SIP traffic, or it may exist solely as an active standby domain. In the most common configuration, two sites are configured to replicate each other's call state data, with each site processing its own local SIP traffic. The administrator can then use either domain as the "secondary" site should one of domains fail.

Figure 6-3 Common Geographically-Redundant Configuration

Example of a common geographically-redundant Converged Application Server configuration.

Description of "Figure 6-3 Common Geographically-Redundant Configuration"

An alternate configuration utilizes a single domain that persists data from multiple, other sites, acting as the secondary for those sites. Although the secondary site in this configuration can also process its own, local SIP traffic, be aware that the resource requirements of the site may be considerable because of the need to persist active traffic from several other installations.

Figure 6-4 Alternate Geographically-Redundant Configuration

A description of this illustration is in the surrounding body text.

Description of "Figure 6-4 Alternate Geographically-Redundant Configuration"

When using geographic persistence, a single replica in the primary site places modified call state data on a distributed JMS queue. By default, data is placed on the queue only at SIP dialog boundaries. (A custom API is provided for application developers who want to replicate data using a finer granularity, as described in "Using Persistence Hints in SIP Applications".) In a secondary site, engine tier servers use a message listener to monitor the distributed queue to receive messages and write the data to its own SIP data tier cluster. If the secondary site uses an RDBMS to store long-lived call states (recommended), then all data writes from the distribute queue go directly to the RDBMS, rather than to the in-memory storage of the SIP data tier.

Requirements and Limitations

The Converged Application Server geographically-redundant persistence feature is most useful for sites that manage long-lived call state data in an RDBMS. Short-lived calls may be lost in the transition to a secondary site, because Converged Application Server may choose to collect data for multiple call states before replicating between sites.

You must have a reliable, site-aware load balancing solution that can partition calls between geographic locations, as well as monitor the health of a given regional site. Converged Application Server provides no automated functionality for detecting the failure of an entire domain, or for failing over to a secondary site. It is the responsibility of the Administrator to determine when a given site has "failed," and to redirect that site's calls to the correct secondary site. Furthermore, the site-aware load balancer must direct all messages for a given callId to a single home site (the "active" site). If, after a failover, the failed site is restored, the load balancer must continue directing calls to the active site and not partition calls between the two sites.

During a failover to a secondary site, some calls may be dropped. This can occur because Converged Application Server generally queues call state data for site replication only at SIP dialog boundaries. Failures that occur before the data is written to the queue result in the loss of the queued data.

Also, Converged Application Server replicates call state data across sites only when a SIP dialog boundary changes the call state. If a long-running call exists on the primary site before the secondary site is started, and the call state remains unmodified, that call's data is not replicated to the secondary site. Should a failure occur before a long-running call state has been replicated, the call is lost during failover.

When planning for the capacity of a Converged Application Server installation, be aware that, after a failover, a given site must be able to support all of the calls from the failed site as well as from its own geographic location. This means that all sites that are involved in a geographically-redundant configuration will operate at less than maximum capacity until a failover occurs.

Steps for Configuring Geographic Persistence

In order to use the Converged Application Server geographic persistence features, you must perform certain configuration tasks on both the primary "home" site and on the secondary replication site.

Table 6-2 Steps for Configuring Geographic Persistence

Steps for Primary "Home" Site	Steps for Secondary "Replication" Site:
Install Converged Application Server software and create replicated domain. Enable RDBMS storage for long-lived call states (recommended). Configure persistence options to: define the unique regional site ID; identify the secondary site's URL; and enable replication hints.	Install Converged Application Server software and create replicated domain. Enable RDBMS storage for long-lived call states (recommended). Configure JMS Servers and modules required for replicating data. Configure persistence options to define the unique regional site ID.

Note:

In most production deployments, two sites will perform replication services for each other, so you will generally configure each installation as both a primary and secondary site.

Converged Application Server provides domain templates to automate the configuration of most of the resources described in Table 6-2. See "Using the Configuration Wizard Templates for Geographic Persistence" for information about using the templates.

If you have an existing Converged Application Server domain and want to use geographic persistence, follow the instructions in "Manually Configuring Geographic Redundancy" to create the resources.

Using the Configuration Wizard Templates for Geographic Persistence

Converged Application Server provides two Configuration Wizard templates for using geographic persistence features:

WL_HOME/common/templates/domains/geo1domain.jar configures a primary site having a site ID of 1. The domain replicates data to the engine tier servers created in geo2domain.jar.
WL_HOME/common/templates/domains/geo2domain.jar configures a secondary site that replicates call state data from the domain created with geo1domain.jar. This installation has site ID of 2.

The server port numbers in both domain templates are unique, so you can test geographic persistence features on a single machine if necessary. Follow the instructions in the sections that follow to install and configure each domain.

Installing and Configuring the Primary Site

Follow these steps to create a new primary domain from the template:

Start the Configuration Wizard application:
```
cd ~/CAS50_home/wlserver_10.3/common/bin
./config.sh
```
where CAS50_home is the directory where you installed the Converged Application Server software.
Accept the default selection, Create a new WebLogic domain, and click Next.
Select Base this domain on an existing template, and click Browse to display the Select a Template dialog.
Select the template named geo1domain.jar, and click OK.
Click Next.
Enter the username and password for the Administrator of the new domain, and click Next.
Select a JDK to use, and click Next.
Select No to keep the settings defined in the source template file, and click Next.
Click Create to create the domain.

The template creates a new domain with two engine tier servers in a cluster, two SIP data tier servers in a cluster, and an Administration Server (AdminServer). The engine tier cluster includes the following resources and configuration:
- A JDBC datasource, wlss.callstate.datasource, required for storing long-lived call state data. If you want to use this functionality, edit the datasource to include your RDBMS connection information as described in "Modify the JDBC Datasource Connection Information".
- A persistence configuration (shown in the SipServer node, Configuration > Persistence tab of the Administration Console) that defines:
  - Default handling of persistence hints for both RDBMS and geographic persistence.
  - A Geo Site ID of 1.
  - A Geo Remote T3 URL of t3://localhost:8011,localhost:8061, which identifies the engine tier servers in the "geo2" domain as the replication site for geographic redundancy.
Click Done to exit the configuration wizard.
Follow the steps under "Installing the Secondary Site" to create the domain that performs the replication.

Installing the Secondary Site

Follow these steps to use a template to create a secondary site from replicating call state data from the "geo1" domain:

Start the Configuration Wizard application:
```
cd ~/CAS50_home/wlserver_10.3/common/bin
./config.sh
```
where CAS50_home is the directory where you installed the Converged Application Server software.
Accept the default selection, Create a new WebLogic domain, and click Next.
Select Base this domain on an existing template, and click Browse to display the Select a Template dialog.
Select the template named geo2domain.jar, and click OK.
Click Next.
Enter the username and password for the Administrator of the new domain, and click Next.
Select a JDK to use, and click Next.
Select No to keep the settings defined in the source template file, and click Next.
Click Create to create the domain.

The template creates a new domain with two engine tier servers in a cluster, two SIP data tier servers in a cluster, and an Administration Server (AdminServer). The engine tier cluster includes the following resources and configuration:
- A JDBC datasource, wlss.callstate.datasource, required for storing long-lived call state data. If you want to use this functionality, edit the datasource to include your RDBMS connection information as described in "Modify the JDBC Datasource Connection Information".
- A persistence configuration (shown in the SipServer node, Configuration > Persistence tab of the Administration Console) that defines:
  - Default handling of persistence hints for both RDBMS and geographic redundancy.
  - A Geo Site ID of 2.
- A JMS system module, SystemModule-Callstate, that includes:
  - ConnectionFactory-Callstate, a connection factory required for backing up call state data from a primary site.
  - DistributedQueue-Callstate, a uniform distributed queue required for backing up call state data from a primary site.
  The JMS system module is targeted to the site's engine tier cluster
- Two JMS Servers, JMSServer-1 and JMSServer-2, are deployed to engine1-site2 and engine2-site2, respectively.
Click Done to exit the configuration wizard.

Manually Configuring Geographic Redundancy

If you have an existing replicated Converged Application Server installation, or pair of installations, you must manually create the JMS and JDBC resources required for enabling geographic redundancy. You must also configure each site to perform replication. The steps to enable geographicic redundancy are:

Configure JDBC Resources. Oracle recommends configuring both the primary and secondary sites to store long-lived call state data in an RDBMS.
Configure Persistence Options. Persistence options must be configured on both the primary and secondary sites to enable engine tier hints to write to an RDBMS or to replicate data to a geographically-redundant installation.
Configure JMS Resources. A secondary site must have available JMS Servers and specific JMS module resources in order to replicate call state data from another site.

The sections that follow describe each step in detail.

Configuring JDBC Resources (Primary and Secondary Sites)

Follow the instructions in "Storing Long-Lived Call State Data in an RDBMS" to configure the JDBC resources required for storing long-lived call states in an RDBMS.

Configuring Persistence Options (Primary and Secondary Sites)

Both the primary and secondary sites must configure the correct persistence settings in order to enable replication for geographic redundancy. Follow these steps to configure persistence:

Use your browser to access the URL http://address:port/console where address is the Administration Server's listen address and port is the listen port.
Select the SipServer node in the left pane. The right pane of the console provides two levels of tabbed pages that are used for configuring and monitoring Converged Application Server.
Select Configuration, then select the Persistence tab in the right pane.
Configure the Persistence attributes as follows:
- Default Handling: Select "all" to persist long-lived call state data to an RDBMS and to replicate data to an external site for geographic redundancy (recommended). If your installation does not store call state data in an RDBMS, select "geo" instead of "all."
- Geo Site ID: Enter a unique number from 1 to 9 to distinguish this site from all other configured sites. Note that the site ID of 0 is reserved to indicate call states that are local to the site in question (call states not replicated from another site).
- Geo Remote T3 URL: For primary sites (or for secondary sites that replicate their own data to another site), enter the T3 URL or URLs of the engine tier servers that will replicate this site's call state data. If the secondary engine tier cluster uses a cluster address, you can enter a single T3 URL, such as t3://mycluster:7001. If the secondary engine tier cluster does not use a cluster address, enter the URLs for each individual engine tier server separated by a comma, such as t3://engine1-east-coast:7001,t3://engine2-east-coast:7002,t3://engine3-east-coast:7001,t4://engine4-east-coast:7002.
Click Save to save your configuration changes.
Click Activate Changes to apply your changes to the engine tier servers.

Configuring JMS Resources (Secondary Site Only)

Any site that replicates call state data from another site must configure certain required JMS resources. The resources are not required for sites that do not replicate data from another site.

Follow these steps to configure JMS resources:

Use your browser to access the URL http://address:port/console where address is the Administration Server's listen address and port is the listen port.
Select Services, then select Messaging, and then select the JMS Servers tab in the left pane.
Click New in the right pane.
Enter a unique name for the JMS Server or accept the default name. Click Next to continue.
In the Target list, select the name of a single engine tier server node in the installation. Click Finish to create the new Server.
Repeat Steps 3 through 6 to create a dedicated JMS Server for each engine tier server node in your installation.
Select Services, then select Messaging, and then select the JMS Modules node in the left pane.
Click New in the right pane.
Fill in the fields of the Create JMS System Module page as follows:
- Name: Enter a name for the new module, or accept the default name.
- Descriptor File Name: Enter the prefix a configuration file name in which to store the JMS module configuration (for example, systemmodule-callstate).
Click Next to continue.
Select the name of the engine tier cluster, and choose the option All servers in the cluster.
Click Next to continue.
Select Would you like to add resources to this JMS system module and click Finish to create the module.
Click New to add a new resource to the module.
Select the Connection Factory option and click Next.
Fill in the fields of the Create a new JMS System Module Resource as follows:
- Name: Enter a descriptive name for the resource, such as ConnectionFactory-Callstate.
- JNDI Name: Enter the name wlss.callstate.backup.site.connection.factory.
Click Next to continue.
Click Finish to save the new resource.
Select the name of the connection factory resource you just created.
Select Configuration, then select the Load Balance tab in the right pane.
De-select the Server Affinity Enabled option, and click Save.
Re-select Services, then select Messaging, and then select the JMS Modules node in the left pane.
Select the name of the JMS module you created in the right pane.
Click New to create another JMS resource.
Select the Distributed Queue option and click Next.
Fill in the Name field of the Create a new JMS System Module Resource by entering a descriptive name for the resource, such as DistributedQueue-Callstate.
JNDI Name: Enter the name Fill in the fields of the Create a new JMS System Module Resource as follows:
- Name: Enter a descriptive name for the resource, such as ConnectionFactory-Callstate.
- JNDI Name: Enter the name wlss.callstate.backup.site.queue.
Click Next to continue.
Click Finish to save the new resource.
Click Save to save your configuration changes.
Click Activate Changes to apply your changes to the engine tier servers.

Understanding Geo-Redundant Replication Behavior

This section provides more detail into how multiple sites replicate call state data. Administrators can use this information to better understand the mechanics of geo-redundant replication and to better troubleshoot any problems that may occur in such a configuration. Note, however, that the internal workings of replication across Converged Application Server installations is subject to change in future releases of the product.

Call State Replication Process

When a call is initiated on a primary Converged Application Server site, call setup and processing occurs normally. When a SIP dialog boundary is reached, the call is replicated (in-memory) to the site's SIP data tier, and becomes eligible for replication to a secondary site. Converged Application Server may choose to aggregate multiple call states for replication in order to optimize network usage.

A single replica in the SIP data tier then places the call state data to be replicated on a JMS queue configured on the replica site. Data is transmitted to one of the available engines (specified in the geo-remote-t3-url element in sipserver.xml) in a round-robin fashion. Engines at the secondary site monitor their local queue for new messages.

Upon receiving a message, an engine on the secondary site persists the call state data and assigns it the site ID value of the primary site. The site ID distinguishes replicated call state data on the secondary site from any other call state data actively managed by the secondary site. Timers in replicated call state data remain dormant on the secondary site, so that timer processing does not become a bottleneck to performance.

Call State Processing After Failover

To perform a failover, the Administrator must change a global load balancer policy to begin routing calls from the primary, failed site to the secondary site. After this process is completed, the secondary site begins processing requests for the backed-up call state data. When a request is made for data that has been replicated from the failed site, the engine retrieves the data and activates the call state, taking ownership for the call. The activation process involves:

Setting the site ID associated with the call to zero (making it appear local).
Activating all dormant timers present in the call state.

By default, call states are activated only for individual calls, and only after those calls are requested on the backup site. SipServerRuntimeMBean includes a method, activateBackup(byte site), that can be used to force a site to take over all call state data that it has replicated from another site. The Administrator can execute this method using a WLST configuration script. Alternatively, an application deployed on the server can detect when a request for replicated site data occurs, and then execute the method. Example 6-1 shows sample code from a JSP that activates a secondary site, changing ownership of all call state data replicated from site 1. Similar code could be used within a deployed Servlet. Note that either a JSP or Servlet must run as a privileged user in order to execute the activateBackup method.

In order to detect whether a particular call state request, Servlets can use the WlssSipApplicationSession.getGeoSiteId() method to examine the site ID associated with a call. Any non-zero value for the site ID indicates that the Servlet is working with call state data that was replicated from another site.

Example 6-1 Activating a Secondary Site Using JMX

<%
    byte site = 1;

    InitialContext ctx = new InitialContext();
    MBeanServer server = (MBeanServer) ctx.lookup("java:comp/env/jmx/runtime");
    Set set = server.queryMBeans(new ObjectName("*:*,Type=SipServerRuntime"), null);
    if (set.size() == 0) {
      throw new IllegalStateException("No MBeans Found!!!");
    }

    ObjectInstance oi = (ObjectInstance) set.iterator().next();
    SipServerRuntimeMBean bean = (SipServerRuntimeMBean)
      MBeanServerInvocationHandler.newProxyInstance(server,
        oi.getObjectName());

    bean.activateBackup(site);
  %>

Note that after a failover, the load balancer must route all calls having the same callId to the newly-activated site. Even if the original, failed site is restored to service, the load balancer must not partition calls between the two geographical sites.

Removing Backup Call States

You may also choose to stop replicating call states to a remote site in order to perform maintenance on the remote site or to change the backup site entirely. Replication can be stopped by setting the Site Handling attribute to "none" on the primary site as described in "Configuring Persistence Options (Primary and Secondary Sites)".

After disabling geographic replication on the primary site, you also may want to remove backup call states on the secondary site. SipServerRuntimeMBean includes a method, deleteBackup(byte site), that can be used to force a site to remove all call state data that it has replicated from another site. The Administrator can execute this method using a WLST configuration script or via an application deployed on the secondary site. The steps for executing this method are similar to those for using the activateBackup method, described in "Call State Processing After Failover".

Monitoring Replication Across Regional Sites

The ReplicaRuntimeMBean includes two new methods to retrieve data about geographically-redundant replication:

getBackupStoreOutboundStatistics() provides information about the number of calls queued to a secondary site's JMS queue.
getBackupStoreInboundStatistics() provides information about the call state data that a secondary site replicates from another site.

See the Converged Application Server JavaDoc for more information about ReplicaRuntimeMBean.

Troubleshooting c Replication

In addition to using the ReplicaRuntimeMBean methods described in "Monitoring Replication Across Regional Sites", Administrators should monitor any SNMP traps that indicate failed database writes on a secondary site installation.

Administrators must also ensure that all sites participating in geographically-redundant configurations use unique site IDs.