10 Configuring Geographically-Redundant Installations

This chapter describes how to replicate call state transactions across multiple, regional Oracle Communications Converged Application Server installations.

About Geographic Redundancy

Geographic redundancy ensures uninterrupted transactions and communications for providers, using geographically-separated SIP server deployments.

A primary site can process various SIP transactions and communications and upon determining a transaction boundary, replicate the state data associated with the transaction being processed, to a secondary site. Upon failure of the primary site, calls are routed from the failed primary site to a secondary site for processing. Similarly, upon recovery, the calls are re-routed back to the primary site.

Figure 10-1 Geo-Redundancy

Surrounding text describes Figure 10-1 .

In the preceding figure, Geo-Redundancy is portrayed. The process proceeds as follows:

Call is initiated on a primary Converged Application Server Cluster site, call setup and processing occurs normally.
Call is replicated as usual to the site's Coherence cache, and becomes eligible for replication to a secondary site.
A single engine in the Coherence cache then places the call state data to be replicated on a JMS queue configured.
Call is transmitted to one of the available engines using JMS over WAN.
Engines at the secondary site monitor their local queue for new messages. Upon receiving a message, an Engine in the secondary site Converged Application Server Cluster persists the call state data and assigns it the site ID value of the primary site.

Table 10-1 Geographic Redundancy Flow

Normal Operation	Failover
When a session is initiated on a primary Converged Application Server site, call setup and processing occurs normally.	Global load balancing policy updated to begin routing calls - primary site to secondary site.
When a SIP transaction boundary is reached, the call is replicated (in-memory) to the site's Coherence cache, and becomes eligible for replication to a secondary site.	Once complete, the secondary site begins processing requests for the backed-up call state data.
A single engine in the Coherence cache then places the call state data to be replicated on a JMS queue configured on the replica site.	When a requests hit secondary site engine retrieves the data and activates the call state, taking ownership for the call.
Data is transmitted to one of the available engines round-robin fashion.	Sets the site ID associated with the call to zero (making it appear local).
Engines at the secondary site monitor their local queue for new messages.	Activates all dormant timers present in the call state.
Upon receiving a message, an engine on the secondary site persists the call state data and assigns it the site ID value of the primary site.	By default, call states are activated only for individual calls, and only after those calls are requested on the backup site.
The site ID distinguishes replicated call state data on the secondary site from any other call state data actively managed by the secondary site.	Servlets can use the WlssSipApplicationSession.getGeoSiteId() method to examine the site ID associated with a call.
Timers in replicated call state data remain dormant on the secondary site, so that timer processing does not become a bottleneck to performance.	Any non-zero value for the site ID indicates that the Servlet is working with call state data that was replicated from another site.

Situations Best Suited to Use Geo-Redundancy

The following situations are best suited to take advantage of Geo-Redundancy:

Your application uses SIP dialog states that are long-lived (dialog states that typically last 30 seconds or longer, such as SUBSCRIBE dialogs or conferences)
Your application would reasonably be able to reconstruct the session (re-INVITE, expire SUBSCRIBE dialogs to trigger re-subscriptions, and so on) from the state that has been replicated
The link between two Converged Application Server clusters or sites is low-bandwidth (<1Gb/s each direction) or high (or variable) latency (>5ms 95%)

Situations Not Suited to Use Geo-Redundancy

Geo-Redundancy should not be used in these situations:

A high-capacity link between sites is available
Your application does not reach SIP dialog steady-states that are likely to last longer than the time it would take to re-route all traffic to the secondary site in the event of catastrophic failure (15-30 seconds)
If the application session is likely to be terminated by the user before the application could re-construct the session (most users will disconnect their calls before the session can be re-established from the secondary site)
The volume of session state objects created by the application is greater than the site interconnect can support

Geo-Redundancy Considerations

Consider the following issues when planning for Geo-Redundancy:

Dimension the system for the site link.
Each dialog state is ~20KB on the wire.
A typical B2BUA is two (2) dialogs.
Aim for 25% utilization (or less, depending on the specific equipment and topology of the site) to accommodate ”jitter” and sustained latency on the link.

For example, a 100 Mb/s link can handle approximately1000 call states per second, and a typical B2BUA (in the default configuration) generates 4 states during the call (two for each dialog). So, a 100 Mb/s link will support a single Converged Application Server cluster dimensioned for a peak arrival rate (call rate) of 250 CPS.
Geo-Redundancy is not transparent to the application; in most cases the application must be designed to use SetPersist() appropriately, and the developer must consider the volume of state that the application will queue for replication between sites.
Given the time it generally takes to route traffic to a secondary site, any application that replicates state more frequently will unnecessarily saturate the JMS queue and site interconnect.
Tuning of JMS to the specific application environment is required: Serialization options, message batching, reliable delivery options and queue size are all variable, depending on the specific application and site characteristics
Geo-Redundancy default behavior is to replicate all dialog state changes when Geo-Redundancy is enabled for the container (this is not recommended for production deployments).
SetPersist() should be used within the application code to selectively identify dialog states that will be long-lived (longer than ~20-30 seconds would be a reasonable threshold).

Using Geographically-Redundant SIP Engines

The basic call state replication functionality available in the Converged Application Server Coherence cache provides excellent failover capabilities for a single site installation. However, the active replication performed within the Coherence cache requires high network bandwidth in order to meet the latency performance needs of most production networks. This bandwidth requirement makes a single Coherence cache cluster unsuitable for replicating data over large distances, such as from one regional data center to another.

The Converged Application Server geographic persistence feature enables you to replicate call state transactions across multiple Converged Application Server installations (multiple Administrative domains or "sites"). A geographically-redundant configuration minimizes dropped calls in the event of a catastrophic failure of an entire site, for example due to an extended, regional power outage.

Example Domain Configurations

A secondary Converged Application Server domain that persists data from another domain may itself process SIP traffic, or it may exist solely as an active standby domain. In the most common configuration, two sites are configured to replicate each other's call state data, with each site processing its own local SIP traffic. The administrator can then use either domain as the "secondary" site should one of domains fail.

Figure 10-2 Common Geographically-Redundant Configuration

Surrounding text describes Figure 10-2 .

An alternate configuration utilizes a single domain that persists data from multiple, other sites, acting as the secondary for those sites. Although the secondary site in this configuration can also process its own, local SIP traffic, be aware that the resource requirements of the site may be considerable because of the need to persist active traffic from several other installations.

Figure 10-3 Alternate Geographically-Redundant Configuration

Surrounding text describes Figure 10-3 .

When using geographic persistence, a single engine in the primary site places modified call state data on a distributed JMS queue. By default, data is placed on the queue only at SIP dialog boundaries. (A custom API is provided for application developers who want to replicate data using a finer granularity, as described in "Using Persistence Hints in SIP Applications".) In a secondary site, engines use a message listener to monitor the distributed queue to receive messages and write the data to its own Coherence cache. If the secondary site uses an RDBMS to store long-lived call states (recommended), then the call state data entries are written into the RDBMS and removed from the in-memory call state cache.

Requirements and Limitations

The Converged Application Server geographically-redundant persistence feature is most useful for sites that manage long-lived call state data in an RDBMS. Short-lived calls may be lost in the transition to a secondary site, because Converged Application Server may choose to collect data for multiple call states before replicating between sites.

You must have a reliable, site-aware load balancing solution that can partition calls between geographic locations, as well as monitor the health of a given regional site. Converged Application Server provides no automated functionality for detecting the failure of an entire domain, or for failing over to a secondary site. It is the responsibility of the Administrator to determine when a given site has "failed," and to redirect that site's calls to the correct secondary site. Furthermore, the site-aware load balancer must direct all messages for a given callId to a single home site (the "active" site). If, after a failover, the failed site is restored, the load balancer must continue directing calls to the active site and not partition calls between the two sites.

During a failover to a secondary site, some calls may be dropped. This can occur because Converged Application Server generally queues call state data for site replication only at SIP dialog boundaries. Failures that occur before the data is written to the queue result in the loss of the queued data.

Also, Converged Application Server replicates call state data across sites only when a SIP dialog boundary changes the call state. If a long-running call exists on the primary site before the secondary site is started, and the call state remains unmodified, that call's data is not replicated to the secondary site. Should a failure occur before a long-running call state has been replicated, the call is lost during failover.

When planning for the capacity of a Converged Application Server installation, be aware that, after a failover, a given site must be able to support all of the calls from the failed site as well as from its own geographic location. This means that all sites that are involved in a geographically-redundant configuration will operate at less than maximum capacity until a failover occurs.

Steps for Configuring Geographic Persistence

In order to use the Converged Application Server geographic persistence features, you must perform certain configuration tasks on both the primary "home" site and on the secondary replication site.

Table 10-2 Steps for Configuring Geographic Persistence

Steps for Primary "Home" Site	Steps for Secondary "Replication" Site:
Install Converged Application Server software and create replicated domain. Enable RDBMS storage for long-lived call states (recommended). Configure JMS Servers and modules required for replicating data. Configure persistence options to: define the unique regional site ID; identify the secondary site's URL; and enable replication hints. Optionally configure cross domain security settings.	Install Converged Application Server software and create replicated domain. For information on best practices, see "Integration and Multi-Domain Best Practices" in Oracle Fusion Middleware Administering JMS Resources for Oracle WebLogic Server. Enable RDBMS storage for long-lived call states (recommended). Configure JMS Servers and modules required for replicating data. Configure persistence options to define the unique regional site ID. Optionally configure cross domain security settings.

Steps for Primary "Home" Site

Steps for Secondary "Replication" Site:

Install Converged Application Server software and create replicated domain.
Enable RDBMS storage for long-lived call states (recommended).
Configure JMS Servers and modules required for replicating data.
Configure persistence options to: define the unique regional site ID; identify the secondary site's URL; and enable replication hints.
Optionally configure cross domain security settings.

Install Converged Application Server software and create replicated domain.

For information on best practices, see "Integration and Multi-Domain Best Practices" in Oracle Fusion Middleware Administering JMS Resources for Oracle WebLogic Server.
Enable RDBMS storage for long-lived call states (recommended).
Configure JMS Servers and modules required for replicating data.
Configure persistence options to define the unique regional site ID.
Optionally configure cross domain security settings.

Note:

In most production deployments, two sites will perform replication services for each other, so you will generally configure each installation as both a primary and secondary site.

Follow the instructions in "Configuring Geographic Redundancy" to create the resources.

Configuring Geographic Redundancy

If you have an existing replicated Converged Application Server installation, or pair of installations, you must manually create the JMS and JDBC resources required for enabling geographic redundancy. You must also configure each site to perform replication. The steps to enable geographic redundancy are:

Configure JDBC Resources. Oracle recommends configuring both the primary and secondary sites to store long-lived call state data in an RDBMS.
Configure Persistence Options. Persistence options must be configured on both the primary and secondary sites to enable engine tier hints to write to an RDBMS or to replicate data to a geographically-redundant installation.
Configure JMS Resources. Both the primary and secondary sites must have available JMS Servers and specific JMS module resources in order to replicate call state data between sites.
Optionally, configure cross domain security for both primary and secondary sites.

The sections that follow describe each step in detail.

Configuring JDBC Resources (Primary and Secondary Sites)

Follow the instructions in "Storing Long-Lived Call State Data in an RDBMS" to configure the JDBC resources required for storing long-lived call states in an RDBMS.

Configuring Persistence Options (Primary Site Only)

The primary site must configure the correct persistence settings in order to enable replication for geographic redundancy. Follow these steps to configure persistence:

Use your browser to access the URL http://address:port/console where address is the Administration Server's listen address and port is the listen port.

Note:
The default administration console port for Converged Application Server is 7001.
If your domain is running in Production mode, click Lock & Edit.
Select the SipServer node in the left pane. The right pane of the console provides two levels of tabbed pages that are used for configuring and monitoring Converged Application Server.
Select Configuration, then select the Persistence tab in the right pane.
Configure the Persistence attributes as follows:
- DB Enabled: Check to enable call states to be stored in an RDBMS. For information on configuring RDBMS call state storage, see "Storing Long-Lived Call State Data in an RDBMS".
- Geo Enabled: Check to enable geographic redundancy.
- Default Handling: Select "all" to persist long-lived call state data to an RDBMS and to replicate data to an external site for geographic redundancy (recommended). If your installation does not store call state data in an RDBMS, select "geo" instead of "all."
- Geo Site ID: Enter a unique number from 1 to 9 to distinguish this site from all other configured sites. Note that the site ID of 0 is reserved to indicate call states that are local to the site in question (call states not replicated from another site).
- Geo Remote T3 URL: This setting is deprecated. Leave it blank.
If your domain is running in Production mode, click Activate Changes.

Configuring JMS Resources Options (Primary Site Only)

Follow these steps to configure JMS resources for the primary site only:

Expand Services, then expand Messaging, and then select the JMS Servers node in the left pane.
Click New in the right pane.
Enter a unique name for the JMS Server or accept the default name. If you have configured a persistent store, select it from the drop down list adjacent Persistent Store. Click Next to continue.
In the Target list, select the name of the engine cluster in the installation. Click Finish to create the server.
Select Services in the left pane, expand Messaging and select JMS Modules.
In the JMS Modules table, click New and enter a Name for the new JMS Module, for example geo-redundancy.
Click Next.
Select all the servers in the cluster, and click Next.
Check Would you like to add resources to this JMS system module and click Finish.
In the Summary of Resources table, click New.
Select the Connection Factory resource type and click Next.
Enter a Name for the connection factory, and enter wlss.callstate.backup.site.connection.factory as the JNDI Name, and click Finish.
In the Summary of Resources table, click New.
Select the Foreign Server resource type and click Next.

Note:
ForeignServer-0 must be targeted to all servers in the engine cluster.
In the Summary of Resources table select the foreign server you just created.
In the General tab enter a JNDI Connection URL, for either a single server, for example, t3://site-2-admin:7001, or for a cluster, for example, t3://site-2-engine1:8001,site-2-engine2:8051 and click Save.
In the Destinations tab, click New.
Enter a Name for the foreign destination, and enter wlss.callstate.backup.site.peer.queue for the Local JNDI Name, and wlss.callstate.backup.site.queue for the Remote JNDI Name.
Click OK.
In the Connection Factories tab, click New.
Enter a Name for the foreign connection factory, and enter wlss.callstate.backup.site.peer.connection.factory for the Local JNDI Name, and wlss.callstate.backup.site.connection.factory for the Remote JNDI Name.
Click OK.
Click Save to save your configuration changes.
Click New, to create another JMS resource.
Select the Distributed Queue option.
Click New to create another JMS resource.
Select the Distributed Queue option and click Next.
Fill in the Name field of the Create a new JMS System Module Resource by entering a descriptive name for the resource, such as DistributedQueue-Callstate.
JNDI Name: Enter the name wlss.callstate.backup.site.queue.
Click Next to continue.
Selected the Unrestricted value for the Client ID Policy option.
Click Finish to save the new resource.
If your domain is running in Production mode, click Activate Changes.

Configuring Persistence Options (Secondary Sites)

The secondary site must configure the correct persistence settings in order to enable replication for geographic redundancy. Follow these steps to configure persistence:

Use your browser to access the URL http://address:port/console where address is the Administration Server's listen address and port is the listen port.

Note:
The default administration console port for Converged Application Server is 7001.
If your domain is running in Production mode, click Lock & Edit.
Select the SipServer node in the left pane. The right pane of the console provides two levels of tabbed pages that are used for configuring and monitoring Converged Application Server.
Select Configuration, then select the Persistence tab in the right pane.
Configure the Persistence attributes as follows:
- DB Enabled: Check to enable call states to be stored in an RDBMS. For information on configuring RDBMS call state storage, see "Storing Long-Lived Call State Data in an RDBMS".
- Geo Enabled: Check to enable geographic redundancy.
- Default Handling: Select "all" to persist long-lived call state data to an RDBMS and to replicate data to an external site for geographic redundancy (recommended). If your installation does not store call state data in an RDBMS, select "geo" instead of "all."
- Geo Site ID: Enter a unique number from 1 to 9 to distinguish this site from all other configured sites. Note that the site ID of 0 is reserved to indicate call states that are local to the site in question (call states not replicated from another site).
- Geo Remote T3 URL: This setting is deprecated. Leave it blank.
If your domain is running in Production mode, click Activate Changes.

Configuring JMS Resources (Secondary Site Only)

Any site that replicates call state data from another site must configure certain required JMS resources. The resources are not required for sites that do not replicate data from another site.

Follow these steps to configure JMS resources:

Use your browser to access the URL http://address:port/console where address is the Administration Server's listen address and port is the listen port.

Note:
The default administration console port for Converged Application Server is 7001.
If your domain is running in Production mode, click Lock & Edit.
Expand Services, then expand Messaging, and then select the JMS Servers node in the left pane.
Click New in the right pane.
Enter a unique name for the JMS Server or accept the default name. If you have configured a persistent store, select it from the drop down list adjacent Persistent Store. Click Next to continue.
In the Target list, select the name of the engine cluster in the installation. Click Finish to create the new Server.
Expand Services, then Expand Messaging, and then select the JMS Modules node in the left pane.
Click New in the right pane.
Fill in the fields of the Create JMS System Module page as follows:
- Name: Enter a name for the new module, or accept the default name.
- Descriptor File Name: Enter the prefix a configuration file name in which to store the JMS module configuration (for example, systemmodule-callstate).
- Location In Domain: Enter a location to store the System Module description file relative to your domain's JMS configuration sub directory.
Click Next to continue.
Choose the option All servers in the cluster in the Clusters pane.
Click Next to continue.
Select Would you like to add resources to this JMS system module and click Finish to create the module.
In the Summary of Resources table, click New to add a new resource to the module.
Select the Connection Factory option and click Next.
Fill in the fields of the Create a new JMS System Module Resource as follows:
- Name: Enter a descriptive name for the resource, such as ConnectionFactory-Callstate.
- JNDI Name: Enter the name wlss.callstate.backup.site.connection.factory.
Selected the Unrestricted value for the Client ID Policy option.
Click Next to continue.
Click Finish to save the new resource.
Select the name of the connection factory resource you just created in the JMS Modules table.
Select Configuration, then select the Load Balance tab in the right pane.
De-select the Server Affinity Enabled option, and click Save.
Re-expand Services, then expand Messaging, and then select the JMS Modules node in the left pane.
Select the name of the JMS module you created in the right pane.
Click New to create another JMS resource.
Select the Distributed Queue option and click Next.
Fill in the Name field of the Create a new JMS System Module Resource by entering a descriptive name for the resource, such as DistributedQueue-Callstate.
JNDI Name: Enter the name wlss.callstate.backup.site.queue.
Click Next to continue.
Click Finish to save the new resource.
If your domain is running in Production mode, click Activate Changes.

Configuring Cross Domain Security (Both Primary and Secondary Sites)

Oracle recommends, depending upon your requirements, that you enable cross domain security between your geographically redundant sites.

For information on cross domain security concepts and configuration details, refer to the following documents:

"Introduction and Roadmap" in Securing a Production Environment for Oracle WebLogic Server
"Overview of WebLogic Server Security Administration" in Administering Security for Oracle WebLogic Server
"Integration and Multi-Domain Best Practices" in Administering JMS Resources for Oracle WebLogic Server
"Cross Domain Security" in Developing JTA Applications for Oracle WebLogic Server
"Important Information Regarding Cross-Domain Security Support" in Administering Security for Oracle WebLogic Server
"Simplified Access to Foreign JMS Providers" in Developing JMS Applications for Oracle WebLogic Server
"Configuring Foreign Server Resources to Access Third-Party JMS Providers" in Administering JMS Resources for Oracle WebLogic Server

Understanding Geo-Redundant Replication Behavior

This section provides more detail into how multiple sites replicate call state data. Administrators can use this information to better understand the mechanics of geo-redundant replication and to better troubleshoot any problems that may occur in such a configuration. Note, however, that the internal workings of replication across Converged Application Server installations is subject to change in future releases of the product.

Call State Replication Process

When a call is initiated on a primary Converged Application Server site, call setup and processing occurs normally. When a SIP dialog boundary is reached, the call is replicated (in-memory) to the site's Coherence cache, and becomes eligible for replication to a secondary site. Converged Application Server may choose to aggregate multiple call states for replication in order to optimize network usage.

A single engine in the Coherence cache then places the call state data to be replicated on a JMS queue configured on the replica site. Data is transmitted to one of the available engines (referenced in the Foreign Server resource configuration specified for the primary site) in a round-robin fashion. Engines at the secondary site monitor their local queue for new messages.

Upon receiving a message, an engine on the secondary site persists the call state data and assigns it the site ID value of the primary site. The site ID distinguishes replicated call state data on the secondary site from any other call state data actively managed by the secondary site. Timers in replicated call state data remain dormant on the secondary site, so that timer processing does not become a bottleneck to performance.

Call State Processing After Failover

To perform a failover, the Administrator must change a global load balancer policy to begin routing calls from the primary, failed site to the secondary site. After this process is completed, the secondary site begins processing requests for the backed-up call state data. When a request is made for data that has been replicated from the failed site, the engine retrieves the data and activates the call state, taking ownership for the call. The activation process involves:

Setting the site ID associated with the call to zero (making it appear local).
Activating all dormant timers present in the call state.

By default, call states are activated only for individual calls, and only after those calls are requested on the backup site. SipServerRuntimeMBean includes a method, activateBackup(byte site), that can be used to force a site to take over all call state data that it has replicated from another site. The Administrator can execute this method using a WLST configuration script. Alternatively, an application deployed on the server can detect when a request for replicated site data occurs, and then execute the method. Example 10-1 shows sample code from a JSP that activates a secondary site, changing ownership of all call state data replicated from site 1. Similar code could be used within a deployed Servlet. Note that either a JSP or Servlet must run as a privileged user in order to execute the activateBackup method.

In order to detect whether a particular call state request, Servlets can use the WlssSipApplicationSession.getGeoSiteId() method to examine the site ID associated with a call. Any non-zero value for the site ID indicates that the Servlet is working with call state data that was replicated from another site.

Example 10-1 Activating a Secondary Site Using JMX

<%
    byte site = 1;

    InitialContext ctx = new InitialContext();
    MBeanServer server = (MBeanServer) ctx.lookup("java:comp/env/jmx/runtime");
    Set set = server.queryMBeans(new ObjectName("*:*,Type=SipServerRuntime"), null);
    if (set.size() == 0) {
      throw new IllegalStateException("No MBeans Found!!!");
    }

    ObjectInstance oi = (ObjectInstance) set.iterator().next();
    SipServerRuntimeMBean bean = (SipServerRuntimeMBean)
      MBeanServerInvocationHandler.newProxyInstance(server,
        oi.getObjectName());

    bean.activateBackup(site);
  %>

Note that after a failover, the load balancer must route all calls having the same callId to the newly-activated site. Even if the original, failed site is restored to service, the load balancer must not partition calls between the two geographical sites.

Removing Backup Call States

You may also choose to stop replicating call states to a remote site in order to perform maintenance on the remote site or to change the backup site entirely. Replication can be stopped by setting the Site Handling attribute to "none" on the primary site as described in "Configuring Persistence Options (Secondary Sites)".

After disabling geographic replication on the primary site, you also may want to remove backup call states on the secondary site. SipServerRuntimeMBean includes a method, deleteBackup(byte site), that can be used to force a site to remove all call state data that it has replicated from another site. The Administrator can execute this method using a WLST configuration script or via an application deployed on the secondary site. The steps for executing this method are similar to those for using the activateBackup method, described in "Call State Processing After Failover".

Monitoring Replication Across Regional Sites

To monitor replication across regional sites, administrators will have examine WebLogic behavior using a combination of WebLogic JMS and Coherence cache statistics.

Troubleshooting Replication

Administrators should monitor any SNMP traps that indicate failed database writes on a secondary site installation.

Administrators must also ensure that all sites participating in geographically-redundant configurations use unique site IDs.