Technical Product Description

     Previous  Next    Open TOC in new window    View as PDF - New Window  Get Adobe Reader - New Window
Content starts here

WebLogic SIP Server Cluster Architecture

The following sections describe the WebLogic SIP Server cluster architecture:


Overview of the Cluster Architecture

WebLogic SIP Server provides a multi-tier cluster architecture in which a stateless “Engine Tier” processes all traffic and distributes all transaction and session state to a “Data Tier.” The data tier is comprised of one or more partitions, labeled as “Data Nodes” in Figure 4-1 below. Each partition may contain one or more replicas of all state assigned to it and may be distributed across multiple physical servers or server blades. A standard load balancing appliance is used to distribute traffic across the Engines in the cluster. It is not necessary that the load balancer be SIP-aware; there is no requirement that the load balancer support affinity between Engines and SIP dialogs or transactions. However, SIP-aware load balancers can provide higher performance by maintaining a client’s affinity to a particular engine tier server.

Figure 4-1 Example WebLogic SIP Server Cluster

Example WebLogic SIP Server Cluster

In some cases, it is advantageous to have state tier instances and engine tier instances running on the same physical host (as shown in Figure 4-1). This is particularly true when the physical servers or server blades in the cluster are based on Symmetrical Multi-Processing (SMP) architectures, as is now common for platforms such as Advance Telecom Computing Architecture (ATCA). This is not arbitrarily required, however, and it is entirely possible to physically distribute State Tier and Engine instances each to a different physical server or server blade.

There is no arbitrary limit to the number of engines, partitions or physical servers within a cluster, and there is no fixed ratio of engines to partitions. When dimensioning the cluster, however, a number of factors should be considered, such as the typical amount of memory required to store the state for a given session and the increasing overhead of having more than two replicas within a partition.


WebLogic SIP Server Cluster Linear Scalability

WebLogic SIP Server has demonstrated linear scalability from 2 to 16 hosts (up to 32 CPUs) in both laboratory and field tests and in commercial deployments. This characteristic is likely to be evident in larger clusters as well, up to the ability of the cluster interconnect (or alternatively the load balancer) to support the total traffic volume. Gigabit Ethernet is recommended as a minimum for the cluster interconnect.


WebLogic SIP Server Replication

The WebLogic SIP Server data tier is an in-memory, peer-replicated store. The store also functions as a lock manager, whereby call state access follows a simple “library book” model (a call state can only be checked out by one SIP engine at a time).

The nodes in the data tier are called replicas. To increase the capacity of the data tier, the data is split evenly across a set of partitions. Each partition has a set of 1-8 replicas which maintain a consistent state. (BEA recommends using no more than 3 replicas per partition.) The number of replicas in the partition is the replication factor.

Replicas can join and leave the partition. Any given replica serves in exactly one partition at a time. The total available call state storage capacity of the cluster is determined by the capacity of each partition.

Figure 4-2 WebLogic SIP Server State Replication

WebLogic SIP Server State Replication

The call state store is peer-replicated. This means that clients perform all operations (reads and writes) to all replicas in a partition. Peer replication stands in contrast to the more common primary-secondary replication architecture, wherein one node acts as a primary and the all other nodes act as secondaries. With primary-secondary replication, clients only talk directly to the current primary node. Peer-replication is roughly equivalent to the synchronous primary-secondary architecture with respect to failover characteristics, peer replication has lower latency during normal operations on average. Lower latency is achieved because the system does not have to wait for the synchronous 2nd hop incurred with primary-secondary replication.

Peer replication also provides better failover characteristics than asynchronous primary-secondary systems because there is no change propagation delay.

The operations supported by all replicas for normal operations are: “lock and get call state,” “put and unlock call state,” and “lock and get call states with expired timers.” The typical message processing flow is simple:

  1. Lock and get the call state.
  2. Process the message.
  3. Put and unlock the call state.

Additional management functions deal with bootstrapping, registration, and failure cases.

Partition Views

The current set of replicas in a partition is referred to as the partition view. The view contains an increasing ID number. A view change signals that either a new replica has joined the partition, or that a replica has left the partition. View changes are submitted to engines when they perform and operation against the data tier.

When faced with a view change, engine nodes performing a lock/get operation must immediately retry their operations with the new view. Each SIP engine schedules a 10ms interval for retrying the lock/get operation against the new view. In the case of a view change on a put request, the new view is inspected for added replicas (in the case that the view change derives from a replica join operation instead of replica failure or shutdown). If there is an added replica, that replica also gets the put request to ensure consistency.

Timer Processing

An additional function of the data tier is timer processing. The replicas set timers for the call states when call states perform put operations. Engines then poll for and “check out” timers for processing. Should an engine fail at this point, this failure is detected by the replica and the set of checked-out timers is forcefully checked back in and rescheduled so that another engine may check them out and process them.

As an optimization, if a given call state contains only timers required for cleaning up the call state, the data tier itself expires the timers. In this special case, the call state is not returned to an engine tier for further processing, because the operation can be completed wholly within the data tier.

Replica Failure

The SIP engine node clients perform failure detection for replicas, or for failed network connections to replicas.

During the course of message processing, an engine communicates with each replica in the current partition view. Normally all operations succeed, but occasionally a failure (a dropped socket or an invocation timeout) is detected. When a failure is detected the engine sends a “replica died” message to any of the remaining live replicas in the partition. (If there is no remaining live replica, the partition is declared “dead” and the engines cannot process calls hashing to that partition until the partition is restored). The replica that receives the failed replica notification proposes a new partition view that excludes the reportedly dead replica. All clients will then receive notification of the view change (see Partition Views).

To handle partitioned network scenarios where one client cannot talk to the supposedly failed replica but another replica can, the “good” replica removes the reportedly failed replica offline, ensuring safe operation in the face of network partition.

Engine Failure

The major concerns with engine failure are:

  1. They are in the middle of a lock/get or put/unlock operation during failure;
  2. They fail to unlock call states for messages they are currently processing;
  3. They abandon the set of timers that they are currently processing.

Replicas are responsible for detecting engine failure. In the case of failures during lock/get and put/unlock operations, there is risk of lock state and data inconsistency between the replicas (data inconsistency in the case of put/unlock only). To handle this, the replicas break locks for call states if they are requested by another engine and the current lock owner is deemed dead. This allows progress with that call state.

Additionally, to deal with possible data inconsistency in scenarios where locks had to be broken, the call state is marked as “possibly stale”. When an engine evaluates the response of a lock/get operation, it wants to choose the best data. If any one replica reports that it has non-stale data, that data is used. Otherwise, the “possibly stale” data is used (it is only actually stale in the case that the single replica that had the non-stale version died in the intervening period).

Effects of Failures on Call Flows

Because of the automatic failure recovery of the replicated store design, failures don’t affect call flow unless the failure is of a certain duration or magnitude.

Figure 4-3 Replication Example Call Flow 1

Replication Example Call Flow 1

In some cases, failure recovery causes “blips” in the system where the engine’s coping with view changes causes message processing to temporarily back-up. This is usually not dangerous, but may cause UAC or UAS re-transmits if the backlog created is substantial.

Figure 4-4 Replication Example Call Flow 2

Replication Example Call Flow 2

Catastrophic failure of a partition (whereby no replica is remaining) causes a fraction of the cluster to be unable to process messages. If there are four partitions, and one is lost, 25% of messages will be rejected. This situation will resolve once any of the replicas are put back in the service of that partition.


Diameter Protocol Handling

A WebLogic SIP Server domain may optionally deploy support for the Diameter base protocol and IMS Sh interface provider on engine tier servers, which then act as Diameter Sh client nodes. SIP Servlets deployed on the engines can use the profile service API to initiate requests for user profile data, or to subscribe to and receive notification of profile data changes. The Sh interface is also used to communicate between multiple IMS Application Servers.

One or more server instances may be also be configured as Diameter relay agents, which route Diameter messages from the client nodes to a configured Home Subscriber Server (HSS) in the network, but do not modify the messages. BEA recommends configuring one or more servers to act as relay agents in a domain. The relays simplify the configuration of Diameter client nodes, and reduce the number of network connections to the HSS. Using at least two relays ensures that a route can be established to an HSS even if one relay agent fails.

The relay agents included in WebLogic SIP Server perform only stateless proxying of Diameter messages; messages are not cached or otherwise processed before delivery to the HSS.

Note: In order to support multiple HSSs, the 3GPP defines the Dh interface to look up the correct HSS. WebLogic SIP Server does not provide a Dh interface application, and can be configured only with a single HSS.

Note that relay agent servers do not function as either engine or data tier instances—they should not host applications, store call state data, maintain SIP timers, or even use SIP protocol network resources (sip or sips network channels).

WebLogic SIP Server also provides a simple HSS simulator that you can use for testing Sh client applications. You can configure a WebLogic SIP Server instance to function as an HSS simulator by deploying the appropriate application.

Figure 4-5 WebLogic SIP Server Diameter Domain

WebLogic SIP Server Diameter Domain


Deployment of WebLogic SIP Server in Non-clustered configurations

WebLogic SIP Server may be deployed in non-clustered configurations where session retention is not a relevant capability. The SIP signaling throughput of individual WebLogic SIP Server instances will be substantially higher due to the elimination of the computing overhead of the clustering mechanism. Non-clustered configurations are appropriate for development environments or for cases where all deployed services are stateless and/or session retention is not considered important to the user experience (where users are not disturbed by failure of established sessions).

It is important to note that this has no impact on the licensing of the product and does not affect license capacity. This feature may reduce the volume of hardware required to handle a given SIP signaling traffic load, however.


“Zero Downtime” Application Upgrades

With WebLogic SIP Server, you can upgrade a deployed SIP application to a newer version without losing existing calls being processed by the application. This type of application upgrade is accomplished by deploying the newer application version alongside the older version. WebLogic SIP Server automatically manages the SIP Servlet mapping so that new requests are directed to the new version. Subsequent messages for older, established dialogs are directed to the older application version until the calls complete. After all of the older dialogs have completed and the earlier version of the application is no longer processing calls, you can safely un-deploy it.

WebLogic SIP Server's upgrade feature ensures that no calls are dropped while during the upgrade of a production application. The upgrade process also enables you to revert or rollback the process of upgrading an application. If, for example, you determine that there is a problem with the newer version of the deployed application, you can simply un-deploy the newer version. WebLogic SIP Server then automatically directs all new requests to the older application version.

Requirements and Restrictions for Upgrading Deployed Applications

To use the application upgrade functionality of WebLogic SIP Server:

WebLogic SIP Server also provides the ability for Administrators to upgrade the SIP Servlet container, JVM, or application on a cluster-wide basis without affecting existing SIP traffic. This is accomplished by creating multiple clusters and having WebLogic SIP Server automatically forward requests during the upgrade process. See Upgrading Software and Converged Applications in the Operations Guide.

  Back to Top       Previous  Next