9 Configuring Server Failure Detection

This chapter describes how to configure Oracle Communications WebRTC Session Controller to improve failover performance when a server becomes physically disconnected from the network.

Overview of Failover Detection

To achieve a highly-available production system, the WebRTC Session Controller uses the Oracle Coherence distributed cache service to retrieve and write call-state data. The cache service consists of a number of partitions that are spread across the servers that are running in the cluster. Each partition has a primary copy of call-state storage assigned to one server in the cluster, and a backup copy assigned to another server in the cluster. This means that a call state that is required to process a request may reside on a remote server and possibly even a remote machine.

The WebRTC Session Controller architecture depends on the Coherence cache service to detect when a server has failed or becomes disconnected. When an engine cannot access or write call-state data because a server is unavailable, the Coherence cache service detects this and reassigns the lost server's partitions to another server in the cluster and ensures a new backup copy is made available on a different server, if one is running.

Coherence Cluster Overview

The Coherence cache service uses its own cluster communication protocol, known as Tangosol Cluster Management Protocol (TCMP), to invoke remote servers, detect server failure and achieve high availability. This protocol uses an optimized algorithm to quickly detect that a server has become physically disconnected from the network. This algorithm, and the configuration options that are available to modify its behavior, are described in detail in the Oracle Coherence documentation. See the following documentation for more information on Coherence and its distributed cache service.

See "Configuring Coherence" and "SIP Coherence Configuration Reference (coherence.xml)" for additional information on configuring Coherence for the WebRTC Session Controller.

Split-Brain Handling

The WebRTC Session Controller relies to a large extent on Oracle Coherence to detect and handle a split-brain condition. A split-brain condition can occur, for example, when connectivity is restored between two or more parts of a cluster that had been isolated from each other. When the WebRTC Session Controller detects such a condition, it attempts to recover by shutting down part of the cluster and expecting the affected servers to restart and join the surviving cluster as new members.

When Coherence detects a split-brain condition, its behavior is controlled primarily through the options related to death detection in the cluster-related configuration.

Coherence Configuration

You can use the following three mechanisms to modify Coherence configuration options:

  • The default Coherence cluster configuration file

  • The system properties

  • The tangosol-coherence-override.xml file

WARNING:

No servers in the domain can be running when you make changes to the Coherence configuration. Also, the configuration must be the same for all servers in the domain or unexpected behavior can result.

Cluster Configuration File

The default Coherence cluster configuration file, Custom-Default.xml, resides in the following location:

$DOMAIN_HOME/config/coherence/Coherence-Default/

where $DOMAIN_HOME is the root directory for the domain.

Table 9-1 describes the default configuration options that you can specify.

Table 9-1 Coherence Cluster Configuration File Options

Option Element Name System Property Name Default Value

TCP-ring IP-timeout

<tcp-ring-listener><pingtimeout>

tangosol.coherence.ipmonitor.pingtimeout

5

TCP-ring IP-attempts

<tcp-ring-listener><pingattempts

tangosol.coherence.ipmonitor.pingtattempts

2

Service Guardian Timeout

<service-guardian><timeout-milliseconds>

tangosol.coherence.guard.timeout

305000

Packet Delivery Timeout

<packet-delivery><timeout-milliseconds>

tangosol.coherence.packet.timeout

300000


You can override these default configuration options either by modifying the corresponding system properties or creating an override configuration file, called tangosol-coherence-override.xml, which you add to the system CLASSPATH variable on all servers.

See the following Coherence documentation for information on which configuration options you can override and for information on how to use the override configuration option: