7 Configuring Server Failure Detection

This chapter describes how to configure Oracle Communications Converged Application Server to improve failover performance when a server becomes physically disconnected from the network.

Overview of Failover Detection

To achieve a highly-available production system, the Converged Application Server uses the Oracle Coherence distributed cache service to retrieve and write call-state data. The cache service consists of a number of partitions that are spread across the servers that are running in the cluster. Each partition has a primary copy of call-state storage assigned to one server in the cluster, and a backup copy assigned to another server in the cluster. This means that a call state that is required to process a request may reside on a remote server and possibly even a remote machine.

The Converged Application Server architecture depends on the Coherence cache service to detect when a server has failed or becomes disconnected. When an engine cannot access or write call-state data because a server is unavailable, the Coherence cache service detects this and reassigns the lost server's partitions to another server in the cluster and ensures a new backup copy is made available on a different server, if one is running.

Coherence Cluster Overview

The Coherence cache service uses its own cluster communication protocol, known as Tangosol Cluster Management Protocol (TCMP), to invoke remote servers, detect server failure and achieve high availability. This protocol uses an optimized algorithm to quickly detect that a server has become physically disconnected from the network. This algorithm, and the configuration options that are available to modify its behavior, are described in detail in the Oracle Coherence documentation. See the following documentation for more information on Coherence and its distributed cache service.

"Introduction to Coherence Clusters" in Developing Applications with Oracle Coherence
"Understanding Distributed Caches" in Developing Applications with Oracle Coherence

See "Configuring Coherence" and "SIP Coherence Configuration Reference (coherence.xml)" for additional information on configuring Coherence for the Converged Application Server.

Split-Brain Handling

The Converged Application Server relies to a large extent on Oracle Coherence to detect and handle a split-brain condition. A split-brain condition can occur, for example, when connectivity is restored between two or more parts of a cluster that had been isolated from each other.

After a split-brain failure, causing two or more network partitions to be created, each such network partition will contain a set of engines that will reform themselves into a smaller cluster (or possibly a single server waiting to form a new cluster with newly started members).

Each such cluster will, while the network still is partitioned, continue to operate as if the other engines have been shut down. The clusters will now have promoted the oldest member in the cluster to a cluster senior member, responsible for managing the cluster state.

When the network is repaired, and all clusters become aware of each other again, the senior members in each cluster will communicate to decide which single cluster should survive. This may in certain situations take a couple of minutes before reaching a final conclusion, but will eventually resolve as follows:

If one cluster is larger than all of the others, it will survive and all other engines will be shut down.
If two or more equally large clusters exist that are larger than all the other clusters, the cluster with the older senior member will survive and all other engines will be shut down.

When Coherence detects a split-brain condition, its behavior is controlled primarily through the options related to death detection in the cluster-related configuration. For more information see "Configuring Death Detection" in Developing Applications with Oracle Coherence.

Coherence Configuration

You can use the following three mechanisms to modify Coherence configuration options:

The default Coherence cluster configuration file
The system properties
The tangosol-coherence-override.xml file

WARNING:

No servers in the domain can be running when you make changes to the Coherence configuration. Also, the configuration must be the same for all servers in the domain or unexpected behavior can result.

Cluster Configuration File

The default Coherence cluster configuration file, Custom-Default.xml, resides in the following location:

$DOMAIN_HOME/config/coherence/Coherence-Default/

where $DOMAIN_HOME is the root directory for the domain.

Table 7-1 describes the default configuration options that you can specify.

Table 7-1 Coherence Cluster Configuration File Options

Option	Element Name	System Property Name	Default Value
TCP-ring IP-timeout	<tcp-ring-listener><ip-timeout>	tangosol.coherence.ipmonitor.pingtimeout	5
TCP-ring IP-attempts	<tcp-ring-listener><ip-attempts	tangosol.coherence.ipmonitor.pingtattempts	2
Service Guardian Timeout	<service-guardian><timeout-milliseconds>	tangosol.coherence.guard.timeout	305000
Packet Delivery Timeout	<packet-delivery><timeout-milliseconds>	tangosol.coherence.packet.timeout	300000

You can override these default configuration options either by modifying the corresponding system properties or creating an override configuration file, called tangosol-coherence-override.xml, which you add to the system CLASSPATH variable on all servers.

See the following Coherence documentation for information on which configuration options you can override and for information on how to use the override configuration option:

"Configuring a Coherence Cluster" in Administering Clusters for Oracle WebLogic Server
"Death Detection Recommendations" in Administering Oracle Coherence
"Configuring Death Detection" in Developing Applications with Oracle Coherence
"Understanding the XML Overrride Feature" in Developing Applications with Oracle Coherence
"Coherence Operational Configuration Reference" in Developing Applications with Oracle Coherence