Coherence 3.3 User Guide : Death Detection

Death Detection

The Coherence death detection algorithms are based on sustained loss of connectivity between two or more cluster nodes. When a node identifies that it has lost connectivity with any other node it will consult with other cluster nodes to determine what action should be taken.

In attempting to consult with others, the node may find that it cannot communicate with any other nodes, and will assume that it has been disconnected from the cluster. Such a condition could be triggered by physically unplugging a node's network adapter. In such an event the isolated node will restart it's clustered services and attempt to rejoin the cluster.

If connectivity with other cluster nodes remains unavailable, the node may (depending on WKA configuration) form a new isolated cluster, or continue searching for the larger cluster. In either case once connectivity is restored the previously isolated cluster nodes will rejoin the running cluster. As part of rejoining the cluster, the nodes former cluster state is discarded, including any cache data it may have held, as the remainder of the cluster had already taken on ownership of that data (restoring from backups).

Without connectivity it is obviously not possible for a node to identify the state of other nodes. This means that from the point of view of a single node, local network adapter failure and network wide switch failure look identical, and are thus handled in the same way, as described above. The important difference is that in the case of a switch failure all nodes are attempting to re-join the cluster, which is the equivalent of a full cluster restart, and all prior state and data is dropped.

Obviously dropping all data is not desirable, and thus if you wish to avoid this as part of a sustained switch failure you must take additional precautions. Options include:

Extend allowable outage duration: The maximum time a node(s) may be unresponsive before being removed from the cluster is configured via the packet-delivery/timeout-milliseconds configuration element, and defaults to one minute for production configurations. Increasing this value will allow the cluster to wait longer for connectivity to return. The downside of increasing this value it may also take longer to handle the case where just a single node has lost connectivity.

Persist data to external storage: By utilizing a Read Write Backing Map, the cluster persists data to external storage, and can retrieve it after a cluster restart. So long as write behind is disabled no data would be lost in the event of a switch failure. The downside here is that synchronously writing through to external storage increases the latency of cache update operations, and the external storage may become a bottleneck.

Delay node restart: The cluster death detection action can be re-configured to delay the node restart until connectivity is restored. By delaying the restart until connectivity is restored an isolated node is allowed to continue running with whatever data it had available at the time of disconnect. Once connectivity is restored the nodes will detect each other and form a new cluster. In forming a new cluster all but the most senior node will be required to restart. This results in behavior which is nearly identical to the default behavior because the majority of the nodes will restart, and drop their data. It may be beneficial for cases in which replicated caches are in use as the senior most node's copy of the data will survive the restart. In order to enable the delayed restart the tangosol.coherence.departure.threshold system property must be set to a value that is greater then the size of the cluster.

When running on Microsoft Windows it is also necessary to ensure the Windows does not disable the network adapter when it is disconnected. To do this add the following Windows registry DWORD, setting it to 1:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DisableDHCPMediaSense
Note that despite the name this setting affects static IPs as well.

Add network level fault tolerance: Adding a redundant layer to the cluster's network infrastructure allows for individual pieces of networking equipment to fail without disrupting connectivity. This is commonly achieved by utilizing at least two network adapters per machine, and having each adapter connected to a separate switch. This is not a feature of Coherence but rather of the underlying operating system or network driver. The only change to Coherence is that it should be configured to bind to the virtual rather then physical network adapter. This form of network redundancy goes by different names depending on the operating system, see Linux bonding, Solaris trunking and Windows teaming for further details.