Death Detection

The Coherence death detection algorithms are based on sustained loss of connectivity between two or more cluster nodes. When a node identifies that it has lost connectivity with any other node it will consult with other cluster nodes to determine what action should be taken.

In attempting to consult with others, the node may find that it cannot communicate with any other nodes, and will assume that it has been disconnected from the cluster. Such a condition could be triggered by physically unplugging a node's network adapter. In such an event the isolated node will restart it's clustered services and attempt to rejoin the cluster.

If connectivity with other cluster nodes remains unavailable, the node may (depending on WKA configuration) form a new isolated cluster, or continue searching for the larger cluster. In either case once connectivity is restored the previously isolated cluster nodes will rejoin the running cluster. As part of rejoining the cluster, the nodes former cluster state is discarded, including any cache data it may have held, as the remainder of the cluster had already taken on ownership of that data (restoring from backups).

Without connectivity it is obviously not possible for a node to identify the state of other nodes. This means that from the point of view of a single node, local network adapter failure and network wide switch failure look identical, and are thus handled in the same way, as described above. The important difference is that in the case of a switch failure all nodes are attempting to re-join the cluster, which is the equivalent of a full cluster restart, and all prior state and data is dropped.

Obviously dropping all data is not desirable, and thus if you wish to avoid this as part of a sustained switch failure you must take additional precautions. Options include: