Clustering

Overview

Coherence is built on a fully clustered architecture. Since "clustered" is an overused term in the industry, it is worth stating exactly what it means to say that Coherence is clustered. Coherence is based on a peer-to-peer clustering protocol, using a conference room model, in which servers are capable of:

Speaking to Everyone: When a party enters the conference room, it is able to speak to all other parties in a conference room.
Listening: Each party present in the conference room can hear messages that are intended for everyone, as well as messages that are intended for that particular party. It is also possible that a message is not heard the first time, thus a message may need to be repeated until it is heard by its intended recipients.
Discovery: Parties can only communicate by speaking and listening; there are no other senses. Using only these means, the parties must determine exactly who is in the conference room at any given time, and parties must detect when new parties enter the conference room.
Working Groups and Private Conversations: Although a party can talk to everyone, once a party is introduced to the other parties in the conference room (i.e. once discovery has completed), the party can communicate directly to any set of parties, or directly to an individual party.
Death Detection: Parties in the conference room must quickly detect when parties leave the conference room – or die.

Using the conference room model provides the following benefits:

There is no configuration required to add members to a cluster. Subject to configurable security restrictions, any JVM running Coherence will automatically join the cluster and be able to access the caches and other services provided by the cluster. This includes J2EE application servers, Cache Servers, dedicated cache loader processes, or any other JVM that is running with the Coherence software. When a JVM joins the cluster, it is called a cluster node, or alternatively, a cluster member.
Since all cluster members are known, it is possible to provide redundancy within the cluster, such that the death of any one JVM or server machine does not cause any data to be lost.
Since the death or departure of a cluster member is automatically and quickly detected, failover occurs very rapidly, and more importantly, it occurs transparently, which means that the application does not have to do any extra work to handle failover.
Since all cluster members are known, it is possible to load balance responsibilities across the cluster. Coherence does this automatically with its Distributed Cache Service, for example. Load balancing automatically occurs to respond to new members joining the cluster, or existing members leaving the cluster.
Communication can be very well optimized, since some communication is multi-point in nature (e.g. messages for everyone), and some is between two members.

Two of the terms used here describe processing for failed servers:

Failover: Failover refers to the ability of a server to assume the responsibilities of a failed server. For example, "When the server died, its processes failed over to the backup server."
Failback: Failback is an extension to failover that allows a server to reclaim its responsibilities once it restarts. For example, "When the server came back up, the processes that it was running previously were failed back to it."

All of the Coherence clustered services, including cache services and grid services, provide automatic and transparent failover and failback. While these features are transparent to the application, it should be noted that the application can sign up for events to be notified of all comings and goings in the cluster.