For most Coherence deployments, there is a single cluster in each data center (through TCMP), and federation between each data center (through Coherence*Extend). While TCMP scalability is dependent on many variables, a good rule of thumb is that with solid networking infrastructure, a cluster of 100 server JVMs is readily supported, and a cluster of 1000 server JVMs is possible but requires far more care to achieve.
Oracle Coherence supports both homogeneous server clusters and the federated server model. Any application or server process that is running the Coherence software is called a cluster node. All cluster nodes on the same network automatically cluster. Cluster nodes use a peer-to-peer protocol, which means that any cluster node can talk directly to any other cluster node.
Coherence is logically sub-divided into clusters, services and caches. A Coherence cluster is a group of cluster nodes that share a group address, which allows the cluster nodes to communicate. Generally, a cluster node only joins one cluster, but it is possible for a cluster node to join (be a member of) several different clusters, by using a different group address for each cluster.
Within a cluster, there exists any number of named services. A cluster node can participate in (join) any number of these services; when a cluster node joins a service, it automatically has all of the information from that service available to it; for example, if the service is a replicated cache service, then joining the service includes replicating the data of all the caches in the service. These services are all peer-to-peer, which means that a cluster node typically plays both the client and the server role through the service; furthermore, all of these services failover during cluster node failure without any data loss.
Failback is an extension to failover that allows a server to reclaim its responsibilities when it restarts. For example, "When the server came back up, the processes that it was running previously were failed back to it."
Failover refers to the ability of a server to assume the responsibilities of a failed server. For example, "When the server died, its processes failed over to the backup server."
JCache (also known as JSR-107), is a caching API specification that is currently in progress. While the final version of this API has not been released yet, Oracle and other companies with caching products have been tracking the current status of the API. The API has been largely solidified at this point. Few significant changes are expected going forward.
The .Net and C++ platforms do not have a corresponding multi-vendor standard for data caching.
A load balancer is a hardware device or software program that delegates network requests to several servers, such as in a server farm or server cluster. Load balancers typically can detect server failure and optionally retry operations that were being processed by that server at the time of its failure. Load balancers typically attempt to keep the servers to which they delegate equally busy, hence the use of the term "balancer". Load balancer devices often have a high-availability option that uses a second load balancer, allowing a load balancer devices to die without affecting availability.
A server cluster is composed of multiple servers that are mutually aware of each other. The servers can communicate directly with each other, safely share responsibilities, and are able to assume the responsibilities of failed servers. This simplifies development because there is no longer a requirement use asynchronous messaging (which may require idempotent, or compensating transactions, or both) or synchronous two-phase commits (which may block indefinitely and reduce system availability).
Due to the need for global coordination, clustering scales to a lesser degree than federation, but generally with much stronger data integrity guarantees.
The loosest form of coupling, a server farm uses multiple servers to handle increased load and provide increased availability. It is common for a load-balancer to be used to assign work to the various servers in the server farm, and server farms often share back-end resources, such as database servers, but each server is typically unaware of other servers in the farm, and usually the load-balancer is responsible for failover.
Farms are commonly used for stateless services such as delivering static web content or performing high volumes of compute-intensive operations for high-performance computing (HPC).
A federated server model is similar to a server farm, but allows multiple servers to work with each other even if they were not originally intended to do so. Federation may operate synchronously through technologies such as distributed XA transactions, or asynchronously through messaging solutions such as JMS. Federation may be used to scale out (for example, with federated databases, data can be "partitioned" or "shared" across multiple database instances). Federation may also be used to integrate between heterogeneous systems (for example, sharing data between applications within a web portal).
In the context of scale-out, federated systems are primarily used when data integrity is important but not absolutely crucial, such as for many stateful web applications or transactional compute grids. In the former case, HTTP sessions are generally require only a "best effort" guarantee. In the latter case, there is usually an external system of record that ensures data integrity (by rolling back any invalid transactions that the compute grid submits).
Federation supports enormous scale at the cost of data integrity guarantees (though such guarantees can be reinstated with the elimination of transparent failover/failback and dynamic partitioning, which is the case for most WAN-style deployments).