The following figure illustrates an enhanced broker cluster. An enhanced broker cluster provides both service availability and data availability.
An enhanced broker cluster has the following characteristics:
Data Synchronization
All brokers in an enhanced cluster share a common persistent data store in which destinations, persistent messages, and other state information is stored for each broker. Because all brokers share the same data store, each broker is able to access the state information stored by other brokers in the cluster. When a broker that has been offline rejoins the cluster (or when a new broker is added to the cluster) it is able to access the most current information simply by accessing the shared data store. Similarly, if a broker fails, another broker is able to access and take over the failed broker's information in the shared data store.
To achieve data availability, the shared data store must be a highly-available JDBC database. While it is possible to use a shared data store that is not highly-available, such a data store would represent a single point of failure for the cluster, and pose a normally unacceptable risk for a production message service: all brokers in the cluster would be impacted if the shared data store were to become unavailable.
Failure Detection and Recovery
An enhanced cluster makes use of a distributed heartbeat service by which brokers inform other brokers that they are online and accessible by the cluster connection service. The heartbeat service also updates broker state information in the cluster's shared data store. When no heartbeat packet is detected from a broker for a configurable number of heartbeat intervals, the broker is considered suspect of failure. The other brokers in the cluster then begin to monitor the suspect broker’s state information in the shared data store to confirm whether the broker is still online. If the suspect broker does not update its state information within a configurable interval, it is considered to have failed. There is a trade-off between the speed and the accuracy of failure detection: configuring the cluster for quick failure detection increases the likelihood that a slow broker will erroneously be considered to have failed.
If these failure detection services operating in tandem determine that a broker has failed, then a failover broker is selected from among the remaining online brokers to take over the pending work of the failed broker.
The failover broker attempts to take over the failed broker’s persistent state (pending messages, destinations, durable subscriptions, pending acknowledgments, and open transactions) so as to provide uninterrupted service to the failed broker’s clients. If two or more brokers attempt such a takeover, only the first will succeed (the first acquires a lock on the failed broker’s data in the shared data store, preventing subsequent takeover attempts).
The takeover of a failed broker's state happens very rapidly, however while in process, the failover broker cannot accept new client connections.
Once takeover is complete and a period for clients to reconnect to the failover broker has elapsed, the failover broker will clean up any transient resources (such as transactions and temporary destinations) belonging to the failed broker.
Client Reconnect
If a broker fails, its clients automatically reconnect to the failover broker, which becomes their new home broker. The reconnect process is a dynamic interplay between the client runtime and the broker cluster: if a client attempts to reconnect to a broker that is not the failover broker, the reconnect is rejected and the client is redirected to the failover broker.
In this scenario, the new home broker (the failover broker) has immediate access to all the client-related state information that was previously held by the failed broker. The failover broker can therefore take over where the failed broker left off. As a result, the failure of a broker in an enhanced cluster will not cause a failure in message delivery. However, during the short time required for takeover to complete, the failover broker cannot accept new client connections, causing a short delay in client reconnects, and a corresponding short delay in message delivery.
To configure an enhanced cluster you set cluster configuration properties for each broker in the cluster. These properties are detailed in Enhanced Broker Cluster Properties in Sun Java System Message Queue 4.3 Administration Guide.