High-Availability Clusters (Sun Java System Message Queue 4.1 Administration Guide)

Sun Java System Message Queue 4.1 Administration Guide

High-Availability Clusters

In a high-availability cluster, all of the brokers share a common JDBC-based persistent data store holding dynamic state information (destinations, persistent messages, durable subscriptions, open transactions, and so forth) for each broker. In the event of broker failure, this enables another broker to assume ownership of the failed broker’s persistent state and provide uninterrupted service to its clients. Because they share a common JDBC-based data store, all brokers belonging to an HA cluster must have their imq.persist.store property (see Table 14–4) set to jdbc.

Brokers within an HA cluster inform each other at regular intervals that they are still in operation by exchanging heartbeat packets, (using a special internal connection service, the cluster connection service), and updating their state information in the cluster’s shared persistent store. When no heartbeat packet is detected from a broker for a specified number of heartbeat intervals, the broker is considered suspect of failure. The other brokers in the cluster then begin to monitor the suspect broker’s state information in the persistent store to confirm whether the broker has indeed failed. If the suspect broker fails to update its state information within a certain threshold interval, it is considered to have failed. (The duration of these heartbeat and failure-detection intervals can be adjusted by means of broker configuration properties to balance the tradeoff between speed and accuracy of failure detection: shorter intervals result in quicker reaction to broker failure, but increase the likelihood of false suspicions and erroneous failure detection.)

When a broker in an HA cluster detects that another broker has failed, it will attempt to take over the failed broker’s persistent state (pending messages, destination definitions, durable subscriptions, pending acknowledgments, and open transactions), in order to provide uninterrupted service to the failed broker’s clients. If two or more brokers attempt such a takeover, only the first will succeed; that broker acquires a lock on the failed broker’s data in the persistent store, preventing subsequent takeover attempts by other brokers from succeeding. After an initial waiting period, the takeover broker will then clean up any transient resources (such as transactions and temporary destinations) belonging to the failed broker; these resources will be unavailable if the client later reconnects.