Automatic Cluster Recovery Process

An automatic synchronization of nodes can occur when there is a temporary communication outage between cluster nodes (possibly due to network issues). During this time, each node can temporarily leave the cluster and automatically rejoin the cluster and synchronize data with the XML databases of the local master cluster node and replication group.

The following automatic recovery processes deploy when a node fails or is shutdown:

  • The Berkley XML database insures that remaining replicas carry out an election and elect a new master, when the member that leaves is running the master database.
  • The Message-Oriented Middleware (MOM) service maintains topic messages for any durable subscribers registered on the host that departed. These messages are maintained for 24 hours before they are removed.
  • Any services that share common task processing on the node that departed the cluster. Submitted tasks, such as save and activate or the poller are processed by the remaining cluster members.
  • Any services acquiring a lock, such as device synchronization, are removed after an expiration time on a node that departed the cluster. If the original service cannot remove the lock, it automatically expires.
  • Any tasks initiated by a node that departed the cluster is re-submitted.
  • Load balancers on the active nodes in the cluster bypass the front end node that left the cluster. Clients are redirected to a valid running node.
  • The health monitoring service determines if the heartbeat of a node failed and publishes a failed message to all active cluster nodes that a node has left the cluster.
  • The node attempts to synchronize its plugins to match the collective status of the cluster. This process might result in the retrieval of plugin zip files from other cluster nodes which were uploaded while the node was down and the installation, uninstallation, or deletion of a plugin while the node was down.