GMS Failure Detection Settings (Sun GlassFish Enterprise Server v2.1.1 High Availability Administration Guide)

Sun GlassFish Enterprise Server v2.1.1 High Availability Administration Guide

GMS Failure Detection Settings

The following settings are used in GMS failure detection:

fd-protocol-max-tries: Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.
fd-protocol-timeout-in-millis: Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.
ping-protocol: Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Enterprise Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.
vs-protocol-timeout-in-millis: Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.

The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Enterprise Server.