Sun GlassFish Enterprise Server v2.1.1 High Availability Administration Guide

Group Management Service

The Group Management Service (GMS) is an infrastructure component that is enabled for the instances in a cluster. When GMS is enabled, if a clustered instance fails, the cluster and the Domain Administration Server are aware of the failure and can take action when failure occurs. Many features of Enterprise Server depend upon GMS. For example, GMS is used by the IIOP failover, in-memory replication, transaction service, and timer service features.

If server instances in a cluster are located on different machines, ensure that the machines are on the same subnet.

Note –

The GMS feature is not available in the developer profile. In the cluster profile and the enterprise profile, GMS is enabled by default.

GMS is a core service of the Shoal framework. For more information about Shoal, visit the Project Shoal home page.

The following topics are addressed here:

GMS Failure Detection Settings

The following settings are used in GMS failure detection:

fd-protocol-max-tries: Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.
fd-protocol-timeout-in-millis: Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.
ping-protocol: Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Enterprise Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.
vs-protocol-timeout-in-millis: Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.

The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Enterprise Server.

To Enable or Disable GMS for a Cluster

In the tree component, select Clusters.

Click the name of the cluster.

Under General Information, ensure that the Heartbeat Enabled checkbox is checked or unchecked as required.
- To enable GMS, ensure that the Heartbeat Enabled checkbox is checked.
- To disable GMS, ensure that the Heartbeat Enabled checkbox is unchecked.

If you are enabling GMS and require different values for these defaults, change the default port and IP address for GMS.

Click Save.

Configuring GMS

Configure GMS for your environment by changing the settings that determine how frequently GMS checks for failures. For example, you can change the timeout between failure detection attempts, the number of retries on a suspected failed member, or the timeout when checking for members of a cluster.

Sample get command to get all the properties associated with a cluster-config-name.

asadmin get cluster2-config.group-management-service.*

cluster2-config.group-management-service.fd-protocol-max-tries = 3cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000

cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000

cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000

cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000

cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

To Configure GMS Settings Using Admin Console

In the Admin Console, go to Enterprise Server node

Click Configuration –> cluster_name-config —> Group Management Service.

Example 6–1 Changing GMS Settings Using `asadmin get` and `set`commands

Instead of using the Admin Console, you can use the asadmin get and set commands.

asadmin> list  cluster2-config.*
cluster2-config.admin-service
cluster2-config.admin-service.das-config
cluster2-config.admin-service.jmx-connector.system
cluster2-config.admin-service.jmx-connector.system.ssl
cluster2-config.availability-service
cluster2-config.availability-service.jms-availability
cluster2-config.availability-service.sip-container-availability
cluster2-config.diagnostic-service
cluster2-config.ejb-container
cluster2-config.ejb-container-availability
cluster2-config.ejb-container.ejb-timer-service
...
...
...
...
cluster2-config.web-container-availability

asadmin> get  cluster2-config.group-management-service.*
cluster2-config.group-management-service.fd-protocol-max-tries = 3
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

asadmin>set  cluster2-config.group-management-service.fd-protocol-max-tries=4
 cluster2-config.group-management-service.fd-protocol-max-tries = 4

asadmin> get  cluster2-config.group-management-service.*
 cluster2-config.group-management-service.fd-protocol-max-tries = 4
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

If the cluster was already started when you created the load balancer, you must restart the cluster to start the load balancer.