The Group Management Service (GMS) is an infrastructure component that is enabled for the instances in a cluster. When GMS is enabled, if a clustered instance fails, the cluster and the Domain Administration Server are aware of the failure and can take action when failure occurs. Many features of Enterprise Server depend upon GMS. For example, GMS is used by the IIOP failover, in-memory replication, transaction service, and timer service features.
If server instances in a cluster are located on different machines, ensure that the machines are on the same subnet.
The GMS feature is not available in the developer profile. In the cluster profile and the enterprise profile, GMS is enabled by default.
GMS is a core service of the Shoal framework. For more information about Shoal, visit the Project Shoal home page.
The following topics are addressed here:
The following settings are used in GMS failure detection:
Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.
Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.
Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Enterprise Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.
Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.
The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Enterprise Server.
In the tree component, select Clusters.
Click the name of the cluster.
Under General Information, ensure that the Heartbeat Enabled checkbox is checked or unchecked as required.
If you are enabling GMS and require different values for these defaults, change the default port and IP address for GMS.
Configure GMS for your environment by changing the settings that determine how frequently GMS checks for failures. For example, you can change the timeout between failure detection attempts, the number of retries on a suspected failed member, or the timeout when checking for members of a cluster.
Sample get command to get all the properties associated with a cluster-config-name.
asadmin get cluster2-config.group-management-service.*
cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500
In the Admin Console, go to Enterprise Server node
Click Configuration –> cluster_name-config —> Group Management Service.
Instead of using the Admin Console, you can use the asadmin get and set commands.
asadmin> list cluster2-config.* cluster2-config.admin-service cluster2-config.admin-service.das-config cluster2-config.admin-service.jmx-connector.system cluster2-config.admin-service.jmx-connector.system.ssl cluster2-config.availability-service cluster2-config.availability-service.jms-availability cluster2-config.availability-service.sip-container-availability cluster2-config.diagnostic-service cluster2-config.ejb-container cluster2-config.ejb-container-availability cluster2-config.ejb-container.ejb-timer-service ... ... ... ... cluster2-config.web-container-availability asadmin> get cluster2-config.group-management-service.* cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000 cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000 cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000 cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000 cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500 asadmin>set cluster2-config.group-management-service.fd-protocol-max-tries=4 cluster2-config.group-management-service.fd-protocol-max-tries = 4 asadmin> get cluster2-config.group-management-service.* cluster2-config.group-management-service.fd-protocol-max-tries = 4 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000 cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000 cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000 cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000 cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500
If the cluster was already started when you created the load balancer, you must restart the cluster to start the load balancer.