Sun GlassFish Enterprise Server 2.1 High Availability Administration Guide

Chapter 6 Setting Up Clusters in Enterprise Server

This chapter describes how to use Enterprise Server clusters. It contains the following sections:

Overview of Clusters

A cluster is a named collection of server instances that share the same applications, resources, and configuration information. You can group server instances on different machines into one logical cluster and administer them as one unit. You can easily control the lifecycle of a multi-machine cluster with the DAS.

Instances can be grouped into clusters. You can distribute an application to all instances in the cluster with a single deployment. Clusters are dynamic. When an instance is added or removed, the changes are handled automatically.

Clusters enable horizontal scalability, load balancing, and failover protection. By definition, all the instances in a cluster have the same resource and application configuration. When a server instance or a machine in a cluster fails, the load balancer detects the failure, redirects traffic from the failed instance to other instances in the cluster, and recovers the user session state. Since the same applications and resources are on all instances in the cluster, an instance can failover to any other instance in the cluster.

Cluster instances are organized in a ring topology. Each member in the ring sends in-memory state data to the next member in the ring, its replica partner, and receives state data from the previous member. As state data is updated in any member, it is replicated around the ring. When a member fails in the ring topology, the ring is broken. Group Management Service (GMS) can recognize the failure of a member. In that event, the replication framework reshapes the topology of the cluster and notifies members of the changes. When a member learns that its replica partner has disappeared, it selects a new partner from in-service members.

Group Management Service

The Group Management Service (GMS) is an infrastructure component that is enabled for the instances in a cluster. When GMS is enabled, if a clustered instance fails, the cluster and the Domain Administration Server are aware of the failure and can take action when failure occurs. Many features of Enterprise Server depend upon GMS. For example, GMS is used by the IIOP failover, in-memory replication, transaction service, and timer service features.

If server instances in a cluster are located on different machines, ensure that the machines are on the same subnet.


Note –

The GMS feature is not available in the developer profile. In the cluster profile and the enterprise profile, GMS is enabled by default.


GMS is a core service of the Shoal framework. For more information about Shoal, visit the Project Shoal home page.

The following topics are addressed here:

GMS Failure Detection Settings

The following settings are used in GMS failure detection:

fd-protocol-max-tries

Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.

fd-protocol-timeout-in-millis

Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.

ping-protocol

Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Enterprise Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.

vs-protocol-timeout-in-millis

Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.

The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Enterprise Server.

ProcedureTo Enable or Disable GMS for a Cluster

  1. In the tree component, select Clusters.

  2. Click the name of the cluster.

  3. Under General Information, ensure that the Heartbeat Enabled checkbox is checked or unchecked as required.

    • To enable GMS, ensure that the Heartbeat Enabled checkbox is checked.

    • To disable GMS, ensure that the Heartbeat Enabled checkbox is unchecked.

  4. If you are enabling GMS and require different values for these defaults, change the default port and IP address for GMS.

  5. Click Save.

Configuring GMS

Configure GMS for your environment by changing the settings that determine how frequently GMS checks for failures. For example, you can change the timeout between failure detection attempts, the number of retries on a suspected failed member, or the timeout when checking for members of a cluster.

Sample get command to get all the properties associated with a cluster-config-name.

asadmin get cluster2-config.group-management-service.*

cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000

cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000

cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000

cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000

cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

ProcedureTo Configure GMS Settings Using Admin Console

  1. In the Admin Console, go to Enterprise Server node

  2. Click Configuration –> cluster_name-config —> Group Management Service.


Example 6–1 Changing GMS Settings Using asadmin get and setcommands

Instead of using the Admin Console, you can use the asadmin get and set commands.


asadmin> list  cluster2-config.*
cluster2-config.admin-service
cluster2-config.admin-service.das-config
cluster2-config.admin-service.jmx-connector.system
cluster2-config.admin-service.jmx-connector.system.ssl
cluster2-config.availability-service
cluster2-config.availability-service.jms-availability
cluster2-config.availability-service.sip-container-availability
cluster2-config.diagnostic-service
cluster2-config.ejb-container
cluster2-config.ejb-container-availability
cluster2-config.ejb-container.ejb-timer-service
...
...
...
...
cluster2-config.web-container-availability

asadmin> get  cluster2-config.group-management-service.*
cluster2-config.group-management-service.fd-protocol-max-tries = 3
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

asadmin>set  cluster2-config.group-management-service.fd-protocol-max-tries=4
 cluster2-config.group-management-service.fd-protocol-max-tries = 4

asadmin> get  cluster2-config.group-management-service.*
 cluster2-config.group-management-service.fd-protocol-max-tries = 4
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

If the cluster was already started when you created the load balancer, you must restart the cluster to start the load balancer.


Working with Clusters

ProcedureTo Create a Cluster

  1. In the tree component, select the Clusters node.

  2. On the Clusters page, click New.

    The Create Cluster page appears.

  3. In the Name field, type a name for the cluster.

    The name must:

    • Consist only of uppercase and lowercase letters, numbers, underscores, hyphens, and periods (.)

    • Be unique across all node agent names, server instance names, cluster names, and configuration names

    • Not be domain

  4. In the Configuration field, choose a configuration from the drop-down list.

    • To create a cluster that does not use a shared configuration, choose default-config.

      Leave the radio button labeled “Make a copy of the selected Configuration” selected. The copy of the default configuration will have the name cluster_name-config.

    • To create a cluster that uses a shared configuration, choose the configuration from the drop-down list.

      Select the radio button labeled “Reference the selected Configuration” to create a cluster that uses the specified existing shared configuration.

  5. Optionally, add server instances.

    You can also add server instances after the cluster is created.

    Server instances can reside on different machines. Every server instance needs to be associated with a node agent that can communicate with the DAS. Before you create server instances for the cluster, first create one or more node agents or node agent placeholders. See To Create a Node Agent Placeholder

    To create server instances:

    1. In the Server Instances To Be Created area, click Add.

    2. Type a name for the instance in the Instance Name field

    3. Choose a node agent from the Node Agent drop-down list.

  6. Click OK.

  7. Click OK on the Cluster Created Successfully page that appears.

Equivalent asadmin command

create-cluster

See Also

For more details on how to administer clusters, server instances, and node agents, see Deploying Node Agents.

ProcedureTo Create Server Instances for a Cluster

Before You Begin

Before you can create server instances for a cluster, you must first create a node agent or node agent placeholder. See To Create a Node Agent Placeholder

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Instances tab to bring up the Clustered Server Instances page.

  4. Click New to bring up the Create Clustered Server Instance page.

  5. In the Name field, type a name for the server instance.

  6. Choose a node agent from the Node Agents drop-down list.

  7. Click OK.

Equivalent asadmin command

create-instance

See Also

ProcedureTo Configure a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

    On the General Information page, you can perform these tasks:

    • Click Start Instances to start the clustered server instances.

    • Click Stop Instances to stop the clustered server instances.

    • Click Migrate EJB Timers to migrate the EJB timers from a stopped server instance to another server instance in the cluster.

Equivalent asadmin command

start-cluster, stop-cluster, migrate-timers

See Also

ProcedureTo Start, Stop, and Delete Clustered Instances

  1. In the tree component, expand the Clusters node.

  2. Expand the node for the cluster that contains the server instance.

  3. Click the Instances tab to display the Clustered Server Instances page.

    On this page you can:

    • Select the checkbox for an instance and click Delete, Start, or Stop to perform the selected action on all the specified server instances.

    • Click the name of the instance to bring up the General Information page.

ProcedureTo Configure Server Instances in a Cluster

  1. In the tree component, expand the Clusters node.

  2. Expand the node for the cluster that contains the server instance.

  3. Select the server instance node.

  4. On the General Information page, you can:

    • Click Start Instance to start the instance.

    • Click Stop Instance to stop a running instance.

    • Click JNDI Browsing to browse the JNDI tree for a running instance.

    • Click View Log Files to open the server log viewer.

    • Click Rotate Log File to rotate the log file for the instance. This action schedules the log file for rotation. The actual rotation takes place the next time an entry is written to the log file.

    • Click Recover Transactions to recover incomplete transactions.

    • Click the Properties tab to modify the port numbers for the instance.

    • Click the Monitor tab to change monitoring properties.

See Also

ProcedureTo Configure Applications for a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Applications tab to bring up the Applications page.

    On this page, you can:

    • From the Deploy drop-down list, select a type of application to deploy. On the Deployment page that appears, specify the application.

    • From the Filter drop-down list, select a type of application to display in the list.

    • To edit an application, click the application name.

    • Select the checkbox next to an application and choose Enable or Disable to enable or disable the application for the cluster.

See Also

ProcedureTo Configure Resources for a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Resources tab to bring up the Resources page.

    On this page, you can:

    • Create a new resource for the cluster: from the New drop-down list, select a type of resource to create. Make sure to specify the cluster as a target when you create the resource.

    • Enable or Disable a resource globally: select the checkbox next to a resource and click Enable or Disable. This action does not remove the resource.

    • Display only resources of a particular type: from the Filter drop-down list, select a type of resource to display in the list.

    • Edit a resource: click the resource name.

See Also

ProcedureTo Delete a Cluster

  1. In the tree component, select the Clusters node.

  2. On the Clusters page, select the checkbox next to the name of the cluster.

  3. Click Delete.

Equivalent asadmin command

delete-cluster

See Also

ProcedureTo Migrate EJB Timers

If a server instance stops running abnormally or unexpectedly, it can be necessary to move the EJB timers installed on that server instance to a running server instance in the cluster. To do so, perform these steps:

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. On the General Information page, click Migrate EJB Timers.

  4. On the Migrate EJB Timers page:

    1. From the Source drop-down list, choose the stopped server instance from which to migrate the timers.

    2. (Optional) From the Destination drop-down list, choose the running server instance to which to migrate the timers.

      If you leave this field empty, a running server instance will be randomly chosen.

    3. Click OK.

  5. Stop and restart the Destination server instance.

    If the source server instance is running or if the destination server instance is not running, Admin Console displays an error message.

Equivalent asadmin command

migrate-timers

See Also

ProcedureTo Upgrade Components Without Loss of Service

You can use the load balancer and multiple clusters to upgrade components within the Enterprise Server without any loss of service. A component can, for example, be a JVM, the Enterprise Server, or a web application.

This approach is not possible if:


Caution – Caution –

Upgrade all server instances in a cluster together. Otherwise, there is a risk of version mismatch caused by a session failing over from one instance to another where the instances have different versions of components running.


  1. Stop one of the clusters using the Stop Cluster button on the General Information page for the cluster.

  2. Upgrade the component in that cluster.

  3. Start the cluster using the Start Cluster button on the General Information page for the cluster.

  4. Repeat the process with the other clusters, one by one.

    Because sessions in one cluster will never fail over to sessions in another cluster, there is no risk of version mismatch caused by a session’s failing over from a server instance that is running one version of the component to another server instance (in a different cluster) that is running a different version of the component. A cluster in this way acts as a safe boundary for session failover for the server instances within it.