Sun GlassFish Communications Server 2.0 High Availability Administration Guide

Chapter 3 Setting Up Clusters in Communications Server

This chapter describes how to use Communications Server clusters. It contains the following sections:

Overview of Clusters

A cluster is a named collection of server instances that share the same applications, resources, and configuration information. You can group server instances on different machines into one logical cluster and administer them as one unit. You can easily control the lifecycle of a multi-machine cluster with the DAS.

Instances can be grouped into clusters. You can distribute an application to all instances in the cluster with a single deployment. Clusters are dynamic. When an instance is added or removed, the changes are handled automatically.

Clusters enable horizontal scalability, load balancing, and failover protection. By definition, all the instances in a cluster have the same resource and application configuration. When a server instance or a machine in a cluster fails, the load balancer detects the failure, redirects traffic from the failed instance to other instances in the cluster, and recovers the user session state. Since the same applications and resources are on all instances in the cluster, an instance can failover to any other instance in the cluster.

Cluster instances are organized in a ring topology. Each member in the ring sends in-memory state data to the next member in the ring, its replica partner, and receives state data from the previous member. As state data is updated in any member, it is replicated around the ring. When a member fails in the ring topology, the ring is broken. Group Management Service (GMS) can recognize the failure of a member. In that event, the replication framework reshapes the topology of the cluster and notifies members of the changes. When a member learns that its replica partner has disappeared, it selects a new partner from in-service members.

Group Management Service

The Group Management Service (GMS) is an infrastructure component that is enabled for the instances in a cluster. When GMS is enabled, if a clustered instance fails, the cluster and the Domain Administration Server are aware of the failure and can take action when failure occurs. Many features of Communications Server depend upon GMS. For example, GMS is used by the IIOP failover, in-memory replication, transaction service, and timer service features.

If server instances in a cluster are located on different machines, ensure that the machines are on the same subnet.


Note –

The GMS feature is not available in the developer profile. In the cluster profile and the enterprise profile, GMS is enabled by default.


GMS is a core service of the Shoal framework. For more information about Shoal, visit the Project Shoal home page.

The following topics are addressed here:

GMS Failure Detection Settings

The following settings are used in GMS failure detection:

fd-protocol-max-tries

Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.

fd-protocol-timeout-in-millis

Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.

ping-protocol

Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Communications Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.

vs-protocol-timeout-in-millis

Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.

The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Communications Server.

ProcedureTo Enable or Disable GMS for a Cluster

  1. In the tree component, select Clusters.

  2. Click the name of the cluster.

  3. Under General Information, ensure that the Heartbeat Enabled checkbox is checked or unchecked as required.

    • To enable GMS, ensure that the Heartbeat Enabled checkbox is checked.

    • To disable GMS, ensure that the Heartbeat Enabled checkbox is unchecked.

  4. If you are enabling GMS and require different values for these defaults, change the default port and IP address for GMS.

  5. Click Save.

Configuring GMS

Configure GMS for your environment by changing the settings that determine how frequently GMS checks for failures. For example, you can change the timeout between failure detection attempts, the number of retries on a suspected failed member, or the timeout when checking for members of a cluster.

Sample get command to get all the properties associated with a cluster-config-name.

asadmin get cluster2-config.group-management-service.*

cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000

cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000

cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000

cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000

cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

ProcedureTo Configure GMS Settings Using Admin Console

  1. In the Admin Console, go to Communications Server node

  2. Click Configuration –> cluster_name-config —> Group Management Service.


Example 3–1 Changing GMS Settings Using asadmin get and setcommands

Instead of using the Admin Console, you can use the asadmin get and set commands.


asadmin> list  cluster2-config.*
cluster2-config.admin-service
cluster2-config.admin-service.das-config
cluster2-config.admin-service.jmx-connector.system
cluster2-config.admin-service.jmx-connector.system.ssl
cluster2-config.availability-service
cluster2-config.availability-service.jms-availability
cluster2-config.availability-service.sip-container-availability
cluster2-config.diagnostic-service
cluster2-config.ejb-container
cluster2-config.ejb-container-availability
cluster2-config.ejb-container.ejb-timer-service
...
...
...
...
cluster2-config.web-container-availability

asadmin> get  cluster2-config.group-management-service.*
cluster2-config.group-management-service.fd-protocol-max-tries = 3
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

asadmin>set  cluster2-config.group-management-service.fd-protocol-max-tries=4
 cluster2-config.group-management-service.fd-protocol-max-tries = 4

asadmin> get  cluster2-config.group-management-service.*
 cluster2-config.group-management-service.fd-protocol-max-tries = 4
cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500

If the cluster was already started when you created the load balancer, you must restart the cluster to start the load balancer.


Working with Clusters

ProcedureTo Create a Cluster

  1. In the tree component, select the Clusters node.

  2. On the Clusters page, click New.

    The Create Cluster page appears.

  3. In the Name field, type a name for the cluster.

    The name must:

    • Consist only of uppercase and lowercase letters, numbers, underscores, hyphens, and periods (.)

    • Be unique across all node agent names, server instance names, cluster names, and configuration names

    • Not be domain

  4. In the Configuration field, choose a configuration from the drop-down list.

    • To create a cluster that does not use a shared configuration, choose default-config.

      Leave the radio button labeled “Make a copy of the selected Configuration” selected. The copy of the default configuration will have the name cluster_name-config.

    • To create a cluster that uses a shared configuration, choose the configuration from the drop-down list.

      Select the radio button labeled “Reference the selected Configuration” to create a cluster that uses the specified existing shared configuration.

  5. Optionally, add server instances.

    You can also add server instances after the cluster is created.

    Server instances can reside on different machines. Every server instance needs to be associated with a node agent that can communicate with the DAS. Before you create server instances for the cluster, first create one or more node agents or node agent placeholders. See To Create a Node Agent Placeholder

    To create server instances:

    1. In the Server Instances To Be Created area, click Add.

    2. Type a name for the instance in the Instance Name field

    3. Choose a node agent from the Node Agent drop-down list.

  6. Click OK.

  7. Click OK on the Cluster Created Successfully page that appears.

Equivalent asadmin command

create-cluster

See Also

For more details on how to administer clusters, server instances, and node agents, see Deploying Node Agents.

ProcedureTo Create Server Instances for a Cluster

Before You Begin

Before you can create server instances for a cluster, you must first create a node agent or node agent placeholder. See To Create a Node Agent Placeholder

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Instances tab to bring up the Clustered Server Instances page.

  4. Click New to bring up the Create Clustered Server Instance page.

  5. In the Name field, type a name for the server instance.

  6. Choose a node agent from the Node Agents drop-down list.

  7. Click OK.

Equivalent asadmin command

create-instance

See Also

ProcedureTo Configure a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

    On the General Information page, you can perform these tasks:

    • Click Start Instances to start the clustered server instances.

    • Click Stop Instances to stop the clustered server instances.

    • Click Migrate EJB Timers to migrate the EJB timers from a stopped server instance to another server instance in the cluster.

Equivalent asadmin command

start-cluster, stop-cluster, migrate-timers

See Also

ProcedureTo Start, Stop, and Delete Clustered Instances

  1. In the tree component, expand the Clusters node.

  2. Expand the node for the cluster that contains the server instance.

  3. Click the Instances tab to display the Clustered Server Instances page.

    On this page you can:

    • Select the checkbox for an instance and click Delete, Start, or Stop to perform the selected action on all the specified server instances.

    • Click the name of the instance to bring up the General Information page.

ProcedureTo Configure Server Instances in a Cluster

  1. In the tree component, expand the Clusters node.

  2. Expand the node for the cluster that contains the server instance.

  3. Select the server instance node.

  4. On the General Information page, you can:

    • Click Start Instance to start the instance.

    • Click Stop Instance to stop a running instance.

    • Click JNDI Browsing to browse the JNDI tree for a running instance.

    • Click View Log Files to open the server log viewer.

    • Click Rotate Log File to rotate the log file for the instance. This action schedules the log file for rotation. The actual rotation takes place the next time an entry is written to the log file.

    • Click Recover Transactions to recover incomplete transactions.

    • Click the Properties tab to modify the port numbers for the instance.

    • Click the Monitor tab to change monitoring properties.

See Also

ProcedureTo Configure Applications for a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Applications tab to bring up the Applications page.

    On this page, you can:

    • From the Deploy drop-down list, select a type of application to deploy. On the Deployment page that appears, specify the application.

    • From the Filter drop-down list, select a type of application to display in the list.

    • To edit an application, click the application name.

    • Select the checkbox next to an application and choose Enable or Disable to enable or disable the application for the cluster.

See Also

ProcedureTo Configure Resources for a Cluster

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. Click the Resources tab to bring up the Resources page.

    On this page, you can:

    • Create a new resource for the cluster: from the New drop-down list, select a type of resource to create. Make sure to specify the cluster as a target when you create the resource.

    • Enable or Disable a resource globally: select the checkbox next to a resource and click Enable or Disable. This action does not remove the resource.

    • Display only resources of a particular type: from the Filter drop-down list, select a type of resource to display in the list.

    • Edit a resource: click the resource name.

See Also

ProcedureTo Delete a Cluster

  1. In the tree component, select the Clusters node.

  2. On the Clusters page, select the checkbox next to the name of the cluster.

  3. Click Delete.

Equivalent asadmin command

delete-cluster

See Also

ProcedureTo Migrate EJB Timers

If a server instance stops running abnormally or unexpectedly, it can be necessary to move the EJB timers installed on that server instance to a running server instance in the cluster. To do so, perform these steps:

  1. In the tree component, expand the Clusters node.

  2. Select the node for the cluster.

  3. On the General Information page, click Migrate EJB Timers.

  4. On the Migrate EJB Timers page:

    1. From the Source drop-down list, choose the stopped server instance from which to migrate the timers.

    2. (Optional) From the Destination drop-down list, choose the running server instance to which to migrate the timers.

      If you leave this field empty, a running server instance will be randomly chosen.

    3. Click OK.

  5. Stop and restart the Destination server instance.

    If the source server instance is running or if the destination server instance is not running, Admin Console displays an error message.

Equivalent asadmin command

migrate-timers

See Also

ProcedureTo Upgrade Components Without Loss of Service

In a clustered environment, a rolling upgrade redeploys an application with a minimal loss of service and sessions. A session can be any replicable artifact, including:

You can use the load balancer and multiple clusters to upgrade components within the Communications Server without any loss of service. A component can, for example, be a JVM, the Communications Server, or a web application.

A rolling upgrade can take place under light to moderate load conditions. The procedure should be doable in a brief amount of time, about 10-15 minutes per server instance.

Applications must be compatible across the upgrade. They must work correctly during the transition, when some server instances are running the old version and others the new one. The old and new versions must have the same shape (for example, non-transient instance variables) of Serializable classes that form object graphs stored in sessions. Or, if the shape of these classes is changed, then the application developer must ensure that correct Serialization behavior occurs. If the application is not compatible across the upgrade, the cluster must be stopped for a full redeployment.

The Basic3pcc sample application includes an Ant target, do-rollingupgrade, which performs all the rolling upgrade steps for you. This sample application is included with the Communications Server in the as-install/samples/sipservlet/Basic3pcc directory. The Basic3pcc application and the Ant target are available only with the JAR installer of Communications Server.

The following procedure describes how to upgrade an application running on all instances of a cluster.

  1. Run the following commands on the converged load balancer in the cluster,

    asadmin set domain.converged-lb-configs.clb_config_name.property.load-increase-factor=1

    asadmin set domain.converged-lb-configs.clb_config_name.property.load-factor-increase-period-in-seconds=0

  2. Set the value of the dynamic-reconfig attribute to false in the cluster.

  3. Redeploy a new version of the application.

    Because you have set the dynamic-reconfig attribute to false, the new version of the application will be loaded to the instance only when the instance restarts.

  4. Disable the instance from the converged load balancer by using the following asadmin command:

    asadmin disable-converged-lb-server instance_name

  5. Back up the current session with the following command:

    asadmin backup—session store instance_name

    By default, the session files are stored at instance-dir/rollingupgrade.

  6. Stop the instance with the following command:

    asadmin stop-instanceinstance_name

  7. Start the instance.

    asadmin start-instance instance_name

  8. Restore the session.

    asadmin restore—session—store instance_name

  9. Enable the instance to the converged load balancer.

    asadmin enable-converged-lb-server instance_name

  10. Use the following command to get the latest version of the session store, which could have been updated by another instance accessing this session store.

    asadmin reconcile—session—store instance_name

  11. For all instances in the cluster, repeat steps 3 to 9.

  12. Set the value of the dynamic-reconfig attribute to true in the cluster.

Using the Multi-homing Feature With a Cluster

Multi-homing enables Communication Server clusters to be used in an environment that uses multiple Network Interface Cards (NICs). A multi-homed host has multiple network connections, of which the connections may or may not be the same network. Multi-homing provides the following benefits:

Traffic Separation Using Multi-homing

You can separate the internal traffic (resulting from the converged load balancer, replication and GMS) from the external traffic. Traffic separation enables you plan a network better and augment certain parts of the network, as required.

Consider a simple cluster, cluster1, with three instances, instance101, instance102, and instance103. Each instance runs on a different machine. In order to separate the traffic, the multi-homed machine should have at least two IP addresses belonging to different networks. The first IP as the external IP and the second one as internal IP. The objective is to expose the external IP to the User Agents, so that all the traffic from the User Agents would be through them. The internal IP is used only by the cluster instances for internal communication. The following procedure describes how to set up traffic separation.

ProcedureTo Set Up Traffic Separation

  1. Set the address attribute of SIP listeners and HTTP listeners to the external address of the mutli-homed machine.

    Use the following commands:

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener1.address=\${EXTERNAL_LISTENER_ADDRESS}

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.http-listener1.address=\${EXTERNAL_LISTENER_ADDRESS}

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener2.address=\${EXTERNAL_LISTENER_ADDRESS}

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.http-listener.http-listener2.address=\${EXTERNAL_LISTENER_ADDRESS}

  2. Set the listener type of these listeners as external, so that they listen for traffic from User Agents and not for the converged load balancer proxying.

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener1.type=external

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.http-listener1.type=external

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener2.type=external

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.http-listener.http-listener2.type=external

  3. Create the system properties EXTERNAL_LISTENER_ADDRESS and INTERNAL_LISTENER_ADDRESS.

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target cluster1 EXTERNAL_LISTENER_ADDRESS=0.0.0.0:INTERNAL_LISTENER_ADDRESS=0.0.0.0

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target server EXTERNAL_LISTENER_ADDRESS=0.0.0.0:INTERNAL_LISTENER_ADDRESS=0.0.0.0

  4. Create new listeners for listening to internal traffic.

    • asadmin create-sip-listener --user admin --port 4848 --passwordfile password.txt --target cluster1 --siplisteneraddress 0.0.0.0 --siplistenerport 25060 internal-sip-listener

    • asadmin create-http-listener --user admin --port 4848 --passwordfile password.txt --target cluster1 --listeneraddress 0.0.0.0 --defaultvs server --listenerport 28080 internal-http-listener

  5. Set the address attribute of these new listeners to the internal address.

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.internal-sip-listener.address=\${INTERNAL_LISTENER_ADDRESS}

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.internal-http-listener.address=\${INTERNAL_LISTENER_ADDRESS}

  6. Set the type attribute of these new listeners to internal.

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.internal-sip-listener.type=internal

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.internal-http-listener.type=internal

  7. Configure the IP address of the cluster instances.

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance101 EXTERNAL_LISTENER_ADDRESS=10.12.152.29:INTERNAL_LISTENER_ADDRESS=192.168.2.1

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance102 EXTERNAL_LISTENER_ADDRESS=10.12.152.39:INTERNAL_LISTENER_ADDRESS=192.168.2.3

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance103 EXTERNAL_LISTENER_ADDRESS=10.12.152.49:INTERNAL_LISTENER_ADDRESS=192.168.2.4

  8. Restart the node agent and the cluster.

  9. If you are using a hardware load balancer for spraying the SIP traffic to the individual instances, you need to set the external-sip-address and external-sip-port attributes to point to the hardware load balancer.

    If you are using only one hardware load balancer for all SIP listeners, set the attributes of the SIP container.

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-container.external-sip-address=yourlbaddress

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-container.external-sip-port=yourlbport

    If you are using multiple hardware load balancers, set the attributes of each of the SIP listeners:

    • asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-service.sip-listener.sip-listener1.external-sip-address=yourlbaddress

    • asadmin set --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-service.sip-listener.sip-listener1.external-sip-port=yourlbport