This chapter describes how to use Communications Server clusters. It contains the following sections:
A cluster is a named collection of server instances that share the same applications, resources, and configuration information. You can group server instances on different machines into one logical cluster and administer them as one unit. You can easily control the lifecycle of a multi-machine cluster with the DAS.
Instances can be grouped into clusters. You can distribute an application to all instances in the cluster with a single deployment. Clusters are dynamic. When an instance is added or removed, the changes are handled automatically.
Clusters enable horizontal scalability, load balancing, and failover protection. By definition, all the instances in a cluster have the same resource and application configuration. When a server instance or a machine in a cluster fails, the load balancer detects the failure, redirects traffic from the failed instance to other instances in the cluster, and recovers the user session state. Since the same applications and resources are on all instances in the cluster, an instance can failover to any other instance in the cluster.
Cluster instances are organized in a ring topology. Each member in the ring sends in-memory state data to the next member in the ring, its replica partner, and receives state data from the previous member. As state data is updated in any member, it is replicated around the ring. When a member fails in the ring topology, the ring is broken. Group Management Service (GMS) can recognize the failure of a member. In that event, the replication framework reshapes the topology of the cluster and notifies members of the changes. When a member learns that its replica partner has disappeared, it selects a new partner from in-service members.
The Group Management Service (GMS) is an infrastructure component that is enabled for the instances in a cluster. When GMS is enabled, if a clustered instance fails, the cluster and the Domain Administration Server are aware of the failure and can take action when failure occurs. Many features of Communications Server depend upon GMS. For example, GMS is used by the IIOP failover, in-memory replication, transaction service, and timer service features.
If server instances in a cluster are located on different machines, ensure that the machines are on the same subnet.
The GMS feature is not available in the developer profile. In the cluster profile and the enterprise profile, GMS is enabled by default.
GMS is a core service of the Shoal framework. For more information about Shoal, visit the Project Shoal home page.
The following topics are addressed here:
The following settings are used in GMS failure detection:
Indicates the maximum number of missed heartbeats that the health monitor counts before marking an instance as suspected failure. GMS also tries to make a peer-2-peer connection with the suspected member. If that also fails, the member is marked as suspect failed.
Indicates the failure detection interval (in milliseconds) between each heartbeat message that would provoke an instance to send out its Alive message. This setting considers the number of milliseconds between missed heartbeats that the max-retry logic would wait for, in the master node, between counting each missed heartbeat. Lowering the value of retries would mean that failure would be suspected after fewer missed heartbeats. Lowering the value of fd-protocol-timeout-in-millis below the default would result in more frequent heartbeat messages being sent out from each member. This could potentially result in more heartbeat messages in the network than a system needs for triggering failure detection protocols. The effect of this varies depending on how quickly the deployment environment needs to have failure detection performed. That is, the (lower) number of retries with a lower heartbeat interval would make it quicker to detect failures. However, lowering the timeout or retry attempts could result in false positives because you could potentially detect a member as failed when, in fact, the member's heartbeat is reflecting the network load from other parts of the server. Conversely, a higher timeout interval results in fewer heartbeats in the system because the time interval between heartbeats is longer. As a result, failure detection would take a longer. In addition, a startup by a failed member during this time results in a new join notification but no failure notification, because failure detection and evaluation were not completed. The lack of a join notification without a preceding failure notification is logged.
Indicates the amount of time an instance's GMS module will wait during instance startup (on a background thread, so that server startup does not wait for the timeout) for discovering the master member of the group. In GMS, this process is called master node discovery protocol. The instance's GMS module sends out a master node query to the multicast group address. If the instance times out (does not receive a master node response from another member within this time) the master is assumed absent and the instance assumes the master role. The instance sends out a master node announcement to the group, and starts responding to subsequent master node query messages from members. In Communications Server, the domain administration server (DAS) joins a cluster as soon as it is created, which means the DAS becomes a master member of the group. This allows cluster members to discover the master quickly, without incurring a timeout. Lowering the ping-protocol timeout would cause a member to timeout more quickly because it will take longer to discover the master node. As a result, there might be multiple masters in the group which could lead to master collision. Master collision could cause resolution protocol to start. The master collision, and resolution protocol, results in multiple masters telling each other who the true master candidate is based on sorted order of memberships (based on their UUIDs). The impact can be extensive in messaging if there are many masters in the group. Therefore, the ping-protocol timeout value should be set to the default or higher.
Indicates the verify suspect protocol's timeout used by the health monitor. After a member is marked as suspect based on missed heartbeats and a failed peer–2–peer connection check, the verify suspect protocol is activated and waits for the specified timeout to check for any further health state messages received in that time, and to see if a peer-2-peer connection can be made with the suspect member. If not, then the member is marked as failed and a failure notification is sent.
The retries, missed heartbeat intervals, peer-2-peer connection-based failure detection, watchdog-based failure reporting, and the verify suspect protocols are all needed to ensure that failure detection is robust and reliable in Communications Server.
In the tree component, select Clusters.
Click the name of the cluster.
Under General Information, ensure that the Heartbeat Enabled checkbox is checked or unchecked as required.
If you are enabling GMS and require different values for these defaults, change the default port and IP address for GMS.
Click Save.
Configure GMS for your environment by changing the settings that determine how frequently GMS checks for failures. For example, you can change the timeout between failure detection attempts, the number of retries on a suspected failed member, or the timeout when checking for members of a cluster.
Sample get command to get all the properties associated with a cluster-config-name.
asadmin get cluster2-config.group-management-service.*
cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000
cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000
cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000
cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000
cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500
In the Admin Console, go to Communications Server node
Click Configuration –> cluster_name-config —> Group Management Service.
Instead of using the Admin Console, you can use the asadmin get and set commands.
asadmin> list cluster2-config.* cluster2-config.admin-service cluster2-config.admin-service.das-config cluster2-config.admin-service.jmx-connector.system cluster2-config.admin-service.jmx-connector.system.ssl cluster2-config.availability-service cluster2-config.availability-service.jms-availability cluster2-config.availability-service.sip-container-availability cluster2-config.diagnostic-service cluster2-config.ejb-container cluster2-config.ejb-container-availability cluster2-config.ejb-container.ejb-timer-service ... ... ... ... cluster2-config.web-container-availability asadmin> get cluster2-config.group-management-service.* cluster2-config.group-management-service.fd-protocol-max-tries = 3 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000 cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000 cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000 cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000 cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500 asadmin>set cluster2-config.group-management-service.fd-protocol-max-tries=4 cluster2-config.group-management-service.fd-protocol-max-tries = 4 asadmin> get cluster2-config.group-management-service.* cluster2-config.group-management-service.fd-protocol-max-tries = 4 cluster2-config.group-management-service.fd-protocol-timeout-in-millis = 2000 cluster2-config.group-management-service.merge-protocol-max-interval-in-millis = 10000 cluster2-config.group-management-service.merge-protocol-min-interval-in-millis = 5000 cluster2-config.group-management-service.ping-protocol-timeout-in-millis = 5000 cluster2-config.group-management-service.vs-protocol-timeout-in-millis = 1500 |
If the cluster was already started when you created the load balancer, you must restart the cluster to start the load balancer.
In the tree component, select the Clusters node.
On the Clusters page, click New.
The Create Cluster page appears.
In the Name field, type a name for the cluster.
The name must:
Consist only of uppercase and lowercase letters, numbers, underscores, hyphens, and periods (.)
Be unique across all node agent names, server instance names, cluster names, and configuration names
Not be domain
In the Configuration field, choose a configuration from the drop-down list.
To create a cluster that does not use a shared configuration, choose default-config.
Leave the radio button labeled “Make a copy of the selected Configuration” selected. The copy of the default configuration will have the name cluster_name-config.
To create a cluster that uses a shared configuration, choose the configuration from the drop-down list.
Select the radio button labeled “Reference the selected Configuration” to create a cluster that uses the specified existing shared configuration.
Optionally, add server instances.
You can also add server instances after the cluster is created.
Server instances can reside on different machines. Every server instance needs to be associated with a node agent that can communicate with the DAS. Before you create server instances for the cluster, first create one or more node agents or node agent placeholders. See To Create a Node Agent Placeholder
To create server instances:
Click OK.
Click OK on the Cluster Created Successfully page that appears.
create-cluster
For more details on how to administer clusters, server instances, and node agents, see Deploying Node Agents.
Before you can create server instances for a cluster, you must first create a node agent or node agent placeholder. See To Create a Node Agent Placeholder
In the tree component, expand the Clusters node.
Select the node for the cluster.
Click the Instances tab to bring up the Clustered Server Instances page.
Click New to bring up the Create Clustered Server Instance page.
In the Name field, type a name for the server instance.
Choose a node agent from the Node Agents drop-down list.
Click OK.
create-instance
In the tree component, expand the Clusters node.
Select the node for the cluster.
On the General Information page, you can perform these tasks:
Click Start Instances to start the clustered server instances.
Click Stop Instances to stop the clustered server instances.
Click Migrate EJB Timers to migrate the EJB timers from a stopped server instance to another server instance in the cluster.
start-cluster, stop-cluster, migrate-timers
In the tree component, expand the Clusters node.
Expand the node for the cluster that contains the server instance.
Click the Instances tab to display the Clustered Server Instances page.
On this page you can:
Select the checkbox for an instance and click Delete, Start, or Stop to perform the selected action on all the specified server instances.
Click the name of the instance to bring up the General Information page.
In the tree component, expand the Clusters node.
Expand the node for the cluster that contains the server instance.
Select the server instance node.
On the General Information page, you can:
Click Start Instance to start the instance.
Click Stop Instance to stop a running instance.
Click JNDI Browsing to browse the JNDI tree for a running instance.
Click View Log Files to open the server log viewer.
Click Rotate Log File to rotate the log file for the instance. This action schedules the log file for rotation. The actual rotation takes place the next time an entry is written to the log file.
Click Recover Transactions to recover incomplete transactions.
Click the Properties tab to modify the port numbers for the instance.
Click the Monitor tab to change monitoring properties.
In the tree component, expand the Clusters node.
Select the node for the cluster.
Click the Applications tab to bring up the Applications page.
On this page, you can:
From the Deploy drop-down list, select a type of application to deploy. On the Deployment page that appears, specify the application.
From the Filter drop-down list, select a type of application to display in the list.
To edit an application, click the application name.
Select the checkbox next to an application and choose Enable or Disable to enable or disable the application for the cluster.
In the tree component, expand the Clusters node.
Select the node for the cluster.
Click the Resources tab to bring up the Resources page.
On this page, you can:
Create a new resource for the cluster: from the New drop-down list, select a type of resource to create. Make sure to specify the cluster as a target when you create the resource.
Enable or Disable a resource globally: select the checkbox next to a resource and click Enable or Disable. This action does not remove the resource.
Display only resources of a particular type: from the Filter drop-down list, select a type of resource to display in the list.
Edit a resource: click the resource name.
In the tree component, select the Clusters node.
On the Clusters page, select the checkbox next to the name of the cluster.
Click Delete.
delete-cluster
If a server instance stops running abnormally or unexpectedly, it can be necessary to move the EJB timers installed on that server instance to a running server instance in the cluster. To do so, perform these steps:
In the tree component, expand the Clusters node.
Select the node for the cluster.
On the General Information page, click Migrate EJB Timers.
On the Migrate EJB Timers page:
From the Source drop-down list, choose the stopped server instance from which to migrate the timers.
(Optional) From the Destination drop-down list, choose the running server instance to which to migrate the timers.
If you leave this field empty, a running server instance will be randomly chosen.
Click OK.
Stop and restart the Destination server instance.
If the source server instance is running or if the destination server instance is not running, Admin Console displays an error message.
migrate-timers
Admin Console online help for configuring settings for the EJB timer service
In a clustered environment, a rolling upgrade redeploys an application with a minimal loss of service and sessions. A session can be any replicable artifact, including:
HttpSession
SingleSignOn
SipApplicationSession
SipSession
ServletTimer
DialogFragment
stateful session bean
You can use the load balancer and multiple clusters to upgrade components within the Communications Server without any loss of service. A component can, for example, be a JVM, the Communications Server, or a web application.
A rolling upgrade can take place under light to moderate load conditions. The procedure should be doable in a brief amount of time, about 10-15 minutes per server instance.
Applications must be compatible across the upgrade. They must work correctly during the transition, when some server instances are running the old version and others the new one. The old and new versions must have the same shape (for example, non-transient instance variables) of Serializable classes that form object graphs stored in sessions. Or, if the shape of these classes is changed, then the application developer must ensure that correct Serialization behavior occurs. If the application is not compatible across the upgrade, the cluster must be stopped for a full redeployment.
The Basic3pcc sample application includes an Ant target, do-rollingupgrade, which performs all the rolling upgrade steps for you. This sample application is included with the Communications Server in the as-install/samples/sipservlet/Basic3pcc directory. The Basic3pcc application and the Ant target are available only with the JAR installer of Communications Server.
The following procedure describes how to upgrade an application running on all instances of a cluster.
Run the following commands on the converged load balancer in the cluster,
asadmin set domain.converged-lb-configs.clb_config_name.property.load-increase-factor=1
asadmin set domain.converged-lb-configs.clb_config_name.property.load-factor-increase-period-in-seconds=0
Set the value of the dynamic-reconfig attribute to false in the cluster.
Redeploy a new version of the application.
Because you have set the dynamic-reconfig attribute to false, the new version of the application will be loaded to the instance only when the instance restarts.
Disable the instance from the converged load balancer by using the following asadmin command:
asadmin disable-converged-lb-server instance_name
Back up the current session with the following command:
asadmin backup—session store instance_name
By default, the session files are stored at instance-dir/rollingupgrade.
Stop the instance with the following command:
asadmin stop-instanceinstance_name
Start the instance.
asadmin start-instance instance_name
Restore the session.
asadmin restore—session—store instance_name
Enable the instance to the converged load balancer.
asadmin enable-converged-lb-server instance_name
Use the following command to get the latest version of the session store, which could have been updated by another instance accessing this session store.
asadmin reconcile—session—store instance_name
For all instances in the cluster, repeat steps 3 to 9.
Set the value of the dynamic-reconfig attribute to true in the cluster.
Multi-homing enables Communication Server clusters to be used in an environment that uses multiple Network Interface Cards (NICs). A multi-homed host has multiple network connections, of which the connections may or may not be the same network. Multi-homing provides the following benefits:
Provides redundant network connections within the same subnet. Having multiple NICs ensures that one or more network connections are available for communication.
Supports SIP communication across two or more different subnets. For example, for proxying SIP requests from User Agents in one subnet to User Agents in a second subnet, when the User Agents cannot directly communicate across subnets.
Binds to a specific IPv4 or IPv6 address and receives SIP messages from thatip:port in a system that has multiple IP addresses configured. The responses for SIP requests received on a particular interface will also go out through that interface.
Binds to a specific IPv4 or IPv6 address and receives SIP and HTTP messages from thatip:port in a system that has multiple IP addresses configured. The responses for SIP requests received on a particular interface will also go out through that interface.
Allows for configuring more than one external and/or more than one internal SIP listener. Configuring more than one internal listener would mean that the converged load balancer would use these implicitly for proxying in a round-robin mechanism.
Supports separation of external and internal traffic.
You can separate the internal traffic (resulting from the converged load balancer, replication and GMS) from the external traffic. Traffic separation enables you plan a network better and augment certain parts of the network, as required.
Consider a simple cluster, cluster1, with three instances, instance101, instance102, and instance103. Each instance runs on a different machine. In order to separate the traffic, the multi-homed machine should have at least two IP addresses belonging to different networks. The first IP as the external IP and the second one as internal IP. The objective is to expose the external IP to the User Agents, so that all the traffic from the User Agents would be through them. The internal IP is used only by the cluster instances for internal communication. The following procedure describes how to set up traffic separation.
Set the address attribute of SIP listeners and HTTP listeners to the external address of the mutli-homed machine.
Use the following commands:
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener1.address=\${EXTERNAL_LISTENER_ADDRESS}
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.http-listener1.address=\${EXTERNAL_LISTENER_ADDRESS}
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener2.address=\${EXTERNAL_LISTENER_ADDRESS}
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.http-listener.http-listener2.address=\${EXTERNAL_LISTENER_ADDRESS}
Set the listener type of these listeners as external, so that they listen for traffic from User Agents and not for the converged load balancer proxying.
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener1.type=external
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.http-listener1.type=external
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.sip-listener2.type=external
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.http-listener.http-listener2.type=external
Create the system properties EXTERNAL_LISTENER_ADDRESS and INTERNAL_LISTENER_ADDRESS.
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target cluster1 EXTERNAL_LISTENER_ADDRESS=0.0.0.0:INTERNAL_LISTENER_ADDRESS=0.0.0.0
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target server EXTERNAL_LISTENER_ADDRESS=0.0.0.0:INTERNAL_LISTENER_ADDRESS=0.0.0.0
Create new listeners for listening to internal traffic.
asadmin create-sip-listener --user admin --port 4848 --passwordfile password.txt --target cluster1 --siplisteneraddress 0.0.0.0 --siplistenerport 25060 internal-sip-listener
asadmin create-http-listener --user admin --port 4848 --passwordfile password.txt --target cluster1 --listeneraddress 0.0.0.0 --defaultvs server --listenerport 28080 internal-http-listener
Set the address attribute of these new listeners to the internal address.
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.internal-sip-listener.address=\${INTERNAL_LISTENER_ADDRESS}
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.internal-http-listener.address=\${INTERNAL_LISTENER_ADDRESS}
Set the type attribute of these new listeners to internal.
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.sip-service.sip-listener.internal-sip-listener.type=internal
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1–config.http-service.http-listener.internal-http-listener.type=internal
Configure the IP address of the cluster instances.
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance101 EXTERNAL_LISTENER_ADDRESS=10.12.152.29:INTERNAL_LISTENER_ADDRESS=192.168.2.1
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance102 EXTERNAL_LISTENER_ADDRESS=10.12.152.39:INTERNAL_LISTENER_ADDRESS=192.168.2.3
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt --target instance103 EXTERNAL_LISTENER_ADDRESS=10.12.152.49:INTERNAL_LISTENER_ADDRESS=192.168.2.4
Restart the node agent and the cluster.
If you are using a hardware load balancer for spraying the SIP traffic to the individual instances, you need to set the external-sip-address and external-sip-port attributes to point to the hardware load balancer.
If you are using only one hardware load balancer for all SIP listeners, set the attributes of the SIP container.
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-container.external-sip-address=yourlbaddress
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-container.external-sip-port=yourlbport
If you are using multiple hardware load balancers, set the attributes of each of the SIP listeners:
asadmin create-system-properties --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-service.sip-listener.sip-listener1.external-sip-address=yourlbaddress
asadmin set --user admin --port 4848 --passwordfile password.txt cluster1-config.sip-service.sip-listener.sip-listener1.external-sip-port=yourlbport