Failover Groups

C H A P T E R 11

Failover Groups

Sun Ray servers configured in a failover group (FOG) provide users with a high level of availability when one of those servers becomes unavailable because of a network or system failure. This chapter describes how to configure failover groups.

For a discussion on how to utilize multiple failover groups to utilize regional hotdesking, see Hotdesking (Mobile Sessions).

This chapter covers these topics:

Overview

Setting Up IP Addressing

Group Manager

Load Balancing

Setting Up a Failover Group

Viewing Administration Status

Recovery Issues and Procedures

Setting Up a Group Signature

Taking Servers Offline

Overview

A failover group consists of two or more Sun Ray servers grouped together to provide highly-available and scalable Sun Ray service for a population of Sun Ray DTUs. Releases earlier than 2.0 supported DTUs available to the servers only on a common, dedicated interconnect. Beginning with the 2.0 release, this capability was expanded to allow access across the LAN to either local or remote Sun Ray devices. However, the servers in a failover group must still be able to reach one another, using multicast or broadcast, over at least one shared subnet. Servers in a group authenticate (or “trust”) one another using a common group signature. The group signature is a key used to sign messages sent between servers in the group; it must be configured to be identical on each server.

Failover groups that use more than one version of Sun Ray Server Software will be unable to use all the features provided in the latest releases. On the other hand, the failover group can be a heterogeneous group of Sun servers running various releases of the Solaris operating environment, such as Solaris 9, and Solaris 10.

When a dedicated interconnect is used, all servers in the failover group should have access to, and be accessible by, all the Sun Ray DTUs on a given sub-net. The failover environment supports the same interconnect topologies that are supported by a single-server Sun Ray environment; however, switches should be multicast-enabled.

FIGURE 11-1 illustrates a typical Sun Ray failover group. For an example of a redundant failover group, see FIGURE 11-2.

FIGURE 11-1 Simple Failover Group

Sun Ray servers on a public network connected to Sun Ray DTUs via a switch

When a server in a failover group fails for any reason, each Sun Ray DTU connected to that server reconnects to another server in the same failover group. The failover occurs at the user authentication level: the DTU connects to a previously existing session for the user’s token. If there is no existing session, the DTU connects to a server selected by the load-balancing algorithm. This server then presents a login screen to the user, and the user must relogin to create a new session. The state of the session on the failed server is lost.

The principal components needed to implement failover are:

Group Manager

A module that monitors the availability (liveness) of the Sun Ray servers and facilitates redirection when needed.

Multiple, coexisting Dynamic Host Configuration Protocol (DHCP) servers

All DHCP servers configured to assign IP addresses to Sun Ray DTUs have a non-overlapping subset of the available address pool.

Note - The failover feature cannot work properly if the IP addresses and DHCP configuration data are not set up properly when the interfaces are configured. In particular, if any Sun Ray server’s interconnect IP address is a duplicate of any other server’s interconnect IP address, the Sun Ray Authentication Manager throws “Out of Memory” errors.

FIGURE 11-2 Redundant Failover Group

Each server is cross-connected to two switches, each connected to a sub-net of DTUs

The redundant failover group illustrated in FIGURE 11-2 can provide maximum resources to a few Sun Ray DTUs. The server sr47 is the primary Sun Ray server, and sr48 is the secondary Sun Ray server; other secondary servers (sr49, sr50... are not shown.

Setting Up IP Addressing

The utadm command assists you in setting up a DHCP server. The default DHCP setup configures each interface for 225 hosts and uses private network addresses for the Sun Ray interconnect. For more information on using the utadm command, see the man page for utadm.

Before setting up IP addressing, you must decide upon an addressing scheme. The following examples discuss setting up class C and class B addresses.

Setting Up Server and Client Addresses

The loss of a server usually implies the loss of its DHCP service and its allocation of IP addresses. Therefore, more DHCP addresses must be available from the address pool than there are Sun Ray DTUs. Consider the situation of five servers and 100 DTUs. If one of the servers fails, the remaining DHCP servers must have enough available addresses so that every “orphaned” DTUs gets a new working address.

TABLE 11-1 lists configuration settings used to configure five servers for 100 DTUs, accommodating the failure of two servers (class C) or four servers (class B).

TABLE 11-1 Configuring Five Servers for 100 DTUs
	Class C (2 Servers Fail)		Class B (4 Servers Fail)
Servers	Interface Address	DTU Address Range	Interface Address	DTU Address Range
`serverA`	192.168.128.1	192.168.128.16 to 192.168.128.49	192.168.128.1	192.168.128.16 to 192.168.128.116
`serverB`	192.168.128.2	192.168.128.50 to 192.168.128.83	192.168.129.1	192.168.129.16 to 192.168.129.116
`serverC`	192.168.128.3	192.168.128.84 to 192.168.128.117	192.168.130.1	192.168.130.16 to 192.168.130.116
`serverD`	192.168.128.4	192.168.128.118 to 192.168.128.151	192.168.131.1	192.168.131.16 to 192.168.131.116
`serverE`	192.168.128.5	192.168.128.152 to 192.168.128.185	192.168.132.1	192.168.132.16 to 192.168.132.116

The formula for address allocation is: address range (AR) = number of DTUs/(total servers - failed servers). For example, in the case of the loss of two servers, each DHCP server must be given a range of 100/(5-2) = 34 addresses.

Ideally, each server would have an address for each DTU. This would require a class B network. Consider these conditions:

If AR multiplied by the total number of servers is less than or equal to 225, configure for a class C network

If AR multiplied by the total number of servers is greater than 225, configure for a class B network

Tip - If all available DHCP addresses are allocated, it is possible for a Sun Ray DTU to request an address yet not find one available, perhaps because another unit has been allocated IP addresses by multiple servers. To prevent this condition, give each DHCP server enough addresses to serve the all the DTUs in a failover group.

Server Addresses

Server IP addresses assigned for the Sun Ray interconnect should all be unique. Use the utadm tool to assign them.

When the Sun Ray DTU boots, it sends a DHCP broadcast request to all possible servers on the network interface. One (or more) server responds with an IP address allocated from its range of addresses. The DTU accepts the first IP address that it receives and configures itself to send and receive at that address.

The accepted DHCP response also contains information about the IP address and port numbers of the Authentication Managers on the server that sent the response.

The DTU then tries to establish a TCP connection to an Authentication Manager on that server. If it is unable to connect, it uses a protocol similar to DHCP, in which it uses a broadcast message to ask the Authentication Managers to identify themselves. The DTU then tries to connect to the Authentication Managers that respond in the order in which the responses are received.

Note - For the broadcast feature to be enabled, the broadcast address (255.255.255.255) must be the last one in the list. Any addresses after the broadcast address are ignored. If the local server is not on the list, Sun Ray DTUs cannot attempt to contact it.

Once a TCP connection to an Authentication Manager has been established, the DTU presents its token. The token is either a pseudo-token representing the individual DTU (its unique Ethernet address) or a smart card. The Session Manager then starts an X window/X server session and binds the token to that session.

The Authentication Manager then sends a query to all the other Authentication Managers on the same subnet and asks for information about existing sessions for the token. The other Authentication Managers respond, indicating whether there is a session for the token and the last time the token was connected to the session.

The requesting Authentication Manager selects the server with the latest connection time and redirects the DTU to that server. If no session is found for the token, the requesting Authentication Manager selects the server with the lightest load and redirects the token to that server. A new session is created for the token.

The Authentication Manager enables both implicit (smart card) and explicit switching. For explicit switching, see Group Manager.

Configuring DHCP

In a large IP network, a DHCP server distributes the IP addresses and other configuration information for interfaces on that network.

Coexistence of the Sun Ray Server With Other DHCP Servers

The Sun Ray DHCP server can coexist with DHCP servers on other subnets, provided you isolate the Sun Ray DHCP server from other DHCP traffic. Verify that all routers on the network are configured not to relay DHCP requests. This is the default behavior for most routers.

Caution - If the IP addresses and DHCP configuration data are not set up correctly when the interfaces are configured, the failover feature cannot work properly. In particular, configuring the Sun Ray server’s interconnect IP address as a duplicate of any other server’s interconnect IP address may cause the Sun Ray Authentication Manager to throw “Out of Memory” errors.

Administering Other Clients

If the Sun Ray server has multiple interfaces, one of which is the Sun Ray interconnect, the Sun Ray DHCP server should be able to manage both the Sun Ray interconnect and the other interfaces without cross-interference.

To Set Up IP Addressing on Multiple Servers, Each with One Sun Ray Interface

1. Log in to the Sun Ray server as superuser and, open a shell window. Type:

# /opt/SUNWut/sbin/utadm -a <interface_name>

where <interface_name> is the name of the Sun Ray network interface to be configured; for example, hme[0-9], qfe[0-9], or ge[0-9]. You must be logged on as superuser to run this command. The utadm script configures the interface (for example, hme1) at the subnet (in this example, 128).

The script displays default values, such as the following:

Selected values for interface "hme1" 
    host address:       192.168.128.1
    net mask:           255.255.255.0
    net address:        192.168.128.0
    host name:          serverB-hme1
    net name:           SunRay-hme1
    first unit address: 192.168.128.16
    last unit address:  192.168.128.240
    auth server list:   192.168.128.1
    firmware server:    192.168.128.1
    router:             192.168.128.1

The default values are the same for each server in a failover group. Certain values must be changed to be unique to each server.

2. When you are asked to accept the default values, type n:

Accept as is? ([Y]/N): n

3. Change the second server’s IP address to a unique value, in this case 192.168.128.2:

new host address: [192.168.128.1] 192.168.128.2

4. Accept the default values for netmask, host name, and net name:

new netmask: [255.255.255.0] 
new host name: [serverB-hme1]

5. Change the DTU address ranges for the interconnect to unique values. For example:

Do you want to offer IP addresses for this interface? [Y/N]:
new first Sun Ray address: [192.168.128.16] 192.168.128.50
number of Sun Ray addresses to allocate: [205] 34

6. Accept the default firmware server and router values:

new firmware server: [192.168.128.2] 
new router: [192.168.128.2]

The utadm script asks if you want to specify an authentication server list:

auth server list:     192.168.128.1
To read auth server list from file, enter file name:
Auth server IP address (enter <CR> to end list):
If no server in the auth server list responds, should an auth server be located by broadcasting on the network? ([Y]/N):

These servers are specified by a file containing a space-delimited list of server IP addresses or by manually entering the server IP addresses.

The newly selected values for interface hme1 are displayed:

Selected values for interface "hme1" 
    host address:       192.168.128.2
    net mask:           255.255.255.0
    net address:        192.168.128.0
    host name:          serverB-hme1
    net name:           SunRay-hme1
    first unit address: 192.168.128.50
    last unit address:  192.168.128.83
    auth server list:   192.168.128.1
    firmware server:    192.168.128.2
    router:             192.168.128.2

7. If these are correct, accept the new values:

Accept as is? ([Y]/N): y

8. Stop and restart the server and power cycle the DTUs to download the firmware.

TABLE 11-2 lists the options available for the utadm command. For additional information, see the utadm man page.

TABLE 11-2 Available Options
Option	Definition
-c	Create a framework for the Sun Ray interconnect.
-r	Remove all Sun Ray interconnects.
-A <subnetwork>	Configure the subnetwork specified as a Sun Ray sub-network. This option only configures the DHCP service to allocate IP address and/or to provide Sun Ray parameters to Sun Ray clients. It also will automatically turn on support for LAN connections from a shared subnetwork.
-a <interface_name>	Add <interface_name> as Sun Ray interconnect.
-D <subnetwork>	Delete the subnetwork specified form the list of configured Sun Ray subnetworks.
-d <interface_name>	Delete <interface_name> as Sun Ray interconnect.
-l	Print the current configuration for all the Sun Ray subnetworks, including remote subnetworks.
-p	Print the current configuration.
-f	Take a server offline
-n	Bring a server online
-x	Print the current configuration in a machine-readable format

Group Manager

Every server has a group manager module that monitors availability and facilitates redirection. It is coupled with the Authentication Manager.

In setting policies, the Authentication Manager uses the selected authentication modules and decides what tokens are valid and which users have access.

Warning - The same policy must exist on every server in the failover group or undesirable results might occur.

The Group Managers create maps of the failover group topology by exchanging keepalive messages among themselves. These keepalive messages are sent to a well-known UDP port (typically 7009) on all of the configured network interfaces. The keepalive message contains enough information for each Sun Ray server to construct a list of servers and the common subnets that each server can access. In addition, the Group Manager remembers the last time that a keepalive message was received from each server on each interface.

The keepalive message contains the following information about the server:

Server’s host name

Server’s primary IP address

Elapsed time since it was booted

IP information for every interface it can reach

Machine information (number and speed of CPUs, configured RAM, and so on)

Load information (CPU and memory utilization, number of sessions, and so on)

Note - The last two items are used to facilitate load distribution. See Load Balancing.

The information maintained by the Group Manager is used primarily for server selection when a token is presented. The server and subnet information is used to determine the servers to which a given DTU can connect. These servers are queried about sessions belonging to the token. Servers whose last keepalive message is older than the timeout are deleted from the list, since either the network connection or the server is probably down.

Redirection

In addition to automatic redirection at authentication, you can use the utselect or utswitch command for manual redirection.

Note - The utselect GUI is the preferred method to use for server selection. For more information, see the utselect man page.

Group Manager Configuration

The Authentication Manager configuration file, /etc/opt/SUNWut/auth.props, contains properties used by the Group Manager at runtime. The properties are:

gmport

gmKeepAliveInterval

enableGroupManager

enableLoadBalancing

enableMulticast

multicastTTL

gmSignatureFile

gmDebug

Note - These properties have default values that are rarely changed. Only very knowledgeable Sun support personnel should direct customers to change these values to help tune or debug their systems. If any properties are changed, they must be changed for all servers in the failover group, since the auth.props file must be the same on all servers in a failover group.

To Restart the Authentication Manager

Property changes do not take effect until the Authentication Manager is restarted.

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utrestart

The Authentication Manager is restarted.

Load Balancing

At the time of a server failure, the Group Manager on each remaining server attempts to distribute the failed server’s sessions evenly among the remaining servers. The load balancing algorithm takes into account each server’s capacity (number and speed of its CPUs) and load so that larger or less heavily loaded servers host more sessions.

When the Group Manager receives a token from a Sun Ray DTU and finds that no server owns an existing session for that token, it redirects the Sun Ray DTU to whichever server in the group has the lightest load. A Sun Ray DTU may appear to connect twice, once on the server that answered its DHCP request and a second time on a server that was less loaded than the first.

To Turn Off the Load Balancing Feature

In the auth.props file set:

enableLoadBalancing = false

Setting Up a Failover Group

A failover group is one in which two or more Sun Ray servers use a common policy and share services. It is composed of a primary server and one or more secondary servers. For such a group, you must configure a Sun Ray Data Store to enable replication of the Sun Ray administration data across the group. Configure the secondary servers so that they serve users directly in addition to serving the Data Store. For best results in groups of four or more servers, configure the primary server so that it serves only the Sun Ray Data Store.

The utconfig command sets up the data store for a single system initially, and enables the Sun Ray servers for failover. The utreplica command then configures the Sun Ray servers as a failover group.

If the Sun Ray server is currently monitored by Sun Management Center, utreplica restarts the agent automatically. Log files for Sun Ray servers contain time-stamped error messages which are difficult to interpret if the time is out of sync. To make troubleshooting easier, all secondary servers should periodically synchronize with their primary server.

Tip - Use rdate <primary-host>, preferably with crontab, to synchronize secondary servers with their primary server.

Primary Server

Layered administration of the group takes place on the primary server. The utreplica command designates a primary server, advises the server of its Administration Primary status, and tells it the host names of all the secondary servers.

Adding or removing secondary servers requires services to be restarted on the primary server. In large failover groups, and significant loads may be pushed onto the primary server from various sources. In addition, runaway processes from user applications on the primary can degrade the health of the entire failover group. Failover groups of more than four servers should have a dedicated primary server devoted to solely serving the Sun Ray Data Store, i.e., not hosting any Sun Ray sessions.

Tip - Configure the primary server before you configure the secondary servers.

To Specify a Primary Server

As a superuser, open a shell window on the primary server and type:

# /opt/SUNWut/sbin/utreplica -p secondary-server1 [secondary-server2 ...]

where secondary_server1 [secondary_server2...] is a space-separated list of unique host names of the secondary servers.

To Specify a Dedicated Primary Server

The purpose of a dedicated primary server is to serve the Sun Ray Data Store.

Follow the procedure to specify a primary server, as above; however, do not run utadm on this server.

Secondary Server

The secondary servers in the group store a replicated version of the primary server’s administration data. Use the utreplica command to advise each secondary server of its secondary status and also the host name of the primary server for the group.

To Specify Each Secondary Server

As superuser, open a shell window on the secondary server and type:

# /opt/SUNWut/sbin/utreplica -s primary-server

where primary-server is the hostname of the primary server.

To Add Additional Secondary Servers

To include an additional secondary server in an already configured failover group:

1. On the primary server, rerun utreplica -p -a with a list of secondary servers.

# /opt/SUNWut/sbin/utreplica -p -a secondary-server1, secondary-server2,...

2. Run utreplica -s primary-server on the new secondary server.

# /opt/SUNWut/sbin/utreplica -s primary-server

Removing Replication Configuration

To Remove the Replication Configuration

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utreplica -u

This removes the replication configuration.

Viewing Administration Status

To Show Current Administration Configuration

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utreplica -l

The result indicates whether the server is standalone (dedicated), primary (with the secondary host names), or secondary (with the primary host name).

To View Network (Failover Group) Status

A failover group is a set of Sun Ray servers all running the same release of Sun Ray Server Software and all having access to all the Sun Ray DTUs on the interconnect.

1. From the Servers tab in the Admin GUI, click on a server name to display its Server Details screen.

2. Click View Network Status.

FIGURE 11-3 Network Status Screen

Tabular presentation of network status data replaces failover status icons used previously

The Network Status screen provides information on group membership and network connectivity for trusted servers--those in the same failover group.

Note - Sun Ray server broadcasts do not traverse routers or servers other than Sun Ray servers.

Recovery Issues and Procedures

If one of the servers of a failover group fails, the remaining group members operate from the administration data that existed prior to the failure.

The recovery procedure depends on the severity of the failure and whether a primary or secondary server has failed.

Note - When the primary server fails, you cannot make administrative changes to the system. For replication to work, all changes must be successful on the primary server.

Primary Server Recovery

There are several strategies for recovering the primary server. The following procedure is performed on the same server which was the primary after making it fully operational.

To Rebuild the Primary Server Administration Data Store

Use this procedure to rebuild the primary server data store from a secondary server. This procedure uses the same hostname for the replacement server.

1. On one of the secondary servers, capture the current data store to a file called /tmp/store:

# /opt/SUNWut/srds/lib/utldbmcat \
/var/opt/SUNWut/srds/dbm.ut/id2entry.dbb > /tmp/store

This provides an LDIF format file of the current data store.

2. FTP this file to the /tmp directory on the primary server.

3. Follow the directions in the Sun Ray Server Software 4.0 Installation and Configuration Guide to install Sun Ray Server Software.

4. After running utinstall, configure the server as a primary server for the group. Make sure that you use the same admin password and group signature.

# utconfig
	:
# utreplica -p <secondary-server1> <secondary-server2> ...

5. Shut down the Sun Ray services, including the data store:

# /etc/init.d/utsvc stop
# /etc/init.d/utds stop

6. Restore the data:

# /opt/SUNWut/srds/lib/utldif2ldbm -c -j 10 -i /tmp/store

This populates the primary server and synchronizes its data with the secondary server. The replacement server is now ready for operation as the primary server.

7. Restart Sun Ray services:

# utrestart -c

8. (Optional) Confirm that the data store is repopulated:

# /opt/SUNWut/sbin/utuser -l

9. (Optional) Perform any additional configuration procedures.

To Replace the Primary Server with a Secondary Server

Note - This procedure is also known as promoting a secondary server to primary.

1. Choose a server in the existing failover group to be promoted and configure it as the primary server:

# utreplica -u
# utreplica -p <secondary-server1> <secondary-server2> ...

2. Reconfigure each of the remaining secondary servers in the failover group to use the new primary server:

# utreplica -u
# utreplica -s <new-primary-server>

This resynchronizes the secondary server with the new primary server.

Note - This process may take some time to complete, depending on the size of the data store. Since Sun Ray services will be offline during this procedure, you may want to schedule your secondary servers’ downtime accordingly. Be sure to perform this procedure on each secondary server in the failover group.

Secondary Server Recovery

Where a secondary server has failed, administration of the group can continue. A log of updates is maintained and applied automatically to the secondary server when it has recovered. If the secondary server needs to be reinstalled, repeat the steps described in the Sun Ray Server Software 4.0 Installation and Configuration Guide.

Setting Up a Group Signature

The utconfig command asks for a group signature if you chose to configure for failover. The signature, which is stored in the /etc/opt/SUNWut/gmSignature file, must be the same on all servers in the group.

The location can be changed in the gmSignatureFile property of the auth.props file.

To form a fully functional failover group, the signature file must:

be owned by root with only root permissions

contain at least eight characters, in which at least two are letters and at least one is not

Tip - For slightly better security, use long passwords.

To Change the Group Manager Signature File

1. As superuser of the Sun Ray server, open a shell window and type:

# /opt/SUNWut/sbin/utgroupsig

You are prompted for the signature.

2. Enter it twice identically for acceptance.

3. For each Sun Ray server in the group, repeat the steps, starting at step 1.

Note - It is important to use the utgroupsig command, rather than any other method, to enter the signature. utgroupsig also ensures proper internal replication.

Taking Servers Offline

Being able to take servers offline makes maintenance easier. In an offline state, no new sessions are created. However, old sessions continue to exist and can be reactivated unless Sun Ray Server Software is affected.

To Take a Server Offline

At the command-line interface, type:

# /opt/SUNWut/sbin/utadm -f

To Bring a Server Online

At the command-line interface, type:

# /opt/SUNWut/sbin/utadm -n