C H A P T E R 10

Failover Groups

Sun Ray servers configured in a failover group provide users with a high level of availability when one of those servers becomes unavailable because of a network or system failure. This chapter describes how to configure failover groups.

For a discussion on how to utilize multiple failover groups to utilize regional hotdesking, see .

This chapter covers these topics:

Failover Group Overview

Setting Up IP Addressing

Group Manager

Load Balancing

Setting Up a Failover Group

Viewing the Administration Status

Viewing Failover Group Status

Recovery Issues and Procedures

Setting Up a Group Signature

Taking Servers Offline

Failover Group Overview

A failover group consists of two or more Sun Ray servers grouped together to provide highly-available and scalable Sun Ray service for a population of Sun Ray DTUs. Releases earlier than 2.0 supported DTUs available to the servers only on a common, dedicated interconnect. Beginning with the 2.0 release, this capability was expanded to allow access across the LAN to either local or remote Sun Ray devices. However, there is still a requirement for the servers in a failover group to be able to reach one another, using multicast or broadcast, over at least one shared subnet. Servers in a group authenticate (or "trust") one another using a common group signature. The group signature is a key used to sign messages sent between servers in the group; it must be configured to be identical on each server.

Failover groups that use more than one version of Sun Ray Server Software will be unable to use all the features provided in the latest releases. On the other hand, the failover group can be a heterogeneous group of Sun servers.

When a dedicated interconnect is used, all servers in the failover group should have access to, and be accessible by, all the Sun Ray DTUs on a given sub-net. The failover environment supports the same interconnect topologies that are supported by a single-server Sun Ray environment. However, switches should be multicast-enabled.

FIGURE 10-1 illustrates a typical Sun Ray failover group. For an example of a redundant failover group, see FIGURE 10-2.

FIGURE 10-1 S imple Failover Group

Various Sun ray servers on a public network are connected to Sun Ray DTUs via a switch.

When a server in a failover group fails for any reason, each Sun Ray DTU connected to that server reconnects to another server in the same failover group. The failover occurs at the user authentication level; the DTU connects to a previously existing session for the user's token. If there is no existing session, the DTU connects to a server selected by the load-balancing algorithm. This server then presents a login screen to the user and the user must relogin to create a new session. The state of the session on the failed server is lost.

The principal components needed to implement failover are:

Group Manager--A module that monitors the availability (liveness) of the Sun Ray servers and facilitates redirection when needed.

Multiple, coexisting Dynamic Host Configuration Protocol (DHCP) servers--All DHCP servers configured to assign IP addresses to Sun Ray DTUs have a non-overlapping subset of the available address pool.

Note - The failover feature cannot work properly if the IP addresses and DHCP configuration data are not set up properly when the interfaces are configured. In particular, if the Sun Ray server's interconnect IP address is a duplicate of any other server's interconnect IP address, the Sun Ray Authentication Manager throws "Out of Memory" errors.

The redundant failover group illustrated in FIGURE 10-2 can provide maximum resources to a few Sun Ray DTUs. The server sr47 is the primary Sun Ray server and sr48 is the secondary Sun Ray server; other secondary servers (sr49, sr50... are not shown.

FIGURE 10-2 R edundant Failover Group

What makes this failover group redundant, as opposed to figure 11-1, is that each Sun Ray server is cross-connected to two switches, each of which connects to a sub-net of Sun Ray DTUs.

Setting Up IP Addressing

The utadm command assists you in setting up a DHCP server. The default DHCP setup configures each interface for 225 hosts and uses private network addresses for the Sun Ray interconnect. For more information on using the utadm command, see the man page for utadm.

Before setting up IP addressing, you must decide upon an addressing scheme. The following examples discuss setting up class C and class B addresses.

Setting Up Server and Client Addresses

The loss of a server usually implies the loss of its DHCP service and its allocation of IP addresses. Therefore, more DHCP addresses must be available from the address pool than there are Sun Ray DTUs. Consider the situation of 5 servers and 100 DTUs. If one of the servers fails, the remaining DHCP servers must have enough available addresses so that all "orphaned" DTUs get a new working address.

TABLE 10-1 describes how to configure five servers for 100 DTUs, accommodating the failure of two servers (class C) or four servers (class B).

TABLE 10-1 Configuring Five Servers for 100 DTUs
	Class C (2 Servers Fail)		Class B (4 Servers Fail)
Servers	Interface Address	DTU Address Range	Interface Address	DTU Address Range
`serverA`	192.168.128.1	192.168.128.16 to 192.168.128.49	192.168.128.1	192.168.128.16 to 192.168.128.116
`serverB`	192.168.128.2	192.168.128.50 to 192.168.128.83	192.168.129.1	192.168.129.16 to 192.168.129.116
`serverC`	192.168.128.3	192.168.128.84 to 192.168.128.117	192.168.130.1	192.168.130.16 to 192.168.130.116
`serverD`	192.168.128.4	192.168.128.118 to 192.168.128.151	192.168.131.1	192.168.131.16 to 192.168.131.116
`serverE`	192.168.128.5	192.168.128.152 to 192.168.128.185	192.168.132.1	192.168.132.16 to 192.168.132.116

The formula for address allocation is: address range (AR) = number of DTUs/(total servers - failed servers). For example, in the case of the loss of two servers, each DHCP server must be given a range of 100/(5-2) = 34 addresses.

Ideally, each server would have an address for each DTU. This would require a class B network. Consider these conditions:

If AR multiplied by the total number of servers is less than or equal to 225, configure for a class C network

If AR multiplied by the total number of servers is greater than 225, configure for a class B network

Tip - If all available DHCP addresses are allocated, it is possible for a Sun Ray DTU to request an address yet not find one available, perhaps because another unit has been allocated IP addresses by multiple servers. To prevent this condition, give each DHCP server enough addresses to serve the all the DTUs in a failover group.

Server Addresses

Server IP addresses assigned for the Sun Ray interconnect should all be unique. Use the utadm tool to assign them.

When the Sun Ray DTU boots, it sends a DHCP broadcast request to all possible servers on the network interface. One (or more) server responds with an IP address allocated from its range of addresses. The DTU accepts the first IP address that it receives and configures itself to send and receive at that address.

The accepted DHCP response also contains information about the IP address and port numbers of the Authentication Managers on the server that sent the response.

The DTU then attempts to establish a TCP connection to an Authentication Manager on that server. If it is unable to connect, it uses a protocol similar to DHCP in which it uses a broadcast message to ask the Authentication Managers to identify themselves. The DTU then attempts to connect to the Authentication Managers that responded in the order in which the responses were received.

Note - For the broadcast feature enabled, the broadcast address (255.255.255.255) must be the last one in the list. Any addresses after the broadcast address are ignored. If the local server is not in the list, Sun Ray DTUs cannot attempt to contact it.

Once a TCP connection to an Authentication Manager has been established, the DTU presents its token. The token is either a pseudo-token representing the individual DTU (its unique Ethernet address) or a smart card. The Session Manager then starts an X window/X server session and binds the token to that session.

The Authentication Manager then sends a query to all of the other Authentication Managers on the same subnet and asks for information about existing sessions for the token. The other Authentication Managers respond, indicating whether there is a session for the token and the last time the token was connected to the session.

The requesting Authentication Manager selects the server with the latest connection time and redirects the DTU to that server. If no session is found for the token, the requesting Authentication Manager selects the server with the lightest load and redirects the token to that server. A new session is created for the token.

The Authentication Manager enables both implicit (smart card) and explicit switching. For explicit switching, see Group Manager.

Configuring DHCP

In a large IP network, a DHCP server distributes the IP addresses and other configuration information for interfaces on that network.

Coexistence of the Sun Ray Server With Other DHCP Servers

The Sun Ray DHCP server can coexist with DHCP servers on other subnets, provided you isolate the Sun Ray DHCP server from other DHCP traffic. Verify that all routers on the network are configured not to relay DHCP requests. This is the default behavior for most routers.

Caution - If the IP addresses and DHCP configuration data are not set up correctly when the interfaces are configured, the failover feature cannot work properly. In particular, configuring the Sun Ray server's interconnect IP address as a duplicate of any other server's interconnect IP address may cause the Sun Ray Authentication Manager to throw "Out of Memory" errors.

Administering Other Clients

If the Sun Ray server has multiple interfaces, one of which is the Sun Ray interconnect, the Sun Ray DHCP server should be able to manage both the Sun Ray interconnect and the other interfaces without cross-interference.

To Set Up IP Addressing on Multiple Servers Each With One Sun Ray Interface

1. Log in to the Sun Ray server as superuser and, open a shell window. Type:

# /opt/SUNWut/sbin/utadm -a <interface_name>

where <interface_name> is the name of the Sun Ray network interface to be configured; for example, hme[0-9], qfe[0-9], or ge[0-9]. You must be logged on as superuser to run this command. The utadm script configures the interface (for example, hme1) at the subnet (in this example, 128).

The script displays default values, such as the following:

Selected values for interface "hme1"

    host address:       192.168.128.1

    net mask:           255.255.255.0

    net address:        192.168.128.0

    host name:          serverB-hme1

    net name:           SunRay-hme1

    first unit address: 192.168.128.16

    last unit address:  192.168.128.240

    auth server list:   192.168.128.1

    firmware server:    192.168.128.1

    router:             192.168.128.1

The default values are the same for each server in a failover group. Certain values must be changed to be unique to each server.

2. When you are asked to accept the default values, type n:

Accept as is? ([Y]/N): n

3. Change the second server's IP address to a unique value, in this case 192.168.128.2:

new host address: [192.168.128.1] 192.168.128.2

4. Accept the default values for netmask, host name, and net name:

new netmask: [255.255.255.0]

new host name: [serverB-hme1]

5. Change the DTU address ranges for the interconnect to unique values. For example:

Do you want to offer IP addresses for this interface? [Y/N]:

new first Sun Ray address: [192.168.128.16] 192.168.128.50

number of Sun Ray addresses to allocate: [205] 34

6. Accept the default firmware server and router values:

new firmware server: [192.168.128.2]

new router: [192.168.128.2]

The utadm script asks if you want to specify an authentication server list:

auth server list:     192.168.128.1

To read auth server list from file, enter file name:

Auth server IP address (enter <CR> to end list):

If no server in the auth server list responds, should an auth server be located by broadcasting on the network? ([Y]/N):

These servers are specified by a file containing a space-delimited list of server IP addresses or by manually entering the server IP addresses.

The newly selected values for interface hme1 are displayed:

Selected values for interface "hme1"

    host address:       192.168.128.2

    net mask:           255.255.255.0

    net address:        192.168.128.0

    host name:          serverB-hme1

    net name:           SunRay-hme1

    first unit address: 192.168.128.50

    last unit address:  192.168.128.83

    auth server list:   192.168.128.1

    firmware server:    192.168.128.2

    router:             192.168.128.2

7. If these are correct, accept the new values:

Accept as is? ([Y]/N): y

8. Stop and restart the server and power cycle the DTUs to download the firmware.

TABLE 10-2 lists the options available for the utadm command. For additional information, see the utadm man page.

TABLE 10-2 Available Options
Option	Definition
-c	Create a framework for the Sun Ray interconnect.
-r	Remove all Sun Ray interconnects.
-A <subnetwork>	Configure the subnetwork specified as a Sun Ray sub-network. This option only configures the DHCP service to allocate IP address and/or to provide Sun Ray parameters to Sun Ray clients. It also will automatically turn on support for LAN connections from a shared subnetwork.
-a <interface_name>	Add <interface_name> as Sun Ray interconnect.
-D <subnetwork>	Delete the subnetwork specified form the list of configured Sun Ray subnetworks.
-d <interface_name>	Delete <interface_name> as Sun Ray interconnect.
-l	Print the current configuration for all the Sun Ray subnetworks, including remote subnetworks.
-p	Print the current configuration.
-f	Take a server offline
-n	Bring a server online
-x	Print the current configuration in a machine-readable format

Group Manager

Every server has a group manager module that monitors availability and facilitates redirection. It is coupled with the Authentication Manager.

In setting policies, the Authentication Manager uses the selected authentication modules and decides what tokens are valid and which users have access.

Warning - The same policy must exist on every server in the failover group or undesirable results might occur.

Each Group Manager creates maps of the failover group topology by exchanging keepalive messages among themselves. These keepalive messages are sent to a well-known UDP port (typically 7009) to all of the configured network interfaces. The keepalive message contains enough information for each Sun Ray server to construct a list of servers and the common subnets that each server can access. In addition, the group manager remembers the last time that a keepalive message was received from each server on each interface.

The keepalive message contains the following information about the server:

Server's host name

Server's primary IP address

Elapsed time since it was booted

IP information for every interface it can reach

Machine information (number and speed of CPUs, configured RAM, and so on)

Load information (CPU and memory utilization, number of sessions, and so on)

Note - The last two items are used to facilitate load distribution. See Load Balancing.

The information maintained by the Group Manager is used primarily for server selection when a token is presented. The server and subnet information is used to determine the servers to which a given DTU can connect. These servers are queried about sessions belonging to the token. Servers whose last keepalive message is older than the timeout are deleted from the list, since either the network connection or the server is probably down.

Redirection

In addition to automatic redirection at authentication, you can use the utselect graphical user interface (GUI) or utswitch command for manual redirection.

Note - The utselect GUI is the preferred method to use for server selection. For more information, see the utselect man page.

Group Manager Configuration

The Authentication Manager configuration file, /etc/opt/SUNWut/auth.props, contains properties used by the Group Manager at runtime. The properties are:

gmport

gmKeepAliveInterval

enableGroupManager

enableLoadBalancing

enableMulticast

multicastTTL

gmSignatureFile

gmDebug

These properties have default values that are rarely changed. Only very knowledgeable Sun support personnel should direct customers to change these values to help tune or debug their systems. If any properties are changed, they must be changed for all servers in the failover group, since the auth.props file must be the same on all servers in a failover group.

To Restart the Authentication Manager

Property changes do not take effect until the Authentication Manager is restarted.

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utrestart

The Authentication Manager is restarted.

Load Balancing

At the time of a server failure, the Group Manager on each remaining server attempts to distribute the failed server's sessions evenly among the remaining servers. The load balancing algorithm takes into account each server's capacity (number and speed of its CPUs) and load so that larger or less heavily loaded servers host more sessions.

When the Group Manager receives a token from a Sun Ray DTU and finds that no server owns an existing session for that token, it redirects the Sun Ray DTU to the server in the group with the lightest load. It is possible that a Sun Ray DTU appears to connect twice; once on the server that answered its DHCP request and a second time on a server that was less loaded than the first.

To Turn Off the Load Balancing Feature

In the auth.props file set:

enableLoadBalancing = false

Setting Up a Failover Group

A failover group is one in which two or more Sun Ray servers use a common policy and share services. It is composed of a primary server and one or more secondary servers. For such a group, you must configure a Sun Ray Data Store to enable replication of the Sun Ray administration data across the group.

The utconfig command sets up the internal database for a single system initially, and enables the Sun Ray servers for failover. The utreplica command then configures the Sun Ray servers as a failover group.

Log files for Sun Ray servers contain time-stamped error messages which are difficult to interpret if the time is out of sync. To make troubleshooting easier, all secondary servers should periodically synchronize with their primary server.

Tip - Use rdate <primary-host>, preferably with crontab, to synchronize secondary servers with their primary server.

Primary Server

Layered administration of the group takes place on the primary server. The utreplica command designates a primary server, advises the server of its Administration Primary status, and tells it the host names of all the secondary servers.

Tip - Configure the primary server before you configure the secondary servers.

To Specify a Primary Server

As a superuser, open a shell window on the primary server and type:

# /opt/SUNWut/sbin/utreplica -p secondary-server1 [secondary-server2 ...]

where secondary_server1 [secondary_server2...] is a space-separated list of unique host names of the secondary servers.

Secondary Server

The secondary servers in the group store a replicated version of the primary server's administration data. Use the utreplica command to advise each secondary server of its secondary status and also the host name of the primary server for the group.

To Specify Each Secondary Server

As superuser, open a shell window on the secondary server and type:

# /opt/SUNWut/sbin/utreplica -s primary-server

where primary-server is the hostname of the primary server.

To Add Additional Secondary Servers

To include an additional secondary server in an already configured failover group:

1. On the primary server, rerun utreplica -p -a with a list of secondary servers.

# /opt/SUNWut/sbin/utreplica -p -a secondary-server1, secondary-server2,...

2. Run utreplica -s primary-server on the new secondary server.

# /opt/SUNWut/sbin/utreplica -s primary-server

Removing Replication Configuration

To Remove the Replication Configuration

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utreplica -u

This removes the replication configuration.

Viewing the Administration Status

To Show Current Administration Configuration

As superuser, open a shell window and type:

# /opt/SUNWut/sbin/utreplica -l

The result indicates whether the server is standalone, primary (with the secondary host names), or secondary (with the Primary host name).

Viewing Failover Group Status

A failover group is a set of Sun Ray servers all running the same release of Sun Ray Server Software and all having access to all the Sun Ray DTUs on the interconnect.

To View Failover Group Status

1. From the navigation menu in the Admin GUI, select the arrow to the left of Failover Group to expand the menu.

2. Click the Status link.

The Failover Group Status window is displayed.

The Failover Group Status window describes the health and current state of multiple Sun Ray servers within your failover group. This window also describes the health of any Sun Ray servers that have responded to a Sun Ray broadcast.

The Failover Group Status window provides information on group membership and network connectivity. The servers are listed by name in the first column. Failover Group Status only displays public networks and Sun Ray interconnect fabrics.

In FIGURE 10-3 the information provided is from the point of view of the server in the upper left hand of the table. In this example the server is ray-146.

FIGURE 10-3 Failover Group Status Table

This figure depicts some of the connectivity icons described in Table 11-3.

Note - Sun Ray server broadcasts do not traverse over routers or servers other than Sun Ray servers.

Sun Ray Failover Group Status Icons

These icons depict current failover group status:

TABLE 10-3 Failover Group Status Icons
Icons	Description
	Information is displayed from the perspective of the system performing the failover status.
	A failover group is established and functioning properly. The trusted hosts are members of this failover group because they share the same group signature.
	A Sun Ray interconnect fabric is established and functioning properly.
	This Sun Ray interconnect fabric is unreachable from the server performing the failover group status. This may indicate a failure in the interconnect fabric between Sun Ray servers if they are supposed to be on the same interconnect. In the past, this host was reachable but is no longer from the point of view of the system performing failover status.
	The servers are unreachable. This network is unreachable from the server performing the Failover Group Status. This could be an alert situation. Over a public network the conditions could be normal, except for the Sun Ray broadcast information, which cannot traverse over routers.
	Servers that appear in the same group use this icon. The signature files, `/etc/opt/SUNWut/gmSignature,` on those two machines are identical. This icon identifies systems as trusted hosts. Failover occurs for any Sun Ray DTUs connected between these systems. The `utgroupsig` utility is used to set the `gmSignature` file.

Recovery Issues and Procedures

If one of the servers of a failover group fails, the remaining group members operate from the administration data that existed prior to the failure.

The recovery procedure depends on the severity of the failure and whether a primary or secondary server has failed.

Note - When the primary server fails, you cannot make administrative changes to the system. For replication to work, all changes must be successful on the primary server.

Primary Server Recovery

There are several strategies for recovering the primary server. The following procedure is performed on the same server which was the primary after making it fully operational.

To Rebuild the Primary Server Administration Data Store

Use this procedure to rebuild the primary server administration data store from a secondary server. This procedure uses the same hostname for the replacement server.

1. On one of the secondary servers, capture the current data store to a file called /tmp/store:

# /opt/SUNWut/srds/lib/utldbmcat \

/var/opt/SUNWut/srds/dbm.ut/id2entry.dbb > /tmp/store

This provides an LDIF format file of the current database.

2. FTP this file to the /tmp directory on the primary server.

3. Follow the directions in the Sun Ray Server Software 3.1.1 Installation and Configuration Guide to install Sun Ray Server Software.

4. After running utinstall, configure the server as a primary server for the group. Make sure that you use the same admin password and group signature.

# utconfig

# utreplica -p <secondary-server1> <secondary-server2> ...

5. Shut down the Sun Ray services, including the data store:

# /etc/init.d/utsvc stop

# /etc/init.d/utds stop

6. Restore the data:

# /opt/SUNWut/srds/lib/utldif2ldbm -c -j 10 -i /tmp/store

This populates the primary server and synchronizes its data with the secondary server. The replacement server is now ready for operation as the primary server.

7. Restart Sun Ray services:

# utrestart -c

8. (Optional) Confirm that the data store is repopulated:

# /opt/SUNWut/sbin/utuser -l

9. (Optional) Perform any additional configuration procedures.

To Replace the Primary Server with a Secondary Server

Note - This procedure is also known as promoting a secondary server to primary.

1. Choose a server in the existing failover group to be promoted and configure it as the primary server:

# utreplica -u

# utreplica -p <secondary-server1> <secondary-server2> ...

2. Reconfigure each of the remaining secondary servers in the failover group to use the new primary server.:

# utreplica -u

# utreplica -s <new-primary-server>

This resynchronizes the secondary server with the new primary server.

Note - This process may take some time to complete, depending on the size of the data store. Since Sun Ray services will be offline during this procedure, you may want to schedule your secondary servers' downtime accordingly. Be sure to perform this procedure on each secondary server in the failover group.

Secondary Server Recovery

Where a secondary server has failed, administration of the group can continue. A log of updates is maintained and applied automatically to the secondary server when it has recovered. If the secondary server needs to be reinstalled, repeat the steps described in the Sun Ray Server Software 3.1.1 Installation and Configuration Guide.

Setting Up a Group Signature

The utconfig command asks for a group signature if you chose to configure for failover. The signature, which is stored in the /etc/opt/SUNWut/gmSignature file, must be the same on all servers in the group.

The location can be changed in the gmSignatureFile property of the auth.props file.

To form a fully functional failover group, the signature file must:

be owned by root with only root permissions

contain at least eight characters, in which at least two are letters and at least one is not

Tip - For slightly better security, use long passwords.

To Change the Group Manager Signature File

1. As superuser of the Sun Ray server, open a shell window and type:

# /opt/SUNWut/sbin/utgroupsig

You are prompted for the signature.

2. Enter it twice identically for acceptance.

3. For each Sun Ray server in the group, repeat the steps, starting at step 1.

Note - It is important to use the utgroupsig command, rather than any other method, to enter the signature. utgroupsig also ensures that internal database replication occurs properly.

Taking Servers Offline

Being able to take servers offline makes maintenance easier. In an offline state, no new sessions are created. However, old sessions continue to exist and can be reactivated unless Sun Ray Server Software is affected.

To Take a Server Offline

At the command-line interface, type:

# /opt/SUNWut/sbin/utadm -f

To Bring a Server Online

At the command-line interface, type:

# /opt/SUNWut/sbin/utadm -n