Sun Cluster 3.0 Concepts

Chapter 3 Key Concepts - Administration and Application Development

This chapter describes the key concepts related to the software components of a Sun Cluster configuration. The topics covered include:

Cluster Administration and Application Development

This information is directed primarily toward system administrators and application developers using the Sun Cluster API and SDK. Cluster system administrators can use this information as background to installing, configuring, and administering cluster software. Application developers can use the information to understand the cluster environment in which they will be working.

The following figure shows a high-level view of how the cluster administration concepts map to the cluster architecture.

Figure 3-1 Sun Cluster Software Architecture

Administrative Interfaces

You can choose from several user interfaces to install, configure, and administer Sun Cluster and Sun Cluster data services. You can accomplish system administration tasks through the documented command-line interface. On top of the command-line interface are some utilities to simplify selected configuration tasks. Sun Cluster also has a module that runs as part of Sun Management Center that provides a GUI to certain cluster tasks. Refer to the introductory chapter in the Sun Cluster 3.0 System Administration Guide for complete descriptions of the administrative interfaces.

Cluster Time

Time between all nodes in a cluster must be synchronized. Whether you synchronize the cluster nodes with any outside time source is not important to cluster operation. Sun Cluster employs the Network Time Protocol (NTP) to synchronize the clocks between nodes.

In general, a change in the system clock of a fraction of a second causes no problems. However, if you run date(1), rdate(1M), or xntpdate(1M) (interactively, or within cron scripts) on an active cluster, you can force a time change much larger than a fraction of a second to synchronize the system clock to the time source. This forced change might cause problems with file modification timestamps or confuse the NTP service.

When you install the Solaris operating environment on each cluster node, you have an opportunity to change the default time and date setting for the node. In general, you can accept the factory default.

When you install Sun Cluster using scinstall(1M), one step in the process is to configure NTP for the cluster. Sun Cluster supplies a template file, ntp.cluster (see /etc/inet/ntp.cluster on an installed cluster node), that establishes a peer relationship between all cluster nodes, with one node being the "preferred" node. Nodes are identified by their private hostnames and time synchronization occurs across the cluster interconnect. The instructions for how to configure the cluster for NTP are included in the Sun Cluster 3.0 Installation Guide.

Alternately, you can set up one or more NTP servers outside the cluster and change the ntp.conf file to reflect that configuration.

In normal operation, you should never need to adjust the time on the cluster. However, if the time was set incorrectly when you installed the Solaris operating environment and you want to change it, the procedure for doing so is included in the Sun Cluster 3.0 System Administration Guide.

High-Availability Framework

Sun Cluster makes all components on the "path" between users and data highly available, including network interfaces, the applications themselves, the file system, and the multihost disks. In general, a cluster component is highly available if it survives any single (software or hardware) failure in the system.

The following table shows the kinds of Sun Cluster component failures (both hardware and software) and the kinds of recovery built into the high-availability framework.

Table 3-1 Levels of Sun Cluster Failure Detection and Recovery


Failed Cluster Resource	Software Recovery	Hardware Recovery
Data service	HA API, HA framework	N/A
Public network adapter	Network Adapter Failover (NAFO)	Multiple public network adapter cards
Cluster file system	Primary and secondary replicas	Multihost disks
Mirrored multihost disk	Volume management (Solstice DiskSuite and VERITAS Volume Manager)	Hardware RAID-5 (for example, Sun StorEdge A3x00)
Global device	Primary and secondary replicas	Multiple paths to the device, cluster transport junctions
Private network	HA transport software	Multiple private hardware-independent networks
Node	CMM, failfast driver	Multiple nodes

The Sun Cluster high-availability framework detects a node failure quickly and creates a new equivalent server for the framework resources on a remaining node in the cluster. At no time are all framework resources unavailable. Framework resources unaffected by a crashed node are fully available during recovery. Furthermore, framework resources of the failed node become available as soon as they are recovered. A recovered framework resource does not have to wait for all other framework resources to complete their recovery.

Most highly available framework resources are recovered transparently to the applications (data services) using the resource. The semantics of framework resource access are fully preserved across node failure. The applications simply cannot tell that the framework resource server has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes using the files, devices, and disk volumes attached to this node, as long as an alternative hardware path exists to the disks from another node. An example is the use of multihost disks that have ports to multiple nodes.

Cluster Membership Monitor

The Cluster Membership Monitor (CMM) is a distributed set of agents, one per cluster member. The agents exchange messages over the cluster interconnect to:

Enforce a consistent membership view on all nodes (quorum)
Drive synchronized reconfiguration in response to membership changes, using registered callbacks
Handle cluster partitioning (split brain, amnesia)
Ensure full connectivity among all cluster members

Unlike previous Sun Cluster releases, CMM runs entirely in the kernel.

Cluster Membership

The main function of the CMM is to establish cluster-wide agreement on the set of nodes that participates in the cluster at any given time. Sun Cluster refers to this constraint as cluster membership.

To determine cluster membership, and ultimately, ensure data integrity, the CMM:

Accounts for a change in cluster membership, such as a node joining or leaving the cluster
Ensures that a "bad" node leaves the cluster
Ensures that a "bad" node stays out of the cluster until it is repaired
Prevents the cluster from partitioning itself into subsets of nodes

See "Quorum and Quorum Devices" for more information on how the cluster protects itself from partitioning into multiple separate clusters.

Cluster Membership Monitor Reconfiguration

To ensure that data is kept safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services (applications) in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster, where cluster resources might be redistributed based on the new membership of the cluster.

Cluster Configuration Repository (CCR)

The Cluster Configuration Repository (CCR) is a private, cluster-wide database for storing information pertaining to the configuration and state of the cluster. The CCR is a distributed database. Each node maintains a complete copy of the database. The CCR ensures that all nodes have a consistent view of the cluster "world." To avoid corrupting data, each node needs to know the current state of the cluster resources.

The CCR is implemented in the kernel as a highly available service.

The CCR uses a two-phase commit algorithm for updates: An update must complete successfully on all cluster members or the update is rolled back. The CCR uses the cluster interconnect to apply the distributed updates.

Caution -

Although the CCR is made up of text files, never edit the CCR files manually. Each file contains a checksum record to ensure consistency. Manually updating CCR files can cause a node or the entire cluster to stop functioning.

The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.

Global Devices

Sun Cluster uses global devices to provide cluster-wide, highly available access to any device in a cluster, from any node, without regard to where the device is physically attached. In general, if a node fails while providing access to a global device, Sun Cluster automatically discovers another path to the device and redirects the access to that path. Sun Cluster global devices include disks, CD-ROMs, and tapes. However, disks are the only supported multiported global devices. This means that CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.

The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the cluster. This assignment allows consistent access to each device from any node in the cluster. The global device namespace is held in the /dev/global directory. See "Global Namespace" for more information.

Multiported global devices provide more than one path to a device. In the case of multihost disks, because the disks are part of a disk device group hosted by more than one node, the multihost disks are made highly available.

Device ID (DID)

Sun Cluster manages global devices through a construct known as the device ID (DID) pseudo driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.

The device ID (DID) pseudo driver is an integral part of the global device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique disk devices, assigning each a unique major and minor number that is consistent on all nodes of the cluster. Access to the global devices is performed utilizing the unique device ID assigned by the DID driver instead of the traditional Solaris device IDs, such as c0t0d0 for a disk.

This approach ensures that any application utilizing the disk devices (such as a volume manager or applications using raw devices) can use a consistent path to access the device. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from node to node, thus changing the Solaris device naming conventions as well. For example, node1 might see a multihost disk as c1t2d0, and node2 might see the same disk completely differently, as c3t2d0. The DID driver would assign a global name, such as d10, that the nodes would use instead, giving each node a consistent mapping to the multihost disk.

You update and administer Device IDs through scdidadm(1M) and scgdevs(1M). See the respective man pages for more information.

Disk Device Groups

In Sun Cluster, all multihost disks must be under control of the Sun Cluster framework. You first create volume manager disk groups--either Solstice DiskSuite disksets or VERITAS Volume Manager disk groups--on the multihost disks. Then, you register the volume manager disk groups as Sun Cluster disk device groups. A disk device group is a type of global device. In addition, Sun Cluster registers every individual disk as a disk device group.

Note -

Disk device groups are independent of resource groups. One node can master a resource group (representing a group of data service processes) while another can master the disk group(s) being accessed by the data services. However, the best practice is to keep the disk device group that stores a particular application's data and the resource group that contains the application's resources (the application daemon) on the same node. Refer to the overview chapter in the Sun Cluster 3.0 Data Services Installation and Configuration Guide for more information about the association between disk device groups and resource groups.

With a disk device group, the volume manager disk group becomes "global" because it provides multipath support to the underlying disks. Each cluster node physically attached to the multihost disks provides a path to the disk device group.

Note -

A global device is highly available if it is part of a device group that is hosted by more than one cluster node.

Disk Device Failover

Because a disk enclosure is connected to more than one node, all disk device groups in that enclosure are accessible through an alternate path if the node currently mastering the device group fails. The failure of the node mastering the device group does not affect access to the device group except for the time it takes to perform the recovery and consistency checks. During this time, all requests are blocked (transparently to the application) until the system makes the device group available.

Figure 3-2 Disk Device Group Failover

Global Namespace

The Sun Cluster mechanism that enables global devices is the global namespace. The global namespace includes the /dev/global/ hierarchy as well as the volume manager namespaces. The global namespace reflects both multihost disks and local disks (and any other cluster device, such as CD-ROMs and tapes), and provides multiple failover paths to the multihost disks. Each node physically connected to multihost disks provides a path to the storage for any node in the cluster.

Normally, the volume manager namespaces reside in the /dev/md/diskset/dsk (and rdsk) directories, for Solstice DiskSuite; and in the /dev/vx/dsk/disk-group and /dev/vx/rdsk/disk-group directories, for VxVM. These namespaces consist of directories for each Solstice DiskSuite diskset and each VxVM disk group imported throughout the cluster, respectively. Each of these directories houses a device node for each metadevice or volume in that diskset or disk group.

In Sun Cluster, each of the device nodes in the local volume manager namespace is replaced by a symbolic link to a device node in the /global/.devices/node@nodeID file system, where nodeID is an integer that represents the nodes in the cluster. Sun Cluster continues to present the volume manager devices, as symbolic links, in their standard locations as well. Both the global namespace and standard volume manager namespace are available from any cluster node.

The advantages of the global namespace include:

Each node remains fairly independent, with little change in the device administration model.
Devices can be selectively made global.
Third-party link generators continue to work.
Given a local device name, an easy mapping is provided to obtain its global name.

Local and Global Namespaces Example

The following table shows the mappings between the local and global namespaces for a multihost disk, c0t0d0s0.

Table 3-2 Local and Global Namespaces Mappings


Component/Path	Local Node Namespace	Global Namespace
Solaris logical name	`/dev/dsk/c0t0d0s0`	`/global/.devices/node@ID/dev/dsk/c0t0d0s0`
DID name	`/dev/did/dsk/d0s0`	`/global/.devices/node@ID/dev/did/dsk/d0s0`
Solstice DiskSuite	`/dev/md/diskset/dsk/d0`	`/global/.devices/node@ID/dev/md/diskset/dsk/d0`
VERITAS Volume Manager	`/dev/vx/dsk/disk-group/v0`	`/global/.devices/node@ID/dev/vx/dsk/disk-group/v0`

The global namespace is automatically generated on installation and updated with every reconfiguration reboot. You can also generate the global namespace by running the scgdevs(1M) command.

Cluster File Systems

A cluster file system is a proxy between the kernel on one node and the underlying file system and volume manager running on a node that has a physical connection to the disk(s).

Cluster file systems are dependent on global devices (disks, tapes, CD-ROMs) with physical connections to one or more nodes. The global devices can be accessed from any node in the cluster through the same file name (for example, /dev/global/) whether or not that node has a physical connection to the storage device. You can use a global device the same as a regular device, that is, you can create a file system on it using newfs and/or mkfs.

You can mount a file system on a global device globally with mount -g or locally with mount.Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/foo).A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.

Using Cluster File Systems

In Sun Cluster, all multihost disks are configured as disk device groups, which can be Solstice DiskSuite disksets, VxVM disk groups, or individual disks not under control of a software-based volume manager. Also, local disks are configured as disk device groups: a path leads to each local disk from each node. This setup does not mean the data on a disk is necessarily available from all nodes. The data only becomes available to all nodes if the file systems on the disks are mounted globally as a cluster file system.

A local file system that is made into a cluster file system only has a single connection to the disk storage. If the node with the physical connection to the disk storage fails, the other nodes no longer have access to the cluster file system. You can have local file systems on a single node that are not accessible directly from other nodes.

HA data services are set up so that the data for the service is stored on disk device groups in cluster file systems. This setup has several advantages. First, the data is highly available; that is, because the disks are multihosted, if the path from the node that currently is the primary fails, access is switched to another node that has direct access to the same disks. Second, because the data is on a cluster file system, it can be viewed from any cluster node directly--you do not have to log onto the node that currently masters the disk device group to view the data.

Proxy File System (PXFS)

The cluster file system is based on the proxy file system (PXFS), which has the following features:

PXFS makes file access locations transparent. A process can open a file located anywhere in the system and processes on all nodes can use the same path name to locate a file.
PXFS uses coherency protocols to preserve the UNIX file access semantics even if the file is accessed concurrently from multiple nodes.
PXFS provides extensive caching and provides zero-copy bulk I/O movement to move large data objects efficiently.
PXFS provides continuous access to data, even when failures occur. Applications do not detect failures as long as a path to disks is still operational. This guarantee is maintained for raw disk access and all file system operations.
PXFS is independent of underlying file system and volume management software. PXFS makes any supported on-disk file system global.
PXFS is built on top of the existing Solaris file system at the vnode interface. This interface enables PXFS to be implemented without extensive kernel modifications.

PXFS is not a distinct file system type. That is, clients see the underlying file system (for example, UFS).

Cluster File System Independence

The cluster file system is independent of the underlying file system and volume manager. Currently, you can build cluster file systems on UFS using either Solstice DiskSuite or VERITAS Volume Manager.

As with normal file systems, you can mount cluster file systems in two ways:

Manually--Use the mount command and the -g option to mount the cluster file system from the command line, for example:
# mount -g /dev/global/dsk/d0s0 /global/oracle/data
Automatically--Create an entry in the /etc/vfstab file with a global mount option to mount the cluster file system at boot. You then create a mount point under the /global directory on all nodes. The directory /global is a recommended location, not a requirement. Here's a sample line for a cluster file system from an /etc/vfstab file:
/dev/md/oracle/dsk/d1 /dev/md/oracle/rdsk/d1 /global/oracle/data ufs 2 yes global,logging

Note -

While Sun Cluster does not impose a naming policy for cluster file systems, you can ease administration by creating a mount point for all cluster file systems under the same directory, such as /global/disk-device-group. See Sun Cluster 3.0 Installation Guide and Sun Cluster 3.0 System Administration Guide for more information.

The Syncdir Mount Option

The syncdir mount option can be used for cluster file systems. However, there is a significant performance improvement if you do not specify syncdir. If you specify syncdir, the writes are guaranteed to be POSIX compliant. If you do not, you will have the same behavior that is seen with UFS file systems. For example, under some cases, without syncdir, you would not discover an out of space condition until you close a file. With syncdir (and POSIX behavior), the out of space condition would have been discovered during the write operation. The cases in which you could have problems if you do not specify syncdir are rare, so we recommend that you do not specify it and receive the performance benefit.

See "File Systems FAQ" for frequently asked questions about global devices and cluster file systems.

Quorum and Quorum Devices

Because cluster nodes share data and resources, the cluster must take steps to maintain data and resource integrity. When a node does not meet the cluster rules for membership, the cluster must disallow the node from participating in the cluster.

In Sun Cluster, the mechanism that determines node participation in the cluster is known as a quorum. Sun Cluster uses a majority voting algorithm to implement quorum. Both cluster nodes and quorum devices, which are disks that are shared between two or more nodes, vote to form quorum. A quorum device can contain user data.

The quorum algorithm operates dynamically: as cluster events trigger its calculations, the results of calculations can change over the lifetime of a cluster.Quorum protects against two potential cluster problems--split brain and amnesia--both of which can cause inconsistent data to be made available to clients. The following table describes these two problems and how quorum solves them.

Table 3-3 Cluster Quorum, and Split-Brain and Amnesia Problems


Problem	Description	Quorum's Solution
Split brain	Occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into sub-clusters, each of which believes that it is the only partition	Allows only the partition (sub-cluster) with a majority of votes to run as the cluster (where at most one partition can exist with such a majority)
Amnesia	Occurs when the cluster restarts after a shutdown with cluster data older than at the time of the shutdown	Guarantees that when a cluster is booted, it has at least one node that was a member of the most recent cluster membership (and thus has the latest configuration data)

Quorum Vote Counts

Both cluster nodes and quorum devices (disks that are shared between two or more nodes) vote to form quorum. By default, cluster nodes acquire a quorum vote count of one when they boot and become cluster members. Nodes can also have a vote count of zero, for example, when the node is being installed, or when an administrator has placed a node into maintenance state.

Quorum devices acquire quorum vote counts based on the number of node connections to the device. When a quorum device is set up, it acquires a maximum vote count of N-1 where N is the number of nodes with non zero vote counts that have ports to the quorum device. For example, a quorum device connected to two nodes with non zero vote counts has a quorum count of one (two minus one).

You configure quorum devices during the cluster installation, or later by using the procedures described in the Sun Cluster 3.0 System Administration Guide.

Note -

A quorum device contributes to the vote count only if at least one of the nodes to which it is currently attached is a cluster member. Also, during cluster boot, a quorum device contributes to the count only if at least one of the nodes to which it is currently attached is booting and was a member of the most recently booted cluster when it was shut down.

Quorum Configurations

Quorum configurations depend on the number of nodes in the cluster:

Two-Node Clusters - Two quorum votes are required for a two-node cluster to form. These two votes can come from the two cluster nodes, or from just one node and a quorum device. Nevertheless, a quorum device must be configured in a two-node cluster to ensure that a single node can continue if the other node fails.
More Than Two-Node Clusters - You should specify a quorum device between every pair of nodes that shares access to a disk storage enclosure. For example, suppose you have a three-node cluster similar to the one shown in Figure 3-3. In this figure, nodeA and nodeB share access to the same disk enclosure and nodeB and nodeC share access to another disk enclosure. There would be a total of five quorum votes, three from the nodes and two from the quorum devices shared between the nodes. A cluster needs a majority of the quorum votes, three, to form.

Specifying a quorum device between every pair of nodes that shares access to a disk storage enclosure is not required or enforced by Sun Cluster. However, it can provide needed quorum votes for the case where an N+1 configuration degenerates into a two-node cluster and then the node with access to both disk enclosures also fails. If you configured quorum devices between all pairs, the remaining node could still operate as a cluster.

See Figure 3-3 for examples of these configurations.

Figure 3-3 Quorum Device Configuration Examples

Quorum Guidelines

Use the following guidelines when setting up quorum devices:

Establish a quorum device between all nodes that are attached to the same shared disk storage enclosure. Add one disk within the shared enclosure as a quorum device to ensure that if any node fails, the other nodes can maintain quorum and master the disk device groups on the shared enclosure.
You must connect the quorum device to at least two nodes.
A quorum device can be any SCSI-2 or SCSI-3 disk used as a dual-ported quorum device. Disks connected to more than two nodes must support SCSI-3 Persistent Group Reservation (PGR) regardless of whether the disk is used as a quorum device. See the chapter on planning in the Sun Cluster 3.0 Installation Guide for more information.
You can use a disk that contains user data as a quorum device.

Tip -

Configure more than one quorum device between sets of nodes. Use disks from different enclosures, and configure an odd number of quorum devices between each set of nodes. This protects against individual quorum device failures.

Failure Fencing

A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When this happens, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might believe it has sole access and ownership to the multihost disks. Multiple nodes attempting to write to the disks can result in data corruption.

Failure fencing limits node access to multihost disks by physically preventing access to the disks. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, resulting in data integrity.

Disk device services provide failover capability for services that make use of multihost disks. When a cluster member currently serving as the primary (owner) of the disk device group fails or becomes unreachable, a new primary is chosen, enabling access to the disk device group to continue with only minor interruption. During this process, the old primary must give up access to the devices before the new primary can be started. However, when a member drops out of the cluster and becomes unreachable, the cluster cannot inform that node to release the devices for which it was the primary. Thus, you need a means to enable surviving members to take control of and access global devices from failed members.

Sun Cluster uses SCSI disk reservations to implement failure fencing. Using SCSI reservations, failed nodes are "fenced" away from the multihost disks, preventing them from accessing those disks.

SCSI-2 disk reservations support a form of reservations, which either grants access to all nodes attached to the disk (when no reservation is in place) or restricts access to a single node (the node that holds the reservation).

When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates a failure fencing procedure to prevent the other node from accessing shared disks. When this failure fencing occurs, it is normal to have the fenced node panic with a "reservation conflict" messages on its console.

The reservation conflict occurs because after a node has been detected to no longer be a cluster member, a SCSI reservation is put on all of the disks that are shared between this node and other nodes. The fenced node might not be aware that it is being fenced and if it tries to access one of the shared disks, it detects the reservation and panics.

Volume Managers

Sun Cluster uses volume management software to increase the availability of data by using mirrors and hot spare disks, and to handle disk failures and replacements.

Sun Cluster does not have its own internal volume manager component, but relies on the following volume managers:

Solstice DiskSuite
VERITAS Volume Manager

Volume management software in the cluster provides support for:

Failover handling of node failures
Multipath support from different nodes
Remote transparent access to disk device groups

When setting up a volume manager in Sun Cluster, you configure multihost disks as Sun Cluster disk devices, a wrapper for a volume manager disk group. The device can be either a Solstice DiskSuite diskset or a VxVM disk group.

You must configure disk groups used for data services for mirroring to make the disks highly available within the cluster.

You can use metadevices or plexes either as a raw device (database application) or to hold UFS file systems.

Volume management objects--metadevices and volumes--come under the control of the cluster, thus becoming disk device groups. For example, in Solstice DiskSuite, when you create a diskset in the cluster (by using the metaset(1M) command), a corresponding disk device group of the same name is created. Then, as you create metadevices in that diskset, they become global devices. Thus, a diskset is a collection of disk devices (DID devices) and hosts to which all devices in the set are ported. All disksets in a cluster need to be created with more than one host in the set to achieve HA. A similar situation occurs when you use VERITAS Volume Manager. The details of setting up each volume manager are included in the appendixes of the Sun Cluster 3.0 Installation Guide.

An important consideration when planning your disksets or disk groups is to understand how their associated disk device groups are associated with the application resources (data) within the cluster. Refer to the Sun Cluster 3.0 Installation Guide and the Sun Cluster 3.0 Data Services Installation and Configuration Guide for discussions of these issues.

Data Services

The term data service is used to describe a third-party application such as Apache Web Server that has been configured to run on a cluster rather than on a single server. A data service includes the application software and Sun Cluster software that starts, stops, and monitors the application.

Sun Cluster supplies data service methods that are used to control and monitor the application within the cluster. These methods run under the control of the Resource Group Manager (RGM), which uses them to start, stop, and monitor the application on the cluster nodes. These methods, along with the cluster framework software and multihost disks, enable applications to become highly available data services. As highly available data services, they can prevent significant application interruptions after any single failure within the cluster. The failure could be to a node, an interface component, or to the application itself.

The RGM also manages resources in the cluster, including instances of an application and network resources (logical hostnames and shared addresses).

Sun Cluster also supplies an API and data service development tools to enable application programmers to develop the data service methods needed to make other applications run as highly available data services with Sun Cluster.

Resource Group Manager (RGM)

Sun Cluster provides an environment for making applications highly available or scalable. The RGM acts on resources, which are logical components that can be:

Brought online and taken offline (switched)
Managed by the RGM framework
Hosted on a single node (failover mode) or multiple nodes (scalable mode)

The RGM controls data services (applications) as resources, which are managed by resource type implementations. These implementations are either supplied by Sun or created by a developer with a generic data service template, the Data Service Development Library API (DSDL API), or the Sun Cluster Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers called resource groups, which form the basic unit of failover and switchover. The RGM stops and starts resource groups on selected nodes in response to cluster membership changes.

Failover Data Services

If the node on which the data service is running (the primary node) fails, the service is migrated to another working node without user intervention. Failover services utilize a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured up on one node, and later, automatically configured down on the original node and configured up on another node.

For failover data services, application instances run only on a single node. If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured.

Scalable Data Services

The scalable data service has the potential for active instances on multiple nodes. Scalable services utilize a scalable resource group to contain the application resources and a failover resource group to contain the network resources (shared addresses) on which the scalable service depends. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running at once. The failover resource group that hosts the shared address is online on only one node at a time. All nodes hosting a scalable service use the same shared address to host the service.

Service requests come into the cluster through a single network interface (the global interface or GIF) and are distributed to the nodes based on one of several predefined algorithms set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes. Note that there can be multiple GIFs on different nodes hosting other shared addresses.

For scalable services, application instances run on several nodes simultaneously. If the node that hosts the global interface fails, the global interface fails over to another node. If an application instance running fails, the instance attempts to restart on the same node.

If an application instance cannot be restarted on the same node, and another unused node is configured to run the service, the service fails over to the unused node. Otherwise, it continues to run on the remaining nodes, possibly causing a degradation of service throughput.

Note -

TCP state for each application instance is kept on the node with the instance, not on the GIF node. Therefore, failure of the GIF node does not affect the connection.

Figure 3-4 shows an example of failover and a scalable resource group and the dependencies that exist between them for scalable services. This example shows three resource groups. The failover resource group contains application resources for highly available DNS, and network resources used by both highly available DNS and highly available Apache Web Server. The scalable resource groups contain only application instances of the Apache Web Server. Note that resource group dependencies exist between the scalable and failover resource groups (solid lines) and that all of the Apache application resources are dependent on the network resource schost-2, which is a shared address (dashed lines).

Figure 3-4 Failover and Scalable Resource Group Example

Scalable Service Architecture

The primary goal of cluster networking is to provide scalability for data services. Scalability means that as the load offered to a service increases, a data service can maintain a constant response time in the face of this increased workload as new nodes are added to the cluster and new server instances are run. We call such a service a scalable data service. A good example of a scalable data service is a web service. Typically, a scalable data service is composed of several instances, each of which runs on different nodes of the cluster. Together these instances behave as a single service from the standpoint of a remote client of that service and implement the functionality of the service.We might, for example, have a scalable web service made up of several httpd daemons running on different nodes. Any httpd daemon may serve a client request. The daemon that serves the request depends on a load-balancing policy. The reply to the client appears to come from the service, not the particular daemon that serviced the request, thus preserving the single service appearance.

A scalable service is composed of:

Networking infrastructure support for scalable services
Load balancing
HA support for networking and data services (using the Resource Group Manager)

The following figure depicts the scalable service architecture.

Figure 3-5 Scalable Service Architecture

The nodes that are not hosting the global interface (proxy nodes) have the shared address hosted on their loopback interfaces. Packets coming into the GIF are distributed to other cluster nodes based on configurable load-balancing policies. The possible load-balancing policies are described next.

Load-Balancing Policies

Load balancing improves performance of the scalable service, both in response time and in throughput.

There are two classes of scalable data services: pure and sticky. A pure service is one where any instance of it can respond to client requests. A sticky service is one where a client sends requests to the same instance. Those requests are not redirected to other instances.

A pure service uses a weighted load-balancing policy. Under this load-balancing policy, client requests are by default uniformly distributed over the server instances in the cluster. For example, in a three-node cluster, let us suppose that each node has the weight of 1. Each node will service 1/3 of the requests from any client on behalf of that service. Weights can be changed at any time by the administrator through the scrgadm(1M) command interface.

A sticky service has two flavors, ordinary sticky and wildcard sticky. Sticky services allow concurrent application-level sessions over multiple TCP connections to share in-state memory (application session state).

Ordinary sticky services permit a client to share state between multiple concurrent TCP connections. The client is said to be "sticky" with respect to that server instance listening on a single port. The client is guaranteed that all of his requests go to the same server instance, provided that instance remains up and accessible and the load balancing policy is not changed while the service is online.

For example, a web browser on the client connects to a shared IP address on port 80 using three different TCP connections, but the connections are exchanging cached session information between them at the service.

A generalization of a sticky policy extends to multiple scalable services exchanging session information behind the scenes at the same instance. When these services exchange session information behind the scenes at the same instance, the client is said to be "sticky" with respect to multiple server instances on the same node listening on different ports.

For example, a customer on an e-commerce site fills his shopping cart with items using ordinary HTTP on port 80, but switches to SSL on port 443 to send secure data in order to pay by credit card for the items in the cart.

Wildcard sticky services use dynamically assigned port numbers, but still expect client requests to go to the same node. The client is "sticky wildcard" over ports with respect to the same IP address.

A good example of this policy is passive mode FTP. A client connects to an FTP server on port 21 and is then informed by the server to connect back to a listener port server in the dynamic port range. All requests for this IP address are forwarded to the same node that the server informed the client through the control information.

Note that for each of these sticky policies the weighted load-balancing policy is in effect by default, thus, a client's initial request is directed to the instance dictated by the load balancer. After the client has established an affinity for the node where the instance is running, then future requests are directed to that instance as long as the node is accessible and the load balancing policy is not changed.

Additional details of the specific load balancing policies are discussed below.

Weighted. The load is distributed among various nodes according to specified weight values. This policy is set using the LB_WEIGHTED value for the Load_balancing_weights property. If a weight for a node is not explicitly set, the weight for that node defaults to one.

Note that this policy is not round robin. A round-robin policy would always cause each request from a client to go to a different node: the first request to node 1, the second request to node 2, and so on. The weighted policy guarantees that a certain percentage of the traffic from clients is directed to a particular node. This policy does not address individual requests.
Sticky. In this policy, the set of ports is known at the time the application resources are configured. This policy is set using the LB_STICKY value for the Load_balancing_policy resource property.
Sticky-wildcard. This policy is a superset of the ordinary "sticky" policy. For a scalable service identified by the IP address, ports are assigned by the server (and are not known in advance). The ports might change. This policy is set using the LB_STICKY_WILD value for the Load_balancing_policy resource property.

Failback Settings

Resource groups fail over from one node to another. You can specify that, in the event that a resource group fails over to another node, after the node it was previously running on returns to the cluster, it will "fail back" to the original node. This option is set using the Failback resource group property setting.

In certain instances, for example if the original node hosting the resource group is failing and rebooting repeatedly, setting failback might result in reduced availability for the resource group.

Data Services Fault Monitors

Each Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon(s) are running and that clients are being served. Based on the information returned by probes, predefined actions, such as restarting daemons or causing a failover, can be initiated.

Developing New Data Services

Sun supplies software that enables you to make various applications operate as highly available data services within a cluster. If the application that you want to run as a highly available data service is not one that is currently offered by Sun, you can use an API or the DSDL API to take your application and configure it to run as a highly available data service. There are two flavors of data services, failover and scalable. There is a set of criteria for determining whether your application can use one of these data service flavors. The specific criteria is described in the Sun Cluster documents that describe the APIs you can use for your application.

Here, we present some guidelines to help you understand whether your service can take advantage of the scalable data services architecture. Review the section, "Scalable Data Services" for more general information on scalable services.

New services that satisfy the following guidelines may make use of scalable services. If an existing service doesn't follow these guidelines exactly, portions may need to be rewritten so that the service complies with the guidelines.

A scalable data service has the following characteristics. First, such a service is composed of one or more server instances. Each instance runs on a different node of the cluster. Two or more instances of the same service cannot run on the same node.

Second, if the service provides an external logical data store, then concurrent access to this store from multiple server instances must be synchronized to avoid losing updates or reading data as it's being changed. Note that we say "external" to distinguish the store from in-memory state, and "logical" because the store appears as a single entity, although it may itself be replicated. Furthermore, this logical data store has the property that whenever any server instance updates the store, that update is immediately seen by other instances.

Sun Cluster provides such an external storage through its cluster file system and its global raw partitions. As an example, suppose a service writes new data to an external log file or modifies existing data in place. When multiple instances of this service run, each has access to this external log, and each may simultaneously access this log. Each instance must synchronize its access to this log, or else the instances interfere with each other. The service could use ordinary Solaris file locking via fcntl(2) and lockf(3C) to achieve the desired synchronization.

Another example of such a store is a backend database such as highly available Oracle or Oracle Parallel Server. Note that such a back-end database server provides built-in synchronization using database query or update transactions, and so multiple server instances need not implement their own synchronization.

An example of a service that is not a scalable service in its current incarnation is Sun's IMAP server. The service updates a store, but that store is private and when multiple IMAP instances write to this store, they overwrite each other because the updates are not synchronized. The IMAP server must be rewritten to synchronize concurrent access.

Finally, note that instances may have private data that's disjoint from the data of other instances. In such a case, the service need not concern itself with synchronizing concurrent access because the data is private, and only that instance can manipulate it. In this case, you must be careful not to store this private data under the cluster file system because it has the potential to become globally accessible.

Data Service API and Data Service Development Library API

Sun Cluster provides the following to make applications highly available:

Data services supplied as part of Sun Cluster
A data service API
A data service development library API
A "generic" data service

The Sun Cluster 3.0 Data Services Installation and Configuration Guide describes how to install and configure the data services supplied with Sun Cluster. The Sun Cluster 3.0 Data Services Developers' Guide describes how to instrument other applications to be highly available under the Sun Cluster framework.

The Sun Cluster API and Data Service Development Library API enable application programmers to develop fault monitors and scripts that start and stop data services instances. With these tools, an application can be instrumented to be a failover or a scalable data service. In addition, Sun Cluster provides a "generic" data service that can be used to quickly generate an application's required start and stop methods to make it run as a highly available data service.

Resources and Resource Types

Data services utilize several types of resources: applications such as Apache Web Server or iPlanet Web Server utilize network addresses (logical hostnames and shared addresses) upon which the applications depend. Application and network resources form a basic unit that is managed by the RGM.

A resource is an instantiation of a resource type that is defined cluster wide. There are several resource types defined.

Data services are resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle and Sun Cluster HA for Apache is the resource type SUNW.apache.

Network resources are either SUNW.LogiclaHostname or SUNW.SharedAddress resource types. These two resource types are pre-registered by the Sun Cluster product.

The SUNW.HAStorage resource type is used to synchronize the startup of resources and disk device groups upon which the resources depend. It ensures that before a data service starts, the paths to cluster file system mount points, global devices, and device group names are available.

RGM-managed resources are placed into groups, called resource groups, so that they can be managed as a unit. A resource group is migrated as a unit if a failover or switchover is initiated on the resource group.

Note -

When you bring a resource group containing application resources online, the application is started. The data service start method waits until the application is up and running before exiting successfully. The determination of when the application is up and running is accomplished the same way the data service fault monitor determines that a data service is serving clients. Refer to the Sun Cluster 3.0 Data Services Installation and Configuration Guide for more information on this process.

Resource and Resource Group Properties

You can configure property values for resources and resource groups for your Sun Cluster data services. A set of standard properties are common to all data services and a set of extension properties are specific to each data service. Some standard and extension properties are configured with default settings so that you do not have to modify them. Others need to be set as part of the process of creating and configuring resources. The documentation for each data service specifies what properties are used by the resource type, and how they should be configured.

The standard properties are used to configure resource and resource group properties that are usually independent of any particular data service. The set of standard properties is described in an appendix to the Sun Cluster 3.0 Data Services Installation and Configuration Guide.

The extension properties provide information such as the location of application binaries and configuration files. You modify extension properties as you configure your data services. The set of extension properties is described in the individual chapter for the data service in the Sun Cluster 3.0 Data Services Installation and Configuration Guide.

Public Network Management (PNM) and Network Adapter Failover (NAFO)

Clients make data requests to the cluster through the public network. Each cluster node is connected to at least one public network through a public network adapter.

Sun Cluster Public Network Management (PNM) software provides the basic mechanism for monitoring public network adapters and failing over IP addresses from one adapter to another when a fault is detected. Each cluster node has its own PNM configuration, which can be different from that on other cluster nodes.

Public network adapters are organized into Network Adapter Failover groups (NAFO groups). Each NAFO group has one or more public network adapters. While only one adapter can be active at any time for a given NAFO group, more adapters in the same group serve as backup adapters that are used during adapter failover in the case that a fault is detected by the PNM daemon on the active adapter. A failover causes the IP addresses associated with the active adapter to be moved to the backup adapter, thereby maintaining public network connectivity for the node. Because the failover happens at the adapter interface level, higher-level connections such as TCP are not affected, except for a brief transient delay during the failover.

Note -

Because of the congestion recovery characteristics of TCP, TCP endpoints can suffer further delay after a successful failover as some segments could be lost during the failover, activating the congestion control mechanism in TCP.

NAFO groups provide the building blocks for logical hostname and shared address resources. The scrgadm(1M) command automatically creates NAFO groups for you if necessary. You can also create NAFO groups independently of logical hostname and shared address resources to monitor public network connectivity of cluster nodes. The same NAFO group on a node can host any number of logical hostname or shared address resources. For more information on logical hostname and shared address resources, see the Sun Cluster 3.0 Data Services Installation and Configuration Guide.

Note -

The design of the NAFO mechanism is meant to detect and mask adapter failures. The design is not intended to recover from an administrator using ifconfig(1M) to remove one of the logical (or shared) IP addresses. The Sun Cluster design views the logical and shared IP addresses as resources managed by the RGM. The correct way for an administrator to add or remove an IP address is to use scrgadm(1M) to modify the resource group containing the resource.

PNM Fault Detection and Failover Process

PNM checks the packet counters of an active adapter regularly, assuming that the packet counters of a healthy adapter will change because of normal network traffic through the adapter. If the packet counters do not change for some time, PNM goes into a ping sequence, which forces traffic through the active adapter. PNM checks for any change in the packet counters at the end of each sequence, and declares the adapter faulty if the packet counters remain unchanged after the ping sequence is repeated a couple of times. These events trigger a failover to a backup adapter, as long as one is available.

Both input and output packet counters are monitored by PNM so that when either or both remain unchanged for some time, the ping sequence is initiated.

The ping sequence consists of a ping of the ALL_ROUTER multicast address (224.0.0.2), the ALL_HOST multicast address (224.0.0.1), and the local subnet broadcast address.

Pings are structured in a least-costly-first manner, so that a more costly ping is not run if a less costly one has succeeded. Also, pings are used only as a means to generate traffic on the adapter. Their exit statuses do not contribute to the decision of whether an adapter is functioning or faulty.

Four tunable parameters are in this algorithm: inactive_time, ping_timeout, repeat_test, and slow_network. These parameters provide an adjustable trade-off between speed and correctness of fault detection. Refer to the procedure for changing public network parameters in the Sun Cluster 3.0 System Administration Guide for details on the parameters and how to change them.

After a fault is detected on a NAFO group's active adapter, if a backup adapter is not available, the group is declared DOWN, while testing of all its backup adapters continues. Otherwise, if a backup adapter is available, a failover occurs to the backup adapter. Logical addresses and their associated flags are "transferred" to the backup adapter while the faulty active adapter is brought down and unplumbed.

When the failover of IP addresses completes successfully, gratuitous ARP broadcasts are sent. The connectivity to remote clients is therefore maintained.