Sun Cluster 3.1 Concepts Guide

Chapter 3 Key Concepts for Administration and Application Development

This chapter describes the key concepts related to the software components of a SunPlex system. The topics covered include:

Cluster Administration and Application Development

This information is directed primarily toward system administrators and application developers using the SunPlex API and SDK. Cluster system administrators can use this information as background to installing, configuring, and administering cluster software. Application developers can use the information to understand the cluster environment in which they will be working.

The following figure shows a high-level view of how the cluster administration concepts map to the cluster architecture.

Figure 3–1 Sun Cluster Software Architecture

Illustration: The following context describes this graphic.

Administrative Interfaces

You can choose how you install, configure, and administer the SunPlex system from several user interfaces. You can accomplish system administration tasks either through the SunPlex Manager graphic user interface (GUI), or through the documented command-line interface. On top of the command-line interface are some utilities, such as scinstall and scsetup, to simplify selected installation and configuration tasks. The SunPlex system also has a module that runs as part of Sun Management Center that provides a GUI to certain cluster tasks. Refer to the introductory chapter in the Sun Cluster 3.1 System Administration Guide for complete descriptions of the administrative interfaces.

Cluster Time

Time between all nodes in a cluster must be synchronized. Whether you synchronize the cluster nodes with any outside time source is not important to cluster operation. The SunPlex system employs the Network Time Protocol (NTP) to synchronize the clocks between nodes.

In general, a change in the system clock of a fraction of a second causes no problems. However, if you run date(1), rdate(1M), or xntpdate(1M) (interactively, or within cron scripts) on an active cluster, you can force a time change much larger than a fraction of a second to synchronize the system clock to the time source. This forced change might cause problems with file modification timestamps or confuse the NTP service.

When you install the Solaris operating environment on each cluster node, you have an opportunity to change the default time and date setting for the node. In general, you can accept the factory default.

When you install Sun Cluster software using scinstall(1M), one step in the process is to configure NTP for the cluster. Sun Cluster software supplies a template file, ntp.cluster (see /etc/inet/ntp.cluster on an installed cluster node), that establishes a peer relationship between all cluster nodes, with one node being the “preferred” node. Nodes are identified by their private host names and time synchronization occurs across the cluster interconnect. The instructions for how to configure the cluster for NTP are included in the Sun Cluster 3.1 Software Installation Guide.

Alternately, you can set up one or more NTP servers outside the cluster and change the ntp.conf file to reflect that configuration.

In normal operation, you should never need to adjust the time on the cluster. However, if the time was set incorrectly when you installed the Solaris operating environment and you want to change it, the procedure for doing so is included in the Sun Cluster 3.1 System Administration Guide.

High-Availability Framework

The SunPlex system makes all components on the “path” between users and data highly available, including network interfaces, the applications themselves, the file system, and the multihost disks. In general, a cluster component is highly available if it survives any single (software or hardware) failure in the system.

The following table shows the kinds of SunPlex component failures (both hardware and software) and the kinds of recovery built into the high-availability framework.

Table 3–1 Levels of SunPlex Failure Detection and Recovery


Failed Cluster Component	Software Recovery	Hardware Recovery
Data service	HA API, HA framework	N/A
Public network adapter	IP Network Multipathing	Multiple public network adapter cards
Cluster file system	Primary and secondary replicas	Multihost disks
Mirrored multihost disk	Volume management (Solaris Volume Manager and VERITAS Volume Manager)	Hardware RAID-5 (for example, Sun StorEdge^TM A3x00)
Global device	Primary and secondary replicas	Multiple paths to the device, cluster transport junctions
Private network	HA transport software	Multiple private hardware-independent networks
Node	CMM, failfast driver	Multiple nodes

Sun Cluster software's high-availability framework detects a node failure quickly and creates a new equivalent server for the framework resources on a remaining node in the cluster. At no time are all framework resources unavailable. Framework resources unaffected by a crashed node are fully available during recovery. Furthermore, framework resources of the failed node become available as soon as they are recovered. A recovered framework resource does not have to wait for all other framework resources to complete their recovery.

Most highly available framework resources are recovered transparently to the applications (data services) using the resource. The semantics of framework resource access are fully preserved across node failure. The applications simply cannot tell that the framework resource server has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes using the files, devices, and disk volumes attached to this node, as long as an alternative hardware path exists to the disks from another node. An example is the use of multihost disks that have ports to multiple nodes.

Cluster Membership Monitor

The Cluster Membership Monitor (CMM) is a distributed set of agents, one per cluster member. The agents exchange messages over the cluster interconnect to:

Enforce a consistent membership view on all nodes (quorum)
Drive synchronized reconfiguration in response to membership changes, using registered callbacks
Handle cluster partitioning (split brain, amnesia)
Ensure full connectivity among all cluster members

Unlike previous Sun Cluster software releases, CMM runs entirely in the kernel.

Cluster Membership

The main function of the CMM is to establish cluster-wide agreement on the set of nodes that participates in the cluster at any given time. This constraint is called the cluster membership.

To determine cluster membership, and ultimately, ensure data integrity, the CMM:

Accounts for a change in cluster membership, such as a node joining or leaving the cluster
Ensures that a “bad” node leaves the cluster
Ensures that a “bad” node stays out of the cluster until it is repaired
Prevents the cluster from partitioning itself into subsets of nodes

See Quorum and Quorum Devices for more information on how the cluster protects itself from partitioning into multiple separate clusters.

Cluster Membership Monitor Reconfiguration

To ensure that data is kept safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services (applications) in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster, where cluster resources might be redistributed based on the new membership of the cluster.

Failfast Mechanism

If the CMM detects a critical problem with a node, it calls upon the cluster framework to forcibly shut down (panic) the node and to remove it from the cluster membership. The mechanism by which this occurs is called failfast. Failfast will cause a node to shut down in two ways.

If a node leaves the cluster and then attempts to start a new cluster without having quorum, it is “fenced” from accessing the shared disks. See Failure Fencing for details on this use of failfast.
If one or more cluster-specific daemons die (clexecd, rpc.pmfd, rgmd, or rpc.ed) the failure is detected by the CMM and the node panics.

When a node panics due to the death of a cluster daemon, a message similar to the following will display on the console for that node.

panic[cpu0]/thread=40e60: Failfast: Aborting because "pmfd" died 35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the node might reboot and attempt to rejoin the cluster or stay at the OpenBoot^TM PROM (OBP) prompt. The action taken is determined by the setting of the auto-boot? parameter in the OBP.

Cluster Configuration Repository (CCR)

The Cluster Configuration Repository (CCR) is a private, cluster-wide database for storing information pertaining to the configuration and state of the cluster. The CCR is a distributed database. Each node maintains a complete copy of the database. The CCR ensures that all nodes have a consistent view of the cluster “world.” To avoid corrupting data, each node needs to know the current state of the cluster resources.

The CCR uses a two-phase commit algorithm for updates: An update must complete successfully on all cluster members or the update is rolled back. The CCR uses the cluster interconnect to apply the distributed updates.

Caution –

Although the CCR consists of text files, never edit the CCR files manually. Each file contains a checksum record to ensure consistency between nodes. Manually updating CCR files can cause a node or the entire cluster to stop functioning.

The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.

Global Devices

The SunPlex system uses global devices to provide cluster-wide, highly available access to any device in a cluster, from any node, without regard to where the device is physically attached. In general, if a node fails while providing access to a global device, the Sun Cluster software automatically discovers another path to the device and redirects the access to that path. SunPlex global devices include disks, CD-ROMs, and tapes. However, disks are the only supported multiported global devices. This means that CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.

The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the cluster. This assignment allows consistent access to each device from any node in the cluster. The global device namespace is held in the /dev/global directory. See Global Namespace for more information.

Multiported global devices provide more than one path to a device. In the case of multihost disks, because the disks are part of a disk device group hosted by more than one node, the multihost disks are made highly available.

Device ID (DID)

The Sun Cluster software manages global devices through a construct known as the device ID (DID) pseudo driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.

The device ID (DID) pseudo driver is an integral part of the global device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique disk devices, assigning each a unique major and minor number that is consistent on all nodes of the cluster. Access to the global devices is performed utilizing the unique device ID assigned by the DID driver instead of the traditional Solaris device IDs, such as c0t0d0 for a disk.

This approach ensures that any application accessing disks (such as a volume manager or applications using raw devices) uses a consistent path across the cluster. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from node to node, thus changing the Solaris device naming conventions as well. For example, node1 might see a multihost disk as c1t2d0, and node2 might see the same disk completely differently, as c3t2d0. The DID driver assigns a global name, such as d10, that the nodes would use instead, giving each node a consistent mapping to the multihost disk.

You update and administer Device IDs through scdidadm(1M) and scgdevs(1M). See the respective man pages for more information.

Disk Device Groups

In the SunPlex system, all multihost disks must be under control of the Sun Cluster software. You first create volume manager disk groups, either Solaris Volume Manager disk sets or VERITAS Volume Manager disk groups, on the multihost disks. Then, you register the volume manager disk groups as disk device groups. A disk device group is a type of global device. In addition, the Sun Cluster software automatically creates a rawdisk device group for each disk and tape device in the cluster. However, these cluster device groups remain in an offline state until you access them as global devices.

Registration provides the SunPlex system information about which nodes have a path to what volume manager disk groups. At this point, the volume manager disk groups become globally accessible within the cluster. If more than one node can write to (master) a disk device group, the data stored in that disk device group becomes highly available. The highly available disk device group can be used to house cluster file systems.

Note –

Disk device groups are independent of resource groups. One node can master a resource group (representing a group of data service processes) while another can master the disk group(s) being accessed by the data services. However, the best practice is to keep the disk device group that stores a particular application's data and the resource group that contains the application's resources (the application daemon) on the same node. Refer to the Sun Cluster 3.1 Data Service Planning and Administration Guide for more information about the association between disk device groups and resource groups.

With a disk device group, the volume manager disk group becomes “global” because it provides multipath support to the underlying disks. Each cluster node physically attached to the multihost disks provides a path to the disk device group.

Disk Device Group Failover

Because a disk enclosure is connected to more than one node, all disk device groups in that enclosure are accessible through an alternate path if the node currently mastering the device group fails. The failure of the node mastering the device group does not affect access to the device group except for the time it takes to perform the recovery and consistency checks. During this time, all requests are blocked (transparently to the application) until the system makes the device group available.

Figure 3–2 Disk Device Group Failover

Illustration: The preceding context describes the graphic.

Multiported Disk Device Groups

This section describes disk device group properties that enable you to balance performance and availability in a multiported disk configuration. Sun Cluster software provides two properties used to configure a multiported disk configuration: preferenced and numsecondaries. Control the order in which nodes attempt to assume control if failover occurs by using the preferenced property. Use the numsecondaries property to set a desired number of secondary nodes for a device group.

A highly available service is considered down when the primary goes down and when no more eligible secondary nodes can be promoted to primary. If service failover occurs, the nodelist that is set by the preferenced property defines the order in which nodes will attempt to assume primary control or transition from spare to secondary. You can dynamically change the preference of a device service by using the scsetup(1M) utility. The preference that is associated with dependent service providers, for example a global file system, will be that of the device service.

Secondary nodes are check-pointed by the primary node during normal operation. In a multiported disk configuration, checkpointing each secondary node causes cluster performance degradation and memory overhead. Spare node support was implemented to minimize the performance degradation and memory overhead caused by checkpointing. By default, your disk device group will have one primary and one secondary. The remaining available provider nodes will come online in the spare state. If failover occurs, the secondary will become primary and the node highest in priority on the nodelist will become secondary.

The desired number of secondary nodes can be set to any integer between one and the number of operational non-primary provider nodes in the device group.

Note –

If you are using Solaris Volume Manager, you must create the disk device group before you can set the numsecondaries property to a number other than the default.

The default desired number of secondaries for device services is one. The actual number of secondary providers that is maintained by the replica framework is the desired number, unless the number of operational non-primary providers is less than the desired number. You will want to alter the numsecondaries property and double check the nodelist if you are adding or removing nodes from your configuration. Maintaining the nodelist and desired number of secondaries will prevent conflict between the configured number of secondaries and the actual number allowed by the framework. Use the scconf(1M) command for VxVM disk device groups or the metaset(1M) command for Solaris Volume Manager device groups in conjunction with the preferenced and numsecondaries property settings to manage addition and removal of nodes from your configuration. Refer to “Administering Global Devices and Cluster File Systems” in Sun Cluster 3.1 System Administration Guide Sun Cluster 3.1 System Administration Guide for procedural information about changing disk device group properties.

Global Namespace

The Sun Cluster software mechanism that enables global devices is the global namespace. The global namespace includes the /dev/global/ hierarchy as well as the volume manager namespaces. The global namespace reflects both multihost disks and local disks (and any other cluster device, such as CD-ROMs and tapes), and provides multiple failover paths to the multihost disks. Each node physically connected to multihost disks provides a path to the storage for any node in the cluster.

Normally, the volume manager namespaces reside in the /dev/md/diskset/dsk (and rdsk) directories, for Solaris Volume Manager; and in the /dev/vx/dsk/disk-group and /dev/vx/rdsk/disk-group directories, for VxVM. These namespaces consist of directories for each Solaris Volume Manager diskset and each VxVM disk group imported throughout the cluster, respectively. Each of these directories houses a device node for each metadevice or volume in that diskset or disk group.

In the SunPlex system, each of the device nodes in the local volume manager namespace is replaced by a symbolic link to a device node in the /global/.devices/node@nodeID file system, where nodeID is an integer that represents the nodes in the cluster. Sun Cluster software continues to present the volume manager devices, as symbolic links, in their standard locations as well. Both the global namespace and standard volume manager namespace are available from any cluster node.

The advantages of the global namespace include:

Each node remains fairly independent, with little change in the device administration model.
Devices can be selectively made global.
Third-party link generators continue to work.
Given a local device name, an easy mapping is provided to obtain its global name.

Local and Global Namespaces Example

The following table shows the mappings between the local and global namespaces for a multihost disk, c0t0d0s0.

Table 3–2 Local and Global Namespaces Mappings


Component/Path	Local Node Namespace	Global Namespace
Solaris logical name	`/dev/dsk/c0t0d0s0`	`/global/.devices/node@nodeID/dev/dsk/c0t0d0s0`
DID name	`/dev/did/dsk/d0s0`	`/global/.devices/node@nodeID/dev/did/dsk/d0s0`
Solaris Volume Manager	`/dev/md/diskset/dsk/d0`	`/global/.devices/node@nodeID/dev/md/diskset/dsk/d0`
VERITAS Volume Manager	`/dev/vx/dsk/disk-group/v0`	`/global/.devices/node@nodeID/dev/vx/dsk/disk-group/v0`

The global namespace is automatically generated on installation and updated with every reconfiguration reboot. You can also generate the global namespace by running the scgdevs(1M) command.

Cluster File Systems

A cluster file system is a proxy between the kernel on one node and the underlying file system and volume manager running on a node that has a physical connection to the disk(s).

Cluster file systems are dependent on global devices (disks, tapes, CD-ROMs) with physical connections to one or more nodes. The global devices can be accessed from any node in the cluster through the same file name (for example, /dev/global/) whether or not that node has a physical connection to the storage device. You can use a global device the same as a regular device, that is, you can create a file system on it using newfs and/or mkfs.

You can mount a file system on a global device globally with mount -g or locally with mount.

Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/foo).

A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.

A cluster file system is not a distinct file system type. That is, clients see the underlying file system (for example, UFS).

Using Cluster File Systems

In the SunPlex system, all multihost disks are placed into disk device groups, which can be Solaris Volume Manager disksets, VxVM disk groups, or individual disks not under control of a software-based volume manager.

For a cluster file system to be highly available, the underlying disk storage must be connected to more than one node. Therefore, a local file system (a file system that is stored on a node's local disk) that is made into a cluster file system is not highly available.

As with normal file systems, you can mount cluster file systems in two ways:

Manually: Use the mount command and the -g or -o global mount options to mount the cluster file system from the command line, for example:
# mount -g /dev/global/dsk/d0s0 /global/oracle/data
Automatically: Create an entry in the /etc/vfstab file with a global mount option to mount the cluster file system at boot. You then create a mount point under the /global directory on all nodes. The directory /global is a recommended location, not a requirement. Here's a sample line for a cluster file system from an /etc/vfstab file:
/dev/md/oracle/dsk/d1 /dev/md/oracle/rdsk/d1 /global/oracle/data ufs 2 yes global,logging

Note –

While Sun Cluster software does not impose a naming policy for cluster file systems, you can ease administration by creating a mount point for all cluster file systems under the same directory, such as /global/disk-device-group. See Sun Cluster 3.1 Software Installation Guide and Sun Cluster 3.1 System Administration Guide for more information.

Cluster File System Features

The cluster file system has the following features:

File access locations are transparent. A process can open a file located anywhere in the system and processes on all nodes can use the same path name to locate a file.

Note –
When the cluster file system reads files, it does not update the access time on those files.
Coherency protocols are used to preserve the UNIX file access semantics even if the file is accessed concurrently from multiple nodes.
Extensive caching is used along with zero-copy bulk I/O movement to move file data efficiently.
The cluster file system provides highly available advisory file locking functionality using the fcntl(2) interfaces. Applications running on multiple cluster nodes can synchronize access to data using advisory file locking on a cluster file system file. File locks are recovered immediately from nodes that leave the cluster, and from applications that fail while holding locks.
Continuous access to data is ensured, even when failures occur. Applications are not affected by failures as long as a path to disks is still operational. This guarantee is maintained for raw disk access and all file system operations.
Cluster file systems are independent from the underlying file system and volume management software. Cluster file systems make any supported on-disk file system global.

HAStoragePlus Resource Type

The HAStoragePlus resource type is designed to make non-global file system configurations such as UFS and VxFS highly available. Use HAStoragePlus to integrate your local file system into the Sun Cluster environment and make the file system highly available. HAStoragePlus provides additional file system capabilities such as checks, mounts, and forced unmounts that enable Sun Cluster to fail over local file systems. In order to fail over, the local file system must reside on global disk groups with affinity switchovers enabled.

See the individual data service chapters or the Sun Cluster 3.1 Data Service Planning and Administration Guide in the Sun Cluster 3.1 Data Service Collection for information on how to use the HAStoragePlus resource type.

HAStoragePlus can also used to synchronize the startup of resources and disk device groups upon which the resources depend. For more information, see Resources, Resource Groups, and Resource Types.

The Syncdir Mount Option

The syncdir mount option can be used for cluster file systems that use UFS as the underlying file system. However, there is a significant performance improvement if you do not specify syncdir. If you specify syncdir, the writes are guaranteed to be POSIX compliant. If you do not, you will have the same behavior that is seen with NFS file systems. For example, under some cases, without syncdir, you would not discover an out of space condition until you close a file. With syncdir (and POSIX behavior), the out of space condition would have been discovered during the write operation. The cases in which you could have problems if you do not specify syncdir are rare, so we recommend that you do not specify it and receive the performance benefit.

VxFS does not have a mount-option equivalent to the syncdir mount option for UFS. VxFS behavior is the same as for UFS when the syncdir mount option is not specified.

See File Systems FAQs for frequently asked questions about global devices and cluster file systems.

Quorum and Quorum Devices

Because cluster nodes share data and resources, it is important that a cluster never splits into separate partitions that are active at the same time. The CMM guarantees that at most one cluster is operational at any time, even if the cluster interconnect is partitioned.

There are two types of problems that arise from cluster partitions: split brain and amnesia. Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into sub-clusters, each of which believes that it is the only partition. This occurs due to communication problems between cluster nodes. Amnesia occurs when the cluster restarts after a shutdown with cluster data older than at the time of the shutdown. This can happen if multiple versions of the framework data are stored on disk and a new incarnation of the cluster is started when the latest version is not available.

Split brain and amnesia can be avoided by giving each node one vote and mandating a majority of votes for an operational cluster. A partition with the majority of votes has a quorum and is allowed to operate. This majority vote mechanism works fine as long as there are more than two nodes in the cluster. In a two-node cluster, a majority is two. If such a cluster becomes partitioned, an external vote is needed for either partition to gain quorum. This external vote is provided by a quorum device. A quorum device can be any disk that is shared between the two nodes. Disks used as quorum devices can contain user data.

The following table describes how Sun Cluster software uses quorum to avoid split brain and amnesia.

Table 3–3 Cluster Quorum, and Split-Brain and Amnesia Problems


Partition Type	Quorum Solution
Split brain	Allows only the partition (sub-cluster) with a majority of votes to run as the cluster (where at most one partition can exist with such a majority); once a node loses the race for quorum, that node panics
Amnesia	Guarantees that when a cluster is booted, it has at least one node that was a member of the most recent cluster membership (and thus has the latest configuration data)

The quorum algorithm operates dynamically: as cluster events trigger its calculations, the results of calculations can change over the lifetime of a cluster.

Quorum Vote Counts

Both cluster nodes and quorum devices vote to form quorum. By default, cluster nodes acquire a quorum vote count of one when they boot and become cluster members. Nodes can also have a vote count of zero, for example, when the node is being installed, or when an administrator has placed a node into maintenance state.

Quorum devices acquire quorum vote counts based on the number of node connections to the device. When a quorum device is set up, it acquires a maximum vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device connected to two nodes with non zero vote counts has a quorum count of one (two minus one).

You configure quorum devices during the cluster installation, or later by using the procedures described in the Sun Cluster 3.1 System Administration Guide.

Note –

A quorum device contributes to the vote count only if at least one of the nodes to which it is currently attached is a cluster member. Also, during cluster boot, a quorum device contributes to the count only if at least one of the nodes to which it is currently attached is booting and was a member of the most recently booted cluster when it was shut down.

Quorum Configurations

Quorum configurations depend on the number of nodes in the cluster:

Two-Node Clusters: Two quorum votes are required for a two-node cluster to form. These two votes can come from the two cluster nodes, or from just one node and a quorum device. Nevertheless, a quorum device must be configured in a two-node cluster to ensure that a single node can continue if the other node fails.
More Than Two-Node Clusters: You should specify a quorum device between every pair of nodes that shares access to a disk storage enclosure. For example, suppose you have a three-node cluster similar to the one shown in the following figure Quorum Device Configuration Examples. In this figure, nodeA and nodeB share access to the same disk enclosure and nodeB and nodeC share access to another disk enclosure. There would be a total of five quorum votes, three from the nodes and two from the quorum devices shared between the nodes. A cluster needs a majority of the quorum votes to form.

Specifying a quorum device between every pair of nodes that shares access to a disk storage enclosure is not required or enforced by Sun Cluster software. However, it can provide needed quorum votes for the case where an N+1 configuration degenerates into a two-node cluster and then the node with access to both disk enclosures also fails. If you configured quorum devices between all pairs, the remaining node could still operate as a cluster.

See the following table for examples of these configurations.

Figure 3–3 Quorum Device Configuration Examples

Quorum Guidelines

Use the following guidelines when setting up quorum devices:

Establish a quorum device between all nodes that are attached to the same shared disk storage enclosure. Add one disk within the shared enclosure as a quorum device to ensure that if any node fails, the other nodes can maintain quorum and master the disk device groups on the shared enclosure.
You must connect the quorum device to at least two nodes.
A quorum device can be any SCSI-2 or SCSI-3 disk used as a dual-ported quorum device. Disks connected to more than two nodes must support SCSI-3 Persistent Group Reservation (PGR) regardless of whether the disk is used as a quorum device. See the chapter on planning in the Sun Cluster 3.1 Software Installation Guide for more information.
You can use a disk that contains user data as a quorum device.

Failure Fencing

A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When this happens, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might believe it has sole access and ownership to the multihost disks. Multiple nodes attempting to write to the disks can result in data corruption.

Failure fencing limits node access to multihost disks by physically preventing access to the disks. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, resulting in data integrity.

Disk device services provide failover capability for services that make use of multihost disks. When a cluster member currently serving as the primary (owner) of the disk device group fails or becomes unreachable, a new primary is chosen, enabling access to the disk device group to continue with only minor interruption. During this process, the old primary must give up access to the devices before the new primary can be started. However, when a member drops out of the cluster and becomes unreachable, the cluster cannot inform that node to release the devices for which it was the primary. Thus, you need a means to enable surviving members to take control of and access global devices from failed members.

The SunPlex system uses SCSI disk reservations to implement failure fencing. Using SCSI reservations, failed nodes are “fenced” away from the multihost disks, preventing them from accessing those disks.

SCSI-2 disk reservations support a form of reservations, which either grants access to all nodes attached to the disk (when no reservation is in place) or restricts access to a single node (the node that holds the reservation).

When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates a failure fencing procedure to prevent the other node from accessing shared disks. When this failure fencing occurs, it is normal to have the fenced node panic with a “reservation conflict” message on its console.

The reservation conflict occurs because after a node has been detected to no longer be a cluster member, a SCSI reservation is put on all of the disks that are shared between this node and other nodes. The fenced node might not be aware that it is being fenced and if it tries to access one of the shared disks, it detects the reservation and panics.

Failfast Mechanism for Failure Fencing

The mechanism by which the cluster framework ensures that a failed node cannot reboot and begin writing to shared storage is called failfast.

Nodes that are cluster members continuously enable a specific ioctl, MHIOCENFAILFAST, for the disks to which they have access, including quorum disks. This ioctl is a directive to the disk driver, and gives a node the capability to panic itself if it cannot access the disk due to the disk being reserved by some other node.

The MHIOCENFAILFAST ioctl causes the driver to check the error return from every read and write that a node issues to the disk for the Reservation_Conflict error code. The ioctl periodically, in the background, issues a test operation to the disk to check for Reservation_Conflict. Both the foreground and background control flow paths panic if Reservation_Conflict is returned.

For SCSI-2 disks, reservations are not persistent because they do not survive node reboots. For SCSI-3 disks with Persistent Group Reservation (PGR), reservation information is stored on the disk and persists across node reboots. The failfast mechanism works the same regardless of whether you have SCSI-2 disks or SCSI-3 disks.

If a node loses connectivity to other nodes in the cluster, and it is not part of a partition that can achieve quorum, it is forcibly removed from the cluster by another node. Another node that is part of the partition that can achieve quorum places reservations on the shared disks and when the node that does not have quorum attempts to access the shared disks, it receives a reservation conflict and panics as a result of the failfast mechanism.

After the panic, the node might reboot and attempt to rejoin the cluster or stay at the OpenBoot PROM (OBP) prompt. The action taken is determined by the setting of the auto-boot? parameter in the OBP.

Volume Managers

The SunPlex system uses volume management software to increase the availability of data by using mirrors and hot spare disks, and to handle disk failures and replacements.

The SunPlex system does not have its own internal volume manager component, but relies on the following volume managers:

Solaris Volume Manager
VERITAS Volume Manager

Volume management software in the cluster provides support for:

Failover handling of node failures
Multipath support from different nodes
Remote transparent access to disk device groups

When volume management objects come under the control of the cluster, they become disk device groups. For information about volume managers, refer to your volume manager software documentation.

Note –

An important consideration when planning your disksets or disk groups is to understand how their associated disk device groups are associated with the application resources (data) within the cluster. Refer to the Sun Cluster 3.1 Software Installation Guide and the Sun Cluster 3.1 Data Service Planning and Administration Guide for discussions of these issues.

Data Services

The term data service describes a third-party application such as Oracle or Sun ONE Web Server that has been configured to run on a cluster rather than on a single server. A data service consists of an application, specialized Sun Cluster configuration files, and Sun Cluster management methods that controls the following actions of the application.

start
stop
monitor and take corrective measures

The following figure, Standard Versus Clustered Client/Server Configuration, compares an application that runs on a single application server (the single-server model) to the same application running on a cluster (the clustered-server model). Note that from the user's perspective, there is no difference between the two configurations except that the clustered application might run faster and will be more highly available.

Figure 3–4 Standard Versus Clustered Client/Server Configuration

Illustration: The following context describes the graphic.

In the single-server model, you configure the application to access the server through a particular public network interface (a hostname). The hostname is associated with that physical server.

In the clustered-server model, the public network interface is a logical hostname or a shared address. The term network resources is used to refer to both logical hostnames and shared addresses.

Some data services require you to specify either logical hostnames or shared addresses as the network interfaces. Logical hostnames and shared addresses are not interchangeable. Other data services allow you to specify either logical hostnames or shared addresses. Refer to the installation and configuration for each data service for details on the type of interface you must specify.

A network resource is not associated with a specific physical server, but can migrate between physical servers.

A network resource is initially associated with one node, the primary. If the primary fails, the network resource, and the application resource, fails over to a different cluster node (a secondary). When the network resource fails over, after a short delay, the application resource continues to run on the secondary.

The following figure compares the single-server model with the clustered-server model. Note that in the clustered-server model, a network resource (logical hostname, in this example) can move between two or more of the cluster nodes. The application is configured to use this logical hostname in place of a hostname associated with a particular server.

Figure 3–5 Fixed Hostname Versus Logical Hostname

A shared address is also initially associated with one node. This node is called the Global Interface (GIF) Node . A shared address is used as the single network interface to the cluster. It is known as the global interface.

The difference between the logical hostname model and the scalable service model is that in the latter, each node also has the shared address actively configured up on its loopback interface. This configuration makes it possible to have multiple instances of a data service active on several nodes simultaneously. The term “scalable service” means that you can add more CPU power to the application by adding additional cluster nodes and the performance will scale.

If the GIF node fails, the shared address can be brought up on another node that is also running an instance of the application (thereby making this other node the new GIF node). Or, the shared address can fail over to another cluster node that was not previously running the application.

The figure Fixed Hostname Versus Shared Address compares the single-server configuration with the clustered-scalable service configuration. Note that in the scalable service configuration, the shared address is present on all nodes. Similar to how a logical hostname is used for a failover data service, the application is configured to use this shared address in place of a hostname associated with a particular server.

Figure 3–6 Fixed Hostname Versus Shared Address

Data Service Methods

The Sun Cluster software supplies a set of service management methods. These methods run under the control of the Resource Group Manager (RGM), which uses them to start, stop, and monitor the application on the cluster nodes. These methods, along with the cluster framework software and multihost disks, enable applications to become failover or scalable data services.

The RGM also manages resources in the cluster, including instances of an application and network resources (logical hostnames and shared addresses).

In addition to Sun Cluster software-supplied methods, the SunPlex system also supplies an API and several data service development tools. These tools enable application programmers to develop the data service methods needed to make other applications run as highly available data services with the Sun Cluster software.

Failover Data Services

If the node on which the data service is running (the primary node) fails, the service is migrated to another working node without user intervention. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured up on one node, and later, automatically configured down on the original node and configured up on another node.

For failover data services, application instances run only on a single node. If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured.

Scalable Data Services

The scalable data service has the potential for active instances on multiple nodes. Scalable services use two resource groups: a scalable resource group to contain the application resources and a failover resource group to contain the network resources (shared addresses) on which the scalable service depends. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running at once. The failover resource group that hosts the shared address is online on only one node at a time. All nodes hosting a scalable service use the same shared address to host the service.

Service requests come into the cluster through a single network interface (the global interface) and are distributed to the nodes based on one of several predefined algorithms set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes. Note that there can be multiple global interfaces on different nodes hosting other shared addresses.

For scalable services, application instances run on several nodes simultaneously. If the node that hosts the global interface fails, the global interface fails over to another node. If an application instance running fails, the instance attempts to restart on the same node.

If an application instance cannot be restarted on the same node, and another unused node is configured to run the service, the service fails over to the unused node. Otherwise, it continues to run on the remaining nodes, possibly causing a degradation of service throughput.

Note –

TCP state for each application instance is kept on the node with the instance, not on the global interface node. Therefore, failure of the global interface node does not affect the connection.

The figure Failover and Scalable Resource Group Example shows an example of failover and a scalable resource group and the dependencies that exist between them for scalable services. This example shows three resource groups. The failover resource group contains application resources for highly available DNS, and network resources used by both highly available DNS and highly available Apache Web Server. The scalable resource groups contain only application instances of the Apache Web Server. Note that resource group dependencies exist between the scalable and failover resource groups (solid lines) and that all of the Apache application resources are dependent on the network resource schost-2, which is a shared address (dashed lines).

Figure 3–7 Failover and Scalable Resource Group Example

Scalable Service Architecture

The primary goal of cluster networking is to provide scalability for data services. Scalability means that as the load offered to a service increases, a data service can maintain a constant response time in the face of this increased workload as new nodes are added to the cluster and new server instances are run. We call such a service a scalable data service. A good example of a scalable data service is a web service. Typically, a scalable data service is composed of several instances, each of which runs on different nodes of the cluster. Together these instances behave as a single service from the standpoint of a remote client of that service and implement the functionality of the service. We might, for example, have a scalable web service made up of several httpd daemons running on different nodes. Any httpd daemon may serve a client request. The daemon that serves the request depends on a load-balancing policy. The reply to the client appears to come from the service, not the particular daemon that serviced the request, thus preserving the single service appearance.

A scalable service is composed of:

Networking infrastructure support for scalable services
Load balancing
Support for networking and data services (using the Resource Group Manager)

The following figure depicts the scalable service architecture.

Figure 3–8 Scalable Service Architecture

Illustration's purpose is to depict the scalable service architecture.

The nodes that are not hosting the global interface (proxy nodes) have the shared address hosted on their loopback interfaces. Packets coming into the global interface are distributed to other cluster nodes based on configurable load-balancing policies. The possible load-balancing policies are described next.

Load-Balancing Policies

Load balancing improves performance of the scalable service, both in response time and in throughput.

There are two classes of scalable data services: pure and sticky. A pure service is one where any instance of it can respond to client requests. A sticky service is one where a client sends requests to the same instance. Those requests are not redirected to other instances.

A pure service uses a weighted load-balancing policy. Under this load-balancing policy, client requests are by default uniformly distributed over the server instances in the cluster. For example, in a three-node cluster, let us suppose that each node has the weight of 1. Each node will service 1/3 of the requests from any client on behalf of that service. Weights can be changed at any time by the administrator through the scrgadm(1M) command interface or through the SunPlex Manager GUI.

A sticky service has two flavors, ordinary sticky and wildcard sticky. Sticky services allow concurrent application-level sessions over multiple TCP connections to share in-state memory (application session state).

Ordinary sticky services permit a client to share state between multiple concurrent TCP connections. The client is said to be “sticky” with respect to that server instance listening on a single port. The client is guaranteed that all of his requests go to the same server instance, provided that instance remains up and accessible and the load balancing policy is not changed while the service is online.

For example, a web browser on the client connects to a shared IP address on port 80 using three different TCP connections, but the connections are exchanging cached session information between them at the service.

A generalization of a sticky policy extends to multiple scalable services exchanging session information behind the scenes at the same instance. When these services exchange session information behind the scenes at the same instance, the client is said to be“sticky” with respect to multiple server instances on the same node listening on different ports.

For example, a customer on an e-commerce site fills his shopping cart with items using ordinary HTTP on port 80, but switches to SSL on port 443 to send secure data in order to pay by credit card for the items in the cart.

Wildcard sticky services use dynamically assigned port numbers, but still expect client requests to go to the same node. The client is “sticky wildcard” over ports with respect to the same IP address.

A good example of this policy is passive mode FTP. A client connects to an FTP server on port 21 and is then informed by the server to connect back to a listener port server in the dynamic port range. All requests for this IP address are forwarded to the same node that the server informed the client through the control information.

Note that for each of these sticky policies the weighted load-balancing policy is in effect by default, thus, a client's initial request is directed to the instance dictated by the load balancer. After the client has established an affinity for the node where the instance is running, then future requests are directed to that instance as long as the node is accessible and the load balancing policy is not changed.

Additional details of the specific load balancing policies are discussed below.

Weighted. The load is distributed among various nodes according to specified weight values. This policy is set using the LB_WEIGHTED value for the Load_balancing_weights property. If a weight for a node is not explicitly set, the weight for that node defaults to one.

The weighted policy redirects a certain percentage of the traffic from clients to a particular node. Given X=weight and A=the total weights of all active nodes, an active node can expect approximately X/A of the total new connections to be directed to the active node, when the total number of connections is large enough. This policy does not address individual requests.

Note that this policy is not round robin. A round-robin policy would always cause each request from a client to go to a different node: the first request to node 1, the second request to node 2, and so on.
Sticky. In this policy, the set of ports is known at the time the application resources are configured. This policy is set using the LB_STICKY value for the Load_balancing_policy resource property.
Sticky-wildcard. This policy is a superset of the ordinary“sticky” policy. For a scalable service identified by the IP address, ports are assigned by the server (and are not known in advance). The ports might change. This policy is set using the LB_STICKY_WILD value for the Load_balancing_policy resource property.

Failback Settings

Resource groups fail over from one node to another. When this occurs, the original secondary becomes the new primary. The failback settings specify the actions that will take place when the original primary comes back online. The options are to have the original primary become the primary again (failback) or to allow the current primary to remain. You specify the option you want using the Failback resource group property setting.

In certain instances, if the original node hosting the resource group is failing and rebooting repeatedly, setting failback might result in reduced availability for the resource group.

Data Services Fault Monitors

Each SunPlex data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon(s) are running and that clients are being served. Based on the information returned by probes, predefined actions such as restarting daemons or causing a failover, can be initiated.

Developing New Data Services

Sun supplies configuration files and management methods templates that enable you to make various applications operate as failover or scalable services within a cluster. If the application that you want to run as a failover or scalable service is not one that is currently offered by Sun, you can use an API or the DSET API to configure it to run as a failover or scalable service.

There is a set of criteria for determining whether an application can become a failover service. The specific criteria is described in the SunPlex documents that describe the APIs you can use for your application.

Here, we present some guidelines to help you understand whether your service can take advantage of the scalable data services architecture. Review the section, Scalable Data Services for more general information on scalable services.

New services that satisfy the following guidelines may make use of scalable services. If an existing service doesn't follow these guidelines exactly, portions may need to be rewritten so that the service complies with the guidelines.

A scalable data service has the following characteristics. First, such a service is composed of one or more server instances. Each instance runs on a different node of the cluster. Two or more instances of the same service cannot run on the same node.

Second, if the service provides an external logical data store, then concurrent access to this store from multiple server instances must be synchronized to avoid losing updates or reading data as it's being changed. Note that we say “external” to distinguish the store from in-memory state, and “logical” because the store appears as a single entity, although it may itself be replicated. Furthermore, this logical data store has the property that whenever any server instance updates the store, that update is immediately seen by other instances.

The SunPlex system provides such an external storage through its cluster file system and its global raw partitions. As an example, suppose a service writes new data to an external log file or modifies existing data in place. When multiple instances of this service run, each has access to this external log, and each may simultaneously access this log. Each instance must synchronize its access to this log, or else the instances interfere with each other. The service could use ordinary Solaris file locking via fcntl(2) and lockf(3C) to achieve the desired synchronization.

Another example of this type of store is a back-end database such as highly available Oracle or Oracle Parallel Server/Real Application Clusters. Note that this type of back-end database server provides built-in synchronization using database query or update transactions, and so multiple server instances need not implement their own synchronization.

An example of a service that is not a scalable service in its current incarnation is Sun's IMAP server. The service updates a store, but that store is private and when multiple IMAP instances write to this store, they overwrite each other because the updates are not synchronized. The IMAP server must be rewritten to synchronize concurrent access.

Finally, note that instances may have private data that's disjoint from the data of other instances. In such a case, the service need not concern itself with synchronizing concurrent access because the data is private, and only that instance can manipulate it. In this case, you must be careful not to store this private data under the cluster file system because it has the potential to become globally accessible.

Data Service API and Data Service Development Library API

The SunPlex system provides the following to make applications highly available:

Data services supplied as part of the SunPlex system
A data service API
A data service development library API
A “generic” data service

The Sun Cluster 3.1 Data Service Collection describes how to install and configure the data services supplied with the SunPlex system. The Sun Cluster 3.1 Data Services Developer's Guide describes how to instrument other applications to be highly available under the Sun Cluster framework.

The Sun Cluster APIs enable application programmers to develop fault monitors and scripts that start and stop data services instances. With these tools, an application can be instrumented to be a failover or a scalable data service. In addition, the SunPlex system provides a “generic” data service that can be used to quickly generate an application's required start and stop methods to make it run as a failover or scalable service.

Using the Cluster Interconnect for Data Service Traffic

A cluster must have multiple network connections between nodes, forming the cluster interconnect. The clustering software uses multiple interconnects both for high availability and to improve performance. For internal traffic (for example, file system data or scalable services data), messages are striped across all available interconnects in a round-robin fashion.

The cluster interconnect is also available to applications, for highly available communication between nodes. For example, a distributed application might have components running on different nodes that need to communicate. By using the cluster interconnect rather than the public transport, these connections can withstand the failure of an individual link.

To use the cluster interconnect for communication between nodes, an application must use the private hostnames configured when the cluster was installed. For example, if the private hostname for node 1 is clusternode1-priv, use that name to communicate over the cluster interconnect to node 1. TCP sockets opened using this name are routed over the cluster interconnect and can be transparently re-routed in the event of network failure.

Note that because the private hostnames can be configured during installation, the cluster interconnect can use any name chosen at that time. The actual name can be obtained from scha_cluster_get(3HA) with the scha_privatelink_hostname_node argument.

For application-level use of the cluster interconnect, a single interconnect is used between each pair of nodes, but separate interconnects are used for different node pairs, if possible. For example, consider an application running on three nodes and communicating over the cluster interconnect. Communication between nodes 1 and 2 might take place on interface hme0, while communication between nodes 1 and 3 might take place on interface qfe1. That is, application communication between any two nodes is limited to a single interconnect, while internal clustering communication is striped over all interconnects.

Note that the application shares the interconnect with internal clustering traffic, so the bandwidth available to the application depends on the bandwidth used for other clustering traffic. In the event of a failure, internal traffic can round-robin over the remaining interconnects, while application connections on a failed interconnect can switch to a working interconnect.

Two types of addresses support the cluster interconnect, and gethostbyname(3N) on a private hostname normally returns two IP addresses. The first address is called the logical pairwise address, and the second address is called the logical pernode address.

A separate logical pairwise address is assigned to each pair of nodes. This small logical network supports failover of connections. Each node is also assigned a fixed pernode address. That is, the logical pairwise addresses for clusternode1-priv are different on each node, while the logical pernode address for clusternode1-priv is the same on each node. A node does not have a pairwise address to itself, however, so gethostbyname(clusternode1-priv) on node 1 returns only the logical pernode address.

Note that applications accepting connections over the cluster interconnect and then verifying the IP address for security reasons must check against all IP addresses returned from gethostbyname, not just the first IP address.

If you need consistent IP addresses in your application at all points, configure the application to bind to the pernode address on both the client and the server side so that all connections can appear to come and go from the pernode address.

Resources, Resource Groups, and Resource Types

Data services utilize several types of resources: applications such as Apache Web Server or Sun ONE Web Server utilize network addresses (logical hostnames and shared addresses) upon which the applications depend. Application and network resources form a basic unit that is managed by the RGM.

Data services are resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle-server and Sun Cluster HA for Apache is the resource type SUNW.apache.

A resource is an instantiation of a resource type that is defined cluster wide. There are several resource types defined.

Network resources are either SUNW.LogicalHostname or SUNW.SharedAddress resource types. These two resource types are pre-registered by the Sun Cluster software.

The SUNW.HAStorage and HAStoragePlus resource types are used to synchronize the startup of resources and disk device groups upon which the resources depend. It ensures that before a data service starts, the paths to cluster file system mount points, global devices, and device group names are available. For more information, see “Synchronizing the Startups Between Resource Groups and Disk Device Groups” in the Sun Cluster 3.1 Data Service Planning and Administration Guide. (The HAStoragePlus resource type became available in Sun Cluster 3.0 5/02 and added another feature, enabling local file systems to be highly available. For more information on this feature, see HAStoragePlus Resource Type.)

RGM-managed resources are placed into groups, called resource groups, so that they can be managed as a unit. A resource group is migrated as a unit if a failover or switchover is initiated on the resource group.

Note –

When you bring a resource group containing application resources online, the application is started. The data service start method waits until the application is up and running before exiting successfully. The determination of when the application is up and running is accomplished the same way the data service fault monitor determines that a data service is serving clients. Refer to the Sun Cluster 3.1 Data Service Planning and Administration Guide for more information on this process.

Resource Group Manager (RGM)

The RGM controls data services (applications) as resources, which are managed by resource type implementations. These implementations are either supplied by Sun or created by a developer with a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers called resource groups. The RGM stops and starts resource groups on selected nodes in response to cluster membership changes.

The RGM acts on resources and resource groups. RGM actions cause resources and resource groups to move between online and offline states. A complete description of the states and settings that can be applied to resources and resource groups is in the section Resource and Resource Group States and Settings. Refer to Resources, Resource Groups, and Resource Types for information about how to launch a resource management project under RGM control.

Resource and Resource Group States and Settings

An administrator applies static settings to resources and resource groups. These settings can only be changed through administrative actions. The RGM moves resource groups between dynamic “states.” These settings and states are described in the following list.

Managed or unmanaged. These are cluster-wide settings that apply only to resource groups. Resource groups are managed by the RGM. The scrgadm(1M) command can be used to cause the RGM to manage or to unmanage a resource group. These settings do not change with a cluster reconfiguration.

When a resource group is first created, it is unmanaged. It must be managed before any resources placed in the group can become active.

In some data services, for example a scalable web server, work must be done prior to starting up network resources and after they are stopped. This work is done by initialization (INIT) and finish (FINI) data service methods. The INIT methods only run if the resource group in which the resources reside is in the managed state.

When a resource group is moved from unmanaged to managed, any registered INIT methods for the group are run on the resources in the group.

When a resource group is moved from managed to unmanaged, any registered FINI methods are called to perform cleanup.

The most common use of INIT and FINI methods are for network resources for scalable services, but they can be used for any initialization or cleanup work that is not done by the application.
Enabled or disabled. These are cluster-wide settings that apply to resources. The scrgadm(1M) command can be used to enable or disable a resource. These settings do not change with a cluster reconfiguration.

The normal setting for a resource is that it is enabled and actively running in the system.

If for some reason, you want to make the resource unavailable on all cluster nodes, you disable the resource. A disabled resource is not available for general use.
Online or offline. These are dynamic states that apply to both resource and resource groups.

These states change as the cluster transitions through cluster reconfiguration steps during switchover or failover. They can also be changed through administrative actions. The scswitch(1M) can be used to change the online or offline state of a resource or resource group.

A failover resource or resource group can only be online on one node at any time. A scalable resource or resource group can be online on some nodes and offline on others. During a switchover or failover, resource groups and the resources within them are taken offline on one node and then brought online on another node.

If a resource group is offline then all of its resources are offline. If a resource group is online, then all of its enabled resources are online.

Resource groups can contain several resources, with dependencies between resources. These dependencies require that the resources be brought online and offline in a particular order. The methods used to bring resources online and offline might take different amounts of time for each resource. Because of resource dependencies and start and stop time differences, resources within a single resource group can have different online and offline states during a cluster reconfiguration.

Resource and Resource Group Properties

You can configure property values for resources and resource groups for your SunPlex data services. Standard properties are common to all data services. Extension properties are specific to each data service. Some standard and extension properties are configured with default settings so that you do not have to modify them. Others need to be set as part of the process of creating and configuring resources. The documentation for each data service specifies which resource properties can be set and how to set them.

The standard properties are used to configure resource and resource group properties that are usually independent of any particular data service. The set of standard properties is described in an appendix to the Sun Cluster 3.1 Data Service Planning and Administration Guide.

The RGM extension properties provide information such as the location of application binaries and configuration files. You modify extension properties as you configure your data services. The set of extension properties is described in the Sun Cluster 3.1 Data Service Planning and Administration Guide.

Data Service Project Configuration

Data services may be configured to launch under a Solaris project name when brought online using the RGM. The configuration associates a resource or resource group managed by the RGM with a Solaris project ID. The mapping from your resource or resource group to a project ID gives you the ability to use sophisticated controls that are available in the Solaris environment to manage workloads and consumption within your cluster.

Note –

You can perform this configuration only if you are running the current release of Sun Cluster software with Solaris 9.

Using the Solaris management functionality in a cluster environment enables you to ensure that your most important applications are given priority when sharing a node with other applications. Applications might share a node if you have consolidated services or because applications have failed over. Use of the management functionality described herein might improve availability of a critical application by preventing other low priority applications from over-consuming system supplies such as CPU time.

Note –

The Solaris documentation of this feature describes CPU time, processes, tasks and similar components as 'resources'. Meanwhile, Sun Cluster documentation uses the term 'resources' to describe entities that are under the control of the RGM. The following section will use the term 'resource' to refer to Sun Cluster entities under the control of the RGM and use the term 'supplies' to refer to CPU time, processes, and tasks.

This section provides a conceptual description of configuring data services to launch processes in a specified Solaris 9 project(4). This section also describes several failover scenarios and suggestions for planning to use the management functionality provided by the Solaris environment. For detailed conceptual and procedural documentation of the management feature, refer to System Administration Guide: Resource Management and Network Services in the Solaris 9 System Administrator Collection.

When configuring resources and resource groups to use Solaris management functionality in a cluster, consider using the following high-level process:

Configure applications as part of the resource.
Configure resources as part of a resource group.
Enable resources in the resource group.
Make the resource group managed.
Create a Solaris project for your resource group.
Configure standard properties to associate the resource group name with the project you created in step 5.
Bring the resource group online.

To configure the standard Resource_project_name or RG_project_name properties to associate the Solaris project ID with the resource or resource group, use the -y option with the scrgadm(1M) command. Set the property values to the resource or resource group. See “scrgadm” in Sun Cluster 3.1 Data Service Planning and Administration Guide for property definitions. Refer to r_properties(5) and rg_properties(5) for property descriptions.

The specified project name must exist in the projects database (/etc/project) and the root user must be configured as a member of the named project. Refer to “Projects and Tasks” in System Administration Guide: Resource Management and Network Servicesin the Solaris 9 System Administrator Collection for conceptual information about the project name database. Refer to project(4) for a description of project file syntax.

When the RGM brings resources or resource groups online, it launches the related processes under the project name.

Note –

Users can associate the resource or resource group with a project at any time. However, the new project name is not effective until the resource or resource group is taken offline and brought back online using the RGM.

Launching resources and resource groups under the project name enables you to configure the following features to manage system supplies across your cluster.

Extended Accounting. Provides a flexible way to record consumption on a task or process basis. Extended accounting enables you to examine historical usage and make assessments of capacity requirements for future workloads.
Controls. Provide a mechanism for constraint on system supplies. Processes, tasks, and projects can be prevented from consuming large amounts of specified system supplies.
Fair Share Scheduling (FSS). Provides the ability to control the allocation of available CPU time among workloads, based on their importance. Workload importance is expressed by the number of shares of CPU time that you assign to each workload. Refer to dispadmin(1M) for a command line description of setting FSS as your default scheduler. See also priocntl(1), ps(1), and FSS(7) for more information.
Pools. Provide the ability to use partitions for interactive applications according to the application's requirements. Pools can be used to partition a server that supports a number of different software applications. The use of pools results in a more predictable response for each application.

Determining Requirements for Project Configuration

Before you configure data services to use the controls provided by Solaris in a Sun Cluster environment, you must decide how you want to control and track resources across switchovers or failovers. Consider identifying dependencies within your cluster before configuring a new project. For example, resources and resource groups depend on disk device groups. Use the nodelist, failback, maximum_primaries and desired_primaries resource group properties, configured with scrgadm(1M) to identify nodelist priorities for your resource group. Refer to “scrgadm” in Sun Cluster 3.1 Data Service Planning and Administration Guide for a brief discussion of the node list dependencies between resource groups and disk device groups. For detailed property descriptions, refer to rg_properties(5).

Use the preferenced and failback properties configured with scrgadm(1M) and scsetup(1M) to determine disk device group nodelist priorities. For procedural information, see “How To Change Disk Device Properties” in “Administering Disk Device Groups” in Sun Cluster 3.1 System Administration Guide. Refer to Failover and Scalability in the SunPlex System for conceptual information about node configuration and the behavior of failover and scalable data services.

If you configure all cluster nodes identically, usage limits are enforced identically on primary and secondary nodes. The configuration parameters of projects need not be identical for all applications in the configuration files on all nodes. All projects associated with the application must at least be accessible by the project database on all potential masters of that application. Suppose that Application 1 is mastered by phys-schost-1 but could potentially be switched over or failed over to phys-schost-2 or phys-schost-3. The project associated with Application 1 must be accessible on all three nodes (phys-schost-1, phys-schost-2, and phys-schost-3).

Note –

Project database information can be a local /etc/project database file or may be stored in the NIS map or the LDAP directory service.

The Solaris environment allows for flexible configuration of usage parameters, and few restrictions are imposed by Sun Cluster. Configuration choices depend on the needs of the site. Consider the general guidelines in the following sections before configuring your systems.

Setting Per-Process Virtual Memory Limits

Set the process.max-address-space control to limit virtual memory on a per-process basis. Refer to rctladm(1M) for detailed information about setting the process.max-address-space value.

When using management controls with Sun Cluster, configure memory limits appropriately to prevent unnecessary failover of applications and a “ping-pong” effect of applications. In general:

Do not set memory limits too low.

When an application reaches its memory limit, it might fail over. This guideline is especially important for database applications, when reaching a virtual memory limit can have unexpected consequences.
Do not set memory limits identically on primary and secondary nodes.

Identical limits can cause a ping-pong effect when an application reaches its memory limit and fails over to a secondary node with an identical memory limit. Set the memory limit slightly higher on the secondary node. The difference in memory limits helps prevent the ping-pong scenario and gives the system administrator a period of time in which to adjust the parameters as necessary.
Do use the resource management memory limits for load-balancing.

For example, you can use memory limits to prevent an errant application from consuming excessive swap space.

Failover Scenarios

You can configure management parameters so that the allocation in the project configuration (/etc/project) works in normal cluster operation and in switchover or failover situations.

The following sections are example scenarios.

The first two sections, “Two-Node Cluster With Two Applications” and “Two-Node Cluster With Three Applications,” show failover scenarios for entire nodes.
The section “Failover of Resource Group Only” illustrates failover operation for an application only.

In a cluster environment, an application is configured as part of a resource and a resource is configured as part of a resource group (RG). When a failure occurs, the resource group along with its associated applications, fails over to another node. In the following examples the resources are not shown explicitly. Assume that each resource has only one application.

Note –

Failover occurs in the preferenced nodelist order that is set in the RGM.

The following examples have these constraints:

Application 1 (App-1) is configured in resource group RG-1.
Application 2 (App-2) is configured in resource group RG-2.
Application 3 (App-3) is configured in resource group RG-3.

Although the numbers of assigned shares remain the same, the percentage of CPU time allocated to each application changes after failover. This percentage depends on the number of applications that are running on the node and the number of shares that are assigned to each active application.

In these scenarios, assume the following configurations.

All applications are configured under a common project.
Each resource has only one application.
The applications are the only active processes on the nodes.
The projects databases are configured the same on each node of the cluster.

Two-Node Cluster With Two Applications

You can configure two applications on a two-node cluster to ensure that each physical host (phys-schost-1, phys-schost-2) acts as the default master for one application. Each physical host acts as the secondary node for the other physical host. All projects associated with Application 1 and Application 2 must be represented in the projects database files on both nodes. When the cluster is running normally, each application is running on its default master, where it is allocated all CPU time by the management facility.

After a failover or switchover occurs, both applications run on a single node where they are allocated shares as specified in the configuration file. For example, this entry in the /etc/project file specifies that Application 1 is allocated 4 shares and Application 2 is allocated 1 share.

Prj_1:100:project for App-1:root::project.cpu-shares=(privileged,4,none)
Prj_2:101:project for App-2:root::project.cpu-shares=(privileged,1,none)

The following diagram illustrates the normal and failover operations of this configuration. The number of shares that are assigned does not change. However, the percentage of CPU time available to each application can change, depending on the number of shares assigned to each process demanding CPU time.

Two-Node Cluster With Three Applications

On a two-node cluster with three applications, you can configure one physical host (phys-schost-1) as the default master of one application and the second physical host (phys-schost-2) as the default master for the remaining two applications. Assume the following example projects database file on every node. The projects database file does not change when a failover or switchover occurs.

Prj_1:103:project for App-1:root::project.cpu-shares=(privileged,5,none)
Prj_2:104:project for App_2:root::project.cpu-shares=(privileged,3,none) 
Prj_3:105:project for App_3:root::project.cpu-shares=(privileged,2,none)

When the cluster is running normally, Application 1 is allocated 5 shares on its default master, phys-schost-1. This number is equivalent to 100 percent of CPU time because it is the only application that demands CPU time on that node. Applications 2 and 3 are allocated 3 and 2 shares, respectively, on their default master, phys-schost-2. Application 2 would receive 60 percent of CPU time and Application 3 would receive 40 percent of CPU time during normal operation.

If a failover or switchover occurs and Application 1 is switched over to phys-schost-2, the shares for all three applications remain the same. However, the percentages of CPU resources are reallocated according to the projects database file.

Application 1, with 5 shares, receives 50 percent of CPU.
Application 2, with 3 shares, receives 30 percent of CPU.
Application 3, with 2 shares, receives 20 percent of CPU.

The following diagram illustrates the normal operations and failover operations of this configuration.

Failover of Resource Group Only

In a configuration in which multiple resource groups have the same default master, a resource group (and its associated applications) can fail over or be switched over to a secondary node. Meanwhile, the default master is running in the cluster.

Note –

During failover, the application that fails over is allocated resources as specified in the configuration file on the secondary node. In this example, the projects database files on the primary and secondary nodes have the same configurations.

For example, this sample configuration file specifies that Application 1 is allocated 1 share, Application 2 is allocated 2 shares, and Application 3 is allocated 2 shares.

Prj_1:106:project for App_1:root::project.cpu-shares=(privileged,1,none)
Prj_2:107:project for App_2:root::project.cpu-shares=(privileged,2,none)
Prj_3:108:project for App_3:root::project.cpu-shares=(privileged,2,none)

The following diagram illustrates the normal and failover operations of this configuration, where RG-2, containing Application 2, fails over to phys-schost-2. Note that the number of shares assigned does not change. However, the percentage of CPU time available to each application can change, depending on the number of shares assigned to each application demanding CPU time.

Public Network Adapters and IP Network Multipathing

Clients make data requests to the cluster through the public network. Each cluster node is connected to at least one public network through a pair of public network adapters.

Solaris Internet Protocol (IP) Network Multipathing software on Sun Cluster provides the basic mechanism for monitoring public network adapters and failing over IP addresses from one adapter to another when a fault is detected. Each cluster node has its own IP Network Multipathing configuration, which can be different from that on other cluster nodes.

Public network adapters are organized into IP multipathing groups (multipathing groups). Each multipathing group has one or more public network adapters. Each adapter in a multipathing group can be active, or you can configure standby interfaces that are inactive unless there is a failover. The in.mpathd multipathing daemon uses a test IP address to detect failures and repairs. If a fault is detected on one of the adapters by the multipathing daemon, a failover occurs. All network access fails over from the faulted adapter to another functional adapter in the multipathing group, thereby maintaining public network connectivity for the node. If a standby interface was configured, the daemon chooses the standby interface. Otherwise, in.mpathd chooses the interface with the least number of IP addresses. Because the failover happens at the adapter interface level, higher-level connections such as TCP are not affected, except for a brief transient delay during the failover. When the failover of IP addresses completes successfully, gratuitous ARP broadcasts are sent. The connectivity to remote clients is therefore maintained.

Note –

Because of the congestion recovery characteristics of TCP, TCP endpoints can suffer further delay after a successful failover as some segments could be lost during the failover, activating the congestion control mechanism in TCP.

Multipathing groups provide the building blocks for logical hostname and shared address resources. You can also create multipathing groups independently of logical hostname and shared address resources to monitor public network connectivity of cluster nodes. The same multipathing group on a node can host any number of logical hostname or shared address resources. For more information on logical hostname and shared address resources, see the Sun Cluster 3.1 Data Service Planning and Administration Guide.

Note –

The design of the IP Network Multipathing mechanism is meant to detect and mask adapter failures. The design is not intended to recover from an administrator using ifconfig(1M) to remove one of the logical (or shared) IP addresses. The Sun Cluster software views the logical and shared IP addresses as resources managed by the RGM. The correct way for an administrator to add or remove an IP address is to use scrgadm(1M) to modify the resource group containing the resource.

For more information about the Solaris implementation of IP Network Multipathing, see the appropriate documentation for the Solaris operating environment installed on your cluster.

Operating Environment Release	For Instructions, Go To...
Solaris 8 operating environment	IP Network Multipathing Administration Guide
Solaris 9 operating environment	“IP Network Multipathing Topics” in System Administration Guide: IP Services

Dynamic Reconfiguration Support

Sun Cluster 3.1 support for the dynamic reconfiguration (DR) software feature is being developed in incremental phases. This section describes concepts and considerations for Sun Cluster 3.1 support of the DR feature.

Note that all of the requirements, procedures, and restrictions that are documented for the Solaris DR feature also apply to Sun Cluster DR support (except for the operating environment quiescence operation). Therefore, review the documentation for the Solaris DR feature before using the DR feature with Sun Cluster software. You should review in particular the issues that affect non-network IO devices during a DR detach operation. The Sun Enterprise 10000 Dynamic Reconfiguration User Guide and the Sun Enterprise 10000 Dynamic Reconfiguration Reference Manual (from the Solaris 8 on Sun Hardware or Solaris 9 on Sun Hardware collections) are both available for download from http://docs.sun.com.

Dynamic Reconfiguration General Description

The DR feature allows operations, such as the removal of system hardware, in running systems. The DR processes are designed to ensure continuous system operation with no need to halt the system or interrupt cluster availability.

DR operates at the board level. Therefore, a DR operation affects all of the components on a board. Each board can contain multiple components, including CPUs, memory, and peripheral interfaces for disk drives, tape drives, and network connections.

Removing a board containing active components would result in system errors. Before removing a board, the DR subsystem queries other subsystems, such as Sun Cluster, to determine whether the components on the board are being used. If the DR subsystem finds that a board is in use, the DR remove-board operation is not done. Therefore, it is always safe to issue a DR remove-board operation since the DR subsystem rejects operations on boards containing active components.

The DR add-board operation is always safe also. CPUs and memory on a newly added board are automatically brought into service by the system. However, the system administrator must manually configure the cluster in order to actively use components that are on the newly added board.

Note –

The DR subsystem has several levels. If a lower level reports an error, the upper level also reports an error. However, when the lower level reports the specific error, the upper level will report “Unknown error.” System administrators should ignore the “Unknown error” reported by the upper level.

The following sections describe DR considerations for the different device types.

DR Clustering Considerations for CPU Devices

Sun Cluster software will not reject a DR remove-board operation due to the presence of CPU devices.

When a DR add-board operation succeeds, CPU devices on the added board are automatically incorporated in system operation.

DR Clustering Considerations for Memory

For the purposes of DR, there are two types of memory to consider. These two types differ only in usage. The actual hardware is the same for both types.

The memory used by the operating system is called the kernel memory cage. Sun Cluster software does not support remove-board operations on a board that contains the kernel memory cage and will reject any such operation. When a DR remove-board operation pertains to memory other than the kernel memory cage, Sun Cluster will not reject the operation.

When a DR add-board operation that pertains to memory succeeds, memory on the added board is automatically incorporated in system operation.

DR Clustering Considerations for Disk and Tape Drives

Sun Cluster rejects DR remove-board operations on active drives in the primary node. DR remove-board operations can be performed on non-active drives in the primary node and on any drives in the secondary node. After the DR operation, cluster data access continues as before.

Note –

Sun Cluster rejects DR operations that impact the availability of quorum devices. For considerations about quorum devices and the procedure for performing DR operations on them, see DR Clustering Considerations for Quorum Devices.

See the Sun Cluster 3.1 System Administration Guide for detailed instructions on how to perform these actions.

DR Clustering Considerations for Quorum Devices

If the DR remove-board operation pertains to a board containing an interface to a device configured for quorum, Sun Cluster rejects the operation and identifies the quorum device that would be affected by the operation. You must disable the device as a quorum device before you can perform a DR remove-board operation.

See the Sun Cluster 3.1 System Administration Guide for detailed instructions on how to perform these actions.

DR Clustering Considerations for Cluster Interconnect Interfaces

If the DR remove-board operation pertains to a board containing an active cluster interconnect interface, Sun Cluster rejects the operation and identifies the interface that would be affected by the operation. You must use a Sun Cluster administrative tool to disable the active interface before the DR operation can succeed (also see the caution below).

See the Sun Cluster 3.1 System Administration Guide for detailed instructions on how to perform these actions.

Caution –

Sun Cluster requires that each cluster node has at least one functioning path to every other cluster node. Do not disable a private interconnect interface that supports the last path to any cluster node.

DR Clustering Considerations for Public Network Interfaces

If the DR remove-board operation pertains to a board containing an active public network interface, Sun Cluster rejects the operation and identifies the interface that would be affected by the operation. Before removing a board with an active network interface present, all traffic on that interface must first be switched over to another functional interface in the multipathing group by using the if_mpadm(1M) command.

Caution –

If the remaining network adapter fails while you are performing the DR remove operation on the disabled network adapter, availability is impacted. The remaining adapter has no place to fail over for the duration of the DR operation.

See the Sun Cluster 3.1 System Administration Guide for detailed instructions on how to perform a DR remove operation on a public network interface.