This chapter describes the key concepts related to the software components of a Sun Cluster system. The topics covered include:
This information is directed primarily toward system administrators and application developers by using the Sun Cluster API and SDK. Cluster system administrators can use this information in preparation for installing, configuring, and administering cluster software. Application developers can use the information to understand the cluster environment in which they will be working.
You can choose how you install, configure, and administer the Sun Cluster system from several user interfaces. You can accomplish system administration tasks either through the SunPlex Manager graphic user interface (GUI), or through the documented command-line interface. On top of the command-line interface are some utilities, such as scinstall and scsetup, to simplify selected installation and configuration tasks. The Sun Cluster system also has a module that runs as part of Sun Management Center that provides a GUI to particular cluster tasks. This module is available for use in only SPARC based clusters. Refer to Administration Tools in Sun Cluster System Administration Guide for Solaris OS for complete descriptions of the administrative interfaces.
Time between all nodes in a cluster must be synchronized. Whether you synchronize the cluster nodes with any outside time source is not important to cluster operation. The Sun Cluster system employs the Network Time Protocol (NTP) to synchronize the clocks between nodes.
In general, a change in the system clock of a fraction of a second causes no problems. However, if you run date(1), rdate(1M), or xntpdate(1M) (interactively, or within cron scripts) on an active cluster, you can force a time change much larger than a fraction of a second to synchronize the system clock to the time source. This forced change might cause problems with file modification timestamps or confuse the NTP service.
When you install the Solaris Operating System on each cluster node, you have an opportunity to change the default time and date setting for the node. In general, you can accept the factory default.
When you install Sun Cluster software by using scinstall(1M), one step in the process is to configure NTP for the cluster. Sun Cluster software supplies a template file, ntp.cluster (see /etc/inet/ntp.cluster on an installed cluster node), that establishes a peer relationship between all cluster nodes. One node is designated the “preferred” node. Nodes are identified by their private hostnames and time synchronization occurs across the cluster interconnect. For instructions about how to configure the cluster for NTP, see Chapter 2, Installing and Configuring Sun Cluster Software, in Sun Cluster Software Installation Guide for Solaris OS.
Alternately, you can set up one or more NTP servers outside the cluster and change the ntp.conf file to reflect that configuration.
In normal operation, you should never need to adjust the time on the cluster. However, if the time was set incorrectly when you installed the Solaris Operating System and you want to change it, the procedure for doing so is included in Chapter 7, Administering the Cluster, in Sun Cluster System Administration Guide for Solaris OS.
The Sun Cluster system makes all components on the “path” between users and data highly available, including network interfaces, the applications themselves, the file system, and the multihost devices. In general, a cluster component is highly available if it survives any single (software or hardware) failure in the system.Table 3–1 Levels of Sun Cluster Failure Detection and Recovery
Failed Cluster Component
HA API, HA framework
Public network adapter
Internet Protocol (IP) Network Multipathing
Multiple public network adapter cards
Cluster file system
Primary and secondary replicas
Mirrored multihost device
Volume management (Solaris Volume Manager and VERITAS Volume Manager, which is available in SPARC based clusters only)
Hardware RAID-5 (for example, Sun StorEdgeTM A3x00)
Primary and secondary replicas
Multiple paths to the device, cluster transport junctions
HA transport software
Multiple private hardware-independent networks
CMM, failfast driver
Sun Cluster software's high-availability framework detects a node failure quickly and creates a new equivalent server for the framework resources on a remaining node in the cluster. At no time are all framework resources unavailable. Framework resources that are unaffected by a crashed node are fully available during recovery. Furthermore, framework resources of the failed node become available as soon as they are recovered. A recovered framework resource does not have to wait for all other framework resources to complete their recovery.
Most highly available framework resources are recovered transparently to the applications (data services) using the resource. The semantics of framework resource access are fully preserved across node failure. The applications simply cannot detect that the framework resource server has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes by using the files, devices, and disk volumes attached to this node. This transparency exists if an alternative hardware path exists to the disks from another node. An example is the use of multihost devices that have ports to multiple nodes.
To ensure that data is kept safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services (applications) in response to a failure.
The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration.
After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster. In a synchronized configuration, cluster resources might be redistributed, based on the new membership of the cluster.
Unlike previous Sun Cluster software releases, CMM runs entirely in the kernel.
See About Failure Fencing for more information about how the cluster protects itself from partitioning into multiple separate clusters.
If the CMM detects a critical problem with a node, it notifies the cluster framework to forcibly shut down (panic) the node and to remove it from the cluster membership. The mechanism by which this occurs is called failfast. Failfast causes a node to shut down in two ways.
If a node leaves the cluster and then attempts to start a new cluster without having quorum, it is “fenced” from accessing the shared disks. See About Failure Fencing for details about this use of failfast.
When the death of a cluster daemon causes a node to panic, a message similar to the following is displayed on the console for that node.
panic[cpu0]/thread=40e60: Failfast: Aborting because "pmfd" died 35 seconds ago. 409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0) %l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0
After the panic, the node might reboot and attempt to rejoin the cluster. Alternatively, if the cluster is composed of SPARC based systems, the node might remain at the OpenBootTM PROM (OBP) prompt. The next action of the node is determined by the setting of the auto-boot? parameter. You can set auto-boot? with eeprom(1M), at the OpenBoot PROM ok prompt.
The CCR uses a two-phase commit algorithm for updates: An update must be successfully completed on all cluster members or the update is rolled back. The CCR uses the cluster interconnect to apply the distributed updates.
Although the CCR consists of text files, never edit the CCR files manually. Each file contains a checksum record to ensure consistency between nodes. Manually updating CCR files can cause a node or the entire cluster to stop functioning.
The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.
The Sun Cluster system uses global devices to provide cluster-wide, highly available access to any device in a cluster, from any node, without regard to where the device is physically attached. In general, if a node fails while providing access to a global device, the Sun Cluster software automatically discovers another path to the device and redirects the access to that path. Sun Cluster global devices include disks, CD-ROMs, and tapes. However, the only multiported global devices that Sun Cluster software supports are disks. Consequently, CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.
The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the cluster. This assignment enables consistent access to each device from any node in the cluster. The global device namespace is held in the /dev/global directory. See Global Namespace for more information.
Multiported global devices provide more than one path to a device. Because multihost disks are part of a disk device group that is hosted by more than one node, the multihost disks are made highly available.
The Sun Cluster software manages global devices through a construct known as the DID pseudo driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.
The DID pseudo driver is an integral part of the global device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique disk devices, assigns each device a unique major and a minor number that are consistent on all nodes of the cluster. Access to the global devices is performed by utilizing the unique device ID instead of the traditional Solaris device IDs, such as c0t0d0 for a disk.
This approach ensures that any application that accesses disks (such as a volume manager or applications that use raw devices) uses a consistent path across the cluster. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from node to node, thus changing the Solaris device naming conventions as well. For example, Node1 might identify a multihost disk as c1t2d0, and Node2 might identify the same disk completely differently, as c3t2d0. The DID driver assigns a global name, such as d10, that the nodes would use instead, giving each node a consistent mapping to the multihost disk.
You update and administer device IDs through scdidadm(1M) and scgdevs(1M). See the following man pages for more information:
In the Sun Cluster system, all multihost devices must be under control of the Sun Cluster software. You first create volume manager disk groups—either Solaris Volume Manager disk sets or VERITAS Volume Manager disk groups (available for use in only SPARC based clusters)—on the multihost disks. Then, you register the volume manager disk groups as disk device groups. A disk device group is a type of global device. In addition, the Sun Cluster software automatically creates a raw disk device group for each disk and tape device in the cluster. However, these cluster device groups remain in an offline state until you access them as global devices.
Registration provides the Sun Cluster system information about which nodes have a path to specific volume manager disk groups. At this point, the volume manager disk groups become globally accessible within the cluster. If more than one node can write to (master) a disk device group, the data stored in that disk device group becomes highly available. The highly available disk device group can be used to contain cluster file systems.
Disk device groups are independent of resource groups. One node can master a resource group (representing a group of data service processes) while another can master the disk groups that are being accessed by the data services. However, the best practice is to keep on the same node the disk device group that stores a particular application's data and the resource group that contains the application's resources (the application daemon). Refer to Relationship Between Resource Groups and Disk Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for more information about the association between disk device groups and resource groups.
When a node uses a disk device group, the volume manager disk group becomes “global” because it provides multipath support to the underlying disks. Each cluster node that is physically attached to the multihost disks provides a path to the disk device group.
Because a disk enclosure is connected to more than one node, all disk device groups in that enclosure are accessible through an alternate path if the node currently mastering the device group fails. The failure of the node mastering the device group does not affect access to the device group except for the time it takes to perform the recovery and consistency checks. During this time, all requests are blocked (transparently to the application) until the system makes the device group available.
This section describes disk device group properties that enable you to balance performance and availability in a multiported disk configuration. Sun Cluster software provides two properties used to configure a multiported disk configuration: preferenced and numsecondaries. You can control the order in which nodes attempt to assume control if a failover occurs by using the preferenced property. Use the numsecondaries property to set a desired number of secondary nodes for a device group.
A highly available service is considered down when the primary fails and when no eligible secondary nodes can be promoted to primary. If service failover occurs and the preferenced property is true, then the nodes follow the order in the nodelist to select a secondary. The nodelist that is set by the defines the order in which nodes will attempt to assume primary control or transition from spare to secondary. You can dynamically change the preference of a device service by using the scsetup(1M) utility. The preference that is associated with dependent service providers, for example a global file system, will be identical to the preference of the device service.
Secondary nodes are check-pointed by the primary node during normal operation. In a multiported disk configuration, checkpointing each secondary node causes cluster performance degradation and memory overhead. Spare node support was implemented to minimize the performance degradation and memory overhead that checkpointing caused. By default, your disk device group has one primary and one secondary. The remaining available provider nodes become spares. If failover occurs, the secondary becomes primary and the node highest in priority on the nodelist becomes secondary.
The desired number of secondary nodes can be set to any integer between one and the number of operational nonprimary provider nodes in the device group.
If you are using Solaris Volume Manager, you must create the disk device group before you can set the numsecondaries property to a number other than the default.
The default desired number of secondaries for device services is one. The actual number of secondary providers that is maintained by the replica framework is the desired number, unless the number of operational nonprimary providers is less than the desired number. You must alter the numsecondaries property and double-check the nodelist if you are adding or removing nodes from your configuration. Maintaining the nodelist and desired number of secondaries prevents conflict between the configured number of secondaries and the actual number allowed by the framework.
(Solaris Volume Manager) Use the metaset(1M) command for Solaris Volume Manager device groups, in conjunction with the preferenced and numsecondaries property settings, to manage addition and removal of nodes from your configuration.
(Veritas Volume Manager) Use the scconf(1M) command for VxVM disk device groups, in conjunction with the preferenced and numsecondaries property settings, to manage addition and removal of nodes from your configuration.
Refer to Administering Cluster File Systems Overview in Sun Cluster System Administration Guide for Solaris OS for procedural information about changing disk device group properties.
The Sun Cluster software mechanism that enables global devices is the global namespace. The global namespace includes the /dev/global/ hierarchy as well as the volume manager namespaces. The global namespace reflects both multihost disks and local disks (and any other cluster device, such as CD-ROMs and tapes), and provides multiple failover paths to the multihost disks. Each node that is physically connected to multihost disks provides a path to the storage for any node in the cluster.
Normally, for Solaris Volume Manager, the volume manager namespaces are located in the /dev/md/diskset/dsk (and rdsk) directories. For Veritas VxVM, the volume manager namespaces are located in the /dev/vx/dsk/disk-group and /dev/vx/rdsk/disk-group directories. These namespaces consist of directories for each Solaris Volume Manager disk set and each VxVM disk group imported throughout the cluster, respectively. Each of these directories contains a device node for each metadevice or volume in that disk set or disk group.
In the Sun Cluster system, each device node in the local volume manager namespace is replaced by a symbolic link to a device node in the /global/.devices/node@nodeID file system where nodeID is an integer that represents the nodes in the cluster. Sun Cluster software continues to present the volume manager devices, as symbolic links, in their standard locations as well. Both the global namespace and standard volume manager namespace are available from any cluster node.
The advantages of the global namespace include the following:
Each node remains fairly independent, with little change in the device administration model.
Devices can be selectively made global.
Third-party link generators continue to work.
Given a local device name, an easy mapping is provided to obtain its global name.
The following table shows the mappings between the local and global namespaces for a multihost disk, c0t0d0s0.Table 3–2 Local and Global Namespace Mappings
Component or Path
Local Node Namespace
Solaris logical name
Solaris Volume Manager
SPARC: VERITAS Volume Manager
The global namespace is automatically generated on installation and updated with every reconfiguration reboot. You can also generate the global namespace by running the scgdevs(1M) command.
The cluster file system has the following features:
File access locations are transparent. A process can open a file that is located anywhere in the system. Processes on all nodes can use the same path name to locate a file.
When the cluster file system reads files, it does not update the access time on those files.
Coherency protocols are used to preserve the UNIX file access semantics even if the file is accessed concurrently from multiple nodes.
Extensive caching is used along with zero-copy bulk I/O movement to move file data efficiently.
The cluster file system provides highly available, advisory file-locking functionality by using the fcntl(2) interfaces. Applications that run on multiple cluster nodes can synchronize access to data by using advisory file locking on a cluster file system. File locks are recovered immediately from nodes that leave the cluster, and from applications that fail while holding locks.
Continuous access to data is ensured, even when failures occur. Applications are not affected by failures if a path to disks is still operational. This guarantee is maintained for raw disk access and all file system operations.
Cluster file systems are independent from the underlying file system and volume management software. Cluster file systems make any supported on-disk file system global.
You can mount a file system on a global device globally with mount -g or locally with mount.
Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/foo).
A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.
A cluster file system is not a distinct file system type. Clients verify the underlying file system (for example, UFS).
In the Sun Cluster system, all multihost disks are placed into disk device groups, which can be Solaris Volume Manager disk sets, VxVM disk groups, or individual disks that are not under control of a software-based volume manager.
For a cluster file system to be highly available, the underlying disk storage must be connected to more than one node. Therefore, a local file system (a file system that is stored on a node's local disk) that is made into a cluster file system is not highly available.
You can mount cluster file systems as you would mount file systems:
Manually — Use the mount command and the -g or -o global mount options to mount the cluster file system from the command line, for example:
SPARC: # mount -g /dev/global/dsk/d0s0 /global/oracle/data
Automatically— Create an entry in the /etc/vfstab file with a global mount option to mount the cluster file system at boot. You then create a mount point under the /global directory on all nodes. The directory /global is a recommended location, not a requirement. Here's a sample line for a cluster file system from an /etc/vfstab file:
SPARC: /dev/md/oracle/dsk/d1 /dev/md/oracle/rdsk/d1 /global/oracle/data ufs 2 yes global,logging
While Sun Cluster software does not impose a naming policy for cluster file systems, you can ease administration by creating a mount point for all cluster file systems under the same directory, such as /global/disk-device-group. See Sun Cluster 3.1 9/04 Software Collection for Solaris OS (SPARC Platform Edition) and Sun Cluster System Administration Guide for Solaris OS for more information.
The HAStoragePlus resource type is designed to make non-global file system configurations such as UFS and VxFS highly available. Use HAStoragePlus to integrate your local file system into the Sun Cluster environment and make the file system highly available. HAStoragePlus provides additional file system capabilities such as checks, mounts, and forced unmounts that enable Sun Cluster to fail over local file systems. In order to fail over, the local file system must reside on global disk groups with affinity switchovers enabled.
See Enabling Highly Available Local File Systems in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for information about how to use the HAStoragePlus resource type.
HAStoragePlus can also used to synchronize the startup of resources and disk device groups on which the resources depend. For more information, see Resources, Resource Groups, and Resource Types.
You can use the syncdir mount option for cluster file systems that use UFS as the underlying file system. However, you experience a significant performance improvement if you do not specify syncdir. If you specify syncdir, the writes are guaranteed to be POSIX compliant. If you do not specify syncdir, you experience the same behavior as in NFS file systems. For example, without syncdir, you might not discover an out of space condition until you close a file. With syncdir (and POSIX behavior), the out-of-space condition would have been discovered during the write operation. The cases in which you might have problems if you do not specify syncdir are rare.
If you are using a SPARC based cluster, VxFS does not have a mount option that is equivalent to the syncdir mount option for UFS. VxFS behavior is the same as for UFS when the syncdir mount option is not specified.
See File Systems FAQs for frequently asked questions about global devices and cluster file systems.
The current release of Sun Cluster software supports disk-path monitoring (DPM). This section provides conceptual information about DPM, the DPM daemon, and administration tools that you use to monitor disk paths. Refer to Sun Cluster System Administration Guide for Solaris OS for procedural information about how to monitor, unmonitor, and check the status of disk paths.
DPM is not supported on nodes that run versions that were released prior to Sun Cluster 3.1 10/03 software. Do not use DPM commands while a rolling upgrade is in progress. After all nodes are upgraded, the nodes must be online to use DPM commands.
DPM improves the overall reliability of failover and switchover by monitoring the secondary disk-path availability. Use the scdpm command to verify availability of the disk path that is used by a resource before the resource is switched. Options that are provided with the scdpm command enable you to monitor disk paths to a single node or to all nodes in the cluster. See the scdpm(1M) man page for more information about command-line options.
The DPM components are installed from the SUNWscu package. The SUNWscu package is installed by the standard Sun Cluster installation procedure. See the scinstall(1M) man page for installation interface details. The following table describes the default location for installation of DPM components.
Daemon status file (created at runtime)
A multithreaded DPM daemon runs on each node. The DPM daemon (scdpmd) is started by an rc.d script when a node boots. If a problem occurs, the daemon is managed by pmfd and restarts automatically. The following list describes how the scdpmd works on initial startup.
At startup, the status for each disk path is initialized to UNKNOWN.
The DPM daemon gathers disk-path and node name information from the previous status file or from the CCR database. Refer to Cluster Configuration Repository (CCR) for more information about the CCR. After a DPM daemon is started, you can force the daemon to read the list of monitored disks from a specified file name.
The DPM daemon initializes the communication interface to answer requests from components that are external to the daemon, such as the command-line interface.
The DPM daemon pings each disk path in the monitored list every 10 minutes by using scsi_inquiry commands. Each entry is locked to prevent the communication interface access to the content of an entry that is being modified.
The DPM daemon notifies the Sun Cluster Event Framework and logs the new status of the path through the UNIX syslogd(1M) mechanism.
All errors that are related to the daemon are reported by pmfd (1M). All the functions from the API return 0 on success and -1 for any failure.
The DPM Daemon monitors the availability of the logical path that is visible through multipath drivers such as Sun StorEdge Traffic Manager, HDLM, and PowerPath. The individual physical paths that are managed by these drivers are not monitored, because the multipath driver masks individual failures from the DPM daemon.
This section describes two methods for monitoring disk paths in your cluster. The first method is provided by the scdpm command. Use this command to monitor, unmonitor, or display the status of disk paths in your cluster. This command is also useful for printing the list of faulted disks and for monitoring disk paths from a file.
The second method for monitoring disk paths in your cluster is provided by the SunPlex Manager graphical user interface (GUI). SunPlex Manager provides a topological view of the monitored disk paths in your cluster. The view is updated every 10 minutes to provide information about the number of failed pings. Use the information that is provided by the SunPlex Manager GUI in conjunction with the scdpm(1M) command to administer disk paths. Refer to Chapter 10, Administering Sun Cluster With the Graphical User Interfaces, in Sun Cluster System Administration Guide for Solaris OS for information about SunPlex Manager.
The scdpm(1M) command provides DPM administration commands that enable you to perform the following tasks:
Monitoring a new disk path
Unmonitoring a disk path
Rereading the configuration data from the CCR database
Reading the disks to monitor or unmonitor from a specified file
Reporting the status of a disk path or all disk paths in the cluster
Printing all the disk paths that are accessible from a node
Issue the scdpm(1M) command with the disk-path argument from any active node to perform DPM administration tasks on the cluster. The disk-path argument is always constituted of a node name and a disk name. The node name is not required and defaults to all if no node name is specified. The following table describes naming conventions for the disk path.
Use of the global disk-path name is strongly recommended, because the global disk-path name is consistent throughout the cluster. The UNIX disk-path name is not consistent throughout the cluster. The UNIX disk path for one disk can differ from cluster node to cluster node. The disk path could be c1t0d0 on one node and c2t0d0 on another node. If you use UNIX disk-path names, use the scdidadm -L command to map the UNIX disk-path name to the global disk-path name before issuing DPM commands. See the scdidadm(1M) man page.
Sample Disk-Path Name
Global disk path
Disk path d1 on the schost-1 node
Disk path d1 on all nodes in the cluster
UNIX disk path
Disk path c0t0d0s0 on the schost-1 node
All disk paths on the schost-1 node
All disk paths
All disk paths on all nodes of the cluster
SunPlex Manager enables you to perform the following basic DPM administration tasks:
Monitoring a disk path
Unmonitoring a disk path
Viewing the status of all disk paths in the cluster
Refer to the SunPlex Manager online help for procedural information about how to perform disk-path administration by using SunPlex Manager.
This section contains the following topics:
For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.
Because cluster nodes share data and resources, a cluster must never split into separate partitions that are active at the same time because multiple active partitions might cause data corruption. The Cluster Membership Monitor (CMM) and quorum algorithm guarantee that at most one instance of the same cluster is operational at any time, even if the cluster interconnect is partitioned.
For an introduction to quorum and CMM, see Cluster Membership in Sun Cluster Overview for Solaris OS.
Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters. Each partition “believes” that it is the only partition because the nodes in one partition cannot communicate with the node in the other partition.
Amnesia occurs when the cluster restarts after a shutdown with cluster configuration data older than at the time of the shutdown. This problem can occur when you start the cluster on a node that was not in the last functioning cluster partition.
Sun Cluster software avoids split brain and amnesia by:
Assigning each node one vote
Mandating a majority of votes for an operational cluster
A partition with the majority of votes gains quorum and is allowed to operate. This majority vote mechanism prevents split brain and amnesia when more than two nodes are configured in a cluster. However, counting node votes alone is not sufficient when more than two nodes are configured in a cluster. In a two-node cluster, a majority is two. If such a two-node cluster becomes partitioned, an external vote is needed for either partition to gain quorum. This external vote is provided by a quorum device.
Use the scstat -q command to determine the following information:
Total configured votes
Current present votes
Votes required for quorum
For more information about this command, see scstat(1M).
Both nodes and quorum devices contribute votes to the cluster to form quorum.
A node contributes votes depending on the node's state:
A node has a vote count of one when it boots and becomes a cluster member.
A node has a vote count of zero when the node is being installed.
A node has a vote count of zero when an system administrator places the node into maintenance state.
Quorum devices contribute votes that are based on the number of votes that are connected to the device. When you configure a quorum device, Sun Cluster software assigns the quorum device a vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).
A quorum device contributes votes if one of the following two conditions are true:
At least one of the nodes to which the quorum device is currently attached is a cluster member.
At least one of the nodes to which the quorum device is currently attached is booting, and that node was a member of the last cluster partition to own the quorum device.
You configure quorum devices during the cluster installation, or later by using the procedures that are described in Chapter 5, Administering Quorum, in Sun Cluster System Administration Guide for Solaris OS.
A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When split brain occurs, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might “believe” it has sole access and ownership to the multihost devices. When multiple nodes attempt to write to the disks, data corruption can occur.
Failure fencing limits node access to multihost devices by physically preventing access to the disks. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, resulting in data integrity.
Disk device services provide failover capability for services that use multihost devices. When a cluster member that currently serves as the primary (owner) of the disk device group fails or becomes unreachable, a new primary is chosen. The new primary enables access to the disk device group to continue with only minor interruption. During this process, the old primary must forfeit access to the devices before the new primary can be started. However, when a member drops out of the cluster and becomes unreachable, the cluster cannot inform that node to release the devices for which it was the primary. Thus, you need a means to enable surviving members to take control of and access global devices from failed members.
The Sun Cluster system uses SCSI disk reservations to implement failure fencing. Using SCSI reservations, failed nodes are “fenced” away from the multihost devices, preventing them from accessing those disks.
SCSI-2 disk reservations support a form of reservations, which either grants access to all nodes attached to the disk (when no reservation is in place). Alternatively, access is restricted to a single node (the node that holds the reservation).
When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates a failure fencing procedure to prevent the other node from accessing shared disks. When this failure fencing occurs, the fenced node panics with a “reservation conflict” message on its console.
The discovery that a node is no longer a cluster member, triggers a SCSI reservation on all the disks that are shared between this node and other nodes. The fenced node might not be “aware” that it is being fenced and if it tries to access one of the shared disks, it detects the reservation and panics.
The mechanism by which the cluster framework ensures that a failed node cannot reboot and begin writing to shared storage is called failfast.
Nodes that are cluster members continuously enable a specific ioctl, MHIOCENFAILFAST, for the disks to which they have access, including quorum disks. This ioctl is a directive to the disk driver. The ioctl gives a node the capability to panic itself if it cannot access the disk due to the disk being reserved by some other node.
The MHIOCENFAILFAST ioctl causes the driver to check the error return from every read and write that a node issues to the disk for the Reservation_Conflict error code. The ioctl periodically, in the background, issues a test operation to the disk to check for Reservation_Conflict. Both the foreground and background control flow paths panic if Reservation_Conflict is returned.
For SCSI-2 disks, reservations are not persistent—they do not survive node reboots. For SCSI-3 disks with Persistent Group Reservation (PGR), reservation information is stored on the disk and persists across node reboots. The failfast mechanism works the same, whether you have SCSI-2 disks or SCSI-3 disks.
If a node loses connectivity to other nodes in the cluster, and it is not part of a partition that can achieve quorum, it is forcibly removed from the cluster by another node. Another node that is part of the partition that can achieve quorum places reservations on the shared disks. When the node that does not have quorum attempts to access the shared disks, it receives a reservation conflict and panics as a result of the failfast mechanism.
After the panic, the node might reboot and attempt to rejoin the cluster or, if the cluster is composed of SPARC based systems, stay at the OpenBootTM PROM (OBP) prompt. The action that is taken is determined by the setting of the auto-boot? parameter. You can set auto-boot? with eeprom(1M), at the OpenBoot PROM ok prompt in a SPARC based cluster. Alternatively, you can set up this parameter with the SCSI utility that you optionally run after the BIOS boots in an x86 based cluster.
The following list contains facts about quorum configurations:
Quorum devices can contain user data.
In an N+1 configuration where N quorum devices are each connected to one of the 1 through N nodes and the N+1 node, the cluster survives the death of either all 1 through N nodes or any of the N/2 nodes. This availability assumes that the quorum device is functioning correctly.
In an N-node configuration where a single quorum device connects to all nodes, the cluster can survive the death of any of the N-1 nodes. This availability assumes that the quorum device is functioning correctly.
In an N-node configuration where a single quorum device connects to all nodes, the cluster can survive the failure of the quorum device if all cluster nodes are available.
You must adhere to the following requirements. If you ignore these requirements, you might compromise your cluster's availability.
Ensure that Sun Cluster software supports your specific device as a quorum device.
For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.
Sun Cluster software supports two types of quorum devices:
Multihosted shared disks that support SCSI-3 PGR reservations
Dual-hosted shared disks that support SCSI-2 reservations
In a two–node configuration, you must configure at least one quorum device to ensure that a single node can continue if the other node fails. See Figure 3–2.
Use the following information to evaluate the best quorum configuration for your topology:
Do you have a device that is capable of being connected to all nodes of the cluster?
If yes, configure that device as your one quorum device. You do not need to configure another quorum device because your configuration is the most optimal configuration.
If you ignore this requirement and add another quorum device, the additional quorum device reduces your cluster's availability.
If no, configure your dual-ported device or devices.
Ensure that the total number of votes contributed by quorum devices is strictly less than the total number of votes contributed by nodes. Otherwise, your nodes cannot form a cluster if all disks are unavailable—even if all nodes are functioning.
In particular environments, you might desire to reduce overall cluster availability to meet your needs. In these situations, you can ignore this best practice. However, not adhering to this best practice decreases overall availability. For example, in the configuration that is outlined in Atypical Quorum Configurations the cluster is less available: the quorum votes exceed the node votes. The cluster has the property that if access to the shared storage between Nodes A and Node B is lost, the entire cluster fails.
See Atypical Quorum Configurations for the exception to this best practice.
Specify a quorum device between every pair of nodes that shares access to a storage device. This quorum configuration speeds the failure fencing process. See Quorum in Greater Than Two–Node Configurations.
Quorum devices slightly slow reconfigurations after a node joins or a node dies. Therefore, do not add more quorum devices than are necessary.
This section shows examples of quorum configurations that are recommended. For examples of quorum configurations you should avoid, see Bad Quorum Configurations.
Two quorum votes are required for a two-node cluster to form. These two votes can derive from the two cluster nodes, or from just one node and a quorum device.
You can configure a greater than two-node cluster without a quorum device. However, if you do so, you cannot start the cluster without a majority of nodes in the cluster.
Figure 3–3 assumes you are running mission-critical applications (Oracle database, for example) on Node A and Node B. If Node A and Node B are unavailable and cannot access shared data, you might want the entire cluster to be down. Otherwise, this configuration is suboptimal because it does not provide high availability.
For information about the best practice to which this exception relates, see Adhering to Quorum Device Best Practices.
This section shows examples of quorum configurations you should avoid. For examples of recommended quorum configurations, see Recommended Quorum Configurations.
The term data service describes an application, such as Sun Java System Web Server or Oracle, that has been configured to run on a cluster rather than on a single server. A data service consists of an application, specialized Sun Cluster configuration files, and Sun Cluster management methods that control the following actions of the application.
Monitor and take corrective measures
For information about data service types, see Data Services in Sun Cluster Overview for Solaris OS.
Figure 3–4 compares an application that runs on a single application server (the single-server model) to the same application running on a cluster (the clustered-server model). The only difference between the two configurations is the clustered application might run faster and will be more highly available.
Some data services require you to specify either logical hostnames or shared addresses as the network interfaces. Logical hostnames and shared addresses are not interchangeable. Other data services allow you to specify either logical hostnames or shared addresses. Refer to the installation and configuration for each data service for details about the type of interface you must specify.
A network resource is not associated with a specific physical server. A network resource can migrate between physical servers.
A network resource is initially associated with one node, the primary. If the primary fails, the network resource and the application resource, fail over to a different cluster node (a secondary). When the network resource fails over, after a short delay, the application resource continues to run on the secondary.
Figure 3–5 compares the single-server model with the clustered-server model. Note that in the clustered-server model, a network resource (logical hostname, in this example) can move between two or more of the cluster nodes. The application is configured to use this logical hostname in place of a hostname associated with a particular server.
A shared address is also initially associated with one node. This node is called the global interface node. A shared address (known as the global interface) is used as the single network interface to the cluster.
The difference between the logical hostname model and the scalable service model is that in the latter, each node also has the shared address actively configured on its loopback interface. This configuration enables multiple instances of a data service active on several nodes simultaneously. The term “scalable service” means that you can add more CPU power to the application by adding additional cluster nodes and the performance will scale.
If the global interface node fails, the shared address can be started on another node that is also running an instance of the application (thereby making this other node the new global interface node). Or, the shared address can fail over to another cluster node that was not previously running the application.
Figure 3–6 compares the single-server configuration with the clustered scalable service configuration. Note that in the scalable service configuration, the shared address is present on all nodes. Similar to how a logical hostname is used for a failover data service, the application is configured to use this shared address in place of a hostname associated with a particular server.
The Sun Cluster software supplies a set of service management methods. These methods run under the control of the Resource Group Manager (RGM), which uses them to start, stop, and monitor the application on the cluster nodes. These methods, along with the cluster framework software and multihost devices, enable applications to become failover or scalable data services.
The RGM also manages resources in the cluster, including instances of an application and network resources (logical hostnames and shared addresses).
In addition to Sun Cluster software-supplied methods, the Sun Cluster system also supplies an API and several data service development tools. These tools enable application developers to develop the data service methods needed to make other applications run as highly available data services with the Sun Cluster software.
If the node on which the data service is running (the primary node) fails, the service is migrated to another working node without user intervention. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured on one node, and later, automatically configured down on the original node and configured on another node.
For failover data services, application instances run only on a single node. If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover). The outcome depends on how the data service has been configured.
The scalable data service has the potential for active instances on multiple nodes. Scalable services use the following two resource groups:
A scalable resource group contains the application resources.
A failover resource group contains the network resources (shared addresses) on which the scalable service depends.
The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running at once. The failover resource group that hosts the shared address is online on only one node at a time. All nodes that host a scalable service use the same shared address to host the service.
Service requests enter the cluster through a single network interface (the global interface). These requests are distributed to the nodes, based on one of several predefined algorithms that are set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes. Multiple global interfaces can exist on different nodes that host other shared addresses.
For scalable services, application instances run on several nodes simultaneously. If the node that hosts the global interface fails, the global interface fails over to another node. If an application instance that is running fails, the instance attempts to restart on the same node.
If an application instance cannot be restarted on the same node, and another unused node is configured to run the service, the service fails over to the unused node. Otherwise, the service continues to run on the remaining nodes, possibly causing a degradation of service throughput.
TCP state for each application instance is kept on the node with the instance, not on the global interface node. Therefore, failure of the global interface node does not affect the connection.
Figure 3–7 shows an example of failover and a scalable resource group and the dependencies that exist between them for scalable services. This example shows three resource groups. The failover resource group contains application resources for highly available DNS, and network resources used by both highly available DNS and highly available Apache Web Server (used in SPARC-based clusters only). The scalable resource groups contain only application instances of the Apache Web Server. Note that resource group dependencies exist between the scalable and failover resource groups (solid lines). Additionally, all the Apache application resources depend on the network resource schost-2, which is a shared address (dashed lines).
Load balancing improves performance of the scalable service, both in response time and in throughput. There are two classes of scalable data services.
A pure service is capable of having any of its instances respond to client requests. A sticky service is capable of having a client send requests to the same instance. Those requests are not redirected to other instances.
A pure service uses a weighted load-balancing policy. Under this load-balancing policy, client requests are by default uniformly distributed over the server instances in the cluster. For example, in a three-node cluster, suppose that each node has the weight of 1. Each node will service 1/3 of the requests from any client on behalf of that service. The administrator can change weights at any time through the scrgadm(1M) command interface or through the SunPlex Manager GUI.
A sticky service has two flavors, ordinary sticky and wildcard sticky. Sticky services enable concurrent application-level sessions over multiple TCP connections to share in-state memory (application session state).
Ordinary sticky services enable a client to share state between multiple concurrent TCP connections. The client is said to be “sticky” toward that server instance listening on a single port. The client is guaranteed that all requests go to the same server instance, provided that instance remains up and accessible and the load-balancing policy is not changed while the service is online.
For example, a web browser on the client connects to a shared IP address on port 80 using three different TCP connections. However, the connections exchange cached session information between them at the service.
A generalization of a sticky policy extends to multiple scalable services that exchange session information in the background and at the same instance. When these services exchange session information in the background and at the same instance, the client is said to be “sticky” toward multiple server instances on the same node listening on different ports .
For example, a customer on an e-commerce site fills the shopping cart with items by using HTTP on port 80. The customer then switches to SSL on port 443 to send secure data to pay by credit card for the items in the cart.
Wildcard sticky services use dynamically assigned port numbers, but still expect client requests to go to the same node. The client is “sticky wildcard” over pots that have the same IP address.
A good example of this policy is passive mode FTP. For example, a client connects to an FTP server on port 21. The server then instructs the client to connect back to a listener port server in the dynamic port range. All requests for this IP address are forwarded to the same node that the server informed the client through the control information .
For each of these sticky policies, the weighted load-balancing policy is in effect by default. Therefore, a client's initial request is directed to the instance that the load balancer dictates. After the client establishes an affinity for the node where the instance is running, future requests are conditionally directed to that instance. The node must be accessible and the load-balancing policy must not have changed.
Additional details of the specific load-balancing policies are as follows.
Weighted. The load is distributed among various nodes according to specified weight values. This policy is set by using the LB_WEIGHTED value for the Load_balancing_weights property. If a weight for a node is not explicitly set, the weight for that node defaults to one.
The weighted policy redirects a certain percentage of the traffic from clients to a particular node. Given X=weight and A=the total weights of all active nodes, an active node can expect approximately X/A of the total new connections to be directed to the active node. However, the total number of connections must be large enough. This policy does not address individual requests.
Note that this policy is not round robin. A round-robin policy would always cause each request from a client to go to a different node. For example, the first request would go to node 1, the second request would go to node 2, and so on.
Sticky. In this policy, the set of ports is known at the time the application resources are configured. This policy is set by using the LB_STICKY value for the Load_balancing_policy resource property.
Sticky-wildcard. This policy is a superset of the ordinary “sticky” policy. For a scalable service that is identified by the IP address, ports are assigned by the server (and are not known in advance). The ports might change. This policy is set by using the LB_STICKY_WILD value for the Load_balancing_policy resource property.
Resource groups fail over from one node to another. When this failover occurs, the original secondary becomes the new primary. The failback settings specify the actions that will occur when the original primary comes back online. The options are to have the original primary become the primary again (failback) or to allow the current primary to remain. You specify the option you want by using the Failback resource group property setting.
If the original node that hosts the resource group fails and reboots repeatedly, setting failback might result in reduced availability for the resource group.
Each Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon(s) are running and that clients are being served. Based on the information that probes return, predefined actions such as restarting daemons or causing a failover can be initiated.
Sun supplies configuration files and management methods templates that enable you to make various applications operate as failover or scalable services within a cluster. If Sun does not offer the application that you want to run as a failover or scalable service, you have an alternative. Use a Sun Cluster API or the DSET API to configure the application to run as a failover or scalable service. However, not all applications can become a scalable service.
A set of criteria determines whether an application can become a scalable service. To determine if your application can become a scalable service, see Analyzing the Application for Suitability in Sun Cluster Data Services Developer’s Guide for Solaris OS. This set of criteria is summarized below.
First, such a service is composed of one or more server instances. Each instance runs on a different node of the cluster. Two or more instances of the same service cannot run on the same node.
Second, if the service provides an external logical data store, you must exercise caution. Concurrent access to this store from multiple server instances must be synchronized to avoid losing updates or reading data as it's being changed. Note the use of “external” to distinguish the store from in-memory state. The term “logical” indicates that the store appears as a single entity, although it might itself be replicated. Furthermore, this logical data store has the property that whenever any server instance updates the store, that update is immediately “seen” by other instances.
The Sun Cluster system provides such an external storage through its cluster file system and its global raw partitions. As an example, suppose a service writes new data to an external log file or modifies existing data in place. When multiple instances of this service run, each instance has access to this external log, and each might simultaneously access this log. Each instance must synchronize its access to this log, or else the instances interfere with each other. The service could use ordinary Solaris file locking through fcntl(2) and lockf(3C) to achieve the desired synchronization.
Another example of this type of store is a back-end database, such as highly available Oracle Real Application Clusters Guard for SPARC based clusters or Oracle. This type of back-end database server provides built-in synchronization by using database query or update transactions. Therefore, multiple server instances do not need to implement their own synchronization.
The Sun IMAP server is an example of a service that is not a scalable service. The service updates a store, but that store is private and when multiple IMAP instances write to this store, they overwrite each other because the updates are not synchronized. The IMAP server must be rewritten to synchronize concurrent access.
Finally, note that instances can have private data that is disjoint from the data of other instances. In such a case, the service does not need synchronized concurrent access because the data is private, and only that instance can manipulate it. In this case, you must be careful not to store this private data under the cluster file system because this data can become globally accessible.
The Sun Cluster system provides the following to make applications highly available:
Data services that are supplied as part of the Sun Cluster system
A data service API
A development library API for data services
A “generic” data service
The Sun Cluster Data Services Planning and Administration Guide for Solaris OS describes how to install and configure the data services that are supplied with the Sun Cluster system. The Sun Cluster 3.1 9/04 Software Collection for Solaris OS (SPARC Platform Edition) describes how to instrument other applications to be highly available under the Sun Cluster framework.
The Sun Cluster APIs enable application developers to develop fault monitors and scripts that start and stop data service instances. With these tools, an application can be implemented as a failover or a scalable data service. The Sun Cluster system provides a “generic” data service. Use this generic data service to quickly generate an application's required start and stop methods and to implement the data service as a failover or scalable service.
A cluster must have multiple network connections between nodes, forming the cluster interconnect. Sun Cluster software uses multiple interconnects to achieve the following goals:
Ensure high availability
For internal traffic such as file system data or scalable services data, messages are striped across all available interconnects in a round-robin fashion. The cluster interconnect is also available to applications, for highly available communication between nodes. For example, a distributed application might have components running on different nodes that need to communicate. By using the cluster interconnect rather than the public transport, these connections can withstand the failure of an individual link.
To use the cluster interconnect for communication between nodes, an application must use the private hostnames that you configured during the Sun Cluster installation. For example, if the private hostname for node 1 is clusternode1-priv, use that name to communicate with node 1 over the cluster interconnect. TCP sockets that are opened by using this name are routed over the cluster interconnect and can be transparently rerouted if the network fails.
Because you can configure the private hostnames during your Sun Cluster installation, the cluster interconnect uses any name you choose at that time. To determine the actual name, use the scha_cluster_get(3HA) command with the scha_privatelink_hostname_node argument.
Both application communication and internal clustering communication are striped over all interconnects. Because applications share the cluster interconnect with internal clustering traffic, the bandwidth available to applications depends on the bandwidth used by other clustering traffic. If a failure occurs, internal traffic and application traffic will stripe over all available interconnects.
Each node is also assigned a fixed pernode address. This pernode address is plumbed on the clprivnet driver. The IP address maps to the private hostname for the node: clusternode1-priv. For information about the Sun Cluster private network driver, see the clprivnet(7) man page.
If your application requires consistent IP addresses at all points, configure the application to bind to the pernode address on both the client and the server. All connections appear then to originate from and return to the pernode address.
Data services utilize several types of resources: applications such as Sun Java System Web Server or Apache Web Server use network addresses (logical hostnames and shared addresses) on which the applications depend. Application and network resources form a basic unit that is managed by the RGM.
Data services are resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle-server and Sun Cluster HA for Apache is the resource type SUNW.apache.
A resource is an instantiation of a resource type that is defined cluster wide. Several resource types are defined.
Network resources are either SUNW.LogicalHostname or SUNW.SharedAddress resource types. These two resource types are preregistered by the Sun Cluster software.
The HAStorage and HAStoragePlus resource types are used to synchronize the startup of resources and disk device groups on which the resources depend. These resource types ensure that before a data service starts, the paths to a cluster file system's mount points, global devices, and device group names are available. For more information, see “Synchronizing the Startups Between Resource Groups and Disk Device Groups” in the Data Services Installation and Configuration Guide. The HAStoragePlus resource type became available in Sun Cluster 3.0 5/02 and added another feature, enabling local file systems to be highly available. For more information about this feature, see HAStoragePlus Resource Type.
RGM-managed resources are placed into groups, called resource groups, so that they can be managed as a unit. A resource group is migrated as a unit if a failover or switchover is initiated on the resource group.
When you bring a resource group that contains application resources online, the application is started. The data service start method waits until the application is running before exiting successfully. The determination of when the application is up and running is accomplished the same way the data service fault monitor determines that a data service is serving clients. Refer to the Sun Cluster Data Services Planning and Administration Guide for Solaris OS for more information about this process.
The RGM controls data services (applications) as resources, which are managed by resource type implementations. These implementations are either supplied by Sun or created by a developer with a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers called resource groups. The RGM stops and starts resource groups on selected nodes in response to cluster membership changes.
The RGM acts on resources and resource groups. RGM actions cause resources and resource groups to move between online and offline states. A complete description of the states and settings that can be applied to resources and resource groups is in the section Resource and Resource Group States and Settings.
Refer to Data Service Project Configuration for information about how to launch Solaris projects under RGM control.
An administrator applies static settings to resources and resource groups. These settings can only be changed through administrative actions. The RGM moves resource groups between dynamic “states.” These settings and states are described in the following list.
Managed or unmanaged – These are cluster-wide settings that apply only to resource groups. Resource groups are managed by the RGM. The scrgadm(1M) command can be used to cause the RGM to manage or to unmanage a resource group. These settings do not change with a cluster reconfiguration.
When a resource group is first created, it is unmanaged. A resource group must be managed before any resources placed in the group can become active.
In some data services, for example a scalable web server, work must be done prior to starting network resources and after they are stopped. This work is done by initialization (INIT) and finish (FINI) data service methods. The INIT methods only run if the resource group in which the resources reside is in the managed state.
When a resource group is moved from unmanaged to managed, any registered INIT methods for the group are run on the resources in the group.
When a resource group is moved from managed to unmanaged, any registered FINI methods are called to perform cleanup.
The most common use of INIT and FINI methods are for network resources for scalable services. However, you can use these methods for any initialization or cleanup work that is not performed by the application.
Enabled or disabled – These are cluster-wide settings that apply to resources. The scrgadm(1M) command can be used to enable or disable a resource. These settings do not change with a cluster reconfiguration.
The normal setting for a resource is that it is enabled and actively running in the system.
If you want to make the resource unavailable on all cluster nodes, disable the resource. A disabled resource is not available for general use.
Online or offline – These are dynamic states that apply to both resource and resource groups.
Online and offline states change as the cluster transitions through cluster reconfiguration steps during switchover or failover. You can also change these states through administrative actions. Use the scswitch(1M) command to change the online or offline state of a resource or resource group.
A failover resource or resource group can only be online on one node at any time. A scalable resource or resource group can be online on some nodes and offline on others. During a switchover or failover, resource groups and the resources within them are taken offline on one node and then brought online on another node.
If a resource group is offline, then all its resources are offline. If a resource group is online, then all its enabled resources are online.
Resource groups can contain several resources, with dependencies between resources. These dependencies require that the resources be brought online and offline in a particular order. The methods used to bring resources online and offline might take different amounts of time for each resource. Because of resource dependencies and start and stop time differences, resources within a single resource group can have different online and offline states during a cluster reconfiguration.
You can configure property values for resources and resource groups for your Sun Cluster data services. Standard properties are common to all data services. Extension properties are specific to each data service. Some standard and extension properties are configured with default settings so that you do not have to modify them. Others need to be set as part of the process of creating and configuring resources. The documentation for each data service specifies which resource properties can be set and how to set them.
The standard properties are used to configure resource and resource group properties that are usually independent of any particular data service. For the set of standard properties, see Appendix A, Standard Properties, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
The RGM extension properties provide information such as the location of application binaries and configuration files. You modify extension properties as you configure your data services. The set of extension properties is described in the individual guide for the data service.
Data services can be configured to launch under a Solaris project name when brought online by using the RGM. The configuration associates a resource or resource group managed by the RGM with a Solaris project ID. The mapping from your resource or resource group to a project ID gives you the ability to use sophisticated controls that are available in the Solaris Operating System to manage workloads and consumption within your cluster.
You can perform this configuration only if you are running the current release of Sun Cluster software with at least Solaris 9.
Using the Solaris management functionality in a Sun Cluster environment enables you to ensure that your most important applications are given priority when sharing a node with other applications. Applications might share a node if you have consolidated services or because applications have failed over. Use of the management functionality described herein might improve availability of a critical application by preventing lower-priority applications from overconsuming system supplies such as CPU time.
The Solaris documentation for this feature describes CPU time, processes, tasks and similar components as “resources”. Meanwhile, Sun Cluster documentation uses the term “resources” to describe entities that are under the control of the RGM. The following section will use the term “resource” to refer to Sun Cluster entities under the control of the RGM. The section uses the term “supplies” to refer to CPU time, processes, and tasks.
This section provides a conceptual description of configuring data services to launch processes in a specified Solaris 9 project(4). This section also describes several failover scenarios and suggestions for planning to use the management functionality provided by the Solaris Operating System.
For detailed conceptual and procedural documentation about the management feature, refer to Chapter 1, Network Service (Overview), in System Administration Guide: Network Services.
When configuring resources and resource groups to use Solaris management functionality in a cluster, use the following high-level process:
Configuring applications as part of the resource.
Configuring resources as part of a resource group.
Enabling resources in the resource group.
Making the resource group managed.
Creating a Solaris project for your resource group.
Configuring standard properties to associate the resource group name with the project you created in step 5.
Bring the resource group online.
To configure the standard Resource_project_name or RG_project_name properties to associate the Solaris project ID with the resource or resource group, use the -y option with the scrgadm(1M) command. Set the property values to the resource or resource group. See Appendix A, Standard Properties, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for property definitions. Refer to r_properties(5) and rg_properties(5) for property descriptions.
The specified project name must exist in the projects database (/etc/project) and the root user must be configured as a member of the named project. Refer to Chapter 2, Projects and Tasks (Overview), in System Administration Guide: Solaris Containers-Resource Management and Solaris Zones for conceptual information about the project name database. Refer to project(4) for a description of project file syntax.
When the RGM brings resources or resource groups online, it launches the related processes under the project name.
Users can associate the resource or resource group with a project at any time. However, the new project name is not effective until the resource or resource group is taken offline and brought back online by using the RGM.
Launching resources and resource groups under the project name enables you to configure the following features to manage system supplies across your cluster.
Extended Accounting – Provides a flexible way to record consumption on a task or process basis. Extended accounting enables you to examine historical usage and make assessments of capacity requirements for future workloads.
Controls – Provide a mechanism for constraint on system supplies. Processes, tasks, and projects can be prevented from consuming large amounts of specified system supplies.
Fair Share Scheduling (FSS) – Provides the ability to control the allocation of available CPU time among workloads, based on their importance. Workload importance is expressed by the number of shares of CPU time that you assign to each workload. Refer to the following man pages for more information.
Pools – Provide the ability to use partitions for interactive applications according to the application's requirements. Pools can be used to partition a server that supports a number of different software applications. The use of pools results in a more predictable response for each application.
Before you configure data services to use the controls provided by Solaris in a Sun Cluster environment, you must decide how to control and track resources across switchovers or failovers. Identify dependencies within your cluster before configuring a new project. For example, resources and resource groups depend on disk device groups.
Use the nodelist, failback, maximum_primaries and desired_primaries resource group properties that are configured with scrgadm(1M) to identify nodelist priorities for your resource group.
For a brief discussion of the node list dependencies between resource groups and disk device groups, refer to Relationship Between Resource Groups and Disk Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
For detailed property descriptions, refer to rg_properties(5).
For conceptual information about the preferenced property, see Multiported Disk Device Groups.
For procedural information, see “How To Change Disk Device Properties” in Administering Disk Device Groups in Sun Cluster System Administration Guide for Solaris OS.
For conceptual information about node configuration and the behavior of failover and scalable data services, see Sun Cluster System Hardware and Software Components.
If you configure all cluster nodes identically, usage limits are enforced identically on primary and secondary nodes. The configuration parameters of projects do not need to be identical for all applications in the configuration files on all nodes. All projects that are associated with the application must at least be accessible by the project database on all potential masters of that application. Suppose that Application 1 is mastered by phys-schost-1 but could potentially be switched over or failed over to phys-schost-2 or phys-schost-3. The project that is associated with Application 1 must be accessible on all three nodes (phys-schost-1, phys-schost-2, and phys-schost-3).
Project database information can be a local /etc/project database file or can be stored in the NIS map or the LDAP directory service.
The Solaris Operating System enables for flexible configuration of usage parameters, and few restrictions are imposed by Sun Cluster. Configuration choices depend on the needs of the site. Consider the general guidelines in the following sections before configuring your systems.
Set the process.max-address-space control to limit virtual memory on a per-process basis. Refer to rctladm(1M) for detailed information about setting the process.max-address-space value.
When you use management controls with Sun Cluster software, configure memory limits appropriately to prevent unnecessary failover of applications and a “ping-pong” effect of applications. In general, observe the following guidelines.
Do not set memory limits too low.
When an application reaches its memory limit, it might fail over. This guideline is especially important for database applications, when reaching a virtual memory limit can have unexpected consequences.
Do not set memory limits identically on primary and secondary nodes.
Identical limits can cause a ping-pong effect when an application reaches its memory limit and fails over to a secondary node with an identical memory limit. Set the memory limit slightly higher on the secondary node. The difference in memory limits helps prevent the ping-pong scenario and gives the system administrator a period of time in which to adjust the parameters as necessary.
Do use the resource management memory limits for load-balancing.
For example, you can use memory limits to prevent an errant application from consuming excessive swap space.
You can configure management parameters so that the allocation in the project configuration (/etc/project) works in normal cluster operation and in switchover or failover situations.
The following sections are example scenarios.
The first two sections, “Two-Node Cluster With Two Applications“ and “Two-Node Cluster With Three Applications,“ show failover scenarios for entire nodes.
The section “Failover of Resource Group Only“ illustrates failover operation for an application only.
In a Sun Cluster environment, you configure an application as part of a resource. You then configure a resource as part of a resource group (RG). When a failure occurs, the resource group, along with its associated applications, fails over to another node. In the following examples the resources are not shown explicitly. Assume that each resource has only one application.
Failover occurs in the preferenced nodelist order that is set in the RGM.
The following examples have these constraints:
Application 1 (App-1) is configured in resource group RG-1.
Application 2 (App-2) is configured in resource group RG-2.
Application 3 (App-3) is configured in resource group RG-3.
Although the numbers of assigned shares remain the same, the percentage of CPU time allocated to each application changes after failover. This percentage depends on the number of applications that are running on the node and the number of shares that are assigned to each active application.
In these scenarios, assume the following configurations.
All applications are configured under a common project.
Each resource has only one application.
The applications are the only active processes on the nodes.
The projects databases are configured the same on each node of the cluster.
You can configure two applications on a two-node cluster to ensure that each physical host (phys-schost-1, phys-schost-2) acts as the default master for one application. Each physical host acts as the secondary node for the other physical host. All projects that are associated with Application 1 and Application 2 must be represented in the projects database files on both nodes. When the cluster is running normally, each application is running on its default master, where it is allocated all CPU time by the management facility.
After a failover or switchover occurs, both applications run on a single node where they are allocated shares as specified in the configuration file. For example, this entry in the/etc/project file specifies that Application 1 is allocated 4 shares and Application 2 is allocated 1 share.
Prj_1:100:project for App-1:root::project.cpu-shares=(privileged,4,none) Prj_2:101:project for App-2:root::project.cpu-shares=(privileged,1,none)
The following diagram illustrates the normal and failover operations of this configuration. The number of shares that are assigned does not change. However, the percentage of CPU time available to each application can change. The percentage depends on the number of shares that are assigned to each process that demands CPU time.
On a two-node cluster with three applications, you can configure one physical host (phys-schost-1) as the default master of one application. You can configure the second physical host (phys-schost-2) as the default master for the remaining two applications. Assume the following example projects database file on every node. The projects database file does not change when a failover or switchover occurs.
Prj_1:103:project for App-1:root::project.cpu-shares=(privileged,5,none) Prj_2:104:project for App_2:root::project.cpu-shares=(privileged,3,none) Prj_3:105:project for App_3:root::project.cpu-shares=(privileged,2,none)
When the cluster is running normally, Application 1 is allocated 5 shares on its default master, phys-schost-1. This number is equivalent to 100 percent of CPU time because it is the only application that demands CPU time on that node. Applications 2 and 3 are allocated 3 and 2 shares, respectively, on their default master, phys-schost-2. Application 2 would receive 60 percent of CPU time and Application 3 would receive 40 percent of CPU time during normal operation.
If a failover or switchover occurs and Application 1 is switched over to phys-schost-2, the shares for all three applications remain the same. However, the percentages of CPU resources are reallocated according to the projects database file.
Application 1, with 5 shares, receives 50 percent of CPU.
Application 2, with 3 shares, receives 30 percent of CPU.
Application 3, with 2 shares, receives 20 percent of CPU.
The following diagram illustrates the normal operations and failover operations of this configuration.
In a configuration in which multiple resource groups have the same default master, a resource group (and its associated applications) can fail over or be switched over to a secondary node. Meanwhile, the default master is running in the cluster.
During failover, the application that fails over is allocated resources as specified in the configuration file on the secondary node. In this example, the project database files on the primary and secondary nodes have the same configurations.
For example, this sample configuration file specifies that Application 1 is allocated 1 share, Application 2 is allocated 2 shares, and Application 3 is allocated 2 shares.
Prj_1:106:project for App_1:root::project.cpu-shares=(privileged,1,none) Prj_2:107:project for App_2:root::project.cpu-shares=(privileged,2,none) Prj_3:108:project for App_3:root::project.cpu-shares=(privileged,2,none)
The following diagram illustrates the normal and failover operations of this configuration, where RG-2, containing Application 2, fails over to phys-schost-2. Note that the number of shares assigned does not change. However, the percentage of CPU time available to each application can change, depending on the number of shares that are assigned to each application that demands CPU time.
Clients make data requests to the cluster through the public network. Each cluster node is connected to at least one public network through a pair of public network adapters.
Solaris Internet Protocol (IP) Network Multipathing software on Sun Cluster provides the basic mechanism for monitoring public network adapters and failing over IP addresses from one adapter to another when a fault is detected. Each cluster node has its own Internet Protocol (IP) Network Multipathing configuration, which can be different from the configuration on other cluster nodes.
Public network adapters are organized into IP multipathing groups (multipathing groups). Each multipathing group has one or more public network adapters. Each adapter in a multipathing group can be active. Alternatively, you can configure standby interfaces that are inactive unless a failover occurs.
The in.mpathd multipathing daemon uses a test IP address to detect failures and repairs. If a fault is detected on one of the adapters by the multipathing daemon, a failover occurs. All network access fails over from the faulted adapter to another functional adapter in the multipathing group. Therefore, the daemon maintains public network connectivity for the node. If you configured a standby interface, the daemon chooses the standby interface. Otherwise, daemon chooses the interface with the least number of IP addresses. Because the failover occurs at the adapter interface level, higher-level connections such as TCP are not affected, except for a brief transient delay during the failover. When the failover of IP addresses completes successfully, ARP broadcasts are sent. Therefore, the daemon maintains connectivity to remote clients.
Because of the congestion recovery characteristics of TCP, TCP endpoints can experience further delay after a successful failover. Some segments might have been lost during the failover, activating the congestion control mechanism in TCP.
Multipathing groups provide the building blocks for logical hostname and shared address resources. You can also create multipathing groups independently of logical hostname and shared address resources to monitor public network connectivity of cluster nodes. The same multipathing group on a node can host any number of logical hostname or shared address resources. For more information about logical hostname and shared address resources, see the Sun Cluster Data Services Planning and Administration Guide for Solaris OS.
The design of the Internet Protocol (IP) Network Multipathing mechanism is meant to detect and mask adapter failures. The design is not intended to recover from an administrator's use of ifconfig(1M) to remove one of the logical (or shared) IP addresses. The Sun Cluster software views the logical and shared IP addresses as resources that are managed by the RGM. The correct way for an administrator to add or remove an IP address is to use scrgadm(1M) to modify the resource group that contains the resource.
For more information about the Solaris implementation of IP Network Multipathing, see the appropriate documentation for the Solaris Operating System that is installed on your cluster.
Operating System Release
Solaris 8 Operating System
Solaris 9 Operating System
Solaris 10 Operating System
Sun Cluster 3.1 8/05 support for the dynamic reconfiguration (DR) software feature is being developed in incremental phases. This section describes concepts and considerations for Sun Cluster 3.1 8/05 support of the DR feature.
All the requirements, procedures, and restrictions that are documented for the Solaris DR feature also apply to Sun Cluster DR support (except for the operating environment quiescence operation). Therefore, review the documentation for the Solaris DR feature before by using the DR feature with Sun Cluster software. You should review in particular the issues that affect nonnetwork IO devices during a DR detach operation.
The Sun Enterprise 10000 Dynamic Reconfiguration User Guide and the Sun Enterprise 10000 Dynamic Reconfiguration Reference Manual (from the Solaris 8 on Sun Hardware or Solaris 9 on Sun Hardware collections) are both available for download from http://docs.sun.com.
The DR feature enables operations, such as the removal of system hardware, in running systems. The DR processes are designed to ensure continuous system operation with no need to halt the system or interrupt cluster availability.
DR operates at the board level. Therefore, a DR operation affects all the components on a board. Each board can contain multiple components, including CPUs, memory, and peripheral interfaces for disk drives, tape drives, and network connections.
Removing a board that contains active components would result in system errors. Before removing a board, the DR subsystem queries other subsystems, such as Sun Cluster, to determine whether the components on the board are being used. If the DR subsystem finds that a board is in use, the DR remove-board operation is not done. Therefore, it is always safe to issue a DR remove-board operation because the DR subsystem rejects operations on boards that contain active components.
The DR add-board operation is also always safe. CPUs and memory on a newly added board are automatically brought into service by the system. However, the system administrator must manually configure the cluster to actively use components that are on the newly added board.
The DR subsystem has several levels. If a lower level reports an error, the upper level also reports an error. However, when the lower level reports the specific error, the upper level reports Unknown error. You can safely ignore this error.
The following sections describe DR considerations for the different device types.
Sun Cluster software does not reject a DR remove-board operation because of the presence of CPU devices.
When a DR add-board operation succeeds, CPU devices on the added board are automatically incorporated in system operation.
For the purposes of DR, consider two types of memory.
kernel memory cage
non-kernel memory cage
These two types differ only in usage. The actual hardware is the same for both types. Kernel memory cage is the memory that is used by the Solaris Operating System. Sun Cluster software does not support remove-board operations on a board that contains the kernel memory cage and rejects any such operation. When a DR remove-board operation pertains to memory other than the kernel memory cage, Sun Cluster software does not reject the operation. When a DR add-board operation that pertains to memory succeeds, memory on the added board is automatically incorporated in system operation.
Sun Cluster rejects DR remove-board operations on active drives in the primary node. DR remove-board operations can be performed on inactive drives in the primary node and on any drives in the secondary node. After the DR operation, cluster data access continues as before.
Sun Cluster rejects DR operations that impact the availability of quorum devices. For considerations about quorum devices and the procedure for performing DR operations on them, see SPARC: DR Clustering Considerations for Quorum Devices.
See Dynamic Reconfiguration With Quorum Devices in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform these actions.
If the DR remove-board operation pertains to a board that contains an interface to a device configured for quorum, Sun Cluster software rejects the operation. Sun Cluster software also identifies the quorum device that would be affected by the operation. You must disable the device as a quorum device before you can perform a DR remove-board operation.
See Chapter 5, Administering Quorum, in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how administer quorum.
If the DR remove-board operation pertains to a board containing an active cluster interconnect interface, Sun Cluster software rejects the operation. Sun Cluster software also identifies the interface that would be affected by the operation. You must use a Sun Cluster administrative tool to disable the active interface before the DR operation can succeed.
Sun Cluster software requires each cluster node to have at least one functioning path to every other cluster node. Do not disable a private interconnect interface that supports the last path to any cluster node.
See Administering the Cluster Interconnects in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform these actions.
If the DR remove-board operation pertains to a board that contains an active public network interface, Sun Cluster software rejects the operation. Sun Cluster software also identifies the interface that would be affected by the operation. Before you remove a board with an active network interface present, switch over all traffic on that interface to another functional interface in the multipathing group by using the if_mpadm(1M) command.
If the remaining network adapter fails while you are performing the DR remove operation on the disabled network adapter, availability is impacted. The remaining adapter has no place to fail over for the duration of the DR operation.
See Administering the Public Network in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform a DR remove operation on a public network interface.