Sun Cluster Concepts Guide for Solaris OS

Chapter 3 Key Concepts for System Administrators and Application Developers

This chapter describes the key concepts that are related to the software components of the Sun Cluster environment. The information in this chapter is directed primarily to system administrators and application developers who use the Sun Cluster API and SDK. Cluster administrators can use this information in preparation for installing, configuring, and administering cluster software. Application developers can use the information to understand the cluster environment in which they work.

This chapter covers the following topics:

Administrative Interfaces

You can choose how you install, configure, and administer the Sun Cluster software from several user interfaces. You can accomplish system administration tasks either through the Sun Cluster Manager graphical user interface (GUI) or through the command-line interface. On top of the command-line interface are some utilities, such as scinstall and clsetup, to simplify selected installation and configuration tasks. The Sun Cluster software also has a module that runs as part of Sun Management Center that provides a GUI to particular cluster tasks. This module is available for use in only SPARC based clusters. Refer to Administration Tools in Sun Cluster System Administration Guide for Solaris OS for complete descriptions of the administrative interfaces.

Cluster Time

Time between all Solaris hosts in a cluster must be synchronized. Whether you synchronize the cluster hosts with any outside time source is not important to cluster operation. The Sun Cluster software employs the Network Time Protocol (NTP) to synchronize the clocks between hosts.

In general, a change in the system clock of a fraction of a second causes no problems. However, if you run date, rdate, or xntpdate (interactively, or within cron scripts) on an active cluster, you can force a time change much larger than a fraction of a second to synchronize the system clock to the time source. This forced change might cause problems with file modification timestamps or confuse the NTP service.

When you install the Solaris Operating System on each cluster host, you have an opportunity to change the default time and date setting for the host. In general, you can accept the factory default.

When you install Sun Cluster software by using the scinstall command, one step in the process is to configure NTP for the cluster. Sun Cluster software supplies a template file, ntp.cluster (see /etc/inet/ntp.cluster on an installed cluster host), that establishes a peer relationship between all cluster hosts. One host is designated the “preferred” host. Hosts are identified by their private host names and time synchronization occurs across the cluster interconnect. For instructions about how to configure the cluster for NTP, see Chapter 2, Installing Software on Global-Cluster Nodes, in Sun Cluster Software Installation Guide for Solaris OS.

Alternately, you can set up one or more NTP servers outside the cluster and change the ntp.conf file to reflect that configuration.

In normal operation, you should never need to adjust the time on the cluster. However, if the time was set incorrectly when you installed the Solaris Operating System and you want to change it, the procedure for doing so is included in Chapter 9, Administering the Cluster, in Sun Cluster System Administration Guide for Solaris OS.

High-Availability Framework

The Sun Cluster software makes all components on the “path” between users and data highly available, including network interfaces, the applications themselves, the file system, and the multihost devices. In general, a cluster component is highly available if it survives any single (software or hardware) failure in the system.

The following table shows the kinds of Sun Cluster component failures (both hardware and software) and the kinds of recovery that are built into the high-availability framework.

Table 3–1 Levels of Sun Cluster Failure Detection and Recovery

Failed Cluster Component 

Software Recovery 

Hardware Recovery 

Data service 

HA API, HA framework 

Not applicable 

Public network adapter 

IP network multipathing 

Multiple public network adapter cards 

Cluster file system 

Primary and secondary replicas 

Multihost devices 

Mirrored multihost device 

Volume management (Solaris Volume Manager and Veritas Volume Manager) 

Hardware RAID-5 (for example, Sun StorEdgeTM A3x00)

Global device 

Primary and secondary replicas 

Multiple paths to the device, cluster transport junctions 

Private network 

HA transport software 

Multiple private hardware-independent networks 

Host 

CMM, failfast driver 

Multiple hosts 

Zone 

HA API, HA framework 

Not applicable 

Sun Cluster software's high-availability framework detects a node failure quickly and creates a new equivalent server for the framework resources on a remaining node in the cluster. At no time are all framework resources unavailable. Framework resources that are unaffected by a failed node are fully available during recovery. Furthermore, framework resources of the failed node become available as soon as they are recovered. A recovered framework resource does not have to wait for all other framework resources to complete their recovery.

Most highly available framework resources are recovered transparently to the applications (data services) that are using the resource. The semantics of framework resource access are fully preserved across node failure. The applications cannot detect that the framework resource server has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes by using the files, devices, and disk volumes that are available to this node. This transparency exists if an alternative hardware path exists to the disks from another host. An example is the use of multihost devices that have ports to multiple hosts.

Zone Membership

Sun Cluster software also tracks zone membership by detecting when a zone boots up or halts. These changes also trigger a reconfiguration. A reconfiguration can redistribute cluster resources among the nodes in the cluster.

Cluster Membership Monitor

To ensure that data is kept safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services (applications) in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster. In a synchronized configuration, cluster resources might be redistributed, based on the new membership of the cluster.

Failfast Mechanism

The failfast mechanism detects a critical problem on either a global-cluster voting node or global-cluster non-voting node. The action that Sun Cluster takes when failfast detects a problem depends on whether the problem occurs in a voting node or a non-voting node.

If the critical problem is located in a voting node, Sun Cluster forcibly shuts down the node. Sun Cluster then removes the node from cluster membership.

If the critical problem is located in a non-voting node, Sun Cluster reboots that non-voting node.

If a node loses connectivity with other nodes, the node attempts to form a cluster with the nodes with which communication is possible. If that set of nodes does not form a quorum, Sun Cluster software halts the node and “fences” the node from the shared disks, that is, prevents the node from accessing the shared disks.

You can turn off fencing for selected disks or for all disks.


Caution – Caution –

If you turn off fencing under the wrong circumstances, your data can be vulnerable to corruption during application failover. Examine this data corruption possibility carefully when you are considering turning off fencing. If your shared storage device does not support the SCSI protocol, such as a Serial Advanced Technology Attachment (SATA) disk, or if you want to allow access to the cluster's storage from hosts outside the cluster, turn off fencing.


If one or more cluster-specific daemons die, Sun Cluster software declares that a critical problem has occurred. Sun Cluster software runs cluster-specific daemons on both voting nodes and non-voting nodes. If a critical problem occurs, Sun Cluster either shuts down and removes the node or reboots the non-voting node where the problem occurred.

When a cluster-specific daemon that runs on a non-voting node fails, a message similar to the following is displayed on the console.


cl_runtime: NOTICE: Failfast: Aborting because "pmfd" died in zone "zone4" (zone id 3)
35 seconds ago.

When a cluster-specific daemon that runs on a voting node fails and the node panics, a message similar to the following is displayed on the console.


panic[cpu1]/thread=2a10007fcc0: Failfast: Aborting because "pmfd" died in zone "global" (zone id 0)
35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

After the panic, the Solaris host might reboot and the node might attempt to rejoin the cluster. Alternatively, if the cluster is composed of SPARC based systems, the host might remain at the OpenBoot PROM (OBP) prompt. The next action of the host is determined by the setting of the auto-boot? parameter. You can set auto-boot? with the eeprom command, at the OpenBoot PROM ok prompt. See the eeprom(1M) man page.

Cluster Configuration Repository (CCR)

The CCR uses a two-phase commit algorithm for updates: An update must be successfully completed on all cluster members or the update is rolled back. The CCR uses the cluster interconnect to apply the distributed updates.


Caution – Caution –

Although the CCR consists of text files, never edit the CCR files yourself. Each file contains a checksum record to ensure consistency between nodes. Updating CCR files yourself can cause a node or the entire cluster to stop working.


The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.

Global Devices

The Sun Cluster software uses global devices to provide cluster-wide, highly available access to any device in a cluster, from any node, without regard to where the device is physically attached. In general, if a node fails while providing access to a global device, the Sun Cluster software automatically discovers another path to the device. The Sun Cluster software then redirects the access to that path. Sun Cluster global devices include disks, CD-ROMs, and tapes. However, the only multiported global devices that Sun Cluster software supports are disks. Consequently, CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.

The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the cluster. This assignment enables consistent access to each device from any node in the cluster. The global device namespace is held in the /dev/global directory. See Global Namespace for more information.

Multiported global devices provide more than one path to a device. Because multihost disks are part of a device group that is hosted by more than one Solaris host, the multihost disks are made highly available.

Device IDs and DID Pseudo Driver

The Sun Cluster software manages shared devices through a construct known as the DID pseudo driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.

The DID pseudo driver is an integral part of the shared device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique devices, assigns each device a unique major and a minor number that are consistent on all nodes of the cluster. Access to shared devices is performed by using the normalized DID logical name, instead of the traditional Solaris logical name, such as c0t0d0 for a disk.

This approach ensures that any application that accesses disks (such as a volume manager or applications that use raw devices) uses a consistent path across the cluster. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from Solaris host to Solaris host, thus changing the Solaris device naming conventions as well. For example, Host1 might identify a multihost disk as c1t2d0, and Host2 might identify the same disk completely differently, as c3t2d0. The DID framework assigns a common (normalized) logical name, such as d10, that the hosts use instead, giving each host a consistent mapping to the multihost disk.

You update and administer device IDs with the cldevice command. See the cldevice(1CL) man page.

Device Groups

In the Sun Cluster software, all multihost devices must be under control of the Sun Cluster software. You first create volume manager disk groups, either Solaris Volume Manager disk sets or Veritas Volume Manager disk groups, on the multihost disks. Then, you register the volume manager disk groups as device groups. A device group is a type of global device. In addition, the Sun Cluster software automatically creates a raw device group for each disk and tape device in the cluster. However, these cluster device groups remain in an offline state until you access them as global devices.

Registration provides the Sun Cluster software information about which Solaris hosts have a path to specific volume manager disk groups. At this point, the volume manager disk groups become globally accessible within the cluster. If more than one host can write to (master) a device group, the data stored in that device group becomes highly available. The highly available device group can be used to contain cluster file systems.


Note –

Device groups are independent of resource groups. One node can master a resource group (representing a group of data service processes). Another node can master the disk groups that are being accessed by the data services. However, the best practice is to keep on the same node the device group that stores a particular application's data and the resource group that contains the application's resources (the application daemon). Refer to Relationship Between Resource Groups and Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for more information about the association between device groups and resource groups.


When a node uses a device group, the volume manager disk group becomes “global” because it provides multipath support to the underlying disks. Each cluster host that is physically attached to the multihost disks provides a path to the device group.

Device Group Failover

Because a disk enclosure is connected to more than one Solaris host, all device groups in that enclosure are accessible through an alternate path if the host currently mastering the device group fails. The failure of the host that is mastering the device group does not affect access to the device group except for the time it takes to perform the recovery and consistency checks. During this time, all requests are blocked (transparently to the application) until the system makes the device group available.

Figure 3–1 Device Group Before and After Failover

Illustration: The preceding context describes the graphic.

Multiported Device Groups

This section describes device group properties that enable you to balance performance and availability in a multiported disk configuration. Sun Cluster software provides two properties that configure a multiported disk configuration: preferenced and numsecondaries. You can control the order in which nodes attempt to assume control if a failover occurs by using the preferenced property. Use the numsecondaries property to set the number of secondary nodes for a device group that you want.

A highly available service is considered down when the primary node fails and when no eligible secondary nodes can be promoted to primary nodes. If service failover occurs and the preferenced property is true, then the nodes follow the order in the node list to select a secondary node. The node list defines the order in which nodes attempt to assume primary control or transition from spare to secondary. You can dynamically change the preference of a device service by using the clsetup command. The preference that is associated with dependent service providers, for example a global file system, is identical to the preference of the device service.

Secondary nodes are check-pointed by the primary node during normal operation. In a multiported disk configuration, checkpointing each secondary node causes cluster performance degradation and memory overhead. Spare node support was implemented to minimize the performance degradation and memory overhead that checkpointing caused. By default, your device group has one primary and one secondary. The remaining available provider nodes become spares. If failover occurs, the secondary becomes primary and the node or highest in priority on the node list becomes secondary.

You can set the number of secondary nodes that you want to any integer between one and the number of operational nonprimary provider nodes in the device group.


Note –

If you are using Solaris Volume Manager, you must create the device group before you can set the numsecondaries property to a number other than the default.


The default number of secondaries for device services is 1. The actual number of secondary providers that is maintained by the replica framework is the number that you want, unless the number of operational nonprimary providers is less than the number that you want. You must alter the numsecondaries property and double-check the node list if you are adding or removing nodes from your configuration. Maintaining the node list and number of secondaries prevents conflict between the configured number of secondaries and the actual number that is allowed by the framework.

Global Namespace

The Sun Cluster software mechanism that enables global devices is the global namespace. The global namespace includes the /dev/global/ hierarchy as well as the volume manager namespaces. The global namespace reflects both multihost disks and local disks (and any other cluster device, such as CD-ROMs and tapes), and provides multiple failover paths to the multihost disks. Each Solaris host that is physically connected to multihost disks provides a path to the storage for any node in the cluster.

Normally, for Solaris Volume Manager, the volume manager namespaces are located in the /dev/md/diskset/dsk (and rdsk) directories. For Veritas VxVM, the volume manager namespaces are located in the /dev/vx/dsk/disk-group and /dev/vx/rdsk/disk-group directories. These namespaces consist of directories for each Solaris Volume Manager disk set and each VxVM disk group imported throughout the cluster, respectively. Each of these directories contains a device host for each metadevice or volume in that disk set or disk group.

In the Sun Cluster software, each device host in the local volume manager namespace is replaced by a symbolic link to a device host in the /global/.devices/node@nodeID file system. nodeID is an integer that represents the nodes in the cluster. Sun Cluster software continues to present the volume manager devices, as symbolic links, in their standard locations as well. Both the global namespace and standard volume manager namespace are available from any cluster node.

The advantages of the global namespace include the following:

Local and Global Namespaces Example

The following table shows the mappings between the local and global namespaces for a multihost disk, c0t0d0s0.

Table 3–2 Local and Global Namespace Mappings

Component or Path 

Local Host Namespace 

Global Namespace 

Solaris logical name 

/dev/dsk/c0t0d0s0

/global/.devices/node@nodeID/dev/dsk/c0t0d0s0

DID name 

/dev/did/dsk/d0s0

/global/.devices/node@nodeID/dev/did/dsk/d0s0

Solaris Volume Manager 

/dev/md/diskset/dsk/d0

/global/.devices/node@nodeID/dev/md/diskset/dsk/d0

Veritas Volume Manager 

/dev/vx/dsk/disk-group/v0

/global/.devices/node@nodeID/dev/vx/dsk/disk-group/v0

The global namespace is automatically generated on installation and updated with every reconfiguration reboot. You can also generate the global namespace by using the cldevice command. See the cldevice(1CL) man page.

Cluster File Systems

The cluster file system has the following features:

You can mount a file system on a global device globally with mount -g or locally with mount.

Programs can access a file in a cluster file system from any node in the cluster through the same file name (for example, /global/foo).

A cluster file system is mounted on all cluster members. You cannot mount a cluster file system on a subset of cluster members.

A cluster file system is not a distinct file system type. Clients verify the underlying file system (for example, UFS).

Using Cluster File Systems

In the Sun Cluster software, all multihost disks are placed into device groups, which can be Solaris Volume Manager disk sets, VxVM disk groups, raw-disk groups, or individual disks that are not under control of a software-based volume manager.

For a cluster file system to be highly available, the underlying disk storage must be connected to more than one Solaris host. Therefore, a local file system (a file system that is stored on a host's local disk) that is made into a cluster file system is not highly available.

You can mount cluster file systems as you would mount file systems:


Note –

While Sun Cluster software does not impose a naming policy for cluster file systems, you can ease administration by creating a mount point for all cluster file systems under the same directory, such as /global/disk-group. See Sun Cluster 3.1 9/04 Software Collection for Solaris OS (SPARC Platform Edition) and Sun Cluster System Administration Guide for Solaris OS for more information.


HAStoragePlus Resource Type

The HAStoragePlus resource type is designed to make local and global file system configurations highly available. You can use the HAStoragePlus resource type to integrate your local or global file system into the Sun Cluster environment and make the file system highly available.

You can use the HAStoragePlus resource type to make a file system available to a global-cluster non-voting node. To enable the HAStoragePlus resource type to do this, you must create a mount point on the global-cluster voting node and in the global-cluster non-voting node. The HAStoragePlus resource type makes the file system available to the global-cluster non-voting node by mounting the file system in the global-cluster voting node. The resource type then performs a loopback mount in the global-cluster node.

Sun Cluster systems support the following cluster file systems:

Sun Cluster software supports the following as highly available failover local file systems:

The HAStoragePlus resource type provides additional file system capabilities such as checks, mounts, and forced unmounts. These capabilities enable Sun Cluster to fail over local file systems. In order to fail over, the local file system must reside on global disk groups with affinity switchovers enabled.

See Enabling Highly Available Local File Systems in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for information about how to use the HAStoragePlus resource type.

You can also use the HAStoragePlus resource type to synchronize the startup of resources and device groups on which the resources depend. For more information, see Resources, Resource Groups, and Resource Types.

syncdir Mount Option

You can use the syncdir mount option for cluster file systems that use UFS as the underlying file system. However, performance significantly improves if you do not specify syncdir. If you specify syncdir, the writes are guaranteed to be POSIX compliant. If you do not specify syncdir, you experience the same behavior as in NFS file systems. For example, without syncdir, you might not discover an out of space condition until you close a file. With syncdir (and POSIX behavior), the out-of-space condition would have been discovered during the write operation. The cases in which you might have problems if you do not specify syncdir are rare.

If you are using a SPARC based cluster, VxFS does not have a mount option that is equivalent to the syncdir mount option for UFS. VxFS behavior is the same as UFS when the syncdir mount option is not specified.

See File Systems FAQs for frequently asked questions about global devices and cluster file systems.

Disk Path Monitoring

The current release of Sun Cluster software supports disk path monitoring (DPM). This section provides conceptual information about DPM, the DPM daemon, and administration tools that you use to monitor disk paths. Refer to Sun Cluster System Administration Guide for Solaris OS for procedural information about how to monitor, unmonitor, and check the status of disk paths.

DPM Overview

DPM improves the overall reliability of failover and switchover by monitoring secondary disk path availability. Use the cldevice command to verify the availability of the disk path that is used by a resource before the resource is switched. Options that are provided with the cldevice command enable you to monitor disk paths to a single Solaris host or to all Solaris hosts in the cluster. See the cldevice(1CL) man page for more information about command-line options.

The following table describes the default location for installation of DPM components.

Location 

Component 

Daemon 

/usr/cluster/lib/sc/scdpmd

Command-line interface 

/usr/cluster/bin/cldevice

Daemon status file (created at runtime) 

/var/run/cluster/scdpm.status

A multithreaded DPM daemon runs on each host. The DPM daemon (scdpmd) is started by an rc.d script when a host boots. If a problem occurs, the daemon is managed by pmfd and restarts automatically. The following list describes how the scdpmd works on initial startup.


Note –

At startup, the status for each disk path is initialized to UNKNOWN.


  1. The DPM daemon gathers disk path and node name information from the previous status file or from the CCR database. See Cluster Configuration Repository (CCR) for more information about the CCR. After a DPM daemon is started, you can force the daemon to read the list of monitored disks from a specified file name.

  2. The DPM daemon initializes the communication interface to respond to requests from components that are external to the daemon, such as the command-line interface.

  3. The DPM daemon pings each disk path in the monitored list every 10 minutes by using scsi_inquiry commands. Each entry is locked to prevent the communication interface access to the content of an entry that is being modified.

  4. The DPM daemon notifies the Sun Cluster Event Framework and logs the new status of the path through the UNIX syslogd command. See the syslogd(1M) man page.


Note –

All errors that are related to the daemon are reported by pmfd. All the functions from the API return 0 on success and -1 for any failure.


The DPM daemon monitors the availability of the logical path that is visible through multipath drivers such as Solaris I/O multipathing (MPxIO), which was formerly named Sun StorEdge Traffic Manager, Sun StorEdge 9900 Dynamic Link Manager, and EMC PowerPath. The individual physical paths that are managed by these drivers are not monitored because the multipath driver masks individual failures from the DPM daemon.

Monitoring Disk Paths

This section describes two methods for monitoring disk paths in your cluster. The first method is provided by the cldevice command. Use this command to monitor, unmonitor, or display the status of disk paths in your cluster. You can also use this command to print a list of faulted disks and to monitor disk paths from a file. See the cldevice(1CL) man page.

The second method for monitoring disk paths in your cluster is provided by the Sun Cluster Manager graphical user interface (GUI). Sun Cluster Manager provides a topological view of the monitored disk paths in your cluster. The view is updated every 10 minutes to provide information about the number of failed pings. Use the information that is provided by the Sun Cluster Manager GUI in conjunction with the cldevice command to administer disk paths. See Chapter 13, Administering Sun Cluster With the Graphical User Interfaces, in Sun Cluster System Administration Guide for Solaris OS for information about Sun Cluster Manager.

Using the cldevice Command to Monitor and Administer Disk Paths

The cldevice command enables you to perform the following tasks:

Issue the cldevice command with the disk path argument from any active node to perform DPM administration tasks on the cluster. The disk path argument consists of a node name and a disk name. The node name is not required. If you do not specify a node name, all nodes are affected by default. The following table describes naming conventions for the disk path.


Note –

Always specify a global disk path name rather than a UNIX disk path name because a global disk path name is consistent throughout a cluster. A UNIX disk path name is not. For example, the disk path name can be c1t0d0 on one node and c2t0d0 on another node. To determine a global disk path name for a device that is connected to a node, use the cldevice list command before issuing DPM commands. See the cldevice(1CL) man page.


Table 3–3 Sample Disk Path Names

Name Type 

Sample Disk Path Name 

Description 

Global disk path  

schost-1:/dev/did/dsk/d1

Disk path d1 on the schost-1 node

 

all:d1

Disk path d1 on all nodes in the cluster

UNIX disk path  

schost-1:/dev/rdsk/c0t0d0s0

Disk path c0t0d0s0 on the schost-1 node

 

schost-1:all

All disk paths on the schost-1 node

All disk paths 

all:all

All disk paths on all nodes of the cluster 

Using Sun Cluster Manager to Monitor Disk Paths

Sun Cluster Manager enables you to perform the following basic DPM administration tasks:

The Sun Cluster Manager online help provides procedural information about how to administer disk paths

Using the clnode set Command to Manage Disk Path Failure

You use the clnode set command to enable and disable the automatic rebooting of a node when all monitored shared-disk paths fail. When you enable the reboot_on_path_failure property, the states of local-disk paths are not considered when determining if a node reboot is necessary. Only monitored shared disks are affected. You can also use Sun Cluster Manager to perform these tasks.

Quorum and Quorum Devices

This section contains the following topics:


Note –

For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.


Because cluster nodes share data and resources, a cluster must never split into separate partitions that are active at the same time because multiple active partitions might cause data corruption. The Cluster Membership Monitor (CMM) and quorum algorithm guarantee that, at most, one instance of the same cluster is operational at any time, even if the cluster interconnect is partitioned.

For an introduction to quorum and CMM, see Cluster Membership in Sun Cluster Overview for Solaris OS.

Two types of problems arise from cluster partitions:

Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters. Each partition “believes” that it is the only partition because the nodes in one partition cannot communicate with the node or nodes in the other partition.

Amnesia occurs when the cluster restarts after a shutdown with cluster configuration data that is older than the data was at the time of the shutdown. This problem can occur when you start the cluster on a node that was not in the last functioning cluster partition.

Sun Cluster software avoids split brain and amnesia by:

A partition with the majority of votes gains quorum and is allowed to operate. This majority vote mechanism prevents split brain and amnesia when more than two nodes are configured in a cluster. However, counting node votes alone is not sufficient when more than two nodes are configured in a cluster. In a two-host cluster, a majority is two. If such a two-host cluster becomes partitioned, an external vote is needed for either partition to gain quorum. This external vote is provided by a quorum device.

About Quorum Vote Counts

Use the clquorum show command to determine the following information:

See the cluster(1CL) man page.

Both nodes and quorum devices contribute votes to the cluster to form quorum.

A node contributes votes depending on the node's state:

Quorum devices contribute votes that are based on the number of votes that are connected to the device. When you configure a quorum device, Sun Cluster software assigns the quorum device a vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).

A quorum device contributes votes if one of the following two conditions are true:

You configure quorum devices during the cluster installation, or afterwards, by using the procedures that are described in Chapter 6, Administering Quorum, in Sun Cluster System Administration Guide for Solaris OS.

About Quorum Configurations

The following list contains facts about quorum configurations:

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Adhering to Quorum Device Requirements

Ensure that Sun Cluster software supports your specific device as a quorum device. If you ignore this requirement, you might compromise your cluster's availability.


Note –

For a list of the specific devices that Sun Cluster software supports as quorum devices, contact your Sun service provider.


Sun Cluster software supports the following types of quorum devices:


Note –

You cannot use a replicated device as a quorum device.


In a two–host configuration, you must configure at least one quorum device to ensure that a single host can continue if the other host fails. See Figure 3–2.

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Adhering to Quorum Device Best Practices

Use the following information to evaluate the best quorum configuration for your topology:

For examples of quorum configurations to avoid, see Bad Quorum Configurations. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Recommended Quorum Configurations

This section shows examples of quorum configurations that are recommended. For examples of quorum configurations you should avoid, see Bad Quorum Configurations.

Quorum in Two–Host Configurations

Two quorum votes are required for a two-host cluster to form. These two votes can derive from the two cluster hosts, or from just one host and a quorum device.

Figure 3–2 Two–Host Configuration

Illustration: Shows Host A and Host B with one quorum
device that is connected to two hosts.

Quorum in Greater Than Two–Host Configurations

Quorum devices are not required when a cluster includes more than two hosts, as the cluster survives failures of a single host without a quorum device. However, under these conditions, you cannot start the cluster without a majority of hosts in the cluster.

You can add a quorum device to a cluster that includes more than two hosts. A partition can survive as a cluster when that partition has a majority of quorum votes, including the votes of the hosts and the quorum devices. Consequently, when adding a quorum device, consider the possible host and quorum device failures when choosing whether and where to configure quorum devices.

Illustration: Config1: HostA-D. A/B connect to (->) QD1.
C/D -> QD2. Config2: HostA-C. A/C -> QD1. B/C -> QD2. Config3: HostA-C ->
one QD.

Atypical Quorum Configurations

Figure 3–3 assumes you are running mission-critical applications (Oracle database, for example) on Host A and Host B. If Host A and Host B are unavailable and cannot access shared data, you might want the entire cluster to be down. Otherwise, this configuration is suboptimal because it does not provide high availability.

For information about the best practice to which this exception relates, see Adhering to Quorum Device Best Practices.

Figure 3–3 Atypical Configuration

Illustration: HostA-D. Host A/B connect to QD1-4. HostC
connect to QD4. HostD connect to QD4. Total votes = 10. Votes required for
quorum = 6.

Bad Quorum Configurations

This section shows examples of quorum configurations you should avoid. For examples of recommended quorum configurations, see Recommended Quorum Configurations.

Illustration: Config1: HostA-B. A/B connect to -> QD1/2.
Config2: HostA-D. A/B -> QD1/2. Config3: HostA-C. A/B-> QD1/2 & C -> QD2.

Data Services

The term data service describes an application, such as Sun Java System Web Server or Oracle, that has been configured to run on a cluster rather than on a single server. A data service consists of an application, specialized Sun Cluster configuration files, and Sun Cluster management methods that control the following actions of the application.

For information about data service types, see Data Services in Sun Cluster Overview for Solaris OS.

Figure 3–4 compares an application that runs on a single application server (the single-server model) to the same application running on a cluster (the clustered-server model). The only difference between the two configurations is that the clustered application might run faster and is more highly available.

Figure 3–4 Standard Compared to Clustered Client-Server Configuration

Illustration: The following context describes the graphic.

In the single-server model, you configure the application to access the server through a particular public network interface (a host name). The host name is associated with that physical server.

In the clustered-server model, the public network interface is a logical host name or a shared address. The term network resources is used to refer to both logical host names and shared addresses.

Some data services require you to specify either logical host names or shared addresses as the network interfaces. Logical host names and shared addresses are not always interchangeable. Other data services allow you to specify either logical host names or shared addresses. Refer to the installation and configuration for each data service for details about the type of interface you must specify.

A network resource is not associated with a specific physical server. A network resource can migrate between physical servers.

A network resource is initially associated with one node, the primary. If the primary fails, the network resource and the application resource fail over to a different cluster node (a secondary). When the network resource fails over, after a short delay, the application resource continues to run on the secondary.

Figure 3–5 compares the single-server model with the clustered-server model. Note that in the clustered-server model, a network resource (logical host name, in this example) can move between two or more of the cluster nodes. The application is configured to use this logical host name in place of a host name that is associated with a particular server.

Figure 3–5 Fixed Host Name Compared to Logical Host Name

Illustration: The preceding context describes the graphic.

A shared address is also initially associated with one node. This node is called the global interface node. A shared address (known as the global interface) is used as the single network interface to the cluster.

The difference between the logical host name model and the scalable service model is that in the latter, each node also has the shared address actively configured on its loopback interface. This configuration enables multiple instances of a data service to be active on several nodes simultaneously. The term “scalable service” means that you can add more CPU power to the application by adding additional cluster nodes and the performance scales.

If the global interface node fails, the shared address can be started on another node that is also running an instance of the application (thereby making this other node the new global interface node). Or, the shared address can fail over to another cluster node that was not previously running the application.

Figure 3–6 compares the single-server configuration with the clustered scalable service configuration. Note that in the scalable service configuration, the shared address is present on all nodes. The application is configured to use this shared address in place of a host name that is associated with a particular server. This scheme is similar to how a logical host name is used for a failover data service.

Figure 3–6 Fixed Host Name Compared to Shared Address

Illustration: The preceding context describes the graphic.

Data Service Methods

The Sun Cluster software supplies a set of service management methods. These methods run under the control of the Resource Group Manager (RGM), which uses them to start, stop, and monitor the application on the cluster nodes. These methods, along with the cluster framework software and multihost devices, enable applications to become failover or scalable data services.

The RGM also manages resources in the cluster, including instances of an application and network resources (logical host names and shared addresses).

In addition to Sun Cluster software-supplied methods, the Sun Cluster software also supplies an API and several data service development tools. These tools enable application developers to develop the data service methods that are required to make other applications run as highly available data services with the Sun Cluster software.

Failover Data Services

If the node on which the data service is running (the primary node) fails, the service is migrated to another working node without user intervention. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical host names). Logical host names are IP addresses that can be configured on one node, and at a later time, automatically configured down on the original node and configured on another node.

For failover data services, application instances run only on a single node. If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover). The outcome depends on how you have configured the data service.

Scalable Data Services

The scalable data service has the potential for active instances on multiple nodes.

Scalable services use the following two resource groups:

A scalable resource group can be online on multiple nodes simultaneously. As a result, multiple instances of the service can be running at once. All scalable resource groups use load balancing. All nodes that host a scalable service use the same shared address to host the service. The failover resource group that hosts the shared address is online on only one node at a time.

Service requests enter the cluster through a single network interface (the global interface). These requests are distributed to the nodes, based on one of several predefined algorithms that are set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes. Multiple global interfaces can exist on different nodes that host other shared addresses.

For scalable services, application instances run on several nodes simultaneously. If the node that hosts the global interface fails, the global interface fails over to another node. If an application instance that is running fails, the instance attempts to restart on the same node.

If an application instance cannot be restarted on the same node, and another unused node is configured to run the service, the service fails over to the unused node. Otherwise, the service continues to run on the remaining nodes, possibly causing a degradation of service throughput.


Note –

TCP state for each application instance is kept on the node with the instance, not on the global interface node. Therefore, failure of the global interface node does not affect the connection.


Figure 3–7 shows an example of failover and a scalable resource group and the dependencies that exist between them for scalable services. This example shows three resource groups. The failover resource group contains application resources for highly available DNS, and network resources used by both highly available DNS and highly available Apache Web Server (used in SPARC-based clusters only). The scalable resource groups contain only application instances of the Apache Web Server. Note that resource group dependencies exist between the scalable and failover resource groups (solid lines). Additionally, all the Apache application resources depend on the network resource schost-2, which is a shared address (dashed lines).

Figure 3–7 SPARC: Failover and Scalable Resource Group Example

Illustration: The preceding context describes the graphic.

Load-Balancing Policies

Load balancing improves performance of the scalable service, both in response time and in throughput. There are two classes of scalable data services.

A pure service is capable of having any of its instances respond to client requests. A sticky service is capable of having a client send requests to the same instance. Those requests are not redirected to other instances.

A pure service uses a weighted load-balancing policy. Under this load-balancing policy, client requests are by default uniformly distributed over the server instances in the cluster. The load is distributed among various nodes according to specified weight values. For example, in a three-node cluster, suppose that each node has the weight of 1. Each node services one third of the requests from any client on behalf of that service. The cluster administrator can change weights at any time with an administrative command or with Sun Cluster Manager.

The weighted load-balancing policy is set by using the LB_WEIGHTED value for the Load_balancing_weights property. If a weight for a node is not explicitly set, the weight for that node is set to 1 by default.

The weighted policy redirects a certain percentage of the traffic from clients to a particular node. Given X=weight and A=the total weights of all active nodes, an active node can expect approximately X/A of the total new connections to be directed to the active node. However, the total number of connections must be large enough. This policy does not address individual requests.

Note that the weighted policy is not round robin. A round-robin policy would always cause each request from a client to go to a different node. For example, the first request would go to node 1, the second request would go to node 2, and so on.

A sticky service has two flavors, ordinary sticky and wildcard sticky.

Sticky services enable concurrent application-level sessions over multiple TCP connections to share in-state memory (application session state).

Ordinary sticky services enable a client to share the state between multiple concurrent TCP connections. The client is said to be “sticky” toward that server instance that is listening on a single port.

The client is guaranteed that all requests go to the same server instance, provided that the following conditions are met:

For example, a web browser on the client connects to a shared IP address on port 80 using three different TCP connections. However, the connections exchange cached session information between them at the service.

A generalization of a sticky policy extends to multiple scalable services that exchange session information in the background and at the same instance. When these services exchange session information in the background and at the same instance, the client is said to be “sticky” toward multiple server instances on the same node that is listening on different ports.

For example, a customer on an e-commerce web site fills a shopping cart with items by using HTTP on port 80. The customer then switches to SSL on port 443 to send secure data to pay by credit card for the items in the cart.

In the ordinary sticky policy, the set of ports is known at the time the application resources are configured. This policy is set by using the LB_STICKY value for the Load_balancing_policy resource property.

Wildcard sticky services use dynamically assigned port numbers, but still expect client requests to go to the same node. The client is “sticky wildcard” over pots that have the same IP address.

A good example of this policy is passive mode FTP. For example, a client connects to an FTP server on port 21. The server then instructs the client to connect back to a listener port server in the dynamic port range. All requests for this IP address are forwarded to the same node that the server informed the client through the control information.

The sticky-wildcard policy is a superset of the ordinary sticky policy. For a scalable service that is identified by the IP address, ports are assigned by the server (and are not known in advance). The ports might change. This policy is set by using the LB_STICKY_WILD value for the Load_balancing_policy resource property.

For each one of these sticky policies, the weighted load-balancing policy is in effect by default. Therefore, a client's initial request is directed to the instance that the load balancer dictates. After the client establishes an affinity for the node where the instance is running, future requests are conditionally directed to that instance. The node must be accessible and the load-balancing policy must not have changed.

Failback Settings

Resource groups fail over from one node to another. When this failover occurs, the original secondary becomes the new primary. The failback settings specify the actions that occur when the original primary comes back online. The options are to have the original primary become the primary again (failback) or to allow the current primary to remain. You specify the option you want by using the Failback resource group property setting.

If the original node that hosts the resource group fails and reboots repeatedly, setting failback might result in reduced availability for the resource group.

Data Services Fault Monitors

Each Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information that probes return, predefined actions such as restarting daemons or causing a failover can be initiated.

Developing New Data Services

Sun supplies configuration files and management methods templates that enable you to make various applications operate as failover or scalable services within a cluster. If Sun does not offer the application that you want to run as a failover or scalable service, you have an alternative. Use a Sun Cluster API or the DSET API to configure the application to run as a failover or scalable service. However, not all applications can become a scalable service.

Characteristics of Scalable Services

A set of criteria determines whether an application can become a scalable service. To determine if your application can become a scalable service, see Analyzing the Application for Suitability in Sun Cluster Data Services Developer’s Guide for Solaris OS.

This set of criteria is summarized as follows:

Data Service API and Data Service Development Library API

The Sun Cluster software provides the following to make applications highly available:

The Sun Cluster Data Services Planning and Administration Guide for Solaris OS describes how to install and configure the data services that are supplied with the Sun Cluster software. The Sun Cluster 3.1 9/04 Software Collection for Solaris OS (SPARC Platform Edition) describes how to instrument other applications to be highly available under the Sun Cluster framework.

The Sun Cluster APIs enable application developers to develop fault monitors and scripts that start and stop data service instances. With these tools, an application can be implemented as a failover or a scalable data service. The Sun Cluster software provides a “generic” data service. Use this generic data service to quickly generate an application's required start and stop methods and to implement the data service as a failover or scalable service.

Using the Cluster Interconnect for Data Service Traffic

A cluster must usually have multiple network connections between Solaris hosts, forming the cluster interconnect.

Sun Cluster software uses multiple interconnects to achieve the following goals:

For both internal and external traffic such as file system data or scalable services data, messages are striped across all available interconnects. The cluster interconnect is also available to applications, for highly available communication between hosts. For example, a distributed application might have components that are running on different hosts that need to communicate. By using the cluster interconnect rather than the public transport, these connections can withstand the failure of an individual link.

To use the cluster interconnect for communication between hosts, an application must use the private host names that you configured during the Sun Cluster installation. For example, if the private host name for host1 is clusternode1-priv, use this name to communicate with host1 over the cluster interconnect. TCP sockets that are opened by using this name are routed over the cluster interconnect and can be transparently rerouted if a private network adapter fails. Application communication between any two hosts is striped over all interconnects. The traffic for a given TCP connection flows on one interconnect at any point. Different TCP connections are striped across all interconnects. Additionally, UDP traffic is always striped across all interconnects.

An application can optionally use a zone's private host name to communicate over the cluster interconnect between zones. However, you must first set each zone's private host name before the application can begin communicating. Each zone must have its own private host name to communicate. An application that is running in one zone must use the private host name in the same zone to communicate with private host names in other zones. An application in one zone cannot communicate through the private host name in another zone.

Because you can configure the private host names during your Sun Cluster installation, the cluster interconnect uses any name that you choose at that time. To determine the actual name, use the scha_cluster_get command with the scha_privatelink_hostname_node argument. See the scha_cluster_get(1HA) man page.

Each host is also assigned a fixed per-host address. This per-host address is plumbed on the clprivnet driver. The IP address maps to the private host name for the host: clusternode1-priv. See the clprivnet(7) man page.

If your application requires consistent IP addresses at all points, configure the application to bind to the per-host address on both the client and the server. All connections appear then to originate from and return to the per-host address.

Resources, Resource Groups, and Resource Types

Data services use several types of resources: applications such as Sun Java System Web Server or Apache Web Server use network addresses (logical host names and shared addresses) on which the applications depend. Application and network resources form a basic unit that is managed by the RGM.

Data services are resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle-server and Sun Cluster HA for Apache is the resource type SUNW.apache.

A resource is an instantiation of a resource type that is defined cluster wide. Several resource types are defined.

Network resources are either SUNW.LogicalHostname or SUNW.SharedAddress resource types. These two resource types are preregistered by the Sun Cluster software.

The HAStoragePlus resource type is used to synchronize the startup of resources and device groups on which the resources depend. This resource type ensures that before a data service starts, the paths to a cluster file system's mount points, global devices, and device group names are available. For more information, see Synchronizing the Startups Between Resource Groups and Device Groups in Sun Cluster Data Services Planning and Administration Guide for Solaris OS. The HAStoragePlus resource type also enables local file systems to be highly available. For more information about this feature, see HAStoragePlus Resource Type.

RGM-managed resources are placed into groups, called resource groups, so that they can be managed as a unit. A resource group is migrated as a unit if a failover or switchover is initiated on the resource group.


Note –

When you bring a resource group that contains application resources online, the application is started. The data service start method waits until the application is running before exiting successfully. The determination of when the application is up and running is accomplished the same way the data service fault monitor determines that a data service is serving clients. Refer to the Sun Cluster Data Services Planning and Administration Guide for Solaris OS for more information about this process.


Resource Group Manager (RGM)

The RGM controls data services (applications) as resources, which are managed by resource type implementations. These implementations are either supplied by Sun or created by a developer with a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers called resource groups. The RGM stops and starts resource groups on selected nodes in response to cluster membership changes.

The RGM acts on resources and resource groups. RGM actions cause resources and resource groups to move between online and offline states. A complete description of the states and settings that can be applied to resources and resource groups is located in Resource and Resource Group States and Settings.

Refer to Data Service Project Configuration for information about how to launch Solaris projects under RGM control.

Resource and Resource Group States and Settings

A system administrator applies static settings to resources and resource groups. You can change these settings only by administrative action. The RGM moves resource groups between dynamic “states.”

These settings and states are as follows:

Resource and Resource Group Properties

You can configure property values for resources and resource groups for your Sun Cluster data services. Standard properties are common to all data services. Extension properties are specific to each data service. Some standard and extension properties are configured with default settings so that you do not have to modify them. Others need to be set as part of the process of creating and configuring resources. The documentation for each data service specifies which resource properties can be set and how to set them.

The standard properties are used to configure resource and resource group properties that are usually independent of any particular data service. For the set of standard properties, see Appendix B, Standard Properties, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

The RGM extension properties provide information such as the location of application binaries and configuration files. You modify extension properties as you configure your data services. The set of extension properties is described in the individual guide for the data service.

Support for Solaris Zones

Solaris zones provide a means of creating virtualized operating system environments within an instance of the Solaris 10 OS. Solaris zones enable one or more applications to run in isolation from other activity on your system. The Solaris zones facility is described in Part II, Zones, in System Administration Guide: Solaris Containers-Resource Management and Solaris Zones.

When you run Sun Cluster software on the Solaris 10 OS, you can create any number of global-cluster non-voting nodes.

You can use Sun Cluster software to manage the availability and scalability of applications that are running on global-cluster non-voting nodes.

Support for Global-Cluster Non-Voting Nodes (Solaris Zones) Directly Through the RGM

On a cluster where the Solaris 10 OS is running, you can configure a resource group to run on a global-cluster voting node or a global-cluster non-voting node. The RGM manages each global-cluster non-voting node as a switchover target. If a global-cluster non-voting node is specified in the node list of a resource group, the RGM brings the resource group online in the specified node.

Figure 3–8 illustrates the failover of resource groups between nodes in a two-host cluster. In this example, identical nodes are configured to simplify the administration of the cluster.

Figure 3–8 Failover of Resource Groups Between Nodes

Diagram showing failover of resource groups between nodes

You can configure a scalable resource group (which uses network load balancing) to run in a cluster non-voting node as well.

In Sun Cluster commands, you specify a zone by appending the name of the zone to the name of the host, and separating them with a colon, for example:

phys-schost-1:zoneA

Criteria for Using Support for Solaris Zones Directly Through the RGM

Use support for Solaris zones directly through the RGM if any of following criteria is met:

Requirements for Using Support for Solaris Zones Directly Through the RGM

If you plan to use support for Solaris zones directly through the RGM for an application, ensure that the following requirements are met:

If you use support for Solaris zones directly through the RGM, ensure that resource groups that are related by an affinity are configured to run on the same Solaris host.

Additional Information About Support for Solaris Zones Directly Through the RGM

For information about how to configure support for Solaris zones directly through the RGM, see the following documentation:

Support for Solaris Zones on Sun Cluster Nodes Through Sun Cluster HA for Solaris Containers

The Sun Cluster HA for Solaris Containers data service manages each zone as a resource that is controlled by the RGM.

Criteria for Using Sun Cluster HA for Solaris Containers

Use the Sun Cluster HA for Solaris Containers data service if any of following criteria is met:

Requirements for Using Sun Cluster HA for Solaris Containers

If you plan to use the Sun Cluster HA for Solaris Containers data service for an application, ensure that the following requirements are met:

Additional Information About Sun Cluster HA for Solaris Containers

For information about how to use the Sun Cluster HA for Solaris Containers data service, see Sun Cluster Data Service for Solaris Containers Guide for Solaris OS.

Service Management Facility

The Solaris Service Management Facility (SMF) enables you to run and administer applications as highly available and scalable resources. Like the Resource Group Manager (RGM), the SMF provides high availability and scalability, but for the Solaris Operating System.

Sun Cluster provides three proxy resource types that you can use to enable SMF services in a cluster. These resource types, SUNW.Proxy_SMF_failover, SUNW.Proxy_SMF_loadbalanced, and SUNW.Proxy_SMF_multimaster, enable you to run SMF services in a failover, scalable, and multi-master configuration, respectively. The SMF manages the availability of SMF services on a single Solaris host. The SMF uses the callback method execution model to run services.

The SMF also provides a set of administrative interfaces for monitoring and controlling services. These interfaces enable you to integrate your own SMF-controlled services into Sun Cluster. This capability eliminates the need to create new callback methods, rewrite existing callback methods, or update the SMF service manifest. You can include multiple SMF resources in a resource group and you can configure dependencies and affinities between them.

The SMF is responsible for starting, stopping, and restarting these services and managing their dependencies. Sun Cluster is responsible for managing the service in the cluster and for determining the hosts on which these services are to be started.

The SMF runs as a daemon, svc.startd, on each cluster host. The SMF daemon automatically starts and stops resources on selected hosts according to pre-configured policies.

The services that are specified for an SMF proxy resource can be located on global cluster voting node or global cluster non-voting node. However, all the services that are specified for the same SMF proxy resource must be located on the same node. SMF proxy resources work on any node.

System Resource Usage

System resources include aspects of CPU usage, memory usage, swap usage, and disk and network throughput. Sun Cluster enables you to monitor how much of a specific system resource is being used by an object type. An object type includes a host, node, zone, disk, network interface, or resource group. Sun Cluster also enables you to control the CPU that is available to a resource group.

Monitoring and controlling system resource usage can be part of your resource management policy. The cost and complexity of managing numerous machines encourages the consolidation of several applications on larger hosts. Instead of running each workload on separate systems, with full access to each system's resources, you use resource management to segregate workloads within the system. Resource management enables you to lower overall total cost of ownership by running and controlling several applications on a single Solaris system.

Resource management ensures that your applications have the required response times. Resource management can also increase resource use. By categorizing and prioritizing usage, you can effectively use reserve capacity during off-peak periods, often eliminating the need for additional processing power. You can also ensure that resources are not wasted because of load variability.

To use the data that Sun Cluster collects about system resource usage, you must do the following:

By default, system resource monitoring and control are not configured when you install Sun Cluster. For information about configuring these services, see Chapter 10, Configuring Control of CPU Usage, in Sun Cluster System Administration Guide for Solaris OS.

System Resource Monitoring

By monitoring system resource usage, you can do the following:

Data about system resource usage can help you determine the hardware resources that are underused and the applications that use many resources. Based on this data, you can assign applications to nodes that have the necessary resources and choose the node to which to failover. This consolidation can help you optimize the way that you use your hardware and software resources.

Monitoring all system resources at the same time might be costly in terms of CPU. Choose the system resources that you want to monitor by prioritizing the resources that are most critical for your system.

When you enable monitoring, you choose the telemetry attribute that you want to monitor. A telemetry attribute is an aspect of system resources. Examples of telemetry attributes include the amount of free CPU or the percentage of blocks that are used on a device. If you monitor a telemetry attribute on an object type, Sun Cluster monitors this telemetry attribute on all objects of that type in the cluster. Sun Cluster stores a history of the system resource data that is collected for seven days.

If you consider a particular data value to be critical for a system resource, you can set a threshold for this value. When setting a threshold, you also choose how critical this threshold is by assigning it a severity level. If the threshold is crossed, Sun Cluster changes the severity level of the threshold to the severity level that you choose.

Control of CPU

Each application and service that is running on a cluster has specific CPU needs. Table 3–4 lists the CPU control activities that are available on different versions of the Solaris OS.

Table 3–4 CPU Control

Solaris Version 

Zone 

Control 

Solaris 9 OS 

Not available 

Assign CPU shares 

Solaris 10 OS 

Global-cluster voting node 

Assign CPU shares 

Solaris 10 OS 

Global-cluster non-voting node 

Assign CPU shares 

Assign number of CPU 

Create dedicated processor sets 


Note –

If you want to apply CPU shares, you must specify the Fair Share Scheduler (FFS) as the default scheduler in the cluster.


Controlling the CPU that is assigned to a resource group in a dedicated processor set in a global-cluster non-voting node offers the strictest level of control. If you reserve CPU for a resource group, this CPU is not available to other resource groups.

Viewing System Resource Usage

You can view system resource data and CPU assignments by using the command line or through Sun Cluster Manager. The system resources that you choose to monitor determine the tables and graphs that you can view.

By viewing the output of system resource usage and CPU control, you can do the following:

Sun Cluster does not provide advice about the actions to take, nor does it take action for you based on the data that it collects. You must determine whether the data that you view meets your expectations for a service. You must then take action to remedy any observed performance.

Data Service Project Configuration

Data services can be configured to launch under a Solaris project name when brought online by using the RGM. The configuration associates a resource or resource group managed by the RGM with a Solaris project ID. The mapping from your resource or resource group to a project ID gives you the ability to use sophisticated controls that are available in the Solaris OS to manage workloads and consumption within your cluster.


Note –

You can perform this configuration if you are using Sun Cluster on the Solaris 9 OS or on the Solaris 10 OS.


Using the Solaris management functionality in a Sun Cluster environment enables you to ensure that your most important applications are given priority when sharing a node with other applications. Applications might share a node if you have consolidated services or because applications have failed over. Use of the management functionality described herein might improve availability of a critical application by preventing lower-priority applications from overconsuming system supplies such as CPU time.


Note –

The Solaris documentation for this feature describes CPU time, processes, tasks and similar components as “resources”. Meanwhile, Sun Cluster documentation uses the term “resources” to describe entities that are under the control of the RGM. The following section uses the term “resource” to refer to Sun Cluster entities that are under the control of the RGM. The section uses the term “supplies” to refer to CPU time, processes, and tasks.


This section provides a conceptual description of configuring data services to launch processes on a specified Solaris OS UNIXproject(4). This section also describes several failover scenarios and suggestions for planning to use the management functionality provided by the Solaris Operating System.

For detailed conceptual and procedural documentation about the management feature, refer to Chapter 1, Network Service (Overview), in System Administration Guide: Network Services.

    When configuring resources and resource groups to use Solaris management functionality in a cluster, use the following high-level process:

  1. Configuring applications as part of the resource.

  2. Configuring resources as part of a resource group.

  3. Enabling resources in the resource group.

  4. Making the resource group managed.

  5. Creating a Solaris project for your resource group.

  6. Configuring standard properties to associate the resource group name with the project you created in step 5.

  7. Bringing the resource group online.

To configure the standard Resource_project_name or RG_project_name properties to associate the Solaris project ID with the resource or resource group, use the -p option with the clresource set and the clresourcegroup set command. Set the property values to the resource or to the resource group. See Appendix B, Standard Properties, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS for property definitions. See the r_properties(5) and rg_properties(5) man pages for descriptions of properties.

The specified project name must exist in the projects database (/etc/project) and the root user must be configured as a member of the named project. Refer to Chapter 2, Projects and Tasks (Overview), in System Administration Guide: Solaris Containers-Resource Management and Solaris Zones for conceptual information about the project name database. Refer to project(4) for a description of project file syntax.

When the RGM brings resources or resource groups online, it launches the related processes under the project name.


Note –

Users can associate the resource or resource group with a project at any time. However, the new project name is not effective until the resource or resource group is taken offline and brought back online by using the RGM.


Launching resources and resource groups under the project name enables you to configure the following features to manage system supplies across your cluster.

Determining Requirements for Project Configuration

Before you configure data services to use the controls provided by Solaris in a Sun Cluster environment, you must decide how to control and track resources across switchovers or failovers. Identify dependencies within your cluster before configuring a new project. For example, resources and resource groups depend on device groups.

Use the nodelist, failback, maximum_primaries and desired_primaries resource group properties that you configure with the clresourcegroup set command to identify node list priorities for your resource group.

Use the preferenced property and failback property that you configure with the cldevicegroup and clsetup commands to determine device group node list priorities. See the clresourcegroup(1CL), cldevicegroup(1CL), and clsetup(1CL) man pages.

If you configure all cluster nodes identically, usage limits are enforced identically on primary and secondary nodes. The configuration parameters of projects do not need to be identical for all applications in the configuration files on all nodes. All projects that are associated with the application must at least be accessible by the project database on all potential masters of that application. Suppose that Application 1 is mastered by phys-schost-1 but could potentially be switched over or failed over to phys-schost-2 or phys-schost-3. The project that is associated with Application 1 must be accessible on all three nodes (phys-schost-1, phys-schost-2, and phys-schost-3).


Note –

Project database information can be a local /etc/project database file or can be stored in the NIS map or the LDAP directory service.


The Solaris Operating System enables for flexible configuration of usage parameters, and few restrictions are imposed by Sun Cluster. Configuration choices depend on the needs of the site. Consider the general guidelines in the following sections before configuring your systems.

Setting Per-Process Virtual Memory Limits

Set the process.max-address-space control to limit virtual memory on a per-process basis. See the rctladm(1M) man page for information about setting the process.max-address-space value.

When you use management controls with Sun Cluster software, configure memory limits appropriately to prevent unnecessary failover of applications and a “ping-pong” effect of applications. In general, observe the following guidelines.

Failover Scenarios

You can configure management parameters so that the allocation in the project configuration (/etc/project) works in normal cluster operation and in switchover or failover situations.

The following sections are example scenarios.

In a Sun Cluster environment, you configure an application as part of a resource. You then configure a resource as part of a resource group (RG). When a failure occurs, the resource group, along with its associated applications, fails over to another node. In the following examples the resources are not shown explicitly. Assume that each resource has only one application.


Note –

Failover occurs in the order in which nodes are specified in the node list and set in the RGM.


The following examples have these constraints:

Although the numbers of assigned shares remain the same, the percentage of CPU time that is allocated to each application changes after failover. This percentage depends on the number of applications that are running on the node and the number of shares that are assigned to each active application.

In these scenarios, assume the following configurations.

Two-Host Cluster With Two Applications

You can configure two applications on a two-host cluster to ensure that each physical host (phys-schost-1, phys-schost-2) acts as the default master for one application. Each physical host acts as the secondary node for the other physical host. All projects that are associated with Application 1 and Application 2 must be represented in the projects database files on both nodes. When the cluster is running normally, each application is running on its default master, where it is allocated all CPU time by the management facility.

After a failover or switchover occurs, both applications run on a single node where they are allocated shares as specified in the configuration file. For example, this entry in the/etc/project file specifies that Application 1 is allocated 4 shares and Application 2 is allocated 1 share.

Prj_1:100:project for App-1:root::project.cpu-shares=(privileged,4,none)
Prj_2:101:project for App-2:root::project.cpu-shares=(privileged,1,none)

The following diagram illustrates the normal and failover operations of this configuration. The number of shares that are assigned does not change. However, the percentage of CPU time available to each application can change. The percentage depends on the number of shares that are assigned to each process that demands CPU time.

Illustration: The preceding context describes the graphic.

Two-Host Cluster With Three Applications

On a two-host cluster with three applications, you can configure one host (phys-schost-1) as the default master of one application. You can configure the second physical host (phys-schost-2) as the default master for the remaining two applications. Assume the following example projects database file is located on every host. The projects database file does not change when a failover or switchover occurs.

Prj_1:103:project for App-1:root::project.cpu-shares=(privileged,5,none)
Prj_2:104:project for App_2:root::project.cpu-shares=(privileged,3,none)
Prj_3:105:project for App_3:root::project.cpu-shares=(privileged,2,none)

When the cluster is running normally, Application 1 is allocated 5 shares on its default master, phys-schost-1. This number is equivalent to 100 percent of CPU time because it is the only application that demands CPU time on that host. Applications 2 and 3 are allocated 3 and 2 shares, respectively, on their default master, phys-schost-2. Application 2 would receive 60 percent of CPU time and Application 3 would receive 40 percent of CPU time during normal operation.

If a failover or switchover occurs and Application 1 is switched over to phys-schost-2, the shares for all three applications remain the same. However, the percentages of CPU resources are reallocated according to the projects database file.

The following diagram illustrates the normal operations and failover operations of this configuration.

Illustration: The preceding context describes the graphic.

Failover of Resource Group Only

In a configuration in which multiple resource groups have the same default master, a resource group (and its associated applications) can fail over or be switched over to a secondary node. Meanwhile, the default master is running in the cluster.


Note –

During failover, the application that fails over is allocated resources as specified in the configuration file on the secondary host. In this example, the project database files on the primary and secondary hosts have the same configurations.


For example, this sample configuration file specifies that Application 1 is allocated 1 share, Application 2 is allocated 2 shares, and Application 3 is allocated 2 shares.

Prj_1:106:project for App_1:root::project.cpu-shares=(privileged,1,none)
Prj_2:107:project for App_2:root::project.cpu-shares=(privileged,2,none)
Prj_3:108:project for App_3:root::project.cpu-shares=(privileged,2,none)

The following diagram illustrates the normal and failover operations of this configuration, where RG-2, containing Application 2, fails over to phys-schost-2. Note that the number of shares assigned does not change. However, the percentage of CPU time available to each application can change, depending on the number of shares that are assigned to each application that demands CPU time.

Illustration: The preceding context describes the graphic.

Public Network Adapters and IP Network Multipathing

Clients make data requests to the cluster through the public network. Each cluster Solaris host is connected to at least one public network through a pair of public network adapters.

Solaris Internet Protocol (IP) Network Multipathing software on Sun Cluster provides the basic mechanism for monitoring public network adapters and failing over IP addresses from one adapter to another when a fault is detected. Each host has its own IP network multipathing configuration, which can be different from the configuration on other hosts.

Public network adapters are organized into IP multipathing groups (multipathing groups). Each multipathing group has one or more public network adapters. Each adapter in a multipathing group can be active. Alternatively, you can configure standby interfaces that are inactive unless a failover occurs.

The in.mpathd multipathing daemon uses a test IP address to detect failures and repairs. If a fault is detected on one of the adapters by the multipathing daemon, a failover occurs. All network access fails over from the faulted adapter to another functional adapter in the multipathing group. Therefore, the daemon maintains public network connectivity for the host. If you configured a standby interface, the daemon chooses the standby interface. Otherwise, the daemon chooses the interface with the least number of IP addresses. Because the failover occurs at the adapter interface level, higher-level connections such as TCP are not affected, except for a brief transient delay during the failover. When the failover of IP addresses completes successfully, ARP broadcasts are sent. Therefore, the daemon maintains connectivity to remote clients.


Note –

Because of the congestion recovery characteristics of TCP, TCP endpoints can experience further delay after a successful failover. Some segments might have been lost during the failover, activating the congestion control mechanism in TCP.


Multipathing groups provide the building blocks for logical host name and shared address resources. You can also create multipathing groups independently of logical host name and shared address resources to monitor public network connectivity of cluster hosts. The same multipathing group on a host can host any number of logical host name or shared address resources. For more information about logical host name and shared address resources, see the Sun Cluster Data Services Planning and Administration Guide for Solaris OS.


Note –

The design of the IP network multipathing mechanism is meant to detect and mask adapter failures. The design is not intended to recover from an administrator's use of ifconfig to remove one of the logical (or shared) IP addresses. The Sun Cluster software views the logical and shared IP addresses as resources that are managed by the RGM. The correct way for an administrator to add or to remove an IP address is to use clresource and clresourcegroup to modify the resource group that contains the resource.


For more information about the Solaris implementation of IP Network Multipathing, see the appropriate documentation for the Solaris Operating System that is installed on your cluster.

Operating System 

Instructions 

Solaris 9 Operating System 

Chapter 1, IP Network Multipathing (Overview), in IP Network Multipathing Administration Guide

Solaris 10 Operating System 

Part VI, IPMP, in System Administration Guide: IP Services

SPARC: Dynamic Reconfiguration Support

Sun Cluster 3.2 1/09 support for the dynamic reconfiguration (DR) software feature is being developed in incremental phases. This section describes concepts and considerations for Sun Cluster 3.2 1/09 support of the DR feature.

All the requirements, procedures, and restrictions that are documented for the Solaris DR feature also apply to Sun Cluster DR support (except for the operating environment quiescence operation). Therefore, review the documentation for the Solaris DR feature before by using the DR feature with Sun Cluster software. You should review in particular the issues that affect non-network IO devices during a DR detach operation.

The Sun Enterprise 10000 Dynamic Reconfiguration User Guide and the Sun Enterprise 10000 Dynamic Reconfiguration Reference Manual (from the Solaris 10 on Sun Hardware collection) are both available for download from http://docs.sun.com.

SPARC: Dynamic Reconfiguration General Description

The DR feature enables operations, such as the removal of system hardware, in running systems. The DR processes are designed to ensure continuous system operation with no need to halt the system or interrupt cluster availability.

DR operates at the board level. Therefore, a DR operation affects all the components on a board. Each board can contain multiple components, including CPUs, memory, and peripheral interfaces for disk drives, tape drives, and network connections.

Removing a board that contains active components would result in system errors. Before removing a board, the DR subsystem queries other subsystems, such as Sun Cluster, to determine whether the components on the board are being used. If the DR subsystem finds that a board is in use, the DR remove-board operation is not done. Therefore, it is always safe to issue a DR remove-board operation because the DR subsystem rejects operations on boards that contain active components.

The DR add-board operation is also always safe. CPUs and memory on a newly added board are automatically brought into service by the system. However, the system administrator must manually configure the cluster to actively use components that are on the newly added board.


Note –

The DR subsystem has several levels. If a lower level reports an error, the upper level also reports an error. However, when the lower level reports the specific error, the upper level reports Unknown error. You can safely ignore this error.


The following sections describe DR considerations for the different device types.

SPARC: DR Clustering Considerations for CPU Devices

Sun Cluster software does not reject a DR remove-board operation because of the presence of CPU devices.

When a DR add-board operation succeeds, CPU devices on the added board are automatically incorporated in system operation.

SPARC: DR Clustering Considerations for Memory

For the purposes of DR, consider two types of memory:

These two types differ only in usage. The actual hardware is the same for both types. Kernel memory cage is the memory that is used by the Solaris Operating System. Sun Cluster software does not support remove-board operations on a board that contains the kernel memory cage and rejects any such operation. When a DR remove-board operation pertains to memory other than the kernel memory cage, Sun Cluster software does not reject the operation. When a DR add-board operation that pertains to memory succeeds, memory on the added board is automatically incorporated in system operation.

SPARC: DR Clustering Considerations for Disk and Tape Drives

Sun Cluster rejects dynamic reconfiguration (DR) remove-board operations on active drives on the primary host. You can perform DR remove-board operations on inactive drives on the primary host and on any drives in the secondary host. After the DR operation, cluster data access continues as before.


Note –

Sun Cluster rejects DR operations that impact the availability of quorum devices. For considerations about quorum devices and the procedure for performing DR operations on them, see SPARC: DR Clustering Considerations for Quorum Devices.


See Dynamic Reconfiguration With Quorum Devices in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform these actions.

SPARC: DR Clustering Considerations for Quorum Devices

If the DR remove-board operation pertains to a board that contains an interface to a device configured for quorum, Sun Cluster software rejects the operation. Sun Cluster software also identifies the quorum device that would be affected by the operation. You must disable the device as a quorum device before you can perform a DR remove-board operation.

See Chapter 6, Administering Quorum, in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how administer quorum.

SPARC: DR Clustering Considerations for Cluster Interconnect Interfaces

If the DR remove-board operation pertains to a board containing an active cluster interconnect interface, Sun Cluster software rejects the operation. Sun Cluster software also identifies the interface that would be affected by the operation. You must use a Sun Cluster administrative tool to disable and remove the active interface before the DR operation can succeed.


Caution – Caution –

Sun Cluster software requires each cluster node to have at least one functioning path to every other cluster node. Do not disable a private interconnect interface that supports the last path to any Solaris host in the cluster.


See Administering the Cluster Interconnects in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform these actions.

SPARC: DR Clustering Considerations for Public Network Interfaces

If the DR remove-board operation pertains to a board that contains an active public network interface, Sun Cluster software rejects the operation. Sun Cluster software also identifies the interface that would be affected by the operation. Before you remove a board with an active network interface present, switch over all traffic on that interface to another functional interface in the multipathing group by using the if_mpadm command.


Caution – Caution –

If the remaining network adapter fails while you are performing the DR remove operation on the disabled network adapter, availability is impacted. The remaining adapter has no place to fail over for the duration of the DR operation.


See Administering the Public Network in Sun Cluster System Administration Guide for Solaris OS for detailed instructions about how to perform a DR remove operation on a public network interface.