This chapter explains the key concepts related to the hardware and software components of the Sun Cluster system that you need to understand before working with Sun Cluster systems.
This chapter contains the following sections:
A cluster node is a machine that runs both the Solaris software and Sun Cluster software. The Sun Cluster software enables you to have one to sixteen nodes in a cluster, depending on the hardware configuration. Contact your Sun representative for information about what hardware configurations support what maximum number of nodes.
Cluster nodes are generally attached to one or more disks. Nodes that are not attached to disks use the cluster file system to access the multihost disks. Nodes in parallel database configurations share concurrent access to some or all disks.
Every node in the cluster is aware when another node joins or leaves the cluster. Also, every node in the cluster is aware of the resources that are running locally as well as the resources that are running on the other cluster nodes.
Nodes in the same cluster should have similar processing, memory, and I/O capability to enable failover to occur without significant degradation in performance. Because of the possibility of failover, each node should have sufficient capacity to meet service level agreements if a node fails.
The cluster interconnect is the physical configuration of devices that are used to transfer cluster-private communications and data service communications between cluster nodes.
Redundant interconnects enable operation to continue over the surviving interconnects while system administrators isolate failures and repair communication. The Sun Cluster software detects, repairs, and automatically re-initiates communication over a repaired interconnect.
For more information, see Cluster-Interconnect Components.
The Cluster Membership Monitor (CMM) is a distributed set of agents that exchange messages over the cluster interconnect to complete the following tasks:
Enforcing a consistent membership view on all nodes (quorum)
Driving synchronized reconfiguration in response to membership changes
Handling cluster partitioning
Ensuring full connectivity among all cluster members by leaving unhealthy nodes out of the cluster until it is repaired
The main function of the CMM is to establish cluster membership, which requires a cluster-wide agreement on the set of nodes that participate in the cluster at any time. The CMM detects major cluster status changes on each node, such as loss of communication between one or more nodes. The CMM relies on the transport kernel module to generate heartbeats across the transport medium to other nodes in the cluster. When the CMM does not detect a heartbeat from a node within a defined time-out period, the CMM considers the node to have failed and the CMM initiates a cluster reconfiguration to renegotiate cluster membership.
To determine cluster membership and to ensure data integrity, the CMM performs the following tasks:
Accounting for a change in cluster membership, such as a node joining or leaving the cluster
Ensuring that an unhealthy node leaves the cluster
Ensuring that an unhealthy node remains inactive until it is repaired
Preventing the cluster from partitioning itself into subsets of nodes
See Data Integrity for more information about how the cluster protects itself from partitioning into multiple separate clusters.
The Cluster Configuration Repository (CCR) is a private, cluster-wide, distributed database for storing information that pertains to the configuration and state of the cluster. To avoid corrupting configuration data, each node must be aware of the current state of the cluster resources. The CCR ensures that all nodes have a consistent view of the cluster. The CCR is updated when error or recovery situations occur or when the general status of the cluster changes.
The CCR structures contain the following types of information:
Cluster and node names
Cluster transport configuration
The names of Solaris Volume Manager disk sets or VERITAS disk groups
A list of nodes that can master each disk group
Operational parameter values for data services
Paths to data service callback methods
DID device configuration
Current cluster status
Sun Cluster system makes all components on the ”path” between users and data highly available by monitoring the applications themselves, the file system, and network interfaces.
The Sun Cluster software detects a node failure quickly and creates an equivalent server for the resources on the failed node. The Sun Cluster software ensures that resources unaffected by the failed node are constantly available during the recovery and that resources of the failed node become available as soon as they are recovered.
Each Sun Cluster data service supplies a fault monitor that periodically probes the data service to determine its health. A fault monitor verifies that the application daemon or daemons are running and that clients are being served. Based on the information returned by probes, predefined actions such as restarting daemons or causing a failover, can be initiated.
Sun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk-path. You can use one of two methods for monitoring disk paths. The first method is provided by the cldevice command. This command enables you to monitor, unmonitor, or display the status of disk paths in your cluster. See the cldevice(1CL) man page for more information about command-line options.
The second method for monitoring disk paths in your cluster is provided by the Sun Cluster Manager graphical user interface (GUI). Sun Cluster Manager provides a topological view of the monitored disk paths. The view is updated every 10 minutes to provide information about the number of failed pings.
Each cluster node has its own IP network multipathing configuration, which can differ from the configuration on other cluster nodes. IP network multipathing monitors the following network communication failures:
The transmit and receive path of the network adapter has stopped transmitting packets.
The attachment of the network adapter to the link is down.
The port on the switch does not transmit-receive packets.
The physical interface in a group is not present at system boot.
A quorum device is a shared storage device or quorum server that is shared by two or more nodes and that contributes votes that are used to establish a quorum. The cluster can operate only when a quorum of votes is available. The quorum device is used when a cluster becomes partitioned into separate sets of nodes to establish which set of nodes constitutes the new cluster.
Both cluster nodes and quorum devices vote to form quorum. By default, cluster nodes acquire a quorum vote count of one when they boot and become cluster members. Nodes can have a vote count of zero when the node is being installed, or when an administrator has placed a node into the maintenance state.
Quorum devices acquire quorum vote counts that are based on the number of node connections to the device. When you set up a quorum device, it acquires a maximum vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).
The Sun Cluster system attempts to prevent data corruption and ensure data integrity. Because cluster nodes share data and resources, a cluster must never split into separate partitions that are active at the same time. The CMM guarantees that only one cluster is operational at any time.
Two types of problems can arise from cluster partitions: split brain and amnesia. Split brain occurs when the cluster interconnect between nodes is lost and the cluster becomes partitioned into subclusters, and each subcluster believes that it is the only partition. A subcluster that is not aware of the other subclusters could cause a conflict in shared resources such as duplicate network addresses and data corruption.
Amnesia occurs if all the nodes leave the cluster in staggered groups. An example is a two-node cluster with nodes A and B. If node A goes down, the configuration data in the CCR is updated on node B only, and not node A. If node B goes down at a later time, and if node A is rebooted, node A will be running with old contents of the CCR. This state is called amnesia and might lead to running a cluster with stale configuration information.
You can avoid split brain and amnesia by giving each node one vote and mandating a majority of votes for an operational cluster. A partition with the majority of votes has a quorum and is enabled to operate. This majority vote mechanism works well if more than two nodes are in the cluster. In a two-node cluster, a majority is two. If such a cluster becomes partitioned, an external vote enables a partition to gain quorum. This external vote is provided by a quorum device. A quorum device can be any disk that is shared between the two nodes.
Table 2–1 describes how Sun Cluster software uses quorum to avoid split brain and amnesia.Table 2–1 Cluster Quorum, and Split-Brain and Amnesia Problems
Enables only the partition (subcluster) with a majority of votes to run as the cluster (only one partition can exist with such a majority). After a node loses the race for quorum, that node panics.
Guarantees that when a cluster is booted, it has at least one node that was a member of the most recent cluster membership (and thus has the latest configuration data).
A major issue for clusters is a failure that causes the cluster to become partitioned (called split brain). When this situation occurs, not all nodes can communicate, so individual nodes or subsets of nodes might try to form individual or subset clusters. Each subset or partition might “believe” it has sole access and ownership to the multihost disks. Attempts by multiple nodes to write to the disks can result in data corruption.
Failure fencing limits node access to multihost disks by preventing access to the disks. When a node leaves the cluster (it either fails or becomes partitioned), failure fencing ensures that the node can no longer access the disks. Only current member nodes have access to the disks, ensuring data integrity.
The Sun Cluster system uses SCSI disk reservations to implement failure fencing. Using SCSI reservations, failed nodes are “fenced” away from the multihost disks, preventing them from accessing those disks.
When a cluster member detects that another node is no longer communicating over the cluster interconnect, it initiates a failure-fencing procedure to prevent the failed node from accessing shared disks. When this failure fencing occurs, the fenced node panics and a “reservation conflict” message is displayed on its console.
The failfast mechanism panics a failed node, but it does not prevent the failed node from rebooting. After the panic, the node might reboot and attempt to rejoin the cluster.
If a node loses connectivity to other nodes in the cluster, and it is not part of a partition that can achieve quorum, it is forcibly removed from the cluster by another node. Any node that is part of the partition that can achieve quorum places reservations on the shared disks. The node that does not have quorum then panics as a result of the failfast mechanism.
The global file system makes all files across a cluster equally accessible and visible to all nodes. Similarly, Sun Cluster software makes all devices on a cluster accessible and visible throughout the cluster. That is, the I/O subsystem enables access to any device in the cluster, from any node, without regard to where the device is physically attached. This access is referred to as global device access.
Sun Cluster systems use global devices to provide cluster-wide, highly available access to any device in a cluster, from any node.
Generally, if a node fails while providing access to a global device, the Sun Cluster software switches over to another path to the device and redirects the access to that path. This redirection is easy with global devices because the same name is used for the device regardless of the path. Access to a remote device is performed in the same way as on a local device that uses the same name. Also, the API to access a global device on a cluster is the same as the API that is used to access a device locally.
Sun Cluster global devices include disks, CD-ROMs, and tapes. However, disks are the only multiported global devices that are supported. This limited support means that CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.
The Sun Cluster software manages global devices through a construct that is known as the device ID (DID) driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.
The DID driver is an integral part of the global device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique disk devices. The DID driver also assigns each device a unique major and minor number that is consistent on all nodes of the cluster. Access to the global devices is through the unique DID assigned by the DID driver instead of the traditional Solaris DIDs.
This approach ensures that any application accessing disks, such as Solaris Volume Manager or Sun Java System Directory Server, uses a consistent path across the cluster. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from node to node. These numbers can change the Solaris device naming conventions as well.
The Sun Cluster software also manages local devices. These devices are accessible only on a node that is running a service and has a physical connection to the cluster. Local devices can have a performance benefit over global devices because local devices do not have to replicate state information on multiple nodes simultaneously. The failure of the domain of the device removes access to the device unless the device can be shared by multiple nodes.
Device groups enable volume manager disk groups to become “global” because they provide multipath and multihost support to the underlying disks. Each cluster node that is physically attached to the multihost disks provides a path to the device group.
In the Sun Cluster system, you can control multihost disks that are using Sun Cluster software by registering the multihost disks as device groups. This registration provides the Sun Cluster system with information about which nodes have a path to what volume manger disk groups. The Sun Cluster software creates a raw device group for each disk and tape device in the cluster. These cluster device groups remain in an offline state until you access them as global devices either by mounting a global file system or by accessing a raw database file.
A data service is the combination of software and configuration files that enables an application to run without modification in a Sun Cluster configuration. When running in a Sun Cluster configuration, an application runs as a resource under the control of the Resource Group Manager (RGM). A data service enables you to configure an application such as Sun Java System Web Server or Oracle database to run on a cluster instead of on a single server.
The software of a data service provides implementations of Sun Cluster management methods that perform the following operations on the application:
Starting the application
Stopping the application
Monitoring faults in the application and recovering from these faults
The configuration files of a data service define the properties of the resource that represents the application to the RGM.
The RGM controls the disposition of the failover and scalable data services in the cluster. The RGM is responsible for starting and stopping the data services on selected nodes of the cluster in response to cluster membership changes. The RGM enables data service applications to utilize the cluster framework.
The RGM controls data services as resources. These implementations are either supplied by Sun or created by a developer who uses a generic data service template, the Data Service Development Library API (DSDL API), or the Resource Management API (RMAPI). The cluster administrator creates and manages resources in containers that are called resource groups. RGM and administrator actions cause resources and resource groups to move between online and offline states.
A resource type is a collection of properties that describe an application to the cluster. This collection includes information about how the application is to be started, stopped, and monitored on nodes of the cluster. A resource type also includes application-specific properties that need to be defined in order to use the application in the cluster. Sun Cluster data services has several predefined resource types. For example, Sun Cluster HA for Oracle is the resource type SUNW.oracle-server and Sun Cluster HA for Apache is the resource type SUNW.apache.
A resource is an instance of a resource type that is defined cluster wide. The resource type enables multiple instances of an application to be installed on the cluster. When you initialize a resource, the RGM assigns values to application-specific properties and the resource inherits any properties on the resource type level.
Data services utilize several types of resources. Applications such as Apache Web Server or Sun Java System Web Server utilize network addresses (logical hostnames and shared addresses) on which the applications depend. Application and network resources form a basic unit that is managed by the RGM.
Resources that are managed by the RGM are placed into resource groups so that they can be managed as a unit. A resource group is a set of related or interdependent resources. For example, a resource derived from a SUNW.LogicalHostname resource type might be placed in the same resource group as a resource derived from an Oracle database resource type. A resource group migrates as a unit if a failover or switchover is initiated on the resource group.
Data services enable applications to become highly available and scalable services help prevent significant application interruption after any single failure within the cluster.
When you configure a data service, you must configure the data service as one of the following data service types:
Failover data service
Scalable data service
Parallel data service
Failover is the process by which the cluster automatically relocates an application from a failed primary node to a designated redundant secondary node. Failover applications have the following characteristics:
Capable of running on only one node of the cluster
Dependent on the cluster framework for high availability
If the fault monitor detects an error, it either attempts to restart the instance on the same node, or to start the instance on another node (failover), depending on how the data service has been configured. Failover services use a failover resource group, which is a container for application instance resources and network resources (logical hostnames). Logical hostnames are IP addresses that can be configured up on one node, and later, automatically configured down on the original node and configured up on another node.
Clients might have a brief interruption in service and might need to reconnect after the failover has finished. However, clients are not aware of the change in the physical server that is providing the service.
The scalable data service enables application instances to run on multiple nodes simultaneously. Scalable services use two resource groups. The scalable resource group contains the application resources and the failover resource group contains the network resources (shared addresses) on which the scalable service depends. The scalable resource group can be online on multiple nodes, so multiple instances of the service can be running simultaneously. The failover resource group that hosts the shared address is online on only one node at a time. All nodes that host a scalable service use the same shared address to host the service.
The cluster receives service requests through a single network interface (the global interface). These requests are distributed to the nodes, based on one of several predefined algorithms that are set by the load-balancing policy. The cluster can use the load-balancing policy to balance the service load between several nodes.
Sun Cluster systems provide an environment that shares parallel execution of applications across all the nodes of the cluster by using parallel databases. Sun Cluster Support for Oracle Real Application Clusters is a set of packages that, when installed, enables Oracle Real Application Clusters to run on Sun Cluster nodes. This data service also enables Sun Cluster Support for Oracle Real Application Clusters to be managed by using Sun Cluster commands.
A parallel application has been instrumented to run in a cluster environment so that the application can be mastered by two or more nodes simultaneously. In an Oracle Real Application Clusters environment, multiple Oracle instances cooperate to provide access to the same shared database. The Oracle clients can use any of the instances to access the database. Thus, if one or more instances have failed, clients can connect to a surviving instance and continue to access the database.
Sun Cluster enables you to monitor how much of a specific system resource is being used by an object type such as a node, disk, network interface, Sun Cluster resource groups, or Solaris zone. Monitor system resource usage can be part of your resource management policy. Sun Cluster also enables you to control the CPU assigned to a resource group and to control the size of the processor set a resource group runs in.
By monitoring system resource usage through Sun Cluster, you can collect data that reflects how a service using specific system resources is performing and you can discover resource bottlenecks or overload and so preempt problems and more efficiently manage workloads. Data about system resource usage can help you determine what hardware resources are under utilized and what applications are using a lot of resources. Based on this data you can assign applications to nodes that have the necessary resources and choose which node to failover to. This consolidation can help you optimize the way you use your hardware and software resources.
If you consider a certain data value to be critical for a system resource, you can set a threshold for this value. When setting a threshold, you also choose how critical this threshold is by assigning it a severity level. If the threshold is crossed, Sun Cluster changes the severity level of the threshold to the severity level you choose. For more information about configuring data collection and threshold, see Chapter 9, Configuring Control of CPU Usage, in Sun Cluster System Administration Guide for Solaris OS.
Each application and service running on a cluster has specific CPU needs. Table 2–2 lists the CPU control activities available on different versions of the Solaris Operating System.Table 2–2 CPU Control
Solaris 9 Operating System
Assign CPU shares
Solaris 10 Operating System
Assign CPU shares
Solaris 10 Operating System
Assign CPU shares
Assign number of CPU
Create dedicated processor sets
The Fair Share Scheduler must be the default scheduler on the cluster if you want to apply CPU shares.
Controlling the CPU assigned to a resource group in a dedicated processor set in a non-global zone offers you the strictest level of control of CPU because if you reserve CPU for a resource group, this CPU is not available to other resource groups. For information about configuring CPU control, see Chapter 9, Configuring Control of CPU Usage, in Sun Cluster System Administration Guide for Solaris OS.
You can visualize system resource data and the CPU attribution in two ways, by using the command line or through the Sun Cluster Manager graphic user interface. The output from the command is a tabular representation of the monitoring data you request. Through the Sun Cluster Manager, you can visualize data in graphical form. The system resources that you choose to monitor determine the data you can visualize.