High-Availability Framework

Language:

The Oracle Solaris Cluster software makes all components on the “path” between users and data highly available, including network interfaces, the applications themselves, the file system, and the multihost devices. In general, a cluster component is highly available if it survives any single (software or hardware) failure in the system. Failures that are caused by data corruption within the application itself are excluded.

The following table shows the kinds of Oracle Solaris Cluster component failures (both hardware and software) and the kinds of recovery that are built into the high-availability framework.

Table 2 Levels of Oracle Solaris Cluster Failure Detection and Recovery

Failed Cluster Component	Software Recovery	Hardware Recovery
Data service	HA API, HA framework	Not applicable
Public network adapter	IPMP groups, trunk and DLMP link aggregations, and VNICs that are directly backed by link aggregations	Multiple public network adapter cards
Cluster file system	Primary and secondary replicas	Multihost devices
Mirrored multihost device	Volume management (Solaris Volume Manager)	Hardware RAID-5
Global device	Primary and secondary replicas	Multiple paths to the device, cluster transport junctions
Private network	HA transport software	Multiple private hardware-independent networks
Node	CMM, failfast driver	Multiple nodes
Zone	HA API, HA framework	Not applicable

Oracle Solaris Cluster software's high-availability framework detects a node failure quickly and migrates the framework resources on a remaining node in the cluster. At no time are all framework resources unavailable. Framework resources on a failed node are fully available during recovery. Furthermore, framework resources of the failed node become available as soon as they are recovered. A recovered framework resource does not have to wait for all other framework resources to complete their recovery.

Highly available framework resources are recovered transparently to most of the applications (data services) that are using the resource. The semantics of framework resource access are fully preserved across node failure. The applications cannot detect that the framework resource server has been moved to another node. Failure of a single node is completely transparent to programs on remaining nodes by using the files, devices, and disk volumes that are available to this node. This transparency exists if an alternative hardware path exists to the disks from another node. An example is the use of multihost devices that have ports to multiple nodes.

Global Devices

The Oracle Solaris Cluster software uses global devices to provide cluster-wide, highly available access to any device in a cluster from any node. In general, if a node fails while providing access to a global device, the Oracle Solaris Cluster software automatically uses another path to the device. The Oracle Solaris Cluster software then redirects the access to that path. For more information, see Device IDs and DID Pseudo Driver. Oracle Solaris Cluster global devices include disks, CD-ROMs, and tapes. However, the only multiported global devices that Oracle Solaris Cluster software supports are disks. Consequently, CD-ROM and tape devices are not currently highly available devices. The local disks on each server are also not multiported, and thus are not highly available devices.

The cluster automatically assigns unique IDs to each disk, CD-ROM, and tape device in the cluster. This assignment enables consistent access to each device from any node in the cluster. The global device namespace is held in the /dev/global directory. See Global Namespace for more information.

Multiported global devices provide more than one path to a device. Because multihost disks are part of a device group that is hosted by more than one cluster node, the multihost disks are made highly available.

Device IDs and DID Pseudo Driver

The Oracle Solaris Cluster software manages shared devices through a construct known as the DID pseudo driver. This driver is used to automatically assign unique IDs to every device in the cluster, including multihost disks, tape drives, and CD-ROMs.

The DID pseudo driver is an integral part of the shared device access feature of the cluster. The DID driver probes all nodes of the cluster and builds a list of unique devices, assigns each device a unique major and a minor number that are consistent on all nodes of the cluster. Access to shared devices is performed by using the normalized DID logical name, instead of the traditional Oracle Solaris logical name, such as c0t0d0 for a disk.

This approach ensures that any application that accesses disks (such as a volume manager or applications that use raw devices) uses a consistent path across the cluster. This consistency is especially important for multihost disks, because the local major and minor numbers for each device can vary from node to node, thus changing the Oracle Solaris device naming conventions as well. For example, Host1 might identify a multihost disk as c1t2d0, and Host2 might identify the same disk completely differently, as c3t2d0. The DID framework assigns a common (normalized) logical name, such as d10, that the nodes use instead, giving each node a consistent mapping to the multihost disk.

You update and administer device IDs with the cldevice command. See the cldevice(1CL) man page.

Zone Cluster Membership

Oracle Solaris Cluster software also tracks zone cluster membership by detecting when a zone cluster node boots up or goes down. These changes also trigger a reconfiguration. A reconfiguration can redistribute cluster resources among the nodes in the cluster.

Cluster Membership Monitor

To ensure that data is kept safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services (applications) in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration. A problem called split brain can occur when the cluster interconnect between cluster nodes is lost and the cluster becomes partitioned into subclusters, and each subcluster believes that it is the only partition. A subcluster that is not aware of the other subclusters could cause a conflict in shared resources, such as duplicate network addresses and data corruption. The quorum subsystem manages the situation to ensure that split brain does not occur, and that one partition survives. For more information, see Quorum and Quorum Devices.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster. In a synchronized configuration, cluster resources might be redistributed, based on the new membership of the cluster.

Failfast Mechanism

The failfast mechanism detects a critical problem on a global-cluster node.

When the critical problem is located in a node, Oracle Solaris Cluster forcibly shuts down the node. Oracle Solaris Cluster then removes the node from cluster membership.

If a node loses connectivity with other nodes, the node attempts to form a cluster with the nodes with which communication is possible. If that set of nodes does not form a quorum, Oracle Solaris Cluster software halts the node and “fences” the node from the shared disks, that is, prevents the node from accessing the shared disks. Fencing is a mechanism that is used by the cluster to protect the data integrity of a shared disk during split-brain situations. By default, the scinstall utility in Typical Mode leaves global fencing enabled.

You can turn off fencing for selected disks or for all disks.

Caution - If you turn off fencing under the wrong circumstances, your data can be vulnerable to corruption during application failover. Examine this data corruption possibility carefully when you are considering turning off fencing. If your shared storage device does not support the SCSI protocol, such as a Serial Advanced Technology Attachment (SATA) disk, or if you want to allow access to the cluster's storage from nodes outside the cluster, turn off fencing.

If one or more cluster-specific daemons die, Oracle Solaris Cluster software declares that a critical problem has occurred. When this occurs, Oracle Solaris Cluster shuts down and removes the node where the problem occurred.

When a cluster-specific daemon that runs on a node fails and the node panics, a message similar to the following is displayed on the console.

panic[cpu1]/thread=2a10007fcc0: Failfast: Aborting because "pmfd" died in zone "global" (zone id 0)
35 seconds ago.
409b8 cl_runtime:__0FZsc_syslog_msg_log_no_argsPviTCPCcTB+48 (70f900, 30, 70df54, 407acc, 0)
%l0-7: 1006c80 000000a 000000a 10093bc 406d3c80 7110340 0000000 4001 fbf0

Cluster Configuration Repository (CCR)

To maintain an accurate representation of an Oracle Solaris Cluster configuration, clustered systems require configuration control. The Oracle Solaris Cluster software uses a Cluster Configuration Repository (CCR) to store the current cluster configuration information. The CCR uses a two-phase commit algorithm for updates: An update must be successfully completed on all cluster members or the update is rolled back. The CCR uses the cluster interconnect to apply the distributed updates.

Caution - Although the CCR consists of text files, never edit the CCR files yourself. Each file contains a checksum record to ensure consistency between nodes. Updating CCR files yourself can cause a node or the entire cluster to stop working.

The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.