Sun Cluster Overview for Solaris OS

Previous: Chapter 2 Key Concepts for Sun Cluster

Chapter 3 Sun Cluster Architecture

Sun Cluster architecture permits a group of systems to be deployed, managed, and viewed as a single, large system.

This chapter contains the following sections:

Sun Cluster Hardware Environment

The following hardware components make up a cluster:

Cluster nodes with local disks (unshared) provide the main computing platform of the cluster.
Multihost storage provides disks that are shared between nodes.
Removable media are configured as global devices, such as tapes and CD-ROM.
Cluster interconnect provides a channel for internode communication.
Public network interfaces enable the network interfaces that are used by client systems to access data services on the cluster.

Figure 3–1 illustrates how the hardware components work with each other.

Figure 3–1 Sun Cluster Hardware Components

Illustration: A two-node cluster with public and private networks,
interconnect hardware, local and multihost disks, console, and clients.

Sun Cluster Software Environment

To function as a cluster member, a node must have the following software installed:

Solaris software
Sun Cluster software
Data service application
Volume management (Solaris^TM Volume Manager or VERITAS Volume Manager)

An exception is a configuration that uses volume management on the box. This configuration might not require a software volume manager.

Figure 3–2 shows a high-level view of the software components that work together to create the Sun Cluster software environment.

Figure 3–2 Sun Cluster Software Architecture

Illustration: Sun Cluster software components such as the RGM,
CMM, CCR, volume managers, and the cluster file system.

Cluster Membership Monitor

To ensure that data is safe from corruption, all nodes must reach a consistent agreement on the cluster membership. When necessary, the CMM coordinates a cluster reconfiguration of cluster services in response to a failure.

The CMM receives information about connectivity to other nodes from the cluster transport layer. The CMM uses the cluster interconnect to exchange state information during a reconfiguration.

After detecting a change in cluster membership, the CMM performs a synchronized configuration of the cluster. In this configuration, cluster resources might be redistributed, based on the new membership of the cluster.

The CMM runs entirely in the kernel.

Cluster Configuration Repository (CCR)

The CCR relies on the CMM to guarantee that a cluster is running only when quorum is established. The CCR is responsible for verifying data consistency across the cluster, performing recovery as necessary, and facilitating updates to the data.

Cluster File Systems

A cluster file system is a proxy between the following:

The kernel on one node and the underlying file system
The volume manager running on a node that has a physical connection to the disk or disks

Cluster file systems are dependent on global devices (disks, tapes, CD-ROMs). The global devices can be accessed from any node in the cluster through the same file name (for example, /dev/global/). That node does not need a physical connection to the storage device. You can use a global device the same as a regular device, that is, you can create a file system on a global device by using newfs or mkfs.

The cluster file system has the following features:

File access locations are transparent. A process can open a file that is located anywhere in the system. Also, processes on all nodes can use the same path name to locate a file.

Note –
When the cluster file system reads files, it does not update the access time on those files.
Coherency protocols are used to preserve the UNIX file access semantics even if the file is accessed concurrently from multiple nodes.
Extensive caching is used with zero-copy bulk I/O movement to move file data efficiently.
The cluster file system provides highly available advisory file-locking functionality by using the fcntl(2) interfaces. Applications that run on multiple cluster nodes can synchronize access to data by using advisory file locking on a cluster file system file. File locks are recovered immediately from nodes that leave the cluster, and from applications that fail while holding locks.
Continuous access to data is ensured, even when failures occur. Applications are not affected by failures if a path to disks is still operational. This guarantee is maintained for raw disk access and all file system operations.
Cluster file systems are independent from the underlying file system and volume management software. Cluster file systems make any supported on-disk file system global.

Scalable Data Services

The primary goal of cluster networking is to provide scalability for data services. Scalability means that as the load offered to a service increases, a data service can maintain a constant response time to this increased workload as new nodes are added to the cluster and new server instances are run. A good example of a scalable data service is a web service. Typically, a scalable data service is composed of several instances, each of which runs on different nodes of the cluster. Together, these instances behave as a single service for a remote client of that service and implement the functionality of the service. A scalable web service with several httpd daemons that run on different nodes can have any daemon serve a client request. The daemon that serves the request depends on a load-balancing policy. The reply to the client appears to come from the service, not the particular daemon that serviced the request, thus preserving the single-service appearance.

The following figure depicts the scalable service architecture.

Figure 3–3 Scalable Data Service Architecture

Illustration: A data service request running on multiple nodes.

The nodes that are not hosting the global interface (proxy nodes) have the shared address hosted on their loopback interfaces. Packets that are coming into the global interface are distributed to other cluster nodes, based on configurable load-balancing policies. The possible load-balancing policies are described next.

Load-Balancing Policies

Load balancing improves performance of the scalable service, both in response time and in throughput.

Two classes of scalable data services exist: pure and sticky. A pure service is one where any instance can respond to client requests. A sticky service has the cluster balancing the load for requests to the node. Those requests are not redirected to other instances.

A pure service uses a weighted load-balancing policy. Under this load-balancing policy, client requests are by default uniformly distributed over the server instances in the cluster. For example, in a three-node cluster where each node has the weight of 1, each node services one-third of the requests from any client on behalf of that service. Weights can be changed at any time through the scrgadm(1M) command interface or through the SunPlex Manager GUI.

A sticky service has two types: ordinary sticky and wildcard sticky. Sticky services allow concurrent application-level sessions over multiple TCP connections to share in-state memory (application session state).

Ordinary sticky services permit a client to share state between multiple concurrent TCP connections. The client is said to be “sticky” toward the server instance listening on a single port. The client is guaranteed that all requests go to the same server instance, if that instance remains up and accessible and the load balancing policy is not changed while the service is online.

Wildcard sticky services use dynamically assigned port numbers, but still expect client requests to go to the same node. The client is “sticky wildcard” over ports toward the same IP address.

Multihost Disk Storage

Sun Cluster software makes disks highly available by utilizing multihost disk storage, which can be connected to more than one node at a time. Volume management software can be used to arrange these disks into shared storage that is mastered by a cluster node. The disks are then configured to move to another node if a failure occurs. The use of multihosted disks in Sun Cluster systems provides a variety of benefits, including the following:

Global access to file systems
Multiple access paths to file systems and data
Tolerance for single-node failures

Cluster Interconnect

All nodes must be connected by the cluster interconnect through at least two redundant physically independent networks, or paths, to avoid a single point of failure. While two interconnects are required for redundancy, up to six can be used to spread traffic to avoid bottlenecks and improve redundancy and scalability. The Sun Cluster interconnect uses Fast Ethernet, Gigabit-Ethernet, InfiniBand, Sun Fire Link, or the Scalable Coherent Interface (SCI, IEEE 1596-1992), enabling high-performance cluster-private communications.

In clustered environments, high-speed, low-latency interconnects and protocols for internode communications are essential. The SCI interconnect in Sun Cluster systems offers improved performance over typical network interface cards (NICs). Sun Cluster uses the Remote Shared Memory (RSM^TM) interface for internode communication across a Sun Fire Link network. RSM is a Sun messaging interface that is highly efficient for remote memory operations.

The RSM Reliable Datagram Transport (RSMRDT) driver consists of a driver that is built on top of the RSM API and a library that exports the RSMRDT-API interface. The driver provides enhanced Oracle Parallel Server/Real Application Clusters performance. The driver also enhances load-balancing and high-availability (HA) functions by providing them directly inside the driver, making them available to the clients transparently.

The cluster interconnect consists of the following hardware components:

Adapters – The network interface cards that reside in each cluster node. A network adapter with multiple interfaces could become a single point of failure if the entire adapter fails.
Junctions – The switches that reside outside of the cluster nodes. Junctions perform pass-through and switching functions to enable you to connect more than two nodes. In a two-node cluster, you do not need junctions because the nodes can be directly connected to each other through redundant physical cables. Those redundant cables are connected to redundant adapters on each node. Greater than two-node configurations require junctions.
Cables – The physical connections that are placed between either two network adapters or an adapter and a junction.

Figure 3–4 shows how the three components are connected.

Figure 3–4 Cluster Interconnect

Illustration: Two nodes connected by a transport adapter, cables,
and a transport junction.

IP Network Multipathing Groups

Public network adapters are organized into IP multipathing groups (multipathing groups). Each multipathing group has one or more public network adapters. Each adapter in a multipathing group can be active, or you can configure standby interfaces that are inactive unless a failover occurs.

Multipathing groups provide the foundation for logical hostname and shared address resources. The same multipathing group on a node can host any number of logical hostname or shared address resources. To monitor public network connectivity of cluster nodes, you can create multipathing.

For more information about logical hostname and shared address resources, see the Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

Public Network Interfaces

Clients connect to the cluster through the public network interfaces. Each network adapter card can connect to one or more public networks, depending on whether the card has multiple hardware interfaces. You can set up nodes to include multiple public network interface cards that are configured so that multiple cards are active, and serve as failover backups for one another. If one of the adapters fails, the Solaris Internet Protocol (IP) network multipathing software on Sun Cluster is called to fail over the defective interface to another adapter in the group.

Previous: Chapter 2 Key Concepts for Sun Cluster