A cluster is two or more systems, or nodes, that work together as a single, continuously available system to provide applications, system resources, and data to users. Each node on a cluster is a fully functional standalone system. However, in a clustered environment, the nodes are connected by an interconnect and work together as a single entity to provide increased availability and performance.
Highly available clusters provide nearly continuous access to data and applications by keeping the cluster running through failures that would normally bring down a single server system. No single failure—hardware, software, or network—can cause a cluster to fail. By contrast, fault-tolerant hardware systems provide constant access to data and applications, but at a higher cost because of specialized hardware. Fault-tolerant systems usually have no provision for software failures.
Each Sun Cluster system is a collection of tightly coupled nodes that provide a single administration view of network services and applications. The Sun Cluster system achieves high availability through a combination of the following hardware and software:
Redundant disk systems provide storage. These disk systems are generally mirrored to permit uninterrupted operation if a disk or subsystem fails. Redundant connections to the disk systems ensures that data is not isolated if a server, controller, or cable fails. A high-speed interconnect among nodes provides access to resources. All nodes in the cluster are also connected to a public network, enabling clients on multiple networks to access the cluster.
Redundant hot-swappable components, such as power supplies and cooling systems, improve availability by enabling systems to continue operation after a hardware failure. Hot-swappable components provide the ability to add or remove hardware components in a functioning system without bringing it down.
Sun Cluster software's high-availability framework detects a node failure quickly and migrates the application or service to another node that runs in an identical environment. At no time are all applications unavailable. Applications unaffected by a down node are fully available during recovery. Furthermore, applications of the failed node become available as soon as they are recovered. A recovered application does not have to wait for all other applications to complete their recovery.
An application is highly available if it survives any single software or hardware failure in the system. Failures that are caused by bugs or data corruption within the application itself are excluded. The following apply to highly available applications:
Recovery is transparent from the applications that use a resource.
Resource access is fully preserved across node failure.
Applications cannot detect that the hosting node has been moved to another node.
Failure of a single node is completely transparent to programs on remaining nodes that use the files, devices, and disk volumes attached to this node.
Failover and scalable services and parallel applications enable you to make your applications highly available and to improve an application's performance on a cluster.
A failover service provides high availability through redundancy. When a failure occurs, you can configure an application that is running to either restart on the same node, or be moved to another node in the cluster, without user intervention.
To increase performance, a scalable service leverages the multiple nodes in a cluster to concurrently run an application. In a scalable configuration, each node in the cluster can provide data and process client requests.
Parallel databases enable multiple instances of the database server to do the following:
Participate in the cluster
Handle different queries on the same database simultaneously
Provide parallel query capability on large queries
For more information about failover and scalable services and parallel applications, see Data Service Types.
Clients make data requests to the cluster through the public network. Each cluster node is connected to at least one public network through one or multiple public network adapters.
IP network multipathing enables a server to have multiple network ports connected to the same subnet. First, IP network multipathing software provides resilience from network adapter failure by detecting the failure or repair of a network adapter. The software then simultaneously switches the network address to and from the alternative adapter. When more than one network adapter is functional, IP network multipathing increases data throughput by spreading outbound packets across adapters.
Multihost storage makes disks highly available by connecting the disks to multiple nodes. Multiple nodes enable multiple paths to access the data, if one path fails, another one is available to take its place.
Multihost disks enable the following cluster processes:
Tolerating single-node failures.
Centralizing application data, application binaries, and configuration files.
Protecting against node failures. If client requests are accessing the data through a node that fails, the requests are switched over to use another node that has a direct connection to the same disks.
Providing access either globally through a primary node that “masters” the disks, or by direct concurrent access through local paths.
A volume manager enables you to manage large numbers of disks and the data on those disks. Volume managers can increase storage capacity and data availability by offering the following features:
Disk-drive striping and concatenation
Disk-drive hot spares
Disk-failure handling and disk replacements
Sun Cluster systems support the following volume managers:
Solaris Volume Manager
VERITAS Volume Manager
Sun StorEdge Traffic Manager software is fully integrated starting with the Solaris Operating System 8 core I/O framework. Sun StorEdge Traffic Manager software enables you more effectively to represent and manage devices that are accessible through multiple I/O controller interfaces within a single instance of the Solaris operating environment. The Sun StorEdge Traffic Manager architecture enables the following:
Protection against I/O outages due to I/O controller failures
Automatic switches to an alternate controller upon an I/O controller failure
Increased I/O performance by load balancing across multiple I/O channels
Sun Cluster systems support the use of hardware Redundant Array of Independent Disks (RAID) and host-based software RAID. Hardware RAID uses the storage array's or storage system's hardware redundancy to ensure that independent hardware failures do not impact data availability. If you mirror across separate storage arrays, host-based software RAID ensures that independent hardware failures do not impact data availability when an entire storage array is offline. Although you can use hardware RAID and host-based software RAID concurrently, you need only one RAID solution to maintain a high degree of data availability.
Because one of the inherent properties of clustered systems is shared resources, a cluster requires a file system that addresses the need for files to be shared coherently. The Sun Cluster file system enables users or applications to access any file on any node of the cluster by using remote or local standard UNIX APIs. Sun Cluster systems support the following file systems:
UNIX file system (UFS)
Sun StorEdge QFS file system
VERITAS file system (VxFS)
If an application is moved from one node to another node, no change is required for the application to access the same files. No changes need to be made to existing applications to fully utilize the cluster file system.
Standard Sun Cluster systems provide high availability and reliability from a single location. If your application must remain available after unpredictable disasters such as an earthquake, flood, or power outage, you can configure your cluster as a campus cluster.
Campus clusters enable you to locate cluster components, such as nodes and shared storage, in separate rooms several kilometers apart. You can separate your nodes and shared storage and locate them in different facilities around your corporate campus or elsewhere within several kilometers. When an a disaster strikes one location, the surviving nodes can take over service for the failed node. This enables applications and data to remain available for your users.