Cluster Model

This section describes the cluster environment and the types of nodes in a cluster.

A cluster is a set of interconnected nodes (peer nodes) that collaborate through distributed services to provide highly available services. A cluster is made of 2 to 64 peer nodes, as follows:

Two mandatory peer nodes are server nodes, also called master-eligible nodes:
A master node to coordinate the cluster membership information and provide highly available services to applications
A vice-master node that backs up the master node
Redundant data is shared between the two server nodes (master and vice-master nodes).
Other peer nodes (if any) are client nodes, also called master-ineligible nodes. Client nodes can be diskless or dataless nodes. There can be a maximum of 62 client nodes.
All peer nodes are connected to a redundant network between all nodes for reliable inter-node communication.

Non-peer nodes outside of the cluster can connect to the server nodes of a cluster.

The following figure is an example of a cluster configuration with non-peer nodes connected to the cluster.

FIGURE 2-1 Example of a Cluster Configuration With a Non-Peer Node Connected to the Cluster

Diagram shows an example of the nodes inside
and outside a cluster.

All types of nodes are described in the following sections.

Peer Nodes and Nonpeer Nodes

Nodes that are configured as members of a cluster are called peer nodes. Peer nodes can run Netra HA Suite software and communicate with each other on the same network. Peer nodes can be server nodes or client nodes.

Nodes that are not configured as members of a cluster are called nonpeer nodes. A nonpeer node communicates with one or more peer nodes to access resources or services provided by the cluster. In FIGURE 2-1, the nonpeer node is connected to both of the redundant network links. For information about the options for connecting nonpeer nodes to a cluster, see Chapter 8.

Server (Master-Eligible) Nodes

A cluster must contain two server nodes (master-eligible nodes). A server node is a peer node that can be elected as the master node or the vice-master node.

The master node is the node that coordinates the cluster membership information. The master node generates its view of the cluster configuration and communicates this view to the other peer nodes. Highly available services (for example, Reliable File Service and Reliable Boot Service ) run on only the master node. If an application is a client/server type of application, services provided by the application must run on a master node and must use the CMM API to get notifications about the cluster state. The application must also manage its own availability (whether or not NSM is being used).

The vice-master node backs up the master node. The vice-master node has a copy of all of the cluster management information that is on the master node. It can transparently take control of the cluster if required.

You must take care that any tasks you run on the vice-master node either have a very low load level, or are designed so that the tasks can be interrupted if the vice-master node needs to become the master node. The vice-master node must always be available to take over the master node's load if the current master node is no longer able to continue in the master role.

Each server node must be a diskfull node. A diskfull node has at least one disk on which information can be permanently stored. A server node must be configured as master-eligible at the time of installation and configuration. The master node and vice-master node are the only nodes that are configured as diskfull in a Netra HA Suite cluster.

Client (Master-Ineligible) Nodes

A cluster contains two server (master-eligible) nodes, the master node and the vice-master node. All other peer nodes are client nodes. It is highly recommended, but not mandatory, that you run applications on only client nodes except when the application is a client/server type of application. In this case, the server part of the application must run on the master node in order to be made easily highly available.

In a Netra HA Suite cluster, client nodes are either diskless nodes or dataless nodes.

A diskless node either does not have a local disk or is configured not to use its local disk. Diskless nodes boot through the cluster network, using the master node as a boot server.

A dataless node has a local disk from which it boots, but it cannot store data that has to be redundant on its disk; this type of data must be stored on server (master-eligible) nodes. An application running on a dataless node accesses redundant data from server (master-eligible) nodes through the cluster network.

For examples of supported cluster configurations, see the Netra High Availability Suite 3.0 1/08 Foundation Services Getting Started Guide.

Reliability, Serviceability, Redundancy, and Availability

This section defines the concepts of reliability, serviceability, redundancy, and availability. These concepts use the mechanisms of failover and switchover, as described in Failover and Switchover.

Reliability

Reliability is a measure of continuous system uptime. Netra HA Suite software provides distributed services and highly available services to increase the reliability of your system.

Serviceability

Serviceability is the probability that a service can be restored within a specified period of time following a service failure. The Foundation Services increase the serviceability of applications through the highly available services they provide and the fast failover (or switchover) time used to transfer highly available services from one server node to another (in general, such transfers take place in less than five seconds).

Redundancy

Redundancy increases the availability of a service by providing a backup to take over in the event of failure.

The Foundation Services provide an active/stand-by model for services running on server nodes (including services from the Foundation Services or those provided by an application, specifically, the server part of a client/server type of application). Because no application framework is currently provided by the Netra HA Suite software, applications must manage the redundancy model they want to apply themselves. The Cluster Membership Manager (CMM) APIs (proprietary or SA Forum-compliant) provide applications with information to help them implement their own redundancy models. The master node is backed up by the vice-master node. If the master node fails, there is a transparent transfer of services to the vice-master node. In the Foundation Services, the instance of the service running on the master node is the primary instance. The instance of the service running on the vice-master node is the secondary instance.

Availability

Availability is the probability that a service is available for use at any given time. Availability is a function of system reliability and serviceability, supported by redundancy.

Failover and Switchover

Failover and switchover are the mechanisms that ensure the high availability of a cluster.

Failover occurs if the master node fails, or if a vital service running on the master node fails. The services on the master node fail over to the vice-master node. The vice-master node has all of the necessary state information to take over from the master node. The vice-master node expects no cooperation or coordination from the failed master node.

Switchover is the planned transfer of services from the master node to the vice-master node. Switchover is orchestrated by the system or by an operator so that a node can be maintained without affecting system performance. Switchover is not linked to node failure. As in the case of a failover, the backup must have all of the necessary state information to take over at the moment of the switchover. Unlike failover, in switchover the master node can help the vice-master node by, for example, flushing caches for shared files.

Only the master node and vice-master node take part in failover and switchover. If a diskless node or dataless node fails, there is no failover, but all nodes that are part of the cluster (peer nodes) are made aware of the node failure through the Cluster Membership Manager (CMM) APIs (proprietary or SA Forum-compliant). If a diskless node or dataless node is the only node running an application, the application fails. If other diskless nodes or dataless nodes are running the application, the application will continue to run on these other nodes.

Data Management Policy

Three data management policies are available in the Foundation Services. These policies determine how the cluster behaves when a failed vice-master node reboots in a cluster that has no master node. The policy you choose depends on the availability and data-integrity requirements of your cluster.

The data management policies are as follows:

`Integrity`	Ensures that the cluster uses the most up-to-date data. The vice-master node does not take the master role, but waits for the old master node to return to the cluster. This is the default data management policy.
`Availability`	Prioritizes the availability of services running on the cluster over data integrity. The vice-master node takes the master role when there is no master node in the cluster. This policy triggers a full synchronization when the old master node joins the cluster as the new vice-master node. This synchronization might result in the loss of any data written to the old master node while the vice-master node was down.
`Adaptability`	Prioritizes availability only if the master and vice-master disks are synchronized. Such disk synchronization would increase the level of data integrity. The vice-master checks the disk synchronization state. If the state indicates that the master and vice-master nodes are not synchronized, the vice-master node does not take the master role, but waits until the old master node returns to the cluster. If the state indicates that the master and vice-master disks are synchronized, the vice-master takes on the master role without waiting for the old master to rejoin the cluster.

Choose the data management policy by setting the value of the Cluster.DataManagementPolicy parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, choose the data management policy by setting the value of DATA_MGT_POLICY in the cluster_definition.conf file. For more information, see the nhfs.conf4 and cluster_definition.conf4 man pages.

Service Models

The Foundation Services provide two categories of service: highly available services and distributed services.

Highly available services run on the master node and vice-master node only. The Reliable Boot Service, Reliable File Service, External Address Manager (EAM), and Node State Manager (NSM) are highly available services. If the master node or one of these services on the master node fails, a failover occurs.

Distributed services are services that run on all peer nodes. The distributed services include the Cluster Membership Manager (CMM), Node Management Agent (NMA), and the Process Monitor Daemon (PMD). If a distributed service fails and cannot be restarted, the node running the service is removed from the cluster. If the node is the master node, a failover occurs.

Fault Management Models

This section describes some of the faults that can occur in a cluster, and how those faults are managed.

Fault Types

When one critical fault occurs, it is called a single fault. A single fault can be the failure of one server node, the failure of a service, or the failure of one of the redundant networks. After a single fault, the cluster continues to operate correctly but it is not highly available until the fault is repaired.

Two critical faults occurring that affect both parts of the redundant system is referred to as a double fault. A double fault can be the simultaneous failure of both server nodes, or the simultaneous failure of both redundant network links. Although many double faults can be detected, it might not be possible to recover from all double faults. Although rare, double faults can result in cluster failure.

Some faults can result in the election of two master nodes. This error scenario is called split brain. Split brain is usually caused by communication failure between server nodes. When the communication between the server nodes is restored, the last elected master node remains the master node. The other server node is elected as the vice-master node. Data shared between the two server (master-eligible) nodes must then be resynchronized. There is a risk of loss of data.

After a cluster is down (when both master-eligible nodes are down), if the cluster is restarted and the newly elected master node was previously the vice-master node, there is a risk that the system will experience amnesia. This happens if the newly elected master node thinks its data is synchronized with the data on the node that was previously the master node, but it actually is not. When the node that was previously the master node comes back as the new vice-master node, the most recently updated data on this node will be lost.

Fault Detection

Fault detection is critical for a cluster running highly available applications. The Foundation Services have the following fault detection mechanisms:

The Cluster Membership Manager (CMM) detects the failure of peer nodes. It notifies the other peer nodes of the failure. For information about CMM, see Chapter 5.

The Process Monitor Daemon (PMD) surveys Foundation Services daemons, some operating system daemons (on the Solaris OS only), and some companion products daemons. When a critical service or a descendent of a critical service fails, the PMD detects the failure and triggers a recovery response. For information about the PMD, see Chapter 10.
External Address Manager (EAM) supervises the external address links to the master node. When EAM detects that links are broken, it triggers a switchover to the vice-master node (if allowed).

Fault Reporting

Errors that indicate potential failure are reported so that you can understand the sequence of events that have led to the problem. The Foundation Services have the following fault-reporting mechanisms:

All error messages are sent to system log files. For information about how to configure log files, see the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

The Cluster Membership Manager on the master node notifies clients when a node fails, a failover occurs, or the cluster membership changes. Clients can be subscribed system services or applications.

The Node Management Agent can be used to develop applications that retrieve statistics about services provided by the Foundation Services. These applications can be used to detect faults or diminished levels of service on your system. For further information on how to collect and manage node and cluster statistics, see the Netra High Availability Suite 3.0 1/08 Foundation Services NMA Programming Guide.

Fault Isolation

When a fault occurs in the cluster, the node on which the fault occurred is isolated. The Cluster Membership Manager ensures that the failed node cannot communicate with the other peer nodes.

Fault Recovery

Typical fault recovery scenarios include the following behaviors:

If a process monitored by the Process Monitor Daemon fails, the recovery depends on the policy of recovery that is attached to a particular process. A recovery can involve a process restart (with a maximum number of trials), a node reboot (implying a failover if the node on which the process failed is the master node), or a simple message logging. In all cases, the cluster is still up and running.
If a highly available or distributed service fails, this triggers a reboot of the node on which the service failed. If the node is the master, the node fails over; if the node is the vice-master or a client node, action is taken by the applications running on the node. The cluster remains up and running.
If a link of the cluster network fails, an error message is logged, and an event is sent to the NMA. The cluster remains up and running.
If an external link to the master node is down, IPMP (on the Solaris OS) or bonding (on Linux), if they are configured, can move all of the external traffic to another link, assuming there are at least two external links giving access to the master node and no single point of failure in a cluster. There will be no impact on the cluster. If all external links to the master node from the same IPMP group (on the Solaris OS) or from the Linux bonding driver (on Linux) are down, the EAM will trigger a switchover. The cluster remains up and running.
If there is a client node failure, the node is isolated and rebooted. It is up to the applications running on this client node to implement the recovery procedure, as no application management framework is provided by the Foundation Services.

Failed nodes are often repaired by rebooting the system. Overload errors are often repaired by waiting for an acceptable delay and then rebooting or restarting the failed service. The Foundation Services are designed so that individual nodes can be shut down and restarted independently, reducing the impact of errors. After failover, the master node and vice-master node are synchronized so that the repaired vice-master node can rejoin the cluster in its current state.

Foundation Services Concepts