Introduction to the Cluster Membership Manager

The Cluster Membership Manager (CMM) is implemented by the nhcmmd daemon. There is a nhcmmd daemon on each peer node.

The nhcmmd daemon on the master node has the current view of the cluster configuration. It communicates its view to the nhcmmd daemons on the other peer nodes. The nhcmmd daemon on the master node determines which nodes are members of the cluster, and assigns roles and attributes to the nodes. It detects the failure of nodes and configures routes for reliable transport.

The nhcmmd daemon on the vice-master node monitors the status of the master node. If the master node fails, the vice-master node is able to take over as the master node.

The nhcmmd daemons on the client nodes do not communicate with one another. Each nhcmmd daemon exports two APIs to do the following:

Notify clients of changes to the cluster
Notify services and applications when the cluster membership or master changes

Notification messages describe the change and the nodeid of the affected node. Clients can use notifications to maintain an accurate view of the peer nodes in the cluster.

For further information about the nhcmmd daemon, see the nhcmmd1M man page.

You can use the CMM API to write applications that manage peer nodes or that register clients to receive notifications. For further information about writing applications that use the CMM API, see the Netra High Availability Suite 3.0 1/08 Foundation Services CMM Programming Guide.

The standard SA Forum CLM API can only be used to retrieve membership information about the cluster nodes, and to receive notifications about membership changes. For more information, see the Netra High Availability Suite 3.0 1/08 Foundation Services SA Forum Programming Guide.

Configuring the Cluster Membership

Cluster membership information is stored in the configuration files cluster_nodes_table and nhfs.conf.

At cluster startup, the cluster membership is configured as follows:

Both of the server nodes retrieve the list of peer nodes and their attributes from the cluster_nodes_table, and configuration information from nhfs.conf. All other peer nodes retrieve configuration information from nhfs.conf.
The nhcmmd daemon on the master node uses the list of nodes and their attributes to generate its view of the cluster configuration. It communicates this view to the nhcmmd daemons on the other peer nodes, including the vice-master node.
Using the master node view of the cluster, the nhcmmd daemon on the vice-master node updates its local cluster_nodes_table.

The nhcmmd daemon on the master node updates its cluster_nodes_table and its view of the cluster configuration when a peer node is added, removed, or disqualified. The nhcmmd daemon on the master node communicates the updated view to the nhcmmd daemons on the other peer nodes. The vice-master node uses this view to update its local cluster_nodes_table. In this way, the master node and vice-master node always have an up-to-date view of the cluster.

Monitoring the Presence of Peer Nodes

Each peer node runs a daemon called nhprobed that periodically sends a heartbeat in the form of an IP packet. Heartbeats are sent through each of the two physical interfaces of each peer node. When a heartbeat is detected through a physical interface, it indicates that the node is reachable and that the physical interface is alive. If a heartbeat is not detected for a period of time exceeding the detection delay, the physical interface is considered to have failed. If both of the node's physical interfaces fail, the node itself is considered to have failed.

Heartbeats from each peer node are sent to a multicast group on the cluster network. Only the master node listens to heartbeats coming from the vice-master node and client nodes. Only the vice-master node listens to heartbeats coming from the master node. For more information, see Interaction Between the nhprobed Daemon and the nhcmmd Daemon. For more information about the nhprobed daemon, see the nhprobed1M man page.

Interaction Between the `nhprobed` Daemon and the `nhcmmd` Daemon

On the server nodes, the nhprobed daemon receives a list of nodes from the nhcmmd daemon. The nhprobed daemon monitors the heartbeats of the nodes on the list. On the master node, the list contains all of the client nodes and the vice-master node. On the vice-master node, the list contains the master node only.

On the server nodes, the nhprobed daemon notifies the nhcmmd daemon when, for any node on its list, any of the following events occur:

One link becomes available, indicating that the node is accessible through the link.

One link becomes unavailable, indicating that the node is not accessible through the link.

The node becomes available, indicating that the first link to the node becomes available.

The node becomes unavailable, indicating that the last available link to the node becomes unavailable.

When a node other than the master node becomes unavailable, the master node eliminates the node from the cluster. The master node uses the TCP abort facility to close communication to the node. When the master node becomes unavailable, a failover is provoked.

Using the Direct Link to Prevent Split Brain Errors

Split brain is an error scenario in which the cluster has two master nodes. A direct communication link between the server nodes prevents the occurrence of split brain when the communication over the cluster network between the master node and vice-master node fails.

As described in Monitoring the Presence of Peer Nodes, the nhprobed daemon on the vice-master node monitors the presence of the master node. If the nhprobed daemon on the vice-master node fails to detect the master node, the master node itself or the communication to the master node has failed. If this happens, the vice-master node uses the direct link to try to contact the master node.

If the vice-master node does not receive a reply from the master node by using the direct link, it is assumed that the master node has failed. The vice-master node becomes the master node.

If the vice-master node receives a reply from the master node by using the direct link, it is assumed that the communication to the master node has failed but the master node is alive. The vice-master node is rebooted.

The Node Management Agent can monitor the following statistics on the direct link:

The number of times that the vice-master node has requested to become the master node.
The state of the direct communication link. The state can be up or down.

For information about how to connect the direct link between the server nodes, see the Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS.

Multicast Transmission of Heartbeats

Probe heartbeats are multicast. Each cluster on a local area network (LAN) is assigned to a different multicast group, and each network interface card (NIC) on a node is assigned to a different multicast group. For example, NICs connected to an hme0 Ethernet network are assigned to one multicast group, and NICs connected to an hme1 Ethernet network are assigned to another multicast group.

A heartbeat sent from one multicast group cannot be detected by another multicast group. Therefore, heartbeats sent from one cluster cannot be detected by another cluster on the same LAN. Similarly, for a cross-switched topology, heartbeats sent from one Ethernet network cannot be detected on another Ethernet network.

Multicast addresses are 32–bit. The lower 28 bits of the multicast address represent the multicast group. The multicast address is broken into the following parts:

Bits 28 to 31 are fixed.

Bits 23 to 27 identify Netra HA Suite. For the Foundation Services, bits 23 to 27 are always set to 10100.

Bits 8 to 22 identify the cluster. The value for a given cluster is specified in the nhfs.conf file by the Node.DomainId parameter.

Bits 0 to 7 identify the NIC. The value for a given NIC is specified in the nhfs.conf file by the Node.NIC0 and Node.NIC1 parameters.

When you are defining multicast groups for applications, follow these recommendations:

Bits 23 to 27 of the multicast address must not have the value 10100. This value is reserved for Netra HA Suite.

Bits 0 to 22 of the multicast address should not have the same value as any of your Netra HA Suite clusters.

When a message is sent, the IP stack uses the lower 23 bits of the multicast address to define the destination media access control (MAC) address. If several multicast addresses have the same value for the lower 23 bits, even if they have different values for the upper five bits, the addresses must be filtered at the IP level. The IP filtering would create a corresponding reduction in performance.

Because of the way that hme and le interfaces filter multicast packets, one in four clusters share the same multicast filter. To reduce the need to filter at the IP level, clusters in the same LAN should have sequential cluster identities.

Masterless Cluster

In normal usage, a cluster should contain a master node and a vice-master node. It can also contain diskless and dataless nodes. If the cluster does not have a master node, there is a risk of data loss because it is the master node that holds the most up-to-date view of the cluster. However, in some cluster usage, you might want to reduce the possible downtime for services running on client nodes. If this is the case, you can permit the diskless and dataless nodes to stay up even when there is no master node in the cluster by enabling the Masterless Cluster feature. By default, this feature is disabled, and diskless and dataless nodes reboot if there is no master node in the cluster for more than a few minutes.

If you enable this feature, you must ensure that the diskless and dataless nodes can handle the situation where they would no longer be able to access the files exported by the master node.

Activate this feature by setting the CMM.Masterloss.Detection parameter in the nhfs.conf file. If you are installing the cluster with nhinstall, enable this feature by setting MASTER_LOSS_DETECTION in the cluster_definition.conf file. For more information, see the nhfs.conf4 and cluster_definition.conf4 man pages.

You can also configure the amount of time the vice-master node will wait before taking over when it detects a stale cluster situation. To do this, define the CMM.Masterloss.Timeout parameter in the nhfs.conf file. If you are installing the cluster using nhinstall, define MASTER_LOSS_TIMEOUT in the cluster_definition.conf file. For more information, refer to the nhfs.conf(4) and cluster_definition.conf(4) man pages.