Defining Minimum Criteria for a Cluster Running Highly Available Services

A Netra HA Suite cluster can run the following highly available services: Reliable NFS and the Reliable Boot Service (RBS). For information about highly available services, see the Netra High Availability Suite 3.0 1/08 Foundation Services Overview.

A highly available cluster has the following features:

A master node and a vice-master node. The master node is the central information point for the cluster. The vice-master node backs up the master node. To verify that there is a master node and a vice-master node in the cluster, see To Verify That the Cluster Has a Master Node and a Vice-Master Node.

An nhcmmd daemon on each peer node. The nhcmmd daemon on the master node manages the membership of the other peer nodes. The nhcmmd daemon on other peer nodes receives cluster information from the nhcmmd daemon on the master node. To verify that there is an nhcmmd daemon on each peer node, perform the procedure described in To Verify That an nhcmmd Daemon Is Running on Each Peer Node.

A redundant network. When the network is redundant, there is no single point of network failure. To verify that the cluster network is redundant, see To Verify That the Cluster Has a Redundant Ethernet Network.

Synchronized master node disk and vice-master node disk. Synchronization ensures that the vice-master node has an up-to-date copy of the information on the master node. To verify that the master node and vice-master node are synchronized, see To Verify That the Master Node and Vice-Master Node Are Synchronized.

If your cluster has diskless nodes, the Reliable Boot Service must be running on the master node and the vice-master node.

Verifying Services on Peer Nodes

When performing administration tasks, regularly verify that your cluster is running correctly by performing the procedures described in this section.

To Verify That the Cluster Has a Master Node and a Vice-Master Node

Type the following command:
# nhcmmstat -c all
The nhcmmstat command displays information in the console window about all of the peer nodes. The information includes the role of each node. The peer nodes must include a master node and a vice-master node. For more information, see the nhcmmstat1M man page.
- If there is a master node but no vice-master node, reboot the second master-eligible node as described in To Perform a Clean Reboot of a Linux Node.
  
  Verify that the second master-eligible node has become the vice-master node:
  # nhcmmstat -c all
  If the second master-eligible node does not become the vice-master node, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.
- If there is neither a master node nor a vice-master node, you do not have a highly available cluster. Verify your cluster configuration by examining the nhfs.conf file and the cluster_nodes_table file for configuration errors.
  
  For more information, see the nhfs.conf4 and cluster_nodes_table4 man pages.
- If there are two master nodes, you have a split brain error scenario. To investigate the cause of split brain, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

To Verify That an nhcmmd Daemon Is Running on Each Peer Node

Verify that an nhcmmd daemon is running on the node:
# pgrep -x nhcmmd
- If a process identifier is returned, the daemon is running.
- If a process identifier is not returned, the daemon is not running.
To investigate the cause of daemon failure, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

Repeat Step 1 and Step 2 on each peer node.

To Verify That the Cluster Has a Redundant Ethernet Network

Verify that the peer nodes are communicating through a network:
# nhadm check starting
If any peer node is not accessible from any other peer node, the nhadm command displays an error message in the console window.

Search the system log files for the following message:
[ifcheck] Interface interface-name used for cgtp has failed
The nhcmmd daemon creates this message if the peer nodes are not communicating through a redundant network.

If the redundant network fails, examine the card, cable, and route table associated with the link. Investigate the system log files for relevant error messages.

To Verify That the Master Node and Vice-Master Node Are Synchronized

This procedure only applies to systems using IP replication, rather than shared disk.

Test whether the vice-master node is synchronized with the master node:

For versions earlier than the Solaris 10 OS:
# /usr/opt/SUNWesm/sbin/scmadm -S -M
- If the scmadm command reaches the replicating state, the vice-master node is synchronized with the master node.
- If the scmadm command does not reach the replicating state, the vice-master node is not synchronized with the master node.
  # /usr/sbin/dsstat 1
  For the Solaris 10 OS and later:
  - If the dsstat command indicates ”R“ in the ”S” column, the vice-master node is synchronized with the master node.
  - If the dsstat command indicates ”L“ in the ”S” column, the vice-master node is not synchronized and no synchronization is currently taking place.
  - If the dsstat command indicates ”SY“ in the ”S” column, the vice-master node is not synchronized and synchronization is currently taking place.
    # drbdadm cstate all
    Note - Refer to the dsstat1M man page for more information.
    
    For the Linux OS:
    - If the drbdadm command indicates ”Connected“, the vice-master node is synchronized with the master node.
    - If the drbdadm command indicates ”StandAlone“ or ”WFConnection”, the vice-master node is not synchronized and no synchronization is currently taking place.
    - If the drbdadm command indicates ”SyncSource“, the vice-master node is not synchronized and synchronization is currently taking place.
      
      Note - Refer to the drbdadm8 man page for more information.

If the master and vice-master nodes are not synchronized, verify if the RNFS.EnableSync parameter is set in to FALSE in the nhfs.conf file.

If the RNFS.EnableSync parameter is set to FALSE and if you want to trigger synchronization:
1. Trigger synchronization:
  # nhenablesync
  For information on nhenableysnc, see the nhenablesync1Mman page.
2. Repeat Step 2.
If the RNFS.EnableSync parameter is not set to FALSE but the vice-master node remains unsynchronized, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

For more information about the scmadm command, see the scmadm1M man page. For more information about the RNFS.EnableSync parameter, see the nhfs.conf4 man page.

To Verify That the Reliable Boot Service Is Running

Diskless nodes and the Reliable Boot Service can be used on the Solaris OS, but are not supported on Linux.

Determine whether an in.dhcpd daemon is running on the node:
# pgrep-x in.dhcpd
- If a process identifier is returned, the daemon is running.
- If a process identifier is not returned, the daemon is not running.
To investigate the cause of daemon failure, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

Verifying That a Cluster Is Configured Correctly

A cluster must meet the criteria outlined in Defining Minimum Criteria for a Cluster Running Highly Available Services. The following procedures describe how to verify that a cluster is configured correctly.

To Verify That a Cluster Is Configured Correctly

Type:
# nhadm check
The nhadm tool tests whether the Foundation Services and their prerequisite products are installed and configured correctly.

If the nhadm command encounters an error, it displays a message in the console window. If you receive an error message, perform the following steps:
1. Identify the problem area, diagnose, and correct the problem.
  
  For an explanation of the error messages displayed by nhadm, type:
  # nhadm -z text
2. Rerun the nhadm check command, diagnosing and correcting any further errors until all tests pass.
For more information, see the nhadm1M man page.

Reacting to a Failover

When a master node fails over to the vice-master node, a fault has occurred. Even though your cluster has recovered, the fault that caused the failover could have serious implications for the future performance of your cluster. You must treat a failover seriously. After a failover, perform the following procedure.

To React to a Failover

Examine the system log files for information about the cause of the failover.

For information about log files, see Chapter 2.

Verify that the failed master node has been elected as the vice-master node:
# nhcmmstat -c vice
- If there is a vice-master node in the cluster, nhcmmstat prints information to the console window about the vice-master role.
- If there is no vice-master node, nhcmmstat sends an error code.
  
  If there is no vice-master node, investigate why the failed master node is not capable of taking the vice-master role. For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

Ensure that you have a valid cluster as described in Defining Minimum Criteria for a Cluster Running Highly Available Services.

Run the nhadm check command to verify that the node is correctly configured.
# nhadm check

Determining Cluster Validity

Defining Minimum Criteria for a Cluster Running Highly Available Services

Verifying Services on Peer Nodes

Verifying That a Cluster Is Configured Correctly

Reacting to a Failover