C H A P T E R 3 |
This chapter describes how to verify whether a group of nodes form a cluster, and whether the cluster is functioning correctly. Before you perform maintenance tasks or change the cluster configuration, verify that the cluster is functioning correctly. When you have completed maintenance tasks, verify that the cluster is still functioning correctly.
A Netra HA Suite cluster can run the following highly available services: Reliable NFS and the Reliable Boot Service (RBS). For information about highly available services, see the Netra High Availability Suite 3.0 1/08 Foundation Services Overview.
A highly available cluster has the following features:
A master node and a vice-master node. The master node is the central information point for the cluster. The vice-master node backs up the master node. To verify that there is a master node and a vice-master node in the cluster, see To Verify That the Cluster Has a Master Node and a Vice-Master Node.
An nhcmmd daemon on each peer node. The nhcmmd daemon on the master node manages the membership of the other peer nodes. The nhcmmd daemon on other peer nodes receives cluster information from the nhcmmd daemon on the master node. To verify that there is an nhcmmd daemon on each peer node, perform the procedure described in To Verify That an nhcmmd Daemon Is Running on Each Peer Node.
A redundant network. When the network is redundant, there is no single point of network failure. To verify that the cluster network is redundant, see To Verify That the Cluster Has a Redundant Ethernet Network.
Synchronized master node disk and vice-master node disk. Synchronization ensures that the vice-master node has an up-to-date copy of the information on the master node. To verify that the master node and vice-master node are synchronized, see To Verify That the Master Node and Vice-Master Node Are Synchronized.
If your cluster has diskless nodes, the Reliable Boot Service must be running on the master node and the vice-master node.
When performing administration tasks, regularly verify that your cluster is running correctly by performing the procedures described in this section.
|
# nhcmmstat -c all |
The nhcmmstat command displays information in the console window about all of the peer nodes. The information includes the role of each node. The peer nodes must include a master node and a vice-master node. For more information, see the nhcmmstat1M man page.
If there is a master node but no vice-master node, reboot the second master-eligible node as described in To Perform a Clean Reboot of a Linux Node.
Verify that the second master-eligible node has become the vice-master node:
# nhcmmstat -c all |
If the second master-eligible node does not become the vice-master node, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.
If there is neither a master node nor a vice-master node, you do not have a highly available cluster. Verify your cluster configuration by examining the nhfs.conf file and the cluster_nodes_table file for configuration errors.
For more information, see the nhfs.conf4 and cluster_nodes_table4 man pages.
If there are two master nodes, you have a split brain error scenario. To investigate the cause of split brain, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.
|
Verify that the peer nodes are communicating through a network:
# nhadm check starting |
If any peer node is not accessible from any other peer node, the nhadm command displays an error message in the console window.
Search the system log files for the following message:
[ifcheck] Interface interface-name used for cgtp has failed |
The nhcmmd daemon creates this message if the peer nodes are not communicating through a redundant network.
If the redundant network fails, examine the card, cable, and route table associated with the link. Investigate the system log files for relevant error messages.
|
Test whether the vice-master node is synchronized with the master node:
For versions earlier than the Solaris 10 OS:
# /usr/opt/SUNWesm/sbin/scmadm -S -M |
If the scmadm command reaches the replicating state, the vice-master node is synchronized with the master node.
If the scmadm command does not reach the replicating state, the vice-master node is not synchronized with the master node.
# /usr/sbin/dsstat 1 |
For the Solaris 10 OS and later:
If the dsstat command indicates ”R“ in the ”S” column, the vice-master node is synchronized with the master node.
If the dsstat command indicates ”L“ in the ”S” column, the vice-master node is not synchronized and no synchronization is currently taking place.
If the dsstat command indicates ”SY“ in the ”S” column, the vice-master node is not synchronized and synchronization is currently taking place.
# drbdadm cstate all |
If the drbdadm command indicates ”Connected“, the vice-master node is synchronized with the master node.
If the drbdadm command indicates ”StandAlone“ or ”WFConnection”, the vice-master node is not synchronized and no synchronization is currently taking place.
If the drbdadm command indicates ”SyncSource“, the vice-master node is not synchronized and synchronization is currently taking place.
If the master and vice-master nodes are not synchronized, verify if the RNFS.EnableSync parameter is set in to FALSE in the nhfs.conf file.
If the RNFS.EnableSync parameter is set to FALSE and if you want to trigger synchronization:
# nhenablesync |
For information on nhenableysnc, see the nhenablesync1Mman page.
Repeat Step 2.
If the RNFS.EnableSync parameter is not set to FALSE but the vice-master node remains unsynchronized, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.
For more information about the scmadm command, see the scmadm1M man page. For more information about the RNFS.EnableSync parameter, see the nhfs.conf4 man page.
Diskless nodes and the Reliable Boot Service can be used on the Solaris OS, but are not supported on Linux.
A cluster must meet the criteria outlined in Defining Minimum Criteria for a Cluster Running Highly Available Services. The following procedures describe how to verify that a cluster is configured correctly.
# nhadm check |
The nhadm tool tests whether the Foundation Services and their prerequisite products are installed and configured correctly.
If the nhadm command encounters an error, it displays a message in the console window. If you receive an error message, perform the following steps:
When a master node fails over to the vice-master node, a fault has occurred. Even though your cluster has recovered, the fault that caused the failover could have serious implications for the future performance of your cluster. You must treat a failover seriously. After a failover, perform the following procedure.
Examine the system log files for information about the cause of the failover.
For information about log files, see Chapter 2.
Verify that the failed master node has been elected as the vice-master node:
# nhcmmstat -c vice |
If there is a vice-master node in the cluster, nhcmmstat prints information to the console window about the vice-master role.
If there is no vice-master node, nhcmmstat sends an error code.
If there is no vice-master node, investigate why the failed master node is not capable of taking the vice-master role. For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.
Ensure that you have a valid cluster as described in Defining Minimum Criteria for a Cluster Running Highly Available Services.
Run the nhadm check command to verify that the node is correctly configured.
# nhadm check |
Copyright © 2008, Sun Microsystems, Inc. All rights reserved.