C H A P T E R  3

Determining Cluster Validity

This chapter describes how to verify whether a group of nodes form a cluster, and whether the cluster is functioning correctly. Before you perform maintenance tasks or change the cluster configuration, verify that the cluster is functioning correctly. When you have completed maintenance tasks, verify that the cluster is still functioning correctly.

This chapter is divided into the following sections:


Defining Minimum Criteria for a Cluster Running Highly Available Services

A Netra HA Suite cluster can run the following highly available services: Reliable NFS and the Reliable Boot Service (RBS). For information about highly available services, see the Netra High Availability Suite 3.0 1/08 Foundation Services Overview.

A highly available cluster has the following features:

If your cluster has diskless nodes, the Reliable Boot Service must be running on the master node and the vice-master node.


Verifying Services on Peer Nodes

When performing administration tasks, regularly verify that your cluster is running correctly by performing the procedures described in this section.

procedure icon  To Verify That the Cluster Has a Master Node and a Vice-Master Node

  1. Log in to a master-eligible node as superuser.

  2. Type the following command:


    # nhcmmstat -c all
    

    The nhcmmstat command displays information in the console window about all of the peer nodes. The information includes the role of each node. The peer nodes must include a master node and a vice-master node. For more information, see the nhcmmstat1M man page.

    • If there is a master node but no vice-master node, reboot the second master-eligible node as described in To Perform a Clean Reboot of a Linux Node.

      Verify that the second master-eligible node has become the vice-master node:


      # nhcmmstat -c all
      

      If the second master-eligible node does not become the vice-master node, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

    • If there is neither a master node nor a vice-master node, you do not have a highly available cluster. Verify your cluster configuration by examining the nhfs.conf file and the cluster_nodes_table file for configuration errors.

      For more information, see the nhfs.conf4 and cluster_nodes_table4 man pages.

    • If there are two master nodes, you have a split brain error scenario. To investigate the cause of split brain, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

procedure icon  To Verify That an nhcmmd Daemon Is Running on Each Peer Node

  1. Log in to a peer node.

  2. Verify that an nhcmmd daemon is running on the node:


    # pgrep -x nhcmmd
    

    • If a process identifier is returned, the daemon is running.

    • If a process identifier is not returned, the daemon is not running.

    To investigate the cause of daemon failure, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

  3. Repeat Step 1 and Step 2 on each peer node.

procedure icon  To Verify That the Cluster Has a Redundant Ethernet Network

  1. Log in to a peer node as superuser.

  2. Verify that the peer nodes are communicating through a network:


    # nhadm check starting
    

    If any peer node is not accessible from any other peer node, the nhadm command displays an error message in the console window.

  3. Search the system log files for the following message:


    [ifcheck] Interface interface-name used for cgtp has failed
    

    The nhcmmd daemon creates this message if the peer nodes are not communicating through a redundant network.

    If the redundant network fails, examine the card, cable, and route table associated with the link. Investigate the system log files for relevant error messages.

procedure icon  To Verify That the Master Node and Vice-Master Node Are Synchronized

This procedure only applies to systems using IP replication, rather than shared disk.

  1. Log in to a master node as superuser.

  2. Test whether the vice-master node is synchronized with the master node:

    For versions earlier than the Solaris 10 OS:


    # /usr/opt/SUNWesm/sbin/scmadm -S -M
    

    • If the scmadm command reaches the replicating state, the vice-master node is synchronized with the master node.

    • If the scmadm command does not reach the replicating state, the vice-master node is not synchronized with the master node.


      # /usr/sbin/dsstat 1
      

      For the Solaris 10 OS and later:

      • If the dsstat command indicates ”R“ in the ”S” column, the vice-master node is synchronized with the master node.

      • If the dsstat command indicates ”L“ in the ”S” column, the vice-master node is not synchronized and no synchronization is currently taking place.

      • If the dsstat command indicates ”SY“ in the ”S” column, the vice-master node is not synchronized and synchronization is currently taking place.


        # drbdadm cstate all
        



        Note - Refer to the dsstat1M man page for more information.



        For the Linux OS:

        • If the drbdadm command indicates ”Connected“, the vice-master node is synchronized with the master node.

        • If the drbdadm command indicates ”StandAlone“ or ”WFConnection”, the vice-master node is not synchronized and no synchronization is currently taking place.

        • If the drbdadm command indicates ”SyncSource“, the vice-master node is not synchronized and synchronization is currently taking place.



          Note - Refer to the drbdadm8 man page for more information.



  3. If the master and vice-master nodes are not synchronized, verify if the RNFS.EnableSync parameter is set in to FALSE in the nhfs.conf file.

    If the RNFS.EnableSync parameter is set to FALSE and if you want to trigger synchronization:

    1. Trigger synchronization:


      # nhenablesync
      

      For information on nhenableysnc, see the nhenablesync1Mman page.

    2. Repeat Step 2.

    If the RNFS.EnableSync parameter is not set to FALSE but the vice-master node remains unsynchronized, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

    For more information about the scmadm command, see the scmadm1M man page. For more information about the RNFS.EnableSync parameter, see the nhfs.conf4 man page.

procedure icon  To Verify That the Reliable Boot Service Is Running

Diskless nodes and the Reliable Boot Service can be used on the Solaris OS, but are not supported on Linux.

  1. Log in to the master node.

  2. Determine whether an in.dhcpd daemon is running on the node:


    # pgrep-x in.dhcpd
    

    • If a process identifier is returned, the daemon is running.

    • If a process identifier is not returned, the daemon is not running.

    To investigate the cause of daemon failure, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.


Verifying That a Cluster Is Configured Correctly

A cluster must meet the criteria outlined in Defining Minimum Criteria for a Cluster Running Highly Available Services. The following procedures describe how to verify that a cluster is configured correctly.

procedure icon  To Verify That a Cluster Is Configured Correctly

  1. Log in to a peer node as superuser.

  2. Type:


    # nhadm check
    

    The nhadm tool tests whether the Foundation Services and their prerequisite products are installed and configured correctly.

    If the nhadm command encounters an error, it displays a message in the console window. If you receive an error message, perform the following steps:

    1. Identify the problem area, diagnose, and correct the problem.

      For an explanation of the error messages displayed by nhadm, type:


      # nhadm -z text
      

    2. Rerun the nhadm check command, diagnosing and correcting any further errors until all tests pass.

    For more information, see the nhadm1M man page.


Reacting to a Failover

When a master node fails over to the vice-master node, a fault has occurred. Even though your cluster has recovered, the fault that caused the failover could have serious implications for the future performance of your cluster. You must treat a failover seriously. After a failover, perform the following procedure.

procedure icon  To React to a Failover

  1. Log in to the failed master node as superuser.

  2. Examine the system log files for information about the cause of the failover.

    For information about log files, see Chapter 2.

  3. Verify that the failed master node has been elected as the vice-master node:


    # nhcmmstat -c vice
    

    • If there is a vice-master node in the cluster, nhcmmstat prints information to the console window about the vice-master role.

    • If there is no vice-master node, nhcmmstat sends an error code.

      If there is no vice-master node, investigate why the failed master node is not capable of taking the vice-master role. For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

  4. Ensure that you have a valid cluster as described in Defining Minimum Criteria for a Cluster Running Highly Available Services.

  5. Run the nhadm check command to verify that the node is correctly configured.


    # nhadm check