C H A P T E R  3

Recovering From Startup Problems on Master-Eligible Nodes

If you have installed the Netra HA Suite, but are unable to start up the master-eligible nodes, see the following sections:


A Master-Eligible Node Does Not Boot

If a master-eligible node does not boot after installation, the cause could be one of the following problems:

If the Solaris Operating System does not start on a master-eligible node, use the error messages and the Solaris documentation set to resolve the problem. If the Netra HA Suite does not start on a master-eligible node, perform the following procedure.

procedure icon  To Investigate Why the Netra HA Suite Does Not Start on a Master-Eligible Node

  1. Stop the continuous reboot cycle if such a cycle is running. For example on a SPARC node with an Open Boot PROM (OBP):

    1. Access the console of the failing node.

    2. Type the following command:


      # halt
      ok>
      

      Alternatively, type the following command:


      # Control-]
      telnet> send brk
      Type  'go' to resume
      ok>
      

      The ok prompt is returned.

    3. Become superuser:


      ok> boot -s
      #
      

  2. Search the error messages on the console of the failing node for an indication of the problem.

    The error messages should indicate the cause of the error. For a summary of error messages and their possible causes, see Appendix A.

    If the error is a configuration error, the following message is displayed:


    Error in configuration
    

    The text following the message should indicate the type of configuration error. Verify that the configuration of the nhfs.conf file for the node is consistent with the information in thenhfs.conf4 man page.

  3. Confirm that the /etc/opt/SUNWcgha/not_configured file on Solaris (respectively /etc/opt/sun/nhas/not_configured on Linux) does not exist on the failing node.

    • If the file does not exist, go to the next step.

    • If the file exists, delete it and reboot the node.

  4. If your hardware includes an OpenBoottrademark PROM diag-switch, confirm that it is set to false:

    1. On the console of the failing node, get the ok prompt:

    2. Run:


      ok> printenv diag-switch?
      

      • If the diag-switch is set to false, go to Step 14.

      • If the diag-switch is set to true, set it to false and reboot the node:


        ok> setenv diag-switch? false
        ok> boot
        #
        

  5. If you cannot resolve this problem, contact your customer support center.


A Master Node Is Not Elected at Startup

At startup, the first master-eligible node that you boot should become the master node. The second master-eligible node that you boot should become the vice-master node. If the first master-eligible node does not become the master node, perform the following procedure.

procedure icon  To Investigate Why a Master Node Is Not Elected at Startup

  1. Log in to the first master-eligible node as superuser.

  2. Confirm that the /etc/opt/SUNWcgha/not_configured file on the Solaris OS (respectively /etc/opt/sun/nhas/not_configured on Linux) does not exist.

    • If the file does not exist, go to Step 5.

    • If the file exists, delete it and reboot the node.

  3. Confirm that the target.conf file has the attribute flag set to -.

    For more information, see the target.conf4 man page.

    This attribute flag indicates that a master-eligible node is qualified to become the master node. The target.conf file contains the node description saved by the nhcmmd daemon on the master node. When a master node exists, or when the cluster is running, do not edit the target.conf file.

    • If the attribute flag is set to -, go to Step 8.

    • If the attribute flag is not set to -, do the following:

      a. Go in to single user mode.

      b. Edit the target.conf file to set the attribute flag to - .

      The node can be set with more than one attribute flag. Make sure that the flag - is the only flag that is set.

      c. Reboot the node.

  4. Confirm that the node has write access to the cluster_nodes_table file.

    For more information, see the cluster_nodes_table4 man page.

    • If the node has write access to the file, go to Step 11.

    • If the node does not have write access to the file, do the following:

      a. Change the access permissions as described in the chmod1 man page.

      b. Reboot the node.

  5. In the cluster_nodes_table file, confirm that the node attribute flag is set to -.

    The - attribute flag indicates that the node is qualified to become the master node.

    • If the attribute flag is set to -, go to Step 14.

    • If the attribute flag is not set to -, do the following:

      a. Go in to single user mode.

      b. Edit the cluster_nodes_table file to set the flag to -.

      The attribute can be set with more than one attribute flag. Confirm that the attribute flag - is the only flag that is set.

      c. Reboot the node.

  6. If you cannot resolve this problem, contact your customer support center.


Two Master Nodes Are Elected at Startup

At startup, when the first master-eligible node becomes the master node, the second master-eligible node should become the vice-master node. If the second master-eligible node cannot detect the master node, it will become the master node. The presence of two master nodes is an error scenario called split brain.

A direct link between the master-eligible nodes prevents the occurrence of split brain when the communication between the master node and vice-master node fails. For information about the direct link, see the Netra High Availability Suite 3.0 1/08 Foundation Services Overview.

If your cluster is configured to use a direct link, perform the procedure in To Investigate Split Brain on Clusters With a Direct LinkTo Investigate Split Brain on Clusters With a Direct Link. If your cluster is not configured to use a direct link, perform the procedure in To Investigate Split Brain on Clusters Without a Direct Link.

procedure icon  To Investigate Split Brain on Clusters With a Direct Link

  1. Confirm that the direct link is physically connected to the serial ports of both master-eligible nodes.

  2. Confirm that the nhfs.conf file contains the following parameters:


    Cluster.Direct-Link.Backend=serial
    Cluster.Direct-Link.Heartbeat=20Node.Direct-Link.serial.Device=/dev/term/b 
    Node.Direct-Link.serial.Speed=115200
    

    The Cluster.Direct-Link.Heartbeat is expressed in seconds.

    The Node.Direct-Link.serial.Speed can have one of the following values: 38400, 57600, 76800, or 115200.

  3. If the direct link is connected and configured correctly, and you still have a split brain error, contact your customer support center.

procedure icon  To Investigate Split Brain on Clusters Without a Direct Link

  1. Access the consoles of the master nodes.

  2. Confirm that you have two master nodes.

    On the console of each master-eligible node, run:


    # nhcmmstat -c all
    

    Each master node should see itself as master, and see the other master as being out of the cluster.

  3. Test the communication between the master nodes.

    On the console of each master-eligible node, run:


    # nhadm check starting
    

    When this command is run on a node, the command pings all of the other nodes in the cluster. If one eligible node cannot ping the other eligible node, the nodes are not communicating.

    If the Carrier Grade Transport Protocol (CGTP) is installed, the nhadm check command pings both of the network interfaces and the CGTP interface. If CGTP is not installed, the nhadm check command pings one network interface only.

  4. Evaluate the result obtained in Step 3 by using the following table.


    Results of nhadm check Possible Cause Action
    Two network interface cards (NICs) fail, or one NIC fails and one NIC passes Incorrect switch configuration or incorrect cabling Reconfigure the hardware as described in the Netra High Availability Suite 3.0 1/08 Foundation Services Getting Started Guide.
    Two NICs pass but the CGTP interface fails Incorrect Netra HA Suite configuration Examine the nhfs.conf and cluster_nodes_table files.
    Two NICs and the CGTP interface pass The master-eligible nodes exist in different domains. Confirm that the nodes have the same values for the domainid parameter in the nhfs.conf file.

  5. Confirm that all of the packages and patches are installed.

    1. Access the consoles of the master-eligible nodes.

    2. Display the installed packages and patches:

      On the Solaris OS:


      # patchadd -p
      # pkginfo
      

      On Linux:


      # rpm -qa
      

    3. Compare the list of installed packages and patches with the lists defined in the Netra High Availability Suite 3.0 1/08 Foundation Services README and patch READMEs.

      • If a required package or patch is not installed on the master-eligible node, install it and reboot both master-eligible nodes.

      • If all of the required packages and patches are installed, go to Step 11.

  6. If you cannot resolve this problem, contact your customer support center.


The Vice-Master Node Remains Unsynchronized After Startup

After the startup of the master node and vice-master node, the data on the master node is copied to the vice-master node. In this way, the master node and vice-master node are synchronized. If the master node and vice-master node are not synchronized after startup, perform the following procedure.

procedure icon  To Investigate Why the Vice-Master Node Remains Unsynchronized After Startup

  1. Confirm that your cluster has a valid master node and vice-master node.

    1. Log in to a master-eligible node as superuser.

    2. Run the nhcmmstat command:


      # nhcmmstat -c all
      

      The nhcmmstat tool displays information about the roles of the peer nodes. The peer nodes should include a master node and a vice-master node. For more information about nhcmmstat, see the nhcmmstat1M man page.

      • If your cluster has a valid master node and vice-master node, go to Step 7.

      • If your cluster has no master node or vice-master node, you do not have a cluster. Verify your cluster configuration by examining the nhfs.conf and cluster_nodes_table files for configuration errors.

      • If your cluster has a master node but no vice-master node, reboot the master-eligible node that is not master.

        Verify that the second master-eligible node has become the vice-master node:


        # nhcmmstat -c all
        

  2. Confirm that the master node and vice-master node are unsynchronized:

    On Linux, run the following command:


    # /sbin/drbdadm state all
    

    If at least one partition is not in the state Primary/Secondary, the master node and vice-master node are unsynchronized. For more information, see the drbdadm(8) man page.

    For versions of the Solaris OS earlier than version 10:


    # /usr/opt/SUNWesm/sbin/scmadm -S -M
    

    For versions of the Solaris 10 OS and later:


    # /usr/sbin/dsstat 1
    

    If the scmadm or dsstat tool does not reach the replicating state, the master node and vice-master node are unsynchronized. For more information, see the nhscmadm1M man page.

  3. Determine whether an nhcrfsd daemon is running on each master-eligible node:


    # pgrep -x nhcrfsd
    

    • If a process identifier is returned, the nhcrfsd daemon is running. Go to Step 12 .

    • If a process identifier is not returned, the nhcrfsd daemon is not running. Perform the procedure in To Recover From Daemon Failure.

  4. On the master node and vice-master node, verify that the mount point is set correctly.

    The mount point is set by the RNFS.Share property in the /etc/opt/SUNWcgha/nhfs.conf file on the Solaris OS and on /etc/opt/SUNWcgha/nhfs.conf Linux. If the mount point is set correctly, the usr, root, and swap parameters in the RNFS.Share property have the following access permissions, respectively: ro, rw, and rw.

  5. For each node, confirm that the IP address of the cgtp0 interface is specified in the /etc/hosts file.

  6. If you cannot resolve this problem, contact your customer support center.


A Monitored Daemon Fails Causing a Master-Eligible Node to Reboot at Startup

When a monitored daemon fails, the Daemon Monitor triggers a recovery response. The recovery response is often to restart the failed daemon. If the daemon fails to restart correctly, the Daemon Monitor reboots the node. The failure of a monitored daemon is the most common cause of a node reboot.

If the system recovers correctly, the daemon core dump and error message might be the only evidence of the failure. You must take the failure seriously even though the system has recovered.

For a list of recovery responses made by the Daemon Monitor, see the nhpmd1M man page.

For information about how to recover from the failure of a monitored daemon, see To Recover From Daemon Failure.

TABLE 3-1 summarizes some causes of daemon failure during the startup of master-eligible nodes.


TABLE 3-1   Causes of Daemon Failure at Startup of Master-Eligible Nodes 
Failed Daemon Possible Causes at Startup
nhcrfsd

One of the following files on the master node contains errors: /etc/vfstab on Solaris, /etc/vfstab on Linux, cluster_nodes_table, or nhfs.conf.
The local file system of the failing node is mounted or unmounted incorrectly.
The network interface of the failing node is incorrectly configured.
nhcmmd

One of the following files on the master node contains errors: cluster_nodes_table or nhfs.conf.
The cgtp0 interface of the failing node is incorrectly configured.
The cgtp0 interface of the failing node could not be initialized.
The failing node cannot connect to the nhprobed daemon.
The failing node cannot access the /etc/services file.
The failing node cannot write to the cluster_nodes_table file when it is to be elected as master node.
nhprobed

The failing node cannot obtain information about the network interfaces.
The failing node cannot access the /etc/services file.
The failing node cannot create the required threads, sockets, or pipe.
in.dhcpd

A datastore location does not exist.
A datastore location is not mounted on the failing node.
The failing node cannot find the dhcptab file in the datastore.
nhnsmd The nhfs.conf file on the master node contains errors.


The Node Management Agent on a Master-Eligible Node Exits at Startup

The following procedure describes what to do if the Node Management Agent (NMA) exits during the startup of the master-eligible nodes.

procedure icon  To Investigate Why the NMA on a Master-Eligible Node Exits at Startup

  1. Confirm port numbers were correctly configured.

    • Determine whether the nma.properties file content is consistent with the information provided in the nma.properties(4) man page.

    • Ensure that none of the ports described in the nma.properties file are already in use.

    For example, if the nma.properties file contains the following line:


    com.sun.nhas.ma.adaptors.snmp.port=8085
    

    you can check that the port is not already in use by typing the following:


    # /bin/netstat -a | grep 8085
    

    If this command returns something, it means that this port is already in use. In this case, change the value of this port in the nma.properties file to a value that is not yet in use.

  2. Once the port numbers have been properly configured, you can start the NMA on this node by doing one of the following:

    On the Solaris 9 OS:


    # /etc/opt/SUNWcgha/init.d/nma start
    

    On the Solaris 10 OS:


    # /usr/sbin/svcadm restart svc:/system/cgha/nma
    

  3. Examine the system log files for the following messages:

    1. If the log files contain the following message, confirm that the /etc/services file contains an entry for the cmm-api.


      CMM statistics (JNI). Unable to access CMM statistics 
      (can't access cmm-api service port number).
      

      At this point, the /etc/services should contain the following content:


      udp6       tpi_clts      v     inet6   udp    /dev/udp6       -
      tcp6       tpi_cots_ord  v     inet6   tcp    /dev/tcp6       -
      udp        tpi_clts      v     inet    udp    /dev/udp        -
      tcp        tpi_cots_ord  v     inet    tcp    /dev/tcp        -
      rawip      tpi_raw       -     inet    -      /dev/rawip      -
      ticlts     tpi_clts      v    loopback -      /dev/ticlts     straddr.so
      ticotsord  tpi_cots_ord  v    loopback -      /dev/ticotsord  straddr.so
      ticots     tpi_cots      v    loopback -      /dev/ticots     straddr.so
      

    2. If the log files contain the following message, an RPC error occurred during an access to the CMM statistics.


      CMM statistics (JNI) Failed to get stats from CMM :[rpc return code]
      

      Use the RPC return code to diagnose and correct the problem.

    3. If the log file contains a CMM statistics (JNI) failure, like the following message:


      CMM statistics (JNI) Failed to get stats from CMM : [rpc return code]
      

      or:


      CMM statistics (JNI) Failed to get stats from CMM : [CMM status]
      

      or:


      CMM statistics (JNI) rpc call failed
      

      This means there is a problem in the connection between the NMA and the CMM. Check the status of the nhcmmd deamon and its processes.

    4. If the log files contain the following message, CGTP is unavailable:


      KSTAT (JNI).  Unable to launch CGTP. CGTP statistics not available.
      

      Confirm that the redundant network is available and that the network configuration is correct by running the following command:


      # nhadm check
      

  4. Restart the NMA on all nodes as follows:

    On Solaris 9 OS:


    # /etc/opt/SUNWcgha/init.d/nma stop
    

    The NMA will then be relaunched by the PMD.

    On the Solaris 10 OS:


    # /usr/sbin/svcadm restart svc:/system/cgha/nma
    

  5. If you have problems relaunching the NMA, contact your customer support center.