C H A P T E R  4

Recovering From Startup Problems on Diskless Nodes and Dataless Nodes

If you have started up the master-eligible nodes but are unable to startup the diskless nodes or dataless nodes, see the following sections:


A Diskless Node Does Not Boot at Startup

Diskless nodes boot using the Solaris Dynamic Host Configuration Protocol (DHCP) servers provided by the Reliable Boot Service. To boot diskless nodes you must have a cluster containing at least one master-eligible node running the Netra HA Suite. When you are booting diskless nodes, you can use the snoop utility to see the parameters transmitted by the DHCP server to the diskless node.

This section describes what to do when the Solaris Operating System or Netra HA Suite does not start on a diskless node.

procedure icon  To Investigate Why the Solaris Operating System Does Not Start on a Diskless Node

Use this procedure when the Solaris Operating System does not start on a diskless node.

  1. Confirm that the spanning tree protocol is disabled.

    For Cisco 29x0 switches, do the following:

    1. Telnet to the Ethernet switch.

    2. Type the following command:


      # enable
      Password <user-password>
      

    3. Type the following command:


      # show run
      

    4. Search the output on the console for the following line:


      no spanning-tree vlan <vlanid>
      

      If the display contains this line, the spanning tree is disabled.

      • If the spanning tree is disabled, go Step 8.

      • If the spanning tree is not disabled, disable it.

        For information, see the Netra High Availability Suite 3.0 1/08 Foundation Services Getting Started Guide.

  2. Confirm that the DHCP configuration is correct.

    1. Access the consoles of the master node and vice-master node.

    2. On each console, confirm that the /etc/inet/dhcpsvc.conf file exists and has the correct attributes:


      # cat /etc/inet/dhcpsvc.conf
      DAEMON_ENABLED=TRUE
      RUN_MODE=server
      RESOURCE=SUNWnhrbs
      PATH=/SUNWcgha/remote/var/dhcp
      CONVER=1
      INTERFACE=nic0,nic1
      

      The RESOURCE parameter must be set to RESOURCE=SUNWnhrbs. By default, this parameter is set to RESOURCE=SUNWfiles.

      The PATH parameter must point to a directory in a replicated file system. By default, the directory is PATH=/SUNWcgha/remote/var/dhcp.

      If the file does not have the correct attributes, do the following:

      • Edit or create the /etc/inet/dhcpsvc.conf file, setting the attributes as stated previously.

      • Stop and restart the DHCP daemon.

        If your cluster is running the Solaris 9 OS, issue the following commands:


        # /etc/rc3.d/HA.S34dhcp stop
        # /etc/rc3.d/HA.S34dhcp start
        

        If your cluster is running the Solaris 10 OS, issue the following commands:


        # svcadm restart  svc:/network/dhcp-server:default
        

    3. If you are installing your cluster manually, confirm that the path has DHCP container files with the following name:

      SUNWnhrbs1_10_x_1_0, SUNWnhrbs1_10_x_2_0, and SUNWnhrbs1_dhcptab, where x is the domain identity.

      You do not need to perform this step if you are installing your cluster using the nhinstall tool.

      If the DHCP container files do not have the specified name, regenerate them, taking care to use the correct values for subnet1 and subnet2. For information about configuring DHCP for a diskless node, see the Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS.

    4. If you are using a static address assignment, confirm that the MAC address or client ID of the diskless node is configured correctly.

      Refer to the DHCP table on the master node.

  3. For SPARC, on the console of the diskless node, confirm that the following OpenBoot PROM parameter is set:


    boot-device net:dhcp,,,,,5 net2:dhcp,,,,,5
    



    Note - For information about performing this task on x64 platforms, refer to the products hardware documentation.



  4. Confirm that the vendor type of the diskless node is recognized by the master node.

    1. Access the console of the master node.

    2. Type the following command:


      # snoop -v -d nic0 ether mac-adr-of-dl-node | grep -i dhcp
      

      or


      ok> dev
      ok> .properties
      => property
      

      The vendor type of the diskless node is returned as a string.

    3. Search for the same string in the DHCP table on the master node.

  5. On the console of the master node, confirm that the directory /tftpboot is present.

    If this directory is not present, the following error message is written to the system log files:


    Timeout waiting for BOOTP/DHCP reply. Retrying...
    TFTP Error Access violation
    

    If the /tftpboot directory is not present on the vice-master node, the diskless node does not boot after a switchover. To set up the /tftpboot directory on the vice-master node, see the Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS.

  6. On the console of the master node, confirm that the following directory contains a file for each diskless node interface: diskless_file_system/root/diskless_nodeid/etc/

    The files could be named as follows:


    hostname.hme0hostname.hme1
    

    If the directory does not contain a file for an interface, the interface cannot be configured.

  7. Examine the access permissions of root, swap, and usr in the nhfs.conf file on the master node.

    If your cluster was installed manually or by the nhinstall tool, confirm that the following access permissions are set:


    share -F nfs -o rw,root=diskless_node_id-nic0:diskless_node_id-nic1:
    		diskless_node_id-cgtp0 /export/root/diskless_node_idshare -F nfs -o rw,root=diskless_node_id-nic1:diskless_node_id-nic1:
    		diskless_node_id-cgtp0 /export/swap/diskless_node_idshare -F nfs -o ro /export/exec/Solaris_X_sparc.all/usr
    

    where X is the version of the Solaris OS installed on the cluster.

  8. If you cannot resolve this problem, contact your customer support center.

procedure icon  To Investigate Why the Netra HA Suite Does Not Start on a Diskless Node

Use this procedure when the Solaris Operating System has started on the diskless node, but the Netra HA Suite do not start. In this error scenario, the diskless node might be in a continuous reboot cycle.

  1. Stop the continuous reboot cycle if such a cycle is running:

    1. Access the console of the failing node.

    2. Type the following command:


      # halt
      ok>
      

      Alternatively, type the following command:


      # Control-]
      telnet> send brk
      Type  'go' to resume
      ok>
      

      The ok prompt is returned.

    3. Boot in single user mode:


      ok> boot -s
      #
      



      Note - For information about performing this task on x64 platforms, refer to the products hardware documentation.



  2. Search the messages displayed on the console of the failing node for an indication of the problem.

    The error messages should indicate the cause of the problem. Use the error messages to identify the failing daemon or failing service. For a summary of error messages and their possible causes, see Appendix A.

    If the error is a configuration error, the following message is displayed:


    Error in configuration
    

    The text following the message should indicate the type of configuration error. Verify that the configuration of the nhfs.conf file for the node is consistent with the information in the nhfs.conf4 man page.

  3. Confirm that the /etc/opt/SUNWcgha/not_configured file does not exist on the failing node.

    • If the file does not exist, go to Step 9.

    • If the file exists, delete it and reboot the node.

  4. Confirm that the cluster_node_table file on the master node contains an entry for the failing node.

    • If the file contains an entry for the failing node, go to Step 12

    • If the file does not contain an entry for the failing node, verify the installation and configuration.

  5. If you cannot resolve this problem, contact your customer support center.


A Diskless Node Does Not Boot After Failover

If a failover occurs during the boot or reboot of a diskless node, the DHCP files can be corrupted. If this problem occurs, see A Diskless Node Does Not Reboot After Failover.


A Dataless Node Does Not Boot at Startup

Dataless nodes boot from a local disk and run customer applications locally. Dataless nodes access the Netra HA Suite through the cluster network and send data to the master node.

This section describes what to do if the Solaris Operating System or the Netra HA Suite do not start on a dataless node. Only use this section when you have a running cluster that contains a master node and a vice-master node.

If the Solaris Operating System does not start on a dataless node, use the error messages and the Solaris documentation set to resolve the problem. If the Netra HA Suite does not start on a dataless node, perform the following procedure.

procedure icon  To Investigate Why the Netra HA Suite Does Not Start on a Dataless Node

  1. Stop the continuous reboot cycle if such a cycle is running.

    For information, see Step 1 of To Investigate Why the Netra HA Suite Does Not Start on a Diskless Node.

  2. Search the messages on the console of the failing node for an indication of the problem.

    For information, see Step 5 of To Investigate Why the Netra HA Suite Does Not Start on a Diskless Node.

  3. Confirm that the /etc/opt/SUNWcgha/not_configured file does not exist on the failing node.

    For information, see Step 6 of To Investigate Why the Netra HA Suite Does Not Start on a Diskless Node.

  4. Confirm that the cluster_node_table file on the master node contains an entry for the failing node.

    For information, see Step 9 of To Investigate Why the Netra HA Suite Does Not Start on a Diskless Node.

  5. If you cannot resolve this problem, contact your customer support center.


A Monitored Daemon Fails Causing a Diskless Node or Dataless Node to Reboot at Startup

When a monitored daemon fails, the Daemon Monitor triggers a recovery response. The recovery response is often to restart the failed daemon. If the daemon fails to restart correctly, the Daemon Monitor reboots the node. The failure of a monitored daemon is the most common cause of a node reboot.

If the system recovers correctly, the daemon core dump and error message might be the only evidence of the failure. You must take the failure seriously even though the system has recovered.

For a list of recovery responses made by the Daemon Monitor, see the nhpmd1M man page.

For information about how to recover from the failure of a monitored daemon, see To Recover From Daemon Failure.

TABLE 4-1 summarizes some causes of daemon failure during the startup of diskless nodes and dataless nodes.


TABLE 4-1   Causes of Daemon Failure on Diskless Nodes and Dataless Nodes at Startup 
Failed Daemon Possible Causes at Startup
nhcmmd

One of the following files on the master node contains errors: cluster_nodes_table or nhfs.conf.
The cgtp0 interface of the failing node is configured incorrectly.
The cgtp0 interface of the failing node could not be initialized.
The failing node cannot connect to the nhprobed daemon.
The failing node cannot access the /etc/services file.
The failing node exceeded the time-out value.
nhprobed

The failing node cannot obtain information about the network interfaces.
The failing node cannot access the /etc/services file.
The failing node cannot create the required threads, sockets, or pipe.