C H A P T E R  5

Recovering From Failover and Switchover Problems

For information about how to recover from problems associated with failover and switchover, see the following sections:


Two Master Nodes Are Elected at Run Time

During run time, one master-eligible node should be the master node, and the other master-eligible node should be the vice-master node. When both master-eligible nodes act as master nodes, you have an error scenario called split brain. For information about split brain and the use of a direct link, see Two Master Nodes Are Elected at Startup.

If a split brain error occurs during run time on a cluster with a direct link, perform the procedure inTo Investigate Split Brain on Clusters With a Direct Link. If a split brain error occurs during run time on a cluster without a direct link, perform the procedure in To Investigate Split Brain During Run Time on Clusters Without a Direct Link.

procedure icon  To Investigate Split Brain During Run Time on Clusters Without a Direct Link

  1. Access the consoles of the master nodes.

  2. Confirm that you have two master nodes.

    On the console of each master-eligible node, run:


    # nhcmmstat -c all
    

    Each master node should see itself as master, and see the other master as being out of the cluster.

  3. Test the communication between the master nodes.

    On the console of each master-eligible node, run:


    # nhadm check starting
    

    When this command is run on a node, it pings all of the other nodes in the cluster. If one master node cannot ping the other master node, the nodes are not communicating.

    • If the nodes are able to communicate with each other, go to Step 6.

    • If the nodes are not able to communicate with each other, examine the network interface values of the nodes.

      For information, see “Examining the Cluster Networking Configuration” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

    When the problem is resolved, Reliable NFS should automatically detect the split-brain situation. Reliable NFS reboots the master-eligible nodes, so that there is a master node and a vice-master node.

  4. Determine whether the spanning tree protocol is disabled, as described in Step 1 of To Investigate Why the Solaris Operating System Does Not Start on a Diskless Node.

  5. If you cannot resolve this problem, contact your customer support center.


A Diskless Node Does Not Reboot After Failover

If a failover occurs during the boot or reboot of a diskless node, the DHCP files can be corrupted. If this problem occurs, perform the following procedure.

procedure icon  To Reboot a Diskless Node After Failover

  1. Confirm that the cluster has recovered from the failover.

    For information, see “Reacting to a Failover” in the Netra High Availability Suite 3.0 1/08 Foundation Services Cluster Administration Guide.

  2. Reconfigure the boot policy for the diskless node.

    For information, see “Configuring DHCP for a Diskless Node” in the Netra High Availability Suite 3.0 1/08 Foundation Services Manual Installation Guide for the Solaris OS.

  3. Reload the DHCP table on the master node and the vice-master node:


    # pkill -HUP in.dhcpd
    


Replication Does Not Resume After Failover or Switchover

If replication does not resume after failover or switchover, examine the replication between the master-eligible nodes, as described in The Vice-Master Node Remains Unsynchronized After Startup.