C H A P T E R  6

Starting and Stopping Services, Nodes, and Clusters

This chapter describes how to stop and start the Netra HA Suite software, a node, or a cluster. This chapter contains the following sections:


Stopping and Restarting the Foundation Services

Maintenance on a peer node can disrupt communication between this node and services and applications running on other peer nodes. During maintenance, you must isolate a node from the cluster by starting the node without the Foundation Services. After maintenance, reintegrate the node into the cluster by restarting the Foundation Services.

procedure icon  To Start a Node Without the Foundation Services

  1. Log in as superuser to the node on which you want to stop the Netra HA Suite software.

  2. Create the not_configured file on the node.

    On Solaris OS systems:


    # touch /etc/opt/SUNWcgha/not_configured
    

    On Linux systems:


    # touch /etc/opt/sun/nhas/not_configured

  3. Reboot the node as described in To Perform a Clean Reboot of a Solaris OS Node or To Perform a Clean Reboot of a Linux Node.

    The node restarts without the Foundation Services running. If the node is the master node, this procedure causes a failover.

  4. Verify that the Foundation Services are not running:


    # pgrep -x nhcmmd
    

    If the Foundation Services have been stopped, no process identifier should appear for the nhcmmd daemon.

procedure icon  To Stop and Restart the Foundation Services Without Stopping the Solaris OS

Use this procedure to restart the Foundation Services when the Solaris OS does not need to come down (to apply a new patch, for example).

  1. Go to single-user mode:


    # init S

  2. Return to multi-user mode:


    # init 3

procedure icon  To Stop and Restart the Foundation Services Without Stopping Linux

Use this procedure to restart the Foundation Services when Linux does not need to come down (to apply a new patch, for example).

  1. Go to single-user mode:


    # telinit 1

  2. Return to multi-user mode:


    # telinit 3

procedure icon  To Restart the Foundation Services

Use this procedure to restart the Foundation Services on a node after performing the procedure in To Start a Node Without the Foundation Services.

  1. Log in as superuser to the node on which you want to restart the Foundation Services.

  2. Check that the not_configured file is not present.

    The file is located at /etc/opt/SUNWcgha/not_configured on Solaris systems, and /etc/opt/sun/nhas/not_configured on Linux systems. If that file is present, delete it.

  3. Reboot the node as described in To Perform a Clean Reboot of a Solaris OS Node or in To Perform a Clean Reboot of a Linux Node, depending on the OS your system uses.

  4. Verify the configuration of the node:


    # nhadm check configuration
    

    If the node is configured correctly, the nhadm command does not encounter any errors.

    For information about the nhadm command, see the nhadm1M man page.

  5. Verify that the services have started correctly:


    # nhadm check starting
    

    If the Foundation Services have started correctly, the nhadm command does not encounter any errors.


Stopping and Restarting Daemon Monitoring

Sometimes you need to stop Daemon Monitoring to investigate why a monitored daemon has failed. This section describes how to stop and restart Daemon Monitoring.

For information about the causes of daemon failure at startup and runtime, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

procedure icon  To Stop Daemon Monitoring

This procedure stops Daemon Monitoring. On reboot, Daemon Monitoring is not automatically restarted.

  1. Log in as superuser to the node on which you want to stop the monitoring daemon.

  2. Create the special file:

    If the node is running the Solaris OS:


    # touch /etc/opt/SUNWcgha/not_under_pmd_control

    If the node is running Linux:


    # touch /etc/opt/sun/nhas/not_under_pmd_control

  3. Reboot the node as described in To Perform a Clean Reboot of a Solaris OS Node or in To Perform a Clean Reboot of a Linux Node, depending on the OS your system uses.

    The Foundation Services start, and the OS and Netra HA Suite daemons that were monitored are no longer monitored.

procedure icon  To Restart Daemon Monitoring

If Daemon Monitoring was stopped using To Stop Daemon Monitoring, restart Daemon Monitoring as follows:

  1. Log in to the node on which you want to restart the Daemon Monitoring.

  2. Remove the special file.

    If the node is running the Solaris OS:


    # rm /etc/opt/SUNWcgha/not_under_pmd_control

    If the node is running Linux:


    # rm /etc/opt/sun/nhas/not_under_pmd_control

  3. Reboot the node as described in To Perform a Clean Reboot of a Solaris OS Node or in To Perform a Clean Reboot of a Linux Node, depending on the OS your system uses.

    The Foundation Services start and are monitored by the Daemon Monitor.


Shutting Down and Restarting a Node

This section describes how to shut down and restart a node. The consequences of stopping a node depend on the role of the node. If you shut down a master-eligible node, you no longer have a redundant cluster.

General Rules for Shutting Down a Node

To shut down nodes, observe the following procedures:

procedure icon  To Perform a Clean Reboot of a Solaris OS Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use init 6 to reboot a node.

  2. When the Foundation Services are running, use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # lockfs -fa
      

    3. Perform an immediate reboot:


      # uadmin 1 1 /*A_REBOOT AD_REBOOT*/
      

procedure icon  To Perform a Clean Reboot of a Linux Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use init 6 to reboot a node.

  2. When the Foundation Services are running, use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # sync
      

    3. Perform an immediate reboot:


      # reboot -n -f
      

procedure icon  To Perform a Clean Power off of a Solaris Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use init 5 to power off a node.

  2. When the Foundation Services are running use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # lockfs -fa
      

    3. Perform an immediate power off:


      # uadmin 1 6 /*A_REBOOT AD_POWEROFF*/
      

procedure icon  To Perform a Clean Power off of a Linux Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use poweroff to power off a node.

  2. When the Foundation Services are running use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # sync
      

    3. Perform an immediate power off:


      # poweroff -n -f
      

procedure icon  To Perform a Clean Halt of a Solaris Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use init 0 to halt a node.

  2. When the Foundation Services are running use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # lockfs -fa
      

    3. Perform an immediate halt:


      # uadmin 1 0 /*A_REBOOT AD_HALT*/
      

procedure icon  To Perform a Clean Halt of a Linux Node

Determine if the Foundation Services are running.

  1. If the Foundation Services are not running, use halt to halt a node.

  2. When the Foundation Services are running use the following procedure:

    1. Stop the user applications.

    2. Flush the file systems:


      # sync
      

    3. Perform an immediate halt:


      # halt -n -f
      

procedure icon  To Abruptly Reboot a Solaris Node

  •   Reboot the node:


    # uadmin 1 1 /*A_REBOOT AD_BOOT*/
    

    The node stops immediately without any further processing and is rebooted.

procedure icon  To Abruptly Reboot a Linux Node

  •   Reboot the node:


    # reboot -n -f
    

    The node stops immediately without any further processing and is rebooted.

procedure icon  To Abruptly Power Off a Solaris Node

  •   Power off the node:


    # uadmin 1 6 /*A_REBOOT AD_POWEROFF*/
    

    The node stops immediately without any further processing.

procedure icon  To Abruptly Power Off a Linux Node

  •   Power off the node:


    # poweroff -n -f
    

    The node stops immediately without any further processing.

procedure icon  To Abruptly Halt a Solaris Node

  •   Halt the node:


    # uadmin 1 0 */A_REBOOT AD_HALT/*
    

    The node stops immediately without any further processing.

procedure icon  To Abruptly Halt a Linux Node

  •   Halt the node by typing the following command:


    # halt -n -f
    

    The node stops immediately without any further processing.

Shutting Down a Node

This section describes how to shut down a master node, a vice-master node, a diskless node, and a dataless node.

procedure icon  To Shut Down the Master Node

Before shutting down the master node, perform a switchover as described in To Trigger a Switchover With nhcmmstat. The vice-master node becomes the new master node, and the old master node becomes the new vice-master node. Then, shut down the new vice-master node as described in To Shut Down the Vice-Master Node.

To shut down the master node without first performing a switchover, do the following:

  1. Log in to the master node as superuser.

  2. Shut down the master node as described in To Perform a Clean Power off of a Solaris Node or To Perform a Clean Power off of a Linux Node , depending on the OS your system uses.

    The vice-master node becomes the master node. Because there are only two master-eligible nodes in the cluster and one is shut down, your cluster is not highly available. To restore high availability, restart the stopped node.

procedure icon  To Shut Down the Vice-Master Node

  1. Log in to the vice-master node as superuser.

  2. Shut down the vice-master node as described in To Perform a Clean Power off of a Solaris Node or To Perform a Clean Power off of a Linux Node , depending on the OS your system uses.

    Because there are only two master-eligible nodes in the cluster and one is shut down, your cluster is not highly available. To restore high availability, restart the stopped node.

procedure icon  To Shut Down a Diskless Node or Dataless Node

  1. Log in as superuser to the node you want to shut down.

  2. Shut down the node as described in To Perform a Clean Power off of a Solaris Node or To Perform a Clean Power off of a Linux Node , depending on the OS your system uses.

    When a diskless node or dataless node is shut down, there is no impact on the roles of the other peer nodes.

Restarting a Node

This section describes how to restart a node that has been stopped by one of the procedures in Shutting Down a Node.



Note - For x64 platforms, refer to the hardware documentation for information about performing tasks that reference OBP commands and that, therefore, apply only to the SPARC architecture.



procedure icon  To Restart a Node

  1. Restart the node.

    • If the node is powered off, power on the node.

    • If the node is not powered off but is at the open boot prompt, boot the node:


       ok> boot
      

      If the node is in single-user mode, go to multi-user mode using CTRL-D.

    If the node is a peer node, restarting the node reintegrates it into the cluster.

  2. Log in to the restarted node as superuser.

  3. Verify that the node has started correctly:


    # nhadm check
    

    For more information, see the nhadm1M man page.


Shutting Down and Restarting a Cluster

This section describes how to shut down and restart a cluster.

procedure icon  To Shut Down a Cluster

  1. Log in to a peer node as superuser.

  2. Identify the role of each peer node:


    # nhcmmstat -c all
    

    Record the role of each node.

  3. Shut down each diskless and dataless node as described in To Perform a Clean Power off of a Linux Node .

  4. Verify that the vice-master node is synchronized with the master node (not applicable for shared disk configurations):

    For versions of the Solaris OS earlier than version 10:


    # /usr/opt/SUNWesm/sbin/scmadm -S -M
    

    For the Solaris 10 OS and later:


    # /usr/sbin/dsstat 1
    

    For the Linux OS:


    # drbdadm cstate all
    

    • If the drbdadm command indicates "Connected," the vice-master node is synchronized with the master node.

    • If the vice-master node is not synchronized with the master node, synchronize it:


      # nhcrfsadm -f all
      

  5. Shut down the vice-master node by logging in to the vice-master node and following the steps provided in To Perform a Clean Power off of a Solaris Node or in To Perform a Clean Power off of a Linux Node , depending on the OS your system uses.

  6. Shut down the master node by logging in to the master node and following the steps provided in To Perform a Clean Power off of a Solaris Node or in To Perform a Clean Power off of a Linux Node , depending on the OS your system uses.

    For further information about the init command, see the init1M man pages.

procedure icon  To Restart a Cluster

This procedure describes how to restart a cluster that has been shut down as described in To Shut Down a Cluster.



caution icon

Caution - To restart a cluster, you boot each peer node. The order in which you boot the nodes is important. Restart the nodes so that they have the same role as they had before the cluster was shut down. If you do not maintain the roles of the nodes, you might lose data on systems using IP replication.



  1. Access the the master node’s system console and type the following:


     ok> boot
    



    Note - For x64 platforms, refer to the hardware documentation for information about performing tasks that reference OpenBoot™ PROM (OBP) commands and, therefore, apply only to the SPARC architecture.



  2. When the node has finished booting, verify that the master node is correctly configured:


    # nhadm check configuration
    

  3. Access the vice-master node’s system console and type the following:


    ok> boot 
    

  4. When the node has finished booting, verify that the vice-master node is correctly configured:


    # nhadm check configuration
    

  5. Access the system consoles of each diskless or dataless node and type the following:


    ok> boot 
    

  6. When the nodes have finished booting, verify that each node is correctly configured:


    # nhadm check configuration
    

  7. From any node in the cluster, verify that the cluster has started up successfully:


    # nhadm check starting
    

  8. Confirm that each node has the same role it had before it was shut down.



    caution icon

    Caution - After an emergency shutdown, the order in which the nodes are rebooted is important if availability or data integrity are a priority on your cluster. The order in which these nodes are restarted depends on the Data Management Policy you have selected in your initial cluster configuration. For more information, see the nhfs.conf4 and cluster_definition.conf4 man pages.




Triggering a Switchover

Before you perform a switchover, verify that the master and vice-master disks are synchronized, as described in To Verify That the Master Node and Vice-Master Node Are Synchronized. To trigger a switchover, perform the following procedure.

procedure icon  To Trigger a Switchover With nhcmmstat

  1. Log in to the master node as superuser.

  2. Trigger a switchover:


    # nhcmmstat -c so
    

    • If there is a vice-master node qualified to become master, this node is elected as the master node. The old master node becomes the vice-master node.

    • If there is no potential master node, nhcmmstat does not perform the switchover.

  3. Verify the cluster configuration:


    # nhadm check
    

    If the switchover was successful, the current node is the vice-master node.

  4. Verify that the current node is now the vice-master node:


    # nhcmmstat -c vice
    

    For more information, see the nhcmmstat1M man page.


Recovering an IP-Replicated Cluster

If the master node and the vice-master node both act as master nodes, this error is called split brain. For information about how to recover from split brain at startup and at runtime, see the Netra High Availability Suite 3.0 1/08 Foundation Services Troubleshooting Guide.

The following procedure is specific to IP-replicated clusters because a split brain error is unlikely to happen with a shared disk configuration. For shared disk, just check that the configuration is normal and then reboot.

procedure icon  To Recover a Solaris Cluster After Failure

  1. Stop all of the nodes in the cluster as described in To Perform a Clean Power off of a Solaris Node .

  2. Boot both of the master-eligible nodes in single-user mode.


     ok> boot -s
    



    Note - For x64 platforms, refer to the hardware documentation for information about performing tasks that reference OBP commands and, therefore, apply only to the SPARC architecture.



  3. Confirm that the master-eligible nodes are configured correctly.

    For each master-eligible node, do the following:

    1. Confirm that the following files exist and are not empty:

      • cluster_nodes_table

      • target.conf

    2. Reset the replication configuration (answer = Y):

      On the Solaris 9 OS:


      # /opt/SUNWesm/SUNWrdc/sbin/sndradm -d
      Disable Remote Mirror? (Y/N) [N]: Y
      #
      

      On the Solaris 10 OS:


      # /usr/sbin/sndradm -d
      Disable Remote Mirror? (Y/N) [N]: Y
      #
      

    3. Synchronize the file system by using /sbin/sync.

    4. Stop the master-eligible node.

  4. Boot the nodes in the following order:

    1. Boot the first master-eligible node. This node has the most up-to-date set of data.



      caution icon

      Caution - The node that becomes the vice-master node will have the recent file system data erased.



    2. Confirm that the first master-eligible node has become the master node.

    3. Boot the second master-eligible node.

    4. Confirm that the second master-eligible node has become the vice-master node.

    5. Wait until the master node and vice-master node are synchronized.

      This is a full resynchronization and might take some time.

    6. Boot the diskless and dataless nodes, if any exist.

      You can boot diskless and dataless nodes in any order.

procedure icon  To Recover a Linux Cluster After Failure

  1. Stop all peer nodes in the cluster as described in To Perform a Clean Power off of a Linux Node .

  2. Restart both of the master-eligible nodes with Netra HA Suite software disabled. Note which node is master and which node is vice-master before restarting the nodes.


     # touch /etc/opt/sun/nhas/not_configured
     # reboot -n -f
    

  3. Confirm that the master-eligible nodes are configured correctly.

    For each master-eligible node, do the following:

    1. Confirm that the following files exist and are not empty:

      • cluster_nodes_table

      • target.conf

    2. Reset the DRBD replication configuration:

      On the vice-master node:


      # drbdadm secondary all
      

      On the master node:


      # drbdadm primary all
      # drbdadm invalidate_remote all
      

      This will trigger a full re-synchronization from the master node to the vice-master node.



      caution icon

      Caution - The vice-master node will have the recent file system data erased.



    3. Wait until the master node and vice-master node are synchronized. This is a full re-synchronization and might take some time.

    4. Remove the not_configured file on both the master and vice-master node:


      # rm /etc/opt/sun/nhas/not_configured
      

  4. Boot the nodes in the following order:

    1. Boot the first master-eligible node.

    2. Confirm that the first master-eligible node has become the master node.

    3. Boot the second master-eligible node.

    4. Confirm that the second master-eligible node has become the vice-master node.

    5. Wait until the master node and vice-master node are synchronized.

    6. Boot the diskless and dataless nodes, if any exist.

      You can boot diskless and dataless nodes in any order.