Sun Cluster 3.0 System Administration Guide

Chapter 2 Shutting Down and Booting a Cluster

This chapter provides the procedures for shutting down and booting a cluster and individual cluster nodes.

This is a list of the procedures in this chapter.

For a high-level description of the related procedures in this chapter, see Table 2-1 and Table 2-2.

2.1 Shutting Down and Booting a Cluster Overview

The Sun Cluster scshutdown(1M) command stops cluster services in an orderly fashion and cleanly shuts down the cluster.


Note -

Use scshutdown instead of the shutdown or halt commands to ensure proper shutdown of the entire cluster. The Solaris shutdown command is used to shut down individual nodes.


The scshutdown command stops a cluster by:

  1. Taking all running resource groups offline

  2. Unmounting all cluster file systems

  3. Shutting down active device services

  4. Running init 0 and bringing all nodes to the ok PROM prompt

You might do this when moving a cluster from one location to another or if there was data corruption caused by an application error.


Note -

If necessary, you can boot a node so that it does not participate in the cluster membership, that is, in non-cluster mode. This is useful when installing cluster software or for performing certain administrative procedures. See "2.2.4 How to Boot a Cluster Node in Non-Cluster Mode" for more information.


.

Table 2-1 Task Map: Shutting Down and Booting a Cluster

Task 

For Instructions, Go To... 

Stop the cluster 

    - Use scshutdown

"2.1.1 How to Shut Down a Cluster"

Start the cluster by booting all nodes.  

The nodes must have a working connection to the cluster interconnect to attain cluster membership. 

"2.1.2 How to Boot a Cluster"

Shut down the cluster 

    - Use scshutdown

At the ok prompt, boot each node individually with the boot command.

The nodes must have a working connection to the cluster interconnect to attain cluster membership. 

"2.1.3 How to Reboot a Cluster"

2.1.1 How to Shut Down a Cluster

  1. (Optional). For a cluster running Oracle Parallel Server (OPS), shut down all OPS database instances.

    Refer to the Oracle Parallel Server product documentation for shutdown procedures.

  2. Become superuser on a node in the cluster.

  3. Shut down the cluster immediately by using the scshutdown(1M) command.

    From a single node in the cluster, enter the following command.


    # scshutdown -g 0 -y
    
  4. Verify that all nodes have reached the ok PROM prompt.

  5. If necessary, power off the nodes.

2.1.1.1 Example--Shutting Down a Cluster

The following example shows the console output when stopping normal cluster operation and bringing down all nodes to the ok prompt. The -g 0 option sets the shutdown grace period to zero, -y provides an automatic yes response to the confirmation question. Shutdown messages also appear on the consoles of the other nodes in the cluster.


# scshutdown -g 0 -y
Sep  2 10:08:46 phys-schost-1 cl_runtime: WARNING: CMM monitoring disabled.
phys-schost-1# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
/etc/rc0.d/K05initrgm: Calling scswitch -S (evacuate)
The system is down.
syncing file systems... done
Program terminated
ok 

2.1.1.2 Where to Go From Here

See "2.1.2 How to Boot a Cluster" to restart a cluster that has been shut down.

2.1.2 How to Boot a Cluster

  1. To start a cluster whose nodes have been shut down and are at the ok PROM prompt, boot each node.

    The order in which the nodes are booted does not matter unless if you make configuration changes between shutdowns. In this case, you should start the nodes such that the one with the most current configuration boots first.


    ok boot
    

    Messages appear on the booted nodes' consoles as cluster components are activated.


    Note -

    Cluster nodes must have a working connection to the cluster interconnect to attain cluster membership.


  2. Verify that the nodes booted without error and are online.

    The scstat(1M) command reports the nodes' status.


    # scstat -n
    

2.1.2.1 Example--Booting a Cluster

The following example shows the console output when booting node phys-schost-1 into the cluster. Similar messages appear on the consoles of the other nodes in the cluster.


ok boot
Rebooting with command: boot 
...
Hostname: phys-schost-1
Booting as part of a cluster
NOTICE: Node 1 with votecount = 1 added.
NOTICE: Node 2 with votecount = 1 added.
NOTICE: Node 3 with votecount = 1 added.
...
NOTICE: Node 1: attempting to join cluster
...
NOTICE: Node 2 (incarnation # 937690106) has become reachable.
NOTICE: Node 3 (incarnation # 937690290) has become reachable.
NOTICE: cluster has reached quorum.
NOTICE: node 1 is up; new incarnation number = 937846227.
NOTICE: node 2 is up; new incarnation number = 937690106.
NOTICE: node 3 is up; new incarnation number = 937690290.
NOTICE: Cluster members:   1  2  3
...
NOTICE: Node 1: joined cluster
...
The system is coming up.  Please wait.
checking ufs filesystems
...
reservation program successfully exiting
Print services started.
volume management starting.
The system is ready.
phys-schost-1 console login: 

2.1.3 How to Reboot a Cluster

Run the scshutdown(1M) command to shut down the cluster, then boot the cluster with the boot command on each node.

  1. (Optional). For a cluster running Oracle Parallel Server (OPS), shut down all OPS database instances.

    Refer to the Oracle Parallel Server product documentation for shutdown procedures.

  2. Become superuser on a node in the cluster.

  3. Shut down the cluster by using the scshutdown command.

    From a single node in the cluster, enter the following command.


    # scshutdown -g 0 -y 
    

    This shuts down each node to the ok PROM prompt.


    Note -

    Cluster nodes must have a working connection to the cluster interconnect to attain cluster membership.


  4. Boot each node.

    The order in which the nodes are booted does not matter unless if you make configuration changes between shutdowns. In this case, you should start the nodes such that the one with the most current configuration boots first.


    ok boot
    

    Messages appear on the booted nodes' consoles as cluster components are activated.

  5. Verify that the nodes booted without error and are online.

    The scstat command reports the nodes' status.


    # scstat -n
    

2.1.3.1 Example--Rebooting a Cluster

The following example shows the console output when stopping normal cluster operation, bringing down all nodes to the ok prompt, then restarting the cluster. The -g 0 option sets the grace period to zero, -y provides an automatic yes response to the confirmation question. Shutdown messages also appear on the consoles of other nodes in the cluster.


# scshutdown -g 0 -y
Sep  2 10:08:46 phys-schost-1 cl_runtime: WARNING: CMM monitoring disabled.
phys-schost-1# 
INIT: New run level: 0
The system is coming down.  Please wait.
...
The system is down.
syncing file systems... done
Program terminated
ok boot
Rebooting with command: boot 
...
Hostname: phys-schost-1
Booting as part of a cluster
...
NOTICE: Node 1: attempting to join cluster
...
NOTICE: Node 2 (incarnation # 937690106) has become reachable.
NOTICE: Node 3 (incarnation # 937690290) has become reachable.
NOTICE: cluster has reached quorum.
...
NOTICE: Cluster members:   1  2  3
...
NOTICE: Node 1: joined cluster
...
The system is coming up.  Please wait.
checking ufs filesystems
...
reservation program successfully exiting
Print services started.
volume management starting.
The system is ready.
phys-schost-1 console login:

2.2 Shutting Down and Booting a Single Cluster Node


Note -

Use the scswitch command in conjunction with the Solaris shutdown command to shut down an individual node. Use the scshutdown command only when shutting down an entire cluster.


Table 2-2 Task Map: Shutting Down and Booting a Cluster Node

Task 

For Instructions, Go To... 

Stop a cluster node  

    - Use scswitch(1M)and shutdown(1M)

"2.2.1 How to Shut Down a Cluster Node"

Start a node by booting it.  

The node must have a working connection to the cluster interconnect to attain cluster membership. 

"2.2.2 How to Boot a Cluster Node"

Stop and restart (reboot) a cluster node 

    - Use scswitch and shutdown

The node must have a working connection to the cluster interconnect to attain cluster membership. 

"2.2.3 How to Reboot a Cluster Node"

Boot a node so that it does not participate in cluster membership 

    - Use scswitch and shutdown, then boot -x

"2.2.4 How to Boot a Cluster Node in Non-Cluster Mode"

2.2.1 How to Shut Down a Cluster Node

  1. (Optional). For a cluster node running Oracle Parallel Server (OPS), shut down all OPS database instances.

    Refer to the Oracle Parallel Server product documentation for shutdown procedures.

  2. Become superuser on the cluster node to be shut down.

  3. Shut down the cluster node by using the scswitch and shutdown commands.

    On the node to be shut down, enter the following command.


    # scswitch -S -h node
    # shutdown -g 0 -y
    
  4. Verify that the cluster node has reached the ok PROM prompt.

  5. If necessary, power off the node.

2.2.1.1 Example--Shutting Down a Cluster Node

The following example shows the console output when shutting down node phys-schost-1. The -g 0 option sets the grace period to zero, -y provides an automatic yes response to the confirmation question. Shutdown messages for this node appear on the consoles of other nodes in the cluster.


# scswitch -S -h phys-schost-1
# shutdown -g 0 -y
Sep  2 10:08:46 phys-schost-1 cl_runtime: WARNING: CMM monitoring disabled.phys-schost-1# 
INIT: New run level: 0
The system is coming down.  Please wait.
Notice: rgmd is being stopped.
Notice: rpc.pmfd is being stopped.
Notice: rpc.fed is being stopped.
umount: /global/.devices/node@1 busy
umount: /global/phys-schost-1 busy
The system is down.
syncing file systems... done
Program terminated
ok 

2.2.1.2 Where to Go From Here

See "2.2.2 How to Boot a Cluster Node" to restart a cluster node that has been shut down.

2.2.2 How to Boot a Cluster Node


Note -

Starting a cluster node can be affected by the quorum configuration. In a two-node cluster, you must have a quorum device configured such that the total quorum count for the cluster is three (one for each node and one for the quorum device). In this situation, if the first node is shut down, the second node continues to have quorum and runs as the sole cluster member. For the first node to come back in the cluster as a cluster node, the second node must be up and running and the required cluster quorum count (two) must be present.


  1. To start a cluster node that has been shut down, boot the node.


    ok boot
    

    Messages appear on the booted node's console, and on the member nodes' consoles, as cluster components are activated.


    Note -

    A cluster node must have a working connection to the cluster interconnect to attain cluster membership.


  2. Verify that the node has booted without error, and is online.

    The scstat(1M) command reports the status of a node.


    # scstat -n
    

2.2.2.1 Example--Booting a Cluster Node

The following example shows the console output when booting node phys-schost-1 into the cluster.


ok boot
Rebooting with command: boot 
...
Hostname: phys-schost-1
Booting as part of a cluster
...
NOTICE: Node 1: attempting to join cluster
...
NOTICE: Node 1: joined cluster
...
The system is coming up.  Please wait.
checking ufs filesystems
...
reservation program successfully exiting
Print services started.
volume management starting.
The system is ready.
phys-schost-1 console login:

2.2.3 How to Reboot a Cluster Node

  1. (Optional). For a cluster node running Oracle Parallel Server (OPS), shut down all OPS database instances.

    Refer to the Oracle Parallel Server product documentation for shutdown procedures.

  2. Become superuser on the cluster node to be shut down.

  3. Shut down the cluster node by using the scswitch and shutdown commands.

    Enter these commands on the node to be shut down.


    # scswitch -S -h node
    # shutdown -g 0 -y -i 6
    

    The -i 6 option with the shutdown command causes the node to reboot after it shuts down to the ok PROM prompt.


    Note -

    Cluster nodes must have a working connection to the cluster interconnect to attain cluster membership.


  4. Verify that the node has booted without error, and is online.

    The scstat(1M) command reports the status of a node.


    # scstat -n
    

2.2.3.1 Example--Rebooting a Cluster Node

The following example shows the console output when shutting down and restarting node phys-schost-1. The -g 0 option sets the grace period to zero, -y provides an automatic yes response to the confirmation question. Shutdown and startup messages for this node appear on the consoles of other nodes in the cluster.


# scswitch -S -h phys-schost-1
# shutdown -g 0 -y -i 6
Sep  2 10:08:46 phys-schost-1 cl_runtime: WARNING: CMM monitoring disabled.
phys-schost-1# 
INIT: New run level: 6
The system is coming down.  Please wait.
System services are now being stopped.
Notice: rgmd is being stopped.
Notice: rpc.pmfd is being stopped.
Notice: rpc.fed is being stopped.
umount: /global/.devices/node@1 busy
umount: /global/phys-schost-1 busy
The system is down.
syncing file systems... done
rebooting...
Resetting ... 
,,,
Sun Ultra 1 SBus (UltraSPARC 143MHz), No Keyboard
OpenBoot 3.11, 128 MB memory installed, Serial #7982421.
Ethernet address 8:0:20:79:cd:55, Host ID: 8079cd55.
...
Rebooting with command: boot
...
Hostname: phys-schost-1
Booting as part of a cluster
...
NOTICE: Node 1: attempting to join cluster
...
NOTICE: Node 1: joined cluster
...
The system is coming up.  Please wait.
The system is ready.
phys-schost-1 console login: 

2.2.4 How to Boot a Cluster Node in Non-Cluster Mode

You can boot a node so that it does not participate in the cluster membership, that is, in non-cluster mode. This is useful when installing the cluster software or for performing certain administrative procedures, such as patching a node.

  1. Become superuser on the cluster node to be started in non-cluster mode.

  2. Shut down the node by using the scswitch and shutdown commands.


    # scswitch -S -h node
    # shutdown -g 0 -y
    
  3. Verify that the node is at the ok PROM prompt.

  4. Boot the node in non-cluster mode by using the boot(1M) command with the -x option.


    ok boot -x
    

    Messages appear on the node's console stating that the node is not part of the cluster.

2.2.4.1 Example--Booting a Cluster Node in Non-Cluster Mode

The following example shows the console output when shutting down node phys-schost-1 then restarting it in non-cluster mode. The -g -0 option sets the grace period to zero, -y provides an automatic yes response to the confirmation question. Shutdown messages for this node appear on the consoles of other nodes in the cluster.


# scswitch -S -h phys-schost-1
# shutdown -g 0 -y
Sep  2 10:08:46 phys-schost-1 cl_runtime: WARNING: CMM monitoring disabled.
phys-schost-1# 
...
rg_name = schost-sa-1 ...
offline node = phys-schost-2 ...
num of  node = 0 ...
phys-schost-1# 
INIT: New run level: 0
The system is coming down.  Please wait.
System services are now being stopped.
Print services stopped.
syslogd: going down on signal 15
...
The system is down.
syncing file systems... done
WARNING: node 1 is being shut down.
Program terminated
ok boot -x
...
Not booting as part of cluster
...
The system is ready.
phys-schost-1 console login:

2.3 Troubleshooting Cluster and Cluster Node Problems

This section describes solutions to problems that can arise during the day-to-day operation of a cluster and cluster nodes.

2.3.1 How to Repair a Full /var File System

Both Solaris and Sun Cluster software write error messages to the /var/adm/messages file, which over time can fill the /var file system. If a cluster node's /var file system fills up, Sun Cluster might not be able to restart on that node. Additionally, you might not be able to log in to the node.

If a node reports a full /var file system and continues to run Sun Cluster services, use this procedure to clear the full file system.

  1. Become superuser on the cluster node with the full /var file system.

  2. Clear the full file system.

    For example, delete nonessential files contained in the file system.