Sun Cluster 2.2 System Administration Guide

Stopping the Cluster and Cluster Nodes

Putting a node in any mode other than multiuser, or halting or rebooting the node, requires stopping the Sun Cluster membership monitor. Then your site's preferred method can be used for further node maintenance.

Stopping the cluster requires stopping the membership monitor on all cluster nodes by running the scadmin stopnode command on all nodes simultaneously.

You can stop the membership monitor only when no logical hosts are owned by the local Sun Cluster node.

To stop the membership monitor on one node, switch over the logical host(s) to another node using the haswitch(1M) command and stop the membership monitor by typing the following command:
phys-hahost1# haswitch destination_host logicalhost phys-hahost1# scadmin stopnode

If a logical host is owned by the node when the scadmin stopnode command is run, ownership will be transferred to another node that can master the logical host before the membership monitor is stopped. If the other possible master of the logical host is down, the scadmin stopnode command will shut down the data services in addition to stopping the membership monitor.

After the scadmin stopnode command runs, Sun Cluster will remain stopped, even across system reboots, until the scadmin startnode command is run.

The scadmin stopnode command removes the node from the cluster. In the absence of other simultaneous failures, you may shut down as many nodes as you choose without losing quorum among the remaining nodes. (If quorum is lost, the entire cluster shuts down.)

If you shut down a node for disk maintenance, you also must prepare the boot disk or data disk using the procedures described in Chapter 10, Administering Sun Cluster Local Disks for boot disks, or those described in your volume manager documentation for data disks.

You might have to shut down one or more Sun Cluster nodes to perform hardware maintenance procedures such as adding or removing SBus cards. The following sections describe the procedure for shutting down a single node or the entire cluster.

Note -

In a cluster with more than two nodes and with direct-attached storage, a problem can occur if the last node in the cluster panics or exits the cluster unusually (without performing the stopnode transition). In such a case, all nodes have been removed from the cluster and the cluster no longer exists, but because the last node left the cluster in an unusual manner, it still holds the nodelock. A subsequent invocation of the scadmin startcluster command will fail to acquire the nodelock. To work around this problem, manually clear the nodelock before restarting the cluster, using the procedure "How to Clear a Nodelock Freeze After a Cluster Panic".

How to Stop Sun Cluster on a Cluster Node

If it is not necessary to have the data remain available, place the logical hosts (disk groups) into maintenance mode.
phys-hahost2# haswitch -m logicalhost
Refer to the haswitch(1M) man page for details.

Note -
It is possible to halt a Sun Cluster node by using the halt(1M) command, allowing a failover to restore the logical host services on the backup node. However, the halt(1M) operation might cause the node to panic. The haswitch(1M) command offers a more reliable method of switching ownership of the logical hosts.

Stop Sun Cluster on one node without stopping services running on the other nodes in the cluster.
phys-hahost1# scadmin stopnode
Note -
When you stop a node, the following error message might be displayed: in.rdiscd[517]: setsockopt (IP_DROP_MEMBERSHIP): Cannot assign requested address The error is caused by a timing issue between the in.rdiscd daemon and the IP module. It is harmless and can be ignored safely.

Halt the node.
phys-hahost1# halt
The node is now ready for maintenance work.

How to Stop Sun Cluster on All Nodes

You might want to shut down all nodes in a Sun Cluster configuration if a hazardous environmental condition exists, such as a cooling failure or a severe lightning storm.

Stop the membership monitor on all nodes by using the scadmin(1M) command.

Run this command on the console of each node in the cluster. Allow each node to exit the cluster and the remaining nodes to reconfigure completely before you run the command on the next node
phys-hahost1# scadmin stopnode ...
.

Halt all nodes using halt(1M).
phys-hahost1# halt ...

How to Halt a Sun Cluster Node

Shut down any Sun Cluster node by using the halt(1M) command or the uadmin(1M) command.

If the membership monitor is running when a node is shut down, the node will most likely take a "Failfast timeout" and display the following message:
panic[cpu9]/thread=0x50f939e0: Failfast timeout - unit
You can avoid this by stopping the membership monitor before shutting down the node. Refer to the procedure, "How to Stop Sun Cluster on All Nodes", for additional information.

How to Clear a Nodelock Freeze After a Cluster Panic

In a cluster with more than two nodes and with direct-attached storage, a problem can occur if the last node in the cluster panics or exits the cluster unusually (without performing the stopnode transition). In such a case, all nodes have been removed from the cluster and the cluster no longer exists. However, because the last node left the cluster in an unusual manner, it still holds the nodelock. A subsequent invocation of the scadmin startcluster command will fail to acquire the nodelock.

To work around this problem, manually clear the nodelock before restarting the cluster. Use the following procedure to manually clear the nodelock and restart the cluster, after the cluster has aborted completely.

As root, display the cluster configuration.
# scconf clustername -p
Look for this line in the output:
clustername Locking TC/SSP, port : A.B.C.D, E
- If E is a positive number, the nodelock is on Terminal Concentrator A.B.C.D and Port E. Proceed to Step 2.
- If E is -1, the lock is on an SSP. Proceed to Step 3.

For a nodelock on a Terminal Concentrator (TC), perform the following steps.
1. Start a telnet connection to Terminal Concentrator tc-name.
  $ telnet tc-name Trying 192.9.75.51... Connected to tc-name. Escape character is `^]'.
  Enter Return to continue.
2. Specify cli (command line interface).
  Enter Annex port name or number: cli
3. Log in as root.
4. Run the admin command.
  annex# admin
5. Reset Port E.
  admin : reset E
6. Close the telnet connection.
  annex# hangup
7. Proceed to Step 4.

For a nodelock on a System Service Processor (SSP), perform the following steps.
1. Connect to the SSP.
  $ telnet ssp-name
2. Log in as user ssp.
3. Display information about the clustername.lock file by using the following command. (This file is a symbolic link to /proc/csh.pid.)
  $ ls -l /var/tmp/clustername.lock
4. Search for the process csh.pid.
  $ ps -ef | grep csh.pid
5. If the csh.pid process exists in the ps -ef output, kill the process by using the following command.
  $ kill -9 csh.pid
6. Delete the clustername.lock file.
  $ rm -f /var/tmp/clustername.lock
7. Log out of the SSP.

Restart the cluster.
$ scadmin startcluster

Stopping the Membership Monitor While Running RDBMS Instances

Database server instances can run on a node only after you have invoked the startnode option and the node has successfully joined the cluster. All database instances should be shut down before the stopnode option is invoked.

Note -

If you are running Oracle7 Parallel Server, Oracle8 Parallel Server, or Informix XPS, refer to your product documentation for shutdown procedures.

If the stopnode command is executed while the Oracle7 or Oracle8 instance is still running on the node, stopnode will hang and the following message is displayed on the console:

ID[vxclust]: stop: waiting for applications to end

The Oracle7 or Oracle8 instance must be shut down for the stopnode command to terminate successfully.

If the stopnode command is executed while the Informix-Online XPS instance is still running on the node, the database hangs and becomes unusable.