Sun Cluster 2.2 System Administration Guide

Chapter 5 Recovering From Power Loss

This chapter describes different power loss scenarios and the steps you take to return the system to normal operation. The topics in this chapter are listed below.

Maintaining Sun Cluster configurations includes handling failures such as power loss. A power loss can shut down an entire Sun Cluster configuration, or one or more components within a configuration. Sun Cluster nodes behave differently depending on which components lose power. The following sections describe typical scenarios and expected behavior.

5.1 Recovering From Total Power Loss

In Sun Cluster configurations with a single power source, a power failure takes down all Sun Cluster nodes along with their multihost disk expansion units. When all nodes lose power, the entire configuration fails.

In a total-failure scenario, there are two ways in which the cluster hardware might come back up.

A Sun Cluster node reboots before the Terminal Concentrator.

Any errors reported when the node is rebooting are stored in the /var/adm/messages file or the error log pointed to in the /etc/syslog.conf file.
A Sun Cluster node reboots before the multihost disk expansion unit.

The associated disks will not be accessible. One or more nodes must be rebooted after the multihost disk expansion unit comes up.

Once the nodes are up, run the hastat(1M) command and use your volume management software to search for errors that occurred due to the power outage.

5.2 Recovering From Partial Power Loss

If the Sun Cluster nodes and the multihost disk expansion units have separate power sources, a failure can take down one or more components. Several scenarios can occur. The most likely cases are:

The power to one Sun Cluster node fails, taking down only the node.
The power to one multihost disk expansion unit fails, taking down only the expansion unit.
The power to one Sun Cluster node fails, taking down at least one multihost disk expansion unit.
The power to one Sun Cluster node fails, taking down the node, at least one multihost disk expansion unit, and the Terminal Concentrator.

5.2.1 Failure of One Node

If separate power sources are used on the nodes and the multihost disk expansion units, and you lose power to only one of the nodes, the other node detects the failure and initiates a takeover.

When power is restored to the node that failed, it reboots. You must rejoin the cluster by using the scadmin startnode command. Then perform a manual switchover by using the haswitch(1M) command to restore the default logical host ownership.

5.2.2 Failure of a Multihost Disk Expansion Unit

If you lose power to one of the multihost disk expansion units, your volume management software detects errors on the affected disks and takes action to put them into an error state. Disk mirroring masks this failure from the Sun Cluster fault monitoring. No switchover or takeover occurs.

When power is returned to the multihost disk expansion unit, perform the procedure documented in Chapter 11, Administering SPARCstorage Arrays, or Chapter 12, Administering Sun StorEdge MultiPacks and Sun StorEdge D1000s.

5.2.3 Failure of One Server and One Multihost Disk Expansion Unit

If power is lost to one of the Sun Cluster nodes and one multihost disk expansion unit, a secondary node immediately initiates a takeover.

After the power is restored, you must reboot the node, rejoin the node to the configuration by using the scadmin startnode command, and then begin monitoring activity. If manual switchover is configured, use the haswitch(1M) command to manually return ownership of the diskset to the node that had lost power. Refer to "4.3 Switching Over Logical Hosts", for more information.

After the diskset ownership has been returned to the default master, any multihost disks that reported errors must be returned to service. Use the instructions provided in the chapters on your disk expansion unit to return the multihost disks to service.

Note -

The node might reboot before the multihost disk expansion unit. Therefore, the associated disks will not be accessible. Reboot the node after the multihost disk expansion unit comes up.

5.3 Powering On the System

Applying power to system cabinets, nodes, and boot disks varies, depending on the type of cabinet being used, and the manner in which the nodes receive AC power.

For disk arrays that do not receive AC power from an independent power source, AC power is applied when the system cabinet is powered on.

For specific power-on procedures for Sun StorEdge MultiPacks, refer to the Sun StorEdge MultiPack Service Manual.

The Terminal Concentrator that receives AC power from the system cabinet is turned on when power is applied to the cabinet. Otherwise, the Terminal Concentrator must be powered on independently.