9 ACSLS Cluster Operation

Solaris Cluster is designed to achieve automatic system recovery under severe failure scenarios by transferring operational control from one server node to the next. But most failures in a Solaris system do not require full system switch over action to recover.

Failures involving network communication are handled quickly and quietly by Solaris IPMP.
System disk failures are handled silently and automatically by Solaris ZFS.
Failures with any single disk drive in the attached storage array are recovered automatically by the storage array firmware. And where the storage array might lack the ability to recover from disk failure, Solaris ZFS is in control to provide uninterrupted disk I/O to the alternate drive in the mirrored configuration.
If an HBA port to the shared array should fail, Solaris automatically switches to an alternate port. Similarly, if a controller module on the shared array should fail or an interconnecting cable is disconnected, Solaris instantly reverts to the alternate path that connects to the disk resource.
Failure in a library communication path is recovered automatically by dual TCP/IP logic in ACSLS. And operations from a failed library controller card are recovered automatically by ACSLS HA logic associated with library Redundant Electronics (RE).
If any of the multiple running processes in ACSLS should fail, the ACSLS daemon instantly reboots the failed process.
Should the ACSLS daemon itself fail, or should any of the remaining ACSLS services stop running, the Solaris Service Management Facility (SMF) is there to instantly restart the failed service.

All of these scenarios are handled quickly and automatically without the involvement of Solaris Cluster. But if any other severe fault should impact ACSLS operation on the active server node, ACSLS HA instructs Solaris Cluster to switch control over to the alternate node.

Once it is started, ACSLS HA probes the system once every minute, watching for any of the following events to occur:

Loss of communication to an attached library.
Loss of network contact to the ACSLS logical host.
Loss of contact to the RPC listener port for client calls.
Loss of access to the ACSLS file system.
Unrecoverable maintenance state of the acsls SMF service.

Any of these events triggers a Cluster fail over. Solaris Cluster also knows to fail over if any fatal system conditions on the active server node occurs.

Starting Cluster Control of ACSLS

To activate Cluster failover control:

# cd /opt/ACSLSHA/util
# ./acsAgt configure

The utility prompts you for the logical host name. Ensure that the logical host is defined in the /etc/hosts file, and that the corresponding i.p. address maps to the ipmp group defined in the chapter, "Configuring the Solaris System for ACSLS HA". Before running acsAgt configure, use zpool list to confirm that acslspool is mounted to the current server node.

This action initiates Cluster control of ACSLS. Solaris Cluster monitors the system, probing once each minute to verify the health of ACSLS specifically and the Solaris system in general. Any condition that is deemed fatal initiates an action on the alternate node.

To check cluster status of the ACSLS resource group:

# clrg status

The display:

Reveals the status of each node.
Identifies which node is the active node.
Reveals whether failover action is suspended.

Setting the Failover Policy for acsls-storage

It is advisable to set a policy in the acsls-storage resource to reboot the active node whenever communication is lost between that node and the shared RAID disk device. This action causes the active node to relinquish control when it cannot connect to the disk, allowing Solaris Cluster to pass control to the alternate node. By setting the Failover_mode from SOFT to HARD, this ensures a reboot of the active node whenever communication has been lost to the shared storage device.

To view the existing Failover_mode, run the following command:

#  clrs show -v acsls-storage | grep Failover

The Failover_mode should be set to HARD as follows:

# clrs set -p Failover_mode=HARD  acsls-storage

ACSLS Operation and Maintenance Under Cluster Control

Once cluster control has been activated, ACSLS can be operated in normal fashion. Start and stop ACSLS using the standard acsss control utility. Under cluster control, a user starts and stops ACSLS services in the same fashion as starting and stopping the application on a standalone ACSLS server. Operation is administered with these standard acsss commands:

acsss enable
acsss disable
acsss db

Manually starting or stopping acsss services with these commands in no way causes Solaris Cluster to intervene with failover action. Nor will the use of the Solaris SMF commands (such as svcadm) cause Cluster to intervene. Whenever acsss services are aborted or interrupted, it is SMF, not Cluster, that is primarily responsible for restarting these services.

Solaris Cluster only intervenes to restore control on the adjacent node under the following circumstances:

Lost communication with the ACSLS filesystem.
Lost communication with all redundant public Ethernet ports.
Lost and unrecoverable communication with a specified library.

Suspending Cluster Control

If it is suspected that the maintenance activity might trigger an unwanted cluster failover event, suspend cluster control of the acsls resource group.

To suspend Cluster control:

# clrg suspend acsls-rg

While the resource group is suspended, Solaris Cluster makes no attempt to switch control to the adjacent node, no matter what conditions might otherwise trigger such action.

This suspension enables you to make more invasive repairs to the system, even while library production may be in full operation.

If the active node happens to reboot while in suspended mode, it does not mount the acslspool after the reboot, and ACSLS operation is halted. To clear this condition, resume Cluster control.

To resume Cluster control:

# clrg resume acsls-rg

If the shared disk resource is mounted to the current node, then normal operation resumes. But if Solaris Cluster discovers upon activation that the zpool is not mounted, it immediately switches control to the adjacent node. If the adjacent node is not accessible, control switches back to the current node. Cluster attempts to mount the acslspool and start ACSLS services on this node.

Powering Down the ACSLS HA Cluster

The following procedure provides for a safe power down sequence if it is necessary to power down the ACSLS HA System.

Determine the active node in the cluster.
```
# clrg status
```
Look for the online node.
Log in as root to the active node and halt Solaris Cluster control of the ACSLS resource group.
```
# clrg suspend acsls-rg
```
Switch to user acsss and shutdown the acsss services:
```
# su - acsss
$ acsss shutdown
```
Log out as acsss and gracefully power down the node.
```
$ exit
# init 5
```
Log in to the alternate node and power it down with init 5.
Power down the shared disk array using the physical power switch.

Powering Up a Suspended ACSLS Cluster System

To restore ACSLS operation on the node that was active before a controlled shutdown:

Power on both nodes locally using the physical power switch or remotely using the Sun Integrated Lights Out Manager.
Power on the shared disk array
Log in to either node as root.
If you attempt to login as acsss or to list the $ACS_HOME directory, you find that the shared disk resource is not mounted to either node. To resume cluster monitoring, run the following command:
```
# clrg resume acsls-rg
```
With this action, Solaris Cluster mounts the shared disk to the node that was active when the system was brought down. This action should also automatically reboot the acsss services and resume normal operations.

Creating a Single Node Cluster

There may be occasions where ACSLS must continue operation from a standalone server environment on one node while the other node is being serviced. This would apply in situations of hardware maintenance, an operating system upgrade, or an upgrade to Solaris Cluster.

Use the following procedures to create a standalone ACSLS server.

Reboot the desired node in a non-cluster mode.
```
# reboot -- -x
```
To boot into non-cluster mode from the Open Boot Prom (OBP) on SPARC servers:
```
ok: boot -x
```
On X86 Servers, it is necessary to edit the GRUB boot menu.
1. Power on the system.
2. When the GRUB boot menu appears, press e (edit).
3. From the submenu, using the arrow keys, select kernel /platform/i86pc/multiboot. When this is selected, press e.
4. In the edit mode, add -x to the multiboot option kernel /platform/i86pc/multiboot -x and click return.
5. With the multiboot -x option selected, press b to boot with that option.
Once the boot cycle is complete, log in as root and import the ACSLS Z-pool.
```
# zpool import acslspool
```
Use the -f (force) option if necessary when the disk resource remains tied to another node.
```
# zpool import -f acslspool
```
Bring up the acsss services.
```
# su - acsss
$ acsss enable
```