9 ACSLS Cluster Operation

Solaris Cluster is designed to achieve automatic system recovery under severe failure scenarios by transferring operational control from one server node to the next. But most failures in a Solaris system do not require full system switch-over action to recover.

Failures involving network communication are handled quickly and quietly by Solaris IPMP.
System disk failures are handled silently and automatically by Solaris ZFS.
Failures with any single disk drive in the attached storage array are recovered automatically by the storage array firmware. And where the storage array might lack the ability to recover from disk failure, Solaris ZFS is in control to provide uninterrupted disk I/O to the alternate drive in the mirrored configuration.
If an HBA port to the shared array should fail, Solaris automatically switches to an alternate port. Similarly, if a controller module on the shared array should fail or an interconnecting cable is disconnected, Solaris instantly reverts to the alternate path that connects to the disk resource.
Failure in a library communication path is recovered automatically by dual TCP/IP logic in ACSLS. And operations from a failed library controller card are recovered automatically by ACSLS HA logic associated with library Redundant Electronics (RE).
If any of the multiple running processes in ACSLS should fail, the ACSLS daemon instantly reboots the failed process.
Should the ACSLS daemon itself fail, or should any of the remaining ACSLS services stop running, the Solaris Service Management Facility (SMF) is there to instantly reboot the failed service.

All of these scenarios are handled quickly and automatically without the involvement of Solaris Cluster. But if any other severe fault should impact ACSLS operation on the active server node, ACSLS HA instructs Solaris Cluster to switch control over to the alternate node.

Once it is started, ACSLS HA probes the system once every minute, watching for any of the following events to occur:

Loss of communication to an attached library
Loss of network contact to the ACSLS logical host
Loss of contact to the RPC listener port for client calls
Loss of access to the ACSLS file system
Unrecoverable maintenance state of the acsls SMF service

Any of these events triggers a Cluster fail over. Solaris Cluster also knows to fail over if any fatal system conditions on the active server node occurs.

Starting Cluster Control of ACSLS

To activate Cluster failover control:

# cd /opt/ACSLSHA/util
# ./start_acslsha.sh -h <logical hostname> -g <IPMP group> -z acslspool

This action initiates Cluster control of ACSLS. Solaris Cluster monitors the system, probing once each minute to verify the health of ACSLS specifically and the Solaris system in general. Any condition that is deemed fatal initiates an action on the alternate node.

To check cluster status of the ACSLS resource group:

# clrg status

The display will:

Reveal the status of each node.
Identify which node is the active node.
Reveal whether failover action is suspended.

ACSLS Operation and Maintenance Under Cluster Control

Once cluster control has been activated, you can operate ACSLS in normal fashion. You can start and stop ACSLS using the standard acsss control utility. Under cluster control, a user starts and stops ACSLS services in the same fashion as they would start and stop the application on a stand-alone ACSLS server. Operation is administered with these standard acsss commands:

acsss enable
acsss disable
acsss db

Manually starting or stopping acsss services with these commands in no way causes Solaris Cluster to intervene with failover action. Nor will the use of the Solaris SMF commands (such as svcadm) cause Cluster to intervene. Whenever acsss services are aborted or interrupted, it is SMF, not Cluster, that is primarily responsible for rebooting these services.

Solaris Cluster only intervenes to restore control on the adjacent node under the following circumstances:

Lost communication with the ACSLS filesystem
Lost communication with all redundant public Ethernet ports
Lost and unrecoverable communication with a specified library

Suspending Cluster Control

If you suspect that your maintenance activity might trigger an unwanted cluster failover event, you can suspend cluster control of the acsls resource group.

To suspend Cluster control:

# clrg suspend acsls-rg

While the resource group is suspended, Solaris Cluster makes no attempt to switch control to the adjacent node, no matter what conditions might otherwise trigger such action.

This enable you to make more invasive repairs to the system, even while library production may be in full operation.

If the active node happens to reboot while in suspended mode, it will not mount the acslspool after the reboot, and ACSLS operation will be halted. To clear this condition, you should resume Cluster control.

To resume Cluster control:

# clrg resume acsls-rg

If the shared disk resource is mounted to the current node, then normal operation resumes. But if Solaris Cluster discovers upon activation that the zpool is not mounted. it immediately switches control to the adjacent node. If the adjacent node is not accessible, then control switches back to the current node and Cluster attempts to mount the acslspool and start ACSLS services on this node.

Powering Down the ACSLS HA Cluster

The following procedure provides for a safe power-down sequence if it is necessary to power down the ACSLS HA System.

Determine the active node in the cluster.
```
# clrg status
```
Look for the online node.
Log in as root to the active node and halt Solaris Cluster control of the ACSLS resource group.
```
# clrg suspend acsls-rg
```
Switch to user acsss and shutdown the acsss services:
```
# su - acsss
$ acsss shutdown
```
Log out as acsss and gracefully power down the node.
```
$ exit
# init 5
```
Log in to the alternate node and power it down with init 5.
Power down the shared disk array using the physical power switch.

Powering Up a Suspended ACSLS Cluster System

To restore ACSLS operation on the node that was active before a controlled shutdown, use the following procedure

Power on both nodes locally using the physical power switch or remotely using the Sun Integrated Lights Out Manager.
Power on the shared disk array
Log in to either node as root.
If you attempt to login as acsss or to list the $ACS_HOME directory, you find that the shared disk resource is not mounted to either node. To resume cluster monitoring, run the following command:
```
# clrg resume acsls-rg
```
With this action, Solaris Cluster mounts the shared disk to the node that was active when you brought the system down. This action should also automatically reboot the acsss services and normal operation should resume.

Creating a Single Node Cluster

There may be occasions where ACSLS must continue operation from a standalone server environment on one node while the other node is being serviced. This would apply in situations of hardware maintenance, an operating system upgrade, or an upgrade to Solaris Cluster.

Use the following procedures to create a standalone ACSLS server.

Reboot the desired node in a non-cluster mode.
```
# reboot -- -x
```
To boot into non-cluster mode from the Open Boot Prom (OBP) on SPARC servers:
```
ok: boot -x
```
On X86 Servers, it is necessary to edit the GRUB boot menu.
1. Power on the system.
2. When the GRUB boot menu appears, press e (edit).
3. From the submenu, using the arrow keys, select kernel /platform/i86pc/multiboot. When this is selected, press e.
4. In the edit mode, add -x to the multiboot option kernel /platform/i86pc/multiboot -x and click return.
5. With the multiboot -x option selected, press b to boot with that option.
Once the boot cycle is complete, log in as root and import the ACSLS Z-pool.
```
# zpool import acslspool
```
Use the -f (force) option if necessary when the disk resource remains tied to another node.
```
# zpool import -f acslspool
```
Bring up the acsss services.
```
# su - acsss
$ acsss enable
```