11 Cluster Logging, Diagnostics, and Testing

This chapter describes the various resources available for testing your ACSLS-HA installation, and for diagnosing issues and troubleshooting problems that may emerge on the system.

Solaris Cluster Logging

Solaris Cluster messages during a failover event is written to the /var/adm/messages file. This file has messages regarding Cluster functions, ACSLS errors and info messages. Only the active node writes cluster messages to the /var/adm/messages file.

Solaris Cluster monitors the health of ACSLS with a probe once every sixty seconds. You can view the log of this probe activity here:

/var/cluster/logs/DS/acsls-rg/acsls-rs/probe_log.txt

In the same directory is a file which logs every start and stop event if there is a failover sequence.

/var/cluster/logs/DS/acsls-rg/acsls-rs/start_stop_log.txt

ACSLS Event Log

The ACSLS event log is $ACS_HOME/log/acsss_event.log. This log includes messages regarding start and stop events from the perspective of ACSLS software. The log reports changes to the operational state of library resources and it logs all errors that are detected by ACSLS software. The acsss_event.log is managed and archived automatically from parameters defined in acsss_config option-2.

Cluster Monitoring Utilities

Solaris Cluster utilities are found in the /usr/cluster/bin directory.

To view the current state of the ACSLS resource group: clrg list -v
To view the current status of the two cluster nodes: clrg status
To view the status of the resource groups: clrs status
To get verbose status on the nodes, the quorum devices, and cluster resources: cluster status
For a detailed component list in the cluster configuration: cluster show
To view the status of each Ethernet node in the resource group: clnode status -m
To view resource Group status: scstat -g
To view device group status: scstat -D
To view the health of the heartbeat network links: scstat -D or clintr status
To view IPMP status: scstat -i
To view node status: scstat -n
To view quorum configuration and status: scstat -q or clq status
To show detailed cluster resources, including timeout values: clresource show -v

Recovery and Failover Testing

This section discusses the conditions, monitoring, and testing for recovery and failover testing.

Recovery Conditions

There are numerous fatal system conditions that can be recovered without the need of a system fail over event. For example, with IPMP, one Ethernet connection in each group may fail for whatever reason, but communication should resume uninterrupted through the alternate path.

The shared disk array should be connected to the servers with two distinct ports on each server. If one path is interrupted, disk I/O operation should resume without interruption over the alternate path.

ACSLS consists of several software 'services' that are monitored by the Solaris Service Management Facility (SMF). As user acsss, you can list each of the acsss services with the command acsss status. Among these services are the PostgreSQL database, the WebLogic Web application server, and the ACSLS application software. If any given service fails on a Solaris system, SMF should automatically reboot that service without the need for a system failover.

The acsls service itself consists of numerous child processes that are monitored by the parent, acsss_daemon. To list the ACSLS sub-processes, use the command, psacs (as user acsss). If any of the child processes is aborted for any reason, the parent should immediately reboot that child and recover normal operation.

Recovery Monitoring

The best location to view recovery of system resources (such as disk I/O and Ethernet connections), is the system log, /var/adm/messages.

SMF maintains a specific log for each software service that it monitors. This log displays start-up, restart, and shutdown events. To get the full path to the service log, run the command, svcs -l service-name ACSLS services can be listed using the acsss command: $ acsss status. Subprocesses can be listed with the command, $ acsss p-status.

To view recovery of any ACSLS sub-process, you can monitor the acsss_event.log ($ACS_HOME/ACSSS/log/acsss_event.log). This log displays all recovery events involving any of the ACSLS sub-processes.

Recovery Tests

Redundant network connections should be rebooted automatically by the Solaris multi-path IP logic (IPMP). Any interrupted data connection to the shared disk array should be rebooted automatically by Solaris on the redundant data path. Services under control of Solaris Service Management Facility should be rebooted automatically by SMF.

For tests that involve an actual failover event, you should be aware of the property setting defined in the file: $ACS_HOME/acslsha/pingpong_interval. Despite the conditions which may trigger a failover event, Solaris Cluster will not initiate failover action if a prior failover event occurred within the specified pingpong_interval. (See "Setting the Failover Pingpong_interval.")

To verify the current Pingpong_interval setting, use the Cluster command:

clrg show -p Pingpong_interval

Suggested validation methods of this behavior might include the following:

While ACSLS is operational, disconnect one Ethernet connection from each IPMP group on the active node. Monitor the status using: # scstat -i.

Observe the reaction in /var/adm/messages. ACSLS operation should not be interrupted by this procedure.
While ACSLS is operational, disconnect one fibre or SAS connection from the active server to the shared disk resource.

Observe the reaction in /var/adm/messages. ACSLS operation should not be interrupted by this procedure.

Repeat this test with each of the redundant I/O connections.
Bring down ACSLS abruptly by stopping the acsss_daemon.

Run svcs -l acsls to locate the service log.

View the tail of this log as you stop the acsss_daemon. You should observe that the service is rebooted automatically by SMF. Similar action should be seen if you stop acsls with acsls shutdown.
Using SMF, disable the acsls service.

This can be done as root with svcadm disable acsls or it can be done as user acsss with acsss disable.

Because SMF is in charge of this shutdown event, there is no attempt to reboot the acsls service. This is the desired behavior. You must reboot the acsls service under SMF using $ acsss enable or # svcadm enable acsls.
Bring down the acsdb service.

As user acsdb, abruptly disable the PostgreSQL database with the following command:
```
pg_ctl stop \
     -D $installDir/acsdb/ACSDB1.0/data \
     -m immediate
```
This action should bring down the database and also cause the acsls processes to come down. Run svcs -l acsdb to locate the acsdb service log.

View the tail of both the acsdb service log and the acsls service log as you bring down the database. You should observe that when the acsdb service goes down, it also brings down the acsls service. Both services should be rebooted automatically by SMF.
While ACSLS is operational, run psacs as user acsss to get a list of sub-processes running under the acsss_daemon.

Stop any one of these sub-processes. Observe the acsss_event.log to confirm that the sub-process is rebooted and a recovery procedure is invoked.

Failover Conditions

Solaris Cluster Software monitors the Solaris system, looking for fatal conditions that would necessitate a system failover event. Among these would be a user-initiated failover (clrg switch), a system reboot of the active node, or any system hang, fatal memory fault, or unrecoverable i/o communications on the active node. Solaris Cluster also monitors HA agents that are designed for specific applications. The ACSLS HA Agent requests a system failover event under any of the following conditions:

TCP/IP communication is lost between the active node and the logical host.
The $ACS_HOME file system is not mounted.
The /export/backup file system is not mounted.
Communication is lost to an ACS that is listed in the file $ACS_HOME/acslsha/ha_acs_list.txt whose desired state is online and where a switch lmu is not otherwise possible or successful.

Failover Monitoring

From moment to moment, you can monitor the failover status of the respective nodes using the command: # clrg status

Or you can monitor failover activity by observing the tail of the start_stop_log:

# tail -f /var/cluster/logs/DS/acsls-rg/acsls-rs/start_stop_log.txt

It may be useful to view (tail -f) the /var/adm/messages file on both nodes as you perform diagnostic failover operations.

Failover Tests

The prescribed method to test Cluster failover is to use the clrg switch command:
```
# clrg switch -M -e -n <standby node name> acsls-rg
```
This action should bring down the ACSLS application and switch operation from the active server to the standby system. The options -M -e instruct the cluster server to enable SMF services on the new node. Observe this sequence of events on each node by viewing the tail of the /var/adm/messages file. You can also tail the start-stop log:
```
# tail -f /var/cluster/logs/DS/acsls-rg/acsls-rs/start_stop_log.txt
```
Periodically run the command: # clrg status
A system reboot on the active node should initiate an immediate HA switch to the alternate node.

This operation should conclude with ACSLS running on the new active node. On the standby node, watch the tail of the /var/adm/messages file as the standby system assumes its new role as the active node. You can also periodically run the command: # clrg status
Using init 5, power down the active server node and verify system failover.
Unplug both data lines between the active server node and the shared disk Storage Array and verify a system switch to the standby node.
Assuming that a given library is listed in the policy file, ha_acs_list.txt, disconnect both Ethernet communication lines between the active server node and that library.

Verify system failover to the standby node.

Additional Tests

If your mirrored boot drives are hot-pluggable, you can disable one of the boot drives and confirm that the system remains fully operational. With one boot drive disabled, reboot the system to verify that the node comes up from the alternate boot drive. Repeat this action for each of the boot drives on each of the two nodes.

Remove any single power supply from the active node and the system should remain fully operational with the alternate power supply.