11 Cluster Logging, Diagnostics, and Testing

This chapter describes the various resources available for testing ACSLS HA installation, and for diagnosing issues and troubleshooting problems that may emerge on the system.

Monitoring Overall Cluster Operation

The activities that transpire during a startup or switch over event are widely distributed across the two nodes. Consequently, the vantage point that chosen to observe the overall operation during testing can greatly determine the ability to see the unfolding events as they transpire. "The ha_console.sh Utility" describes procedures for setting up a comprehensive view

A recommended dashboard configuration for observing the overall HA behavior during testing would include eight shell windows, four windows from each node.

  1. A command shell for root should be reserved on each node to assert various commands as needed

  2. Set up a window on each node to display the tail of the system /var/adm/messages file.

    # tail -f /var/adm/messages
    

    Solaris Cluster prints all informational messages to this log file.

  3. Set up another window on each node to display the tail of the acsls-rs resource start_stop log.

    # tail -f /var/cluster/logs/DS/acsls-rg/acsls-rs/start_stop_log.txt
    

    All messages posted by the acsls_agt.sh start script are displayed here.

  4. A third window on each node should be set up to display the tail of the acsls-rs probe log.

    # tail -f /var/cluster/logs/DS/acsls-rg/acsls-rs/probe_log.txt
    

    Once the application has started, Solaris Cluster probes the ACSLS resource once every minute. A numeric code is returned to Cluster from each probe, and the results are printed to the file, probe_log.txt. With each probe, any of five standard return values are seen posted to this log:

      0 -  The probe found that ACSLS is healthy and functioning normally.
      1 -  The probe may not have completed due to a functional error.
      2 -  The probe reports that ACSLS is in a transitional state.
      3 -  The ACSLS application has been intentionally placed offline.
    201 -  A condition was detected that requires fail-over action.
    

    It is only in response to code 201 that Solaris Cluster initiates failover action. The conditions that prompt such action are listed in the chapter, "ACSLS Cluster Operation". All other return codes from the Cluster probe are considered informational and no Cluster responsive actions are asserted.

    Sample probes for testing can be asserted at any time from the command line. Use the command, acsAgt probe:

    #/opt/ACSLSHA/util/acsAgt probe 
    

All of the logs mentioned above reflect a system view as seen by Solaris Cluster. Two additional logs in the $ACS_HOME/log/ directory provide a view from the ACSLS application level. The acsss_event.log reports all significant events encountered by ACSLS from the moment the application was started. And any ACSLS startup difficulties encountered by SMF are logged in acsls_start.log.

Cluster Monitoring Utilities

Solaris Cluster utilities are found in the /usr/cluster/bin directory.

  • To view the current state of the ACSLS resource group: clrg list -v

  • To view the current status of the two cluster nodes: clrg status

  • To view the status of the resource groups: clrs status

  • To get verbose status on the nodes, the quorum devices, and cluster resources: cluster status

  • For a detailed component list in the cluster configuration: cluster show

  • To view the status of each Ethernet node in the resource group: clnode status -m

  • To view the status of the various acsls-rg resources on each note: scstat -g

  • To view the health of the heartbeat network links: clintr status

  • To view IPMP status: scstat -i

  • To view node status: scstat -n

  • To view quorum configuration and status: scstat -q or clq status

  • To show detailed cluster resources, including timeout values: clresource show -v

Recovery and Failover Testing

This section discusses the conditions, monitoring, and testing for recovery and failover testing.

Recovery Conditions

There are numerous fatal system conditions that can be recovered without the need of a system fail over event. For example, with IPMP, one Ethernet connection in each group may fail for whatever reason, but communication should resume uninterrupted through the alternate path.

The shared disk array should be connected to the servers with two distinct ports on each server. If one path is interrupted, disk I/O operation should resume without interruption over the alternate path.

ACSLS consists of several software services that are monitored by the Solaris Service Management Facility (SMF). As user acsss, list each of the acsss services with the command acsss status. Among these services are the PostgreSQL database, the WebLogic Web application server, and the ACSLS application software. If any given service fails on a Solaris system, SMF should automatically reboot that service without the need for a system failover.

The acsls service itself consists of numerous child processes that are monitored by the parent, acsss_daemon. To list the ACSLS sub-processes, use the command, psacs (as user acsss). If any of the child processes is aborted for any reason, the parent should immediately reboot that child and recover normal operation.

Recovery Monitoring

The best location to view recovery of system resources (such as disk I/O and Ethernet connections), is the system log, /var/adm/messages.

SMF maintains a specific log for each software service that it monitors. This log displays startup, restart, and shutdown events. To get the full path to the service log, run the command, svcs -l service-name ACSLS services can be listed using the acsss command: $ acsss status. Subprocesses can be listed with the command,
$ acsss p-status
.

To view recovery of any ACSLS sub-process, you can monitor the acsss_event.log ($ACS_HOME/ACSSS/log/acsss_event.log). This log displays all recovery events involving any of the ACSLS sub-processes.

Recovery Tests

Redundant network connections should be restarted automatically by the Solaris IPMP logic. Any interrupted data connection to the shared disk array should be restarted automatically by Solaris on the redundant data path. Services under control of Solaris Service Management Facility should be restarted automatically by SMF.

For tests that involve an actual failover event, be aware of the property setting defined in the file: $ACS_HOME/acslsha/pingpong_interval. Despite the conditions which may trigger a failover event, Solaris Cluster does not initiate failover action if a prior failover event occurred within the specified pingpong_interval.

To view or to dynamically change the pingpong interval, go to the /opt/ACSLSHA/util directory and run acsAgt pingpong:

# ./acsAgt pingpong
Pingpong_interval
   current value:  1200 seconds.
   desired value: [1200] 300
Pingpong_interval : 300 seconds.

Use any or all of the following techniques to evaluate the resilience of HA installation:

  1. While ACSLS is operational, disconnect one Ethernet connection from each IPMP group on the active node. Monitor the status using # scstat -i.

    Observe the reaction in /var/adm/messages. ACSLS operation should not be interrupted by this procedure.

  2. Ensure that Cluster Failover_mode is set to HARD. While ACSLS is operational, disconnect one fibre or SAS connection from the active server to the shared disk resource.

    Observe the reaction in /var/adm/messages. ACSLS operation should not be interrupted by this procedure.

    Repeat this test with each of the redundant I/O connections.

  3. Abruptly terminate ACSLS by killing the acsss_daemon. Use pkill acsss_daemon.

    Run svcs -l acsls to locate the service log.

    View the tail of this log as the acsss_daemon is stopped.Observe that the service is restarted automatically by SMF. Similar action should be seen if stopping acsls with acsls shutdown.

  4. Using SMF, disable the acsls service.

    This can be done as root with svcadm disable acsls or it can be done as user acsss with acsss disable.

    Because SMF is in charge of this shutdown event, there is no attempt to restart the acsls service. This is the desired behavior. The acsls service must be restarted under SMF. As root, use the command, svcadm enable acsls. Or, as user acsss, use the command, acsss-enable.

  5. Bring down the acsdb service.

    As user acsdb, source the .acsls_env file.

    $ su acsdb
    $ . /var/tmp/acsls/.acsls_env
    

    Now, abruptly disable the PostgreSQL database with the following command:

    pg_ctl stop \
         -D $installDir/acsdb/ACSDB1.0/data \
         -m immediate
    

    This action should bring down the database and also cause the acsls processes to come down. Run svcs -l acsdb to locate the acsdb service log.

    View the tail of both the acsdb service log and the acsls service log the database is brought down. Observe that when the acsdb service goes down, it also brings down the acsls service. Both services should be restarted automatically by SMF.

  6. While ACSLS is operational, run psacs as user acsss to get a list of subprocesses running under the acsss_daemon.

    Stop any one of these subprocesses. Observe the acsss_event.log to confirm that the subprocess is restarted and a recovery procedure is invoked.

Failover Conditions

Solaris Cluster Software monitors the Solaris system, looking for fatal conditions that would necessitate a system failover event. Among these would be a user initiated fail over (acsAgt nodeSwitch or clrg switch -n), a system reboot of the active node, or any system hang, fatal memory fault, or unrecoverable I/O communications on the active node. Solaris Cluster also monitors HA agents that are designed for specific applications. The ACSLS HA Agent requests a system failover event under any of the following conditions:

  • TCP/IP communication is lost between the active node and the logical host.

  • The $ACS_HOME file system is not mounted.

  • The database backup file system ($ACS_HOME/.../backup) is not mounted.

  • Communication is lost to the library corresponding to a specified ACS in the file $ACS_HOME/acslsha/ha_acs_list.txt whose desired state is online and where a switch lmu is not otherwise possible or successful.

Failover Monitoring

From moment to moment, the failover status of the respective nodes can be monitored using the command # clrg status.

Fail over activity can also be monitored by observing the tail of the start_stop_log:

# tail -f /var/cluster/logs/DS/acsls-rg/acsls-rs/start_stop_log.txt

It can be useful to view (tail -f) the /var/adm/messages file on both nodes as you perform diagnostic failover operations. See "Monitoring ACSLS Cluster Operation".

Failover Tests

  1. The simple command to initiate a Cluster failover event is acsAgt nodeSwitch.

    # acsAgt nodeSwitch
    

    Or, use the equivalent Cluster command:

    # clrg switch -n <node name> acsls_rg
    

    This action should bring down the ACSLS application and switch operation from the active server to the standby system. The options -M -e instructs the cluster server to enable SMF services on the new node. See "Monitoring ACSLS Cluster Operation".

  2. A system reboot on the active node should initiate an immediate HA switch to the alternate node.

    This operation should conclude with ACSLS running on the new active node. On the standby node, watch the tail of the /var/adm/messages file as the standby system assumes its new role as the active node. The command, # clrg status, can also be periodically run.

  3. Using init 5, power down the active server node and verify system failover.

  4. Unplug both data lines between the active server node and the shared disk Storage Array and verify a system switch to the standby node.

  5. Assuming that a given library is listed in the policy file, ha_acs_list.txt, disconnect both Ethernet communication lines between the active server node and that library.

    Verify system failover to the standby node.

Additional Tests

If the mirrored boot drives are hot-pluggable, disable one of the boot drives and confirm that the system remains fully operational. With one boot drive disabled, reboot the system to verify that the node comes up from the alternate boot drive. Repeat this action for each of the boot drives on each of the two nodes.

Remove any single power supply from the active node and the system should remain fully operational with the alternate power supply.