Troubleshooting SEM High Availability Problems

Solstice Enterprise Manager 4.1 Troubleshooting Guide

Chapter 7

Troubleshooting SEM High Availability Problems

This chapter provides information on resolving High Availability (HA) administration problems with Solstice EM.

This chapter describes the following topics:

Section 7.1.1 Monitoring the Status of Sun Cluster
Section 7.1.2 Clearing the stop_failed Status
Section 7.1.3 Bringing Resources and Resource Groups Offline or Online
Section 7.1.4 Shutting Down a Sun Cluster

7.1 SEM-HA Administration

7.1.1 Monitoring the Status of Sun Cluster

Problem: Determining the status of Sun Cluster resources.

Solution: The resource state, resource group state, and resource status are all maintained by the Resource Group Manager (RGM) on each node, based only on which methods have been invoked on the resource.

The scstat command displays the current state of the Sun Cluster and its components. One instance of the scstat is adequate to run on any machine in the Sun Cluster configuration.

scstat [-DWgnpv[v]q] [-h node]

The command options allow you to request status information for specific components.

When the stop method of a resource fails, its status is marked as stop_failed, and the cluster will not allow it to be started until its status is cleared. See section on Clearing the stop_failed Status of Sun Cluster and its Components to resolve this.

The possible resource states are:

online, offline, start_failed, stop_failed or online_not_monitored.

The possible resource group states are:

unmanaged, online, offline, pending_online, pending_offline, or error_stop_failed.

See Also: Sun Cluster Administration Guide.

7.1.2 Clearing the stop_failed Status

Problem: Clearing the stop_failed error flag of a given resource.

Solution: After the stop method has run successfully on a resource on a given node, the resource's state will be offline on that node. If the stop method exits non-zero or times out, then the state of the resource will be stop_failed.

Use the scswitch -c -h node [,node,...] -j resource_name -f STOP_FAILED command option to clear the stop_failed error flag on the resources on the indicated set of nodes.

Clearing the stop_failed state places the resources into the offline state on the given node.

7.1.3 Bringing Resources and Resource Groups Offline or Online

Problem: How to bring resources offline or online.

Solution: Use the scswitch command to bring resource groups or disk device offline or online.

Getting a resource group offline:

scswitch -F -g resource_grp_name

For each resource group specified by the -g option, -F disables all resources and their monitors, moves the resource group into unmanaged state and brings the resource group offline on all the default primaries. Without the -g option, scswitch attempts to bring all resource groups offline.

Getting a resource group online:

scswitch -Z -g resource_grp_name -h hostname

For each resource group specified by the -g option, -Z enables all resources and their monitors, moves the resource group into managed state, and brings the resource group online on all the default primaries. Without the -g option, scswitch attempts to bring all resource groups online.

Getting a resource offline:

scswitch -n -j resource_name

Getting a resource online:

scswitch -e -j resource_name
Note – The resource group that the resource belongs to must also be online while attempting to bring a resource online.

7.1.4 Shutting Down a Sun Cluster

Problem: To shut down the sun cluster gracefully.

Solution: The scshutdown command shuts down the entire cluster in an orderly fashion.

scshutdown [-y] [-g grace-period] [message]

Before the shutdown, shutdown can send a warning message, and a final message asking for confirmation. The scshutdown command should be run from one node.

When shutting down a cluster, scshutdown performs the following:

Changes all functioning resource groups on the cluster to an offline state. If any transition fails shutdown does not complete, and an error message is displayed.
Unmounts all cluster file systems. If any unmount fails, scshutdown does not complete, and an error message is displayed.
Shuts down all active device services. If any of the transitions fail, scshutdown does not complete, and an error message is displayed.
Runs "/usr/sbin/init 0" on all nodes.

grace-period changes the number of seconds from the default 60 seconds.

message is a string that is sent out following the standard warning message "The system will be shut down in...seconds before scshutdown begins.

Doc Set | Contents | Previous | Next | Index