If you encounter a problem with Sun Cluster Support for Oracle Parallel Server/Real Application Clusters, troubleshoot the problem by using the techniques that are described in the following sections.
The status of the SUNW.rac_framework resource indicates the status of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters. The Sun Cluster system administration tool scstat(1M) enables you to obtain the status of this resource.
The following example shows that status of a RAC framework resource group that is faulty.
-- Resource Groups and Resources -- Group Name Resources ---------- --------- Resources: rac-framework-rg rac_framework rac_udlm rac_cvm -- Resource Groups -- Group Name Node Name State ---------- --------- ----- Group: rac-framework-rg node1 Online faulted Group: rac-framework-rg node2 Online -- Resources -- Resource Name Node Name State Status Message ------------- --------- ----- -------------- Resource: rac_framework node1 Start failed Degraded - reconfiguration in progress Resource: rac_framework node2 Online Online Resource: rac_udlm node1 Offline Unknown - RAC framework is running Resource: rac_udlm node2 Online Online Resource: rac_cvm node1 Offline Unknown - RAC framework is running Resource: rac_cvm node2 Online Online |
This example shows the status of the resources in a RAC framework resource group for the following two-node configuration of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters:
The configuration contains a RAC framework resource group that is named rac-framework-rg.
The rac-framework-rg resource group contains the following resources:
An instance of the SUNW.rac_framework resource type that is named rac_framework
An instance of the SUNW.rac_udlm resource type that is named rac_udlm
An instance of the SUNW.rac_cvm resource type that is named rac_cvm
This example provides the following status information:
A configuration error has prevented the rac_framework resource on cluster node node1 from starting.
The effects of this configuration error on other entities on cluster node node1 are as follows:
The rac-framework-rg resource group is online, but faulted.
The rac_udlm resource and the rac_cvm resource are offline.
The rac-framework-rg resource group and all resources on cluster node node2 are online.
The following example shows that status of a RAC framework resource group that is operating correctly.
-- Resource Groups and Resources -- Group Name Resources ---------- --------- Resources: rac-framework-rg rac_framework rac_udlm rac_cvm -- Resource Groups -- Group Name Node Name State ---------- --------- ----- Group: rac-framework-rg node1 Online Group: rac-framework-rg node2 Online -- Resources -- Resource Name Node Name State Status Message ------------- --------- ----- -------------- Resource: rac_framework node1 Online Online Resource: rac_framework node2 Online Online Resource: rac_udlm node1 Online Online Resource: rac_udlm node2 Online Online Resource: rac_cvm node1 Online Online Resource: rac_cvm node2 Online Online |
This example shows the status of the resources in a RAC framework resource group for the following two-node configuration of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters:
The configuration contains a RAC framework resource group that is named rac-framework-rg.
The rac-framework-rg resource group contains the following resources:
An instance of the SUNW.rac_framework resource type that is named rac_framework
An instance of the SUNW.rac_udlm resource type that is named rac_udlm
An instance of the SUNW.rac_cvm resource type that is named rac_cvm
This example indicates that all resources and resource groups in this configuration are online.
The directory /var/cluster/ucmm contains the following sources of diagnostic information:
Core files
Log files that provide the following information:
Details of userland cluster membership monitor (UCMM) reconfigurations
Time-out settings
Events that are logged by the UNIX Distributed Lock Manager (Oracle UDLM)
The system messages file also contains diagnostic information.
If a problem occurs with Sun Cluster Support for Oracle Parallel Server/Real Application Clusters, consult these files to obtain information about the cause of the problem.
The subsections that follow describe problems that can affect Sun Cluster Support for Oracle Parallel Server/Real Application Clusters. Each subsection provides information about the cause of the problem and a solution to the problem.
If a fatal problem occurs during the initialization of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters, the node panics with an error message similar to the following error message:
panic[cpu0]/thread=40037e60: Failfast: Aborting because "ucmmd" died 30 seconds ago
To determine the cause of the problem, examine the system messages file. The most common causes of this problem are as follows:
The license for VERITAS Volume Manager (VxVM) is missing or has expired.
The ORCLudlm package that contains the Oracle UDLM is not installed.
The amount of shared memory is insufficient to enable the Oracle UDLM to start.
The version of the Oracle UDLM is incompatible with the version of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters.
A reconfiguration step has timed out.
To correct the problem, perform the appropriate recovery action for the cause of the problem and reboot the node that panicked.
The timing out of any step in the reconfiguration of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters causes the node on which the timeout occurred to panic.
To prevent reconfiguration steps from timing out, tune the timeouts that depend on your cluster configuration. For more information, see Guidelines for Setting Timeouts.
If a reconfiguration step times out, use the scrgadm utility to increase the value of the extension property that specifies the timeout for the step. For more information, see Appendix A, Sun Cluster Support for Oracle Parallel Server/Real Application Clusters Extension Properties.
After you have increased the value of the extension property, reboot the node that panicked.
The UCMM daemon, ucmmd, manages the reconfiguration of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters. When a cluster is booted or rebooted, this daemon is started only after all components of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters are validated. If the validation of a component on a node fails, the ucmmd fails to start on the node.
To determine the cause of the problem, examine the following files:
The UCMM reconfiguration log file /var/cluster/ucmm/ucmm_reconf.log
The system messages file
The most common causes of this problem are as follows:
The ORCLudlm package that contains the Oracle UDLM is not installed.
An error occurred during a previous reconfiguration of a component Sun Cluster Support for Oracle Parallel Server/Real Application Clusters.
A step in a previous reconfiguration of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters timed out, causing the node on which the timeout occurred to panic.
To correct the problem, perform the appropriate recovery action for the cause of the problem and reboot the node on which ucmmd failed to start.
If a SUNW.rac_framework resource fails to start, verify the status of the resource to determine the cause of the failure. For more information, see How to Verify the Status of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters.
The state of a resource that failed to start is shown as Start failed. The associated status message indicates the cause of the failure to start as follows:
Faulted - ucmmd is not running
The ucmmd daemon is not running on the node where the resource resides. For information about how to correct this problem, see Failure of the ucmmd Daemon to Start.
Degraded - reconfiguration in progress
A configuration error occurred in one or more components of Sun Cluster Support for Oracle Parallel Server/Real Application Clusters.
To determine the cause of the configuration error, examine the following files:
The UCMM reconfiguration log file /var/cluster/ucmm/ucmm_reconf.log
The system messages file
For more information about error messages that might indicate the cause of the configuration error, see Sun Cluster Error Messages Guide for Solaris OS.
To correct the problem, correct the configuration error that caused the problem. Then reboot the node on which the erroneous component resides.
Online
Reconfiguration of Oracle Parallel Server/Real Application Clusters was not completed until after the START method of the SUNW.rac_framework resource timed out.
For instructions to correct the problem, see How Recover From the Timing Out of the START Method.
Become superuser.
On the node where the START method timed out, take the RAC framework resource group offline.
# scswitch -z -g resource-group -h nodelist |
Specifies the name of the RAC framework resource group. If this resource group was created by using the scsetup utility, the name of the resource group is rac-framework-rg.
Specifies a comma-separated list of other cluster nodes on which resource-group is online.
On all cluster nodes that can run Sun Cluster Support for Oracle Parallel Server/Real Application Clusters, bring the RAC framework resource group online.
# scswitch -Z -g resource-group |
Enables the resource and monitor, moves the resource group to the MANAGED state, and brings the resource group online
Specifies that the resource group that you brought offline in Step 2 is to be moved to the MANAGED state and brought online
If a resource fails to stop, correct this problem as explained in “Clearing the STOP_FAILED Error Flag on Resources” in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.