Sun Cluster Data Service for Oracle Real Application Clusters Guide for Solaris OS

Chapter 3 Troubleshooting Sun Cluster Support for Oracle Real Application Clusters

If you encounter a problem with Sun Cluster Support for Oracle Real Application Clusters, troubleshoot the problem by using the techniques that are described in the following sections.

Verifying the Status of Sun Cluster Support for Oracle Real Application Clusters

The status of the SUNW.rac_framework resource indicates the status of Sun Cluster Support for Oracle Real Application Clusters. The Sun Cluster system administration tool scstat(1M) enables you to obtain the status of this resource.

How to Verify the Status of Sun Cluster Support for Oracle Real Application Clusters

  1. Become superuser.

  2. Type the following command:


    # scstat -g
    

The following examples show the status of the resources for a two-node configuration of Sun Cluster Support for Oracle Real Application Clusters. This configuration uses Solaris Volume Manager for Sun Cluster to store the Oracle Real Application Clusters database.

Each node contains a RAC framework resource group that is named rac-framework-rg. The resource type and resource name of each resource in these resource groups is shown in the following table.

Resource Type 

Resource Instance Name 

SUNW.rac_framework

rac_framework

SUNW.rac_udlm

rac_udlm

SUNW.rac_svm

rac_svm

Each node contains a resource group for an Oracle RAC server resource on as shown in the following table. The table also shows the resource type and the name of the resource in each resource group.

Node 

Resource Group 

Resource Type  

Resource Name 

node1

RAC1-rg

SUNW.oracle_rac_server

RAC1

node1

RAC2-rg

SUNW.oracle_rac_server

RAC2


Example 3–1 Status of a Faulty RAC Framework Resource Group


-- Resource Groups and Resources --

           Group Name        Resources
           ----------        ---------
Resources: rac-framework-rg  rac_framework rac_udlm rac_svm
Resources: RAC1-rg             RAC1
Resources: RAC2-rg             RAC2


-- Resource Groups --

            Group Name        Node Name  State
            ----------        ---------  -----
     Group: rac-framework-rg  node1      Online faulted
     Group: rac-framework-rg  node2      Online

     Group: RAC1-rg           node1       Online

     Group: RAC2-rg           node2     Online


-- Resources --

            Resource Name    Node Name  State     Status Message
            -------------    ---------  -----     --------------
  Resource: rac_framework    node1      Start failed Degraded - reconfiguration in progress
  Resource: rac_framework    node2      Online    Online

  Resource: rac_udlm         node1      Offline   Unknown - RAC framework is running
  Resource: rac_udlm         node2      Online    Online

  Resource: rac_svm          node1      Offline   Unknown - RAC framework is running
  Resource: rac_svm          node2      Online    Online

  Resource: RAC1             node1      Online    Online

  Resource: RAC2             node2      Online    Faulted

This example provides the following status information for a RAC framework resource group that is faulty.



Example 3–2 Status of a Faulty RAC Server Resource Group


-- Resource Groups and Resources --

             Group Name          Resources
             ----------          ---------
  Resources: rac-framework-rg    rac_framework rac_udlm rac_svm
  Resources: RAC1-rg             RAC1    
  Resources: RAC2-rg             RAC2    


-- Resource Groups --

             Group Name          Node Name      State
             ----------          ---------      -----
      Group: rac-framework-rg    node1          Online
      Group: rac-framework-rg    node2          Online

      Group: RAC1-rg             node1          Online

      Group: RAC2-rg             node2          Online faulted


-- Resources --

             Resource Name      Node Name      State     Status Message
             -------------      ---------      -----     --------------
   Resource: rac_framework      node1          Online    Online
   Resource: rac_framework      node2          Online    Online

   Resource: rac_udlm           node1          Online    Online
   Resource: rac_udlm           node2          Online    Online

   Resource: rac_svm            node1          Online    Online
   Resource: rac_svm            node2          Online    Online

   Resource: RAC1               node1          Online    Online

   Resource: RAC2               node2          Online    Faulted - RAC instance not running

This example provides the following status information for a RAC server resource group that is faulty:



Example 3–3 Status of an Operational Oracle Real Application Clusters Configuration


-- Resource Groups and Resources --

             Group Name          Resources
             ----------          ---------
  Resources: rac-framework-rg    rac_framework rac_udlm rac_svm
  Resources: RAC1-rg             RAC1    
  Resources: RAC2-rg             RAC2    


-- Resource Groups --

             Group Name          Node Name           State
             ----------          ---------           -----
      Group: rac-framework-rg    node1               Online
      Group: rac-framework-rg    node2               Online

      Group: RAC1-rg             node1               Online

      Group: RAC2-rg             node2               Online


-- Resources --

             Resource Name       Node Name           State     Status Message
             -------------       ---------           -----     --------------
   Resource: rac_framework       node1               Online    Online
   Resource: rac_framework       node2               Online    Online

   Resource: rac_udlm            node1               Online    Online
   Resource: rac_udlm            node2               Online    Online

   Resource: rac_svm             node1               Online    Online
   Resource: rac_svm             node2               Online    Online

   Resource: RAC1                node1               Online    Online

   Resource: RAC2                node2               Online    Online

This example shows the status of an Oracle Real Application Clusters configuration that is operating correctly. The example indicates that all resources and resource groups in this configuration are online.


Sources of Diagnostic Information

The directory /var/cluster/ucmm contains the following sources of diagnostic information:

The directory /var/opt/SUNWscor/oracle_server contains log files for the Oracle RAC server resource.

The system messages file also contains diagnostic information.

If a problem occurs with Sun Cluster Support for Oracle Real Application Clusters, consult these files to obtain information about the cause of the problem.

Common Problems and Their Solutions

The subsections that follow describe problems that can affect Sun Cluster Support for Oracle Real Application Clusters. Each subsection provides information about the cause of the problem and a solution to the problem.

Node Panic During Initialization of Sun Cluster Support for Oracle Real Application Clusters

If a fatal problem occurs during the initialization of Sun Cluster Support for Oracle Real Application Clusters, the node panics with an error message similar to the following error message:


panic[cpu0]/thread=40037e60: Failfast: Aborting because "ucmmd" died 30 seconds ago

To determine the cause of the problem, examine the system messages file. The most common causes of this problem are as follows:

  • The license for VERITAS Volume Manager (VxVM) is missing or has expired.

  • The ORCLudlm package that contains the Oracle UDLM is not installed.

  • The amount of shared memory is insufficient to enable the Oracle UDLM to start.

  • The version of the Oracle UDLM is incompatible with the version of Sun Cluster Support for Oracle Real Application Clusters.

For instructions to correct the problem, see How to Recover From a Node Panic During Initialization.

A node might also panic during the initialization of Sun Cluster Support for Oracle Real Application Clusters because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout .

How to Recover From a Node Panic During Initialization

  1. Boot into maintenance mode the node that panicked.

    For more information, see Sun Cluster System Administration Guide for Solaris OS.

  2. Verify that you have correctly installed your volume manager packages.

    If you are using VxVM, check that you have installed the software and check that the license for the VxVM cluster feature is valid.

  3. Ensure that you have completed all the procedures that precede installing and configuring the Oracle UDLM software.

    The procedures that you must complete are listed in Table 1–1.

  4. Ensure that the Oracle UDLM software is correctly installed and configured.

    For more information, see Installing the Oracle UDLM.

  5. Reboot the node that panicked.

    For more information, see Sun Cluster System Administration Guide for Solaris OS.

Node Panic Caused by a Timeout

The timing out of any step in the reconfiguration of Sun Cluster Support for Oracle Real Application Clusters causes the node on which the timeout occurred to panic.

To prevent reconfiguration steps from timing out, tune the timeouts that depend on your cluster configuration. For more information, see Guidelines for Setting Timeouts.

If a reconfiguration step times out, use the scrgadm utility to increase the value of the extension property that specifies the timeout for the step. For more information, see Appendix A, Sun Cluster Support for Oracle Real Application Clusters Extension Properties.

After you have increased the value of the extension property, reboot the node that panicked.

Failure of a Node

Recovering from the failure of a node involves the following tasks:

  1. Booting into maintenance mode the node that panicked

  2. Performing the appropriate recovery action for the cause of the problem

  3. Rebooting the node that panicked

For more information, see Sun Cluster System Administration Guide for Solaris OS


Note –

In an Oracle Real Application Clusters environment, multiple Oracle instances cooperate to provide access to the same shared database. The Oracle clients can use any of the instances to access the database. Thus, if one or more instances have failed, clients can connect to a surviving instance and continue to access the database.


Failure of the ucmmd Daemon to Start

The UCMM daemon, ucmmd, manages the reconfiguration of Sun Cluster Support for Oracle Real Application Clusters. When a cluster is booted or rebooted, this daemon is started only after all components of Sun Cluster Support for Oracle Real Application Clusters are validated. If the validation of a component on a node fails, the ucmmd fails to start on the node.

To determine the cause of the problem, examine the following files:

The most common causes of this problem are as follows:

To correct the problem, perform the appropriate recovery action for the cause of the problem and reboot the node on which ucmmd failed to start.

Failure of a SUNW.rac_framework Resource to Start

If a SUNW.rac_framework resource fails to start, verify the status of the resource to determine the cause of the failure. For more information, see How to Verify the Status of Sun Cluster Support for Oracle Real Application Clusters.

The state of a resource that failed to start is shown as Start failed. The associated status message indicates the cause of the failure to start as follows:


Faulted - ucmmd is not running

The ucmmd daemon is not running on the node where the resource resides. For information about how to correct this problem, see Failure of the ucmmd Daemon to Start.


Degraded - reconfiguration in progress

A configuration error occurred in one or more components of Sun Cluster Support for Oracle Real Application Clusters.

To determine the cause of the configuration error, examine the following files:

  • The UCMM reconfiguration log file /var/cluster/ucmm/ucmm_reconf.log

  • The system messages file

For more information about error messages that might indicate the cause of the configuration error, see Sun Cluster Error Messages Guide for Solaris OS.

To correct the problem, correct the configuration error that caused the problem. Then reboot the node on which the erroneous component resides.


Online

Reconfiguration of Oracle Real Application Clusters was not completed until after the START method of the SUNW.rac_framework resource timed out.

For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.

How to Recover From the Timing Out of the START Method

  1. Become superuser.

  2. On the node where the START method timed out, take offline the RAC framework resource group.

    To perform this operation, switch the primary nodes of the resource group to the other nodes where this group is online.


    # scswitch -z -g resource-group -h nodelist
    
    -g resource-group

    Specifies the name of the RAC framework resource group. If this resource group was created by using the scsetup utility, the name of the resource group is rac-framework-rg.

    -h nodelist

    Specifies a comma-separated list of other cluster nodes on which resource-group is online. Omit from this list the node where the START method timed out.

  3. On all cluster nodes that can run Sun Cluster Support for Oracle Real Application Clusters, bring the RAC framework resource group online.


    # scswitch -Z -g resource-group
    
    -Z

    Enables the resource and monitor, moves the resource group to the MANAGED state, and brings the resource group online

    -g resource-group

    Specifies that the resource group that you brought offline in Step 2 is to be moved to the MANAGED state and brought online

Failure of a Resource to Stop

If a resource fails to stop, correct this problem as explained in “Clearing the STOP_FAILED Error Flag on Resources” in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.