Sun Cluster Data Service for Oracle RAC Guide for Solaris OS

Chapter 5 Troubleshooting Sun Cluster Support for Oracle RAC

If you encounter a problem with Sun Cluster Support for Oracle RAC, troubleshoot the problem by using the techniques that are described in the following sections.

Verifying the Status of Sun Cluster Support for Oracle RAC

The status of resource groups and resources for Sun Cluster Support for Oracle RAC indicates the status of Oracle RAC in your cluster. Use Sun Cluster maintenance commands to obtain this status information.

ProcedureHow to Verify the Status of Sun Cluster Support for Oracle RAC

This procedure provides the long forms of the Sun Cluster maintenance commands. Most commands also have short forms. Except for the forms of the command names, the commands are identical. For a list of the commands and their short forms, see Appendix A, Sun Cluster Object-Oriented Commands, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

  1. Become superuser or assume a role that provides solaris.cluster.read RBAC authorization.

  2. Display status information for the Sun Cluster objects in which you are interested.

    For example:

    • To display status information for all resource groups in your cluster, type the following command:


      # clresourcegroup status +
      
    • To display status information for all resources in a resource group, type the following command:


      # clresource status -g resource-group +
      
      resource-group

      Specifies the resource group that contains the resources whose status information you are displaying.

See Also

For information about options that you can specify to filter the status information that is displayed, see the following man pages:

Examples of the Status of Sun Cluster Support for Oracle RAC

The following examples show the status of resource groups and resources for a configuration of Sun Cluster Support for Oracle RAC on a four-node cluster. Each node is a machine that uses the SPARC® processor.

The cluster in this example is running version 10g R2 of Oracle RAC. The configuration in this example uses Sun StorageTekTM QFS shared file system on Solaris Volume Manager for Sun Cluster to store Oracle files.

The resource groups and resources for this configuration are shown in the following table.

Resource Group 

Purpose 

Resource Group Contents 

Resource Type 

Resource Instance Name 

rac-framework-rg

RAC framework resource group 

SUNW.rac_framework

rac-framework-rs

SUNW.rac_udlm

rac-udlm-rs

SUNW.rac_svm

rac-svm-rs

SUNW.crs_framework

crs_framework-rs

scaldg-rg

Resource group for scalable device-group resources 

SUNW.ScalDeviceGroup

scaloradg-rs

qfsmds-rg

Resource group for Sun StorageTek QFS metadata server resources 

SUNW.qfs

qfs-db_qfs-OraHome-rs

qfs-db_qfs-OraData-rs

scalmnt-rg

Resource group for scalable file-system mount-point resources 

SUNW.ScalMountPoint

scal-db_qfs-OraHome-rs

scal-db_qfs-OraData-rs

rac_server_proxy-rg

RAC database resource group 

SUNW.scalable_rac_server_proxy

rac_server_proxy-rs


Example 5–1 Status of a Faulty RAC Framework Resource Group


# clresourcegroup status +

=== Cluster Resource Groups ===

Group Name             Node Name    Suspended   Status
----------             ---------    ---------   ------
rac-framework-rg       pclus1       No          Online faulted
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

scaldg-rg              pclus1       No          Pending online blocked
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

qfsmds-rg              pclus1       No          Offline
                       pclus2       No          Online
                       pclus3       No          Offline
                       pclus4       No          Offline

scalmnt-rg             pclus1       No          Pending online blocked
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

rac_server_proxy-rg    pclus1       No          Pending online blocked
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

# clresource status -g rac-framework-rg +

=== Cluster Resources ===

Resource Name       Node Name    State          Status Message
-------------       ---------    -----          --------------
rac-framework-rs    pclus1       Start failed   Faulted - Error in previous reconfiguration.
                    pclus2       Online         Online
                    pclus3       Online         Online
                    pclus4       Online         Online

rac-udlm-rs         pclus1       Offline        Offline
                    pclus2       Online         Online
                    pclus3       Online         Online
                    pclus4       Online         Online

rac-svm-rs          pclus1       Offline        Offline
                    pclus2       Online         Online
                    pclus3       Online         Online
                    pclus4       Online         Online

crs_framework-rs    pclus1       Offline        Offline
                    pclus2       Online         Online
                    pclus3       Online         Online
                    pclus4       Online         Online

# clresource status -g scaldg-rg +

=== Cluster Resources ===

Resource Name       Node Name      State        Status Message
-------------       ---------      -----        --------------
scaloradg-rs        pclus1         Offline      Offline
                    pclus2         Online       Online - Diskgroup online
                    pclus3         Online       Online - Diskgroup online
                    pclus4         Online       Online - Diskgroup online

# clresource status -g qfsmds-rg +

=== Cluster Resources ===

Resource Name            Node Name    State     Status Message
-------------            ---------    -----     --------------
qfs-db_qfs-OraHome-rs    pclus1       Offline   Offline
                         pclus2       Online    Online - Service is online.
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

qfs-db_qfs-OraData-rs    pclus1       Offline   Offline
                         pclus2       Online    Online - Service is online.
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

# clresource status -g scalmnt-rg +

=== Cluster Resources ===

Resource Name             Node Name   State     Status Message
-------------             ---------   -----     --------------
scal-db_qfs-OraHome-rs    pclus1      Offline   Offline
                          pclus2      Online    Online
                          pclus3      Online    Online
                          pclus4      Online    Online

scal-db_qfs-OraData-rs    pclus1      Offline   Offline
                          pclus2      Online    Online
                          pclus3      Online    Online
                          pclus4      Online    Online

# clresource status -g rac_server_proxy-rg +

=== Cluster Resources ===

Resource Name           Node Name    State      Status Message
-------------           ---------    -----      --------------
rac_server_proxy-rs     pclus1       Offline    Offline
                        pclus2       Online     Online - Oracle instance UP
                        pclus3       Online     Online - Oracle instance UP
                        pclus4       Online     Online - Oracle instance UP

This example provides the following status information for a RAC framework resource group that is faulty.



Example 5–2 Status of a Faulty RAC Database Resource Group


# clresourcegroup status +

=== Cluster Resource Groups ===

Group Name             Node Name    Suspended   Status
----------             ---------    ---------   ------
rac-framework-rg       pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

scaldg-rg              pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

qfsmds-rg              pclus1       No          Online
                       pclus2       No          Offline
                       pclus3       No          Offline
                       pclus4       No          Offline

scalmnt-rg             pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

rac_server_proxy-rg    pclus1       No          Online faulted
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

# clresource status -g rac_server_proxy-rg +

=== Cluster Resources ===

Resource Name           Node Name    State      Status Message
-------------           ---------    -----      --------------
rac_server_proxy-rs     pclus1       Offline    Offline - Oracle instance DOWN
                        pclus2       Online     Online - Oracle instance UP
                        pclus3       Online     Online - Oracle instance UP
                        pclus4       Online     Online - Oracle instance UP

# clresource status -g rac-framework-rg +

=== Cluster Resources ===

Resource Name         Node Name      State      Status Message
-------------         ---------      -----      --------------
rac-framework-rs      pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

rac-udlm-rs           pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

rac-svm-rs            pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

crs_framework-rs      pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

# clresource status -g scaldg-rg +

=== Cluster Resources ===

Resource Name       Node Name       State       Status Message
-------------       ---------       -----       --------------
scaloradg-rs        pclus1          Online      Online - Diskgroup online
                    pclus2          Online      Online - Diskgroup online
                    pclus3          Online      Online - Diskgroup online
                    pclus4          Online      Online - Diskgroup online

# clresource status -g qfsmds-rg +

=== Cluster Resources ===

Resource Name            Node Name    State     Status Message
-------------            ---------    -----     --------------
qfs-db_qfs-OraHome-rs    pclus1       Online    Online - Service is online.
                         pclus2       Offline   Offline
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

qfs-db_qfs-OraData-rs    pclus1       Online    Online - Service is online.
                         pclus2       Offline   Offline
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

# clresource status -g scalmnt-rg +

=== Cluster Resources ===

Resource Name             Node Name    State    Status Message
-------------             ---------    -----    --------------
scal-db_qfs-OraHome-rs    pclus1       Online   Online
                          pclus2       Online   Online
                          pclus3       Online   Online
                          pclus4       Online   Online

scal-db_qfs-OraData-rs    pclus1       Online   Online
                          pclus2       Online   Online
                          pclus3       Online   Online
                          pclus4       Online   Online

This example provides the following status information for a RAC database resource group that is faulty:



Example 5–3 Status of an Operational Oracle RAC Configuration


# clresourcegroup status +

=== Cluster Resource Groups ===

Group Name             Node Name    Suspended   Status
----------             ---------    ---------   ------
rac-framework-rg       pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

scaldg-rg              pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

qfsmds-rg              pclus1       No          Online
                       pclus2       No          Offline
                       pclus3       No          Offline
                       pclus4       No          Offline

scalmnt-rg             pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

rac_server_proxy-rg    pclus1       No          Online
                       pclus2       No          Online
                       pclus3       No          Online
                       pclus4       No          Online

# clresource status -g rac-framework-rg +

=== Cluster Resources ===

Resource Name         Node Name      State      Status Message
-------------         ---------      -----      --------------
rac-framework-rs      pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

rac-udlm-rs           pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

rac-svm-rs            pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

crs_framework-rs      pclus1         Online     Online
                      pclus2         Online     Online
                      pclus3         Online     Online
                      pclus4         Online     Online

# clresource status -g scaldg-rg +

=== Cluster Resources ===

Resource Name       Node Name       State       Status Message
-------------       ---------       -----       --------------
scaloradg-rs        pclus1          Online      Online - Diskgroup online
                    pclus2          Online      Online - Diskgroup online
                    pclus3          Online      Online - Diskgroup online
                    pclus4          Online      Online - Diskgroup online

# clresource status -g qfsmds-rg +

=== Cluster Resources ===

Resource Name            Node Name    State     Status Message
-------------            ---------    -----     --------------
qfs-db_qfs-OraHome-rs    pclus1       Online    Online - Service is online.
                         pclus2       Offline   Offline
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

qfs-db_qfs-OraData-rs    pclus1       Online    Online - Service is online.
                         pclus2       Offline   Offline
                         pclus3       Offline   Offline
                         pclus4       Offline   Offline

# clresource status -g scalmnt-rg +

=== Cluster Resources ===

Resource Name             Node Name    State    Status Message
-------------             ---------    -----    --------------
scal-db_qfs-OraHome-rs    pclus1       Online   Online
                          pclus2       Online   Online
                          pclus3       Online   Online
                          pclus4       Online   Online

scal-db_qfs-OraData-rs    pclus1       Online   Online
                          pclus2       Online   Online
                          pclus3       Online   Online
                          pclus4       Online   Online

# clresource status -g rac_server_proxy-rg +

=== Cluster Resources ===

Resource Name           Node Name     State     Status Message
-------------           ---------     -----     --------------
rac_server_proxy-rs     pclus1        Online    Online - Oracle instance UP
                        pclus2        Online    Online - Oracle instance UP
                        pclus3        Online    Online - Oracle instance UP
                        pclus4        Online    Online - Oracle instance UP

This example shows the status of an Oracle RAC configuration that is operating correctly. The example indicates that the status of resource groups and resources in this configuration is as follows:


Sources of Diagnostic Information

If the state of a scalable device group resource or a file-system mount-point resource changes, the new state is logged through the syslog(3C) function.

The directory /var/cluster/ucmm contains the sources of diagnostic information that are shown in the following table.

Source 

Location 

Oracle UDLM core files

/var/cluster/ucmm/dlm_nodename/cores

If you cannot find the Oracle log files at this location, contact Oracle support. 

Log file for the current userland cluster membership monitor (UCMM) reconfiguration 

/var/cluster/ucmm/ucmm_reconf.log

Log files for previous UCMM reconfigurations 

/var/cluster/ucmm/ucmm_reconf.log.0 (0,1,...)

This location is dependent on the Oracle UDLM package. 

Log files for UNIX Distributed Lock Manager (Oracle UDLM) events

/var/cluster/ucmm/dlm_nodename/logs

If you cannot find the Oracle log files at this location, contact Oracle support. 

The directory /var/opt/SUNWscor/oracle_server/proxyresource contains log files for the resource that represents the Oracle 10g R2 RAC proxy server. Messages for server-side components and client-side components of the proxy server resource are written to separate files:

In these file names and directory names, resource is the name of the resource that represents the Oracle RAC server component.

The directory /var/opt/SUNWscor/oracle_server contains log files for the Oracle 9i RAC server resource. Each file is named /var/opt/SUNWscor/oracle_server/message_log.resource.

The system messages file also contains diagnostic information.

If a problem occurs with Sun Cluster Support for Oracle RAC, consult these files to obtain information about the cause of the problem.

Common Problems and Their Solutions

The subsections that follow describe problems that can affect Sun Cluster Support for Oracle RAC. Each subsection provides information about the cause of the problem and a solution to the problem.

Node Panic During Initialization of Sun Cluster Support for Oracle RAC

If a fatal problem occurs during the initialization of Sun Cluster Support for Oracle RAC, the node panics with an error message similar to the following error message:


panic[cpu0]/thread=40037e60: Failfast: Aborting because "ucmmd" died 30 seconds ago

Description:

A component that the UCMM controls returned an error to the UCMM during a reconfiguration.

Cause:

The most common causes of this problem are as follows:

  • The license for Veritas Volume Manager (VxVM) is missing or has expired.

  • The ORCLudlm package that contains the Oracle UDLM is not installed.

  • The version of the Oracle UDLM is incompatible with the version of Sun Cluster Support for Oracle RAC.

  • The amount of shared memory is insufficient to enable the Oracle UDLM to start.

A node might also panic during the initialization of Sun Cluster Support for Oracle RAC because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout.

Solution:

For instructions to correct the problem, see How to Recover From a Failure of the UCMM or a Related Component.


Note –

When the node is a global-cluster voting node of the global cluster, the node panic brings down the entire machine. When the node is a zone-cluster node, the node panic brings down only that specific zone and other zones remain unaffected.


Failure of the ucmmd Daemon to Start

The UCMM daemon, ucmmd, manages the reconfiguration of Sun Cluster Support for Oracle RAC. When a cluster is booted or rebooted, this daemon is started only after all components of Sun Cluster Support for Oracle RAC are validated. If the validation of a component on a node fails, the ucmmd fails to start on the node.

The most common causes of this problem are as follows:

For instructions to correct the problem, see How to Recover From a Failure of the UCMM or a Related Component.

ProcedureHow to Recover From a Failure of the UCMM or a Related Component

Perform this task to correct the problems that are described in the following sections:

This procedure provides the long forms of the Sun Cluster maintenance commands. Most commands also have short forms. Except for the forms of the command names, the commands are identical. For a list of the commands and their short forms, see Appendix A, Sun Cluster Object-Oriented Commands, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

  1. To determine the cause of the problem, examine the log files for UCMM reconfigurations and the system messages file.

    For the location of the log files for UCMM reconfigurations, see Sources of Diagnostic Information.

    When you examine these files, start at the most recent message and work backward until you identify the cause of the problem.

    For more information about error messages that might indicate the cause of reconfiguration errors, see Sun Cluster Error Messages Guide for Solaris OS.

  2. Correct the problem that caused the component to return an error to the UCMM.

    For example:

    • If the license for VxVM is missing or has expired, ensure that VxVM is correctly installed and licensed.

      1. Verify that you have correctly installed your volume manager packages.

      2. If you are using VxVM, check that you have installed the software and check that the license for the VxVM cluster feature is valid.


      Note –

      A zone cluster does not support VxVM.


    • If the ORCLudlm package that contains the Oracle UDLM is not installed, ensure that the package is installed.


      Note –

      Oracle UDLM is required only when it is actually used.


      1. Ensure that you have completed all the procedures that precede installing and configuring the Oracle UDLM software.

        The procedures that you must complete are listed in Table 1–1.

      2. Ensure that the Oracle UDLM software is correctly installed and configured.

        For more information, see SPARC: Installing the Oracle UDLM.

    • If the version of the Oracle UDLM is incompatible with the version of Sun Cluster Support for Oracle RAC, install a compatible version of the package.

      For more information, see SPARC: Installing the Oracle UDLM.

    • If the amount of shared memory is insufficient to enable the Oracle UDLM to start, increase the amount of shared memory.

      For more information, see How to Configure Shared Memory for the Oracle RAC Software in the Global Cluster.

    • If a reconfiguration step has timed out, increase the value of the extension property that specifies the timeout for the step.

      For more information, see Node Panic Caused by a Timeout.

  3. If the solution to the problem requires a reboot, reboot the node where the problem occurred.

    The solution to only certain problems requires a reboot. For example, increasing the amount of shared memory requires a reboot. However, increasing the value of a step timeout does not require a reboot.

    For more information about how to reboot a node, see Shutting Down and Booting a Single Node in a Cluster in Sun Cluster System Administration Guide for Solaris OS.

  4. On the node where the problem occurred, bring online the RAC framework resource group.

    1. Become superuser or assume a role that provides solaris.cluster.admin RBAC authorization.

    2. Type the command to bring online and in a managed state the RAC framework resource group and its resources.


      # clresourcegroup online -emM -n node rac-fmwk-rg
      
      -n node

      Specifies the node name or node identifier (ID) of the node where the problem occurred.

      rac-fmwk-rg

      Specifies the name of the resource group that is to be moved to the MANAGED state and brought online.

Node Panic Caused by a Timeout

The timing out of any step in the reconfiguration of Sun Cluster Support for Oracle RAC causes the node on which the timeout occurred to panic.

To prevent reconfiguration steps from timing out, tune the timeouts that depend on your cluster configuration. For more information, see Guidelines for Setting Timeouts.

If a reconfiguration step times out, use the Sun Cluster maintenance commands to increase the value of the extension property that specifies the timeout for the step. For more information, see Appendix C, Sun Cluster Support for Oracle RAC Extension Properties.

After you have increased the value of the extension property, bring online the RAC framework resource group on the node that panicked.

Failure of a SUNW.rac_framework Resource to Start

If a SUNW.rac_framework resource fails to start, verify the status of the resource to determine the cause of the failure. For more information, see How to Verify the Status of Sun Cluster Support for Oracle RAC.

The state of a resource that failed to start is shown as Start failed. The associated status message indicates the cause of the failure to start as follows:


Faulted - ucmmd is not running

Description:

The ucmmd daemon is not running on the node where the resource resides.

Solution:

For information about how to correct this problem, see Failure of the ucmmd Daemon to Start.


Degraded - reconfiguration in progress

Description:

The UCMM is undergoing a reconfiguration. This message indicates a problem only if the reconfiguration of the UCMM is not completed and the status of this resource persistently remains degraded.

Cause:

If this message indicates a problem, the cause of the failure is a configuration error in one or more components of Sun Cluster Support for Oracle RAC.

Solution:

The solution to this problem depends on whether the message indicates a problem:


Online

Description:

Reconfiguration of Oracle RAC was not completed until after the START method of the SUNW.rac_framework resource timed out.

Solution:

For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.

ProcedureHow to Recover From the Timing Out of the START Method

This procedure provides the long forms of the Sun Cluster maintenance commands. Most commands also have short forms. Except for the forms of the command names, the commands are identical. For a list of the commands and their short forms, see Appendix A, Sun Cluster Object-Oriented Commands, in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.

  1. Become superuser or assume a role that provides solaris.cluster.admin RBAC authorization.

  2. On the node where the START method timed out, take offline the RAC framework resource group.

    To perform this operation, switch the primary nodes of the resource group to the other nodes where this group is online.


    # clresourcegroup switch -n nodelist resource-group
    
    -n nodelist

    Specifies a comma-separated list of other cluster nodes on which resource-group is online. Omit from this list the node where the START method timed out.

    resource-group

    Specifies the name of the RAC framework resource group. If this resource group was created by using the clsetup utility, the name of the resource group is rac-framework-rg.

  3. On all cluster nodes that can run Sun Cluster Support for Oracle RAC, bring the RAC framework resource group online.


    # clresourcegroup online resource-group
    
    resource-group

    Specifies that the resource group that you brought offline in Step 2 is to be moved to the MANAGED state and brought online.

Failure of a Resource to Stop

If a resource fails to stop, correct this problem as explained in Clearing the STOP_FAILED Error Flag on Resources in Sun Cluster Data Services Planning and Administration Guide for Solaris OS.