JavaScript is required to for searching.
Skip Navigation Links
Exit Print View
Oracle Solaris Cluster Data Service for Oracle Real Application Clusters Guide     Oracle Solaris Cluster
search filter icon
search icon

Document Information

Preface

1.  Installing Support for Oracle RAC

2.  Configuring Storage for Oracle Files

3.  Registering and Configuring the Resource Groups

4.  Enabling Oracle RAC to Run in a Cluster

5.  Administering Support for Oracle RAC

6.  Troubleshooting Support for Oracle RAC

Verifying the Status of Support for Oracle RAC

How to Verify the Status of Support for Oracle RAC

Examples of the Status of Support for Oracle RAC

Sources of Diagnostic Information

Common Problems and Their Solutions

Failure of an Oracle RAC Framework Resource Group

Node Panic During Initialization of Support for Oracle RAC

Failure of the ucmmd Daemon to Start

How to Recover From a Failure of the ucmmd Daemon or a Related Component

Failure of a Multiple-Owner Volume-Manager Framework Resource Group

Node Panic During Initialization of the Multiple-Owner Volume-Manager Framework

Failure of the vucmmd Daemon to Start

How to Recover From a Failure of the vucmmd Daemon or a Related Component

SUNW.qfs Registration Fails Because the Registration File Is Not Found

Node Panic Caused by a Timeout

Failure of a SUNW.rac_framework or SUNW.vucmm_framework Resource to Start

SUNW.rac_framework Failure-to-Start Status Messages

SUNW.vucmm_framework Failure-to-Start Status Messages

How to Recover From the Timing Out of the START Method

Failure of a Resource to Stop

7.  Modifying an Existing Configuration of Support for Oracle RAC

8.  Upgrading Support for Oracle RAC

A.  Sample Configurations of This Data Service

B.  Preset Actions for DBMS Errors and Logged Alerts

C.  Support for Oracle RAC Extension Properties

D.  Command-Line Alternatives

Index

Common Problems and Their Solutions

The subsections that follow describe problems that can affect Support for Oracle RAC. Each subsection provides information about the cause of the problem and a solution to the problem.

Failure of an Oracle RAC Framework Resource Group

This section describes problems that can affect the Oracle RAC framework resource group.

Node Panic During Initialization of Support for Oracle RAC

If a fatal problem occurs during the initialization of Support for Oracle RAC, the node panics with an error messages similar to the following error message:

panic[cpu0]/thread=40037e60: Failfast: Aborting because "ucmmd" died 30 seconds ago

Description: A component that the UCMM controls returned an error to the UCMM during a reconfiguration.

Cause: The most common causes of this problem are as follows:

A node might also panic during the initialization of Support for Oracle RAC because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout.

Solution: For instructions to correct the problem, see How to Recover From a Failure of the ucmmd Daemon or a Related Component.


Note - When the node is a global-cluster voting node of the global cluster, the node panic brings down the entire machine. When the node is a zone-cluster node, the node panic brings down only that specific zone and other zones remain unaffected.


Failure of the ucmmd Daemon to Start

The UCMM daemon, ucmmd, manages the reconfiguration of Support for Oracle RAC. When a cluster is booted or rebooted, this daemon is started only after all components of Support for Oracle RAC are validated. If the validation of a component on a node fails, the ucmmd daemon fails to start on the node.

The most common causes of this problem are as follows:

For instructions to correct the problem, see How to Recover From a Failure of the ucmmd Daemon or a Related Component.

How to Recover From a Failure of the ucmmd Daemon or a Related Component

Perform this task to correct the problems that are described in the following sections:

  1. To determine the cause of the problem, examine the log files for UCMM reconfigurations and the system messages file.

    For the location of the log files for UCMM reconfigurations, see Sources of Diagnostic Information.

    When you examine these files, start at the most recent message and work backward until you identify the cause of the problem.

    For more information about error messages that might indicate the cause of reconfiguration errors, see Oracle Solaris Cluster Error Messages Guide.

  2. Correct the problem that caused the component to return an error to the UCMM.

    For example:

    • SPARC: If your Oracle release requires UDLM and the ORCLudlm package that contains the UDLM is not installed, ensure that the package is installed.

      Note - UDLM is required only when it is actually used.


      1. Ensure that you have completed all the procedures that precede installing and configuring the UDLM software.

        The procedures that you must complete are listed in Table 1-1.

      2. Ensure that the UDLM software is correctly installed and configured.

        For more information, see SPARC: Installing the UDLM.

    • SPARC: If the version of the UDLM is incompatible with the version of Support for Oracle RAC, install a compatible version of the package.

      For more information, see SPARC: Installing the UDLM.

    • SPARC: If the amount of shared memory is insufficient to enable the UDLM to start, increase the amount of shared memory.

      For more information, see How to Configure Shared Memory for Oracle RAC Software in the Global Cluster.

    • If a reconfiguration step has timed out, increase the value of the extension property that specifies the timeout for the step.

      For more information, see Node Panic Caused by a Timeout.

  3. If the solution to the problem requires a reboot, reboot the node where the problem occurred.

    The solution to only certain problems requires a reboot. For example, increasing the amount of shared memory requires a reboot. However, increasing the value of a step timeout does not require a reboot.

    For more information about how to reboot a node, see Shutting Down and Booting a Single Node in a Cluster in Oracle Solaris Cluster System Administration Guide.

  4. On the node where the problem occurred, take offline and bring online the Oracle RAC framework resource group.

    This step refreshes the resource group with the configuration changes you made.

    1. Become superuser or assume a role that provides solaris.cluster.admin RBAC authorization.
    2. Type the command to take offline the Oracle RAC framework resource group and its resources.
      # clresourcegroup offline -n node rac-fmwk-rg
      -n node

      Specifies the node name or node identifier (ID) of the node where the problem occurred.

      rac-fmwk-rg

      Specifies the name of the resource group that is to be taken offline.

    3. Type the command to bring online and in a managed state the Oracle RAC framework resource group and its resources.
      # clresourcegroup online -emM -n node rac-fmwk-rg

Failure of a Multiple-Owner Volume-Manager Framework Resource Group

This section describes problems that can affect the multiple-owner volume-manager framework resource group.

Node Panic During Initialization of the Multiple-Owner Volume-Manager Framework

If a fatal problem occurs during the initialization of the multiple-owner volume-manager framework, the node panics with an error messages similar to the following error message:


Note - When the node is a global-cluster voting node of the global cluster, the node panic brings down the entire machine.


panic[cpu0]/thread=40037e60: Failfast: Aborting because "vucmmd" died 30 seconds ago

Description: A component that the multiple-owner volume-manager framework controls returned an error to the multiple-owner volume-manager framework during a reconfiguration.

Cause: The most common causes of this problem is that the license for Veritas Volume Manager (VxVM) is missing or has expired.

A node might also panic during the initialization of the multiple-owner volume-manager framework because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout.

Solution: For instructions to correct the problem, see How to Recover From a Failure of the vucmmd Daemon or a Related Component.

Failure of the vucmmd Daemon to Start

The multiple-owner volume-manager framework daemon, vucmmd, manages the reconfiguration of the multiple-owner volume-manager framework. When a cluster is booted or rebooted, this daemon is started only after all components of the multiple-owner volume-manager framework are validated. If the validation of a component on a node fails, the vucmmd daemon fails to start on the node.

The most common causes of this problem are as follows:

For instructions to correct the problem, see How to Recover From a Failure of the vucmmd Daemon or a Related Component.

How to Recover From a Failure of the vucmmd Daemon or a Related Component

Perform this task to correct the problems that are described in the following sections:

  1. To determine the cause of the problem, examine the log files for multiple-owner volume-manager framework reconfigurations and the system messages file.

    For the location of the log files for multiple-owner volume-manager framework reconfigurations, see Sources of Diagnostic Information.

    When you examine these files, start at the most recent message and work backward until you identify the cause of the problem.

    For more information about error messages that might indicate the cause of reconfiguration errors, see Oracle Solaris Cluster Error Messages Guide.

  2. Correct the problem that caused the component to return an error to the multiple-owner volume-manager framework .

    For example:

    • If the license for VxVM is missing or has expired, ensure that VxVM is correctly installed and licensed.
      1. Verify that you have correctly installed your volume manager packages.
      2. If you are using VxVM, check that you have installed the software and check that the license for the VxVM cluster feature is valid.

      Note - A zone cluster does not support VxVM.


    • If a reconfiguration step has timed out, increase the value of the extension property that specifies the timeout for the step.

      For more information, see Node Panic Caused by a Timeout.

  3. If the solution to the problem requires a reboot, reboot the node where the problem occurred.

    The solution to only certain problems requires a reboot. For example, increasing the amount of shared memory requires a reboot. However, increasing the value of a step timeout does not require a reboot.

    For more information about how to reboot a node, see Shutting Down and Booting a Single Node in a Cluster in Oracle Solaris Cluster System Administration Guide.

  4. On the node where the problem occurred, take offline and bring online the multiple-owner volume-manager framework resource group.

    This step refreshes the resource group with the configuration changes you made.

    1. Become superuser or assume a role that provides solaris.cluster.admin RBAC authorization.
    2. Type the command to take offline the multiple-owner volume-manager framework resource group and its resources.
      # clresourcegroup offline -n node vucmm-fmwk-rg
      -n node

      Specifies the node name or node identifier (ID) of the node where the problem occurred.

      vucmm-fmwk-rg

      Specifies the name of the resource group that is to be taken offline.

    3. Type the command to bring online and in a managed state the multiple-owner volume-manager framework resource group and its resources.
      # clresourcegroup online -emM -n node vucmm-fmwk-rg

SUNW.qfs Registration Fails Because the Registration File Is Not Found

Oracle Solaris Cluster resource-type registration files are located in the /opt/cluster/lib/rgm/rtreg/ or /usr/cluster/lib/rgm/rtreg/ directory. The SUNW.qfs resource-type registration file is located in the /opt/SUNWsamfs/sc/etc/ directory.

If Oracle Solaris Cluster software is already installed when you install Sun QFS software, the necessary mapping to the SUNW.qfs registration file is automatically created. But if Oracle Solaris Cluster software is not already installed when you install Sun QFS software, the necessary mapping to the SUNW.qfs registration file is not made, even when Sun Cluster software is later installed. Attempts to register the SUNW.qfs resource type therefore fail because the Oracle Solaris Cluster software is unaware of the location of its registration file.

To enable Oracle Solaris Cluster software to locate the SUNW.qfs resource type, create a symbolic link to the directory:

# cd /usr/cluster/lib/rgm/rtreg
# ln -s /opt/SUNWsamfs/sc/etc/SUNW.qfs SUNW.qfs

Node Panic Caused by a Timeout

The timing out of any step in the reconfiguration of Support for Oracle RAC causes the node on which the timeout occurred to panic.

To prevent reconfiguration steps from timing out, tune the timeouts that depend on your cluster configuration. For more information, see Guidelines for Setting Timeouts.

If a reconfiguration step times out, use the Oracle Solaris Cluster maintenance commands to increase the value of the extension property that specifies the timeout for the step. For more information, see Appendix C, Support for Oracle RAC Extension Properties.

After you have increased the value of the extension property, bring online the Oracle RAC framework resource group on the node that panicked.

Failure of a SUNW.rac_framework or SUNW.vucmm_framework Resource to Start

If a SUNW.rac_framework or SUNW.vucmm_frameworkresource fails to start, verify the status of the resource to determine the cause of the failure. For more information, see How to Verify the Status of Support for Oracle RAC.

The state of a resource that failed to start is shown as Start failed. The associated status message indicates the cause of the failure to start.

This section contains the following information:

SUNW.rac_framework Failure-to-Start Status Messages

The following status messages are associated with the failure of a SUNW.rac_framework resource to start:

Faulted - ucmmd is not running

Description: The ucmmd daemon is not running on the node where the resource resides.

Solution: For information about how to correct this problem, see Failure of the ucmmd Daemon to Start.

Degraded - reconfiguration in progress

Description: The UCMM is undergoing a reconfiguration. This message indicates a problem only if the reconfiguration of the UCMM is not completed and the status of this resource persistently remains degraded.

Cause: If this message indicates a problem, the cause of the failure is a configuration error in one or more components of Support for Oracle RAC.

Solution: The solution to this problem depends on whether the message indicates a problem:

Online

Description: Reconfiguration of Oracle RAC was not completed until after the START method of the SUNW.rac_framework resource timed out.

Solution: For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.

SUNW.vucmm_framework Failure-to-Start Status Messages

The following status messages are associated with the failure of a SUNW.vucmm_framework resource to start:

Faulted - vucmmd is not running

Description: The vucmmd daemon is not running on the node where the resource resides.

Solution: For information about how to correct this problem, see Failure of the vucmmd Daemon to Start.

Degraded - reconfiguration in progress

Description: The multiple-owner volume-manager framework is undergoing a reconfiguration. This message indicates a problem only if the reconfiguration of the multiple-owner volume-manager framework is not completed and the status of this resource persistently remains degraded.

Cause: If this message indicates a problem, the cause of the failure is a configuration error in one or more components of the volume manager reconfiguration framework.

Solution: The solution to this problem depends on whether the message indicates a problem:

Online

Description: Reconfiguration of Oracle RAC was not completed until after the START method of the SUNW.vucmm_framework resource timed out.

Solution: For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.

How to Recover From the Timing Out of the START Method

  1. Become superuser or assume a role that provides solaris.cluster.admin RBAC authorization.
  2. On the node where the START method timed out, take offline the framework resource group that failed to start.

    To perform this operation, switch the primary nodes of the resource group to the other nodes where the group is online.

    # clresourcegroup offline -n nodelist resource-group
    -n nodelist

    Specifies a comma-separated list of other cluster nodes on which resource-group is online. Omit from this list the node where the START method timed out.

    resource-group

    Specifies the name of the framework resource group.

    If your configuration uses both a multiple-owner volume-manager framework resource group and an Oracle RAC framework resource group, first take offline the multiple-owner volume-manager framework resource group. When the multiple-owner volume-manager framework resource group is offline, then take offline the Oracle RAC framework resource group.

    If the Oracle RAC framework resource group was created by using the clsetup utility, the name of the resource group is rac-framework-rg.

  3. On all cluster nodes that can run Support for Oracle RAC, bring online the framework resource group that failed to come online.
    # clresourcegroup online resource-group
    resource-group

    Specifies that the resource group that you brought offline in Step 2 is to be moved to the MANAGED state and brought online.

Failure of a Resource to Stop

If a resource fails to stop, correct this problem as explained in Clearing the STOP_FAILED Error Flag on Resources in Oracle Solaris Cluster Data Services Planning and Administration Guide.