Skip Navigation Links | |
Exit Print View | |
Oracle Solaris Cluster Data Service for Oracle Real Application Clusters Guide |
1. Installing Support for Oracle RAC
2. Configuring Storage for Oracle Files
3. Registering and Configuring the Resource Groups
4. Enabling Oracle RAC to Run in a Cluster
5. Administering Support for Oracle RAC
6. Troubleshooting Support for Oracle RAC
Verifying the Status of Support for Oracle RAC
How to Verify the Status of Support for Oracle RAC
Examples of the Status of Support for Oracle RAC
Sources of Diagnostic Information
Common Problems and Their Solutions
Failure of a RAC Framework Resource Group
Node Panic During Initialization of Support for Oracle RAC
Failure of the ucmmd Daemon to Start
How to Recover From a Failure of the ucmmd Daemon or a Related Component
Failure of a Multiple-Owner Volume-Manager Framework Resource Group
Node Panic During Initialization of the Multiple-Owner Volume-Manager Framework
Failure of the vucmmd Daemon to Start
How to Recover From a Failure of the vucmmd Daemon or a Related Component
SUNW.qfs Registration Fails Because the Registration File Is Not Found
Node Panic Caused by a Timeout
Failure of a SUNW.rac_framework or SUNW.vucmm_framework Resource to Start
SUNW.rac_framework Failure-to-Start Status Messages
SUNW.vucmm_framework Failure-to-Start Status Messages
7. Modifying an Existing Configuration of Support for Oracle RAC
8. Upgrading Support for Oracle RAC
A. Sample Configurations of This Data Service
B. Preset Actions for DBMS Errors and Logged Alerts
The subsections that follow describe problems that can affect Support for Oracle RAC. Each subsection provides information about the cause of the problem and a solution to the problem.
Failure of a Multiple-Owner Volume-Manager Framework Resource Group
SUNW.qfs Registration Fails Because the Registration File Is Not Found
Failure of a SUNW.rac_framework or SUNW.vucmm_framework Resource to Start
This section describes problems that can affect the RAC framework resource group.
If a fatal problem occurs during the initialization of Support for Oracle RAC, the node panics with an error messages similar to the following error message:
panic[cpu0]/thread=40037e60: Failfast: Aborting because "ucmmd" died 30 seconds ago
Description: A component that the UCMM controls returned an error to the UCMM during a reconfiguration.
Cause: The most common causes of this problem are as follows:
SPARC: The ORCLudlm package that contains the Oracle UDLM is not installed.
SPARC: The version of the Oracle UDLM is incompatible with the version of Support for Oracle RAC.
SPARC: The amount of shared memory is insufficient to enable the Oracle UDLM to start.
A node might also panic during the initialization of Support for Oracle RAC because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout.
Solution: For instructions to correct the problem, see How to Recover From a Failure of the ucmmd Daemon or a Related Component.
Note - When the node is a global-cluster voting node of the global cluster, the node panic brings down the entire machine. When the node is a zone-cluster node, the node panic brings down only that specific zone and other zones remain unaffected.
The UCMM daemon, ucmmd, manages the reconfiguration of Support for Oracle RAC. When a cluster is booted or rebooted, this daemon is started only after all components of Support for Oracle RAC are validated. If the validation of a component on a node fails, the ucmmd daemon fails to start on the node.
The most common causes of this problem are as follows:
SPARC: The ORCLudlm package that contains the Oracle UDLM is not installed.
An error occurred during a previous reconfiguration of a component Support for Oracle RAC.
A step in a previous reconfiguration of Support for Oracle RAC timed out, causing the node on which the timeout occurred to panic.
For instructions to correct the problem, see How to Recover From a Failure of the ucmmd Daemon or a Related Component.
Perform this task to correct the problems that are described in the following sections:
For the location of the log files for UCMM reconfigurations, see Sources of Diagnostic Information.
When you examine these files, start at the most recent message and work backward until you identify the cause of the problem.
For more information about error messages that might indicate the cause of reconfiguration errors, see Oracle Solaris Cluster Error Messages Guide.
For example:
Note - Oracle UDLM is required only when it is actually used.
The procedures that you must complete are listed in Table 1-1.
For more information, see SPARC: Installing the Oracle UDLM.
For more information, see SPARC: Installing the Oracle UDLM.
For more information, see How to Configure Shared Memory for the Oracle RAC Software in the Global Cluster.
For more information, see Node Panic Caused by a Timeout.
The solution to only certain problems requires a reboot. For example, increasing the amount of shared memory requires a reboot. However, increasing the value of a step timeout does not require a reboot.
For more information about how to reboot a node, see Shutting Down and Booting a Single Node in a Cluster in Oracle Solaris Cluster System Administration Guide.
This step refreshes the resource group with the configuration changes you made.
# clresourcegroup offline -n node rac-fmwk-rg
Specifies the node name or node identifier (ID) of the node where the problem occurred.
Specifies the name of the resource group that is to be taken offline.
# clresourcegroup online -emM -n node rac-fmwk-rg
This section describes problems that can affect the multiple-owner volume-manager framework resource group.
Node Panic During Initialization of the Multiple-Owner Volume-Manager Framework
How to Recover From a Failure of the vucmmd Daemon or a Related Component
If a fatal problem occurs during the initialization of the multiple-owner volume-manager framework, the node panics with an error messages similar to the following error message:
Note - When the node is a global-cluster voting node of the global cluster, the node panic brings down the entire machine.
panic[cpu0]/thread=40037e60: Failfast: Aborting because "vucmmd" died 30 seconds ago
Description: A component that the multiple-owner volume-manager framework controls returned an error to the multiple-owner volume-manager framework during a reconfiguration.
Cause: The most common causes of this problem is that the license for Veritas Volume Manager (VxVM) is missing or has expired.
A node might also panic during the initialization of the multiple-owner volume-manager framework because a reconfiguration step has timed out. For more information, see Node Panic Caused by a Timeout.
Solution: For instructions to correct the problem, see How to Recover From a Failure of the vucmmd Daemon or a Related Component.
The multiple-owner volume-manager framework daemon, vucmmd, manages the reconfiguration of the multiple-owner volume-manager framework. When a cluster is booted or rebooted, this daemon is started only after all components of the multiple-owner volume-manager framework are validated. If the validation of a component on a node fails, the vucmmd daemon fails to start on the node.
The most common causes of this problem are as follows:
An error occurred during a previous reconfiguration of a component of the multiple-owner volume-manager framework.
A step in a previous reconfiguration of the multiple-owner volume-manager framework timed out, causing the node on which the timeout occurred to panic.
For instructions to correct the problem, see How to Recover From a Failure of the vucmmd Daemon or a Related Component.
Perform this task to correct the problems that are described in the following sections:
For the location of the log files for multiple-owner volume-manager framework reconfigurations, see Sources of Diagnostic Information.
When you examine these files, start at the most recent message and work backward until you identify the cause of the problem.
For more information about error messages that might indicate the cause of reconfiguration errors, see Oracle Solaris Cluster Error Messages Guide.
For example:
Note - A zone cluster does not support VxVM.
For more information, see Node Panic Caused by a Timeout.
The solution to only certain problems requires a reboot. For example, increasing the amount of shared memory requires a reboot. However, increasing the value of a step timeout does not require a reboot.
For more information about how to reboot a node, see Shutting Down and Booting a Single Node in a Cluster in Oracle Solaris Cluster System Administration Guide.
This step refreshes the resource group with the configuration changes you made.
# clresourcegroup offline -n node vucmm-fmwk-rg
Specifies the node name or node identifier (ID) of the node where the problem occurred.
Specifies the name of the resource group that is to be taken offline.
# clresourcegroup online -emM -n node vucmm-fmwk-rg
Oracle Solaris Cluster resource-type registration files are located in the /opt/cluster/lib/rgm/rtreg/ or /usr/cluster/lib/rgm/rtreg/ directory. The SUNW.qfs resource-type registration file is located in the /opt/SUNWsamfs/sc/etc/ directory.
If Oracle Solaris Cluster software is already installed when you install Sun QFS software, the necessary mapping to the SUNW.qfs registration file is automatically created. But if Oracle Solaris Cluster software is not already installed when you install Sun QFS software, the necessary mapping to the SUNW.qfs registration file is not made, even when Sun Cluster software is later installed. Attempts to register the SUNW.qfs resource type therefore fail because the Oracle Solaris Cluster software is unaware of the location of its registration file.
To enable Oracle Solaris Cluster software to locate the SUNW.qfs resource type, create a symbolic link to the directory:
# cd /usr/cluster/lib/rgm/rtreg # ln -s /opt/SUNWsamfs/sc/etc/SUNW.qfs SUNW.qfs
The timing out of any step in the reconfiguration of Support for Oracle RAC causes the node on which the timeout occurred to panic.
To prevent reconfiguration steps from timing out, tune the timeouts that depend on your cluster configuration. For more information, see Guidelines for Setting Timeouts.
If a reconfiguration step times out, use the Oracle Solaris Cluster maintenance commands to increase the value of the extension property that specifies the timeout for the step. For more information, see Appendix C, Support for Oracle RAC Extension Properties.
After you have increased the value of the extension property, bring online the RAC framework resource group on the node that panicked.
If a SUNW.rac_framework or SUNW.vucmm_frameworkresource fails to start, verify the status of the resource to determine the cause of the failure. For more information, see How to Verify the Status of Support for Oracle RAC.
The state of a resource that failed to start is shown as Start failed. The associated status message indicates the cause of the failure to start.
This section contains the following information:
The following status messages are associated with the failure of a SUNW.rac_framework resource to start:
Faulted - ucmmd is not running
Description: The ucmmd daemon is not running on the node where the resource resides.
Solution: For information about how to correct this problem, see Failure of the ucmmd Daemon to Start.
Degraded - reconfiguration in progress
Description: The UCMM is undergoing a reconfiguration. This message indicates a problem only if the reconfiguration of the UCMM is not completed and the status of this resource persistently remains degraded.
Cause: If this message indicates a problem, the cause of the failure is a configuration error in one or more components of Support for Oracle RAC.
Solution: The solution to this problem depends on whether the message indicates a problem:
If the message indicates a problem, correct the problem as explained in How to Recover From a Failure of the ucmmd Daemon or a Related Component.
If the message does not indicate a problem, no action is required.
Description: Reconfiguration of Oracle RAC was not completed until after the START method of the SUNW.rac_framework resource timed out.
Solution: For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.
The following status messages are associated with the failure of a SUNW.vucmm_framework resource to start:
Faulted - vucmmd is not running
Description: The vucmmd daemon is not running on the node where the resource resides.
Solution: For information about how to correct this problem, see Failure of the vucmmd Daemon to Start.
Degraded - reconfiguration in progress
Description: The multiple-owner volume-manager framework is undergoing a reconfiguration. This message indicates a problem only if the reconfiguration of the multiple-owner volume-manager framework is not completed and the status of this resource persistently remains degraded.
Cause: If this message indicates a problem, the cause of the failure is a configuration error in one or more components of the volume manager reconfiguration framework.
Solution: The solution to this problem depends on whether the message indicates a problem:
If the message indicates a problem, correct the problem as explained in How to Recover From a Failure of the vucmmd Daemon or a Related Component.
If the message does not indicate a problem, no action is required.
Description: Reconfiguration of Oracle RAC was not completed until after the START method of the SUNW.vucmm_framework resource timed out.
Solution: For instructions to correct the problem, see How to Recover From the Timing Out of the START Method.
To perform this operation, switch the primary nodes of the resource group to the other nodes where the group is online.
# clresourcegroup offline -n nodelist resource-group
Specifies a comma-separated list of other cluster nodes on which resource-group is online. Omit from this list the node where the START method timed out.
Specifies the name of the framework resource group.
If your configuration uses both a multiple-owner volume-manager framework resource group and a RAC framework resource group, first take offline the multiple-owner volume-manager framework resource group. When the multiple-owner volume-manager framework resource group is offline, then take offline the RAC framework resource group.
If the RAC resource group was created by using the clsetup utility, the name of the resource group is rac-framework-rg.
# clresourcegroup online resource-group
Specifies that the resource group that you brought offline in Step 2 is to be moved to the MANAGED state and brought online.
If a resource fails to stop, correct this problem as explained in Clearing the STOP_FAILED Error Flag on Resources in Oracle Solaris Cluster Data Services Planning and Administration Guide.