This chapter provides general information about the troubleshooting tools provided with Oracle Fail Safe Manager. The following topics are discussed in this chapter:
Note that Oracle Fail Safe provides a centralized message facility. When you perform an action that results in an error, the system locates the message associated with the error and displays it. You can find more information about these messages in the Oracle Fail Safe Error Messages manual.
Oracle Fail Safe provides a family of tools to help you verify cluster components and the cluster environment to validate the status of nodes, groups, and resources. If a discrepancy or a problem is found, then the verify operation takes the appropriate action to fix any potential or actual problems.
Figure 6-1 shows the verify commands in the Troubleshooting menu.
Figure 6-1 Troubleshooting Menu and Verify Commands
Table 6-1 describes the verify commands and provides references for more information.
Table 6-1 Verify Commands for Troubleshooting
Tool | Description | Reference |
---|---|---|
Validates the Oracle Fail Safe installation, the Oracle product installation (including Oracle homes and product version numbers), cluster network configuration, and cluster resource DLL registration. |
||
Validates that the group resources and their dependencies are configured correctly. |
||
Validates the standalone database instance and removes any old configuration information that may remain on another node. |
You can use the verify commands at any time to validate your cluster, group, or standalone database. If problems are found during verification, then Oracle Fail Safe prompts you to fix them or returns an error message that further describes the problem.
If errors are returned when you run one of the verify commands, then fix the errors and then rerun the verify command. Repeat this process until the verify operation runs without errors.
The Verify Cluster
operation validates the installation and network configuration of the cluster. You can perform a cluster verification at any time. From the Oracle Fail Safe Manager menu bar, select Troubleshooting, then select Verify Cluster.
The first time you connect to a cluster after installing or upgrading the Oracle Fail Safe software, you are prompted to run Verify Cluster
. You can run Verify Cluster
at any time, however, you must run it whenever the cluster configuration changes. The Verify Cluster
operation verifies that:
Each Oracle home name into which Oracle software is installed is the same on all cluster nodes
If, for example, OFS
is the Oracle home name for the Oracle Fail Safe software on one cluster node, then OFS must be the Oracle home name on all nodes in the cluster where Oracle Fail Safe is installed. Similarly, if OfsDb
is the Oracle home name for the Oracle database software on one cluster node, then it must be the Oracle home name on all nodes in the cluster where the Oracle database software is installed.
The Oracle Services for MSCS release is identical on all nodes
The resource providers (components) are configured identically on at least two of the nodes that are possible owners for each resource
The Host Name/IP Address mappings resolve consistently across all nodes in the cluster
If there is a problem with inconsistent mapping, then the Verify Cluster
command returns errors indicating that the order of network adapters may be incorrect. See Appendix A for details.
Verify Cluster
also registers Oracle resource DLLs with Microsoft Cluster Server (MSCS).
Figure 6-2 shows the output from a typical Verify Cluster
operation.
Figure 6-2 Clusterwide Operation Window for Verify Cluster
If you run the Verify Cluster
operation and it does not complete successfully, then it may indicate one or more of the following problems:
A problem exists in the configuration of the hardware, network, or the MSCS software.
A problem exists in the symmetry of the Oracle homes and versions.
A problem exists with the Oracle Fail Safe installation (for example, with the symmetry of the resource providers).
If the operation completes successfully, but you are having problems with Oracle Fail Safe, then the problem is based in the Oracle Fail Safe configuration.
The Verify Group
operation does the following to ensure that a group performs correctly:
Checks all resources in a group and confirms that they have been configured correctly on all nodes that are possible owners for the group.
Updates the dependencies among resources in the group.
After prompting you, repairs a group that is misconfigured.
You can run the Verify Group
operation at any time. However, you must run it when any of the following occurs:
A group or resource in a group does not come online.
You add a node to the cluster.
To verify a group select the group from the Oracle Fail Safe Manager tree view and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, then Verify Group.
Or, you can run a Verify Group
operation using the FSCMD
command VERIFYGROUP
(see Chapter 5). The FSCMD
command also provides a VERIFYALLGROUPS
command that lets you verify all groups configured by Oracle Fail Safe on a given cluster. You can run the VERIFYGROUP
and VERIFYALLGROUPS
commands in scripts as batch jobs.
You can watch the progress of the Verify Group
operation and view the status of the individual resources in the group as Oracle Fail Safe verifies the group.
Figure 6-3 shows the output from a Verify Group
operation.
Figure 6-3 Clusterwide Operation Window for Verify Group
You can validate a standalone database at any time by using the Verify Standalone Database
operation. To run the Verify Standalone Database
command, select the database from the Oracle Fail Safe Manager tree view, and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, then select Verify Standalone Database.
The Verify Standalone Database
operation performs validation checks to ensure that the standalone database is configured correctly on the node where it resides and to remove any references to the database that may exist on other cluster nodes. (References to the database may exist on other cluster nodes if the database was once added to a group and then later removed.) This ensures that the database can be made highly available using Oracle Fail Safe.
Oracle recommends that you use the Verify Standalone Database
command on a standalone database before you add it to a group. You can also use it whenever you have trouble accessing a standalone database. However, note that Oracle Fail Safe stops and restarts the database during the verify operation.
For example, you may perform a verification:
If a failure occurs when you try to add a database to a group.
If you used an administrator tool other than Oracle Fail Safe Manager to perform an operation on the database and the database now is inaccessible.
If you removed or deinstalled the MSCS software from the cluster nodes without first removing the Oracle Fail Safe software (for example, during a software upgrade). This is described in more detail in the Oracle Fail Safe Installation Guide.
Figure 6-4 shows the Verify Standalone Database dialog box in which you enter valid database information and account information for a standalone database.
Figure 6-4 Verify Standalone Database Dialog Box
To use the Verify Standalone Database dialog box, you must specify:
The service name of the standalone database, in the Service Name field
The instance name of the standalone database, in the Instance Name field
The database name of the standalone database, in the Database Name field
The parameter file disk, path name, and file name for the initialization parameter file for the standalone database, in the Parameter File field
The account that Oracle Fail Safe must use to attach to the database, in the Account area.
Oracle Fail Safe uses this information to:
Fix clusterwide problems with Oracle Net
Check that the standalone database is on a cluster disk
Ensure that Oracle Fail Safe can attach to the database
If a standalone database is open and you run a Verify Standalone Database
operation, then the operation does not restart the database.
If a standalone database is not open or if the database is stopped, then Oracle Fail Safe asks your permission to stop and restart the database instance. Subsequently, Oracle Fail Safe opens the database for access.
Figure 6-5 shows the output from a typical Verify Standalone Database
operation in a Clusterwide Operation window.
Figure 6-5 Clusterwide Operation Window for Verify Standalone Database
If any problems are found during verification, then the Verify Standalone Database
operation prompts you before it attempts to fix them. For example, imagine that you try to add a database to a group, but the operation fails because of an Oracle Net problem. You can run the Verify Standalone Database
command to fix the network problem and subsequently add the database to a group.
Oracle Fail Safe provides the Dump Cluster
command to display Oracle Fail Safe Manager cluster data in a window. Run this command periodically (and save the output) to maintain a record of changes made to the cluster over time, or run it at the request of customer support so as to provide a snapshot of the cluster environment.
Data displayed when you run the Dump Cluster
command includes:
You can optionally save the Dump Cluster data to a file by clicking Save As.
To run the Dump Cluster
command, select the cluster from the Oracle Fail Safe Manager tree view, and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, and then select Dump Cluster.
Figure 6-6 shows the portion of the Dump Cluster
command output that provides information about the NTCLU-150 cluster and some of its resources.
Figure 6-6 Dump Cluster Clusterwide Operation
Oracle Fail Safe provides the fssvr
command qualifier, /GETSECURITY
, which displays security information about the system where the command is run. Run the fssvr
command qualifier, /GETSECURITY
on each cluster node to help diagnose FS-1075n errors (where n is a value between 0 and 7, inclusive).
The command and its associated output must be similar to the following:
fssvr /getsecurity Looking up user account information for OracleMSCSServices. The user account must be a domain user acount with local Administrator privileges. The user account must also have the 'Log on as batch job' privilege. User account specified for OracleMSCSServices is NEDCDOMAIN\cluadmin User account specified has local Administrator privileges User account has the 'Log on as batch job' privilege Looking up user account information for Cluster Service. The user account must be a domain user account with local Administrator privileges. The user account must also have the 'Log on as batch job' privilege. User account specified for Cluster Service is NEDCDOMAIN\cluadmin User account specified has local Administrator privileges User account has the 'Log on as batch job' privilege Checking to see if DCOM is enabled. DCOM must be enabled. DCOM is enabled.
This chapter describes how to use the Oracle Fail Safe Manager family of troubleshooting tools. Additional information is available as follows:
Information about troubleshooting a specific component can be found in Chapters 7 through 9, each of which describes how to configure a particular component for high availability.
Information about troubleshooting network configuration problems is described in Appendix A.
Because Oracle Fail Safe is layered upon Microsoft Cluster Server software, you may need to refer to the MSCS documentation to troubleshoot problems with the cluster service, interconnect, and hardware configuration.
If you are unable to start Oracle Fail Safe, then start the Windows Event Viewer and look at the application log. Oracle Services for MSCS usually logs an event identifying the problem.