6 Troubleshooting Tools

This chapter provides general information about the troubleshooting tools provided with Oracle Fail Safe Manager. The following topics are discussed in this chapter:

Note that Oracle Fail Safe provides a centralized message facility. When you perform an action that results in an error, the system locates the message associated with the error and displays it. You can find more information about these messages in the Oracle Fail Safe Error Messages manual.

6.1 Verify Operations

Oracle Fail Safe provides a family of tools to help you verify cluster components and the cluster environment to validate the status of nodes, groups, and resources. If a discrepancy or a problem is found, then the verify operation takes the appropriate action to fix any potential or actual problems.

Figure 6-1 shows the verify commands in the Troubleshooting menu.

Figure 6-1 Troubleshooting Menu and Verify Commands

Description of Figure 6-1 follows
Description of "Figure 6-1 Troubleshooting Menu and Verify Commands"

Table 6-1 describes the verify commands and provides references for more information.

Table 6-1  Verify Commands for Troubleshooting

Tool Description Reference

Verify Cluster

Validates the Oracle Fail Safe installation, the Oracle product installation (including Oracle homes and product version numbers), cluster network configuration, and cluster resource DLL registration.

Section 6.1.1

Verify Group

Validates that the group resources and their dependencies are configured correctly.

Section 6.1.2

Verify Standalone Database

Validates the standalone database instance and removes any old configuration information that may remain on another node.

Section 6.1.3


You can use the verify commands at any time to validate your cluster, group, or standalone database. If problems are found during verification, then Oracle Fail Safe prompts you to fix them or returns an error message that further describes the problem.

If errors are returned when you run one of the verify commands, then fix the errors and then rerun the verify command. Repeat this process until the verify operation runs without errors.

6.1.1 Verify Cluster

The Verify Cluster operation validates the installation and network configuration of the cluster. You can perform a cluster verification at any time. From the Oracle Fail Safe Manager menu bar, select Troubleshooting, then select Verify Cluster.

The first time you connect to a cluster after installing or upgrading the Oracle Fail Safe software, you are prompted to run Verify Cluster. You can run Verify Cluster at any time, however, you must run it whenever the cluster configuration changes. The Verify Cluster operation verifies that:

  • Each Oracle home name into which Oracle software is installed is the same on all cluster nodes

    If, for example, OFS is the Oracle home name for the Oracle Fail Safe software on one cluster node, then OFS must be the Oracle home name on all nodes in the cluster where Oracle Fail Safe is installed. Similarly, if OfsDb is the Oracle home name for the Oracle database software on one cluster node, then it must be the Oracle home name on all nodes in the cluster where the Oracle database software is installed.

  • The Oracle Services for MSCS release is identical on all nodes

  • The resource providers (components) are configured identically on at least two of the nodes that are possible owners for each resource

  • The Host Name/IP Address mappings resolve consistently across all nodes in the cluster

    If there is a problem with inconsistent mapping, then the Verify Cluster command returns errors indicating that the order of network adapters may be incorrect. See Appendix A for details.

Verify Cluster also registers Oracle resource DLLs with Microsoft Cluster Server (MSCS).

Figure 6-2 shows the output from a typical Verify Cluster operation.

Figure 6-2 Clusterwide Operation Window for Verify Cluster

Description of Figure 6-2 follows
Description of "Figure 6-2 Clusterwide Operation Window for Verify Cluster"

If you run the Verify Cluster operation and it does not complete successfully, then it may indicate one or more of the following problems:

  • A problem exists in the configuration of the hardware, network, or the MSCS software.

  • A problem exists in the symmetry of the Oracle homes and versions.

  • A problem exists with the Oracle Fail Safe installation (for example, with the symmetry of the resource providers).

If the operation completes successfully, but you are having problems with Oracle Fail Safe, then the problem is based in the Oracle Fail Safe configuration.

6.1.2 Verify Group

The Verify Group operation does the following to ensure that a group performs correctly:

  • Checks all resources in a group and confirms that they have been configured correctly on all nodes that are possible owners for the group.

  • Updates the dependencies among resources in the group.

  • After prompting you, repairs a group that is misconfigured.

You can run the Verify Group operation at any time. However, you must run it when any of the following occurs:

  • A group or resource in a group does not come online.

  • Failover or failback do not perform as you expected.

  • You add a node to the cluster.

To verify a group select the group from the Oracle Fail Safe Manager tree view and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, then Verify Group.

Or, you can run a Verify Group operation using the FSCMD command VERIFYGROUP (see Chapter 5). The FSCMD command also provides a VERIFYALLGROUPS command that lets you verify all groups configured by Oracle Fail Safe on a given cluster. You can run the VERIFYGROUP and VERIFYALLGROUPS commands in scripts as batch jobs.

You can watch the progress of the Verify Group operation and view the status of the individual resources in the group as Oracle Fail Safe verifies the group.

Figure 6-3 shows the output from a Verify Group operation.

Figure 6-3 Clusterwide Operation Window for Verify Group

Description of Figure 6-3 follows
Description of "Figure 6-3 Clusterwide Operation Window for Verify Group"

6.1.3 Verify Standalone Database

You can validate a standalone database at any time by using the Verify Standalone Database operation. To run the Verify Standalone Database command, select the database from the Oracle Fail Safe Manager tree view, and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, then select Verify Standalone Database.

The Verify Standalone Database operation performs validation checks to ensure that the standalone database is configured correctly on the node where it resides and to remove any references to the database that may exist on other cluster nodes. (References to the database may exist on other cluster nodes if the database was once added to a group and then later removed.) This ensures that the database can be made highly available using Oracle Fail Safe.

Oracle recommends that you use the Verify Standalone Database command on a standalone database before you add it to a group. You can also use it whenever you have trouble accessing a standalone database. However, note that Oracle Fail Safe stops and restarts the database during the verify operation.

For example, you may perform a verification:

  • If a failure occurs when you try to add a database to a group.

  • If you used an administrator tool other than Oracle Fail Safe Manager to perform an operation on the database and the database now is inaccessible.

  • If you removed or deinstalled the MSCS software from the cluster nodes without first removing the Oracle Fail Safe software (for example, during a software upgrade). This is described in more detail in the Oracle Fail Safe Installation Guide.

Figure 6-4 shows the Verify Standalone Database dialog box in which you enter valid database information and account information for a standalone database.

Figure 6-4 Verify Standalone Database Dialog Box

Description of Figure 6-4 follows
Description of "Figure 6-4 Verify Standalone Database Dialog Box"

To use the Verify Standalone Database dialog box, you must specify:

  • The service name of the standalone database, in the Service Name field

  • The instance name of the standalone database, in the Instance Name field

  • The database name of the standalone database, in the Database Name field

  • The parameter file disk, path name, and file name for the initialization parameter file for the standalone database, in the Parameter File field

  • The account that Oracle Fail Safe must use to attach to the database, in the Account area.

Oracle Fail Safe uses this information to:

  • Fix clusterwide problems with Oracle Net

  • Check that the standalone database is on a cluster disk

  • Ensure that Oracle Fail Safe can attach to the database

If a standalone database is open and you run a Verify Standalone Database operation, then the operation does not restart the database.

If a standalone database is not open or if the database is stopped, then Oracle Fail Safe asks your permission to stop and restart the database instance. Subsequently, Oracle Fail Safe opens the database for access.

Figure 6-5 shows the output from a typical Verify Standalone Database operation in a Clusterwide Operation window.

Figure 6-5 Clusterwide Operation Window for Verify Standalone Database

Description of Figure 6-5 follows
Description of "Figure 6-5 Clusterwide Operation Window for Verify Standalone Database"

If any problems are found during verification, then the Verify Standalone Database operation prompts you before it attempts to fix them. For example, imagine that you try to add a database to a group, but the operation fails because of an Oracle Net problem. You can run the Verify Standalone Database command to fix the network problem and subsequently add the database to a group.

6.2 Dump Cluster

Oracle Fail Safe provides the Dump Cluster command to display Oracle Fail Safe Manager cluster data in a window. Run this command periodically (and save the output) to maintain a record of changes made to the cluster over time, or run it at the request of customer support so as to provide a snapshot of the cluster environment.

Data displayed when you run the Dump Cluster command includes:

  • Information related to the operating system (including the location of the quorum disk)

  • Public and private network information

  • Resources registered with the cluster

  • Group failover and failback policies

You can optionally save the Dump Cluster data to a file by clicking Save As.

To run the Dump Cluster command, select the cluster from the Oracle Fail Safe Manager tree view, and then from the Oracle Fail Safe Manager menu bar, select Troubleshooting, and then select Dump Cluster.

Figure 6-6 shows the portion of the Dump Cluster command output that provides information about the NTCLU-150 cluster and some of its resources.

Figure 6-6 Dump Cluster Clusterwide Operation

Description of Figure 6-6 follows
Description of "Figure 6-6 Dump Cluster Clusterwide Operation"

6.3 Verify Security Parameters

Oracle Fail Safe provides the fssvr command qualifier, /GETSECURITY, which displays security information about the system where the command is run. Run the fssvr command qualifier, /GETSECURITY on each cluster node to help diagnose FS-1075n errors (where n is a value between 0 and 7, inclusive).

The command and its associated output must be similar to the following:

fssvr /getsecurity

Looking up user account information for OracleMSCSServices.
The user account must be a domain user acount with local Administrator
privileges.  The user account must also have the 'Log on as batch job'
privilege.

    User account specified for OracleMSCSServices is NEDCDOMAIN\cluadmin 
    User account specified has local Administrator privileges 
    User account has the 'Log on as batch job' privilege 

Looking up user account information for Cluster Service. The user account 
must be a domain user account with local Administrator privileges. The user
account must also have the 'Log on as batch job' privilege.

    User account specified for Cluster Service is NEDCDOMAIN\cluadmin
    User account specified has local Administrator privileges 
    User account has the 'Log on as batch job' privilege 

Checking to see if DCOM is enabled.  DCOM must be enabled.
    DCOM is enabled.

6.4 Finding Additional Troubleshooting Information

This chapter describes how to use the Oracle Fail Safe Manager family of troubleshooting tools. Additional information is available as follows:

  • Information about troubleshooting a specific component can be found in Chapters 7 through 9, each of which describes how to configure a particular component for high availability.

  • Information about troubleshooting network configuration problems is described in Appendix A.

  • Because Oracle Fail Safe is layered upon Microsoft Cluster Server software, you may need to refer to the MSCS documentation to troubleshoot problems with the cluster service, interconnect, and hardware configuration.

  • If you are unable to start Oracle Fail Safe, then start the Windows Event Viewer and look at the application log. Oracle Services for MSCS usually logs an event identifying the problem.