7 Troubleshooting Oracle Site Guard

Table 7-1 describes common situations that you might encounter when deploying and managing Oracle Site Guard in disaster recovery topologies, and explains the steps for addressing them.

Table 7-1 Troubleshooting

Scenario Description and Solution

Pre-check Operation

If the pre check operation fails and displays the following error:

Nmo setuid status NMO not setuid-root (Unix-only)

After installing the Oracle Management Agent, ensure that you run the root.sh script, as described in the section "After You Install" in the Oracle Enterprise Manager Cloud Control Basic Installation Guide.

If the Oracle Management Agent is down then the pre check operation hangs while trying to execute commands on the remote host.

Ensure that all hosts involved in an operation are active and all the configured scripts are available on remote hosts in the configured locations. If the Oracle Management Agent cannot be reached for some reason, then check the log files from the Enterprise Manager Cloud Control console. If you know which hosts are down, then you can skip pre check operation on those hosts.

Oracle WebLogic Server

Node Manager may fail due to the following error:

<Sep 13, 2011 8:45:37 PM PDT> <Error> <NodeManager> <BEA-300033> <Could not execute command "getVersion" on the node manager. Reason: "Access to domain 'base_domain' for user 'weblogic' denied".>

This problem may occur if you have changed the Node Manager credentials and then not running nmEnroll to ensure that the correct Node Manager user and password token are supplied to each Managed Server.

Run nmEnroll using the following syntax:

nmEnroll([domainDir], [nmHome])

For example:

nmEnroll('C:/oracle/user_projects/domains/prod_domain',
'C:/oracle/wlserver_10.3/common/nodemanager')

Note: You must restart Node Manager in order for the changes to take effect.

Managed Server fails to start due to Oracle WebLogic Server Administration Server connection failure in Enterprise Manager Cloud Control.

Oracle Site Guard requires the Administration Server and the Node Manager to start a Managed Server. Ensure that the Administration Server is up and running to start and stop Managed Servers successfully.

Operation Plan

Targets like Oracle Database or Oracle Fusion Middleware farm which are part of the system, may not be discovered in the operation plan workflow.

This problem may occur if you have added targets to the system after creating the operation plan. Oracle Site Guard only includes those targets, which are part of the system during the creation of the operation plan. If you have added new targets, then you must re-create the operation plan.

The Oracle WebLogic Server Managed Server target, which is part of the Oracle WebLogic Server Domain, is not updated or identified by Oracle Site Guard when creating the operation plan workflow.

Ensure that the Managed Servers are up and running before performing Automatic discovery in Enterprise Manager Cloud Control.

OPMN Managed System Components which are part of the system may not be discovered in the operation plan workflow.

Oracle Site Guard discovers only those OPMN managed system components represented in Enterprise Manager Cloud Control. For example, OPMN Managed System Components like Oracle HTTP Server and Oracle Web cache are represented in Enterprise Manager Cloud Control. These components are discovered as part of Oracle Fusion Middleware farm.

Oracle RAC Database which is part of the system may not be discovered in the operation plan workflow.

Oracle RAC Database are grouped and represented under RAC Database target in the Enterprise Manager Cloud Control. When RAC database instances are discovered, the RAC database target is created and all the database instances in the RAC deployment are grouped under the RAC database target. This issue may occur if individual RAC instance targets are added to the system instead of the RAC database target. Oracle Site Guard cannot identify individual RAC instances.

Switchover or Failover Operations

The Administration Server may fail to start after performing switchover or failover operation. The Administration Server output log file reports the following error:

<Jan 19, 2012 3:43:05 AM PST> <Warning> <EmbeddedLDAP> <BEA-171520> <Could not obtain an exclusive lock for directory: ORACLE_BASE/admin/soadomain/aserver/soadomain/servers/AdminServer/data/ldap/ldapfiles. Waiting for 10 seconds and then retrying in case existing WebLogic Server is still shutting down.>

The error appears in the Administration Server log file due to unsuccessful lock cleanup. To fix this error, delete the EmbeddedLDAP.lok file (Located at, ORACLE_BASE/admin/<domain_name>/aserver/<domain_name>/servers/AdminServer/data/ldap/ldapfiles/).

The Administration Server may fail to start after performing switchover or failover operation. The Administration Server output log file reports the following error:

<Sep 16, 2011 2:04:06 PM PDT> <Error> <Store> <BEA-280061> <The persistent store "_WLS_AdminServer" could not be deployed: weblogic.store.PersistentStoreException:

[Store:280105]The persistent file store "_WLS_AdminServer" cannot open file _WLS_ADMINSERVER000000.DAT.>

This error may appear due to the locks from Network File System (NFS) storage. You must clear the NFS locks using the storage vendor's NFS utility. You can also copy the .DAT file to a temporary location and copying it back to clear the locks

Some host on the new primary system may not be available or might be down while performing switchover or failover operation. Oracle Site Guard cannot perform any operation on these hosts.

If the services running on those hosts are not mandatory and the site can still be functional and active with the services running on the other nodes, the steps pertaining to the host(s), which are down, can be disabled by updating the operation plan. The Oracle Site Guard workflow will skip executing all the disabled steps from the workflow.

If the Oracle RAC Database Instance is down, then the switchover or failover operation fails.

While creating the operation plan Oracle Site Guard determines the Oracle RAC Database instance on which the switchover or failover operation will be performed. RAC deployment can have multiple instances and it's possible that some of the instance(s) are down. Before running the Switchover or Failover operation you must ensure that the instance are up and running. You can identify the RAC instance name by running the get_operation_plan_details command.

Database Operations

If the pre check operation or database switchover/failover operation fails and displays the following error:

Database Status:
DGM-17016: failed to retrieve status for database "racs"
ORA-16713: the Data Guard broker command timed out

This error may be due to the Oracle Data Guard broker Data Guard monitor process (DMON) is down in the target database instance. You must restart the database instance and ensure that the DMON process is up and running. You can also see the database log file for DMON process errors. You can use the CommunicationTimeout parameter to select an appropriate timeout value for their environment. For more information, see "CommunicationTimeout" in Oracle Data Guard Broker.