5 Troubleshooting Oracle Site Guard

This chapter describes common situations that you might encounter when deploying and managing Oracle Site Guard in disaster-recovery topologies. It also includes the steps for addressing them.

This chapter contains the following sections:

Section 5.1, "Operation Plan Failure"
Section 5.2, "Switchover or Failover Operations Failure"
Section 5.3, "Precheck Failure"
Section 5.4, "Oracle WebLogic Server Failure"
Section 5.5, "Database Failure"
Section 5.6, "Storage Failures"

5.1 Operation Plan Failure

This section provides tips for troubleshooting the following operation-plan failure issues:

Targets Not Discovered in Operation Plan Workflow
Oracle WebLogic Server Managed-Server Target Not Identified
Manual Intervention Needed for Hung Operation Step
OPMN Managed System Components Not Discovered In Operation-Plan Workflow
Oracle RAC Database Not Discovered in Operation-Plan Workflow
Failure of Operation Step When Accessed with Sudo Privileges
Error While Creating Operation Plan Indicating Credential Association Not Configured
Inability to Associate Credentials for Targets Added to a Site
Error Indicating Inability to Create Scalar Value While Creating Operation Plan
Error While Deleting Or Updating Operation Plans
Error While Creating Operation Plan Indicating Missing Node Manager Credentials

5.1.1 Targets Not Discovered in Operation Plan Workflow

Issue

Targets like Oracle Database or Oracle Fusion Middleware farm, which are part of the system, might not be discovered in the operation plan workflow.

Description and Solution

This problem may occur if you have added targets to the system after creating the operation plan. Oracle Site Guard only includes those targets that are part of the system during the creation of the operation plan. If you have added new targets, re-create the operation plan.

5.1.2 Oracle WebLogic Server Managed-Server Target Not Identified

Issue

The Oracle WebLogic Server managed-server target, which is part of the Oracle WebLogic Server domain, is not updated or identified by Oracle Site Guard when creating the operation plan workflow.

Description and Solution

Ensure that the managed servers are running, before performing an automatic discovery in Enterprise Manager Cloud Control.

5.1.3 Manual Intervention Needed for Hung Operation Step

Issue

When an operation step (for example, database switchover or failover, custom scripts, and so on) hangs, manual intervention is needed.

Description and Solution

Suspend the operation from the Enterprise Manager Cloud Control console. Do not stop the operation.

Manually correct the condition that caused the operation plan to hang. After completing the manual procedures, resume the operation to complete the Oracle Site Guard operation. Do not re-submit the operation.

If Oracle Site Guard determines that the components are already in the desired state, it performs a 'no operation' for all the start or stop or database switchover operations. This appropriately ends the process, and updates the sites with the required roles. If an operation step fails, and if manual intervention is needed to resolve the issue, you can either retry the failed step or confirm the manual step, and proceed with the execution of the operation.

Note:

Restart or resume the operation after every manual intervention. Ensure that you complete the operations that you have started.

5.1.4 OPMN Managed System Components Not Discovered In Operation-Plan Workflow

Issue

OPMN Managed System Components, which are part of the system, might not be discovered in the operation-plan workflow.

Description and Solution

Oracle Site Guard discovers only those OPMN managed system components represented in Enterprise Manager Cloud Control. For example, OPMN Managed System Components like Oracle HTTP Server and Oracle Web cache are represented in Enterprise Manager Cloud Control. These components are discovered as part of the Oracle Fusion Middleware farm.

5.1.5 Oracle RAC Database Not Discovered in Operation-Plan Workflow

Issue

Oracle RAC Database, which is part of the system, may not be discovered in the operation plan workflow.

Description and Solution

Oracle RAC Databases are grouped and represented under RAC Database target in the Enterprise Manager Cloud Control. When RAC database instances are discovered, the RAC database target is created, and all the database instances in the RAC deployment are grouped below the RAC database target. This issue may occur if individual RAC instance targets are added to the system, instead of the RAC database target. Oracle Site Guard cannot identify individual RAC instances.

5.1.6 Failure of Operation Step When Accessed with Sudo Privileges

Issue

Site Guard operation step fails with the error stageOmsFileEntry (Error), while using credentials with sudo privileges. You might encounter this issue during the Precheck operation as well.

Description and Solution

When the credentials used by Site Guard are configured to use sudo privileges to run as root, the sudo privilege must be configured as PDP (Privilege Delegation Provider) on all the agents running on the respective hosts of the target.

PDP can be configured from Enterprise Manager Cloud Control console. To configure PDP, go to Setup > Security > Privilege Delegation in the Enterprise Manager Cloud Control console.

5.1.7 Error While Creating Operation Plan Indicating Credential Association Not Configured

Issue

While creating an operation plan, you might encounter an error indicating that a target in the site does not have any credentials associated with it, despite having created and associated credentials for that target.

Description and Solution

This issue occurs when there are two targets with identical names in Enterprise Manger, and one of the targets is part of the site. For example, if a database instance target and a database system target are both named db1, and the database instance target is added to your site.

Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.

5.1.8 Inability to Associate Credentials for Targets Added to a Site

Issue

While configuring credentials for Oracle Site Guard, you might face issues when you attempt to associate credentials for a target. This occurs because the credential configuration for that target type is not enabled, or because the target does not show up in the list of targets for a specific target type. This error is seen despite adding the target to the site.

Description and Solution

Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.

5.1.9 Error While Deleting Or Updating Operation Plans

Issue

While deleting or updating an operation plan, you might encounter the following error:

Error:User does not have FULL_JOB privileges on execution with guid XXXXXXXXXXXXXXXX

Description and Solution

This might occur when a user does not have the necessary privileges to delete or update the operation plan.

5.1.10 Error Indicating Inability to Create Scalar Value While Creating Operation Plan

Issue

While creating an operation plan, you might encounter an error such as the following:

oracle.sysman.ai.siteguard.model.exception.ConfigurationException: Cannot create scalar value for name [PropertyType = DB_VERSION]. Value argument to the method getScalarValue() is null

Description and Solution

Oracle Site Guard reads and uses the DB_VERSION property maintained by Enterprise Manager for database targets protected by Oracle Data Guard. The DB_VERSION property for the database can display as NULL in Enterprise Manager if a Data Guard switchover or failover occurred outside of Enterprise Manager (for example, if a Data Guard switchover was performed using DGMGRL or using Site Guard.)

To correct this issue, using Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target, and reset the DataGuardStatus property from NULL to true. On resetting the DataGuardStatus property, the other Data Guard related properties are automatically refreshed.

5.1.11 Error While Creating Operation Plan Indicating Missing Node Manager Credentials

Issue

While creating an operation plan, you might encounter an error such as the following:

Credential association for credential type NODEMANAGER is missing for target host_name belonging to system site_name.

Description and Solution

In Enterprise Manager, the Node Manager of a host is not a target type, and therefore, Enterprise Manager does not directly interact with it. Oracle Site Guard, on the other hand, interacts with the Node Managers of hosts for managing disaster recovery operations of Oracle Fusion Middleware components. For this reason, Node Manager credentials must be configured and associated while configuring Oracle Site Guard. Since Enterprise Manager does not recognize Node Manager as a target type, you must create host credentials to be used with the node managers running on host targets, and associate these credentials with Oracle Site Guard using the Oracle Site Guard Credential Configuration page.

5.2 Switchover or Failover Operations Failure

This section provides tips for troubleshooting the following issues that you may encounter during switchover or failover operations:

WebLogic Administration Server Does Not Start After Performing Switchover or Failover Operation
WebLogic Administration Server Fails to Restart After Performing Switchover or Failover Operations
Host Not Available During Switchover or Failover Operations
Switchover or Failover Operations Fail When Oracle RAC Database Instances Are Not Available

5.2.1 WebLogic Administration Server Does Not Start After Performing Switchover or Failover Operation

Issue

The WebLogic Administration Server might not start after performing switchover or failover operation. The output log file of the Administration Server reports an error, such as the following:

<Jan 19, 2012 3:43:05 AM PST> <Warning> <EmbeddedLDAP> <BEA-171520> <Could not obtain an exclusive lock for directory: ORACLE_BASE/admin/soadomain/aserver/soadomain/servers/AdminServer/data/ldap/ldapfiles. Waiting for 10 seconds and then retrying in case existing WebLogic Server is still shutting down.>

Description and Solution

The error appears in the Administration Server log file due to unsuccessful lock cleanup. To fix this error, delete the EmbeddedLDAP.lock file (located at, ORACLE_BASE/admin/domain_name/aserver/domain_name/servers/AdminServer/data/ldap/ldapfiles/).

5.2.2 WebLogic Administration Server Fails to Restart After Performing Switchover or Failover Operations

Issue

The WebLogic Administration Server might not start after performing switchover or failover operation. The Administration Server output log file reports the following error:

<Sep 16, 2011 2:04:06 PM PDT> <Error> <Store> <BEA-280061> <The persistent store "_WLS_AdminServer" could not be deployed: weblogic.store.PersistentStoreException:

[Store:280105]The persistent file store "_WLS_AdminServer" cannot open file _WLS_ADMINSERVER000000.DAT.>

Description and Solution

This error might appear due to the locks from Network File System (NFS) storage. You must clear the NFS locks using the NFS utility of the storage vendor. You may also copy the .DAT file to a temporary location, and copy it back, to clear the locks.

5.2.3 Host Not Available During Switchover or Failover Operations

Issue

Some host on the new primary system might not be available, or might be down while performing switchover or failover operation. In such situations, Oracle Site Guard cannot perform any operation on these hosts.

Description and Solution

If the services running on these hosts are not mandatory, and the site can still be functional and active with the services running on the other nodes, the steps pertaining to the hosts, which are down, can be disabled by updating the operation plan. The Oracle Site Guard workflow skips all the disabled steps from the workflow.

5.2.4 Switchover or Failover Operations Fail When Oracle RAC Database Instances Are Not Available

Issue

If all the Oracle RAC Database instances are down, the switchover or failover operation fails.

Description and Solution

While creating an operation plan, Oracle Site Guard determines the Oracle RAC Database instance on which the switchover or failover operation is performed. RAC deployment can have multiple instances, and it is possible that some of the instances are down. Before running the switchover or failover operation, ensure that at least one instance is running. You can identify the name of the RAC instance, which is used by Oracle Site Guard to perform the role reversal operation, by running the get_operation_plan_details command.

5.3 Precheck Failure

This section provides tips for troubleshooting the following Precheck failures:

Failure of Prechecks
Prechecks Hang When Oracle Management Agent Is Not Available

5.3.1 Failure of Prechecks

Issue

Prechecks fail, displaying the following error:

Nmo setuid status NMO not setuid-root (Unix-only)

Description and Solution

After installing the Oracle Management Agent, ensure that you run the root.sh script from the Enterprise Manager Cloud host and all hosts managed by Enterprise Manager, as described in the section "After You Install" in the Oracle Enterprise Manager Cloud Control Basic Installation Guide.

5.3.2 Prechecks Hang When Oracle Management Agent Is Not Available

Issue

If the Oracle Management Agent is down, Prechecks hang while trying to run commands on the remote host.

Description and Solution

Ensure that all hosts involved in an operation are active, and all the configured scripts are available on remote hosts in the configured locations. If the Oracle Management Agent cannot be reached for some reason, then check the log files from the Enterprise Manager Cloud Control console. If you have identified the hosts that are down, skip the Precheck operation on those hosts.

5.4 Oracle WebLogic Server Failure

This section provides troubleshooting tips for the following Oracle WebLogic Server failure issues:

Node Manager Fails to Restart
Managed Server Fails to Start
Oracle Site Guard Does Not Include Oracle WebLogic Server Instances That Are Migrated to a Different Host
Error Displayed While Creating Operation Plan
WebLogic Administration Server Able to Communicate With Node Manager When Site Guard Cannot

5.4.1 Node Manager Fails to Restart

Issue

Node Manager might fail to start due to an error, like the following:

<Sep 13, 2011 8:45:37 PM PDT> <Error> <NodeManager> <BEA-300033> <Could not execute command "getVersion" on the node manager. Reason: "Access to domain 'base_domain' for user 'weblogic' denied".>

Description and Solution

This problem might occur if you have changed the Node Manager credentials and then have not run nmEnroll to ensure that the correct Node Manager username and password is supplied to each managed server.

To ensure that the correct Node Manager user name and password have been supplied, connect to WLST (using wlst.sh) and execute the nmEnroll command using the following syntax:

nmEnroll(domain_directory, node_manager_home)

For example:

nmEnroll('C:/oracle/user_projects/domains/prod_domain',
'C:/oracle/wlserver_10.3/common/nodemanager')

Note:

Restart Node Manager for the changes to take effect.

5.4.2 Managed Server Fails to Start

Issue

The managed server does not start due to a connection failure of the WLS Administration Server in Enterprise Manager Cloud Control.

Description and Solution

To start the managed server, Oracle Site Guard requires the Administration Server and the Node Manager. To start and stop managed servers successfully, ensure that the Administration Server is running.

5.4.3 Oracle Site Guard Does Not Include Oracle WebLogic Server Instances That Are Migrated to a Different Host

Issue

Oracle Site Guard does not include the WebLogic Server instances that are migrated to a different host in the workflow.

Description and Solution

After you create the operation plan, Oracle Site Guard does not include the WebLogic Server instances involved in the operation plan that are migrated to different hosts, as a result of server migration.

After you complete the server migration, refresh the WebLogic Server farm target from the Enterprise Manager Cloud Control console to uptake the latest target changes in the farm. This step is mandatory for Enterprise Manager to resume its farm monitoring capabilities after any changes in the farm like server migration happens. After the farm target is refreshed, you need to recreate the Oracle Site Guard operation plans to include all of the farm targets in the Oracle Site Guard workflow.

5.4.4 Error Displayed While Creating Operation Plan

Issue

While creating an operation plan, you might see an error, like the following:

oracle.sysman.ai.siteguard.model.common.exception.DAOException:
For hostName:
[2606:b400:800:89:214:4fff:fe46:2d52] credential of type HOSTNORMAL does notexist for  siteName: System1

Description and Solution

If you do not configure the listen address for the WebLogic Server instances running on the hosts where multiple IP addresses are configured, WebLogic Server randomly picks up an IP address, and reports that as the listen address. This IP address might not be a valid one, and it could be an issue when creating operation plans. To fix the issue, using the Administration Console, configure WebLogic Server properly, with a resolvable listen address. After configuring Oracle WebLogic Server, restart the server, and re-discovered it again from the Enterprise Manager Cloud Control. For more information about listen address configuration, refer to the Oracle Fusion Middleware Disaster Recovery Guide.

5.4.5 WebLogic Administration Server Able to Communicate With Node Manager When Site Guard Cannot

Issue

Oracle Site Guard is unable to access the Node Manager even though the Weblogic Administrator is able to log in to the Node Manager.

Description and Solution

This issue occurs when the user name used to authenticate with Node Manager is randomly generate by the WebLogic Administration Server.

To correct this, complete the following steps:

Log in to the WebLogic Administration Server console.
Click Domain listed in the left-hand pane.
Click on the Security tab, and then click Advanced link.

The Node Manager user name is displayed. The user name might appear to be a randomly generated string.
Update the Node Manager log-in credentials with the correct information.

5.4.6 Unable to Associate More Than One Node Manager Per Host

Issue

Oracle Site Guard is unable to associate credentials for more than one Node Manager running on the same host.

Description

This is a limitation in the current version of Oracle Site Guard. The current version can only support one set of credentials for all the Node Managers running on a host. Ensure that all the Node Managers on a given host have been configured with an identical set of credentials.

5.5 Database Failure

This section provides tips for troubleshooting the following issues related to database operation failure:

Prechecks for Database Switchover and Database Failover Operations Fail
Databases Protected by Data Guard Included in the Incorrect Operation-Plan Category

5.5.1 Prechecks for Database Switchover and Database Failover Operations Fail

Issue

The Prechecks for database switchover or database failover operations fail, and display the following error:

Database Status:
DGM-17016: failed to retrieve status for database "racs"
ORA-16713: the Data Guard broker command timed out

Description and Solution

This error might occur if the Data Guard Monitor process (DMON) in the target database instance is down.

Note:

The Data Guard Monitor process (DMON) is part of the Oracle Data Guard Broker.

If this error occurs, restart the database instance, and ensure that the DMON process is running. You can also see the database log file for DMON-process errors. Use the CommunicationTimeout parameter to select an appropriate time-out value for the environment. For more information, see "CommunicationTimeout" in Oracle Data Guard Broker.

5.5.2 Databases Protected by Data Guard Included in the Incorrect Operation-Plan Category

Issue

Oracle Site Guard adds the Oracle Data Guard protected database targets to the Start/Stop category instead of Switchover/Failover category of the operation plan.

Description and Solution

Oracle Site Guard uses the DataGuardStatus property maintained by Enterprise Manager for database targets to determine whether the database is protected by Data Guard. This determines which operation plan category the database is added to. If the value of this property is NULL then Site Guard assumes that the database is not protected by Data Guard and adds the database target to the Start or Stop category of the operation plan, instead of the Switchover or Failover category.

The DataGuardStatus property for the database can display as NULL in Enterprise Manager if the Data Guard switchover or failover occurs outside of Enterprise Manager. For example, a Data Guard switchover is performed using DGMGRL or using Oracle Site Guard.

Using the Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target. Upon logging in, the Data Guard related properties are automatically refreshed.

5.6 Storage Failures

This section provides tips for troubleshooting the following issues related to storage and storage appliances:

Attempt to Log In to ZFS Storage Appliance Might Fail During Execution of Operation Plan
Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Deleting Empty Project on Target Appliance
Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Executing 'confirm reverse'

5.6.1 Attempt to Log In to ZFS Storage Appliance Might Fail During Execution of Operation Plan

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, logging into a ZFS appliance might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Wrong credentials. Make sure that the given credentials are correct and does not contain any special characters.

Description and Solution

This occurs if the password for the ZFS appliance credential contains special characters. Update the appliance password so that it does not contain special characters. Then, update the storage appliance credentials in the Enterprise Manager Credential Management Framework, and retry the operation step.

5.6.2 Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Deleting Empty Project on Target Appliance

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Error: The action could not be completed because the the target (or one of its descendants) has the 'nodestroy' property set. Turn off the property for '1_test' and try again.

Description and Solution

This occurs if the project has the nodestroy property set. This property is called as Prevent destruction in the Enterprise Manager Cloud Control interface.

Turn off this property and retry the operation step.

5.6.3 Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Executing 'confirm reverse'

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail while executing confirm reverse, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Error: The action could not be completed because the mountpoint of '<project_name>/<share_name>' would conflict with that of '<project_name>/<share_name>' (/export/<project_name>/<share_name>). Change the mountpoint of '<project_name>/<share_name>' and try again.

This occurs if at least one of the shares inside all available packages for a given project, has exported as file system. Make sure that the exported property of all shares inside all packages for a given projects is turned off.