6 Troubleshooting Oracle Site Guard

This chapter describes situations that you might encounter when deploying or managing Oracle Site Guard in disaster-recovery topologies, and how to workaround common issues.

This chapter includes the following topics:

6.1 Operation Plan Failure

This section provides tips for troubleshooting the following operation-plan failure issues:

6.1.1 Targets Not Discovered in Operation Plan Workflow

Issue

Targets like Oracle Database or Oracle Fusion Middleware farm, which are part of the system, might not be discovered in the operation plan workflow.

Description and Solution

This problem may occur if you have added targets to the system after creating the operation plan. Oracle Site Guard only includes those targets that are part of the system during the creation of the operation plan. If you have added new targets, re-create the operation plan. If you have customized the plan, make note of those customizations before you re-create the plan, and re-customize the new plan again after it is re-created.

6.1.2 Oracle WebLogic Server Managed-Server Target Not Identified

Issue

The Oracle WebLogic Server managed-server target, which is part of the Oracle WebLogic Server domain, is not updated or identified by Oracle Site Guard when creating the operation plan workflow.

Description and Solution

Ensure that the managed servers are running, before performing an automatic discovery in Enterprise Manager Cloud Control. If the managed servers are already running but are not visible in Enterprise Manager, try refreshing the WebLogic Domain target to discover the managed servers.

6.1.3 Manual Intervention Needed for Hung Operation Step

Issue

When an operation step (for example, database switchover or failover, custom scripts, and so on) hangs, manual intervention is needed.

Description and Solution

Suspend the operation from the Enterprise Manager Cloud Control console. Do not stop the operation.

Manually correct the condition that caused the operation plan to hang. After completing the manual procedures, resume the operation to complete the Oracle Site Guard operation. Do not re-submit the operation.

If Oracle Site Guard determines that the components are already in the desired state, it performs a 'no operation' for all the start or stop or database switchover operations. This appropriately ends the process, and updates the sites with the required roles. If an operation step fails, and if manual intervention is needed to resolve the issue, you can either retry the failed step or confirm the manual step, and proceed with the execution of the operation.

Note:

Restart or resume the operation after every manual intervention. Ensure that you complete the operations that you have started.

6.1.4 OPMN Managed System Components Not Discovered In Operation-Plan Workflow

Issue

OPMN Managed System Components, which are part of the system, might not be discovered in the operation-plan workflow.

Description and Solution

Oracle Site Guard discovers only those OPMN managed system components represented in Enterprise Manager Cloud Control. For example, OPMN Managed System Components like Oracle HTTP Server and Oracle Web Cache are represented in Enterprise Manager Cloud Control. These components are discovered as part of the Oracle Fusion Middleware farm.

6.1.5 Oracle RAC Database Not Discovered in Operation-Plan Workflow

Issue

Oracle RAC Database, which is part of the system, may not be discovered in the operation plan workflow.

Description and Solution

Oracle RAC Databases are grouped and represented under RAC Database target in the Enterprise Manager Cloud Control. When RAC database instances are discovered, the RAC database target is created, and all the database instances in the RAC deployment are grouped below the RAC database target. This issue may occur if individual RAC instance targets are added to the system, instead of the RAC database target. Oracle Site Guard cannot identify individual RAC instances.

6.1.6 Failure of Operation Step When Accessed with Sudo Privileges

Issue

Site Guard operation step fails with the error stageOmsFileEntry (Error), when using credentials with sudo privileges. You might encounter this issue during the Precheck operation as well.

Description and Solution

When the credentials used by Site Guard are configured to use sudo privileges to run as root, the sudo privilege must be configured as PDP (Privilege Delegation Provider) on all the agents running on the respective hosts of the target.

PDP can be configured from Enterprise Manager Cloud Control console. To configure PDP, go to Setup > Security > Privilege Delegation in the Enterprise Manager Cloud Control console.

6.1.7 Error While Creating Operation Plan Indicating Credential Association Not Configured

Issue

While creating an operation plan, you might encounter an error indicating that a target in the site does not have any credentials associated with it, despite having created and associated credentials for that target.

Description and Solution

This issue occurs when there are two targets with identical names in Enterprise Manger, and one of the targets is part of the site. For example, if a database instance target and a database system target are both named db1, and the database instance target is added to your site.

Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.

6.1.8 Inability to Associate Credentials for Targets Added to a Site

Issue

While configuring credentials for Oracle Site Guard, you might face issues when you attempt to associate credentials for a target. This occurs because the credential configuration for that target type is not enabled, or because the target does not show up in the list of targets for a specific target type. This error is seen despite adding the target to the site.

Description and Solution

This issue occurs when there are two targets with identical names in Enterprise Manger, and one of the targets is part of the site. For example, if a database instance target and a database system target are both named db1, and the database instance target is added to your site.

Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.

6.1.9 Error While Deleting Or Updating Operation Plans

Issue

While deleting or updating an operation plan, you might encounter the following error:

Error:User does not have FULL_JOB privileges on execution with guid XXXXXXXXXXXXXXXX

Description and Solution

This might occur when a user does not have the necessary privileges to delete or update the operation plan.

Log in using the credentials that were used while creating the operation plan, and then delete or update the plan.

6.1.10 Error Indicating Inability to Create Scalar Value While Creating Operation Plan

Issue

While creating an operation plan, you might encounter an error such as the following:

oracle.sysman.ai.siteguard.model.exception.ConfigurationException: Cannot create scalar value for name [PropertyType = DB_VERSION]. Value argument to the method getScalarValue() is null

Description and Solution

Oracle Site Guard reads and uses the DB_VERSION property maintained by Enterprise Manager for database targets protected by Oracle Data Guard. The DB_VERSION property for the database can display as NULL in Enterprise Manager if a Data Guard switchover or failover occurred outside of Enterprise Manager (for example, if a Data Guard switchover was performed with DGMGRL or Site Guard.)

To correct this issue with Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target, and reset the DataGuardStatus property from NULL to true. On resetting the DataGuardStatus property, the other Data Guard related properties are automatically refreshed.

6.1.11 Error While Creating Operation Plan Indicating Missing Node Manager Credentials

Note:

This issue and workaround are specific to Site Guard 12.1.0.7.

Issue

While creating an operation plan, you might encounter an error such as the following:

Credential association for credential type NODEMANAGER is missing for target host_name belonging to system site_name.

Description and Solution

In Enterprise Manager, the Node Manager of a host is not a target type, and therefore, Enterprise Manager does not directly interact with it. Oracle Site Guard, on the other hand, interacts with the Node Managers of hosts for managing disaster recovery operations of Oracle Fusion Middleware components. For this reason, Node Manager credentials must be configured and associated while configuring Oracle Site Guard. Since Enterprise Manager does not recognize Node Manager as a target type, you must create host credentials to be used with the node managers running on host targets, and associate these credentials with Oracle Site Guard using the Oracle Site Guard Credential Configuration page.

6.1.12 Error Indicating Inability to Stage SWLIB Artifacts Due To Insufficient Disk Space on Target Host

Issue

An operation plan may fail with an error similar to the following because of problems with disk space checks on a remote target host:

Value of property oracle.sysman.core.swlib.disableFreeSpaceOnDestCheck:falseERROR [Wed Jun 03 07:29:31 PDT 2015]: Parameter validation failure. Reason: The space on the destination host 'myhost.com' is not sufficient to stage the entity.

Description and Solution

The short-term solution to this issue is to ensure that the /tmp directory on the remote host has enough disk space available and then to disable the disk space check for Enterprise Manager jobs using emcli:

emctl set property -name oracle.sysman.core.swlib.disableFreeSpaceOnDestCheck -value true

A more permanent solution to this issue is to inspect the Enterprise Manager logs (emom.log and emoms.trc) and determine the root cause for why this failure is occurring and fix that. The following example from the emoms.trc log file illustrates a disk space check failed on one particular VM host:

2015-06-03 10:53:16,628 [RJob Step 3818744] WARN swlib.storage logp.251 - 
Unable to retrieve disk space details from agent myhost.com:/tmp/JOB_17161DC66E0E5053BA46F40AE165', 
output=[Error occurred during initialization of VM. Could not reserve enough space for object heap

To determine the location of these log files, see section "Locating and Configuring Enterprise Manager Log Files" in the Enterprise Manager Cloud Control Administrator's Guide.

6.1.13 Operation Plan Fails Because of Inability to Copy WLS Utility Script to Domain Directory

Issue

An operation plan may fail because Site Guard fails to copy the WebLogic Server-related utility script (siteguard_python_util.py) to the WebLogic Server domain directory.

Description and Solution

This problem can occur if you use Privilege Delegation for the credential used to access the target host where the WebLogic Server resides. During WebLogic start/stop operations, Site Guard stages scripts to this host and then copies these scripts to the WebLogic Server domain directory. This copy process can fail if privilege delegation has not been set up correctly.

To avoid this issue, ensure that privileged credential delegation is correctly configured. For information about configuring privileged delegation for targets, see Oracle Enterprise Manager documentation. After this issue is corrected, you must delete the siteguard_python_util.py file from the WebLogic Server domain directory before you retry the failed operation.

6.2 Switchover or Failover Operations Failure

This section provides tips for troubleshooting the following issues that you may encounter during switchover or failover operations:

6.2.1 WebLogic Administration Server Does Not Start After Performing Switchover or Failover Operation

Issue

The WebLogic Administration Server might not start after performing switchover or failover operation. The output log file of the Administration Server reports an error, such as the following:

<Jan 19, 2012 3:43:05 AM PST> <Warning> <EmbeddedLDAP> <BEA-171520> <Could not obtain an exclusive lock for directory: ORACLE_BASE/admin/soadomain/aserver/soadomain/servers/AdminServer/data/ldap/ldapfiles. Waiting for 10 seconds and then retrying in case existing WebLogic Server is still shutting down.>

Description and Solution

The error appears in the Administration Server log file due to unsuccessful lock cleanup. To fix this error, delete the EmbeddedLDAP.lock file (located at, ORACLE_BASE/admin/domain_name/aserver/domain_name/servers/AdminServer/data/ldap/ldapfiles/).

There may be multiple WebLogic Administration Server lock files that need be deleted. Repeat the process by attempting to start the WebLogic Administration Server and identifying each stale lock file that must be deleted.

6.2.2 WebLogic Administration Server Fails to Restart After Performing Switchover or Failover Operations

Issue

The WebLogic Administration Server might not start after performing switchover or failover operation. The Administration Server output log file reports the following error:

<Sep 16, 2011 2:04:06 PM PDT> <Error> <Store> <BEA-280061> <The persistent store "_WLS_AdminServer" could not be deployed: weblogic.store.PersistentStoreException:

[Store:280105]The persistent file store "_WLS_AdminServer" cannot open file _WLS_ADMINSERVER000000.DAT.>

Description and Solution

This error might appear due to the locks from Network File System (NFS) storage. You must clear the NFS locks with the NFS utility of the storage vendor. You may also copy the .DAT file to a temporary location, and copy it back, to clear the locks.

6.2.3 Host Not Available During Switchover or Failover Operations

Issue

Some host on the new primary system might not be available, or might be down while performing switchover or failover operation. In such situations, Oracle Site Guard cannot perform any operation on these hosts.

Description and Solution

If the services running on these hosts are not mandatory, and the site can still be functional and active with the services running on the other nodes, the steps pertaining to the hosts, which are down, can be disabled by updating the operation plan. The Oracle Site Guard workflow skips all the disabled steps from the workflow.

6.2.4 Switchover or Failover Operations Fail When Oracle RAC Database Instances Are Not Available

Issue

If all the Oracle RAC Database instances are down, the switchover or failover operation fails.

Description and Solution

While creating an operation plan, Oracle Site Guard determines the Oracle RAC Database instance on which the switchover or failover operation is performed. RAC deployment can have multiple instances, and it is possible that some of the instances are down. Before running the switchover or failover operation, ensure that at least one instance is running. You can identify the name of the RAC instance, which is used by Oracle Site Guard to perform the role reversal operation, by running the get_operation_plan_details command.

6.3 Precheck or Healthcheck Failure

This section provides tips for troubleshooting the following Precheck or Healthcheck failures:

6.3.1 Failure of Prechecks

Issue

Prechecks fail, displaying the following error:

Nmo setuid status NMO not setuid-root (Unix-only)

Description and Solution

After installing the Oracle Management Agent, ensure that you run the root.sh script from the Enterprise Manager Cloud host and all hosts managed by Enterprise Manager, as described in the section "After You Install" in the Oracle Enterprise Manager Cloud Control Basic Installation Guide.

6.3.2 Prechecks Hang When Oracle Management Agent Is Not Available

Issue

If the Oracle Management Agent is down, Prechecks hang while trying to run commands on the remote host.

Description and Solution

Ensure that all hosts involved in an operation are active, and all the configured scripts are available on remote hosts in the configured locations. If the Oracle Management Agent cannot be reached for some reason, then check the log files from the Enterprise Manager Cloud Control console. If you have identified the hosts that are down, skip the Precheck operation on those hosts.

6.3.3 Healthchecks Cannot Be Retired or Resumed

Issue

Healthchecks that fail cannot be retried or resumed.

Description and Solution

If a healthcheck fails, it cannot be retried or resumed. Either wait for the next healthcheck or execute a standalone precheck to verify a Site Guard operation plan's validity.

6.4 Oracle WebLogic Server Failure

This section provides troubleshooting tips for the following Oracle WebLogic Server failure issues:

6.4.1 Node Manager Fails to Restart

Issue

Node Manager might fail to start due to an error, like the following:

<Sep 13, 2011 8:45:37 PM PDT> <Error> <NodeManager> <BEA-300033> <Could not execute command "getVersion" on the node manager. Reason: "Access to domain 'base_domain' for user 'weblogic' denied".>

Description and Solution

This problem might occur if you have changed the Node Manager credentials and then have not run nmEnroll to ensure that the correct Node Manager username and password is supplied to each managed server.

To ensure that the correct Node Manager user name and password have been supplied, connect to WLST and execute the nmEnroll command using the following syntax:

nmEnroll(domain_directory, node_manager_home)

For example:

nmEnroll('C:/oracle/user_projects/domains/prod_domain',
'C:/oracle/wlserver_10.3/common/nodemanager')

Note:

Restart Node Manager for the changes to take effect.

6.4.2 Node Manage Start or Stop Fails Due to Missing nodemanager.properties File

Issue

Node Manager Start or Stop operations may fail because of a missing nodemanager.properties file.

Description and Solution

Site Guard inspects the nodemanager.properties file to determine various properties of the Node Manager when starting or stopping the Node Managers during disaster recovery operations. If this file is missing, Node Manager start and stop operation steps will fail.

The nodemanager.properties file is created at a predetermined location the first time a Node Manager is started. Ensure that you have manually started all involved Node Managers at least once prior to executing any Site Guard operation plans that affect those Node Managers.

6.4.3 Managed Server Fails to Start

Issue

The managed server does not start due to a connection failure of the WLS Administration Server in Enterprise Manager Cloud Control.

Description and Solution

To start the managed server, Oracle Site Guard requires the Administration Server and the Node Manager. To start and stop managed servers successfully, ensure that the Administration Server is running.

6.4.4 Oracle Site Guard Does Not Include Oracle WebLogic Server Instances That Are Migrated to a Different Host

Issue

Oracle Site Guard does not include the WebLogic Server instances that are migrated to a different host in the workflow.

Description and Solution

After you create the operation plan, Oracle Site Guard does not include the WebLogic Server instances involved in the operation plan that are migrated to different hosts, as a result of server migration.

After you complete the server migration, refresh the WebLogic Server farm target from the Enterprise Manager Cloud Control console to uptake the latest target changes in the farm. This step is mandatory for Enterprise Manager to resume its farm monitoring capabilities after any changes in the farm like server migration happens. After the farm target is refreshed, you need to recreate the Oracle Site Guard operation plans to include all of the farm targets in the Oracle Site Guard workflow. Any customizations made to operation plans must also be recreated.

6.4.5 Error Displayed While Creating Operation Plan

Issue

While creating an operation plan, you might see an error, like the following:

oracle.sysman.ai.siteguard.model.common.exception.DAOException:
For hostName:
[2606:b400:800:89:214:4fff:fe46:2d52] credential of type HOSTNORMAL does not exist for siteName: System1

Description and Solution

If you do not configure the listen address for the WebLogic Server instances running on the hosts where multiple IP addresses are configured, WebLogic Server randomly picks up an IP address, and reports that as the listen address. This IP address might not be a valid one, and it could be an issue when creating operation plans. To fix the issue with the Administration Console, configure WebLogic Server properly, with a resolvable listen address. After configuring Oracle WebLogic Server, restart the server, and re-discovered it again from the Enterprise Manager Cloud Control. For more information about listen address configuration, refer to the Oracle Fusion Middleware Disaster Recovery Guide.

6.4.6 WebLogic Administration Server Able to Communicate With Node Manager When Site Guard Cannot

Issue

Oracle Site Guard is unable to access the Node Manager even though the Weblogic Administrator is able to log in to the Node Manager.

Description and Solution

This issue occurs when the user name used to authenticate with Node Manager is randomly generate by the WebLogic Administration Server.

To correct this, complete the following steps:

  1. Log in to the WebLogic Administration Server console.

  2. Click Domain listed in the left-hand pane.

  3. Click on the Security tab, and then click Advanced link.

    The Node Manager user name is displayed. The user name might appear to be a randomly generated string.

  4. Update the Node Manager log-in credentials with the correct information.

6.4.7 Unable to Associate More Than One Node Manager Per Host

Issue

Oracle Site Guard is unable to associate different credentials for different Node Managers running on the same host.

Description

This is a limitation in the current version of Oracle Site Guard. The current version can only support one set of credentials for all the Node Managers running on a host. Ensure that all the Node Managers on a given host have been configured with an identical set of credentials.

6.4.8 Weblogic Server Password Updates and Site Guard Credentials

Issue

WebLogic Server start/stop operations in Site Guard operation plans may fail after a WebLogic Server administration password update. This can occur even if Site Guard credential for the WebLogic Server target has been updated with the new password.

Description and Solution

In order for the updated Site Guard credentials to work with the updated WebLogic Server password, the WebLogic Administration Server must be restarted for the new password to be applicable for the administration functions that Site Guard performs. After each WebLogic Server password change, update the Site Guard credential and restart the WebLogic Administration Server.

6.5 Database Failure

This section provides tips for troubleshooting the following issues related to database operation failure:

6.5.1 Prechecks for Database Switchover and Database Failover Operations Fail

Issue

The Prechecks for database switchover or database failover operations fail, and display the following error:

Database Status:
DGM-17016: failed to retrieve status for database "racs"
ORA-16713: the Data Guard broker command timed out

Description and Solution

This error might occur if the Data Guard Monitor process (DMON) in the target database instance is down.

Note:

The Data Guard Monitor process (DMON) is part of the Oracle Data Guard Broker.

If this error occurs, restart the database instance, and ensure that the DMON process is running. You can also see the database log file for DMON-process errors. Use the CommunicationTimeout parameter to select an appropriate time-out value for the environment. For more information, see "CommunicationTimeout" in Oracle Data Guard Broker.

6.5.2 Databases Protected by Data Guard Included in the Incorrect Operation-Plan Category

Issue

Oracle Site Guard adds the Oracle Data Guard protected database targets to the Start/Stop category instead of Switchover/Failover category of the operation plan.

Description and Solution

Oracle Site Guard uses the DataGuardStatus property maintained by Enterprise Manager for database targets to determine whether the database is protected by Data Guard. This determines which operation plan category the database is added to. If the value of this property is NULL then Site Guard assumes that the database is not protected by Data Guard and adds the database target to the Start or Stop category of the operation plan, instead of the Switchover or Failover category.

The DataGuardStatus property for the database can display as NULL in Enterprise Manager if the Data Guard switchover or failover occurs outside of Enterprise Manager. For example, a Data Guard switchover is performed with DGMGRL or Site Guard.

Using the Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target. Upon logging in, the Data Guard related properties are automatically refreshed.

6.5.3 Database Is Not Accessible When Opening a Site for Standby Validation

Issue

After opening a Site Guard site in Standby Validation mode, one or more databases in the site are not accessible even though a database snapshot has been created.

Description and Solution

This can occur if the standby database does not have a snapshot service associated with the database. When configuring the standby site database, ensure that you have specifically created a separate snapshot service for the database so that the database snapshots can be accessed in Standby Validation mode. Refer to Oracle Database documentation for details on configuring services for databases.

6.6 Storage Failures

This section provides tips for troubleshooting the following issues related to storage and storage appliances:

6.6.1 Attempt to Log In to ZFS Storage Appliance Might Fail During Execution of Operation Plan

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, logging into a ZFS appliance might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Wrong credentials. Make sure that the given credentials are correct and does not contain any special characters.

Description and Solution

This occurs if the password for the ZFS appliance credential contains special characters. Update the appliance password so that it does not contain special characters. Then, update the storage appliance credentials in the Enterprise Manager Credential Management Framework, and retry the operation step.

6.6.2 Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Deleting Empty Project on Target Appliance

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Error: The action could not be completed because the the target (or one of its descendants) has the 'nodestroy' property set. Turn off the property for '1_test' and try again.

Description and Solution

This occurs if the project has the nodestroy property set. This property is called as Prevent destruction in the Enterprise Manager Cloud Control interface.

Turn off this property and retry the operation step.

6.6.3 Storage Role Reversal Operation Might Fail During Execution of Operation Plan While Executing 'confirm reverse'

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail while executing confirm reverse, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh script:

Error: The action could not be completed because the mountpoint of '<project_name>/<share_name>' would conflict with that of '<project_name>/<share_name>' (/export/<project_name>/<share_name>). Change the mountpoint of '<project_name>/<share_name>' and try again.

This occurs if at least one of the shares inside all available packages for a given project, has exported as file system. Make sure that the exported property of all shares inside all packages for a given projects is turned off.

6.6.4 ZFS Storage Role Reversal Operation Might Fail During Execution of Operation Plan Because of Insufficient Privileges

Issue

During a storage switchover or failover step of an Oracle Site Guard operation, ZFS storage role reversal operation might fail because the credentials used to perform ZFS operations do not have the necessary privileges to perform these ZFS operations.

Description and Solution

Ensure that the credentials used for ZFS operations are assigned the roles/privileges required for performing ZFS storage role reversal. Refer to the ZFS storage configuration section of this guide for additional details.

6.6.5 Remote Replication Targets on Source ZFS Storage May List Multiple Target Appliances With The Same Name During Replication Configuration

Issue

When attempting to set up a replication configuration (action) on source ZFS storage appliance, you may see multiple instances of the same replication targets in the drop-down list. This is a known ZFS issue.

Description and Solution

Only one of these instances of the target appliance will actually work as a valid target appliance. The other invalid instances will not work and the replication configuration for those instances cannot be successfully saved. Try creating a configuration with each instance of the target appliance to determine which configuration succeeds. Note that creating a configuration or determining which instance succeeds is manual at storage level.

6.6.6 ZFS Storage Role Reversal May Fail If Storage Scripts Are Configured to Use Physical (Non-Portable) Addresses for Clustered ZFS Appliances

Issue

ZFS storage role reversal scripts may fail with errors like ”Replication action not found for given project on <source> appliance” if they are configured with source and target appliance hostnames that are physical. This is especially true in the case of clustered (highly available) ZFS appliances.

Description and Solution

Physical hostnames or IP addresses are not relocated in a storage cluster when services failover from one storage head to another. If you use these physical addresses in your script configuration, and the storage appliance services relocate to a different head during an HA event, the storage script will be unable to find replication action id and its UUID.

Ensure that you use management interfaces (not physical interfaces) when configuring the source and target hostnames or IP addresses for Site Guard ZFS storage scripts.