This chapter describes common situations that you might encounter when deploying and managing Oracle Site Guard in disaster-recovery topologies. It also includes the steps for addressing them.
This chapter contains the following sections:
This section provides tips for troubleshooting the following operation-plan failure issues:
OPMN Managed System Components Not Discovered In Operation-Plan Workflow
Oracle RAC Database Not Discovered in Operation-Plan Workflow
Failure of Operation Step When Accessed with Sudo Privileges
Error While Creating Operation Plan Indicating Credential Association Not Configured
Inability to Associate Credentials for Targets Added to a Site
Error Indicating Inability to Create Scalar Value While Creating Operation Plan
Error While Creating Operation Plan Indicating Missing Node Manager Credentials
Targets like Oracle Database or Oracle Fusion Middleware farm, which are part of the system, might not be discovered in the operation plan workflow.
This problem may occur if you have added targets to the system after creating the operation plan. Oracle Site Guard only includes those targets that are part of the system during the creation of the operation plan. If you have added new targets, re-create the operation plan.
The Oracle WebLogic Server managed-server target, which is part of the Oracle WebLogic Server domain, is not updated or identified by Oracle Site Guard when creating the operation plan workflow.
Ensure that the managed servers are running, before performing an automatic discovery in Enterprise Manager Cloud Control.
When an operation step (for example, database switchover or failover, custom scripts, and so on) hangs, manual intervention is needed.
Suspend the operation from the Enterprise Manager Cloud Control console. Do not stop the operation.
Manually correct the condition that caused the operation plan to hang. After completing the manual procedures, resume the operation to complete the Oracle Site Guard operation. Do not re-submit the operation.
If Oracle Site Guard determines that the components are already in the desired state, it performs a 'no operation' for all the start or stop or database switchover operations. This appropriately ends the process, and updates the sites with the required roles. If an operation step fails, and if manual intervention is needed to resolve the issue, you can either retry the failed step or confirm the manual step, and proceed with the execution of the operation.
Note:
Restart or resume the operation after every manual intervention. Ensure that you complete the operations that you have started.OPMN Managed System Components, which are part of the system, might not be discovered in the operation-plan workflow.
Oracle Site Guard discovers only those OPMN managed system components represented in Enterprise Manager Cloud Control. For example, OPMN Managed System Components like Oracle HTTP Server and Oracle Web cache are represented in Enterprise Manager Cloud Control. These components are discovered as part of the Oracle Fusion Middleware farm.
Oracle RAC Database, which is part of the system, may not be discovered in the operation plan workflow.
Oracle RAC Databases are grouped and represented under RAC Database target in the Enterprise Manager Cloud Control. When RAC database instances are discovered, the RAC database target is created, and all the database instances in the RAC deployment are grouped below the RAC database target. This issue may occur if individual RAC instance targets are added to the system, instead of the RAC database target. Oracle Site Guard cannot identify individual RAC instances.
Site Guard operation step fails with the error stageOmsFileEntry (Error)
, while using credentials with sudo
privileges. You might encounter this issue during the Precheck operation as well.
When the credentials used by Site Guard are configured to use sudo
privileges to run as root
, the sudo
privilege must be configured as PDP (Privilege Delegation Provider) on all the agents running on the respective hosts of the target.
PDP can be configured from Enterprise Manager Cloud Control console. To configure PDP, go to Setup > Security > Privilege Delegation in the Enterprise Manager Cloud Control console.
Issue
While creating an operation plan, you might encounter an error indicating that a target in the site does not have any credentials associated with it, despite having created and associated credentials for that target.
Description and Solution
This issue occurs when there are two targets with identical names in Enterprise Manger, and one of the targets is part of the site. For example, if a database instance target and a database system target are both named db1
, and the database instance target is added to your site.
Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.
Issue
While configuring credentials for Oracle Site Guard, you might face issues when you attempt to associate credentials for a target. This occurs because the credential configuration for that target type is not enabled, or because the target does not show up in the list of targets for a specific target type. This error is seen despite adding the target to the site.
Description and Solution
This issue occurs when there are two targets with identical names in Enterprise Manger, and one of the targets is part of the site. For example, if a database instance target and a database system target are both named db1
, and the database instance target is added to your site.
Delete the targets with identical names, and rediscover them. When you rediscover the targets ensure that each target name is unique across all of the Enterprise Manager targets.
Issue
While deleting or updating an operation plan, you might encounter the following error:
Error:User does not have FULL_JOB privileges on execution with guid XXXXXXXXXXXXXXXX
Description and Solution
This might occur when a user does not have the necessary privileges to delete or update the operation plan.
Log in using the credentials that were used while creating the operation plan, and then delete or update the plan.
Issue
While creating an operation plan, you might encounter an error such as the following:
oracle.sysman.ai.siteguard.model.exception.ConfigurationException: Cannot create scalar value for name [PropertyType = DB_VERSION]. Value argument to the method getScalarValue() is null
Description and Solution
Oracle Site Guard reads and uses the DB_VERSION
property maintained by Enterprise Manager for database targets protected by Oracle Data Guard. The DB_VERSION
property for the database can display as NULL
in Enterprise Manager if a Data Guard switchover or failover occurred outside of Enterprise Manager (for example, if a Data Guard switchover was performed using DGMGRL
or using Site Guard.)
To correct this issue, using Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target, and reset the DataGuardStatus
property from NULL
to true
. On resetting the DataGuardStatus
property, the other Data Guard related properties are automatically refreshed.
Issue
While creating an operation plan, you might encounter an error such as the following:
Credential association for credential type NODEMANAGER is missing for target host_name belonging to system site_name.
Description and Solution
In Enterprise Manager, the Node Manager of a host is not a target type, and therefore, Enterprise Manager does not directly interact with it. Oracle Site Guard, on the other hand, interacts with the Node Managers of hosts for managing disaster recovery operations of Oracle Fusion Middleware components. For this reason, Node Manager credentials must be configured and associated while configuring Oracle Site Guard. Since Enterprise Manager does not recognize Node Manager as a target type, you must create host credentials to be used with the node managers running on host targets, and associate these credentials with Oracle Site Guard using the Oracle Site Guard Credential Configuration page.
This section provides tips for troubleshooting the following issues that you may encounter during switchover or failover operations:
WebLogic Administration Server Does Not Start After Performing Switchover or Failover Operation
WebLogic Administration Server Fails to Restart After Performing Switchover or Failover Operations
Switchover or Failover Operations Fail When Oracle RAC Database Instances Are Not Available
The WebLogic Administration Server might not start after performing switchover or failover operation. The output log file of the Administration Server reports an error, such as the following:
<Jan 19, 2012 3:43:05 AM PST> <Warning> <EmbeddedLDAP> <BEA-171520> <Could not obtain an exclusive lock for directory: ORACLE_BASE/admin/soadomain/aserver/soadomain/servers/AdminServer/data/ldap/ldapfiles. Waiting for 10 seconds and then retrying in case existing WebLogic Server is still shutting down.>
The error appears in the Administration Server log file due to unsuccessful lock cleanup. To fix this error, delete the EmbeddedLDAP.lock
file (located at, ORACLE_BASE/admin/
domain_name
/aserver/
domain_name
/servers/AdminServer/data/ldap/ldapfiles/
).
The WebLogic Administration Server might not start after performing switchover or failover operation. The Administration Server output log file reports the following error:
<Sep 16, 2011 2:04:06 PM PDT> <Error> <Store> <BEA-280061> <The persistent store "_WLS_AdminServer" could not be deployed: weblogic.store.PersistentStoreException: [Store:280105]The persistent file store "_WLS_AdminServer" cannot open file _WLS_ADMINSERVER000000.DAT.>
This error might appear due to the locks from Network File System (NFS) storage. You must clear the NFS locks using the NFS utility of the storage vendor. You may also copy the .DAT
file to a temporary location, and copy it back, to clear the locks.
Some host on the new primary system might not be available, or might be down while performing switchover or failover operation. In such situations, Oracle Site Guard cannot perform any operation on these hosts.
Description and Solution
If the services running on these hosts are not mandatory, and the site can still be functional and active with the services running on the other nodes, the steps pertaining to the hosts, which are down, can be disabled by updating the operation plan. The Oracle Site Guard workflow skips all the disabled steps from the workflow.
If all the Oracle RAC Database instances are down, the switchover or failover operation fails.
While creating an operation plan, Oracle Site Guard determines the Oracle RAC Database instance on which the switchover or failover operation is performed. RAC deployment can have multiple instances, and it is possible that some of the instances are down. Before running the switchover or failover operation, ensure that at least one instance is running. You can identify the name of the RAC instance, which is used by Oracle Site Guard to perform the role reversal operation, by running the get_operation_plan_details
command.
This section provides tips for troubleshooting the following Precheck failures:
Prechecks fail, displaying the following error:
Nmo setuid status NMO not setuid-root (Unix-only)
After installing the Oracle Management Agent, ensure that you run the root.sh
script from the Enterprise Manager Cloud host and all hosts managed by Enterprise Manager, as described in the section "After You Install" in the Oracle Enterprise Manager Cloud Control Basic Installation Guide.
If the Oracle Management Agent is down, Prechecks hang while trying to run commands on the remote host.
Ensure that all hosts involved in an operation are active, and all the configured scripts are available on remote hosts in the configured locations. If the Oracle Management Agent cannot be reached for some reason, then check the log files from the Enterprise Manager Cloud Control console. If you have identified the hosts that are down, skip the Precheck operation on those hosts.
This section provides troubleshooting tips for the following Oracle WebLogic Server failure issues:
Node Manager might fail to start due to an error, like the following:
<Sep 13, 2011 8:45:37 PM PDT> <Error> <NodeManager> <BEA-300033> <Could not execute command "getVersion" on the node manager. Reason: "Access to domain 'base_domain' for user 'weblogic' denied".>
This problem might occur if you have changed the Node Manager credentials and then have not run nmEnroll
to ensure that the correct Node Manager username and password is supplied to each managed server.
To ensure that the correct Node Manager user name and password have been supplied, connect to WLST (using wlst.sh
) and execute the nmEnroll
command using the following syntax:
nmEnroll(domain_directory, node_manager_home)
For example:
nmEnroll('C:/oracle/user_projects/domains/prod_domain', 'C:/oracle/wlserver_10.3/common/nodemanager')
Note:
Restart Node Manager for the changes to take effect.The managed server does not start due to a connection failure of the WLS Administration Server in Enterprise Manager Cloud Control.
To start the managed server, Oracle Site Guard requires the Administration Server and the Node Manager. To start and stop managed servers successfully, ensure that the Administration Server is running.
Oracle Site Guard does not include the WebLogic Server instances that are migrated to a different host in the workflow.
After you create the operation plan, Oracle Site Guard does not include the WebLogic Server instances involved in the operation plan that are migrated to different hosts, as a result of server migration.
After you complete the server migration, refresh the WebLogic Server farm target from the Enterprise Manager Cloud Control console to uptake the latest target changes in the farm. This step is mandatory for Enterprise Manager to resume its farm monitoring capabilities after any changes in the farm like server migration happens. After the farm target is refreshed, you need to recreate the Oracle Site Guard operation plans to include all of the farm targets in the Oracle Site Guard workflow.
While creating an operation plan, you might see an error, like the following:
oracle.sysman.ai.siteguard.model.common.exception.DAOException: For hostName: [2606:b400:800:89:214:4fff:fe46:2d52] credential of type HOSTNORMAL does notexist for siteName: System1
If you do not configure the listen address for the WebLogic Server instances running on the hosts where multiple IP addresses are configured, WebLogic Server randomly picks up an IP address, and reports that as the listen address. This IP address might not be a valid one, and it could be an issue when creating operation plans. To fix the issue, using the Administration Console, configure WebLogic Server properly, with a resolvable listen address. After configuring Oracle WebLogic Server, restart the server, and re-discovered it again from the Enterprise Manager Cloud Control. For more information about listen address configuration, refer to the Oracle Fusion Middleware Disaster Recovery Guide.
Issue
Oracle Site Guard is unable to access the Node Manager even though the Weblogic Administrator is able to log in to the Node Manager.
Description and Solution
This issue occurs when the user name used to authenticate with Node Manager is randomly generate by the WebLogic Administration Server.
To correct this, complete the following steps:
Log in to the WebLogic Administration Server console.
Click Domain listed in the left-hand pane.
Click on the Security tab, and then click Advanced link.
The Node Manager user name is displayed. The user name might appear to be a randomly generated string.
Update the Node Manager log-in credentials with the correct information.
Issue
Oracle Site Guard is unable to associate credentials for more than one Node Manager running on the same host.
Description
This is a limitation in the current version of Oracle Site Guard. The current version can only support one set of credentials for all the Node Managers running on a host. Ensure that all the Node Managers on a given host have been configured with an identical set of credentials.
This section provides tips for troubleshooting the following issues related to database operation failure:
Prechecks for Database Switchover and Database Failover Operations Fail
Databases Protected by Data Guard Included in the Incorrect Operation-Plan Category
The Prechecks for database switchover or database failover operations fail, and display the following error:
Database Status: DGM-17016: failed to retrieve status for database "racs" ORA-16713: the Data Guard broker command timed out
This error might occur if the Data Guard Monitor process (DMON) in the target database instance is down.
Note:
The Data Guard Monitor process (DMON) is part of the Oracle Data Guard Broker.If this error occurs, restart the database instance, and ensure that the DMON process is running. You can also see the database log file for DMON-process errors. Use the CommunicationTimeout
parameter to select an appropriate time-out value for the environment. For more information, see "CommunicationTimeout" in Oracle Data Guard Broker.
Issue
Oracle Site Guard adds the Oracle Data Guard protected database targets to the Start/Stop category instead of Switchover/Failover category of the operation plan.
Description and Solution
Oracle Site Guard uses the DataGuardStatus
property maintained by Enterprise Manager for database targets to determine whether the database is protected by Data Guard. This determines which operation plan category the database is added to. If the value of this property is NULL
then Site Guard assumes that the database is not protected by Data Guard and adds the database target to the Start or Stop category of the operation plan, instead of the Switchover or Failover category.
The DataGuardStatus
property for the database can display as NULL
in Enterprise Manager if the Data Guard switchover or failover occurs outside of Enterprise Manager. For example, a Data Guard switchover is performed using DGMGRL
or using Oracle Site Guard.
Using the Enterprise Manager Cloud Console, log in to the Data Guard Administration page of the database target. Upon logging in, the Data Guard related properties are automatically refreshed.
This section provides tips for troubleshooting the following issues related to storage and storage appliances:
Issue
During a storage switchover or failover step of an Oracle Site Guard operation, logging into a ZFS appliance might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh
script:
Wrong credentials. Make sure that the given credentials are correct and does not contain any special characters.
Description and Solution
This occurs if the password for the ZFS appliance credential contains special characters. Update the appliance password so that it does not contain special characters. Then, update the storage appliance credentials in the Enterprise Manager Credential Management Framework, and retry the operation step.
Issue
During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh
script:
Error: The action could not be completed because the the target (or one of its descendants) has the 'nodestroy' property set. Turn off the property for '1_test' and try again.
Description and Solution
This occurs if the project has the nodestroy
property set. This property is called as Prevent destruction in the Enterprise Manager Cloud Control interface.
Turn off this property and retry the operation step.
Issue
During a storage switchover or failover step of an Oracle Site Guard operation, storage role reversal operation might fail while executing confirm reverse
, and you might see the following error in the log file generated by the zfs_storage_role_reversal.sh
script:
Error: The action could not be completed because the mountpoint of '<project_name>/<share_name>' would conflict with that of '<project_name>/<share_name>' (/export/<project_name>/<share_name>). Change the mountpoint of '<project_name>/<share_name>' and try again.
This occurs if at least one of the shares inside all available packages for a given project, has exported as file system. Make sure that the exported
property of all shares inside all packages for a given projects is turned off.