Skip Headers
Oracle® Application Server High Availability Guide
10g (10.1.4.0.1)

Part Number B28186-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

A Troubleshooting High Availability

This appendix describes common problems that you might encounter when deploying and managing Oracle Application Server in high availability configurations, and explains how to solve them. It contains the following topics:

A.1 Troubleshooting Active-Active Topologies

Topics:

A.1.1 Registering an Application using ssoreg Fails

Problem

In high availability topologies where OracleAS Single Sign-On and Oracle Delegated Administration Services components are clustered in OracleAS Clusters, you get the following error message when you try to register an application using ssoreg.sh (ssoreg.bat on Windows):

java.io.EOFException
null
java.io.EOFException
        at
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2435)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1245)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:324)
        at oracle.ias.sysmgmt.task.TaskMaster.daemon_exec(Unknown Source)
        at oracle.ias.sysmgmt.task.TaskMaster.remote_operation(Unknown Source)
        at oracle.ias.sysmgmt.cmdline.DcmCmdLine.ssoPropagate(Unknown Source)
        at oracle.ias.sysmgmt.cmdline.DcmCmdLine.execute(Unknown Source)
        at oracle.ias.sysmgmt.cmdline.DcmCmdLine.main(Unknown Source)
--end of dcmctl's output to stderr
Thu May 18 20:02:24 PDT 2006  dcmctl returned exit value 1
Thu May 18 20:02:24 PDT 2006  dcmctl returned unsuccessfully, exitValue 1
Thu May 18 20:02:24 PDT 2006  SSO registration tool failed.  Please check the
error in this log file, correct the problem and re-run the tool.

Solution

When you run ssoreg.sh (ssoreg.bat on Windows) in an environment where OracleAS Single Sign-On instances are in an OracleAS Cluster, you must ensure that the DCM daemon is running on all nodes in the cluster. This is because ssoreg invokes the DCM daemon on the node where you are running the command, and the DCM daemon needs to communicate with the other DCM daemons on the other nodes in the cluster.

To check that the DCM daemon is running, run the following command on all nodes in the cluster:

ORACLE_HOME/opmn/bin/opmnctl status

To start the DCM daemon, run the following command on nodes where the DCM daemon is not already running:

ORACLE_HOME/opmn/bin/opmnctl startproc ias-component=dcm-daemon

A.1.2 OC4J_SECURITY Instance Fails to Start

Problem

You are running Oracle Internet Directory in an Oracle RAC environment, and the OC4J_SECURITY instance fails to start. You see the following message in the oidctl.log file:

[gsdsiConnect] ORA-1017, ORA-01017: invalid username/password; logon denied.

Solution

Check that the oidpwdlldap1 file is the same on all nodes in the Oracle RAC environment. You may have forgotten to copy the oidpwdlldap1 file to all Oracle RAC nodes after running oidpasswd to change the ODS password. See Section 9.6, "About Changing the ODS Password on an Oracle RAC System" for details.

A.1.3 Logging into OracleAS Single Sign-On Takes a Long Time

Problem

Logging into OracleAS Single Sign-On might take a long time if you are running OracleAS Single Sign-On and Oracle Internet Directory on opposite sides of a firewall (OracleAS Single Sign-On is running outside the firewall and Oracle Internet Directory inside the firewall) and if the firewall is configured to drop idle connections or recycle connections after the configured timeout period has elapsed.

Solution

  1. Set the timeout on OracleAS Single Sign-On connections to a value smaller than the firewall and load balancer timeout values. The OracleAS Single Sign-On server will remove connections that are idle for longer than the specified value.

    You specify this value (in minutes) using the connectionIdleTimeout parameter in the ORACLE_HOME/sso/conf/policy.properties file. For example, the following line sets the timeout value for 20 minutes. The OracleAS Single Sign-On server will remove connections that are idle for longer than 20 minutes.

    connectionIdleTimeout = 20
    
    

    Restart the OC4J server (OC4J_SECURITY) that is running the OracleAS Single Sign-On server for the new value to take effect.

  2. Set the timeout for database connections in the SQLNET.EXPIRE_TIME parameter in the ORACLE_HOME/network/admin/sqlnet.ora file. You also set this value to a value smaller than the firewall and load balancer timeout values.

    This parameter specifies how often the database server sends a probe packet to the client (which is the OracleAS Single Sign-On server). This periodic activity by the probe packet enables the OracleAS Single Sign-On server-to-database connections to stay active.

    The value is specified in minutes. In the following example, the database server sends the probe packet every 20 minutes to the client.

    SQLNET.EXPIRE_TIME = 20
    
    

    Restart the database for the new value to take effect.

Explanation: The firewall or load balancer might drop connections to Oracle Internet Directory and the database if the connections are idle for a certain time. When the firewall or load balancer drops a connection, it might not send a tcp close notification to the OracleAS Single Sign-On server. The OracleAS Single Sign-On server then is unaware that the connection is no longer valid and tries to use it to perform Oracle Internet Directory or database operations. When the OracleAS Single Sign-On server does not get a response, it tries the next connection. Eventually it tries all the connections in the pool before making fresh connections to Oracle Internet Directory or to the database.

By setting the timeout on the OracleAS Single Sign-On server and on the database to a value smaller than the timeout on the firewall or load balancer, you ensure that the connections are valid.

A.1.4 Oracle Internet Directory Does Not Start Up on One of the Nodes

Problem

If the time difference between the nodes in the OracleAS Cluster (Identity Management) is greater than 250 seconds, the OID Monitor will stop Oracle Internet Directory on the node that is behind. For example, if the time on node A is ahead of node B's by more than 250 seconds, then the OID Monitor will stop Oracle Internet Directory processes on node B.

For details on how OID Monitor works, see Section 3.7.2, "OID Monitor Details".

For details on time synchronization, see Section 3.7.2.2, "Time Discrepancy Between Nodes".

Solution

Synchronize the time on all nodes to within 250 seconds of each other.

A.1.5 Unable to Connect to Oracle Internet Directory, and Oracle Internet Directory Cannot Be Restarted

Problem

This issue applies only to Windows 2000 platforms. This issue has two symptoms:

Symptom #1: If you have configured your load balancer to monitor the Oracle Internet Directory ports using TCP port monitoring, you might see the "maximum number of connections reached" error in the Oracle Internet Directory log file. This means that clients are unable to connect to Oracle Internet Directory.

Symptom #2: If Oracle Internet Directory terminates, you are not able to restart it. When you try to restart it, you get a message that Oracle Internet Directory is unable to access its ports because the System Idle Process is already using them. Oracle Internet Directory needs exclusive access to its ports.

Solution

This problem is caused by an application (in this case, the load balancer) that performs TCP port monitoring on the Oracle Internet Directory ports. In TCP port monitoring, the application opens and closes connections to the Oracle Internet Directory ports. In Windows 2000, the connection is not closed properly; this is why you reach the maximum number of connections.

The workaround is not to use TCP port monitoring for the Oracle Internet Directory ports. Instead, use LDAP or HTTP port monitoring.

A.1.6 Cluster Configuration Assistant Fails During Installation

Problem

During the installation of distributed Oracle Identity Management topologies, the OracleAS Single Sign-On and Oracle Delegated Administration Services components are installed on their own nodes separate from the other Oracle Identity Management components. The Cluster Configuration Assistant may attempt to cluster the two resulting OracleAS Single Sign-On/Oracle Delegated Administration Services instances together. However, the error message "Instances containing disabled components cannot be added to a cluster" may appear. This message appears because Enterprise Manager cannot cluster instances with disabled components.

Solution

If the Cluster Configuration Assistant fails, you can cluster the instance after installation. In this case, to cluster the instance, you must use the "dcmctl joincluster" command instead of Application Server Control Console. You cannot use Application Server Control Console in this case because it cannot cluster instances that contain disabled components. In this case, the "home" OC4J instance is disabled.

A.1.7 odisrv Process Does Not Fail Over After "opmnctl stopall"

Problem

In OracleAS Cluster (Identity Management) and distributed OracleAS Cluster (Identity Management) topologies, when opmnctl stopall is executed to stop all OPMN-managed processes on that node, odisrv is not started automatically on the second node because opmnctl stopall is a normal administrative shutdown, not an actual node failure. In a true node failure, odisrv is started on the remaining node upon death detection of the original odisrv process.

Solution

If planned maintenance is required for OracleAS Cluster (Identity Management) and distributed OracleAS Cluster (Identity Management) topologies, use the oidctl command to explicitly stop and start odisrv.

On the node where odisrv is running, use the following command to stop it:

ORACLE_HOME/bin/oidctl connect=<dbConnect> server=odisrv inst=1 stop

On the remaining active node, start odisrv using the following command:

ORACLE_HOME/bin/oidctl connect=<dbConnect> server=odisrv instance=1
     flags="host=OIDhost port=OIDport" start

See Section 3.7.2.1, "Normal Shutdown vs. Process Failure" for details.

A.1.8 Oracle Internet Directory Processes Shut Down by OID Monitor

Problem

Oracle Internet Directory processes on one node are shut down by OID Monitor.

Solution

In active-active topologies, OID Monitor checks the time on each node running Oracle Internet Directory processes. If it discovers that the time difference between the nodes is more than 250 seconds, it shuts down the processes on the node that is behind in time.

To fix this, reset the time on the nodes such that the time on all nodes is within 250 seconds of each other. OID Monitor will detect the updated times and start up the Oracle Internet Directory processes.

See Section 3.7.2.2, "Time Discrepancy Between Nodes" for details.

A.1.9 Oracle Internet Directory Connections Being Disconnected by the Load Balancer or Firewall

Problem

The load balancer or firewall terminates connections to Oracle Internet Directory, and further connections from OC4J to Oracle Internet Directory cannot be made.

Solution

To fix this, set the orclLDAPConnTimeout attribute (in the "cn=dsaconfig, cn=configsets, cn=oracle internet directory" entry) to a value smaller than the "idle connection timeout" value configured on the load balancer or firewall. This prevents the load balancer or firewall from terminating connections to Oracle Internet Directory.

The orclLDAPConnTimeout attribute is expressed in minutes.

Note that in this release and also in the 10.1.2.2.0 patch set, the orclLDAPConnTimeout attribute is independent of the orclStatsPeriodicity attribute when Oracle Internet Directory calculates the idle time of a connection.

However, in previous releases (releases 9.0.4.2, 9.0.4.3, 10.1.2.0, 10.1.2.0.2, and 10.1.2.1), Oracle Internet Directory takes into account the values for both attributes when it calculates the idle time. For these releases, you need to set the attributes as follows:

  • Set the orclStatsPeriodicity attribute to a value less than half of the "idle connection timeout" value configured on the load balancer or firewall.

  • Set the orclLDAPConnTimeout attribute to a value less than the "idle connection timeout" value configured on the load balancer or firewall.

The attribute values are expressed in minutes.

The values of the orclStatsFlag and orclMaxTcpIdleConnTime attributes are not used here.

For example, assume that the "idle connection timeout" value on the load balancer or firewall is set at 15. In this case, you can set the orclStatsPeriodicity attribute to 7 (which is less than half of 15) and the orclLDAPConnTimeout attribute to 12 (which is less than 15).

The orclLDAPConnTimeout attribute is in the "cn=dsaconfig, cn=configsets, cn=oracle internet directory" entry, while the other attributes are in the root DSE entry.

A.2 Troubleshooting Active-Passive Topologies

Topics:

A.2.1 Unable to Perform Online Database Backup and Restore in OracleAS Cold Failover Cluster Environment

Problem

Unable to perform online recovery of Infrastructure database due to dependencies and cluster administrator trying to bring the database down and then up during the recovery phase by the Backup and Recovery Tool.

Solution 1

To perform a clean recovery, use the following steps:

  1. Bring all resources offline using the cluster administrator (for Windows, use Oracle Fail Safe).

  2. Perform a normal shutdown of the Infrastructure database.

  3. Start only the database service. You can do this from the Windows Service Manager, or you can run the following command:

    net start OracleService<SID>
    
    
  4. Run the Backup and Recovery Tool to perform the recovery of the database.

Solution 2

For Windows, the following steps can be used to perform a recovery:

  1. In Oracle Fail Safe, under "Cluster Resources", select "ASDB (DB Resource)" in the Database tab.

  2. For "Database Polling", select "Disabled" from the drop down list.

  3. Using the Backup and Recovery Tool, perform an online restore of the Infrastructure database.

The database is not accessible for a brief period while the Backup and Recovery Tool stops and starts the database. Once the database starts up, it can be accessed by middle-tier and Infrastructure components.

A.2.2 Cannot Connect to Database for Restoration (Windows)

Problem

When you stop the OracleAS Metadata Repository database using Microsoft Cluster Administrator, Microsoft Cluster Administrator performs the strictest and fastest abort to shut down the database service. After the shutdown, you are unable to connect to the database.

The following steps illustrate the problem:

  1. Access an OracleAS Metadata Repository that is used for testing.

  2. Corrupt a database file (note: do not modify the ts$ table).

  3. Issue a SQL query to ensure that the database is corrupted.

  4. Using Microsoft Cluster Administrator, verify that the database is online.

  5. Using Oracle Fail Safe Manager, disable database polling.

  6. Using Microsoft Cluster Administrator, take the database offline. This also takes OPMN and Application Server Control Console offline as they are dependencies of the database.

  7. Try connecting as sysdba. The connection should fail.

At this time, you are unable to connect to the database to run backup/restore scripts to restore the database to a good version (because you corrupted a database file in step 2 above).

Solution

Use Oracle Fail Safe Manager (instead of Microsoft Cluster Administrator) to shut down the database. To do so:

  1. In the Oracle Fail Safe Manager, right-click the "ASDB" resource (default if not changed), and select "Immediate".

  2. Start the database service using Windows Service Manager.

  3. Connect to the database as sysdba. The connection should be successful.

A.3 Troubleshooting OracleAS Disaster Recovery Topologies

This section describes common problems and solutions in OracleAS Disaster Recovery configurations. It contains the following topics:

A.3.1 Standby Site Not Synchronized

In the OracleAS Disaster Recovery standby site, you may find that the site's OracleAS Metadata Repository is not synchronized with the OracleAS Metadata Repository in the primary site.

Problem

The OracleAS Disaster Recovery solution requires manual configuration and shipping of data files from the primary site to the standby site. Also, the data files (archived database log files) are not applied automatically in the standby site, that is, OracleAS Disaster Recovery does not use managed recovery in Oracle Data Guard.

Solution

The archive log files have to be applied manually. The steps to perform this task is found in Chapter 11, "OracleAS Disaster Recovery".

A.3.2 Failure to Bring Up Standby Instances After Failover or Switchover

Standby instances are not started after a failover or switchover operation.

Problem

IP addresses are used in instance configuration. OracleAS Disaster Recovery setup does not require identical IP addresses in peer instances between the production and standby site. OracleAS Disaster Recovery synchronization does not reconcile IP address differences between the production and standby sites. Thus, if you use explicit IP address xxx.xx.xxx.xx in your configuration, the standby configuration after synchronization will not work.

Solution

Avoid using explicit IP addresses. For example, in OracleAS Web Cache and Oracle HTTP Server configurations, use ANY or host names instead of IP addresses as listening addresses

A.3.3 Switchover Operation Fails At the Step dcmctl resyncInstance -force -script

The OracleAS Disaster Recovery asgctl switchover operation requires that the value of the TMP variable be defined the same in the opmn.xml file on both the primary and standby sites.

Problem

OracleAS Disaster Recovery switchover fails at the step dcmctl resyncInstance -force -script and displays a message that a directory could not be found.

Solution

During a switchover operation, the opmn.xml file is copied from the primary site to the standby site. For this reason, the value of the TMP variable must be defined the same in the opmn.xml file on both primary and standby sites; otherwise, the switchover operation will fail. Make sure the TMP variable is defined identically in the opmn.xml files and resolves to the same directory structure on both sites before attempting to perform an asgctl switchover operation.

For example, the following code snippets for a Windows and UNIX environment show a sample definition of the TMP variable.

Example in Windows Environment: 
------------------------------- 
.
.
.
<ias-instance id="infraprod.iasha28.us.oracle.com"> 
 <environment> 
 <variable id="TMP" value="C:\DOCUME~1\ntregres\LOCALS~1\Temp"/> 
 </environment> 
.
.
.
Example in Unix Environment: 
---------------------------- 
.
.
.
<ias-instance id="infraprod.iasha28.us.oracle.com"> 
 <environment> 
 <variable id="TMP" value="/tmp"/> 
 </environment> 
.
.
.

A workaround to this problem is to change the value of the TMP variable in the opmn.xml file on the primary site, perform a dcmctl update config operation, then perform the asgctl switchover operation. This approach saves you having to reinstall the mid-tiers to make use of an altered TMP variable.

A.3.4 Unable to Start Standalone OracleAS Web Cache Installations at the Standby Site

OracleAS Web Cache cannot be started at the standby site possibly due to misconfigured standalone OracleAS Web Cache after failover or switchover.

Problem

OracleAS Disaster Recovery synchronization does not synchronize standalone OracleAS Web Cache installations.

Solution

Use the standard Oracle Application Server full CD image to install the OracleAS Web Cache component

A.3.5 Standby Site Middle-tier Installation Uses Wrong Hostname

A middle-tier installation in the standby site uses the wrong hostname even after the machine's physical hostname is changed.

Problem

Besides modifying the physical hostname, you also need to put it as the first entry in /etc/hosts file. Failure to do the latter will cause the installer to use the wrong hostname.

Solution

Put the physical hostname as the first entry in the /etc/hosts file. See Section 11.2.2, "Configuring Hostname Resolution" for more information.

A.3.6 Failure of Farm Verification Operation with Standby Farm

When performing a verify farm with standby farm operation, the operation fails with an error message indicating that the middle-tier machine instance cannot be found and that the standby farm is not symmetrical with the production farm.

Problem

The verify farm with standby farm operation is trying to verify that the production and standby farms are symmetrical to one another, that they are consistent, and conform to the requirements for disaster recovery.

The verify operation is failing because it sees the middle-tier instance as mid_tier.<hostname> and not as mid_tier.<physical_hostname>. You might suspect that this is a problem with the environmental variable _CLUSTER_NETWORK_NAME_, which is set during installation. However, in this case, it is not because a check of the _CLUSTER_NETWORK_NAME_ environmental variable setting finds this entry to be correct. However, a check of the contents of the /etc/hosts file, indicates that the entries for the middle tier in question are incorrect. That is, all middle-tier installations take the hostname from the second column of the /etc/hosts file.

For example, assume the following scenario:

  • Two environments are used: examp1 and examp2

  • OracleAS Infrastructure (Oracle Identity Management and OracleAS Metadata Repository) is first installed on examp1 and examp2 as host infra

  • OracleAS middle-tier (OracleAS Portal and OracleAS Wireless) is then installed on examp1 and examp2 as host node1

  • Basically, these are two installations (OracleAS Infrastructure and OracleAS middle-tier) on a single node

  • Updated the latest duf.jar and backup_restore files on all four Oracle homes

  • Started OracleAS Guard (asgctl) on all four Oracle homes (OracleAS Infrastructure and OracleAS middle-tier on two nodes)

  • Performed asgctl operations: connect asg, set primary, dump farm

  • Performed asgctl verify farm with standby farm operation, but it fails because it sees the instance as mid-tier.examp1 and not as mid_tier.node1.us.oracle.com

A check of the /etc/hosts file shows the following entry:

123.45.67.890 examp1 node1.us.oracle.com node1 infra

Then ias.properties and farms shows the following and the verify operation is failing:

IASname=midtier_inst.examp1

However, the /etc/hosts file should actually be the following:

123.45.67.890 node1.us.oracle.com node1 infra

Then ias.properties and farms shows the following and the verify operation succeeds:

IASname=midtier_inst.node1.us.oracle.com

Solution

Check and change the second column entry in your /etc/hosts file to match the hostname of the middle-tier node in question as described in the previous explanation.

A.3.7 Sync Farm Operation Returns Error Message

A sync farm to operation returns the error message: "Cannot Connect to asdb"

Problem

Occasionally, an administrator may forget to set the primary database using the asgctl command line utility in performing an operation that requires that the asdb database connection be established prior to an operation. The following example shows this scenario for a sync farm to operation:

ASGCTL> connect asg hsunnab13 ias_admin/iastest2
Successfully connected to hsunnab13:7890
ASGCTL>  
.
.
.
<Other asgctl operations may follow, such as verify farm, dump farm, 
<and show operation history, and so forth that do not require the connection
<to the asdb database to be established or a time span may elapse of no activity
<and the administrator may miss performing this vital command.
.
.
.
ASGCTL> sync farm to usunnaa11
prodinfra(asr1012): Syncronizing each instance in the farm to standby farm
prodinfra: -->ASG_ORACLE-300: ORA-01031: insufficient privileges
prodinfra: -->ASG_DUF-3700: Failed in SQL*Plus executing SQL statement:  connect null/******@asdb.us.oracle.com as sysdba;.
prodinfra: -->ASG_DUF-3502: Failed to connect to database asdb.us.oracle.com.
prodinfra: -->ASG_DUF-3504: Failed to start database asdb.us.oracle.com.
prodinfra: -->ASG_DUF-3027: Error while executing Syncronizing each instance in the farm to standby farm at step - init step.

Solution

Perform the asgctl set primary database command. This command sets the connection parameters required to open the asdb database in order to perform the sync farm to operation. Note that the set primary database command must also precede the instantiate farm to command and switchover farm to command if the primary database has not been specified in the current connection session.

A.3.8 On Windows Systems Use of asgctl startup Command May Fail If the PATH Environment Variable Has Exceeded 1024 Characters

On Windows systems, if your system PATH environment variable has exceeded the 1024 character limit because you have many OracleAS instances installed or many third party software installations, or both on your system, the asgctl startup command may fail because you are starting the OracleAS Guard server outside of OPMN and the system cannot resolve the directory path.

Problem

Occasionally, on Windows systems with many installations, OracleAS instances or third party software, or both, the asgctl startup command, which is run outside of OPMN, may return a popup error stating it could not find a dynamic link library for a particular file, orawsec9.dll, followed by a DufException. For example:

C:\product\10.1.3\OC4J_1\dsa\bin> asgctl startup
<<Popup Error:>>
The dynamic link library *orawsec9.dll* could not be found.
<<The exception:>>
oracle.duf.DufException
        at oracle.duf.DufOsBase.constructInstance(DufOsBase.java:1331)
        at oracle.duf.DufOsBase.getDufOs(DufOsBase.java:122)
        at 
oracle.duf.DufHomeMgr.getCurrentHomePath(DufHomeMgr.java:582)
        at oracle.duf.dufclient.DufClient.main(DufClient.java:132)
stado42: -->ASG_SYSTEM-100: oracle.duf.DufException
----------------------------------------------------------------------------- 

However, this dll does exist in the ORACLE_HOME\bin directory.

This error is not seen in OracleAS Guard standalone kit because the file orawsec9.dll exists in the ORACLE_HOME\dsa\bin folder.

Solution

The workaround is to either manually edit the system PATH variable with the required path information or manually override the PATH in the command prompt by specifying the relevant %PATH% variables. For example:

C:\set PATH=C:\product\10.1.3\OracleAS_OC4J_2\bin;
C:\product\10.1.3\OracleAS_OHS1\jre\1.4.2\bin\client;
C:\product\10.1.3\OracleAS_OHS1\jre\1.4.2\bin;
C:\product\10.1.3\OracleAS_OHS1\bin;C:\product\10.1.3\OC4J_1\bin

C:\product\10.1.3\OC4J_1\dsa\bin> asgctl startup

A.4 Need More Help?

In case the information in the previous section is not sufficient, you can find more solutions on Oracle MetaLink, http://metalink.oracle.com. If you do not find a solution for your problem, log a service request.


See Also: