|Oracle® Application Server High Availability Guide
10g Release 2 (10.1.2)
This appendix describes common problems that you might encounter when deploying and managing Oracle Application Server in high availability configurations, and explains how to solve them. It contains the following topics:
This section describes common problems and solutions in OracleAS Cold Failover Cluster configurations. It contains the following topics:
OracleAS Web Cache does not fail over in an OracleAS Cold Failover Cluster environment (that is, it does not start up on the standby node). It writes the following error in the log file:
[26/Apr/2005:14:36:08 -0700] [error 13079] [ecid: -] No matching CACHE element found in webcache.xml for current hostname (hostname) and ORACLE_HOME (/path/to/oracle/home)
You need to perform these steps for OracleAS Web Cache to fail over in an OracleAS Cold Failover Cluster environment:
Create a two-node OracleAS Web Cache cluster using Application Server Control Console. For the host name, use the physical hostnames of the nodes in the OracleAS Cold Failover Cluster.
Keep both of these cache entries (the
CACHE element in
webcache.xml) in sync, except for the host name.
For details on OracleAS Web Cache clusters, see chapter 10, "Configuring Cache Clusters", in the Oracle Application Server Web Cache Administrator's Guide.
Issues with online database backup and restore are noted here. This information pertains to the OracleAS Cold Failover Cluster environment.
Unable to perform online recovery of Infrastructure database due to dependencies and cluster administrator trying to bring the database down and then up during the recovery phase by the Backup and Recovery Tool.
To perform a clean recovery, use the following steps:
Bring all resources offline using the cluster administrator (for Windows, use Oracle Fail Safe).
Perform a normal shutdown of the Infrastructure database.
Start only the database service using the following command:
net start OracleService
Run the Backup and Recovery Tool to perform the recovery of the database.
For Windows, the following steps can be used to perform a recovery:
In Oracle Fail Safe, under "Cluster Resources", select "ASDB(DB Resource)" in the "Database" tab.
For "Database Polling", select "Disabled" from the drop down list.
Using the Backup and Recovery Tool, perform an online restore of the Infrastructure database.
The database is not accessible for a brief period while the Backup and Recovery Tool stops and starts the database. Once the database starts up, it can be accessed by middle-tier and Infrastructure components.
Unable to connect to idle OracleAS Metadata Repository database to restore it after it is shutdown using Microsoft Cluster Administrator.
When you stop the OracleAS Metadata Repository database using Microsoft Cluster Administrator, Microsoft Cluster Administrator performs the strictest and fastest abort to shut down the database service. After the shutdown, you are unable to connect to the database.
The following steps illustrate the problem:
Access an OracleAS Metadata Repository that is used for testing.
Corrupt a database file (note: do not modify the
Issue a SQL query to ensure that the database is corrupted.
Using Microsoft Cluster Administrator, verify that the database is online.
Using Oracle Fail Safe Manager, disable database polling.
Using Microsoft Cluster Administrator, take the database offline. This also takes OPMN and Application Server Control Console offline as they are dependencies of the database.
Try connecting as
sysdba. The connection should fail.
Use the Oracle Fail Safe Manager to shut down the database. To do so:
In the Oracle Fail Safe Manager, right-click the "ASDB" resource (default if not changed), and select "Immediate".
Start the database service using Windows Service Manager.
Connect to the database as
sysdba. The connection should be successful.
This section describes common problems and solutions in OracleAS Cluster (Identity Management) configurations. It contains the following topics:
Problems and solutions related to multimaster replication and other Oracle Internet Directory features are documented in the troubleshooting section of Oracle Internet Directory Administrator's Guide.
Logging into OracleAS Single Sign-On might take a long time if you are running OracleAS Single Sign-On and Oracle Internet Directory on opposite sides of a firewall (OracleAS Single Sign-On is running outside the firewall and Oracle Internet Directory inside the firewall) and if the firewall is configured to drop idle connections or recycle connections after the configured timeout period has elapsed.
Set the timeout on OracleAS Single Sign-On connections to a value smaller than the firewall and load balancer timeout values. The OracleAS Single Sign-On server will remove connections that are idle for longer than the specified value.
You specify this value (in minutes) using the
connectionIdleTimeout parameter in the
/sso/conf/policy.properties file. For example, the following line sets the timeout value for 20 minutes. The OracleAS Single Sign-On server will remove connections that are idle for longer than 20 minutes.
connectionIdleTimeout = 20
Restart the OC4J server (OC4J_SECURITY) that is running the OracleAS Single Sign-On server for the new value to take effect.
Set the timeout for database connections in the
SQLNET.EXPIRE_TIME parameter in the
/network/admin/sqlnet.ora file. You also set this value to a value smaller than the firewall and load balancer timeout values.
This parameter specifies how often the database server sends a probe packet to the client (which is the OracleAS Single Sign-On server). This periodic activity by the probe packet enables the OracleAS Single Sign-On server-to-database connections to stay active.
The value is specified in minutes. In the following example, the database server sends the probe packet every 20 minutes to the client.
SQLNET.EXPIRE_TIME = 20
Restart the database for the new value to take effect.
Explanation: The firewall or load balancer might drop connections to Oracle Internet Directory and the database if the connections are idle for a certain time. When the firewall or load balancer drops a connection, it might not send a tcp close notification to the OracleAS Single Sign-On server. The OracleAS Single Sign-On server then is unaware that the connection is no longer valid and tries to use it to perform Oracle Internet Directory or database operations. When the OracleAS Single Sign-On server does not get a response, it tries the next connection. Eventually it tries all the connections in the pool before making fresh connections to Oracle Internet Directory or to the database.
By setting the timeout on the OracleAS Single Sign-On server and on the database to a value smaller than the timeout on the firewall or load balancer, you ensure that the connections are valid.
If the time difference between the nodes in the OracleAS Cluster (Identity Management) is greater than 250 seconds, the Oracle Internet Directory Monitor (
oidmon) will stop Oracle Internet Directory on the node that is behind. For example, if the time on node A is ahead of node B's by more than 250 seconds, then
oidmon will stop Oracle Internet Directory processes on node B. This is because the
oidmon processes on all the nodes update the database every 10 seconds to tell the other nodes it is running. If a node does not respond for 250 seconds, then the other nodes treat that node as a failed node.
Synchronize the time on all nodes to within 250 seconds of each other.
This issue applies only to Windows 2000 platforms. This issue has two symptoms:
Symptom #1: If you have configured your load balancer to monitor the Oracle Internet Directory ports using TCP port monitoring, you might see the "maximum number of connections reached" error in the Oracle Internet Directory log file. This means that clients are unable to connect to Oracle Internet Directory.
Symptom #2: If Oracle Internet Directory terminates, you are not able to restart it. When you try to restart it, you get a message that Oracle Internet Directory is unable to access its ports because the System Idle Process is already using them. Oracle Internet Directory needs exclusive access to its ports.
This problem is caused by an application (in this case, the load balancer) that performs TCP port monitoring on the Oracle Internet Directory ports. In TCP port monitoring, the application opens and closes connections to the Oracle Internet Directory ports. In Windows 2000, the connection is not closed properly; this is why you reach the maximum number of connections.
The workaround is not to use TCP port monitoring for the Oracle Internet Directory ports. Instead, use LDAP or HTTP port monitoring.
Problems encountered during the clustering of components using the Cluster Configuration Assistant are addressed here.
During the installation of distributed Oracle Identity Management configurations, the OracleAS Single Sign-On and Oracle Delegated Administration Services components are installed in two of their own nodes separate from the other Oracle Identity Management components. The Cluster Configuration Assistant may attempt to cluster the two resulting OracleAS Single Sign-On/Oracle Delegated Administration Services instances together. However, the error message "Instances containing disabled components cannot be added to a cluster" may appear. This message appears because Enterprise Manager cannot cluster instances with disabled components.
If the Cluster Configuration Assistant fails, you can cluster the instance after installation. In this case, to cluster the instance, you must use the "
dcmctl joincluster" command instead of Application Server Control Console. You cannot use Application Server Control Console in this case because it cannot cluster instances that contain disabled components. In this case, the "home" OC4J instance is disabled.
During high availability Infrastructure installation, the Oracle Ultra Search Configuration Assistant cannot connect to an Oracle Internet Directory instance at port 3060 of the virtual hostname provided in the virtual hostname addressing screen.
A common mistake can be made when virtual hostname addressing is used during Infrastructure installation. The load balancer virtual server name is entered, and the load balancer is set up correctly to assume this name. However, the Infrastructure node is not set up correctly to resolve this name. Thus, when the Oracle Ultra Search Configuration Assistant on the Infrastructure node tries to connect to the load balancer virtual server name, the Configuration Assistant cannot find the load balancer.
The solution is to set up name resolution correctly on the Infrastructure machine for the load balancer virtual server name. This procedure is platform dependent. Check your operating system manual for an accurate procedure. In Unix, this usually involves editing the
/etc/hosts file and making sure this file is used for name resolution by editing the
/etc/nsswitch.conf file. In Windows, this usually involves editing the
odisrv process failover between nodes are documented here.
In any OracleAS Cluster (Identity Management) solution, when
opmnctl stopall is executed to stop all OPMN-managed processes on that node,
odisrv is not started automatically on the second node because
opmnctl stopall is a normal administrative shutdown, not an actual node failure. In a true node failure,
odisrv is started on the remaining node upon death detection of the original
If planned maintenance is required for an OracleAS Cluster (Identity Management), use the
oidctl command to explicitly stop and start
On the node where
odisrv is running, use the following command to stop it:
ORACLE_HOME/bin/oidctl connect=<dbConnect> server=odisrv inst=1 stop
On the remaining active node, start
odisrv using the following command:
ORACLE_HOME/bin/oidctl connect=<dbConnect> server=odisrv instance=1 flags="host=OIDhost port=OIDport" start
In a OracleAS Cluster (Identity Management) configuration, the Oracle Internet Directory Monitor (OIDMON) on each node updates the directory database every 10 seconds with metadata. At the same time, it queries the database to verify that all other directory servers are running.
If an OIDMON does not update the database for 250 seconds, the other nodes assume that that node has failed. This delay can be manifested erroneously by nodes with their system clocks set with a difference of more than 250 seconds from the other nodes. When this happens, OIDMON on one of the other nodes will initiate failover operations, which include locally bringing up processes that were running on the failed node. The node where these processes are started continue processing the operations that were underway in the failed node.
As an example, assume a OracleAS Cluster (Identity Management) configuration with nodes A and B. The system clock in node B is 300 seconds behind node A's clock. Node B updates its metadata in the directory database, which includes the system clock value. Node A queries the database for active Oracle Internet Directory servers and determines that node B has failed because its time value is 300 seconds. Node A then initiates failover operations by locally starting all Oracle Internet Directory server processes that were running on node B.
The system clock value on all nodes in the OracleAS Cluster (Identity Management) configuration should be synchronized using Greenwich mean time so that there is a discrepancy of no more than 250 seconds between them.
Refer to the chapters on Rack-Mounted directory server configurations in the Oracle Internet Directory Administrator's Guide.
If a load balancer is deployed in front of Oracle Application Server instances that are clustered together, configuration files of the instances may not have the correct load balancer virtual server name specified.
For a cluster of Oracle Application Server instances front-ended by a load balancer, a redirect back to the cluster may not contain the load balancer virtual server name. Dynamic pages created by a servlet or JSP may also not use the correct load balancer virtual server name. In both cases, the local hostname is most likely used instead.
To correctly specify the load balancer virtual server name to be used, modifications have to be made to the
default-web-site.xml file for each instance.
For each Oracle Application Server instance, perform the following steps:
Perform the following steps for Oracle HTTP Server:
Stop the Oracle HTTP Server using the following command:
opmnctl stopproc ias_component=HTTP_Server
In Oracle HTTP Server's
httpd.conf file, change the value for the directive
ServerName to the virtual server name of your load balancer. For example, if you use "
localhost", change it to the virtual server name of your load balancer.
In the same
httpd.conf file, change the value of the
Port directive to the port number your load balancer is configured with for incoming requests. For example, if the port number specified is 7777, change it to port 80 if that is configured on your load balancer.
Execute the following command to update the DCM repository with the above changes:
dcmctl updateConfig -ct ohs
Start the Oracle HTTP Server using the following command:
opmnctl startproc ias_component=HTTP_Server
Perform the following steps for OC4J:
Stop the OC4J processes for each OracleAS instance using the following command:
opmnctl stopproc ias_component=OC4J
Edit the file
default-web-site.xml to include the following line:
load_balancer_name" with the virtual server name of your load balancer and "
port_number" with the port number that is configured for incoming requests in your load balancer (these values are similar to those you entered for
Execute the following command to update the DCM repository with the changes you made in the
dcmctl updateconfig -ct oc4j
Start the OC4J instances using the following command:
opmnctl startproc ias_component=OC4J
This section describes common problems and solutions in OracleAS Disaster Recovery configurations. It contains the following topics:
In the OracleAS Disaster Recovery standby site, you may find that the site's OracleAS Metadata Repository is not synchronized with the OracleAS Metadata Repository in the primary site.
The OracleAS Disaster Recovery solution requires manual configuration and shipping of data files from the primary site to the standby site. Also, the data files (archived database log files) are not applied automatically in the standby site, that is, OracleAS Disaster Recovery does not use managed recovery in Oracle Data Guard.
The archive log files have to be applied manually. The steps to perform this task is found in Chapter 13, "OracleAS Disaster Recovery".
Standby instances are not started after a failover or switchover operation.
IP addresses are used in instance configuration. OracleAS Disaster Recovery setup does not require identical IP addresses in peer instances between the production and standby site. OracleAS Disaster Recovery synchronization does not reconcile IP address differences between the production and standby sites. Thus, if you use explicit IP address xxx.xx.xxx.xx in your configuration, the standby configuration after synchronization will not work.
Avoid using explicit IP addresses. For example, in OracleAS Web Cache and Oracle HTTP Server configurations, use ANY or host names instead of IP addresses as listening addresses
OracleAS Web Cache cannot be started at the standby site possibly due to misconfigured standalone OracleAS Web Cache after failover or switchover.
OracleAS Disaster Recovery synchronization does not synchronize standalone OracleAS Web Cache installations.
Use the standard Oracle Application Server full CD image to install the OracleAS Web Cache component
A middle-tier installation in the standby site uses the wrong hostname even after the machine's physical hostname is changed.
Besides modifying the physical hostname, you also need to put it as the first entry in
/etc/hosts file. Failure to do the latter will cause the installer to use the wrong hostname.
Put the physical hostname as the first entry in the
/etc/hosts file. See Section 13.2.2, "Configuring Hostname Resolution" for more information.
When performing a verify farm with standby farm operation, the operation fails with an error message indicating that the middle-tier machine instance cannot be found and that the standby farm is not symmetrical with the production farm.
The verify farm with standby farm operation is trying to verify that the production and standby farms are symmetrical to one another, that they are consistent, and conform to the requirements for disaster recovery.
The verify operation is failing because it sees the middle-tier instance as
<hostname> and not as
<physical_hostname>. You might suspect that this is a problem with the environmental variable
_CLUSTER_NETWORK_NAME_, which is set during installation. However, in this case, it is not because a check of the
_CLUSTER_NETWORK_NAME_ environmental variable setting finds this entry to be correct. However, a check of the contents of the
/etc/hosts file, indicates that the entries for the middle tier in question are incorrect. That is, all middle-tier installations take the hostname from the second column of the
For example, assume the following scenario:
Two environments are used:
OracleAS Infrastructure (Oracle Identity Management and OracleAS Metadata Repository) is first installed on
examp2 as host
OracleAS middle-tier (OracleAS Portal and OracleAS Wireless) is then installed on
examp2 as host
Basically, these are two installations (OracleAS Infrastructure and OracleAS middle-tier) on a single node
Updated the latest
backup_restore files on all four Oracle homes
Started OracleAS Guard (
asgctl) on all four Oracle homes (OracleAS Infrastructure and OracleAS middle-tier on two nodes)
asgctl verify farm with
standby farm operation, but it fails because it sees the instance as
mid-tier.examp1 and not as
A check of the
/etc/hosts file shows the following entry:
184.108.40.2060 examp1 node1.us.oracle.com node1 infra
ias.properties and farms shows the following and the verify operation is failing:
/etc/hosts file should actually be the following:
220.127.116.110 node1.us.oracle.com node1 infra
ias.properties and farms shows the following and the verify operation succeeds:
Check and change the second column entry in your
/etc/hosts file to match the hostname of the middle-tier node in question as described in the previous explanation.
sync farm to operation returns the error message: "Cannot Connect to asdb"
Occasionally, an administrator may forget to set the primary database using the
asgctl command line utility in performing an operation that requires that the asdb database connection be established prior to an operation. The following example shows this scenario for a
sync farm to operation:
ASGCTL> connect asg hsunnab13 ias_admin/iastest2 Successfully connected to hsunnab13:7890 ASGCTL> . . . <Other asgctl operations may follow, such as verify farm, dump farm, <and show operation history, and so forth that do not require the connection <to the asdb database to be established or a time span may elapse of no activity <and the administrator may miss performing this vital command. . . . ASGCTL> sync farm to usunnaa11 prodinfra(asr1012): Syncronizing each instance in the farm to standby farm prodinfra: -->ASG_ORACLE-300: ORA-01031: insufficient privileges prodinfra: -->ASG_DUF-3700: Failed in SQL*Plus executing SQL statement: connect email@example.com as sysdba;. prodinfra: -->ASG_DUF-3502: Failed to connect to database asdb.us.oracle.com. prodinfra: -->ASG_DUF-3504: Failed to start database asdb.us.oracle.com. prodinfra: -->ASG_DUF-3027: Error while executing Syncronizing each instance in the farm to standby farm at step - init step.
asgctl set primary database command. This command sets the connection parameters required to open the asdb database in order to perform the
sync farm to operation. Note that the
set primary database command must also precede the
instantiate farm to command and
switchover farm to command if the primary database has not been specified in the current connection session.
This section describes common problems and solutions for middle-tier components in high availability configurations. It contains the following topics:
If you are running OracleAS Cluster (OC4J-EJB) on computers with two NICs (network interface cards) and you are using one NIC for connecting to the network and the second NIC for connecting to the other node in the cluster, multicast messages may not be sent or received correctly. This means that session information does not get replicated between the nodes in the cluster.
Figure A-1 OracleAS Cluster (OC4J-EJB) Running on Computers with Two NICs
You need to start up the OC4J instances by setting the
oc4j.multicast.bindInterface parameter to the name or IP address of the other NIC on the node.
For example, using the values shown in Figure A-1, you would start up the OC4J instances with these parameters:
On node 1, configure the OC4J instance to start with up with this parameter:
On node 2, configure the OC4J instance to start with up with this parameter:
You specify this parameter and its value in the "Java Options" field in the "Command Line Options" section in the Server Properties page in the Application Server Control Console (Figure A-2).
Figure A-2 Server Properties Page in Application Server Control Console
If you have applications that use the "
opmn:" prefix in their
Context.PROVIDER_URL property, you may experience slow performance in the
The following sample code sets the
PROVIDER_URL to a URL with an
Hashtable env = new Hashtable(); env.put(Context.PROVIDER_URL, "opmn:ormi://hostname:port/cmpapp"); // ... set other properties ... Context context = new InitialContext(env);
If the host specified in
PROVIDER_URL is down, the application has to make a network connection to OPMN to locate another host. Going through the network to OPMN takes time.
To avoid making another network connection to OPMN to get another host, set the
oracle.j2ee.naming.cache.timeout property so that the values returned from OPMN the first time are cached, and the application can use the values in the cache.
The following sample code sets the
Hashtable env = new Hashtable(); env.put(Context.PROVIDER_URL, "opmn:ormi://hostname:port/cmpapp"); // set the cache value env.put("oracle.j2ee.naming.cache.timeout", "30"); // ... set other properties ... Context context = new InitialContext(env);
Table A-1 shows valid values for the
Table A-1 Values for the oracle.j2ee.naming.cache.timeout Property
Cache only once, without any refreshing.
Number of seconds after which the cache can be refreshed. Note that this is not automatic; the refresh occurs only when you invoke "
If the property is not set, the default value is 60.
With the property set, you will still see some delay on the first "
new InitialContext()" call, but subsequent calls should be faster because they are retrieving data from the cache instead of making a network connection to OPMN.
Note that for optimal performance, you should also set
Dedicated.Connection to either
DEFAULT, and set
The backing up and restoration of an OracleAS Metadata Repository using the Backup and Recovery Tool from one host to another fails if the ORACLE SID in the new host is different from that of the old host.
The Backup and Recovery Tool does not work with different ORACLE SID values.
The following is an example of the error message that appears when the restoration fails due to an inconsistent ORACLE SID:
Assume two nodes: A and B. The OracleAS Metadata Repository in machine A is backed up using the Backup and Recovery Tool. When attempting to restore it on machine B using the same tool, the following message appears:
Oracle instance started RMAN-00571: =========================================================== RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS =============== RMAN-00571: =========================================================== RMAN-00579: the following error occurred at 09/08/2003 16:29:15 RMAN-06003: ORACLE error from target database: ORA-01103: database name 'M16REP1' in controlfile is not 'M16MR2' RMAN-06097: text of failing SQL statement: alter database mount RMAN-06099: error occurred in source file: krmk.pc, line: 4124
Note that "M16REP1" is the ORACLE SID of the database that was backed up.
None at this time. Restoring the OracleAS Metadata Repository to a database with a different ORACLE SID is currently not supported.
Currently, the Oracle Ultra Search web crawler is configured so that it can be run only from one node in a Real Application Cluster. If that node (or the database) goes down, the web crawler will not startup on an available node. This situation occurs for non Cluster File System Real Application Clusters.
When Real Application Clusters use a Cluster File System, Oracle Ultra Search crawler can be launched from any of the Real Application Clusters nodes. At least one node has to be running.
When a Cluster File System is not used, the Oracle Ultra Search crawler always runs on a specified node. If this node stops operating, you must run the
wk0reconfig.sql script to move Oracle Ultra Search to another Real Application Clusters node. This script can be run as follows:
> sqlplus wksys/wksys_passwd SQL> ORACLE_HOME/ultrasearch/admin/wk0reconfig.sql <instance_name> <connect_url>
<instance_name> is the name of the Real Application Clusters instance that Oracle Ultra Search uses for crawling. This name can be obtained by using the following SQL statement after connecting to the database:
SELECT instance_name FROM v$instance
<connect_url> is the JDBC connection string that guarantees a connection only to the specified instance, such as:
(DESCRIPTION= (ADDRESS_LIST= (ADDRESS=(PROTOCOL=TCP) (HOST=<nodename>) (PORT=<listener_port>))) (CONNECT_DATA=(SERVICE_NAME=<service_name>)))
Note that when Oracle Ultra Search is switched from one Real Application Clusters node to another, the contents of the cache will be lost. After switching instances, force a re-crawl of the documents to re-populate the cache.
In case the information in the previous section is not sufficient, you can find more solutions on Oracle MetaLink,
http://metalink.oracle.com. If you do not find a solution for your problem, log a service request.