A Troubleshooting High Availability

This appendix describes common problems that you might encounter when deploying and managing Oracle Application Server in high availability configurations, and explains how to solve them. It contains the following topics:

Problems and Solutions
Need More Help?

A.1 Problems and Solutions

This section describes common problems and solutions. It contains the following topics:

Section A.1.1, "Cluster Configuration Assistant Fails During Installation"
Section A.1.2, "Oracle Ultra Search Configuration Assistant is Unable to Connect to Oracle Internet Directory During High Availability Infrastructure Installation"
Section A.1.3, "Unable to Perform Online Database Backup and Restore in OracleAS Cold Failover Cluster Environment"
Section A.1.4, "odisrv Process Does Not Failover"
Section A.1.5, "Oracle Ultra Search Web Crawler Does Not Failover"
Section A.1.6, "Unable to Restore OracleAS Metadata Repository to a Different Host"
Section A.1.7, "Cannot Connect to Database for Restoration (Windows)"
Section A.1.8, "Unpredictable Behavior from Oracle Application Server Cluster (Identity Management) Configuration"
Section A.1.9, "Wrong Name Specified for Load Balancer"
Section A.1.10, "OracleAS Disaster Recovery: Standby Site Not Synchronized"
Section A.1.11, "OracleAS Disaster Recovery: Failure to Bring Up Standby Instances After Failover or Switchover"
Section A.1.12, "OracleAS Disaster Recovery: Unable to Start Standalone OracleAS Web Cache Installations at the Standby Site"
Section A.1.13, "OracleAS Disaster Recovery: Standby Site Middle-tier Installation Uses Wrong Hostname"
Section A.1.14, "OracleAS Disaster Recovery: Failure of Farm Verification Operation with Standby Farm"
Section A.1.15, "OracleAS Disaster Recovery: Sync Farm Operation Returns Error Message"

A.1.1 Cluster Configuration Assistant Fails During Installation

Problems encountered during the clustering of components using the Cluster Configuration Assistant are addressed here.

Problem

During the installation of distributed Oracle Identity Management configurations, the OracleAS Single Sign-On and Oracle Delegated Administration Services components are installed in two of their own nodes separate from the other Oracle Identity Management components. The Cluster Configuration Assistant may attempt to cluster the two resulting OracleAS Single Sign-On/Oracle Delegated Administration Services instances together. However, the error message "Instances containing disabled components cannot be added to a cluster" may appear. This message appears because Enterprise Manager cannot cluster instances with disabled components.

Solution

If the Cluster Configuration Assistant fails, you can cluster the instance after installation. In this case, to cluster the instance, you must use the "dcmctl joincluster" command instead of Application Server Control Console. You cannot use Application Server Control Console in this case because it cannot cluster instances that contain disabled components. In this case, the "home" OC4J instance is disabled.

A.1.2 Oracle Ultra Search Configuration Assistant is Unable to Connect to Oracle Internet Directory During High Availability Infrastructure Installation

During high availability Infrastructure installation, the Oracle Ultra Search Configuration Assistant cannot connect to an Oracle Internet Directory instance at port 3060 of the virtual hostname provided in the virtual hostname addressing screen.

Problem

A common mistake can be made when virtual hostname addressing is used during Infrastructure installation. The load balancer virtual server name is entered, and the load balancer is set up correctly to assume this name. However, the Infrastructure node is not set up correctly to resolve this name. Thus, when the Oracle Ultra Search Configuration Assistant on the Infrastructure node tries to connect to the load balancer virtual server name, the Configuration Assistant cannot find the load balancer.

Solution

The solution is to set up name resolution correctly on the Infrastructure machine for the load balancer virtual server name. This procedure is platform dependent. Check your operating system manual for an accurate procedure. In Unix, this usually involves editing the /etc/hosts file and making sure this file is used for name resolution by editing the /etc/nsswitch.conf file. In Windows, this usually involves editing the C:\WINDOWS\system32\drivers\etc\hosts file.

A.1.3 Unable to Perform Online Database Backup and Restore in OracleAS Cold Failover Cluster Environment

Issues with online database backup and restore are noted here. This information pertains to the OracleAS Cold Failover Cluster environment.

Problem

Unable to perform online recovery of Infrastructure database due to dependencies and cluster administrator trying to bring the database down and then up during the recovery phase by the Backup and Recovery Tool.

Solution 1

To perform a clean recovery, use the following steps:

Bring all resources offline using the cluster administrator (for Windows, use Oracle Failsafe).
Perform a normal shutdown of the Infrastructure database.
Start only the database service using the following command:

net start OracleService<SID>
Run the Backup and Recovery Tool to perform the recovery of the database.

Solution 2

For Windows, the following steps can be used to perform a recovery:

In Oracle Failsafe, under "Cluster Resources", select "ASDB(DB Resource)" in the "Database" tab.
For "Database Polling", select "Disabled" from the drop down list.
Using the Backup and Recovery Tool, perform an online restore of the Infrastructure database.

The database is not accessible for a brief period while the Backup and Recovery Tool stops and starts the database. Once the database starts up, it can be accessed by middle-tier and Infrastructure components.

A.1.4 `odisrv` Process Does Not Failover

Issues with odisrv process failover between nodes are documented here.

Problem

In any OracleAS Cluster (Identity Management)solution, when opmnctl stopall is executed to stop all OPMN-managed processes on that node, odisrv is not started automatically on the second node because opmnctl stopall is a normal administrative shutdown, not an actual node failure. In a true node failure, odisrv is started on the remaining node upon death detection of the original odisrv process.

Solution

If planned maintenance is required for an OracleAS Cluster (Identity Management), use the oidctl command to explicitly stop and start odisrv.

On the node where odisrv is running, use the following command to stop it:

oidctl connect=<dbConnect> server=odisrv inst=1 stop

On the remaining active node, start odisrv using the following command:

oidctl connect=<dbConnect> server=odisrv inst=1 falgs="..." start

A.1.5 Oracle Ultra Search Web Crawler Does Not Failover

For Real Application Clusters that do not use a Cluster File System, the Oracle Ultra Search web crawler does not failover to an available node.

Problem

Currently, the Oracle Ultra Search web crawler is configured so that it can be run only from one node in a Real Application Cluster. If that node (or the database) goes down, the web crawler will not startup on an available node. This situation occurs for non Cluster File System Real Application Clusters.

Solution

When Real Application Clusters use a Cluster File System, Oracle Ultra Search crawler can be launched from any of the Real Application Clusters nodes. At least one node has to be running.

When a Cluster File System is not used, the Oracle Ultra Search crawler always runs on a specified node. If this node stops operating, you must run the wk0reconfig.sql script to move Oracle Ultra Search to another Real Application Clusters node. This script can be run as follows:

> sqlplus wksys/wksys_passwd > ORACLE_HOME/ultrasearch/admin/wk0reconfig.sql <instance_name> <connect_url>

where <instance_name> is the name of the Real Application Clusters instance that Oracle Ultra Search uses for crawling. This name can be obtained by using the following SQL statement after connecting to the database:

SELECT instance_name FROM v$instance

<connect_url> is the JDBC connection string that guarantees a connection only to the specified instance, such as:

(DESCRIPTION=
  (ADDRESS_LIST=
    (ADDRESS=(PROTOCOL=TCP)
      (HOST=<nodename>)
      (PORT=<listener_port>)))
  (CONNECT_DATA=(SERVICE_NAME=<service_name>)))

Note that when Oracle Ultra Search is switched from one Real Application Clusters node to another, the contents of the cache will be lost. After switching instances, force a re-crawl of the documents to re-populate the cache.

A.1.6 Unable to Restore OracleAS Metadata Repository to a Different Host

The backing up and restoration of an OracleAS Metadata Repository using the Backup and Recovery Tool from one host to another fails if the ORACLE SID in the new host is different from that of the old host.

Problem

The Backup and Recovery Tool does not work with different ORACLE SID values.

The following is an example of the error message that appears when the restoration fails due to an inconsistent ORACLE SID:

Assume two nodes: A and B. The OracleAS Metadata Repository in machine A is backed up using the Backup and Recovery Tool. When attempting to restore it on machine B using the same tool, the following message appears:

Oracle instance started
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00579: the following error occurred at 09/08/2003 16:29:15
RMAN-06003: ORACLE error from target database: ORA-01103: database name 'M16REP1' in controlfile is not 'M16MR2'
RMAN-06097: text of failing SQL statement: alter database mount
RMAN-06099: error occurred in source file: krmk.pc, line: 4124

Notice that "M16REP1" is the ORACLE SID of the database that was backed up.

Solution

Non at this time. Restoring the OracleAS Metadata Repository to a database with a different ORACLE SID is currently not supported.

A.1.7 Cannot Connect to Database for Restoration (Windows)

Unable to connect to idle OracleAS Metadata Repository database to restore it after it is shutdown using Microsoft Cluster Administrator.

Problem

When you stop the OracleAS Metadata Repository database using Microsoft Cluster Administrator, the latter performs the strictest and fastest abort that shuts down the database Windows service. After the shutdown, attempts to connect to the service fail.

The following steps can be used to illustrate the problem:

Access an OracleAS Metadata Repository that is used for testing.
Corrupt a database file (note: do not modify the ts$ table).
Issue a SQL query to ensure that the database is corrupted.
Using Microsoft Cluster Administrator, verify that the database is online.
Using Oracle Fail Safe Manager, disable database polling.
Using Microsoft Cluster Administrator, take the database offline. This also takes OPMN and Application Server Control Console offline as they are dependencies of the database.
Try connecting as sysdba. The connection should fail.

Solution

Use the Oracle Failsafe Manager to shutdown the database. To do so:

In the Oracle Failsafe Manager, right click on the "ASDB" resource (default if not changed), and select "immediate".
Start the database service using the Windows Service Manager
Connect to the database as sysdba. The connection should be successful.

A.1.8 Unpredictable Behavior from Oracle Application Server Cluster (Identity Management) Configuration

Unpredictable behavior from OracleAS Cluster (Identity Management) nodes if system time on all nodes is not synchronized.

Problem

In a OracleAS Cluster (Identity Management) configuration, the Oracle Internet Directory Monitor (OIDMON) on each node updates the directory database every 10 seconds with metadata. At the same time, it queries the database to verify that all other directory servers are running.

If an OIDMON does not update the database for 250 seconds, the other nodes assume that that node has failed. This delay can be manifested erroneously by nodes with their system clocks set with a difference of more than 250 seconds from the other nodes. When this happens, OIDMON on one of the other nodes will initiate failover operations, which include locally bringing up processes that were running on the failed node. The node where these processes are started continue processing the operations that were underway in the failed node.

As an example, assume a OracleAS Cluster (Identity Management) configuration with nodes A and B. The system clock in node B is 300 seconds behind node A's clock. Node B updates its metadata in the directory database, which includes the system clock value. Node A queries the database for active Oracle Internet Directory servers and determines that node B has failed because its time value is 300 seconds. Node A then initiates failover operations by locally starting all Oracle Internet Directory server processes that were running on node B.

Solution

The system clock value on all nodes in the OracleAS Cluster (Identity Management) configuration should be synchronized using Greenwich mean time so that there is a discrepancy of no more than 250 seconds between them.

Refer to the chapters on Rack-Mounted directory server configurations in the Oracle Internet Directory Administrator's Guide.

A.1.9 Wrong Name Specified for Load Balancer

If a load balancer is deployed in front of OracleAS instances that are clustered together, configuration files of the instances may not have the correct load balancer virtual server name specified.

Problem

For a cluster of OracleAS instances front-ended by a load balancer, a redirect back to the cluster may not contain the load balancer virtual server name. Dynamic pages created by a servlet or JSP may also not use the correct load balancer virtual server name. In both cases, the local hostname is most likely used instead.

To correctly specify the load balancer virtual server name to be used, modifications have to be made to the httpd.conf and default-web-site.xml file for each instance.

Solution

At each OracleAS instance, perform the following instructions:

Perform the following steps for Oracle HTTP Server:
1. Stop the Oracle HTTP Server using the following command:
  
  opmnctl stopproc ias_component=HTTP_Server
2. In Oracle HTTP Server's httpd.conf file, change the value for the directive ServerName to the virtual server name of your load balancer. For example, if you use "localhost", change it to the virtual server name of your load balancer.
3. In the same httpd.conf file, change the value of the Port directive to the port number your load balancer is configured with for incoming requests. For example, if the port number specified is 7777, change it to port 80 if that is configured on your load balancer.
4. Execute the following command to update the DCM repository with the above changes:
  
  dcmctl updateConfig -ct ohs
5. Start the Oracle HTTP Server using the following command:
  
  opmnctl startproc ias_component=HTTP_Server
Perform the following steps for OC4J:
1. Stop the OC4J processes for each OracleAS instance using the following command:
  
  opmnctl stopproc ias_component=OC4J
2. Edit the file default-web-site.xml to include the following line:
  
  <frontend host="load_balancer_name" port="port_number" />
  
  Replace "load_balancer_name" with the virtual server name of your load balancer and "port_number" with the port number that is configured for incoming requests in your load balancer (these values are similar to those you entered for httpd.conf above).
3. Execute the following command to update the DCM repository with the changes you made in the default-web-site.xml file:
  
  dcmctl updateconfig -ct oc4j
4. Start the OC4J instances using the following command:
  
  opmnctl startproc ias_component=OC4J

A.1.10 OracleAS Disaster Recovery: Standby Site Not Synchronized

In the OracleAS Disaster Recovery standby site, you may find that the site's OracleAS Metadata Repository is not synchronized with the OracleAS Metadata Repository in the primary site.

Problem

The OracleAS Disaster Recovery solution requires manual configuration and shipping of data files from the primary site to the standby site. Also, the data files (archived database log files) are not applied automatically in the standby site, that is, OracleAS Disaster Recovery does not use managed recovery in Oracle Data Guard.

Solution

The archive log files have to be applied manually. The steps to perform this task is found in Chapter 7, "Oracle Application Server Disaster Recovery".

A.1.11 OracleAS Disaster Recovery: Failure to Bring Up Standby Instances After Failover or Switchover

Standby instances are not started after a failover or switchover operation.

Problem

IP addresses are used in instance configuration. OracleAS Disaster Recovery setup does not require identical IP addresses in peer instances between the production and standby site. OracleAS Disaster Recovery synchronization does not reconcile IP address differences between the production and standby sites. Thus, if you use explicit IP address xxx.xx.xxx.xx in your configuration, the standby configuration after synchronization will not work.

Solution

Avoid using explicit IP addresses. For example, in OracleAS Web Cache and Oracle HTTP Server configurations, use ANY or host names instead of IP addresses as listening addresses

A.1.12 OracleAS Disaster Recovery: Unable to Start Standalone OracleAS Web Cache Installations at the Standby Site

OracleAS Web Cache cannot be started at the standby site possibly due to misconfigured standalone OracleAS Web Cache after failover or switchover.

Problem

OracleAS Disaster Recovery synchronization does not synchronize standalone OracleAS Web Cache installations.

Solution

Use the standard Oracle Application Server full CD image to install the OracleAS Web Cache component

A.1.13 OracleAS Disaster Recovery: Standby Site Middle-tier Installation Uses Wrong Hostname

A middle-tier installation in the standby site uses the wrong hostname even after the machine's physical hostname is changed.

Problem

Besides modifying the physical hostname, you also need to put it as the first entry in /etc/hosts file. Failure to do the latter will cause the installer to use the wrong hostname.

Solution

Put the physical hostname as the first entry in the /etc/hosts file. See Section 7.2.2, "Configuring Hostname Resolution" for more information.

A.1.14 OracleAS Disaster Recovery: Failure of Farm Verification Operation with Standby Farm

When performing a verify farm with standby farm operation, the operation fails with an error message indicating that the middle-tier machine instance cannot be found and that the standby farm is not symmetrical with the production farm.

Problem

The verify farm with standby farm operation is trying to verify that the production and standby farms are symmetrical to one another, that they are consistent, and conform to the requirements for disaster recovery.

The verify operation is failing because it sees the middle-tier instance as mid_tier.<hostname> and not as mid_tier.<physical_hostname>. You might suspect that this is a problem with the environmental variable _CLUSTER_NETWORK_NAME_, which is set during installation. However, in this case, it is not because a check of the _CLUSTER_NETWORK_NAME_ environmental variable setting finds this entry to be correct. However, a check of the contents of the /etc/hosts file, indicates that the entries for the middle tier in question are incorrect. That is, all middle-tier installations take the hostname from the second column of the /etc/hosts file.

For example, assume the following scenario:

Two environments are used: examp1 and examp2
OracleAS Infrastructure (Oracle Identity Management and OracleAS Metadata Repository) is first installed on examp1 and examp2 as host infra
OracleAS middle-tier (OracleAS Portal and OracleAS Wireless) is then installed on examp1 and examp2 as host node1
Basically, these are two installations (OracleAS Infrastructure and OracleAS middle-tier) on a single node
Updated the latest duf.jar and backup_restore files on all four Oracle homes
Started OracleAS Guard (asgctl) on all four Oracle homes (OracleAS Infrastructure and OracleAS middle-tier on two nodes)
Performed asgctl operations: connect asg, set primary, dump farm
Performed asgctl verify farm with standby farm operation, but it fails because it sees the instance as mid-tier.examp1 and not as mid_tier.node1.us.oracle.com

A check of the /etc/hosts file shows the following entry:

123.45.67.890 examp1 node1.us.oracle.com node1 infra

Then ias.properties and farms shows the following and the verify operation is failing:

IASname=midtier_inst.examp1

However, the /etc/hosts file should actually be the following:

123.45.67.890 node1.us.oracle.com node1 infra

Then ias.properties and farms shows the following and the verify operation succeeds:

IASname=midtier_inst.node1.us.oracle.com

Solution

Check and change the second column entry in your /etc/hosts file to match the hostname of the middle-tier node in question as described in the previous explanation.

A.1.15 OracleAS Disaster Recovery: Sync Farm Operation Returns Error Message

A sync farm to operation returns the error message: "Cannot Connect to asdb"

Problem

Occasionally, an administrator may forget to set the primary database using the asgctl command line utility in performing an operation that requires that the asdb database connection be established prior to an operation. The following example shows this scenario for a sync farm to operation:

ASGCTL> connect asg hsunnab13 ias_admin/iastest2
Successfully connected to hsunnab13:7890
ASGCTL>  
.
.
.
<Other asgctl operations may follow, such as verify farm, dump farm, 
<and show operation history, and so forth that do not require the connection
<to the asdb database to be established or a time span may elapse of no activity
<and the administrator may miss performing this vital command.
.
.
.
ASGCTL> sync farm to usunnaa11
prodinfra(asr1012): Syncronizing each instance in the farm to standby farm
prodinfra: -->ASG_ORACLE-300: ORA-01031: insufficient privileges
prodinfra: -->ASG_DUF-3700: Failed in SQL*Plus executing SQL statement:  connect null/******@asdb.us.oracle.com as sysdba;.
prodinfra: -->ASG_DUF-3502: Failed to connect to database asdb.us.oracle.com.
prodinfra: -->ASG_DUF-3504: Failed to start database asdb.us.oracle.com.
prodinfra: -->ASG_DUF-3027: Error while executing Syncronizing each instance in the farm to standby farm at step - init step.

Solution

Perform the asgctl set primary database command. This command sets the connection parameters required to open the asdb database in order to perform the sync farm to operation. Note that the set primary database command must also precede the instantiate farm to command and switchover farm to command if the primary database has not been specified in the current connection session.

A.2 Need More Help?

In case the information in the previous section is not sufficient, you can find more solutions on Oracle MetaLink, http://metalink.oracle.com. If you do not find a solution for your problem, log a service request.

A Troubleshooting High Availability

A.1 Problems and Solutions

A.1.1 Cluster Configuration Assistant Fails During Installation

A.1.2 Oracle Ultra Search Configuration Assistant is Unable to Connect to Oracle Internet Directory During High Availability Infrastructure Installation

A.1.3 Unable to Perform Online Database Backup and Restore in OracleAS Cold Failover Cluster Environment

A.1.4 odisrv Process Does Not Failover

A.1.5 Oracle Ultra Search Web Crawler Does Not Failover

A.1.6 Unable to Restore OracleAS Metadata Repository to a Different Host

A.1.7 Cannot Connect to Database for Restoration (Windows)

A.1.8 Unpredictable Behavior from Oracle Application Server Cluster (Identity Management) Configuration

A.1.9 Wrong Name Specified for Load Balancer

A.1.10 OracleAS Disaster Recovery: Standby Site Not Synchronized

A.1.11 OracleAS Disaster Recovery: Failure to Bring Up Standby Instances After Failover or Switchover

A.1.12 OracleAS Disaster Recovery: Unable to Start Standalone OracleAS Web Cache Installations at the Standby Site

A.1.13 OracleAS Disaster Recovery: Standby Site Middle-tier Installation Uses Wrong Hostname

A.1.14 OracleAS Disaster Recovery: Failure of Farm Verification Operation with Standby Farm

A.1.15 OracleAS Disaster Recovery: Sync Farm Operation Returns Error Message

A.2 Need More Help?

A.1.4 `odisrv` Process Does Not Failover