Sun Java System Application Server Standard and Enterprise Edition 7 2004Q2 Update 7 Release Notes

High Availability

This section describes the known high availability issues and associated solutions.

ID 

Summary 

6301842

Sometimes on Windows, the management agent cannot deregister the service when running, ma -r, and fails with the error message, Could not identify program.

Solution

Start a Windows command prompt window and run sc stop HADBMgmtAgent and then run sc delete HADBMgmtAgent. If the command ma -i -n servicename was used to install and start the service, then use servicename when running the command sc.

6293912

The Management Agent should not use special-use interfaces.

Solution

When issuing hadbm create on hosts with multiple interfaces, always specify the IP-addresses explicitly, using DDN notation. 

6291562

Reassembly failures on Windows.

On the Windows platform, with certain configurations and load, there may be a large number of reassembly failures in the operating system. The problem has been seen with configurations of more than 20 nodes when running several table scans (select *) in parallel. The symptoms could be that transactions abort frequently, or repair and recovery may take a long time to complete, and there may be frequent timeouts in various parts of the system. 

Solution

To fix the problem, the Windows registry variable HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters should be set to a value higher than the default value of 100. We recommend increasing it to 0x1000 (4096). For more information, see article 811003 from the Microsoft support pages: http://support.microsoft.com/default.aspx?scid=kb;en-us;811003

6275319

Non-root users cannot manage HADB.

Installing with Java Enterprise System (as root) does not permit non-root users to manage HADB. 

Solution

Always login as root to manage HADB. 

6275103

hadbm management agent should give a better error message when a session object has timed out and deleted at MA.

Sometimes, a resource contention problem on the server may cause a management client to become disconnected, When reconnecting, a misleading error message, hadbm:Error 22184: A password is required to connect to the management agent may be returned.

Solution

Check if there is a resource problem on the server, take proper action (e.g., add more resources), and retry the operation. 

6273681

Management agents in global and local zones may interfere.

On Solaris 10, stopping a management agent by using the ma-initd script in a global zone stops the management agent in the local zone as well.

Solution

Do not install the management agent both in the global and local zone. 

6271063

Install/removal and symlink preservation.

Regarding install/removal of HADB c package (Solaris: SUNWhadbc, Linux: sun-hadb-c) version <m.n.u-p>, the symlink /opt/SUNWhadb/<m> is never touched once it exists. Thus, it is possible that an orphaned symlink will exist.

Solution

Delete the symlink before install or after uninstall unless in use. 

6265419

Downgrading from HADB Version 4.4.2.5 to HADB Version 4.4.1.7 causes management agent to fail with different error codes.

When downgrading to a previous HADB version, the management agent may fail with different error codes. 

Solution

It is possible to downgrade the HADB database, however the management agent cannot be downgraded if there changes have been made in the repository objects. After a downgrade, you must use the management agent from the latest HADB version. 

6262824

hadbm does not support passwords containing uppercase letters.

Capital letters in passwords are converted to lowercase when the password is stored in hadb. 

Solution

Do not use passwords containing uppercase letters. 

6173886, 6253132

hadbm createdomain may fail.

If running the management agent on a host with multiple network interfaces, the createdomain command may fail if not all network interfaces are on the same subnet:

hadbm:Error 22020: The management agents could not establish a domain, please check that the hosts can communicate with UDP multicast.

The management agents will (if not configured otherwise) use the first interface for UDP multicasts (first as defined by the result from java.net.NetworkInterface.getNetworkInterfaces()).

Solution

The best solution is to tell the management agent which subnet to use (using ma.server.mainternal.interfaces in the configuration file. For example, ma.server.mainternal.interfaces=10.11.100.0). Alternatively you can configure the router between the subnets to route multicast packets (the management agent uses multicast address 228.8.8.8).

Before retrying with a new configuration of the management agents, you should clean up the management agent’s repository. Stop all agents in the domain, and delete all files and directories in the repository directory (identified by repository.dr.path in the management agent configuration file). This must be done on all hosts before restarting the agents with a new configuration file.

6249685

clu_trans_srv process cannot be interrupted on Linux.

There is a bug in the 64 bit version of Red Hat Enterprise Linux 3.0 that makes the clu_trans_srv process end up in an uninterruptible mode when performing asynchronous I/O. This means that kill -9 does not work and the operating system must be rebooted.

Solution

Use a 32 bit version of Red Hat Enterprise Linux 3.0. 

6230792, 6230415

Starting, stopping or reconfiguring HADB may fail or hang.

On AMD OpteronTM systems running Solaris 10, starting, stopping or reconfiguring HADB using the hadbm command may fail or hang with one of the following errors:

hadbm:Error 22009: The command issued had no progress in the last 300 seconds.

HADB-E-21070: The operation did not complete within the time limit, but has not been cancelled and may complete at a later time.

This may happen if there are inconsistencies while reading/writing to a file (nomandevice) which the clu_noman_srv process uses. This problem can be detected by looking for the following messages in the HADB history files:

n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Child process noman3 733 does not respond.

n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Have not heard from it in 104.537454 sec

n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Child process noman3 733 did not start.

Solution

To solve the problem, run the following command for the affected node: 

hadbm restartnode --level=clear nodeno dbname

Note that all devices for the node will be reinitialized. You may have to stop the node before reinitializing it. 

None

HADB database creation fails.

Creating a new database may fail with the following error, stating that too few shared memory segments are available: 

HADB-E-21054: System resource is unavailable : HADB-S-05512: Attaching shared memory segment with key "xxxxx" failed, OS status=24 OS error message: Too many open files.

Solution

Verify that shared memory is configured and the configuration is working. In particular, on Solaris 8, inspect the file /etc/system, and check that the value of the variable shmsys:shminfo_shmseg is at least six times the number of nodes per host.

6232140

The management agent terminates with the exception, "IPV6_MULTICAST_IF failed."

The management agent may terminate with the exception, IPV6_MULTICAST_IF failed, when starting on a host running Solaris 8 with several NIC cards, and if there is a mixture of cards with IPv6 and IPv4 enabled. The root cause is described in bug 4418866/4418865.

Solution

  1. Set the environment variable, _JAVA_OPTIONS, as described here:

    $> export _JAVA_OPTIONS="-Djava.net.preferIPv4Stack=true”

  2. Alternatively, use Solaris 9.

6171832, 6172138

Stale sessions are not cleaned up leading to degraded HADB performance, or the data device is getting full.

Solution

To remove stale sessions efficiently, modify the sun-ejb-jar.xml file to set the value of cache-idle-timeout-in-seconds to less than the removal-timeout-in-seconds value.

If the cache-idle-timeout-in-seconds is equal to or greater than the removal-timeout-in-seconds, old sessions will not be cleaned-up in HADB, which is the expected behavior.

If you continue to face issues with stale sessions even after setting these properties as recommended, contact product support for help. 

6171994

Improper permissions in security.policy file causing startup hang.

Description

hadb-jdbc has improper access permissions in the security.policy file.

Solution

If there is an intermittent hang during startup, add the following suggested permissions in the security.policy file: 

By default, the following is present: 

permission java.net.SocketPermission "*", "connect";

Suggested permissions: 

permission java.net.SocketPermission "*", "connect accept,listen,resolve";

5042351

New tables created after new nodes are added will not spread on the added nodes.

Description

If a user creates a database instance, add nodes to it, then any new tables created afterwards will not be fragmented on the nodes added after database creation. Only the tables created before addnodes will be able to use the added nodes when hadbm addnodes refragment it.

This is because create table uses the sysnode node group which is created at the boot time of the database (when hadbm create is executed).

Solution

Run hadbm refragment after new tables have been added, or create the new tables on nodegroup, all_nodes.

6158393

HADB problem with RedHat AS 3.0 in co-located mode under load.

Description

HADB runs on RedHat Linux AS 3.0 co-located with Application Server. Transactions may get aborted and affect the performance. This is caused by the excessive swapping performed by the operating system. 

Solution

This issue appears to have been resolved when HADB was tested against RedHat Linux AS 3.0 Update 4. 

6214601

Addnodes fails with table not found error since hadbm searches user tables in sysroot schema.

Description

The hadbm refragment command fails with: 

hadbm:Error 22042: Database could not be refragmented. Please retry with hadbm refragment command to refragment the database.. Caused by: HADB-E-11701: *Table singlesignon not found*

Solution

Refragment the Application Server tables manually with the help of clusql:

> clusql server:port list> system+dbpassword specified at database create>

SQL: set autocommit on;

SQL: set schema haschema;

SQL: alter table sessionattribute nodegroup all_nodes;

SQL: alter table singlesignon nodegroup all_nodes;

SQL: alter table statefulsessionbean nodegroup all_nodes;

SQL: alter table sessionheader nodegroup all_nodes;

SQL: alter table blobsessions nodegroup all_nodes;

SQL: quit;

6159633

configure-ha-cluster may hang.

Description

When the asadmin configure-ha-cluster command is used to create or configure a highly available cluster on more than one host, the command hangs. There are no exceptions thrown from the HADB Management Agent or the Application Server.

Solution

HADB does not support heterogeneous paths across nodes in a database cluster. Make sure that the HADB server installation directory and configuration directory are the same across all participating hosts. 

Additionally, clear the repository directories before running the command again. 

6197822

hadbm set brings the database instance to a state from which it is difficult to recover.

Description

In this scenario, the hadbm set command fails when attempting to change some database configuration variable; for example, setting DataBufferPoolSize to a larger size fails due to insufficient shared memory on node-0. The hadbm set command then leaves the database with node-0 in stopped state and node-1 in running state. Resetting the pool size back to the original value with the help of hadbm set fails with the message:

22073: The operation requires restart of node 1. Its mirror node is currently not available. Use hadbm status --nodes to see the status of the nodes.

In this case, hadbm startnode 0 also fails. 

Solution

Stop the database, then restore the old values using hadbm set and restart the database.

6200133

Failure in configure-ha-cluster; creating an HADB instance fails.

Description

Attempts to create a HADB cluster fails with the message: 

HADB-E-00208: The transaction was aborted.

The booting transaction populating the SQL dictionary tables gets aborted. 

Solution

Run the configure-ha-cluster command again. If you run the hadbm create command and it fails with the previous message, rerun it.

5091349

Heterogeneous install paths are not supported.

It’s not possible to register the same software package with the same name at different locations on different hosts. 

Solution

HADB does not support heterogeneous paths across nodes in a database cluster. Ensure that the HADB server installation directory and configuration directory are same across all participating hosts. 

5091280

hadbm set does not check resource availability (disk and memory space)

Scenario

Increasing device or buffer sizes using hadbm set.

Description

The management system will check resource availability when creating databases or adding nodes, but it will not check if there are sufficient resources available when device or main-memory buffer sizes are changed. 

Solution

Check that there is enough free disk/memory space on all hosts before increasing any of the devicesize or buffersize configuration attributes.

4855623

When one of the nodes’ host is down, hadbm stop command does not exit.

The hadbm stop command may not be able to shutdown a database completely if HADB nodes do not receive shutdown messages due to network problems. The typical symptom is that hadbm takes more than 60 seconds to complete. In this situation, hadbm stop/delete will not work. You must specify the nodes that needs to be shutdown. 

Solution

  1. To determine which nodes are still alive, use hadbm status --nodes.

  2. For each of the partially running nodes, run hadbm stopnode -f node_number.

4861337

If an active data node fails while executing hadm stopdb, hadm startdb will fail.

hadbm status should return non-operational if the database is unable to start.

Solution

To correct the problem: 

  1. Run hadbm clear --fast

    If this command reports failures of type, address in use, for each machine in the system, login and kill all processes starting with clu_.

  2. Rerun the command, hadbm clear --fast.

    This will restart the database, causing the loss of all data.

  3. Recreate the session-store.

    For details on creating the session-store, see Sun Java System Application Server Administration Guide.

4958827

Child process transaction does not respond.

When a host machine accommodates more than one HADB node and all nodes use the same disk for placing their devices, it is observed that the disk I/O becomes the bottleneck. HADB process have been waiting for asynchronous I/O and therefore did not answer the node supervisor’s heartbeat check. This causes the processes to be restarted by the node supervisor. Although this problem can occur on any operating system, it is observed on Red Hat Linux AS 2.1 and 3. 

Solution

Use separate disks to place the devices belonging to different HADB nodes residing on the same machine. 

None

HADB Configuration with Double Networks

HADB, configured with double networks on two subnets, work properly on Solaris SPARC. However, due to problems in the operating system or network drivers on some hardware platforms, it is observed that Solaris x86 and Linux platforms do not handle double networks properly. This causes the following problems to HADB: 

  • On Linux, some of the HADB processes are blocked on message sending. This causes HADB node restarts and network partitioning.

  • On Solaris x86, after a network failure, some problems may arise that prohibits switching to the other network interface. This does not happen all the time, so it is still better to have two networks than one. These problems are partially solved in Solaris 10.

  • Trunking is not supported.

  • HADB does not support double networks on Windows 2003 (bug id 5103186).