High Availability (Sun Java System Application Server Standard and Enterprise Edition 7 2004Q2 Update 7 Release Notes)

Sun Java System Application Server Standard and Enterprise Edition 7 2004Q2 Update 7 Release Notes

High Availability

This section describes the known high availability issues and associated solutions.

ID	Summary
6301842	Sometimes on Windows, the management agent cannot deregister the service when running, `ma -r`, and fails with the error message, `Could not identify program`. Solution Start a Windows command prompt window and run `sc stop HADBMgmtAgent` and then run `sc delete HADBMgmtAgent`. If the command `ma -i -n` `servicename` was used to install and start the service, then use `servicename` when running the command `sc`.
6293912	The Management Agent should not use special-use interfaces. Solution When issuing hadbm create on hosts with multiple interfaces, always specify the IP-addresses explicitly, using DDN notation.
6291562	Reassembly failures on Windows. On the Windows platform, with certain configurations and load, there may be a large number of reassembly failures in the operating system. The problem has been seen with configurations of more than 20 nodes when running several table scans (select ) in parallel. The symptoms could be that transactions abort frequently, or repair and recovery may take a long time to complete, and there may be frequent timeouts in various parts of the system. Solution* To fix the problem, the Windows registry variable `HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters` should be set to a value higher than the default value of 100. We recommend increasing it to `0x1000` (`4096`). For more information, see article 811003 from the Microsoft support pages: `http://support.microsoft.com/default.aspx?scid=kb;en-us;811003`
6275319	Non-root users cannot manage HADB. Installing with Java Enterprise System (as root) does not permit non-root users to manage HADB. Solution Always login as root to manage HADB.
6275103	hadbm management agent should give a better error message when a session object has timed out and deleted at MA. Sometimes, a resource contention problem on the server may cause a management client to become disconnected, When reconnecting, a misleading error message, `hadbm:Error 22184: A password is required to connect to the management agent` may be returned. Solution Check if there is a resource problem on the server, take proper action (e.g., add more resources), and retry the operation.
6273681	Management agents in global and local zones may interfere. On Solaris 10, stopping a management agent by using the `ma-initd` script in a global zone stops the management agent in the local zone as well. Solution Do not install the management agent both in the global and local zone.
6271063	Install/removal and symlink preservation. Regarding install/removal of HADB c package (Solaris: `SUNWhadbc`, Linux: `sun-hadb-c`) version <m.n.u-p>, the symlink `/opt/SUNWhadb/<m>` is never touched once it exists. Thus, it is possible that an orphaned symlink will exist. Solution Delete the symlink before install or after uninstall unless in use.
6265419	Downgrading from HADB Version 4.4.2.5 to HADB Version 4.4.1.7 causes management agent to fail with different error codes. When downgrading to a previous HADB version, the management agent may fail with different error codes. Solution It is possible to downgrade the HADB database, however the management agent cannot be downgraded if there changes have been made in the repository objects. After a downgrade, you must use the management agent from the latest HADB version.
6262824	hadbm does not support passwords containing uppercase letters. Capital letters in passwords are converted to lowercase when the password is stored in hadb. Solution Do not use passwords containing uppercase letters.
6173886, 6253132	hadbm createdomain may fail. If running the management agent on a host with multiple network interfaces, the `createdomain` command may fail if not all network interfaces are on the same subnet: `hadbm:Error 22020: The management agents could not establish a domain, please check that the hosts can communicate with UDP multicast.` The management agents will (if not configured otherwise) use the first interface for UDP multicasts (first as defined by the result from `java.net.NetworkInterface.getNetworkInterfaces()`). Solution The best solution is to tell the management agent which subnet to use (using `ma.server.mainternal.interfaces` in the configuration file. For example, `ma.server.mainternal.interfaces=10.11.100.0`). Alternatively you can configure the router between the subnets to route multicast packets (the management agent uses multicast address 228.8.8.8). Before retrying with a new configuration of the management agents, you should clean up the management agent’s repository. Stop all agents in the domain, and delete all files and directories in the repository directory (identified by `repository.dr.path` in the management agent configuration file). This must be done on all hosts before restarting the agents with a new configuration file.
6249685	clu_trans_srv process cannot be interrupted on Linux. There is a bug in the 64 bit version of Red Hat Enterprise Linux 3.0 that makes the `clu_trans_srv` process end up in an uninterruptible mode when performing asynchronous I/O. This means that kill -9 does not work and the operating system must be rebooted. Solution Use a 32 bit version of Red Hat Enterprise Linux 3.0.
6230792, 6230415	Starting, stopping or reconfiguring HADB may fail or hang. On AMD Opteron^TM systems running Solaris 10, starting, stopping or reconfiguring HADB using the `hadbm` command may fail or hang with one of the following errors: `hadbm:Error 22009: The command issued had no progress in the last 300 seconds.` `HADB-E-21070: The operation did not complete within the time limit, but has not been cancelled and may complete at a later time.` This may happen if there are inconsistencies while reading/writing to a file (nomandevice) which the `clu_noman_srv` process uses. This problem can be detected by looking for the following messages in the HADB history files: `n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Child process noman3 733 does not respond.` `n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Have not heard from it in 104.537454 sec` `n:3 NSUP INF 2005-02-11 18:00:33.844 p:731 Child process noman3 733 did not start.` Solution To solve the problem, run the following command for the affected node: `hadbm restartnode --level=clear` `nodeno` `dbname` Note that all devices for the node will be reinitialized. You may have to stop the node before reinitializing it.
None	HADB database creation fails. Creating a new database may fail with the following error, stating that too few shared memory segments are available: `HADB-E-21054: System resource is unavailable : HADB-S-05512: Attaching shared memory segment with key "xxxxx" failed, OS status=24 OS error message: Too many open files.` Solution Verify that shared memory is configured and the configuration is working. In particular, on Solaris 8, inspect the file `/etc/system`, and check that the value of the variable `shmsys:shminfo_shmseg` is at least six times the number of nodes per host.
6232140	The management agent terminates with the exception, "IPV6_MULTICAST_IF failed." The management agent may terminate with the exception, `IPV6_MULTICAST_IF failed`, when starting on a host running Solaris 8 with several NIC cards, and if there is a mixture of cards with IPv6 and IPv4 enabled. The root cause is described in bug 4418866/4418865. Solution Set the environment variable, _JAVA_OPTIONS, as described here: `$> export _JAVA_OPTIONS="-Djava.net.preferIPv4Stack=true”` Alternatively, use Solaris 9.
6171832, 6172138	Stale sessions are not cleaned up leading to degraded HADB performance, or the data device is getting full. Solution To remove stale sessions efficiently, modify the `sun-ejb-jar.xml` file to set the value of `cache-idle-timeout-in-seconds` to less than the `removal-timeout-in-seconds` value. If the `cache-idle-timeout-in-seconds` is equal to or greater than the `removal-timeout-in-seconds`, old sessions will not be cleaned-up in HADB, which is the expected behavior. If you continue to face issues with stale sessions even after setting these properties as recommended, contact product support for help.
6171994	Improper permissions in security.policy file causing startup hang. Description hadb-jdbc has improper access permissions in the `security.policy` file. Solution If there is an intermittent hang during startup, add the following suggested permissions in the security.policy file: By default, the following is present: `permission java.net.SocketPermission "", "connect";` Suggested permissions: `permission java.net.SocketPermission "", "connect accept,listen,resolve";`
5042351	New tables created after new nodes are added will not spread on the added nodes. Description If a user creates a database instance, add nodes to it, then any new tables created afterwards will not be fragmented on the nodes added after database creation. Only the tables created before `addnodes` will be able to use the added nodes when `hadbm addnodes` refragment it. This is because create table uses the `sysnode node` group which is created at the boot time of the database (when `hadbm create` is executed). Solution Run `hadbm refragment` after new tables have been added, or create the new tables on nodegroup, `all_nodes`.
6158393	HADB problem with RedHat AS 3.0 in co-located mode under load. Description HADB runs on RedHat Linux AS 3.0 co-located with Application Server. Transactions may get aborted and affect the performance. This is caused by the excessive swapping performed by the operating system. Solution This issue appears to have been resolved when HADB was tested against RedHat Linux AS 3.0 Update 4.
6214601	Addnodes fails with table not found error since hadbm searches user tables in sysroot schema. Description The hadbm refragment command fails with: `hadbm:Error 22042: Database could not be refragmented. Please retry with hadbm refragment command to refragment the database.. Caused by: HADB-E-11701: Table singlesignon not found` Solution Refragment the Application Server tables manually with the help of `clusql`: > clusql `server`:`port lis`t> system+`dbpassword specified at database create`> `SQL: set autocommit on;` `SQL: set schema haschema;` `SQL: alter table sessionattribute nodegroup all_nodes;` `SQL: alter table singlesignon nodegroup all_nodes;` `SQL: alter table statefulsessionbean nodegroup all_nodes;` `SQL: alter table sessionheader nodegroup all_nodes;` `SQL: alter table blobsessions nodegroup all_nodes;` `SQL: quit;`
6159633	configure-ha-cluster may hang. Description When the `asadmin configure-ha-cluster` command is used to create or configure a highly available cluster on more than one host, the command hangs. There are no exceptions thrown from the HADB Management Agent or the Application Server. `Solution` HADB does not support heterogeneous paths across nodes in a database cluster. Make sure that the HADB server installation directory and configuration directory are the same across all participating hosts. Additionally, clear the repository directories before running the command again.
6197822	hadbm set brings the database instance to a state from which it is difficult to recover. Description In this scenario, the `hadbm set` command fails when attempting to change some database configuration variable; for example, setting `DataBufferPoolSize` to a larger size fails due to insufficient shared memory on node-0. The `hadbm set` command then leaves the database with node-0 in stopped state and node-1 in running state. Resetting the pool size back to the original value with the help of `hadbm` set fails with the message: `22073: The operation requires restart of node 1. Its mirror node is currently not available. Use hadbm status --nodes to see the status of the nodes.` In this case, hadbm startnode 0 also fails. Solution Stop the database, then restore the old values using `hadbm set` and restart the database.
6200133	Failure in configure-ha-cluster; creating an HADB instance fails. Description Attempts to create a HADB cluster fails with the message: `HADB-E-00208: The transaction was aborted.` The booting transaction populating the SQL dictionary tables gets aborted. Solution Run the `configure-ha-cluster` command again. If you run the `hadbm create` command and it fails with the previous message, rerun it.
5091349	Heterogeneous install paths are not supported. It’s not possible to register the same software package with the same name at different locations on different hosts. Solution HADB does not support heterogeneous paths across nodes in a database cluster. Ensure that the HADB server installation directory and configuration directory are same across all participating hosts.
5091280	hadbm set does not check resource availability (disk and memory space) Scenario Increasing device or buffer sizes using `hadbm set.` Description The management system will check resource availability when creating databases or adding nodes, but it will not check if there are sufficient resources available when device or main-memory buffer sizes are changed. Solution Check that there is enough free disk/memory space on all hosts before increasing any of the `devicesize` or `buffersize` configuration attributes.
4855623	When one of the nodes’ host is down, hadbm stop command does not exit. The hadbm stop command may not be able to shutdown a database completely if HADB nodes do not receive shutdown messages due to network problems. The typical symptom is that hadbm takes more than 60 seconds to complete. In this situation, hadbm stop/delete will not work. You must specify the nodes that needs to be shutdown. Solution To determine which nodes are still alive, use `hadbm status --nodes`. For each of the partially running nodes, run `hadbm stopnode -f node_number`.
4861337	If an active data node fails while executing `hadm stopdb`, `hadm startdb` will fail. `hadbm status` should return `non-operational` if the database is unable to start. Solution To correct the problem: Run `hadbm clear --fast` If this command reports failures of type, `address in use`, for each machine in the system, login and kill all processes starting with `clu_`. Rerun the command, `hadbm clear --fast`. This will restart the database, causing the loss of all data. Recreate the session-store. For details on creating the session-store, see Sun Java System Application Server Administration Guide.
4958827	Child process transaction does not respond. When a host machine accommodates more than one HADB node and all nodes use the same disk for placing their devices, it is observed that the disk I/O becomes the bottleneck. HADB process have been waiting for asynchronous I/O and therefore did not answer the node supervisor’s heartbeat check. This causes the processes to be restarted by the node supervisor. Although this problem can occur on any operating system, it is observed on Red Hat Linux AS 2.1 and 3. Solution Use separate disks to place the devices belonging to different HADB nodes residing on the same machine.
None	HADB Configuration with Double Networks HADB, configured with double networks on two subnets, work properly on Solaris SPARC. However, due to problems in the operating system or network drivers on some hardware platforms, it is observed that Solaris x86 and Linux platforms do not handle double networks properly. This causes the following problems to HADB: On Linux, some of the HADB processes are blocked on message sending. This causes HADB node restarts and network partitioning. On Solaris x86, after a network failure, some problems may arise that prohibits switching to the other network interface. This does not happen all the time, so it is still better to have two networks than one. These problems are partially solved in Solaris 10. Trunking is not supported. HADB does not support double networks on Windows 2003 (bug id 5103186).