Chapter 5 HADB Issues on Windows

This section covers problems you may encounter when using Sun Java™ System Application Server 7 2004Q2 Update 1 Enterprise Edition with the High Availability Database (HADB) 4.4 on the Windows platform. HADB 4.4 has a new management architecture and new commands, compared to HADB 4.3 that is bundled with Application Server for UNIX platforms. For details on administering HADB 4.4, see Sun Java System Application Server 7 2004Q2 Update 1 Administration Guide.

HADB Database Creation Fails

The error occurs when starting the database. The typical message in this case is:

To determine the cause of the problem, use the Log Viewer and/or inspect the install_dir/hadb/4/log directory. Some possible errors are:

No Available Memory

Description

Solution 1

Check if there are other processes using up all memory the on Windows and end those processes if possible.

Solution 2

Review the documentation on shared memory requirements in the Preparing for HADB Setup chapter of the Sun Java System Application Server Installation Guide.

Too Few Semaphores

Description

HADB uses memory-mapped files for shared memory on Windows. You will get this message when there is not enough space on device disk for shared memory.

Solution

Database Nodes Cannot Be Reached and the Database Does Not Function

Solution

The IP addresses of the involved hosts should be static. If the addresses are dynamic (DHCP) the lease time should be set to forever (usually 0).

The Management Agents Could Not Establish a Domain

Description

The HADB management system is dependent on UDP Multicast messages on multicast address 228.8.8.8. If these messages cannot get through, the createdomain command fails with the following message:

The management agents could not establish a domain, please check that the hosts can communicate with UDP multicast.

The agents are running on hosts with several network interfaces on different subnets.

There is a switch on the network that does not forward multicast messages.

There is router on the network that does not route multicast messages with the address 228.8.8.8.

Multicast messages are disabled in the operating system.

Solution 1

If the hosts have several network interfaces on different subnets, the management agent must be configured to use one of the subnets. Set the ma.server.mainternal.interfaces attribute.

Solution 2

Unexpected Node Restarts, Network Partitions, or Reconnects

Description

Unexpected node restarts, network partitions, or reconnects with messages “Network Partition: *** Reconnect detected ***” written in the HADB history files and on the HADB host terminals.

This may happen if multiple nodes identify themselves with the same physical node number.

Solution

Try stopping the database with the hadbm stop command, and look for “rogue” hadb processes on the hosts on which any HADB nodes have been running at any time. If there still are hadb processes running, these belong to rogue nodes.

On the hosts on which rogue nodes are found, check that the management agents are correctly configured, and that the management domain is correctly defined. There may be multiple management domains configured, and each host may possibly be included in more than one domain. Make sure that databases defined in separate domains do not have conflicting definitions, such as database nodes using the same port numbers.

hadbm create or hadbm addnodes Command Hangs

Description

Some hosts in the host list given to hadbm create or addnodes have multiple network interfaces, while others have only one, and the hadbm create/addnodes command hangs.

Solution

For the hosts having multiple network interfaces, specify the dotted IP address of the network interface (for example., 129.241.111.23) to be used by hadb when issuing hadbm create/addnodes. If the host name is used instead of IP address, the first interface registered on the host will be used, and there is no guarantee that the nodes will be able to communicate.

ma (Management Agent Process) Crashes

Description

Solution

Display diagnostic information by using hadbm listdomain.Typically, the remedy is to restart the failed agent. If that does not help, restart all agents in turn.

Server Responds Slowly After Being Idle

Description

The server takes a long time to service a request after a long period of idleness, and the sever log shows “lost connection” messages of the form:

Solution

Change the timeout value. The default HADB connection timeout value is 1800 seconds. If the application server does not send any request over a JDBC connection during this period, HADB closes the connection, and the application server needs to re-establish it. To change the timeout value, use the hadbm set SessionTimeout= command.



Note	Make sure the HADB connection timeout is greater than the JDBC connection pool timeout. If the JDBC connection timeout is more than the HADB connection time out, the connection will be closed from the HADB side, but will remain in the appserver connection pool. When the application then tries to use the connection, the application server will have to recreate the connection, which incurs significant overhead.

Requests Are Not Succeeding

Is the Load Balancer Timeout Correct?

Description

When configuring the response-timeout-in-seconds property in the loadbalancer.xml file, you must take into account the maximum timeouts for all the applications that are running. If the response timeout it is set to a very low value, numerous in-flight requests will fail because the load balancer will not wait long enough for the Application Server to respond to the request.

Conversely, setting the response timeout to an inordinately large value will result in requests being queued to an instance that has stopped responding, resulting in numerous failed requests.

Solution

Set the response-timeout-in-seconds value to the maximum response time of all the applications.

Are the System Clocks Synchronized?

Description

When a session is stored in HADB, it includes some time information, including the last time the session was accessed and the last time it was modified. If the clocks are not synchronized, then when an instance fails and another instance takes over (on another machine), that instance may think the session was expired when it was not, or worse yet, that the session was last accessed in the future!



Note	In a non-co-located configuration, it is important to synchronize the clocks on that machines that are hosting HADB nodes. For more information, see the Installation Guide chapter, “Preparing for HADB Setup.”

Solution

Is the Application Server Communicating With HADB?

Description

HADB may be created and running, but if the persistence store has not yet been created, the Application Server will not be able to communicate with the HADB. This situation is accompanied by the following message:

WARNING (7715): ConnectionUtilgetConnectionsFromPool failed using connection URL: connection URL

Solution

asadmin create-session-store --storeurl connection URL --storeuser haadmin --storepassword hapasswd --dbsystempassword super123

Session Persistence Problems

The create-session-store Command Failed

Description

The asadmin create-session-store command cannot run across firewalls. Therefore, for the create-session-store command to work, the application server instance and the HADB must be on the same side of a firewall.

The create-session-store command communicates with the HADB and not with the application server instance.

Solution

Locate the HADB and the application server instance on the same side of a firewall.

Configuring Instance-Level Session Persistence Did Not Work

The application-level session persistence configuration always takes precedence over instance-level session persistence configuration. Even if you change the instance-level session persistence configuration after an application has been deployed, the settings for the application still override the settings for the application server instance.

Session Data Seems To Be Corrupted

Description

Session data may be corrupted if the system log reports errors under the following circumstances:

If the data has been corrupted, there are three possible solutions for bringing the session store back to a consistent state, as described below.

Solution 1

Solution 2

If clearing the session store does not work, re initialize the data space on all the nodes and clear the data in the HADB using the hadbm clear command.

Solution 3

HADB Performance Problems

Performance is affected when the transactions to HADB get delayed or aborted. This situation is generally caused by a shortage of system resources. Any wait beyond five seconds causes the transactions to abort. Any node failures also cause the active transaction on that node at crash time to abort. Any double failures (failure of both mirrors) will make the HADB unavailable. The causes of the failures can generally be found in the HADB history files.

Is There a Shortage of CPU or Memory Resources, or Too Much Swapping?

Description

Node restarts or double failures due to “Process blocked for x sec, max block time is 2.500000 sec.” In this case, x is the length of time the process was blocked, and it was greater than 2.5 seconds.

The HADB Node Supervisor Process (NSUP/clu_nsup_srv) tracks the time elapsed since the last time it did some monitoring work. If that time duration exceeds a specified maximum (2500ms by default), NSUP concludes that it was blocked too long and restarts the node.

NSUP being blocked for more than 2.5 seconds cause the node to restart. If mirror nodes are placed on the same host, the likelihood of double failure is high. Simultaneous occurrence of the blocking on the mirror hosts may also lead to double failures.

The situation is especially likely to arise when there are other processes—for example, in a colocated configuration— in the system that compete for CPU, or memory which produces extensive swapping and multiple page faults as processes are rescheduled.

Solution

Ensure that HADB nodes get enough system resources. Ensure also that the time synchronization daemon does not make large (not higher than 2 seconds) jumps.

Is There Disk Contention?

Description

A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.

The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. This can also identified by the following statement in the history files: “HADB warning: Schedule of async <read,write> operation took ...”

History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”

This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:

Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.

This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.

This problem is observed especially in RH AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.

Solution

Use one disk per node to place the devices used by that node. If the node has more than one data devices and the disk contention is observed, move one data device to another disk. The same applies to the history file as well.

Is There a Shortage of HADB Data Device Space?

Description

One possible reason for transaction failure is running out of data device space. If this situation occurs, HADB will write warnings to the history file, and abort the transaction which tried to insert and/or update data.

HIGH LOAD: about to run out of device space, ...
HIGH LOAD: about to run out of device space on mirror node, ...

The general rule of thumb is that the data devices must have room for at least four times the volume of the user data. Please refer to the Tuning Guide for additional explanation.

Solution 1

This solution requires that there is space available on the physical disks which are used for the HADB data devices on all nodes.

Solution 2

Stop and clear the HADB, and create a new instance with more nodes and/or larger data devices and/or several data devices per node. Unfortunately, using this solution will erase all persistent data. See the Administrator's Guide for more information about this procedure.

See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.

Is There a Shortage of Other HADB Resources?

If an HADB node runs out of resources it will delay and/or abort transactions. Resource usage information is shipped between mirror nodes, so that a node can delay or abort an operation which is likely to fail on its mirror node.

Transactions that are delayed repeatedly may time out and return an error message to the client. If they do not time out, the situation will be visible to the client only as decreased performance during the periods in which the system is short on resources.

These problems frequently occur in “High Load” situations. For details, see High Load Problems.

High Load Problems

Is the Tuple Log Out Of Space?


Note	Frequently, all of these problems can be solved by making more CPU horsepower available.

All user operations (delete, insert, update) are logged in the tuple log and executed. There tuple log may fill up because:

Execution slows due to CPU or disk I/O contention

The mirror node is slow in receiving the log records (“log throw due to...” messages in the history files), which can happen as a result of:

Network contention, so the log records do not reach the mirror node

CPU contention at the mirror node, which keeps it from processing the received log records quickly enough.

Solution 1

Solution 2

If CPU utilization is not a problem, check the disk I/O. If the disk shows contention, avoid page faults when log records are being processed by increasing the data buffer size with hadbm set DataBufferPoolSize=...

Solution 3

Solution 4

See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.

Is the node-internal Log Full?

Too many node-internal operations are scheduled but not processed due to CPU or disk I/O problems.

Solution 1

Solution 2

If CPU utilization is not a problem, and there is sufficient memory, increase he InternalLogbufferSize using the hadbm set InternalLogbufferSize= command.

Are There Enough Locks?

Solution 1: Increase the number of locks

Solution 2: Improve CPU Utilization

Can You Fix the Problem by Doing Some Performance Tuning?

In most situations, reducing load or increasing the availability of resources will improve host performance. Some of the more common steps to take are:

Run the nodes on hosts with better hardware characteristics (more internal memory, higher processor speed, more processors).

Add physical disks and use several data devices, not more than one device on each physical disk.

Add more nodes, on new hosts, and refragment the data to utilize the new nodes.

Change configuration variables to allocate larger memory segments or internal data structures.

In addition, the following resources can be adjusted to improve “HIGH LOAD” problems, as described in the Performance and Tuning Guide:

Table 5-1 HADB Performance Tuning Properties
Resource	Property
Size of Database Buffer	hadbm attribute DataBufferPoolSize
Size of Tuple Log Buffer	hadbm attribute LogBufferSize
Size of Node Internal Log Buffer	hadbm attribute InternalLogBufferSize
Number of Database Locks	hadbm attribute NumberOfLocks

Client cannot connect to HADB

Description

11626 is an HADB error code, “Error in IPC operations,” which means that some Inter Process Communication operation failed.

“iostat = 28” means (on Solaris) that the operating system set errno to ENOSPC, which again translates to “No space left on device.”

If HADB started successfully, and you get this message at runtime, it means that the host computer has too few semaphore undo structures.

Solution

Stop the affected HADB node, reconfigure and reboot the affected host, restart the HADB node. HADB will be available during the process.

Improving CPU Utilization

Description

Available CPU cycles and I/O capacity can impose severe restrictions on performance. Resolving and preventing such issues is necessary to optimize system performance (in addition to configuring the HADB optimally.)

Solutions

If there are additional CPUs on the host that are not exploited, add new nodes to the same host. Otherwise add new machines and add new nodes on them.

If the machine has enough memory, increase the DataBufferPoolSize, and increase other internal buffers that may be putting warnings into the log files. Otherwise, add new machines and add new nodes on them.

HADB Administration Problems

The hadbm command and its many subcommands and options are provided for administering the high-availability database (HADB). The hadbm command is located in the install_dir/SUNWhadb/4/bin directory.

Refer to the chapter on Configuring the High Availability Database in the Sun Java System Application Server Administrator’s Guide for a full explanation of this command. Specifics on the various hadbm subcommands are explained in the hadbm man pages.

hadbm Command Fails: The agents could not be reached

Description

The hosts in the URL could be unreachable either because the hosts are down, because the communication pathway has not been established, because the port number in the URL is wrong, or because the management agents are down.

Solution

Verify that the URL is correct. If the URL is correct, verify that the hosts are up and running and are ready to accept communications; for example:

hadbm Command Fails: command not found

Description

The hadbm command can be run from the current directory, or you can set the search PATH to access the hadb commands from anywhere, which is much more convenient. The error, “hadbm: Command not found,” indicates that neither of these conditions has been met.

Solution 1

Solution 2

Solution 3

You can use the hadbm command from anywhere by setting the PATH variable. Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.

hadbm Command Fails: JAVA_HOME not defined

Description

The message “Error: JAVA_HOME is not defined correctly” indicates that the JAVA_HOME environment variable has not been set properly.

Solution

If multiple Java versions are installed on the system, ensure that the JAVA_HOME environment variable points to the correct Java version (1.4.1_03 or above for Enterprise Edition).

Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.

create Fails: “path does not exist on a host”

Description

After issuing the hadbm create command, an error similar to the following appears on the console:

./hadbm create ...
...
hadbm:Error 22022: Specified path does not exist on a host. Please specify a valid path: [ machineName ... ]

This error message indicates that the HADB server component is not installed on the machine on which you are trying to create the HA database.

Solution

Log in to the host and create paths for the HADB devices and HADB history files. Run hadbm create and specify the --devicepath and --historypath options to the paths created. Also make sure that the user running the management agent on the host has read and write access to these directories.

Database Does Not Start

Was there a shared memory get segment failure?

Description

Solution 1

Solution 2


Note	HADB executables cannot be installed on different paths on different hosts.

If the problem persists, the operating system may not have enough shared memory or semaphores. Increase them according to the number of nodes in the machine. (For details, see the Deployment Guide). Note that after making these changes, the machine must be restarted to make them available.

Do the History Files Contain Errors?

Description

If the problem still persists, inspect the HADB history files. Some of the more likely error messages to look for are:

The system has not been set up with enough shared memory. (Discussed in the previous section.)

This message occurs when another process is using the port that an HADB server is trying to process. It can occur in several situations:

The portBase is used by another process running on this host machine.

Set the PortBase attribute to another value using the following command:

hadbm set portbase=value

An attempt to stop the HADB node for maintenance failed.

Try again to stop the node with the hadbm command. If that fails, use Windows Task Manager to end the OS process, clu_nsup_srv, for this node. The nsup process should then end all its HADB child processes. If the nsup process does not exist you have to remove all the HADB child processes one by one.

The HADB node was stopped for maintenance and an inetd process restarted the HADB node before you intended to start it.

Make sure that inetd does not start the HADB node before stopping it.

Check the following:

Shared memory is correct on all machines in the HADB configuration.

No other HADB databases are running on the machines, or any other processes that could be using the same port numbers.

All necessary directories exist and have write permissions.

There is enough space in directory where devices are going to be written.

Solutions

After verifying that none of the above errors have occurred, try the following remedies, in order:

Do You Need a Simple Solution?

Solution 1

Delete the database with the hadbm delete command, and see if that allows the hadbm create to proceed normally.

Solution 2

Sometimes a system reboot is the necessary last resort. Issue hadbm delete, reboot the machine, and then rerun the hadb create command.

clear Command Failed

When this command fails, the history files are likely to explain why. See Do the History Files Contain Errors? for instructions on viewing the history files and a list of some common error messages.

create-session-store Failed

Invalid user name or password

This error occurs when the --dbsystempassword supplied to the create-session-store command is not the same password as the one given at the time of database creation.

Solution 1

Solution 2

If you cannot remember the dbsystem password, you need to clear the database using hadbm clear and provide a new dbsystem system password.

SQLException: No suitable driver

The create-session-store produces the error: SessionStoreException: java.sql.SQLException: No suitable driver.

Solution 1

This error can occur when asadmin is not able to find hadbjdbc4.jar from the AS_HADB path defined in asenv.conf in the Application Server config directory.

The solution is to change AS_HADB to point to the location of the HADB installation.

Solution 2

This error can also occur if you provide the incorrect value for --storeUrl. To solve this problem, obtain the correct URL using hadbm get jdbcURL.

Attaching Shared Memory Segment Fails Due To Insufficient Space

Description

Attaching shared memory segment with key xx failed,
OS status=12 OS message: Not enough space.

Solution

Cannot Restart the HADB

Description

HADB restart does not work after a double node failure. Additional recovery actions are needed before HADB can be restarted.

hadbm status shows that the HADB status is non-functional.

The node status shows that the nodes are in Starting or Recovering state. Even after stopping and then restarting each of the nodes, they remain in the Starting state. Eventually, the node status changes to Stopped.

This problem occurs when mirror HADB host machines have failed or been rebooted, typically after a power outage, or when a machine is rebooted without first stopping the HADB (in a single-machine installation), or when a pair of mirror machines from both Data Redundancy Units (DRUs) are rebooted.

If mirror host machine pairs are rebooted, or if host failures cause an unplanned reboot of one or more mirror host machine pairs, then the mirror nodes on these machines are not available, and the data is likely to be in an inconsistent state, because a record may have been in the process of being committed when the power failed, or the reboot occurred.



Tip	To prevent such problems, be sure to use the procedure described in the HADB chapter of the Administration Guide when rebooting as a part of a planned maintenance.

HADB cannot heal itself automatically in such “double failure” situations because the part of the data that resided on the pair nodes is lost. In such cases, the hadbm start command does not succeed, and the hadbm status command shows that HADB is in a non-operational state.

Explanation

For performance reasons, the HADB does much of its data management in memory. If both DRUs are rebooted, then the HADB does not have a chance to write its data blocks to disk.

For more information on the DRUs and HADB confutation, see “Administering the High Availability Database” in the Administration Guide, and the Deployment Guide.



Tip	If the HADB exhibits strange behavior (for example consistent timeout problems), and you want to check whether a restart cures the problem, use the hadbm restart command. When the HADB is restarted in this manner, HADB services remain available. Conversely, if HADB is started and stopped in separate operations using hadbm stop and hadbm start, HADB services are unavailable while HADB is stopped.

Solution

Follow the instructions under “Recovering from Session Data Corruption” in the “Administering the High Availability Database” chapter of the Administration Guide.

Verify that the node states show Starting/Recovering, then reset the database.

Previous Contents Index Next
Sun Java System Application Server 7 2004Q2 Update 1 Standard and Enterprise Edition Troubleshooting Guide