Sun Java logo     Previous      Contents      Index      Next     

Sun logo
Sun Java System Application Server 7 2004Q2 Update 1 Standard and Enterprise Edition Troubleshooting Guide 

Chapter 5
HADB Issues on Windows

This section covers problems you may encounter when using Sun Java™ System Application Server 7 2004Q2 Update 1 Enterprise Edition with the High Availability Database (HADB) 4.4 on the Windows platform. HADB 4.4 has a new management architecture and new commands, compared to HADB 4.3 that is bundled with Application Server for UNIX platforms. For details on administering HADB 4.4, see Sun Java System Application Server 7 2004Q2 Update 1 Administration Guide.

Topics in this chapter include:


HADB Database Creation Fails

The error occurs when starting the database. The typical message in this case is:

failed to start database : HADB Database creation failed

To determine the cause of the problem, use the Log Viewer and/or inspect the install_dir/hadb/4/log directory. Some possible errors are:

No Available Memory

Description

Insufficient memory is available to create the database.

Solution 1

Check if there are other processes using up all memory the on Windows and end those processes if possible.

Solution 2

Install more memory in your system.

Review the documentation on shared memory requirements in the Preparing for HADB Setup chapter of the Sun Java System Application Server Installation Guide.

Too Few Semaphores

Description

HADB uses memory-mapped files for shared memory on Windows. You will get this message when there is not enough space on device disk for shared memory.

Solution

Make more space available on device disk for shared memory.

Database Nodes Cannot Be Reached and the Database Does Not Function

Solution

The IP addresses of the involved hosts should be static. If the addresses are dynamic (DHCP) the lease time should be set to forever (usually 0).

The Management Agents Could Not Establish a Domain

Description

The HADB management system is dependent on UDP Multicast messages on multicast address 228.8.8.8. If these messages cannot get through, the createdomain command fails with the following message:

The management agents could not establish a domain, please check that the hosts can communicate with UDP multicast.

Possible causes include:

Solution 1

If the hosts have several network interfaces on different subnets, the management agent must be configured to use one of the subnets. Set the ma.server.mainternal.interfaces attribute.

Solution 2

Configure the needed network infrastructure to support multicast messages.

Unexpected Node Restarts, Network Partitions, or Reconnects

Description

Unexpected node restarts, network partitions, or reconnects with messages “Network Partition: *** Reconnect detected ***” written in the HADB history files and on the HADB host terminals.

This may happen if multiple nodes identify themselves with the same physical node number.

Solution

Try stopping the database with the hadbm stop command, and look for “rogue” hadb processes on the hosts on which any HADB nodes have been running at any time. If there still are hadb processes running, these belong to rogue nodes.

On the hosts on which rogue nodes are found, check that the management agents are correctly configured, and that the management domain is correctly defined. There may be multiple management domains configured, and each host may possibly be included in more than one domain. Make sure that databases defined in separate domains do not have conflicting definitions, such as database nodes using the same port numbers.

hadbm create or hadbm addnodes Command Hangs

Description

Some hosts in the host list given to hadbm create or addnodes have multiple network interfaces, while others have only one, and the hadbm create/addnodes command hangs.

Solution

For the hosts having multiple network interfaces, specify the dotted IP address of the network interface (for example., 129.241.111.23) to be used by hadb when issuing hadbm create/addnodes. If the host name is used instead of IP address, the first interface registered on the host will be used, and there is no guarantee that the nodes will be able to communicate.

ma (Management Agent Process) Crashes

Description

The ma (Management Agent process) crashes for various reasons.

Solution

Display diagnostic information by using hadbm listdomain.Typically, the remedy is to restart the failed agent. If that does not help, restart all agents in turn.


Server Responds Slowly After Being Idle

Description

The server takes a long time to service a request after a long period of idleness, and the sever log shows “lost connection” messages of the form:

java.io.IOException:..HA Store: Lost connection to the server.

In such cases, the server needs to recreate the JDBC pool for HADB.

Solution

Change the timeout value. The default HADB connection timeout value is 1800 seconds. If the application server does not send any request over a JDBC connection during this period, HADB closes the connection, and the application server needs to re-establish it. To change the timeout value, use the hadbm set SessionTimeout= command.


Note

Make sure the HADB connection timeout is greater than the JDBC connection pool timeout. If the JDBC connection timeout is more than the HADB connection time out, the connection will be closed from the HADB side, but will remain in the appserver connection pool. When the application then tries to use the connection, the application server will have to recreate the connection, which incurs significant overhead.



Requests Are Not Succeeding

The following problems are addressed in this section:

Is the Load Balancer Timeout Correct?

Description

When configuring the response-timeout-in-seconds property in the loadbalancer.xml file, you must take into account the maximum timeouts for all the applications that are running. If the response timeout it is set to a very low value, numerous in-flight requests will fail because the load balancer will not wait long enough for the Application Server to respond to the request.

Conversely, setting the response timeout to an inordinately large value will result in requests being queued to an instance that has stopped responding, resulting in numerous failed requests.

Solution

Set the response-timeout-in-seconds value to the maximum response time of all the applications.

Are the System Clocks Synchronized?

Description

When a session is stored in HADB, it includes some time information, including the last time the session was accessed and the last time it was modified. If the clocks are not synchronized, then when an instance fails and another instance takes over (on another machine), that instance may think the session was expired when it was not, or worse yet, that the session was last accessed in the future!


Note

In a non-co-located configuration, it is important to synchronize the clocks on that machines that are hosting HADB nodes. For more information, see the Installation Guide chapter, “Preparing for HADB Setup.”


Solution

Verify that clocks are synchronized for all systems in the cluster.

Is the Application Server Communicating With HADB?

Description

HADB may be created and running, but if the persistence store has not yet been created, the Application Server will not be able to communicate with the HADB. This situation is accompanied by the following message:

WARNING (7715): ConnectionUtilgetConnectionsFromPool failed using connection URL: connection URL

Solution

Create the session store in the HADB with a command like the following:

asadmin create-session-store --storeurl connection URL --storeuser haadmin --storepassword hapasswd --dbsystempassword super123


Session Persistence Problems

The following problems are addressed in this section:

The create-session-store Command Failed

Description

The asadmin create-session-store command cannot run across firewalls. Therefore, for the create-session-store command to work, the application server instance and the HADB must be on the same side of a firewall.

The create-session-store command communicates with the HADB and not with the application server instance.

Solution

Locate the HADB and the application server instance on the same side of a firewall.

Configuring Instance-Level Session Persistence Did Not Work

The application-level session persistence configuration always takes precedence over instance-level session persistence configuration. Even if you change the instance-level session persistence configuration after an application has been deployed, the settings for the application still override the settings for the application server instance.

Session Data Seems To Be Corrupted

Description

Session data may be corrupted if the system log reports errors under the following circumstances:

If the data has been corrupted, there are three possible solutions for bringing the session store back to a consistent state, as described below.

Solution 1

Use the asadmin clear-session-store command to clear the session store.

Solution 2

If clearing the session store does not work, re initialize the data space on all the nodes and clear the data in the HADB using the hadbm clear command.

Solution 3

If clearing the HADB does not work, delete and then recreate the database.


HADB Performance Problems

Performance is affected when the transactions to HADB get delayed or aborted. This situation is generally caused by a shortage of system resources. Any wait beyond five seconds causes the transactions to abort. Any node failures also cause the active transaction on that node at crash time to abort. Any double failures (failure of both mirrors) will make the HADB unavailable. The causes of the failures can generally be found in the HADB history files.

To isolate the problem, consider the following:

Is There a Shortage of CPU or Memory Resources, or Too Much Swapping?

Description

Node restarts or double failures due to “Process blocked for x sec, max block time is 2.500000 sec.” In this case, x is the length of time the process was blocked, and it was greater than 2.5 seconds.

The HADB Node Supervisor Process (NSUP/clu_nsup_srv) tracks the time elapsed since the last time it did some monitoring work. If that time duration exceeds a specified maximum (2500ms by default), NSUP concludes that it was blocked too long and restarts the node.

NSUP being blocked for more than 2.5 seconds cause the node to restart. If mirror nodes are placed on the same host, the likelihood of double failure is high. Simultaneous occurrence of the blocking on the mirror hosts may also lead to double failures.

The situation is especially likely to arise when there are other processes—for example, in a colocated configuration— in the system that compete for CPU, or memory which produces extensive swapping and multiple page faults as processes are rescheduled.

NSUP being blocked can also be caused by negative system clock adjustments.

Solution

Ensure that HADB nodes get enough system resources. Ensure also that the time synchronization daemon does not make large (not higher than 2 seconds) jumps.

Is There Disk Contention?

Description

A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.

The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. This can also identified by the following statement in the history files: “HADB warning: Schedule of async <read,write> operation took ...

History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”

This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:

Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.

This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.

This problem is observed especially in RH AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.

Solution

Use one disk per node to place the devices used by that node. If the node has more than one data devices and the disk contention is observed, move one data device to another disk. The same applies to the history file as well.

Is There a Shortage of HADB Data Device Space?

Description

One possible reason for transaction failure is running out of data device space. If this situation occurs, HADB will write warnings to the history file, and abort the transaction which tried to insert and/or update data.

Typical messages are:

HIGH LOAD: about to run out of device space, ...
HIGH LOAD: about to run out of device space on mirror node, ...

The general rule of thumb is that the data devices must have room for at least four times the volume of the user data. Please refer to the Tuning Guide for additional explanation.

Solution 1

Increase the size of the data devices using the following command:

hadbm set TotalDataDevicePerNode=size

This solution requires that there is space available on the physical disks which are used for the HADB data devices on all nodes.

HADBM automatically restarts each node of the database.

Solution 2

Stop and clear the HADB, and create a new instance with more nodes and/or larger data devices and/or several data devices per node. Unfortunately, using this solution will erase all persistent data. See the Administrator's Guide for more information about this procedure.

See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.

Is There a Shortage of Other HADB Resources?

When an HADB node is started, it will allocate:

If an HADB node runs out of resources it will delay and/or abort transactions. Resource usage information is shipped between mirror nodes, so that a node can delay or abort an operation which is likely to fail on its mirror node.

Transactions that are delayed repeatedly may time out and return an error message to the client. If they do not time out, the situation will be visible to the client only as decreased performance during the periods in which the system is short on resources.

These problems frequently occur in “High Load” situations. For details, see High Load Problems.


High Load Problems

High load scenarios are recognizable by the following symptoms:

If a high load problem is suspected, consider the following:

Is the Tuple Log Out Of Space?

All user operations (delete, insert, update) are logged in the tuple log and executed. There tuple log may fill up because:

Solution 1

Check CPU usage, as described in Improving CPU Utilization.

Solution 2

If CPU utilization is not a problem, check the disk I/O. If the disk shows contention, avoid page faults when log records are being processed by increasing the data buffer size with hadbm set DataBufferPoolSize=...

Solution 3

Look for evidence of network contention, and resolve bottlenecks.

Solution 4

Increase the tuple log buffer using hadbm set LogBufferSize=...

See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.

Is the node-internal Log Full?

Too many node-internal operations are scheduled but not processed due to CPU or disk I/O problems.

Solution 1

Check CPU usage, as described in Solution 2: Improve CPU Utilization.

Solution 2

If CPU utilization is not a problem, and there is sufficient memory, increase he InternalLogbufferSize using the hadbm set InternalLogbufferSize= command.

Are There Enough Locks?

Some extra symptoms that identify this condition are:

Solution 1: Increase the number of locks

Use hadbm set NumberOfLocks= to increase the number of locks.

Solution 2: Improve CPU Utilization

Check CPU usage, as described in Improving CPU Utilization.

Can You Fix the Problem by Doing Some Performance Tuning?

In most situations, reducing load or increasing the availability of resources will improve host performance. Some of the more common steps to take are:

In addition, the following resources can be adjusted to improve “HIGH LOAD” problems, as described in the Performance and Tuning Guide:

Table 5-1  HADB Performance Tuning Properties

Resource

Property

Size of Database Buffer

hadbm attribute DataBufferPoolSize

Size of Tuple Log Buffer

hadbm attribute LogBufferSize

Size of Node Internal Log Buffer

hadbm attribute InternalLogBufferSize

Number of Database Locks

hadbm attribute NumberOfLocks


Client cannot connect to HADB

Description

This problem is accompanied by a message in the history file:

HADB-E-11626: Error in IPC operations, iostat = 28: No space left on device

where:

If HADB started successfully, and you get this message at runtime, it means that the host computer has too few semaphore undo structures.

Solution

Stop the affected HADB node, reconfigure and reboot the affected host, restart the HADB node. HADB will be available during the process.


Improving CPU Utilization

Description

Available CPU cycles and I/O capacity can impose severe restrictions on performance. Resolving and preventing such issues is necessary to optimize system performance (in addition to configuring the HADB optimally.)

Solutions

If there are additional CPUs on the host that are not exploited, add new nodes to the same host. Otherwise add new machines and add new nodes on them.

If the machine has enough memory, increase the DataBufferPoolSize, and increase other internal buffers that may be putting warnings into the log files. Otherwise, add new machines and add new nodes on them.

For more information on this subject, consult the Performance and Tuning Guide.


HADB Administration Problems

The hadbm command and its many subcommands and options are provided for administering the high-availability database (HADB). The hadbm command is located in the install_dir/SUNWhadb/4/bin directory.

Refer to the chapter on Configuring the High Availability Database in the Sun Java System Application Server Administrator’s Guide for a full explanation of this command. Specifics on the various hadbm subcommands are explained in the hadbm man pages.

The following problems are addressed in this section:

hadbm Command Fails: The agents could not be reached

Description

The command fails with the error:

The agents <url> could not be reached.

The hosts in the URL could be unreachable either because the hosts are down, because the communication pathway has not been established, because the port number in the URL is wrong, or because the management agents are down.

Solution

Verify that the URL is correct. If the URL is correct, verify that the hosts are up and running and are ready to accept communications; for example:

ping hostname1
ping hostname2
...

hadbm Command Fails: command not found

Description

The hadbm command can be run from the current directory, or you can set the search PATH to access the hadb commands from anywhere, which is much more convenient. The error, “hadbm: Command not found,” indicates that neither of these conditions has been met.

Solution 1

cd to the directory that contains the hadbm command and run it from there:

cd install_dir/SUNWhadb/4/bin/
./hadbm

Solution 2

Use the full path to invoke the hadbm command:

install_dir/SUNWhadb/4/bin/hadbm

Solution 3

You can use the hadbm command from anywhere by setting the PATH variable. Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.

To verify that the PATH settings are correct, run the following commands:

which asadmin
which hadbm

These commands should echo the paths to the utilities.

hadbm Command Fails: JAVA_HOME not defined

Description

The message “Error: JAVA_HOME is not defined correctly” indicates that the JAVA_HOME environment variable has not been set properly.

Solution

If multiple Java versions are installed on the system, ensure that the JAVA_HOME environment variable points to the correct Java version (1.4.1_03 or above for Enterprise Edition).

Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.

create Fails: “path does not exist on a host”

Description

After issuing the hadbm create command, an error similar to the following appears on the console:

./hadbm create ...
...
hadbm:Error 22022: Specified path does not exist on a host. Please specify a valid path: [ machineName ... ]

This error message indicates that the HADB server component is not installed on the machine on which you are trying to create the HA database.

Solution

Log in to the host and create paths for the HADB devices and HADB history files. Run hadbm create and specify the --devicepath and --historypath options to the paths created. Also make sure that the user running the management agent on the host has read and write access to these directories.


Note

HADB executables cannot be installed on different paths on different hosts.


Database Does Not Start

The create or start command fails with the console error message:

hadbm: Error 22095: Database could not be started...

Consider the following possibilities:

Was there a shared memory get segment failure?

Description

The history files show the error message:

..'systemerr'..HADB-S-01760: Shared memory get segment failed..

Solution 1

Reboot the system.

Solution 2

If the problem persists, the operating system may not have enough shared memory or semaphores. Increase them according to the number of nodes in the machine. (For details, see the Deployment Guide). Note that after making these changes, the machine must be restarted to make them available.

Do the History Files Contain Errors?

Description

If the problem still persists, inspect the HADB history files. Some of the more likely error messages to look for are:

Solutions

After verifying that none of the above errors have occurred, try the following remedies, in order:

For more information, refer to the Error Message Reference.

Do You Need a Simple Solution?

As a last resort, try the following possible solutions.

Solution 1

Delete the database with the hadbm delete command, and see if that allows the hadbm create to proceed normally.

Solution 2

Sometimes a system reboot is the necessary last resort. Issue hadbm delete, reboot the machine, and then rerun the hadb create command.

clear Command Failed

When this command fails, the history files are likely to explain why. See Do the History Files Contain Errors? for instructions on viewing the history files and a list of some common error messages.

create-session-store Failed

The asadmin create-session-store command could fail for one of these reasons:

Invalid user name or password

This error occurs when the --dbsystempassword supplied to the create-session-store command is not the same password as the one given at the time of database creation.

Solution 1

Try the command again with the correct password.

Solution 2

If you cannot remember the dbsystem password, you need to clear the database using hadbm clear and provide a new dbsystem system password.

SQLException: No suitable driver

The create-session-store produces the error: SessionStoreException: java.sql.SQLException: No suitable driver.

Solution 1

This error can occur when asadmin is not able to find hadbjdbc4.jar from the AS_HADB path defined in asenv.conf in the Application Server config directory.

The solution is to change AS_HADB to point to the location of the HADB installation.

Here is a sample AS_HADB entry from an asenv.conf file:

AS_HADB=c:\install_dir\SUNWhadb\4.4.0-8

Solution 2

This error can also occur if you provide the incorrect value for --storeUrl. To solve this problem, obtain the correct URL using hadbm get jdbcURL.

Attaching Shared Memory Segment Fails Due To Insufficient Space

Description

The server throws an error message like the following:

Attaching shared memory segment with key xx failed,
OS status=12 OS message: Not enough space.

Solution

Increase shared memory.

Cannot Restart the HADB

Description

HADB restart does not work after a double node failure. Additional recovery actions are needed before HADB can be restarted.

Symptoms of a double node failure include:

This problem occurs when mirror HADB host machines have failed or been rebooted, typically after a power outage, or when a machine is rebooted without first stopping the HADB (in a single-machine installation), or when a pair of mirror machines from both Data Redundancy Units (DRUs) are rebooted.

If mirror host machine pairs are rebooted, or if host failures cause an unplanned reboot of one or more mirror host machine pairs, then the mirror nodes on these machines are not available, and the data is likely to be in an inconsistent state, because a record may have been in the process of being committed when the power failed, or the reboot occurred.


Tip

To prevent such problems, be sure to use the procedure described in the HADB chapter of the Administration Guide when rebooting as a part of a planned maintenance.


HADB cannot heal itself automatically in such “double failure” situations because the part of the data that resided on the pair nodes is lost. In such cases, the hadbm start command does not succeed, and the hadbm status command shows that HADB is in a non-operational state.

Explanation

For performance reasons, the HADB does much of its data management in memory. If both DRUs are rebooted, then the HADB does not have a chance to write its data blocks to disk.

For more information on the DRUs and HADB confutation, see “Administering the High Availability Database” in the Administration Guide, and the Deployment Guide.


Tip

If the HADB exhibits strange behavior (for example consistent timeout problems), and you want to check whether a restart cures the problem, use the hadbm restart command.

When the HADB is restarted in this manner, HADB services remain available. Conversely, if HADB is started and stopped in separate operations using hadbm stop and hadbm start, HADB services are unavailable while HADB is stopped.


Solution

  1. Follow the instructions under “Recovering from Session Data Corruption” in the “Administering the High Availability Database” chapter of the Administration Guide.
  2. Verify that the node states show Starting/Recovering, then reset the database.



Previous      Contents      Index      Next     


Copyright 2004 Sun Microsystems, Inc. All rights reserved.