Chapter 3 HADB Problems

This section covers problems you may encounter when using the Application Server 8.1 EE product with the High Availability Database (HADB) module, HADB Management Client, or the load balancer plugin. These components can be installed separately or with the rest of the Application Server.

HADB Database Creation Fails

The error occurs when using clsetup to start the database. The typical message in this case is:

To determine the cause of the problem, inspect the /var/tmp/clsetup.log file. Some possible errors are:

No Available Memory

Description

Solution 1

This problem can occur when changes are made to /etc/system and the init 6 command is given to reset the system. The following error message occurs in the database log file:

System aborted with message:
'Could not create shared DictCache segment'
...Shared memory get segment failed'

To avoid this problem, do sync;sync as root user and then do reboot instead of init 6.

Solution 2

This error can occur when insufficient swap space has been allocated. Review the documentation on shared memory requirements in the Preparing for HADB Setup chapter of the Sun Java™ System Application Server Installation Guide.

Too Few Semaphores

Description

This can be caused when the number of semaphores is too low. Since the semaphores are provided as a global resource by the operating system, the configuration depends on all processes running on the host, not only the HADB. This can occur either while starting the HADB, or during runtime.

Solution

Configure the semaphore settings by editing the /etc/system file. Instructions and guidelines are contained in the Configuring Shared Memory and Semaphores section of the Preparing for HADB Setup chapter of the Sun Java™ System Application Server Installation Guide.

Problems When Running clsetup As Non-Root

Description

To run the clsetup command as a user other than root, you need to configure HADB administration for non-root.

Solution

Follow the instructions in the Setting Up Administration for Non-Root section in the Sun Java™ System Application Server Installation Guide.

hadbm create Fails With Error “Node-x NSUP timestamp HADB-S-00240: Illegal node number”

Description

The likely cause is that another process is occupying the port that the NSUP process on node x is trying to open.

Solution

Find the host on which the node number x is running (the xth item in the host list). Also check whether an old HADB node or some other process uses this port on this host. If so, stop that process and rerun the hadbm create command.

Database Nodes Cannot Be Reached and the Database Does Not Function

Solution

Check whether dynamic IP addresses (DHCP) are used for hosts used in definedomain or in hadbm commands.

createdomain Hangs If Agents Cannot Reach Each Other With Multicast

Description

If the different machines in the domain are connected to a switch that does not forward multicast messages (to multicast address 228.8.8.8), the definedomain function will never terminate.

Solution

Unexpected Node Restarts, Network Partitions, or Reconnects

Description

Unexpected node restarts, network partitions, or reconnects with messages “Network Partition: *** Reconnect detected ***” written in the HADB history files and on the HADB host terminals.

Solution

Verify that messages from nodes belonging to one database instance are delivered to nodes belonging to the other database instance. If management domains share HADB hosts, ensure that the nodes on the common host do not use the same port number.

hadbm create or hadbm addnodes Command Hangs

Description

Some hosts in the host list given to hadbm create or addnodes have multiple network interfaces, while others have only one, and the hadbm create/addnodes command hangs.

Solution

For the hosts having multiple network interfaces, specify the dotted IP address of the network interface (for example., 129.241.111.23) to be used by hadb when issuing hadbm create/addnodes. If the host name is used instead of IP address, the first interface registered on the host will be used, and there is no guarantee that the nodes will be able to communicate.

ma (Management Agent Process) Crashes

Description

Solution

Display diagnostic information by using hadbm listdomain.Typically, the remedy is to restart the failed agent. If that does not help, restart all agents in turn.

Server Responds Slowly After Being Idle

Description

The server takes a long time to service a request after a long period of idleness, and the sever log shows “lost connection” messages of the form:

Solution

Change the timeout value. The default HADB connection timeout value is 1800 seconds. If the application server does not send any request over a JDBC connection during this period, HADB closes the connection, and the application server needs to re-establish it. To change the timeout value, use the hadbm set SessionTimeout= command.



Note	Make sure the HADB connection timeout is greater than the JDBC connection pool timeout. If the JDBC connection timeout is more than the HADB connection time out, the connection will be closed from the HADB side, but will remain in the appserver connection pool. When the application then tries to use the connection, the application server will have to recreate the connection, which incurs significant overhead.

Requests Are Not Succeeding

Is the Load Balancer Timeout Correct?

Description

When configuring the response-timeout-in-seconds property in the loadbalancer.xml file, you must take into account the maximum timeouts for all the applications that are running. If the response timeout it is set to a very low value, numerous in-flight requests will fail because the load balancer will not wait long enough for the Application Server to respond to the request.

Conversely, setting the response timeout to an inordinately large value will result in requests being queued to an instance that has stopped responding, resulting in numerous failed requests.

Solution

Set the response-timeout-in-seconds value to the maximum response time of all the applications.

Are the System Clocks Synchronized?

Description

When a session is stored in HADB, it includes some time information, including the last time the session was accessed and the last time it was modified. If the clocks are not synchronized, then when an instance fails and another instance takes over (on another machine), that instance may think the session was expired when it was not, or worse yet, that the session was last accessed in the future!



Note	In a non-colocated configuration, it is important to synchronize the clocks on that machines that are hosting HADB nodes. For more information, see the Installation Guide chapter, “Preparing for HADB Setup.”

Solution

Is the Application Server Communicating With HADB?

Description

HADB may be created and running, but if the persistence store has not yet been created, the Application Server will not be able to communicate with the HADB. This situation is accompanied by the following message:

WARNING (7715): ConnectionUtilgetConnectionsFromPool failed using connection URL: connection URL

Solution

asadmin create-session-store --storeurl connection URL --storeuser haadmin --storepassword hapasswd --dbsystempassword super123

Session Persistence Problems

The create-session-store Command Failed

Description

The asadmin create-session-store command cannot run across firewalls. Therefore, for the create-session-store command to work, the application server instance and the HADB must be on the same side of a firewall.

The create-session-store command communicates with the HADB and not with the application server instance.

Solution

Locate the HADB and the application server instance on the same side of a firewall.

Configuring Instance-Level Session Persistence Did Not Work

The application-level session persistence configuration always takes precedence over instance-level session persistence configuration. Even if you change the instance-level session persistence configuration after an application has been deployed, the settings for the application still override the settings for the application server instance.

Session Data Seems To Be Corrupted

Description

Session data may be corrupted if the system log reports errors under the following circumstances:

If the data has been corrupted, there are three possible solutions for bringing the session store back to a consistent state, as described below.

Solution 1

Solution 2

If clearing the session store does not work, reinitialize the data space on all the nodes and clear the data in the HADB using the hadbm clear command.

Solution 3

HADB Performance Problems

Performance is affected when the transactions to HADB get delayed or aborted. This situation is generally caused by a shortage of system resources. Any wait beyond five seconds causes the transactions to abort. Any node failures also cause the active transaction on that node at crash time to abort. Any double failures (failure of both mirrors) will make the HADB unavailable. The causes of the failures can generally be found in the HADB history files.

Is There a Shortage of CPU or Memory Resources, or Too Much Swapping?

Description

Node restarts or double failures due to “Process blocked for x sec, max block time is 2.500000 sec.” In this case, x is the length of time the process was blocked, and it was greater than 2.5 seconds.

The HADB Node Supervisor Process (NSUP/clu_nsup_srv) tracks the time elapsed since the last time it did some monitoring work. If that time duration exceeds a specified maximum (2500ms by default), NSUP concludes that it was blocked too long and restarts the node.

NSUP being blocked for more than 2.5 seconds cause the node to restart. If mirror nodes are placed on the same host, the likelihood of double failure is high. Simultaneous occurrence of the blocking on the mirror hosts may also lead to double failures.

The situation is especially likely to arise when there are other processes—for example, in a colocated configuration— in the system that compete for CPU, or memory which produces extensive swapping and multiple page faults as processes are rescheduled.

Solution

Ensure that HADB nodes get enough system resources. Ensure also that the time synchronization daemon does not make large (not higher than 2 seconds) jumps.

Is There Disk Contention?

Description

A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.

The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. This can also identified by the following statement in the history files: “HADB warning: Schedule of async <read,write> operation took ...”

History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”

This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:

Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.

This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.

This problem is observed especially in RH AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.

Solution

Use one disk per node to place the devices used by that node. If the node has more than one data devices and the disk contention is observed, move one data device to another disk. The same applies to the history file as well.

Is There a Shortage of HADB Data Device Space?

Description

One possible reason for transaction failure is running out of data device space. If this situation occurs, HADB will write warnings to the history file, and abort the transaction which tried to insert and/or update data.

HIGH LOAD: about to run out of device space, ...
HIGH LOAD: about to run out of device space on mirror node, ...

The general rule of thumb is that the data devices must have room for at least four times the volume of the user data. Please refer to the Tuning Guide for additional explanation.

Solution 1

This solution requires that there is space available on the physical disks which are used for the HADB data devices on all nodes.

Solution 2

Stop and clear the HADB, and create a new instance with more nodes and/or larger data devices and/or several data devices per node. Unfortunately, using this solution will erase all persistent data. See the Administrator's Guide for more information about this procedure.

See Bug ID 5097447 in the “Known Problems” section of the Application Server 8.1 Release Notes for more information.

Is There a Shortage of Other HADB Resources?

If an HADB node runs out of resources it will delay and/or abort transactions. Resource usage information is shipped between mirror nodes, so that a node can delay or abort an operation which is likely to fail on its mirror node.

Transactions that are delayed repeatedly may time out and return an error message to the client. If they do not time out, the situation will be visible to the client only as decreased performance during the periods in which the system is short on resources.

These problems frequently occur in “High Load” situations. For details, see High Load Problems.

High Load Problems

Is the Tuple Log Out Of Space?


Note	Frequently, all of these problems can be solved by making more CPU horsepower available.

All user operations (delete, insert, update) are logged in the tuple log and executed. There tuple log may fill up because:

Execution slows due to CPU or disk I/O contention

The mirror node is slow in receiving the log records (“log throw due to...” messages in the history files), which can happen as a result of:

Network contention, so the log records do not reach the mirror node

CPU contention at the mirror node, which keeps it from processing the received log records quickly enough.

Solution 1

Solution 2

If CPU utilization is not a problem, check the disk I/O. If the disk shows contention, avoid page faults when log records are being processed by increasing the data buffer size with hadbm set DataBufferPoolSize=...

Solution 3

Solution 4

See Bug ID 5097447 in the “Known Problems” section of the Application Server 8.1 Release Notes for more information.

Is the node-internal Log Full?

Too many node-internal operations are scheduled but not processed due to CPU or disk I/O problems.

Solution 1

Solution 2

If CPU utilization is not a problem, and there is sufficient memory, increase he InternalLogbufferSize using the hadbm set InternalLogbufferSize= command.

Are There Enough Locks?

Solution 1: Increase the number of locks

Solution 2: Improve CPU Utilization

Can You Fix the Problem by Doing Some Performance Tuning?

In most situations, reducing load or increasing the availability of resources will improve host performance. Some of the more common steps to take are:

Run the nodes on hosts with better hardware characteristics (more internal memory, higher processor speed, more processors).

Add physical disks and use several data devices, not more than one device on each physical disk.

Add more nodes, on new hosts, and refragment the data to utilize the new nodes.

Change configuration variables to allocate larger memory segments or internal data structures.

In addition, the following resources can be adjusted to improve “HIGH LOAD” problems, as described in the Performance and Tuning Guide:

Table 3-1 HADB Performance Tuning Properties
Resource	Property
Size of Database Buffer	hadbm attribute DataBufferPoolSize
Size of Tuple Log Buffer	hadbm attribute LogBufferSize
Size of Node Internal Log Buffer	hadbm attribute InternalLogBufferSize
Number of Database Locks	hadbm attribute NumberOfLocks

Client cannot connect to HADB

Description

11626 is an HADB error code, “Error in IPC operations,” which means that some Inter Process Communication operation failed.

“iostat = 28” means (on Solaris) that the operating system set errno to ENOSPC, which again translates to “No space left on device.”

The most likely explanation is that a semget() call failed (see the UNIX man pages). If HADB started successfully, and you get this message at runtime, it means that the host computer has too few semaphore undo structures. See the “Preparing for HADB Setup” chapter in the Installation Guide for information on how to configure semmnu in /etc/system.

Solution

Stop the affected HADB node, reconfigure and reboot the affected host, restart the HADB node. HADB will be available during the process.

Improving CPU Utilization

Description

Available CPU cycles and I/O capacity can impose severe restrictions on performance. Resolving and preventing such issues is necessary to optimize system performance (in addition to configuring the HADB optimally.)

Solutions

If there are additional CPUs on the host that are not exploited, add new nodes to the same host. Otherwise add new machines and add new nodes on them.

If the machine has enough memory, increase the DataBufferPoolSize, and increase other internal buffers that may be putting warnings into the log files. Otherwise, add new machines and add new nodes on them.

HADB Administration Problems

The hadbm command and its many subcommands and options are provided for administering the high-availability database (HADB). The hadbm command is located in the install_dir/SUNWhadb/4/bin directory.

Refer to the chapter on Configuring the High Availability Database in the Sun Java™ System Application Server Administrator’s Guide for a full explanation of this command. Specifics on the various hadbm subcommands are explained in the hadbm man pages.

hadbm Command Fails: host unreachable

Description

The command fails with the error, “Host unreachable: hostname.” The host could be unreachable either because it is down, or because the communication pathway has not been established. If the remote host is not running or cannot accept connections, attempts to access it will fail.

Solution

Try pinging the host to see if it is up and running, ready to accept communications:

hadbm Command Fails: command not found

Description

The hadbm command can be run from the current directory, or you can set the search PATH to access the hadb commands from anywhere, which is much more convenient. The error, “hadbm: Command not found,” indicates that neither of these conditions has been met.

Solution 1

Solution 2

Solution 3

You can use the hadbm command from anywhere by setting the PATH variable. Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java™ System Application Server 8.1 Installation Guide.

hadbm Command Fails: JAVA_HOME not defined

Description

The message “Error: JAVA_HOME is not defined correctly” indicates that the JAVA_HOME environment variable has not been set properly.

Solution

If multiple Java versions are installed on the system, ensure that the JAVA_HOME environment variable points to the correct Java version (1.4.1_03 or above for Enterprise Edition).

Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java™ System Application Server 8.1 Installation Guide.

create Fails: “path does not exist on a host”

Description

After issuing the hadbm create command, an error similar to the following appears on the console:

./hadbm create ...
...
hadbm:Error 22022: Specified path does not exist on a host. Please specify a valid path: [ machineName ... ]

This error message indicates that the HADB server component is not installed on the machine on which you are trying to create the HA database.

Solution

Install the HADB server in the in the install_dir directory, and run the command again.

Database Does Not Start

Was there a shared memory get segment failure?

Description

Solution 1


Note	HADB executables cannot be installed on different paths on different hosts.

Use sync;sync and reboot instead of init 6. The hadbm create command can fail with this error after making changes to /etc/system and doing a system reset with the init 6 command.

Instead of restarting the machine with init 6, do sync;sync as root and then reboot.

Solution 2

Increase the amount of shared memory. There may not be as much shared memory as the HADB needs. The amount of shared memory required by HADB depends on parameters like DataBufferPoolSize, LogbufferSize, and other parameters. Look into the /etc/system file and set shmsys:shminfo_shmmax to the maximum value possible (the preferred value is 0xffffffff).

Verify that other shared memory settings are configured correctly. After making your changes, issue the hadbm stop command and (for Solaris) reboot the machine. For Linux, rebooting is not necessary.

For more information on the mechanics of configuring shared memory, consult the chapter, “Preparing for HADB Setup” in the Sun Java™ System Application Server 8.1 Installation Guide. For guidelines on choosing the best settings, consult the Performance Tuning Guide.

Solution 3

Verify the settings in the /etc/system file. Even a single mistyped character can create problems.

Solution 4

Use ipcs to see if there are any shared memory segments or semaphores occupied unnecessarily by you or the other users. Use ipcrm to free them and then try starting the database.

Solution 5

If the problem persists, the operating system may not have enough shared memory or semaphores. Increase them according to the number of nodes in the machine. (For details, see the Deployment Guide). Note that after making these changes, the machine must be restarted to make them available.

Do the History Files Contain Errors?

Description

If the problem still persists, inspect the HADB history files. Some of the more likely error messages to look for are:

The system has not been set up with enough shared memory. (Discussed in the previous section.)

This message occurs when another process is using the port that an HADB server is trying to process. It can occur in several situations:

The portBase is used by another process running on this host machine.

Set the PortBase attribute to another value using the following command:

hadbm set portbase=value

An attempt to stop the HADB node for maintenance failed.

Try again to stop the node with the hadbm command. If that fails, kill the OS process clu_nsup_srv for this node without the -9 option. The nsup process should then stop its hadb child process. If the parent process nsup does not exist, kill all the child processes using kill -9.

The HADB node was stopped for maintenance and an inetd process restarted the HADB node before you intended to start it.

Make sure that inetd does not start the HADB node before stopping it.

Check the following:

Shared memory is all correct on all machines in the HADB configuration.

No other HADB databases are running on the machines, or any other processes that could be using the same port numbers.

All necessary directories exist and have write permissions.

There is enough space in directory where devices are going to be written.

Solutions

After verifying that none of the above errors have occurred, try the following remedies, in order:

Do You Need a Simple Solution?

Solution 1

Delete the database with the hadbm delete command, and see if that allows the hadbm create to proceed normally.

Solution 2

Sometimes a system reboot is the necessary last resort. Issue hadbm delete, reboot the machine, and then rerun the hadb create command.

clear Command Failed

When this command fails, the history files are likely to explain why. See Do the History Files Contain Errors? for instructions on viewing the history files and a list of some common error messages.

create-session-store Failed

Invalid user name or password

This error occurs when the --dbsystempassword supplied to the create-session-store command is not the same password as the one given at the time of database creation.

Solution 1

Solution 2

If you cannot remember the dbsystem password, you need to clear the database using hadbm clear and provide a new dbsystem system password.

SQLException: No suitable driver

The create-session-store produces the error: SessionStoreException: java.sql.SQLException: No suitable driver.

Solution 1

This error can occur when asadmin is not able to find hadbjdbc4.jar from the AS_HADB path defined in asenv.conf in the Application Server config directory.

The solution is to change AS_HADB to point to the location of the HADB installation.

Solution 2

This error can also occur if you provide the incorrect value for --storeUrl. To solve this problem, obtain the correct URL using hadbm get jdbcURL

Attaching Shared Memory Segment Fails Due To Insufficient Space

Description

Attaching shared memory segment with key xx failed,
OS status=12 OS message: Not enough space.

Solution

Cannot Restart the HADB

Description

HADB restart does not work after a double node failure. Additional recovery actions are needed before HADB can be restarted.

hadbm status shows that the HADB status is non-functional.

The node status shows that the nodes are in Starting or Recovering state. Even after stopping and then restarting each of the nodes, they remain in the Starting state. Eventually, the node status changes to Stopped.

This problem occurs when mirror HADB host machines have failed or been rebooted, typically after a power outage, or when a machine is rebooted without first stopping the HADB (in a single-machine installation), or when a pair of mirror machines from both Data Redundancy Units (DRUs) are rebooted.

If mirror host machine pairs are rebooted, or if host failures cause an unplanned reboot of one or more mirror host machine pairs, then the mirror nodes on these machines are not available, and the data is likely to be in an inconsistent state, because a record may have been in the process of being committed when the power failed, or the reboot occurred.



Tip	To prevent such problems, be sure to use the procedure described in the HADB chapter of the Administration Guide when rebooting as a part of a planned maintenance.

HADB cannot heal itself automatically in such “double failure” situations because the part of the data that resided on the pair nodes is lost. In such cases, the hadbm start command does not succeed, and the hadbm status command shows that HADB is in a non-operational state.

Explanation

For performance reasons, the HADB does much of its data management in memory. If both DRUs are rebooted, then the HADB does not have a chance to write its data blocks to disk.

For more information on the DRUs and HADB confutation, see “Administering the High Availability Database” in the Administration Guide, and the Deployment Guide.



Tip	If the HADB exhibits strange behavior (for example consistent timeout problems), and you want to check whether a restart cures the problem, use the hadbm restart command. When the HADB is restarted in this manner, HADB services remain available. Conversely, if HADB is started and stopped in separate operations using hadbm stop and hadbm start, HADB services are unavailable while HADB is stopped.

Solution

Follow the instructions under “Recovering from Session Data Corruption” in the “Administering the High Availability Database” chapter of the Administration Guide.

Verify that the node states show Starting/Recovering, then reset the database.

Previous Contents Index Next
Sun Java System Application Server 8.1 2004Q4 Beta Troubleshooting Guide