![]() | |
Sun Java System Application Server 7 2004Q2 Update 1 Standard and Enterprise Edition Troubleshooting Guide |
Chapter 5
HADB Issues on WindowsThis section covers problems you may encounter when using Sun Java System Application Server 7 2004Q2 Update 1 Enterprise Edition with the High Availability Database (HADB) 4.4 on the Windows platform. HADB 4.4 has a new management architecture and new commands, compared to HADB 4.3 that is bundled with Application Server for UNIX platforms. For details on administering HADB 4.4, see Sun Java System Application Server 7 2004Q2 Update 1 Administration Guide.
Topics in this chapter include:
HADB Database Creation FailsThe error occurs when starting the database. The typical message in this case is:
failed to start database : HADB Database creation failed
To determine the cause of the problem, use the Log Viewer and/or inspect the install_dir/hadb/4/log directory. Some possible errors are:
No Available Memory
Description
Insufficient memory is available to create the database.
Solution 1
Check if there are other processes using up all memory the on Windows and end those processes if possible.
Solution 2
Install more memory in your system.
Review the documentation on shared memory requirements in the Preparing for HADB Setup chapter of the Sun Java System Application Server Installation Guide.
Too Few Semaphores
Description
HADB uses memory-mapped files for shared memory on Windows. You will get this message when there is not enough space on device disk for shared memory.
Solution
Make more space available on device disk for shared memory.
Database Nodes Cannot Be Reached and the Database Does Not Function
Solution
The IP addresses of the involved hosts should be static. If the addresses are dynamic (DHCP) the lease time should be set to forever (usually 0).
The Management Agents Could Not Establish a Domain
Description
The HADB management system is dependent on UDP Multicast messages on multicast address 228.8.8.8. If these messages cannot get through, the createdomain command fails with the following message:
The management agents could not establish a domain, please check that the hosts can communicate with UDP multicast.
Possible causes include:
- The agents are running on hosts with several network interfaces on different subnets.
- There is a switch on the network that does not forward multicast messages.
- There is router on the network that does not route multicast messages with the address 228.8.8.8.
- Multicast messages are disabled in the operating system.
Solution 1
If the hosts have several network interfaces on different subnets, the management agent must be configured to use one of the subnets. Set the ma.server.mainternal.interfaces attribute.
Solution 2
Configure the needed network infrastructure to support multicast messages.
Unexpected Node Restarts, Network Partitions, or Reconnects
Description
Unexpected node restarts, network partitions, or reconnects with messages “Network Partition: *** Reconnect detected ***” written in the HADB history files and on the HADB host terminals.
This may happen if multiple nodes identify themselves with the same physical node number.
Solution
Try stopping the database with the hadbm stop command, and look for “rogue” hadb processes on the hosts on which any HADB nodes have been running at any time. If there still are hadb processes running, these belong to rogue nodes.
On the hosts on which rogue nodes are found, check that the management agents are correctly configured, and that the management domain is correctly defined. There may be multiple management domains configured, and each host may possibly be included in more than one domain. Make sure that databases defined in separate domains do not have conflicting definitions, such as database nodes using the same port numbers.
hadbm create or hadbm addnodes Command Hangs
Description
Some hosts in the host list given to hadbm create or addnodes have multiple network interfaces, while others have only one, and the hadbm create/addnodes command hangs.
Solution
For the hosts having multiple network interfaces, specify the dotted IP address of the network interface (for example., 129.241.111.23) to be used by hadb when issuing hadbm create/addnodes. If the host name is used instead of IP address, the first interface registered on the host will be used, and there is no guarantee that the nodes will be able to communicate.
ma (Management Agent Process) Crashes
Description
The ma (Management Agent process) crashes for various reasons.
Solution
Display diagnostic information by using hadbm listdomain.Typically, the remedy is to restart the failed agent. If that does not help, restart all agents in turn.
Server Responds Slowly After Being IdleDescription
The server takes a long time to service a request after a long period of idleness, and the sever log shows “lost connection” messages of the form:
java.io.IOException:..HA Store: Lost connection to the server.
In such cases, the server needs to recreate the JDBC pool for HADB.
Solution
Change the timeout value. The default HADB connection timeout value is 1800 seconds. If the application server does not send any request over a JDBC connection during this period, HADB closes the connection, and the application server needs to re-establish it. To change the timeout value, use the hadbm set SessionTimeout= command.
Requests Are Not SucceedingThe following problems are addressed in this section:
Is the Load Balancer Timeout Correct?
Description
When configuring the response-timeout-in-seconds property in the loadbalancer.xml file, you must take into account the maximum timeouts for all the applications that are running. If the response timeout it is set to a very low value, numerous in-flight requests will fail because the load balancer will not wait long enough for the Application Server to respond to the request.
Conversely, setting the response timeout to an inordinately large value will result in requests being queued to an instance that has stopped responding, resulting in numerous failed requests.
Solution
Set the response-timeout-in-seconds value to the maximum response time of all the applications.
Are the System Clocks Synchronized?
Description
When a session is stored in HADB, it includes some time information, including the last time the session was accessed and the last time it was modified. If the clocks are not synchronized, then when an instance fails and another instance takes over (on another machine), that instance may think the session was expired when it was not, or worse yet, that the session was last accessed in the future!
Solution
Verify that clocks are synchronized for all systems in the cluster.
Is the Application Server Communicating With HADB?
Description
HADB may be created and running, but if the persistence store has not yet been created, the Application Server will not be able to communicate with the HADB. This situation is accompanied by the following message:
WARNING (7715): ConnectionUtilgetConnectionsFromPool failed using connection URL: connection URL
Solution
Create the session store in the HADB with a command like the following:
asadmin create-session-store --storeurl connection URL --storeuser haadmin --storepassword hapasswd --dbsystempassword super123
Session Persistence ProblemsThe following problems are addressed in this section:
The create-session-store Command Failed
Description
The asadmin create-session-store command cannot run across firewalls. Therefore, for the create-session-store command to work, the application server instance and the HADB must be on the same side of a firewall.
The create-session-store command communicates with the HADB and not with the application server instance.
Solution
Locate the HADB and the application server instance on the same side of a firewall.
Configuring Instance-Level Session Persistence Did Not Work
The application-level session persistence configuration always takes precedence over instance-level session persistence configuration. Even if you change the instance-level session persistence configuration after an application has been deployed, the settings for the application still override the settings for the application server instance.
Session Data Seems To Be Corrupted
Description
Session data may be corrupted if the system log reports errors under the following circumstances:
If the data has been corrupted, there are three possible solutions for bringing the session store back to a consistent state, as described below.
Solution 1
Use the asadmin clear-session-store command to clear the session store.
Solution 2
If clearing the session store does not work, re initialize the data space on all the nodes and clear the data in the HADB using the hadbm clear command.
Solution 3
If clearing the HADB does not work, delete and then recreate the database.
HADB Performance ProblemsPerformance is affected when the transactions to HADB get delayed or aborted. This situation is generally caused by a shortage of system resources. Any wait beyond five seconds causes the transactions to abort. Any node failures also cause the active transaction on that node at crash time to abort. Any double failures (failure of both mirrors) will make the HADB unavailable. The causes of the failures can generally be found in the HADB history files.
To isolate the problem, consider the following:
Is There a Shortage of CPU or Memory Resources, or Too Much Swapping?
Description
Node restarts or double failures due to “Process blocked for x sec, max block time is 2.500000 sec.” In this case, x is the length of time the process was blocked, and it was greater than 2.5 seconds.
The HADB Node Supervisor Process (NSUP/clu_nsup_srv) tracks the time elapsed since the last time it did some monitoring work. If that time duration exceeds a specified maximum (2500ms by default), NSUP concludes that it was blocked too long and restarts the node.
NSUP being blocked for more than 2.5 seconds cause the node to restart. If mirror nodes are placed on the same host, the likelihood of double failure is high. Simultaneous occurrence of the blocking on the mirror hosts may also lead to double failures.
The situation is especially likely to arise when there are other processes—for example, in a colocated configuration— in the system that compete for CPU, or memory which produces extensive swapping and multiple page faults as processes are rescheduled.
NSUP being blocked can also be caused by negative system clock adjustments.
Solution
Ensure that HADB nodes get enough system resources. Ensure also that the time synchronization daemon does not make large (not higher than 2 seconds) jumps.
Is There Disk Contention?
Description
A disk contention can have a negative impact on user data read/writes to the disk devices, as well as on HADB writing to history files. Severe disk contention may delay or abort user transactions. Delay in history file writing may cause node restarts and, in the worst case, lead to double failures.
The disk contention can be identified by monitoring the disk I/O from the OS, for the disks used for data, log devices and history files. This can also identified by the following statement in the history files: “HADB warning: Schedule of async <read,write> operation took ...”
History file write delays are written to the HADB history files. This can be identified by “NSUP BEWARE timestamp Last flush took too long (x msecs).”
This warning shows that disk I/O took too long. If the delay exceeds ten seconds, the node supervisor restarts the trans process with the error message:
Child process trans0 10938 does not respond.
Child died - restarting nsup.
Psup::stop: stopping all processes.This message indicates that a trans (clu_trans_srv) process has been too busy doing other things (for example, waiting to write to the history file) to reply to the node supervisor’s request checking the heartbeat of the trans process. This causes the nsup to believe that the trans has died and then restarts it.
This problem is observed especially in RH AS 2.1 when multiple HADB nodes are placed on the same host and all the nodes use the same disk to place their devices.
Solution
Use one disk per node to place the devices used by that node. If the node has more than one data devices and the disk contention is observed, move one data device to another disk. The same applies to the history file as well.
Is There a Shortage of HADB Data Device Space?
Description
One possible reason for transaction failure is running out of data device space. If this situation occurs, HADB will write warnings to the history file, and abort the transaction which tried to insert and/or update data.
Typical messages are:
HIGH LOAD: about to run out of device space, ...
HIGH LOAD: about to run out of device space on mirror node, ...The general rule of thumb is that the data devices must have room for at least four times the volume of the user data. Please refer to the Tuning Guide for additional explanation.
Solution 1
Increase the size of the data devices using the following command:
hadbm set TotalDataDevicePerNode=size
This solution requires that there is space available on the physical disks which are used for the HADB data devices on all nodes.
HADBM automatically restarts each node of the database.
Solution 2
Stop and clear the HADB, and create a new instance with more nodes and/or larger data devices and/or several data devices per node. Unfortunately, using this solution will erase all persistent data. See the Administrator's Guide for more information about this procedure.
See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.
Is There a Shortage of Other HADB Resources?
When an HADB node is started, it will allocate:
If an HADB node runs out of resources it will delay and/or abort transactions. Resource usage information is shipped between mirror nodes, so that a node can delay or abort an operation which is likely to fail on its mirror node.
Transactions that are delayed repeatedly may time out and return an error message to the client. If they do not time out, the situation will be visible to the client only as decreased performance during the periods in which the system is short on resources.
These problems frequently occur in “High Load” situations. For details, see High Load Problems.
High Load ProblemsHigh load scenarios are recognizable by the following symptoms:
If a high load problem is suspected, consider the following:
Is the Tuple Log Out Of Space?
All user operations (delete, insert, update) are logged in the tuple log and executed. There tuple log may fill up because:
Solution 1
Check CPU usage, as described in Improving CPU Utilization.
Solution 2
If CPU utilization is not a problem, check the disk I/O. If the disk shows contention, avoid page faults when log records are being processed by increasing the data buffer size with hadbm set DataBufferPoolSize=...
Solution 3
Look for evidence of network contention, and resolve bottlenecks.
Solution 4
Increase the tuple log buffer using hadbm set LogBufferSize=...
See Bug ID 5097447 in the “Known Problems” section of the Application Server 7 Release Notes for more information.
Is the node-internal Log Full?
Too many node-internal operations are scheduled but not processed due to CPU or disk I/O problems.
Solution 1
Check CPU usage, as described in Solution 2: Improve CPU Utilization.
Solution 2
If CPU utilization is not a problem, and there is sufficient memory, increase he InternalLogbufferSize using the hadbm set InternalLogbufferSize= command.
Are There Enough Locks?
Some extra symptoms that identify this condition are:
Solution 1: Increase the number of locks
Use hadbm set NumberOfLocks= to increase the number of locks.
Solution 2: Improve CPU Utilization
Check CPU usage, as described in Improving CPU Utilization.
Can You Fix the Problem by Doing Some Performance Tuning?
In most situations, reducing load or increasing the availability of resources will improve host performance. Some of the more common steps to take are:
- Run the nodes on hosts with better hardware characteristics (more internal memory, higher processor speed, more processors).
- Add physical disks and use several data devices, not more than one device on each physical disk.
- Add more nodes, on new hosts, and refragment the data to utilize the new nodes.
- Change configuration variables to allocate larger memory segments or internal data structures.
In addition, the following resources can be adjusted to improve “HIGH LOAD” problems, as described in the Performance and Tuning Guide:
Client cannot connect to HADBDescription
This problem is accompanied by a message in the history file:
HADB-E-11626: Error in IPC operations, iostat = 28: No space left on device
where:
If HADB started successfully, and you get this message at runtime, it means that the host computer has too few semaphore undo structures.
Solution
Stop the affected HADB node, reconfigure and reboot the affected host, restart the HADB node. HADB will be available during the process.
Improving CPU UtilizationDescription
Available CPU cycles and I/O capacity can impose severe restrictions on performance. Resolving and preventing such issues is necessary to optimize system performance (in addition to configuring the HADB optimally.)
Solutions
If there are additional CPUs on the host that are not exploited, add new nodes to the same host. Otherwise add new machines and add new nodes on them.
If the machine has enough memory, increase the DataBufferPoolSize, and increase other internal buffers that may be putting warnings into the log files. Otherwise, add new machines and add new nodes on them.
For more information on this subject, consult the Performance and Tuning Guide.
HADB Administration ProblemsThe hadbm command and its many subcommands and options are provided for administering the high-availability database (HADB). The hadbm command is located in the install_dir/SUNWhadb/4/bin directory.
Refer to the chapter on Configuring the High Availability Database in the Sun Java System Application Server Administrator’s Guide for a full explanation of this command. Specifics on the various hadbm subcommands are explained in the hadbm man pages.
The following problems are addressed in this section:
hadbm Command Fails: The agents could not be reached
Description
The command fails with the error:
The agents <url> could not be reached.
The hosts in the URL could be unreachable either because the hosts are down, because the communication pathway has not been established, because the port number in the URL is wrong, or because the management agents are down.
Solution
Verify that the URL is correct. If the URL is correct, verify that the hosts are up and running and are ready to accept communications; for example:
ping hostname1
ping hostname2
...hadbm Command Fails: command not found
Description
The hadbm command can be run from the current directory, or you can set the search PATH to access the hadb commands from anywhere, which is much more convenient. The error, “hadbm: Command not found,” indicates that neither of these conditions has been met.
Solution 1
cd to the directory that contains the hadbm command and run it from there:
cd install_dir/SUNWhadb/4/bin/
./hadbmSolution 2
Use the full path to invoke the hadbm command:
install_dir/SUNWhadb/4/bin/hadbm
Solution 3
You can use the hadbm command from anywhere by setting the PATH variable. Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.
To verify that the PATH settings are correct, run the following commands:
which asadmin
which hadbmThese commands should echo the paths to the utilities.
hadbm Command Fails: JAVA_HOME not defined
Description
The message “Error: JAVA_HOME is not defined correctly” indicates that the JAVA_HOME environment variable has not been set properly.
Solution
If multiple Java versions are installed on the system, ensure that the JAVA_HOME environment variable points to the correct Java version (1.4.1_03 or above for Enterprise Edition).
Instructions for setting the PATH variable are contained in the “Preparing for HADB Setup” chapter of the Sun Java System Application Server 7 Installation Guide.
create Fails: “path does not exist on a host”
Description
After issuing the hadbm create command, an error similar to the following appears on the console:
./hadbm create ...
...
hadbm:Error 22022: Specified path does not exist on a host. Please specify a valid path: [ machineName ... ]This error message indicates that the HADB server component is not installed on the machine on which you are trying to create the HA database.
Solution
Log in to the host and create paths for the HADB devices and HADB history files. Run hadbm create and specify the --devicepath and --historypath options to the paths created. Also make sure that the user running the management agent on the host has read and write access to these directories.
Database Does Not Start
The create or start command fails with the console error message:
hadbm: Error 22095: Database could not be started...
Consider the following possibilities:
Was there a shared memory get segment failure?
Description
The history files show the error message:
..'systemerr'..HADB-S-01760: Shared memory get segment failed..
Solution 1
Reboot the system.
Solution 2
If the problem persists, the operating system may not have enough shared memory or semaphores. Increase them according to the number of nodes in the machine. (For details, see the Deployment Guide). Note that after making these changes, the machine must be restarted to make them available.
Do the History Files Contain Errors?
Description
If the problem still persists, inspect the HADB history files. Some of the more likely error messages to look for are:
This message occurs when another process is using the port that an HADB server is trying to process. It can occur in several situations:
Try again to stop the node with the hadbm command. If that fails, use Windows Task Manager to end the OS process, clu_nsup_srv, for this node. The nsup process should then end all its HADB child processes. If the nsup process does not exist you have to remove all the HADB child processes one by one.
Check the following:
- Shared memory is correct on all machines in the HADB configuration.
- No other HADB databases are running on the machines, or any other processes that could be using the same port numbers.
- All necessary directories exist and have write permissions.
- There is enough space in directory where devices are going to be written.
Solutions
After verifying that none of the above errors have occurred, try the following remedies, in order:
For more information, refer to the Error Message Reference.
Do You Need a Simple Solution?
As a last resort, try the following possible solutions.
Solution 1
Delete the database with the hadbm delete command, and see if that allows the hadbm create to proceed normally.
Solution 2
Sometimes a system reboot is the necessary last resort. Issue hadbm delete, reboot the machine, and then rerun the hadb create command.
clear Command Failed
When this command fails, the history files are likely to explain why. See Do the History Files Contain Errors? for instructions on viewing the history files and a list of some common error messages.
create-session-store Failed
The asadmin create-session-store command could fail for one of these reasons:
Invalid user name or password
This error occurs when the --dbsystempassword supplied to the create-session-store command is not the same password as the one given at the time of database creation.
Solution 1
Try the command again with the correct password.
Solution 2
If you cannot remember the dbsystem password, you need to clear the database using hadbm clear and provide a new dbsystem system password.
SQLException: No suitable driver
The create-session-store produces the error: SessionStoreException: java.sql.SQLException: No suitable driver.
Solution 1
This error can occur when asadmin is not able to find hadbjdbc4.jar from the AS_HADB path defined in asenv.conf in the Application Server config directory.
The solution is to change AS_HADB to point to the location of the HADB installation.
Here is a sample AS_HADB entry from an asenv.conf file:
AS_HADB=c:\install_dir\SUNWhadb\4.4.0-8
Solution 2
This error can also occur if you provide the incorrect value for --storeUrl. To solve this problem, obtain the correct URL using hadbm get jdbcURL.
Attaching Shared Memory Segment Fails Due To Insufficient Space
Description
The server throws an error message like the following:
Attaching shared memory segment with key xx failed,
OS status=12 OS message: Not enough space.Solution
Increase shared memory.
Cannot Restart the HADB
Description
HADB restart does not work after a double node failure. Additional recovery actions are needed before HADB can be restarted.
Symptoms of a double node failure include:
This problem occurs when mirror HADB host machines have failed or been rebooted, typically after a power outage, or when a machine is rebooted without first stopping the HADB (in a single-machine installation), or when a pair of mirror machines from both Data Redundancy Units (DRUs) are rebooted.
If mirror host machine pairs are rebooted, or if host failures cause an unplanned reboot of one or more mirror host machine pairs, then the mirror nodes on these machines are not available, and the data is likely to be in an inconsistent state, because a record may have been in the process of being committed when the power failed, or the reboot occurred.
Tip
To prevent such problems, be sure to use the procedure described in the HADB chapter of the Administration Guide when rebooting as a part of a planned maintenance.
HADB cannot heal itself automatically in such “double failure” situations because the part of the data that resided on the pair nodes is lost. In such cases, the hadbm start command does not succeed, and the hadbm status command shows that HADB is in a non-operational state.
Explanation
For performance reasons, the HADB does much of its data management in memory. If both DRUs are rebooted, then the HADB does not have a chance to write its data blocks to disk.
For more information on the DRUs and HADB confutation, see “Administering the High Availability Database” in the Administration Guide, and the Deployment Guide.
Solution