Sun ONE logo      Previous      Contents      Index      Next     

Sun ONE Application Server 7, Enterprise Edition Troubleshooting Guide

Chapter 4
Runtime Problems

This chapter addresses problems that you may encounter while running the Sun™ Open Net Environment (ONE) Application Server 7 product.

The following sections are contained in this chapter:


Runtime Logs

Refer to the logs and information in "Server Logs" for information on using the logs to troubleshooting runtime problems.


Load Balancer / Web Server won’t start

This problem occurs when two instances have the same listener value—for example, if instance foo has listener value bar:80 and instance spam has listener value bar:80.

The error messages that result look like this:

04/Sep/2003:13:01:08] warning ( 2938): reports: lb.runtime: RNTM2029: DaemonMonitor :http://hostname:81 : could be because of connection saturation

[04/Sep/2003:13:01:08] failure ( 2938): ServerInstance.cpp@265: reports: lb.runtime:RNTM3002 : Failed to add listener multiple times: <instance name>

[04/Sep/2003:13:01:08] failure ( 2938): FailoverGroup.cpp@102: reports:lb.failovermanager: FGRP1002: Instance <instance name> could not be added to theFailoverGroup: cluster1

[04/Sep/2003:13:01:08] failure ( 2938): LBConfigurator.cpp@209: reports:lb.cofigurator: CNFG1007 :ServerInstance <instance name> could not be added onFailoverGroup cluster1

[04/Sep/2003:13:01:08] failure ( 2938): lbplugin.cpp@168: reports: lb.runtime:RNTM3004 : Failed to initialise load balancing subsystem

Solution

Make sure that the listener values for each instance are unique.


Instance goes unused after restarting

An instance was down, and is now back up, but the access log shows that is not getting any requests.

This situation occurs when the Load Balancer has not been configured to regularly check the health of instances. When the instance was down, the Load Balancer marked it as unhealthy, and recorded that fact in the load balancer log. But now that the instance is up and running again, the Load Balancer doesn’t know that its health has been restored.

Solution

Configure the Load Balancer to check the health of instance regularly by adding the health checker URL to loadbalancer.xml with a line like this:

<health-checker url="/pathToHealthChecker"
  interval-in-seconds="10" timeout-in-seconds="30" />


Can’t access a web page.

A typical error that displays on the browser is the following:

404 Not Found
The requested URL destination_URL was not found on this server.

This means that the web page you are attempting to access is not available at the location you have specified. The most likely causes are a change in the location, or an error in the URL you have specified.

Solution
  1. Try again.
  2. If you still receive this error, verify that you have entered the location correctly.
  3. If you have entered the URL correctly, verify that the location has not changed or been deleted. You will have to contact the page owner to verify this.


Can’t access an application.

There are a number of possible reasons that you cannot access an application. A typical error message in this case is

Consider the following:

Is the application deployed?

The most likely cause is that the application is not deployed. When an application is deployed to a cluster, an entry for it appears in the web server plug-in’s loadbalancer.xml file. If an application has been successfully deployed to a cluster, the following snippet shows how the loadbalancer.xml file should look:

The cladmin command is used to deploy an application to all instances in your cluster. Refer to the Sun ONE Application Server Administrator’s Guide for deployment instructions. Otherwise, refer to the Sun ONE Application Server Developer’s Guide for non-cluster deployment guidelines.

Solution

If needed, redeploy the application.

Is your loadbalancer.xml file correct?

Check the web server log files to verify that the load balancer started. If it hasn’t, there may be errors about the loadbalancer.xml file written in the to the error log.

Consider the following:

Is the web server running?

Verify that the web server has started.

Has the correct port been specified for the web server?

Determine the correct web server port number and verify that the correct port has been specified. Refer to "Is the Admin Server running at the expected port?" for guidelines on determining the port number.


Server responds slowly after being idle

If the server takes a while to service a request after a long period of idleness, consider the following:

Does the log contain “Lost connection” messages?

Does the log contain “Lost connection” messages?

If the server log shows error messages of the form,

java.io.IOException:..HA Store: Lost connection to the server..

then server has to recreate the JDBC pool for HADB.

Solution: Change the timeout value

The default HADB connection timeout value is 1800 seconds. If the application server does not send any request over a JDBC connection during this period, HADB closes the connection, and the application server needs to re-establish it. To change the timeout value, use the hadbm set SessionTimeout= command.

Important Note:
Make sure the HADB connection time out is greater than the JDBC connection pool time out. If the JDBC connection time out is more than the HADB connection time out, the connection will be closed from the HADB side, but it will be there in appserver connection pool. So when the application tries to use the connection, the application server will have to re-create the connection, which incurs significant over head


My application suddenly went away.

Consider the following:

Is the application you are using being quiesced by the load balancer?

When an application is being quiesced, you may experience loss of service when the application is disabled, until the application is re-enabled.


Requests are not succeeding.

The following problems are addressed in this section:

Is the load balancer timeout correct?

When configuring the response-timeout-in-seconds property in the loadbalancer.xml file, you must take into account the maximum timeouts for all the applications that are running. If the response timeout it is set to a very low value, numerous in-flight requests will fail because the load balancer will not wait long enough for the Application Server to respond to the request.

On the other hand, setting the response timeout to an inordinately large value will result in requests being queued to an instance that has stopped responding, resulting in numerous failed requests.

Solution

Set the response-timeout-in-seconds value to the maximum response time of all the applications.

Are the system clocks synchronized?

When a session is stored in HADB, it includes some time information, including the last time the session was accessed and the last time it was modified. If the clocks are not synchronized, then when an instance fails and another instance takes over (on another machine), that instance may think the session was expired when it was not, or worse yet, that the session was last accessed in the future!

Solution

Verify that clocks are synchronized for all systems in the cluster.

Have you enabled the instances of the cluster?

Even if you start an application server instance and define it to be a part of the cluster, the instance will not receive requests from the load balancer until you enable the instance. Enabling makes the instance and active part of the cluster. The correct sequence of events for activating and deactivating an instance is:

  1. Start the Application Server.
  2. Create an Application Server instance.
  3. Enable the instance
  4. Disable the instance.
  5. Stop the instance.
  6. Start the instance (only if it has been stopped).
  7. Enable the instance.

Is the AppServer communicating with HADB?

HADB may be created & running, but if the persistence store has not yet been created, created the Application Server won’t be able to communicate with the HADB. This situation is accompanied by the following message:

WARNING (7715): ConnectionUtilgetConnectionsFromPool failed using connection URL: <connection URL>

Solution

Create the session store in the HADB with a command like the following:

asadmin create-session-store --storeurl connection URL --storeuser haadmin --storepassword hapasswd --dbsystempassword super123


Server log: app persistence-type = memory

The server.log shows that the J2EE application is using memory persistence instead of High Availability, with a message like this:

Enabling no persistence for web module [Application.war]'s sessions: persistence-type = [memory]

This situation occurs when the application server has not been configured to use HA.

Solution

Enable the availability service with a command like this:

asadmin set --user admin --password netscape --host localhost
--port 4848 serverName.availability-service.availabilityEnabled=true


Dynamic reconfiguration failed.

The load balancer plug-in detects changes to its configuration by examining the time stamp of the loadbalancer.xml file. If a change has been made to the loadbalancer.xml file, the load balancer automatically reconfigures itself. The load balancer ensures that the modified configuration data is compliant with the DTD before overwriting the existing configuration.

If changes to the loadbalancer.xml file are not in the correct format, as specified by the sun-loadbalancer_1_0.dtd file, the reconfiguration fails and a failure notice is printed in the web server's error log files. The load balancer continues to use the old configuration in memory.


Note

If the load balancer encounters a hard disk read error while attempting to reconfigure itself, it uses the configuration that is currently in memory, and a warning message is logged to the web server's error log file.


Solution

Edit the loadbalancer.xml file as needed until it follows the correct format as specified in the DTD file.


Session Persistence Problems

The following problems are addressed in this section:

The create-session-store command failed.

Consider the following:

Are the HADB and the application server instance on different sides of a firewall?

The asadmin create-session-store command cannot run across firewalls. Therefore, for the create-session-store command to work, the application server instance and the HADB must be on the same side of a firewall.

The create-session-store command communicates with the HADB and not with the application server instance.

Solution

Locate the HADB and the application server instance on the same side of a firewall.

Configuring instance-level session persistence didn’t work.

The application-level session persistence configuration always takes precedence over instance-level session persistence configuration. Even if you change the instance-level session persistence configuration after an application has been deployed, the settings for the application still override the settings for the application server instance.

Session data seems to be corrupted.

Session data may be corrupted if the system log reports errors under the following circumstances:

If you determine that the data has been corrupted, there are three possible solution.

Solution

To bring the session store back to a consistent state, do the following:

  1. Use the asadmin clear-session-store command to clear the session store.
  2. If clearing the session store doesn’t work, reinitialize the data space on all the nodes and clear the data in the HADB using the hadbm clear command.
  3. If clearing the HADB doesn’t work, delete and then recreate the database.


HTTP session failover is not working.

The Sun ONE Application Server 7, Enterprise Edition includes the high-availability database (HADB) for storing session data. The HADB is not a general-purpose database but instead is an HttpSession store.

If HTTP session failover is not working correctly, consider the following:

Are the system clocks synchronized?

For HTTP session failover to work, the clocks of all the computers on which the application server instances in a cluster reside must be synchronized. (For more detail, see "Are the system clocks synchronized?".)

Solution

Verify that clocks are synchronized for all systems in the cluster.

Do all objects bound to a distributed session implement the java.io.Serializable interface?

If an object does not implement the java.io.Serializable interface, it will not be persisted. No errors or warnings are produced, because the lack of persistence may well be the desired behavior. The remaining session objects are successfully persisted and will fail over.


Note

If an object which is not serializable implements java.io.Serializable, an exception will be thrown in the log. In this case, the entire user session is not persisted. A failover will produce an empty session.


Solution

Make sure that every class in the session that is supposed to persist implements java.io.Serializable, as in this example:

public class MyClass implements java.io.Serializable
{
  ..
}

Is session information stored in HttpSession?

Sun ONE Application Server 7, Enterprise Edition does not support session failover if session data is stored in a stateful session bean.

Is your web application distributable?

For a web application to be highly available, it should be distributable. An application is non-distributable if the webapp element of the web.xml file does not contain a distributable subelement.

For additional information, refer to the Sun ONE Application Server Developer’s Guide to Web Applications.

Solution

Verify that the webapp element of the web.xml file contains a distributable subelement.

Is the persistence type set to ha?

The persistence type must be set to ha for session failover to work. When you run the clsetup command, the persistence type is set to ha by default. If you do not use the clsetup command to set up your initial cluster, the persistence type is specified as memory, the default. The memory type offers no session persistence upon failover, while the failover capabilities offered by the file persistence type are intended for use only in development systems where session failover capabilities are not strictly required.

Instructions for setting the persistence type are contained in the Session Persistence chapter of the Sun ONE Application Server Administrator’s Guide.

Solution

Verify that the persistence type is set to ha. If it isn’t, modify either the entire instance or your particular application to use the ha persistence type. For details, see the Session Persistence chapter of the Administration Guide.

An object is cloned instead of shared

When using modified attribute persistence scope, and a session fails over or is activated after being passivated, an object that is shared between two attributes comes back as two separate copies of the object instead of as a single shared object.

The situation occurs because you can not have one object referred to by two separate attributes when using modified attribute. The application server serializes and persists each attribute separately, so the shared object gets serialized twice, once for each attribute. When the objects get deserialized, they are now two separate objects.

Has a session store been created?

HTTP session failover will not work until a session store has been created using the asadmin create-session-store command.

Instructions for creating a session store are contained in the Session Persistence chapter of the Sun ONE Application Server Administrator’s Guide

Are all the machines in the cluster homogenous?

All Application Server instances in a cluster must have the same applications deployed to them. For these applications to take part in failover, they must have a consistent session persistence configuration and point to the same session store.

Any new instance that you add to a cluster must have the same version and same patch level as all existing instances in a cluster.

Has high availability been enabled?

HTTP session failover will not work until high availability has been enabled using the availability-enabled attribute.

Solution

Set the availability-enabled attribute using the asadmin command.


Out of Memory and Stack Overflow errors.

When using the Sun ONE Application Server to deploy web applications, out of memory errors and stack overflow errors are occurring., even though the volume of data is not large.

Memory: 2048M
CPU : 1 900Mhz Ultra SparcIII

Solution 1

Java VM runs out of memory because the web applications are quite demanding in terms of creating Java objects.

This can usually be solved by setting the VM Xms/Xmx and Xss parameters. If you have default values similar to these:

-Xmx256m - for initial app. server install
-Xss512k - default for 1.4.0 server VM

try something like:

-Xmx1024m -Xss1024k

Solution 2

Java VM runs out of memory because of bad web application design.

For instance, the application creates lots of references to Java objects and they are being cross referenced through the lifecycle of this application preventing the garbage collector from doing its job. With each additional user using the application, it consumes more and more heap space until it runs out.

There is nothing you can do here except for changing the application. You are most likely dealing with an application memory leak which can be traced using a profiling tool such as OptimizeIt.

Solution 3

The other thing you can try is to configure your Application Server with JDK 1.4.2 instead the default 1.4.0_02 which has better memory management and more aggressive garbage collection.


HADB Transaction Failures

Transaction failures are generally caused by a shortage of system resources. The causes of the failures will generally be found in the application server log.

In your efforts to isolate the problem, consider the following:

Is there a shortage of HADB data devices space?

One possible reason for transaction failure is running out of data device space. If this situation occurs, HADB will write warnings to the history file, and abort the transaction which tried to insert and/or update data.

To find the messages, follow the instructions in "Examine the history files". Typical messages are:

HIGH LOAD: about to run out of device space, ...

HIGH LOAD: about to run out of device space on mirror node, ...

The general rule of thumb is that the data devices must have room for at least four times the volume of the user data. Please refer to the Tuning Guide for an explanation.

Solution 1

Increase the size of the data devices using the following command:

hadbm set TotalDataDevicePerNode=size

(See the Administrator's Guide)

This solution requires that there is space available on the physical disk which is used for the HADB data device.

HADBM will automatically re-start each node of the database.

Once all nodes have re-started, the database will start accepting insert and update requests.

Solution 2

Stop and clear the HADB, and create a new instance with more nodes and/or larger data devices and/or several data devices per node, see the Administrator's guide.

Unfortunately, using this solution will erase all persistent data.

Is there a shortage of other HADB resources?

When an HADB node is started, it will allocate:

If an HADB node runs out of resources it will delay and/or abort transactions. Resource usage information is shipped between mirror nodes, so that a node can delay or abort an operation which is likely to fail on its mirror node.

Transactions that are delayed repeatedly may time out, and return an error message to the client. If they do not time out, the situation will be visible to the client only as decreased performance in the period where the system is short on resources.

These problems frequently occur in “High Load” situations. For details, see "Frequent “High Load” Warnings".

Have history files grown too large?

As history files grow, they consume more and more disk space. Using the hadbm clearhistory command recovers that space, with the option of saving those files to a specified directory. For more information, consult the Administrator's guide.


Other Transaction Problems

Can’t set TRANSACTION_SERIALIZABLE level for a bean.

Since the Release Notes say that check-modified-at-commit flag is not implemented for this release, is there an equivalent of the Weblogic <transaction-isolation> element in Sun ONE Application Server?

Solution

First identify which resource is being used in that bean, then add the following attribute to the jdbc-connection-pool in the server.xml file:

transaction-isolation-level="serializable"

This is an optional element and does not exist in the server.xml file by default. You must explicitly add it.

You can do this either using the Administration interface or using the command-line interface to modify the server.xml file, then running the asadmin reconfig command.


Sporadic failures during high loads.

High load scenarios are recognizable by the following symptoms:

High load scenarios are commonly caused by one of the following problems:

Note:
Frequently, these problems can be solved by making more CPU horsepower available, as described in "Solution 2: Improve CPU Utilization". (For even more information, consult the Tuning Guide.)

See also: "Frequent “High Load” Warnings".

No space left in Tuple Log

The user operations (delete, insert, update) are logged in the tuple log and executed. There tuple log may fill up because:

messages in the history files), which happens as a result of:

Solution 1

Check CPU usage, as described in "Solution 2: Improve CPU Utilization".

Solution 2

If CPU utilization is not a problem, and there is sufficient memory, increase the LogBufferSize using the hadbm set LogbufferSize= command.

Solution 3

Look for evidence of network contention, and resolve bottlenecks.

Node-internal log gets full

Too many node-internal operations are scheduled but not processed due to CPU or disk I/O problems.

Solution 1

Check CPU usage, as described in "Solution 2: Improve CPU Utilization".

Solution 2

If CPU utilization is not a problem, and there is sufficient memory, increase he InternalLogbufferSize using the hadbm set InternalLogbufferSize= command.

Not enough locks

Some extra symptoms that identify this condition are:

Solution 1: Increase the number of locks

Use hadbm set NumberOfLocks= to increase the number of locks.

Solution 2: Improve CPU Utilization

Available CPU cycles and I/O capacity can impose severe restrictions on performance. Resolving and preventing such issues is necessary to optimize system performance (in addition to configuring the HADB optimally.)

If there are more CPUs on the host that are not exploited, add new nodes to the same host. Otherwise add new machines and add new nodes on them.

If the machine has enough memory, increase the DataBufferPoolSize, and increase other internal buffers that may are putting warnings into the log files, as described in "Sporadic failures during high loads.". Otherwise, add new machines and add new nodes on them.

For much more information on this subject, consult the Performance and Tuning Guide.


Frequent “High Load” Warnings

In some cases HADB will write informational or warning “HIGH LOAD” messages to the history file. If such messages are frequently encountered in the history file, the database administrator should take certain steps to improve the availability of resources. In most situations, reducing load or improving host performance will increase the availability of resources. This may be accomplished by:

See the Administrator’s Guide for information on how to perform these tasks.

The following resources can be adjusted to improve “HIGH LOAD” problems, as described in the Tuning Guide:

Size of Database Buffer:        hadbm attribute DataBufferPoolSize

Size of Tuple Log Buffer:         hadbm attribute LogBufferSize

Size of Node Internal Log Buffer:  hadbm attribute InternalLogBufferSize

Number of Database Locks:        hadbm attribute NumberOfLocks


Client cannot connect to HADB.

This problem is accompanied by a message in the history file:

HADB-E-11626: Error in IPC operations, iostat = 28: No space left on device

where:

The most likely explanation is that a semget() call failed (see the unix man pages). If HADB started successfully, and you get this message at runtime, it means that the host computer has too few semaphore “undo structures”. See the “Preparing for HADB Setup” chapter in the Installation guide for how to configure semmnu in /etc/system.

Solution 1

Stop the affected HADB node, reconfigure and reboot the affected host, re-start the HADB node. HADB will be available during the process.

Solution 2

Stop the HADB database, reconfigure and reboot the affected host, re-start the HADB database. HADB will be unavailable during the process.


Connection Queue Problems

The following problems may occur with connection queues:

Full connection queue closes socket

This problem typically occurs in “high load” scenarios. The server.log file shows the error: SEVERE (17357): Connection queue full, closing socket.

Solution: Increase configuration values

Increase the value for ConnQueueSize and MaxKeepAliveConnections in file magnus.conf under the config directory of your Sun ONE WebServer, for example:


Connection Pool Problems

The following problems may occur in relation to connection pools:

Single sign-on requires larger connection pool.

When single sign-on (or session persistence) requires connections and the wait time is exceeded, the following error occurs:

Unable to get connection - Wait-Time has expired

The Sun ONE Application Server uses the same connection pool for both HADB session persistence and single sign-on. Single sign-on is enabled by default. If an application requires single sign-on functionality, the connection pool setting must be doubled.


Tip

If your application does not require single sign-on functionality, disabling it can improve performance. To disable single sign-on, change the following settings in the server.xml file:

<virtual-server id="server1" ... >
  ...
  <property name="sso-enabled" value="
false"/>
  ...
</virtual-server>


Solution

Double the size of the connection pool.

For example, if an application indicates that 16 is the optimal number of connections to a single HADB node, the number of connections should be doubled to 32 if single sign-on functionality is required. In this case, the JDBC connection pool settings look like this:

<jdbc-connection-pool steady-pool-size="32" max-pool-size="32" max-wait-time-in-millis="60000" pool-resize-quantity="2" idle-timeout-in-seconds="10000" is-isolation-level-guaranteed="true" is-connection-validation-required="true" connection-validation-method="auto-commit" fail-all-connections="false" datasource classname="com.sun.hadb.jdbc.ds.HadbDataSource" name="CluJDBC" transaction-isolation-level="repeatable-read">

You should also double the loadbalancer.xml file setting for response-timeout-in-seconds from 60 seconds to 12 0 seconds.

<property name="response-timeout-in-seconds" value="120"/>

This value must be equal to or greater than the following:

Server: Unable to obtain connection from pool

The application server is having trouble connecting with HADB, as evidenced by a message like the following in the server.log file:

ConnectionUtilgetConnectionsFromPool failed using connection URL: null Unable to obtain connection from pool

Solution

Make sure HADB is running. Make sure that session-store, JDBC connection pool, and JNDI name (jdbc/hastore) are created. Configure the session persistence for High Availability with a command like the following:

asadmin configure-session-persistence --user admin
  --password netscape --host localhost --port 4848
  --type ha --frequency web-method --scope session
  --store jdbc/hastore server1

JDBC connection is not available.

Consider the following:

Is the max-pool-size setting adequate?

The server.xml file defines the following default values:

During server start/restart, the ejb-container steady pool for deployed enterprise beans will be created. Since the default steady-pool-size is 32, 32 enterprise beans will be created for each bean unless a different value is specified in the sun-ejb-jar.xml file.

The setEntityContext method will be called for each of the beans created. If more than one bean is grabbing JDBC connections in the setEntityContext method from the same JDBC connection pool, the following happens:

If the newly-created beans attempt to grab a JDBC connection from the same pool in their setEntityContext method, an exception is thrown with the following message:

Solution

Increase the max-pool-size of the default jdbc-connection-pool to a higher value, such as 256.

Exception occurs while calling DataSource.getConnection().

This exception occurs when an invalid DataSource class property is registered within the JDBC connection pool. Misspelling is a common cause. For instance, while creating a jdbc-connection-pool for Oracle, one might specify the following:

< property name="OracleURL" value="jdbc:oracle:...."/>

This will result in the following exception since OracleURL is not a property of any Oracle datasource:

Solution

Verify that all jdbc-connection-pool properties in use are valid.

Verify that the datasource classname specified is for the required vendor datasource class.

Exception occurs while executing against MSSQL.

The following exception occurs while executing a statement against the MSSQL server using a non-XA driver:

java.sql.SQLException: [DataDirect] [SQLServer JDBC Driver] Can't start a cloned connection while in manual transaction mode

This happens when the selectMethod property is not set to cursor.

Solution

Ensure the selectMethod property is set correctly during JDBC connection pool registration:

Using the command-line interface, issue the following command:

The options can also be set using the graphical Administration interface.

IOException: Connection in invalid state

The error log shows the following message:

WEB2001: Web application service failed
java.io.IOException: Error from HA Store:
  Connection is in invalid state for LOB operations.
  <stack trace>

This error occurs when you have the HADB JDBC Connection pool transaction-isolation-level entry set to read-committed and read-uncommitted.

Solution

Change the transaction-isolation-level value of your HADB JDBC Connection pool to repeatable-read and re-start the application server.



Previous      Contents      Index      Next     


Copyright 2003 Sun Microsystems, Inc. All rights reserved.