Using WebLogic Server Clusters
Troubleshooting Common Problems
This chapter provides guidelines on how to prevent cluster problems or troubleshoot them if they do occur.
Before You Start the Cluster
You can do a number of things to help prevent problems before you boot the cluster.
Check for a Cluster License
Your WebLogic Server license must include the clustering feature. If you try to start a cluster without a clustering license, you will see the error message Unable to find a license for clustering.
Check the Server Version Numbers
All Managed Servers in a cluster and the cluster's Administration Server should run under the same version of WebLogic Server. The major and minor version numbers (e.g 6.1), service packs, and attached patch levels should be same across the cluster.
Check the Multicast Address
A problem with the multicast address is one of the most common reasons a cluster does not start or a server fails to join a cluster.
A multicast address is required for each cluster. The multicast address can be an IP number between 22.214.171.124 and 126.96.36.199, or a host name with an IP address within that range.
You can check a cluster's multicast address and port on its Configuration-->Multicast tab in the Administration Console.
For each cluster on a network, the combination of multicast address and port must be unique. If two clusters on a network use the same multicast address, they should use different ports. If the clusters use different multicast addresses, they can use the same port or accept the default port, 7001.
Before booting the cluster, make sure the cluster's multicast address and port are correct and do not conflict with the multicast address and port of any other clusters on the network.
The errors you are most likely to see if the multicast address is bad are:
Unable to create a multicast socket for clustering
Multicast socket send error
Multicast socket receive error
Check the CLASSPATH Value
Make sure the value of
CLASSPATH is the same on all managed servers in the cluster.
CLASSPATH is set by the
setEnv script, which you run before you run
startManagedWebLogic to start the managed servers.
setEnv sets this value for
CLASSPATH (as represented on Windows systems):
If you change the value of
CLASSPATH on one managed server, or change how
CLASSPATH, you must change it on all managed servers in the cluster.
Check the Thread Count
Each server instance in the cluster has a default execute queue, configured with a fixed number of execute threads. To view the thread count for the default execute queue, choose the Configure Execute Queue command on the Advanced Options portion of the Configuration> General tab for the server. The default thread count for the default queue is 15, and the minimum value is 5. If the value of Thread Count is below 5, change it to a higher value so that the Managed Server does not hang on startup.
After You Start the Cluster
This section describes first troubleshooting steps to perform if you have problems trying to start a cluster.
Check Your Commands
If the cluster fails to start, or a server fails to join the cluster, the first step is to check any commands you have entered, such as
startManagedWebLogic or a
java interpreter command, for errors and misspellings.
Generate a Log File
Before contacting BEA Technical Support for help with cluster-related problems, collect diagnostic information. The most important information is a log file with multiple thread dumps from a Managed Server. The log file is especially important for diagnosing cluster freezes and deadlocks.
Remember: a log file that contains multiple thread dumps is a prerequisite for diagnosing your problem.
- Remove or back up any log files you currently have. You should create a new log file each time you boot a server, rather than appending to an existing log file.
- Start the server with this command, which turns on verbose garbage collection and redirects both the standard error and standard output to a log file:
% java -ms64m -mx64m
-verbose:gc -classpath $CLASSPATH
Redirecting both standard error and standard output places thread dump information in the proper context with server informational and error messages and provides a more useful log.
- Continue running the cluster until you have reproduced the problem.
- If a server hangs, use
kill -3 or
<Ctrl>-<Break> to create the necessary thread dumps to diagnose your problem. Make sure to do this several times on each server, spaced about 5-10 seconds apart, to help diagnose deadlocks.
Note: If you are running the JRockit JVM under Linux, see Getting a JRockit Thread Dump Under Linux.
- Compress the log file using a Unix utility:
% tar czf logfile.tar logfile.txt
- or zip it using a Windows utility.
- Attach the compressed log file to an e-mail to your BEA Technical Support representative. Do not cut and paste the log file into the body of an e-mail.
- If the compressed log file is too large, you can use the BEA Customer Support FTP site.
Getting a JRockit Thread Dump Under Linux
If you use the JRockit JVM under Linux, use one of the following methods to generate a thread dump.
- Use the
weblogic.admin THREAD_DUMP command. For instructions and limitations, see THREAD_DUMP in WebLogic Server Command Reference.
- If the JVM's management server is enabled (by starting the JVM with the -
Xmanagement option), you can generate a thread dump using the JRockit Management Console.
PID is the root of the process tree.
To obtain the root PID, perform a:
ps -efHl | grep 'java' **. **
grep argument that is a string that will be found in the process stack that matches the server startup command. The first PID reported will be the root process, assuming that the
ps command has not been piped to another routine.
Under Linux, each execute thread appears as a separate process under the Linux process stack. To use Kill -3 on Linux you supply must match PID of the main WebLogic execute thread, otherwise no thread dump will be produced.
Check Garbage Collection
If you are experiencing cluster problems, you should also check the garbage collection on the managed servers. If garbage collection is taking too long, the servers will not be able to make the frequent heartbeat signals that tell the other cluster members they are running and available.
If garbage collection (either first or second generation) is taking 10 or more seconds, you need to tune heap allocation (the
msmx parameter) on your system.
You can verify that multicast is working by running
utils.MulticastTest from one of the managed servers. See Using the WebLogic Server Java Utilities in WebLogic Server Command Reference.