BEA WebLogic Enterprise 4.2 Developer Center
		HOME \| SITE MAP \| SEARCH \| CONTACT \| GLOSSARY \| PDF FILES \| WHAT'S NEW
		Administration \| TABLE OF CONTENTS \| PREVIOUS TOPIC \| INDEX

Troubleshooting Applications

Other chapters of this document discuss many diagnostic tools provided by your WLE or BEA TUXEDO system: commands and log files that help you monitor a running system, identify potential problems while there is still time to prevent them, and detect error conditions once they have occurred. This chapter provides additional information to help you identify and recover from various system errors.

This chapter discusses the following topics:

Distinguishing Between Types of Failures

The first step in troubleshooting is to determine the area in which the problem has occurred. In most applications, you must consider six possible sources of trouble:

Application
The WLE or BEA TUXEDO system
Database management software
Network
Operating system
Hardware

To resolve the trouble in most of these areas, you must work with the appropriate administrator. If, for example, you determine that the trouble is being caused by a networking problem, you must work with the network administrator.

Determining the Cause of an Application Failure

To detect the source of an application failure, complete the following steps.

Check any WLE or BEA TUXEDO system warnings and error messages in the user log (ULOG).
Select the messages you think are most likely to reflect the current problem. Note the catalog name and the message number of each of those messages and look them up in the BEA WebLogic Enterprise System Messages or BEA TUXEDO System Message Manual. The document entry provides:
- Details about the error condition flagged by the message
- Recommendations for actions you can take to recover
Check any application warnings and error messages in the ULOG.
Check any warnings and errors generated by application servers and clients. Such messages are usually sent to the standard output and standard error files (named, by default stdout and stderr, respectively).
- The stdout and stderr files are located in $APPDIR.
- The stdout and stderr files for your clients and servers may have been renamed. (You can rename the stdout and stderr files by specifying -e and -o in the appropriate client and server definitions in your configuration file. For details, see the servopts(5) reference page in the BEA TUXEDO Reference Manual.)
Look for any core dumps in $APPDIR. Use a debugger such as sdb to get a stack trace. If you find core dumps, notify the application developer.
Check your system activity reports (by running the sar(1) command) to determine why your system is not functioning properly. Consider the following possible reasons:
- The system may be running out of memory.
- The kernel might not be tuned correctly.

Determining the Cause of a WLE or BEA TUXEDO System Failure

To detect the source of a system failure, complete the following steps:

Check any WLE or BEA TUXEDO system warnings and error messages in the user log (ULOG):
- TPEOS messages indicate errors in the operating system.
- TPESYSTEM messages indicate errors in the WLE or BEA TUXEDO system.
Select the messages you think are most likely to reflect the current problem. Note the catalog name and message number of each of those messages and locate the messages in the BEA WebLogic Enterprise System Messages or the BEA TUXEDO System Message Manual. The message manual provides the following information about each system message:
- Details about the error condition flagged by the message
- Recommendations for actions you can take to recover

Broadcasting Unsolicited Messages (BEA TUXEDO System)

To send an unsolicited message, enter the following command.

broadcast (bcst) [-m machine] [-u usrname] [-c cltname] [text]

By default, the message is sent to all clients. You have the choice, however, of limiting distribution to one of the following recipients:

One machine (-m machine)
One client group (-c client_group)
One user (-u user)

The text may not include more than 80 characters. The system sends the message in a buffer of type STRING. This means that the client's unsolicited message handling function (specified by tpsetunsol(0)) must be able to handle a message of this type. The tptypes() function may be useful in this case.

Performing System File Maintenance

This section provides instructions for the following tasks that you may need to perform in the course of maintaining your file system:

Creating a device list
Destroying a device list
Reinitializing a device
Printing the Universal Device List
Printing VTOC information

Creating a Device List

Complete the following steps to create a device list.

Start a tmadmin session.
Enter the following command.
```
crdl [-z devicename] [-b blocks]
```
- The value of devicename [devindx] is the desired device name. (Another way to assign a name to a new device is by setting the FSCONFIG environment variable to the desired device name.)
- The value of blocks is the number of blocks needed. The default value is 1000 pages.
  
  Note: Because 35 blocks are needed for the administrative overhead associated with a TLOG, be sure to assign a value higher than 35 when you create a TLOG.

Destroying a Device List

To destroy a device list with index devindx, enter the following command.

dsdl [-z devicename] [yes] [devindx]

You can specify the device by:
- Entering its name after the -z option (as shown here), or
- Setting the environment variable FSCONFIG to the device name
If you include the yes option on the command line, you will not be prompted to confirm your intention to destroy the file before the file is actually destroyed.
The value of devindx is the index to the file to be destroyed.

Reinitializing a Device

To reinitialize a device on a device list, enter the following command.

initdl [-z devicename] [-yes] devindx

You can specify the device by:
- Entering its name after the -z option (as shown here), or
- Setting the environment variable FSCONFIG to the device name
If you include the -yes option on the command line, you will not be prompted to confirm your intention to destroy the file before the file is actually destroyed.
The value of devindx is the index to the file to be destroyed.

Printing the Universal Device List (UDL)

To print a UDL, enter the following command.

lidl

To specify the device from which you want to obtain the UDL, you have a choice of two methods:

Specify the following on the lidl command line.
```
-z device name [devindx]
```
Set the environment variable FSCONFIG to the name of the desired device.

Printing VTOC Information

To get information about all VTOC table entries, enter the following command.

livtoc

To specify the device from which you want to obtain the VTOC, you have a choice of two methods:

Specify the following on the lidl command line.
```
-z device name [devindx]
```
Set the environment variable FSCONFIG to the name of the desired device.

Repairing Partitioned Networks

A network partition exists if one or more machines cannot access the master machine. As the application administrator, you are responsible for detecting partitions and recovering from them. This section provides instructions for troubleshooting a partition, identifying its cause, and taking action to recover from it.

A network partition may be caused by the following:

A network failure-one of two types:
- Transient failure, which corrects itself in minutes
- Severe failure, which requires you to take the partitioned machine out of the network
A machine failure on either:
- The master machine
- The nonmaster machine
A BRIDGE failure

The procedure you follow to recover from a partitioned network depends on the cause of the partition. Recovery procedures for these situations are provided in this section.

Detecting Partitioned Networks

There are several ways to detect a network partition:

You can check the user log (ULOG) for messages that may shed light on the origin of the problem.
You can gather information about the network, server, and service by running the tmadmin commands provided for this purpose.

Checking the ULOG

When things go wrong with the network, WLE or BEA TUXEDO system administrative servers start sending messages to the ULOG. If the ULOG is set up over a remote file system, all messages are written to the same log. In such a case you can run the tail(1) command on one file and check the failure messages displayed on the screen.

If, however, the remote file system is using the same network, the remote file system may no longer be available.

Example

151804.gumby!DBBL.28446: ... : ERROR: BBL partitioned, machine=SITE2

Gathering Information about the Network, Server, and Service

Listing 23-1 provides an example of a tmadmin session in which information is being collected about a partitioned network, and a server and a service on that network. Three tmadmin commands are run:

pnw (the printnetwork command)
psr (the printserver command)

psc (the printservice command)

Listing 23-1 Example of a tmadmin Session

$ tmadmin
> pnw SITE2
Could not retrieve status from SITE2

> psr -m SITE1
a.out Name     Queue Name   Grp Name   ID   Rq Done  Load Done  Current Service
BBL            30002.00000  SITE1       0      -         -        (- )
DBBL           123456       SITE1       0     121       6050      MASTERBB
simpserv       00001.00001  GROUP1      1      -         -        ( - )
BRIDGE         16900672     SITE1       0      -         -        ( DEAD )

>psc -m SITE1
Service Name   Routine Name  a.out     Grp Name ID  Machine   # Done Status
------------   ------------  --------  -------- --  -------   ------------ 
ADJUNCTADMIN   ADJUNCTADMIN  BBL       SITE1     0   SITE1      - PART
ADJUNCTBB      ADJUNCTBB     BBL       SITE1     0   SITE1      - PART
TOUPPER        TOUPPER       simpserv  GROUP1    1   SITE1      - PART
BRIDGESVCNM    BRIDGESVCNM   BRIDGE    SITE1     1   SITE1      - PART

Restoring Failed Machines

The procedure you follow to restore a failed machine depends on whether that machine was the master machine.

Restoring a Failed Master Machine

To restore a failed master machine, complete the following procedure.

Make sure that all IPC resources are removed for the BEA TUXEDO processes that died.
Start a tmadmin session on the ACTING MASTER (SITE2):
```
tmadmin
```
Boot the BBL on the MASTER (SITE1) by entering the following command:
```
boot -B SITE1
```
The BBL will not boot if you have not executed pclean on SITE1.
Still in tmadmin, start a DBBL running again on the master site (SITE1) by entering the following:
```
MASTER
```
If you have migrated application servers and data off the failed machine, boot them or migrate them back.

Restoring a Failed Nonmaster Machine

To restore a failed nonmaster machine, complete the following procedure.

On the master machine, start a tmadmin session.
Run pclean, specifying the partitioned machine on the command line.
Fix the machine problem.
Restore the failed machine by booting the Bulletin Board Listener (BBL) for it from the master machine.
If you have migrated application servers and data off the failed machine, boot them or migrate them back.

In Listing 23-2, SITE2, a nonmaster machine, is restored.

Listing 23-2 Example of Restoring a Failed Nonmaster Machine

$ tmadmin
tmadmin - Copyright © 1987-1990 AT&T; 1991-1993 USL. All rights reserved

> pclean SITE2
Cleaning the DBBL.

Pausing 10 seconds waiting for system to stabilize.
3 SITE2 servers removed from bulletin board

> boot -B SITE2
Booting admin processes ...

Exec BBL -A :

on SITE2 -> process id=22923 ... Started.
1 process started.
> q

Replacing System Components (BEA TUXEDO System)

To replace BEA TUXEDO system components, complete the following procedure.

Install the BEA TUXEDO system software that is being replaced.
Shut down those parts of the application that will be affected by the changes:
- The BEA TUXEDO system servers may need to be shut down if libraries are being updated.
- Application clients and servers must be shut down and rebuilt if relevant BEA TUXEDO system header files or static libraries are being replaced. (Application clients and servers do not need to be rebuilt if the BEA TUXEDO system message catalogs, system commands, administrative servers, or shared objects are being replaced.)
If relevant BEA TUXEDO system header files and static libraries have been replaced, rebuild your application clients and servers.
Reboot the parts of the application that you shut down.

Replacing Application Components

To replace components of your application, complete the following procedure.

Install the application software. This software may consist of application clients, application servers, and various administrative files, such as the FML field tables.
Shut down the application servers being replaced.
If necessary, build the new application servers.
Boot the new application servers.

Cleaning Up and Restarting Servers Manually

By default, the WLE or BEA TUXEDO system cleans up resources associated with dead processes (such as queues) and restarts restartable dead servers from the Bulletin Board (BB) at regular intervals during BBL scans. You may, however, request cleaning at other times.

Cleaning Up Resources

To request an immediate cleanup of resources associated with dead processes, complete the following procedure.

Start a tmadmin session.
Enter bbclean machine.

The bbclean command takes one optional argument: the name of the machine to be cleaned.


If You Specify . . .	Then . . .
No machine	The resources on the default machine are cleaned.
A machine	The resources on that machine are cleaned.
DBBL	The resources on the Distinguished Bulletin Board Listener (DBBL) and the Bulletin Boards at all sites are cleaned.

To clean up other resources, complete the following procedure.

Start a tmadmin session.

Enter pclean machine.

Note: You must specify a value for machine; it is a required argument.

If the Specified Machine Is . . . Then . . .

Not partitioned

pclean will invoke bbclean.

Partitioned

pclean will remove all entries for servers and services from all nonpartitioned Bulletin Boards.

If the Specified Machine Is . . .	Then . . .
Not partitioned	`pclean` will invoke `bbclean`.
Partitioned	`pclean` will remove all entries for servers and services from all nonpartitioned Bulletin Boards.

This command is useful for restoring order to a system after partitioning has occurred unexpectedly.

Checking the Order in Which Servers Are Booted (WLE Servers)

If a WLE application fails to boot, open the application's UBBCONFIG file with a text editor and check whether the servers are booted in the correct order in the SERVERS section. The following is the correct order in which to boot the servers on a WLE system. A WLE application will not boot if this order is not adhered to.

Boot the servers in the following order:

The system event broker, TMSYSEVT.
The TMFFNAME server with the -N option and the -M option, which starts the NameManager service (as a master). This service maintains a mapping of application-supplied names to object references.
The TMFFNAME server with the -N option only, to start a slave NameManager service.
The TMFFNAME server with the -F option, to start the FactoryFinder.
The application servers that are advertising factories.

For a detailed example, see the section "Required Order in Which to Boot Servers (WLE Servers)" in Chapter 3, "Creating a Configuration File."

Checking Hostname Format and Capitalization (WLE Servers)

The network address that is specified by programmers in the Bootstrap object constructor or in TOBJADDR must exactly match the network address in the server application's UBBCONFIG file. The format of the address as well as the capitalization must match. If the addresses do not match, the call to the Bootstrap object constructor will fail with a seemingly unrelated error message:

ERROR: Unofficial connection from client at <tcp/ip address>/<port-number>:

For example, if the network address is specified as //TRIXIE:3500 in the ISL command line option string (in the server application's UBBCONFIG file), specifying either //192.12.4.6:3500 or //trixie:3500 in the Bootstrap object constructor or in TOBJADDR will cause the connection attempt to fail.

On UNIX systems, use the uname -n command on the host system to determine the capitalization used. On Windows NT systems, see the host system's Network control panel to determine the capitalization used.

Some Clients Fail to Boot (WLE Servers)

You may want to perform the following steps on a Windows NT server that is running a WLE application, if the following problem occurs: some Internet Inter-ORB Protocol (IIOP) clients boot, but some clients fail to create a Bootstrap object and return an InvalidDomain message, even though the //host:port address is correctly specified. (For related information, see the section "Checking Hostname Format and Capitalization (WLE Servers)" in this chapter.)

Start regedt32, the Registry Editor.
Go to the HKEY_LOCAL_MACHINE on Local Machine window.

Select:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Afd\Parameters

Add the following values by using the Edit -> Add Value menu option:

DynamicBacklogGrowthDelta: REG_DWORD : 0xa
EnableDynamicBacklog: REG_DWORD: 0x1
MaximumDynamicBacklog: REG_DWORD: 0x3e8
MinimumDynamicBacklog: REG_DWORD: 0x14
Restart the Windows NT system for the changes to take effect.

These values replace the static connection queue (that is, the backlog) of five pending connections with a dynamic connection backlog, that will have at least 20 entries (minimum 0x14), at most 1000 entries (maximum 0x3e8), and will increase from the minimum to the maximum by steps of 10 (growth delta 0xa).

These settings only apply to connections that have been received by the system, but are not accepted by an IIOP Listener. The minimum value of 20 and the delta of 10 are recommended by Microsoft. The maximum value depends on the machine. However, Microsoft recommends that the maximum value not exceed 5000 on a Windows NT server.

Aborting or Committing Transactions

This section provides instructions for aborting and committing transactions.

Aborting a Transaction

To abort a transaction, enter the following command.

aborttrans (abort) [-yes] [-g groupname] tranindex

To determine the value of tranindex, run the printtrans command (a tmadmin command).
If groupname is specified, a message is sent to the TMS of that group to mark as "aborted" the transaction for that group. If a group is not specified, a message is sent, instead, to the coordinating TMS, requesting an abort of the transaction. You must send abort messages to all groups in the transaction to control the abort.

This command is useful when the coordinating site is partitioned or when the client terminates before calling a commit or an abort. If the timeout is large, the transaction remains in the transaction table unless it is aborted.

Committing a Transaction

To commit a transaction, enter the following command.

committrans (commit) [-yes] [-g groupname] tranindex

Both groupname and tranindex are required arguments.
The operation fails if the transaction is not precommitted or has been marked aborted.
This message should be sent to all groups to fully commit the transaction.

Cautions

Be careful about using this command. The only time you should need to run it is when both of the following conditions apply:

The coordinating TMS has gone down before all groups got the commit message.
The coordinating TMS will not be able to recover the transaction for some time.

Also, a client may be blocked on tpcommit(), which will be timed out. If you are going to perform an administrative commit, be sure to inform this client.

Recovering from Failures When Transactions Are Used

When the application you are administering includes database transactions, you may need to apply an after-image journal (AIJ) to a restored database following a disk corruption failure. Or you may need to coordinate the timing of this recovery activity with your site's database administrator (DBA). Typically, the database management software automatically performs transaction rollback when an error occurs. When the disk containing database files has become permanently corrupt, however, you or the DBA may need to step in and perform the rollforward operation.

Assume that a disk containing portions of a database is corrupted at 3:00 P.M. on a Wednesday. For this example, assume that a shadow volume does not exist.

Shut down the WLE or BEA TUXEDO application. For instructions, see Chapter 4, "Starting and Shutting Down Applications."
Get the last full backup of the database and restore the file. For example, restore the full backup version of the database from last Sunday at 12:01 A.M.
Apply the incremental backup files, such as the incrementals from Monday and Tuesday. For example, assume that this step restores the database up until 11:00 P.M. on Tuesday.
Apply the AIJ, or transaction journal file, that contains the transactions from 11:15 P.M. on Tuesday up to 2:50 P.M. on Wednesday.
Open the database again.
Restart the WLE or BEA TUXEDO applications.

Refer to the documentation for the resource manager (database product) for specific instructions on the database rollforward process.

Copyright � 1999 BEA Systems, Inc. All Rights Reserved.
Required browser version: Netscape Communicator version 4.0 or higher, or Microsoft Internet Explorer version 4.0 or higher.
Last update: July 05, 1999.