![]() |
![]() |
BEA WebLogic Enterprise 4.2 Developer Center |
![]() HOME | SITE MAP | SEARCH | CONTACT | GLOSSARY | PDF FILES | WHAT'S NEW |
||
![]() Administration | TABLE OF CONTENTS | PREVIOUS TOPIC | INDEX |
Other chapters of this document discuss many diagnostic tools provided by your WLE or BEA TUXEDO system: commands and log files that help you monitor a running system, identify potential problems while there is still time to prevent them, and detect error conditions once they have occurred. This chapter provides additional information to help you identify and recover from various system errors.
This chapter discusses the following topics:
The first step in troubleshooting is to determine the area in which the problem has occurred. In most applications, you must consider six possible sources of trouble:
Distinguishing Between Types of Failures
To resolve the trouble in most of these areas, you must work with the appropriate administrator. If, for example, you determine that the trouble is being caused by a networking problem, you must work with the network administrator.
To detect the source of an application failure, complete the following steps.
Determining the Cause of an Application Failure
ULOG
).
ULOG
.
stdout
and stderr
files are located in $APPDIR
.
$APPDIR
. Use a debugger such as sdb
to get a stack
trace. If you find core dumps, notify the application developer.
To detect the source of a system failure, complete the following steps:
ULOG
):
TPEOS
messages indicate errors in the operating system.
TPESYSTEM
messages indicate errors in the WLE or BEA TUXEDO system.
To send an unsolicited message, enter the following command.
broadcast
(bcst
) [-m machine] [-u usrname] [-c cltname] [text]
By default, the message is sent to all clients. You have the choice, however, of limiting distribution to one of the following recipients:
-m
machine
)
The text may not include more than 80 characters. The system sends the message in a buffer of type This section provides instructions for the following tasks that you may need to perform in the course of maintaining your file system:
STRING
. This means that the client's unsolicited message handling function (specified by tpsetunsol(0)
) must be able to handle a message of this type. The tptypes()
function may be useful in this case.
Performing System File Maintenance
Complete the following steps to create a device list.
Creating a Device List
tmadmin
session.
crdl [-z
devicename
] [-b blocks
]
devicename
[devindx
] is the desired device name. (Another way to assign a name to a new device is by setting the FSCONFIG
environment variable to the desired device name.)
blocks
is the number of blocks needed. The default value is 1000 pages.
To destroy a device list with index devindx
, enter the following command.
dsdl [-zdevicename
] [yes] [devindx
]
-z
option (as shown here), or
FSCONFIG
to the device name
yes
option on the command line, you will not be prompted to confirm your intention to destroy the file before the file is actually destroyed.
devindx
is the index to the file to be destroyed.
To reinitialize a device on a device list, enter the following command.
Reinitializing a Device
initdl [-z
devicename
] [-yes] devindx
-z
option (as shown here), or
FSCONFIG
to the device name
-yes
option on the command line, you will not be prompted to confirm your intention to destroy the file before the file is actually destroyed.
devindx
is the index to the file to be destroyed.
To print a UDL, enter the following command.
To specify the device from which you want to obtain the UDL, you have a choice of two methods:
Printing the Universal Device List (UDL)
lidl
lidl
command line.
-z
device name
[devindx
]FSCONFIG
to the name of the desired device.
To get information about all VTOC table entries, enter the following command.
To specify the device from which you want to obtain the VTOC, you have a choice of two methods:
Printing VTOC Information
livtoc
lidl
command line.
-z
device name
[devindx
]FSCONFIG
to the name of the desired device.
A network partition exists if one or more machines cannot access the master machine. As the application administrator, you are responsible for detecting partitions and recovering from them. This section provides instructions for troubleshooting a partition, identifying its cause, and taking action to recover from it.
A network partition may be caused by the following:
Repairing Partitioned Networks
BRIDGE
failure
The procedure you follow to recover from a partitioned network depends on the cause of the partition. Recovery procedures for these situations are provided in this section.
There are several ways to detect a network partition:
Detecting Partitioned Networks
ULOG
) for messages that may shed light on the origin of the problem.
When things go wrong with the network, WLE or BEA TUXEDO system administrative servers start sending messages to the If, however, the remote file system is using the same network, the remote file system may no longer be available.
Checking the ULOG
ULOG
. If the ULOG
is set up over a remote file system, all messages are written to the same log. In such a case you can run the tail
(1) command on one file and check the failure messages displayed on the screen.
151804.gumby!DBBL.28446: ... : ERROR: BBL partitioned, machine=SITE2
Listing 23-1 provides an example of a tmadmin
session in which information is being collected about a partitioned network, and a server and a service on that network. Three tmadmin
commands are run:
pnw
(the printnetwork
command)
psc
(the printservice
command)
Listing 23-1
Example of a tmadmin Session
$ tmadmin
> pnw SITE2
Could not retrieve status from SITE2> psr -m SITE1
a.out Name Queue Name Grp Name ID Rq Done Load Done Current Service
BBL 30002.00000 SITE1 0 - - (- )
DBBL 123456 SITE1 0 121 6050 MASTERBB
simpserv 00001.00001 GROUP1 1 - - ( - )
BRIDGE 16900672 SITE1 0 - - ( DEAD )>psc -m SITE1
Service Name Routine Name a.out Grp Name ID Machine # Done Status
------------ ------------ -------- -------- -- ------- ------------
ADJUNCTADMIN ADJUNCTADMIN BBL SITE1 0 SITE1 - PART
ADJUNCTBB ADJUNCTBB BBL SITE1 0 SITE1 - PART
TOUPPER TOUPPER simpserv GROUP1 1 SITE1 - PART
BRIDGESVCNM BRIDGESVCNM BRIDGE SITE1 1 SITE1 - PART
This section provides instructions for recovering from transient and severe network failures.
Because the Restoring a Network Connection
Recovering from Transient Network Failures
BRIDGE
tries, automatically, to recover from any transient network failures and reconnects, transient network failures are usually not noticed. If, however, you do need to perform a manual recovery from a transient network failure, complete the following procedure.
tmadmin
(1) session.
reconnect
command (rco
), specifying the names of nonpartitioned and
partitioned machines.
Perform the following steps to recover from severe network failure.
tmadmin
session.
pclean
command, specifying the name of the partitioned machine.
pcl
partioned_machine
The procedure you follow to restore a failed machine depends on whether that machine was the master machine.
To restore a failed master machine, complete the following procedure.
tmadmin
session on the ACTING MASTER
(SITE2
):
tmadmin
MASTER
(SITE1
) by entering the following command:
The BBL will not boot if you have not executed boot -B SITE1
pclean
on SITE1
.
tmadmin
, start a DBBL running again on the master site (SITE1
) by
entering the following:
MASTER
To restore a failed nonmaster machine, complete the following procedure.
tmadmin
session.
pclean
, specifying the partitioned machine on the command line.
In Listing 23-2, SITE2
, a nonmaster machine, is restored.
Listing 23-2 Example of Restoring a Failed Nonmaster Machine
$ tmadmin
tmadmin - Copyright © 1987-1990 AT&T; 1991-1993 USL. All rights reserved
> pclean SITE2
Cleaning the DBBL.
Pausing 10 seconds waiting for system to stabilize.
3 SITE2 servers removed from bulletin board
> boot -B SITE2
Booting admin processes ...
Exec BBL -A :
on SITE2 -> process id=22923 ... Started.
1 process started.
> q
To replace BEA TUXEDO system components, complete the following procedure.
To replace components of your application, complete the following procedure.
By default, the WLE or BEA TUXEDO system cleans up resources associated with dead processes (such as queues) and restarts restartable dead servers from the Bulletin Board (BB) at regular intervals during BBL scans. You may, however, request cleaning at other times.
To request an immediate cleanup of resources associated with dead processes, complete the following procedure.
tmadmin
session.
The bbclean
command takes one optional argument: the name of the machine to be cleaned.
To clean up other resources, complete the following procedure.
tmadmin
session.
Note:
You must specify a value for machine
; it is a required argument.
If the Specified Machine Is . . . | Then . . . |
---|---|
Not partitioned |
|
Partitioned |
|
This command is useful for restoring order to a system after partitioning has occurred unexpectedly.
If a WLE application fails to boot, open the application's UBBCONFIG
file with a text editor and check whether the servers are booted in the correct order in the SERVERS
section. The following is the correct order in which to boot the servers on a WLE system. A WLE application will not boot if this order is not adhered to.
Boot the servers in the following order:
TMSYSEVT.
-N
option only, to start a slave NameManager
service.
TMFFNAME
server with the -F
option, to start the FactoryFinder.
For a detailed example, see the section "Required Order in Which to Boot Servers (WLE Servers)" in Chapter 3, "Creating a Configuration File."
The network address that is specified by programmers in the Bootstrap object constructor or in TOBJADDR
must exactly match the network address in the server application's UBBCONFIG
file. The format of the address as well as the capitalization must match. If the addresses do not match, the call to the Bootstrap object constructor will fail with a seemingly unrelated error message:
ERROR: Unofficial connection from client at
<tcp/ip address>/<port-number>:
For example, if the network address is specified as //TRIXIE:3500 in the ISL
command line option string (in the server application's UBBCONFIG
file), specifying either //192.12.4.6:3500 or //trixie:3500 in the Bootstrap object constructor or in TOBJADDR will cause the connection attempt to fail.
On UNIX systems, use the uname -n command on the host system to determine the capitalization used. On Windows NT systems, see the host system's Network control panel to determine the capitalization used.
You may want to perform the following steps on a Windows NT server that is running a WLE application, if the following problem occurs: some Internet Inter-ORB Protocol (IIOP) clients boot, but some clients fail to create a Bootstrap object and return an InvalidDomain
message, even though the //host:port
address is correctly specified. (For related information, see the section "Checking Hostname Format and Capitalization (WLE Servers)" in this chapter.)
regedt32
, the Registry Editor.
HKEY_LOCAL_MACHINE on Local Machine
window.
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Afd\Parameters
DynamicBacklogGrowthDelta: REG_DWORD : 0xa
EnableDynamicBacklog: REG_DWORD: 0x1
MaximumDynamicBacklog: REG_DWORD: 0x3e8
MinimumDynamicBacklog: REG_DWORD: 0x14
These values replace the static connection queue (that is, the backlog) of five pending connections with a dynamic connection backlog, that will have at least 20 entries (minimum 0x14), at most 1000 entries (maximum 0x3e8), and will increase from the minimum to the maximum by steps of 10 (growth delta 0xa).
These settings only apply to connections that have been received by the system, but are not accepted by an IIOP Listener. The minimum value of 20 and the delta of 10 are recommended by Microsoft. The maximum value depends on the machine. However, Microsoft recommends that the maximum value not exceed 5000 on a Windows NT server.
This section provides instructions for aborting and committing transactions.
To abort a transaction, enter the following command.
aborttrans (abort) [-yes] [-ggroupname
]tranindex
tranindex
, run the printtrans
command (a tmadmin
command).
This command is useful when the coordinating site is partitioned or when the client terminates before calling a commit or an abort. If the timeout is large, the transaction remains in the transaction table unless it is aborted.
To commit a transaction, enter the following command.
Committing a Transaction
committrans (commit) [-yes] [-g
groupname
] tranindex
groupname
and tranindex
are required arguments.
Be careful about using this command. The only time you should need to run it is when both of the following conditions apply:
Cautions
Also, a client may be blocked on When the application you are administering includes database transactions, you may need to apply an after-image journal (AIJ) to a restored database following a disk corruption failure. Or you may need to coordinate the timing of this recovery activity with your site's database administrator (DBA). Typically, the database management software automatically performs transaction rollback when an error occurs. When the disk containing database files has become permanently corrupt, however, you or the DBA may need to step in and perform the rollforward operation.
Assume that a disk containing portions of a database is corrupted at 3:00 P.M. on a Wednesday. For this example, assume that a shadow volume does not exist.
tpcommit()
, which will be timed out. If you are going to perform an administrative commit, be sure to inform this client.
Recovering from Failures When Transactions Are Used
Refer to the documentation for the resource manager (database product) for specific instructions on the database rollforward process.