[Top]
[Previous Page] [Next Page] [Bottom]
Other chapters of this document discuss many diagnostic tools provided by your BEA TUXEDO system: commands and log files that help you monitor a running system, identify potential problems while there is still time to prevent them, and detect error conditions once they have occurred. This chapter provides additional information to help you identify and recover from various system errors.
This chapter discusses the following topics:
The first step in troubleshooting is to determine the area in which the problem has occurred. In most applications, you must consider six possible sources of trouble:
To resolve the trouble in most of these areas, you must work with the appropriate administrator. If, for example, you determine that the trouble is being caused by a networking problem, you must work with the network administrator.
To detect the source of an application failure, complete the following steps.
ULOG
). ULOG
. stdout
and stderr
, respectively). stdout
and stderr
files are located in $APPDIR
. stdout
and stderr
files for your clients and
servers may have been renamed. (You can rename the stdout
and stderr
files by specifying -e
and -o
in the appropriate client and
server definitions in your configuration file. For details, see the servopts
(5) reference page in the BEA
TUXEDO Reference Manual.) $APPDIR
.
Use a debugger such as sdb
to get a stack trace. If you find core dumps, notify the application developer. sar
(1) command) to determine why
your system is not functioning properly. Consider the following possible reasons: To detect the source of a system failure, complete the following steps:
ULOG
): TPEOS
messages
indicate errors in the operating system. TPESYSTEM
messages
indicate errors in the BEA TUXEDO system. TMDEBUG
, BRDBG
, NWDBG
, and WSDBG
. To send an unsolicited message, enter the following command.
broadcast
(bcst
) [-m machine] [-u usrname] [-c cltname] [text]
By default, the message is sent to all clients. You have the choice, however, of limiting distribution to one of the following recipients:
-m machine
)
-c client_group
)
-u user
)
The text may not include more than 80 characters. The system sends the message in a
buffer of type STRING
.
This means that the client's unsolicited message handling function (specified by tpsetunsol(0)
) must be able to
handle a message of this type. The tptypes()
function may be useful in this case.
This section provides instructions for the following tasks that you may need to perform in the course of maintaining your file system:
Complete the following steps to create a device list.
tmadmin
session. crdl [-zdevicename
] [-bblocks
]
devicename
[devindx
] is
the desired device name. (Another way to assign a name to a new device is by setting the FSCONFIG
environment variable to the
desired device name.) blocks
is the number of blocks needed. The default value is 1000 pages. Note: Because 35 blocks are needed for the administrative overhead associated with a TLOG, be sure to assign a value higher than 35 when you create a TLOG.
To destroy a device list with index devindx
, enter the following command.
dsdl [-zdevicename
] [yes] [devindx
]
-z
option (as shown here), or FSCONFIG
to the device name yes
option on the command line, you will not be prompted to confirm your intention to destroy
the file before the file is actually destroyed. devindx
is the index to the file to be destroyed. To reinitialize a device on a device list, enter the following command.
initdl [-zdevicename
] [-yes]devindx
-z
option (as shown here), or FSCONFIG
to the device name -yes
option on the command line, you will not be prompted to confirm your intention to destroy
the file before the file is actually destroyed. devindx
is the index to the file to be destroyed. To print a UDL, enter the following command.
lidl
To specify the device from which you want to obtain the UDL, you have a choice of two methods:
lidl
command line. -zdevice name
[devindx
]
FSCONFIG
to the name of the desired device. To get information about all VTOC table entries, enter the following command.
livtoc
To specify the device from which you want to obtain the VTOC, you have a choice of two methods:
lidl
command line. -zdevice name
[devindx
]
FSCONFIG
to the name of the desired device. A network partition exists if one or more machines cannot access the master machine. As the application administrator, you are responsible for detecting partitions and recovering from them. This section provides instructions for troubleshooting a partition, identifying its cause, and taking action to recover from it.
A network partition may be caused by the following:
BRIDGE
failure The procedure you follow to recover from a partitioned network depends on the cause of the partition. Recovery procedures for these situations are provided in this section.
There are several ways to detect a network partition:
ULOG
)
for messages that may shed light on the origin of the problem. tmadmin
commands provided for this
purpose. When things go wrong with the network, BEA TUXEDO system administrative servers start
sending messages to the ULOG
.
If the ULOG
is set up
over a remote file system, all messages are written to the same log. In such a case you
can run the tail
(1)
command on one file and check the failure messages displayed on the screen.
If, however, the remote file system is using the same network, the remote file system may no longer be available.
151804.gumby!DBBL.28446: ... : ERROR: BBL partitioned, machine=SITE2
tmadmin
session in which information
is being collected about a partitioned network, and a server and a service on that
network. Three tmadmin
commands are run:
pnw
(the printnetwork
command) psr
(the printserver
command) psc
(the printservice
command) Listing 19-1 Example of a tmadmin Session
$ tmadmin > pnw SITE2 Could not retrieve status from SITE2 > psr -m SITE1 a.out Name Queue Name Grp Name ID Rq Done Load Done Current Service BBL 30002.00000 SITE1 0 - - ( - ) DBBL 123456 SITE1 0 121 6050 MASTERBB simpserv 00001.00001 GROUP1 1 - - ( - ) BRIDGE 16900672 SITE1 0 - - ( DEAD ) >psc -m SITE1 Service Name Routine Name a.out Grp Name ID Machine # Done Status ------------ ------------ -------- -------- -- ------- ------------ ADJUNCTADMIN ADJUNCTADMIN BBL SITE1 0 SITE1 - PART ADJUNCTBB ADJUNCTBB BBL SITE1 0 SITE1 - PART TOUPPER TOUPPER simpserv GROUP1 1 SITE1 - PART BRIDGESVCNM BRIDGESVCNM BRIDGE SITE1 1 SITE1 - PART
This section provides instructions for recovering from transient and severe network failures.
Because the BRIDGE
tries, automatically, to recover from any transient network failures and reconnects,
transient network failures are usually not noticed. If, however, you do need to perform a
manual recovery from a transient network failure, complete the following procedure.
tmadmin
(1)
session. reconnect
command (rco
),
specifying the names of nonpartitioned and partitioned machines. rco non-partioned_node1 partioned_node2
Perform the following steps to recover from severe network failure.
tmadmin
session. pclean
command, specifying the name of the partitioned machine. pcl partioned_machine
The procedure you follow to restore a failed machine depends on whether that machine was the master machine.
To restore a failed master machine, complete the following procedure.
tmadmin
session on the ACTING MASTER
(SITE2
): tmadmin
MASTER
(SITE1
) by entering the
following command: boot -B SITE1
The BBL will not boot if you have not executed pclean
on SITE1
.
tmadmin
,
start a DBBL running again on the master site (SITE1
) by entering the following: MASTER
To restore a failed nonmaster machine, complete the following procedure.
tmadmin
session. pclean
, specifying
the partitioned machine on the command line. In
Listing 19-2,SITE2
, a nonmaster machine, is
restored.
Listing 19-2 Example of Restoring a Failed Nonmaster Machine
$ tmadmin tmadmin - Copyright © 1987-1990 AT&T; 1991-1993 USL. All rights reserved > pclean SITE2 Cleaning the DBBL. Pausing 10 seconds waiting for system to stabilize. 3 SITE2 servers removed from bulletin board > boot -B SITE2 Booting admin processes ... Exec BBL -A : on SITE2 -> process id=22923 ... Started. 1 process started. > q
To replace BEA TUXEDO system components, complete the following procedure.
To replace components of your application, complete the following procedure.
By default, the BEA TUXEDO system cleans up resources associated with dead processes (such as queues) and restarts restartable dead servers from the Bulletin Board (BB) at regular intervals during BBL scans. You may, however, request cleaning at other times.
To request an immediate cleanup of resources associated with dead processes, complete the following procedure.
tmadmin
session. bbclean
machine
. The bbclean
command
takes one optional argument: the name of the machine to be cleaned.
To clean up other resources, complete the following procedure.
tmadmin
session. pclean machine
.
Note: You must specify a value for machine
; it is a required
argument.
If the Specified Machine Is . . . | Then . . . |
---|---|
Not partitioned | pclean will invoke bbclean . |
Partitioned | pclean will remove all entries for
servers and services from all nonpartitioned Bulletin Boards. |
This command is useful for restoring order to a system after partitioning has occurred unexpectedly.
This section provides instructions for aborting and committing transactions.
To abort a transaction, enter the following command.
aborttrans (abort) [-yes] [-ggroupname
]tranindex
tranindex
,
run the printtrans
command (a tmadmin
command). groupname
is specified, a message is sent to the TMS of that group to mark as "aborted"
the transaction for that group. If a group is not specified, a message is sent, instead,
to the coordinating TMS, requesting an abort of the transaction. You must send abort
messages to all groups in the transaction to control the abort. This command is useful when the coordinating site is partitioned or when the client terminates before calling a commit or an abort. If the timeout is large, the transaction remains in the transaction table unless it is aborted.
To commit a transaction, enter the following command.
committrans (commit) [-yes] [-ggroupname
]tranindex
groupname
and tranindex
are required arguments. Be careful about using this command. The only time you should need to run it is when both of the following conditions apply:
Also, a client may be blocked on tpcommit()
,
which will be timed out. If you are going to perform an administrative commit, be sure to
inform this client.
When the application you are administering includes database transactions, you may need to apply an after-image journal (AIJ) to a restored database following a disk corruption failure. Or you may need to coordinate the timing of this recovery activity with your site's database administrator (DBA). Typically, the database management software automatically performs transaction rollback when an error occurs. When the disk containing database files has become permanently corrupt, however, you or the DBA may need to step in and perform the rollforward operation.
Assume that a disk containing portions of a database is corrupted at 3:00 P.M. on a Wednesday. For this example, assume that a shadow volume does not exist.
Refer to the documentation for the resource manager (database product) for specific instructions on the database rollforward process.
[Top] [Previous Page] [Next Page] [Bottom]