Troubleshooting an Oracle Tuxedo Application

Troubleshooting an Oracle Tuxedo Application

This topic includes the following sections:

•

Determining Types of Failures

•

How to Broadcast an Unsolicited Message

•

Maintaining Your System Files

•

Recovery Considerations

•

Repairing Partitioned Networks

•

Restoring Failed Machines

•

How to Replace System Components

•

How to Replace Application Components

•

Cleaning Up and Restarting Servers Manually

•

Aborting or Committing Transactions

•

How to Recover from Failures When Transactions Are Used

•

How to Use the IPC Tool When an Application Fails to Shut Down Properly

•

Troubleshooting Multithreaded/ Multicontexted Applications

Determining Types of Failures

The first step in troubleshooting is determining problem areas. In most applications you must consider six possible sources of trouble:

•

Application

•

Oracle Tuxedo system

•

Database management software

•

Network

•

Operating system

•

Hardware

Once you have determined the problem area, you must then work with the appropriate administrator to resolve the problem. If, for example, you determine that the trouble is caused by a networking problem, you must work with the network administrator.

How to Determine the Cause of an Application Failure

The following steps will help you detect the source of an application failure.

1.

Check any Oracle Tuxedo system warnings and error messages in the user log (ULOG).

2.

Select the messages you think most likely reflect the current problem. Note the catalog name and the number of each of message, so you can look up the message in System Messages. The manual entry provides:

•

Details about the error condition indicated by the message

•

Recommendations for recovery actions

3.

Check any application warnings and error messages in the ULOG.

4.

Check any warnings and errors generated by application servers and clients. Such messages are usually sent to the standard output and standard error files (named, by default stdout and stderr, respectively).

•

The stdout and stderr files are located in the directory defined by the APPDIR variable.

•

The stdout and stderr files for your clients and servers may have been renamed. (You can rename the stdout and stderr files by specifying -e and -o in the appropriate client and server definitions in your configuration file. For details, see servopts(5) in the File Formats, Data Descriptions, MIBs, and System Processes Reference.)

5.

Look for any core dumps in the directory defined by the APPDIR.variable. Use a debugger such as dbx to get a stack trace. If you find core dumps, notify your application developer.

6.

Check your system activity reports (for example, by running the sar(1) command) to determine why your system is not functioning properly. Consider the following reasons:

•

The system may be running out of memory.

•

The kernel might not be tuned correctly.

How to Determine the Cause of an Oracle Tuxedo System Failure

The following steps will help you detect the source of a system failure.

1.

Check any Oracle Tuxedo system warnings and error messages in the user log (ULOG):

•

TPEOS messages indicate errors in the operating system.

•

TPESYSTEM messages indicate errors in the Oracle Tuxedo system.

2.

Select the messages you think most likely reflect the current problem. Note the catalog name and number of each of message, so you can look up the message in System Messages. The manual entry provides:

•

Details about the error condition flagged by the message.

•

Recommendations for recovery actions.

3.

Prepare for debugging in the following ways:

•

Shut down the suspend service.

•

Use tmboot -n -s(server) -d1. (This will not boot the server, but prints the command line used to boot the server by the Oracle Tuxedo system.) Use that command line with a debugger such as dbx.

How to Broadcast an Unsolicited Message

The EventBroker enhances troubleshooting by providing a system-wide summary of events and a mechanism whereby an event triggers notification. The EventBroker provides details about Oracle Tuxedo system events, such as servers dying and networks failing, or application events, such as an ATM machine running out of money. An Oracle Tuxedo client that receives unsolicited notification of an event, can name a service routine to be invoked, or name an application queue in which data should be stored for later processing. An Oracle Tuxedo server that receives unsolicited notification can specify a service request or name an application queue to store data.

1.

To send an unsolicited message, enter the following command:

broadcast (bcst) [-m machine] [-u usrname] [-c cltname] [text]

Note:

By default, the message is sent to all clients.

2.

You can limit distribution to one of the following recipients:

•

One machine (-m machine)

•

One client group (-c client_group)

•

One user (-u user)

The text may not include more than 80 characters. The system sends the message in a STRING type buffer, which means the client’s unsolicited message handling function (specified by tpsetunsol(0)) must be able to handle this type of message. The tptypes() function may be useful in this case.

See Also

•

“Unsolicited Communication” in Introducing Oracle Tuxedo ATMI

•

“Managing Events Using EventBroker” in Introducing Oracle Tuxedo ATMI

Maintaining Your System Files

Periodically, you may need to perform the following tasks to maintain your file system:

•

Print the Universal Device List

•

Print VTOC information

•

Reinitialize a device

•

Create a device list

•

Destroy a device list

Note:

This file format is used for TUXCONFIG, TLOG, and /Q.

How to Print the Universal Device List (UDL)

To print a UDL, complete the following procedure:

1.

Run tmadmin -c.

2.

Enter the following command:

lidl

3.

To specify the device from which you want to obtain the UDL, you have a choice of two methods:

•

Specify the device on the lidl command line:

-z device_name [devindx]

•

Set the environment variable FSCONFIG to the name of the desired device.

How to Print VTOC Information

To print VTOC information, complete the following procedure.

1.

Run tmadmin -c.

2.

To get information about all VTOC table entries, enter the following command:

livtoc

3.

To specify the device from which you want to obtain the VTOC, you have a choice of two methods:

•

Specify the following on the lidl command line:

-z device name [devindx]

•

Set the environment variable FSCONFIG to the name of the desired device.

How to Reinitialize a Device

To reinitialize a device that is included on a device list, complete the following procedure.

1.

Run tmadmin -c.

2.

Enter the following command:

initdl [-z devicename] [-yes] devindx

Note:

The value of devindx is the index to the file to be destroyed.

3.

You can specify the device by:

•

Entering its name after the -z option (as shown here), or

•

Setting the environment variable FSCONFIG to the device name

4.

If you include the -yes option on the command line, you are not prompted to confirm your intention to destroy the file before the file is actually destroyed.

How to Create a Device List

To create a device list, complete the following procedure.

1.

Run tmadmin -c.

2.

Enter the following command:

crdl [-z devicename] [-b blocks]

•

The value of devicename [devindx] is the desired device name. (Another way to assign a name to a new device is by setting the FSCONFIG environment variable to the desired device name.)

•

The value of blocks is the number of blocks needed. The default is 1000 blocks.

Note:

Because 35 blocks are needed for the administrative overhead associated with a TLOG, be sure to assign a value higher than 35 when you create a TLOG.

How to Destroy a Device List

To destroy a device list with index devindx, complete the following procedure.

1.

Run tmadmin -c.

2.

Enter the following command:

dsdl [-z devicename] [yes] [devindx]

Note:

The value of devindx is the index to the file to be destroyed.

3.

You can specify the device by:

•

Entering its name after the -z option (as shown here), or

•

Setting the environment variable FSCONFIG to the device name

4.

If you include the yes option on the command line, you are not prompted to confirm your intention to destroy the file before the file is actually destroyed.

Recovery Considerations

The Oracle Tuxedo system requires a certain level of environmental stability to provide optimum functionality. Although the Oracle Tuxedo administrative subsystem offers unparalleled capabilities of recovering from network, machine, and application process failures, it is not invulnerable. You should be aware of the following ways in which an Oracle Tuxedo system works.

Application clients and servers that use the FASTPATH model of SYSTEM_ACCESS (the default) have direct memory access to the Oracle Tuxedo shared data structures. Using the FASTPATH model helps ensure that the Oracle Tuxedo system achieves its outstanding performance. The Oracle Tuxedo system uses the IPC (InterProcess Communication and File System) facilities provided by the operating system.

If an application accidentally uses these facilities to write into the Oracle Tuxedo shared memory or to an Oracle Tuxedo file descriptor, or if it mistakenly uses any other Oracle Tuxedo system resource, data may become corrupted, Oracle Tuxedo functionality may be compromised, or an application may be brought down.

It is inappropriate for a user or administrator to directly terminate application clients, application servers, or Oracle Tuxedo administrative processes because these processes may be executing within a critical section (that is, updating shared information in shared memory). Interrupting a critical section during a memory update could potentially cause inconsistent internal data structures. (This is characteristic not only of the Oracle Tuxedo system, but of any system in which shared data is used.) Error messages in the Oracle Tuxedo userlog that refer to locks or semaphores may indicate that such corruption has occurred.

For maximum application availability, you can take advantage of the Oracle Tuxedo system’s facilities for managing redundancy, such as its multiple server, machine, and domain facilities. Distributing an application’s functionality allows continued operation if a failure occurs in one area.

Repairing Partitioned Networks

This topic provides instructions for troubleshooting a partition, identifying its cause, and taking action to recover from it. A network partition exists if one or more machines cannot access the MASTER machine. As the application administrator, you are responsible for detecting partitions and recovering from them.

A network partition may be caused by any the following failures:

•

A network failure—either a transient failure, which corrects itself in minutes, or a severe failure, which requires you to take the partitioned machine out of the network

•

A machine failure on either the MASTER machine or the nonmaster machine

•

A BRIDGE failure

The procedure you follow to recover from a partitioned network depends on the cause of the partition.

Detecting a Partitioned Network

You can detect a network partition in one of the following ways:

•

Check the user log (ULOG) for messages that may shed light on the origin of the problem.

•

Gather information about the network, server, and service, by running the tmadmin commands provided for this purpose.

How to Check the ULOG

When problems occur with the network, Oracle Tuxedo system administrative servers start sending messages to the ULOG. If the ULOG is set up over a remote file system, all messages are written to the same log. In this scenario, you can run the tail(1) command on one file and check the failure messages displayed on the screen.

If, however, the remote file system is using the network in which the problem has occurred, the remote file system may no longer be available.

Listing 9‑1 Example of a ULOG Error Message

151804.gumby!DBBL.28446: ... : ERROR: BBL partitioned, machine=SITE2

How to Gather Information About the Network, Server, and Service

The following is an example of a tmadmin session in which information is being collected about a partitioned network, a server, and a service on that network. Three tmadmin commands are run:

•

pnw (the printnetwork command)

•

psr (the printserver command)

•

psc (the printservice command)

Listing 9‑2 Example tmadmin Session

$ tmadmin
> pnw SITE2
Could not retrieve status from SITE2

> psr -m SITE1
a.out Name Queue Name Grp Name ID Rq Done Load Done Current Service
BBL 30002.00000 SITE1 0 - - ( - )
DBBL 123456 SITE1 0 121 6050 MASTERBB
simpserv 00001.00001 GROUP1 1 - - ( - )
BRIDGE 16900672 SITE1 0 - - ( DEAD )
>psc -m SITE1
Service Name Routine Name a.out Grp Name ID Machine # Done Status
------------ ------------ -------- -------- -- ------- ------------
ADJUNCTADMIN ADJUNCTADMIN BBL SITE1 0 SITE1 - PART
ADJUNCTBB ADJUNCTBB BBL SITE1 0 SITE1 - PART
TOUPPER TOUPPER simpserv GROUP1 1 SITE1 - PART
BRIDGESVCNM BRIDGESVCNM BRIDGE SITE1 1 SITE1 - PART

Restoring a Network Connection

This topic provides instructions for recovering from transient and severe network failures.

How to Recover from Transient Network Failures

Because the BRIDGE tries, automatically, to recover from any transient network failures and reconnect, transient network failures are usually not noticed. If, however, you need to perform a manual recovery from a transient network failure, complete the following procedure.

1.

On the MASTER machine, start a tmadmin(1) session.

2.

Run the reconnect command (rco), specifying the names of nonpartitioned and partitioned machines:

rco non-partioned_node1 partioned_node2

How to Recover from Severe Network Failures

To recover from severe network failure, complete the following procedure.

1.

On the MASTER machine, start a tmadmin session.

2.

Run the pclean command, specifying the name of the partitioned machine:

pcl partioned_machine

3.

Migrate the application servers or, once the problem has been corrected, reboot the machine.

Restoring Failed Machines

The procedure you follow to restore a failed machine depends on whether that machine was the MASTER machine.

How to Restore a Failed MASTER Machine

To restore a failed MASTER machine, complete the following procedure.

1.

Make sure that all IPC resources for the Oracle Tuxedo processes that are removed.

2.

Start a tmadmin session on the ACTING MASTER (SITE2):

tmadmin

3.

Boot the BBL on the MASTER (SITE1) by entering the following command:

boot -B SITE1

(The BBL does not boot if you have not executed pclean on SITE1.)

4.

Still in tmadmin, start a DBBL running again on the MASTER site (SITE1) by entering the following:

MASTER

5.

If you have migrated application servers and data off the failed machine, boot them or migrate them back.

How to Restore a Failed Nonmaster Machine

To restore a failed nonmaster machine, complete the following procedure.

1.

On the MASTER machine, start a tmadmin session.

2.

Run pclean, specifying the partitioned machine on the command line.

3.

Fix the machine problem.

4.

Restore the failed machine by booting the Bulletin Board Liaison (BBL) for the machine from the MASTER machine.

5.

If you have migrated application servers and data from the failed machine, boot them or migrate them back.

In the following list, SITE2, a nonmaster machine, is restored.

Listing 9‑3 Example of Restoring a Failed Nonmaster Machine

$ tmadmin
tmadmin - Copyright © 1987-1990 AT&T; 1991-1993 USL. All rights reserved

> pclean SITE2
Cleaning the DBBL.

Pausing 10 seconds waiting for system to stabilize.
3 SITE2 servers removed from bulletin board

> boot -B SITE2
Booting admin processes ...

Exec BBL -A :

on SITE2 -> process id=22923 ... Started.
1 process started.
> q

How to Replace System Components

To replace Oracle Tuxedo system components, complete the following procedure.

1.

Install the Oracle Tuxedo system software that is being replaced.

2.

Shut down those parts of the application that will be affected by the changes:

•

The Oracle Tuxedo system servers may need to be shut down if libraries are being updated.

•

Application clients and servers must be shut down and rebuilt if relevant Oracle Tuxedo system header files or static libraries are being replaced. (Application clients and servers do not need to be rebuilt if the Oracle Tuxedo system message catalogs, system commands, administrative servers, or shared objects are being replaced.)

3.

If relevant Oracle Tuxedo system header files and static libraries have been replaced, rebuild your application clients and servers.

4.

Reboot the parts of the application that you shut down.

How to Replace Application Components

To replace components of your application, complete the following procedure.

1.

Install the application software. This software may consist of application clients, application servers, and various administrative files, such as the FML field tables.

2.

Shut down the application servers being replaced.

3.

If necessary, build the new application servers.

4.

Boot the new application servers.

Cleaning Up and Restarting Servers Manually

By default, the Oracle Tuxedo system cleans up resources associated with dead processes (such as queues) and restarts restartable dead servers from the Bulletin Board (BB) at regular intervals during BBL scans. You may, however, request cleaning at other times.

How to Clean Up Resources Associated with Dead Processes

To request an immediate cleanup of resources associated with dead processes, complete the following procedure.

1.

Start a tmadmin session.

2.

Enter bbclean machine.

The bbclean command takes one optional argument: the name of the machine to be cleaned.

If You Specify...

Then...

No machine

The resources on the default machine are cleaned.

A machine

The resources on the specified machine are cleaned.

DBBL

The resources on the Distinguished Bulletin Board Liaison (DBBL) and the bulletin boards at all sites are cleaned.

How to Clean Up Other Resources

To clean up other resources, complete the following procedure.

1.

Start a tmadmin session.

2.

Enter pclean machine.

Note:

You must specify a value for machine; it is a required argument.

If the Specified Machine Is

Then

Not partitioned

pclean will invoke bbclean.

Partitioned

pclean will remove all entries for servers and services from all nonpartitioned bulletin boards.

This command is useful for restoring order to a system after partitioning has occurred unexpectedly.

How to Check the Order in Which Oracle Tuxedo CORBA Servers Are Booted

If a Oracle Tuxedo CORBA application fails to boot, open the application’s UBBCONFIG file with a text editor and check whether the servers are booted in the correct order in the SERVERS section. The following is the correct order in which to boot the servers in a Oracle Tuxedo CORBA environment. A Oracle Tuxedo CORBA application will not boot if this order is not adhered to.

Boot the servers in the following order:

1.

The system EventBroker, TMSYSEVT.

2.

The TMFFNAME server with the -N option and the -M option, which starts the NameManager service (as a MASTER). This service maintains a mapping of application-supplied names to object references.

3.

The TMFFNAME server with the -N option only, to start a slave NameManager service.

4.

The TMFFNAME server with the -F option, to start the FactoryFinder.

5.

The application servers that are advertising factories.

For a detailed example, see the section “Required Order in Which to Boot CORBA C++ Servers” in Setting Up an Oracle Tuxedo Application.

How to Check the Hostname Format and Capitalization of Oracle Tuxedo CORBA Servers

The network address that is specified by programmers in the Bootstrap object constructor or in TOBJADDR must exactly match the network address in the server application’s UBBCONFIG file. The format of the address as well as the capitalization must match. If the addresses do not match, the call to the Bootstrap object constructor will fail with a seemingly unrelated error message:

ERROR: Unofficial connection from client at
<tcp/ip address>/<port-number>:

For example, if the network address is specified as //TRIXIE:3500 in the ISL command-line option string (in the server application’s UBBCONFIG file), specifying either //192.12.4.6:3500 or //trixie:3500 in the Bootstrap object constructor or in TOBJADDR will cause the connection attempt to fail.

On UNIX systems, use the uname -n command on the host system to determine the capitalization used. On Windows systems, see the host system’s Network control panel to determine the capitalization used.

Why Some Oracle Tuxedo CORBA Clients Fail to Boot

You may want to perform the following steps on a Windows server that is running a Oracle Tuxedo CORBA application, if the following problem occurs: some Internet Inter-ORB Protocol (IIOP) clients boot, but some clients fail to create a Bootstrap object and return an InvalidDomain message, even though the //host:port address is correctly specified. (For related information, see the section “How to Check the Hostname Format and Capitalization of Oracle Tuxedo CORBA Servers” on page 9‑14.)

1.

Start regedt32, the Registry Editor.

2.

Go to the HKEY_LOCAL_MACHINE on Local Machine window.

3.

Select:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Afd\Parameters

4.

Add the following values by using the Edit —> Add Value menu option:

DynamicBacklogGrowthDelta: REG_DWORD : 0xa

EnableDynamicBacklog: REG_DWORD: 0x1

MaximumDynamicBacklog: REG_DWORD: 0x3e8

MinimumDynamicBacklog: REG_DWORD: 0x14

5.

Restart the Windows system for the changes to take effect.

These values replace the static connection queue (that is, the backlog) of five pending connections with a dynamic connection backlog, that will have at least 20 entries (minimum 0x14), at most 1000 entries (maximum 0x3e8), and will increase from the minimum to the maximum by steps of 10 (growth delta 0xa).

These settings only apply to connections that have been received by the system, but are not accepted by an IIOP Listener. The minimum value of 20 and the delta of 10 are recommended by Microsoft. The maximum value depends on the machine. However, Microsoft recommends that the maximum value not exceed 5000 on a Windows server.

Aborting or Committing Transactions

This topic provides instructions for aborting and committing transactions.

How to Abort a Transaction

To abort a transaction, complete the following procedure.

1.

Enter the following command:

aborttrans (abort) [-yes] [-g groupname] tranindex

2.

To determine the value of tranindex, run the printtrans command (a tmadmin command).

3.

If groupname is specified, a message is sent to the TMS of that group to mark as “aborted” the transaction for that group. If a group is not specified, a message is sent, instead, to the coordinating TMS, requesting an abort of the transaction. You must send abort messages to all groups in the transaction to control the abort.

This command is useful when the coordinating site is partitioned or when the client terminates before calling a commit or an abort. If the timeout is large, the transaction remains in the transaction table unless it is aborted.

How to Commit a Transaction

To commit a transaction, enter the following command:

committrans (commit) [-yes] [-g groupname] tranindex

Note:

Both groupname and tranindex are required arguments.

The operation fails if the transaction is not precommitted or has been marked aborted. This message should be sent to all groups to fully commit the transaction.

Cautions About Using the committrans Command

Be careful about using the committrans command. The only time you need to run it is when both of the following conditions apply:

•

The coordinating TMS has gone down before all groups got the commit message.

•

The coordinating TMS will not be able to recover the transaction for some time.

Also, a client may be blocked on tpcommit(), which will be timed out. If you are going to perform an administrative commit, be sure to inform this client.

How to Recover from Failures When Transactions Are Used

When the application you are administering includes database transactions, you may need to apply an after-image journal (AIJ) to a restored database following a disk corruption failure. Or you may need to coordinate the timing of this recovery activity with your site’s database administrator (DBA). Typically, the database management software automatically performs transaction rollback when an error occurs. When the disk containing database files has become corrupted permanently, however, you or the DBA may need to step in and perform the rollforward operation.

Assume that a disk containing portions of a database is corrupted at 3:00 P.M. on a Wednesday. For this example, assume that a shadow volume (that is, you have disk mirroring) does not exist.

1.

Shut down the Oracle Tuxedo application. (For instructions, see “Starting Up and Shutting Down an Application” on page 1‑1 in Setting Up an Oracle Tuxedo Application.)

2.

Obtain the last full backup of the database and restore the file. For example, restore the full backup version of the database from last Sunday at 12:01 A.M.

3.

Apply the incremental backup files, such as the incrementals from Monday and Tuesday. For example, assume that this step restores the database up until 11:00 P.M. on Tuesday.

4.

Apply the AIJ, or transaction journal file, that contains the transactions from 11:15 P.M. on Tuesday up to 2:50 P.M. on Wednesday.

5.

Open the database again.

6.

Restart the Oracle Tuxedo application.

Refer to the documentation for the resource manager (database product) for specific instructions on the database rollforward process.

How to Use the IPC Tool When an Application Fails to Shut Down Properly

Inter-process communication (IPC) resources are operating system resources, such as message queues, shared memory, and semaphores. When a Oracle Tuxedo application shuts down properly with the tmshutdown command, all IPC resources are removed from the system. In some cases, however, an application may fail to shut down properly and stray IPC resources may remain on the system. When this happens, it may not be possible to reboot the application.

One way to address this problem is to remove IPC resources with a script that invokes the system IPCS command and scan for all IPC resources owned by a particular user account. However, with this method, it is difficult to distinguish among different sets of IPC resources; some may belong to the Oracle Tuxedo system; some to a particular Oracle Tuxedo application; and others to applications unrelated to the Oracle Tuxedo system. It is important to be able to distinguish among these sets of resources; unintentional removal of IPC resources can severely damage an application.

The Oracle Tuxedo IPC tool (that is, the tmipcrm command) enables you to remove IPC resources allocated by the Oracle Tuxedo system (that is, for core Oracle Tuxedo and Workstation components only) in an active application.

The command to remove IPC resources, tmipcrm, resides in TUXDIR/bin. This command reads the binary configuration file (TUXCONFIG), and attaches to the bulletin board using the information in this file. tmipcrm works only on the local server machine; it does not clean up IPC resources on remote machines in a Oracle Tuxedo configuration.

To run this command, enter it as follows on the command line:

tmipcrm [-y] [-n] [TUXCONFIG_file]

The IPC tool lists all IPC resources used by the Oracle Tuxedo system and gives you the option of removing them.

Note:

This command will not work unless you have set the TUXCONFIG environment variable correctly or specified the appropriate TUXCONFIG file on the command line.

Troubleshooting Multithreaded/
Multicontexted Applications

Debugging Multithreaded/Multicontexted Applications

Multithreaded applications can be much more difficult to debug than single-threaded applications. As the administrator, you may want to establish a policy governing whether such multithreaded applications should be created.

Limitations of Protected Mode in a Multithreaded Application

When running in protected mode, an application attaches to shared memory only when an ATMI call is being executed. Protected mode is used to guard against problems that arise when Oracle Tuxedo shared memory is accidentally overwritten by stray application pointers.

If your multithreaded application is running in protected mode, some threads may be executing application code while others are attached to the Oracle Tuxedo Bulletin Board’s shared memory within an Oracle Tuxedo function call. Therefore, as long as at least one thread is attached to the bulletin board in an ATMI call, the use of protected mode cannot guard against stray application pointers in threads executing application code, which may overwrite the Oracle Tuxedo shared memory. As a result, the usefulness of protected mode is relatively limited in multithreaded applications.

There is no solution to this limitation. We simply want to warn you that when running a multithreaded application you cannot rely on protected mode as much as you do when running a single-threaded application.

If You Specify...	Then...
No machine	The resources on the default machine are cleaned.
A machine	The resources on the specified machine are cleaned.
DBBL	The resources on the Distinguished Bulletin Board Liaison (DBBL) and the bulletin boards at all sites are cleaned.

If the Specified Machine Is	Then
Not partitioned	pclean will invoke bbclean.
Partitioned	pclean will remove all entries for servers and services from all nonpartitioned bulletin boards.