Skip Headers
Oracle® Database Backup and Recovery User's Guide
11g Release 1 (11.1)

B28270-03
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

22 Troubleshooting RMAN Operations

This chapter describes how to troubleshoot Recovery Manager. This chapter contains these topics:

Interpreting RMAN Message Output

Recovery Manager provides detailed error messages that can aid in troubleshooting problems. Also, the Oracle database server and third-party media vendors generate useful debugging output of their own. The discussion which follows explains how to identify and interpret the different errors you may encounter.

Identifying Types of Message Output

Output that is useful for troubleshooting failed or hung RMAN jobs is located in several different places, as explained in the following table.

Table 22-1 Types of Message Output

Type of Output Produced By Location Description

RMAN messages

RMAN

Completed job information is in V$RMAN_STATUS and RC_RMAN_STATUS. Current job information is in V$RMAN_OUTPUT.

When running RMAN from the command line, you can direct output to the following places:

  • Standard output

  • A log file specified by LOG on the command line or the SPOOL LOG command

  • A file created by redirecting RMAN output (for example, UNIX > operator)

Contains actions relevant to the RMAN job as well as error messages generated by RMAN, the database server, and the media vendor. RMAN error messages have an RMAN-xxxxx prefix. Normal action descriptions do not have a prefix.

Note that you can execute the following PL/SQL to remove all entries from V$RMAN_STATUS:

SYS.DBMS_BACKUP_RESTORE.resetCfileSection(28);

The preceding function removes all job-related entries. No rows will be visible until new backup jobs are shown in V$RMAN_BACKUP_JOB_DETAILS.

alert_SID.log

Oracle Database

The alert subdirectory of the Automatic Diagnostic Repository (ADR) home.

Contains a chronological log of errors, initialization parameter settings, and administration operations. Records values for overwritten control file records.

Oracle trace file

Oracle Database

The trace subdirectory of the ADR home.

Contains detailed output generated by Oracle server processes. This file is created when an ORA-600 or ORA-3113 error message occurs, whenever RMAN cannot allocate a channel, and when the database fails to load the media management library.

sbtio.log

Third-party media management software

The trace subdirectory of the ADR home.

Contains vendor-specific information written by the media management software. This log does not contain Oracle server or RMAN errors.

Media manager log file

Third-party media management software

The filenames for any media manager logs other than sbtio.log are determined by the media management software.

Contains information on the functioning of the media management device.


Recognizing RMAN Error Message Stacks

RMAN reports errors as they occur. If an error is not retrievable, that is, RMAN cannot perform failover to another channel to complete a particular job step, then RMAN also reports a summary of the errors after all job sets complete. This feature is known as deferred error reporting.

One way to determine whether RMAN encountered an error is to examine its return code, as described in "Identifying RMAN Return Codes". A second way is to search the RMAN output for the string RMAN-00569, which is the message number for the error stack banner. All RMAN errors are preceded by this error message. If you do not see an RMAN-00569 message in the output, then there are no errors. The following example shows a syntax error.

Example 22-1 RMAN Syntax Error

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-00558: error encountered while parsing input commands
RMAN-01005: syntax error: found ")": expecting one of: "archivelog, backup, backupset, controlfilecopy, current, database, datafile, datafilecopy, (, plus, ;, tablespace"
RMAN-01007: at line 1 column 18 file: standard input

Identifying Error Codes

Typically, you find the following types of error codes in RMAN message stacks:

  • Errors prefixed with RMAN-

  • Errors prefixed with ORA-

  • Errors preceded by the line Additional information:

    See Also:

    Oracle Database Error Messages for explanations of RMAN and ORA error codes

RMAN Error Message Numbers

Table 22-2 indicates the error ranges for common RMAN error messages, all of which are described in Oracle Database Error Messages.

Table 22-2 RMAN Error Message Ranges

Error Range Cause

0550-0999

Command-line interpreter

1000-1999

Keyword analyzer

2000-2999

Syntax analyzer

3000-3999

Main layer

4000-4999

Services layer

5000-5499

Compilation of RESTORE or RECOVER command

5500-5999

Compilation of DUPLICATE command

6000-6999

General compilation

7000-7999

General execution

8000-8999

PL/SQL programs

9000-9999

Low-level keyword analyzer

10000-10999

Server-side execution

11000-11999

Interphase errors between PL/SQL and RMAN

12000-12999

Recovery catalog packages


ORA-19511: Media Manager Errors

In the event of a media manager error, ORA-19511 is signalled, and the media manager is expected to provide RMAN a descriptive error. RMAN will display the error passed back to it by the media manager. For example, you might see this:

ORA-19511: Error received from media manager layer, error text:
   sbtpvt_open_input: file .* does not exist or cannot be accessed, errno = 2

The message from the media manager should provide you with enough information to let you fix the root problem. If it does not, you should refer to the documentation for your media manager or contact your media management vendor support representative for further information. ORA-19511 errors originate with the media manager, not the Oracle database. The database merely passes the message on from the media manager. The cause can only be addressed by the media management vendor.

Note that if you are still using an SBT 1.1-compliant media management layer, you may see some additional error message text. Output from an SBT 1.1-compliant media management layer is similar to the following:

ORA-19507: failed to retrieve sequential file, handle="c-140148591-20031014-06", parms=""
ORA-27007: failed to open file
Additional information: 7000
Additional information: 2
ORA-19511: Error received from media manager layer, error text:
   SBT error = 7000, errno = 0, sbtopen: backup file not found

The "Additional information" provided uses error codes specific to SBT 1.1. The values displayed correspond to the media manager message numbers and error text listed in Table 22-3. RMAN re-signals the error, as an ORA-19511 Error received from media manager layer error, and a general error message related to the error code returned from the media manager and including the SBT 1.1 error number is then displayed.

The SBT 1.1 error messages are listed here for your reference. Table 22-3 lists media manager message numbers and their corresponding error text. In the error codes, O/S stands for operating system. The errors marked with an asterisk are internal and should not typically be seen during normal operation.

Table 22-3 Media Manager Error Message Ranges

Cause No. Message

sbtopen

7000

7001

7002*

7003

7004

7005

7006

7007

7008

7009

7010

7011

7012*

Backup file not found (only returned for read)

File exists (only returned for write)

Bad mode specified

Invalid block size specified

No tape device found

Device found, but busy; try again later

Tape volume not found

Tape volume is in-use

I/O Error

Can't connect with Media Manager

Permission denied

O/S error for example malloc, fork error

Invalid argument(s) to sbtopen

sbtclose

7020*

7021*

7022

7023

7024*

7025

Invalid file handle or file not open

Invalid flags to sbtclose

I/O error

O/S error

Invalid argument(s) to sbtclose

Can't connect with Media Manager

sbtwrite

7040*

7041

7042

7043

7044*

Invalid file handle or file not open

End of volume reached

I/O error

O/S error

Invalid argument(s) to sbtwrite

sbtread

7060*

7061

7062

7063

7064

7065*

Invalid file handle or file not open

EOF encountered

End of volume reached

I/O error

O/S error

Invalid argument(s) to sbtread

sbtremove

7080

7081

7082

7083

7084

7085

7086*

Backup file not found

Backup file in use

I/O Error

Can't connect with Media Manager

Permission denied

O/S error

Invalid argument(s) to sbtremove

sbtinfo

7090

7091

7092

7093

7094

7095*

Backup file not found

I/O Error

Can't connect with Media Manager

Permission denied

O/S error

Invalid argument(s) to sbtinfo

sbtinit

7110*

7111

Invalid argument(s) to sbtinit

O/S error


Interpreting RMAN Error Stacks

Sometimes you may find it difficult to identify the useful messages in the RMAN error stack. Note the following tips and suggestions:

  • Read the messages from the bottom up, because this is the order in which RMAN issues the messages. The last one or two errors displayed in the stack are often the most informative.

  • When using an SBT 1.1 media management layer and presented with SBT 1.1 style error messages containing the "Additional information:" numeric error codes, look for the ORA-19511 message that follows for the text of error messages passed back to RMAN by the media manager. These should identify the real failure in the media management layer.

  • Look for the RMAN-03002 or RMAN-03009 message (RMAN-03009 is the same as RMAN-03002 but includes the channel ID), immediately following the error banner. These messages indicate which command failed. Syntax errors generate RMAN-00558.

  • Identify the basic type of error according to the error range chart in Table 22-2 and then refer to Oracle Database Error Messages for information on the most important messages.

Interpreting RMAN Errors: Example

You attempt a backup of tablespace users and receive the following message:

Starting backup at 29-AUG-02
using channel ORA_DISK_1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 08/29/2002 15:14:03
RMAN-20202: tablespace not found in the recovery catalog
RMAN-06019: could not translate tablespace name "USESR"

The RMAN-03002 error indicates that the BACKUP command failed. You read the last two messages in the stack first and immediately see the problem: no tablespace user appears in the recovery catalog because you mistyped the name.

Interpreting Server Errors: Example

Assume that you attempt to recover a tablespace and receive the following errors:

RMAN> RECOVER TABLESPACE users;

Starting recover at 29-AUG-01
using channel ORA_DISK_1

starting media recovery
media recovery failed
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of recover command at 08/29/2007 15:18:43
RMAN-11003: failure during parse/execution of SQL statement: alter database recover if needed tablespace USERS
ORA-00283: recovery session canceled due to errors
ORA-01124: cannot recover data file 8 - file is in use or recovery
ORA-01110: data file 8: '/oracle/oradata/trgt/users01.dbf'

As suggested, you start reading from the bottom up. The ORA-01110 message explains there was a problem with the recovery of datafile users01.dbf. The second error indicates that the database cannot recover the datafile because it is in use or already being recovered. The remaining RMAN errors indicate that the recovery session was cancelled due to the server errors. Hence, you conclude that because you were not already recovering this datafile, the problem must be that the datafile is online and you need to take it offline and restore a backup.

Interpreting SBT 2.0 Media Management Errors: Example

Assume that you use a tape drive and see the following output during a backup job:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
ORA-19624: operation failed, retry possible
ORA-19507: failed to retrieve sequential file, handle="/tmp/mydir", parms=""
ORA-27029: skgfrtrv: sbtrestore returned error
ORA-19511: Error received from media manager layer, error text:
  sbtpvt_open_input:file /tmp/mydir does not exist or cannot be accessed, errno=2

The error text displayed following the ORA-19511 error is generated by the media manager and describes the real source of the failure. Refer to the media manager documentation to interpret this error.

Interpreting SBT 1.1 Media Management Errors: Example

Assume that you use a tape drive and see the following output during a backup job:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on c1 channel at 09/04/2007 13:18:19
ORA-19506: failed to create sequential file, name="07d36ecp_1_1", parms=""
ORA-27007: failed to open file
SVR4 Error: 2: No such file or directory
Additional information: 7005
Additional information: 1
ORA-19511: Error received from media manager layer, error text:
   SBT error = 7005, errno = 2, sbtopen: system error

The main information of interest returned by SBT 1.1 media managers is the error code in the "Additional information" line:

Additional information: 7005

Referring to Table 22-3, "Media Manager Error Message Ranges", you discover that error 7005 means that the media management device is busy. So, the media management software is not able to write to the device because it is in use or there is a problem with it.

Note:

The sbtio.log contains information written by the media management software, not the Oracle database. Thus, you must consult your media vendor documentation to interpret the error codes and messages. If no information is written to the sbtio.log, then contact your media manager support to ask whether they are writing error messages in some other location, or whether there are steps you need to take to have the media manager errors appear in sbtio.log.

Identifying RMAN Return Codes

One way to determine whether RMAN encountered an error is to examine its return code or exit status. The RMAN client returns 0 to the shell from which it was invoked if no errors occurred, and a nonzero error value otherwise.

How you access this return code depends upon the environment from which you invoked the RMAN client. For example, if you are running UNIX with the C shell, then, when RMAN completes, the return code is placed in a shell variable called $status. The method of returning exit status is a detail specific to the host operating system rather than the RMAN client.

Using V$ Views for RMAN Troubleshooting

When LIST, REPORT and SHOW do not provide all the information you need for RMAN operations, a number of useful V$ views can provide more details.

Sometimes it is useful to identify exactly what a server session performing a backup and recovery job is doing. The views described in Table 22-4 are useful for obtaining information about RMAN jobs.

Table 22-4 Useful V$ Views for Troubleshooting

View Description

V$PROCESS

Identifies currently active processes.

V$SESSION

Identifies currently active sessions. Use this view to determine which database server sessions correspond to which RMAN allocated channels.

V$SESSION_WAIT

Lists the events or resources for which sessions are waiting.


You can use the preceding views to perform the following tasks:

Monitoring RMAN Interaction with the Media Manager

You can use the event names in the dynamic performance event views to monitor RMAN calls to the media management API. The event names have one-to-one correspondence with SBT functions, as shown in the following examples:

Backup: sbtinit
Backup: ssbtopen
Backup: ssbtread
Backup: ssbtwrite
Backup: ssbtbackup
.
.
.

To obtain the complete list of SBT events, you can use the following query:

SELECT NAME 
FROM   V$EVENT_NAME 
WHERE  NAME LIKE '%sbt%';

Before making a call to any of functions in the media management API, the server adds a row in V$SESSION_WAIT, with the STATE column including the string WAITING. The V$SESSION_WAIT.SECONDS_IN_WAIT column shows the number of seconds that the server has been waiting for this call to return. After an SBT function is returned from the media manager, this row disappears.

A row in V$SESSION_WAIT corresponding to an SBT event name does not indicate a problem, because the server updates these rows at runtime. The rows appear and disappear as calls are made and returned. However, if the SECONDS_IN_WAIT column is high, then the media manager may be hung.

To monitor the SBT events, you can run the following SQL query:

COLUMN EVENT FORMAT a10
COLUMN SECONDS_IN_WAIT FORMAT 999
COLUMN STATE FORMAT a20
COLUMN CLIENT_INFO FORMAT a30

SELECT p.SPID, EVENT, SECONDS_IN_WAIT AS SEC_WAIT, 
       sw.STATE, CLIENT_INFO
FROM   V$SESSION_WAIT sw, V$SESSION s, V$PROCESS p
WHERE  sw.EVENT LIKE 's%bt%'
AND    s.SID=sw.SID
AND    s.PADDR=p.ADDR;

Examine the SQL output to determine which SBT functions are waiting. For example, the following output indicates that RMAN has been waiting for the sbtbackup function to return for ten minutes:

SPID EVENT             SEC_WAIT   STATE                CLIENT_INFO
---- ----------------- ---------- -------------------- ------------------------------
8642 Backup: sbtbackup 600        WAITING              rman channel=ORA_SBT_TAPE_1

Note:

The V$SESSION_WAIT view shows only database events, not media manager events.

See Also:

Oracle Database Reference for descriptions of V$SESSION_WAIT

Correlating Server Sessions with RMAN Channels

To identify which server sessions correspond to which RMAN channels, you can query V$SESSION and V$PROCESS. The SPID column of V$PROCESS identifies the operating system ID number for the process or thread. For example, on UNIX the SPID column shows the process ID, whereas on Windows the SPID column shows the thread ID. You have two basic methods for obtaining this information, depending on whether you have multiple RMAN sessions active concurrently.

Matching Server Sessions with Channels When One RMAN Session Is Active

When only one RMAN session is active, the easiest method for determining the server session ID for an RMAN channel is to execute the following query on the target database while the RMAN job is executing:

COLUMN CLIENT_INFO FORMAT a30
COLUMN SID FORMAT 999
COLUMN SPID FORMAT 9999

SELECT s.SID, p.SPID, s.CLIENT_INFO
FROM   V$PROCESS p, V$SESSION s
WHERE  p.ADDR = s.PADDR
AND    CLIENT_INFO LIKE 'rman%';

The following shows sample output:

 SID SPID         CLIENT_INFO
---- ------------ ------------------------------
  14 8374         rman channel=ORA_SBT_TAPE_1

If you set an ID using the RMAN SET COMMAND ID command instead of using the system-generated default ID, then search for that value in the CLIENT_INFO column instead of 'rman%'.

Matching Server Sessions with Channels in Multiple RMAN Sessions

If more than one RMAN session is active, then it is possible for the V$SESSION.CLIENT_INFO column to yield the same information for a channel in each session. For example:

 SID SPID         CLIENT_INFO
---- ------------ ------------------------------
  14 8374         rman channel=ORA_SBT_TAPE_1
   9 8642         rman channel=ORA_SBT_TAPE_1

In this case, you have the following methods for determining which channel corresponds to which SID value.

Obtaining the Channel ID from the RMAN Output

In this method, you must first obtain the sid values from the RMAN output and then use these values in your SQL query.

To correlate a process with a channel during a backup:

In one of the active sessions, run the RMAN job as normal and examine the output to get the sid for the channel. For example, the output may show:

Starting backup at 21-AUG-01
allocated channel: ORA_SBT_TAPE_1
channel ORA_SBT_TAPE_1: sid=14 devtype=SBT_TAPE

Start a SQL*Plus session and then query the joined V$SESSION and V$PROCESS views while the RMAN job is executing. For example, enter:

COLUMN CLIENT_INFO FORMAT a30
COLUMN SID FORMAT 999
COLUMN SPID FORMAT 9999

SELECT s.SID, p.SPID, s.CLIENT_INFO
FROM   V$PROCESS p, V$SESSION s
WHERE  p.ADDR = s.PADDR
AND    CLIENT_INFO LIKE 'rman%'
/

Use the sid value obtained from the first step to determine which channel corresponds to which server session:

       SID SPID         CLIENT_INFO
---------- ------------ ------------------------------
        14 2036         rman channel=ORA_SBT_TAPE_1
        12 2066         rman channel=ORA_SBT_TAPE_1
Correlating Server Sessions with Channels by Using SET COMMAND ID

In this method, you specify a command ID string in the RMAN backup script. You can then query V$SESSION.CLIENT_INFO for this string.

To correlate a process with a channel during a backup:

In each session, set the COMMAND ID to a different value after allocating the channels and then back up the desired object. For example, enter the following in session 1:

RUN 
{
  ALLOCATE CHANNEL c1 TYPE disk;
  SET COMMAND ID TO 'sess1';
  BACKUP DATABASE;
}

Set the command ID to a string such as sess2 in the job running in session 2:

RUN 
{
  ALLOCATE CHANNEL c1 TYPE sbt;
  SET COMMAND ID TO 'sess2';
  BACKUP DATABASE;
}

Start a SQL*Plus session and then query the joined V$SESSION and V$PROCESS views while the RMAN job is executing. For example, enter:

SELECT SID, SPID, CLIENT_INFO 
FROM   V$PROCESS p, V$SESSION s 
WHERE  p.ADDR = s.PADDR 
AND    CLIENT_INFO LIKE '%id=sess%';

If you run the SET COMMAND ID command in the RMAN job, then the CLIENT_INFO column displays in the following format:

id=command_id,rman channel=channel_id

For example, the following shows sample output:

 SID SPID         CLIENT_INFO
---- ------------ ------------------------------
  11 8358         id=sess1
  15 8638         id=sess2
  14 8374         id=sess1,rman channel=c1
   9 8642         id=sess2,rman channel=c1

The rows that contain the string rman channel show the channel performing the backup. The remaining rows are for the connections to the target database.

See Also:

Oracle Database Backup and Recovery Reference for SET COMMAND ID syntax, and Oracle Database Reference for more information on V$SESSION and V$PROCESS

Testing the Media Management API

On some platforms, Oracle provides a diagnostic tool called sbttest. This utility performs a simple test of the media management software by attempting to communicate with the media manager as the Oracle database server would.

Obtaining the sbttest Utility

On UNIX, the sbttest utility is typically located in $ORACLE_HOME/bin. If for some reason the utility is not included with your platform, then contact Oracle Support to obtain the C version of the program. You can compile this version of the program on all UNIX platforms.

Note that on platforms such as Solaris, you do not have to relink when using sbttest. On other platforms, relinking may be necessary.

Obtaining Online Documentation for the sbttest Utility

For online documentation of sbttest, issue the following on the command line:

% sbttest

The program displays the list of possible arguments for the program:

Error: backup file name must be specified
Usage: sbttest backup_file_name # this is the only required parameter
               <-dbname database_name>
               <-trace trace_file_name>
               <-remove_before>
               <-no_remove_after> 
               <-read_only>
               <-no_regular_backup_restore>
               <-no_proxy_backup>
               <-no_proxy_restore>
               <-file_type n>
               <-copy_number n>
               <-media_pool n>
               <-os_res_size n>
               <-pl_res_size n>
               <-block_size block_size> 
               <-block_count block_count>
               <-proxy_file os_file_name bk_file_name 
                           [os_res_size pl_res_size block_size block_count]>
               <-libname sbt_library_name>

The display also indicates the meaning of each argument. For example, following is the description for two optional parameters:

Optional parameters:
  -dbname  specifies the database name which will be used by SBT
           to identify the backup file. The default is "sbtdb"
  -trace   specifies the name of a file where the Media Management 
           software will write diagnostic messages.

Using the sbttest Utility

Use sbttest to perform a quick test of the media manager.

If sbttest returns 0, then the test ran without error, which means that the media manager is correctly installed and can accept a data stream and return the same data when requested. If sbttest returns a nonzero value, then either the media manager is not installed or it is not configured correctly.

To use sbttest:

Make sure the program is installed and included in the system path by typing sbttest at the command line:

% sbttest

If the program is operational, then you should see a display of the online documentation.

Execute the program, specifying any of the arguments described in the online documentation. For example, enter the following to create test file some_file.f and write the output to sbtio.log:

% sbttest some_file.f -trace sbtio.log

You can also test a backup of an existing datafile. For example, this command tests datafile tbs_33.f of database prod:

% sbttest tbs_33.f -dbname prod

Examine the output. If the program encounters an error, then it provides messages describing the failure. For example, if the database cannot find the library, you see:

libobk.so could not be loaded. Check that it is installed properly, and that
 LD_LIBRARY_PATH environment variable (or its equivalent on your platform)
 includes the directory where this file can be found. Here is some additional
 information on the cause of this error:
ld.so.1: sbttest: fatal: libobk.so: open failed: No such file or directory

Note that in some cases sbttest can work but an RMAN backup does not. The reasons can be the following:

  • The user who starts sbttest is not the owner of the Oracle processes.

  • If the database server is not linked with the media management library or cannot load it dynamically when needed, then RMAN backups to the media manager fail, but sbttest may still work.

  • The sbttest program passes all environment parameters from the shell but RMAN does not.

Terminating an RMAN Command

There are several ways to terminate an RMAN command in the middle of execution:

Terminating the Session with ALTER SYSTEM KILL SESSION

You can identify the Oracle session ID for an RMAN channel by looking in the RMAN log for messages with the format shown in the following example:

channel ch1: sid=15 devtype=SBT_TAPE

The sid and devtype are displayed for each allocated channel. Note that the Oracle sid is different from the operating system process ID. You can kill the session using a SQL ALTER SYSTEM KILL SESSION statement.

ALTER SYSTEM KILL SESSION takes two arguments, the sid printed in the RMAN message and a serial number, both of which can be obtained by querying V$SESSION. For example, run the following statement, where sid_in_rman_output is the number from the RMAN message:

SELECT SERIAL# 
FROM   V$SESSION 
WHERE  SID=sid_in_rman_output;

Then, run the following statement, substituting the sid_in_rman_output and serial number obtained from the query:

ALTER SYSTEM KILL SESSION 'sid_in_rman_output,serial#';

Note that this will not unhang the session if the session is hung in media manager code.

Terminating the Session at the Operating System Level

Finding and killing the processes that are associated with the server sessions is operating system specific. On some platforms the server sessions are not associated with any processes at all. Refer to your operating system specific documentation for more information.

Terminating an RMAN Session That Is Hung in the Media Manager

You may sometimes need to kill an RMAN job that is hung in the media manager. The best way to terminate RMAN when the channel connections are hung in the media manager is to kill the session in the media manager. If this action does not solve the problem, then on some platforms, such as Unix, you may be able to kill the Oracle processes of the connections. (Note that killing the Oracle processes may cause problems from the media manager. See your media manager documentation for details.)

Components of an RMAN Session

The nature of an RMAN session depends on the operating system. In UNIX, an RMAN session has the following processes associated with it:

  • The RMAN client process itself

  • The default channel, the initial connection to the target database

  • One target connection to the target database corresponding to each allocated channel

  • The catalog connection to the recovery catalog database, if you use a recovery catalog

  • An auxiliary connection to an auxiliary instance, during DUPLICATE or TSPITR operations

  • A polling connection to the target database, used for monitoring RMAN command execution on the various allocated channels. By default, RMAN makes one polling connection. RMAN makes additional polling connections if you use different connect strings in the ALLOCATE CHANNEL or CONFIGURE CHANNEL commands. One polling connection exists for each distinct connect string used in the ALLOCATE CHANNEL or CONFIGURE CHANNEL command.

Process Behavior During a Hung Job

RMAN usually hangs because one of the channel connections is waiting in the media manager code for a tape resource. The catalog connection and the default channel appear to hang, because they are waiting for RMAN to tell them what to do. Polling connections seem to be in an infinite loop while polling the RPC under the control of the RMAN process.

If you kill the RMAN process itself, then you also kill the catalog connection, the auxiliary connection, the default channel, and the polling connections. If target and auxiliary connections are not hung in the media manager code, they also terminate. If either the target connection or any of the auxiliary connections are executing in the media management layer, then they will not terminate until the processes are manually killed at the operating system level.

Not all media managers can detect the termination of the Oracle process. Those which cannot may keep resources busy or continue processing. Consult your media manager documentation for details.

Terminating the catalog connection does not cause the RMAN process to terminate because RMAN is not performing catalog operations while the backup or restore is in progress. Removing default channel and polling connections causes the RMAN process to detect that one of the channels has died and then proceed to exit. In this case, the connections to the hung channels remain active as described previously.

Terminating an RMAN Session: Basic Steps

After the hung channels in the media manager code are killed, the RMAN process detects this termination and proceed to exit, removing all connections except target connections that are still operative in the media management layer. The warning about the media manager resources still applies in this case.

To terminate an Oracle process that is hung in the media manager:

Query V$SESSION and V$SESSION_WAIT as described in "Using V$ Views for RMAN Troubleshooting". For example, execute the following query:

COLUMN EVENT FORMAT a10
COLUMN SECONDS_IN_WAIT FORMAT 999
COLUMN STATE FORMAT a20
COLUMN CLIENT_INFO FORMAT a30

SELECT p.SPID, s.EVENT, s.SECONDS_IN_WAIT AS SEC_WAIT, 
       sw.STATE, s.CLIENT_INFO
FROM   V$SESSION_WAIT sw, V$SESSION s, V$PROCESS p
WHERE  sw.EVENT LIKE 'sbt%'
AND    s.SID=sw.SID
AND    s.PADDR=p.ADDR;

Examine the SQL output to determine which SBT functions are waiting. For example, the output may be as follows:

SPID EVENT        SEC_WAIT STATE                CLIENT_INFO
---- ---------- ---------- -------------------- -------------
8642 sbtwrite2         600 WAITING              rman channel=ORA_SBT_TAPE_1
8374 sbtwrite2         600 WAITING              rman channel=ORA_SBT_TAPE_2

Using operating system-level tools appropriate to your platform, kill the hung sessions. For example, on Linux execute a kill -9 command:

% kill -9 8642 8374

Some platforms include a command-line utility called orakill that enables you to kill a specific thread. From a command prompt, run the following command, where sid identifies the database instance to target, and the thread_id is the SPID value from the query in step 1:

orakill sid thread_id

Check that the media manager also clears its processes. If any remain, the next backup or restore operation may hang again, due to the previous hang. In some media managers, the only solution is to shut down and restart the media manager. If the documentation from the media manager does not provide the needed information, contact technical support for the media manager.

See Also:

Your operating system specific documentation for the relevant commands