4 Managing Unscheduled Outages

This chapter describes unscheduled outages and the Oracle operational best practices that can tolerate or manage each outage type and minimize downtime.

This chapter contains these topics:

Overview of Unscheduled Outages
Recovering from Unscheduled Outages
Restoring Fault Tolerance

See Also:

Chapter 5 for information about scheduled outages

4.1 Overview of Unscheduled Outages

This section complements Table 1-1 in Oracle Database High Availability Overview that describes unscheduled outages that affect the primary or secondary site components. This section also describes the recommended methods to repair or minimize the downtime associated with each outage.

Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:

Hardware
Software
Network infrastructure
Naming services infrastructure
Database

Your monitoring and high availability infrastructure should provide rapid detection and recovery from downtime. Chapter 3, "Monitoring Using Oracle Grid Control" describes detection, while this chapter focuses on reducing downtime.

The best practice recommendations for reducing unscheduled outages on the primary site and the secondary site, estimated recovery times, and recovery steps appear in the following sections:

Managing Unscheduled Outages on the Primary Site
Managing Unscheduled Outages on the Standby Site

4.1.1 Managing Unscheduled Outages on the Primary Site

Solutions for unscheduled outages are critical for maximum availability of the system.

Table 4-1 compares the most common Oracle high availability architectures and summarizes the recovery steps for unscheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.

Table 4-1 Recovery Times and Steps for Unscheduled Outages on the Primary Site

Outage Scope	Oracle Database 11g	Oracle Database 11g with RAC and Clusterware	Oracle Database 11g with Data Guard	Oracle Database 11g MAA
site failure	Hours to days Restore site. Restore from tape backups. Recover database.	Hours to days Restore site. Restore from tape backups. Recover database.	Seconds to 5 minutes^Foot 1 Database Failover with a Standby Database Complete Site Failover Application Failover	Seconds to 5 minutes^Footref 1 Database Failover with a Standby Database Complete Site Failover Application Failover
clusterwide failure	Not applicable	Hours to days Restore cluster or restore at least one node. Optionally restore from tape backups if the data is lost or corrupted. Recover database.	Not applicable	Seconds to 5 minutes Database Failover with a Standby Database Application Failover
computer failure (node)	Minutes to hours^Foot 2 Restart node and restart database. Reconnect users.	No downtime^Foot 3 Managed automatically by Oracle RAC Recovery for Unscheduled Outages	Seconds to 5 minutes^Footref 2 Database Failover with a Standby Database Application Failover	No downtime^Footref 3 Managed automatically by Oracle RAC Recovery for Unscheduled Outages
computer failure (instance)	Minutes^Footref 2 Restart instance. Reconnect users.	No downtime^Footref 3 Managed automatically by Oracle RAC Recovery for Unscheduled Outages	Minutes^Footref 2 Restart instance. Reconnect users. or Seconds to 5 minutes^Footref 1 Database Failover with a Standby Database Application Failover	No downtime^Footref 3 Managed automatically by Oracle RAC Recovery for Unscheduled Outages
storage failure	No downtime^Foot 4 ASM Recovery After Disk and Storage Failures	No downtime^Footref 4 ASM Recovery After Disk and Storage Failures	No downtime^Footref 4 ASM Recovery After Disk and Storage Failures	No downtime^Footref 4 ASM Recovery After Disk and Storage Failures
human error	< 30 minutes^Foot 5 Recovering from Human Error	< 30 minutes^Footref 5 Recovering from Human Error	<30 minutes^Footref 5 Recovering from Human Error	< 30 minutes^Footref 5 Recovering from Human Error
hangs or slow down	See Oracle Database High Availability Overview) solutions for unplanned downtime Application Failover	See Oracle Database High Availability Overview solutions for unplanned downtime Application Failover	See Oracle Database High Availability Overview solutions for unplanned downtime Application Failover	See Oracle Database High Availability Overview solutions for unplanned downtime Application Failover

Outage Scope

Oracle Database 11g

Oracle Database 11g with RAC and Clusterware

Oracle Database 11g with Data Guard

Oracle Database 11g MAA

site failure

Hours to days

Restore site.
Restore from tape backups.
Recover database.

Hours to days

Restore site.
Restore from tape backups.
Recover database.

Seconds to 5 minutes^Foot 1

Database Failover with a Standby Database
Complete Site Failover
Application Failover

Seconds to 5 minutes^Footref 1

Database Failover with a Standby Database
Complete Site Failover
Application Failover

clusterwide failure

Not applicable

Hours to days

Restore cluster or restore at least one node.
Optionally restore from tape backups if the data is lost or corrupted.
Recover database.

Not applicable

Seconds to 5 minutes

Database Failover with a Standby Database
Application Failover

computer failure (node)

Minutes to hours^Foot 2

Restart node and restart database.
Reconnect users.

No downtime^Foot 3

Managed automatically by Oracle RAC Recovery for Unscheduled Outages

Seconds to 5 minutes^Footref 2

Database Failover with a Standby Database
Application Failover

No downtime^Footref 3

Managed automatically by Oracle RAC Recovery for Unscheduled Outages

computer failure (instance)

Minutes^Footref 2

Restart instance.
Reconnect users.

No downtime^Footref 3

Managed automatically by Oracle RAC Recovery for Unscheduled Outages

Minutes^Footref 2

Restart instance.
Reconnect users.

Seconds to 5 minutes^Footref 1

Database Failover with a Standby Database
Application Failover

No downtime^Footref 3

Managed automatically by Oracle RAC Recovery for Unscheduled Outages

storage failure

No downtime^Foot 4

ASM Recovery After Disk and Storage Failures

No downtime^Footref 4

ASM Recovery After Disk and Storage Failures

No downtime^Footref 4

ASM Recovery After Disk and Storage Failures

No downtime^Footref 4

ASM Recovery After Disk and Storage Failures

human error

< 30 minutes^Foot 5

Recovering from Human Error

< 30 minutes^Footref 5

Recovering from Human Error

<30 minutes^Footref 5

Recovering from Human Error

< 30 minutes^Footref 5

Recovering from Human Error

hangs or slow down

See Oracle Database High Availability Overview) solutions for unplanned downtime

Application Failover

See Oracle Database High Availability Overview solutions for unplanned downtime

Application Failover

See Oracle Database High Availability Overview solutions for unplanned downtime

Application Failover

See Oracle Database High Availability Overview solutions for unplanned downtime

Application Failover

^Footnote 1Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.

^Footnote 2Recovery time consists largely of the time it takes to restart the failed system.

^Footnote 3Database is still available, but portion of application connected to failed system is temporarily affected.

^Footnote 4Storage failures are prevented by using ASM with mirroring and its automatic rebalance capability.

^Footnote 5Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flash back the appropriate transactions, if properly rehearsed. Referential or integrity constraints must be considered.

4.1.2 Managing Unscheduled Outages on the Standby Site

For most cases, you can manage outages on the secondary site without affecting availability of the primary database. However, if the Data Guard configuration is in maximum protection mode, then unscheduled outages on the last surviving standby database that is running in SYNC mode incurs outages on the primary database. This is necessary to ensure there is no data loss when failing over to the standby database. After downgrading the data protection mode, you can restart the primary database even without accessibility to the standby databases. Outages on the secondary site might affect the maximum time to recovery (MTTR) if there are concurrent failures on the primary site.

Table 4-2 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.

Table 4-2 Recovery Steps for Unscheduled Outages on the Secondary Site

Outage Type Recovery Steps for Single-Instance or Oracle RAC Standby Database

Outage Type	Recovery Steps for Single-Instance or Oracle RAC Standby Database
Computer failure (instance)	Restart node and standby instance when they are available. Restart recovery. The broker automatically restarts the log apply services. Note: If there is only one standby database and if maximum database protection is configured, then the primary database shuts down to ensure that there is no data divergence with the standby database. Note: If this is an Oracle RAC standby database, then there is no affect on primary database availability if you configured the primary database Oracle Net descriptor to use connect-time failover to an available standby instance. If you are using the broker, connect-time failover is configured automatically.
Data corruption	Restoring Fault Tolerance After a Standby Database Data Failure
Primary database opens with `RESETLOGS` because of Flashback Database operations or point-in-time media recovery	Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs

Computer failure (instance)

Restart node and standby instance when they are available.
Restart recovery.

The broker automatically restarts the log apply services.

Note: If there is only one standby database and if maximum database protection is configured, then the primary database shuts down to ensure that there is no data divergence with the standby database.

Note: If this is an Oracle RAC standby database, then there is no affect on primary database availability if you configured the primary database Oracle Net descriptor to use connect-time failover to an available standby instance. If you are using the broker, connect-time failover is configured automatically.

Data corruption

Restoring Fault Tolerance After a Standby Database Data Failure

Primary database opens with RESETLOGS because of Flashback Database operations or point-in-time media recovery

Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs

4.2 Recovering from Unscheduled Outages

This section describes best practices for recovering from various types of unscheduled outages.

This section contains these topics:

Complete Site Failover
Database Failover with a Standby Database
Oracle RAC Recovery for Unscheduled Outages
Application Failover
ASM Recovery After Disk and Storage Failures
Recovering from Data Corruption (Data Failures)
Recovering from Human Error
Recovering Databases in a Distributed Environment

4.2.1 Complete Site Failover

With complete site failover, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.

4.2.1.1 When to Use Complete Site Failover

If the standby site meets the prerequisites, then complete site failover is recommended for the following scenarios:

Primary site disaster, such as natural disasters or malicious attacks
Primary network-connectivity failures
Primary site power failures

4.2.1.2 Best Practices for Complete Site Failover

Site failover can be expedited in minutes by using the following practices:

Use the Data Guard configuration best practices in Section 2.6.4, "General Configuration Best Practices for Data Guard"
Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO ) of less than 30 seconds (described in Section 2.6.7.2.3, "Fast-Start Failover Best Practices")
Maintain a running middle-tier application server on the secondary site to avoid the startup time, or redirect existing applications to the new primary database using the Fast Connection Failover best practices described in:
- Section 2.9, "Configuring Fast Connection Failover"
- The MAA white paper: "Client Failover for Highly Available Oracle Databases" at http://www.otn.oracle.com/goto/maa
Automate the DNS failover procedure

Data loss is dependent on the Oracle Data Guard configuration and the use of synchronous or asynchronous redo transport.

4.2.1.3 Repair Solution

A wide-area traffic manager on the primary and secondary sites provides the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site, or a specific application on the primary site, is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site automatically.

Figure 4-1 illustrates the possible network routes before site failover:

Client requests enter the client tier of the primary site and travel by the WAN traffic manager.
Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier.
Requests are forwarded through the active load balancer to the application servers.
Requests are sent through another firewall and into the database server tier.
The application requests, if required, are routed to a Oracle RAC instance.
Responses are sent back to the application and clients by a similar path.

Figure 4-1 Network Routes Before Site Failover

Description of "Figure 4-1 Network Routes Before Site Failover"

Figure 4-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow the same path on the secondary site that they followed on the primary site.

Figure 4-2 Network Routes After Site Failover

Description of "Figure 4-2 Network Routes After Site Failover"

The following steps describe the effect of a failover or switchover on network traffic:

The administrator has failed over or switched over the primary database to the secondary site. This is automatic if you are using Data Guard fast-start failover.
The administrator starts the middle-tier application servers on the secondary site, if they are not running.
The wide-area traffic manager selection of the secondary site can be automatic for an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover.

Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:
1. Change the DNS to point to the secondary site load balancer:
  
  The master (primary) DNS server is updated with the zone information, and the change is announced with the DNS NOTIFY announcement.
  
  The slave DNS servers are notified of the zone update with a DNS NOTIFY announcement, and the slave DNS servers pull the zone information.
  
  Note:
  The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information.
2. Clear affected records from caching DNS servers.
  
  A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information.
  
  Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions:
  
  BIND 9.3.0: The command rndc flushname name flushes individual entries from the cache.
  
  BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndc flush.
  
  BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.
3. Refresh local DNS service caching.
  
  Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly.
  
  Solaris: nscd
  
  Linux: /etc/init.d/nscd restart
  
  Microsoft Windows: ipconfig /flushdns
  
  Apple Mac OS X: lookupd -flushcache
4. The secondary site load balancer directs traffic to the secondary site middle-tier application server.
5. The secondary site is ready to take client requests.

Failover also depends on the client's Web browser. Most browser applications cache the DNS entry for a period. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.

4.2.2 Database Failover with a Standby Database

Failover is the operation of transitioning one standby database to the role of primary database. A failover operation is invoked when an unplanned failure occurs on the primary database and there is no possibility of recovering the primary database in a timely manner.

With Oracle Data Guard, you can automate the failover process using the broker and fast-start failover, or you can perform the failover manually:

Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero loss or minimum-loss failover (that you configure using the FastStartFailoverLagLimit property) within seconds of an outage being detected. See Section 2.6.7.2.3, "Fast-Start Failover Best Practices" for configuration best practices.
Manual failover allows for a failover process where decisions are user driven using any of the following methods:
- Oracle Enterprise Manager
- The broker command-line interface (DGMGRL)
- SQL*Plus statements
  
  See Section 4.2.2.3, "Best Practices for Performing Manual Failover".

A database failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site hosts the primary database. You must reinstate the original primary database as a new standby database to restore fault tolerance of the configuration. See Section 4.3.2, "Restoring a Standby Database After a Failover".

A failover operation typically occurs in under a minute, and with little or no data loss.

See Also:

Oracle Data Guard Concepts and Administration.for a complete description of failover processing
The following MAA best practice white papers available at http://www.otn.oracle.com/goto/maa

"Data Guard Fast-Start Failover"

"Data Guard Switchover and Failover"

4.2.2.1 When To Perform a Data Guard Failover

When a primary database failure cannot be repaired in time to meet your Recovery Time Objective (RTO) using local backups or Flashback technology, you should perform a failover using Oracle Data Guard.

You should perform a failover m anually due to an unplanned outage such as:

A site disaster, which results in the primary database becoming unavailable
Damage resulting from user errors that cannot be repaired in a timely fashion
Data failures, which impact the production application

A failover requires that you reinstate the initial primary database as a standby database to restore fault tolerance to your environment. You can quickly reinstate the standby database using Flashback Database provided the original primary database has not been damaged. See Section 4.3.2, "Restoring a Standby Database After a Failover".

4.2.2.2 Best Practices for Implementing Fast-Start Failover

A fast-start failover is completely automated and requires no user intervention.

There are no procedural best practices to consider when performing a fast-start failover. However, it is important to address all of the configuration best practices described in Section 2.6.7.2.3, "Fast-Start Failover Best Practices".

See Also:

The MAA white paper "Data Guard Switchover and Failover Best Practices" at http://www.otn.oracle.com/goto/maa

4.2.2.3 Best Practices for Performing Manual Failover

When performing a manual failover:

Follow the configuration best practices outlined in Section 2.6.7.2.4, "Manual Failover Best Practices".
Choose one of the following methods:
- Oracle Enterprise Manager
  
  See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases.
- Oracle Data Guard broker command-line interface (DGMGRL)
  
  See Oracle Data Guard Broker for complete information about how to perform a manual failover using Oracle Enterprise Manager. The procedure is the same for both physical and logical standby databases.
- SQL*Plus statements:
  - Physical standby databas e: Follow the steps for "Performing a Failover to a Physical Standby Database" in Oracle Data Guard Concepts and Administration.
  - Logical standby databases: Follow the steps for "Performing a Failover to a Logical Standby Database" in Oracle Data Guard Concepts and Administration.

4.2.3 Oracle RAC Recovery for Unscheduled Outages

This solution is used automatically when there is a node or instance failure. Surviving instances automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and Oracle RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems, with no data loss.

Use the following recovery methods:

Automatic Instance Recovery for Failed Instances
Automatic Service Relocation
Oracle Cluster Registry

4.2.3.1 Automatic Instance Recovery for Failed Instances

Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.

Instance recovery in Oracle RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using service relocation and fast application notification (as described in Section 4.2.3.2, "Automatic Service Relocation").

When one instance performs recovery for another instance, the recovering instance:

Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost
Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions

When multiple node failures occur, as long as one instance survives, Oracle RAC performs instance recovery for any other instances that fail. If all instances of an Oracle RAC database fail, then on subsequent restart of any one instance a crash recovery occurs and all committed transactions are recovered. If Data Guard is available, you can fail over automatically with Data Guard fast-start failover after all instances are down.

4.2.3.2 Automatic Service Relocation

Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a Oracle RAC database instance, then CRS automatically moves any services on that instance to another available instance, as configured with DBCA or Enterprise Manager. Then, Cluster Ready Services (CRS) attempt to restart the failed nodes and instances.

CRS recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification Service (ONS), advanced queueing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Auto restart and recovery are automatic, including all the subsystems, such as the listener and the ASM instance, not just database.

4.2.3.3 Oracle Cluster Registry

Loss of the Oracle Cluster Registry (OCR) file affects the availability of Oracle RAC and Oracle Clusterware. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig tool. Additionally, starting with Oracle Database 10g Release 10.2, Oracle can optionally mirror the OCR so that a single OCR device failure can be tolerated. Ensure the OCR mirror is on a physically separate device and preferably on a separate controller.

See Also:

"Administering Storage in Real Application Clusters" in Oracle Real Application Clusters Administration and Deployment Guide

4.2.4 Application Failover

With proper configuration, applications can receive fast and efficient notification when application services become unavailable. When notified, application connections occur transparently to surviving instances of an Oracle RAC database or to a standby database that has assumed the primary role following a failover.

In an Oracle RAC configuration, services are essential to achieving fast and transparent application failover. If a service becomes unavailable for a particular instance because of an instance or node failure, the service fails over to an available instance in the cluster, thereby allowing applications to continue processing. Clients are notified of the service relocation through Fast Application Notification (FAN).

In an Oracle Data Guard configuration, you can configure services for client failover across sites. After a site failure in a Data Guard configuration, the new primary database can automatically publish the production service while notifying affected clients (through FAN events) that the services are no longer available on the failed primary database.

For hangs or situations in which the response time is unacceptable, you can configure Oracle Enterprise Manager or a custom application heartbeat to detect application or response time slowdown and react to these situations. For example, you can configure the Enterprise Manager Beacon to monitor and detect application response times. Then, after a certain time threshold expires, Enterprise Manager can call the Oracle Data Guard DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure to initiate a database failover immediately followed by an application failover using FAN notifications and service relocation.

FAN notifications and service relocation enable automatic and fast redirection of clients in the event of any failure or planned maintenance that results in an Oracle RAC or Oracle Data Guard fail over.

See Also:

The MAA white paper "Client Failover for Highly Available Oracle Databases" at http://www.otn.oracle.com/goto/maa
The section "Application Initiated Fast-Start Failover" in Oracle Data Guard Broker for more information about the DBMS_DG.INITIATE_FS_FAILOVER PL/SQL procedure

4.2.5 ASM Recovery After Disk and Storage Failures

The impacts and recommended repairs for various ASM failure types are summarized in Table 4-3.

Table 4-3 Types of ASM Failures and Recommended Repair

Failure	Description	Impact	Recommended Repair
ASM instance failure	ASM instance fails	All database instances accessing ASM storage from the same node shut down.	Automatic Oracle RAC Recovery for Unscheduled Outages If Oracle RAC is not used, use Data Guard failover (see Section 4.2.2.2, "Best Practices for Implementing Fast-Start Failover") If Oracle RAC and Data Guard are not used, fix the underlying problem and then restart ASM and the database instances
ASM disk failure	One or more ASM disks fail, but all disk groups remain online.	All data remains accessible. This is possible only with normal or high redundancy disk groups.	ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an ORA-15041 (See Section 2.1.2.2, "Oracle Storage Grid Best Practices for Planned Maintenance") Note: External redundancy disk groups should use mirroring in the storage array to protect from disk failure. Disk failures should not be exposed to ASM.
Data area disk-group failure	One or more ASM disks fail, and data area disk group goes offline.	Databases accessing the data area disk group shut down.	Perform Data Guard failover or local recovery as described in Section 4.2.5.3, "Data Area Disk Group Failure"
Flash recovery area disk-group failure	One or more ASM disks fail, and the flash recovery area disk group goes offline.	Databases accessing the flash recovery area disk group shut down.	Perform local recovery or Data Guard failover as described in Section 4.2.5.4, "Flash Recovery Area Disk Group Failure"

4.2.5.1 ASM Instance Failure

If the ASM instance fails, then database instances accessing ASM storage from the same node shut down. The following list describes failover processing:

If the primary database is an Oracle RAC database, then application failover occurs automatically and clients connected to the database instance reconnect to remaining instances. Thus, the service is provided by other instances in the cluster and processing continues. The recovery time typically occurs in seconds.
If the primary database is not an Oracle RAC database, then an ASM instance failure shuts down the entire database.
If the configuration uses Oracle Data Guard and fast-start failover is enabled, a database failover is triggered automatically and clients automatically reconnect to the new primary database after the failover completes. The recovery time is the amount of time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then you must recover from this outage by either restarting the ASM and database instances manually, or by performing a manual Data Guard failover.
If the configuration includes neither Oracle RAC nor Data Guard, then you must manually restart the ASM instance and database instances. The recovery time depends on how long it takes to perform these tasks.

4.2.5.2 ASM Disk Failure

If the ASM disk fails, then failover processing is as follows:

External redundancy

If an ASM disk group is configured as an external redundancy type, then a failure of a single disk is handled by the storage array and should not be seen by the ASM instance. All ASM and database operations using the disk group continue normally.

However, if the failure of an external redundancy disk group is seen by the ASM instance, then the ASM instance takes the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then you can restart ASM and the database instances and crash recovery occurs after the disk group is brought back online.
Normal or a high-redundancy

If an ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by ASM and the databases accessing the disk group are not affected.

An ASM instance automatically starts an ASM rebalance operation to distribute the data on one or more failed disks to alternative disks in the ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be remirrored. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an ASM disk group.

The failure of multiple disks in multiple failure groups where a primary extent and all of its mirrors have been lost cause the disk group to go offline.

This following recovery methods can be used:

Using Enterprise Manager to Repair ASM Disk Failure
Using SQL to Add Replacement Disks Back to the Disk Group

4.2.5.2.1 Using Enterprise Manager to Repair ASM Disk Failure

Figure 4-3 shows Enterprise Manager reporting disk failures. Three alerts appear at 11:19:29. The first alert is an Offline Disk Count. The second and third alerts are Disk Status messages for data area disk DATA.XBBT1D06_DATA and recovery area disk RECO.XBBT1D06_RECO:

2 disks are offline
Disk DATA.XBBT1D06_DATA is offline.
Disk RECO.XBBT1D06_RECO is offline.

Figure 4-3 Enterprise Manager Reports Disk Failures

Description of "Figure 4-3 Enterprise Manager Reports Disk Failures"

Figure 4-4 shows Enterprise Manager reporting the status of data area disk group DATA and recovery area disk group RECO. The red arrows under Member Disks indicate that one disk has failed in each disk group. The numbers under Pending Operations indicate that one operation is pending for each disk group.

Figure 4-4 Enterprise Manager Reports ASM Disk Groups Status

Description of "Figure 4-4 Enterprise Manager Reports ASM Disk Groups Status"

Figure 4-5 shows Enterprise Manager reporting a pending REBAL operation on the DATA disk group. The operation is about one-third done, as shown in % Complete, and the Remaining Time is estimated to be 16 minutes.

Figure 4-5 Enterprise Manager Reports Pending REBAL Operation

Description of "Figure 4-5 Enterprise Manager Reports Pending REBAL Operation"

4.2.5.2.2 Using SQL to Add Replacement Disks Back to the Disk Group

Perform these steps after one or more failed disks have been replaced, and access to the storage has been restored:

Add the one or more replacement disks to the failed disk group with the following SQL command:

ALTER DISKGROUP disk_group 
   ADD FAILGROUP failure_group 
   DISK 'disk1','disk2',...;

Check the progress of the operation:
```
SELECT * FROM V$ASM_OPERATION;
```

4.2.5.3 Data Area Disk Group Failure

A data area disk group failure should occur only when there have been multiple failures. For example, if the data-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

When one or more disks fail in a normal or high redundancy disk group, but the ASM disk group is accessible, there is no loss of data and no immediate loss of accessibility. An ASM instance automatically starts an ASM rebalance operation to distribute the data on the one or more failed disks to alternative disks in the ASM disk group. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a second failure. There must be enough disk space on the remaining disks in the disk group for the rebalance to complete successfully.

Table 4-4 summarizes the possible solutions for recovering from a data area disk group failure.

Table 4-4 Recovery Options for Data Area Disk Group Failure

Recovery Option	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Data Guard failover	Five minutes or less	Varies depending on the data protection level chosen
Local recovery	Database restore and recovery time	Zero

If Data Guard is being used and fast-start failover is configured, then an automatic failover occurs when the database shuts down due to the data-area disk group going offline. If fast-start failover is not configured, then perform a manual failover.

If you decide to perform a Data Guard failover, then the recovery time objective (RTO) is expressed in terms of minutes or seconds, depending on the presence of the Data Guard observer process and fast-start failover. However, if a manual failover occurs and not all data is available on the standby site, then data loss might result.

After Data Guard failover has completed and the application is available, you must resolve the data area disk group failure. Continue with the following "Local Recovery Steps" procedure to resolve the ASM disk group failure.

The RTO for local recovery only is based on the time required to:

Repair and replace the failed storage components
Restore and recover the database

Because the loss affects only the data-area disk group, there is no loss of data. All transactions are recorded in the Oracle redo log members that reside in the flash recovery area, so complete media recovery is possible.

If you are not using Data Guard, then perform the following local recovery steps. The time required to perform local recovery depends on how long it takes to restore and recover the database. There is no data loss when performing local recovery.

Local Recovery Steps

Perform these steps after one or more failed disks have been replaced and access to the storage has been restored:

Note:

If you have performed an Oracle Data Guard failover to a new primary database, then you can now use the following procedure to reintroduce the database into the Data Guard environment. Also, see Section 4.3.2, "Restoring a Standby Database After a Failover".

Rebuild the ASM disk group using the new storage location by issuing the following SQL*Plus statement on the ASM instance:
```
SQL> CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK 'path1','path2',...;
```
Start the instance NOMOUNT by issuing the following RMAN command:
```
RMAN> STARTUP FORCE NOMOUNT;
```
Restore the control file from the surviving copy located in the recovery area:
```
RMAN> RESTORE CONTROLFILE FROM 'recovery_area_controlfile';
```
Start the instance MOUNT:
```
RMAN> STARTUP FORCE MOUNT;
```
Restore the database:
```
RMAN> RESTORE DATABASE
```
Recover the database:
```
RMAN> RECOVER DATABASE;
```
If you use block change tracking, then disable and reenable the block change tracking file using SQL*Plus statements:
```
SQL> ALTER DATABASE DISABLE BLOCK CHANGE TRACKING;
SQL> ALTER DATABASE ENABLE BLOCK CHANGE TRACKING;
```
Open the database:
```
SQL> ALTER DATABASE OPEN;
```

Re-create the log file members on the failed ASM disk group:

SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename';
SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;

Perform an incremental level 0 backup using the following RMAN command:
```
RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
```

4.2.5.4 Flash Recovery Area Disk Group Failure

When the flash recovery-area disk group fails, the database crashes because the control file member usually resides in the flash recovery area and Oracle requires that all control file members are accessible. The flash recovery area can also contain the flashback logs, redo log members, and all backup files.

Because the failure affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area.

A flash recovery area disk group failure typically occurs only when there have been multiple failures. For example, if the flash recovery-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple disk failures in a storage array may affect ASM and cause the disk group to go offline. Similarly, multiple disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.

Table 4-5 summarizes possible solutions when the flash recovery-area disk group fails.

Table 4-5 Recovery Options for Flash Recovery Area Disk Group Failure

Recovery Option	Recovery Time Objective (RTO)	Recovery Point Objective (RPO)
Local recovery	Five minutes or less	Zero
Data Guard failover or switchover	Five minutes or less	Zero

If you decide to perform local recovery:

Then you must perform a fast local restart to start the primary database after removing the controlfile member located in the failed flash recovery area and point to a new flash recovery area for local archiving.

For a fast local restart, perform the following steps on the primary database:

Change the CONTROL_FILES initialization parameter to specify only the members in the Data Area:
```
ALTER SYSTEM SET CONTROL_FILES='+DATA/sales/control1.dbf' SCOPE=spfile;
```
Change local archive destinations and the flash recovery area to the local redundant, scalable destination:
```
ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='+DATA' SCOPE=spfile;
```
Start the database with the new settings:
```
STARTUP MOUNT:
```
If the flashback logs were damaged or lost, it may be necessary to disable and reenable Flashback Database:
```
ALTER DATABASE FLASHBACK OFF;
ALTER DATABASE FLASHBACK ON;
ALTER DATABASE OPEN;
```

However, this is a temporary fix until you create a flash recovery area to replace the failed storage components. Oracle recommends using the "Local Recovery Steps" section.

If you decide to perform a Data Guard role transition:

Then the RTO can be expressed in terms of seconds or minutes, depending on the presence of the Data Guard observer process and fast-start failover.

If the protection level is maximum performance or the standby database is unsynchronized with the primary database, then:

Temporarily start the primary database by removing the controlfile member and pointing to a temporary flash recovery area (file system) in the SPFILE.
Perform a Data Guard switchover to ensure no data loss.
After the switchover has completed and the application is available, resolve the flash recovery area disk group failure.
Shut down the affected database and continue by using the instructions in the "Local Recovery Steps" section to resolve the ASM disk group failure.

Local Recovery Steps

Note:

If you performed an Oracle Data Guard failover to a new primary database, then you cannot use this procedure to reintroduce the original primary database as a standby database. This is because Flashback Database log files that are required as part of reintroducing the database have been lost. You must perform a full reinstatement of the standby database.

Replace or get access to storage that can be used as a flash recovery area
Rebuild the ASM disk group using the storage location by issuing the following SQL*Plus statement:
```
SQL> CREATE DISKGROUP RECO NORMAL REDUNDANCY DISK 'path1','path2',...;
```
Start the instance NOMOUNT using the following RMAN command:
```
RMAN> STARTUP FORCE NOMOUNT;
```
Restore the control file from the surviving copy located in the data area:
```
RMAN> RESTORE CONTROLFILE FROM 'data_area_controlfile';
```
Start the instance MOUNT:
```
RMAN> STARTUP FORCE MOUNT;
```
If you use Flashback Database, then disable it with the following SQL*Plus statement:
```
SQL> ALTER DATABASE FLASHBACK OFF;
```
Open the database and allow instance recovery to complete:
```
SQL> ALTER DATABASE OPEN;
```

Issue the following statements only if Flashback Database is required:

SQL> SHUTDOWN IMMEDIATE;
SQL> STARTUP MOUNT;
SQL> ALTER DATABASE FLASHBACK ON;
SQL> ALTER DATABASE OPEN;

Re-create the log file members on the failed ASM disk group:

SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename';
SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;

Synchronize the control file and the flash recovery area using the following RMAN commands:
```
RMAN> CATALOG RECOVERY AREA;
RMAN> CROSSCHECK ARCHIVELOG ALL;
RMAN> CROSSCHECK BACKUPSET;
RMAN> CROSSCHECK DATAFILECOPY ALL;
RMAN> LIST EXPIRED type;
RMAN> DELETE EXPIRED type;
```
In the example, the type variable is a placeholder for both LIST EXPIRED BACKUP and LIST EXPIRED COPY commands, and also for the DELETE EXPIRED BACKUP and DELETE EXPIRED COPY commands. You should run all of these commands now.
Assuming that data has been lost in some way, perform a backup:
```
RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
```

4.2.6 Recovering from Data Corruption (Data Failures)

Recovering from data corruption is an unscheduled outage scenario. Data corruption is usually—but not always—caused by some activity or failure that occurs outside the database, even though the problem might be evident within the database.

Data corruption in data files has two categories:

Data file block corruption

A corrupt data file block can be accessed, but the contents in the block are invalid or inconsistent. The typical cause of data file corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.

The database usually remains available when corrupt blocks have been detected, but some corrupt blocks might cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.

A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:
- A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk
- A single corrupt data block because of block inconsistencies detected by Oracle. The block is marked corrupted and any application accessing the block receives an ORA-1578 error.
- A database that automatically shuts down because of the invalid blocks of a data file in the SYSTEM tablespace caused by a failing disk controller
Media failure

This category of data corruption results from a physical hardware problem or user error. The system cannot successfully read or write to a file that is necessary to operate the database.

RMAN block media recovery provides the highest application availability if targeted blocks are not critical to application functionality. Data Guard switchover or failover to standby database provides the fastest predictable RTO.

Other outages that result in database objects becoming unavailable or inconsistent are caused by human error, such as dropping a table or erroneously updating table data. Information about recovering from human error can be found in Section 4.2.7, "Recovering from Human Error".

If the data corruption affects nondata files, then the repair may be slightly different. Table 4-6 provides a matrix of the key non database object corruption and the recommended repair.

Table 4-6 Non Database Object Corruption and Recommended Repair

Object or Component Affected	Impact	Repair
Any control file	Database fails	Data Guard fast-start failover automatically fails over to the standby database.
Redo log member	None	Investigate failure and check system. Drop and re-create redo log member.
Active redo log group	Database fails	If the database is still running and the lost active redo log is not the current log, then issue the `ALTER SYSTEM CHECKPOINT` statement. If checkpoint is possible, then clear the redo log group. If a checkpoint is not possible, use a solution described in the next table row (Active or current redo log group that is still needed for crash recovery). See Also Oracle Database Backup and Recovery User's Guide for more information about recovering from the loss of active logs.
Active or current redo log group that is still needed for crash recovery	Database fails	Use one of the following solutions: Oracle Data Guard failover. Flashback Database—either flashback the database to a consistent time and then issue an `OPEN RESETLOGS` statement, or flashback to a time before the damaged log. Begin incomplete media recovery, recovering up through the log before the damaged log. Ensure that the current name of the lost redo log can be used for a newly created file. If not, then rename the members of the damaged online redo log group to a new location. For example, enter the following: ALTER DATABASE RENAME FILE "/disk1/oradata/trgt/redo01.log" TO "/tmp/redo01.log"; ALTER DATABASE RENAME FILE "/disk1/oradata/trgt/redo01.log" TO "/tmp/redo02.log"; Then, open the database using the `OPEN RESETLOGS` option: ALTER DATABASE OPEN RESETLOGS; Incomplete media recovery and Flashback Database are equivalent solutions. However, Flashback Database is typically faster.
Archived redo log file^Foot 1	None	Create a database backup. If the standby database did not already receive redo data through the LGWR `SYNC` and `ASYNC` transport, then refresh the standby database by applying an incremental backup, by re-creating the standby database from the primary database, or by using a backup of the primary database.
SPFILE	None	Restore SPFILE from a backup and revise it.

^Footnote 1Assumes loss or corruption of all copies of an archived redo log (if multiple archive destinations were specified).

You can resolve a data corruption outage by using any of following methods:

Use Data Recovery Advisor to Automate Failure Diagnosis and Repair
Use Data Guard to Recover From Data Corruption and Data Failure
Use RMAN Block Media Recovery
Use RMAN Data File Media Recovery
Re-Create Objects Manually

4.2.6.1 Use Data Recovery Advisor to Automate Failure Diagnosis and Repair

Data Recovery Advisor automatically diagnoses data failures, determines and presents appropriate repair options, and executes repairs at the user's request. By automating data repair, Data Recovery Advisor improves the manageability and reliability of an Oracle database and thus helps reduce the MTTR.

Note:

The initial release of Data Recovery Advisor does not support Oracle RAC. In addition, while you can use Data Recovery Advisor when managing a primary database in a Data Guard configuration, you cannot use Data Recovery Advisor to troubleshoot a physical standby database. Data Recovery Advisor only takes the presence of a standby database into account when recommending repair strategies if you are using Enterprise Manager 10g Grid Control.

Data Recovery Advisor has both a command-line and a GUI interface. The GUI interface is available in Oracle Enterprise Manager Database Control and Grid Control. To navigate to the recovery page in the GUI, you can click Perform Recovery in the Availability tab of the Database Home page. In the RMAN command-line interface, the Data Recovery Advisor commands include LIST FAILURE, ADVISE FAILURE, REPAIR FAILURE, and CHANGE FAILURE.

See Also:

The chapter about "Diagnosing and Repairing Failures with Data Recovery Advisor" in Oracle Database Backup and Recovery User's Guide

4.2.6.2 Use Data Guard to Recover From Data Corruption and Data Failure

Failover is the operation of transitioning the standby databases to the primary database role. A database switchover is a planned transition in which a standby database and a primary database switch roles. Either of these operations can occur in less than five minutes and with no data loss.Use Data Guard switchover or failover for data corruption or data failure when:

The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown.
Recovering locally takes longer than the business service-level agreement or RTO.

4.2.6.3 Use RMAN Block Media Recovery

Block media recovery recovers one block or a set of data blocks marked "media corrupt" in a data file by using the RMAN RECOVER BLOCK command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole data files. This results in lower RTO because only blocks that need recovery are restored and only necessary corrupt blocks are recovered. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.

Use block media recovery when:

A small number of blocks require media recovery and you know which blocks need recovery. If a significant portion of the data file is corrupt, or if the amount of corruption is unknown, then use a different recovery method.
Blocks are marked corrupt (you can verify this with the RMAN VALIDATE CHECK LOGICAL command).
The backup file for the corrupted data file is available locally or can be retrieved from a remote location, including from a physical standby database.

Note:

When using block media recovery on databases running a release before release 11.1.0.7, the backup for the corrupted data file must be located on the primary database. If the backup is located on a standby database, you must move the backup to the primary database.

Do not use block media recovery to recover from the following outages:

User error or software bugs that cause logical corruption where the data blocks are intact. See Section 4.2.7, "Recovering from Human Error" for additional details for this type of recovery.
Changes caused by corrupt redo data. Block media recovery requires that all available redo data be applied to the blocks being recovered. If the redo data is corrupted, the suspicious changes are reintroduced during the recovery process.

For example, to recover a specific corrupt block using RMAN block media recovery:

RMAN> RECOVER BLOCK DATAFILE 7 BLOCK 3;

When the corruption is detected, it would be easy to recover this block through Grid Control.

Note:

When using Active Data Guard, block media recovery performs as follows:

When performing user-initiated block media recovery at a primary database, it attempts to fetch a good copy of the blocks from the standby before searching backups for the blocks.
When performing user-initiated block media recovery at a standby database, it attempts to fetch a good copy of the blocks from the primary before searching backups for the blocks.

See Also:

Oracle Database Backup and Recovery User's Guide
The white paper "Using Recovery Manager with Oracle Data Guard in Oracle Database 10g" at http://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf

4.2.6.4 Use RMAN Data File Media Recovery

Data file media recovery recovers an entire data file or set of data files for a database by using the RMAN RECOVER command. When a large or unknown number of data blocks are marked "media corrupt" and require media recovery, or when an entire file is lost, you must restore and recover the applicable data files.

Use RMAN file media recovery when the following conditions are true:

The number of blocks requiring recovery is large. Use the RMAN VALIDATE CHECK LOGICAL command to find out exactly how many blocks are corrupted.
Block media recovery is not available (for example, if incomplete recovery is required).

See Also:

"Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery User's Guide and the white paper titled "Using Recovery Manager with Oracle Data Guard in Oracle Database 10g" available at http://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf

4.2.6.5 Re-Create Objects Manually

Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.

Use manual object re-creation when:

You must re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.
You must re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table might be the fastest option.

4.2.7 Recovering from Human Error

Oracle Flashback technology revolutionizes data recovery. Before Flashback technology, it took seconds to damage a database but from hours to days to recover it. With Flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing human errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, Flashback technology is exponentially faster than a database restoration.

Flashback technologies are applicable only to repairing the following human errors:

Erroneous or malicious update, delete or insert transactions
Erroneous or malicious DROP TABLE statements
Erroneous or malicious batch job or wide-spread application errors

Flashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See Section 4.2.6, "Recovering from Data Corruption (Data Failures)" and Section 4.2.2, "Database Failover with a Standby Database" to repair these outages.

Table 4-7 summarizes the Flashback solutions for outage varying in scope from destroying a row (such as through a bad update) to destroying a whole database (such as by deleting all the underlying files at the operating system level).

Table 4-7 Flashback Solutions for Different Outages

Outage Scope	Examples of Human Errors	Flashback Solutions	See Also
Row or transaction	Accidental deletion of row Erroneous transaction	Flashback Query Flashback Version Query Flashback Transaction Query Flashback Transaction	See Also: "Resolving Row and Transaction Inconsistencies"
Table	Dropped table Erroneous transactions affecting one table or a set of tables	Flashback Drop Flashback Table	See Also:"Resolving Table Inconsistencies"
Tablespace or database	Erroneous batch job affecting many tables or an unknown set of tables Series of database-wide malicious transactions	Enable Flashback Database or use multiple Flashback Table commands	See Also: "Resolving Database-Wide Inconsistencies"
Single tablespace or a subset of tablespaces	Erroneous transactions affecting a small number of tablespaces	RMAN Tablespace Point-in-Time Recovery (TSPITR)	See Also: "Resolving One or More Tablespace Inconsistencies"

Table 4-8 summarizes each Flashback feature.

Table 4-8 Summary of Flashback Features

Flashback Feature	Description	Changes are propagated to ...
Flashback Query	Flashback Query enables you to view data at an earlier point in time. You can use it to view and reconstruct lost data that was deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors.	Physical and logical standby databases
Flashback Version Query	Flashback Version Query uses undo data stored in the database to view the changes to one or more rows along with all the metadata of the changes.	Physical and logical standby databases
Flashback Transaction Query	Flashback Transaction Query enables you to examine changes to the database at the transaction level. As a result, you can diagnose problems, perform analysis, and audit transactions.	Physical and logical standby databases
Flashback Transaction	Flashback Transaction provides a way to roll back one or more transactions and their dependent transactions, while the database remains online.	Physical and logical standby databases
Flashback Drop	Flashback Drop provides a way to restore accidentally dropped tables.	Physical standby databases
Flashback Table	Flashback Table enables you to quickly recover a table to an earlier point in time without restoring a backup.	Physical and logical standby databases
Flashback Database	Flashback Database enables you to quickly return the database to an earlier point in time by undoing all of the changes that have taken place since that time. This operation is fast because you do not have to restore the backups.	Physical and logical standby databases

Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See the configuration best practices for the database—as documented in Section 2.2, "Configuring Oracle Database 11g"—for configuring Flashback technologies to ensure that the resources from these solutions are available at a time of failure.

See Also:

Oracle Database Administrator's Guide, Oracle Database Backup and Recovery User's Guide, and Oracle Database Concepts for more information about Flashback technology and automatic undo management

In general, the recovery time when using Flashback technologies is equivalent to the time it takes to cause the human error plus the time it takes to detect the human error.

Flashback technologies allow recovery up to the point that the human error occurred.

Use the following recovery methods:

Resolving Table Inconsistencies
Resolving Row and Transaction Inconsistencies
Resolving Database-Wide Inconsistencies
Resolving One or More Tablespace Inconsistencies

4.2.7.1 Resolving Table Inconsistencies

Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.

Oracle provides the following statements to help resolve table inconsistencies:

Flashback Table statement to restore a table to a previous point in the database
Flashback Drop statement to recover from an accidental DROP TABLE statement
Flashback Transaction statement to roll back one or more transactions and their dependent transactions, while the database remains online

Flashback Table

Flashback Table provides the ability to quickly recover a table or a set of tables to a specified point in time. In many cases, Flashback Table alleviates the more complicated point-in-time recovery operations. For example:

FLASHBACK TABLE orders, order_items 
      TO TIMESTAMP 
      TO_DATE('28-Jun-08 14.00.00','dd-Mon-yy hh24:mi:ss');

This statement rewinds any updates to the ORDERS and ORDER_ITEMS tables that have been done between the current time and a specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.

Flashback Drop

Flashback Drop provides a safety net when dropping objects in Oracle Database 10g or later releases. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users view the recycle bin and undrop the dropped table and its dependent objects. For example, the employees table and all its dependent objects would be undropped by the following statement:

FLASHBACK TABLE employees TO BEFORE DROP;

Flashback Transaction

Oracle Flashback Transaction increases availability during logical recovery by easily and quickly backing out a specific transaction or set of transactions and their dependent transactions, while the database remains online.

Use the DBMS_FLASHBACK.TRANSACTION_BACKOUT() PL/SQL procedure to roll back a transaction and its dependent transactions. This procedure uses undo data to create and execute the compensating transactions that return the affected data to its pre-transaction state.

See Also:

The "Using Flashback Transaction" section in Oracle Database Advanced Application Developer's Guide
DBMS_FLASHBACK.TRANSACTION_BACKOUT() in Oracle Database PL/SQL Packages and Types Reference

4.2.7.2 Resolving Row and Transaction Inconsistencies

Resolving row and transaction inconsistencies might require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the compensating SQL statements constructed from undo statements to rectify the problem. This section describes a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.

Flashback Query

Flashback Query enables an administrator or user to query any data from some earlier point in time. Use this feature to view and reconstruct data that might have been deleted or changed by accident.

Developers can use Flashback Query to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, and freeing database administrators from having to perform this task. Flashback Query is easy to manage because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.

The following partial statement displays rows from the EMPLOYEES table starting from 2:00 p.m. on June 28, 2008.

SELECT * FROM EMPLOYEES 
       AS OF TIMESTAMP 
       TO_DATE('28-Jun-08 14:00','DD-Mon-YY HH24:MI')
WHERE ...

Flashback Version Query

Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:

SELECT * FROM EMPLOYEES
       VERSIONS BETWEEN TIMESTAMP
       TO_DATE('28-Jun-08 14:00','dd-Mon-YY hh24:mi') AND
       TO_DATE('28-Jun-08 15:00','dd-Mon-YY hh24:mi')
WHERE ...

This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. on June 28, 2008. A database administrator can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the database administrator to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.

Flashback Transaction Query

Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:

SELECT UNDO_SQL
FROM FLASHBACK_TRANSACTION_QUERY 
WHERE XID = '000200030000002D';

This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like Flashback Transaction Query, the database administrator and application developer can precisely diagnose and correct logical problems in the database or application.

Consider a human resources (HR) example involving the SCOTT schema. The HR manager reports to the database administrator that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he instructed his staff to reset the salary to the previous level of $1250. This was completed around 9:15 a.m.

The following steps show how to approach the problem.

Assess the problem.

Fortunately, the HR manager has provided information about the time when the change occurred. You can query the information as it was at 9:00 a.m. using Flashback Query.
```
    SELECT EMPNO, ENAME, SAL
    FROM EMP
    AS OF TIMESTAMP TO_DATE('28-JUN-08 09:00','dd-Mon-yy hh24:mi')
    WHERE ENAME = 'WARD';

         EMPNO ENAME             SAL
    ---------- ---------- ----------
          7521 WARD             1875
```
You can confirm that you have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, you can now use the employee number for subsequent investigation.

Query previous rows or versions of the data to acquire transaction information.

Although it is possible to restrict the row version information to a specific date or SCN range, you might want to query all the row information that is available for the employee WARD using Flashback Version Query.

SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME, VERSIONS_XID
    FROM EMP
    VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE
    WHERE EMPNO = 7521
    ORDER BY NVL(VERSIONS_STARTSCN,1);

EMPNO ENAME  SAL  VERSIONS_STARTTIME     VERSIONS_ENDTIME      VERSIONS_XID
----- ------ ---  ---------------------- -------------------- ---------------
 7521 WARD  1250  28-JUN-08 08.48.43 AM  28-JUN-08 08.54.49 AM 0006000800000086
 7521 WARD  1875  28-JUN-08 08.54.49 AM  28-JUN-08 09.10.09 AM 0009000500000089
 7521 WARD  1250  28-JUN-08 09.10.09 AM                        000800050000008B

You can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09.

Also, you can see that the ID of the erroneous transaction that increased WARD's salary to $1875 was "0009000500000089".

Query the erroneous transaction and the scope of its effect.

With the transaction information (VERSIONS_XID pseudocolumn), you can now query the database to determine the scope of the transaction, using Flashback Transaction Query.

    SELECT UNDO_SQL
    FROM FLASHBACK_TRANSACTION_QUERY
    WHERE XID = HEXTORAW('0009000500000089');

    UNDO_SQL                                                                    
    ----------------------------------------------------------------------------
    update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL';      
    update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ';     
    update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF';      
    update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE';    
    update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB';     
                                                                                
    6 rows selected.

You can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.

Determine if the corrective statements should be executed.

If the HR manager decides that the corrective changes suggested by the UNDO_SQL column are correct, then the database administrator can execute the statements individually.
Query the FLASHBACK_TRANSACTION_QUERY view for additional transaction information. For example, to determine the user that performed the erroneous update, issue the following query:
```
SELECT LOGON_USER FROM FLASHBACK_TRANSACTION_QUERY
WHERE XID = HEXTORAW('0009000500000089');

LOGON_USER
----------------------------
MSMITH
```
In this example, the query shows that the user MSMITH was responsible for the erroneous transaction.

4.2.7.3 Resolving Database-Wide Inconsistencies

To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, because it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it takes hours or even days just to restore the whole database.

Flashback Database is a strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. When recovery must be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one instance subsequently mounted. The following is an example of a FLASHBACK DATABASE statement.

FLASHBACK DATABASE TO TIMESTAMP SYSDATE-1;

No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a primary database, a physical standby database, or a logical standby database.

These steps are recommended for using Flashback Database:

Determine the time or the SCN to which to flash back the database.

Verify that there is sufficient flashback log information.

     SELECT OLDEST_FLASHBACK_SCN, 
       TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') 
       FROM V$FLASHBACK_DATABASE_LOG;

Flash back the database to a specific time or SCN. (The database must be mounted to perform a Flashback Database.)
```
FLASHBACK DATABASE TO SCN scn;
```
or
```
FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
```
Open the database in read-only mode to verify that it is in the correct state.
```
ALTER DATABASE OPEN READ ONLY;
```
If more flashback data is required, then issue another FLASHBACK DATABASE statement. (The database must be mounted to perform a Flashback Database.)

If you want to move forward in time, then issue a statement similar to the following:
```
RECOVER DATABASE UNTIL [TIME date | CHANGE scn];
```
Open the database:
```
ALTER DATABASE OPEN RESETLOGS;
```

Other considerations when using Flashback Database are as follows:

If there are not sufficient flashback logs to flash back to the target time, then use one of the following alternatives:
- Use Data Guard to recover to the target time if the standby database lags behind the primary database or flash back to the target time if there's sufficient flashback logs on the standby.
- Restore from backups.
After flashing back a database, any dependent database such as a standby database must be flashed back. See Section 4.3, "Restoring Fault Tolerance".

Flashback Database does not automatically fix a dropped tablespace, but it can be used to significantly reduce the downtime. You can flash back the primary database to a point before the tablespace was dropped and then restore a backup of the corresponding data files from the affected tablespace and recover to a time before the tablespace was dropped. See support note 783471.1 for a step-by-step procedure you can use to repair a dropped tablespace.

4.2.7.4 Resolving One or More Tablespace Inconsistencie s

Recovery Manager (RMAN) automatic tablespace point-in-time recovery (TSPITR) enables you to quickly recover one or more tablespaces in a database to an earlier time without affecting the rest of the tablespaces and objects in the database. You can only use TSPITR on tablespaces whose data is completely segregated from the rest of the database. This usually means that TSPITR is something for which you must plan in advance.

RMAN TSPITR is most useful for the following situations:

To recover a logical database to a point different from the rest of the physical database, when multiple logical databases exist in separate tablespaces of one physical database. For example, you maintain logical databases in the Orders and Personnel tablespaces. An incorrect batch job or DML statement corrupts the data in only one tablespace.
To recover data lost after DDL operations that change the structure of tables. You cannot use Flashback Table to rewind a table to before the point of a structural change such as a truncate table operation.
To recover a table after it has been dropped with the PURGE option.
To recover from the logical corruption of a table.

You perform TSPITR by using the RMAN RECOVER TABLESPACE command.

See Also:

Oracle Database Backup and Recovery User's Guide for detailed information about performing RMAN TSPITR

4.2.8 Recovering Databases in a Distributed Environmen t

Some applications may update multiple databases and participate in distributed transactions. Global consistency between the participating databases may be expected and crucial to the application.If one database in a distributed database environment requires recovery to an earlier time, it is often necessary to recover all other databases in the configuration to the same point in time when global data consistency is required by the application. To achieve coordinated, time-based, distributed database recovery, perform the following steps:

Recover the database that requires the recovery operation using time-based recovery.

For example, if a database must be recovered because of a media failure, then recover this database first using time-based recovery. Do not recover the other databases at this point.

After you have recovered the database and opened it with the RESETLOGS option, search the alert_SID.log of the database for the RESETLOGS message. Your next step depends on the message that you find in the log file, as described in following table:

If the message returned is ...	Then ...
"`RESETLOGS` after complete recovery through change nnn"	Recovery is complete. You have applied all the changes in the database and performed complete recovery. Do not recover any of the other databases in the distributed system because this unnecessarily removes database changes.
"`RESETLOGS` after incomplete recovery `UNTIL CHANGE` `nnn`"	You have successfully performed an incomplete recovery. Record the change number from the message and proceed to the next step.

Recover or flash back all other databases in the distributed database system using change-based recovery, specifying the change number (SCN) that you recorded in Step 2 .

Note:
If a database that is participating in distributed transactions fails, in-doubt distributed transactions may exist in the participating databases. If the failed database recovers completely and communications resume between the databases, then the in-doubt transactions is automatically resolved by the Oracle recoverer process (RECO) process. If you cannot wait until the failed database becomes available, you can also manually commit or rollback in-doubt transactions.
See Also:
- Oracle Database Backup and Recovery User's Guide for more information about performing time-based recovery
- Oracle Database Administrator's Guide for information about how to handle in-doubt transactions and about recovery from distributed transaction failures

4.3 Restoring Fault Tolerance

Whenever a component in a high availability architecture fails, then the full protection—or fault tolerance—of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the high availability architecture to full fault tolerance to reestablish full Oracle RAC, Data Guard, or MAA protection requires repairing the failed component. While full fault tolerance might be sacrificed during planned downtime, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unplanned downtime, the risk of exposure to a single point of failure must be clearly understood.

This section provides the following topics that describe the steps needed to restore database fault tolerance:

For Oracle Database 11g with Oracle RAC
- Restoring Failed Nodes or Instances in Oracle RAC
For Oracle Database 11g with Data Guard and Oracle Database 11g with Oracle RAC and Data Guard - MAA

4.3.1 Restoring Failed Nodes or Instances in Oracle RAC

Ensuring that application services fail over quickly and automatically in a Oracle RAC cluster—or between primary and secondary sites—is important when planning for both scheduled and unscheduled outages. To ensure that the environment is restored to full fault tolerance after any errors or issues are corrected, it is also important to understand the steps and processes for restoring failed instances or nodes within a Oracle RAC cluster or databases between sites.

Adding a failed node back into the cluster or restarting a failed Oracle RAC instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, you should also consider:

When to perform these tasks to incur minimal or no effect on the current running environment
Resetting network components (such as load balancer) which were modified for failover and now must be reset
Failing back or rebalancing existing connections

After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the Oracle RAC environment at any time. Processing to complete the reconfiguration of a node may require additional system resources.

Table 4-9 summarizes additional processing that may be required when adding a node.

Table 4-9 Additional Processing When Restarting or Rejoining a Node or Instance

Action	Additional Resources
Restarting a node or rejoining a node into a cluster	When using only Oracle Clusterware, there is no impact when a node joins the cluster. When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload.
Restarting or rejoining a Oracle RAC instance	When you restart a Oracle RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload.

Action

Additional Resources

Restarting a node or rejoining a node into a cluster

When using only Oracle Clusterware, there is no impact when a node joins the cluster.

When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload.

Restarting or rejoining a Oracle RAC instance

When you restart a Oracle RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload.

See Also:

Your vendor-specified cluster management documentation for detailed steps on how to start and join a node back into a cluster
Oracle Real Application Clusters Administration and Deployment Guide for more information about restarting a Oracle RAC instance

Use the following recovery methods:

Recovering Service Availability
Considerations for Client Connections After Restoring an Oracle RAC Instance

4.3.1.1 Recovering Service Availability

After a failed node has been brought back into the cluster and its instance has been started, Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service might or might not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.

For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance.

Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.

4.3.1.2 Considerations for Client Connections After Restoring an Oracle RAC Instance

After a Oracle RAC instance has been restored, additional steps might be required, depending on the current resource usage and system performance, the application configuration, and the network load balancing that has been implemented.

Existing connections (that might have failed over or started as a new session) on the surviving Oracle RAC instances, are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users might or might not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving Oracle RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it might be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.

Connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the connections are automatically load-balanced over time.

An application service can be:

Managed with services running on a subset of Oracle RAC instances
Nonpartitioned so that all services run equally across all nodes

This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For cases where an application is partitioned or has a combination of partitioning and nonpartitioning, you should consider the response time and availability aspects for each service. If redistribution or failback of connections for a particular service is required, then:

You can rebalance workloads automatically using Oracle Universal Connection Pool (UCP). If you are using UCP, then connections are automatically redistributed to the new node.
You can rebalance workloads manually with the DBMS_SERVICE.DISCONNECT_SESSION PL/SQL procedure. You can use this procedure to disconnect sessions associated with a service while the service is running.

Note:

Oracle Universal Connection Pool (UCP) provides fast and automatic detection of connection failures and removes terminated connections for any Java application using, Fast Connection Failover, and FAN events

For load-balancing application services across multiple Oracle RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there might be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is restored and available to start receiving connections, a manual step might be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.

Table 4-10 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or are a combination of both. The actual redistribution of existing connections might or might not be required depending on the resource utilization and response times.

Table 4-10 Restoration and Connection Failback

Application Services	Failback or Restore Existing Connections	Failback or Restore New Connections
Partitioned	Existing sessions are not automatically relocated back to the restored instance. Use `DBMS_SERVICE.DISCONNECT_SESSION` to manually disconnect sessions and allow them to be reestablished on a remaining instance that provides the service.	Automatically routes to the restored instance by using the Oracle Net Services configuration.
Nonpartitioned	No action is necessary unless the load must be rebalanced, because restoring the instance means that the load there is low. If the load must be rebalanced, then the same problems are encountered as if application services were partitioned.	Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration

Figure 4-6 shows a two-node partitioned Oracle RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.

Figure 4-6 Partitioned Two-Node Oracle RAC Database

Description of "Figure 4-6 Partitioned Two-Node Oracle RAC Database"

Figure 4-7 shows what happens when one Oracle RAC instance fails.

Figure 4-7 Oracle RAC Instance Failover in a Partitioned Database

Description of "Figure 4-7 Oracle RAC Instance Failover in a Partitioned Database"

If one Oracle RAC instance fails, then the service and existing client connections can be automatically failed over to another Oracle RAC instance. In this example, the HR and Sales services are both supported by the remaining Oracle RAC instance. In addition, you can route new client connections for the Sales service to the instance now supporting this service.

After the failed instance has been repaired and restored to the state shown in Figure 4-6 and the Sales service is relocated to the restored instance, then you might need to identify and failback any failed-over clients and any new clients that had connected to the Sales service on the failed-over instance. Client connections that started after the instance has been restored should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.

Figure 4-8 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.

Figure 4-8 Nonpartitioned Oracle RAC Instances

Description of "Figure 4-8 Nonpartitioned Oracle RAC Instances"

If one Oracle RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available Oracle RAC instances that offer that service.

After the failed instance has been repaired and restored to the state shown in Figure 4-8, some clients might have to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single-instance database is not able to adequately service the requests.

Client connections that started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load evenly balances again across all available Oracle RAC instances. Rebalancing the load immediately after restoration depends on the resource usage and application response times.

4.3.2 Restoring a Standby Database After a Failover

Following unplanned downtime on a primary database that requires a failover, full fault tolerance is compromised until the standby database is reestablished. Full database protection should be restored as soon as possible. The steps for restoring fault tolerance differ slightly between physical and logical standby databases.

Reinstating databases is automated if you are using Data Guard fast-start failover. After a fast-start failover completes, the observer automatically attempts to reinstate the original primary database as a standby database. R einstatement restores high availability to the broker configuration so that, in the event of a failure of the new primary database, another fast-start failover can occur. The reinstated database can act as the fast-start failover target for the primary database, making a subsequent fast-start failover possible. The standby database is a viable target of a failover when it begins applying redo data received from the new primary database. If you want to prevent automatic reinstatement (for example, to perform diagnostic or repair work after failover has completed), set the FastStartFailoverAutoReinstate configuration property to FALSE.

The FastStartFailoverAutoReinstate configuration property controls whether the observer should automatically reinstate the original primary after a fast-start failover occurred because a fast-start failover was initiated due to the primary database being isolated for longer than the number of seconds specified by the FastStartFailoverThreshold property. In some cases, an automatic reinstatement might not be wanted until further diagnostic or recovery work is done.

To reinstate the original primary database, the database must be started and mounted, but it cannot be opened. The broker reinstates the database as a standby database of the same type (physical or logical) as the original standby database.

If the original primary database cannot be reinstated automatically, you can manually reinstate it using either the DGMGRL REINSTATE command or Enterprise Manager. Step-by-step instructions for manual reinstatement are described in Oracle Data Guard Broker.

Standby databases do not have to be re-created if you use the Oracle Flashback Database feature. Flashback Database has the following advantages:

Saves hours of database restoration time
Reduces overall complexity in restoring fault tolerance
Reduces the time that the system is vulnerable because the standby database is re-created more quickly

See Also:

The following topics in Oracle Data Guard Concepts and Administration:

Flashing Back a Failed Primary Database into a Physical Standby Database
Flashing Back a Failed Primary Database into a Logical Standby Database

This section includes the following topics:

Reinstating the Original Primary Database After a Fast-Start Failover
Reinstating a Standby Database Using Enterprise Manager After a Failover

4.3.2.1 Reinstating the Original Primary Database After a Fast-Start Failover

Following a fast-start failover, the observer periodically attempts to reconnect to the original primary database. When the observer regains network access to the original primary database, it initiates a request for the broker to automatically reinstate it as a standby database to the new primary. This quickly restores disaster protection and high availability for the configuration.

You can enable fast-start failover from any site, including the observer site, in Enterprise Manager while connected to any database in the broker configuration. The broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager, as shown in Figure 4-9.

Figure 4-9 Fast-Start Failover and the Observer Are Successfully Enabled

Data Guard page showing that fast-start failover is enabled.

Description of "Figure 4-9 Fast-Start Failover and the Observer Are Successfully Enabled"

4.3.2.2 Reinstating a Standby Database Using Enterprise Manager After a Failover

Furthermore, you can use Enterprise Manager to reinstate the original primary as the new standby. Figure 4-10 shows an example of the warning message that shows in Enterprise Manager when a reinstatement is needed.

Figure 4-10 Reinstating the Original Primary Database After a Fast-Start Failover

Description of "Figure 4-10 Reinstating the Original Primary Database After a Fast-Start Failover"

4.3.3 Restoring ASM Disk Groups after a Failure

Follow the steps in Section 4.2.5.3, "Data Area Disk Group Failure" or Section 4.2.5.4, "Flash Recovery Area Disk Group Failure".

4.3.4 Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster

After performing the planned maintenance on the secondary site, the standby database and log apply services must be restarted, and then the Data Guard redo transport services automatically catch up the standby database with the primary database. You can use Enterprise Manager and the broker to monitor the Data Guard state.

The following steps are required to restore full fault tolerance after planned downtime on a secondary site or clusterwide outage:

Note:

The following steps can be accomplished manually (as described below) or automatically using Enterprise Manager.

Start the standby database

You might have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new primary database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration.

After the standby database has been reestablished, start the standby database.

Table 4-11 SQL Statements for Starting Standby Databases

Type of Standby Database SQL Statement

Physical

STARTUP MOUNT;

Logical

STARTUP;

Active Data Guard

STARTUP;
Start Redo Apply (physical standby) or SQL Apply (logical standby):

Table 4-12 SQL Statements to Start Redo Apply and SQL Apply

Type of Standby Database SQL Statement

Physical (or Active Data Guard)

RECOVER MANAGED STANDBY DATABASE DISCONNECT;

Logical

ALTER DATABASE START LOGICAL STANDBY APPLY;
Verify redo transport services on the primary database

You might have to reenable the primary database remote archive destination. Query the V$ARCHIVE_DEST_STATUS view first to see the current state of the archive destinations:
```
SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL 
      FROM V$ARCHIVE_DEST_STATUS;
ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE; 
ALTER SYSTEM ARCHIVE LOG CURRENT;
```
Verify redo transport services between the primary and standby databases by checking for errors. Query the V$ARCHIVE_DEST and V$ARCHIVE_DEST_STATUS views:
```
SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR 
    FROM V$ARCHIVE_DEST; 
SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
```
Verify that recovery is progressing on standby database
- For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied the redo from the archived redo log files:
```
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD;
SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS
    FROM V$MANAGED_STANDBY;
```
- For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the redo from the archived redo logs:
```
SELECT THREAD#, SEQUENCE# SEQ# 
    FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG 
    WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# 
    ORDER BY NEXT_CHANGE#;
```
Restore primary database protection mode

If you had to change the protection mode of the primary database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the primary database protection mode back to maximum protection depending on your business requirements.
```
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
```

Type of Standby Database	SQL Statement
Physical	`STARTUP MOUNT;`
Logical	`STARTUP;`
Active Data Guard	`STARTUP;`

Type of Standby Database	SQL Statement
Physical (or Active Data Guard)	`RECOVER MANAGED STANDBY DATABASE DISCONNECT;`
Logical	`ALTER DATABASE START LOGICAL STANDBY APPLY;`

4.3.5 Restoring Fault Tolerance After a Standby Database Data Failure

Following unplanned downtime on the standby database that requires a full or partial data file restoration (such as data or media failure), full fault tolerance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible.

To repair data corruption and data failures on a logical standby database, you require a backup of the logical standby file and not a backup from the primary database. Otherwise, you must reinstate or re-create the relevant objects affected by the corruption.To repair data corruption or data failures on the standby database, you can use the following repair solutions:

Use RMAN Block Media Recovery (described in Section 4.2.6.3)
Use RMAN Data File Media Recovery (described in Section 4.2.6.4)
Re-Create Objects Manually for logical standby databases only (described in Section 4.2.6.5)

If you had to change the protection mode of the primary database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the primary database protection mode back to maximum protection (depending on your business requirements).

ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];

4.3.6 Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs

If the primary database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database might require additional maintenance. No additional work is required if the primary database did complete recovery with no resetlogs.

After opening the primary database with the RESETLOGS option, execute the queries shown in Table 4-13.

Table 4-13 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS

Database	Query
Primary	`SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) FROM V$DATABASE;`
Physical standby	`SELECT TO_CHAR(CURRENT_SCN) FROM V$DATABASE;`
Logical standby	`SELECT APPLIED_SCN FROM DBA_LOGSTDBY_PROGRESS;`

Table 4-14 shows the actions you take to restore fault tolerance if the standby database is behind the primary database's resetlogs SCN.

Table 4-14 SCN on Standby Database is Behind RESETLOGS SCN on the Primary Database

Database	Action
Physical standby	Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database" Restart Redo Apply.
Logical standby	Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database"

Database

Action

Physical standby

Ensure that the standby database has received an archived redo log file from the primary database.

See Also: "Verify redo transport services on the primary database"
Restart Redo Apply.

Logical standby

Ensure that the standby database has received an archived redo log file from the primary database.

Table 4-15 shows the actions you take to restore fault tolerance if the standby database is ahead of the primary database's resetlogs SCN.

Table 4-15 SCN on the Standby is Ahead of Resetlogs SCN on the Primary Database

Database Action

Database	Action
Physical standby	Ensure that the standby database has received an archived redo log file from the primary database. See Also: "Verify redo transport services on the primary database" Issue the `SHUTDOWN IMMEDIATE` statement, if necessary. Issue the `STARTUP MOUNT` statement. Issue the `FLASHBACK DATABASE TO SCN` `flashback_scn` statement where `flashback_scn` is the SCN returned from the primary database query in Table 4-13. The SCN returned from the primary database query is 2 less than the `RESETLOGS_CHANGE#`. Issue the `FLASHBACK DATABASE TO SCN resetlogs_change#_minus_2` statement. Restart Redo Apply with or without real-time apply: With real-time apply: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT; Without real-time apply: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;
Logical standby	Determine the SCN at the primary database. On the primary database, use the following query to obtain the value of the system change number (SCN) that is 2 SCNs before the RESETLOGS operation occurred on the primary database: SQL> SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) AS FLASHBACK_SCN FROM V$DATABASE; Determine the target SCN for flashback operation at the logical standby: SQL> SELECT DBMS_LOGSTDBY.MAP_PRIMARY_SCN (PRIMARY_SCN => FLASHBACK_SCN)2> AS TARGET_SCN FROM DUAL; Flash back the logical standby to the `TARGET_SCN` returned. Issue the following SQL statements to flash back the logical standby database to the specified SCN, and open the logical standby database with the `RESETLOGS` option: SQL> SHUTDOWN; SQL> STARTUP MOUNT EXCLUSIVE; SQL> FLASHBACK DATABASE TO SCN TARGET_SCN; SQL> ALTER DATABASE OPEN RESETLOGS; Start SQL Apply: SQL> ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

Physical standby

Ensure that the standby database has received an archived redo log file from the primary database.

See Also: "Verify redo transport services on the primary database"
Issue the SHUTDOWN IMMEDIATE statement, if necessary.
Issue the STARTUP MOUNT statement.
Issue the FLASHBACK DATABASE TO SCN flashback_scn statement where flashback_scn is the SCN returned from the primary database query in Table 4-13. The SCN returned from the primary database query is 2 less than the RESETLOGS_CHANGE#.

Issue the FLASHBACK DATABASE TO SCN resetlogs_change#_minus_2 statement.

Restart Redo Apply with or without real-time apply:

With real-time apply:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE DISCONNECT;

Without real-time apply:

ALTER DATABASE RECOVER MANAGED STANDBY DATABASE DISCONNECT;

Logical standby

Determine the SCN at the primary database.

On the primary database, use the following query to obtain the value of the system change number (SCN) that is 2 SCNs before the RESETLOGS operation occurred on the primary database:
```
SQL> SELECT TO_CHAR(RESETLOGS_CHANGE# - 2) AS FLASHBACK_SCN FROM V$DATABASE;
```

Determine the target SCN for flashback operation at the logical standby:

SQL> SELECT DBMS_LOGSTDBY.MAP_PRIMARY_SCN (PRIMARY_SCN => FLASHBACK_SCN)2> AS TARGET_SCN FROM DUAL;

Flash back the logical standby to the TARGET_SCN returned.

Issue the following SQL statements to flash back the logical standby database to the specified SCN, and open the logical standby database with the RESETLOGS option:
```
SQL> SHUTDOWN;
SQL> STARTUP MOUNT EXCLUSIVE;
SQL> FLASHBACK DATABASE TO SCN TARGET_SCN;
SQL> ALTER DATABASE OPEN RESETLOGS;
```

Start SQL Apply:

SQL> ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE;

4.3.7 Restoring Fault Tolerance After Dual Failures

If a dual failure affecting both the standby and primary databases occurs, then you must re-create the primary database first. Because the sites are identical, the primary database can be created wherever the most recent backup resides.

Table 4-16 summarizes the recovery strategy depending on the type of backups that are available.

Table 4-16 Re-Creating the Primary and Standby Databases

Available Backups	Re-Creating the Primary Database
Local backup on primary and standby databases	Restore backup from the primary database. Recover and activate the database as the new primary database.
Local backup only on standby database. Tape backups on standby database.	Restore the local standby backup to the standby database. Recover and activate the database as the new primary database.
Tape backups only	Restore tape backups locally. Recover the database and activate it as the new primary database.

See Also:

After the primary database is re-created, follow the steps for creating a standby database that are described inOracle Data Guard Concepts and Administration

4 Managing Unscheduled Outages

4.1 Overview of Unscheduled Outages

4.1.1 Managing Unscheduled Outages on the Primary Site

4.1.2 Managing Unscheduled Outages on the Standby Site

4.2 Recovering from Unscheduled Outages

4.2.1 Complete Site Failover

4.2.1.1 When to Use Complete Site Failover

4.2.1.2 Best Practices for Complete Site Failover

4.2.1.3 Repair Solution

4.2.2 Database Failover with a Standby Database

4.2.2.1 When To Perform a Data Guard Failover

4.2.2.2 Best Practices for Implementing Fast-Start Failover

4.2.2.3 Best Practices for Performing Manual Failover

4.2.3 Oracle RAC Recovery for Unscheduled Outages

4.2.3.1 Automatic Instance Recovery for Failed Instances

4.2.3.2 Automatic Service Relocation

4.2.3.3 Oracle Cluster Registry

4.2.4 Application Failover

4.2.5 ASM Recovery After Disk and Storage Failures

4.2.5.1 ASM Instance Failure

4.2.5.2 ASM Disk Failure

4.2.5.2.1 Using Enterprise Manager to Repair ASM Disk Failure

4.2.5.2.2 Using SQL to Add Replacement Disks Back to the Disk Group

4.2.5.3 Data Area Disk Group Failure

4.2.5.4 Flash Recovery Area Disk Group Failure

4.2.6 Recovering from Data Corruption (Data Failures)

4.2.6.1 Use Data Recovery Advisor to Automate Failure Diagnosis and Repair

4.2.6.2 Use Data Guard to Recover From Data Corruption and Data Failure

4.2.6.3 Use RMAN Block Media Recovery

4.2.6.4 Use RMAN Data File Media Recovery

4.2.6.5 Re-Create Objects Manually

4.2.7 Recovering from Human Error

4.2.7.1 Resolving Table Inconsistencies

4.2.7.2 Resolving Row and Transaction Inconsistencies

4.2.7.3 Resolving Database-Wide Inconsistencies

4.2.7.4 Resolving One or More Tablespace Inconsistencies

4.2.8 Recovering Databases in a Distributed Environment

4.3 Restoring Fault Tolerance

4.3.1 Restoring Failed Nodes or Instances in Oracle RAC

4.3.1.1 Recovering Service Availability

4.3.1.2 Considerations for Client Connections After Restoring an Oracle RAC Instance

4.3.2 Restoring a Standby Database After a Failover

4.3.2.1 Reinstating the Original Primary Database After a Fast-Start Failover

4.3.2.2 Reinstating a Standby Database Using Enterprise Manager After a Failover

4.3.3 Restoring ASM Disk Groups after a Failure

4.3.4 Restoring Fault Tolerance After Planned Downtime on Secondary Site or Cluster

4.3.5 Restoring Fault Tolerance After a Standby Database Data Failure

4.3.6 Restoring Fault Tolerance After the Primary Database Was Opened Resetlogs

4.3.7 Restoring Fault Tolerance After Dual Failures

4.2.7.4 Resolving One or More Tablespace Inconsistencie s

4.2.8 Recovering Databases in a Distributed Environmen t