Skip Headers

Oracle® High Availability Architecture and Best Practices
10g Release 1 (10.1)

Part Number B10726-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

9
Recovering from Outages

This chapter describes scheduled and unscheduled outages and the Oracle recovery process and architectural framework that can manage each outage and minimize downtime. This chapter contains the following sections:

Recovery Steps for Unscheduled Outages

Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:

The monitoring and HA infrastructure should provide rapid detection and recovery from failures. Detection is described in Chapter 8, "Using Oracle Enterprise Manager for Monitoring and Detection", while this chapter focuses on the recovery operations for each outage.

Table 9-1 describes the unscheduled outages that impact the primary or secondary site components.

Table 9-1 Unscheduled Outages  
Outage Description Examples

Site failure

The entire site where the current production database resides is unavailable. This includes all tiers of the application.

  • Disaster at the production site such as a fire, flood, or earthquake
  • Power outages. (If there are multiple power grids and backup generators for critical systems, then this should affect only part of the data center.)

Node failure

A node of the RAC cluster is unavailable or fails

  • A database tier node fails or has to be shut down because of bad memory or bad CPU
  • The database tier node is unreachable
  • Both of the redundant cluster interconnects fail, resulting in another node taking ownership

Instance failure

A database instance is unavailable or fails

An instance of the RAC database on the data server fails because of a software bug or an operating system or hardware problem

Clusterwide failure

The whole cluster hosting the RAC database is unavailable or fails. This includes failures of nodes in the cluster as well as any other components that result in the cluster being unavailable and the Oracle database and instances on the site being unavailable.

  • The last surviving node on the RAC cluster fails and cannot be restarted
  • Both of the redundant cluster interconnects fail
  • Database corruption is severe enough to disallow continuity on the current data server
  • Disk storage fails

Data failure

This failure results in unavailability of parts of the database because of media corruptions, inaccessibility, or inconsistencies.

  • A datafile is accidentally removed or is unavailable
  • Media corruption affects blocks of the database
  • Oracle block corruption is caused by operating system or other node-related problems

User error

This failure results in unavailability of parts of the database and causes transactional or logical data inconsistencies. It is usually caused by the operator or by bugs in the application code.

This is estimated to be the greatest single cause of downtime.

Localized damage (needs surgical repair)

  • User error results in a table being dropped or in rows being deleted from a table

Widespread damage (needs drastic action to avoid downtime)

  • Application errors result in logical corruptions in the database
  • Operator error results in a batch job being run more times than specified.

Note: This category focuses on user errors that affect database availability and, in particular, cause transactional or logical data inconsistencies.

The rest of this section provides outage decision trees for unscheduled outages on the primary site and the secondary site. The decision trees appear in the following sections:

The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. These descriptions are found in Chapter 10, "Detailed Recovery Steps".

Some outages require multiple recovery steps. For example, when a site failure occurs, the outage decision matrix states that Data Guard failover must occur before site failover. Some outages are handled automatically without any loss of availability. For example, instance failure is managed automatically by RAC. Multiple recovery options for each outage are listed wherever relevant.

Recovery Steps for Unscheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for maximum availability of the system. Only the "Data Guard only" and MAA architectures have a secondary site to protect from site disasters. The estimated recovery times (ERT) are strictly examples derived from customer and actual testing experiences and do not reflect a guaranteed recovery time.

Table 9-2 summarizes the recovery steps for unscheduled outages on the primary site.

Table 9-2 Recovery Steps for Unscheduled Outages on the Primary Site  
Reason for Outage Recovery Steps for "Database Only" Architecture Recovery Steps for "RAC Only" Architecture Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site failure

ERT: hours to days

  1. Restore site.
  2. Restore from tape backups.
  3. Recover database.

ERT: hours to days

  1. Restore site.
  2. Restore from tape backups.
  3. Recover database.

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Node failure

ERT: minutes to an hour

  1. Restart node and restart database.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

ERT: minutes to an hour

  1. Restart node and restart database.
  2. Reconnect users.

or

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

ERT: seconds to minutes

Managed automatically by RAC Recovery

Instance failure

ERT: minutes

  1. Restart instance.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

ERT: minutes

  1. Restart instance.
  2. Reconnect users.

ERT: seconds to minutes

Managed automatically by RAC Recovery

Clusterwide failure

N/A

ERT: hours to days

  1. Restore cluster or restore at least one node.
  2. Restore from tape backups.
  3. Recover database.

N/A

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Data failure

ERT: minutes to an hour

Recovery Solutions for Data Failures

ERT: minutes to an hour

Recovery Solutions for Data Failures

ERT: minutes to an hour

Recovery Solutions for Data Failures

or

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Note: For primary database media failures or media corruptions, database failover may minimize data loss.

ERT: minutes to an hour

Recovery Solutions for Data Failures

or

ERT: minutes to an hour

  1. Database Failover
  2. Complete or Partial Site Failover

Note: For primary database media failures or media corruptions, database failover may minimize data loss.

User error

ERT: minutes

Recovering from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Technology

ERT: minutes

Recovering from User Error with Flashback Technology

Recovery Steps for Unscheduled Outages on the Secondary Site

Outages on the secondary site do not directly affect availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may impact the MTTR if there are concurrent failures on the primary site. For most cases, outages on the secondary site can be managed with no impact on availability. However, if maximum protection mode is part of the configuration, then an unscheduled outage on the last surviving standby database causes downtime on the production database. After downgrading the data protection mode, you can restart the production database.

Table 9-3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site.

Table 9-3 Recovery Steps for Unscheduled Outages of the Standby Database on the Secondary Site  
Reason for Outage Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Standby apply instance failure

  1. Restart node and standby instance.
  2. Restart recovery.

If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database.

ERT: seconds

Apply Instance Failover

There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance.

Restart node and instance when they are available.

Standby non-apply instance failure

N/A

There is no effect on availability because the primary node or instance receives redo logs and applies them with the recovery process. The production database continues to communicate with this standby instance.

Restart node and instance when they are available.

Data failure such as media failure or disk corruption

Restoring Fault Tolerance after a Standby Database Data Failure

Restoring Fault Tolerance after a Standby Database Data Failure

Primary database resets logs because of flashback operations or media recovery

Restoring Fault Tolerance After the Production Database Has Opened Resetlogs

Restoring Fault Tolerance After the Production Database Has Opened Resetlogs

Recovery Steps for Scheduled Outages

Scheduled outages are planned outages. They are required for regular maintenance of the technology infrastructure that supports the application and include tasks such as hardware maintenance, repair, and upgrades; software upgrades and patching; application changes and patching; and changes to improve performance and manageability of systems. Scheduled outages should be scheduled at times best suited for continual application availability.

Table 9-4 describes the scheduled outages that impact either the primary or secondary site components.

Table 9-4 Scheduled Outages  
Outage Class Description Examples

Site-wide

The entire site where the current production database resides is unavailable. This is usually known well in advance and can be scheduled.

  • Scheduled power outages
  • Site maintenance
  • Regular planned switchovers to test infrastructure

Hardware maintenance (node impact)

This is scheduled downtime of a database server node for hardware maintenance. The scope of this downtime is restricted to a node of the database cluster.

  • Repair of a failed component such as a memory card or CPU board
  • Addition of memory or CPU to an existing node in the database tier

Hardware maintenance (clusterwide impact)

This is scheduled downtime of the database server cluster for hardware maintenance.

  • Some cases of adding a node to the cluster
  • Upgrade or repair of the cluster interconnect
  • Upgrade to the storage tier that requires downtime on the database tier

System software maintenance (node impact)

This is scheduled downtime of a database server node for system software maintenance. The scope of the downtime is restricted to a node.

  • Upgrade of a software component such as the operating system
  • Changes to the configuration parameters for the operating system

System software maintenance (clusterwide impact)

This is scheduled downtime of the database server cluster for system software maintenance.

  • Upgrade or patching of the cluster software
  • Upgrade of the volume management software

Oracle patch upgrade for the database

Scheduled downtime for an Oracle patch

Patch Oracle software to fix a specific customer issue

Oracle patch set or software upgrade for the database

Scheduled downtime for Oracle patch set or software upgrade

  • Patching Oracle software with a patch set
  • Upgrade Oracle software

Database object reorganization

These are changes to the logical structure or the physical organization of Oracle database objects. The primary reason for these changes is to improve performance or manageability. This is always a planned activity. The method and the time chosen to do the reorganization should be planned and appropriate.

Using Oracle's online reorganization features enables objects to be available during the reorganization.

  • Moving an object to a different tablespace
  • Converting a table to a partitioned table
  • Renaming or dropping columns of a table

The rest of this section provides outage decision trees for scheduled outages. They appear in the following sections:

The high-level recovery steps for each outage are listed with links to the detailed descriptions for each recovery step. The detailed descriptions of the recovery operations are found in Chapter 10, "Detailed Recovery Steps".

This section also includes the following topic:

Recovery Steps for Scheduled Outages on the Primary Site

If the primary site contains the production database and the secondary site contains the standby database, then the outages on the primary site are the ones of most interest. Solutions for these outages are critical for continued availability of the system.

Table 9-5 shows the recovery steps for scheduled outages on the primary site.

Table 9-5 Recovery Steps for Scheduled Outages on the Primary Site  
Scope of Outage Reason for Outage Recovery Steps for "Database Only" Architecture Recovery Steps for "RAC Only" Architecture Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site

Site shutdown

Downtime for entire duration

Downtime for entire duration

  1. Database Switchover
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Complete or Partial Site Failover

Primary database

Hardware maintenance (node impact)

Downtime for entire duration

Managed automatically by RAC Recovery

  1. Database Switchover
  2. Complete or Partial Site Failover

Managed automatically by RAC Recovery

Primary database

Hardware maintenance (clusterwide impact)

Downtime for entire duration

Downtime for entire duration

  1. Database Switchover
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Complete or Partial Site Failover

Primary database

System software maintenance (node impact)

Downtime for entire duration

Managed automatically by RAC Recovery

  1. Database Switchover
  2. Complete or Partial Site Failover

Managed automatically by RAC Recovery

Primary database

System software maintenance (clusterwide impact)

Downtime for entire duration

Downtime for entire duration

  1. Database Switchover
  2. Complete or Partial Site Failover
  1. Database Switchover
  2. Complete or Partial Site Failover

Primary database

Oracle patch upgrade for the database

Downtime for entire duration

RAC Rolling Upgrade

Downtime for entire duration

RAC Rolling Upgrade

Primary database

Oracle patch set or software upgrade for the database

Downtime for entire duration

Downtime for entire duration

Upgrade with Logical Standby Database

Upgrade with Logical Standby Database

Primary database

Database object reorganization

Online Object Reorganization

Online Object Reorganization

Online Object Reorganization

Online Object Reorganization

Recovery Steps for Scheduled Outages on the Secondary Site

Outages on the secondary site do not impact availability because the clients always access the primary site unless there is a switchover or failover. Outages on the secondary site may affect the MTTR if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no impact on availability. If maximum protection database mode is configured, then downgrade the protection mode before a scheduled outage on the standby instance or database so that there will be no downtime on the production database.

Table 9-6 describes the recovery steps for scheduled outages on the secondary site.

Table 9-6 Recovery Steps for Scheduled Outages on the Secondary Site  
Scope of Outage Reason for Outage Recovery Steps for "Data Guard Only" Architecture Recovery Steps for MAA

Site

Site shutdown

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"

Standby database

Hardware or software maintenance the node that is running the managed recovery process (MRP)

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

Standby database

Hardware or software maintenance on a node that is not running the MRP

N/A

No impact because the primary standby node or instance receives redo logs that are applied with the managed recovery process

After the outage: Restart node and instance when available.

Standby database

Hardware or software maintenance (clusterwide impact)

N/A

Before the outage: "Preparing for Scheduled Secondary Site Maintenance"

After the outage: "Restoring Fault Tolerance after Secondary Site or Clusterwide Scheduled Outage"

Standby database

Oracle patch and software upgrades

Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode.

Downtime needed for upgrade, but there is no impact on primary node unless the configuration is in maximum protection database mode.

Preparing for Scheduled Secondary Site Maintenance

To achieve continued service during a secondary site scheduled outage, downgrade the maximum protection mode to maximum availability or maximum performance. When you are scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time the standby lags behind the production database, which lengthens the time to restore fault tolerance.

Table 9-7 shows how to prepare for scheduled secondary site maintenance.

Table 9-7 Preparing for Scheduled Secondary Site Maintenance  
Production Database Protection Mode Reason for Outage Preparation Steps for "Data Guard Only" Architecture and MAA

Maximum protection

Site shutdown

Switch the production data protection mode to either maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Hardware maintenance (clusterwide impact)

Switch the production data protection mode to either maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Software maintenance (clusterwide impact)

Switch the production data protection mode to either maximum availability or maximum performance

See Also: "Changing the Data Protection Mode"

Maximum protection

Hardware maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

Switch the production data protection mode to either maximum availability or maximum performance

Maximum protection

Software maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

Switch the production data protection mode to either maximum availability or maximum performance

Maximum availability or maximum performance

Site shutdown

None; no impact on production database

Maximum availability or maximum performance

Hardware maintenance (clusterwide impact)

None; no impact on production database

Maximum availability or maximum performance

Software maintenance (clusterwide impact)

None; no impact on production database

Maximum availability or maximum performance

Hardware maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

None; no impact on production database

Maximum availability or maximum performance

Software maintenance on the primary node (the node that is running the recovery process)

Apply Instance Failover (MAA only)

or

None; no impact on production database