This chapter describes unscheduled and scheduled outages and the Oracle operational best practices that can tolerate or manage each outage type and minimize downtime.
This chapter contains these topics:
This section describes the types of possible outages and the recommended methods to repair or minimize the downtime associated with each outage.
This section contains these topics:
Unscheduled outages are unanticipated failures in any part of the technology infrastructure that supports the application, including the following components:
Hardware
Software
Network infrastructure
Naming services infrastructure
Database
Your monitoring and high-availability infrastructure should provide rapid detection and recovery from downtime. Chapter 3, "Monitoring Using Oracle Grid Control" describes detection, while this chapter focuses on reducing downtime.
Table 4-1 describes unscheduled outages that affect the primary or secondary site components.
The best practice recommendations for reducing unscheduled outages on the primary site and the secondary site, estimated recovery times, and recovery steps appear in the following sections:
If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for these outages are critical for maximum availability of the system. Only the Oracle Database 10g with Data Guard, and the Oracle Database 10g with RAC and Data Guard (MAA) architectures have a secondary site to protect from site disasters.
Table 4-2 summarizes the recovery steps for unscheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.
Table 4-2 Recovery Times and Steps for Unscheduled Outages on the Primary Site
Outage Type | Oracle Database 10g | Oracle Database 10g with RAC | Oracle Database 10g with Data Guard | Oracle Database 10g - MAA |
---|---|---|---|---|
Hours to days
|
Hours to days
|
Seconds to 5 minutesFoot 1 |
Seconds to 5 minutesFootref 1 |
|
Not applicable |
Hours to days
|
Not applicable |
Seconds to 5 minutes |
|
Minutes to hoursFoot 2
|
No downtimeFoot 3 Managed automatically by RAC Recovery for Unscheduled Outages |
Seconds to 5 minutesFootref 2 or Minutes to hours2
|
No downtimeFootref 3 Managed automatically by RAC Recovery for Unscheduled Outages |
|
MinutesFootref 2
|
No downtimeFootref 3 Managed automatically by RAC Recovery for Unscheduled Outages |
MinutesFootref 2
or Seconds to 5 minutesFootref 1 |
No downtimeFootref 3 Managed automatically by RAC Recovery for Unscheduled Outages |
|
No downtimeFoot 4 |
No downtimeFootref 4 |
No downtimeFootref 4 |
No downtimeFootref 4 |
|
HARD prevents data corruptionFoot 5 Potentially hours |
HARD prevents data corruption5 Potentially hours |
HARD prevents data corruption5 or Seconds to 5 minutesFootref 1 |
HARD prevents data corruption5 or Seconds to 5 minutesFootref 1 |
|
< 30 minutesFoot 6 |
< 30 minutesFootref 6 |
<30 minutesFootref 6 |
< 30 minutesFootref 6 |
Footnote 1 Recovery time indicated applies to database and existing connection failover. Network connection changes and other site-specific failover activities may lengthen overall recovery time.
Footnote 2 Recovery time consists largely of the time it takes to restore the failed system.
Footnote 3 Database is still available, but portion of application connected to failed system is temporarily affected.
Footnote 4 Storage failures are prevented by using Automatic Storage Management (ASM) with mirroring and its automatic rebalance capability.
Footnote 5 Not all types of data corruption are prevented. For the most recent information about the HARD initiative, refer to Section 2.1.6, "Consider HARD-Compliant Storage".
Footnote 6 Recovery times from human errors depend primarily on detection time. If it takes seconds to detect a malicious DML or DLL transaction, then it typically only requires seconds to flashback the appropriate transactions.
For most cases, outages on the secondary site can be managed with no effect on availability of the primary database located on the primary site. However, if the configuration is in maximum protection mode, then unscheduled outages on the last surviving standby database will cause outages on the production database to ensure no data loss when failing over to the standby database. After downgrading the data protection mode, you can restart the production database even without accessibility to the standby databases. Outages on the secondary site might affect the maximum time to recovery (MTTR) if there are concurrent failures on the primary site.
Table 4-3 summarizes the recovery steps for unscheduled outages of the standby database on the secondary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.2, "Recovering from Unscheduled Outages" that starts.
Table 4-3 Recovery Steps for Unscheduled Outages on the Secondary Site
Outage Type | Oracle Database 10g with Data Guard | Oracle Database 10g - MAA |
---|---|---|
If there is only one standby database and if maximum database protection is configured, then the production database will shut down to ensure that there is no data divergence with the standby database. |
There is no effect on production availability if the production database Oracle Net descriptor is configured to use connect-time failover to an available standby instance. Broker will automatically restart the apply process. Restart node and instance when they are available. |
|
Restoring Fault Tolerance After a Standby Database Data Failure |
Restoring Fault Tolerance After a Standby Database Data Failure |
|
Primary database opens with |
Restoring Fault Tolerance After the Production Database Was Opened Resetlogs |
Restoring Fault Tolerance After the Production Database Was Opened Resetlogs |
Scheduled outages are required for regular maintenance of the technology infrastructure that supports the application, including tasks such as:
Hardware maintenance, repair, and upgrades
Software upgrades and patching
Application changes and patching
Changes to improve performance and manageability of systems
These tasks should be scheduled at times best suited for continual application availability.
Table 4-4 describes scheduled outages that affect either the primary or secondary site.
Outage Scope | Description | Examples |
---|---|---|
Site-wide |
The entire site where the current production database resides is unavailable. Usually known well in advance. |
Scheduled power outages Site maintenance Regular planned switchovers to test infrastructure |
Hardware maintenance (node impact) |
Hardware maintenance on a database server. Restricted to a node of the database cluster. |
Repair of a failed component such as a memory card or CPU board Addition of memory or CPU to an existing node in the database tier |
Hardware maintenance (clusterwide impact) |
Hardware maintenance on a database server cluster |
Some cases of adding a node to the cluster Upgrade or repair of the cluster interconnect Upgrade to the storage tier that requires downtime on the database tier |
System software maintenance (node impact) |
System software maintenance on a database server. The scope of the downtime is restricted to a node. |
Upgrade of a software component such as the operating system Changes to the configuration parameters for the operating system |
System software maintenance (clusterwide impact) |
System software maintenance on a database server cluster |
Upgrade or patching of the cluster software Upgrade of the volume management software |
Oracle patch upgrade for the database |
Scheduled outage for installation of an Oracle patch |
Patch Oracle software to fix a specific customer issue |
Oracle patch set or software upgrade for the database |
Scheduled outage for Oracle patch set or software upgrade |
Patching Oracle software with a patch set Upgrading Oracle software |
Database object reorganization |
Changes to the logical structure or the physical organization of Oracle Database objects, primarily to improve performance or manageability. Using Oracle Database online reorganization features enables objects to be available during the reorganization. |
Moving an object to a different tablespace Converting a table to a partitioned table Renaming or dropping columns of a table |
Storage maintenance |
Maintenance of storage where database files reside |
Converting to ASM Adding or removing storage |
Platform migration |
Changing operating system platform of the primary and standby databases |
Moving to the Linux operating system |
Location migration |
Changing physical location of the primary database |
Moving the primary database from one data center to another. |
The following sections provide best practice recommendations and preparations for reducing scheduled outages on the primary and secondary sites:
If the primary site contains the production database and the secondary site contains the standby database, then outages on the primary site are the most crucial. Solutions for theses outages are critical for continued availability of the system.
Table 4-5 shows the high-level recovery steps for scheduled outages on the primary site. For outages that require multiple recovery steps, the table includes links to the detailed descriptions in Section 4.4, "Eliminating or Reducing Downtime for Scheduled Outages" beginning.
Table 4-5 Recovery Steps for Scheduled Outages on the Primary Site
Outage Scope | Cause | Oracle Database 10g | Oracle Database 10g with RAC | Oracle Database 10g with Data Guard | Oracle Database 10g - MAA |
---|---|---|---|---|---|
Site |
Site shutdown |
Restart database after outage |
Restart database after outage |
||
Primary database |
Hardware maintenance (node impact) |
Restart database after outage |
Managed automatically, see System Maintenance |
Managed automatically, see System Maintenance |
|
Primary database |
Hardware maintenance (clusterwide impact) System software maintenance (clusterwide impact) |
Not applicable |
Restart database after outage |
Not applicable |
|
Primary database |
System software maintenance (node impact) |
Restart database after outage |
Managed automatically by System Maintenance |
Managed automatically by System Maintenance |
|
Primary database |
Real Application Cluster's Cluster Ready Service (CRS) Upgrades |
Not applicable |
In general, CRS upgrades can be done online and do not require downtime. |
Not applicable |
In general, CRS upgrades can be done online and do not require downtime. |
Primary database |
Oracle patch upgrade for the database |
Restart database after outage |
|||
Primary database |
Oracle patch set or software upgrade for the database, including some Automatic Storage Management (ASM) and CRS upgrades |
Restart database after outage |
Restart database after outage |
||
Primary database |
Database object reorganization |
||||
Primary database |
Storage maintenance |
||||
Primary database |
Platform maintenance |
||||
Primary database |
Location maintenance |
Outages on the secondary site do not affect availability because the clients always access the primary site. Outages on the secondary site might affect the RTO if there are concurrent failures on the primary site. Outages on the secondary site can be managed with no effect on availability. If maximum protection database mode is configured, then downgrade the protection mode before scheduled outages on the standby instance or database so that there will be no downtime on the production database.
Table 4-6 describes the recovery steps for scheduled outages on the secondary site.
Table 4-6 Managing Scheduled Outages on the Secondary Site
Outage Type | Cause | Oracle Database 10g with Data Guard | Oracle Database 10g - MAA |
---|---|---|---|
Site |
Site shutdown |
Before the outage: Preparing for Scheduled Outages on the Secondary Site After the outage: Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage |
Before the outage: Preparing for Scheduled Outages on the Secondary Site After the outage: Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage |
Standby database |
Hardware or software maintenance the node that is running the managed recovery process (MRP) |
Before the outage: |
Before the outage: |
Standby database |
Hardware or software maintenance on a node that is not running the MRP |
Not applicable |
No effect because the primary standby node or instance receives redo logs that are applied with the managed recovery process After the outage: Restart node and instance when available. |
Standby database |
Hardware or software maintenance (clusterwide impact) |
Not applicable |
Before the outage: Preparing for Scheduled Outages on the Secondary Site After the outage: Restoring Fault Tolerance After Planned Downtime on Secondary Site or Clusterwide Outage |
Standby database |
Oracle patch and software upgrades |
Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode. |
Downtime needed for upgrade, but there is no effect on primary node unless the configuration is in maximum protection database mode. |
To achieve continued service during scheduled outages on a secondary site when in maximum protection mode, downgrade the maximum protection mode to maximum availability or maximum performance temporarily. When scheduling secondary site maintenance, consider that the duration of a site-wide or clusterwide outage adds to the time that the standby database lags behind the production database, which in turn lengthens the time to restore fault tolerance. See Section 2.4.2, "Data Protection Mode" for an overview of the Data Guard protection modes.
This section describes best practices for recovering from various types of unscheduled outages.
This section contains these topics:
With complete site failover, the database, the middle-tier application server, and all user connections fail over to a secondary site that is prepared to handle the production load.
If the standby site meets the prerequisites, then complete site failover is recommended for the following scenarios:
Primary-site disaster, such as natural disasters or malicious attacks
Primary network-connectivity failures
Primary site power failures
Site failover can be expedited in minutes by using the following practices:
Use Data Guard configuration best practices
Use Data Guard fast-start failover to automatically fail over to the standby database, with a recovery time objective (RTO) of less than 30 seconds
Maintain a running middle-tier application server on the secondary site to avoid the startup time
Automate the DNS failover procedure
Data loss is dependent on the Data Guard configuration and the use of synchronous or asynchronous redo shipping.
A wide-area traffic manager on the primary and secondary sites provides the site failover function. The wide-area traffic manager can redirect traffic automatically if the primary site, or a specific application on the primary site, is not accessible. It can also be triggered manually to switch to the secondary site for switchovers. Traffic is directed to the secondary site only when the primary site cannot provide service due to an outage or after a switchover. If the primary site fails, then user traffic is directed to the secondary site automatically.
Figure 4-1 illustrates the possible network routes before site failover:
Client requests enter the client tier of the primary site and travel by the WAN traffic manager.
Client requests are sent through the firewall into the demilitarized zone (DMZ) to the application server tier.
Requests are forwarded through the active load balancer to the application servers.
Requests are sent through another firewall and into the database server tier.
The application requests, if required, are routed to a RAC instance.
Responses are sent back to the application and clients by a similar path.
Figure 4-1 Network Routes Before Site Failover
Figure 4-2 illustrates the network routes after site failover. Client or application requests enter the secondary site at the client tier and follow exactly the same path on the secondary site that they followed on the primary site.
Figure 4-2 Network Routes After Site Failover
The following steps describe the effect on network traffic of a failover or switchover:
The administrator has failed over or switched over the production database to the secondary site. This is automatic if you are using Data Guard fast-start failover.
The administrator starts the middle-tier application servers on the secondary site, if they are not already running.
The wide-area traffic manager selection of the secondary site can be automatic in the case of an entire site failure. The wide-area traffic manager at the secondary site returns the virtual IP address of a load balancer at the secondary site and clients are directed automatically on the subsequent reconnect. In this scenario, the site failover is accomplished by an automatic domain name system (DNS) failover.
Alternatively, a DNS administrator can manually change the wide-area traffic manager selection to the secondary site for the entire site or for specific applications. The following is an example of a manual DNS failover:
Change the DNS to point to the secondary site load balancer:
The master (primary) DNS server is updated with the new zone information, and the change is announced with DNS NOTIFY.
The slave DNS servers are notified of the zone update with a DNS NOTIFY announcement, and the slave DNS servers pull the new zone information.
Note:
The master and slave servers are authoritative name servers. Therefore, they contain trusted DNS information.Clear affected records from caching DNS servers.
A caching DNS server is used primarily for performance and fast response. The caching server obtains information from an authoritative DNS server in response to a host query and then saves (caches) the data locally. On a second or subsequent request for the same data, the caching DNS server responds with its locally stored data (the cache) until the time-to-live (TTL) value of the response expires. At this time, the server refreshes the data from the zone master. If the DNS record is changed on the primary DNS server, then the caching DNS server does not pick up the change for cached records until TTL expires. Flushing the cache forces the caching DNS server to go to an authoritative DNS server again for the updated DNS information.
Flush the cache if the DNS server being used supports such a capability. The following is the flush capability of common DNS BIND versions:
BIND 9.3.0: The command rndc
flushname
name
flushes individual entries from the cache.
BIND 9.2.0 and 9.2.1: The entire cache can be flushed with the command rndc
flush
.
BIND 8 and BIND 9 up to 9.1.3: Restarting the named server clears the cache.
Refresh local DNS service caching.
Some operating systems might cache DNS information locally in the local name service cache. If so, this cache must also be cleared so that DNS updates are recognized quickly.
Solaris: nscd
Linux: /etc/init.d/nscd restart
Microsoft Windows: ipconfig /flushdns
Apple Mac OS X: lookupd -flushcache
The secondary site load balancer directs traffic to the secondary site middle-tier application server.
The secondary site is ready to take client requests.
Failover also depends on the client's Web browser. Most browser applications cache the DNS entry for a period of time. Consequently, sessions in progress during an outage might not fail over until the cache timeout expires. To resume service to such clients, close the browser and restart it.
Failover is the operation of transitioning one of the standby databases to the role of production database. A failover operation can be invoked when an unplanned failure occurs on the production database, and there is no possibility of recovering the production database in a timely manner.
Data Guard enables you to fail over by:
Using Oracle Enterprise Manager, as described in Section 4.2.2.2.1, "Using Enterprise Manager to Perform a Data Guard Failover"
Using the Oracle Data Guard Broker command-line interface (DGMGRL)
Issuing SQL statements, as described in Section 4.2.2.2.2, "Using SQL to Fail Over to a Physical Standby Database"
See Also:
Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database FailoverData Guard failover is a series of steps to transition a standby database into a production database. The standby database essentially assumes the role of production. A Data Guard failover is accompanied by an application failover and, in some cases, preceded by a site failover. After the Data Guard failover, the secondary site contains the production database. The former production database must be reinstated as a new standby database to restore resiliency. The standby database can be quickly re-created by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".
With Data Guard the process of failover can be completely automated using fast-start failover, or the failover operation can be user driven, also referred to as manual failover. Fast-start failover eliminates the uncertainty of a process that requires manual intervention and automatically executes a zero data loss failover within seconds of an outage being detected. A manual failover allows for a failover process where decisions are user driven. Manual failover can be accomplished by using Oracle Enterprise Manager, by issuing commands at the Oracle Data Guard broker command-line interface, or by issuing the SQL statements described in subsequent sections.
A failover operation typically occurs in under a minute, and with little or no data loss. The complete description of a failover can be found in Oracle Data Guard Concepts and Administration.
See Also:
The following white papers for information about optimizing the Failover operation:"Oracle Database 10g Release 2 Best Practices: Data Guard Fast-Start Failover" at http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_FastStartFailoverBestPractices.pdf
"Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover" at http://www.oracle.com/technology/deploy/availability/pdf/MAA_WP_10gR2_SwitchoverFailoverBestPractices.pdf
When a primary database failure cannot be repaired in time to meet your Recovery Time Objective (RTO) using local backups or Flashback Technology, Data Guard should be used.
A manual failover, which is user initiated, should be performed due to an unplanned outage such as:
Site disaster which results in the primary database becoming unavailable
User errors that cannot be repaired in a timely fashion
Data failures, which impact the production application
A failover requires that the initial production database be reinstated as a standby database to restore fault tolerance to your environment. The standby database can be quickly reinstated by using Flashback Database. See Section 4.3.2, "Restoring a Standby Database After a Failover".
A fast-start failover is completely automated and requires no user intervention. A manual failover, being user-driven, can be performed using Enterprise Manager, the Data Guard broker command-line interface, or SQL*Plus commands:
Fast-start failover: There are no procedural best practices to consider when performing a fast-start failover. However, it is very important to address all of the configuration best practices described in Section 2.4.7.2.2, "Fast-Start Failover Best Practices".
Manual failover: When performing a manual failover, follow the best practices described in Section 4.2.2.2.2, "Using SQL to Fail Over to a Physical Standby Database" and the configuration best practices outlined in Section 2.4.7.2.3, "Manual Failover Best Practices":
For manual failovers that involve Real Application Clusters, issue the SHUTDOWN
ABORT
statement on all secondary RAC instances on the standby database prior to performing a failover.
See Also:
Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover athttp://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
This section contains these topics:
The procedure for Data Guard failover is the same for both physical and logical standby databases. The following screen shots illustrate how to perform a failover using Oracle Enterprise Manager:
In Figure 4-3 the Data Guard Overview page shows the ORA-16625 error status that indicates problems accessing the primary database.
Figure 4-3 Data Guard Overview Page Showing ORA-16625 Error
To transition DR_Sales into the primary role, select DR_Sales in the Standby Databases table and click Failover.
Figure 4-4 shows the Failover Confirmation page.
If you determine that a failure occurred on the primary database and there is no possibility of recovering the primary database in a timely manner, you can start the Failover operation. In configurations with both physical and logical standby databases, Oracle recommends using the physical standby database as the failover target because it will allow the logical standby database to continue to function as a logical standby to the new primary database. If the failover is made to the logical standby, any physical standbys in the configuration will need to be re-created from a backup of the new primary database.
The failover operation enables you to choose one of the following two types of failover operations:
Complete
This operation attempts to minimize data loss by applying all available redo on the standby database.
Immediate
No additional data is applied on the standby database; data might be lost. This is the fastest type of failover.
Figure 4-5 shows the progress of the failover operation.
During the failover, the selected standby database (also referred to as the target standby database) transitions into the primary role. If the failover target is a physical standby database, it is restarted. When completed, the Data Guard Overview page reflects the updated configuration, as shown in Figure 4-6.
Figure 4-6 Data Guard Overview Page After a Failover Completes
In the figure, the Data Guard Status column indicates that the original primary database (North_Sales) is disabled and can no longer be managed through Enterprise Manager until it has been re-enabled as a physical standby database.
Follow these steps to fail over to a physical standby database:
If the standby database is a Real Application Clusters database, then issue a SHUTDOWN
ABORT
on all additional standby instances.
Initiate the failover by issuing the following SQL command on the target standby database:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH;
If the network between the primary and standby sites is unavailable, then the standby RFS processes will wait for the network connections to time out through normal TCP timeout processing before shutting down. While the RFS processes are in this TCP timeout processing, the standby database will not be able to fail over unless you include the FORCE
keyword on the RECOVER
MANAGED
STANDBY
DATABASE
FINISH
statement.
Convert the physical standby database to the primary role:
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
If the standby database was never opened read-only since the last time it was started, then open the new primary database by issuing the following SQL statement:
ALTER DATABASE OPEN;
If the standby database has been opened read-only, then restart the new primary database before starting Redo Apply.
Follow these steps to fail over to a logical standby database:
If the standby database is a Real Application Clusters database , then issue a SHUTDOWN
ABORT
on all additional standby instances.
Initiate the failover by issuing the following SQL command on the target standby database:
ALTER DATABASE ACTIVATE LOGICAL STANDBY DATABASE FINISH APPLY;
This statement stops the RFS process, applies any remaining redo data, stops SQL Apply, and activates the logical standby database in the primary role. To avoid waiting for the redo in the standby redo log file to be applied prior to performing the failover, omit the FINISH
APPLY
clause on the statement.
Although omitting the FINISH
APPLY
clause will accelerate failover, omitting it will also cause the loss of any unapplied redo data in the standby redo log. To gauge the amount of redo that will be lost, query the V$LOGSTDBY_PROGRESS
view. The LATEST_SCN
column value indicates the last SCN received from the primary database, and the APPLIED_SCN
column value indicates the last SCN applied to the standby database. All SCNs between these two values will be lost.
A database switchover performed by Oracle Data Guard is a planned transition that includes a series of steps to switch roles between a standby database and a production database. Following a successful switchover operation, the standby database assumes the production role and the production database becomes a standby database. In a RAC environment, a switchover requires that only one instance is active for each database, production and standby. At times the term switchback is also used within the scope of database role management. A switchback operation is a subsequent switchover operation to return the roles to their original state.
Data Guard enables you to change these roles dynamically by:
Using Enterprise Manager, as described in Section 4.2.3.2.1, "Using Enterprise Manager to Perform a Data Guard Switchover"
Using the Oracle Data Guard Broker command-line interface (DGMGRL)
Issuing SQL commands, as described in Section 4.2.3.2.2, "Using SQL for Data Guard Switchover to a Physical Standby Database" and Section 4.2.3.2.3, "Using SQL for Data Guard Switchover to a Logical Standby Database"
See Also:
Oracle Data Guard Broker for information about using Enterprise Manager or the Data Guard broker command-line for database switchoverSwitchover is a planned operation. Switchover is the capability to switch database roles between the production and standby databases. Switchover can occur whenever a production database is started, the target standby database is available, and all the archived redo logs are available. Switchovers are typically completed in less than five minutes and, in some cases, are optimized to be less than a minute. Switchovers are useful in the following situations:
Scheduled maintenance such as hardware maintenance or firmware patches on the production host
Resolution of data failures when the production database is still opened
Testing and validating the secondary resources, as a means to test disaster recovery readiness
Switchover is not possible or practical under the following circumstances:
Missing archived redo logs
Point-in-time recovery is required
The production database is not open and cannot be opened
Before performing a switchover the following procedural best practices should be followed in addition to the configuration best practices outlined in Section 2.4.7.1.1, "Switchover Best Practices":
Disconnect all sessions possible by issuing SQL command ALTER
SYSTEM
KILL
SESSION
.
Stop job processing by setting the AQ_TM_PROCESSES
to 0
.
Cancel any specified apply delay by using the NODELAY
keyword to stop and restart log apply services on the standby database:
On a physical standby database:
ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL; ALTER DATABASE RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE NODELAY;
On a logical standby database:
ALTER DATABASE STOP LOGICAL STANDBY APPLY; ALTER DATABASE START LOGICAL STANDBY APPLY IMMEDIATE NODELAY;
See Also:
Oracle Database 10g Release 2 Best Practices: Data Guard Switchover and Failover athttp://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
This following sections describe how to perform a switchover:
Using SQL for Data Guard Switchover to a Physical Standby Database
Using SQL for Data Guard Switchover to a Logical Standby Database
The procedure for Data Guard switchover using Enterprise Manager is the same for both physical and logical standby databases:
Select the standby database that you want to become the primary database.
Click Switchover.
Click Yes to continue with the switchover. Click No to cancel.
Figure 4-7 shows the switchover confirmation page.
Figure 4-7 Switchover Operation Confirmation
Figure 4-8 shows the processing page during the switchover.
Figure 4-8 Processing Page During Switchover
Figure 4-9 shows the Data Guard overview page after a successful switchover.
Figure 4-9 New Primary Database After Switchover
If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be performed using with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.
Follow these steps for a switchover to a physical standby database:
If possible, disconnect user sessions and disable or stop application processing.
If the primary and standby databases are RAC, then cleanly shut down all instances except one. To expedite this operation, issue a SHUTDOWN
ABORT
.
Issue the following SQL statement on the primary database to convert it to a standby database:
ALTER DATABASE COMMIT TO SWITCHOVER TO STANDBY WITH SESSION SHUTDOWN;
After the statement in the previous step completes:
Issue the following SQL statement on the old standby database:
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
Immediately after issuing the COMMIT
TO
SWITCHOVER
TO
PRIMARY
statement, restart the old primary database as the new standby database and bring it to the mount state.
When the switchover command completes, issue the ALTER
DATABASE
OPEN
statement on the new primary database to bring it to the open state.
Opening the new primary database from the mount state is possible only if the standby database was never opened read-only since the last time the database was started. If the database has been opened read-only, it will need to be restarted.
If the primary and standby databases are RAC, then start all instances.
Restart user sessions and application processing.
If you are not using Oracle Enterprise Manager, then the high-level steps in this section can be executed with SQL*Plus. These steps are described in detail in Oracle Data Guard Concepts and Administration.
When performing a switchover using SQL*Plus commands it is possible for the old standby database that is to become the new primary database to build and transmit the LogMiner dictionary to the current primary database (the new standby database) prior to the actual switchover. This reduces the total time needed to perform the switchover. The following steps describe how to perform this optimized method:
Issue the following SQL statement on the primary database to enable receipt of redo from the current standby database:
ALTER DATABASE PREPARE TO SWITCHOVER TO LOGICAL STANDBY;
On the current logical standby database, build the LogMiner dictionary and transmit this dictionary to the current primary:
ALTER DATABASE PREPARE TO SWITCHOVER TO PRIMARY;
If possible, disconnect user sessions and disable or stop application processing.
If the primary and standby databases are RAC, then cleanly shut down all instances except one. To optimize the shutdown operations, use SHUTDOWN
ABORT
.
When the SWITCHOVER_STATUS
column of the V$DATABASE
view returns TO
LOGICAL
STANDBY
, convert the primary database to a standby by issuing:
ALTER DATABASE COMMIT TO SWITCHOVER TO LOGICAL STANDBY WITH SESSION SHUTDOWN;
Issue the following statement on the old standby database:
ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY;
If the primary and standby databases are configured in a RAC, then start all instances.
This solution is leveraged automatically when there is a node or instance failure. Surviving instances will automatically recover the failed instances and potentially aid in the automatic client failover. Recover times can be bounded by adopting the database and RAC configuration best practices and can usually lead to instance recovery times of seconds to minutes in very large busy systems with no data loss.
The following recovery methods can be used:
Instance failure occurs when software or hardware problems disable an instance. After instance failure, Oracle automatically uses the online redo log file to perform database recovery as described in this section.
Instance recovery in RAC does not include restarting the failed instance or the recovery of applications that were running on the failed instance. Applications that were running continue by using service relocation and fast application notification (as described in Section 4.2.4.2, "Automatic Service Relocation").
When one instance performs recovery for another instance, the recovering instance:
Reads redo log entries generated by the failed instance and uses that information to ensure that committed transactions are recorded in the database. Thus, data from committed transactions is not lost
Rolls back uncommitted transactions that were active at the time of the failure and releases resources used by those transactions
When multiple node failures occur, as long as one instance survives, RAC performs instance recovery for any other instances that fail. If all instances of a RAC database fail, then on subsequent restart of any one instance, crash recovery will occur and all committed transactions will be recovered. If Data Guard is available, you can fail over automatically with Data Guard fast-start failover once all instances are down.
Service reliability is achieved by configuring and failing over among redundant instances. More instances are enabled to provide a service than would otherwise be needed. If a hardware failure occurs and adversely affects a RAC database instance, then CRS automatically moves any services on that instance to another available instance, as configured with DBCA or Enterprise Manager. Then, Cluster Ready Services (CRS) attempt to restart the failed nodes and instances.
CRS recognizes when a failure affects a service and automatically fails over the service and redistributes the clients across the surviving instance supporting the service. In parallel, CRS attempts to restart and integrate the failed instances and dependent resources back into the system. Notification of failures using fast application notification (FAN) events occur at various levels within the Oracle Server architecture. The response can include notifying external parties through Oracle Notification Service (ONS), advanced queueing, or FAN callouts, recording the fault for tracking, event logging, and interrupting applications. Notification occurs from a surviving node when the failed node is out of service. The location and number of nodes serving a service is transparent to applications. Auto restart and recovery are automatic, including all the subsystems, such as the listener and the ASM instance, not just database.
Loss of the Oracle Cluster Registry file affects the availability of RAC and Cluster Ready Services. The OCR file can be restored from a physical backup that is automatically created or from an export file that is manually created by using the ocrconfig
tool. Additionally, starting with Oracle Database 10g Release 10.2, Oracle can optionally mirror the OCR so that a single OCR device failure can be tolerated.
See Also:
"Administering Storage in Real Application Clusters" in Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment GuideWith proper configuration, applications can be configured to receive fast and efficient notification when application services become unavailable. Once notified, application connections occur transparently to surviving resources such as nodes in a RAC database or a standby database in a remote datacenter than has assumed the primary role following a failover.
In a RAC configuration, services are essential to achieving fast and transparent application failover. If a service becomes unavailable for a particular instance, the service will fail over to an available instance in the cluster, thereby allowing applications to continue processing. Clients are notified of the service relocation through Fast Application Notification (FAN).
Services are also essential for client failover across sites in a Data Guard configuration. After a site failure in a Data Guard configuration, the new production database will also be configured to automatically publish the production service while notifying affected clients that the services are no longer available on the failed production database through FAN events.
FAN notifications and service relocation enable automatic and fast redirection of clients in the event of a failures for both RAC and Data Guard environments.
See Also:
Oracle Database 10g Release 2 Best Practices: Client Failover for Highly Available Oracle Databases at http://www.oracle.com/technology/deploy/availability/htdocs/maa.htm
The impacts and recommended repairs for various ASM failure types are summarized in Table 4-7.
Table 4-7 Types of ASM Failures and Recommended Repair
Failure | Description | Impact | Recommended Repair |
---|---|---|---|
ASM instance failure |
ASM instance fails |
All database instances accessing ASM storage from the same node will shut down. |
Automatic RAC Recovery for Unscheduled Outages If RAC is not used, use Data Guard failover (see Section 4.2.2.2, "Best Practices for Implementing Data Guard Failover") If RAC and Data Guard are not used, fix the underlying problem and then restart ASM and the database instances |
ASM disk failure |
One or more ASM disks fail, but all disk groups remain online. |
All data remains accessible. This is possible only with normal or high redundancy disk groups. |
ASM automatically rebalances to the remaining disk drives and reestablishes redundancy. There must be enough free disk space in the remaining disk drives to restore the redundancy or the rebalance may fail with an ORA-15041 Note: External redundancy disk groups should use mirroring in the storage array to protect from disk failure. Disk failures should not be exposed to ASM. |
Data area disk-group failure |
One or more ASM disks fail, and data area disk group goes offline. |
Databases accessing the data area disk group will shut down. |
Perform Data Guard failover or local recovery as described in Section 4.2.6.3, "Data Area Disk Group Failure" |
Flash recovery area disk-group failure |
One or more ASM disks fail, and the flash recovery area disk group goes offline. |
Databases accessing the flash recovery area disk group will shut down. |
Perform local recovery or Data Guard failover as described in Section 4.2.6.4, "Flash Recovery Area Disk Group Failure" |
If the ASM instance fails, database instances accessing ASM storage from the same node will shut down.If the primary database is using RAC, then application failover will occur automatically and clients connected to the database instance will reconnect to remaining instances providing the required service in the cluster and continue processing. The recovery time is typically in seconds.
If the primary database does not use RAC, then an ASM instance failure will shut down the entire database. If Data Guard is being used and Data Guard fast-start failover is configured, a database failover will be triggered automatically and clients will automatically reconnect to the new primary database following the failover. The recovery time is the time it takes to complete an automatic Data Guard fast-start failover operation. If fast-start failover is not configured, then recovering from this outage is a manual process, which can be accomplished by either restarting the ASM and database instances manually, or by performing a Data Guard failover. If neither RAC nor Data Guard is being used, then restart the ASM instance and restart database instances manually. The recovery time depends on the length of time to start the ASM instance, and the length of time to start the database instances and perform crash recovery.
If an ASM disk group is configured as an external redundancy type, then a failure of a single disk should be handled by the storage array and should not be seen by the ASM instance, and all operations of ASM and databases using the disk group will continue normally. However, if the failure of an external redundancy disk group is seen by the ASM instance, then the ASM instance will take the disk group offline immediately, causing Oracle instances accessing the disk group to crash. If the disk failure is temporary, then ASM and the database instances can be restarted, and crash recovery will occur after the disk group is brought back online. If an ASM disk group is configured as a normal or a high-redundancy type, then disk failure is handled transparently by ASM and the databases accessing the disk group are not affected. An ASM instance automatically starts an ASM rebalance operation to distribute the data on one or more failed disks to alternative disks in the ASM disk group. While the rebalance operation is in progress, subsequent disk failures may affect disk group availability if the disk contains data that has yet to be re-mirrored. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a subsequent failure. Multiple disk failures are handled similarly, provided the failures affect only one failure group in an ASM disk group.The failure of multiple disks in multiple failure groups where a primary extent and all of its mirror extents have been lost will cause the disk group to go offline.
See Also:
Section 4.2.6.3, "Data Area Disk Group Failure" and Section 4.2.6.4, "Flash Recovery Area Disk Group Failure" for detailsThis following recovery methods can be used:
Figure 4-10 shows Enterprise Manager reporting disk failures. Three alerts appear at 11:19:29. The first alert is an Offline Disk Count. The second and third alerts are Disk Status messages for data area disk DATA.XBBT1D06_DATA
and recovery area disk RECO.XBBT1D06_RECO
:
2 disks are offline Disk DATA.XBBT1D06_DATA is offline. Disk RECO.XBBT1D06_RECO is offline.
Figure 4-10 Enterprise Manager Reports Disk Failures
Figure 4-11 shows Enterprise Manager reporting the status of data area disk group DATA
and recovery area disk group RECO
. The red arrows under Member Disks indicate that one disk has failed in each disk group. The numbers under Pending Operations indicate that one operation is pending for each disk group.
Figure 4-11 Enterprise Manager Reports ASM Disk Groups Status
Figure 4-12 shows Enterprise Manager reporting a pending REBAL
operation on the DATA
disk group. The operation is about one-third done, as shown in % Complete, and the Remaining Time is estimated to be 16 minutes.
Figure 4-12 Enterprise Manager Reports Pending REBAL Operation
Perform these steps after one or more failed disks have been replaced, and access to the storage has been restored:
Add the one or more replacement disks to the failed disk group with the following SQL command:
ALTER DISKGROUP disk_group ADD FAILGROUP failure_group DISK 'disk1','disk2',...;
Check the progress of the operation:
SELECT * FROM V$ASM_OPERATION;
A data area disk group failure should occur only when there have been multiple failures. For example, if the data-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.
When one or more disks fail in a normal or high redundancy disk group, but the ASM disk group is accessible, there is no loss of data and no immediate loss of accessibility. An ASM instance automatically starts an ASM rebalance operation to distribute the data on the one or more failed disks to alternative disks in the ASM disk group. When the rebalance operation completes successfully, the ASM disk group is no longer at risk in the event of a second failure. There must be enough disk space on the remaining disks in the disk group for the rebalance to complete successfully.
The two possible solutions when the data area disk group fails are summarized in Table 4-8.
Table 4-8 Recovery Options for Data Area Disk Group Failure
If Data Guard is being used and fast-start failover is configured, then an automatic failover will occur when the database shuts down due to the data-area disk group going offline. If fast-start failover is not configured, then perform a manual failover.
If you decide to perform a Data Guard failover, then the recovery time objective (RTO) will be expressed in terms of minutes or perhaps seconds, depending on the presence of the Data Guard observer process and fast-start failover. However, if a manual failover occurs and not all data is on the standby site, then data loss might result.
After Data Guard failover has completed, and the application is available, the data area disk group failure must still be resolved. Continue with the following "Local Recovery Steps" procedure to resolve the ASM disk group failure.
The RTO for local recovery only is based primarily on the time required to:
Repair and replace the failed storage components
Restore and recover the database
Because the loss affects only the data-area disk group, there is no loss of data. All transactions are recorded in the Oracle redo log members that reside in the flash recovery area, so complete media recovery is possible.
If Data Guard is not being used, then perform the following local recovery steps. The time required to perform local recovery depends on how long it takes to restore and recover the database. There is no data loss when performing local recovery.
Perform these steps after one or more failed disks have been replaced and access to the storage has been restored:
Note:
If you have performed an Oracle Data Guard failover to a new primary database, then you can now use the following procedure to reintroduce the database into the Data Guard environment. Also, see Section 4.3.2, "Restoring a Standby Database After a Failover".Rebuild the ASM disk group using the new storage location:
SQL> CREATE DISKGROUP DATA NORMAL REDUNDANCY DISK 'path1','path2',...;
Start the instance NOMOUNT
:
RMAN> STARTUP FORCE NOMOUNT;
Restore the control file from the surviving copy located in the recovery area:
RMAN> RESTORE CONTROLFILE FROM 'recovery_area_controlfile';
Start the instance MOUNT
:
RMAN> STARTUP FORCE MOUNT;
Restore the database:
RMAN> RESTORE DATABASE
Recover the database:
RMAN> RECOVER DATABASE;
If you use block change tracking, then disable and reenable the block change tracking file:
SQL> ALTER DATABASE DISABLE BLOCK CHANGE TRACKING; SQL> ALTER DATABASE ENABLE BLOCK CHANGE TRACKING;
Open the database:
SQL> ALTER DATABASE OPEN;
Re-create the log file members on the failed ASM disk group:
SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename'; SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
Perform a new incremental level 0 backup:
RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
When the flash recovery-area disk group fails, the database crashes because the control file member usually resides in the flash recovery area and Oracle requires that all control file members are accessible. The flash recovery area can also contain the flashback logs, redo log members and all backups.
A flash recovery area disk group failure should occur only when there have been multiple failures. For example, if the flash recovery-area disk group is defined as external redundancy, a single-disk failure should not be exposed to ASM. However, multiple-disk failures in a storage array may be seen by ASM causing the disk group to go offline. Similarly, multiple-disk failures in different failure groups in a normal or high-redundancy disk group may cause the disk group to go offline.
Table 4-9 summarizes the two possible solutions when the flash recovery-area disk group fails.
Table 4-9 Recovery Options for Flash Recovery Area Disk Group Failure
Recovery Option | Recovery Time Objective (RTO) | Recovery Point Objective (RPO) |
---|---|---|
Local Recovery |
Five minutes or less |
Zero |
Data Guard Failover or Switchover |
Five minutes or less |
Zero |
Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. A fast local restart is to startup the primary database after removing the controlfile member located in the failed flash recovery area and point to a new flash recovery area for local archiving (see "Local Restart Steps" discussion later in this section for the step-by-step procedure). However, this is a temporary fix until a new flash recovery area is created to replace the failed storage components. Oracle recommends using the "Local Recovery Steps" discussion later in this section.
If you decide to perform a Data Guard failover, then the RTO will be expressed in terms of minutes or perhaps seconds depending on the presence of the Data Guard observer process and fast-start failover. After Data Guard failover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.
If the protection level is maximum performance or the standby database is unsynchronized with the primary database, then temporarily start up the primary database by removing the controlfile member and pointing to a temporary flash recovery area (file system) in the SPFILE. Issue a Data Guard switchover to ensure no data loss. After Data Guard switchover has completed, and the application is available, the flash recovery area disk group failure must still be resolved. Shut down the affected database and continue with the instructions in the following "Local Recovery Steps" section to resolve the ASM disk group failure.
The RTO for local recovery only is based primarily on the time to repair and replace the failed storage components and then on the time to restore the control-file copy. Because the loss affects only the flash recovery-area disk group, there is no loss of data. No database media recovery is required, because the data files and the online redo log files are still present and available in the data area. As mentioned previously, you can start up the primary database by removing the controlfile member and pointing to a new flash recovery area. However, this is a temporary fix filled with availability and performance risks unless the flash recovery area is configured properly. Therefore, Oracle recommends the "Local Recovery Steps" that follow.
For a fast local restart, perform the following steps on the primary database:
Change the CONTROL_FILES
initialization parameter to refer only to members in the Data Area. For example:
ALTER SYSTEM SET CONTROL_FILES='+DATA/sales/control1.dbf' SCOPE=spfile;
Change local archive destinations and/or the flash recovery area to the local redundant, scalable destination. For example:
ALTER SYSTEM SET DB_RECOVERY_FILE_DEST='+DATA' SCOPE=spfile;
Startup with new settings:
STARTUP MOUNT:
You may need to disable and reenable Flashback Database because the flashback logs were damaged or lost:
ALTER DATABASE FLASHBACK OFF; ALTER DATABASE FLASHBACK ON; ALTER DATABASE OPEN;
Note:
If you performed an Oracle Data Guard failover to a new primary database, then you cannot use this procedure to reintroduce the old primary database as a standby database. This is because Flashback Database log files that are required as part of reintroducing the database have been lost. You must perform a full reinstantiation of the standby database.Replace or get access to new storage to be leveraged as flash recovery area
Rebuild the ASM disk group using the new storage location:
SQL> CREATE DISKGROUP RECO NORMAL REDUNDANCY DISK 'path1','path2',...;
Start the instance NOMOUNT
:
RMAN> STARTUP FORCE NOMOUNT;
Restore the control file from the surviving copy located in the data area:
RMAN> RESTORE CONTROLFILE FROM 'data_area_controlfile';
Start the instance MOUNT
:
RMAN> STARTUP FORCE MOUNT;
If you use Flashback Database, then disable it:
SQL> ALTER DATABASE FLASHBACK OFF;
Open the database and allow instance recovery to complete:
SQL> ALTER DATABASE OPEN;
Issue the following statements only if Flashback Database is required:
SQL> SHUTDOWN IMMEDIATE; SQL> STARTUP MOUNT; SQL> ALTER DATABASE FLASHBACK ON; SQL> ALTER DATABASE OPEN;
Re-create the log file members on the failed ASM disk group:
SQL> ALTER DATABASE DROP LOGFILE MEMBER 'filename'; SQL> ALTER DATABASE ADD LOGFILE MEMBER 'disk_group' TO GROUP group_no;
Synchronize the control file and the flash recovery area:
RMAN> CATALOG RECOVERY AREA; RMAN> CROSSCHECK ARCHIVELOG ALL; RMAN> CROSSCHECK BACKUPSET; RMAN> CROSSCHECK DATAFILECOPY ALL; RMAN> LIST EXPIRED type; RMAN> DELETE EXPIRED type;
Assuming that data has been lost in some way, take a new backup:
RMAN> BACKUP INCREMENTAL LEVEL 0 DATABASE;
Recovering from data corruption is an unscheduled outage scenario. Data corruption is usually—but not always—caused by some activity or failure that occurs outside the database, even though the problem might be evident within the database.
Data corruption in data files has two categories:
Data file block corruption
A corrupt data file block can be accessed, but the contents in the block are invalid or inconsistent. The typical cause of data file corruption is a faulty hardware or software component in the I/O stack, which includes, but is not limited to, the file system, volume manager, device driver, host bus adapter, storage controller, and disk drive.
The database usually remains available when corrupt blocks have been detected, but some corrupt blocks might cause widespread problems, such as corruption in a file header or with a data dictionary object, or corruption in a critical table that renders an application unusable.
A data fault is detected when it is recognized by the user, administrator, RMAN backup, or application because it has affected the availability of the application. For example:
A single corrupt data block in a user table that cannot be read by the application because of a bad spot of the physical disk
A single corrupt data block because of block inconsistencies detected by Oracle. The block will be marked corrupted and any application accessing the block will receive an ORA-1578
error.
A database that automatically shuts down because of the invalid blocks of a data file in the SYSTEM
tablespace caused by a failing disk controller
Media failure
This category of data corruption results from a physical hardware problem or user error. The system cannot successfully read or write to a file that is necessary to operate the database.
In all environments, you can resolve a data corruption outage by one of the following methods:
RMAN block media restoration and recovery
Data Guard switchover or failover to a standby database
RMAN datafile media restoration and recovery
Manually re-create the object
RMAN block media restoration and recovery provides the highest application availability if targeted blocks are not critical to application functionality. Data Guard switchover or failover to standby database provides the fastest predictable RTO.
Other outages that result in database objects becoming unavailable or inconsistent are caused by human error, such as dropping a table or erroneously updating table data. Information about recovering from human error can be found in Section 4.2.8, "Recovering from Human Error".
If the data corruption impacts nondata files, then the repair may be slightly different. Table 4-10 provides a matrix of the key non database object corruption and the recommended repair.
Table 4-10 Non Database Object Corruption and Recommended Repair
Object or Component Affected | Impact | Repair |
---|---|---|
Any control file |
Database fails |
Data Guard fast-start failover will automatically fail over to the standby database |
Redo log member |
None |
|
Active log group that is archived and not needed for crash recovery |
Database fails |
Restart database after dropping affected redo log group |
Active redo log group not archived and not needed for crash recovery |
Database fails |
|
Active or current redo log group that is still needed for crash recovery |
Database fails |
Use one of the following solutions:
|
Archived redo log file |
None |
|
SPFILE |
None |
Restore SPFILE from a backup and revise |
The following recovery methods can be used:
Failover is the operation of transitioning the standby databases as the new production database. A database switchover is a planned transition in which a standby database and a production database switch roles. Either of these operations can occur in less than 5 minutes and with no data loss.Use Data Guard switchover or failover for data corruption or data failure when:
The database is down or when the database is up but the application is unavailable because of data corruption or failure, and the time to restore and recover locally is long or unknown.
Recovering locally will be longer than the business SLA or RTO.
Block media recovery (BMR) recovers one block or a set of data blocks marked "media corrupt" in a data file by using the RMAN BLOCKRECOVER
command. When a small number of data blocks are marked media corrupt and require media recovery, you can selectively restore and recover damaged blocks rather than whole data files. This results in lower RTO because only blocks that need recovery are restored and only necessary corrupt blocks undergo recovery. Block media recovery minimizes redo application time and avoids I/O overhead during recovery. It also enables affected data files to remain online during recovery of the corrupt blocks. The corrupt blocks, however, remain unavailable until they are completely recovered.
Use block media recovery when:
A small number of blocks require media recovery and the blocks that need recovery are known. If a significant portion of the datafile is corrupt, or if the amount of corruption is unknown, then a different recovery method should be used.
Blocks are marked corrupt (verified with the RMAN BACKUP VALIDATE
command) and only when complete recovery is required.
Backup of the data file containing the corrupted blocks is available locally or can be retrieved from a remote location including from a a physical standby database.
Block media recovery cannot be used to recover from the following:
User error or software bugs that cause logical corruption where the data blocks are intact. See Section 4.2.8, "Recovering from Human Error" for additional details for this type of recovery.
Changes caused by corrupt redo data. Block media recovery requires that all available redo data be applied to the blocks being recovered.
For example, to recover a specific corrupt block using RMAN block media recovery:
RMAN> BLOCKRECOVER DATAFILE 7 BLOCK 3;
When the corruption is detected, it would be easy to recover this block through Grid Control.
See Also:
Oracle Database Backup and Recovery Advanced User's Guide and the white paper titled Using Recovery Manager with Oracle Data Guard in Oracle Database 10g available athttp://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf
Data file media recovery recovers an entire datafile or set of data files for a database by using the RMAN RECOVER
command. When a large or unknown number of data blocks are marked media-corrupt and require media recovery, or when an entire file is lost, the affected data files must be restored and recovered.
Use RMAN file media recovery when the following conditions are true:
The number of blocks requiring recovery is large or unknown
Block media recovery is not available (for example, if incomplete recovery is required, or if only incremental backups are available for the data file requiring recovery)
See Also:
"Advanced User-Managed Recovery Scenarios" in Oracle Database Backup and Recovery Advanced User's Guide and the white paper titled Using Recovery Manager with Oracle Data Guard in Oracle Database 10g available athttp://www.oracle.com/technology/deploy/availability/pdf/RMAN_DataGuard_10g_wp.pdf
Some database objects, such as small look-up tables or indexes, can be recovered quickly by manually re-creating the object instead of doing media recovery.
Use manual object re-creation when:
You must re-create a small index because of media corruption. Creating an index online enables the base object to be used concurrently.
You must re-create a look-up table or when the scripts to re-create the table are readily available. Dropping and re-creating the table might be the fastest option.
Oracle flashback technology revolutionizes data recovery. In the past it took seconds to damage a database but hours to days to recover it. With flashback technology, the time to correct errors can be as short as the time it took to make the error. Fixing human errors that require rewinding the database, table, transaction, or row level changes to a previous point in time is easy and does not require any database or object restoration. Flashback technology provides fine-grained analysis and repair for localized damage such as erroneous row deletion. Flashback technology also enables correction of more widespread damage such as accidentally running the wrong application batch job. Furthermore, flashback technology is exponentially faster than a database restoration.
Flashback technologies are applicable only to repairing the following human errors:
Erroneous or malicious update, delete or insert transactions
Erroneous or malicious DROP TABLE
statements
Erroneous or malicious batch job or wide-spread application errors
Flashback technologies cannot be used for media or data corruption such as block corruption, bad disks, or file deletions. See Section 4.2.7, "Recovering from Data Corruption (Data Failures)" and Section 4.2.2, "Database Failover with a Standby Database" to repair these outages.
Table 4-23 summarizes the flashback solutions for each type of outage.
Table 4-11 Flashback Solutions for Different Outages
Impact of Outage | Examples of Human Errors | Flashback Solutions |
---|---|---|
Row or transaction |
Accidental deletion of row Erroneous transaction |
Flashback Query Flashback Version Query Flashback Transaction Query |
Table See Also:"Resolving Table Inconsistencies" |
Dropped table Erroneous transactions affecting one table or a set of tables |
Flashback Drop Flashback Table |
Tablespace or database |
Erroneous batch job affecting many tables or an unknown set of tables Series of database-wide malicious transactions |
Flashback Database |
Table 4-12 summarizes each flashback feature.
Table 4-12 Summary of Flashback Features
Flashback Database uses the Oracle Database flashback logs, while all other features of flashback technology use the Oracle Database unique undo and multiversion read consistency capabilities. See the configuration best practices for the database—as documented in Section 2.2, "Configuring Oracle Database 10g"—for configuring flashback technologies to ensure that the resources from these solutions are available at a time of failure.
See Also:
Oracle Database Administrator's Guide, Oracle Database Backup and Recovery Basics, and Oracle Database Concepts for more information about flashback technology and automatic undo managementIn general, the recovery time when using flashback technologies is equivalent to the time it takes to cause the human error plus the time it takes to detect the human error.
Flashback technologies allow recovery up to the point that the human error occurred.
The following recovery methods can be used:
Oracle provides a FLASHBACK DROP
statement to recover from an accidental DROP TABLE
statement, and a FLASHBACK TABLE
statement to restore a table to a previous point in the database.
Flashback Table provides the DBA the ability to recover a table, or a set of tables, to a specified point in time quickly and easily. In many cases, Flashback Table alleviates the more complicated point in time recovery operations. For example:
FLASHBACK TABLE orders, order_items TO TIMESTAMP TO_DATE('28-Jun-06 14.00.00','dd-Mon-yy hh24:mi:ss');
This statement rewinds any updates to the ORDERS
and ORDER_ITEMS
tables that have been done between the current time and specified timestamp in the past. Flashback Table performs this operation online and in place, and it maintains referential integrity constraints between the tables.
Dropping or deleting database objects by accident is a common mistake. Users soon realize their mistake, but by then it is too late and there has been no way to easily recover the dropped tables and its indexes, constraints, and triggers. Objects once dropped were dropped forever. Loss of very important tables or other objects (like indexes, partitions or clusters) required DBAs to perform a point-in-time recovery, which can be time-consuming and lead to loss of recent transactions.
Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle places it in a recycle bin. Objects in the recycle bin remain there until the user decides to permanently remove them or until space limitations begin to occur on the tablespace containing the table. The recycle bin is a virtual container where all dropped objects reside. Users view the recycle bin and undrop the dropped table and its dependent objects. For example, the employees
table and all its dependent objects would be undropped by the following statement:
FLASHBACK TABLE employees TO BEFORE DROP;
Resolving row and transaction inconsistencies might require a combination of Flashback Query, Flashback Version Query, Flashback Transaction Query, and the compensating SQL statements constructed from undo statements to rectify the problem. This section describes a general approach using a human resources example to resolve row and transaction inconsistencies caused by erroneous or malicious user errors.
Flashback Query, a feature introduced in the Oracle9i Database, enables an administrator or user to query any data at some point in time in the past. This powerful feature can be used to view and reconstruct data that might have been deleted or changed by accident. For example:
SELECT * FROM EMPLOYEES AS OF TIMESTAMP TO_DATE('28-Jun-06 14:00','DD-Mon-YY HH24:MI') WHERE ...
This partial statement displays rows from the EMPLOYEES
table starting from 2 p.m. on June 28, 2006. Developers can use this feature to build self-service error correction into their applications, empowering end users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is very simple to manage, because the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.
Flashback Version Query provides a way to view changes made to the database at the row level. It is an extension to SQL and enables the retrieval of all the different versions of a row across a specified time interval. For example:
SELECT * FROM EMPLOYEES VERSIONS BETWEEN TIMESTAMP TO_DATE('28-Jun-06 14:00','dd-Mon-YY hh24:mi') AND TO_DATE('28-Jun-06 15:00','dd-Mon-YY hh24:mi') WHERE ...
This statement displays each version of the row, each entry changed by a different transaction, between 2 and 3 p.m. today. A DBA can use this to pinpoint when and how data is changed and trace it back to the user, application, or transaction. This enables the DBA to track down the source of a logical corruption in the database and correct it. It also enables application developers to debug their code.
Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that enables you to see all changes made by a transaction. For example:
SELECT UNDO_SQL FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = '000200030000002D';
This statement shows all of the changes that resulted from this transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Using a precision tool like this, the DBA and application developer can precisely diagnose and correct logical problems in the database or application.
Consider a human resources (HR) example involving the SCOTT
schema. The HR manager reports to the DBA that there is a potential discrepancy in Ward's salary. Sometime before 9:00 a.m., Ward's salary was increased to $1875. The HR manager is uncertain how this occurred and wishes to know when the employee's salary was increased. In addition, he instructed his staff to reset the salary to the previous level of $1250. This was completed around 9:15 a.m.
The following steps show how to approach the problem.
Assess the problem.
Fortunately, the HR manager has provided information about the time when the change occurred. You can query the information as it was at 9:00 a.m. with Flashback Query.
SELECT EMPNO, ENAME, SAL FROM EMP AS OF TIMESTAMP TO_DATE('28-JUN-06 09:00','dd-Mon-yy hh24:mi') WHERE ENAME = 'WARD'; EMPNO ENAME SAL ---------- ---------- ---------- 7521 WARD 1875
To can confirm that you have the correct employee by the fact that Ward's salary was $1875 at 09:00 a.m. Rather than using Ward's name, you can now use the employee number for subsequent investigation.
Query past rows or versions of the data to acquire transaction information.
Although it is possible to restrict the row version information to a specific date or SCN range, you might want to query all the row information that is available for the employee WARD using Flashback Version Query.
SELECT EMPNO, ENAME, SAL, VERSIONS_STARTTIME, VERSIONS_ENDTIME FROM EMP VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE WHERE EMPNO = 7521 ORDER BY NVL(VERSIONS_STARTSCN,1); EMPNO ENAME SAL VERSIONS_STARTTIME VERSIONS_ENDTIME ----- ---------- ------- ---------------------- ---------------------- 7521 WARD 1250 28-JUN-06 08.48.43 AM 28-JUN-06 08.54.49 AM 7521 WARD 1875 28-JUN-06 08.54.49 AM 28-JUN-06 09.10.09 AM 7521 WARD 1250 28-JUN-06 09.10.09 AM
You can see that WARD's salary was increased from $1250 to $1875 at 08:54:49 the same morning and was subsequently reset to $1250 at approximately 09:10:09.
Also, you can modify the query to determine the transaction information for each of the changes affecting WARD using a similar Flashback Version Query. This time use the VERSIONS_XID
pseudocolumn.
SELECT EMPNO, ENAME, SAL, VERSIONS_XID FROM EMP VERSIONS BETWEEN TIMESTAMP MINVALUE AND MAXVALUE WHERE EMPNO = 7521 ORDER BY NVL(VERSIONS_STARTSCN,1); EMPNO ENAME SAL VERSIONS_XID ---------- ---------- ---------- ---------------- 7521 WARD 1250 0006000800000086 7521 WARD 1875 0009000500000089 7521 WARD 1250 000800050000008B
The ID of the erroneous transaction that increased WARD's salary to $1875 was "0009000500000089".
Query the erroneous transaction and the scope of its effect.
With the transaction information (VERSIONS_XID
pseudocolumn), you can now query the database to determine the scope of the transaction, using Flashback Transaction Query.
SELECT UNDO_SQL FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = HEXTORAW('0009000500000089'); UNDO_SQL ---------------------------------------------------------------------------- update "SCOTT"."EMP" set "SAL" = '950' where ROWID = 'AAACV4AAFAAAAKtAAL'; update "SCOTT"."EMP" set "SAL" = '1500' where ROWID = 'AAACV4AAFAAAAKtAAJ'; update "SCOTT"."EMP" set "SAL" = '2850' where ROWID = 'AAACV4AAFAAAAKtAAF'; update "SCOTT"."EMP" set "SAL" = '1250' where ROWID = 'AAACV4AAFAAAAKtAAE'; update "SCOTT"."EMP" set "SAL" = '1600' where ROWID = 'AAACV4AAFAAAAKtAAB'; 6 rows selected.
You can see that WARD's salary was not the only change that occurred in the transaction. The information that was changed for the other four employees at the same time as WARD can now be passed back to the HR manager for review.
Determine if the corrective statements should be executed.
If the HR manager decides that the corrective changes suggested by the UNDO_SQL
column are correct, then the DBA can execute those statements individually.
Query the FLASHBACK_TRANSACTION_QUERY
view for additional transaction information. For example, to determine the user that performed the erroneous update, issue the following query:
SELECT LOGON_USER FROM FLASHBACK_TRANSACTION_QUERY WHERE XID = HEXTORAW('0009000500000089'); LOGON_USER ---------------------------- MSMITH
In this example, the query shows that the user MSMITH was responsible for the erroneous transaction.
To bring an Oracle database to a previous point in time, the traditional method is point-in-time recovery. However, point-in-time recovery can take hours or even days, because it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database.
Flashback Database is a new strategy for doing point-in-time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error. Flashback logs are used to capture old versions of changed blocks. One way to think of it is as a continuous backup or storage snapshot. When recovery must be performed the flashback logs are quickly replayed to restore the database to a point in time before the error and just the changed blocks are restored. It is extremely fast and reduces recovery time from hours to minutes. In addition, it is easy to use. A database can be recovered to 2:05 p.m. by issuing a single statement. Before the database can be recovered, all instances of the database must be shut down and one of the instances subsequently mounted. The following is an example of a FLASHBACK DATABASE
statement.
FLASHBACK DATABASE TO TIMESTAMP SYSDATE-1;
No restoration from tape, no lengthy downtime, and no complicated recovery procedures are required to use it. You can also use Flashback Database and then open the database in read-only mode and examine its contents. If you determine that you flashed back too far or not far enough, then you can reissue the FLASHBACK DATABASE
statement or continue recovery to a later time to find the proper point in time before the database was damaged. Flashback Database works with a production database, a physical standby database, or a logical standby database.
These steps are recommended for using Flashback Database:
Determine the time or the SCN to which to flash back the database.
Verify that there is sufficient flashback log information.
SELECT OLDEST_FLASHBACK_SCN, TO_CHAR(OLDEST_FLASHBACK_TIME, 'mon-dd-yyyy HH:MI:SS') FROM V$FLASHBACK_DATABASE_LOG;
Flash back the database to a specific time or SCN. (The database must be mounted to perform a Flashback Database.)
FLASHBACK DATABASE TO SCN scn;
or
FLASHBACK DATABASE TO TIMESTAMP TO_DATE date;
Open the database in read-only mode to verify that it is in the correct state.
ALTER DATABASE OPEN READ ONLY;
If more flashback data is required, then issue another FLASHBACK DATABASE
statement. (The database must be mounted to perform a Flashback Database.)
If you want to move forward in time, then issue a statement similar to the following:
RECOVER DATABASE UNTIL [TIME | CHANGE] date | scn;
Open the database:
ALTER DATABASE OPEN RESETLOGS;
Other considerations when using Flashback Database are as follows:
If there are not sufficient flashback logs to flash back to the target time, then use one of the following alternatives:
Use Data Guard to recover to the target time if the standby lags behind the primary database or flash back to the target time if there's sufficient flashback logs on the standby.
Restore from backups.
After flashing back a database, any dependent database such as a standby database must be flashed back. See Section 4.3, "Restoring Fault Tolerance".
Flashback Database does not automatically fix a dropped tablespace, but it can be used to dramatically reduce the downtime. You can flash back the production database to a point before the tablespace was dropped and then restore a backup of the corresponding datafiles from the affected tablespace and recover to a time before the tablespace was dropped.
Follow these recommended steps to use Flashback Database to repair a dropped tablespace:
Determine the SCN or time you dropped the tablespace.
Flash back the database to a time before the tablespace was dropped. You can use a statement similar to the following:
FLASHBACK DATABASE TO BEFORE SCN drop_scn;
Restore, rename, and bring datafiles online.
Restore only the datafiles from the affected tablespace from a backup.
Rename the unnamed files to the backup files.
ALTER DATABASE RENAME FILE '.../UNNAMED00005' to 'restored_file';
Bring the datafiles online.
ALTER DATABASE DATAFILE 'name' ONLINE;
Query and recover the database.
SELECT CHECKPOINT_CHANGE# FROM V$DATAFILE_HEADER WHERE FILE#=1;
RECOVER DATABASE UNTIL CHANGE checkpoint_change#;
Open the database.
ALTER DATABASE OPEN RESETLOGS;
Whenever a component in a high-availability architecture fails, then the full protection—or fault tolerance—of the architecture is compromised and possible single points of failure exist until the component is repaired. Restoring the high-availability architecture to full fault tolerance to reestablish full RAC, Data Guard, or MAA protection requires repairing the failed component. While full fault tolerance might be sacrificed during planned downtime, the method of repair is well understood because it is planned, the risk is controlled, and it ideally occurs at times best suited for continued application availability. However, for unplanned downtime, the risk of exposure to a single point of failure must be clearly understood.
This section provides the following topics that describe the steps needed to restore database fault tolerance:
For Oracle Database 10g with RAC
For Oracle Database 10g with Data Guard and Oracle Database 10g with RAC and Data Guard - MAA
Ensuring that application services fail over quickly and automatically in a RAC cluster—or between primary and secondary sites—is important when planning for both scheduled and unscheduled outages. To ensure that the environment is restored to full fault tolerance after any errors or issues are corrected, it is also important to understand the steps and processes for restoring failed instances or nodes within a RAC cluster or databases between sites.
Adding a failed node back into the cluster or restarting a failed RAC instance is easily done after the core problem that caused the specific component to originally fail has been corrected. However, you should also consider:
When to perform these tasks so as to incur minimal or no effect on the current running environment
Resetting network components (such as load balancer) which were modified for failover and now must be reset
Failing back or rebalancing existing connections
After the problem that caused the initial node or instance failure has been corrected, a node or instance can be restarted and added back into the RAC environment at any time. Processing to complete the reconfiguration of a node may require additional system resources.
Table 4-13 summarizes additional processing that may be required when adding a node.
Table 4-13 Additional Processing When Restarting or Rejoining a Node or Instance
Action | Additional Resources |
---|---|
Restarting a node or rejoining a node into a cluster |
When using only Oracle Clusterware, there is no impact when a new node joins the cluster. When using vendor clusterware, there may be performance degradation while reconfiguration occurs to add a node back into the cluster. The impact on current applications should be evaluated with a full test workload. |
Restarting or rejoining a RAC instance |
When you restart a RAC instance, there might be some potential performance impact while lock reconfiguration takes place. The impact on current applications is usually minimal, but it should be evaluated with a full test workload. |
See Also:
Your vendor-specified cluster management documentation for detailed steps on how to start and join a node back into a cluster
Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for more information about restarting a RAC instance
The following recovery methods can be used:
After a failed node has been brought back into the cluster and its instance has been started, Cluster Ready Services (CRS) automatically manages the virtual IP address used for the node and the services supported by that instance automatically. A particular service might or might not be started for the restored instance. The decision by CRS to start a service on the restored instance depends on how the service is configured and whether the proper number of instances are currently providing access for the service. A service is not relocated back to a preferred instance if the service is still being provided by an available instance to which it was moved by CRS when the initial failure occurred. CRS restarts services on the restored instance if the number of instances that are providing access to a service across the cluster is less than the number of preferred instances defined for the service. After CRS restarts a service on a restored instance, CRS notifies registered applications of the service change.
For example, suppose the HR service is defined with instances A and B as preferred and instances C and D as available in case of a failure. If instance B fails and CRS starts up the HR service on C automatically, then when instance B is restarted, the HR service remains at instance C. CRS does not automatically relocate a service back to a preferred instance.
Suppose a different scenario in which the HR service is defined with instances A, B, C, and D as preferred and no instances defined as available, spreading the service across all nodes in the cluster. If instance B fails, then the HR service remains available on the remaining three nodes. CRS automatically starts the HR service on instance B when it rejoins the cluster because it is running on fewer instances than configured. CRS notifies the applications that the HR service is again available on instance B.
After a RAC instance has been restored, additional steps might be required, depending on the current resource utilization and performance of the system, the application configuration, and the network load balancing that has been implemented.
Existing connections (which might have failed over or started as a new session) on the surviving RAC instances, are not automatically redistributed or failed back to an instance that has been restarted. Failing back or redistributing users might or might not be necessary, depending on the current resource utilization and the capability of the surviving instances to adequately handle and provide acceptable response times for the workload. If the surviving RAC instances do not have adequate resources to run a full workload or to provide acceptable response times, then it might be necessary to move (disconnect and reconnect) some existing user connections to the restarted instance.
New connections are started as they are needed, on the least-used node, assuming connection load balancing has been configured. Therefore, the new connections are automatically load-balanced over time.
An application service can be:
Partitioned with services running on a subset of RAC instances
Nonpartitioned so that all services run equally across all nodes
This is valuable for modularizing application and database form and function while still maintaining a consolidated data set. For the cases where an application is partitioned or has a combination of partitioning and nonpartitioning, the response time and availability aspects for each service should be considered. If redistribution or failback of connections for a particular service is required, then you can rebalance workloads manually with the DBMS_SERVICE.DISCONNECT_SESSION
PL/SQL procedure. You can use this procedure to disconnect sessions associated with a service while the service is running.
For load-balancing application services across multiple RAC instances, Oracle Net connect-time failover and connection load balancing are recommended. This feature does not require changes or modifications for failover or restoration. It is also possible to use hardware-based load balancers. However, there might be limitations in distinguishing separate application services (which is understood by Oracle Net Services) and restoring an instance or a node. For example, when a node or instance is restored and available to start receiving new connections, a manual step might be required to include the restored node or instance in the hardware-based load balancer logic, whereas Oracle Net Services does not require manual reconfiguration.
Table 4-14 summarizes the considerations for new and existing connections after an instance has been restored. The considerations differ depending on whether the application services are partitioned, nonpartitioned, or are a combination of both. The actual redistribution of existing connections might or might not be required depending on the resource utilization and response times.
Table 4-14 Restoration and Connection Failback
Application Services | Failback or Restore Existing Connections | Failback or Restore New Connections |
---|---|---|
Partitioned |
Existing sessions are not automatically relocated back to the restored instance. Use |
Automatically routes to the restored instance by using the Oracle Net Services configuration. |
Nonpartitioned |
No action is necessary unless the load must be rebalanced, because restoring the instance means that the load there is low. If the load must be rebalanced, then the same problems are encountered as if application services were partitioned. |
Automatically routes to the restored instance (because its load should be lowest) by using the Oracle Net Services configuration |
Figure 4-13 shows a two-node partitioned RAC database. Each instance services a different portion of the application (HR and Sales). Client processes connect to the appropriate instance based on the service they require.
Figure 4-13 Partitioned Two-Node RAC Database
Figure 4-14 shows what happens when one RAC instance fails.
Figure 4-14 RAC Instance Failover in a Partitioned Database
If one RAC instance fails, then the service and existing client connections can be automatically failed over to another RAC instance. In this example, the HR and Sales services are both supported by the remaining RAC instance. In addition, new client connections for the Sales service can be routed to the instance now supporting this service.
After the failed instance has been repaired and restored to the state shown in Figure 4-13 and the Sales service is relocated to the restored instance failed-over clients and any new clients that had connected to the Sales service on the failed-over instance might have to be identified and failed back. New client connections, which are started after the instance has been restored, should automatically connect back to the original instance. Therefore, over time, as older connections disconnect, and new sessions connect to the Sales service, the client load migrates back to the restored instance. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.
Figure 4-15 shows a nonpartitioned application. Services are evenly distributed across both active instances. Each instance has a mix of client connections for both HR and Sales.
If one RAC instance fails, then CRS moves the services that were running on the failed instance. In addition, new client connections are routed only to the available RAC instances that offer that service.
After the failed instance has been repaired and restored to the state shown in Figure 4-15, some clients might have to be moved back to the restored instance. For nonpartitioned applications, identifying appropriate services is not required for rebalancing the client load among all available instances. Also, this is necessary only if a single instance is not able to adequately service the requests.
New client connections that are started after the instance has been restored should automatically connect back to the restored instance because it has a smaller load. Therefore, over time, as older connections disconnect and new sessions connect to the restored instance, the client load will again evenly balance across all available RAC instances. Rebalancing the load immediately after restoration depends on the resource utilization and application response times.
Following unplanned downtime on a production database that requires a failover, full fault tolerance is compromised until the standby database is reestablished. Full database protection should be restored as soon as possible. The steps for restoring fault tolerance differ slightly between physical and logical standby databases.
Reinstating databases is automated if you are using Data Guard fast-start failover. After a fast-start failover completes, the observer automatically attempts to reinstate the former primary database as a standby database. Reinstatement restores high availability to the broker configuration so that, in the event of a failure of the new primary database, another fast-start failover can occur. The reinstated database can act as the fast-start failover target for the new primary database, making a subsequent fast-start failover possible. The new standby database is a viable target of a failover when it begins applying redo data received from the new primary database.
The broker can reinstate the former primary database as a standby database without the need to reenable the primary database or to manually perform a Flashback Database operation. To reinstate the former primary database, the database must be started and mounted, but it cannot be opened. The broker reinstates the database as a standby database of the same type (physical or logical) as the former standby database.
If the former primary database cannot be reinstated automatically, you can manually reinstate it using either the DGMGRL REINSTATE
command or Enterprise Manager. Step-by-step instructions for manual reinstatement are described in Oracle Data Guard Broker.
Standby databases do not have to be reinstantiated if you use the Oracle Flashback Database feature. Flashback Database has the following advantages:
Saves hours of database restoration time
Reduces overall complexity in restoring fault tolerance
Reduces the time that the system is vulnerable because the standby database is re-created more quickly
See Also:
The following topics in Oracle Data Guard Concepts and Administration:Flashing Back a Failed Primary Database into a Physical Standby Database
Flashing Back a Failed Primary Database into a Logical Standby Database
This section includes the following topics:
Following a fast-start failover, the observer periodically attempts to reconnect to the original primary database. When the observer regains network access to the original primary database, it initiates a request for the Data Guard broker to automatically reinstate it as a standby database to the new primary. This quickly restores disaster protection and high availability for the new primary database.
You can enable fast-start failover from any site, including the observer site, in Enterprise Manager while connected to any database in the broker configuration. The broker simplifies switchovers and failovers by allowing you to invoke them using a single key click in Oracle Enterprise Manager, as shown in Figure 4-16.
Figure 4-16 Fast-Start Failover and the Observer Are Successfully Enabled
Furthermore, you can leverage Enterprise Manager to reinstate the old primary as the new standby. Figure 4-17 shows an example of the warning message that shows in Enterprise Manager when a reinstatement is needed.
Figure 4-17 Reinstating the Former Primary Database After a Fast-Start Failover
Follow the steps in Section 4.2.6.3, "Data Area Disk Group Failure" or Section 4.2.6.4, "Flash Recovery Area Disk Group Failure".
After performing the planned maintenance on the secondary site, the standby database and log apply services must be restarted, and then Data Guard will automatically catch up. You can leverage Enterprise Manager and Data Guard broker to monitor the Data Guard state.
The following steps are required to restore full fault tolerance after planned downtime on a secondary site or clusterwide outage:
Note:
The following steps can be accomplished manually (as described below) or automatically using Enterprise Manager.Start the standby database
You might have to restore the standby database from local backups, local tape backups, or from the primary site backups if the data in the secondary site has been damaged. Re-create the standby database from the new production database by following the steps for creating a standby database in Oracle Data Guard Concepts and Administration.
After the standby database has been reestablished, start the standby database.
Start Redo Apply (physical standby) or SQL Apply (logical standby):
Verify redo transport services on production database
You might have to reenable the production database remote archive destination. Query the V$ARCHIVE_DEST_STATUS
view first to see the current state of the archive destinations:
SELECT DEST_ID, DEST_NAME, STATUS, PROTECTION_MODE, DESTINATION, ERROR, SRL FROM V$ARCHIVE_DEST_STATUS; ALTER SYSTEM SET LOG_ARCHIVE_DEST_STATE_n=ENABLE; ALTER SYSTEM ARCHIVE LOG CURRENT;
Verify redo transport services between the production and standby databases by checking for errors. Query the V$ARCHIVE_DEST
and V$ARCHIVE_DEST_STATUS
views:
SELECT STATUS, TARGET, LOG_SEQUENCE, TYPE, PROCESS, REGISTER, ERROR FROM V$ARCHIVE_DEST; SELECT * FROM V$ARCHIVE_DEST_STATUS WHERE STATUS!='INACTIVE';
Verify that recovery is progressing on standby database
For a physical standby database, verify that there are no errors from the managed recovery process and that the recovery has applied the redo from the archived redo log files:
SELECT MAX(SEQUENCE#), THREAD# FROM V$LOG_HISTORY GROUP BY THREAD; SELECT PROCESS, STATUS, THREAD#, SEQUENCE#, CLIENT_PROCESS FROM V$MANAGED_STANDBY;
For a logical standby database, verify that there are no errors from the logical standby process and that the recovery has applied the redo from the archived redo logs:
SELECT THREAD#, SEQUENCE# SEQ# FROM DBA_LOGSTDBY_LOG LOG, DBA_LOGSTDBY_PROGRESS PROG WHERE PROG.APPLIED_SCN BETWEEN LOG.FIRST_CHANGE# AND LOG.NEXT_CHANGE# ORDER BY NEXT_CHANGE#;
Restore production database protection mode
If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection depending on your business requirements.
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
Following unplanned downtime on the standby database that requires a full or partial datafile restoration (such as data or media failure), full fault tolerance is compromised until the standby database is brought back into service. Full database protection should be restored as soon as possible.
To repair data corruption and data failures on a logical standby database, you require a backup of the logical standby file and not a backup from the primary database. Otherwise, you need to reinstantiate or re-create the relevant objects that got affected by the corruption.To repair data corruption or data failures on the standby database, you can leverage the following repair solutions:
Use RMAN Block Media Recovery (described in Section 4.2.7.2)
Use RMAN Data File Media Recovery (described in Section 4.2.7.3)
Re-Create Objects Manually for logical standby databases only (described in Section 4.2.7.4)
If you had to change the protection mode of the production database from maximum protection to either maximum availability or maximum performance because of the standby database outage, then change the production database protection mode back to maximum protection (depending on your business requirements).
ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE [PROTECTION | AVAILABILITY];
If the production database is activated because it was flashed back to correct a logical error or because it was restored and recovered to a point in time, then the corresponding standby database might require additional maintenance. No additional work is required if the production database did complete recovery with no resetlogs.
After opening the production database with the RESETLOGS
option, execute the queries shown in Table 4-17.
Table 4-17 Queries to Determine RESETLOGS SCN and Current SCN OPEN RESETLOGS
Database | Query |
---|---|
Production |
|
Physical standby |
|
Logical standby |
|
Table 4-18 shows the actions you take to restore fault tolerance if the standby database is behind the primary database's resetlogs SCN.
Table 4-18 SCN on Standby Database is Behind Resetlogs SCN on the Production Database
Database | Action |
---|---|
Physical standby |
|
Logical standby |
Ensure that the standby database has received a new archived redo log file from the production database. See Also: "Verify redo transport services on production database" |
Table 4-19 shows the actions you take to restore fault tolerance if the standby database is ahead of the primary database's resetlogs SCN.
Table 4-19 SCN on the Standby is Ahead of Resetlogs SCN on the Production Database
Database | Action |
---|---|
Physical standby |
|
Logical standby |
|
If a dual failure affecting both the standby and production databases occurs, then you must re-create the production database first. Because the sites are identical, the production database can be created wherever the most recent backup resides.
Table 4-20 summarizes the recovery strategy depending on the type of backups that are available.
Table 4-20 Re-Creating the Production and Standby Databases
Available Backups | Re-Creating the Production Database |
---|---|
Local backup on production and standby databases |
Restore backup from the production database. Recover and activate the database as the new production database. |
Local backup only on standby database. Tape backups on standby database. |
Restore the local standby backup to the standby database. Recover and activate the database as the new production database. |
Tape backups only |
Restore tape backups locally. Recover the database and activate it as the new production database. |
See Also:
After the production database is re-created, follow the steps for creating a new standby database that are described inOracle Data Guard Concepts and AdministrationThis section describes best practices for eliminating or reducing downtime due to scheduled outages and contains the following topics:
The following procedure should be used when you add storage to the system. The procedures in the following sections assume that you are adding storage to an ASM Disk Group.
If you have an existing Oracle database that stores database files on a file system or on raw devices, you can migrate some or all of these database files to ASM. The DBMS_FILE_TRANSFER
package can be used for this purpose.
See Also:
Oracle Database Backup and Recovery Advanced User's Guide for detailed instructions on migrating a database to ASM using RMAN
Oracle Database PL/SQL Packages and Types Reference for the DBMS_FILE_TRANSFER
package
Disks can be added to and removed from ASM with no downtime. When disks are added or removed, ASM automatically starts a rebalance operation to evenly spread the disk group contents over all drives in the disk group.The best practices for adding or removing storage include:
Investigate methods of adding storage to, and removing storage from, the host operating system with no downtime.
Use a single ALTER DISKGROUP
command when adding or removing multiple disk drives.
For example, if the storage maintenance is to add new drives and remove existing drives, use a single ALTER DISKGROUP
command with the ADD DISK
clause to add the new drives, and the DROP DISK
clause to remove the existing drives. For example:
ALTER DISKGROUP data DROP DISK diska5 ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9;
When dropping disks from a disk group, specify the WAIT
option in the REBALANCE
clause so the ALTER DISKGROUP
statement does not return until the contents of the drives being dropped have been moved to other drives. Once the statement completes, the drives can be safely removed from the system. For example:
ALTER DISKGROUP data DROP DISK diska5 ADD FAILGROUP failgrp1 DISK '/devices/diska9' NAME diska9 REBALANCE WAIT;
When dropping disks in a normal or high redundancy disk group, ensure there is enough free disk space in the disk group to reconstruct full redundancy.
Monitor the progress of rebalance operations using Enterprise Manager or by querying V$ASM_OPERATION
.
For long-running rebalance operations that occur during periods of low database activity, increase the rebalance power limit to reduce the rebalance time.
See Also:
The "Using Automated Storage Management (ASM)" chapter in Oracle Database Administrator's GuideWith RAC, you can apply certain database patches to one node or instance at a time, which enables continual application and database availability. "One-off" patches or interim patches to database software are usually applied to implement known fixes for software problems an installation has encountered or to apply diagnostic patches to gather information regarding a problem. Such patch application is often carried out during a scheduled maintenance outage.
Oracle now provides the capability to do rolling patch upgrades with Real Application Clusters with little or no database downtime. The tool used to achieve this is the opatch
command-line utility.
The advantage of a RAC rolling upgrade is that it enables at least some instances of the RAC installation to be available during the scheduled outage required for patch upgrades. Only the RAC instance that is currently being patched must be brought down. The other instances can continue to remain available. This means that the effect on the application downtime required for such scheduled outages is further minimized. Oracle's opatch
utility enables the user to apply the patch successively to the different instances of the RAC installation.
Rolling upgrade is available only for patches that have been certified by Oracle to be eligible for rolling upgrades. Typically, patches that can be installed in a rolling upgrade include:
Patches that do not affect the contents of the database such as the data dictionary
Patches not related to RAC internode communication
Patches related to client-side tools such as SQL*PLUS, Oracle utilities, development libraries, and Oracle Net
Patches that do not change shared database resources such as datafile headers, control files, and common header definitions of kernel modules
Rolling upgrade of patches is currently available for one-off patches only. It is not available for patch sets.
Rolling patch upgrades are not available for deployments where the Oracle Database software is shared across the different nodes. This is the case where the Oracle home is on Cluster File System (CFS) or on shared volumes provided by file servers or NFS-mounted drives. The feature is only available where each node has its own copy of the Oracle Database software.
Use the following recommended practices for all database patch upgrades:
Always confirm with Oracle Support Services that the patch is valid for your problem and for your deployment environment.
Have a plan for applying the patch as well as a plan for backing out the patch.
Apply the patch to your test environment first and verify that it fixes the problem.
When you plan the elapsed time for applying the patch, include time for starting up and shutting down the other tiers of your technology stack if necessary.
If the patch is not a candidate for RAC rolling upgrade and you can incur the downtime for applying the patch, go to Section 4.4.3, "Database Upgrades" to assess whether or not other solutions are feasible.
The following are additional recommended practices for RAC rolling upgrades.
If multiple instances share an Oracle home, then all of them will be affected by application of a patch. The DBA should verify that this will not cause unintentional side effects. Also, all such instances on a node must be shut down during the patch application. Scheduled outage planning should take this into account. As a best practice, only similar applications should share an Oracle home on a node. This provides greater flexibility for patching.
The Oracle inventory on each node is a repository of the Oracle Database software installed on the node. The inventory is node-specific. It is shared by all Oracle software installed on the node. It is similar across nodes only if all nodes are exactly the same in terms of the Oracle Database software deployed, the deployment configuration, and patch levels. Because the Oracle inventory greatly aids the patch application and patch management process, it is recommended that its integrity be maintained. Oracle inventory should be backed up after each patch installation to any Oracle software on a specific node. This applies to the Oracle inventory on each node of the cluster.
Use the Oracle Universal Installer to install all Oracle database software. This creates the relevant repository entries in the Oracle inventory on each node of the cluster. Also, use the Oracle Universal Installer to add nodes to an existing RAC cluster.
However, if this was not done or is not feasible for some reason, adding information about an existing Oracle database software installation to the Oracle inventory can be done with the attach
option of the opatch
utility. Node information can be also added with this option.
The nature of the rolling patch upgrade enables it to be applied to only some nodes of the RAC cluster. So an instance can be operating with the patch applied, while another instance is operating without the patch. This is not possible for nonrolling patch upgrades. Apply nonrolling patch upgrades to all instances before the RAC deployment is activated. A mixed environment is useful if a patch must be tested before deploying it to all the instances. Applying the patch with the -local
option is the recommended way to do this.
In the interest of keeping all instances of the RAC cluster at the same patch level, it is strongly recommended that after a patch has been validated, it should be applied to all nodes of the RAC installation. When instances of a RAC cluster have similar patch software, services can be migrated among instances without running into the problem a patch might have fixed.
All patches (including those applied by rolling upgrades) should be maintained online and not removed once they have been applied. This is useful if a patch must be rolled back or applied again.
The patches should be stored in a location that is accessible by all nodes of the cluster. Thus all nodes of the cluster are equivalent in their capability to apply or roll back a patch.
Rolling patch upgrades, just like any other patch upgrade, should be done when no other patch upgrade or Oracle installation is being done on the node. Application of multiple patches is a sequential process. The scheduled outage should be planned accordingly.
If multiple patches have to be applied and they must be applied at the same time, and if only some of these patches are eligible for rolling upgrade, then apply all of them in a nonrolling manner. This reduces the overall time required to get through the patching process.
For patches that are not eligible for rolling upgrade, the next best option for RAC deployments is the minimize_downtime
option of the apply
command.
Perform the rolling upgrade when system usage is low. This ensures minimal disruption of service for the end user.
See Also:
Oracle Universal Installer and OPatch User's Guide for more information on the opatch utilityThe following Oracle features are available to perform database upgrades:
Database Upgrade Assistant (DBUA)
Data Guard SQL Apply (logical standby database)
Oracle Streams
Transportable tablespaces
The method you choose to perform database upgrades can vary depending on the following considerations:
Downtime required to complete the upgrade
Setup time and effort required prior to the downtime
Temporary additional resources necessary (for example, disk space or CPU)
Complexity of the steps needed to complete the upgrade
Table 4-21 lists the methods that can be used for platform migrations and database upgrades, and recommends what method to use.
Table 4-21 Platform Migration and Database Upgrade Options
Upgrade Method | Use This Method When... |
---|---|
Recommended method when maintenance window is sufficient |
|
DBUA will not finish within the maintenance window and the database is not a candidate for RAC rolling patch upgrade |
|
Already a Streams implementation or Data Guard SQL Apply rolling upgrade does not support database versions in use |
|
Database is using data types unsupported by Data Guard SQL Apply or Streams |
Database Upgrade Assistant (DBUA) is used to upgrade a database in place from an earlier software version.
When deciding if DBUA is the proper tool to use when performing a database upgrade with minimal downtime, consider the following:
DBUA upgrades the database dictionary and all components (for example: Java, XDB, Streams, and so on) that have been installed while the database is unavailable for normal user activity.
Downtime required for a database upgrade when using DBUA is determined by the time needed to:
Upgrade all database dictionary objects to the new version
Recompile all PL/SQL
Reconnect the clients to the upgraded database
Use DBUA for a database upgrade when the time to perform the upgrade with this method fits within the maintenance window.
See Also:
Oracle Database Upgrade Guide for more information on DBUA and upgrading your Oracle Database softwareData Guard SQL Apply can be used to upgrade a database with minimal downtime by means of a process called rolling upgrade. Data Guard currently supports homogeneous environments where the primary and standby databases are running on the same platform.
Note the following points when deciding if Data Guard SQL Apply is the appropriate method for minimizing downtime during a database upgrade:
The Data Guard SQL Apply infrastructure uses Oracle Streams and therefore inherits Oracle Streams data type restrictions on user-defined types, such as object types, REF values, varrays, and nested tables.
Support for rolling upgrade starts with Oracle Database 10g release 1 (10.1.0.3). The supported versions, for both the source database and the target database, are more restrictive than Oracle Streams.
Downtime required for a database upgrade (rolling upgrade) when using Data Guard SQL Apply is determined by the time needed to:
Perform a Data Guard switchover
Reconnect the clients to the new database
Use Data Guard SQL Apply for rolling database upgrade when DBUA will not complete the upgrade within the maintenance window and the application does not use user-defined types.
See Also:
Oracle Data Guard Concepts and Administration for more information on SQL ApplyOracle Streams can be used to upgrade the database software from one version to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different database versions.
Note the following points when deciding if Oracle Streams is an appropriate method for a database upgrade:
Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables. However, shadow tables can be created on the primary database that do not have the unsupported data types and the shadow tables can be replicated.
The source database must be running Oracle9i release 2 or higher.
The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.
There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.
Downtime required for a database upgrade when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect the clients to the new database.
Consider using Oracle Streams if the application already uses Streams or when clients do not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.
See Also:
Oracle Streams Concepts and Administration for more information on database upgrading using Oracle StreamsTransportable tablespaces can be used to accomplish a database upgrade by transporting all user datafiles into a pre-created, prepared target database.
Note the following points when deciding if transportable tablespaces is the appropriate method for performing a database upgrade:
The SYSTEM
tablespace cannot be moved with transportable tablespaces. The target database SYSTEM
tablespace contents, including user definitions and objects necessary for the application, must be built manually. Use Data Pump to move the contents of the SYSTEM
tablespace.
Downtime required for a database upgrade when using transportable tablespaces is determined by the time needed to:
Place the source database tablespaces in read-only mode
Perform a network import of the transportable metadata
If the target database is on a remote system, then include the time to transfer all datafiles from the source system to the target system
The time it takes to transfer the datafiles can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files, or by using a physical standby database.
Use transportable tablespaces to perform a database upgrade when DBUA will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.
See Also:
Oracle Database Administrator's Guide for more information on transportable tablespacesThe following Oracle features are available to perform platform migrations:
The method you choose to perform these database maintenance tasks depends on the following considerations:
Downtime required to complete the maintenance operations
Setup time and effort required prior to the downtime
Amount of temporary additional resources necessary, such as disk space or CPU
Complexity of the steps needed to complete maintenance operations
The following table summarizes the methods that can be used for platform migrations and database upgrades, and recommends which method to use for each operation.
Table 4-22 Platform and Location Migration Options
Operation | Method | When to Use |
---|---|---|
Platform migration to same endian platform |
Recommended method |
|
Transportable Database will not finish within the maintenance window |
||
Platform migration to different endian platform |
Recommended method |
|
Data Pump will not finish within the maintenance window |
||
Database is using data types unsupported by Oracle Streams |
||
Location Migration |
Recommended method |
Note:
Query theV$TRANSPORTABLE_PLATFORM
view to determine the endian format of all platforms. Query the V$DATABASE
view to determine the platform ID and platform name of the current system.Transportable Database is a new feature in Oracle Database 10g Release 2 (10.2) that is the recommended method for migrating an entire database to another platform that has the same endian format.
Note the following points when deciding if Transportable Database is the appropriate method to use when moving a database to another platform:
Transportable Database supports moving databases between platforms that are of the same endian format
Downtime required for a platform migration when using Transportable Database is determined by the time needed to:
Place the source database in read-only mode
Convert all data files to the new platform format
Transfer all data files from the source system to the target system
You can significantly reduce this time by using a storage infrastructure that can make the data files available to the target system without the need to physically move the files.
Transportable Databases is the recommended method for migrating an entire database to another platform that has the same endian format because it's the simplest approach.
See Also:
Oracle Database Backup and Recovery Advanced User's Guide for more information on cross-platform use of Transportable DatabaseOracle Streams can be used to move a database from one platform to another with minimal downtime. This is because Oracle Streams supports a configuration in which the primary database and its replica are running on different platforms.
Note the following points when deciding if Oracle Streams is an appropriate method for a platform migration:
Oracle Streams does not support user-defined types, such as object types, REF values, varrays, and nested tables.
To perform an upgrade using Oracle Streams, the source database must be running Oracle9i release 2 or higher.
The administrative effort required to set up and maintain the Oracle Streams environment is more than if using Data Guard SQL Apply for a database upgrade.
There might be a performance impact on the source database while the source and target databases run in parallel as changes are propagated to the target database.
Downtime required for a platform migration when using Oracle Streams is determined by the time needed to apply the remaining transactions in the queue and to reconnect clients to the new database.
Consider using Oracle Streams when the application does not use user-defined types and the extra administrative effort is worth the opportunity for a very small outage time.
See Also:
Oracle Streams Concepts and Administration for more information on database upgrading using Oracle StreamsOracle Data Pump technology enables very high-speed movement of data and metadata from one database to another, across different platforms and different database versions.
Note the following points when deciding if Data Pump is an appropriate method for a platform migration:
Oracle Data Pump is available only on Oracle Database 10g Release 1 (10.1) and later releases.
Downtime required for a platform migration when using Data Pump is determined by the time needed to perform a full database network import. A network import uses a database link between the target system and the remote source system to retrieve data and write it directly into the target system, without the use of dump files.
Use Data Pump when moving a database to a platform with different endian format when the network import time is acceptable.
See Also:
Oracle Database Utilities for more information about Oracle Data Pump and the Export and Import utilities
Oracle Database Upgrade Guide for more information about upgrading your Oracle Database software
Transportable tablespaces can be used to accomplish a platform migration by transporting all user datafiles into a pre-created, prepared target database.
Note the following points when deciding if transportable tablespaces is the appropriate method for performing a platform migration:
The SYSTEM
tablespace cannot be moved with transportable tablespaces. the target database SYSTEM
tablespace contents, including user definitions and objects necessary for the clients, must be built manually. Use Data Pump to move the necessary contents of the SYSTEM
tablespace.
Downtime required for a platform migration or database upgrade when using transportable tablespaces is determined by the time needed to:
Place the source database tablespaces in read-only mode
Perform a network import of the transportable metadata
Transfer all datafiles from the source system to the target system
This time can be reduced significantly by using a storage infrastructure that can make the datafiles available to the target system without the need to physically move the files
Convert all datafiles to the new platform format using RMAN
Use transportable tablespaces to migrate to a new platform when Oracle Data Pump will not complete within the maintenance window, and Oracle Streams or Data Guard SQL Apply cannot be used due to data type restrictions.
See Also:
Oracle Database Administrator's Guide for more information on transportable tablespacesData Guard Redo Apply can be used to change the location of a database to a remote site with minimal downtime by setting up a temporary standby database at a remote location and performing a switchover operation.
Downtime required for a location migration when using Data Guard Redo Apply is determined by the time required to perform a switchover operation.
See Also:
Oracle Data Guard Concepts and Administration for more information on Redo Apply and physical standby databasesAn Oracle database upgrade is the process of transforming an existing, prior release of an Oracle Database system into the current release of the Oracle Database system and can be a very lengthy process. An application upgrade may include a database upgrade and any application code and schema changes required. If database upgrade with Data Guard is not applicable and zero to minimum downtime is required for the database or application upgrade, then configure Oracle Streams to perform a database upgrade with little or no downtime. To do so, you use Oracle Streams to configure a single-source replication environment with the following databases:
Source Database: The original database that is being upgraded
Capture Database: The database where a capture process captures changes made to the source database during the upgrade
Destination Database: The copy of the source database where an apply process applies changes made to the source database during the upgrade process. The apply process can apply to the same or different schema and object structure using
Specifically, you can use the following general steps to perform a database upgrade while the database is online:
Create an empty destination database.
Configure an Oracle Streams single-source replication environment where the original database is the source database and a copy of the database is the destination database for the changes made at the source.
Perform the database upgrade on the destination database. During this time the original source database is available online.
Use Oracle Streams to apply the changes made at the source database to the destination database.
When the destination database has caught up with the changes made at the source database, take the source database offline and make the destination database available for applications and users.
If the schema or object structure is different at the destination database, then Streams transformations need to be incorporated to manipulate the change to its new structure.
Figure 4-18 provides an overview of this process.
Figure 4-18 Online Database Upgrade with Oracle Streams
See Also:
Appendix C "Online Database Maintenance with Streams" in Oracle Streams Concepts and AdministrationMany scheduled outages related to the data server involve some reorganization of the database objects. The database object reorganization must be accomplished with continued availability of the database. Oracle's online object reorganization capabilities have been available since Oracle8i. These capabilities enable object reorganization to be performed even while the underlying data is being modified.
Table 4-23 describes a few of the object reorganization capabilities available with Oracle Database 10g.
Table 4-23 Some Object Reorganization Capabilities
Object Type | Example of Object Reorganization Solution | Description of Solution |
---|---|---|
Table |
A PL/SQL package that provides a mechanism to redefine tables online. This is Oracle's recommended best practice. |
|
Index |
Rebuild an index that has previously been marked as unusable. |
|
Tablespace |
Enables an existing tablespace to be renamed without rebuilding the tablespace and its contents. |
In highly available systems, it is occasionally necessary to redefine large tables that are constantly accessed to improve the performance of queries or DML. The Online Reorganization and Redefinition feature in Oracle Database 10g, offers administrators unprecedented flexibility to modify table physical attributes and transform both data and table structure, while allowing users full access to the database. This capability improves data availability, query performance, response time and disk space usage, all of which are important in a mission-critical environment and it can make the application upgrade process easier, safer and faster.
Oracle's recommended practice is to reorganize tables using the DBMS_REDEFINITION
PL/SQL package, because it provides a significant increase in availability compared to traditional methods of redefining tables that require tables to be taken offline. Whether you call DBMS_REDEFINITION
manually at the command line or automatically through Oracle Enterprise Manager, the entire reorganization process occurs while users have full access to the table thus ensuring system availability.
Figure 4-19 shows the Reorganize Objects Wizard in Oracle Enterprise Manager that you can use as an alternative to calling the DBMS_REDEFINITION
package at the SQL*Plus command line. After you answer a few questions in the wizard, it automatically generates the script and performs the reorganization.
Figure 4-19 Database Object Reorganization Using Oracle Enterprise Manager
Using the DBMS_REDEFINITION
approach, an interim table is created that contains all the desired attributes. The reorganization begins by calling the procedure START_REDEF_TABLE
, which is where the column mappings between the current and new version of the table are described. All the dependent objects such as triggers, constraints and indexes are automatically copied to the interim table using the procedure COPY_TABLE_DEPENDENTS
. During the reorganization, any changes made to the original table are added to the interim table by calling the procedure SYNC_INTERIM_TABLE
. The reorganization is complete when the procedure FINISH_REDEF_TABLE
is called and the interim table is renamed as the main table.
A tablespace can be renamed in Oracle Database 10g, similar to the ability to rename a column, table and datafile. Previously, the only way to change a tablespace name was to drop and re-create the tablespace, but this meant that the contents of the tablespace had to be dropped and rebuilt later. With the ability to rename a tablespace online, there is no interruption to the users.
ALTER TABLESPACE USERS RENAME TO new_tablespace_name;
Tablespace altered.
Additionally, consider the following when performing data reorganization:
Concurrent activity on the table during an online operation.
During an online operation, Oracle recommends users minimize activities on the base table. Database activities should impact less than ten percent of the table while online operation is in progress. Also the database administrator can use the Database Resource Manager to minimize the data reorganization impact to users by allocating enough resources to users.
Oracle does not recommend running online operations at peak times or running a batch job that modifies large amount of data during an online data reorganization.
In fact, parallel DML, direct load and import/export cannot be performed during an online operation.
Rebuilding index online vs. dropping an index and then re-creating a new index online.
Rebuilding an index online requires additional disk space for the new index during the operation, whereas dropping an index and then re-creating an index does not require additional disk space.
Coalescing an index online vs. rebuilding an index online.
Online index coalesce is an in-place data reorganization operation, hence does not require additional disk space like index rebuild does. Index rebuild requires temporary disk space equal to the size of the index plus sort space during the operation. Index coalesce does not reduce the height of the B-tree. It only tries to reduce the number of leaf blocks. The coalesce operation does not free up space for users but does improve index scan performance.
If a user needs to move an index to a new tablespace, use online index rebuild.
Local and global indexes.
Oracle Database 10g supports both local and global partitioned indexes with online operations. When tables and indexes are partitioned, this allows administrators to perform maintenance on these objects, one partition at a time, while the other partitions remain online.
See Also:
Oracle Database Administrator's Guide for more information on redefining tables online
The Online Data Reorganization & Redefinition web site on OTN at http://www.oracle.com/technology/deploy/availability/htdocs/online_ops.html
The "Online Reorganization using Oracle Database 10g" white paper that provides additional redefinition solutions online at http://www.oracle.com/technology/deploy/availability/pdf/ha_10gR2_online_reorg_twp.pdf
For a scheduled outage that requires an instance, node, or other component to be isolated, RAC provides the ability to relocate, disable, and enable services. Relocation migrates a service to another instance. Services and instances can be selectively disabled while repair, change, or upgrade is performed on hardware or system software and re-enabled after the maintenance is complete. This ensures that the service or instance is not started during the maintenance outage. The service and instance is disabled at the beginning of the planned outage. It is then enabled at the end of the maintenance outage.
See Also:
Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for information about administering services with Enterprise Manager, DBCA, PL/SQL, and SRVCTLWhen using RAC, Oracle Clusterware daemons start automatically at the time the node is started. When performing maintenance that requires one or more system reboots or requires that all non-operating system processes be shut down, use the crsctl
command to stop and disable the startup of the Oracle Clusterware daemons. Once maintenance is complete, enable and start the Oracle Clusterware daemons with crsctl
commands.
See Also:
Oracle Database Oracle Clusterware and Oracle Real Application Clusters Administration and Deployment Guide for information about using thecrsctl
command