29 Oracle Maximum Availability Architecture and Oracle Autonomous Database

Oracle Maximum Availability Architecture (MAA) is a set of best practices developed by Oracle engineers over many years for the integrated use of Oracle High Availability, data protection, and disaster recovery technologies.

The key goal of Oracle MAA is to meet Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for Oracle databases and applications running on our system and database platforms using Oracle Cloud MAA architectures and solutions.

See Oracle MAA Reference Architectures for an overview of the MAA reference architectures and their associated benefits and potential RTO and RPO targets. Also, see Oracle Maximum Availability Architecture in Oracle Exadata Cloud Systems for the inherent differentiated Oracle Exadata Cloud HA and data protection benefits, because Autonomous Database Cloud runs on the Exadata Cloud platform.

Note that Maximum Availability Architectures leverage Chaos Engineering throughout its testing and development life cycles to ensure that end-to-end application and database availability is preserved, or at its optimal levels, for any fault or maintenance event in Oracle Cloud. Chaos Engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. Specifically, MAA aggressively injects various faults and planned maintenance events to evaluate application and database impact throughout our development, stress, and testing cycles. With that experimentation, best practices, defects, and lessons learned are derived, and that knowledge is put back into practice to evolve and improve our cloud MAA solutions.

Oracle Autonomous Database with Default High Availability Option (MAA Silver)

High availability is suitable for all development, test, and production databases that have high uptime requirements and zero or low data loss tolerance. By default, Autonomous Databases are highly available, incorporating a multi-node configuration to protect against localized software and hardware failures.

Each Autonomous Database application service resides in at least one Oracle Real Application Clusters (Oracle RAC) instance, with the option to fail over to another available Oracle RAC instance for unplanned outages or planned maintenance activities, enabling zero or near-zero downtime.

Autonomous Database automatic backups are stored in Oracle Cloud Infrastructure Object Storage and are replicated to another availability domain if available. These backups can be used to restore the database in the event of a disaster. For Autonomous Database with Exadata Cloud at Customer, customers have an option to backup to NFS or Zero Data Loss Recovery Appliance (ZDLRA); however, replication of those backups is the responsibility of the customer.

Major database upgrades are automated. For Autonomous Database Serverless, the downtime is minimal.

The uptime service-level agreements (SLAs) per month is 99.95% (a maximum of 22 minutes of downtime per month). To achieve the application uptime SLAs where most months would be zero downtime, see Maintaining Application Uptime below.

The following table describes the recovery-time objectives and recovery-point objectives (data loss tolerance) for different outages.

Table 29-1 Default High Availability Policy Recovery Time (RTO) and Recovery Point (RPO) Service-level Objectives

Failure and Maintenance Events Database Downtime Service-level Downtime (RTO) Potential Service-level Data Loss (RPO)

Localized events, including:

  • Exadata cluster network topology failures
  • Storage (disk and flash) failures
  • Database instance failures
  • Database server failures
  • Periodic software and hardware maintenance updates
Zero Near-zero Zero

Events that require restoring from backup because the standby database does not exist:

  • Data corruptions
  • Full database failures
  • Complete storage failures
  • Availability domain (AD) for multi-AD regions

Minutes to hours

(without Autonomous Data Guard)

Minutes to hours

(without Autonomous Data Guard)

15 minutes for Oracle Autonomous Database on Dedicated Exadata Infrastructure

1 minute for Autonomous Database Serverless

(without Autonomous Data Guard)

Events that require non-rolling software updates or database upgrades

Less than 10 minutes for Autonomous Database Serverless

Minutes to hour for Autonomous Database on Dedicated Infrastructure

(without Autonomous Data Guard)

Less than 10 minutes for Autonomous Database Serverless

Minutes to hour for Autonomous Database on Dedicated Infrastructure

(without Autonomous Data Guard)

Zero

In the table above, the amount of downtime for events that require restoring from a backup varies depending on the nature of the failure. In the most optimistic case, physical block corruption is detected and the block is repaired with block media recovery in minutes. In this case, only a small portion of the database is affected with zero data loss. In a more pessimistic case, the entire database or cluster fails, then the database is restored and recovered using the latest database backup, including all archives.

Data loss is limited by the last successful archive log backup, the frequency of which is every 15 minutes for Autonomous Database on Dedicated Infrastructure and 1 minute for Autonomous Database Serverless. Archive or redo are backed up to Oracle Cloud Infrastructure Object Storage or File Storage Service for future recovery purposes. Data loss can be seconds, or, at worst minutes of data loss, around the last successful archive log and remaining redo in the online redo logs that were not archived to external storage.

Oracle Autonomous Database with Autonomous Data Guard Option (MAA Gold)

Enable Autonomous Data Guard for mission-critical production databases that require better uptime requirements for disasters from data corruptions, and database or site failures, while still reaping the Autonomous Database High Availability Option benefits.

Additionally, the read-only standby database provides expanded application services to offload reporting, queries, and some updates. The read-only standby database is only available with Autonomous Data Guard on Dedicated Infrastructure.

Enabling Autonomous Data Guard adds one symmetric standby database to an Exadata rack that is located in the same availability domain, another availability domain, or in another region. The primary and standby database systems are configured symmetrically to ensure that performance service levels are maintained after Data Guard role transitions. Autonomous Database Serverless supports configuring two standby databases, and Autonomous Database on Dedicated Infrastructure is restricted to a single database at this time. For Autonomous Database Serverless, a multiple standby configuration consists of a local standby database in the same region and a cross-region standby database.

Oracle Autonomous Data Guard features asynchronous redo transport (in maximum performance mode) by default to ensure zero application performance impact. The standby database can be placed within the same availability domain, across availability domains, or across regions. MAA recommends placing the standby in separate availability domain or in a different region for the best fault isolation. Data Guard zero data loss protection can be achieved by configuring synchronous redo transport (in maximum availability mode); however, maximum availability database protection mode with synchronous redo transport is only available with Autonomous Database on Dedicated Infrastructure, and the standby database is typically placed in a different availability domain in the same region, or across multiple regions if the round trip latency between regions is minimal (< 5ms) to ensure a negligible impact on application response time and throughput while providing fault isolation. Furthermore, local and remote virtual cloud network peering provides a secure, high-bandwidth network across availability domains and regions for any traffic between the primary and standby servers.

Backups are scheduled automatically on both primary and standby databases, and they are stored in Oracle Cloud Infrastructure Object Storage. Autonomous Database with Exadata Cloud at Customer, provides you with an option to backup to NFS or Zero Data Loss Recovery Appliance; however, replication of those backups is the responsibility of the customer. Those backups can be used to restore databases in the event of a double disaster, where both primary and standby databases are lost.

The uptime service-level agreement (SLA) per month is 99.995% (maximum 132 seconds of downtime per month) and recovery time objectives (downtime) and recovery point objectives (data loss) are low, as described in the table below. To achieve the application uptime SLAs where most months would be zero downtime, refer to Maintaining Application Uptime (XREF).

Automatic Data Guard failover with Autonomous Database Serverless supports a data loss threshold service level which will initiate an automatic failover to the standby database if the data loss is below that threshold. Zero data loss failover is not guaranteed for Autonomous Database Serverless but possible when the primary database fails while primary system container and infrastructure is still available allowing the remaining redo to be sent and applied to the standby database. Automatic Data Guard failover with Autonomous Database on Dedicated Infrastructure supports zero data loss or low data loss threshold service levels. In all cases, automatic Autonomous Data Guard failover will occur for primary database, cluster, or data center failures when those data loss service levels can be guaranteed. The target standby becomes the new primary database, and all application services are enabled automatically. A manual Data Failover option is provided in the OCI Console. For the manual Data Guard failover option, the calculated downtime for the uptime SLA starts with the time to execute the Data Guard failover operation and ends when the new primary service is enabled.

Automatic Data Guard failover with Autonomous Database Serverless supports a data loss threshold service level which initiates an automatic failover to the standby database if the data loss is below that threshold. Zero data loss failover is not guaranteed for Autonomous Database Serverless but is possible when the primary database fails while the primary system container and infrastructure are still available, allowing the remaining redo to be sent and applied to the standby database. Automatic Data Guard failover with Autonomous Database on Dedicated Infrastructure supports zero data loss or low and configurable data loss threshold service levels.

In all cases, automatic Autonomous Data Guard failover occurs for primary database, cluster, or data center failures when those data loss service levels can be guaranteed. The target standby becomes the new primary database, and all application services are enabled automatically. A manual Data Failover option is provided in the OCI Console. For the manual Data Guard failover option, the calculated downtime for the uptime SLA starts with the time to execute the Data Guard failover operation and ends when the new primary service is enabled.

You can choose whether your database failover site is located in the same availability domain, in a different availability domain within the same region, or in a different region, contingent upon application or business requirements and data center availability.

Table 29-2 Autonomous Data Guard Recovery Time (RTO) and Recovery Point (RPO) Service-level Objectives

Failure and Maintenance Events Service-level Downtime (RTO)1 Potential Service-level Data Loss (RPO)

Localized events, including:

  • Exadata cluster network fabric failures
  • Storage (disk and flash) failures
  • Database instance failures
  • Database server failures
  • Periodic software and hardware maintenance updates

Zero or Near Zero

Zero

Events that require failover to the standby database using Autonomous Data Guard, including:

  • Data corruptions (because Data Guard has automatic block repair for physical corruptions2, a failover operation is required only for logical corruptions or extensive data corruptions)
  • Full database failures
  • Complete storage failures
  • Availability domain or region failures3

Few seconds to two minutes4

Zero with maximum availability protection mode (uses synchronous redo transport). Most commonly used for intra-region standby databases. This is available for Autonomous Data Guard on Dedicated Infrastructure.

Near zero for maximum performance protection mode (uses asynchronous redo transport). Most commonly used for cross-region standby databases. Also used for intra-regional standby databases and to ensure zero application impact. This is applicable for both Autonomous Data Guard on Dedicated Infrastructure and Autonomous Database Serverless. RPO is typically less than 10 seconds. RPO can be impacted by network bandwidth and throughput between primary and standby clusters.

1 Service-Level Downtime (RTO) excludes detection time that includes multiple heartbeats to ensure the source is indeed inaccessible before initiating an automatic failover.

2 The Active Data Guard automatic block repair for physical corruptions feature is only available for Autonomous Data Guard on Dedicated Infrastructure.

3Regional failure protection is only available if the standby is located across regions.

4 The back end Autonomous Data Guard role transition timings are much faster than what is indicated by the Cloud Console refresh rates.

Both Autonomous Database on Dedicated Infrastructure and Autonomous Database Serverless have been MAA Gold validated and certified. Autonomous Database on Dedicated Infrastructure was validated with a standby database in the same region, and also with a standby database in a different region, and the above SLAs were met when the standby target was symmetric to the primary. RTO and RPO SLAs were met with redo rates of up to 1000 MB/sec. Autonomous Database Serverless was validated and certified with a standby database in the same region only, and met the above SLAs when the standby target had symmetric resources. RTO and RPO SLAs were met with redo rates up to 300 MB/sec for the entire Container Database (CDB) where the target Autonomous Data Guard pluggable database resides.

Autonomous Database with Autonomous Data Guard Option and Oracle GoldenGate (MAA Platinum)

MAA Platinum with Autonomous Database on Dedicated Infrastructure is configurable. No guaranteed SLAs are provided since the GoldenGate and application failover configuration is manual.

MAA Platinum or Never-Down Architecture, delivers near-zero recovery time objective (RTO, or downtime incurred during an outage) and potentially zero or near zero recover point objective (RPO, or data loss potential).

The MAA Platinum with Autonomous Database on Dedicated Infrastructure ensures:

  • RTO = zero or near-zero for all local failures
  • RTO = zero or near-zero for disasters, such as database, cluster, or site failures, achieved by redirecting the application to an Autonomous Database with Autonomous Data Guard or Oracle GoldenGate replica
  • Zero downtime maintenance for software and hardware updates
  • Zero downtime database upgrade or application upgrade by redirecting the application to an upgraded Oracle GoldenGate replica residing in a separate Autonomous Database on Dedicated Infrastructure
  • RPO = zero or near-zero data loss, depending on selecting the Oracle Data Guard Maximum Availability or Maximum Performance protection modes with synchronous redo transport in Autonomous Database with Autonomous Data Guard
  • Fast re-synchronization and zero or near-zero RPO between Oracle GoldenGate source and target databases after a disaster using Cloud MAA GoldenGate Hub and Oracle GoldenGate best practices
  • After any database failure, automatic failover to its standby database occurs automatically using integrated Data Guard Fast-start Failover (FSFO). Subsequently, automatic re-synchronization between Oracle GoldenGate source and target databases resumes from the new primary after a role transition. For synchronous transport, this leads to eventual zero data loss.

Prerequisites:

  • Autonomous Database on Dedicated Infrastructure must be running Oracle Database software release 19.20 or later for GoldenGate conflict resolution support
  • Autonomous Database with Autonomous Data Guard and automatic failover needs to be configured for fast GoldenGate resynchronization after a disaster
  • GoldenGate setup must be done manually according to Cloud MAA best practices
  • Application failover to an available GoldenGate replica or a new primary database must be configured. Currently, Global Data Services (GDS) cannot be used with an Autonomous Database in this architecture.

Implementing the MAA Platinum Solution

To achieve an MAA Platinum solution, review and leverage the technical briefs and documentation referenced in the following steps.

  1. Review Oracle MAA Platinum Tier for Oracle Exadata to understand MAA Platinum benefits and use cases.
    1. Decide primary database locations based on application needs. The primary database will reside in Autonomous Database on Dedicated Infrastructure.
    2. Decide standby database location based on fault isolation requirements.
    3. Enable Autonomous Data Guard.
    4. Choose Autonomous Data Guard protection mode based on RPO tolerance, and set up automatic failover.
  2. Set up MAA GoldenGate Hub in Oracle cloud.
    1. Follow the steps in Cloud: Configuring Oracle GoldenGate Hub for MAA Platinum.
    2. Configure Bidirectional Replication and Automatic Conflict Detection and Resolution. See Set Up Bidirectional Replication for Oracle GoldenGate Microservices Architecture or the latest Oracle GoldenGate 21c documentation.
  3. Configure application failover options so that your application can fail over automatically in the case of database, cluster, or site failure.

Maintaining Application Uptime

Ensure that network connectivity to Oracle Cloud Infrastructure is reliable so that you can access your tenancy's Autonomous Database resources.

Follow the guidelines to connect to your Autonomous Database (see Autonomous Database Serverless, or Autonomous Database on Dedicated Exadata Infrastructure). Applications must connect to the predefined service name and download client credentials that include the proper tnsnsames.ora and sqlnet.ora files. You can also change your specific application service’s drain_timeout attribute to fit your requirements.

For more details about enabling continuous application service through planned and unplanned outages, see Configuring Continuous Availability for Applications. Oracle recommends that you test your application readiness by following Validating Application Failover Readiness (Doc ID 2758734.1).

For Oracle Exadata Cloud Infrastructure planned maintenance events that require restarting database instance, Oracle automatically relocates services and drain sessions to another available Oracle RAC instance before stopping any Oracle RAC instance. For OLTP applications that follow the MAA checklist, draining and relocating services results in zero application downtime.

Some applications, such as long running batch jobs or reports, may not be able to drain and relocate gracefully, even with a longer drain timeout. For those applications, Oracle recommends that you schedule the software planned maintenance window excluding these types of activities, or stop these activities before the planned maintenance window. For example, you can reschedule a planned maintenance window so that it is outside your batch windows, or stop batch jobs before a planned maintenance window.