Logo

Data Estate Resilience

As previously stated, protecting against loss of service, or loss of data, is a multi-layered discipline. Some technolgies and approaches protect against outages occurring, and some technologies and approaches reduce the amount of loss. This applies to both time and data.

For over 15 years, Oracle customers have been able to take advantage of a series of recommendations and blueprint architectures that ensure appropriate availability levels are achieved for database workloads. This body of work is known as Oracle’s Maximum Availability Architecture (MAA).

There are many materials and resources dedicated to MAA. You can find them here;

Oracle MAA

MAA Summary

This topic within the Cloud Foundation aims to describe how the principles and designs with MAA specifically apply to databases deployed onto OCI.

Before that, it is worthwhile reminding ourselves of what we are trying to protect the data estate from and the different levels of protection that we might need.

Prevention is Better Than Cure

Important:

Before we consider the range of infrastructure-based, protective architectures that can be deployed, we should be confident that we are taking advantage of all of the protective measures built into the Oracle database itself.

Most of these measures are intended to prevent outages from occuring in the first place. It is good practice to ensure that these measures have been actively considered for every Oracle database that you deploy. After all, prevention is better than cure!

The following features and capabilities can all assist with removing or reducing outages and data loss, and we recommend they are considered for all Oracle databases.

Fast Recovery Area The fast recovery area contains all the necessary files that need to be backed up for the smooth running of your database. They include;

Flashback Operations The various flashback technologies enable you to rewind (undo) changes to the data and objects in the database without having to use database recovery. This saves time in the event of human or program error when manipulating data and objects. The Flashback technologies include;

Online Operations Online operations allow you to make structural and definition changes to database objects while still allowing users to access them. They include;

Initialization Parameters Many initialization parameters are used to add resilience to an Oracle database. Some of these are automated, and some need to be configured manually. The setting of them is dependent on how the database is used and how active the database is. There are three parameters in particular that help with data corruption issues. These are;

Types of Downtime

There are two types of downtime; planned and unplanned. The way we might deal with each of these may be different, yet we often leverage the same technologies.

Planned downtime covers outages that occur as a result of;

Unplanned downtime covers outages that result from;

Whichever category a particular downtime event falls into, our objective is to recover the service in an acceptable time frame, preferably without losing any data.

The key metrics we need to consider then are;

Recovery Point Objective (RPO) - The maximum amount of data loss (expressed in minutes) that can be tolerated Recovery Time Objective (RTO) - The maximum amount of time that can elapse before the service is made available again

The combination of these two values is different for each application or service. For practicality sake, we usually group and categorize applications according to their similar requirements.

For example;

Category RPO RTO Examples
Level 1 15 minutes 4 hours Low Priority Apps, Development
Level 2 5 minutes 1 hour Medium Priority Apps, Test
Level 3 0 minutes 15 minutes High Priority Apps, Production
Level 4 0 minutes 0 minutes Mission Critical Applications

We may question; Why wouldn’t we want all applications and services to benefit from an RPO of zero and an RTO of zero?

The answer is; Achieving higher levels of resilience is usually more expensive, has a more complex architecture, and is operationally more demanding.

Therefore, it makes sense that, for each application, we fully understand what levels of resilience are required. We can then match that requirement to the appropriate architecture.

Levels of Protection

There are many different ways that technologies can be combined to deliver increasing levels of resilience. For Oracle MAA, all of the components and technologies recommended have been developed by Oracle. We believe this is very important as having one provider of resilience centric technologies means you can rely on the recommended architectures being engineered to work together and be fully tested.

Oracle MAA describes four levels of resilience; Bronze, Silver, Gold and Platinum.

Important:

All implementations of an MAA architecture should incorporate the appropriate outage prevention capabilities outlined in the “Prevention is Better Than Cure” section of this page.

BRONZE

Logo

The Oracle MAA Bronze architecture consists of a single instance Oracle database with local, and ideally remote, backups.

The Bronze architecture protects against both planned and unplanned outages. However, the RTO in some circumstances (e.g. complete site failure) could be many hours or even days. Also, with full site failure, in particular, there is a risk of significant data loss depending on the availability and timing of the last successful backup.

For Bronze, the expected RPO and RTO for each outage category are;

Outage Type Outage Event Expected RTO Expected RPO
Unplanned Recoverable Instance Failure Minutes Zero
  Recoverable Server Failure Minutes to Hour Zero
  Data Corruption, Site Failure Hours to Days Since last backup
Planned Reorganization, Limited Patches Zero Zero
  Hardware or O/S Maintenance Minutes to Hours Zero
  Most Database Patches Minutes to Hours Zero
  Database Upgrades, Patch Sets Minutes to Hours Zero
  Platform Migrations Hours to a Day Zero
  Application Upgrades Hours to Days Zero

The Bronze MAA architecture may be appropriate if an application is considered non-critical or is a development or test environment. In all circumstances, the reduced complexity and cost need to be balanced against increased outage times and significant potential data loss (e.g. in the event of unrecoverable site failure).

SILVER

Logo

The Oracle MAA Silver architecture builds upon the Bronze architecture but, instead of having a single database instance accessing the database files, there are at least two. These nodes simultaneously access the database in an active/active configuration. This is made possible using Oracle Real Application Clusters (RAC). RAC can do this by creating and tracking changes in a global cache of memory-resident database blocks from each of the participating nodes.

For unplanned outages, the Silver architecture extends the protection offered by the Bronze architecture in the following way;

The presence of multiple database instance nodes means that should one of the nodes experience an unplanned outage (of any description), the remaining node(s) can continue to service database workloads. At its most basic implementation, user sessions connected to a failed node will experience an interruption while their connection is established to the surviving node(s). Sessions already connected to a surviving node will not experience any interruption.

Oracle offers a range of other technologies to ensure that sessions connected to a failed node do not experience interruption. These include;

More information about these technologies can be found in the following white paper;

MAA Checklist for Applications

For planned outages, the Silver architecture extends the Bronze architecture by allowing maintenance to be carried out on each node at a time. This could be resolving a physical hardware issue or patching and upgrading the operating system. Furthermore, the Silver architecture means that database patches themselves can be applied in a rolling fashion. This means that while one database node is being maintained or patched, the other node(s) can continue to provide a database service to users.

For Silver, the expected RPO and RTO for each outage category is;

Outage Type Outage Event Expected RTO Expected RPO
Unplanned Recoverable Instance Failure Zero Zero
  Recoverable Server Failure Seconds Zero
  Data Corruption, Site Failure Hours to Days Since last backup
Planned Reorganization Zero Zero
  Hardware or O/S Maintenance Zero Zero
  Most Database Patches Zero Zero
  Database Upgrades & Patch Sets Minutes to Hours Zero
  Platform Migrations Hours to a Day Zero
  Application Upgrades Hours to Days Zero

The Silver MAA architecture is suited to workloads that demand higher levels of availability. These can include critical applications where outages, planned or unplanned, would be detrimental to the business. This is often seen as a requirement for production environments but, because of the architectural sophistication of active/active database technology, it is often prudent to replicate the Silver architecture for non-production environments (dev, test, pre-prod).

Gold

Logo

The Oracle MAA Gold architecture extends the Silver architecture with the introduction of data replication technologies. The data replication technology for Oracle databases is Oracle Data Guard. There are two main characters in a Data Guard configuration;

These roles are interchangeable between the participants in a Data Guard configuration.

Data Guard replicates change vectors in the transaction logs of the Primary Database to the Standby Database. The Standby Database can be hundreds or even thousands of miles away from the Primary. The only limiting factor to this is the tolerance to latency for your solution. For very long distances, the data can be replicated asynchronously and never impact performance on the primary database. However, being asynchronous means that there is still a slight chance of data loss in the event of an unrecoverable failure on the Primary site (typically, a few seconds).

If this small amount of data loss is not acceptable, then replication can be done synchronously. This guarantees protection against data loss but will limit the distance the data can travel before latency begins to affect the performance on the Primary database.

There are four reference architectures associated with MAA Gold;

For more information on each of these architectures, please download;

MAA Reference Architectures

The MAA Gold architecture depicted above is an interpretation of the Multiple Standby Databases reference architecture.

This particular implementation of an MAA Gold architecture provides all of the protection of the Silver architecture plus protection against outages associated with data corruption and site failure (natural disaster, power etc.).

In this solution, the change vectors for the Primary database are simultaneously replicated to a local Standby (within the same datacentre) and a remote Standby that is several hundred miles away. The changes are synchronously copied to the local Standby, whereas the same change vectors are asynchronously copied to the remote Standby.

This means that should the Primary database experience an unprotected or unrecoverable failure; we would switch to the local Standby database. This is fast to do and guarantees that no data is lost. It also means that all users experience the same performance characteristics as they do with the Primary database (this assumes that the Standby is sized the same as the Primary).

If there is a data centre-wide, unrecoverable failure, and the local Standby is also affected, we would only then switch to the remote Standby database.

Because the remote Standby database is geographically distant, it is not affected by the data centre failure and can continue to support the application workload. Due to the asynchronous nature of the remote replication, there is still a possibility of a small amount of data loss.

For planned outages, such as database patch sets and upgrades, the Gold architecture can protect further. Using Active Data Guard, we can execute highly automated rolling upgrades. This means it is possible to provide access to a working database (the Standby) while the Primary database is being upgraded. Once completed, the roles can reverse, and the Standby can then also be upgraded.

For Gold, the expected RPO and RTO for each outage category is;

Outage Type Outage Event Expected RTO Expected RPO
Unplanned Recoverable Instance Failure Zero Zero
  Recoverable Server Failure Seconds Zero
  Data Corruption, Site Failure Zero to Seconds Near-zero if ASYNC, Zero if SYNC
Planned Reorganization Zero Zero
  Hardware or O/S Maintenance Zero Zero
  Most Database Patches Zero Zero
  Database Upgrades & Patch Sets Seconds Zero
  Platform Migrations Seconds Zero
  Application Upgrades Hours to Days Zero

The Gold MAA architecture is suited to mission-critical workloads with very little tolerance for either data loss or outages. Due to the higher levels of architectural and operational complexity, this type of protection is often reserved for production environments.

PLATINUM

Logo

The Oracle MAA Platinum architecture offers the broadest levels of resilience for an Oracle Database.

This architecture offers all of the protections covered by the Gold architecture and adds additional protection from outages associated with application upgrades and migrations.

It does this by using Oracle GoldenGate replication to allow both the local site and the remote site to be fully active. The replication can occur in one or both directions allowing all resources to be fully utilized and protected.

For the Platinum architecture, the main difference between Oracle GoldenGate and Oracle Data Guard is that GoldenGate is based upon logical replication of changes. In contrast, Data Guard maintains a physical replica of a primary database.

Because of this difference, GoldenGate has many use-cases beyond MAA. Data Guard is specific to Disaster Recovery protection.

Each primary database is part of an active/active GoldenGate configuration in the Platinum architecture depicted above. Each Primary is protected by a local standby using Data Guard with synchronous replication. GoldenGate also protects each primary database in an asynchronous manner.

For Platinum, the expected RPO and RTO for each outage category is;

Outage Type Outage Event Expected RTO Expected RPO
Unplanned Recoverable Instance Failure Zero Zero
  Recoverable Server Failure Zero Zero
  Data Corruption, Site Failure Zero Zero to Seconds
Planned Reorganization Zero Zero
  Hardware or O/S Maintenance Zero Zero
  Most Database Patches Zero Zero
  Database Upgrades & Patch Sets Zero Zero
  Platform Migrations Zero Zero
  Application Upgrades Zero Zero

Although this architecture allows for very high levels of resilience, the use of GoldenGate means that the application needs to be engineered to take full advantage of the underlying topology. For this reason, Platinum is applicable when only the very highest levels of RPO & RTO are required. For most mission-critical applications, a flavour of the Gold MAA architecture will usually be sufficient.