Data Estate Resilience
As previously stated, protecting against loss of service, or loss of data, is a multi-layered discipline. Some technolgies and approaches protect against outages occurring, and some technologies and approaches reduce the amount of loss. This applies to both time and data.
For over 15 years, Oracle customers have been able to take advantage of a series of recommendations and blueprint architectures that ensure appropriate availability levels are achieved for database workloads. This body of work is known as Oracle’s Maximum Availability Architecture (MAA).
There are many materials and resources dedicated to MAA. You can find them here;
MAA Summary
This topic within the Cloud Foundation aims to describe how the principles and designs with MAA specifically apply to databases deployed onto OCI.
Before that, it is worthwhile reminding ourselves of what we are trying to protect the data estate from and the different levels of protection that we might need.
Prevention is Better Than Cure
Important:
Before we consider the range of infrastructure-based, protective architectures that can be deployed, we should be confident that we are taking advantage of all of the protective measures built into the Oracle database itself.
Most of these measures are intended to prevent outages from occuring in the first place. It is good practice to ensure that these measures have been actively considered for every Oracle database that you deploy. After all, prevention is better than cure!
The following features and capabilities can all assist with removing or reducing outages and data loss, and we recommend they are considered for all Oracle databases.
Fast Recovery Area The fast recovery area contains all the necessary files that need to be backed up for the smooth running of your database. They include;
- Image Copies – Exact byte-for-byte copies of all the files of the target database
- RMAN Backup Set – logical entities produced by the RMAN BACKUP command.
- Data files
- Archive Logs
- Online Redo Logs
- Flashback logs
- Control files
- Control file and spfile backup
Flashback Operations The various flashback technologies enable you to rewind (undo) changes to the data and objects in the database without having to use database recovery. This saves time in the event of human or program error when manipulating data and objects. The Flashback technologies include;
- Flashback Query
- Flashback Versions
- Flashback Transaction
- Flashback Drop
- Flashback Table
- Flashback Database
Online Operations Online operations allow you to make structural and definition changes to database objects while still allowing users to access them. They include;
- Table Redefinition
- Partition Move
Initialization Parameters Many initialization parameters are used to add resilience to an Oracle database. Some of these are automated, and some need to be configured manually. The setting of them is dependent on how the database is used and how active the database is. There are three parameters in particular that help with data corruption issues. These are;
-
DB_BLOCK_CHECKSUM : This parameter calculates a checksum in the header of each database block. The checksum is verified against the contents of the block once it is read. In this way, it can protect against data corruption resulting from the underlying storage layer.
-
DB_BLOCK_CHECKING : This parameter checks that the structure of different types of database blocks is consistent and intact. It can identify data corruption that occurs in memory and on disk.
-
DB_LOST_WRITE_PROTECT : This parameter detects when a write operation for a block is sent to disk but never actually made it to the storage layer (for whatever reason).
Types of Downtime
There are two types of downtime; planned and unplanned. The way we might deal with each of these may be different, yet we often leverage the same technologies.
Planned downtime covers outages that occur as a result of;
- Patching
- Upgrading software in the technology stack
- Changes to an application or database schema
- Migrating to a new infrastructure platform or service
Unplanned downtime covers outages that result from;
- Hardware Failure
- Software Failure
- Power Failure
- Natural Disasters
- Human Error
Whichever category a particular downtime event falls into, our objective is to recover the service in an acceptable time frame, preferably without losing any data.
The key metrics we need to consider then are;
Recovery Point Objective (RPO) - The maximum amount of data loss (expressed in minutes) that can be tolerated Recovery Time Objective (RTO) - The maximum amount of time that can elapse before the service is made available again
The combination of these two values is different for each application or service. For practicality sake, we usually group and categorize applications according to their similar requirements.
For example;
Category | RPO | RTO | Examples |
Level 1 | 15 minutes | 4 hours | Low Priority Apps, Development |
Level 2 | 5 minutes | 1 hour | Medium Priority Apps, Test |
Level 3 | 0 minutes | 15 minutes | High Priority Apps, Production |
Level 4 | 0 minutes | 0 minutes | Mission Critical Applications |
We may question; Why wouldn’t we want all applications and services to benefit from an RPO of zero and an RTO of zero?
The answer is; Achieving higher levels of resilience is usually more expensive, has a more complex architecture, and is operationally more demanding.
Therefore, it makes sense that, for each application, we fully understand what levels of resilience are required. We can then match that requirement to the appropriate architecture.
Levels of Protection
There are many different ways that technologies can be combined to deliver increasing levels of resilience. For Oracle MAA, all of the components and technologies recommended have been developed by Oracle. We believe this is very important as having one provider of resilience centric technologies means you can rely on the recommended architectures being engineered to work together and be fully tested.
Oracle MAA describes four levels of resilience; Bronze, Silver, Gold and Platinum.
Important:
All implementations of an MAA architecture should incorporate the appropriate outage prevention capabilities outlined in the “Prevention is Better Than Cure” section of this page.
BRONZE
The Oracle MAA Bronze architecture consists of a single instance Oracle database with local, and ideally remote, backups.
The Bronze architecture protects against both planned and unplanned outages. However, the RTO in some circumstances (e.g. complete site failure) could be many hours or even days. Also, with full site failure, in particular, there is a risk of significant data loss depending on the availability and timing of the last successful backup.
For Bronze, the expected RPO and RTO for each outage category are;
Outage Type | Outage Event | Expected RTO | Expected RPO |
Unplanned | Recoverable Instance Failure | Minutes | Zero |
Recoverable Server Failure | Minutes to Hour | Zero | |
Data Corruption, Site Failure | Hours to Days | Since last backup | |
Planned | Reorganization, Limited Patches | Zero | Zero |
Hardware or O/S Maintenance | Minutes to Hours | Zero | |
Most Database Patches | Minutes to Hours | Zero | |
Database Upgrades, Patch Sets | Minutes to Hours | Zero | |
Platform Migrations | Hours to a Day | Zero | |
Application Upgrades | Hours to Days | Zero |
The Bronze MAA architecture may be appropriate if an application is considered non-critical or is a development or test environment. In all circumstances, the reduced complexity and cost need to be balanced against increased outage times and significant potential data loss (e.g. in the event of unrecoverable site failure).
SILVER
The Oracle MAA Silver architecture builds upon the Bronze architecture but, instead of having a single database instance accessing the database files, there are at least two. These nodes simultaneously access the database in an active/active configuration. This is made possible using Oracle Real Application Clusters (RAC). RAC can do this by creating and tracking changes in a global cache of memory-resident database blocks from each of the participating nodes.
For unplanned outages, the Silver architecture extends the protection offered by the Bronze architecture in the following way;
The presence of multiple database instance nodes means that should one of the nodes experience an unplanned outage (of any description), the remaining node(s) can continue to service database workloads. At its most basic implementation, user sessions connected to a failed node will experience an interruption while their connection is established to the surviving node(s). Sessions already connected to a surviving node will not experience any interruption.
Oracle offers a range of other technologies to ensure that sessions connected to a failed node do not experience interruption. These include;
- Fast Application Notification (FAN)
- Transparent Application Failover (TAF)
- Application Continuity (AC)
- Transparent Application Continuity (TAC)
More information about these technologies can be found in the following white paper;
MAA Checklist for Applications
For planned outages, the Silver architecture extends the Bronze architecture by allowing maintenance to be carried out on each node at a time. This could be resolving a physical hardware issue or patching and upgrading the operating system. Furthermore, the Silver architecture means that database patches themselves can be applied in a rolling fashion. This means that while one database node is being maintained or patched, the other node(s) can continue to provide a database service to users.
For Silver, the expected RPO and RTO for each outage category is;
Outage Type | Outage Event | Expected RTO | Expected RPO |
Unplanned | Recoverable Instance Failure | Zero | Zero |
Recoverable Server Failure | Seconds | Zero | |
Data Corruption, Site Failure | Hours to Days | Since last backup | |
Planned | Reorganization | Zero | Zero |
Hardware or O/S Maintenance | Zero | Zero | |
Most Database Patches | Zero | Zero | |
Database Upgrades & Patch Sets | Minutes to Hours | Zero | |
Platform Migrations | Hours to a Day | Zero | |
Application Upgrades | Hours to Days | Zero |
The Silver MAA architecture is suited to workloads that demand higher levels of availability. These can include critical applications where outages, planned or unplanned, would be detrimental to the business. This is often seen as a requirement for production environments but, because of the architectural sophistication of active/active database technology, it is often prudent to replicate the Silver architecture for non-production environments (dev, test, pre-prod).
Gold
The Oracle MAA Gold architecture extends the Silver architecture with the introduction of data replication technologies. The data replication technology for Oracle databases is Oracle Data Guard. There are two main characters in a Data Guard configuration;
- The Primary Database (the database you want to protect)
- The Standby Database (the copy of the database providing the protection)
These roles are interchangeable between the participants in a Data Guard configuration.
Data Guard replicates change vectors in the transaction logs of the Primary Database to the Standby Database. The Standby Database can be hundreds or even thousands of miles away from the Primary. The only limiting factor to this is the tolerance to latency for your solution. For very long distances, the data can be replicated asynchronously and never impact performance on the primary database. However, being asynchronous means that there is still a slight chance of data loss in the event of an unrecoverable failure on the Primary site (typically, a few seconds).
If this small amount of data loss is not acceptable, then replication can be done synchronously. This guarantees protection against data loss but will limit the distance the data can travel before latency begins to affect the performance on the Primary database.
There are four reference architectures associated with MAA Gold;
- Remote Standby
- Multiple Standby Databases
- Standby Reader Farm
- Far Sync Standby
For more information on each of these architectures, please download;
The MAA Gold architecture depicted above is an interpretation of the Multiple Standby Databases reference architecture.
This particular implementation of an MAA Gold architecture provides all of the protection of the Silver architecture plus protection against outages associated with data corruption and site failure (natural disaster, power etc.).
In this solution, the change vectors for the Primary database are simultaneously replicated to a local Standby (within the same datacentre) and a remote Standby that is several hundred miles away. The changes are synchronously copied to the local Standby, whereas the same change vectors are asynchronously copied to the remote Standby.
This means that should the Primary database experience an unprotected or unrecoverable failure; we would switch to the local Standby database. This is fast to do and guarantees that no data is lost. It also means that all users experience the same performance characteristics as they do with the Primary database (this assumes that the Standby is sized the same as the Primary).
If there is a data centre-wide, unrecoverable failure, and the local Standby is also affected, we would only then switch to the remote Standby database.
Because the remote Standby database is geographically distant, it is not affected by the data centre failure and can continue to support the application workload. Due to the asynchronous nature of the remote replication, there is still a possibility of a small amount of data loss.
For planned outages, such as database patch sets and upgrades, the Gold architecture can protect further. Using Active Data Guard, we can execute highly automated rolling upgrades. This means it is possible to provide access to a working database (the Standby) while the Primary database is being upgraded. Once completed, the roles can reverse, and the Standby can then also be upgraded.
For Gold, the expected RPO and RTO for each outage category is;
Outage Type | Outage Event | Expected RTO | Expected RPO |
Unplanned | Recoverable Instance Failure | Zero | Zero |
Recoverable Server Failure | Seconds | Zero | |
Data Corruption, Site Failure | Zero to Seconds | Near-zero if ASYNC, Zero if SYNC | |
Planned | Reorganization | Zero | Zero |
Hardware or O/S Maintenance | Zero | Zero | |
Most Database Patches | Zero | Zero | |
Database Upgrades & Patch Sets | Seconds | Zero | |
Platform Migrations | Seconds | Zero | |
Application Upgrades | Hours to Days | Zero |
The Gold MAA architecture is suited to mission-critical workloads with very little tolerance for either data loss or outages. Due to the higher levels of architectural and operational complexity, this type of protection is often reserved for production environments.
PLATINUM
The Oracle MAA Platinum architecture offers the broadest levels of resilience for an Oracle Database.
This architecture offers all of the protections covered by the Gold architecture and adds additional protection from outages associated with application upgrades and migrations.
It does this by using Oracle GoldenGate replication to allow both the local site and the remote site to be fully active. The replication can occur in one or both directions allowing all resources to be fully utilized and protected.
For the Platinum architecture, the main difference between Oracle GoldenGate and Oracle Data Guard is that GoldenGate is based upon logical replication of changes. In contrast, Data Guard maintains a physical replica of a primary database.
Because of this difference, GoldenGate has many use-cases beyond MAA. Data Guard is specific to Disaster Recovery protection.
Each primary database is part of an active/active GoldenGate configuration in the Platinum architecture depicted above. Each Primary is protected by a local standby using Data Guard with synchronous replication. GoldenGate also protects each primary database in an asynchronous manner.
For Platinum, the expected RPO and RTO for each outage category is;
Outage Type | Outage Event | Expected RTO | Expected RPO |
Unplanned | Recoverable Instance Failure | Zero | Zero |
Recoverable Server Failure | Zero | Zero | |
Data Corruption, Site Failure | Zero | Zero to Seconds | |
Planned | Reorganization | Zero | Zero |
Hardware or O/S Maintenance | Zero | Zero | |
Most Database Patches | Zero | Zero | |
Database Upgrades & Patch Sets | Zero | Zero | |
Platform Migrations | Zero | Zero | |
Application Upgrades | Zero | Zero |
Although this architecture allows for very high levels of resilience, the use of GoldenGate means that the application needs to be engineered to take full advantage of the underlying topology. For this reason, Platinum is applicable when only the very highest levels of RPO & RTO are required. For most mission-critical applications, a flavour of the Gold MAA architecture will usually be sufficient.