Oracle Maximum Availability Architecture for Oracle Database@AWS

Amazon Web Services (AWS) is a strategic partner of Oracle Multicloud. Oracle Maximum Availability Architecture (MAA) evaluates MAA reference architectures on Oracle Database@AWS, the results of which are shown here.

To learn more about MAA Silver and MAA Gold evaluations and their benefits after certification, see MAA Evaluations on Multicloud Solutions.

Oracle MAA evaluated Oracle's solution in AWS. Oracle MAA continues to be periodically re-evaluated to ensure the solution meets all expected benefits, and to highlight any new MAA benefits and capabilities for Oracle Database@AWS. Certification is only given after the MAA evaluation meets the requirements.

Oracle MAA has evaluated and endorsed Oracle Database@AWS for:

  • MAA Silver reference architecture for Oracle Exadata Database Service on Dedicated Infrastructure
  • MAA Gold reference architecture for Oracle Exadata Database Service on Dedicated Infrastructure. when the standby database resides on another Exadata infrastructure in the same or separate Availability Zones (AZ) or different regions. Note that network latency and network bandwidth may vary depending on source and target locations.

MAA Silver Network Topology and Evaluation

The following image shows the MAA Silver reference architecture, with Oracle RAC on Exadata Database Service, and showing backup options to Oracle Database Autonomous Recovery Service running in OCI or Amazon S3 in AWS.

Figure 36-9 Oracle Database@AWS in an MAA Silver Reference Architecture



Oracle MAA evaluated and endorsed Exadata Database Service on Oracle Database@AWS for the MAA Silver reference architecture, and observed the following results.

  • Application and database uptime is adhered to as expected while injecting unplanned local outages, updating database and system software, and during system and database elastic changes (for example, increasing CPU, storage, and so on). See MAA Evaluations on Multicloud Solutions for tests evaluated.
  • Oracle Maximum Availability Architecture in Oracle Exadata Cloud Systems describes all of the benefits, including cloud MAA templates, high availability and low brownout, data protection, response time Quality-of-Service, low downtime for unplanned outages, minimum application impact for planned maintenance, and elastic system operations.

Additional MAA Best Practices Specific to AWS

  • Take careful consideration of application VM placement and high availability. Application latency may be impacted by the application VM's proximity to the database server target.

    For the lowest latency, place the application VM in the same Availability Zone (AZ) as the database, and use direct communication (not AWS network virtual appliances, firewalls, and the like). For high availability, evaluate placing multiple application VMs in the same AZ or across multiple AZs, depending on your application response time requirements. See the topic "Application Network Layer and Application Failover" below.

  • Use Autonomous Recovery Service on OCI for the best data protection, backup, and recovery benefits as discussed later.

    The MAA evaluation observed that both Autonomous Recovery Service and Object Storage Service in OCI met expectations. MAA evaluation with S3 in AWS is still underway. See topic "Backup and Restore Observations" below.

  • Allow TCP port 6200 ingress into the Exadata VM Cluster client subnet from the networks where your application VMs reside.

    TCP port 6200 access lets the Oracle Notification Services (ONS) to communicate about Fast Application Notification (FAN) events, which are required to provide immediate notification of cluster and service events, enabling applications to respond quickly to changes that occur during planned maintenance and unplanned outages. The recommended way to allow TCP port 6200 ingress is to log in to the Oracle Cloud Console, navigate to VM Cluster Information, edit the Client network security group exa_1521_adjustable_nsg, and add a security rule where:

    • Direction is Ingress
    • Source Type is CIDR
    • Source CIDR is your application network CIDR
    • Protocol is TCP
    • Destination Port Range is 6200
  • Allow ICMP Echo Request within the Exadata VM Cluster Client subnet.

    ICMP Echo Request enables the ping(8) command between Exadata virtual machines, which is required during Virtual Machine OS Update by the patchmgr utility. The recommended way to allow ICMP Echo Request is to log in to the Oracle Cloud Console, navigate to VM Cluster Information, edit the Client network security group exa_static_nsg, and add a security rule where:

    • Direction is Ingress
    • Source Type is NSG
    • Source NSG is exa_static_nsg
    • Protocol is ICMP
    • Type is 8
  • See Security Rules for the Oracle Exadata Database Service on Dedicated Infrastructure.

Application Network Layer and Application Failover

The proximity of the application tier to the database VM Cluster affects application response time

For response time-sensitive applications, pay attention to the observed response time results in the table below. See MAA Evaluations on Multicloud Solutions for the tests that were conducted.

Use Case RTT Latency Observed Network Throughput Observed MAA Recommendations
Application VMs to Exadata VM Cluster (same region)

Varies based on placement

  • 676 microseconds to 1.2 milliseconds in the same availability zone (AZ)
  • 2.1 milliseconds or higher across AZs

Varies based on placement and VM size

  • 5 Gbps for single process throughput
  • 25 Gbps for multiple (64) parallel processes throughput
  1. Ensure your required RTT latency meets your application requirements.
  2. Have multiple application VMs for high availability, and ensure placement of application VMs continues to meet your application RTT latency requirements.
  3. Test thoroughly with your implementation. Variables such as VM size and placement can impact results.
  4. To enable transparent application failover, follow the steps described in Configuring Continuous Availability for Applications.

Backup and Restore Observations

Oracle Database backup and restore throughput to Oracle's Autonomous Recovery Service or Oracle’s Object Storage Service was within performance expectations, and all expected built-in backup and restore configuration and life cycle best practices were incorporated into cloud automation.

Using RMAN nettest and actual database backup and restore operations, backup and restore performance were within expectations. For example, an Exadata Database Service 2-node VM Cluster (using 16+ OCPUs) and three storage cells may observe a 1.5 TB/hour backup rate and approximately 2.2 TB/hour restore rate with no other workloads.

Note that the default backup and restore RMAN channels vary based on the number of ECPUs per database node. You can increase the number of RMAN channels for higher backup and restore throughput by using dbaascli and to change the channelsPerNode value (maximum value is 32 per node for Autonomous Recovery Service or higher for other target services). An average throughput of 40 MB/second per Oracle RMAN channel was observed. The RMAN channel throughput contributes to your overall backup or restore throughput until you max out another resource limitation.

By increasing the RMAN channels, you can leverage available network bandwidth or storage bandwidth, and achieve as much as a 5 TB/hour backup rate and 7 TB/hour restore rate with three Exadata storage cells. The restore rates can increase as you add more Exadata storage cells, with approximately 2GB/second per storage cell. The performance varies based on existing workloads and network traffic on the shared infrastructure.

The Autonomous Recovery Service provides the following additional benefits:

  • Leverage real-time data protection capabilities to eliminate data loss.
  • With a unique "incremental forever" backup benefit, you can significantly reduce backup processing overhead and time for your production databases. The effective backup rate can be increased to 20+ TB/hour depending on the shape and size of the VM Cluster, database size, and the number of data files.
  • Implement a policy-driven backup life-cycle management.
  • Additional malware protection
  • Data protection health statuses
  • Long-term backup retention.
  • Built-in high availability for Autonomous Recovery Service deployments

With all of the above benefits, Oracle MAA always recommends Autonomous Recovery Service. To enable Autonomous Recovery Service, see the Autonomous Recovery Service documentation, enable the security rules required, and use the Cloud Console and choose the Autonomous Recovery Service option.

See also:

MAA Gold Network Topology and Evaluation

The recommended MAA Gold architecture in AWS consists of:

  • When using Oracle Data Guard, Oracle Exadata VM Clusters are provisioned in two different, independent Exadata Infrastructures in the same region across Availability Zones (AZs), or across Regions using separate networks that do not have overlapping IP CIDR ranges.
  • Backup network subnets assigned to the primary and standby Exadata VM Clusters do not have overlapping IP CIDR ranges.
  • Database backup and restore operations use a high-bandwidth network for Autonomous Recovery Service or Object Storage Service in OCI or Amazon S3 Object Storage.

The following images show the MAA Gold reference architecture, with Oracle RAC on Exadata Database Service, and shows backups to Oracle Database Autonomous Recovery Service running in OCI and Amazon S3 in AWS. The first image shows the architecture implemented in a single AWS region across two Availability Zones, and the second image shows the architecture deployed across two AWS regions.

Figure 36-10 MAA Gold Architecture in One AWS Region with Two AZs



Figure 36-11 MAA Gold Architecture in Two AWS Regions



MAA Gold evaluation builds upon the Silver MAA evaluation, plus:

  • Network tests between primary and standby Exadata VM Clusters using OCI peered or AWS peered networks to evaluate round-trip latency and bandwidth
  • Oracle Data Guard role transition performance and timings for disaster recovery use cases
  • Oracle database rolling upgrade with Data Guard

Data Guard Considerations and Network Peering

Oracle Data Guard maintains an exact physical copy of the primary database by transmitting (and applying) all data changes (redo) to the standby database across the network, making network throughput, and in some cases, latency, critical to the implementation's success.

Use Data Guard switchover for planned maintenance or disaster recovery tests. If the primary database becomes unavailable, use Data Guard failover to resume service.

Peering Networks Between Primary and Standby

Network peering is the most critical decision to ensure your standby can keep up with the primary database. If the network does not have sufficient bandwidth to support single-process redo throughput, the standby will have a growing transport lag.

The primary and standby Exadata VM Clusters are deployed in separate networks. Oracle Database@AWS Exadata VM Clusters are always deployed using separate Virtual Cloud Networks (VCN) in OCI. These separate VCNs must be connected to allow traffic to pass between them; that is, they must be "peered" before enabling Data Guard with Oracle Cloud Automation. For this reason, the networks must use separate, non-overlapping IP CIDR ranges.

When creating a standby database across Availability Zones (AZs), consider

  • If zero data loss is a business requirement, a lower RTT latency can reduce application impact.
  • To reduce standby transport lags and potential data loss, sufficient network bandwidth is required to handle peak application and database workloads. Evaluate your peak change rate (for example, database redo rate).

When creating a standby database across regions, consider

  • Due to higher RTT latency across regions, most Data Guard configurations use asynchronous redo transport for their cross region standby database, which essentially eliminates any performance overhead with sending redo to the standby.
  • To reduce standby transport lags and data loss, sufficient network bandwidth is required to handle peak application and database workloads. Evaluate your peak change rate (for example, database redo rate) for each database instance and compare that to the single process throughput of a tool like iperf.

You can leverage the recommendations in the table below to help you decide if you should peer with OCI or AWS before setting up your Data Guard and standby database. You can also run your own network experiments after peering and before you configure your standby. Refer to MAA Evaluations on Multicloud Solutions for the tests that were conducted as part of this certification.

Data Guard Use Case RTT Latency Observed Network Throughput Observed MAA Recommendations

Between Exadata VM Clusters, across AZs for:

  • Redo Transport
  • Database migration or standby database instantiation

Varies between different AZs

  • 1700 microseconds observed with OCI peering
  • 2200 microseconds observed with AWS peering

Varies based on placement and VM size

  • 33 Gbps for single process throughput with OCI peering
  • 90 Gbps for multiple (64) parallel processes throughput with OCI peering
  • 4.9 Gbps for single process throughput with AWS peering
  • 45 Gbps for multiple (64) parallel processes throughput with AWS peering

Choose the network peering option that can support your peak database workload throughput. OCI peering today provides significantly higher throughput.

For OCI peering, see Local VCN Peering using Local Peering Gateways

For cross AZs, see Implement Disaster Recovery with Cross-Zone Data Guard on Oracle Database@AWS

Between Exadata VM Clusters across regions for:

  • Redo Transport
  • Database migration or standby database instantiation

Varies between different regions

Regions used Ashburn, VA, to Boardman, OR.

  • 93 milliseconds observed with OCI peering
  • 65 milliseconds observed with AWS peering

Varies based on placement and VM size

  • 1.3 Gbps for single process throughput with OCI peering
  • 35 Gbps for multiple (64) parallel processes throughput with OCI peering
  • 0.7 Gbps for single process throughput with AWS peering
  • 20 Gbps for multiple (64) parallel processes throughput with AWS peering

Choose the network peering option that can support your peak database workload throughput. OCI peering today provides significantly higher throughput.

For OCI peering, see Remote VCN Peering using Dynamic Routing Gateways.

OCI provides higher network bandwidth, better network throughput, and reduced cost (the First 10 TB / Month is Free).

For Cross-region, see Implement disaster recovery with cross-regional Active Data Guard on Oracle Database@AWS

Path to Gold MAA Architecture

To achieve a Gold Maximum Availability Architecture (MAA), Oracle recommends the following principles for your cross-region deployment:

  1. Deploy Infrastructure: In both your primary and standby regions, deploy an Exadata Infrastructure.

  2. Create VM Clusters: On each infrastructure, create an Exadata VM Cluster within the ODB Network.

  3. Instantiate Databases: Deploy your Oracle Real Application Clusters (RAC) database on the Exadata VM Cluster.

  4. Configure Data Replication: Set up Oracle Data Guard to replicate data between the two databases across the regions.

Networking for Data Guard

When Exadata VM Clusters are created in AWS, each resides within its own Oracle Cloud Infrastructure (OCI) Virtual Cloud Network (VCN). To enable Data Guard communication, which is essential for shipping redo logs and performing role transitions, these VCNs must be peered. You have the flexibility to choose either OCI or AWS peering based on your specific needs.

Enabling Data Guard

Once the network is peered using one of the above options, you can enable Data Guard (see Use Oracle Data Guard with Exadata Cloud Infrastructure). All Data Guard and MAA best Practices are built in. Data Guard adds to the comprehensive data protection and provides minimum data loss and downtime for the database, cluster, or even regional failures.

Data Guard Role Transitions

Data Guard switchover and failover performance was within expectations compared to a similar setup in Oracle OCI. Application downtime when performing a Data Guard switchover and failover can range from 30 seconds to a few minutes.

For guidance on tuning Data Guard role transition timings or examples of role transition timings, see Tune and Troubleshoot Oracle Data Guard. Note that Oracle Cloud console timings do not reflect actual database downtime. To minimize downtime and achieve the low RTO of seconds to minutes, automatic Data Guard failover is required. Setting Data Guard Fast-Start Failover is a manual step. Refer to the Fast-Start Failover and Configure and Deploy Oracle Data Guard documentation.

Enhancing Availability and Reducing RTO for Disasters

To minimize downtime and achieve the low RTO of seconds to minutes, automatic Data Guard failover is required.

The Fast Start Failover (FSFO) feature automatically initiates a failover and requires you to install a Data Guard Observer on a separate VM. For optimal results, place the Observer VM in a different location from the primary and standby databases. If no additional location is available, place the Observer on the primary database site, preferably in the application network.

Note:

This is a manual configuration step and is not part of the standard cloud automation.

For detailed setup instructions, see Implement disaster recovery with cross-regional Active Data Guard on Oracle Database@AWS.

For additional details, refer to the Fast-Start Failover and Configure and Deploy Oracle Data Guard documentation.