29 Oracle Maximum Availability Architecture for Oracle Database@Azure

Oracle Maximum Availability Architecture (MAA) in Oracle Exadata Database Service on Dedicated Infrastructure (ExaDB-D) running within Microsoft Azure's data centers ensures inherent high availability, including zero database downtime for software updates and elastic operations.

When augmented with an Oracle Cloud standby database with Oracle Active Data Guard, this cloud MAA architecture achieves comprehensive data protection and disaster recovery. This integrated combination of optimized Exadata hardware, Exadata Cloud software automation, and Oracle MAA best practices enables Oracle Exadata Cloud systems to be the best cloud solution for mission-critical enterprise databases and applications.

As of now, the Oracle MAA solution team has validated and certified the MAA Silver and Gold service level reference architectures with Oracle Database@Azure within the same Azure Region when configured with primary and standby databases residing on Oracle Database@Azure in different Availability Zones (AZ).

See Oracle Cloud: Maximum Availability Architecture for a detailed walk-through of Oracle Cloud MAA architectures and features.

Oracle Database@Azure Evaluations by Oracle MAA

Oracle MAA has evaluated and endorsed Oracle Database@Azure for the MAA Silver architecture on ExaDB-D, and MAA Gold when the standby database resides in another ExaDB-D in separate Availability Zones (AZ).

To ensure success and consistency for Oracle customers, the Oracle MAA team conducts ongoing evaluations of MAA reference architectures on Oracle Database@Azure. MAA solutions based on these evaluations protect your database from outages such as instance, node, storage, network, various data corruptions, and Azure AZ failures, while enabling zero database downtime during software updates, elastic configuration changes, or storage and compute additions.

What Does Oracle MAA Evaluate

An MAA evaluation of Oracle Database@Azure consists of:

  • Cloud setup of MAA Silver and MAA Gold architectures in Oracle Database@Azure AZs
  • Application throughput and response time impact analysis while injecting 100+ outages (Oracle MAA chaos evaluation)
  • Backup and restore performance, throughput, and key use cases
  • Oracle Data Guard role transition performance and timings for disaster recovery use cases
  • Application impact on elastic ExaDB-D cluster operations
  • Application impact on software updates to the ExaDB-D targets
  • Data center failure analysis

MAA Silver

MAA Silver on Oracle Database@Azure consists of the following architecture:

  • The ExaDB-D cluster residing in Azure hosts one or more databases
  • High Availability (HA) and redundant application tier spread across multiple AZs
  • Key Management Service and Object Storage Service (for backup and restore) are located on Oracle Cloud Infrastructure (OCI)
  • Pre-configured redundant and HA network topology

MAA Gold

MAA Gold on Oracle Database@Azure consists of the following architecture:

  • ExaDB-D clusters (primary and standby databases) residing in separate Azure Availability Zones (AZ). Note that all primary and standby databases and their data reside in Oracle Database@Azure. If primary and standby databases reside in the same AZ, this MAA Gold architecture still provides inherent HA benefits plus DR failover options for database and cluster failures, but lacks DR protection for a complete AZ failure.
  • HA and redundant application tier spread across multiple AZs
  • Key Management Service and Object Storage Service (for backup and restore) are located on Oracle Cloud Infrastructure (OCI)
  • Pre-configured redundant and HA network topology

Oracle Maximum Availability Architecture Benefits

The following are some of the benefits of implementing Oracle MAA reference architectures for Oracle Database@Azure.

For a comprehensive list of Oracle Maximum Availability Architecture benefits for Oracle Exadata Database Machine systems, see Exadata Database Machine: Maximum Availability Architecture.

Deployment

Oracle Database@Azure running Oracle Exadata Database Service on Dedicated Infrastructure is deployed using Oracle Maximum Availability Architecture best practices, including configuration best practices for storage, network, operating system, Oracle Grid Infrastructure, and Oracle Database. ExaDB-D is optimized to run enterprise Oracle databases with extreme scalability, availability, and elasticity.

Oracle MAA Database Templates

All Oracle Cloud databases created with Oracle Cloud automation use Oracle Maximum Availability Architecture default settings, which are optimized for Oracle Database@Azure. Oracle does not recommend that you use custom scripts to create cloud databases.

Other than adjusting memory and system resource settings, avoid migrating previous database parameter settings, especially undocumented parameters. One beneficial primary database data protection parameter, DB_BLOCK_CHECKING, is not enabled by default due to its potential performance overhead. Any Oracle standby database configured with cloud automation will enable DB_BLOCK_CHECKING on the standby automatically to maximize data protection and detection on the standby database. MAA recommends evaluating the performance impact of your application and allowing this setting to be on the primary database to maximize logical data corruption prevention and detection if the performance impact is reasonable. In Oracle Database versions 19c and later, the Data Guard broker will maintain the data protection settings through MAA best practices.

Backup and Restore Automation

When you configure automatic backup to Oracle Cloud Infrastructure Object Storage, backup copies provide additional protection when multiple availability zones exist in your region. Oracle Recovery Manager (RMAN) validates cloud database backups for any physical corruptions.

Database backups occur daily, with a full backup occurring once per week and incremental backups occurring on all other days. Archived log backups occur frequently to reduce potential data loss in case of full database restore and recovery is required. The archived log backup frequency is 30 minutes by default; however, the potential data loss will be zero or near zero with Data Guard.

Oracle Exadata Database Machine Inherent Benefits

Oracle Exadata Database Machine is the best Oracle Maximum Availability Architecture database platform that Oracle offers. Exadata is engineered with hardware, software, database, availability, and extreme performance for all workloads and scalability innovations that support the most mission-critical enterprise applications.

Specifically, Exadata provides unique high availability, data protection, and quality-of-service capabilities that set Oracle apart from any other platform or cloud vendor. Sizing Exadata cloud systems to meet your application and database system resource needs (for example, sufficient CPU, memory, and I/O resources) is very important to maintain the highest availability, stability, and performance. Proper sizing and resource management are especially important when consolidating many databases on the same cluster. Database consolidation is a very common benefit when leveraging Exadata.

Examples of these benefits include:

  • High availability and low brownout: Fully redundant, fault-tolerant hardware exists in the storage, network, and database servers. Resilient, highly-available software, such as Oracle Real Application Clusters (Oracle RAC), Oracle Clusterware, Oracle Database, Oracle Automatic Storage Management (ASM), Oracle Linux, and Oracle Exadata Storage Server enables applications to maintain application service levels through unplanned outages and planned maintenance events.

    For example, Exadata has instant failure detection that can detect and repair database nodes, storage servers, and network failures in less than two seconds and resume application and database service uptime and performance. Other platforms can experience 30 seconds, or even minutes, of blackout and extended application brownouts for the same type of failures. Only the Exadata platform offers a wide range of unplanned outages and planned maintenance tests to evaluate end-to-end application and database brownouts and blackouts.

  • Data protection: Exadata provides Oracle Database with physical and logical block corruption prevention, detection, and, in some cases, automatic remediation.

    The Exadata Hardware Assisted Resilient Data (HARD) checks include support for server parameter files, control files, log files, Oracle data files, and Oracle Data Guard broker files when those files are stored in Exadata storage. This intelligent Exadata storage validation stops corrupted data from being written to disk when a HARD check fails, which eliminates a large class of failures that the database industry had previously been unable to prevent.

    Examples of the Exadata HARD checks include:

    • Redo and block checksum
    • Correct log sequence
    • Block type validation
    • Block number validation
    • Oracle data structures, such as block magic number, block size, sequence number, and block header and tail data structures

    Exadata HARD checks are initiated from Exadata storage software (cell services) and work transparently after enabling a database DB_BLOCK_CHECKSUM parameter, which is enabled by default in the cloud. Exadata is the only platform that currently supports the HARD initiative.

    Furthermore, Oracle Exadata Storage Server provides non-intrusive, automatic hard disk scrub and repair. This feature periodically inspects and repairs hard disks during idle time. suppose bad sectors are detected on a hard disk. In that case, Oracle Exadata Storage Server automatically requests Oracle Automatic Storage Management (ASM) to repair the bad sectors by reading the data from another mirror copy.

    Finally, Exadata and Oracle ASM can detect corruptions as data blocks are read into the buffer cache and automatically repair data corruption with a good copy of the data block on a subsequent database write. This inherent intelligent data protection makes Exadata Database Machine and ExaDB-D the best data protection storage platform for Oracle databases.

    For comprehensive data protection, a Maximum Availability Architecture best practice is to use a standby database on a separate Exadata instance to detect, prevent, and automatically repair corruptions that cannot be addressed by Exadata alone. The standby database also minimizes downtime and data loss for disasters that result from site, cluster, and database failures.

  • Response time quality of service: Only Exadata has end-to-end quality-of-service capabilities to ensure that response time remains low and optimum. Database server I/O latency capping and Exadata storage I/O latency capping ensure that read or write I/O can be redirected to partnered cells when response time exceeds a certain threshold. More importantly, memory and flash are intelligently pre-warmed for various maintenance events and sick component outages (“gray area outages”) to preserve application response time and performance. This end-to-end holistic performance view is a big benefit for Oracle enterprise customers who require consistent application response time and high throughput.

    Suppose storage becomes unreliable (but not failed) because of poor and unpredictable performance. In that case, the disk or flash cache can be confined offline and later returned online if heuristics show that I/O performance is back to acceptable levels. Resource management can help prioritize critical database network or I/O functionality so that your application and database perform at an optimized level.

    For example, database log writes get priority over backup requests on the Exadata network and storage. Furthermore, rapid response time is maintained during storage software updates by ensuring that the partner flash cache is warmed so flash misses are minimized.

  • End-to-end testing and holistic health checks: Because Oracle owns the entire Oracle Exadata Cloud Infrastructure, end-to-end testing, and optimizations benefit every Exadata customer around the world, whether hosted on-premises or in the cloud. Validated optimizations and fixes required to run any mission-critical system are uniformly applied after rigorous testing. Health checks are designed to evaluate the entire stack.

    The Exadata health check utility EXACHK is Exadata cloud-aware and highlights any configuration and software alerts that may have occurred because of customer changes. No other cloud platform currently has this kind of end-to-end health check available. Oracle recommends running EXACHK at least once a month, and before and after any software updates, to evaluate any new best practices and alerts.

  • Higher Uptime: The uptime service-level agreement per month is 95% (a maximum of 22 minutes of downtime per month), but when you use MAA best practices for continuous service, most months would have zero downtime. With Gold MAA, you can fail over to your standby database for various disaster events such as database, cluster, or data center (or AZ) failures, depending on your standby database placement. Note setting automatic failover to your target standby with Data Guard Fast-Start Failover is a manual setup (see Configure Fast Start Failover).

Expected Impact During Unplanned Outages

The following table lists various unplanned outage events and the associated potential database downtime, application Recovery Time Objective (RTO), and data loss potential or recovery point objective (RPO).

For Oracle Data Guard architectures (MAA Gold), the database downtime or service level downtime does not include detection time or the time it takes before a customer initiates the Cloud Console Data Guard failover operation.

Outage Event Database Downtime Service-Level Downtime (RTO) Potential Service-Level Data Loss (RPO)

Localized events, including:

Exadata cluster network topology failures

Storage (disk, flash, and storage cell) failures

Database instance failures

Database server failures

Zero Near-zero Zero

Events that require restoration from backup because a standby database does not exist:

Data corruptions

Full database failures

Complete storage failures

Availability Zone failures

Minutes to hours

(without Data Guard)

Minutes to hours

(without Data Guard)

30 minutes

(without Data Guard)

Events using Data Guard to fail over:

Data corruptions

Full database failures

Complete storage failures

Availability Zone failures

Seconds to minutes1

Zero downtime for physical corruptions due to the auto-block repair feature

Seconds to minutes1

The foreground process that detects the physical corruption pauses while auto-block repair completes

Zero for Maximum Availability (SYNC redo transport)

Near Zero for Maximum Performance (ASYNC redo transport)

1For MAA Gold, to protect your database from regional failure, instantiate the standby database in a region different from the primary database. For this MAA evaluation, the standby database was in a different AZ. Also, Data Guard Fast-Start Failover and its Data Guard observers must be set up manually to perform automatic database failover. Application workloads as high as 300 MB/second per Oracle Real Application Cluster instance were validated. The standby database was up-to-date with near-zero lag. Depending on the workload, standby database tuning may be required for extreme workloads (see Tune and Troubleshoot Oracle Data Guard).

Expected Impact During Planned Maintenance

The following tables describe the impact of various planned maintenance events for Oracle Exadata Database Service on Dedicated Infrastructure on Oracle Database@Azure.

Impact of Exadata Cloud Software Updates

The following table lists various software updates and their impact on the associated database and application. This is applicable for Oracle Exadata Database Service on Dedicated Infrastructure on Oracle Database@Azure.

Software Update Database Impact Application Impact Scheduled By Performed By

Exadata Network Fabric Switches

Zero downtime with no database restart

Zero to single-digit seconds brownout

Oracle schedules based on customer preferences, and customers can reschedule

Oracle Cloud Operation

Exadata Storage Servers

Zero downtime with no database restart

Zero to single-digit seconds brownout

Exadata storage servers are updated in a rolling manner, maintaining redundancy.

Oracle Exadata System Software pre-fetches the secondary mirrors of the OLTP data most frequently accessed into the flash cache, maintaining application performance during storage server restarts.

Exadata smart flash for database buffers is maintained across a storage server restart.

With Exadata 21.2 software, Persistent Storage Index and Persistent Columnar Cache features enable consistent query performance after a storage server software update.

Oracle schedules based on customer preferences, and customers can reschedule

Oracle Cloud Operation

Exadata Database Host

Monthly Infrastructure Security Maintenance

Zero downtime with no host or database restart

Zero downtime

Oracle schedules, and customers can reschedule

Oracle Cloud Operation

Exadata Database Host

Quarterly Infrastructure Maintenance

Zero downtime with Oracle RAC rolling updates

Zero downtime

Exadata Database compute resources are reduced until planned maintenance is completed.

Oracle schedules based on customer preferences, and customers can reschedule

Oracle Cloud Operation

Exadata Database Guest

Zero downtime with Oracle RAC rolling updates

Zero downtime

Exadata Database compute resources are reduced until planned maintenance is completed.

Customer

Customers, using Oracle Cloud Console or APIs

Oracle Database quarterly update or custom image update

Zero downtime with Oracle RAC rolling updates

Zero downtime

Exadata Database compute resources are reduced until planned maintenance is completed.

Special consideration is required during rolling database quarterly updates for applications that use database OJVM (see My Oracle Support Doc ID 2217053.1 for details).

Customer

Customers using Oracle Cloud Console, APIs, or dbaascli utility

In-place, with database home patch, and out-of-place with database move (recommended)

Works for Data Guard and standby databases (see My Oracle Support Doc ID 2701789.1)

Oracle Grid Infrastructure quarterly update or upgrade

Zero downtime with Oracle RAC rolling updates

Zero downtime

Exadata Database compute resources are reduced until planned maintenance is completed.

Customer

Customers, using Oracle Cloud Console, APIs, or dbaascli utility

Oracle Database upgrade with downtime

Minutes to Hour(s) downtime

Minutes to Hour(s) downtime

Customer

Customers, using Oracle Cloud Console, APIs, or dbaascli utility

Works for Data Guard and standby databases (see My Oracle Support Doc ID 2628228.1)

Oracle Database upgrade with near-zero downtime

Minimal downtime with DBMS_ROLLING, Oracle GoldenGate replication, or with pluggable database relocate Minimal downtime with DBMS_ROLLING, Oracle GoldenGate replication, or with pluggable database relocate Customer

Customers, using dbaascli leveraging DBMS_ROLLING (see My Oracle Support Doc ID 2832235.1)

Customers, using generic Maximum Availability Architecture best practices

Impact of Exadata Elastic Operations

Exadata cloud systems have many elastic capabilities that can be used to adjust database and application performance needs. By rearranging resources on need, you can maximize system resources to targeted databases and applications and minimize costs.

The following table lists elastic Oracle Exadata Cloud Infrastructure and VM Cluster updates and the impacts associated with those updates on databases and applications. All of these operations can be performed using Oracle Cloud Console or APIs unless specified otherwise.

VM Cluster Change Database Impact Application Impact
Scale Up or Down VM Cluster Memory Zero downtime with Oracle RAC rolling updates Zero to single-digit seconds brownout
Scale Up or Down VM Cluster CPU Zero downtime with no database restart

Zero downtime

Available CPU resources can impact application performance and throughput

Scale Up or Down (resize) ASM Storage for Database usage Zero downtime with no database restart

Zero downtime

Application performance might be minimally impacted

Scale Up VM Local /u02 File System Size (Exadata X9M and later systems) Zero downtime with no database restart Zero downtime
Scale Down VM Local /u02 File System Size Zero downtime with Oracle RAC rolling updates for scaling down Zero to single-digit seconds brownout
Adding Exadata Storage Cells Zero downtime with no database restart

Zero to single-digit seconds brownout

Application performance might be minimally impacted

Adding Exadata Database Servers Zero downtime with no database restart

Zero to single-digit seconds brownout

Adding Oracle RAC instances and CPU resources may improve application performance and throughput

Adding Database Nodes in Virtual Machines (VMs) Cluster Zero downtime with no database restart

Zero to single-digit seconds brownout

Application performance and throughput may increase or decrease by adding or dropping Oracle RAC instances and CPU resources

Planning for the Impact of Exadata Elastic Operations

Because some of the above elastic changes may take significant time, and they impact the available resources for your application, some planning is required.

Note that “scale down” and “drop” changes will decrease available resources. Care must be taken to not reduce resources below the amount required for database and application stability and to meet application performance targets. The following table provides you with the estimated time duration and planning recommendations for these changes.

VM Cluster Change Database Impact Application Impact
Scale Up or Down VM Cluster Memory

Time to drain services and Oracle RAC rolling restart

Typically 15-30 minutes per node, but may vary depending on application draining

Understanding application draining

See Configuring Continuous Availability for Applications before scaling down memory, ensure that database SGAs can still be stored in hugepages, and that application performance is still acceptable.

To preserve predictable application performance and stability:

  • Monitor and scale up before important high workload patterns require the memory resources
  • Avoid memory scale down unless all of the database SGA and PGA memory fits into the new memory size, and the system's hugepages accommodate all SGAs
Scale Up or Down VM Cluster CPU

Online operation, typically less than 5 minutes for each VM cluster

Scaling up from a very low value to a very high value (10+ OCPUs increase) may take 10 minutes.

To preserve predictable application performance and stability:

  • Monitor and scale up before important high workload patterns require the CPU resources, or when consistently reaching an OCPU threshold for a tolerated amount of time
  • Only scale down if the load average is below the threshold for at least 30 minutes, or scale down based on fixed workload schedules (such as business hours with 60 OCPUs, non-business hours with 10 OCPUs, and batch with 100 OCPUs)
  • Avoid more than one scale-down request within a 2 hour period
Scale Up or Down (resize) ASM Storage for Database usage

Typically minutes to hours

Time varies based on utilized database storage capacity and database activity. The higher the percentage of utilized database storage, the longer the resize operation (which includes ASM rebalance) will take.

Oracle ASM rebalance is initiated automatically. Storage redundancy is retained. Because of the inherent best practice of using a non-intrusive ASM power limit, application workload impact is minimal.

Choose a non-peak window so resize and rebalance operations can be optimized.

Because the time may vary significantly, plan for the operation to be completed in hours. To estimate the time that an existing resize or rebalance operation per VM cluster requires, query GV$ASM_OPERATION. For example, you can run the following query every 30 minutes to evaluate how much work (EST_WORK) and how much more time (EST_MINUTES) potentially is required:

select operation, pass, state, sofar, est_work, est_minutes from gv$asm_operation where operation='REBAL';

Note that the estimated statistics tend to become more accurate as the rebalance progresses, but can vary based on the concurrent workload.

Scale Up VM Local /u02 File System Size (Exadata X9M and later) Online operation, typically less than 5 minutes for each VM cluster

VM local file system space is allocated on local database host disks, which are shared by all VM guests for all VM clusters provisioned on that database host.

Do not scale up space for Local /u02 File System unnecessarily on one VM cluster, such that no space remains to scale up on other VM clusters on the same Exadata Infrastructure, because a Local /u02 File System scale down must be performed in an Oracle RAC rolling manner, which may cause application disruption.

Scale Down VM Local /u02 File System Size

Time to drain services and Oracle RAC rolling restart

Typically 15-30 minutes for each node, but may vary depending on application draining settings.

To plan, learn about application draining at Configuring Continuous Availability for Applications
Adding Exadata Storage Cells

The online operation creates more available space for administrators to choose how to distribute.

Typically, 3-72 hours per operation, depending on the number of VM clusters, database storage usage, and storage activity. With a very active database and heavy storage activity, this can take up to 72 hours.

As part of the add storage cell operation, there are two parts to this operation:

  1. Storage is added to the Exadata system as part of the add storage operation.
  2. The administrator must decide which VM cluster to expand its ASM disk groups as a separate operation.

Plan to add storage when your storage capacity utilization hits 80% within a month, because this operation may be completed in days.

Oracle ASM rebalance is initiated automatically. Storage redundancy is retained. Because of inherent best practices in using non-intrusive ASM power limits, the impact of application workload is minimal.

Because the time duration may vary significantly, plan for the operation to be completed in days before the storage is available.

To estimate the time that an existing resize or rebalance operation will take on each VM cluster, query GV$ASM_OPERATION. For example, you can run the following query every 30 minutes to evaluate how much work (EST_WORK) and how much more time (EST_MINUTES) is potentially required:

Select operation, pass, state, sofar, est_work, est_minutes from gv$asm_operation where operation='REBAL';

Note that the estimated statistics tend to become more accurate as the rebalance progresses, but can vary based on the concurrent workload.

Adding Exadata Database Servers

Online operation to expand your VM cluster

One-step process to add the Database Compute to the Exadata infrastructure and then expand the VM cluster

Approximately 1 to 6 hours for each Exadata Database Server

Plan to add Database Compute when your database resource utilization reaches 80% within a month. Be aware, and plan for this operation to take many hours to a day.

Choose a non-peak window so that the add Database Compute operation can be completed faster.

Each Oracle RAC database registered by Oracle Clusterware and visible in the Oracle Cloud Console is extended. If a database was configured outside the Oracle Cloud Console, or without dbaascli, then those databases will not be extended.

Adding or Dropping Database Nodes in Virtual Machines (VMs) Cluster

Zero database downtime when adding Database Nodes in the VM cluster. Typically takes 3-6 hours, depending on the number of databases in the VM cluster.

Zero database downtime when dropping Database Nodes in the VM cluster. Typically takes 1-2 hours, depending on the number of databases in the VM cluster.

Understand that the add/drop operation is not instantaneous, and the operation may take several hours to complete.

The drop operation reduces database computing, OCPU, and memory resources so that application performance can be impacted.

MAA Gold Network Topology and Evaluation

The recommended MAA Gold architecture on Oracle Database@Azure consists of:

  • When using Data Guard, Oracle Exadata infrastructures (ExaDB-D) are provisioned in two different Availability Zones (AZ) using separate VNets that do not have overlapping IP CIDR ranges.
  • Backup network subnets assigned to the primary and standby clusters do not have overlapping IP CIDR ranges.
  • The application tier spans at least two AZs, and the VNet is peered with each VNet of primary and standby VM Clusters.
  • Database backups and restore operations use a high bandwidth network for OCI Object Storage.

Figure 29-1 DR Capability for a Solution in the Same Region



Application Network Layer on Azure

The proximity of the application tier to the database cluster affects application response time.

If you require a very low latency response time (for example, 200-400 microseconds), deploy the application VMs in the same AZ as the database cluster. Latency increases to possibly 1 millisecond or more when application and database servers are configured across VNets or AZs.

Deploy the application tier over at least two AZs for High Availability. The deployment process and solution over multiple AZs vary depending on the application’s components, Azure services, and resources involved. For example, with Azure Kubernetes Services (AKS), you can deploy the worker nodes in different AZs. Kubernetes control plane maintains and synchronizes the pods and the workload.

Database Network Layer

Oracle Data Guard maintains a standby database by transmitting and applying redo data from the primary database. Use Data Guard switchover for planned maintenance or disaster recovery tests. If the primary database becomes unavailable, use Data Guard failover to resume service.

Peering Networks Between Primary and Standby

The primary and standby Exadata Clusters are deployed in separate networks. Oracle Database@Azure Exadata Clusters are always deployed using separate Virtual Cloud Networks (VCN) in OCI. These separate VCNs must be connected to allow traffic to pass between them, that is they must be "peered" before enabling Data Guard with Oracle cloud automation. For this reason, the networks must use separate, non-overlapping IP CIDR ranges.

Peering can be done using the OCI network or Azure network. The recommended option is to peer the OCI VCNs and use the OCI network for redo traffic. OCI VCN peering provides higher single-process network throughput (observed up to 14 Gbits/s), lower latency between database clusters, and there is no chargeback for this traffic. Peering using the Azure network provides an observed 3 Gbit/s single process throughput (relevant for database instances with high redo generation rates over 300 MB/s), has approximately 20% higher latency, and there is a chargeback for cross-VNet traffic.

Recommended OCI VCN Peering for Data Guard

When Exadata Clusters are created in Azure, each cluster is in a different Virtual Cloud Network (VCN) in OCI. Connectivity between VCNs is required for Data Guard redo transport. This connectivity, or peering, must be configured before enabling Data Guard in Oracle Database@Azure. For resources in different VCNs to communicate with each other, as is required by Data Guard, additional steps are required to peer the VCNs and allow the IP address ranges access to each other.

Follow these high-level steps to peer the VCNs. More details are available at Configure VCN peering (oracle.com).

  1. Provision a Local Peering Gateway in each VCN.
  2. Establish a peer connection between Local Peering Gateways.
  3. Update the default route table to route traffic between VCNs.
  4. Update VCN Network Security Groups (NSG) to allow connections.

Alternative Option of Azure VNet Peering for Data Guard

To peer the Azure VNets for Data Guard redo traffic, see https://learn.microsoft.com/en-us/azure/virtual-network/virtual-network-peering-overview.

Be aware that when networks are peered through Azure, latency increases by about 20%, and single-process network throughput is limited to approximately 3 Gbit/s (~375 MB/sec). This is relevant because Data Guard redo transport is a single process for each database instance; therefore, if a single instance produces redo at a higher rate, a transport lag may form. There is an additional cost for ingress and egress network traffic in each VNet when networks are peered through Azure.

Enable Data Guard

After the network is peered by one of the above options, you can Enable Data Guard (see Use Oracle Data Guard with Exadata Cloud Infrastructure).

Network Throughput and Latency Evaluation

When comparing throughput and latency between networks, the following methods are recommended.

Data Guard Throughput

It is recommended that iperf be used to measure throughput between endpoints.

Examples:

Server side (as root):

# iperf -s

Client Side (as root):

Single process: iperf -c <ip address of VIP>

  • This determines the maximum redo throughput from one Oracle RAC instance to a standby Oracle RAC instance.
  • single-process network throughput estimated to be 14 Gbits/s with OCI VCN Peering
  • single-process network throughput estimated to be 3 Gbits/s with Azure VNet Peering

Parallel process: iperf -c <ip address of VIP> -P 32

  • This determines the maximum network bandwidth available for Data Guard instantiation and large redo gap resolution.

Backups

For backups, RMAN nettest was used and met the expected results. See My Oracle Support Doc ID 2371860.1 for details about nettest.

Oracle database backup and restore throughput to Oracle’s Object Storage Service were within performance expectations. For example, an ExaDB-D 2 node cluster (using 16+ OCPUs) and 3 storage cells may observe a 4 TB/hour backup rate and approximately 8 TB/hour restore rate with no other workloads running on the cluster. By increasing the RMAN channels, you can leverage available network bandwidth or storage bandwidth and achieve as much as 42 TB/hour backup rate and 8.7 TB/hour restore rate. The performance varies based on existing workloads and network traffic on the shared infrastructure.

Latency

The best tool for testing TCP latency between VM endpoints is sockperf. Latency is not tested for backups. sockperf is not installed by default and must be installed from an RPM or YUM.

server: sockperf sr -i <IP of VIP> --tcp

client: sockperf pp -i <IP of VIP> --tcp --full-rtt

Sample output(client) between clusters in different AZs:

# sockperf pp -i <IP> --tcp --full-rtt
  sockperf: Summary: Round trip is 1067.225 usec
  sockperf: Total 516 observations; each percentile contains 5.16 observations
  sockperf: ---> <MAX> observation = 1194.612
  sockperf: ---> percentile 99.999 = 1194.612
  sockperf: ---> percentile 99.990 = 1194.612
  sockperf: ---> percentile 99.900 = 1137.864
  sockperf: ---> percentile 99.000 = 1112.276
  sockperf: ---> percentile 90.000 = 1082.640
  sockperf: ---> percentile 75.000 = 1070.377
  sockperf: ---> percentile 50.000 = 1064.075
  sockperf: ---> percentile 25.000 = 1059.195
  sockperf: ---> <MIN> observation = 1047.373

Note:

Results vary based on region and AZ sampled.

The ping command should not be used in Azure because ICMP packets are set to very low priority and will not accurately represent the latency of TCP packets.

Traceroute

Run traceroute between endpoints to ensure that the proper route is being taken.

Observations

  • One ‘hop’ between ExaDB-D clusters when Data Guard uses OCI VCN peering
  • Six ‘hops’ between ExaDB-D clusters when Data Guard uses Azure VNet peering
  • Four ‘hops’ between application VMs and ExaDB-D clusters in the same AZ

Achieving Continuous Availability For Your Applications

As part of Oracle Exadata Database Service on Dedicated Infrastructure on Oracle Database@Azure, all software updates (except for non-rolling database upgrades or non-rolling patches) can be done online or with Oracle RAC rolling updates to achieve continuous database uptime.

Furthermore, any local failures of storage, Exadata network, or Exadata database server are managed automatically, and database uptime is maintained.

To achieve continuous application uptime during Oracle RAC switchover or failover events, follow these application-configuration best practices:

  • Use Oracle Clusterware-managed database services to connect your application. For Oracle Data Guard environments, use role-based services.
  • Use the recommended connection string with built-in timeouts, retries, and delays so that incoming connections do not see errors during outages.
  • Configure your connections with Fast Application Notification.
  • Drain and relocate services. Use the recommended best practices in the table below that support draining, such as test connections, when borrowing or starting batches of work, and return connections to pools between uses.
  • Leverage Application Continuity or Transparent Application Continuity to replay in-flight uncommitted transactions transparently after failures.

For more details, see Configuring Continuous Availability for Applications. Oracle recommends testing your application readiness by following Validating Application Failover Readiness (My Oracle Support Doc ID 2758734.1).

Depending on the Oracle Exadata Database Service planned maintenance event, Oracle attempts to automatically drain and relocate database services before stopping any Oracle RAC instance. For OLTP applications, draining and relocating services typically work very well and result in zero application downtime.

Some applications, such as long-running batch jobs or reports, may not be able to drain and relocate gracefully within the maximum draining time. For those applications, Oracle recommends scheduling the software planned maintenance window around these types of activities or stopping these activities before the planned maintenance window. For example, you can reschedule a planned maintenance window to run outside your batch windows or stop batch jobs before a planned maintenance window.

Special consideration is required during rolling database quarterly updates for applications that use database OJVM. See My Oracle Support Doc ID 2217053.1 for details.

The following table lists planned maintenance events that perform Oracle RAC instance rolling restart, as well as the relevant service drain timeout variables that may impact your application.

Exadata Cloud Software Updates or Elastic Operation Drain Timeout Variables
Oracle DBHOME patch apply and database MOVE

Cloud software automation stops/relocates database services while honoring DRAIN_TIMEOUT settings defined by database service configuration (such as srvctl).1

You can override DRAIN_TIMEOUT defined on services using the option drainTimeoutInSeconds with command line operation dbaascli dbHome patch or dbaascli database move.

The Oracle Cloud internal maximum draining time supported is 2 hours.

Oracle Grid Infrastructure (GI) patch apply and upgrade

Cloud software automation stops/relocates database services while honoring DRAIN_TIMEOUT settings defined by database service configuration (such as srvctl).1

You can override the DRAIN_TIMEOUT defined on services using the option drainTimeoutInSeconds with command line operation dbaascli grid patch or dbaascli grid upgrade.

The Oracle Cloud internal maximum draining time supported is 2 hours.

Virtual Machine Operating System Software Update (Exadata Database Guest)

Exadata patchmgr/dbnodeupdate software program calls drain orchestration (rhphelper).

Drain orchestration has the following drain timeout settings (see My Oracle Support Doc ID 2385790.1 for details):

  • DRAIN_TIMEOUT – If a service does not have DRAIN_TIMEOUT defined, then the default value of 180 seconds is used.
  • MAX_DRAIN_TIMEOUT - Overrides any higher DRAIN_TIMEOUT value defined by database service configuration. The default value is 300 seconds. There is no maximum value.

The DRAIN_TIMEOUT settings defined by the database service configuration are honored during service stop/relocate.

Exadata X9M and later systems

Scale Down VM Local File System Size

Exadata X9M and later systems call drain orchestration (rhphelper).

Drain orchestration has the following drain timeout settings (see My Oracle Support Doc ID 2385790.1 for details):

  • DRAIN_TIMEOUT – If a service does not have DRAIN_TIMEOUT defined, then the default value of 180 seconds is used.
  • MAX_DRAIN_TIMEOUT - Overrides any higher DRAIN_TIMEOUT value defined by database service configuration. The default value is 300 seconds.

The DRAIN_TIMEOUT settings defined by the database service configuration are honored during service stop/relocate.

The Oracle Cloud internal maximum draining time supported for this operation is 300 seconds.

Exadata X9M and later systems

Scale Up or Down VM Cluster Memory

Exadata X9M and later systems call drain orchestration (rhphelper).

Drain orchestration has the following drain timeout settings (see My Oracle Support Doc ID 2385790.1 for details):

  • DRAIN_TIMEOUT – If a service does not have DRAIN_TIMEOUT defined, then the default value of 180 seconds is used.
  • MAX_DRAIN_TIMEOUT - Overrides any higher DRAIN_TIMEOUT value defined by database service configuration. The default value is 300 seconds.

The DRAIN_TIMEOUT settings defined by the database service configuration are honored during service stop/relocate.

The Oracle Cloud internal maximum draining time supported for this operation is 900 seconds.

Oracle Exadata Cloud Infrastructure (ExaDB) software update

The ExaDB-D database host calls drain orchestration (rhphelper).

Drain orchestration has the following drain timeout settings (see My Oracle Support Doc ID 2385790.1 for details):

  • DRAIN_TIMEOUT – If a service does not have DRAIN_TIMEOUT defined, then the default value of 180 seconds is used.
  • MAX_DRAIN_TIMEOUT - Overrides any higher DRAIN_TIMEOUT value defined by database service configuration. The default value is 300 seconds.

The DRAIN_TIMEOUT settings defined by the database service configuration are honored during service stop/relocate.

The Oracle Cloud internal maximum draining time supported for this operation is 500 seconds.

Enhanced Infrastructure Maintenance Controls:

To achieve draining time longer than the Oracle Cloud internal maximum, leverage the custom action capability of Enhanced Infrastructure Maintenance Controls, which allows you to suspend infrastructure maintenance before the next database server update starts, directly stop/relocate database services running on the database server, and then resume infrastructure maintenance to proceed to the next database server. See Configure Oracle-Managed Infrastructure Maintenance in Oracle Cloud Infrastructure documentation for details.

1Minimum software requirements to achieve this service drain capability are: Oracle Database release 12.2 and later and the latest cloud DBaaS tooling software.

Oracle MAA Reference Architectures in Oracle Exadata Cloud

Oracle Exadata Database Service on Dedicated Infrastructure on Oracle Database@Azure supports all Oracle MAA reference architectures, providing support for all Oracle databases, regardless of their specific high availability, data protection, and disaster recovery service-level agreements.

See MAA Best Practices for the Oracle Cloud for more information about Oracle MAA in the Oracle Exadata Cloud.