Shared Responsibility Model for Resiliency
Resiliency in the cloud is a shared responsibility between you (the user) and Oracle. For you to build resilient workload architectures in Oracle Cloud Infrastructure (OCI), you must understand your high availability and disaster recovery requirements and responsibilities.
Oracle's Responsibility: “Resiliency of the Cloud”
OCI is responsible for the "resiliency of the cloud." OCI provides a robust, highly available and resilient global cloud infrastructure consisting of data centers, network, physical hardware, and software designed to minimize downtime and ensure that applications remain accessible and functional even in the event of failures. OCI offers end-to-end service level agreements (SLAs) covering performance, availability, and manageability of these services.
OCI is physically hosted in multiple regions. The regions are independent and are geographically dispersed within a country, between countries, or among continents. Each region is comprised of one or more availability domains (ADs), which are named Single-AD or Multi-AD respectively. Each AD is an independent data center, and in multi-AD regions, each one is isolated to help reduce the risk of failure in affecting others.
The ADs are connected by a secured, low latency, high-bandwidth network, which lets you build resilient, highly-available solutions across multiple ADs (where available). Additionally, every AD contains three fault domains (FDs). Each FD is a grouping of hardware and infrastructure distinct from the other FDs in the same AD. FDs allow distributing resources so that they don't depend on the same physical hardware within a single AD. As a result, hardware failures or maintenance events that affect one FD don't affect the resources in other FDs.
OCI core infrastructure components, such as Compute, Storage, Networking, Identity, and Database services have built-in redundancies. You can leverage ADs, FDs, and these services to build highly available applications. However, OCI doesn't automatically replicate, deploy, or perform failover for application resources and data provisioned in a user’s tenancy to another AD or region in the event of a disaster or partial/complete regional outage. It's the user’s responsibility to deploy their application resources across ADs and regions.
For example, if an application is deployed on a compute instance (with a block volume) within one AD (for example, AD1), OCI won't automatically provision a new compute instance in a different AD or region in the event of a failure affecting the instance.
Note: Block storage has built in redundancies.
Your Responsibility: “Resiliency in the Cloud”
To achieve "resiliency in the cloud", you are ultimately responsible for developing a comprehensive business continuity plan, including a high-availability (HA) and disaster recovery (DR) strategy, risk assessments, and incident response plans. You are also responsible for deploying your applications and systems across multiple FDs, ADs, and regions for resiliency and fault tolerance using OCI best practices and Maximum Availability Architecture (MAA) Frameworks. Each component of the application should be designed to ensure it has the maximum potential for uptime and accessibility. To ensure high-availability, single points of failures must be identified and eliminated so that even if components fail, the application remains running and available.
In the event of disaster or full regional outage, whether it involves a Single-AD or Multi-AD region, it's your responsibility to ensure OCI resource availability is allocated for your tenancy in the failover AD or region before executing a disaster recovery plan.
Resiliency is a Shared Responsibility Between OCI and You
OCI Responsibilities: Resiliency of the Cloud
Components | Description |
---|---|
Region, Availability Domains, Fault Domains | Oracle provisions, manages, monitors, secures, and operates a highly reliable global cloud infrastructure. |
OCI Storage Services | Oracle provisions and operates storage services, providing service high availability and protecting data physically within an availability domain. |
OCI Core Networking Services | Oracle provides high availability for OCI core networking services and connectivity services with global traffic shaping that ensures optimal application connectivity and performance. |
OCI Database Services | Oracle creates and initiates the Database service, conducts hardware maintenance and enhancements, updates storage servers, and oversees service health. |
Your Responsibilities: Resiliency in the Cloud
Components | Description |
---|---|
HA, DR, and failover planning and testing | Plan, configure, test, and run HA, DR, and failover solutions for data and service resiliency to ensure business continuity. |
Operations and Management | You are responsible for operating and monitoring your cloud resources, implementing resilient cloud architecture best practices to minimize service disruptions. |
Workload Architecture | You are responsible for using Enterprise Architecture Best Practices and Maximum Availability Architecture (MAA) frameworks for designing, building, and maintaining reliable, secure, efficient, and cost-effective cloud workloads. |
Resiliency Planning | You are responsible for developing a comprehensive business continuity plan, including HA and DR strategy, risk assessments, and incident response plans. |
How OCI Delivers Cloud Resiliency
The following information describes ways in which OCI delivers cloud resiliency.
OCI Responsibilities for Services
- OCI Architecture is built with resiliency, deploying multiple components that can execute the same task.
- OCI monitors the health of OCI services and manages automatic failover in case of service disruption.
- OCI core platform services, servers, and storage, networking, core Identity and Access Management (IAM), and telemetry services are designed and deployed redundantly. OCI continuously monitors their health, and in the event of a failure, automatic failover processes are executed to provide continuity.
- OCI Storage services have built in resiliency. OCI Block Volume provides persistent, high-performance data storage within an AD. Similarly, OCI Object Storage provides persistent, durable, high-performance data storage within an AD. Additionally, in multi-AD regions, object store replicates the data across ADs automatically. File storage maintains replicas across fault domains, within an AD.
- Oracle provides highly robust and resilient Database Services within OCI that let you select the most suitable HA and DR strategy for your needs.
- OCI DNS is hosted across multiple geographically distributed data centers, making it highly available. It also provides low latency, a basic level of load-balancing, and resiliency to be able to handle outages or heavy traffic with minimal impact to users.
Your Responsibilities for Achieving Resiliency
The following information describes ways in which you are responsible for achieving resiliency.
Process Recommendations
- Document a high availability plan based on these best practices. Consider that higher availability will result in higher costs and increased complexity.
- Document a disaster recovery plan based on best practices, including Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
- Document resiliency needs at a workload and application level, and plan for redundancy, monitoring, and failovers as necessary.
- Have a failover plan in place for workloads and applications with impact to the business, including scenarios for service disruption, planned maintenance, and application level leveraging Oracle Data Guard or Oracle Real Application Clusters (RAC).
- Deploy Full Stack Disaster Recovery for critical workloads.
Identity Domains
- Plan for disaster recovery and identity domains.
- Identity domain replication is always enabled for the “default” identity domain. The “default” identity domain always replicates to all regions to which the tenant is subscribed. When an administrator subscribes to another region, the “default” identity domain automatically replicates to that region.
- Additional identity domains are created in the “home region” specified at the creation time. They don't replicate to other subscribed regions unless replication is specifically enabled.
Networking
- Plan high availability for network resources and leverage the Load Balancer service to distribute traffic.
- Peer the virtual cloud networks (VCNs) in the different regions to facilitate network connectivity.
- OCI provides you with the option to provision a secondary DNS to create redundancy for web-facing applications.
Compute
- Plan high availability for Compute instances, distributing them across FDs in each of the ADs, and placing them behind load balancers.
- Enable backup for a point-in-time snapshot of your volumes.
- Set up cross-region replication of block volumes, boot volumes, and volume groups.
- Make the compute images available in both an active and a DR region. In the region for DR, deploy a minimal setup to maintain a warm standby. Then, use capacity reservations to reserve the rest of the required capacity to run all the VMs when the DR region becomes primary.
Storage
- Plan high availability for storage.
- Enable automated backups for Object Storage, and Object Storage replication across regions for DR purposes.
- Enable volume cloning features for block volumes, and leverage the Block Volume service replication feature to ensure redundancy across different ADs (same or different region).
- Enable file system snapshots and clones. The snapshots life cycle can be managed automatically by using the policy based snapshot feature. Leverage OCI File Storage asynchronous replication for failover and failback scenarios
- Configure Block Volume asynchronous replication to replicate volumes and volume groups to another region. Enable the backup feature to produce crash-consistent backups for volumes and volume groups. Enable copies to another region.
- For File Storage, in addition to the built-in replication to asynchronously replicate to another availability domain and region, you can use the File Storage cloning feature for an almost instant RTO.
Database
Oracle Database: Plan for database high availability based on the Maximum Availability Architecture (MAA). Consider higher RPO and RTO metrics will increase cost and complexity.
- Define the correct DB edition according to the high availability needs.
- Leverage Oracle Data Guard to replicate data between Oracle DB nodes.
- Use Oracle Clusterware-managed database services to connect your application. For Oracle Data Guard environments, use role based services.
- Use the recommended connection string with built-in timeouts, retries, and delays.
- Configure your connections with Fast Application Notification (FAN).
- Leverage Application Continuity or Transparent Application Continuity to replay in-flight uncommitted transactions transparently after failures.
- Enable replicas for a current version of the data.
- Leverage OCI Services: Recovery Manager (RMAN), Refreshable Pluggable Database (PDB), Oracle Data Guard and Active Data Guard, Autonomous Data Guard, and OCI GoldenGate.
MySQL: OCI provides High Availability Architecture and Disaster Recovery configurations for Oracle MySQL Database Service.
OCI HA DR Decision Tree
Explore More
Documentation
- Best practices framework for Oracle Cloud Infrastructure
- Learn about architecting a highly available cloud topology
- OCI Full Stack Disaster Recovery (FSDR) orchestration and management service
- OCI Disaster Recovery documentation
Solution Playbooks
- Learn about architecting a highly available cloud topology
- Learn about reliable and resilient cloud topology practices
- Design the infrastructure to deploy Oracle Enterprise Performance Management in the cloud (HA Architecture: One Region, Single Availability Domain)
Reference Architectures
- Deploy a highly available web application
- Deploy Oracle REST Data Services with high availability on Oracle Cloud Infrastructure
- Deploy a highly available MySQL InnoDB cluster
- Deploy highly available ASP.Net applications on Oracle Cloud Infrastructure
- Deploy a highly available CockroachDB cluster
- Deploy a highly available bare metal database
- Deploy a highly available Microsoft SQL Server database
- Deploy a highly available Apache Cassandra cluster
- Deploy a highly available, distributed cache using Redis
- Provision a highly available session border controller