Resiliency

Resiliency is the ability of an application or workload to recover quickly from failures and maintain high availability. It's a critical aspect of cloud computing because it ensures that applications and workloads remain accessible and functional, even when unexpected events occur. The following information describes Oracle Cloud Infrastructure (OCI) resiliency. The information highlights the importance of resiliency in cloud computing and the resiliency features provided by OCI. Resiliency should be a key consideration because it ensures business continuity and minimizes the risk of service disruptions.

Recover from Failure

OCI provides a set of tools and services designed to provide a high level of resiliency and availability for applications and workloads. One offering is the Platform as a Service (PaaS), which includes several mechanisms for recovering from failures and ensuring high uptime for workloads.

For example, the Autonomous Database service, which is a PaaS service, offers built-in fault tolerance and automatic backup and recovery capabilities. In the event of a failure, the database can automatically switch over to a standby database, minimizing downtime and ensuring the continuity of critical business processes.

OCI provides automated backup and recovery capabilities for compute instances, letting you restore instances to a previous state in case of a failure. This feature offers peace of mind, knowing that critical workloads can be restored to a functional state if there's an unexpected event.

The resiliency and availability features provided by OCI, including PaaS services and automated backup and recovery, help ensure that applications and workloads are always available and performant. This lets you continue operating and providing services to your customers, even in the face of unexpected events such as hardware failures or other disruptions.

High Availability

OCI provides a robust and highly available architecture, specifically designed to minimize downtime and ensure that applications remain accessible and functional even in the face of failures. This architecture is achieved by deploying resources across multiple fault domains (FD) and availability domains (AD) within a region. Each Fault Domain (FD) represents different physical hardware within a single availability domain, providing anti-affinity, while each Availability Domain (AD) is an independent data center that is isolated from other ADs, providing redundancy and fault tolerance. Every availability domain contains three fault domains.

For example, consider a highly available application, such as one with two web servers and a clustered database. In this scenario, the ideal placement for each component is to group one web server and one database node in one fault domain, and the other half of each pair in another fault domain. This placement strategy ensures that a failure of any one fault domain doesn't result in an outage for your application, as the other half of each component pair continues to function.

The highly available architecture provided by OCI, including the deployment of resources across multiple fault domains and availability domains, in addition to paired regions for disaster recovery, ensure that your applications and workloads remain available and functional, even when facing unexpected events. This provides you with the assurance that your services will remain accessible to your customers, helping to maintain customer satisfaction and business continuity. For more information, see Cloud Adoption Framework recommendations and best practices on High Availability (HA).

In addition, OCI offers paired regions for disaster recovery, letting you replicate your resources across two regions for additional resilience.

Disaster Recovery

Disaster recovery (DR) is the process of restoring IT systems and infrastructure after a catastrophic event. Regions are independent of other regions and can be separated by vast distances—across countries or even continents. Generally, you would deploy an application in the region where it is most heavily used, because using nearby resources is faster than using distant resources. However, you can also deploy applications in different regions to mitigate the risk of region-wide catastrophic events and meet varying requirements for legal jurisdictions, tax domains, and other business or social criteria.

OCI provides several disaster recovery options, including hot, warm, and cold standby solutions. Hot standby solutions provide real-time replication of data and are ideal for mission-critical workloads that require near-zero downtime. Warm standby solutions replicate data at intervals and are suitable for workloads that can tolerate some downtime. Cold standby solutions involve manually restoring systems from backups and are suitable for workloads that can tolerate significant downtime.

Also, OCI supports several HA DR models, including active-passive and active-active architectures. Active-passive architectures involve replicating resources in a standby environment that is activated in case of a failure. Active-active architectures involve replicating resources across multiple regions or ADs and distributing traffic across them to minimize downtime.

Maximum Availability Architecture

OCI provides a range of highly effective HA DR models, including active-passive and active-active architectures, to ensure seamless continuity and high availability of critical resources in the event of any failure or maintenance. For example, in an active-passive architecture, the standby environment replicates the resources, and it becomes active only when the primary environment fails. In contrast, an active-active architecture involves replicating resources across multiple regions or ADs to distribute traffic and minimize downtime.

To guarantee end-to-end application and database availability at optimal levels, OCI employs Chaos Engineering, a discipline that experiments with a system to build confidence in its ability to withstand turbulent conditions in production. Maximum Availability Architectures (MAA) leverage Chaos Engineering throughout the testing and development life cycles, aggressively injecting various faults and planned maintenance events to evaluate the application and database's impact. Through this experimentation, best practices, defects, and lessons learned are derived and put into practice to evolve and improve OCI's cloud MAA solutions.

Automatic Database Backup (ADB) in OCI is stored in OCI Object Storage and replicated to another availability domain, letting you restore your databases in the event of a disaster. In addition, for Oracle Autonomous Database on Exadata Cloud@Customer (ADB-C@C), you can choose to backup to NFS or Zero Data Loss Recovery Appliance (ZDLRA); however, you're responsible for configuring and managing the replication of those backups.

OCI's advanced HA DR models, Chaos Engineering, and automatic database backups with replication to multiple availability domains provide you with comprehensive protection against potential data loss or system failures, ensuring maximum availability and continuity of critical resources.

Mean Time to Restore

The Mean Time to Restore (MTTR) is a critical metric that measures the average time it takes to restore a service or system after a failure. A prolonged MTTR can cause significant financial and reputational damage to businesses, leading to lost revenue, decreased customer satisfaction, and even regulatory fines.

OCI provides several tools and services, such as Terraform and Ansible, to reduce MTTR and ensure maximum availability of services. For example, automated backups and recovery processes are available to quickly recover data and applications in the event of an outage or disaster. In addition, real-time replication of data across multiple availability domains enables the rapid restoration of services, minimizing downtime and reducing the impact of failures.

It's essential to continually measure MTTR to understand the time required to restore services under unfavorable conditions. This evaluation is critical in identifying areas for improvement and reducing MTTR over time, ensuring optimal service availability and reducing the risk of damage caused by prolonged downtime.

Continuous Integration and Continuous Deployment

Continuous Integration and Continuous Deployment (CI/CD) are important DevOps practices that help you streamline your software development process, increase productivity, and reduce errors. These practices involve automating the process of building, testing, and deploying software, letting you release code more frequently, with improved quality and consistency.

OCI supports these best practices through integration with popular CI/CD tools, such as Jenkins, GitLab, and GitHub. These tools provide an efficient and automated pipeline for software development and deployment, from code changes to testing and deployment. By integrating these tools into OCI, you can take advantage of the cloud's scalability and flexibility to speed up the development process and improve overall productivity.

For example, by leveraging Jenkins on OCI, you can automate build and deployment processes, ensuring code is thoroughly tested and quickly deployed to production environments. Similarly, using GitLab in OCI can enable seamless collaboration between teams, facilitating efficient code sharing and tracking changes. In addition, integrating GitHub with OCI provides a platform for you to manage code repositories, enabling version control, and facilitating code reviews.

The OCI DevOps service is a fully managed cloud service designed to support CI/CD workflows for developers. With this platform, DevOps engineers can build, test, and deploy software and applications easily in Oracle Cloud, providing an end-to-end solution that streamlines the development process.

The service enables the creation of DevOps build and deployment pipelines that reduce errors associated with change and minimize the time required for building and deploying releases, ultimately improving the overall quality and consistency of code. In addition, the service offers private Git repositories for secure code storage and supports connections to external code repositories, allowing for streamlined collaboration with external teams.

As a fully managed cloud service, OCI's DevOps service provides automated scaling and maintenance, letting you focus on code without worrying about infrastructure management. This ensures that the platform is always up-to-date and available to support the development process, making it an ideal solution when you want to streamline DevOps workflows and achieve faster release cycles with higher-quality code.

DevOps, SecOps, DevSecOps, IaC

DevOps, SecOps, and DevSecOps are critical methodologies that let you meet the demands of modern software development by emphasizing collaboration, automation, and security. Infrastructure as code (IaC) plays and important role to provision and configure infrastructure for automated deployment.

  • DevOps: Essential because it fosters collaboration between development and operations teams, ensuring that software is delivered faster, with better quality, and more reliability. This methodology emphasizes the importance of automation, allowing teams to build, test, and deploy code more efficiently, reducing the time to market. In OCI, you can use DevOps practices by using tools such as Jenkins, GitLab, and GitHub to automate the software development process.

  • SecOps: Important because it integrates security into the development process to ensure that security vulnerabilities are identified and addressed early on, reducing the risk of breaches and ensuring the protection of sensitive data. By emphasizing the importance of security, this methodology ensures that you can build and deploy secure software applications. In OCI, you can use integrations with third-party security tools such as Check Point and Fortinet to provide advanced threat detection and protection.

  • DevSecOps: Combination of these two methodologies, with security integrated into the DevOps process from the beginning. This approach ensures that applications are secure, reliable, and meet compliance requirements. By focusing on security from the start, you can build and deploy secure applications faster, with better quality, and more reliability. In OCI, you can use built-in security features such as security zones to isolate workloads and control network traffic to improve security and resiliency.

  • IaC: Important practice that involves writing code to automate the deployment and management of infrastructure. This methodology ensures the consistency and reliability of infrastructure deployments, reducing the risk of errors and improving resiliency. In OCI, you can use tools such as Terraform and Ansible to automate the provisioning and configuration of infrastructure resources.

Automate Everything

Automation is a crucial aspect of building and maintaining a resilient cloud infrastructure. By automating processes and tasks, errors can be reduced, and efficiency can be increased. Building a culture that prioritizes automation and resiliency is essential for maintaining high availability in the cloud. This can be achieved through the use of tools and services such as Terraform, Ansible, and Jenkins, which provide automation capabilities for infrastructure deployment, configuration, and management.

For example, OCI provides a range of automation tools, including Resource Manager, which allows you to automate the creation, configuration, and deployment of cloud resources using Terraform or Oracle Cloud Infrastructure native APIs. In addition, using automation to perform routine tasks such as backups and updates can significantly reduce the risk of errors and increase the overall resilience of your cloud infrastructure.

Non-Functional Requirements - SLI, SLO, and SLA

Non-functional requirements, such as performance, scalability, and availability, play a crucial role in ensuring that applications and workloads meet business needs. To achieve this, it's important to have metrics in place that measure the performance and availability of services and resources. Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs) are metrics that help you measure the effectiveness of your cloud infrastructure.

OCI provides a range of tools and services that let you monitor and manage these metrics, including Cloud Monitoring, Logging, and Notifications. Cloud Monitoring lets you collect, analyze, and alert on metrics and logs across OCI resources and services. It provides a unified view of the health and performance of your infrastructure, letting you quickly identify and troubleshoot issues that might impact your SLIs, SLOs, and SLAs. Logging lets you capture and analyze log data from various sources, including OCI services, applications, and infrastructure components. Notifications lets you receive alerts and notifications when predefined conditions are met, letting you take action before issues impact your SLIs, SLOs, and SLAs.

By leveraging these tools and services, you can gain deep visibility into your cloud infrastructure, and proactively monitor and manage SLIs, SLOs, and SLAs. This helps ensure that your applications and workloads are meeting business needs and enables them to quickly respond to any issues that arise, minimizing downtime and improving overall resiliency.

For example, you can use Cloud Monitoring to monitor the response time and availability of a web application hosted in OCI, while using Logging to track errors and diagnose performance issues. Notifications can be used to alert administrators when service disruptions or performance issues occur, letting them take action before the problem becomes severe.

Fault and Availability Domain

Fault domains and availability domains are important concepts in cloud computing that enhance resiliency and reduce the impact of potential failures. In the event of a fault in a particular area, fault domains can be used to ensure that critical resources are not impacted, reducing the overall impact on the system. Availability domains provide isolation between data centers to provide redundancy and fault tolerance. This ensures that if a failure occurs in one availability domain, the workload can failover to a different availability domain, ensuring that services remain available, even in the event of failures.

OCI leverages fault domains and availability domains to provide you with high availability. For example, in OCI, each region is composed of three availability domains, which are physically isolated from each other and provide independent failure domains. OCI uses fault domains to ensure that instances in a given availability domain are distributed across multiple fault domains, ensuring high availability and protection against failures.

Multi Regions

OCI's regions are a crucial component in ensuring resiliency and continuity in the event of a disaster. Regions are two geographically separated regions that provide redundancy and fault tolerance. In case of a catastrophic event such as a natural disaster, cyberattack, or human error, the paired regions ensure that critical resources are replicated and available in an alternate region. This reduces the risk of downtime and data loss, providing peace of mind to businesses and your customers.

For example, if a business operates in the United States and there's a catastrophic event such as a natural disaster, political unrest, or power outage in one region, the other region can seamlessly take over and ensure business continuity. If the primary region is US East (Ashburn) and is experiencing an outage, the secondary region US West (Phoenix) can take over and provide the necessary services until the primary region is back online. This approach ensures that user experience isn't disrupted and data is available throughout the outage. OCI's active-active or active-passive replication of resources in paired regions ensures that data is continuously available, making it possible to failover to the backup region without disruption.

Multi regions provide businesses with the ability to implement an effective disaster recovery plan, ensuring that data is protected, and services are always available.

Data Guard and GoldenGate

Data Guard is a feature of Oracle Database that provides disaster recovery and high availability for enterprise databases. It allows for the creation of a standby database that can take over if the primary database fails. The standby database is continuously synchronized with the primary database, ensuring that data is always up to date. This provides an additional layer of resiliency for critical systems and applications.

GoldenGate is a data integration and replication tool that enables real-time data integration between different databases. It supports heterogeneous data integration, meaning that it can replicate data between different database vendors and within a single vendor. GoldenGate can also be used for database migration, data warehousing, and business intelligence.

OCI provides different versions of Data Guard and GoldenGate to meet different requirements and use cases. For example, Data Guard Standard Edition provides basic disaster recovery capabilities, while Data Guard Enterprise Edition provides more advanced features such as automatic failover and data protection. GoldenGate Standard Edition provides real-time data replication between databases, while GoldenGate Enterprise Edition includes additional features such as conflict detection and resolution.

By using these technologies in conjunction with OCI, you can improve your system's resilience by ensuring that critical data is always available and up to date, even in the event of a disaster or system failure. For example, a financial services company can use Data Guard to replicate a production database to a standby database in a different region, such as US North and South regions in OCI, to ensure that they can quickly recover from a catastrophic event and continue serving customers without interruption.

Date Replication

Data replication is a critical aspect of resiliency in cloud computing because it ensures that data is available even in the event of a failure. Replication involves creating copies of data and storing them in multiple locations, which can be used to recover from a failure or disaster.

OCI provides several storage options for replicating data. Object Storage is a highly scalable and durable storage service that enables replication of data across regions. By configuring cross-region replication, data is automatically replicated to a different region, providing a high level of resiliency. In the event of a disaster or outage, data can be easily accessed from the replicated location, ensuring business continuity.

File Storage provides highly available and durable file systems that can be accessed by multiple instances simultaneously. By using Replication Policies, files are automatically replicated to a different availability domain, providing fault tolerance and high availability.

Block Volume is a highly available and durable block storage service that provides a replication feature. By configuring Block Volume Replication, data is automatically replicated to another block volume in a different availability domain within the same region. This ensures that data is available even in the event of a failure or outage.

Data replication is crucial for maintaining resiliency in cloud computing, and OCI provides several options to replicate data across regions, availability domains, and instances. By using these options, you can ensure that your data is highly available, durable, and easily recoverable in the event of a failure or disaster.

Calculate Overall Reliability

Reliability is crucial to consider when deploying an application or workload in the cloud. Measuring the probability and impact of failures is essential to ensure the smooth operation of business operations. OCI provides a range of tools and services that help you calculate the overall reliability and cost of your cloud infrastructure. For example, Cloud Advisor helps you identify potential issues with your architecture and provides recommendations for improving reliability, while Cost Estimator helps you estimate the cost of implementing your cloud infrastructure. By using these tools and services, you can ensure that your applications and workloads are deployed in a reliable and cost-effective way.

Plan for Patching and Upgrade

Keeping applications and infrastructure up to date is important for maintaining optimal security and performance in the cloud. Failing to apply necessary patches and upgrades can leave systems vulnerable to attacks and can cause performance issues that disrupt business operations. OCI provides a range of tools and services to help streamline and automate the patching and upgrading process.

The Patching Automation and Upgrade Advisor services provided by OCI are designed to make it easier to plan and execute the patching and upgrading process. In addition, the OS Management Service (OSMS) lets you automate the patching of Oracle Linux or Windows instances. With the OSMS, you can organize your systems into groups and schedule jobs to apply the latest updates to all systems. This service provides access to a wide range of predefined software sources, providing the full range of Oracle yum repositories to Linux systems. As a result, systems can be kept up to date with the latest patches all the time, improving security and performance.

Business Continuity Plan

A solid business continuity plan is essential for any organization to ensure that it can continue to operate even in the face of disruptive events. These could include natural disasters, power outages, or cyber attacks.

OCI provides a range of tools and services to support this type of planning. For example, the Site-to-Site VPN service lets you create a secure, encrypted connection between your on-premises network and your OCI Virtual Cloud Network (VCN), letting you extend your data center to the cloud. Similarly, the FastConnect service provides a private, high-bandwidth connection between your on-premises infrastructure and your OCI resources, letting you replicate data and run critical applications in the cloud.

Use of Loosely Coupled Architecture

Loosely coupled architecture is a crucial element in building resilient systems because it helps to minimize the impact of failures by reducing dependencies between components. By reducing dependencies, each component can be scaled and evolved independently, making the system more flexible and adaptable to changes. OCI provides tools and services that support this architecture, such as Oracle Functions, which lets developers build and deploy serverless applications that can scale automatically based on workload demand, without being tightly coupled to other components. Another example is Oracle Kubernetes Engine (OKE), which provides a highly scalable and flexible platform for running containerized applications. OKE uses a microservices-based architecture that lets you build and deploy modular, loosely coupled applications that can be easily scaled and managed.

Monitor and Adapt for Unusual Patterns

To ensure the resilience of your system, it's important to monitor and adapt to unusual patterns in resource usage, traffic, and behavior. These patterns can help identify potential issues before they become critical and impact the performance and availability of your system. OCI provides tools and services such as Cloud Guard and Security Zones that provide continuous monitoring and analysis of resource usage, network traffic, and user behavior. Cloud Guard automates the monitoring of your resources and helps detect security threats and misconfigurations in your cloud environment. In addition, Security Zones provide a secure environment for workloads and resources that require higher levels of security.

Choosing from SaaS, PaaS, and IaaS

Selecting the appropriate cloud service model is crucial because it determines the level of control, flexibility, and management required for your applications and workloads. Cloud service models such as Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS) provide distinct advantages and disadvantages. SaaS provides a fully managed solution that can be easily deployed and requires little maintenance, while PaaS provides a development platform with more control and customization options. IaaS offers complete control over the infrastructure but requires more management and maintenance. Choosing the right service model for your business is essential to maximize performance, reduce costs, and maintain high levels of resilience.

Anticipate Failure

Mitigating the impact of potential failures is crucial for ensuring the resiliency of your cloud infrastructure. OCI provides a variety of tools and services that let you anticipate potential failure points and plan for your mitigation. For example, fault domains and availability domains are concepts used in cloud computing to increase resilience and reduce the impact of failures. By grouping resources together and distributing them across different fault domains and availability domains, you can minimize the risk of a single point of failure. In addition, Security Zones in OCI let you isolate workloads and reduce the impact of security incidents or failures.

Cost Versus Reliability

To ensure cost-effectiveness while maintaining reliability, it's essential to balance the cost and performance of your cloud infrastructure. OCI provides various tools and services, such as Cost Estimator and Cost Management, that help you monitor and optimize your cloud spending. The Cost Estimator helps you estimate the costs of your infrastructure deployment and identify potential cost savings. The Cost Management service provides a centralized platform for monitoring and managing your cloud spending across different services and regions. With this service, you can set budgets, track usage, and identify areas where you can reduce costs without impacting the reliability of your infrastructure.

Plan for Big Events

Ensuring high availability and avoiding disruptions during big events, such as seasonal peaks in traffic or planned maintenance windows, requires careful planning. OCI provides various tools and services such as Autoscaling and Scheduled Scaling that help you plan and adjust your resources accordingly. Autoscaling automatically adjusts the capacity of your resources based on real-time traffic, ensuring that your application is available to your users. Scheduled Scaling allows you to plan and adjust the resources in advance for predictable traffic patterns, reducing the risk of over-provisioning and unnecessary costs. These tools help you efficiently manage your cloud resources, ensuring high availability and optimal performance during big events.