About Reliable and Resilient Cloud Topology Practices

The architecture of a reliable application in the cloud is typically different from a traditional application architecture. While historically you may have purchased redundant higher-end hardware to minimize the chance of an entire application platform failing, in the cloud, it's important to acknowledge up front, that failures will happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component (single point of failure or SPOF). Follow these best practices to build reliability into each step of your design process.

Reliable applications are:

Resilient and recover gracefully from failures, and they continue to function with minimal downtime and data loss before full recovery.
Highly available (HA) and run as designed in a healthy state with no significant downtime.
Protected from Region failure through good disaster recovery (DR) design.

Understanding how these elements work together, and how they affect cost, is essential to building a reliable application. It can help you determine how much downtime is acceptable, the potential cost to your business, and which functions are necessary during a recovery.

Architect for Reliability

When creating a cloud application, use the following to build in reliability.

Define the requirements.
Define your availability and recovery requirements based on the workloads you are bringing to the cloud and business needs.
Apply architectural best practices.
Follow proven practices, identify possible failure points in the architecture, and determine how the application will respond to failure.
Test with simulations and forced failovers.
Simulate faults, trigger forced failovers, and test detection and recovery from these failures.
Deploy applications consistently.
Release to production using reliable and repeatable processes and automate where possible.
Monitor application health.
Detect failures, monitor indicators of potential failures, and gauge the health of your applications.
Manage failures and disasters.
Identify when a failure occurs, and determine how to address it based on established strategies.

Define the Requirements

Define your availability and recovery requirements based on the workloads you are bringing to the cloud and business needs.

Consider the following when identifying your business needs, and match your reliability plan to them:

Identify workloads and their requirements
A workload is a distinct capability or task that is logically separated from other workloads, in terms of business logic and data storage requirements. Each workload has different requirements for availability, scalability, data consistency, and disaster recovery.
Determine usage patterns
How an application is used plays a role in requirements. Identify differences in requirements during critical and non-critical periods. For example, an application handling month-end processing can't fail during the month-end but might handle failure at other times. To ensure uptime, additional components or redundancy can be added during critical periods and removed when they no longer add value.
Identify critical components and paths
Not all components of your system might be as important as others. For example, you might have an optional component that adds incremental value, but that the workload can run without if necessary. Understand where these components are, and conversely, where the critical parts of your workload are. This will help to scope your availability and reliability metrics and to plan your recovery strategies to prioritize the highest-importance components.
Identify external dependencies and the effect of downstream failure
If your workload depends on external services, a failure in these dependencies may negatively impact the availability of the workload. Implement methods of decoupling the integration to insulate against downstream failures.
Determine workload availability requirements
High availability (HA) is normally defined in terms of an up time target. A 99% HA target for instance means that a particular resource can be unavailable for 3.65 days in a year. Oracle Cloud Infrastructure (OCI) is architected to provide you with a highly available environment. OCI publishes a Service Level Agreement (SLA) for each of its services which describes Oracle's commitments for uptime and connectivity to those services and you should review them to see how they match your requirements. Some services on OCI have high levels of HA built in, particularly Oracle managed services such as the autonomous database. To ensure that an application architecture meets your business requirements, define target SLAs for each workload inclusive of external dependencies. Account for the cost and complexity of meeting availability requirements, in addition to application dependencies.
Establish your recovery metrics — recovery time objective (RTO) and recovery point objective (RPO)
RTO is the maximum acceptable time an application can be unavailable after a disaster incident.

RPO is the maximum duration of data loss that is acceptable during a disaster.

To derive these values, conduct a risk assessment and make sure you understand the cost and risk of downtime or data loss in your organization.

Incremental backups for storage provide security against data loss via recovery points. The period between each backup limits the maximum amount of data that is lost after restoring from a backup.

For example, consider using one of Oracle's pre-defined backup policies for OCI Block Volumes storage: Bronze, Silver, and Gold. Each backup policy is comprised of schedules with a set incremental backup frequency, such as monthly, weekly, or daily, and a defined retention period.
Define a Disaster
Having well-documented disaster recovery plans and requirements are important, but the chaotic nature of such an event can create confusion. Understand what constitutes a disaster to your business, identify key roles that will be needed during a disaster, and establish a well defined communication plan to initiate a disaster response.
Consider Costs
From a FinOps perspective, consider the cost implications of different reliability strategies. This helps you to make informed choices about your availability and recovery requirements.

Apply Architectural Best Practices

When designing your architecture, focus on implementing practices that meet your business requirements, identify failure points, and minimize the scope of failures.

Consider the following best practices:

Determine where failures might occur
Analyze your architecture to identify the types of failures your application might experience, the potential effects of each, and possible recovery strategies.
Determine the level of redundancy required, based on you HA and DR requirements
The level of redundancy required for each workload depends on your business needs and factors into the overall cost of your application.
Design for scalability
A cloud application must be able to scale to accommodate changes in usage. Begin with discrete components, and design the application to respond automatically to load changes whenever possible. Keep scaling limits in mind during design so you can expand easily in the future.
Use load-balancing to distribute requests
Load-balancing distributes your application's requests to healthy service instances by removing unhealthy instances from rotation. Externalizing state information will make backend scaling transparent to the end user. If state is tracked in the session, stickiness may be needed.
Build availability and resilience requirements into your design
Resiliency is the ability of a system to recover from failures and continue to function. Availability is the proportion of time your system is functional and working. Implement high availability best practices, such as avoiding single points of failure and decomposing workloads by service-level objective. Utilize the standard features of your data layer, such as application continuity and asynchronous transactions to ensure both availability and resilience.
Implement DR
Design your solution to meet the RTO and RPO requirements identified. Ensure that you can bring all components of your DR solution online within your RTO. Protect your data so you can meet your RPO. How you store, back up, and replicate data is critical.
- Backup your data
  Even with a fully replicated DR environment, regular backups are still critical. Backup and validate data regularly, and make sure no single user account has access to both production and backup data.
- Choose replication methods for your application data
  Your application data is stored in various data stores and might have different availability requirements. Evaluate the replication methods and locations for each type of data store to ensure that they satisfy your HA requirements and RPO.
- Understand implications of failing over and it's effect on disaster readiness
  Will you need to instantiate another region for replication to meet your workloads requirements? Will you need to worry about data consistency upon failback?
- Document and test your failover and failback processes
  Clearly document instructions to fail over to a new data store, and test them regularly to make sure they are accurate and easy to follow.
- Ensure your data recovery plan meets your RTO
  Make sure that your backup and replication strategy provides for data recovery times that meet your RTO as well as your RPO. Account for all types of data your application uses, including reference data, files and databases.

Periodically Test with Simulations and Forced Failovers

Testing for reliability requires measuring how the end-to-end workload performs under failure conditions that only occur intermittently.

Test for common failure scenarios by triggering actual failures or by simulating them
Use fault injection testing to test common scenarios (including combinations of failures) and recovery time.
Identify failures that occur only under load
Test for peak load, using production data or synthetic data that is as close to production data as possible, to see how the application behaves under real-world conditions.
Run disaster recovery drills
Have a disaster recovery plan in place, and test it periodically to make sure it works.
Perform failover and failback testing
Ensure that your application's dependent services fail over and fail back in the correct order.
Run simulation tests
Testing real-life scenarios can highlight issues that need to be addressed. Scenarios should be controllable and non-disruptive to the business. Inform management of simulation testing plans.
Test health probes
Configure health probes for load balancers and traffic managers to check critical system components. Test them to make sure that they respond appropriately.
Test monitoring systems
Be sure that monitoring systems are reliably reporting critical information and accurate data to help identify potential failures.
Include third-party services in test scenarios
Test possible points of failure due to third-party service disruption, in addition to recovery.
Learn from issues encountered during tests
If testing reveals issues or gaps, then ensure that they are identified and addressed either by adding additional monitoring or adjusting operational processes.

Deploy Applications Consistently

Deployment includes provisioning Oracle Cloud Infrastructure (OCI) services and resources, deploying application code, and applying configuration settings. An update may involve all three tasks or a subset of them.

Automate your application deployment process
Automate as many processes as possible. If possible, manual deployments in production should be eliminated, though this may be acceptable in lower environments to promote velocity and flexibility.
Leverage automation to test your code prior to deployment
Testing for bugs, security vulnerabilities, functionality, performance, and integrations are critical to minimizing problems that end users discover. Testing failures should prevent code from being released into production.
Design your release process to maximize availability
If your release process requires services to go offline during deployment, your application is unavailable until they come back online. Take advantage of platform staging and production features. If possible, release new deployments to a subset of users to monitor for early-hour failures.
Have a rollback plan for deployment
Design a rollback process to return to a last known good version and to minimize downtime if a deployment fails. Automation of rollback upon failed deployment can prevent unnecessary downtime.
Log and audit deployments
If you use staged deployment techniques, more than one version of your application is running in production. Implement a robust logging strategy to capture as much version-specific information as possible.
Document the application release process
Clearly define and document your release process, and ensure that it's available to the entire operations team.

Monitor Application Health

Implement best practices for monitoring and alerts in your application so you can detect failures and alert an operator to fix them.

Implement tracing around service calls
Baseline performance data can help provide trend data that can be used to proactively identify performance problems before they affect users.
Implement health probes
Run them regularly from outside the application to identify degradation of application health and performance. These probes should be more than just static page tests, they should reflect holistic application health.
Check long-running workflows
Catching issues early can minimize the need to roll back the entire workflow or to execute multiple compensating transactions.
Maintain system, application and audit logs
Utilize a centralized logging service to store your logs.
Set up an early warning system
Identify the key performance indicators (KPIs) of an application's health, such as transient exceptions and remote call latency, and set appropriate threshold values for each of them. Send an alert to operations when the threshold value is reached.
Train multiple operators to monitor the application and to perform manual recovery steps
Make sure there is always at least one trained operator active.

Manage Failures and Disasters

Create a recovery plan, and make sure that it covers data restoration, network outages, dependent service failures, and region-wide service disruptions. Consider your VMs, storage, databases, and other OCI services in your recovery strategy.

Plan for incident management
Define clear roles and responsibilities for incident management to keep services running, or restore them as quickly as possible.
Document and test your disaster recovery plan
Write a disaster recovery plan that reflects the business impact of application failures. Automate the recovery process as much as possible, and document any manual steps. Regularly test your disaster recovery process to validate and improve the plan.
Understand key roles needed for DR coordination
Make sure that DR efforts are well coordinated and applications are prioritized based on business value.
Prepare for application failure
Prepare for a range of failures, including faults that are handled automatically, those that result in reduced functionality, and those that cause the application to become unavailable. The application should inform users of temporary issues.
Recover from data corruption
If a failure happens in a data store, check for data inconsistencies when the store becomes available again, especially if the data was replicated. Restore corrupt data from a backup.
Recover from a network outage
You might be able to use cached data to run locally with reduced application functionality. If not, consider application downtime or fail over to another region. Store your data in an alternate location until connectivity is restored.
Recover from a dependent service failure
Determine which functionality is still available and how the application should respond.
Recover from a region-wide service disruption
Region-wide service disruptions are uncommon, but you should have a strategy to address them, especially for critical applications. You might be able to redeploy the application to another region or redistribute traffic.
Learn from DR tests and improve processes
Ensure that any issues encountered during DR testing are captured, and plans to remediate those issues are addressed in future tests or failovers.