Infrastructure Resilience
Infrastructure resilience is principally concerned with ensuring that best practice guidelines are followed for Compute, Storage and Network resources.
Compute Resilience Best Practice
When to Use Fault Domains
If you deploy a solution to a region with only one availability domain, explicitly distribute clustered compute instances across fault domains. This will avoid single points of failure.
In the example shown above, we can see two nodes of a clustered application server deployed to separate fault domains. This has to be specifically and explictly stated when deploying each instance.
By default, OCI will make a balanced judgement on fault domain placement based upon availability domain capacity and attempting to deliver anti-affinity.
For more information on fault domains, please see here.. Fault Domains
When to Use Multiple Availability Domains
Some regions have multiple availability domains. If this is the case for your deployment, place compute instances that perform the same function in different availability domains. This offers even greater protection than intra-region fault domains. Deploying into different availability domains means that your application will be protected not just against hardware failure but also against any data centre wide outage.
In the example shown above, we can see that a three-node web cluster has been deployed across availability domain’s 1,2 and 3. We do not need to concern ourselves with explicitly referencing fault domains as no nodes of the same type have been deployed in the same availability domain.
This diagram extends the previous example.
For scalability purposes, we have now deployed a fourth web server node within availability domain 3.
In this case, it would be wise to state the fault domain to which the fourth node should be deployed. We aim to ensure that the fourth node is not inadvertently placed in the same fault domain as an existing node.
Suppose the number of nodes in a cluster, or collection of same task instances, is expected to extend beyond the number of availability domains. In that case, we recommend explicitly stating fault domains as part of the deployment specification.
For more information on availability domains, please see.. Availability Domains
Storage Resilience Best Practice
File Storage
-
Use File Storage for a durable, scalable, enterprise-grade network file system. This is ideal for Enterprise applications that need shared files (NAS)
-
File Storage can provide high-performance and resilient data protection for applications.
-
Use Block Volume policy-based backups to perform automatic, scheduled backups and retain them based on a backup policy.
Block Storage
-
Block Volume Backups can be restored as new volumes to any Availability Domain within the same region and copied from one region to another.
-
Use Block Volume cloning for an immediate, point-in-time, direct disk-to-disk copy of the block volume.
Object Storage
- Object Storage is designed to be highly durable. Multiple copies of the data are stored redundantly across numerous storage servers across various Availability Domains. Data integrity is actively monitored using checksums, and corrupt data is auto-detected and auto healed (repaired) from redundant copies. Any loss of data redundancy is actively managed by recreating a copy of the data.
Copying backups to another region at regular intervals makes it easier to rebuild applications and data in the destination region if a region-wide disaster occurs in the source region.
Network Resilience Best Practice
-
IPSec VPN: use multiple redundant IPSec tunnels and static routes to route traffic. To ensure high availability, set up the VPN connection within the internal network to use either path when needed.
-
FastConnect: build fault-tolerant connections including multiple points-of-presence (POPs) per region and multiple FastConnect routers per POP
-
For an additional level of redundancy, set up both IPSec VPN and FastConnect to connect the on-prem data centers to OCI
For more guidance on connectivity best practice, please visit the Cloud Foundation section here..Connecting to Oracle Cloud Infrastructure
-
Load Balancer: place Load Balancers in regional subnets to enhance the failover architecture on OCI.
-
DNS: Use DNS to automatically direct traffic to a different endpoint as soon as the service fails to respond
For more information about the load balancer service, please see.. Load Balancer Service
To learn more about the Oracle DNS service, please see.. Oracle DNS Service
Example High Availability Architectures
There are many ways of deploying highly available architectures in OCI. These can be either active/passive or active/active. A fully redundant approach requires the design to take into account each layer of the application.
The following examples introduce some of the highly resilient architectures that can be implemented in OCI.
Regions With Multiple Availability Domains
The example depicted above shows an application with redundancy at each layer that protects against software failure, hardware failure, and data centre failure.
This application consists of four application servers running the same code and delivering the same functionality.
All four application servers are deployed in a separate fault domain. This immediately provides protection against power or hardware failure for application servers in the same availability domain.
The application servers are distributed over two availability domains, meaning they also enjoy protection from wider data centre outages.
Traffic is routed evenly across the four application servers via the Load Balancer service. The load balancer service can operate across availability domains and has redundancy built-in using a standby load balancer in the second availability domain. The standby load balancer will automatically take over the role of the primary load balancer should a failure occur.
Each of the active application servers read and write to a singular database service deployed within availability domain one. This is a single node Oracle database running inside a virtual machine.
Tip:
There is a range of capabilities inside the Oracle database to mitigate against system and human error and protect against subsequent outages or data loss.
Further protection is provided by replicating database blocks across to a standby database deployed in availability domain two. This replication keeps the standby database up to date with the primary and, in this case, is configured for no data loss to occur. Should the primary database fail, the standby database will take over.
Regions With Single Availability Domains
The above example shows a highly available cross-region architecture. This may be appropriate when you require greater resilience than that provided by multiple availability domains or where you can only deploy to regions with a single availability domain.
Just as in the previous example, four application servers are running the same code and delivering the same functionality. The application servers have been deployed across two regions; Sydney and Melbourne.
Because each of these regions has just one availability domain, the application servers have been explicitly placed in separate fault domains to protect against hardware and power failures.
Each pair of regional application servers has traffic routed to them by their regional load balancer service.
Users are directed to the most appropriate load balancer service by the OCI DNS Service. This could be a load balancing approach, or traffic could be steered according to their geography or IP address. This may help ensure that response times from servers are optimal according to the users’ location.
At any one time, all four application servers read and write to a single database. This is made possible by each virtual network in their respective region being able to remote peer with each other. They can do this by having traffic transit through a dynamic routing gateway in each region. This enables bi-directional traffic allowing the application servers based in Melbourne to access the database in Sydney.
- Using the same remote peering connection, the primary database in Sydney replicates its data asynchronously to the standby database in Melbourne.
If the entire Sydney region became unavailable, the database would failover to Melbourne, and all users and traffic would be routed to the Melbourne region.
- If the database in Sydney became unavailable, the database would failover to Melbourne. Users and traffic would continue to be routed to their local application servers.
If one of the application servers became unavailable in Sydney, the service would continue as usual but at reduced capacity.
If both application servers or the load balancer became unavailable in Sydney, users and traffic would be re-routed to Melbourne. The primary database would remain in Sydney.
Although the narrative above describes Melbourne acting as the failover site for Sydney, this architecture enables the same protection and failover capabilities in either direction.