Building Resilient Solutions in OCI
Organizations have to have services and solutions that are available to users as and when they need them. Making sure that solutions have the right level of resilience is a critical part of a successful deployment to OCI.
Resilience is often measured at the component level, yet most services and solutions are a collection of components. Therefore, it is important to ensure that the solution is sufficiently resilient within every component and layer.
Architecting highly resilient solutions is usually more complex and involves higher costs than simple, less resilient solutions. Given that not every solution needs that highest level of protection, it becomes prudent to stratify solutions according to their business criticality and design appropriately resilient architectures.
This topic will describe which tools, techniques and approaches should be considered when embarking on such an approach.
There are four areas of focus on this topic. They are;
- Infrastructure Resilience
- Data Estate Resilience
- Application Resilience
- Business Resilience
Before moving through these areas, it’s essential to understand some key concepts and terminology.
Resiliency Concepts
High Availability
Computing environments configured to provide nearly full-time availability are known as highly available (HA) systems. Such systems typically have redundant hardware and software that makes the system available despite failures. When failures occur, the failover process moves processing performed by the failed component to the backup component. The more transparent that failover is to users, the higher the availability of the system.
High availability protects from localized failures (Hardware failure, bugs etc.)
Disaster Recovery
Disaster recovery (DR) involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems. Disaster recovery should indicate the key metrics of recovery point objective (RPO) and recovery time objective (RTO). In many cases, an organization may elect to use an outsourced disaster recovery provider to provide a stand-by site and systems rather than using their own remote facilities
Disaster Recovery protects from site failures (entire data center) ensuring Business Continuity.
Terminology
Downtime
- Downtime is a period during which services or/and systems are unavailable
- Planned downtime is downtime that occurs due to scheduled or planned maintenance, upgrade, update, patching activities
- Examples: hardware upgrade, server maintenance, database patching, application changes or update
- Unplanned downtime is downtime that occurs due to an unforeseen event like service or system failure or natural disaster
- Examples: hardware outage, server or storage failure, data corruption, human error, facilities failure, etc.
RPO/RTO
- RPO (Recovery Point Objective) is the maximum amount of data that can be lost before causing detrimental harm to an organization
- Measured in terms of time, for example, 5 hours or two days worth of data loss.
- It determines the frequency of backups and replication approaches
- RTO (Recovery Time Objective)is the targeted duration of time within which a system or service must be restored after a disruption to avoid unacceptable consequences.
- Measured in terms of time; for example, all production systems must be up and running at 80% of pre-disaster capability within 30 minutes of any unplanned outage.
SLAs
A Cloud SLA (cloud service-level agreement) is an agreement between a cloud service provider and a customer that maintains a minimum level of service.
- OCI offers end-to-end SLAs covering performance, availability, and manageability of services.
- Standard SLAs for the availability of most Oracle Cloud services is 99.95%
- Standard SLAs for the manageability of most Cloud services is 99.9%, and for performance (disk, network) is 90%