Building Resilient Solutions in OCI

Logo

Organizations have to have services and solutions that are available to users as and when they need them. Making sure that solutions have the right level of resilience is a critical part of a successful deployment to OCI.

Resilience is often measured at the component level, yet most services and solutions are a collection of components. Therefore, it is important to ensure that the solution is sufficiently resilient within every component and layer.

Architecting highly resilient solutions is usually more complex and involves higher costs than simple, less resilient solutions. Given that not every solution needs that highest level of protection, it becomes prudent to stratify solutions according to their business criticality and design appropriately resilient architectures.

This topic will describe which tools, techniques and approaches should be considered when embarking on such an approach.

There are four areas of focus on this topic. They are;

Before moving through these areas, it’s essential to understand some key concepts and terminology.

Resiliency Concepts

High Availability

Computing environments configured to provide nearly full-time availability are known as highly available (HA) systems. Such systems typically have redundant hardware and software that makes the system available despite failures. When failures occur, the failover process moves processing performed by the failed component to the backup component. The more transparent that failover is to users, the higher the availability of the system.

High availability protects from localized failures (Hardware failure, bugs etc.)

Disaster Recovery

Disaster recovery (DR) involves a set of policies, tools and procedures to enable the recovery or continuation of vital technology infrastructure and systems. Disaster recovery should indicate the key metrics of recovery point objective (RPO) and recovery time objective (RTO). In many cases, an organization may elect to use an outsourced disaster recovery provider to provide a stand-by site and systems rather than using their own remote facilities

Disaster Recovery protects from site failures (entire data center) ensuring Business Continuity.

Terminology

Downtime

RPO/RTO

SLAs

A Cloud SLA (cloud service-level agreement) is an agreement between a cloud service provider and a customer that maintains a minimum level of service.