Design a Pilot-Light Disaster Recovery (DR) Topology

If a large-scale outage affects your production applications, you need the ability to restore the workloads quickly. Your business continuity plan should include a DR strategy that meets your recovery point, recovery time, and budget objectives. A pilot-light topology offers a balance between cost and recovery requirements.

The term pilot light refers to a small flame that is always lit in devices such as gas-powered heaters, and can be used to start the devices quickly when required. In the context of DR, a pilot-light environment contains the core components of a given workload, with the latest configuration and critical data, running at a minimal scale at a location that's remote from the primary site. In the event of a disaster at the primary site, you can use the pilot-light components at the remote location to restore a production-scale environment quickly.

Oracle Cloud Infrastructure provides highly available and scalable infrastructure and services that enable you design a pilot-light DR topology.

Architecture

This architecture shows a multi-tier topology that has redundant resources distributed across two Oracle Cloud Infrastructure regions.

The following diagram illustrates this reference architecture.

Description of x-region-pilot-light-topology.png follows
Description of the illustration x-region-pilot-light-topology.png

The architecture has the following components:

  • Regions

    An Oracle Cloud Infrastructure region is a localized geographic area that contains one or more data centers, called availability domains. Regions are independent of other regions, and vast distances can separate them (across countries or even continents).

  • Availability domains

    Availability domains are standalone, independent data centers within a region. The physical resources in each availability domain are isolated from the resources in the other availability domains, which provides fault tolerance. Availability domains don’t share infrastructure such as power or cooling, or the internal availability domain network. So, a failure at one availability domain is unlikely to affect the other availability domains in the region.

    The architecture diagram doesn't show availability domains. But in regions that have multiple availability domains, you can distribute the resources in each region across the availability domains, for high availability.

  • Fault domains

    A fault domain is a grouping of hardware and infrastructure within an availability domain. Each availability domain has three fault domains with independent power and hardware. When you distribute resources across multiple fault domains, your applications can tolerate physical server failure, system maintenance, and power failures inside a fault domain.

    The architecture diagram doesn't show fault domains. But to protect against failure within a fault domain, you can distribute the resources in each availability across the fault domains.

  • Virtual cloud networks (VCN) and subnets

    A VCN is a customizable, software-defined network that you set up in an Oracle Cloud Infrastructure region. Like traditional data center networks, VCNs give you complete control over your network environment. A VCN can have multiple non-overlapping CIDR blocks that you can change after you create the VCN. You can segment a VCN into subnets, which can be scoped to a region or to an availability domain. Each subnet consists of a contiguous range of addresses that don't overlap with the other subnets in the VCN. You can change the size of a subnet after creation. A subnet can be public or private.

    In this reference architecture, all the resources in each region are attached to a single VCN.

  • Bastion host

    The bastion host is a compute instance that serves as a secure, controlled entry point to the topology from outside the cloud. The bastion host is provisioned typically in a demilitarized zone (DMZ). It enables you to protect sensitive resources by placing them in private networks that can't be accessed directly from outside the cloud. The topology has a single, known entry point that you can monitor and audit regularly. So, you can avoid exposing the more sensitive components of the topology without compromising access to them.

  • Load balancer

    The Oracle Cloud Infrastructure Load Balancing service provides automated traffic distribution from a single entry point to multiple servers in the back end.

  • Internet gateway

    The internet gateway allows traffic between the public subnets in a VCN and the public internet.

  • Compute instances

    The primary region includes two compute instances for the application tier.

    The standby region has a compute instance for mounting the replicated file storage. The other two compute instances in the standby region represent servers that you can create by using replicated boot volumes and block volumes, in the event of a disaster in the primary region.

  • Block volumes

    With block storage volumes, you can create, attach, connect, and move storage volumes, and change volume performance to meet your storage, performance, and application requirements. After you attach and connect a volume to an instance, you can use the volume like a regular hard drive. You can also disconnect a volume and attach it to another instance without losing data.

    The architecture shows the boot volumes and block volumes in the primary region being replicated to the standby region. With this design, in the event of a disaster in the primary region, you can restore the application tier quickly in the standby region, by provisioning compute instances using the replicated boot and block volumes.

  • File storage

    The Oracle Cloud Infrastructure File Storage service provides a durable, scalable, secure, enterprise-grade network file system. You can connect to a File Storage service file system from any bare metal, virtual machine, or container instance in a VCN. You can also access a file system from outside the VCN by using Oracle Cloud Infrastructure FastConnect and IPSec VPN.

    The architecture shows file storage in the primary region being replicated to the standby region using a script.

  • Object storage

    Object storage provides quick access to large amounts of structured and unstructured data of any content type, including database backups, analytic data, and rich content such as images and videos. You can safely and securely store and then retrieve data directly from the internet or from within the cloud platform. You can seamlessly scale storage without experiencing any degradation in performance or service reliability. Use standard storage for "hot" storage that you need to access quickly, immediately, and frequently. Use archive storage for "cold" storage that you retain for long periods of time and seldom or rarely access.

    The architecture shows object storage in the primary region being replicated to the standby region automatically by using a cross-region replication policy.

  • Application Server

    Application servers use a secondary peer that, like the database, will take over processing in the event of a disaster. Application servers use configuration and metadata that is stored both in the database and the file system. Application server clustering provides protection in the scope of a single region but ongoing modifications and new deployments need to be replicated to the secondary location on an ongoing basis for a consistent disaster recovery.

  • Database

    The architecture includes a database in each region. Oracle Data Guard is used for data replication, and ensures that the standby database is a transactionally consistent copy of the primary database.

    Data Guard automatically maintains synchronization between the databases by transmitting and applying redo data from the primary database to the standby. In the event of a disaster in the primary region, Data Guard fails over automatically to the standby database.

  • Dynamic routing gateway (DRG)

    The DRG is a virtual router that provides a path for private network traffic between a VCN and a network outside the region, such as a VCN in another Oracle Cloud Infrastructure region, an on-premises network, or a network in another cloud provider.

  • NAT gateway

    The NAT gateway enables private resources in a VCN to access hosts on the internet, without exposing those resources to incoming internet connections.

  • Service gateway

    The service gateway provides access from a VCN to other services, such as Oracle Cloud Infrastructure Object Storage. The traffic from the VCN to the Oracle service travels over the Oracle network fabric and never traverses the internet.

Recommendations

Use the following recommendations as a starting point to design your pilot-light DR topology. Your requirements might differ from the architecture described here.

  • VCN

    When you create each VCN, determine how many IP addresses your cloud resources in each subnet require. Using the Classless Inter-Domain Routing (CIDR) notation, specify a subnet mask and a network address range that's large enough for the required IP addresses. Use an address range that's within the standard private IP address space.

    Select CIDR blocks that don't overlap with any other network (in Oracle Cloud Infrastructure, your on-premises data center, or another cloud provider) to which you intend to set up private connections.

    After you create a VCN, you can change, add, and remove its CIDR blocks.

    When you design the subnets, consider your traffic flow and security requirements. Attach all the resources within a specific tier or role to the same subnet, which can serve as a security boundary.

    Use regional subnets.

  • Security lists

    To allow cross-region replication of the database and file storage configure the required security lists. Note that replication of the boot volumes and block volumes doesn't require communication between the hosts to which the volumes are attached.

  • Block volumes backup policy

    Configure a policy to take backups of the block volumes as frequently as necessary to meet your RPO.

  • Application Servers and custom applications running on Oracle Platform as a Service (PaaS)

    PaaS services, such as Oracle SOA Cloud Service and Oracle WebLogic Server for Oracle Cloud Infrastructure, use most of the resources mentioned above internally (compute, block volumes, file storage, networking, database). They require specific disaster recover strategies that protect all the different layers in a consistent fashion. Oracle provides detailed best practices intended to create maximum availability architectures (MAA) and protect this type of systems against disasters. See Explore More for specific documentation on disaster recovery (DR) for PaaS.

Considerations

When implementing your pilot-light DR setup, consider the following factors:

  • Performance

    When planning the RPO and RTO, consider the time required for volume backups to be copied across regions.

  • Availability

    You can use DNS steering management to redirect client traffic to the current production region after a failover.

    If you use compute shapes that provide locally attached NVMe devices, you can back up the data on these devices by using traditional backup solutions that use object storage.

  • Cost

    In the event of a failover from the primary to the standby region, you can provision the required infrastructure quickly by using Terraform scripts. You can resize the database systems after provisioning them; so specify the minimum shape required initially, and change to a larger shape after the failover.