Active-Passive Disaster Recovery

Oracle Linux Virtualization Manager active-passive disaster recovery solution can span two sites. If the primary site becomes unavailable, the Oracle Linux Virtualization Manager environment can be forced to failover to the secondary (backup) site.

Failover is achieved by configuring a secondary site with:

  • An active Oracle Linux Virtualization Manager Engine.

  • A data center and clusters.

  • Networks with the same general connectivity as the primary site.

  • Active hosts capable of running critical virtual machines after failover.

Important:

You must ensure that the secondary environment has enough resources to run the failed over virtual machines, and that both the primary and secondary environments have identical Engine versions, data center and cluster compatibility levels, and PostgreSQL versions.

Storage domains that contain virtual machine disks and templates in the primary site must be replicated. These replicated storage domains must not be attached to the secondary site.

The failover and failback processes are executed manually using Ansible playbooks that map entities between the sites and manage the failover and failback processes. The mapping file instructs the Oracle Linux Virtualization Manager components where to failover or failback to.

Network Considerations

You must ensure that the same general connectivity exists in the primary and secondary sites. If you have multiple networks or multiple data centers then you must use an empty network mapping in the mapping file to ensure that all entities register on the target during failover.

Storage Considerations

The storage domain for Oracle Linux Virtualization Manager can be made of either block devices (iSCSI or FCP) or a file system (NAS/NFS or GlusterFS). Local storage domains are unsupported for disaster recovery.

Your environment must have a primary and secondary storage replica. The primary storage domain’s block devices or shares that contain virtual machine disks or templates must be replicated. The secondary storage must not be attached to any data center and is added to the backup site’s data center during failover.

If you are implementing disaster recovery using a self-hosted engine, ensure that the storage domain used by the self-hosted engine's Engine virtual machine does not contain virtual machine disks because the storage domain will not failover.

You can use any storage solutions that have replication options supported by Oracle Linux 8 and later.

Important:

Metadata for all virtual machines and disks resides on the storage data domain as OVF_STORE disk images. This metadata is used when the storage data domain is moved by failover or failback to another data center in the same or different environment.

By default, the metadata is automatically updated by the Engine in 60 minute intervals. This means that you can potentially lose all data and processing completed during an interval. To avoid such loss, you can manually update the metadata from the Administration Portal by navigating to the storage domain section and clicking Update OVFs. Or, you can modify the Engine parameters to change the update frequency, for example:

# engine-config -s OvfUpdateIntervalInMinutes=30 && systemctl restart ovirt-engine

For more information, see the Storage topics in the Oracle Linux Virtualization Manager: Administration Guide and the Oracle Linux Virtualization Manager: Architecture and Planning Guide.

Creating the Ansible Playbooks

You use Ansible to initiate and manage the active-passive disaster recovery failover and failback through Ansible playbooks that you create. For more information about Ansible playbooks, see the Ansible documentation.

Before you begin creating your Ansible playbooks, review the following prerequisites and limitations:

  • Primary site has a fully functioning Oracle Linux Virtualization Manager environment.

  • A backup environment in the secondary site with the same data center and cluster compatibility level as the primary environment. The backup environment must have:

    • An Oracle Linux Virtualization Manager Engine

    • Active hosts capable of running the virtual machines and connecting to the replicated storage domains

    • A data center with clusters

    • Networks with the same general connectivity as the primary site

  • Replicated storage that contains virtual machines and templates not attached to the secondary site.

  • The ovirt-ansible-collection package must be installed on the highly available Ansible Engine machine to automate the failover and failback.

  • The machine running the Ansible Engine must be able to use SSH to connect to the Engine in the primary and secondary site.

Note:

We recommended that you create environment properties that exist in the primary site, such as affinity groups, affinity labels, users, on the secondary site. The default behaviour of the Ansible playbooks can be configured in the /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery/defaults/main.yml file.

You must create the following required Ansible playbooks:

  • Playbook that creates the file to map entities on the primary and secondary sites

  • Failover playbook

  • Failback playbook

The playbooks and associated files that you create must reside in /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery on the Ansible machine that is managing the failover and failback. If you have multiple Ansible machines that can manage it, ensure that you copy the files to all of them.

After configuring active-passive disaster recovery, you should test and verify the configuration. See Testing the Active-Passive Configuration.

Simplifying Ansible Tasks Using the ovirt-dr Script

You can use the ovirt-dr script, located in /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery, to simplify these Ansible tasks:

  • Generating a var mapping file of the primary and secondary sites for failover and fallback

  • Validating the var mapping file

  • Executing failover on a target site

  • Executing failback from a target site to a source site

The following is an example of the ovirt-dr script:

# ./ovirt-dr generate/validate/failover/failback

[--conf-file=dr.conf]
[--log-file=ovirt-dr-log_number.log]
[--log-level=DEBUG/INFO/WARNING/ERROR]

You optionally can make the following customizations:

  • Set parameters for the script’s actions in the configuration file: /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery/files/dr.conf.

  • Change location of the configuration file using the --conf-file option

  • Set location of log file using the --log-file option

  • Set level of logging detail using the --log-level option

Generating the Mapping File Using an Ansible Playbook

The Ansible playbook used to generate the mapping file prepopulates the file with the primary site’s entities. Then, you need to manually add to the file the backup site’s entities, such as IP addresses, cluster, affinity groups, affinity label, external LUN disks, authorization domains, roles, and vNIC profiles.

Important:

Generating the mapping file will fail if you have any virtual machine disks on the self-hosted engine’s storage domain. Also, the generated mapping file will not contain an attribute for this storage domain because it must not be failed over.

To create the mapping file, complete the following steps.

  1. Create an Ansible playbook using a yaml file (such as dr-olvm-setup.yml) to generate the mapping file. For example:

    ---
    - name: Setup oVirt environment
      hosts: localhost
      connection: local
      vars:
         site: https://manager1.mycompany.com/ovirt-engine/api
         username: admin@internal
         password: Mysecret1
         ca: /etc/pki/ovirt-engine/ca.pem
         var_file: disaster_recovery_vars.yml
      roles:
         - disaster_recovery
      collections:
         - ovirt.ovirt

    For extra security you can encrypt your Engine password in a .yml file.

  2. Run the Ansible command to generate the mapping file. The primary site’s configuration will be prepopulated.

    # ansible-playbook dr-olvm-setup.yml --tags "generate_mapping"
  3. Configure the generated mapping .yml file with the backup site’s configuration. For more information, see Mapping File Attributes.

If you have multiple Ansible machines that can perform failover and failback, then copy the mapping file to all relevant machines.

Creating Failover and Failback Playbooks

Before creating the failover and failback playbooks, ensure you have created and configured the mapping file, which must be added to the playbooks.

To create the failover and failback playbooks, complete the following steps.

  1. Optionally, define a password file (for example passwords.yml) to store the Engine passwords of the primary and secondary site, for example:

    ---
    # This file is in plain text, if you want to
    # encrypt this file, please execute following command:
    #
    # $ ansible-vault encrypt passwords.yml
    #
    # It will ask you for a password, which you must then pass to
    # ansible interactively when executing the playbook.
    #
    # $ ansible-playbook myplaybook.yml --ask-vault-pass
    #
    dr_sites_primary_password: primary_password
    dr_sites_secondary_password: secondary_password

    For extra security you can encrypt the password file. However, you will need to use the --ask-vault-pass parameter when running the playbook.

  2. Create an Ansible playbook using a failover yaml file (such as dr-olvm-failover.yml) to failover the environment, for example:

    ---
    - name: oVirt Failover
      hosts: localhost
      connection: local
      vars:
         dr_target_host: secondary
         dr_source_map: primary
      vars_files:
         - disaster_recovery_vars.yml
      roles:
         - disaster_recovery
      collections:
         - ovirt.ovirt
  3. Create an Ansible playbook using a failback yaml file (such as dr-olvm-failback.yml) to failback the environment, for example:

    ---
    - name: oVirt Failback
      hosts: localhost
      connection: local
      vars:
         dr_target_host: primary
         dr_source_map: secondary
      vars_files:
         - disaster_recovery_vars.yml
      roles:
         - disaster_recovery
      collections:
         - ovirt.ovirt

Executing a Failover

Before executing a failover, ensure you have read and understood the Network Considerations and Storage Considerations. You must also ensure that:

  • the Engine and hosts in the secondary site are running.

  • replicated storage domains are in read/write mode.

  • no replicated storage domains are attached to the secondary site.

  • a machine running the Ansible Engine that can connect via SSH to the Engine in the primary and secondary site, with the required packages and files:

    • The ovirt-ansible-collection package.

    • The mapping file and failover playbook.

Sanlock must release all storage locks from the replicated storage domains before the failover process starts. These locks should be released automatically approximately 80 seconds after the disaster occurs.

To execute a failover, run the failover playbook on the Engine host using the following command:

# ansible-playbook dr-olvm-failover.yml --tags "fail_over"

When the primary site becomes active, ensure that you clean the environment before failing back. For more information, see Cleaning the Primary Site.

Continue with a failback. See Executing a Failback.

Cleaning the Primary Site

After you failover, you must clean the environment in the primary site before failing back to it. Cleaning the primary site's environment:

  • Reboots all hosts in the primary site.

  • Ensures the secondary site’s storage domains are in read/write mode and the primary site’s storage domains are in read only mode.

  • Synchronizes the replication from the secondary site’s storage domains to the primary site’s storage domains.

  • Cleans the primary site of all storage domains to be imported. This can be done manually in the Engine. For more information, see Detaching a Storage Domain from a Data Center.

For example, create a cleanup yml file (such as dr_cleanup_primary_site.yml):

---
- name: oVirt Cleanup Primary Site
  hosts: localhost
  connection: local
  vars:
     dr_source_map: primary
  vars_files:
     - disaster_recovery_vars.yml
  roles:
     - disaster_recovery
  collections:
     - ovirt.ovirt

Once you have cleaned the primary site, you can now failback the environment to the primary site. For more information, see Executing a Failback.

Executing a Failback

After failover, you can failback to the primary site when it is active and you have performed the necessary steps to clean the environment by ensuring:

  • The primary site's environment is running and has been cleaned. For more information, see Cleaning the Primary Site.

  • The environment in the secondary site is running and has active storage domains.

  • The machine running the Ansible Engine that can connect via SSH to the Engine in the primary and secondary site, with the required packages and files:

    • The ovirt-ansible-collection package.

    • The mapping file and required failback playbook.

To execute a failback, complete the following steps.

  1. Run the failback playbook on the Engine host using the following command:

    #  ansible-playbook dr-olvm-failback.yml --tags "fail_back"
  2. Enable replication from the primary storage domains to the secondary storage domains.

Testing the Active-Passive Configuration

You must test your disaster recovery solution after configuring it using one of the provided options:

  1. Test failover while the primary site remains active and without interfering with virtual machines on the primary site’s storage domains. See Discreet Failover Test.

  2. Test failover and failback using specific storage domains attached to the primary site which allows the primary site to remain active. See Discreet Failover and Failback Tests.

  3. Test failover and failback for an unplanned shutdown of the primary site or an impending disaster where you have a grace period to failover to the secondary site. See Full Failover and Failback Tests.

Important:

Ensure that you have completed all the steps to configure your active-passive disaster recovery before running any of these tests.

Discreet Failover Test

The discreet failover test simulates a failover while the primary site and all its storage domains remain active which allows users to continue working in the primary site. To perform this test, you must disable replication between the primary storage domains and the replicated (secondary) storage domains. During this test the primary site is unaware of the failover activities on the secondary site.

This test does not allow you to test the failback functionality.

Important:

Ensure that no production tasks are performed after the failover. For example, ensure that email systems are blocked from sending emails to real users or redirect emails elsewhere. If systems are used to directly manage other systems, prohibit access to the systems or ensure that they access parallel systems in the secondary site.

To perform a discreet failover test, complete the following steps.

  1. Disable storage replication between the primary and replicated storage domains and ensure that all replicated storage domains are in read/write mode.

  2. Run the following command to failover to the secondary site:

    # ansible-playbook playbook --tags "fail_over"
  3. Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the secondary site.

To restore the environment to its active-passive state, complete the following steps.

  1. Detach the storage domains from the secondary site.

  2. Enable storage replication between the primary and secondary storage domains.

Discreet Failover and Failback Tests

The discreet failover and failback tests use testable storage domains that you specifically define for testing failover and failback. These storage domains must be replicated so that the replicated storage can be attached to the secondary site which allows you to test the failover while users continue to work in the primary site.

Note:

You should define the testable storage domains on a separate storage server that can be offline without affecting the primary storage domains used for production in the primary site.

To perform a discreet failover test, complete the following steps.

  1. Stop the test storage domains in the primary site. For example, shut down the server host or block it with a firewall rule.

  2. Disable the storage replication between the testable storage domains and ensure that all replicated storage domains used for the test are in read/write mode.

  3. Place the test primary storage domains into read-only mode.

  4. Run the command to failover to the secondary site:

    # ansible-playbook playbook --tags "fail_over"
  5. Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the secondary site.

To perform a discreet failback test, complete the following steps.

  1. Clean the primary site and remove all inactive storage domains and related virtual machines and templates. For more information, see Cleaning the Primary Site.

  2. Run the command to failback to the primary site:

    # ansible-playbook playbook --tags "fail_back"
  3. Enable replication from the primary storage domains to the secondary storage domains.

  4. Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully in the primary site.

Full Failover and Failback Tests

The full failover and failback tests allow you to simulate a primary site disaster, failover to the secondary site, and failback to the primary site. To simulate a primary site disaster, you can shut down the primary site’s hosts or by add firewall rules to block writing to the storage domains.

To perform a full failover test, complete the following steps.

  1. Disable storage replication between the primary and replicated storage domains and ensure that all replicated storage domains are in read/write mode.

  2. Run the command to failover to the secondary site:

    # ansible-playbook playbook --tags "fail_over"
  3. Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully in the secondary site.

To perform a full failback test, complete the following steps.

  1. Synchronize replication between the secondary site’s storage domains and the primary site’s storage domains. The secondary site’s storage domains must be in read/write mode and the primary site’s storage domains must be in read-only mode.

  2. Clean the primary site and remove all inactive storage domains and related virtual machines and templates. For more information, see Cleaning the Primary Site.

  3. Run the command to failback to the primary site:

    # ansible-playbook playbook --tags "fail_back"
  4. Enable replication from the primary storage domains to the secondary storage domains.

  5. Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the primary site.

Mapping File Attributes

The attributes in the mapping file are used to failover and failback between the two sites in an active-passive disaster recovery solution.

  • Site details

    Attributes that map the Engine details in the primary and secondary site, for example:

    dr_sites_primary_url: https://manager1.mycompany.com/ovirt-engine/api
    dr_sites_primary_username: admin@internal
    dr_sites_primary_ca_file: /etc/pki/ovirt-engine/ca.pem
    
    # scp manager2:/etc/pki/ovirt-engine/ca.pem /var/tmp/secondary-ca.pem
    
    # Please fill in the following properties for the secondary site:
    dr_sites_secondary_url: https://manager2.mycompany.com/ovirt-engine/api
    dr_sites_secondary_username: admin@internal
    dr_sites_secondary_ca_file: /var/tmp/secondary-ca.pem
  • Storage domain details

    Attributes that map the storage domain details between the primary and secondary site, for example:

    dr_import_storages:
    - dr_domain_type: nfs
      dr_primary_name: DATA
      dr_master_domain: True
      dr_wipe_after_delete: False
      dr_backup: False
      dr_critical_space_action_blocker: 5
      dr_warning_low_space: 10
      dr_primary_dc_name: Default
      dr_discard_after_delete: False
      dr_primary_path: /storage/data
      dr_primary_address: 10.64.100.xxx
      # Fill in the empty properties related to the secondary site
      dr_secondary_dc_name: Default
      dr_secondary_path: /storage/data2
      dr_secondary_address:10.64.90.xxx
      dr_secondary_name: DATA
  • Cluster details

    Attributes that map the cluster names between the primary and secondary site, for example:

    dr_cluster_mappings:
      - primary_name: cluster_prod
        secondary_name: cluster_recovery
      - primary_name: fc_cluster
        secondary_name: recovery_fc_cluster
  • Affinity group details

    Attributes that map the affinity groups that virtual machines belong to, for example:

    dr_affinity_group_mappings:
    - primary_name: affinity_prod
      secondary_name: affinity_recovery
  • Affinity label details

    Attributes that map the affinity labels that virtual machines belong to, for example:

    dr_affinity_label_mappings:
    - primary_name: affinity_label_prod
      secondary_name: affinity_label_recovery
  • Domain authentication, authorization and accounting details

    Attributes that map authorization details between the primary and secondary site, for example:

    dr_domain_mappings:
    - primary_name: internal-authz
      secondary_name: recovery-authz
    - primary_name: external-authz
      secondary_name: recovery2-authz
  • Role details

    Attributes that provide mapping for specific roles, for example:

    dr_role_mappings:
    - primary_name: admin
      Secondary_name: newadmin
  • Network details

    Attributes that map the vNIC details between the primary and secondary site, for example:

    dr_network_mappings:
    - primary_network_name: ovirtmgmt
      primary_profile_name: ovirtmgmt
      primary_profile_id: 0000000a-000a-000a-000a-000000000398
      # Fill in the correlated vnic profile properties in the secondary site for profile 'ovirtmgmt'
      secondary_network_name: ovirtmgmt
      secondary_profile_name: ovirtmgmt
      secondary_profile_id:  0000000a-000a-000a-000a-000000000410

    If you have multiple networks or multiple data centers then you must use an empty network mapping in the mapping file to ensure that all entities register on the target during failover, for example:

    dr_network_mappings:
    # No mapping should be here
  • External LUN disk details

    LUN attributes allow virtual machines to be registered with the appropriate external LUN disk after failover and failback, for example:

    dr_lun_mappings:
    - primary_logical_unit_id: 460014069b2be431c0fd46c4bdce29b66
      primary_logical_unit_alias: My_Disk
      primary_wipe_after_delete: False
      primary_shareable: False
      primary_logical_unit_description: 2b66
      primary_storage_type: iscsi
      primary_logical_unit_address: 10.35.xx.xxx
      primary_logical_unit_port: 3260
      primary_logical_unit_portal: 1
      primary_logical_unit_target: iqn.2017-12.com.prod.example:444
      secondary_storage_type: iscsi
      secondary_wipe_after_delete: False
      secondary_shareable: False
      secondary_logical_unit_id: 460014069b2be431c0fd46c4bdce29b66
      secondary_logical_unit_address: 10.35.x.xxx
      secondary_logical_unit_port: 3260
      secondary_logical_unit_portal: 1
      secondary_logical_unit_target: iqn.2017-12.com.recovery.example:444