Active-Passive Disaster Recovery
Oracle Linux Virtualization Manager active-passive disaster recovery solution can span two sites. If the primary site becomes unavailable, the Oracle Linux Virtualization Manager environment can be forced to failover to the secondary (backup) site.
Failover is achieved by configuring a secondary site with:
-
An active Oracle Linux Virtualization Manager Engine.
-
A data center and clusters.
-
Networks with the same general connectivity as the primary site.
-
Active hosts capable of running critical virtual machines after failover.
Important:
You must ensure that the secondary environment has enough resources to run the failed over virtual machines, and that both the primary and secondary environments have identical Engine versions, data center and cluster compatibility levels, and PostgreSQL versions.
Storage domains that contain virtual machine disks and templates in the primary site must be replicated. These replicated storage domains must not be attached to the secondary site.
The failover and failback processes are executed manually using Ansible playbooks that map entities between the sites and manage the failover and failback processes. The mapping file instructs the Oracle Linux Virtualization Manager components where to failover or failback to.
Network Considerations
You must ensure that the same general connectivity exists in the primary and secondary sites. If you have multiple networks or multiple data centers then you must use an empty network mapping in the mapping file to ensure that all entities register on the target during failover.
Storage Considerations
The storage domain for Oracle Linux Virtualization Manager can be made of either block devices (iSCSI or FCP) or a file system (NAS/NFS or GlusterFS). Local storage domains are unsupported for disaster recovery.
Your environment must have a primary and secondary storage replica. The primary storage domain’s block devices or shares that contain virtual machine disks or templates must be replicated. The secondary storage must not be attached to any data center and is added to the backup site’s data center during failover.
If you are implementing disaster recovery using a self-hosted engine, ensure that the storage domain used by the self-hosted engine's Engine virtual machine does not contain virtual machine disks because the storage domain will not failover.
You can use any storage solutions that have replication options supported by Oracle Linux 8 and later.
Important:
Metadata for all virtual machines and disks resides on the storage data domain as OVF_STORE disk images. This metadata is used when the storage data domain is moved by failover or failback to another data center in the same or different environment.
By default, the metadata is automatically updated by the Engine in 60 minute intervals. This means that you can potentially lose all data and processing completed during an interval. To avoid such loss, you can manually update the metadata from the Administration Portal by navigating to the storage domain section and clicking Update OVFs. Or, you can modify the Engine parameters to change the update frequency, for example:
# engine-config -s OvfUpdateIntervalInMinutes=30 && systemctl restart ovirt-engine
For more information, see the Storage topics in the Oracle Linux Virtualization Manager: Administration Guide and the Oracle Linux Virtualization Manager: Architecture and Planning Guide.
Creating the Ansible Playbooks
You use Ansible to initiate and manage the active-passive disaster recovery failover and failback through Ansible playbooks that you create. For more information about Ansible playbooks, see the Ansible documentation.
Before you begin creating your Ansible playbooks, review the following prerequisites and limitations:
-
Primary site has a fully functioning Oracle Linux Virtualization Manager environment.
-
A backup environment in the secondary site with the same data center and cluster compatibility level as the primary environment. The backup environment must have:
-
An Oracle Linux Virtualization Manager Engine
-
Active hosts capable of running the virtual machines and connecting to the replicated storage domains
-
A data center with clusters
-
Networks with the same general connectivity as the primary site
-
-
Replicated storage that contains virtual machines and templates not attached to the secondary site.
-
The
ovirt-ansible-collection
package must be installed on the highly available Ansible Engine machine to automate the failover and failback. -
The machine running the Ansible Engine must be able to use SSH to connect to the Engine in the primary and secondary site.
Note:
We recommended that you create environment properties that exist in the primary site,
such as affinity groups, affinity labels, users, on the secondary site. The default
behaviour of the Ansible playbooks can be configured in the
/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery/defaults/main.yml
file.
You must create the following required Ansible playbooks:
-
Playbook that creates the file to map entities on the primary and secondary sites
-
Failover playbook
-
Failback playbook
The playbooks and associated files that you create must reside in
/usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery
on the Ansible machine that is managing the failover and failback. If you have multiple
Ansible machines that can manage it, ensure that you copy the files to all of them.
After configuring active-passive disaster recovery, you should test and verify the configuration. See Testing the Active-Passive Configuration.
Simplifying Ansible Tasks Using the ovirt-dr
Script
You can use the ovirt-dr
script, located in /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery,
to simplify these Ansible tasks:
-
Generating a
var
mapping file of the primary and secondary sites for failover and fallback -
Validating the
var
mapping file -
Executing failover on a target site
-
Executing failback from a target site to a source site
The following is an example of the ovirt-dr
script:
# ./ovirt-dr generate/validate/failover/failback [--conf-file=dr.conf] [--log-file=ovirt-dr-log_number.log] [--log-level=DEBUG/INFO/WARNING/ERROR]
You optionally can make the following customizations:
-
Set parameters for the script’s actions in the configuration file: /usr/share/ansible/collections/ansible_collections/ovirt/ovirt/roles/disaster_recovery/files/dr.conf.
-
Change location of the configuration file using the
--conf-file
option -
Set location of log file using the
--log-file
option -
Set level of logging detail using the
--log-level
option
Generating the Mapping File Using an Ansible Playbook
The Ansible playbook used to generate the mapping file prepopulates the file with the primary site’s entities. Then, you need to manually add to the file the backup site’s entities, such as IP addresses, cluster, affinity groups, affinity label, external LUN disks, authorization domains, roles, and vNIC profiles.
Important:
Generating the mapping file will fail if you have any virtual machine disks on the self-hosted engine’s storage domain. Also, the generated mapping file will not contain an attribute for this storage domain because it must not be failed over.
To create the mapping file, complete the following steps.
-
Create an Ansible playbook using a yaml file (such as
dr-olvm-setup.yml
) to generate the mapping file. For example:--- - name: Setup oVirt environment hosts: localhost connection: local vars: site: https://manager1.mycompany.com/ovirt-engine/api username: admin@internal password: Mysecret1 ca: /etc/pki/ovirt-engine/ca.pem var_file: disaster_recovery_vars.yml roles: - disaster_recovery collections: - ovirt.ovirt
For extra security you can encrypt your Engine password in a
.yml
file. -
Run the Ansible command to generate the mapping file. The primary site’s configuration will be prepopulated.
# ansible-playbook dr-olvm-setup.yml --tags "generate_mapping"
-
Configure the generated mapping .yml file with the backup site’s configuration. For more information, see Mapping File Attributes.
If you have multiple Ansible machines that can perform failover and failback, then copy the mapping file to all relevant machines.
Creating Failover and Failback Playbooks
Before creating the failover and failback playbooks, ensure you have created and configured the mapping file, which must be added to the playbooks.
To create the failover and failback playbooks, complete the following steps.
-
Optionally, define a password file (for example
passwords.yml
) to store the Engine passwords of the primary and secondary site, for example:--- # This file is in plain text, if you want to # encrypt this file, please execute following command: # # $ ansible-vault encrypt passwords.yml # # It will ask you for a password, which you must then pass to # ansible interactively when executing the playbook. # # $ ansible-playbook myplaybook.yml --ask-vault-pass # dr_sites_primary_password: primary_password dr_sites_secondary_password: secondary_password
For extra security you can encrypt the password file. However, you will need to use the
--ask-vault-pass
parameter when running the playbook. -
Create an Ansible playbook using a failover yaml file (such as
dr-olvm-failover.yml
) to failover the environment, for example:--- - name: oVirt Failover hosts: localhost connection: local vars: dr_target_host: secondary dr_source_map: primary vars_files: - disaster_recovery_vars.yml roles: - disaster_recovery collections: - ovirt.ovirt
-
Create an Ansible playbook using a failback yaml file (such as
dr-olvm-failback.yml
) to failback the environment, for example:--- - name: oVirt Failback hosts: localhost connection: local vars: dr_target_host: primary dr_source_map: secondary vars_files: - disaster_recovery_vars.yml roles: - disaster_recovery collections: - ovirt.ovirt
Executing a Failover
Before executing a failover, ensure you have read and understood the Network Considerations and Storage Considerations. You must also ensure that:
-
the Engine and hosts in the secondary site are running.
-
replicated storage domains are in read/write mode.
-
no replicated storage domains are attached to the secondary site.
-
a machine running the Ansible Engine that can connect via SSH to the Engine in the primary and secondary site, with the required packages and files:
-
The
ovirt-ansible-collection
package. -
The mapping file and failover playbook.
-
Sanlock must release all storage locks from the replicated storage domains before the failover process starts. These locks should be released automatically approximately 80 seconds after the disaster occurs.
To execute a failover, run the failover playbook on the Engine host using the following command:
# ansible-playbook dr-olvm-failover.yml --tags "fail_over"
When the primary site becomes active, ensure that you clean the environment before failing back. For more information, see Cleaning the Primary Site.
Continue with a failback. See Executing a Failback.
Cleaning the Primary Site
After you failover, you must clean the environment in the primary site before failing back to it. Cleaning the primary site's environment:
-
Reboots all hosts in the primary site.
-
Ensures the secondary site’s storage domains are in read/write mode and the primary site’s storage domains are in read only mode.
-
Synchronizes the replication from the secondary site’s storage domains to the primary site’s storage domains.
-
Cleans the primary site of all storage domains to be imported. This can be done manually in the Engine. For more information, see Detaching a Storage Domain from a Data Center.
For example, create a cleanup yml file (such as
dr_cleanup_primary_site.yml
):
--- - name: oVirt Cleanup Primary Site hosts: localhost connection: local vars: dr_source_map: primary vars_files: - disaster_recovery_vars.yml roles: - disaster_recovery collections: - ovirt.ovirt
Once you have cleaned the primary site, you can now failback the environment to the primary site. For more information, see Executing a Failback.
Executing a Failback
After failover, you can failback to the primary site when it is active and you have performed the necessary steps to clean the environment by ensuring:
-
The primary site's environment is running and has been cleaned. For more information, see Cleaning the Primary Site.
-
The environment in the secondary site is running and has active storage domains.
-
The machine running the Ansible Engine that can connect via SSH to the Engine in the primary and secondary site, with the required packages and files:
-
The
ovirt-ansible-collection
package. -
The mapping file and required failback playbook.
-
To execute a failback, complete the following steps.
-
Run the failback playbook on the Engine host using the following command:
# ansible-playbook dr-olvm-failback.yml --tags "fail_back"
-
Enable replication from the primary storage domains to the secondary storage domains.
Testing the Active-Passive Configuration
You must test your disaster recovery solution after configuring it using one of the provided options:
-
Test failover while the primary site remains active and without interfering with virtual machines on the primary site’s storage domains. See Discreet Failover Test.
-
Test failover and failback using specific storage domains attached to the primary site which allows the primary site to remain active. See Discreet Failover and Failback Tests.
-
Test failover and failback for an unplanned shutdown of the primary site or an impending disaster where you have a grace period to failover to the secondary site. See Full Failover and Failback Tests.
Important:
Ensure that you have completed all the steps to configure your active-passive disaster recovery before running any of these tests.
Discreet Failover Test
The discreet failover test simulates a failover while the primary site and all its storage domains remain active which allows users to continue working in the primary site. To perform this test, you must disable replication between the primary storage domains and the replicated (secondary) storage domains. During this test the primary site is unaware of the failover activities on the secondary site.
This test does not allow you to test the failback functionality.
Important:
Ensure that no production tasks are performed after the failover. For example, ensure that email systems are blocked from sending emails to real users or redirect emails elsewhere. If systems are used to directly manage other systems, prohibit access to the systems or ensure that they access parallel systems in the secondary site.
To perform a discreet failover test, complete the following steps.
-
Disable storage replication between the primary and replicated storage domains and ensure that all replicated storage domains are in read/write mode.
-
Run the following command to failover to the secondary site:
# ansible-playbook playbook --tags "fail_over"
-
Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the secondary site.
To restore the environment to its active-passive state, complete the following steps.
-
Detach the storage domains from the secondary site.
-
Enable storage replication between the primary and secondary storage domains.
Discreet Failover and Failback Tests
The discreet failover and failback tests use testable storage domains that you specifically define for testing failover and failback. These storage domains must be replicated so that the replicated storage can be attached to the secondary site which allows you to test the failover while users continue to work in the primary site.
Note:
You should define the testable storage domains on a separate storage server that can be offline without affecting the primary storage domains used for production in the primary site.
To perform a discreet failover test, complete the following steps.
-
Stop the test storage domains in the primary site. For example, shut down the server host or block it with a firewall rule.
-
Disable the storage replication between the testable storage domains and ensure that all replicated storage domains used for the test are in read/write mode.
-
Place the test primary storage domains into read-only mode.
-
Run the command to failover to the secondary site:
# ansible-playbook playbook --tags "fail_over"
-
Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the secondary site.
To perform a discreet failback test, complete the following steps.
-
Clean the primary site and remove all inactive storage domains and related virtual machines and templates. For more information, see Cleaning the Primary Site.
-
Run the command to failback to the primary site:
# ansible-playbook playbook --tags "fail_back"
-
Enable replication from the primary storage domains to the secondary storage domains.
-
Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully in the primary site.
Full Failover and Failback Tests
The full failover and failback tests allow you to simulate a primary site disaster, failover to the secondary site, and failback to the primary site. To simulate a primary site disaster, you can shut down the primary site’s hosts or by add firewall rules to block writing to the storage domains.
To perform a full failover test, complete the following steps.
-
Disable storage replication between the primary and replicated storage domains and ensure that all replicated storage domains are in read/write mode.
-
Run the command to failover to the secondary site:
# ansible-playbook playbook --tags "fail_over"
-
Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully in the secondary site.
To perform a full failback test, complete the following steps.
-
Synchronize replication between the secondary site’s storage domains and the primary site’s storage domains. The secondary site’s storage domains must be in read/write mode and the primary site’s storage domains must be in read-only mode.
-
Clean the primary site and remove all inactive storage domains and related virtual machines and templates. For more information, see Cleaning the Primary Site.
-
Run the command to failback to the primary site:
# ansible-playbook playbook --tags "fail_back"
-
Enable replication from the primary storage domains to the secondary storage domains.
-
Verify that all relevant storage domains, virtual machines, and templates are registered and running successfully on the primary site.
Mapping File Attributes
The attributes in the mapping file are used to failover and failback between the two sites in an active-passive disaster recovery solution.
-
Site details
Attributes that map the Engine details in the primary and secondary site, for example:
dr_sites_primary_url: https://manager1.mycompany.com/ovirt-engine/api dr_sites_primary_username: admin@internal dr_sites_primary_ca_file: /etc/pki/ovirt-engine/ca.pem # scp manager2:/etc/pki/ovirt-engine/ca.pem /var/tmp/secondary-ca.pem # Please fill in the following properties for the secondary site: dr_sites_secondary_url: https://manager2.mycompany.com/ovirt-engine/api dr_sites_secondary_username: admin@internal dr_sites_secondary_ca_file: /var/tmp/secondary-ca.pem
-
Storage domain details
Attributes that map the storage domain details between the primary and secondary site, for example:
dr_import_storages: - dr_domain_type: nfs dr_primary_name: DATA dr_master_domain: True dr_wipe_after_delete: False dr_backup: False dr_critical_space_action_blocker: 5 dr_warning_low_space: 10 dr_primary_dc_name: Default dr_discard_after_delete: False dr_primary_path: /storage/data dr_primary_address: 10.64.100.xxx # Fill in the empty properties related to the secondary site dr_secondary_dc_name: Default dr_secondary_path: /storage/data2 dr_secondary_address:10.64.90.xxx dr_secondary_name: DATA
-
Cluster details
Attributes that map the cluster names between the primary and secondary site, for example:
dr_cluster_mappings: - primary_name: cluster_prod secondary_name: cluster_recovery - primary_name: fc_cluster secondary_name: recovery_fc_cluster
-
Affinity group details
Attributes that map the affinity groups that virtual machines belong to, for example:
dr_affinity_group_mappings: - primary_name: affinity_prod secondary_name: affinity_recovery
-
Affinity label details
Attributes that map the affinity labels that virtual machines belong to, for example:
dr_affinity_label_mappings: - primary_name: affinity_label_prod secondary_name: affinity_label_recovery
-
Domain authentication, authorization and accounting details
Attributes that map authorization details between the primary and secondary site, for example:
dr_domain_mappings: - primary_name: internal-authz secondary_name: recovery-authz - primary_name: external-authz secondary_name: recovery2-authz
-
Role details
Attributes that provide mapping for specific roles, for example:
dr_role_mappings: - primary_name: admin Secondary_name: newadmin
-
Network details
Attributes that map the vNIC details between the primary and secondary site, for example:
dr_network_mappings: - primary_network_name: ovirtmgmt primary_profile_name: ovirtmgmt primary_profile_id: 0000000a-000a-000a-000a-000000000398 # Fill in the correlated vnic profile properties in the secondary site for profile 'ovirtmgmt' secondary_network_name: ovirtmgmt secondary_profile_name: ovirtmgmt secondary_profile_id: 0000000a-000a-000a-000a-000000000410
If you have multiple networks or multiple data centers then you must use an empty network mapping in the mapping file to ensure that all entities register on the target during failover, for example:
dr_network_mappings: # No mapping should be here
-
External LUN disk details
LUN attributes allow virtual machines to be registered with the appropriate external LUN disk after failover and failback, for example:
dr_lun_mappings: - primary_logical_unit_id: 460014069b2be431c0fd46c4bdce29b66 primary_logical_unit_alias: My_Disk primary_wipe_after_delete: False primary_shareable: False primary_logical_unit_description: 2b66 primary_storage_type: iscsi primary_logical_unit_address: 10.35.xx.xxx primary_logical_unit_port: 3260 primary_logical_unit_portal: 1 primary_logical_unit_target: iqn.2017-12.com.prod.example:444 secondary_storage_type: iscsi secondary_wipe_after_delete: False secondary_shareable: False secondary_logical_unit_id: 460014069b2be431c0fd46c4bdce29b66 secondary_logical_unit_address: 10.35.x.xxx secondary_logical_unit_port: 3260 secondary_logical_unit_portal: 1 secondary_logical_unit_target: iqn.2017-12.com.recovery.example:444