Validate Failover and Failback Behavior
Once the VMware vSAN stretched cluster is configured, it is essential to validate both failover and failback workflows to ensure business continuity and disaster recovery readiness. This section outlines the steps to simulate a failure at the primary site and test recovery from the secondary site, followed by restoration of services to the Primary Site.
Simulate a Failover Event
To simulate a failure of the Primary Region:
- Power Off Primary Region Hosts
- Use the OCI Console to forcefully power off all VMware ESXi hosts in the Primary Region.
- Observe HA Recovery at Secondary Site
- From the Bastion VM in the Secondary Region, connect to one of the VMware ESXi hosts.
- Observe that the management and workload VMs automatically power on via VMware vSphere HA.
- Update Network Routing
- Detach
VCN-MGMT-Active
from the DRG in the Primary Region. - Attach
VCN-MGMT-Failover
to the DRG in the Secondary Region.
- Detach
- Modify Route Tables in
VCN-MGMT-Failover
- Update route tables to point traffic destined for:
10.16.0.0/16
(Primary VCN)10.17.0.0/16
(Secondary VCN)172.30.0.0/16
(overlay networks or external resources)- toward the DRG in the Secondary Region.
- Update route tables to point traffic destined for:
- Verify Connectivity
- Use Network Analyzer or similar diagnostic tools to validate reachability to vSphere components.
- Confirm vCenter is operational and displays the Primary Region hosts as unavailable.
- Validate East-West (intra-site) and North-South (external) connectivity using test VMs.
- Ensure internet access works as expected via the NAT Gateway in the Secondary Region.
With optimized routing and configuration, VMs can recover and become operational within 15 minutes of failure detection. Networking updates and confirmation typically complete within an additional 5 minutes.
Execute a Failback
Once the Primary Region is restored and operational, follow these steps to return services to their original state:
- Restore and Reboot Primary Hosts
- Power on the previously shut down VMware ESXi hosts.
- Once online, perform either a full reboot via OCI
Console or manually restart system services using
services.sh
restart over SSH to ensure stability.
- vMotion VMs Back to Primary Hosts
- Migrate all workload and management VMs from Secondary Region
hosts to Primary Region hosts.
Note:
VMs can temporarily drop off the network due to unadjusted routing at this stage.
- Migrate all workload and management VMs from Secondary Region
hosts to Primary Region hosts.
- Reconfigure Network Routing
- Detach
VCN-MGMT-Failover
from the DRG in the Secondary Region. - Reattach
VCN-MGMT-Active
to the DRG in the Primary Region. - No route table changes are required as existing entries remain valid from the earlier configuration.
- Detach
- Confirm Operational Status
- Validate VM and service reachability from the Bastion in the Primary Region.
- Confirm HA, vMotion, and VMware vSAN operations resume as expected.
- All routes and policies should now reflect the pre-failover state.
This completes the configuration and validation of a VMware vSAN stretched cluster across OCI Dedicated Regions, including successful simulation of failover and failback scenarios.