Recovering From a Disaster

Language:

Disaster tolerance is the ability of a system to restore an application on a secondary cluster when the primary cluster fails. Disaster tolerance is based on data replication and failover. The Geographic Edition software enables disaster tolerance by redundantly deploying the following:

Highly available clusters that are geographically separated
Data replication at either the host or the storage level
Backups and restoration and data vaulting

Data replication is the process of continuously copying data from the primary cluster to the secondary cluster. Through data replication, the secondary cluster has a recent copy of the data on the primary cluster. The secondary cluster can be geographically separated from the primary cluster.

The Geographic Edition software supports two types of migration of services: a switchover and a takeover.

A switchover is a planned migration of services from the primary cluster to the secondary cluster. During a switchover, the primary cluster is connected to the secondary cluster and coordinates the migration of services with the secondary cluster. This coordination enables the data replication to complete and ensures that services can be transferred from the primary cluster to the secondary cluster without loss or corruption of data.
A takeover is an emergency migration of services from the primary cluster to the secondary cluster. A system administrator can initiate a takeover to recover from a disaster. Unlike a switchover, the primary cluster is not connected to the secondary cluster during a takeover. Therefore, the primary cluster cannot coordinate with the secondary cluster to migrate the services. Because of this lack of coordination, the risk of data loss and data corruption in a takeover is higher than it is with a switchover. The Geographic Edition software uses dedicated recovery procedures during a takeover to minimize data loss and data corruption.

These operations intentionally require manual initiation, rather than occur automatically like failover between cluster nodes. Business continuity covers all aspects of a company's response to a disaster, not only information technology (IT) but also staff availability and welfare, phones, buildings, and so forth. A good business continuity plan will include all these things and will outline the actions to be taken. When a disaster occurs, it can be extremely difficult to obtain accurate information about what is happening. Having one part of the infrastructure attempting an automatic recovery while other areas are still trying to work out what is happening can often make matters worse.

General best practice is to have a designated Business Continuity Manager involved in disaster recovery decisions, to review status and decide on appropriate action. Once an action is decided upon, it must then be performed correctly, preferably in an automated, tested way. This is the basis of the Geographic Edition takeover operation. For example, if a brief power outage has crashed systems at one site, switching to a remote site might not be the correct response. If the remote site is in another time zone, where staff are not on duty, such a takeover will require that staff be paged, and potentially all communications services redirected. After the outage is corrected, the process must be reversed. It might, in the circumstances, be much more effective to simply restart the primary site. Having the IT infrastructure take over automatically while the situation is being evaluated will not help recovery.