Cluster Takeover and Failback

Takeover enables service to continue or resume normally when a cluster controller fails or loses power.

Takeover of the cluster by one of the controllers is automatically attempted when the controller detects that its peer is absent (for example, shut down or rebooting). After takeover, the controller that performed the takeover owns all cluster resources and provides all services.

If both controllers fail or are powered off, then upon simultaneous startup, the appliance software performs an arbitration procedure to determine which controller will continue with takeover.

Takeover can also be performed manually, which can be useful for testing purposes.

The failback operation changes the cluster configuration from OWNER-STRIPPED (active-passive) to CLUSTERED-CLUSTERED (active-active). Failback never occurs automatically.

Failback usually is performed:

If controller-b in a cluster fails or loses power, then controller-a in that cluster takes over the resources that had been assigned to controller-b, and provides all cluster services. After controller-b is repaired and booted, an administrator performs the failback operation to return controller-b to production service.

When controller-b is repaired and booted, that controller:

  • Rejoins the cluster, resynchronizing its view of all resources, their properties, and their ownership.

  • Waits for an administrator to perform a failback operation.

While controller-b is waiting, controller-a continues to provide all services. Controller-a is in the Active (takeover completed) or AKCS_OWNER state, and controller-b is in the Ready (waiting for failback) or AKCS_STRIPPED state.

The failback operation returns controller-b to production service. Since the failure of controller-b, controller-a has been providing all services. The failback operation restores resources that were owned by controller-b prior to the failure back to controller-b. The failback operation exports from controller-a all resources that are assigned to controller-b, and controller-b imports these resources. After a successful failback, both controller-a and controller-b are in the Active or CLUSTERED state.

During failback, a pool that cannot be imported by controller-b because the pool is faulted will cause controller-b to reboot. The failback operation fails, and controller-a continues to provide all services.

When scheduling a failback operation, consider the following:

  • Failback is disruptive to clients of the cluster.

  • Delaying failback is equally or more disruptive if the single active controller fails before the failback is performed.

To minimize service downtime, data is not collected and statistics and datasets are not available during failback and takeover operations. Requests to suspend or resume statistics are delayed until failback and takeover operations have completed. Data collection automatically resumes after failback and takeover operations have completed.

Related Topics