Clustered controller nodes are in one of a small set of states at any given time:
|
Transitions among these states take place as part of two operations: takeover and failback.
Takeover can occur at any time, and is attempted whenever peer failure is detected. It can also be triggered manually using the cluster configuration CLI or BUI, which can be useful for testing purposes. Finally, takeover will occur when a controller boots and detects that its peer is absent. This allows service to resume normally when one controller has failed permanently or when both controllers have temporarily lost power.
Failback never occurs automatically. When a failed controller is repaired and booted, it will rejoin the cluster (resynchronizing its view of all resources, their properties, and their ownership) and proceed to wait for an administrator to perform a failback operation. Until then, the original surviving controller will continue to provide all services. This allows for a full investigation of the problem that originally triggered the takeover, validation of a new software revision, or other administrative tasks prior to the controller returning to production service. Because failback is disruptive to clients, it should be scheduled according to business-specific needs and processes. There is one exception: Suppose that controller A has failed and controller B has taken over. When controller A rejoins the cluster, it becomes eligible to take over if it detects that controller B is absent or has failed. The principle is that it is always better to provide service than not, even if there has not yet been an opportunity to investigate the original problem. So while failback to a previously-failed controller will never occur automatically, it may still perform takeover at any time.
After you set up a cluster, the initial state consists of the node that initiated the setup in the OWNER state and the other node in the STRIPPED state. After performing an initial failback operation to hand the STRIPPED node its portion of the shared resources, both nodes are CLUSTERED. If both cluster nodes fail or are powered off, then upon simultaneous startup they will arbitrate and one of them will become the OWNER and the other STRIPPED.
During failback all foreign resources (those assigned to the peer) are exported, then imported by the peer. A pool that cannot be imported because it is faulted will trigger reboot of the STRIPPED node. An attempt to failback with a faulted pool can reboot the STRIPPED node as a result of the import failure.
To minimize service downtime, statistics and datasets are not available during failback and takeover operations. Data is not collected, and any attempt to suspend or resume statistics is delayed until failback and takeover operations have completed and data collection automatically resumes.
Related Topics