Repairing a Failed Zone

If a zone fails but quorum is maintained, you have the option to repair the failed zone with new hardware by following the procedure described in Repairing a Failed Zone by Replacing Hardware.

Another option is to convert the failed zone to a secondary zone. In some cases, this approach can improve the high availability characteristics of the store by reducing the quorum requirements.

For example, suppose a store consists of two primary zones: zone 1 with a replication factor of three and zone 2, with a replication factor of two. Additionally, suppose zone 2 fails. In this case, quorum is maintained because you would have 3 out of the 5 replicas, but any additional failure would result in a loss of quorum.

Converting zone 2 to a secondary zone would reduce the primary replication factor to 3, meaning that each shard could tolerate an additional failure.

You should determine if switching zone types would actually improve availability. If so, then decide if it is worth doing in the current circumstances.

Need for an Admin node in the secondary zone:

Having admins in a secondary zone is very useful to support failure recovery. For example, if a store has primary and secondary zones, and all of the primary zones are lost, the administrator can use the repair-admin-quorum and plan failover commands to resume operations by converting the secondary zone to a primary zone. But these operations can occur only if an Admin node is available. For this reason, stores with secondary zones should include Admins in the secondary zones.

The recommendation is to deploy the same number of admins as the replication factor for the zone. For example if you have primary and secondary zone with a replication factor of 3, then each zone should be configured with three admins. If a zone failure occurs and no admins remain available, the failover procedures cannot be used. To avoid this situation you need to configure as many admins as the replication factor for the zone.