Preventing Split-Brain Conditions

A common failure mode in clustered systems is known as split-brain. In this condition, each of the clustered controllers behaves as if its peer has failed and attempts takeover. The most common cause of this condition is failure of the communication medium shared by the controllers. In Oracle ZFS Storage Appliance, the shared communication medium is the cluster I/O links. However, Oracle ZFS Storage Appliance cluster I/O links have built-in link redundancy: For Oracle ZFS Storage ZS9-2 controllers, only a single cluster I/O Ethernet link is required to avoid triggering takeover. For all other controllers, only a single cluster I/O serial link is required to avoid triggering takeover.

The appliance software performs an arbitration procedure to determine which controller should continue with takeover.

The Oracle ZFS Storage Appliance clustering solution is designed to ensure that there is no single point of failure, and to protect both data and availability against failure. Most failures can be prevented by installing the hardware properly and employing cluster setup and management best practices. Ensure the following:

  • All cluster I/O links (two for an Oracle ZFS Storage ZS9-2 controller, three for all other controllers) are connected and functional as shown in Cluster Configuration BUI View and Checking Cluster Link Status (CLI).

  • All storage cabling is connected as shown in the setup documentation that was delivered with your appliances.

    It is particularly important that two paths are detected to each disk shelf as shown in the following figure before placing the cluster into production and at all times afterward, with the exception of temporary cabling changes to support capacity increases or replacement of faulty components. Use alerts to monitor the state of cluster interconnect links and disk shelf paths and correct any failures promptly. Ensuring that proper connectivity is maintained will protect both availability and data integrity if a hardware or software component fails.


This figure shows a 2 in the Paths column for Disk Shelves.

Related Topics

Clustered Controller States