Cluster Interconnect I/O

All inter-controller communication consists of one or more messages transmitted over the redundant cluster I/O links provided by the controller cluster interface card. For more information about cluster interface cards and cluster cabling, see Controller Cluster I/O Ports in Oracle ZFS Storage Appliance Cabling Guide, Release OS8.8.x and Connecting Cluster Cables in Oracle ZFS Storage Appliance Cabling Guide, Release OS8.8.x.

  • Oracle ZFS Storage ZS9-2 controllers employ Ethernet-based clustering using two Ethernet ports in the Oracle Quad Port 10GBASE-T Ethernet Adapter.

  • Oracle ZFS Storage ZS7-2, ZS5-x, ZS4-4, ZS3-x, and Sun ZFS Storage 7x20 controllers employ serial-based clustering using two serial cluster links, and provide Ethernet connectivity via one link. The Ethernet link provides a higher-performance transport for non-heartbeat messages, such as rejoin synchronization, and provides a backup heartbeat.

Clustered controllers only communicate with each other over the secure private network established by the cluster interconnects, never over network interfaces intended for service or administration. Messages fall into two general categories: regular heartbeats used to detect the failure of a remote controller, and higher-level traffic associated with the resource manager and the cluster management subsystem.

Heartbeats are sent, and expected, on all links. Heartbeats are transmitted continuously at fixed intervals. Heartbeats are never acknowledged or retransmitted because all heartbeats are identical and contain no unique information. Other traffic is acknowledged, verified, and retransmitted as required to maintain a reliable transport for higher-level software.

For Oracle ZFS Storage ZS9-2 controllers, heartbeat messages are sent at 200ms intervals. Failure to receive any message after 1 second is considered to be link failure. For all other controllers, heartbeat messages are sent on all cluster I/O links at 50ms intervals. Failure to receive any message after 200ms (serial links) or 500ms (Ethernet links) is considered to be link failure. For all controllers, if all links have failed, the peer is assumed to have failed, and takeover arbitration will be performed.

If a panic occurs on an Oracle ZFS Storage ZS9-2 controller, the clustering system can detect that the peer has failed within 1200ms. No panic message is sent.

If a panic occurs on a, Oracle ZFS Storage ZS7-2, ZS5-x, ZS4-4, ZS3-x, or Sun ZFS Storage 7x20 controller, the panicking controller will transmit a single notification message over each serial link. The peer controller will immediately begin takeover, regardless of the state of any other links. Given these characteristics, the clustering subsystem normally can detect that the peer has failed within:

  • 550ms, if the peer has stopped responding or lost power, or

  • 30ms, if the peer has encountered a fatal software error that triggered an operating system panic.

All of the values described in this section are fixed. The appliance does not offer the ability to tune these parameters. These parameters are provided here for informational purposes only and may be changed without notice at any time.

Note:

To avoid data corruption after a physical re-location of a cluster, verify that all cluster cabling is installed correctly in the new location. For more information, see Preventing Split-Brain Conditions.

Related Topics

Clustered Controller States