Controller Failover and Failback

When a Controller fails or when a Controller is taken offline through Guided Maintenance, the Oracle FS System uses failover and failback to return the Oracle FS System to a normal state. Failover transfers all of the resources of the offline node to the node that remains online. Failback transfers those resources back to the node that was offline when the node comes back online.

Controller Failover

Under normal conditions, each Controller node in a pair uses its resources to actively service the I/O requests arriving on the data path. The Controller nodes are cross connected as an active-active pair. This cross connection enables each node to support the other node should one of them fail.

When one Controller node fails, both the Pilot and the partner Controller node detect and confirm the failure. Next, the partner Controller node takes over all of the I/O requests of the failed Controller node, including flushing the data cache to storage and converting Controller data operations to conservative mode.

A Controller node failover can occur if there is an unexpected Controller node failure or you need to replace a field replaceable unit (FRU) or customer replaceable unit (CRU) that requires you to power down the Controller node. For the components that require you to power down the Controller node prior to maintenance, Guided Maintenance initiates the failover process for the Controller node that needs to be serviced.

The failover process performs the following tasks:

After maintenance or the handling of the Controller failure is complete, the Controller node begins a boot cycle. During the boot cycle, the system verifies the operational state of the Controller node and then performs a failback to restore services to the repaired Controller node.

Controller Failback

After a Controller node has failed over to its partner node, when the Controller node is repaired or recovers and boots, the system performs a failback operation. During failback, the partner Controller node which is handling all the services and the resource load, transfers the appropriate services and resources back to the repaired Controller node.

When a Controller node comes back online, the node passes through several states. Use the Event Log screen in the Oracle FS System Manager (GUI) to track the status of the Controller node as it returns online.