8.14 Recovering From a Catastrophic Storage Controller Failure

The internal ZFS Storage Appliance uses two storage controllers to provide storage services to the rack components. If one controller goes offline due to an intentional failover, such as a reboot, the other controller takes over all required services within 1-2 minutes. In this situation the virtual machines are typically not affected and no disruptions are to be expected.

However, in the event of a catastrophic failure of a storage controller, meaning a sudden and unanticipated loss, the time required by the standby storage controller to take over all storage services increases. In this situation, because the shared ocfs2 file system used as cluster heartbeat device is located on the ZFS Storage Appliance, if the downtime exceeds the heartbeat limit then the management nodes begin the fencing process. After 25-30 minutes the management nodes eventually shut down, but during this time the ovca service, which is the appliance controller software, appears unresponsive or exceptionally slow.

If the Oracle Private Cloud Appliance experiences a catastrophic storage controller failure, follow the procedure in this section to recover and return the appliance to normal operation.

Recovering After a Catastrophic Failure of a ZFS Storage Appliance Controller

If both management nodes have fenced themselves from the cluster and have shut down, verify that all storage and cluster resources have failed over correctly to the second storage controller of the internal ZFS Storage Appliance.
When the storage services are running correctly, power on both management nodes.

When the management nodes have finished booting, log in to the master management node and verify that all required services are running.

[root@ovcamn06r1 ~]# service ovca status
Checking Oracle Fabric Manager: Running
MySQL running (6191)                       [  OK  ]
Oracle VM Manager is not running...
Oracle VM Manager CLI is running...
tinyproxy (pid 7004 7003 7002 7001 7000 6999 6998 6997 6996 6995 6989) is running...
dhcpd (pid  7021) is running...
snmptrapd (pid  7037) is running...
log server (pid 5230) is running...
remaster server (pid 5232) is running...
http server (pid 7040) is running...
taskmonitor server (pid 7044) is running...
xmlrpc server (pid 7042) is running...
nodestate server (pid 7046) is running...
sync server (pid 7048) is running...
monitor server (pid 7050) is running...

If Oracle VM Manager is not running, as is the case in the example above, then the ovmcli, ovmm and ovmm_mysql services must all be restarted in the correct order. Proceed as follows:

[root@ovcamn06r1 ~]# service ovmcli stop
Stopping Oracle VM Manager CLI                        [  OK  ]
[root@ovcamn06r1 ~]# service ovmm stop
Stopping Oracle VM Manager                            [  OK  ]
[root@ovcamn06r1 ~]# service ovmm_mysql stop
Shutting down OVMM MySQL..                            [  OK  ]

[root@ovcamn06r1 ~]# service ovmm_mysql start
Starting OVMM MySQL.                                  [  OK  ]
[root@ovcamn06r1 ~]# service ovmm start
Starting Oracle VM Manager                            [  OK  ]
[root@ovcamn06r1 ~]# service ovmm status
Oracle VM Manager is running...
[root@ovcamn06r1 ~]# service ovmcli start
Starting Oracle VM Manager CLI                        [  OK  ]

After Oracle VM Manager has been restarted successfully, verify the status of the services again.

[root@ovcamn06r1 ~]# service ovca status
Checking Oracle Fabric Manager: Running
MySQL running (12356)                       [  OK  ]
Oracle VM Manager is running...
Oracle VM Manager CLI is running...
tinyproxy (pid 7004 7003 7002 7001 7000 6999 6998 6997 6996 6995 6989) is running...
dhcpd (pid  7021) is running...
snmptrapd (pid  7037) is running...
log server (pid 5230) is running...
remaster server (pid 5232) is running...
http server (pid 7040) is running...
taskmonitor server (pid 7044) is running...
xmlrpc server (pid 7042) is running...
nodestate server (pid 7046) is running...
sync server (pid 7048) is running...
monitor server (pid 7050) is running...

Oracle PCA is back in normal operating condition.