The internal ZFS Storage Appliance uses two storage controllers to provide storage services to the rack components. If one controller goes offline due to an intentional failover, such as a reboot, the other controller takes over all required services within 1-2 minutes. In this situation the virtual machines are typically not affected and no disruptions are to be expected.
However, in the event of a catastrophic failure of a storage
controller, meaning a sudden and unanticipated loss, the time
required by the standby storage controller to take over all
storage services increases. In this situation, because the shared
ocfs2 file system used as cluster heartbeat device is located on
the ZFS Storage Appliance, if the downtime exceeds the heartbeat
limit then the management nodes begin the fencing process. After
25-30 minutes the management nodes eventually shut down, but
during this time the ovca
service, which is the
appliance controller software, appears unresponsive or
exceptionally slow.
If the Oracle Private Cloud Appliance experiences a catastrophic storage controller failure, follow the procedure in this section to recover and return the appliance to normal operation.
Recovering After a Catastrophic Failure of a ZFS Storage Appliance Controller
If both management nodes have fenced themselves from the cluster and have shut down, verify that all storage and cluster resources have failed over correctly to the second storage controller of the internal ZFS Storage Appliance.
When the storage services are running correctly, power on both management nodes.
When the management nodes have finished booting, log in to the master management node and verify that all required services are running.
[root@ovcamn06r1 ~]# service ovca status Checking Oracle Fabric Manager: Running MySQL running (6191) [ OK ] Oracle VM Manager
is not running
... Oracle VM Manager CLI is running... tinyproxy (pid 7004 7003 7002 7001 7000 6999 6998 6997 6996 6995 6989) is running... dhcpd (pid 7021) is running... snmptrapd (pid 7037) is running... log server (pid 5230) is running... remaster server (pid 5232) is running... http server (pid 7040) is running... taskmonitor server (pid 7044) is running... xmlrpc server (pid 7042) is running... nodestate server (pid 7046) is running... sync server (pid 7048) is running... monitor server (pid 7050) is running...If Oracle VM Manager is not running, as is the case in the example above, then the
ovmcli
,ovmm
andovmm_mysql
services must all be restarted in the correct order. Proceed as follows:[root@ovcamn06r1 ~]# service ovmcli stop Stopping Oracle VM Manager CLI [ OK ] [root@ovcamn06r1 ~]# service ovmm stop Stopping Oracle VM Manager [ OK ] [root@ovcamn06r1 ~]# service ovmm_mysql stop Shutting down OVMM MySQL.. [ OK ] [root@ovcamn06r1 ~]# service ovmm_mysql start Starting OVMM MySQL. [ OK ] [root@ovcamn06r1 ~]# service ovmm start Starting Oracle VM Manager [ OK ] [root@ovcamn06r1 ~]# service ovmm status Oracle VM Manager is running... [root@ovcamn06r1 ~]# service ovmcli start Starting Oracle VM Manager CLI [ OK ]
After Oracle VM Manager has been restarted successfully, verify the status of the services again.
[root@ovcamn06r1 ~]# service ovca status Checking Oracle Fabric Manager: Running MySQL running (12356) [ OK ] Oracle VM Manager is running... Oracle VM Manager CLI is running... tinyproxy (pid 7004 7003 7002 7001 7000 6999 6998 6997 6996 6995 6989) is running... dhcpd (pid 7021) is running... snmptrapd (pid 7037) is running... log server (pid 5230) is running... remaster server (pid 5232) is running... http server (pid 7040) is running... taskmonitor server (pid 7044) is running... xmlrpc server (pid 7042) is running... nodestate server (pid 7046) is running... sync server (pid 7048) is running... monitor server (pid 7050) is running...
Oracle PCA is back in normal operating condition.