7.26 Cluster Failure After An Offline Server is Removed From the Cluster While Another Cluster Member is Offline

If more than one Oracle VM Server within a clustered server pool is offline at the same time, and one of those servers is removed from the server pool, when the other offline servers come back online, the cluster remains in a failed state. This is because the cluster configuration falls out of sync across the members of the cluster.

Whenever a server is added or removed from a cluster, Oracle VM Manager triggers an operation on each Oracle VM Server in the cluster to update the cluster configuration information. However, if any servers are offline at this time, they are unable to receive the updated configuration resulting in a configuration mismatch between the cluster configuration on the server and the actual configuration for the rest of the cluster. As a result, the server is no longer able to participate in the cluster.

On an x86 platform, this situation is simply represented within Oracle VM Manager and can be easily resolved from within Oracle VM Manager. On a SPARC platform, this can cause a server to repeatedly reboot when it comes online again due to a cluster panic.

Workaround (SPARC): In the case where a SPARC server is rebooting continuously, the problem is caused by the server attempting to rejoin the cluster even though it is no longer a member of the cluster. As a result, the server can panic repeatedly with a message similar to the following:

panic[cpu5]/thread=2a102a13c60: 
 **** dlm FENCING this system by PANICing

If this issue occurs and a server is repeatedly panicking then you can prevent the server from panicking again by stopping the ovs-agent service and deconfiguring the cluster. To do so, connect to the server as root and run the following commands:

# svcadm disable -s ovs-agent
# dlmcconf -S

If you are unable to run these commands because the server is panicking too quickly, then boot the server in single-user mode:

# boot -s

Disable the ovs-agent service:

# svcadm disable ovs-agent

Reboot the server. The server stops panicking while the ovs-agent service is disabled. If you want to re-enable the ovs-agent service, then you must resolve the cluster configuration issue first.

Once you have resolved the cluster configuration issue, you can acknowledge the server cluster failure event within Oracle VM Manager to resume normal operations.

Workaround (x86): To restore the environment to normal operation, you must first acknowledge the server cluster failure event within Oracle VM Manager. Remove any servers from the server pool that were offline when you made the configuration change, and then add them back to the server pool, so that the cluster configuration information can be properly refreshed for that server.

Bug 22304185, 18776654