Kernel Cage DR Recovery

When you perform a Dynamic Reconfiguration (DR) remove operation on a memory board with kernel cage memory, the affected node becomes unresponsive so heartbeat monitoring for that node is suspended on all other nodes and the node's quorum vote count is set to 0. After DR is completed, the heartbeat monitoring of the affected node is automatically re-enabled and the quorum vote count is reset to 1. If the DR operation does not complete, you might need to manually recover. For general information about DR, see Dynamic Reconfiguration Support in Oracle Solaris Cluster Concepts Guide.

The monitor-heartbeat subcommand is not supported in an exclusive-IP zone cluster. For more information about this command, see the cluster(1CL) man page.

Preparing the Cluster for Kernel Cage DR

When you use a DR operation to remove a system board containing kernel cage memory (memory used by the Oracle Solaris OS), the system must be quiesced in order to allow the memory contents to be copied to another system board. In a clustered system, the tight coupling between cluster nodes means that the quiescing of one node for repair can cause operations on non-quiesced nodes to be delayed until the repair operation is complete and the node is unquiesced. For this reason, using DR to remove a system board containing kernel cage memory from a cluster node requires careful planning and preparation.

Use the following information to reduce the impact of the DR quiesce on the rest of the cluster:

I/O operations for file systems or global device groups with their primary or secondary on the quiesced node will hang until the node is unquiesced. If possible, ensure that the node being repaired is not the primary for any global file systems or device groups.
I/O to SVM multi-owner disksets that include the quiesced node will hang until the node is unquiesced.
Updates to the CCR require communication between all cluster members. Any operations that result in CCR updates should not be performed while the DR operation is ongoing. Configuration changes are the most common cause of CCR updates.
Many cluster commands result in communication among cluster nodes. Refrain from running cluster commands during the DR operation.
Applications and cluster resources on the node being quiesced will be unavailable for the duration of the DR event. The time required to move applications and resources to another node should be weighed against the expected outage time of the DR event.
Scalable applications such as Oracle RAC often have a different membership standard, and have communication and synchronization actions among members. Scalable application instances on the node to be repaired should be brought offline before you initiate the DR operation.

How to Recover From an Interrupted Kernel Cage DR Operation

If the DR operation does not complete, perform the following steps to re-enable heartbeat timeout monitoring for that node and to reset the quorum vote count.

If DR does not complete successfully, manually re-enable heartbeat timeout monitoring.
From a single cluster node (which is not the node where the DR operation was performed), run the following command.
```
# cluster monitor-heartbeat
```
Use this command only in the global zone. Messages display indicating that monitoring has been enabled.
If the node that was dynamically reconfigured paused during boot, allow it to finish booting and join the cluster membership.
If the node is at the ok prompt, boot it now.

Verify that the node is now part of the cluster membership and check the quorum vote count of the cluster nodes by running the following command on a single node in the cluster.

# clquorum status
--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
pnode1          1             1              Online
pnode2          1             1              Online
pnode3          0             0              Online

If one of the nodes has a vote count of 0, reset its vote count to 1 by running the following command on a single node in the cluster.
```
# clquorum votecount -n nodename 1
```
nodename

The hostname of the node that has a quorum vote count of 0.

Verify that all nodes now have a quorum vote count of 1.

# clquorum status
--- Quorum Votes by Node (current status) ---

Node Name       Present       Possible       Status
---------       -------       --------       ------
pnode1          1             1              Online
pnode2          1             1              Online
pnode3          1             1              Online

Skip Navigation Links
Exit Print View
	Oracle Solaris Cluster 4.1 Hardware Administration Manual Oracle Solaris Cluster 4.1