This chapter provides information about migrating services for maintenance or as a result of cluster failure. The chapter contains information about the following:
Migrating Services That Use Sun StorEdge Availability Suite 3.2.1 With a Switchover
Forcing a Takeover on Systems That Use Sun StorEdge Availability Suite 3.2.1
Recovering Sun StorEdge Availability Suite 3.2.1 Data After a Takeover
Recovering From a Sun StorEdge Availability Suite 3.2.1 Data Replication Error
This section describes the internal processes that occur when failure is detected on a primary or a secondary cluster.
When the primary cluster for a given protection group fails, the secondary cluster in the partnership detects the failure. The cluster that fails might be a member of more than one partnership, resulting in multiple failure detections.
The following actions occur when the overall state of a protection group changes to the Unknown state:
Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the heartbeat loss is not transient and that the primary cluster has failed. The heartbeat remains in the OK state during this default timeout interval, while the heartbeat mechanism continues to retry the primary cluster. Only the heartbeat plug-ins appear in the Error state.
This query interval is set by using the Query_interval property of the heartbeat. If the heartbeat still fails after four times the Query_interval you configured (three retries and one emergency-mode probing), a heartbeat-lost event is generated and logged in the system log. When using the default interval, the emergency-mode retry behavior might delay heartbeat-loss notification for about nine minutes. Messages are displayed in the graphical user interface (GUI) and in the output of the geoadm status command.
For more information about logging, see Viewing the Sun Cluster Geographic Edition Log Messages in Sun Cluster Geographic Edition System Administration Guide.
When a secondary cluster for a given protection group fails, a cluster in the same partnership detects the failure. The cluster that failed might be a member of more than one partnership, resulting in multiple failure detections.
During failure detection, the following actions occur:
Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the secondary cluster is dead.
The cluster notifies the administrator. The system detects all protection groups for which the cluster that failed was acting as secondary. The state of these protection groups becomes Unknown.
You perform a switchover of a Sun StorEdge Availability Suite 3.2.1 protection group when you want to migrate services to the partner cluster in an orderly fashion. A switchover consists of the following:
Application services are unmanaged on the former primary cluster, cluster-paris.
For a reminder of which cluster is cluster-paris, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
The data replication role is reversed and now continues to run from the new primary, cluster-newyork, to the former primary, cluster-paris.
Application services are brought online on the new primary cluster, cluster-newyork.
For a switchover to occur, data replication must be active between the primary cluster and the secondary cluster. Additionally, the data volumes on the two clusters must be in a synchronized state.
Before you switch over a protection group from the primary cluster to the secondary cluster, ensure that the following conditions are met:
Sun Cluster Geographic Edition software is running on the both clusters.
The secondary cluster is a member of a partnership.
Both cluster partners can be reached.
The overall state of the protection group is OK.
Log in to a cluster node.
You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC in Sun Cluster Geographic Edition System Administration Guide.
Initiate the switchover.
The application resource groups that are a part of the protection group are stopped and started during the switchover.
# geopg switchover [-f] -m newprimarycluster protectiongroupname |
Forces the command to perform the operation without asking you for confirmation
Specifies the name of the cluster that is to be the primary cluster for the protection group
Specifies the name of the protection group
This example performs a switchover to the secondary cluster.
# geopg switchover -f -m cluster-newyork avspg |
When you run the geopg switchover command, the software confirms that the volume sets that are associated with the device groups are in the replicating state. Then, the software performs the following actions on the original primary cluster:
Removes affinities and resource dependencies between all the application resource groups in the protection group and the internal resource group, such as the lightweight resource groups
Takes the application resource groups offline and places them in the Unmanaged state
Waits for writes to complete
Unmounts the primary volumes that correspond to the device groups in the protection group
Stops data replication by placing all volume sets in logging mode
Reverses the role of all volume sets
On the original secondary cluster, the command takes the following actions:
Places all volume sets in logging mode
Reverses the role of all volume sets
Starts data replication by issuing update synchronization with the autosynchronization feature active
Runs the script that is defined in the RoleChange_ActionCmd property
Brings all application resource groups online and adds the affinities between the application resource groups and the internal resource groups, such as the lightweight resource group
If the command completes successfully, the secondary cluster, cluster-newyork, becomes the new primary cluster for the protection group. The original primary cluster, cluster-paris, becomes the new secondary cluster. Volume sets associated with a device group of the protection group have their role reversed according to the role of the protection group on the local cluster. The application resource group is online on the new primary cluster. Data replication from the new primary cluster to the new secondary cluster begins.
This command returns an error if any of the previous operations fails. Run the geoadm status command to view the status of each component. For example, the Configuration status of the protection group might be set to Error, depending on the cause of the failure. The protection group might be activated or deactivated.
If the Configuration status of the protection group is set to Error, revalidate the protection group by using the procedures described in How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
If the configuration of the protection group is not the same on each partner cluster, you need to resynchronize the configuration by using the procedures described in How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.
You perform a takeover when applications need to be brought online on the secondary cluster regardless of whether the data is completely consistent between the primary volume and the secondary volume. The information in this section assumes that the protection group has been started.
The following steps occur after a takeover is initiated:
If the former primary cluster, cluster-paris, can be reached and the protection group is not locked for notification handling or some other reason, the protection group is deactivated.
For a reminder of which cluster is cluster-paris, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
Data volumes of the former primary cluster, cluster-paris, are taken over by the new primary cluster, cluster-newyork.
This data might not be consistent with the original primary volumes. Data replication from the new primary cluster, cluster-newyork, to the former primary cluster, cluster-paris, is stopped.
The protection group is activated without data replication.
For details about the possible conditions of the primary and secondary cluster before and after takeover, see Appendix C, Takeover Postconditions, in Sun Cluster Geographic Edition System Administration Guide.
The following procedures describe the steps you must perform to force a takeover by a secondary cluster, and how to recover data afterward.
Before you force the secondary cluster to assume the activity of the primary cluster, ensure that the following conditions are met:
Sun Cluster Geographic Edition software is up and running on the cluster.
The cluster is a member of a partnership.
The Configuration status of the protection group is OK on the secondary cluster.
Log in to a node in the secondary cluster.
You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC in Sun Cluster Geographic Edition System Administration Guide.
Initiate the takeover.
# geopg takeover [-f] protectiongroupname |
Forces the command to perform the operation without your confirmation
Specifies the name of the protection group
This example forces the takeover of avspg by the secondary cluster, cluster-newyork.
phys-newyork-1 is the first node of the secondary cluster. For a reminder of which node is phys-newyork-1, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
phys-newyork-1# geopg takeover -f avspg |
For information about the state of the primary and secondary clusters after a takeover, see Appendix C, Takeover Postconditions, in Sun Cluster Geographic Edition System Administration Guide.
When you run the geopg takeover command, the software confirms that the volume sets are in a Replicating or Logging state on the secondary cluster.
If the original primary cluster, cluster-paris, can be reached, the software performs the following actions:
Removes affinities and resource dependencies between all the application resource groups in the protection group and the internal resource group if the protection group was active
Takes the application resource groups offline and places them in an Unmanaged state
Unmounts the primary volumes that correspond to the device groups in the protection group
Stops data replication by placing all volume sets in logging mode
Reverses the role of all volume sets
On the original secondary cluster, cluster-newyork, the software performs the following actions:
Places all volume sets into logging mode
Reverses the role of all volume sets
Runs the script that is specified in the RoleChange_ActionCmd property
If the protection group was active on the original secondary cluster before the takeover, brings all application resource groups online and adds affinities and resource dependencies between the application resource group and the internal resource group
If the command completes successfully, the secondary cluster, cluster-newyork, becomes the new primary cluster for the protection group. Volume sets associated with a device group in the protection group have their role reversed according to the role of the protection group on the local cluster. If the protection group was active on the original secondary cluster before the takeover, the application resource groups are brought online on the new primary cluster. If the original primary cluster can be reached, it becomes the new secondary cluster of the protection group. Replication of all volume sets that are associated with the device groups of the protection group is stopped.
After a successful takeover, data replication is stopped. If you want to continue to suspend replication, specify the -n option when you use the geopg start command. This option prevents the start of data replication from the new primary cluster to the new secondary cluster.
This command returns an error if any of the previous operations fails. Use the geoadm status command to view the status of each component. For example, the Configuration status of the protection group might be set to Error, depending on the cause of the failure. The protection group might be activated or deactivated.
If the Configuration status of the protection group is set to Error, revalidate the protection group by using the procedures described in How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
If the configuration of the protection group is not the same on each partner cluster, you need to resynchronize the configuration by using the procedures described in How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.
After a successful takeover operation, the secondary cluster, cluster-newyork, becomes the primary for the protection group and the services are online on the secondary cluster. After the recovery of the original primary cluster, the services can be brought online again on the original primary by using a process called failback.
Sun Cluster Geographic Edition software supports the following two kinds of failback:
Failback-switchover. During a failback-switchover, applications are brought online again on the original primary cluster, cluster-paris, after the data of the primary cluster has been resynchronized with the data on the secondary cluster, cluster-newyork.
For a reminder of which clusters are cluster-paris and cluster-newyork, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
Failback-takeover. During a failback-takeover, applications are brought online again on the original primary cluster and use the current data on the primary cluster. Any updates that occurred on the secondary cluster are discarded.
If you want to leave the new primary, cluster-newyork, as the primary cluster and the original primary cluster, cluster-paris, as the secondary after the original primary starts again , you can resynchronize and revalidate the protection group configuration without performing a switchover or takeover.
Use this procedure to resynchronize and revalidate data on the original primary cluster, cluster-paris, with the data on the current primary cluster, cluster-newyork.
Before you resynchronize and revalidate the protection group configuration, a takeover has occurred on cluster-newyork. The clusters now have the following roles:
If the original primary cluster, cluster-paris, has been down, confirm that the cluster is booted and that the Sun Cluster Geographic Edition infrastructure is enabled on the cluster. For more information about booting a cluster, see Booting a Cluster in Sun Cluster Geographic Edition System Administration Guide.
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether cluster-paris could be reached during the takeover from cluster-newyork.
Resynchronize the original primary cluster, cluster-paris, with the current primary cluster, cluster-newyork.
The cluster cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally. Resynchronize both the partnership and protection group configurations.
On cluster-paris, deactivate the protection group on the local cluster.
# geopg stop -e Local protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Specifies the name of the protection group.
If the protection group is already deactivated, the state of the resource group in the protection group is probably Error. The state is Error because the application resource groups are managed and offline.
Deactivating the protection group results in the application resource groups no longer being managed, clearing the Error state.
On cluster-paris, resynchronize the partnership.
# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once, even if you are resynchronizing multiple protection groups.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
On cluster-paris, resynchronize each protection group.
Because the role of the protection group on cluster-newyork is primary, this step ensures that the role of the protection group on cluster-paris is secondary.
# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about synchronizing protection groups, see Resynchronizing a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, validate the configuration for each protection group.
# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, activate each protection group.
When you activate a protection group, its application resource groups are also brought online.
# geopg start -e Global protectiongroupname |
Specifies the scope of the command.
By specifying a Global scope, the command operates on both clusters where the protection group is deployed.
Specifies the name of the protection group.
Do not use the -n option because the data needs to be synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
Because the protection group has a role of secondary, the data is synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
Confirm that the data is completely synchronized.
First, confirm that the state of the protection group on cluster-newyork is OK.
phys-newyork-1# geoadm status |
Refer to the Protection Group section of the output.
Next, confirm that all resources in the replication resource group, AVSprotectiongroupname-rep-rg, report a status of OK.
phys-newyork-1# scstat -g |
Use this procedure to restart an application on the original primary cluster, cluster-paris, after the data on the cluster has been resynchronized with the data on the current primary cluster, cluster-newyork.
The failback procedures apply only to clusters in a partnership. You need to perform the following procedure only once per partnership.
Before you perform a failback-switchover, a takeover has occurred on cluster-newyork. The clusters now have the following roles:
If the original primary cluster, cluster-paris, has failed, confirm that the cluster is booted and that the Sun Cluster Geographic Edition infrastructure is enabled on the cluster. For more information about booting a cluster, see Booting a Cluster in Sun Cluster Geographic Edition System Administration Guide.
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether cluster-paris could be reached during the takeover from cluster-newyork.
Resynchronize the original primary cluster, cluster-paris, with the current primary cluster, cluster-newyork.
The cluster cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally. Resynchronize both the partnership and protection group configurations.
On cluster-paris, deactivate the protection group on the local cluster.
phys-paris-1# geopg stop -e Local protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Specifies the name of the protection group.
If the protection group is already deactivated, the state of the resource group in the protection group is probably Error. The state is Error because the application resource groups are managed and offline.
Deactivating the protection group results in the application resource groups no longer being managed, clearing the Error state.
On cluster-paris, resynchronize the partnership.
phys-paris-1# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once per partnership, even if you are performing a failback-switchover for multiple protection groups in the partnership.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
On cluster-paris, resynchronize each protection group.
Because the local role of the protection group on cluster-newyork is now primary, this steps ensures that the role of the protection group on cluster-paris becomes secondary.
phys-paris-1# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about synchronizing protection groups, see Resynchronizing a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, validate the configuration for each protection group.
A protection group cannot be started when it is in a error state. Ensure that the protection group is not in an error state.
phys-paris-1# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, activate each protection group.
When you activate a protection group, its application resource groups are also brought online.
phys-paris-1# geopg start -e Global protectiongroupname |
Specifies the scope of the command.
By specifying a Global scope, the command operates on both clusters where the protection group is deployed.
Specifies the name of the protection group.
Do not use the -n option when performing a failback-switchover because the data needs to be synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
Because the protection group has a role of secondary, the data is synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
Confirm that the data is completely synchronized.
First, confirm that the state of the protection group on cluster-newyork is OK.
phys-newyork-1# geoadm status |
Refer to the Protection Group section of the output.
Next, confirm that all resources in the replication resource group, AVSprotectiongroupname-rep-rg, report a status of OK.
phys-newyork-1# scstat -g |
On either cluster, perform a switchover from cluster-newyork to cluster-paris for each protection group.
# geopg switchover [-f] -m clusterparis protectiongroupname |
For more information, see How to Switch Over a Sun StorEdge Availability Suite 3.2.1 Protection Group From Primary to Secondary.
cluster-paris resumes its original role as primary cluster for the protection group.
Ensure that the switchover was performed successfully.
Verify that the protection group is now primary on cluster-paris and secondary on cluster-newyork and that the state for “Data replication” and “Resource groups” is OK on both clusters.
# geoadm status |
Check the runtime status of application resource group and data replication for each Sun StorEdge Availability Suite 3.2.1 protection group.
# scstat -g |
Refer to the Status and Status Message fields that are presented for the data replication device group you want to check. For more information about these fields, see Table 2–1.
For more information about the runtime status of data replication, see Checking the Runtime Status of Sun StorEdge Availability Suite 3.2.1 Data Replication.
Use this procedure to restart an application on the original primary cluster, cluster-paris, and use the current data on the original primary cluster. Any updates that occurred on the secondary cluster, cluster-newyork, while it was acting as primary are discarded.
The failback procedures apply only to clusters in a partnership. You need to perform the following procedure only once per partnership.
Conditionally, you can resume using the data on the original primary, cluster-paris. You must not have replicated data from the new primary, cluster-newyork, to the original primary cluster, cluster-paris, at any point after the takeover operation on cluster-newyork.
Before you begin the failback-takeover operation, the clusters have the following roles:
If the original primary cluster, cluster-paris, has failed, confirm that the cluster is booted and that the Sun Cluster Geographic Edition infrastructure is enabled on the cluster. For more information about booting a cluster, see Booting a Cluster in Sun Cluster Geographic Edition System Administration Guide.
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether the protection group could be reached during the takeover.
Resynchronize the original primary cluster, cluster-paris, with the original secondary cluster, cluster-newyork.
cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally.
On cluster-paris, resynchronize the partnership.
phys-paris-1# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once per partnership, even if you are performing a failback-takeover for multiple protection groups in the partnership.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
On cluster-paris, resynchronize each protection group.
If the protection group has been activated, deactivate the protection group by using the geopg stop command. For more information about deactivating a protection group, see How to Deactivate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
phys-paris-1# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about synchronizing protection groups, see How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, validate the configuration for each protection group.
Ensure that the protection group is not in an error state. A protection group cannot be started when it is in an error state.
phys-paris-1# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
On cluster-paris, activate each protection group in the secondary role without data replication.
Because the protection group on cluster-paris has a role of secondary, the geopg start command does not restart the application on cluster-paris.
phys-paris-1# geopg start -e local -n protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Prevents the start of data replication at protection group startup.
You must use the -n option.
Specifies the name of the protection group.
For more information, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
Replication from cluster-newyork to cluster-paris is not started because the -n option is used on cluster-paris.
On cluster-paris, initiate a takeover for each protection group.
phys-paris-1# geopg takeover [-f] protectiongroupname |
Forces the command to perform the operation without your confirmation
Specifies the name of the protection group
For more information about the geopg takeover command, see How to Force Immediate Takeover of Sun StorEdge Availability Suite 3.2.1 Services by a Secondary Cluster.
The protection group on cluster-paris now has the primary role, and the protection group on cluster-newyork has the secondary role.
On cluster-newyork, activate each protection group.
Because the protection group on cluster-newyork has a role of secondary, the geopg start command does not restart the application on cluster-newyork.
phys-newyork-1# geopg start -e local [-n] protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Prevents the start of data replication at protection group startup.
If you omit this option, the data replication subsystem starts at the same time as the protection group.
Specifies the name of the protection group.
For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
Start data replication.
To start data replication, activate the protection group on the primary cluster, cluster-paris.
# geopg start -e local protectiongroupname |
For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.
Ensure that the takeover was performed successfully.
Verify that the protection group is now primary on cluster-paris and secondary on cluster-newyork and that the state for Data replication and Resource groups is OK on both clusters.
# geoadm status |
Check the runtime status of the application resource group and data replication for each Sun StorEdge Availability Suite 3.2.1 protection group.
# scstat -g |
Refer to the Status and Status Message fields that are presented for the data replication device group you want to check. For more information about these fields, see Table 2–1.
For more information about the runtime status of data replication, see Checking the Runtime Status of Sun StorEdge Availability Suite 3.2.1 Data Replication.
When an error occurs at the data replication level, the error is reflected in the status of the resource in the replication resource group of the relevant device group.
For example, suppose a device group controlled by Sun StorEdge Availability Suite 3.2.1 that is called avsdg changes to a Volume failed state, VF. This state is reflected in the following resource status:
Resource Status = "FAULTED" Resource status message = "FAULTED : Volume failed" |
The Resource State remains Online because the probe is still running correctly.
Because the resource status has changed, the protection group status also changes. In this case, the local Data Replication state, the Protection Group state on the local cluster, and the overall Protection Group state become Error.
To recover from an error state, complete the relevant steps in the following procedure.
Use the procedures in the Sun StorEdge Availability Suite 3.2.1 documentation to determine the causes of the FAULTED state. This state is indicated as VF.
Recover from the faulted state by using the Sun StorEdge Availability Suite 3.2.1 procedures.
If the recovery procedures change the state of the device group, this state is automatically detected by the resource and is reported as a new protection group state.
Revalidate the protection group configuration.
phys-paris-1# geopg validate protectiongroupname |
Specifies the name of the Sun StorEdge Availability Suite 3.2.1 protection group
Review the status of the protection group configuration.
phys-paris-1# geopg list protectiongroupname |
Specifies the name of the Sun StorEdge Availability Suite 3.2.1 protection group