This chapter provides information about migrating services for maintenance or as a result of cluster failure. This chapter contains the following sections:
Detecting Cluster Failure on a System That Uses Hitachi TrueCopy Data Replication
Migrating Services That Use Hitachi TrueCopy Data Replication With a Switchover
Forcing a Takeover on a System That Uses Hitachi TrueCopy Data Replication
Recovering Services to a Cluster on a System That Uses Hitachi TrueCopy Replication
Recovering From a Switchover Failure on a System That Uses Hitachi TrueCopy Replication
This section describes the internal processes that occur when failure is detected on a primary or a secondary cluster.
When the primary cluster for a given protection group fails, the secondary cluster in the partnership detects the failure. The cluster that fails might be a member of more than one partnership, resulting in multiple failure detections.
The following actions take place when a primary cluster failure occurs. During a failure, the appropriate protection groups are in the Unknown state.
Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the heartbeat loss is not transient and that the primary cluster has failed. The heartbeat remains in the Online state during this default timeout interval, while the heartbeat mechanism continues to retry the primary cluster.
This query interval is set by using the Query_interval heartbeat property. If the heartbeat still fails after the interval you configured, a heartbeat-lost event is generated and logged in the system log. When you use the default interval, the emergency-mode retry behavior might delay heartbeat-loss notification for about nine minutes. Messages are displayed in the graphical user interface (GUI) and in the output of the geoadm status command.
For more information about logging, see Viewing the Sun Cluster Geographic Edition Log Messages in Sun Cluster Geographic Edition System Administration Guide.
When a secondary cluster for a given protection group fails, a cluster in the same partnership detects the failure. The cluster that failed might be a member of more than one partnership, resulting in multiple failure detections.
During failure detection, the following actions occur:
Heartbeat failure is detected by a partner cluster.
The heartbeat is activated in emergency mode to verify that the secondary cluster is dead.
The cluster notifies the administrator. The system detects all protection groups for which the cluster that failed was acting as secondary. The state of the appropriate protection groups is marked Unknown.
Perform a switchover of a Hitachi TrueCopy protection group when you want to migrate services to the partner cluster in an orderly fashion. A switchover consists of the following:
Application services are offline on the former primary cluster, cluster-paris.
For a reminder of which cluster is cluster-paris, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
The data replication role is reversed and now continues to run from the new primary, cluster-newyork, to the former primary, cluster-paris.
Application services are brought online on the new primary cluster, cluster-newyork.
This section provides the following information:
When a switchover is initiated by using the geopg switchover command, the data replication subsystem runs several validations on both clusters. The switchover is performed only if the validation step succeeds on both clusters.
First, the replication subsystem checks that the Hitachi TrueCopy device group is in a valid aggregate device group state. Then, it checks that the local device group states on the target primary cluster, cluster-newyork, are 23, 33, 43, or 53. The local device group state is returned by the pairvolchk -g device-group-name -ss command. These values correspond to a PVOL_PAIR or SVOL_PAIR state. The Hitachi TrueCopy commands that are run on the new primary cluster, cluster-newyork, are described in the following table.
Table 3–1 Hitachi TrueCopy Switchover Validations on the New Primary Cluster
Aggregate Device Group State |
Valid Device Group State on Local Cluster |
Hitachi TrueCopy Switchover Commands That Are Run on cluster-newyork |
---|---|---|
SMPL |
None |
None |
Regular primary |
23, 43 |
No command is run, because the Hitachi TrueCopy device group is already in the PVOL_PAIR state. |
Regular secondary |
33, 53 |
horctakeover -g dg [-t] The-t option is specified when the fence_level of the Hitachi TrueCopy device group is async. The value is calculated as 80% of the Timeout property of the protection group. For example, if the protection group has a Timeout of 200 seconds, the value of -t used in this command is 80% of 200 seconds, or 160 seconds. |
Takeover primary |
None |
None |
Takeover secondary |
None |
None |
After a successful switchover, at the data replication level the roles of the primary and secondary volumes have been switched. The PVOL_PAIR volumes that were in place before the switchover become the SVOL_PAIR volumes. The SVOL_PAIR volumes in place before the switchover become the PVOL_PAIR volumes. Data replication will continue from the new PVOL_PAIR volumes to the new SVOL_PAIR volumes.
The Local-role property of the protection group is also switched regardless of whether the application could be brought online on the new primary cluster as part of the switchover operation. On the cluster on which the protection group had a Local role of Secondary, the Local-role property of the protection group becomes Primary. On the cluster on which the protection group had a Local-role of Primary, the Local-role property of the protection group becomes Secondary.
For a successful switchover, data replication must be active between the primary and the secondary clusters and data volumes on the two clusters must be synchronized.
Before you switch over a protection group from the primary cluster to the secondary cluster, ensure that the following conditions are met:
The Sun Cluster Geographic Edition software is running on the both clusters.
The secondary cluster is a member of a partnership.
Both cluster partners can be reached.
The protection group is in the OK state.
If you have configured the Cluster_dgs property, only applications that belong to the protection group can write to the device groups specified in the Cluster_dgs property.
Log in to a cluster node.
You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC in Sun Cluster Geographic Edition System Administration Guide.
Initiate the switchover.
The application resource groups that are a part of the protection group are stopped and started during the switchover.
# geopg switchover [-f] -m newprimarycluster protectiongroupname |
Forces the command to perform the operation without asking you for confirmation
Specifies the name of the cluster that is to be the new primary cluster for the protection group
Specifies the name of the protection group
This example performs a switchover to the secondary cluster.
# geopg switchover -f -m cluster-newyork tcpg |
Perform a takeover when applications need to be brought online on the secondary cluster regardless of whether the data is completely consistent between the primary volume and the secondary volume. The information in this section assumes that the protection group has been started.
The following steps occur after a takeover is initiated:
If the former primary cluster, cluster-paris, can be reached and the protection group is not locked for notification handling or some other reason, the application services are taken offline on the former primary cluster.
For a reminder of which cluster is cluster-paris, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
Data volumes of the former primary cluster, cluster-paris, are taken over by the new primary cluster, cluster-newyork.
This data might not be consistent with the original primary volumes. After the takeover, data replication from the new primary cluster, cluster-newyork, to the former primary cluster, cluster-paris, is stopped.
Application services are brought online on the new primary cluster, cluster-newyork.
For details about the possible conditions of the primary and secondary cluster before and after takeover, see Appendix C, Takeover Postconditions, in Sun Cluster Geographic Edition System Administration Guide.
The following sections describe the steps you must perform to force a takeover by a secondary cluster.
When a takeover is initiated by using the geopg takeover command, the data replication subsystem runs several validations on both clusters. These steps are conducted on the original primary cluster only if the primary cluster can be reached. If validation on the original primary cluster fails, the takeover still occurs.
First, the replication subsystem checks that the Hitachi TrueCopy device group is in a valid aggregate device group state. Then, the replication subsystem checks that the local device group states on the target primary cluster, cluster-newyork, are not 32 or 52. These values correspond to a SVOL_COPY state, for which the horctakeover command fails. The Hitachi TrueCopy commands that are used for the takeover are described in the following table.
Table 3–2 Hitachi TrueCopy Takeover Validations on the New Primary Cluster
Aggregate Device Group State |
Valid Local State Device Group State |
Hitachi TrueCopy Takeover Commands That Are Run on cluster-newyork |
---|---|---|
SMPL |
All |
No command is run. |
Regular primary |
All |
No command is run. |
Regular secondary |
All Regular secondary states except 32 or 52 For a list of Regular secondary states, refer to Table 2–1 and Table 2–2. |
horctakeover -S -g dg [-t] The-t option is given when the fence_level of the Hitachi TrueCopy device group is async. The value is calculated as 80% of the Timeout property of the protection group. For example, if the protection group has a Timeout of 200 seconds, the value of -t used in this command will be 80% of 200 seconds, or 160 seconds. |
Takeover primary |
All |
No command is run. |
Takeover secondary |
All |
pairsplit -R-g dg pairsplit -S-g dg |
From a replication perspective, after a successful takeover, the Local-role property of the protection group is changed to reflect the new role, it is immaterial whether the application could be brought online on the new primary cluster as part of the takeover operation. On cluster-newyork, where the protection group had a Local-role of Secondary, the Local-role property of the protection group becomes Primary. On cluster-paris, where the protection group had a Local-role of Primary, the following might occur:
If the cluster can be reached, the Local-role property of the protection group becomes Secondary.
If the cluster cannot be reached, the Local-role property of the protection group remains Primary.
If the takeover is successful, the applications are brought online. You do not need to run a separate geopg start command.
After a successful takeover, data replication between the new primary cluster, cluster-newyork, and the old primary cluster, cluster-paris, is stopped. If you want to run a geopg start command, you must use the -n option to prevent replication from resuming.
Before you force the secondary cluster to assume the activity of the primary cluster, ensure that the following conditions are met:
Sun Cluster Geographic Edition software is running on the cluster.
The cluster is a member of a partnership.
The Configuration status of the protection group is OK on the secondary cluster.
Log in to a node in the secondary cluster.
You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC in Sun Cluster Geographic Edition System Administration Guide.
Initiate the takeover.
# geopg takeover [-f] protectiongroupname |
Forces the command to perform the operation without your confirmation
Specifies the name of the protection group
This example forces the takeover of tcpg by the secondary cluster cluster-newyork.
The phys-newyork-1 cluster is the first node of the secondary cluster. For a reminder of which node is phys-newyork-1, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
phys-newyork-1# geopg takeover -f tcpg |
For information about the state of the primary and secondary clusters after a takeover, see Appendix C, Takeover Postconditions, in Sun Cluster Geographic Edition System Administration Guide.
After a successful takeover operation, the secondary cluster, cluster-newyork, becomes the primary for the protection group and the services are online on the secondary cluster. After the recovery of the original primary cluster,cluster-paris, the services can be brought online again on the original primary by using a process called failback.
Sun Cluster Geographic Edition software supports the following kinds of failback:
Failback-switchover. During a failback-switchover, applications are brought online again on the original primary cluster, cluster-paris, after the data of the original primary cluster was resynchronized with the data on the secondary cluster, cluster-newyork.
For a reminder of which clusters are cluster-paris and cluster-newyork, see Example Sun Cluster Geographic Edition Cluster Configuration in Sun Cluster Geographic Edition System Administration Guide.
Failback-takeover. During a failback-takeover, applications are brought online again on the original primary cluster, cluster-paris, and use the current data on the original primary cluster. Any updates that occurred on the secondary cluster, cluster-newyork, while it was acting as primary are discarded.
To continue using the new primary, cluster-newyork, as the primary cluster and the original primary cluster, cluster-paris, as the secondary after the original primary is running again, resynchronize and revalidate the protection group configuration without performing a switchover or takeover.
This section provides the following information:
How to Resynchronize and Revalidate the Protection Group Configuration
How to Perform a Failback-Switchover on a System That Uses Hitachi TrueCopy Replication
How to Perform a Failback-Takeover on a System That Uses Hitachi TrueCopy Replication
Use this procedure to resynchronize and revalidate data on the original primary cluster, cluster-paris, with the data on the current primary cluster, cluster-newyork.
Before you resynchronize and revalidate the protection group configuration, a takeover has occurred on cluster-newyork. The clusters now have the following roles:
If the original primary cluster, cluster-paris, has been down, confirm that the cluster is booted and that the Sun Cluster Geographic Edition infrastructure is enabled on the cluster. For more information about booting a cluster, see Booting a Cluster in Sun Cluster Geographic Edition System Administration Guide.
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether cluster-paris could be reached during the takeover from cluster-newyork.
Resynchronize the original primary cluster, cluster-paris, with the current primary cluster, cluster-newyork.
cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally. Resynchronize both the partnership and protection group configurations.
On cluster-paris, resynchronize the partnership.
# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once, even if you are resynchronizing multiple protection groups.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
On cluster-paris, resynchronize each protection group.
Because the role of the protection group on cluster-newyork is primary, this step ensures that the role of the protection group on cluster-paris is secondary.
# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about synchronizing protection groups, see Resynchronizing a Hitachi TrueCopy Protection Group.
On cluster-paris, validate the cluster configuration for each protection group.
# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Hitachi TrueCopy Protection Group.
On cluster-paris, activate each protection group.
Because the protection group on cluster-paris has a role of secondary, the geopg start command does not restart the application on cluster-paris.
# geopg start -e local protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Specifies the name of the protection group.
Do not use the -n option because the data needs to be synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
Because the protection group has a role of secondary, the data is synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
For more information about the geopg start command, see How to Activate a Hitachi TrueCopy Protection Group.
Confirm that the data is completely synchronized.
The state of the protection group on cluster-newyork must be OK.
phys-newyork-1# geoadm status |
Refer to the Protection Group section of the output.
The protection group has a local state of OK when the Hitachi TrueCopy device groups on cluster-newyork have a state of PVOL_PAIR and the Hitachi TrueCopy device groups on cluster-paris have a state of SVOL_PAIR.
Use this procedure to restart an application on the original primary cluster, cluster-paris, after the data on this cluster has been resynchronized with the data on the current primary cluster, cluster-newyork.
The failback procedures apply only to clusters in a partnership. You need to perform the following procedure only once per partnership.
Before you perform a failback-switchover, a takeover has occurred on cluster-newyork. The clusters have the following roles:
If the original primary cluster, cluster-paris, has been down, confirm that the cluster is booted and that the Sun Cluster Geographic Edition infrastructure is enabled on the cluster. For more information about booting a cluster, see Booting a Cluster in Sun Cluster Geographic Edition System Administration Guide.
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether cluster-paris could be reached during the takeover from cluster-newyork.
Resynchronize the original primary cluster, cluster-paris, with the current primary cluster, cluster-newyork.
cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally. Resynchronize both the partnership and protection group configurations.
On cluster-paris, resynchronize the partnership.
phys-paris-1# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once per partnership, even if you are performing a failback-switchover for multiple protection groups in the partnership.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
Determine whether the protection group on the original primary cluster, cluster-paris, is active.
phys-paris-1# geoadm status |
If the protection group on the original primary cluster is active, stop it.
phys-paris-1# geopg stop -e local protectiongroupname |
Verify that the protection group is stopped.
phys-paris-1# geoadm status |
On cluster-paris, resynchronize each protection group.
Because the local role of the protection group on cluster-newyork is now primary, this steps ensures that the role of the protection group on cluster-paris becomes secondary.
phys-paris-1# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about synchronizing protection groups, see Resynchronizing a Hitachi TrueCopy Protection Group.
On cluster-paris, validate the cluster configuration for each protection group.
Ensure that the protection group is not in an error state. A protection group cannot be started when it is in an error state.
phys-paris-1# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Hitachi TrueCopy Protection Group.
On cluster-paris, activate each protection group.
Because the protection group on cluster-paris has a role of secondary, the geopg start command does not restart the application on cluster-paris.
phys-paris-1# geopg start -e local protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Specifies the name of the protection group.
Do not use the -n option because the data needs to be synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
Because the protection group has a role of secondary, the data is synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.
For more information about the geopg start command, see How to Activate a Hitachi TrueCopy Protection Group.
Confirm that the data is completely synchronized.
The state of the protection group on cluster-newyork must be OK.
phys-newyork-1# geoadm status |
Refer to the Protection Group section of the output.
The protection group has a local state of OK when the Hitachi TrueCopy device groups on cluster-newyork have a state of PVOL_PAIR and the Hitachi TrueCopy device groups on cluster-paris have a state of SVOL_PAIR.
On both partner clusters, ensure that the protection group is activated.
# geoadm status |
On either cluster, perform a switchover from cluster-newyork to cluster-paris for each protection group.
# geopg switchover [-f] -m clusterparis protectiongroupname |
For more information, see How to Switch Over a Hitachi TrueCopy Protection Group From Primary to Secondary.
cluster-paris resumes its original role as primary cluster for the protection group.
Ensure that the switchover was performed successfully.
Verify that the protection group is now primary on cluster-paris and secondary on cluster-newyork and that the state for Data replication and Resource groups is OK on both clusters.
# geoadm status |
Check the runtime status of the application resource group and data replication for each Hitachi TrueCopy protection group.
# clresourcegroup status -v # clresource status -v |
Refer to the Status and Status Message fields that are presented for the data replication device group you want to check. For more information about these fields, see Table 2–1.
For more information about the runtime status of data replication see, Checking the Runtime Status of Hitachi TrueCopy Data Replication.
Use this procedure to restart an application on the original primary cluster, cluster-paris, and use the current data on the original primary cluster. Any updates that occurred on the secondary cluster, cluster-newyork, while it was acting as primary are discarded.
The failback procedures apply only to clusters in a partnership. You need to perform the following procedure only once per partnership.
Conditionally, you can resume using the data on the original primary, cluster-paris. You must not have replicated data from the new primary, cluster-newyork, to the original primary cluster, cluster-paris, at any point after the takeover operation on cluster-newyork. To prevent data replication between the new primary and the original primary, you must use the -n option when you run the geopg start command.
Ensure that the clusters have the following roles:
The protection group on cluster-newyork has the primary role.
The protection group on cluster-paris has either the primary role or secondary role, depending on whether the protection group could be reached during the takeover.
Resynchronize the original primary cluster, cluster-paris, with the original secondary cluster, cluster-newyork.
cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally.
On cluster-paris, resynchronize the partnership.
phys-paris-1# geops update partnershipname |
Specifies the name of the partnership
You need to perform this step only once per partnership, even if you are performing a failback-takeover for multiple protection groups in the partnership.
For more information about synchronizing partnerships, see Resynchronizing a Partnership in Sun Cluster Geographic Edition System Administration Guide.
Determine whether the protection group on the original primary cluster, cluster-paris, is active.
phys-paris-1# geoadm status |
If the protection group on the original primary cluster is active, stop it.
phys-paris-1# geopg stop -e local protectiongroupname |
Verify that the protection group is stopped.
phys-paris-1# geoadm status |
Place the Hitachi TrueCopy device group, devgroup1, in the SMPL state.
Use the pairsplit commands to place the Hitachi TrueCopy device groups that are in the protection group on both cluster-paris and cluster-newyork in the SMPL state. The pairsplit command you use depends on the pair state of the Hitachi TrueCopy device group. The following table gives some examples of the command you need to use on cluster-paris for some typical pair states.
Pair State on cluster-paris |
Pair State on cluster-newyork |
pairsplit Command Used on cluster-paris |
---|---|---|
PSUS or PSUE |
SSWS |
pairsplit -R -g dgname pairsplit -S -g dgname |
SSUS |
PSUS |
pairsplit -S -g dgname |
For more information about the pairsplit commands, see the Sun StorEdge SE 9900 V Series Command and Control Interface User and Reference Guide.
If the command is successful, the state of devgroup1 is provided in the output of the pairdisplay command:
phys-paris-1# pairdisplay -g devgroup1 Group PairVol(L/R) (Port#,TID,LU),Seq#,LDEV#,P/S,Status,Fence,Seq#,P-LDEV# M devgroup1 pair1(L) (CL1-A , 0, 1) 12345 1..SMPL ---- ----,----- ---- - devgroup1 pair1(R) (CL1-C , 0, 20)54321 609..SMPL ---- ----,----- ---- - devgroup1 pair2(L) (CL1-A , 0, 2) 12345 2..SMPL ---- ----,----- ---- - devgroup1 pair2(R) (CL1-C , 0,21) 54321 610..SMPL ---- ----,----- ---- - |
.
On cluster-paris, resynchronize each protection group.
phys-paris-1# geopg update protectiongroupname |
Specifies the name of the protection group
For more information about resynchronizing protection groups, see How to Resynchronize a Protection Group.
On cluster-paris, validate the configuration for each protection group.
Ensure that the protection group is not in an error state. A protection group cannot be started when it is in a error state.
phys-paris-1# geopg validate protectiongroupname |
Specifies a unique name that identifies a single protection group
For more information, see How to Validate a Hitachi TrueCopy Protection Group.
On cluster-paris, activate each protection group in the secondary role without data replication.
Because the protection group on cluster-paris has a role of secondary, the geopg start command does not restart the application on cluster-paris.
phys-paris-1# geopg start -e local -n protectiongroupname |
Specifies the scope of the command
.
By specifying a local scope, the command operates on the local cluster only.
Prevents the start of data replication at protection group startup.
You must use the -n option.
Specifies the name of the protection group.
For more information, see How to Activate a Hitachi TrueCopy Protection Group.
Replication from cluster-newyork to cluster-paris is not started because the -n option is used on cluster-paris.
On cluster-paris, initiate a takeover for each protection group.
phys-paris-1# geopg takeover [-f] protectiongroupname |
Forces the command to perform the operation without your confirmation
Specifies the name of the protection group
For more information about the geopg takeover command, see How to Force Immediate Takeover of Hitachi TrueCopy Services by a Secondary Cluster.
The protection group on cluster-paris now has the primary role, and the protection group on cluster-newyork has the role of secondary. The application services are now online on cluster-paris.
On cluster-newyork, activate each protection group.
At the end of step 4, the local state of the protection group on cluster-newyork is Offline. To start monitoring the local state of the protection group, you must activate the protection group on cluster-newyork.
Because the protection group on cluster-newyork has a role of secondary, the geopg start command does not restart the application on cluster-newyork.
phys-newyork-1# geopg start -e local [-n] protectiongroupname |
Specifies the scope of the command.
By specifying a local scope, the command operates on the local cluster only.
Prevents the start of data replication at protection group startup.
If you omit this option, the data replication subsystem starts at the same time as the protection group.
Specifies the name of the protection group.
For more information about the geopg start command, see How to Activate a Hitachi TrueCopy Protection Group.
Ensure that the takeover was performed successfully.
Verify that the protection group is now primary on cluster-paris and secondary on cluster-newyork and that the state for “Data replication” and “Resource groups” is OK on both clusters.
# geoadm status |
Check the runtime status of the application resource group and data replication for each Hitachi TrueCopy protection group.
# clresourcegroup status -v # clresource status -v |
Refer to the Status and Status Message fields that are presented for the data replication device group you want to check. For more information about these fields, see Table 2–1.
For more information about the runtime status of data replication, see Checking the Runtime Status of Hitachi TrueCopy Data Replication.
When you run the geopg switchover command, the horctakeover command runs at the Hitachi TrueCopy data replication level. If the horctakeover command returns a value of 1, the switchover is successful.
In Hitachi TrueCopy terminology, a switchover is called a swap-takeover. In some cases, the horctakeover command might not be able to perform a swap-takeover. In these cases, a return value other than 1 is returned, which is considered a switchover failure.
In a failure, the horctakeover command usually returns a value of 5, which indicates a SVOL-SSUS-takeover.
One reason the horctakeover command might fail to perform a swap-takeover is because the data replication link, ESCON/FC, is down.
Any result other than a swap-takeover implies that the secondary volumes might not be fully synchronized with the primary volumes. Sun Cluster Geographic Edition software does not start the applications on the new intended primary cluster in a switchover failure scenario.
The remainder of this section describes the initial conditions that lead to a switchover failure and how to recover from a switchover failure.
How to Make the Original Primary Cluster Primary for a Hitachi TrueCopy Protection Group
How to Make the Original Secondary Cluster Primary for a Hitachi TrueCopy Protection Group
This section describes a switchover failure scenario. In this scenario, cluster-paris is the original primary cluster and cluster-newyork is the original secondary cluster.
A switchover switches the services from cluster-paris to cluster-newyork as follows:
phys-newyork-1# geopg switchover -f -m cluster-newyork tcpg |
While processing the geopg switchover command, the horctakeover command performs an SVOL-SSUS-takeover and returns a value of 5 for the Hitachi TrueCopy device group, devgroup1. As a result, the geopg switchover command returns with the following failure message:
Processing operation.... this may take a while .... "Switchover" failed for the following reason: Switchover failed for Truecopy DG devgroup1 |
After this failure message has been issued, the two clusters are in the following states:
cluster-paris: tcpg role: Secondary cluster-newyork: tcpg role: Secondary phys-newyork-1# pairdisplay -g devgroup1 -fc Group PairVol(L/R) (Port#,TID,LU),Seq#,LDEV#.P/S, Status,Fence,%, P-LDEV# M devgroup1 pair1(L) (CL1-C , 0, 20)12345 609..S-VOL SSWS ASYNC,100 1 - devgroup1 pair1(R) (CL1-A , 0, 1) 54321 1..P-VOL PSUS ASYNC,100 609 - |
This section describes procedures to recover from the failure scenario described in the previous section. These procedures bring the application online on the appropriate cluster.
Place the Hitachi TrueCopy device group, devgroup1, in the SMPL state.
Use the pairsplit commands to place the device groups that are in the protection group on both cluster-paris and cluster-newyork in the SMPL state. For the pair states that are shown in the previous section, run the following pairsplit commands:
phys-newyork-1# pairsplit -R -g devgroup1 phys-newyork-1# pairsplit -S -g devgroup1 |
Designate one of the clusters Primary for the protection group.
Designate the original primary cluster, cluster-paris, Primary for the protection group if you intend to start the application on the original primary cluster. The application uses the current data on the original primary cluster.
Designate the original secondary cluster, cluster-newyork, Primary for the protection group if you intend to start the application on the original secondary cluster. The application uses the current data on the original secondary cluster.
Because the horctakeover command did not perform a swap-takeover, the data volumes on cluster-newyork might not be synchronized with the data volumes on cluster-paris. If you intend to start the application with the same data that appears on the original primary cluster, you must not make the original secondary cluster Primary.
Deactivate the protection group on the original primary cluster.
phys-paris-1# geopg stop -e Local tcpg |
Resynchronize the configuration of the protection group.
This command updates the configuration of the protection group on cluster-paris with the configuration information of the protection group on cluster-newyork.
phys-paris-1# geopg update tcpg |
After the geopg update command completes successfully, tcpg has the following role on each cluster:
cluster-paris: tcpg role: Primary cluster-newyork: tcpg role: secondary |
Activate the protection group on both clusters in the partnership.
phys-paris-1# geopg start -e Global tcpg |
This command starts the application on cluster-paris. Data replication starts from cluster-paris to cluster-newyork.
Resynchronize the configuration of the protection group.
This command updates the configuration of the protection group on cluster-newyork with the configuration information of the protection group on cluster-paris.
phys-newyork-1# geopg update tcpg |
After the geopg update command completes successfully, tcpg has the following role on each cluster:
cluster-paris: tcpg role: Secondary cluster-newyork: tcpg role: Primary |
Activate the protection group on both clusters in the partnership.
phys-newyork-1# geopg start -e Global tcpg |
This command starts the application on cluster-newyork. Data replication starts from cluster-newyork to cluster-paris.
This command overwrites the data on cluster-paris.
When an error occurs at the data replication level, the error is reflected in the status of the resource in the replication resource group of the relevant device group.
This section provides the following information:
For information about how different Resource status values map to actual replication pair states, see Table 2–6.
You can check the status of the replication resources by using the clresource command as follows:
phys-paris-1# clresource status -v |
Running the clresource status command might return the following:
=== Cluster Resources === Resource Name de Name State Status Message ------------- --------- ----- -------------- r-tc-tcpg1-devgroup1 phys-paris-2 Offline Offline phys-paris-1 Online Faulted - P-VOL:PSUE hasp4nfs phys-paris-2 Offline Offline phys-paris-1 Offline Offline |
The aggregate resource status for all device groups in the protection group is provided by using the geoadm status command. For example, the output of the clresource status command in the preceding example indicates that the Hitachi TrueCopy device group, devgroup1, is in the PSUE state on cluster-paris. Table 2–6 indicates that the PSUE state corresponds to a resource status of FAULTED. So, the data replication state of the protection group is also FAULTED. This state is reflected in the output of the geoadm status command, which displays the state of the protection group as Error.
phys-paris-1# geoadm status Cluster: cluster-paris Partnership "paris-newyork-ps" : OK Partner clusters : cluster-newyork Synchronization : OK ICRM Connection : OK Heartbeat "paris-to-newyork" monitoring "cluster-newyork": OK Heartbeat plug-in "ping_plugin" : Inactive Heartbeat plug-in "tcp_udp_plugin" : OK Protection group "tcpg" : Error Partnership : paris-newyork-ps Synchronization : OK Cluster cluster-paris : Error Role : Primary PG activation state : Activated Configuration : OK Data replication : Error Resource groups : OK Cluster cluster-newyork : Error Role : Secondary PG activation state : Activated Configuration : OK Data replication : Error Resource groups : OK Pending Operations Protection Group : "tcpg" Operations : start |
To recover from an error state, you might perform some or all of the steps in the following procedure.
Use the procedures in the Hitachi TrueCopy documentation to determine the causes of the FAULTED state. This state is indicated as PSUE.
Recover from the faulted state by using the Hitachi TrueCopy procedures.
If the recovery procedures change the state of the device group, this state is automatically detected by the resource and is reported as a new protection group state.
Revalidate the protection group configuration.
phys-paris-1# geopg validate protectiongroupname |
Specifies the name of the Hitachi TrueCopy protection group
Review the status of the protection group configuration.
phys-paris-1# geopg list protectiongroupname |
Specifies the name of the Hitachi TrueCopy protection group
Review the runtime status of the protection group.
phys-paris-1# geoadm status |