Sun Cluster Geographic Edition System Administration Guide

Chapter 8 Migrating Services That Use Sun StorEdge Availability Suite 3.2.1 Data Replication

This chapter provides information about migrating services for maintenance or as a result of cluster failure. The chapter contains information about the following:

Detecting Cluster Failure on a System That Uses Sun StorEdge Availability Suite 3.2.1 Data Replication

This section describes the internal processes that occur when failure is detected on a primary or a secondary cluster.

Detecting Primary Cluster Failure

When the primary cluster for a given protection group fails, the secondary cluster in the partnership detects the failure. The cluster that fails might be a member of more than one partnership, resulting in multiple failure detections.

The following actions occur when a protection group's overall state changes to the Unknown state:

Detecting Secondary Cluster Failure

When a secondary cluster for a given protection group fails, a cluster in the same partnership detects the failure. The cluster that failed might be a member of more than one partnership, resulting in multiple failure detections.

During failure detection, the following actions occur:

Migrating Services That Use Sun StorEdge Availability Suite 3.2.1 With a Switchover

You perform a switchover of a Sun StorEdge Availability Suite 3.2.1 protection group when you want to migrate services to the partner cluster in an orderly fashion. A switchover consists of the following:

ProcedureHow to Switch Over a Sun StorEdge Availability Suite 3.2.1 Protection Group From Primary to Secondary

Before You Begin

For a switchover to occur, data replication must be active between the primary cluster and the secondary cluster. Additionally, the data volumes on the two clusters must be in a synchronized state.

Before you switch over a protection group from the primary cluster to the secondary cluster, ensure that the following conditions are met:

Steps
  1. Log in to one of the cluster nodes.

    You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC.

  2. Initiate the switchover.

    The application resource groups that are a part of the protection group are stopped and started during the switchover.


    # geopg  switchover [-f] -m new-primary-cluster protection-group-name 
    
    -f

    Forces the command to perform the operation without asking you for confirmation

    -m new-primary-cluster

    Specifies the name of the cluster that is to be the primary cluster for the protection group

    protection-group-name

    Specifies the name of the protection group


Example 8–1 Forcing a Switchover From Primary to Secondary

The following example illustrates how to perform a switchover to the secondary cluster:


# geopg switchover -f -m cluster-newyork avspg

Actions Performed by the Sun Cluster Geographic Edition Software During a Switchover

When the geopg switchover command is executed, the software confirms that the volume sets associated with the device groups are in the replicating state. Then, the software performs the following actions on the original primary cluster:

On the original secondary cluster, the command takes the following actions:

If the command executes successfully, the secondary cluster, cluster-newyork, becomes the new primary cluster for the protection group. The original primary cluster, cluster-paris, becomes the new secondary cluster. Volume sets associated with a device group of the protection group have their role reversed according to the role of the protection group on the local cluster. The application resource group is online on the new primary cluster. Data replication from the new primary cluster to the new secondary cluster begins.

This command returns an error if any of the previous operations fails. Execute the geoadm status command to view the status of each component. For example, the Configuration status of the protection group might be set to Error, depending on the cause of the failure. The protection group might be activated or deactivated.

If the Configuration status of the protection group is set to Error, revalidate the protection group by using the procedures described in How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

If the configuration of the protection group is not the same on each partner cluster, you need to resynchronize the configuration by using the procedures described in How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.

Forcing a Takeover on Systems That Use Sun StorEdge Availability Suite 3.2.1

You perform a takeover when applications need to be brought online on the secondary cluster regardless of whether the data is completely consistent between the primary volume and the secondary volume. The following steps occur after takeover is initiated:

For details about the possible conditions of the primary and secondary cluster before and after takeover, see Appendix C, Takeover Postconditions.

The following procedures describe the steps you must perform to force takeover by a secondary cluster, and how to recover data afterward.

ProcedureHow to Force Immediate Takeover of Sun StorEdge Availability Suite 3.2.1 Services by a Secondary Cluster

Before You Begin

Before you force the secondary cluster to assume the activity of the primary cluster, ensure that the following conditions are met:

Steps
  1. Log in to a node in the secondary cluster.

    You must be assigned the Geo Management RBAC rights profile to complete this procedure. For more information about RBAC, see Sun Cluster Geographic Edition Software and RBAC.

  2. Initiate the takeover.


    # geopg takeover  [-f] protection-group-name
    
    -f

    Forces the command to perform the operation without your confirmation

    protection-group-name

    Specifies the name of the protection group


Example 8–2 Forcing a Takeover by a Secondary Cluster

The following example illustrates how to force the takeover of avspg by the secondary cluster cluster-newyork.

phys-newyork-1 is the first node of the secondary cluster. For a reminder of which node is phys-newyork-1, see Example Sun Cluster Geographic Edition Cluster Configuration.


phys-newyork-1# geopg takeover -f avspg

Actions Performed by the Sun Cluster Geographic Edition Software During a Takeover

When the geopg takeover command executes, the software confirms that the volume sets are in a Replicating or Logging state on the secondary cluster.

If the original primary cluster, cluster-paris, can be reached, the software performs the following actions:

On the original secondary cluster, cluster-newyork, the software performs the following actions:

If the command executes successfully, the secondary cluster, cluster-newyork, becomes the new primary cluster for the protection group. Volume sets associated with a device group in the protection group have their role reversed according to the role of the protection group on the local cluster. If the protection group was active on the original secondary cluster before the takeover, the application resource groups are brought online on the new primary cluster. If the original primary cluster can be reached, it becomes the new secondary cluster of the protection group. Replication of all volume sets associated with the device groups of the protection group is stopped.


Caution – Caution –

After a successful takeover, data replication is stopped. If you want to continue to suspend replication, specify the -n option any time you use the geopg start command. This option prevents the start of data replication from the new primary cluster to the new secondary cluster.


This command returns an error if any of the previous operations fails. Execute the geoadm status command to view the status of each component. For example, the Configuration status of the protection group might be set to Error, depending on the cause of the failure. The protection group might be activated or deactivated.

If the Configuration status of the protection group is set to Error, revalidate the protection group by using the procedures described in How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

If the configuration of the protection group is not the same on each partner cluster, you need to resynchronize the configuration by using the procedures described in How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.

Recovering Sun StorEdge Availability Suite 3.2.1 Data After a Takeover

After a successful takeover operation, the secondary cluster (cluster-newyork) becomes the primary for the protection group and the services are online on the secondary cluster. After the recovery of the original primary cluster, the services can be brought online again on the original primary by using a process called failback.

Sun Cluster Geographic Edition software supports the following two kinds of failback:

ProcedureHow to Perform a Failback-Switchover on a System That Uses Sun StorEdge Availability Suite 3.2.1 Replication

Use this procedure to restart an application on the original primary cluster, cluster-paris, after this cluster's data has been resynchronized with the data on the current primary cluster, cluster-newyork.

Before You Begin

Before you perform a failback-switchover, a takeover has occurred on cluster-newyork. The clusters now have the following roles:

Steps
  1. Resynchronize the original primary cluster, cluster-paris, with the current primary cluster, cluster-newyork.

    cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally. Resynchronize both the partnership and protection group configurations.

    1. On cluster-paris, deactivate the protection group on the local cluster.


      # geopg stop -e Local protection-group-name
      
      -e Local

      Specifies the scope of the command

      By specifying a local scope, the command operates on the local cluster only.

      protection-group-name

      Specifies the name of the protection group

      If the protection group is already deactivated, the state of the resource group in the protection group is probably Error. The state is Error because the application resource groups are managed and offline.

      Deactivating the protection group will result in the application resource groups no longer being managed, clearing the Error state.

    2. On cluster-paris, resynchronize the partnership.


      # geops update partnership-name
      
      partnership-name

      Specifies the name of the partnership


      Note –

      You need to perform this step only once, even if you are performing a failback-switchover for multiple protection groups.


      For more information about synchronizing partnerships, see Resynchronizing a Partnership.

    3. On cluster-paris, resynchronize each protection group.

      Because the role of the protection group on cluster-newyork is primary, this step ensures that the role of the protection group on cluster-paris is secondary.


      # geopg update protection-group-name 
      
      protection-group-name

      Specifies the name of the protection group

      For more information about synchronizing protection groups, see Resynchronizing a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  2. On cluster-paris, validate the cluster's configuration for each protection group.


    # geopg validate protection-group-name 
    
    protection-group-name

    Specifies a unique name that identifies a single protection group

    For more information, see How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  3. On cluster-paris, activate each protection group.

    When you activate a protection group, its application resource groups are also brought online.


    # geopg start -e Global protection-group-name
    
    -e Global

    Specifies the scope of the command

    By specifying a Global scope, the command operates on both clusters where the protection group is deployed.

    protection-group-name

    Specifies the name of the protection group


    Note –

    The -n option must not be given when doing a failback-switchover because the data needs to be synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.


    Because the protection group has a role of secondary, the data is synchronized from the current primary, cluster-newyork, to the current secondary, cluster-paris.

    For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  4. Confirm that the data is completely synchronized.

    First, confirm that the state of the protection group on cluster-newyork is OK.


    phys-newyork-1# geoadm status

    Refer to the Protection Group section of the output.

    Next, confirm that all resources in the replication resource group, AVS-protection-group-name-rep-rg, report a status of OK.


    phys-newyork-1# scstat -g
  5. On either cluster, perform a switchover from cluster-newyork to cluster-paris for each protection group.


    # geopg  switchover [-f] -m cluster-paris protection-group-name
    

    For more information, see How to Switch Over a Sun StorEdge Availability Suite 3.2.1 Protection Group From Primary to Secondary.

    cluster-paris resumes its original role as primary cluster for the protection group.

  6. Ensure that the switchover was performed successfully by using the geoadm status on either cluster to verify that the replication resource and the application resource groups and resources are online.

    Also, you must verify that the protection group is now primary on cluster-paris and secondary on cluster-newyork and that “Data replication” and “Resource groups” are listed in OK states for both clusters.


    # geoadm status

ProcedureHow to Perform a Failback-Takeover on a System That Uses Sun StorEdge Availability Suite 3.2.1 Replication

Use this procedure to restart an application on the original primary cluster, cluster-paris, and use the current data on the original primary cluster. Any updates that occurred on the secondary cluster, cluster-newyork, while it was acting as primary are discarded.


Note –

Conditionally, you can resume using the data on the original primary, cluster-paris. You must not have replicated data from the new primary, cluster-newyork, to the original primary cluster, cluster-paris, at any point after the takeover operation on cluster-newyork.


Before You Begin

Before you begin the failover-takeover operation, the clusters have the following roles:

Steps
  1. Resynchronize the original primary cluster, cluster-paris, with the original secondary cluster, cluster-newyork.

    cluster-paris forfeits its own configuration and replicates the cluster-newyork configuration locally.

    1. On cluster-paris, resynchronize the partnership.


      # geops update partnership-name
      
      partnership-name

      Specifies the name of the partnership


      Note –

      You need to perform this step only once, even if you are performing a failback-takeover for multiple protection groups.


      For more information about synchronizing partnerships, see Resynchronizing a Partnership.

    2. On cluster-paris, resynchronize each protection group.

      If the protection group has been activated, deactivate the protection group by using the geopg stop command. For more information about deactivating a protection group, see How to Deactivate a Sun StorEdge Availability Suite 3.2.1 Protection Group.


      # geopg update protection-group-name
      
      protection-group-name

      Specifies the name of the protection group

      For more information about synchronizing protection groups, see How to Resynchronize a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  2. On cluster-paris, validate the cluster's configuration for each protection group.


    # geopg validate protection-group-name 
    
    protection-group-name

    Specifies a unique name that identifies a single protection group

    For more information, see How to Validate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  3. On cluster-paris, activate each protection group in the secondary role without data replication.

    Because the protection group on cluster-paris has a role of secondary, the geopg start command does not restart the application on cluster-paris.


    # geopg start -e local -n protection-group-name
    
    -e local

    Specifies the scope of the command

    By specifying a local scope, the command operates on the local cluster only.

    -n

    Prevents the start of data replication at protection group startup


    Note –

    You must use the -n option.


    protection-group-name

    Specifies the name of the protection group

    For more information, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

    Replication from cluster-newyork to cluster-paris is not started, because the -n option is given on cluster-paris.

  4. On cluster-paris, initiate a takeover for each protection group.


    # geopg takeover  [-f] protection-group-name
    
    -f

    Forces the command to perform the operation without your confirmation

    protection-group-name

    Specifies the name of the protection group

    For more information about the geopg takeover command, see How to Force Immediate Takeover of Sun StorEdge Availability Suite 3.2.1 Services by a Secondary Cluster.

    The protection group on cluster-paris now has the primary role, and the protection group on cluster-newyork has the secondary role.

  5. On cluster-newyork, activate each protection group.

    Because the protection group on cluster-newyork has a role of secondary, the geopg start command does not restart the application on cluster-newyork.


    # geopg start -e local [-n] protection-group-name
    
    -e local

    Specifies the scope of the command

    By specifying a local scope, the command operates on the local cluster only.

    -n

    Prevents the start of data replication at protection group startup

    If you omit this option, the data replication subsystem starts at the same time as the protection group.

    protection-group-name

    Specifies the name of the protection group

    For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

  6. Start data replication.

    To start data replication, activate the protection group on the primary cluster, cluster-paris.


    # geopg start -e local protection-group-name
    

    For more information about the geopg start command, see How to Activate a Sun StorEdge Availability Suite 3.2.1 Protection Group.

Recovering From a Sun StorEdge Availability Suite 3.2.1 Data Replication Error

When an error occurs at the data replication level, the error is reflected in the status of the resource in the replication resource group of the relevant device group.

For example, suppose a device group controlled by Sun StorEdge Availability Suite 3.2.1 that is called avsdg changes to a Volume failed state, VF. This state is reflected in the following resource status:


Resource Status = "FAULTED"
Resource status message = "FAULTED : Volume failed"

Note –

The Resource State remains Online because the probe is still running correctly.


Because the resource status has changed, the protection group status also changes. In this case, the local Data Replication state, the Protection Group state on the local cluster, and the overall Protection Group state become Error.

To recover from an error state, complete the relevant steps in the following procedure.

ProcedureHow to Recover From a Data Replication Error

Steps
  1. Use the procedures in the Sun StorEdge Availability Suite 3.2.1 documentation to determine the causes of the FAULTED state. This state is indicated as VF.

  2. Recover from the faulted state by using the Sun StorEdge Availability Suite 3.2.1 procedures.

    If the recovery procedures change the state of the device group, this state is automatically detected by the resource and is reported as a new protection group state.

  3. Revalidate the protection group configuration.


    phys-paris-1# geopg validate protection-group-name 
    
    protection-group-name

    Specifies the name of the Sun StorEdge Availability Suite 3.2.1 protection group

  4. Review the status of the protection group configuration.


    phys-paris-1# geopg list protection-group-name 
    
    protection-group-name

    Specifies the name of the Sun StorEdge Availability Suite 3.2.1 protection group