Case Where DIDs Get Reassigned in a Geographic Edition Oracle ZFS Storage Appliance Configuration (24851015)

Problem Summary: Because Oracle ZFS Storage Appliance replicated LUNs cannot be exported while a cluster is secondary for an Oracle ZFS Storage Appliance protection group, the cluster cannot access these LUNs. If the user runs the cldevice clear command on this secondary cluster, the DIDs corresponding to these replicated LUNs are removed. If the user then adds new LUNs to the secondary cluster and runs the cldevice populate command, DIDs that had been assigned to the deleted replicated LUNs might get reassigned to newly added LUNs.

If later the cluster becomes primary, when the application that uses this replicated data starts on this cluster, attempts to access DIDs that had been assigned to deleted replicated LUNs will not find the expected data, and the application will fail to start.

Workaround: To avoid this issue, never run the cldevice clear command on a cluster that is secondary for an Oracle ZFS Storage Appliance protection group.

If you encounter this problem, you can use the cldevice rename command to resolve the issue. The following scenario illustrates one instance of this problem and the commands to recover from it. The scenario uses the following example components:

  • clusterB – The secondary cluster for the Oracle ZFS Storage Appliance protection group.

  • zfssaBoxB – The current target for the replicated project.

  • DID 13 – The DID on clusterB that corresponds to the replicated LUN in the project that is managed by this protection group.

The following series of actions would create this problem:

  1. Add a new LUN to clusterB.

  2. On one node of clusterB, issue the cldevice clear command.

    DID 13 is removed from the cluster, since the replicated LUN is not exported and cannot be accessed.

  3. On one node of clusterB, issue the cldevice populate command.

    DID 13 is assigned to the new LUN created in Step 1.

  4. Switch over the protection group to clusterB to make clusterB the primary.

    The switchover issues the cldevice populate command. The cluster allocates the next available DID, 14, to the replicated LUN that is now accessible on clusterB.

The application resource is now unable to start, because the data in DID 13 is not what is expected.

The following recovery steps correct the problem, where DID 15 is the next unassigned DID:

  1. On each node of clusterB, move DID 13 to DID 15.

    # cldevice rename -d 15 13; devfsadm; cldevice populate
  2. On each node of clusterB, move DID 14 to DID 13.

    # cldevice rename -d 13 14; devfsadm; cldevice populate
  3. Restart the application resource group.

    clresourcegroup restart rg-name

The application can now start because it finds the expected data in DID 13.