Case Where DIDs Get Reassigned in a Geographic Edition Oracle ZFS Storage Appliance Configuration (24851015)
Problem Summary: Because Oracle ZFS Storage Appliance replicated LUNs cannot be exported while a cluster is secondary for an Oracle ZFS Storage Appliance protection group, the cluster cannot access these LUNs. If the user runs the cldevice clear
command on this secondary cluster, the DIDs corresponding to these replicated LUNs are removed. If the user then adds new LUNs to the secondary cluster and runs the cldevice populate
command, DIDs that had been assigned to the deleted replicated LUNs might get reassigned to newly added LUNs.
If later the cluster becomes primary, when the application that uses this replicated data starts on this cluster, attempts to access DIDs that had been assigned to deleted replicated LUNs will not find the expected data, and the application will fail to start.
Workaround: To avoid this issue, never run the cldevice clear
command on a cluster that is secondary for an Oracle ZFS Storage Appliance protection group.
If you encounter this problem, you can use the cldevice rename
command to resolve the issue. The following scenario illustrates one instance of this problem and the commands to recover from it. The scenario uses the following example components:
-
clusterB
– The secondary cluster for the Oracle ZFS Storage Appliance protection group. -
zfssaBoxB
– The current target for the replicated project. -
DID
13
– The DID onclusterB
that corresponds to the replicated LUN in the project that is managed by this protection group.
The following series of actions would create this problem:
-
Add a new LUN to
clusterB
. -
On one node of
clusterB
, issue thecldevice clear
command.DID 13
is removed from the cluster, since the replicated LUN is not exported and cannot be accessed. -
On one node of
clusterB
, issue thecldevice populate
command.DID
13
is assigned to the new LUN created in Step 1. -
Switch over the protection group to
clusterB
to makeclusterB
the primary.The switchover issues the
cldevice populate
command. The cluster allocates the next available DID,14
, to the replicated LUN that is now accessible onclusterB
.
The application resource is now unable to start, because the data in DID 13
is not what is expected.
The following recovery steps correct the problem, where DID 15
is the next unassigned DID:
-
On each node of
clusterB
, move DID13
to DID15
.# cldevice rename -d 15 13; devfsadm; cldevice populate
-
On each node of
clusterB
, move DID 14 to DID 13.# cldevice rename -d 13 14; devfsadm; cldevice populate
-
Restart the application resource group.
clresourcegroup restart rg-name
The application can now start because it finds the expected data in DID 13.