Geographic Edition Issues

Language:

Case Where DIDs Get Reassigned in a Geographic Edition Oracle ZFS Storage Appliance Configuration (24851015)

Problem Summary: Because Oracle ZFS Storage Appliance replicated LUNs cannot be exported while a cluster is secondary for an Oracle ZFS Storage Appliance protection group, the cluster cannot access these LUNs. If the user runs the cldevice clear command on this secondary cluster, the DIDs corresponding to these replicated LUNs are removed. If the user then adds new LUNs to the secondary cluster and runs the cldevice populate command, DIDs that had been assigned to the deleted replicated LUNs might get reassigned to newly added LUNs.

If later the cluster becomes primary, when the application that uses this replicated data starts on this cluster, attempts to access DIDs that had been assigned to deleted replicated LUNs will not find the expected data, and the application will fail to start.

Workaround: To avoid this issue, never run the cldevice clear command on a cluster that is secondary for an Oracle ZFS Storage Appliance protection group.

If you encounter this problem, you can use the cldevice rename command to resolve the issue. The following scenario illustrates one instance of this problem and the commands to recover from it. The scenario uses the following example components:

clusterB – The secondary cluster for the Oracle ZFS Storage Appliance protection group.
zfssaBoxB – The current target for the replicated project.
DID 13 – The DID on clusterB that corresponds to the replicated LUN in the project that is managed by this protection group.

The following series of actions would create this problem:

Add a new LUN to clusterB.
On one node of clusterB, issue the cldevice clear command.

DID 13 is removed from the cluster, since the replicated LUN is not exported and cannot be accessed.
On one node of clusterB, issue the cldevice populate command.

DID 13 is assigned to the new LUN created in Step 1.
Switch over the protection group to clusterB to make clusterB the primary.

The switchover issues the cldevice populate command. The cluster allocates the next available DID, 14, to the replicated LUN that is now accessible on clusterB.

The application resource is now unable to start, because the data in DID 13 is not what is expected.

The following recovery steps correct the problem, where DID 15 is the next unassigned DID:

On each node of clusterB, move DID 13 to DID 15.

# cldevice rename -d 15 13; devfsadm; cldevice populate

On each node of clusterB, move DID 14 to DID 13.

# cldevice rename -d 13 14; devfsadm; cldevice populate

Restart the application resource group.
```
clresourcegroup restart rg-name
```

The application can now start because it finds the expected data in DID 13.

Oracle Data Guard Module Incorrectly Flagging `SUNW.oracle_server` Dependencies in the Single Instance (15818725)

Problem Summary: Attempting to retrieve the Oracle Data Guard protection group configuration fails with an error if HA for Oracle Database has dependencies on other resources.

Workaround: Set the protection group's external_dependencies_allowed property to true.

# geopg set-prop -p external_dependencies_allowed=TRUE protection_group

Collision Problems Should Be Flagged at Protection Group Creation Time (15801862)

Problem Summary: Projects or mount points configured with the same name on the target appliance as the one on source appliance managed by Geographic Edition on the primary cluster will result in switchover or takeover failures.

Workaround: Before adding the Oracle ZFS Storage Appliance replicated project to the protection group, ensure that the target appliance does not have a project or mount point with the same name as the source appliance.

Doing `geosite update remote-cluster site` on a Cluster Does Not Replicate the Site's Multigroups That Are Present on the Remote Cluster Onto the Local Cluster (18368896)

Problem Summary: Once a multigroup is created using geomg create on any controller in a site, the multigroup gets created automatically on other clusters in the site if that controller has no site configuration synchronization errors with those clusters. If the site synchronization status is in ERROR between any such cluster and that controller, then that cluster does not accept the multigroup creation.

One possible way to attempt to resolve the site synchronization error is by using the geosite update command on that cluster with the controller as an argument in order to make the site's configuration data on the cluster the same as the data that exists on the controller, and thereby replicate the multigroup onto that cluster. This replication of a multigroup configuration might fail in some situations even though the site synchronization status of that cluster will report OK with respect to the controller.

Workaround: Use the geosite leave command to make that cluster leave the site and then include it back in the site using the geosite add-member and geosite join commands.

Infrastructure Resource Goes Offline After Probe Caused a Restart and Failed to Start (21298474)

Problem Summary: If the Geographic Edition setup on a cluster has multiple protection groups and multi-group configurations, the related infrastructure components might take a long time to start. This startup is managed by the geo-failovercontrol resource of the SUNW.scmasa resource type, which has a default start timeout of 600 seconds. If the geo-failovercontrol resource takes more time to start than the default start timeout, the Geographic Edition infrastructure goes offline.

Workaround: Increase the Start_timeout property value of the geo-failovercontrol resource in the geo-infrastructure resource group. If the RG_system property of the geo-infrastructure resource group is TRUE, temporarily change it to FALSE before changing the resource property.

Type the following commands to change the Start_timeout of the resource to 1200 seconds.

$ /usr/cluster/bin/clresourcegroup set -p RG_system=FALSE geo-infrastructure
$ /usr/cluster/bin/clresource set -p  Start_timeout=1200 geo-failovercontrol
$ /usr/cluster/bin/clresourcegroup set -p RG_system=TRUE geo-infrastructure

Oracle GoldenGate Protection Group Data Replication Status Shows `OK` Though the Replication Resource did not Failover (21527062)

Problem Summary: After a node failure on the secondary partner, the Oracle GoldenGate replication status resource does not start up on another node of the secondary partner because the affinity resource group of the Oracle GoldenGate replication status resource group did not come up. This behavior is valid as per the resource group affinity. However, the protection group's data replication status does not reflect the replication status resource's new status and the replication status still shows OK.

Workaround: Validate the protection group using geopg validate on the cluster, which would query the latest replication resource status and update the protection group's replication status.

`java.lang.IllegalArgumentException: Unmatched braces in the pattern` (21570583)

Problem Summary: Protection group creation fails if one of the cluster nodes is down or the common agent container is not running on a node and displays the following error message on the terminal:

Cannot reach management agent on cluster-node :
Internal Error :javax.management.RuntimeMBeanException:
java.lang.IllegalArgumentException: Unmatched braces in the pattern.

Workaround: Ensure that the common agent container is running on all cluster nodes. If a node is down, bring up the node or remove the node and create the protection group.

Protection Group Creation Should Not Fail If a Node in the Cluster Is Down (21697993)

Problem Summary: Protection group creation fails if one of the cluster nodes is down. This situation occurs when the script-based plug-in module tries to check whether all *_script files exist and are executable on all cluster nodes. The check is performed on all nodes because the script-based plug-in module does not have a script-based plug-in name to look up in the configuration file. If one of the cluster nodes is down, then an exception is thrown which terminates the protection group creation.

Workaround: Bring up the node or remove the node and create the protection group.

If Takeover Is Performed While Both Sites Are Up, Project Is Not Removed From the Original Primary Site (21684495)

Problem Summary: If you run the geopg takeover command when both the primary and secondary ZFSSA appliances are up, then switchover to the secondary site fails because of an empty project that exists on the original primary ZFSSA appliance after the protection group is activated.

Workaround: Before attempting to switchover the protection group, remove the empty project on the secondary appliance after the protection group is activated.

Geographic Edition Does Not Support ZFSSA Offline Replication Feature (21934145)

Problem Summary: Geographic Edition incorrectly allows a switchover while replication is in the Idle (export pending) state.

Workaround: Do not use offline replication feature on projects managed by the Geographic Edition.

Oracle® Solaris Cluster 4.3 Release Notes