The following known issues and bugs affect the operation of the Sun Cluster Geographic Edition 3.1 8/05 release.
Problem Summary: You cannot delete a protection group that contains device groups.
Workaround: To delete a protection group that contains device groups by using the GUI, delete the device groups individually first. Then, delete the protection group.
Problem Summary: The java.io.InterruptedIOException error message appears when logging to common agent container log file java.util.logging.ErrorManager.
Workaround: This exception is harmless and can safely be ignored.
Problem Summary: The Sun Cluster Geographic Edition infrastructure might remain offline after a cluster is rebooted.
Workaround:
If the Sun Cluster Geographic Edition infrastructure is offline after a cluster reboot, restart the Sun Cluster Geographic Edition infrastructure by using the geoadm start command.
Problem Summary: The GUI does not support RBAC.
Workaround: Invoke the GUI as root on the local cluster.
Problem Summary: To use the root password to access the SunPlex Manager GUI, the root password must be the same on all nodes of both clusters.
Workaround: Ensure that the root password is the same on every node of both clusters.
Problem Summary: Partner clusters in different domains cannot include domain name with cluster name.
Workaround: Specify the partner cluster name with the IP of the logical hostname for the partner cluster in the /etc/hosts file of each node on local cluster. See also bug 6252467.
Manually updating the /etc/hosts file might result in conflicts with local domain machines of the same name.
Problem Summary: If a partnership is created on a remote cluster by using a custom heartbeat, then a heartbeat by the same name must exist on the local cluster before it can join the partnership. You cannot create a heartbeat by using the GUI, so the appropriate heartbeat will not be available to choose on the Join Partnership page.
Workaround: Use the CLI to create the custom heartbeat, and then use either CLI or GUI to join the partnership.
Problem Summary: When the sysevent daemon crashes, the cluster status goes to Error and heartbeat status goes to No Reponse.
Workaround: Restart the sysevent daemon and restart the Sun Cluster Geographic Edition infrastructure as follows.
Disable the Sun Cluster Geographic Edition software.
phys-paris-1# geoadm stop |
On one node of the cluster, enable the Sun Cluster Geographic Edition infrastructure.
phys-paris-1# geoadm start |
For more information about the geoadm command, see the geoadm(1M) man page.
Problem Summary: When the sysevent daemon crashes, the cluster status goes to Error and heartbeat status goes to No Reponse.
Workaround: Restart the sysevent daemon and restart the Sun Cluster Geographic Edition infrastructure as follows.
Disable the Sun Cluster Geographic Edition software.
phys-paris-1# geoadm stop |
On one node of the cluster, enable the Sun Cluster Geographic Edition infrastructure.
phys-paris-1# geoadm start |
For more information about the geoadm command, see the geoadm(1M) man page.
Problem Summary: When the geopg start command times out, the following message appears: “Waiting response timeout: 100000.” This message does not clearly state that the operation has timed out. Also, the timeout period is stated in milliseconds instead of seconds.
Workaround: None.
Problem Summary: When the common agent container hangs or is very slow to respond, for example, because of high system loads, then the geo-failovercontrol stop method times out. This time out results in the geo-failovercontrol resource going into the STOP_FAILED state.
Workaround: This problem should be rare because the stop_timeout period is relatively large, 10 minutes. However, if the geo-failovercontrol resource is in the STOP_FAILED state, use the following procedure to recover and enable the Sun Cluster Geographic Edition infrastructure.
Problem Summary: A protection group is activated on the primary cluster with the resource group in an OK state. If the primary cluster is rebooted, when the cluster comes back up the protection group is in a deactivated state and the resource group is in an Error state.
Workaround: During a failback-switchover, before synchronizing the partnership as described in step 1a of the procedure, the protection group must be deactivated:
# geopg stop -e Local protection-group-name |
Specifies the scope of the command
By specifying a local scope, the command operates on the local cluster only.
Specifies the name of the protection group
If the protection group is already deactivated, the state of the resource group in the protection group is probably Error. The state is Error because the application resource groups are managed and offline.
Deactivating the protection group will result in the application resource groups no longer being managed, clearing the Error state.
For the complete procedure, see How to Perform a Failback-Switchover on a System That Uses Sun StorEdge Availability Suite 3.2.1 Replication in Sun Cluster Geographic Edition System Administration Guide.
Problem Summary: When application resource groups are added to a protection group, you might see a message that states that the application resource group and lightweight resource group must be in the same protection group. This messages indicates that the application resource group must be in the same protection group as the device group that is controlled by the lightweight resource group.
Regardless of the message, do not add the lightweight resource group to the protection group because the lightweight resource group is managed by the Sun Cluster Geographic Edition software.
Workaround: None.
Problem Summary: Pulling public network from node mastering device groups controlled by Sun StorEdge Availability Suite 3.2.1 and Sun Cluster Geographic Edition infrastructure resource groups and resources results in that node losing the public network and being aborted.
Workaround: None.
Problem Summary: The switchover procedures currently documented in Hitachi TrueCopy CCI guide are correct; however, when a switchover fails because of a SVOL-SSUS takeover, the dev_group might result in unmatched volume status which causes pairvolchk and pairsplit commands to fail.
Workaround: To bring dev_group to matched volume status, bring pairs within a dev_group to matched volume status. The commands to be used to bring the pairs to matched volume status depend on the current pair state, and which cluster's volumes the user want to make primary (bring application up on). Refer to the Hitachi TrueCopy CCI guide for Hitachi TrueCopy command set. Then, complete the procedure in Recovering From a Switchover Failure on a System That Uses Hitachi TrueCopy Replication in Sun Cluster Geographic Edition System Administration Guide.
Problem Summary: When a cluster node has two or more network addresses on different subnets for communication, IP_address in the /etc/horcm.conf file must be set to NONE. You must set the IP_address field to NONE even if the network addresses belong to the same subnet.
If the IP_address field is not set to NONE, Hitachi TrueCopy commands could respond unpredictably with the timeout error ENORMT, even though the remote process horcmd is alive and responding.
Workaround: Update the SUNW.GeoCtlTC resource time out values if the default Hitachi TrueCopy time out value has changed in the /etc/horcm.conf file. The default Hitachi TrueCopy time out value in /etc/horcm.conf is 3000(10ms), which is 30 seconds.
The SUNW.GeoCtlTC resources that have been created by the Sun Cluster Geographic Edition environment also have the default time out set at 3000(10ms).
If the default Hitachi TrueCopy time out value has changed in /etc/horcm.conf, the resource time out values must be updated according to algorithm discussed below. You should not change the default time out values for /etc/horcm.conf and Hitachi TrueCopy resources unless the situation demands otherwise.
The following equations establish an upper limit on the time it takes for a Hitachi TrueCopy command to time out based on various factors:
Units appear in seconds in the following equation.
Set horctimeout to the timeout value configured in /etc/horcm.conf
Set numhosts to the number of hosts on the remote cluster. For pair commands, the horcmd command tries to contact each remote host.
Set numretries to two. numretries specifies the maximum number of tries that the horcmd command should make to contact each remote host.
Set Upper-limit-on-timeout to (horctimeout * numhosts * numretries).
For example, if horctimeout were set to 30, and numhosts is set to 2, and numretries is set to 2, then Upper-limit-on-timeout would be 120.
Based on value of Upper-limit-on-timeout, the following resource time out values should be set. A minimum of 60 should be specified as a buffer, to allow for processing of other commands.
Validate_timeout = Upper-limit-on-timeout + 60 Update_timeout = Upper-limit-on-timeout + 60 Monitor_Check_timeout = Upper-limit-on-timeout + 60 Probe_timeout = Upper-limit-on-timeout + 60 Retry_Interval = (Prote_timeout + Thorough_probe_interval) + 60 |
The other time out parameters in the resource should contain default values.
To change the time out values, complete the following steps:
Bring the resource group offline by using the scswitch command.
Update the required timeout properties by using the scrgadm command.
Bring the resource group online by using the scswitch command.
Problem Summary: Traversing dependencies consumes a lot of system resources.
Workaround: None.
Problem Summary: Sometimes the geopg switchover command fails and does not state the reason for failure.
Workaround: Follow the procedure in Recovering From a Switchover Failure on a System That Uses Hitachi TrueCopy Replication in Sun Cluster Geographic Edition System Administration Guide.
Problem Summary: If creating or adding a device group to a protection group takes longer than the timeout period allowed within the browser, the GUI might not refresh when the operation does complete.
Workaround: You can either navigate to the partnership page in the GUI or use the command geopg list to see the result of the operation.
Problem Summary: The process cacaocsc hangs sometimes when the server side socket is partially closed or broken. See also bug 6304065.
Workaround: Exit out of the command by using Ctrl+C or the kill command.
Problem Summary: When a cluster encounters a failure during the switchover process, such as node mastering infrastructure resource group losing power, an unclear message is returned.
Workaround: None.
Problem Summary: Configuration and state changes of entities on a page displayed in the GUI should cause the page to be refreshed automatically. Sometimes the refresh does not take place.
Workaround: Use the navigation tree to navigate to a different page then return to the original page. It will be refreshed on reload.
Problem Summary: You must not perform two or more operations that update the Sun StorEdge Availability Suite 3.2.1 configuration database simultaneously in the Sun Cluster environment.
When the Sun Cluster Geographic Edition software is running, you must not perform two or more of the following commands simultaneously on different protection groups with data replicated by Sun StorEdge Availability Suite 3.2.1:
geopg add-device-group
geopg remove-device-group
geopg get
geopg delete
geopg update
geopg validate
geopg start
geopg stop
geopg switchover
geopg takeover
For example, running the geopg start pg1 command and geopg switchover pg2 command simultaneously might corrupt the Sun StorEdge Availability Suite 3.2.1 configuration database.
Sun StorEdge Availability Suite 3.2.1 is not supported on Solaris OS 10. If you are running Solaris OS 10, do not install the Sun Cluster Geographic Edition packages for Sun StorEdge Availability Suite 3.2.1 support.
Workaround: For Sun Cluster configurations consisting of two or more nodes, you must enable the Sun StorEdge Availability Suite 3.2.1 dscfglockd daemon process on all of the nodes of both partner clusters. You do not need to enable this daemon for Sun Cluster configurations consisting of only a single node.
To enable the dscfglockd daemon process , complete the following procedure on all nodes of both partner clusters.
Ensure that the Sun StorEdge Availability Suite 3.2.1 product has been installed as instructed in the Sun StorEdge Availability Suite 3.2.1 product documentation.
Ensure that the Sun StorEdge Availability Suite 3.2.1 product has been patched with the latest patches available on SunSolve at http://sunsolve.sun.com.
Create a copy of the /etc/init.d/scm file.
# cp /etc/init.d/scm /etc/init.d/scm.original |
Edit the/etc/init.d/scm file.
Delete the comment character (#) and the comment “(turned off for 3.2)” from the following lines.
# do_stopdscfglockd (turned off for 3.2) # do_dscfglockd (turned off for 3.2) |
Save the edited file.
If you do not need to reboot all the Sun Cluster nodes, then a system administrator with superuser privileges must run the following command on each node.
# /usr/opt/SUNWscm/lib/dscfglockd \ -f /var/opt/SUNWesm/dscfglockd.cf |
If you require further assistance, contact your Sun service representative.
Problem Summary: Running the commands geopg takeover or geopg switchover on a primary cluster where the protection group has been activated, results in application resource groups in the protection group being taken offline and not managed, and then brought online again on the same cluster.
Workaround: None.
Problem Summary: If you bring down a node while running the geops create or geops joincommand, you will not be able to restart the Sun Cluster Geographic Edition infrastructure.
Workaround: Contact your Sun service representative.
Problem Summary: When the geopg switchover command times out, the protection group role might not match the data replication role. Despite this mismatch, the geoadm status command indicates that the configuration is in the OK state rather than the Error state.
Workaround: Validate the protection group again by using the geopg validate command on both clusters after a switchover or takeover times out.
Problem Summary: When a takeover operation cannot change the role of the original primary cluster, the synchronization status should be ERROR.
Workaround: Resynchronize the protection group by using the geopg update command, and then validate the protection group on original primary cluster by using the geopg validate command.
Problem Summary: The geopg takeover command returns success, but the protection group is left as the primary on both clusters.
Workaround: None.
Problem Summary: The Common Agent Container can hang after it has been running for a prolong period.
Workaround: None.