Sun Cluster Geographic Edition 3.1 8/05 Release Notes

Known Issues and Bugs

The following known issues and bugs affect the operation of the Sun Cluster Geographic Edition 3.1 8/05 release.

Extended SunPlex Manager GUI Restrictions

Problem Summary: You cannot delete a protection group that contains device groups.

Workaround: To delete a protection group that contains device groups by using the GUI, delete the device groups individually first. Then, delete the protection group.

Writing to `java.util.logging.ErrorManager` Results in Common Agent Container Logging Error (5081674)

Problem Summary: The java.io.InterruptedIOException error message appears when logging to common agent container log file java.util.logging.ErrorManager.

Workaround: This exception is harmless and can safely be ignored.

Sun Cluster Geographic Edition Infrastructure Might Remain Offline After a Cluster Reboot (6218200)

Problem Summary: The Sun Cluster Geographic Edition infrastructure might remain offline after a cluster is rebooted.

Workaround:

If the Sun Cluster Geographic Edition infrastructure is offline after a cluster reboot, restart the Sun Cluster Geographic Edition infrastructure by using the geoadm start command.

No RBAC Support for GUI (6226493)

Problem Summary: The GUI does not support RBAC.

Workaround: Invoke the GUI as root on the local cluster.

GUI Requires Same Root Password on Partner Clusters (6260505)

Problem Summary: To use the root password to access the SunPlex Manager GUI, the root password must be the same on all nodes of both clusters.

Workaround: Ensure that the root password is the same on every node of both clusters.

Partner Clusters in Different Domains Cannot Include a Domain Name With the Cluster Name (6260506)

Problem Summary: Partner clusters in different domains cannot include domain name with cluster name.

Workaround: Specify the partner cluster name with the IP of the logical hostname for the partner cluster in the /etc/hosts file of each node on local cluster. See also bug 6252467.

Note –

Manually updating the /etc/hosts file might result in conflicts with local domain machines of the same name.

A Custom Heartbeat Must Exist on Both the Remote and Local Cluster Before the Heartbeat Can Join a Partnership (6263692)

Problem Summary: If a partnership is created on a remote cluster by using a custom heartbeat, then a heartbeat by the same name must exist on the local cluster before it can join the partnership. You cannot create a heartbeat by using the GUI, so the appropriate heartbeat will not be available to choose on the Join Partnership page.

Workaround: Use the CLI to create the custom heartbeat, and then use either CLI or GUI to join the partnership.

Communication Loss Between Node and Storage Device Might Result in Error State (6269186)

Problem Summary: When the sysevent daemon crashes, the cluster status goes to Error and heartbeat status goes to No Reponse.

Workaround: Restart the sysevent daemon and restart the Sun Cluster Geographic Edition infrastructure as follows.

How to Restart the Sun Cluster Geographic Edition Infrastructure

Disable the Sun Cluster Geographic Edition software.
phys-paris-1# geoadm stop

On one node of the cluster, enable the Sun Cluster Geographic Edition infrastructure.
phys-paris-1# geoadm start

Cluster Status is `Error` When the `sysevent` Daemon Crashes (6276483)

Problem Summary: When the sysevent daemon crashes, the cluster status goes to Error and heartbeat status goes to No Reponse.

Workaround: Restart the sysevent daemon and restart the Sun Cluster Geographic Edition infrastructure as follows.

How to Restart the Sun Cluster Geographic Edition Infrastructure

Disable the Sun Cluster Geographic Edition software.
phys-paris-1# geoadm stop

On one node of the cluster, enable the Sun Cluster Geographic Edition infrastructure.
phys-paris-1# geoadm start

Unclear Error Message When Protection Group Start Times Out (6284278)

Problem Summary: When the geopg start command times out, the following message appears: “Waiting response timeout: 100000.” This message does not clearly state that the operation has timed out. Also, the timeout period is stated in milliseconds instead of seconds.

Workaround: None.

When `geo-failovercontrol` Resource Goes to `STOP_FAILED` State the Resource Times Out (6288257)

Problem Summary: When the common agent container hangs or is very slow to respond, for example, because of high system loads, then the geo-failovercontrol stop method times out. This time out results in the geo-failovercontrol resource going into the STOP_FAILED state.

Workaround: This problem should be rare because the stop_timeout period is relatively large, 10 minutes. However, if the geo-failovercontrol resource is in the STOP_FAILED state, use the following procedure to recover and enable the Sun Cluster Geographic Edition infrastructure.

Activated Protection Groups Deactivated and Resource Groups in an `Error` State After a Cluster Reboot (6289463)

Problem Summary: A protection group is activated on the primary cluster with the resource group in an OK state. If the primary cluster is rebooted, when the cluster comes back up the protection group is in a deactivated state and the resource group is in an Error state.

Workaround: During a failback-switchover, before synchronizing the partnership as described in step 1a of the procedure, the protection group must be deactivated:

# geopg stop -e Local  protection-group-name

-e Local

Specifies the scope of the command

By specifying a local scope, the command operates on the local cluster only.

protection-group-name

Specifies the name of the protection group

If the protection group is already deactivated, the state of the resource group in the protection group is probably Error. The state is Error because the application resource groups are managed and offline.

Deactivating the protection group will result in the application resource groups no longer being managed, clearing the Error state.

For the complete procedure, see How to Perform a Failback-Switchover on a System That Uses Sun StorEdge Availability Suite 3.2.1 Replication in Sun Cluster Geographic Edition System Administration Guide.

Incorrect Message When Adding Resource Group to Protection Group (6290256)

Problem Summary: When application resource groups are added to a protection group, you might see a message that states that the application resource group and lightweight resource group must be in the same protection group. This messages indicates that the application resource group must be in the same protection group as the device group that is controlled by the lightweight resource group.

Regardless of the message, do not add the lightweight resource group to the protection group because the lightweight resource group is managed by the Sun Cluster Geographic Edition software.

Workaround: None.

Pulling Public Network From a Node Mastering Device Groups Controlled by Sun StorEdge Availability Suite 3.2.1 and Sun Cluster Geographic Edition Infrastructure Resource Groups Results in the Node Being Aborted (6291382)

Problem Summary: Pulling public network from node mastering device groups controlled by Sun StorEdge Availability Suite 3.2.1 and Sun Cluster Geographic Edition infrastructure resource groups and resources results in that node losing the public network and being aborted.

Workaround: None.

A Failed Switchover for Hitachi TrueCopy Leaves Pairs Within `dev_group` With Unmatched Volume Status (6295537)

Problem Summary: The switchover procedures currently documented in Hitachi TrueCopy CCI guide are correct; however, when a switchover fails because of a SVOL-SSUS takeover, the dev_group might result in unmatched volume status which causes pairvolchk and pairsplit commands to fail.

Workaround: To bring dev_group to matched volume status, bring pairs within a dev_group to matched volume status. The commands to be used to bring the pairs to matched volume status depend on the current pair state, and which cluster's volumes the user want to make primary (bring application up on). Refer to the Hitachi TrueCopy CCI guide for Hitachi TrueCopy command set. Then, complete the procedure in Recovering From a Switchover Failure on a System That Uses Hitachi TrueCopy Replication in Sun Cluster Geographic Edition System Administration Guide.

Hitachi TrueCopy CCI Commands and Hitachi TrueCopy Resources Report that Remote `horcmd` Is Not Alive Even When It Is Alive and Responding (6297384)

Problem Summary: When a cluster node has two or more network addresses on different subnets for communication, IP_address in the /etc/horcm.conf file must be set to NONE. You must set the IP_address field to NONE even if the network addresses belong to the same subnet.

If the IP_address field is not set to NONE, Hitachi TrueCopy commands could respond unpredictably with the timeout error ENORMT, even though the remote process horcmd is alive and responding.

Workaround: Update the SUNW.GeoCtlTC resource time out values if the default Hitachi TrueCopy time out value has changed in the /etc/horcm.conf file. The default Hitachi TrueCopy time out value in /etc/horcm.conf is 3000(10ms), which is 30 seconds.

The SUNW.GeoCtlTC resources that have been created by the Sun Cluster Geographic Edition environment also have the default time out set at 3000(10ms).

If the default Hitachi TrueCopy time out value has changed in /etc/horcm.conf, the resource time out values must be updated according to algorithm discussed below. You should not change the default time out values for /etc/horcm.conf and Hitachi TrueCopy resources unless the situation demands otherwise.

The following equations establish an upper limit on the time it takes for a Hitachi TrueCopy command to time out based on various factors:

Note –

Units appear in seconds in the following equation.

Set horctimeout to the timeout value configured in /etc/horcm.conf
Set numhosts to the number of hosts on the remote cluster. For pair commands, the horcmd command tries to contact each remote host.
Set numretries to two. numretries specifies the maximum number of tries that the horcmd command should make to contact each remote host.
Set Upper-limit-on-timeout to (horctimeout * numhosts * numretries).

For example, if horctimeout were set to 30, and numhosts is set to 2, and numretries is set to 2, then Upper-limit-on-timeout would be 120.

Based on value of Upper-limit-on-timeout, the following resource time out values should be set. A minimum of 60 should be specified as a buffer, to allow for processing of other commands.

Validate_timeout = Upper-limit-on-timeout + 60
Update_timeout = Upper-limit-on-timeout + 60
Monitor_Check_timeout = Upper-limit-on-timeout + 60
Probe_timeout = Upper-limit-on-timeout + 60
Retry_Interval = (Prote_timeout + Thorough_probe_interval) + 60

The other time out parameters in the resource should contain default values.

To change the time out values, complete the following steps:

Bring the resource group offline by using the scswitch command.
Update the required timeout properties by using the scrgadm command.
Bring the resource group online by using the scswitch command.

Traversing Dependencies Consumes System Resources (6297751)

Problem Summary: Traversing dependencies consumes a lot of system resources.

Workaround: None.

Protection Group Switchover Fails Without Apparent Reason and Does Not Report Reason for Failure (6299103)

Problem Summary: Sometimes the geopg switchover command fails and does not state the reason for failure.

Workaround: Follow the procedure in Recovering From a Switchover Failure on a System That Uses Hitachi TrueCopy Replication in Sun Cluster Geographic Edition System Administration Guide.

The GUI Does Not Always Return the Result of Creating or Adding a Device Groups to a Protection Group (6300168)

Problem Summary: If creating or adding a device group to a protection group takes longer than the timeout period allowed within the browser, the GUI might not refresh when the operation does complete.

Workaround: You can either navigate to the partnership page in the GUI or use the command geopg list to see the result of the operation.

CLI Command Hangs If the Node Where the Geocontrol Module Is Active Reboots While the Command Is Running (6300616)

Problem Summary: The process cacaocsc hangs sometimes when the server side socket is partially closed or broken. See also bug 6304065.

Workaround: Exit out of the command by using Ctrl+C or the kill command.

Restarting the Common Agent Container While a Switchover Is in Progress Results in `CRITICAL INTERNAL ERROR` Error (6302009)

Problem Summary: When a cluster encounters a failure during the switchover process, such as node mastering infrastructure resource group losing power, an unclear message is returned.

Workaround: None.

GUI Does Not Refresh Protection Group Status Change (6302217)

Problem Summary: Configuration and state changes of entities on a page displayed in the GUI should cause the page to be refreshed automatically. Sometimes the refresh does not take place.

Workaround: Use the navigation tree to navigate to a different page then return to the original page. It will be refreshed on reload.

Performing Two or More Operations that Update the Sun StorEdge Availability Suite 3.2.1 Configuration Database Simultaneously Might Corrupt the Configuration Database (6303883)

Problem Summary: You must not perform two or more operations that update the Sun StorEdge Availability Suite 3.2.1 configuration database simultaneously in the Sun Cluster environment.

When the Sun Cluster Geographic Edition software is running, you must not perform two or more of the following commands simultaneously on different protection groups with data replicated by Sun StorEdge Availability Suite 3.2.1:

geopg add-device-group
geopg remove-device-group
geopg get
geopg delete
geopg update
geopg validate
geopg start
geopg stop
geopg switchover
geopg takeover

For example, running the geopg start pg1 command and geopg switchover pg2 command simultaneously might corrupt the Sun StorEdge Availability Suite 3.2.1 configuration database.

Note –

Sun StorEdge Availability Suite 3.2.1 is not supported on Solaris OS 10. If you are running Solaris OS 10, do not install the Sun Cluster Geographic Edition packages for Sun StorEdge Availability Suite 3.2.1 support.

Workaround: For Sun Cluster configurations consisting of two or more nodes, you must enable the Sun StorEdge Availability Suite 3.2.1 dscfglockd daemon process on all of the nodes of both partner clusters. You do not need to enable this daemon for Sun Cluster configurations consisting of only a single node.

To enable the dscfglockd daemon process , complete the following procedure on all nodes of both partner clusters.

How to Enable the Sun StorEdge Availability Suite 3.2.1 `dscfglockd` Daemon Process

Ensure that the Sun StorEdge Availability Suite 3.2.1 product has been installed as instructed in the Sun StorEdge Availability Suite 3.2.1 product documentation.

Ensure that the Sun StorEdge Availability Suite 3.2.1 product has been patched with the latest patches available on SunSolve at http://sunsolve.sun.com.

Create a copy of the /etc/init.d/scm file.

# cp /etc/init.d/scm /etc/init.d/scm.original

Edit the/etc/init.d/scm file.

Delete the comment character (#) and the comment “(turned off for 3.2)” from the following lines.
# do_stopdscfglockd (turned off for 3.2) # do_dscfglockd (turned off for 3.2)

Save the edited file.

If you do not need to reboot all the Sun Cluster nodes, then a system administrator with superuser privileges must run the following command on each node.
# /usr/opt/SUNWscm/lib/dscfglockd \ -f /var/opt/SUNWesm/dscfglockd.cf

Next Steps

If you require further assistance, contact your Sun service representative.

Protection Group Takeover and Switchover on Active Primary Cluster Causes Application Resource Groups to Be Recycled (6304781)

Problem Summary: Running the commands geopg takeover or geopg switchover on a primary cluster where the protection group has been activated, results in application resource groups in the protection group being taken offline and not managed, and then brought online again on the same cluster.

Workaround: None.

Unable to Start Sun Cluster Geographic Edition Infrastructure After Node Is Brought Down During `geops create` or `geops join` Operation Has Been Run (6305780)

Problem Summary: If you bring down a node while running the geops create or geops joincommand, you will not be able to restart the Sun Cluster Geographic Edition infrastructure.

Workaround: Contact your Sun service representative.

Protection Group Role and Data Replication Role Do Not Match When Protection Group Switchover Times Out (6306759)

Problem Summary: When the geopg switchover command times out, the protection group role might not match the data replication role. Despite this mismatch, the geoadm status command indicates that the configuration is in the OK state rather than the Error state.

Workaround: Validate the protection group again by using the geopg validate command on both clusters after a switchover or takeover times out.

Synchronization Status Should Be `ERROR` After a Failed Protection Group Takeover (6307131)

Problem Summary: When a takeover operation cannot change the role of the original primary cluster, the synchronization status should be ERROR.

Workaround: Resynchronize the protection group by using the geopg update command, and then validate the protection group on original primary cluster by using the geopg validate command.

No Error Message When a Takeover Operation Fails to Change Old Primary to Secondary (6309228)

Problem Summary: The geopg takeover command returns success, but the protection group is left as the primary on both clusters.

Workaround: None.

Common Agent Container Might Hang After It Has Been Running for a While (6383202)

Problem Summary: The Common Agent Container can hang after it has been running for a prolong period.

Workaround: None.

Known Issues and Bugs

Extended SunPlex Manager GUI Restrictions

Writing to `java.util.logging.ErrorManager` Results in Common Agent Container Logging Error (5081674)

Sun Cluster Geographic Edition Infrastructure Might Remain Offline After a Cluster Reboot (6218200)

No RBAC Support for GUI (6226493)

GUI Requires Same Root Password on Partner Clusters (6260505)

Partner Clusters in Different Domains Cannot Include a Domain Name With the Cluster Name (6260506)

A Custom Heartbeat Must Exist on Both the Remote and Local Cluster Before the Heartbeat Can Join a Partnership (6263692)

Communication Loss Between Node and Storage Device Might Result in Error State (6269186)

How to Restart the Sun Cluster Geographic Edition Infrastructure

See Also

Cluster Status is `Error` When the `sysevent` Daemon Crashes (6276483)

How to Restart the Sun Cluster Geographic Edition Infrastructure

See Also

Unclear Error Message When Protection Group Start Times Out (6284278)

When `geo-failovercontrol` Resource Goes to `STOP_FAILED` State the Resource Times Out (6288257)

Activated Protection Groups Deactivated and Resource Groups in an `Error` State After a Cluster Reboot (6289463)

Incorrect Message When Adding Resource Group to Protection Group (6290256)

Pulling Public Network From a Node Mastering Device Groups Controlled by Sun StorEdge Availability Suite 3.2.1 and Sun Cluster Geographic Edition Infrastructure Resource Groups Results in the Node Being Aborted (6291382)

A Failed Switchover for Hitachi TrueCopy Leaves Pairs Within `dev_group` With Unmatched Volume Status (6295537)

Hitachi TrueCopy CCI Commands and Hitachi TrueCopy Resources Report that Remote `horcmd` Is Not Alive Even When It Is Alive and Responding (6297384)

Traversing Dependencies Consumes System Resources (6297751)

Protection Group Switchover Fails Without Apparent Reason and Does Not Report Reason for Failure (6299103)

The GUI Does Not Always Return the Result of Creating or Adding a Device Groups to a Protection Group (6300168)

CLI Command Hangs If the Node Where the Geocontrol Module Is Active Reboots While the Command Is Running (6300616)

Restarting the Common Agent Container While a Switchover Is in Progress Results in `CRITICAL INTERNAL ERROR` Error (6302009)

GUI Does Not Refresh Protection Group Status Change (6302217)

Performing Two or More Operations that Update the Sun StorEdge Availability Suite 3.2.1 Configuration Database Simultaneously Might Corrupt the Configuration Database (6303883)

How to Enable the Sun StorEdge Availability Suite 3.2.1 `dscfglockd` Daemon Process

Next Steps

Protection Group Takeover and Switchover on Active Primary Cluster Causes Application Resource Groups to Be Recycled (6304781)

Unable to Start Sun Cluster Geographic Edition Infrastructure After Node Is Brought Down During `geops create` or `geops join` Operation Has Been Run (6305780)

Protection Group Role and Data Replication Role Do Not Match When Protection Group Switchover Times Out (6306759)

Synchronization Status Should Be `ERROR` After a Failed Protection Group Takeover (6307131)

No Error Message When a Takeover Operation Fails to Change Old Primary to Secondary (6309228)

Common Agent Container Might Hang After It Has Been Running for a While (6383202)

Known Issues and Bugs

Extended SunPlex Manager GUI Restrictions

Writing to java.util.logging.ErrorManager Results in Common Agent Container Logging Error (5081674)

Sun Cluster Geographic Edition Infrastructure Might Remain Offline After a Cluster Reboot (6218200)

No RBAC Support for GUI (6226493)

GUI Requires Same Root Password on Partner Clusters (6260505)

Partner Clusters in Different Domains Cannot Include a Domain Name With the Cluster Name (6260506)

A Custom Heartbeat Must Exist on Both the Remote and Local Cluster Before the Heartbeat Can Join a Partnership (6263692)

Communication Loss Between Node and Storage Device Might Result in Error State (6269186)

How to Restart the Sun Cluster Geographic Edition Infrastructure

See Also

Cluster Status is Error When the sysevent Daemon Crashes (6276483)

How to Restart the Sun Cluster Geographic Edition Infrastructure

See Also

Unclear Error Message When Protection Group Start Times Out (6284278)

When geo-failovercontrol Resource Goes to STOP_FAILED State the Resource Times Out (6288257)

Activated Protection Groups Deactivated and Resource Groups in an Error State After a Cluster Reboot (6289463)

Incorrect Message When Adding Resource Group to Protection Group (6290256)

Pulling Public Network From a Node Mastering Device Groups Controlled by Sun StorEdge Availability Suite 3.2.1 and Sun Cluster Geographic Edition Infrastructure Resource Groups Results in the Node Being Aborted (6291382)

A Failed Switchover for Hitachi TrueCopy Leaves Pairs Within dev_group With Unmatched Volume Status (6295537)

Hitachi TrueCopy CCI Commands and Hitachi TrueCopy Resources Report that Remote horcmd Is Not Alive Even When It Is Alive and Responding (6297384)

Traversing Dependencies Consumes System Resources (6297751)

Protection Group Switchover Fails Without Apparent Reason and Does Not Report Reason for Failure (6299103)

The GUI Does Not Always Return the Result of Creating or Adding a Device Groups to a Protection Group (6300168)

CLI Command Hangs If the Node Where the Geocontrol Module Is Active Reboots While the Command Is Running (6300616)

Restarting the Common Agent Container While a Switchover Is in Progress Results in CRITICAL INTERNAL ERROR Error (6302009)

GUI Does Not Refresh Protection Group Status Change (6302217)

Performing Two or More Operations that Update the Sun StorEdge Availability Suite 3.2.1 Configuration Database Simultaneously Might Corrupt the Configuration Database (6303883)

How to Enable the Sun StorEdge Availability Suite 3.2.1 dscfglockd Daemon Process

Next Steps

Protection Group Takeover and Switchover on Active Primary Cluster Causes Application Resource Groups to Be Recycled (6304781)

Unable to Start Sun Cluster Geographic Edition Infrastructure After Node Is Brought Down During geops create or geops join Operation Has Been Run (6305780)

Protection Group Role and Data Replication Role Do Not Match When Protection Group Switchover Times Out (6306759)

Synchronization Status Should Be ERROR After a Failed Protection Group Takeover (6307131)

No Error Message When a Takeover Operation Fails to Change Old Primary to Secondary (6309228)

Common Agent Container Might Hang After It Has Been Running for a While (6383202)

Writing to `java.util.logging.ErrorManager` Results in Common Agent Container Logging Error (5081674)

Cluster Status is `Error` When the `sysevent` Daemon Crashes (6276483)

When `geo-failovercontrol` Resource Goes to `STOP_FAILED` State the Resource Times Out (6288257)

Activated Protection Groups Deactivated and Resource Groups in an `Error` State After a Cluster Reboot (6289463)

A Failed Switchover for Hitachi TrueCopy Leaves Pairs Within `dev_group` With Unmatched Volume Status (6295537)

Hitachi TrueCopy CCI Commands and Hitachi TrueCopy Resources Report that Remote `horcmd` Is Not Alive Even When It Is Alive and Responding (6297384)

Restarting the Common Agent Container While a Switchover Is in Progress Results in `CRITICAL INTERNAL ERROR` Error (6302009)

How to Enable the Sun StorEdge Availability Suite 3.2.1 `dscfglockd` Daemon Process

Unable to Start Sun Cluster Geographic Edition Infrastructure After Node Is Brought Down During `geops create` or `geops join` Operation Has Been Run (6305780)

Synchronization Status Should Be `ERROR` After a Failed Protection Group Takeover (6307131)