Sun Cluster 3.1 Release Notes

Known Issues and Bugs

The following known issues and bugs affect the operation of the Sun Cluster 3.1 release. For the most current information, see the online Sun Cluster 3.1 Release Notes Supplement at http://docs.sun.com.

Incorrect Largefile Status (4419214)

Problem Summary: The /etc/mnttab file does not show the most current largefile status of a globally mounted VxFS filesystem.

Workaround: Use the fsadm command to verify the filesystem largefile status, instead of the /etc/mnttab entry.

Global VxFS File System Lists Block Allocations Differently Than Local VxFS (4449437)

Problem Summary: For a given file size, global VxFS file system appears to allocate more disk blocks than the local VxFS file system.

Workaround: Unmounting and mounting the filesystem eliminates the extra disk blocks that were reported as allocated to the given file.

Nodes Unable to Bring up qfe Paths (4526883)

Problem Summary: Sometimes, private interconnect transport paths ending at a qfe adapter fail to come online.

Workaround: Follow the steps shown below:

Using scstat -W, identify the adapter that is at fault. The output will show all transport paths with that adapter as one of the path endpoints in the faulted or the waiting states.
Use scsetup to remove from the cluster configuration all the cables connected to that adapter.
Use scsetup again to remove that adapter from the cluster configuration.
Add back the adapter and the cables.
Verify if the paths appear. If the problem persists, repeat steps 1–5 a few times.
Verify if the paths appear. If the problem still persists, reboot the node with the at-fault adapter. Before the node is rebooted, make sure that the remaining cluster has enough quorum votes to survive the node reboot.

File Blocks Not Updated Following Writes to Sparse File Holes (4607142)

Problem Summary: A file's block count is not always consistent across cluster nodes following block-allocating write operations within a sparse file. For a cluster file system layered on UFS (or VxFS 3.4), the block inconsistency across cluster nodes disappears within 30 seconds or so.

Workaround: File metadata operations which update the inode (touch, etc.) should synchronize the st_blocks value so that subsequent metadata operations will ensure consistent st_blocks values.

Concurrent use of `forcedirectio` and `mmap`(2) may Cause Panics (4629536)

Problem Summary: Using the forcedirectio mount option and the mmap(2) function concurrently might cause data corruption, system hangs, or panics.

Workaround: Observe the following restrictions:

Do not remount a file system with the directio mount option added at remount time.
Do not set the directio mount option on a single file by using the directio ioctl.

If there is a need to use directio, mount the whole file system with directio options.

Unmounting of a Cluster File System Fails (4656624)

Problem Summary: The unmounting of a cluster file system fails sometimes even though the fuser command shows that there are no users on any node.

Workaround: Retry the unmounting after all asynchronous I/O to the underlying file system has been completed.

Rebooting Puts Cluster Nodes in a Non–Working State (4664510)

Problem Summary: After powering off one of the Sun StorEdge T3 Arrays and running scshutdown, rebooting both nodes puts the cluster in a non-working state.

Workaround: If half the replicas are lost, perform the following steps:

Ensure the cluster is in cluster mode.

Forcibly import the diskset.
# metaset -s set-name -f -C take

Delete the broken replicas.

# metadb -s set-name -fd /dev/did/dsk/dNsX

Release the diskset.
# metaset -s set-name -C release
The file system can now be mounted and used. However, the redundancy in the replicas has not been restored. If the other half of replicas is lost, then there will be no way to restore the mirror to a sane state.

Recreate the databases after the above repair procedure is applied.

Dissociating a Plex from a Disk Group Causes Panic (4657088)

Problem Summary: Dissociating or detaching a plex from a disk group under Sun Cluster may panic the cluster node with following panic string:

panic[cpu2]/thread=30002901460: BAD TRAP: type=31 rp=2a101b1d200 addr=40 mmu_fsr=0 occurred in module "vxfs" due to a NULL pointer dereference

Workaround: Before dissociating or detaching a plex from a disk group, unmount the corresponding file system.

`scvxinstall -i` Fails to Install a License Key (4706175)

Problem Summary: The scvxinstall -i command accepts a license key with the -L option. However, the key is ignored and does not get installed.

Workaround: Do not provide a license key with the -i form of scvxinstall. The key will not be installed. The license keys should be installed with the interactive form or with the -e option. Before proceeding with the encapsulation of root, examine the license requirements and provide the desired keys either with the -e option or in the interactive form.

Sun Cluster HA–Siebel Fails to Monitor Siebel Components (4722288)

Problem Summary: The Sun Cluster HA-Siebel agent will not monitor individual Siebel components. If failure of a Siebel component is detected, only a warning message would be logged in syslog.

Workaround: Restart the Siebel server resource group in which components are offline using the command scswitch -R -h node-g resource_group.

The `remove` Script Fails to Unregister `SUNW.gds` Resource Type (4727699)

Problem Summary: The remove script fails to unregister SUNW.gds resource type and displays the following message:

Resource type has been un-registered already.

Workaround: After using the remove script, manually unregister SUNW.gds. Alternatively, use the scsetup command or the SunPlex Manager.

Create IPMP Group Option Overwrites `hostname.int` (4731768)

Problem Summary: The Create IPMP group option in SunPlex Manager should only be used with adapters that are not already configured. If an adapter is already configured with an IP address, the adapter must be manually configured for IPMP.

Workaround: The Create IPMP group option in SunPlex Manager must be used only with adapters that are not already configured. If an adapter is already configured with an IP address, the adapter should be manually configured using Solaris IPMP management tools.

Using the Solaris `shutdown` Command May Result in Node Panic (4745648)

Problem Summary: Using the Solaris shutdown command or similar commands (for example, uadmin) to bring down a cluster node may result in node panic and display the following message:

CMM: Shutdown timer expired. Halting.

Workaround: Contact your Sun service representative for support. The panic is necessary to provide a guaranteed safe way for another node in the cluster to take over the services that were being hosted by the shutting-down node.

Administrative Command to Add a Quorum Device to the Cluster Fails (4746088)

Problem Summary: If a cluster has the minimum votes required for quorum, an administrative command to add a quorum device to the cluster fails with the following message:

Cluster could lose quorum

Workaround: Contact your Sun service representative for support.

Path Timeouts When Using `ce` Adapters on the Private Interconnect (4746175)

Problem Summary: Clusters using ce adapters on the private interconnect may notice path timeouts and subsequent node panics if one or more cluster nodes have more than four processors.

Workaround: Set the ce_taskq_disable parameter in the ce driver by adding set ce:ce_taskq_disable=1 to /etc/system file on all cluster nodes and then rebooting the cluster nodes. This ensures that heartbeats (and other packets) are always delivered in the interrupt context, eliminating path timeouts and the subsequent node panics. Quorum considerations should be observed while rebooting cluster nodes.

Siebel Gateway Probe May Time Out When a Public Network Fails (4764204)

Problem Summary: Failure of a public network may cause the Siebel gateway probe to time out and eventually cause the Siebel gateway resource to go offline. This may occur if the node on which the Siebel gateway is running has a path beginning with /home which depends on network resources such as NFS and NIS. Without the public network, the Siebel gateway probe hangs while trying to open a file on/home, causing the probe to time out.

Workaround: Complete the following steps for all nodes of the cluster which can host the Siebel gateway.

Ensure that the passwd, group, and project entries in /etc/nsswitch.conf refer only to files and not to nis.

Ensure that there are no NFS or NIS dependencies for any path starting with /home.

You may have either a locally mounted /home path or rename the /home mount point to /export/home or some name which does not start with /home.

In the /etc/auto_master file, comment out the line containing the entry +auto_master. Also comment out any /home entries using auto_home.

In etc/auto_home, comment out the line containing +auto_home.

Flushing Gateway Routes Breaks Per–Node Logical IP Communication (4766076)

Problem Summary: To provide highly available, per-node, logical IP communication over a private interconnect, Sun Cluster software relies on gateway routes on the cluster nodes. Flushing the gateway routes will break the per-node logical IP communication.

Workaround: Reboot the cluster nodes where the routes were inadvertently flushed. To restore the gateway routes, it is sufficient to reboot the cluster nodes one at a time. Per-node logical IP communication will remain broken until the routes have been restored. Quorum considerations must be observed while rebooting cluster nodes.

Unsuccessful Failover Results in Error (4766781)

Problem Summary: An unsuccessful failover/switchover of a file system might leave the file system in an errored state.

Workaround: Unmount and remount the file system.

Enabling TCP-Selective Acknowledgments may Cause Data Corruption (4775631)

Problem Summary: Enabling TCP-selective acknowledgements on cluster nodes may cause data corruption.

Workaround:No user action is required. To avoid causing data corruption on the global file system, do not reenable TCP selective acknowledgements on cluster nodes.

`scinstall` Incorrectly Shows Some Data Services as Unsupported (4776411)

Problem Summary: scinstall incorrectly shows that the following data services are not supported on Solaris 9:

Sun Cluster HA for SAP
Sun Cluster HA for SAP liveCache

Workaround: Solaris 8 and 9 support both Sun Cluster HA for SAP and Sun Cluster HA for SAP liveCache; ignore the unsupported feature list in scinstall.

`scdidadm` Exits With an Error if `/dev/rmt` is Missing (4783135)

Problem Summary: The current implementation of scdidadm(1M) relies on the existence of both /dev/rmt and /dev/(r)dsk to successfully execute scdiadm -r. Solaris installs both, regardless of the existence of the actual underlying storage devices. If /dev/rmt is missing, scdidadm exits with the following error:

Cannot walk /dev/rmt" during execution of 'scdidadm -r

Workaround: On any node where /dev/rmt is missing, use mkdir to create a directory /dev/rmt. Then, run scgdevs from one node.

Data Corruption When Node Failure Causes the Cluster File System Primary to Die (4804964)

Problem Summary: Data corruption may occur with Sun Cluster 3.x systems running patches 113454-04, 113073-02 and 113276-02 (or a subset of these patches). The problem only occurs with globally mounted UFS file systems. The data corruption results in missing data (that is, you will see zero's where data should exist), and the amount of missing data is always a multiple of a disk block. The data loss can occur any time a node failure causes the cluster file system primary to die soon after the cluster file systemclient completes— or reports that it has just completed—a write operation. The period of vulnerability is limited and does not occur every time.

Workaround: Use the -o syncdir mount option to force UFS to use synchronous UFS log transactions.

Node Hangs After Rebooting When Switchover is in Progress (4806621)

Problem Summary: If a device group switchover is in progress when a node joins the cluster, the joining node and the switchover operation may hang. Any attempts to access any device service will also hang. This is more likely to happen on a cluster with more than two nodes and if the file system mounted on the device is a VxFS file system.

Workaround: To avoid this situation, do not initiate device group switchovers while a node is joining the cluster. If this situation occurs, then all the cluster nodes must be rebooted to restore access to device groups.

File System Panics When Cluster File System is Full (4808748)

Problem Summary: When a cluster file system is full, there are instances where the filesystem might panic with one of the following messages: 1)

assertion failed: cur_data_token & PXFS_WRITE_TOKEN or PXFS_READ_TOKEN

or 2)

vp->v_pages == NULL

. These panics are intended to prevent data corruption when a filesystem is full.

Workaround: To reduce the likelihood of this problem, use a cluster file system with UFS as far as possible. It is extremely rare for one of these panics to occur when using a cluster file system with UFS, but the risk is greater when using a cluster file system with VxFS.

Cluster Node Hangs While Booting Up (4809076)

Problem Summary: When a device service switchover request, using scswitch -z -D <device-group> -h <node> , is concurrent with a node reboot and there are global file systems configured on the device service, the global file systems might become unavailable and subsequent configuration changes involving any device service or global file system may also hang. Additionally, subsequent cluster node joins might hang.

Workaround: Recovery requires a reboot of all the cluster nodes.

Removing a Quorum Device Using `scconf -rq` Causes Cluster Panic (4811232)

Problem Summary: If you execute the scconf -rq command to remove a quorum device in a vulnerable configuration, all nodes of the cluster will panic with the message

CMM lost operational quorum

Workaround: To remove a quorum device from a cluster, first check the output of scstat -q. If the quorum device is listed as having more than one vote in the Present column, then the device should first be put into maintenance mode using scconf -cq globaldev=QD,maintstate. After the command completes and the quorum device is shown in scstat -q as having 0 votes present, the device can be removed using scconf -rq.

Mirrored Volume Fails When Using `O_EXCL` Flag (4820273)

Problem Summary: If Solstice DiskSuite/Solaris Volume Manager is being used and a mirrored volume is opened with O_EXCL flag, the failover of the device group containing this volume will fail. This will panic the new device group primary, when the volume is first accessed after the failover.

Workaround: When using Solstice DiskSuite/Solaris Volume Manager, do not open mirrored volumes with O_EXCL flag.

Cluster Hangs After a Node is Rebooted During Switchover (4823195)

Problem Summary: If a device service failover request is concurrent with a node reboot or a node join, and there are cluster file systems configured on the device service, the cluster file systems might become unavailable and subsequent configuration changes involving any device service or cluster file system may also hang. Additionally, subsequent cluster node joins might hang.