Sun Cluster 3.1 10/03 Release Notes

Known Issues and Bugs

The following known issues and bugs affect the operation of the Sun Cluster 3.1 10/03 release. For the most current information, see the online Sun Cluster 3.1 10/03 Release Notes Supplement at http://docs.sun.com.

Incorrect Largefile Status (4419214)

Problem Summary: The /etc/mnttab file does not show the most current largefile status of a globally mounted VxFS filesystem.

Workaround: Use the fsadm command to verify the filesystem largefile status, instead of the /etc/mnttab entry.

Nodes Unable to Bring Up qfe Paths (4526883)

Problem Summary: Sometimes, private interconnect transport paths ending at a qfe adapter fail to come online.

Workaround: Follow the steps shown below:

Using scstat -W, identify the adapter that is at fault. The output will show all transport paths with that adapter as one of the path endpoints in the faulted or the waiting states.
Use scsetup to remove from the cluster configuration all the cables connected to that adapter.
Use scsetup again to remove that adapter from the cluster configuration.
Add back the adapter and the cables.
Verify if the paths appear. If the problem persists, repeat steps 1–5 a few times.
Verify if the paths appear. If the problem still persists, reboot the node with the at-fault adapter. Before the node is rebooted, make sure that the remaining cluster has enough quorum votes to survive the node reboot.

File Blocks Not Updated Following Writes to Sparse File Holes (4607142)

Problem Summary: A file's block count is not always consistent across cluster nodes following block-allocating write operations within a sparse file. For a cluster file system layered on UFS (or VxFS 3.4), the block inconsistency across cluster nodes disappears within 30 seconds or so.

Workaround: File metadata operations which update the inode (touch, etc.) should synchronize the st_blocks value so that subsequent metadata operations will ensure consistent st_blocks values.

During a Network Failure, the Data Service Starts and Stops Incorrectly (4644289)

Problem Summary: The Sun Cluster HA for Oracle data service uses the su command to start and stop the database. The network service might become unavailable when a cluster node's public network fails.

Workaround: In Solaris 9, configure the /etc/nsswitch.conf files as follows so that the data service starts and stops correctly in the event of a network failure:

On each node that can be a primary for oracle_server or oracle_listener resource, modify/etc/nsswitch.conf to include the following entries for passwd, group, publickey, and project databases:

passwd: files
group: files
publickey: files
project: files

Adding the above entries ensures that the su(1M) command does not refer to the NIS/NIS+ name services.

Unmounting of a Cluster File System Fails (4656624)

Problem Summary: The unmounting of a cluster file system fails sometimes even though the fuser command shows that there are no users on any node.

Workaround: Retry the unmounting after all asynchronous I/O to the underlying file system has been completed.

Sun Cluster HA–Siebel Fails to Monitor Siebel Components (4722288)

Problem Summary: The Sun Cluster HA-Siebel agent will not monitor individual Siebel components. If failure of a Siebel component is detected, only a warning message would be logged in syslog.

Workaround: Restart the Siebel server resource group in which components are offline using the command scswitch -R -h node -g resource_group.

Oracle RAC Instances May Become Unavailable on Newly Added Nodes (4723575)

Problem Summary: Installing Sun Cluster support for RAC on a newly added node will cause unavailability of Oracle RAC instances.

Workaround: To add a node into a cluster currently running with Oracle RAC support, without losing availability of the Oracle RAC database, requires special installation steps. The example shown below describes going from a 3–node cluster to a 4–node cluster, with Oracle RAC running on nodes 1, 2, and 3:

Install the Sun Cluster software on the new node (node 4).

Note: Do not install the RAC support packages as this time.
Reboot the new node into the cluster.
Once the new node has joined the cluster, shutdown the Oracle RAC database on one of the nodes where it is already running (node 1, in this example).
Reboot the node where the database was just shutdown (node 1).
Once the node (node 1) is back up, start the Oracle database on that node to resume database service.
If a single node is capable of handling the database workload, shutdown the database on the remaining nodes (nodes 2 and 3), and reboot these nodes. If more than one node is required to support the database workload, do them one at a time as described in steps 3 to 5.
Once all nodes have been rebooted, it is safe to install the Oracle RAC support packages on the new node.

The `remove` Script Fails to Unregister `SUNW.gds` Resource Type (4727699)

Problem Summary: The remove script fails to unregister SUNW.gds resource type and displays the following message:

Resource type has been un-registered already.

Workaround: After using the remove script, manually unregister SUNW.gds. Alternatively, use the scsetup command or the SunPlex Manager.

Using the Solaris `shutdown` Command May Result in Node Panic (4745648)

Problem Summary: Using the Solaris shutdown command or similar commands (for example, uadmin) to bring down a cluster node may result in node panic and display the following message:

CMM: Shutdown timer expired. Halting.

Workaround: Contact your Sun service representative for support. The panic is necessary to provide a guaranteed safe way for another node in the cluster to take over the services that were being hosted by the shutting-down node.

Path Timeouts When Using `ce` Adapters on the Private Interconnect (4746175)

Problem Summary: Clusters using ce adapters on the private interconnect may notice path timeouts and subsequent node panics if one or more cluster nodes have more than four processors.

Workaround: Set the ce_taskq_disable parameter in the ce driver by adding set ce:ce_taskq_disable=1 to /etc/system file on all cluster nodes and then rebooting the cluster nodes. This ensures that heartbeats (and other packets) are always delivered in the interrupt context, eliminating path timeouts and the subsequent node panics. Quorum considerations should be observed while rebooting cluster nodes.

`scrgadm` Prevents IP Addresses of Different Subnets to Reside on one NIC (4751406)

Problem Summary: scrgadmprevents the hosting of logical hostnames/shared addresses which belong to a subnet that is different from the subnet of the IPMP (NAFO) group.

Workaround: Use the following form of the scrgadmcommand:

scrgadm -a -j <resource> -t <resource_type> -g <resource_group> -x HostnameList=<logical_hostname> -x NetIfList=<nafogroup>@<nodeid>.

Note that nodenames do not appear to work in the NetIfList; use nodeids, instead.

Unsuccessful Failover Results in Error (4766781)

Problem Summary: An unsuccessful failover/switchover of a file system might leave the file system in an errored state.

Workaround: Unmount and remount the file system.

Node Hangs After Rebooting When Switchover Is in Progress (4806621)

Problem Summary: If a device group switchover is in progress when a node joins the cluster, the joining node and the switchover operation may hang. Any attempts to access any device service will also hang. This is more likely to happen on a cluster with more than two nodes and if the file system mounted on the device is a VxFS file system.

Workaround: To avoid this situation, do not initiate device group switchovers while a node is joining the cluster. If this situation occurs, then all the cluster nodes must be rebooted to restore access to device groups.

DNS Wizard Fails if an Existing DNS Configuration is not Supplied (4839993)

Problem Summary: SunPlex Manager includes a data service installation wizard that sets up a highly available DNS service on the cluster. If the user does not supply an existing DNS configuration, such as a named.conf file, the wizard attempts to generate a valid DNS configuration by autodetecting the existing network and nameservice configuration. However, it fails in some network environments, causing the wizard to fail without issuing an error message.

Workaround: When prompted, supply the SunPlex Manager DNS data service install wizard with an existing, valid named.conf file. Otherwise, follow the documented DNS data service procedures to manually configure highly available DNS on the cluster.

Using SunPlex Manager to Install an Oracle Service (4843605)

Problem Summary: SunPlex Manager includes a data service installation wizard which sets up a highly available Oracle service on the cluster by installing and configuring the Oracle binaries as well as creating the cluster configuration. However, this installation wizard is currently not working, and results in a variety of errors based on the users' software configuration.

Workaround: Manually install and configure the Oracle data service on the cluster, using the procedures provided in the Sun Cluster documentation.

Shutdown or Reboot Sequence Fails (4844784)

Problem Summary: When shutting down or rebooting a node, the node may hang and the shutdown or reboot sequence may not complete. The system hangs after issuing the following message: Failfast: Halting because all userland daemons all have died.

Workaround: Before shutting down or rebooting the node, issue the following command: psradm -f -a:

To shutdown a node:

# scswitch -S -h <node>
# psradm -f -a
# shutdown -g0 -y -i0

To reboot a node:

# scswitch -S -h <node>
# psradm -f -a
# shutdown -g0 -y -i6

Note –

In some rare instances, the suggested workarounds may fail to resolve this problem.

Rebooting a Node (4862321)

Problem Summary: On large systems running Sun Cluster 3.x, shutdown -g0 -y -i6, the command to reboot a node, can make the system to go to the OK prompt with the message Failfast: Halting because all userland daemons have died, instead of rebooting.

Workaround: Use one of the following workarounds:

Halt the node and then type boot at the ok prompt.

Disable failfasts before rebooting the node:

# /usr/cluster/lib/sc/cmm_ctl -f

# shutdown -g0 -y -i6

Remember to re-enable failfasts after the node has rebooted:

# /usr/cluster/lib/sc/cmm_ctl -f

or increase the failfast_panic_delay timeout before shutting down the system, using the following mdb command:

(echo 'cl_comm`conf+8/W 0t600000' ;

echo 'cl_comm`conf+c/W 0t600000') | mdb -kw

This sets the timeout to 600000 ms (10 minutes).

Oracle DLM Process Remains Alive During Node Shutdown (4891227)

Problem Summary: Oracle DLM process does not terminate during shutdown and prevents /var from being unmounted.

Workaround: Use one of the following two workarounds:

Do not use a separate /var partition.

Use reboot/halt instead of init or shutdown.

Oracle Listener Probe May Timeout on a Heavily Loaded System (4900140)

Problem Summary: The Oracle listener probe may timeout on a heavily loaded system, causing the Oracle listener to restart.

Workaround: On a heavily loaded system, the Oracle listener resource probe timeouts may be prevented by increasing the value of Thorough_probe_interval property of the resource.

The probe timeout is calculated as follows:

10 seconds if Thorough_probe_interval is greater than 20 seconds

60 seconds if Thorough_probe_interval is greater than 120 seconds

Thorough_probe_interval/2 in other cases

`RG_system` Resource Group Property Update may Result in Node Panic (4902066)

Problem Summary: When set to TRUE, the RG_system resource group property indicates that the resource group and its resources are being used to support the cluster infrastructure, instead of implementing a user data service. If RG_system is TRUE, the RGM prevents the system administrator from inadvertently switching the group or its resources offline, or modifying their properties. In some instances, the node may panic when you try to modify a resource group property after setting the RG_system property to TRUE.

Workaround: Do not edit the value of the RG_system resource group property.

`nsswitch.conf` Requirements for `passwd` Make `nis` Unusable (4904975)

Problem Summary: On each node that can master the liveCache resource, the su command might hang when the public net is down.

Workaround: On each node that can master the liveCache resource, the following changes to /etc/nsswitch.conf are recommended so that the su command will not hang when the public net is down:

passwd: files nis [TRYAGAIN=0]

Data Service Installation Wizards for Oracle and Apache do not Support Solaris 9 and Above (4906470)

Problem Summary: The SunPlex Manager data service installation wizards for Apache and Oracle do not support Solaris 9 and above.

Workaround: Manually install Oracle on the cluster using, using Sun Cluster documentation. If installing Apache on Solaris 9 (or higher), manually add the Solaris Apache packages SUNWapchr and SUNWapchu before running the installation wizard.

Installation Fails if Default Domain is not set (4913925)

Problem Summary: When adding nodes to a cluster during installation and configuration, you may see an “RPC authentication“ failure. The error messages are similar to the following:

“RPC authentication error”
“Not authorized to communicate with <sponsor-node>”
“Cluster name verification failed”

This will occur when the nodes are not configured to use NIS/NIS+, particularly if the file /etc/defaultdomain is not present.

Workaround: When a domain name is not set (that is, the /etc/defaultdomain file is missing), set the domain name on all nodes joining the cluster, using the domainname(1M) command before proceeding with the installation. For example, # domainname xxx.