|Skip Navigation Links|
|Exit Print View|
|Oracle Solaris Cluster 3.3 5/11 Release Notes Oracle Solaris Cluster|
The following known issues and bugs affect the operation of the Oracle Solaris Cluster 3.3 5/11 release. Bugs and issues are grouped into the following categories:
Problem Summary: If a failover data service, such as HA for Oracle, is configured with the ScalMountpoint resource to probe and detect NAS storage access failure, and the network interface is lost, such as due to a loss of cable connection, the monitor probe hangs. If the Failover_mode property is set to SOFT, this results in a stop-failed status and the resource does not fail over. The associated error message is similar to the following:
SC[SUNW.ScalMountPoint:3,scalmnt-rg,scal-oradata-11g-rs,/usr/cluster/lib/rgm/rt/scal_mountpoint/scal_mountpoint_probe]: Probing thread for mountpoint /oradata/11g is hanging for timeout period 300 seconds
Workaround: Change the Failover_mode property on the resource to HARD
# clresource set -p Failover_mode=HARD ora-server-rs # clresource show -v ora-server-rs | grep Failover_mode Failover_mode: HARD
Problem Summary: The current implementation requires an RTR file, rather than a symbolic link to the file, to be present in /usr/cluster/lib/rgm/rtreg.
Workaround: Perform the following commands as superuser on one node of the global cluster.
# cp /opt/SUNWscor/oracle_asm/etc/SUNW.scalable_acfs_proxy /usr/cluster/lib/rgm/rtreg/ # clrt register -Z zoneclustername SUNW.scalable_acfs_proxy # rm /usr/cluster/lib/rgm/rtreg/SUNW.scalable_acfs_proxy
Problem Summary: During a reboot, Oracle's SPARC T3-4 server with four processors fails to connect to the Oracle Solaris Cluster framework.. Error messages similar to the following appear:
Sep 20 15:18:53 svc.startd : svc:/system/pools:default: Method or service exit timed out. Killing contract 29. Sep 20 15:18:53 svc.startd : svc:/system/pools:default: Method "/lib/svc/method/svc-pools start" failed due to signal KILL. … Sep 20 15:20:55 solta svc.startd : system/pools:default failed: transitioned to maintenance (see 'svcs -xv' for details) … Sep 20 15:22:12 solta INITGCHB: Given up waiting for rgmd. … Sep 20 15:23:12 solta Cluster.GCHB_resd: GCHB system error: scha_cluster_open failed with 18 Sep 20 15:23:12 solta : No such process
Workaround: Use the svccfg command to increase the service timeout to 300 seconds. Boot into noncluster mode and perform the following commands:
# svccfg -s svc:/system/pools setprop start/timeout_seconds = 300 # svcadm refresh svc:/system/pools
After you perform these commands, boot into cluster mode.
Problem Summary: When you remove a global-cluster node that is the last node in the global cluster that hosts a zone cluster, the zone cluster is not removed from the cluster configuration
Workaround: Before you run the clnode remove -F command to delete the global-cluster node, use the clzonecluster command to delete the zone cluster.
Problem Summary: When a new storage device is added to a cluster and is configured with three or more DID paths, the node on which the cldevice populate command is run might fail to register its PGR key on the device.
Workaround: Run the cldevice populate command on all cluster nodes, or run the cldevice populate command twice from the same node.
Problem Summary: Oracle Solaris Cluster attempts to verify that a storage device fully supports SCSI-3 PGR before allowing the user to set its fencing property to prefer3. This verification might succeed when it should fail.
Workaround: Ensure that a storage device is certified by Oracle Solaris Cluster for use with SCSI-3 PGR before changing the fencing setting to prefer3.
Problem Summary: During cluster configuration on LDoms with hybrid I/O, autodiscovery does not report any paths for the cluster interconnect.
Workaround: When you run the interactive scinstall utility, choose to configure the sponsor node and additional nodes in separate operations, rather than by configuring all nodes in a single operation. When the utility prompts "Do you want to use autodiscovery?", answer "no". You can then select transport adapters from the list that is provided by the scinstall utility.
Problem Summary: If a Hitachi TrueCopy device group whose replica pair is in the COPY state, or an EMC SRDF device group whose replica pair is split, attempts to switch the device group over to another node, the switchover fails. Furthermore, the device group is unable to come back online on the original node until the replica pair is been returned to a paired state.
Workaround: Verify that TrueCopy replicas are not in the COPY state, or that SRDF replicas are not split, before you attempt to switch the associated Oracle Solaris Cluster global-device group to another cluster node.
Problem Summary: You cannot use the clsetup utility to configure a resource to have the load balancing policy LB_STICKY_WILD. The policy is set to LB_WILD instead.
Workaround: After you configure the resource, use the clresource create command to change the load balancing policy to LB_STICKY_WILD.
Problem Summary: Changing a cluster configuration from a three-node cluster to a two-node cluster might result in complete loss of the cluster, if one of the remaining nodes leaves the cluster or is removed from the cluster configuration.
Workaround: Immediately after removing a node from a three-node cluster configuration, run the cldevice clear command on one of the remaining cluster nodes.
Problem Summary: If the Solaris Security Toolkit is configured on cluster nodes, the command scstat -i gives an RPC bind failure error. The error message is similar to the following:
scrconf: RPC: Rpcbind failure - RPC: Authentication error Other Sun Cluster commands that use RPC, such as clsnmpuser, might also fail.
Workaround: Add the cluster private hostnames or the IP addresses associated with the cluster private hostnames to the /etc/hosts.allow file.
Problem Summary: The scdidadm and cldevice commands are unable to verify that replicated SRDF devices that are being combined into a single DID device are, in fact, replicas of each other and belong to the specified replication group.
Workaround: Take care when combining DID devices for use with SRDF. Ensure that the specified DID device instances are replicas of each other and that they belong to the specified replication group.
Problem Summary: For a 16-node cluster, the Oracle Solaris Cluster Manager GUI is not usable.
Workaround: Use instead the clsetup utility or the Oracle Solaris Cluster maintenance commands.
Problem Summary: If resource groups are created, edited, or deleted immediately after a zone cluster is rebooted, the Resource Group Manager (RGM) gets into an inconsistent state in which further operations on the resource group might fail. In the worst case, the failure might cause nodes of the global cluster to panic and reboot.
This problem can occur after all nodes of the zone cluster are rebooted at once. The problem does not occur if only some of the nodes are rebooted while others remain up. It can also occur when the entire physical cluster is rebooted, if resource-group updates are executed immediately after the zone cluster comes up.
The following are the commands that might can cause such errors:
Workaround: To avoid this problem, wait for a minute or so after you reboot a zone cluster, to allow the zone cluster to achieve a stable state, before you execute any of the above commands.
If all nodes of the physical cluster are rebooted, allow an extra minute after you see console messages indicating that all of the zone cluster nodes have joined the cluster, before you execute any of the above commands. The console messages look similar to the following:
May 5 17:30:49 phys-schost-4 cl_runtime: NOTICE: Membership : Node 'zc-host-2' (node id 2) of cluster 'schost' joined.
If only some nodes are rebooted while others remain up, the additional delay is not needed.
Problem Summary: After installation and creation of the resource group and resources for Oracle Solaris Cluster HA for Apache Tomcat, the service cannot start if HA for Apache Tomcat is configured on top of a failover zone.
Workaround: Contact your Oracle support representative to obtain the missing script.
Problem Summary: If you kill the dispatcher of a dialogue instance that is running with SAP kernel 7.11, the SAP Web Application Server agent is unable to restart the dialogue instance on the same node. After two retries, it fails over, and the start succeeds on the other node. The root cause is that, with SAP Kernel 7.11, the cleanipc command requires setting LD_LIBRARY_PATH before executing cleanipc.
Workaround: Insert the setting of LD_LIBRARY_PATH and the execution of cleanipc in the Webas_Startup_Script for the webas resource. For example, assuming the SAP SID is FIT and the instance is 03, the code to insert into the start script registered for your webas resource in the property Webas_Startup_script would be the following:
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/sap/FIT/SYS/exe/run export LD_LIBRARY_PATH /usr/sap/FIT/SYS/exe/run/cleanipc 03 remove
Problem Summary: When the /etc/vfstab file entry for a cluster file system has a mount-at-boot value of noand the cluster file system is configured in a SUNW.HAStoragePlus resource that belongs to a scalable resource group, the SUNW.HAStoragePlus resource fails to come online. The resource stays in the Starting state until prenet_start_method is timed out.
Workaround: In the /etc/vfstab file's entry for the cluster file system, set the mount-at-boot value to yes.
Problem Summary: In Siebel 8.1.1, the gateway server has a dependency on the database. If the machine hosting the database listener is not reachable, the gateway probe will cause the resource group to ping pong until the ping pong interval is reached.
Workaround: Co-locating the database listener with the gateway mitigates this issue. Or, if the database is running outside the cluster control, ensure that the machine that hosts the database listener is up and running.
Problem Summary: If scalable applications configured to run in different zone clusters bind to INADDR_ANY and use the same port, then scalable services cannot distinguish between the instances of these applications that run in different zone clusters.
Workaround: Do not configure the scalable applications to bind to INADDR_ANY as the local IP address, or bind them to a port that does not conflict with another scalable application.
When adding or removing a NAS device, running the clnas addor clnas removecommand on multiple nodes at the same time might corrupt the NAS configuration file.
Workaround: Run the clnas addor clnas removecommand on one node at a time.
Problem Summary: When a native brand non-global zone is added to the node list of a resource group that contains an HAStoragePlus resource with ZFS pools configured, the HAStoragePlus resource might enter the Faulted state. This problem happens only when the physical node that hosts the native zone is part of the resource-group node list.
Workaround: Restart the resource group that contains the faulted HAStoragePlus resource.
# clresourcegroup restart faulted-resourcegroup
Problem Summary: The Generic Data Service (GDS) data service Stop script cannot force a Stop method failure. If the Stop script exits non-zero, the GDS Stop method will try killing the resource daemon. If the kill succeeds, then the Stop method exits success, even though the stop script had failed. As a result, the stop script cannot programmatically force a Stop method failure.
Workaround: Have the GDS stop script execute clresourcegroup quiesce -k rgname command, where rgname is the name of the resource group that contains the GDS resource. The -k option will cause the rgmd daemon to kill the GDS Stop method that is currently executing. This will move the GDS resource into the STOP_FAILED state, and the resource group will move to the ERROR_STOP_FAILED state.
The following are limitations of this workaround:
The clresourcegroup quiesce command prevents the node from being rebooted, even if the Failover_mode of the resource is set to HARD. If the reboot behavior is required, the GDS stop script can query the Failover_mode property and, if the property is set to HARD, the stop script can directly reboot the node or non-global zone in which it is executing.
This workaround is best suited for a failover resource group, which can only be stopping on one node at a time. In the case of a multi-mastered resource group, the GDS resource might be stopping on multiple nodes at the same time. Executing the clresourcegroup quiesce -k command in that case will kill all of the executing Stop methods on several nodes, not just the one that is executing on the local node.
Problem Summary: The Oracle Enterprise Manager Ops Center Agent for Oracle Solaris 10 uses JavaDB software for its configuration database. When installing the Oracle Solaris Cluster software by using the installer utility, the JavaDB software package is re-installed, causing an existing Agent configuration database to be deleted.
The following error messages are reported from the Ops Center Agent as a result of the package getting removed:
java.sql.SQLException: Database '/var/opt/sun/xvm/agentdb' not found. at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.newEmbedSQLException(Unknown Source)
The Agent is broken now and needs to be unconfigured or configured.
Workaround: Manually install on all cluster nodes the following additional JavaDB packages from the Oracle Solaris Cluster media:
Running the installer utility does not remove the existing JavaDB database packages.
Problem Summary: When you use the installer utility in the Simplified Chinese and Traditional Chinese locales to install Oracle Solaris Cluster software, the software that checks the system requirements incorrectly reports that the swap space is 0 Mbytes.
Workaround: Ignore this reported information. In these locales, you can run the following command to determine the correct swap space:
# df -h | grep swap
Problem Summary: When a multi-owner Solaris Volume Manager disk set is configured on the vucmm framework, the cldevicegroup status command always shows the disk set as offline, regardless of the real status of the disk set.
Workaround: Check the status of the multi-owner disk set by using the metastat -s diskset command.
Problem Summary: A scalable resource that depends on a SUNW.SharedAddress resource fails to come online, due to failure of an IPMP group that is on a subnet that is not used by the shared-address resource. Messages similar to the following are seen in the syslog of the cluster nodes:
Mar 22 12:37:51 schost1 SC SUNW.gds:5,Traffic_voip373,Scal_service_voip373,SSM_START: ID 639855 daemon.error IPMP group sc_ipmp1 has status DOWN. Assuming this node cannot respond to client requests.
Workaround: Repair the failed IPMP group and restart the failed scalable resource.
Problem Summary: The problem occurs when the resource type SUNW.LogicalHostname is registered at version 2 (use the clresourcetype list command to display the version). After upgrade, logical-hostname resources can be created for non-global zones with ip-type=exclusive, but network access to the logical hostname, for example, telnet or rsh, does not work.
Workaround: Perform the following steps:
Delete all resource groups with a node list that contains a non-global zone with ip-type=exclusive that hosts logical-hostname resources.
Upgrade the SUNW.LogicalHostname resource type to at least version 3:
# clresourcetype register SUNW.LogicalHostname:3