|
Sun HPC ClusterTools 7.1 Software Release Notes |
This document describes late-breaking news about the Sun HPC ClusterTools
7.1 (ClusterTools 7.1) software. The information is organized into the following sections:
The major new features of the ClusterTools 7.1 software include:
MPI, Sun’s implementation of the MPI-1 and MPI-2 standards. Sun HPC ClusterTools 7.1 is based on Open MPI, an open-source MPI that is also based on the MPI-1 and MPI-2 standards. Open MPI has been extended to support Sun Grid Engine. Open MPI is delivered through Sun packages.
This section lists some of the differences between Sun MPI and Open MPI. For more information about migrating MPI applications built using Sun HPC ClusterTools 6 software to Sun HPC ClusterTools 7.1, refer to the Sun HPC ClusterTools 7.1 Software Migration Guide.
For more information about the installation process, see the Sun HPC ClusterTools 7.1 Installation Guide.
Some components of the ClusterTools 6 product are not available in ClusterTools 7.1.
The following tools in Sun HPC ClusterTools 6 have no equivalents in Open MPI:
The Sun HPC ClusterTools 7.1 Software Migration Guide discusses some possible alternatives you can use in place of these tools.
These are some of the differences between Sun HPC ClusterTools 6/Sun MPI and Sun HPC ClusterTools 7.1/Open MPI:
Sun HPC ClusterTools 7.1 software works with the following versions of related software:
10 03/05 OS, or any subsequent Solaris 10 OS release that supports Sun HPC ClusterTools 7.1 software.
Studio 10, 11, and 12 C, C++, and Fortran compilers
To improve ClusterTools, Sun collects anonymous information about your cluster during installation. If you want to turn this feature off, use the -w option for ctinstall.
The communication between ctinstall and Sun works only if the Sun HPC ClusterTools software installation process completes successfully. It does not work if the installation fails for any reason.
This section highlights some of the outstanding CRs (Change Requests) for the ClusterTools 7.1 software components. A CR might be a defect, or it might be an RFE (Request For Enhancement).
| Note - The heading of each CR description includes the CR’s Bugster number in parentheses. |
The Open MPI library does not currently support thread-safe operations. If your applications contain thread-safe operations, they might fail.
The udapl BTL fails when run in clusters of heterogeneous nodes.
Workaround: Use a different interconnect between heterogeneous nodes.
Occasionally, programs run on heterogeneous clusters will return truncated messages or incorrect information. Some examples of these issues include:
1. MPI_LONG_DOUBLE datatypes do not work properly
2. Some of the one-sided APIs do not work properly
3. Some derived datatypes will not work properly
If you run an MPI program using the udapl BTL in a local (non-global) zone in the Solaris OS, your program might fail and display the following error message:
Workarounds: Either run the udapl BTL in the Solaris global zone only, or use another interconnect (such as tcp) in the local zone.
If your cluster contains multiple Infiniband Host Channel Adapter (HCA) cards, you cannot select a particular HCA when running using the udapl BTL.
This condition happens when the udapl BTL is not available on one node in a cluster. The Infiniband adapter on the node could be unavailable or misconfigured, or there might not be an Infiniband adapter on the node.
When you run an Open MPI program using the udapl BTL under such conditions, the program might hang or fail, but no error message is displayed. When a similar operation fails under the tcp BTL, the failure results in an error message.
Workaround: Add the following MCA parameter to your command line to exclude the udapl BTL:
For more information about MCA parameters and how to exclude functions at the command line, refer to the Sun HPC ClusterTools 7.1 Software User’s Guide.
If an MPI job exhausts the resources of the CPUs, the program can fail or show segmentation faults. This might happen when nodes are oversubscribed.
Workaround: Avoid oversubscribing the nodes.
If you compile and link a C++ program using the -library=stlport4 flag, the generated program will produce a segmentation fault.
Workaround: Do not link to stlport4.
If one CPU in a cluster fails or goes offline while processor affinity is set, processes assigned to the failed CPU attempt to bind to that processor. When the attempt fails, processor affinity is turned off and the processes continue to run, even though they do not bind to the failed processor.
This issue produces the following error message:
If you start an MPI job using Sun Grid Engine and use the qsub -notify option to send a SIGUSR1 (impending SIGSTOP) or SIGUSR2 (impending SIGKILL) signal to the job, mpirun and orted receive the SIGUSR signal from Sun Grid Engine. They route the signal to the program and then exit before the actual SIGSTOP or SIGKILL is received. This might cause ORTE to hang instead of properly exiting.
When you set up non-global zones in the Solaris OS, the Solaris OS packages propagate from the global zone to the new zones.
However, if you installed Sun HPC ClusterTools software on the system before setting up the zones, SUNWompiat (the Open MPI installer package) does not get propagated to the new non-global zone. It causes the Install_Utilities directory to not be available on non-global zones during new zone creation. This also means that the links to /opt/SUNWhpc do not get propagated to the local zone.
Workaround: There are two workarounds for this issue.
1. From the command line, use the full path to the Sun HPC ClusterTools executable you want to use. For example, you would type /opt/SUNWhpc/HPC7.1/bin/mpirun instead of /opt/SUNWhpc/bin/mpirun.
2. Reinstall Sun HPC ClusterTools 7.1 software in the non-global zone. This process allows you to activate Sun HPC ClusterTools 7.1 software (thus creating the links to the executables) on non-global zones.
| Note - This issue affects both rsh and the Sun Grid Engine program qrsh. qrsh uses rsh to launch jobs. |
If you are using rsh or qrsh as the job launcher on a large cluster with hundreds of nodes, rsh might show the following error messages when launching jobs on the remote nodes:
This indicates that rsh is running out of sockets when launching the job from the head node.
1. If you are using rsh as your job launcher, use ssh instead. Add the following to your command line:
2. If you are using Sun Grid Engine as your job launcher, you can modify the Sun Grid Engine configuration to allow Sun Grid Engine to use ssh instead of rsh to launch tasks on the remote nodes. The following web site describes how to perform this workaround:
http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html
Note that this workaround does not properly track resource usage.
When using a peer-to-peer connection with the udapl BTL (byte-transfer layer), the udapl BTL allocates a free list of fragments. This free list is used for send and receive operations between the peers. The free list does not have a specified maximum size, so a high amount of communication traffic at one peer might increase the size of the free list until it interferes with the ability of the other peers to communicate.
This issue might appear as a memory resource issue to an Open MPI application. This problem has only been observed on large jobs where the number of uDAPL connections exceeds the default value of btl_udapl_max_eager_rdma_peers.
Workaround: For example, if an Open MPI application running over uDAPL/IB (Infiniband) reports an out of memory error for alloc or for privileged memory, and if those two values have already been increased, the following might allow the program to run successfully.
1. At the command line, add the following MCA parameter to your mpirun command:
where x is equal to the number of peer uDAPL connections that the Open MPI job will establish.
2. If the setting in Step 1 does not fix the problem, then also set the following MCA parameter with the mpirun command at the command line:
When using the Allinea DDT debugger message queue with Sun HPC ClusterTools 7.1 and Open MPI 1.2.4, DDT might cause a segmentation fault (SEGV) within the message queue DLL.
Workaround: This is a known issue in Open MPI 1.2.4. It will be fixed in a subsequent version of Open MPI.
Intermittent segmentation faults (SEGVs) can occur when using all-to-one communications (for example, when an MPI_Put operation sends the data from all ranks to rank 0, with rank 0 reporting on the origin of each part of the data). These faults occur in clusters with large numbers of processors.
The TotalView debugger might not be able to determine if an MPI_Comm_spawn operation has occurred, and might not be able to locate the new processes that the operation creates. This is because the current version of the Open MPI message dumping library (ompi/debuggers/ompi_dll.c) does not implement the functions and interfaces for the support of MPI 2 debugging and message dumping.
The Open MPI DLL for the TotalView debugger does not support handling of unexpected messages. Only pending send and receive queues are supported.
The TotalView debugger only provides a 32-bit debugger for Solaris SPARC and a 64-bit debugger for Solaris AMD. As a result, the following message might appear when TotalView attempts to load the message queue library.
This is a known issue with the TotalView debugger.
Workaround: Before you start TotalView, set the LD_LIBRARY_PATH or LD_LIBRARY_PATH_64 variable to the correct value (shown below) to ensure the correct message queue shared library is loaded.
This section discusses issues that pertain to the Solaris OS. Although Sun HPC ClusterTools 7.1 supports the Solaris 10 3/05 release, updating to a later release (such as Solaris 10 11/06) can help you avoid these issues altogether.
When using the tcp BTL on the Sun Fire X4500 server with a high level of network traffic, running a program might result in data corruption.
Workaround: This issue has been fixed in the Solaris 10 11/06 release. It is strongly suggested that you upgrade your Solaris OS to Solaris 10 11/06 or a compatible release.
A workaround does exist for the Solaris 10 3/05 release, but you might experience performance degradation when using this workaround.
Use a text editor to add the following line to your /etc/system file:
This disables the hardware checksum and allows the program to complete successfully.