Sun HPC ClusterTools 7.1 Software Release Notes

This document describes late-breaking news about the Sun HPC ClusterToolstrademark 7.1 (ClusterTools 7.1) software. The information is organized into the following sections:


Major New Features

The major new features of the ClusterTools 7.1 software include:


Product Migration

This section lists some of the differences between Sun MPI and Open MPI. For more information about migrating MPI applications built using Sun HPC ClusterTools 6 software to Sun HPC ClusterTools 7.1, refer to the Sun HPC ClusterTools 7.1 Software Migration Guide.

For more information about the installation process, see the Sun HPC ClusterTools 7.1 Installation Guide.

Some components of the ClusterTools 6 product are not available in ClusterTools 7.1.

The following tools in Sun HPC ClusterTools 6 have no equivalents in Open MPI:

The Sun HPC ClusterTools 7.1 Software Migration Guide discusses some possible alternatives you can use in place of these tools.

These are some of the differences between Sun HPC ClusterTools 6/Sun MPI and Sun HPC ClusterTools 7.1/Open MPI:


Related Software

Sun HPC ClusterTools 7.1 software works with the following versions of related software:



Note - If you plan to use the uDAPL BTL (Byte Transfer Layer) with your applications, you must install Solaris 10 11/06 OS. Solaris 10 11/06 OS is the first version that supports this functionality. In addition, you must install patch 125792-01 (for SPARC-based systems) or patch 125793-01 (for AMD-based systems), plus any patches that those patches require.



Disabling Installation Notification

To improve ClusterTools, Sun collects anonymous information about your cluster during installation. If you want to turn this feature off, use the -w option for ctinstall.

The communication between ctinstall and Sun works only if the Sun HPC ClusterTools software installation process completes successfully. It does not work if the installation fails for any reason.


Outstanding CRs

This section highlights some of the outstanding CRs (Change Requests) for the ClusterTools 7.1 software components. A CR might be a defect, or it might be an RFE (Request For Enhancement).



Note - The heading of each CR description includes the CR’s Bugster number in parentheses.


The MPI Library is Not Thread-Safe (CR 6474910)

The Open MPI library does not currently support thread-safe operations. If your applications contain thread-safe operations, they might fail.

Workaround: None.

udapl BTL Fails in Heterogeneous Cluster (CR 6512878)

The udapl BTL fails when run in clusters of heterogeneous nodes.

Workaround: Use a different interconnect between heterogeneous nodes.

Problems With Heterogeneous Support (CR 6538714)

Occasionally, programs run on heterogeneous clusters will return truncated messages or incorrect information. Some examples of these issues include:

1. MPI_LONG_DOUBLE datatypes do not work properly

2. Some of the one-sided APIs do not work properly

3. Some derived datatypes will not work properly

Workaround: None.

Using udapl BTL on Local Zones Fails for MPI Programs (CR 6480399)

If you run an MPI program using the udapl BTL in a local (non-global) zone in the Solaris OS, your program might fail and display the following error message:


Process 0.1.3 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of
usable components.
 
PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
----------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)

Workarounds: Either run the udapl BTL in the Solaris global zone only, or use another interconnect (such as tcp) in the local zone.

udapl BTL in Open MPI: If Multiple HCAs Exist, User Should Be Able to Select One or More (CR 6532415)

If your cluster contains multiple Infiniband Host Channel Adapter (HCA) cards, you cannot select a particular HCA when running using the udapl BTL.

Workaround: None.

udapl BTL in Open MPI should Detect That a udapl Connection is Not Accessible and Not Just Hang (CR 6497612)

This condition happens when the udapl BTL is not available on one node in a cluster. The Infiniband adapter on the node could be unavailable or misconfigured, or there might not be an Infiniband adapter on the node.

When you run an Open MPI program using the udapl BTL under such conditions, the program might hang or fail, but no error message is displayed. When a similar operation fails under the tcp BTL, the failure results in an error message.

Workaround: Add the following MCA parameter to your command line to exclude the udapl BTL:


--mca btl ^udapl 

For more information about MCA parameters and how to exclude functions at the command line, refer to the Sun HPC ClusterTools 7.1 Software User’s Guide.

MPI Is Not Handling Resource Exhaustion Gracefully (CR 6499679)

If an MPI job exhausts the resources of the CPUs, the program can fail or show segmentation faults. This might happen when nodes are oversubscribed.

Workaround: Avoid oversubscribing the nodes.

libmpi_cxx.so Is Incompatible With Applications Linked With stlport (CR 6532412)

If you compile and link a C++ program using the -library=stlport4 flag, the generated program will produce a segmentation fault.

Workaround: Do not link to stlport4.

OMPI Supposed to Bind Processes Even If CPU Is Down With --mca mpi_paffinity_alone Set to 1 (CR 6478417)

If one CPU in a cluster fails or goes offline while processor affinity is set, processes assigned to the failed CPU attempt to bind to that processor. When the attempt fails, processor affinity is turned off and the processes continue to run, even though they do not bind to the failed processor.

This issue produces the following error message:


% /opt/SUNWhpc/HPC7.1/bin/mpirun --mca mpi_paffinity_alone 1 -mca btl self,sm -np 4 a.out
 
output:
Num_procs = 3;
Num_procs = 3;
Num_procs = 3;
Num_procs = 3;
----------------------------------------------------------------
The MCA parameter "mpi_paffinity_alone" was set to a nonzero value,
but Open MPI was unable to bind MPI_COMM_WORLD rank 0 to a processor.
 
Typical causes for this problem include:
 
   - A node was oversubscribed (more processes than processors), in
     which case Open MPI will not bind any processes on that node
   - A startup mechanism was used which did not tell Open MPI which
     processors to bind processes to

Sun Grid Engine qsub -notify (or SIGUSR1/2) Is Not Supported (CR 6535841)

If you start an MPI job using Sun Grid Engine and use the qsub -notify option to send a SIGUSR1 (impending SIGSTOP) or SIGUSR2 (impending SIGKILL) signal to the job, mpirun and orted receive the SIGUSR signal from Sun Grid Engine. They route the signal to the program and then exit before the actual SIGSTOP or SIGKILL is received. This might cause ORTE to hang instead of properly exiting.

Request Script Prevents SUNWompiat From Propagating to Non-global Zone During Zone Creation (CR 6539860)

When you set up non-global zones in the Solaris OS, the Solaris OS packages propagate from the global zone to the new zones.

However, if you installed Sun HPC ClusterTools software on the system before setting up the zones, SUNWompiat (the Open MPI installer package) does not get propagated to the new non-global zone. It causes the Install_Utilities directory to not be available on non-global zones during new zone creation. This also means that the links to /opt/SUNWhpc do not get propagated to the local zone.

Workaround: There are two workarounds for this issue.

1. From the command line, use the full path to the Sun HPC ClusterTools executable you want to use. For example, you would type /opt/SUNWhpc/HPC7.1/bin/mpirun instead of /opt/SUNWhpc/bin/mpirun.

2. Reinstall Sun HPC ClusterTools 7.1 software in the non-global zone. This process allows you to activate Sun HPC ClusterTools 7.1 software (thus creating the links to the executables) on non-global zones.

rsh Could Run Out of Sockets When Launching Jobs on a Large Cluster (CR 6541735)



Note - This issue affects both rsh and the Sun Grid Engine program qrsh. qrsh uses rsh to launch jobs.


If you are using rsh or qrsh as the job launcher on a large cluster with hundreds of nodes, rsh might show the following error messages when launching jobs on the remote nodes:


rcmd: socket: Cannot assign requested address
rcmd: socket: Cannot assign requested address
rcmd: socket: Cannot assign requested address
[node0:00749] ERROR: A daemon on node m2187 failed to start as expected.
[node0:00749] ERROR: There may be more information available from
[node0:00749] ERROR: the ’qstat -t’ command on the Grid Engine tasks.
[node0:00749] ERROR: If the problem persists, please restart the
[node0:00749] ERROR: Grid Engine PE job

This indicates that rsh is running out of sockets when launching the job from the head node.

Workarounds:

1. If you are using rsh as your job launcher, use ssh instead. Add the following to your command line:


-mca pls_rsh_agent ssh

2. If you are using Sun Grid Engine as your job launcher, you can modify the Sun Grid Engine configuration to allow Sun Grid Engine to use ssh instead of rsh to launch tasks on the remote nodes. The following web site describes how to perform this workaround:

http://gridengine.sunsource.net/howto/qrsh_qlogin_ssh.html

Note that this workaround does not properly track resource usage.

udapl BTL Use of Fragment Free Lists Can Potentially Starve a Peer Connection and Prevent Progress (CR 6542966)

When using a peer-to-peer connection with the udapl BTL (byte-transfer layer), the udapl BTL allocates a free list of fragments. This free list is used for send and receive operations between the peers. The free list does not have a specified maximum size, so a high amount of communication traffic at one peer might increase the size of the free list until it interferes with the ability of the other peers to communicate.

This issue might appear as a memory resource issue to an Open MPI application. This problem has only been observed on large jobs where the number of uDAPL connections exceeds the default value of btl_udapl_max_eager_rdma_peers.

Workaround: For example, if an Open MPI application running over uDAPL/IB (Infiniband) reports an out of memory error for alloc or for privileged memory, and if those two values have already been increased, the following might allow the program to run successfully.

1. At the command line, add the following MCA parameter to your mpirun command:


--mca btl_udapl_max_eager_rdma_peers x

where x is equal to the number of peer uDAPL connections that the Open MPI job will establish.

2. If the setting in Step 1 does not fix the problem, then also set the following MCA parameter with the mpirun command at the command line:


--mca mpi_preconnect_all 1

Showing Message Queues Would cause SEGV on DDT (CR 6615467)

When using the Allinea DDT debugger message queue with Sun HPC ClusterTools 7.1 and Open MPI 1.2.4, DDT might cause a segmentation fault (SEGV) within the message queue DLL.

Workaround: This is a known issue in Open MPI 1.2.4. It will be fixed in a subsequent version of Open MPI.

All-to-One Communication Patterns Can Cause SEGV on Larger Clusters (CR 6617724)

Intermittent segmentation faults (SEGVs) can occur when using all-to-one communications (for example, when an MPI_Put operation sends the data from all ranks to rank 0, with rank 0 reporting on the origin of each part of the data). These faults occur in clusters with large numbers of processors.

Workaround: None.

TotalView: MPI-2 Support Is Not Implemented (CR 6597772)

The TotalView debugger might not be able to determine if an MPI_Comm_spawn operation has occurred, and might not be able to locate the new processes that the operation creates. This is because the current version of the Open MPI message dumping library (ompi/debuggers/ompi_dll.c) does not implement the functions and interfaces for the support of MPI 2 debugging and message dumping.

Workaround: None.

TotalView: Message Queue for Unexpected Messages is Not Implemented (CR 6597750)

The Open MPI DLL for the TotalView debugger does not support handling of unexpected messages. Only pending send and receive queues are supported.

Workaround: None.

Behavior of TotalView Message Queue Support with 32/64 SPARC/AMD User Executables (CR 6623686)

The TotalView debugger only provides a 32-bit debugger for Solaris SPARC and a 64-bit debugger for Solaris AMD. As a result, the following message might appear when TotalView attempts to load the message queue library.


The image claims that TotalView should use the dynamic library ’/opt/SUNWhpc/HPC7.1/lib/sparcv9/openmpi/libompitv.so’
for MPI debugging, but that file is not accessible.
This is probably an MPI installation problem.
TotalView will try to use the default debug library ’libtvmpich.so’

This is a known issue with the TotalView debugger.

Workaround: Before you start TotalView, set the LD_LIBRARY_PATH or LD_LIBRARY_PATH_64 variable to the correct value (shown below) to ensure the correct message queue shared library is loaded.



Note - Using the msq_lib parameter in the Totalview Parallel Configuration file to specify the location of the message queues library has no effect on the way Totalview finds the location of the library. Use the workaround described in this section until Totalview resolves this issue.



Solaris OS Issues

This section discusses issues that pertain to the Solaris OS. Although Sun HPC ClusterTools 7.1 supports the Solaris 10 3/05 release, updating to a later release (such as Solaris 10 11/06) can help you avoid these issues altogether.

HCTS Network Test Failed Due to E1000G Issue (CR 6462893)

When using the tcp BTL on the Sun Fire X4500 server with a high level of network traffic, running a program might result in data corruption.

Workaround: This issue has been fixed in the Solaris 10 11/06 release. It is strongly suggested that you upgrade your Solaris OS to Solaris 10 11/06 or a compatible release.

A workaround does exist for the Solaris 10 3/05 release, but you might experience performance degradation when using this workaround.

single-step bullet  Use a text editor to add the following line to your /etc/system file:


set ip:dohwcksum=0

This disables the hardware checksum and allows the program to complete successfully.