Sun HPC ClusterTools 6 Software Release Notes |
This document describes late-breaking news about the Sun HPC ClusterTools 6 software. The information is organized into the following sections:
The major new features of the Sun HPC ClusterTools 6 software include:
Several components of the ClusterTools 5 product are not available in ClusterTools 6. This section contains a list of the components that are not available, as well as suggestions for replacing the functionality.
The Prism debugger has been removed from ClusterTools 6 software. The TotalView debugger from Etnus (http://www.etnus.com) supports the debugging of Sun MPI programs on SPARC-based platforms running the Solaris OS. For Sun MPI programs on x64- and SPARC-based platforms, dbx is supported.
The DDT debugger from Allinea (http://www.allinea.com) supports the debugging of Sun MPI programs on AMD x64-based platforms running the Solaris OS. (See Sun MPI Message Queues Not Accurate or Visible from Allinea DDT (CR 6396838) for more information on a known issue with DDT.)
The Sun Scalable Scientific Subroutine Library (Sun S3L) is not supported in ClusterTools 6 software. Many of the S3L functions can be replaced with functions provided in the public domain libraries PETSc and ScaLAPACK.
Sun ClusterTools 6 software is not released under the Sun Community Source License (SCSL).
The Cluster Console Manager (CCM) tools, cconsole, ctelnet, and crlogin, are no longer shipped with ClusterTools 6 software.
The CCM package, SUNWccon, is bundled with the Sun Java Enterprise System (JES) and can be downloaded with JES at the following URL:
http://www.sun.com/software/javaenterprisesystem/index.xml
RSM (Remote Shared Memory) functionality has been removed from ClusterTools 6 software.
The Sun HPC ClusterTools 6 software works with the following versions of related software:
Note - If you have the Solaris 10 3/05 OS release installed, download and install the recommended patches for your platform type, as shown in TABLE 1. These patches are available from SunSolve. Alternatively, you can upgrade to another release of the Solaris 10 OS, such as Solaris 10 1/06. |
If you have the Solaris 10 3/05 release installed, you must install the following Solaris patches in order to run Sun HPC ClusterTools. The patch revisions shown in this table reflect the minimum version that supports Sun HPC ClusterTools.
113000-07
|
|
Sun Grid Engine 5.2 (SGE) was supported on Sun HPC ClusterTools 5 software. Sun HPC ClusterTools software supports Sun N1 Grid Engine (N1GE) 6 as its SGE (Sun Grid Engine) resource manager. Previous versions of Sun HPC ClusterTools software supported Sun Grid Engine software. This section outlines the differences between the two resource managers.
Sun N1 Grid Engine (N1GE6) introduces new features and attributes that are different from those in SGE 5.2. These differences fall into four major categories:
For more information about N1GE, refer to the Sun N1 Grid Engine 6 Administration Guide (817-5677).
N1GE6 introduces a new parallel environment attribute:
The queue_list attribute in SGE 5.2 has been removed. The values for queue_list now appear as the values for the slots attribute in PE.
The following table illustrates the differences between certain queue attributes in N1GE6 and in SGE 5.2.
Values from SGE 5.2 queue_list attribute (for example,
|
||
cre make
|
In addition, N1GE provides enhanced suspend and resume scripts. Use these scripts to prevent MPI processes from continuing to run when SGE has issued a suspension.
The suspend_method script has the following attributes:
<sge-root>/mpi/SunHPCT5/suspend_sunmpi_ci.sh $job_pid $job_id
The resume_method script has the following attributes:
<sge-root>/mpi/SunHPCT6/suspend_sunmpi_ci.sh $job_pid $job_id
where <sge-root> is the path to the location where SGE is installed.
The command to modify a queue has changed Between SGE 5.2 and N1GE6. For example, to modify a queue in N1GE6, you would issue a command similar to the following (substituting the name of the queue for queue-name):
Note that the PARALLEL value for qtype has been removed in N1GE6.
The equivalent command in SGE 5.2 would be as follows:
This section highlights some of the outstanding CRs (Change Requests) for the Sun HPC ClusterTools 6 software components. A CR may be a defect, or it may be an RFE (Request For Enhancement).
Note - The heading of each CR description includes the CR's Bugster number in parentheses. |
This issue only affects the access of MPI message queues from the Allinea DDT debugger on the x64 platform. Message queue information will not be available. Other debugging functionality is not affected.
If a user who is not root attempts to run an HPC job under a Solaris 10 non-global zone, the job aborts and returns an error message similar to the following:
Job cre.6 on nodename: aborted due to an unexpected error.
This error occurs only on nodes with the Solaris 10 3/05 version installed. This error has been fixed in susbsequent versions of the Solaris 10 OS. To fix the problem, either download and install patch 119689-06 (or higher) to Solaris 10 3/05, or upgrade to a more recent version of the Solaris 10 OS (such as Solaris 10 1/06).
TotalView, LSF, and Sun HPC ClusterTools cannot be used together when debugging parallel MPI programs. the debugging process in TotalView does not work correctly when LSF is used as the resource manager.
Workaround: Use a different resource manager (such as CRE) when debugging with TotalView.
pbsrun returns failure messages from the PBS task manager tm_init (which returns the message TM_BADINIT) during spawn jobs if the -np option of mprun is not specified.
Workaround: Specify values for -np and -nr in mprun when you want to use the MPI_Comm_spawn() API or the MPI_Comm_spawn_multiple() API with PBS. Make sure that the total of the two values you specify does not exceed the total number of processes allocated in the PBS environment.
If a node crashes while an MPI program is running, CRE does not remove the job entry from its database, so mpps continues to show the job indefinitely, often in states such as coring or exiting.
Workaround: To delete these stale jobs from the database, su to root and issue this command:
# mpkill -C
The mpps and mpkill commands do not work properly when running under LSF.
Workaround: Instead of using mpps and mpkill, use the LSF equivalent commands bjobs and bkill.
When the physical memory requirements of all the processes an MPI application exceed the amount of memory available on a node, the mprun command returns an error similar to the following:
Job cre.1 on nodename: received signal KILL.
Workaround: Run the application on a node with sufficient physical memory.
When you use SGE/Sun N1 Grid Engine as the resource manager, mprun can return an error message similar to the following:
mprun: tmrte_auth_verify_user: Key mismatch: Authentication error. Contact system administrator.
This condition occurs when SGE/N1GE6 has been improperly installed. mprun returns the error, but the error itself occurs within the resource manager.
Workaround: Reconfigure Sun N1 Grid Engine 6 with a new unused gid range.
If you have both the OpenPBS and PBS Professional resource managers installed and running on the same system, issuing the mprun -x pbs command causes mprun to always select PBS Professional.
Workaround: Shut down the PBS Professional daemons if you want to use OpenPBS. To shut down the PBS Professional daemons, type the following command:
This CR affects Sun HPC ClusterTools 6 software running on Solaris 10 6/06 OS, or a previous version of the Solaris 10 OS with kernel updates 118833-17 or -18 or 118855-15. The issue causes Sun HPC ClusterTools commands such as mpinfo to hang or to produce RPC-related error messages such as the following:
% mpinfo -N mpinfo: tmrte_auth_create_context: host: RPC call tmrte_auth_conf_3 timed out after 30 secs. Try using mprun -t to increase timeout factor. |
Workaround: Disable TCP fusion by typing the following command at the system prompt:
Next, reboot the system so this command can take effect.
Copyright © 2006, Sun Microsystems, Inc. All Rights Reserved.