Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Error Conditions and Troubleshooting Tips

The following sections include sample error messages and their interpretations, as well as guidelines for anticipating common problems.

Error Messages

The following error message usually indicates that all the nodes in an CRE partition are marked down--that is, their node daemons are not running.

No nodes in partition satisfy RRS:

This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.

Under certain circumstances, when a user attempts to kill a job, the CRE may log error messages of the following form on the master node:

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

The following error message indicates that no partitions have been set up.

mprun: unique partition: No such object

When there is stale job information in the CRE database, an error message of the following form may occur:

Query returned excess
results: 
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed.

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.

Note -

Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.

Troubleshooting Tips

When running multiprocess jobs (-np equal to 0 or greater than 1) in an cluster with NFS-mounted file systems, you should take steps to limit core dumps to zero. This can be done with the Solaris limit command. See the limit(1) man page for additional information.

The CRE resource database daemon (tm.rdb) does not remove missing interfaces after a client daemon is restarted. Instead, the mpinfo -Nv command will show them marked as down.

The contents of the /var/adm/messages file are local to each node. Any daemon messages will be logged only on the node where that daemon runs. By default, CRE daemon messages are stored in /var/adm/messages along with other messages handled by syslog. Alternatively, CRE messages can be written to a file specified by the mpadmin logfile command.

Use shell I/O redirection instead of mprun -I options whenever possible. Using shell redirection will reduce the likelihood of problems involving standard I/O.

If mprun is signaled too soon after it has been invoked, it exits without stopping the job`s processes. If this happens, find out which processes are use mpkill -9 jid to kill such a job.

The CRE does not pass supplemental group ID information to remote processes. You must use the -G gid option with mprun to run with the group permissions of that group. You must be a member of that group.

RPC timeouts - CRE RPC timeouts in Sun MPI code are logged to syslog, but the default syslog.conf file causes these messages to be dropped. If you want to see these errors, you should modify your /etc/syslog.conf file so that messages of the priority user.err are not dropped. Note that this does not apply to RPC timeouts occurring in the CRE daemons themselves. By default, these are logged to /var/adm/messages.

Note -
If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by the CRE (not syslog) and will be logged in a file specified by an argument to logfile.

CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably won't be able to communicate with each other. There are two ways to deal with this:

Enable the tm.watchd job killing option (- Yk), which will automatically kill jobs when nodes go off line. This will catch most of these cases, since RPC timeouts usually coincide with tm.watchd marking the node as off line.

Monitor RPC errors from user codes by looking for syslog messages of priority user.err). Then use mpkill to kill the associated job manually.

If a file system is not visible on all nodes, users can encounter a permission denied message when attempting to execute programs from such a file system. Watch for errors caused by non-shared file system like /tmp, which exist locally on all nodes. This can show up when users attempt to execute programs from /tmp, and the program does not exist in the /tmp file systems of all nodes.

If the behavior of your system suggests that you've run out of swap space (after executing vmstat or df on /tmp), you may need to increase the limit on the CRE's shmem_minfree attribute.

If you execute mpkill with the -C option (this option is available only to the system administrator), you should look for and remove leftover files on the master node. The file names for large files are of the form:

/tmp/.hpcshm_mmap.jid.*

Smaller files will have file names of the form:

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.