Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Troubleshooting Tips

When running multiprocess jobs (-np equal to 0 or greater than 1) in an cluster with NFS-mounted file systems, you should take steps to limit core dumps to zero. This can be done with the Solaris limit command. See the limit(1) man page for additional information.

The CRE resource database daemon (tm.rdb) does not remove missing interfaces after a client daemon is restarted. Instead, the mpinfo -Nv command will show them marked as down.

The contents of the /var/adm/messages file are local to each node. Any daemon messages will be logged only on the node where that daemon runs. By default, CRE daemon messages are stored in /var/adm/messages along with other messages handled by syslog. Alternatively, CRE messages can be written to a file specified by the mpadmin logfile command.

Use shell I/O redirection instead of mprun -I options whenever possible. Using shell redirection will reduce the likelihood of problems involving standard I/O.

If mprun is signaled too soon after it has been invoked, it exits without stopping the job`s processes. If this happens, find out which processes are use mpkill -9 jid to kill such a job.

The CRE does not pass supplemental group ID information to remote processes. You must use the -G gid option with mprun to run with the group permissions of that group. You must be a member of that group.

RPC timeouts - CRE RPC timeouts in Sun MPI code are logged to syslog, but the default syslog.conf file causes these messages to be dropped. If you want to see these errors, you should modify your /etc/syslog.conf file so that messages of the priority user.err are not dropped. Note that this does not apply to RPC timeouts occurring in the CRE daemons themselves. By default, these are logged to /var/adm/messages.

Note -
If you have set the Cluster-level attribute logfile, all error messages generated by user code will be handled by the CRE (not syslog) and will be logged in a file specified by an argument to logfile.

CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably won't be able to communicate with each other. There are two ways to deal with this:

Enable the tm.watchd job killing option (- Yk), which will automatically kill jobs when nodes go off line. This will catch most of these cases, since RPC timeouts usually coincide with tm.watchd marking the node as off line.

Monitor RPC errors from user codes by looking for syslog messages of priority user.err). Then use mpkill to kill the associated job manually.

If a file system is not visible on all nodes, users can encounter a permission denied message when attempting to execute programs from such a file system. Watch for errors caused by non-shared file system like /tmp, which exist locally on all nodes. This can show up when users attempt to execute programs from /tmp, and the program does not exist in the /tmp file systems of all nodes.

If the behavior of your system suggests that you've run out of swap space (after executing vmstat or df on /tmp), you may need to increase the limit on the CRE's shmem_minfree attribute.

If you execute mpkill with the -C option (this option is available only to the system administrator), you should look for and remove leftover files on the master node. The file names for large files are of the form:

/tmp/.hpcshm_mmap.jid.*

Smaller files will have file names of the form:

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.