Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Error Conditions and Troubleshooting Tips

The following sections include sample error messages and their interpretations, as well as guidelines for anticipating common problems.

Error Messages

No nodes in partition satisfy RRS:

This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

mprun: unique partition: No such object

Query returned excess
results: 
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed.

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.


Note -

Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.


Troubleshooting Tips

CRE RPC timeouts in user code are generally not recoverable. The job might continue to run, but processes probably won't be able to communicate with each other. There are two ways to deal with this:

/tmp/.hpcshm_mmap.jid.*

/tmp/.hpcshm_acf.jid.*

The Sun MPI shared memory protocol module uses these files for interprocess communication on the same node. These files consume swap space.