Sun HPC ClusterTools 3.0 Administrator's Guide: With CRE

Error Messages

No nodes in partition satisfy RRS:

This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.

Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused

If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.

mprun: unique partition: No such object

Query returned excess
results: 
a.out: (TMTL UL) TMRTE_Abort: Not yet initialized
The attempt to kill your program failed.

This might happen, for example, when mpps shows running processes that are actually no longer running.

Use the mpkill -C nn command to clear out such stale jobs.


Note -

Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.