The following error message usually indicates that all the nodes in an CRE partition are marked down--that is, their node daemons are not running.
No nodes in partition satisfy RRS:
This could happen, for example, if CRE was unable to check out its licenses. This message can also indicate an error in the construction of an RRS.
Under certain circumstances, when a user attempts to kill a job, the CRE may log error messages of the following form on the master node:
Aug 27 11:02:30 ops2a tm.rdb[462]: Cond_set: unable to connect to ops2a/45126: connect: Connection refused
If these can be correlated to jobs being killed, these errors can be safely ignored. One way to check this correlation would be to look at the accounting logs for jobs that were signaled during this time.
mprun: unique partition: No such object
When there is stale job information in the CRE database, an error message of the following form may occur:
Query returned excess results: a.out: (TMTL UL) TMRTE_Abort: Not yet initialized The attempt to kill your program failed.
This might happen, for example, when mpps shows running processes that are actually no longer running.
Use the mpkill -C nn command to clear out such stale jobs.
Before removing the job's information from the database, the mpkill -C option verifies that the processes of the job are in fact no longer running.