You can monitor and control submitted jobs in three ways:
With QMON
From the command line with the qstat, qdel, and qmod commands
By email
The following sections describe each of these methods.
You use the QMON Job Control dialog box to control jobs.
To monitor and control your submitted jobs, in the QMON Main Control window click the Job Control button. The Job Control dialog box appears.
The Job Control dialog box has three tabs, a tab for Running Jobs, a tab for Pending Jobs that are waiting to be dispatched to an appropriate resource, and a tab for recently Finished Jobs.
The Submit button provides a link to the Submit Job dialog box.
The Job Control dialog box enables you to monitor all running, pending, and finished jobs that are known to the system. You can also use this dialog box to manage jobs. You can change a job's priority. You can also suspend, resume, and cancel jobs.
In its default format, the Job Control dialog box displays the following columns for each running and pending job:
JobId
Priority
JobName
Owner
Status
Queue
You can change the default display by customizing the format. See Customizing the Job Control Display for details.
To keep the displayed information up-to-date, QMON uses a polling scheme to retrieve the status of the jobs from sge_qmaster. Click Refresh to force an update of the Job Control display.
You can select jobs with the following mouse and key combinations:
To select multiple noncontiguous jobs, hold down the Control key and click two or more jobs.
To select a contiguous range of jobs, hold down the Shift key, click the first job in the range, and then click the last job in the range.
To toggle between selecting a job and clearing the selection, click the job while holding down the Control key.
You can also use a filter to select the jobs that you want to display. See Filtering the Job List for details.
You can use the buttons at the right of the dialog box to manage selected jobs in the following ways:
Suspend
Resume (unsuspend)
Delete
Hold back
Release
Reprioritize
Reschedule
Modify with qalter
Only the job owner or grid engine managers and operators can suspend and resume jobs, delete jobs, hold back jobs, modify job priority, and modify jobs. See Managers, Operators, and Owners. Only running jobs can be suspended or resumed. Only pending jobs can be rescheduled, held back and modified, in priority as well as in other attributes.
Suspension of a job sends the signal SIGSTOP to the process group of the job with the UNIX kill command. SIGSTOP halts the job and no longer consumes CPU time. Resumption of the job sends the signal SIGCONT, thereby unsuspending the job. See the kill(1) man page for your system for more information on signalling processes.
You can force suspending, resuming, and deleting jobs. In other words, you can register these actions with sge_qmaster without notifying the sge_execd that controls the jobs. Forcing is useful when the corresponding sge_execd is unreachable, for example, due to network problems. Select the Force option for this purpose.
Click Reschedule to reschedule a currently running job.
To put a job on hold, select a pending job and click Hold. The Set Hold dialog box appears.
The Set Hold dialog box enables setting and resetting user, operator, and system holds. User holds can be set or reset by the job owner as well as by grid engine managers and operators. Operator holds can be set or reset by managers and operators. System holds can be set or reset by managers only. As long as any hold is assigned to a job, the job is not eligible for running. You can also set or reset holds by using the qalter, qhold, and qrls commands.
The Tasks field on the Set Hold dialog box applies to Array jobs. Use this button to put a hold on particular subtasks of an array job. Note the format of the text in the Tasks field. The task ID range specified in this field can be a single number, a simple range of the form n-m, or a range with a step size. The task ID range specified by, for example, 2-10:2 results in the task ID indexes 2, 4, 6, 8, and 10. This range represents a total of five identical tasks, with the environment variable SGE_TASK_ID containing one of the five index numbers. For detailed information about job holds, see the qsub(1) man page.
When you click Priority on the Job Control dialog box, the following dialog box appears.
This dialog box enables you to provide the new priority of selected pending or running jobs. The priority ranks a single user's jobs among themselves. Priority tells the scheduler how to choose among a single user's jobs when several jobs are in the system simultaneously.
When you select a pending job and click Qalter, the Submit Job window appears. All the entries of the dialog box are set corresponding to the attributes of the job that were defined when the job was submitted. Entries that cannot be changed are grayed out. The other entries can be edited. The changes are registered with the grid engine system when you click Qalter on the Submit Job dialog box. The Qalter button is a substitute for the Submit button.
The Verify flag on the Submit Job dialog box has a special meaning when the flag is used in the Qalter mode. You can check pending jobs for their consistency, and you can investigate why jobs are not yet scheduled. Select the desired consistency-checking mode for the Verify flag, and then click Qalter. The system displays warnings on inconsistencies, depending on the checking mode you select. See Submitting Advanced Jobs With QMON and the -w option on the qalter(1) man page for more information.
Another method for checking why jobs are still pending is to select a job and click Why? on the Job Control dialog box. Doing so opens the Object Browser dialog box. This dialog box displays a list of reasons that prevented the scheduler from dispatching the job in its most recent pass. An example of a Browser window that displays such a message is shown in the following figure.
The Why? button delivers meaningful output only if the scheduler configuration parameter schedd_job_info is set to true. See the sched_conf(5) man page. The displayed scheduler information relates to the last scheduling interval. The information might not be accurate by the time you investigate why your job was not scheduled.
Click Clear Error to remove an error state from a pending job that failed due to a job-dependent problem. For example, the job might have insufficient permissions to write to the specified job output file.
Error states appear in red text in the pending jobs list. You should remove jobs only after you correct the error condition, for example, using qalter. Such error conditions are automatically reported through email if the job requests to send email when the job is aborted. For example, the job might have been aborted with the qsub -m a command.
To customize the default Job Control display, click Customize. The Job Customize dialog box appears. Click the Select Job Fields tab. A sample Select Job Fields tab is shown in the following figure.
Use the Job Customize dialog box to configure the set of information to display.
With the Job Customize dialog box, you can select more entries of the job object to be displayed. You can also filter the jobs that you are interested in. The example in the preceding figure selects the additional fields Projects, Tickets, and Submit Time.
The following figure shows the enhanced look after customization is applied to the Finished Jobs list.
Use the Save button on the Customize Job dialog box to store the customizations in the file .qmon_preferences. This file is located in the user's home directory. By saving your customizations, you redefine the appearance of the Job Control dialog box.
The following example of the filtering facility selects only those jobs owned by aa114085 that are suitable to be run on the architecture solaris64.
The following figure shows the resulting Running Jobs tab of the Job Control dialog box.
The Job Control dialog box that is shown in the previous figure is also an example of how QMON displays array jobs.
You can use the QMON Object Browser to quickly retrieve additional information about jobs without having to customize the Job Control dialog box, as explained in Monitoring and Controlling Jobs With QMON.
You can open the Object Browser to display information about jobs in two ways:
Click the Browser button in the QMON Main Control window, and then click Job in the Browser dialog box.
Move the pointer over a job in the Job Control dialog box.
The following Browser window shows an example of the job information that is displayed:
This section describes how to use the commands qstat, qdel, and qmod to monitor, delete, and modify jobs from the command line.
To monitor jobs, type one of the following commands, guided by information that is detailed in the following sections:
qstat qstat -f qstat -ext |
qstat with no options provides an overview of submitted jobs only. qstat -f includes information about the currently configured queues in addition. qstat -ext contains details such as up-to-date job usage and tickets assigned to a job.
In the first form, a header line indicates the meaning of the columns. The purpose of most of the columns should be self-explanatory. The state column, however, contains single character codes with the following meaning: r for running, s for suspended, q for queued, and w for waiting. See the qstat(1) man page for a detailed explanation of the qstat output format.
The second form is divided into two sections. The first section displays the status of all available queues. The second section, titled PENDING JOBS, shows the status of the sge_qmaster job spool area. The first line of the queue section defines the meaning of the columns with respect to the queues that are listed. The queues are separated by horizontal lines. If jobs run in a queue, the job names appear below the associated queue in the same format as in the qstat command in its first form. The pending jobs in the second output section are also listed as in qstat`s first form.
The columns of the queue description provide the following information:
qtype – Queue type. Queue type is either B (batch) or I (interactive).
used/free – Count of used and free job slots in the queue.
states – State of the queue. See the qstat(1) man page for detailed information about queue states.
The qstat(1) man page contains a more detailed description of the qstat output format.
In the third form, the usage and ticket values assigned to a job are shown in the following columns:
cpu/mem/io – Currently accumulated CPU, memory, and I/O usage.
tckts/ovrts/otckt/ftckt/stckt – These values are as follows:
tckts – Total number of tickets assigned to the job
ovrts – Override tickets assigned through qalter -ot
otckt – Tickets assigned through the override policy
ftckt – Tickets assigned through the functional policy
stckt – Tickets assigned through the share-based policy
In addition, the deadline initiation time is displayed in the column deadline, if applicable. The share column shows the current resource share that each job has with respect to the usage generated by all jobs in the cluster. See the qstat(1) man page for further details.
Various additional options to the qstat command enhance the functionality. Use the -r option to display the resource requirements of submitted jobs. Furthermore, the output can be restricted to a certain user or to a specific queue. You can use the -l option to specify resource requirements, as described in Defining Resource Requirements, for the qsub command. If resource requirements are used, only those queues, and the jobs that are running in those queues, are displayed that match the resource requirement specified by qstat.
The qstat command has been enhanced so that the administrator and the user may define files that can contain useful options. See the sge_qstat(5) man page. A cluster-wide sge_qstat file may be placed under $xxQS_NAME_Sxx_ROOT/$xxQS_NAME_Sxx_CELL/common/sge_qstat. The user private file is processed under the location $HOME/.sge_qstat. The home directory request file has the highest precedence, then the cluster global file. You can use the command line to override the flags contained in a file.
Example 4–1 and Example 4–2 show examples of output from the qstat and qstat -f commands.
queuename qtype used/free load_avg arch states dq BIP 0/1 99.99 sun4 au durin.q BIP 2/2 0.36 sun4 231 0 hydra craig r 07/13/96 20:27:15 MASTER 232 0 compile penny r 07/13/96 20:30:40 MASTER dwain.q BIP 3/3 0.36 sun4 230 0 blackhole don r 07/13/96 20:26:10 MASTER 233 0 mac elaine r 07/13/96 20:30:40 MASTER 234 0 golf shannon r 07/13/96 20:31:44 MASTER fq BIP 0/3 0.36 sun4 ################################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - ################################################################################ 236 5 word elaine qw 07/13/96 20:32:07 235 0 andrun penny qw 07/13/96 20:31:43 |
job-ID prior name user state submit/start at queue function 231 0 hydra craig r 07/13/96 durin.q MASTER 20:27:15 232 0 compile penny r 07/13/96 durin.q MASTER 20:30:40 230 0 blackhole don r 07/13/96 dwain.q MASTER 20:26:10 233 0 mac elaine r 07/13/96 dwain.q MASTER 20:30:40 234 0 golf shannon r 07/13/96 dwain.q MASTER 20:31:44 236 5 word elaine qw 07/13/96 20:32:07 235 0 andrun penny qw 07/13/96 20:31:43 |
To control jobs from the command line, type one of the following commands with the appropriate arguments.
% qdel arguments % qmod arguments |
Use the qdel command to cancel jobs, regardless of whether the jobs are running or are spooled. Use the qmod command to suspend and resume (unsuspend) jobs already running.
For both commands, you need to know the job identification number, which is displayed in response to a successful qsub command. If you forget the number, you can retrieve it with qstat. See Monitoring Jobs With qstat.
The following list provides several examples of the qdel and qmod commands:
% qdel job-id % qdel -f job-id1, job-id2 % qmod -s job-id % qmod -us -f job-id1, job-id2 % qmod -s job-id.task-id-range |
In order to delete, suspend, or resume a job, you must be the owner of the job or a grid engine manager or operator. See Managers, Operators, and Owners.
You can use the -f (force) option with both commands to register a job status change at sge_qmaster without contacting sge_execd. You might want to use the force option in cases where sge_execd is unreachable, for example, due to network problems. The -f option is intended for use only by the administrator. In the case of qdel, however, users can force deletion of their own jobs if the flag ENABLE_FORCED_QDEL in the cluster configuration qmaster_params entry is set. See the sge_conf(5) man page for more information.
From the command line, type the following command with appropriate arguments.
% qsub arguments |
The qsub -m command requests email to be sent to the user who submitted a job or to the email addresses specified by the -M flag if certain events occur. See the qsub(1) man page for a description of the flags. An argument to the -m option specifies the events. The following arguments are available:
a – Send email when the job is rescheduled or aborted (for example, by using the qdel command).
n – Do not send email. n is the default.
Use a string made up of one or more of the letter arguments to specify several of these options with a single -m option. For example, -m be sends email at the beginning and at the end of a job.
You can also use the Submit Job dialog box to configure these mail events. See Submitting Advanced Jobs With QMON.