This chapter provides background information about submitting jobs, as well as instructions for how to submit jobs for processing. The chapter begins with an example of how to run a simple job. The chapter then continues with instructions for how to run more complex jobs.
Instructions for accomplishing the following tasks are included in this chapter.
Use the information and instructions in this section to become familiar with basic procedures involved in submitting jobs.
If you installed the N1 Grid Engine 6.1 software under an unprivileged user account, you must log in as that user to be able to run jobs. See Installation Accounts in Sun N1 Grid Engine 6.1 Installation Guide for details.
Before you run any grid engine system command, you must first set your executable search path and other environment conditions properly.
From the command line, type one of the following commands.
If you are using csh or tcsh as your command interpreter, type the following:
% source sge-root/cell/common/settings.csh |
sge-root specifies the location of the root directory of the grid engine system. This directory was specified at the beginning of the installation procedure.
If you are using sh, ksh, or bash as your command interpreter, type the following:
# . sge-root/cell/common/settings.sh |
You can add these commands to your .login, .cshrc, or .profile files, whichever is appropriate. By adding these commands, you guarantee proper settings for all interactive session you start later.
Submit a simple job script to your cluster by typing the following command:
% qsub simple.sh |
The command assumes that simple.sh is the name of the script file, and that the file is located in your current working directory.
You can find the following job in the file /sge-root/examples/jobs/simple.sh.
#!/bin/sh # # # (c) 2004 Sun Microsystems, Inc. Use is subject to license terms. # This is a simple example of a SGE batch script # request Bourne shell as shell for job #$ -S /bin/sh # # print date and time date # Sleep for 20 seconds sleep 20 # print date and time again date |
If the job submits successfully, the qsub command responds with a message similar to the following example:
your job 1 (“simple.sh”) has been submitted |
Type the following command to retrieve status information about your job.
% qstat |
You should receive a status report that provides information about all jobs currently known to the grid engine system. For each job, the status report lists the following items:
Job ID, which is the unique number that is included in the submit confirmation
Name of the job script
Owner of the job
State indicator; for example, r means running
Submit or start time
Name of the queue in which the job runs
If qstat produces no output, no jobs are actually known to the system. For example, your job might already have finished.
You can control the output of the finished jobs by checking their stdout and stderr redirection files. By default, these files are generated in the job owner`s home directory on the host that ran the job. The names of the files are composed of the job script file name with a .o extension for the stdout file and a .e extension for the stderr file, followed by the unique job ID. The stdout and stderr files of your job can be found under the names simple.sh.o1 and simple.sh.e1 respectively. These names are used if your job was the first ever executed in a newly installed grid engine system.
A more convenient way to submit and control jobs and of getting an overview of the grid engine system is the graphical user interface QMON. Among other facilities, QMON provides a job submission dialog box and a Job Control dialog box for the tasks of submitting and monitoring jobs.
Type the following command to start the QMON GUI:
% qmon |
During startup, a message window appears, and then the QMON Main Control window appears.
Click the Job Control button, and then click the Submit Jobs button.
The button names, such as Job Control, are displayed when you rest the mouse pointer over the buttons.
The Submit Job and the Job Control dialog boxes appear, as shown in the following figures.
In the Submit Job dialog box, click the icon at the right of the Job Script field.
The Select a File dialog box appears.
Select your script file.
For example, select the file simple.sh that was used in the command line example.
Click OK to close the Select a File dialog box.
On the Submit Job dialog box, click Submit.
After a few seconds you should be able to monitor your job on the Job Control dialog box. You first see your job on the Pending Jobs tab. The job quickly moves to the Running Jobs tab once the job starts running.
The following sections describe how to submit more complex jobs through the grid engine system.
Shell scripts, also called batch jobs, are a sequence of command-line instructions that are assembled in a file. Script files are made executable by the chmod command. If scripts are invoked, a command interpreter is started. Each instruction is interpreted as if the instruction were typed manually by the user who is running the script. csh, tcsh, sh, or ksh are typical command interpreters. You can invoke arbitrary commands, applications, and other shell scripts from within a shell script.
The command interpreter can be invoked as login shell. To do so, the name of the command interpreter must be contained in the login_shells list of the grid engine system configuration that is in effect for the particular host and queue that is running the job.
The grid engine system configuration might be different for the various hosts and queues configured in your cluster. You can display the effective configurations with the -sconf and -sq options of the qconf command. For detailed information, see the qconf(1) man page.
If the command interpreter is invoked as login shell, the environment of your job is the same as if you logged in and ran the script. In using csh, for example, .login and .cshrc are executed in addition to the system default startup resource files, such as /etc/login, whereas only .cshrc is executed if csh is not invoked as login-shell. For a description of the difference between being invoked and not being invoked as login-shell, see the man page of your command interpreter.
Example 3–1 is a simple shell script. The script first compiles the application flow from its Fortran77 source and then runs the application.
#!/bin/csh # This is a sample script file for compiling and # running a sample FORTRAN program under N1 Grid Engine 6 cd TEST # Now we need to compile the program "flow.f" and # name the executable "flow". f77 flow.f -o flow |
Your local system user's guide provides detailed information about building and customizing shell scripts. You might also want to look at the sh, ksh, csh, or tcsh man page. The following sections emphasize special things that you should consider when you prepare batch scripts for the grid engine system.
In general, you can submit to the grid engine system all shell scripts that you can run from your command prompt by hand. Such shell scripts must not require a terminal connection, and the scripts must not need interactive user intervention. The exceptions are the standard error and standard output devices, which are automatically redirected. Therefore, Example 3–1 is ready to be submitted to the grid engine system and the script will perform the desired action.
Some extensions to regular shell scripts influence the behavior of scripts that run under grid engine system control. The following sections describe these extensions.
At submit time, you can specify the command interpreter to use to process the job script file as shown in Figure 3–5. However, if nothing is specified, the configuration variable shell_start_mode determines how the command interpreter is selected:
If shell_start_mode is set to unix_behavior, the first line of the script file specifies the command interpreter. The first line of the script file must begin with #!. If the first line does not begin with #!, the Bourne Shell sh is used by default.
For all other settings of shell_start_mode, the default command interpreter is determined by the shell parameter for the queue where the job starts. See Displaying Queues and Queue Properties and the queue_conf(5) man page.
Since batch jobs do not have a terminal connection, their standard output and their standard error output must be redirected into files. The grid engine system enables the user to define the location of the files to which the output is redirected. Defaults are used if no output files are specified.
The standard location for the files is in the current working directory where the jobs run. The default standard output file name is job-name.ojob-id, the default standard error output is redirected to job-name>.ejob-id. The job-name can be built from the script file name, or defined by the user. See, for example, the -N option in the submit(1) man page. job-id is a unique identifier that is assigned to the job by the grid engine system.
For array job tasks , the task identifier is added to these filenames, separated by a dot. The resulting standard redirection paths are job-name.ojob-id.task-id> and job-name.ejob-id.task-id. For more information, see Submitting Array Jobs.
In case the standard locations are not suitable, the user can specify output directions with QMON, as shown in Figure 3–6. Or the user can use the -e and -o options to the qsub command to specify output directions. Standard output and standard error output can be merged into one file. The redirections can be specified on a per execution host basis, in which case, the location of the output redirection file depends on the host on which the job is executed. To build custom but unique redirection file paths, use dummy environment variables together with the qsub -e and -o options. A list of these variables follows.
When the job runs, these variables are expanded into the actual values, and the redirection path is built with these values.
See the qsub(1) man page for further details.
Lines with a leading # sign are treated as comments in shell scripts. However, the grid engine system recognizes special comment lines and uses these lines in a special way. The special comment script line is treated as part of the command line argument list of the qsub command. The qsub options that are supplied within these special comment lines are also interpreted by the QMON Submit Job dialog box. The corresponding parameters are preset when a script file is selected.
By default, the special comment lines are identified by the #$ prefix string. You can redefine the prefix string with the qsub -C command.
This use of special comments is called script embedding of submit arguments. The following example shows a script file that uses script-embedded command-line options.
#!/bin/csh #Force csh if not Grid Engine default #shell #$ -S /bin/csh # This is a sample script file for compiling and # running a sample FORTRAN program under N1 Grid Engine 6 # We want Grid Engine to send mail # when the job begins # and when it ends. #$ -M EmailAddress #$ -m b e # We want to name the file for the standard output # and standard error. #$ -o flow.out -j y # Change to the directory where the files are located. cd TEST # Now we need to compile the program "flow.f" and # name the executable "flow". f77 flow.f -o flow # Once it is compiled, we can run the program. flow |
When a job runs, several variables are preset into the job's environment.
ARC – The architecture name of the node on which the job is running. The name is compiled into the sge_execd binary.
SGE_ROOT – The root directory of the grid engine system as set for sge_execd before startup, or the default /usr/SGE directory.
SGE_BINARY_PATH – The directory in which the grid engine system binaries are installed.
SGE_JOB_SPOOL_DIR – The directory used by sge_shepherd to store job-related data while the job runs.
SGE_O_HOME – The path to the home directory of the job owner on the host from which the job was submitted.
SGE_O_LOGNAME – The login name of the job owner on the host from which the job was submitted.
SGE_O_MAIL – The content of the MAIL environment variable in the context of the job submission command.
SGE_O_PATH – The content of the PATH environment variable in the context of the job submission command.
SGE_O_SHELL – The content of the SHELL environment variable in the context of the job submission command.
SGE_O_TZ – The content of the TZ environment variable in the context of the job submission command.
SGE_O_WORKDIR – The working directory of the job submission command.
SGE_CKPT_ENV – The checkpointing environment under which a checkpointing job runs. The checkpointing environment is selected with the qsub -ckpt command.
SGE_CKPT_DIR – The path ckpt_dir of the checkpoint interface. Set only for checkpointing jobs. For more information, see the checkpoint(5) man page.
SGE_STDERR_PATH – The path name of the file to which the standard error stream of the job is diverted. This file is commonly used for enhancing the output with error messages from prolog, epilog, parallel environment start and stop scripts, or checkpointing scripts.
SGE_STDOUT_PATH – The path name of the file to which the standard output stream of the job is diverted. This file is commonly used for enhancing the output with messages from prolog, epilog, parallel environment start and stop scripts, or checkpointing scripts.
SGE_TASK_ID – The task identifier in the array job represented by this task.
ENVIRONMENT – Always set to BATCH. This variable indicates that the script is run in batch mode.
HOME – The user's home directory path as taken from the passwd file.
HOSTNAME – The host name of the node on which the job is running.
JOB_ID – A unique identifier assigned by the sge_qmaster daemon when the job was submitted. The job ID is a decimal integer from 1 through 9,999,999.
JOB_NAME – The job name, which is built from the file name provided with the qsub command, a period, and the digits of the job ID. You can override this default with qsub -N.
LOGNAME – The user's login name as taken from the passwd file.
NQUEUES – The number of queues that are allocated for the job. This number is always 1 for serial jobs.
NSLOTS – The number of queue slots in use by a parallel job.
PATH – A default shell search path of: /usr/local/bin:/usr/ucb:/bin:/usr/bin.
PE – The parallel environment under which the job runs. This variable is for parallel jobs only.
PE_HOSTFILE – The path of a file that contains the definition of the virtual parallel machine that is assigned to a parallel job by the grid engine system. This variable is used for parallel jobs only. See the description of the $pe_hostfile parameter in sge_pe for details on the format of this file.
REQUEST – The request name of the job. The name is either the job script file name or is explicitly assigned to the job by the qsub -N command.
RESTARTED – Indicates whether a checkpointing job was restarted. If set to value 1, the job was interrupted at least once. The job is therefore restarted.
SHELL – The user's login shell as taken from the passwd file.
SHELL is not necessarily the shell that is used for the job.
TMPDIR – The absolute path to the job's temporary working directory.
TMP – The same as TMPDIR. This variable is provided for compatibility with NQS.
TZ – The time zone variable imported from sge_execd, if set.
Extended jobs and advanced jobs are more complex forms of job submission. Before attempting to submit such jobs, you should understand some important background information about the process. The following sections describe those job processes.
The General tab of the Submit Job dialog box enables you to configure the following parameters for an extended job. The General tab is shown in Figure 3–2.
Prefix – A prefix string that is used for script-embedded submit options. See Active Comments for details.
Job Script – The job script to use. Click the icon at the right of the Job Script field to open a file selection box. The file selection box is shown in Figure 3–4.
Job Tasks – The task ID range for submitting array jobs. See Submitting Array Jobs for details.
Job Name – The name of the job. A default is set after you select a job script.
Job Args – Arguments to the job script.
Priority – A counting box for setting the job's initial priority This priority ranks a single user's jobs. Priority tells the scheduler how to choose among a single user's jobs when several of that user's jobs are in the system simultaneously.
To enable users to set the priorities of their own jobs, the administrator must enable priorities with the weight_priority parameter of the scheduler configuration. For more information, see Chapter 5, Managing Policies and the Scheduler, in Sun N1 Grid Engine 6.1 Administration Guide.
Job Share – Defines the share of the job's tickets relative to other jobs. The job share influences only the share tree policy and the functional policy.
Start At – The time at which the job is considered eligible for execution. Click the icon at the right of the Start At field to open a dialog box for entering the correctly formatted time:
Project – The project to which the job is subordinated. Click the icon at the right of the Project field to select among the available projects:
Current Working Directory – A flag that indicates whether to execute the job in the current working directory. Use this flag only for identical directory hierarchies between the submit host and the potential execution hosts.
Shell – The command interpreter to use to run the job script. See How a Command Interpreter Is Selected for details. Click the icon at the right of the Shell field to open a dialog box for entering the command interpreter specifications of the job:
Merge Output – A flag indicating whether to merge the job's standard output and standard error output together into the standard output stream.
stdout – The standard output redirection to use. See Output Redirection for details. A default is used if nothing is specified. Click the icon at the right of the stdout field to open a dialog box for entering the output redirection alternatives:
stderr – The standard error output redirection to use, similar to the standard output redirection.
stdin – The standard input file to use, similar to the standard output redirection.
Request Resources – Click this button to define the resource requirement for your job. If resources are requested for a job, the button changes its color.
Restart depends on Queue – Click this button to define whether the job can be restarted after being aborted by a system crash or similar events. This button also controls whether the restart behavior depends on the queue or is demanded by the job.
Notify Job – A flag indicating whether the job is to be notified by SIGUSR1 or by SIGUSR2 signals if the job is about to be suspended or cancelled.
Hold Job – A flag indicating that either a user hold or a job dependency is to be assigned to the job. The job is not eligible for execution as long as any type of hold is assigned to the job. See Monitoring and Controlling Jobs for more details. The Hold Job field enables restricting the hold only to a specific range of tasks of an array job. See Submitting Array Jobs for information about array jobs.
Start Job Immediately – A flag that forces the job to be started immediately if possible, or to be rejected otherwise. Jobs are not queued if this flag is selected.
Job Reservation – A flag specifying that resources should be reserved for this job. See Resource Reservation and Backfilling in Sun N1 Grid Engine 6.1 Administration Guide for details.
The buttons at the right side of the Submit Job dialog box enable you to start various actions:
Submit – Submit the currently specified job.
Edit – Edit the selected script file in an X terminal, using either vi or the editor defined by the EDITOR environment variable.
Clear – Clear all settings in the Submit Job dialog box, including any specified resource requests.
Reload – Reload the specified script file, parse any script-embedded options, parse default settings, and discard intermediate manual changes to these settings. For more information, see Active Comments and Default Request Files. This action is the equivalent to a Clear action with subsequent specifications of the previous script file. The option has an effect only if a script file is already selected.
Save Settings – Save the current settings to a file. Use the file selection box to select the file. The saved files can either be loaded later or be used as default requests. For more information, see Load Settings and Default Request Files.
Load Settings – Load settings previously saved with the Save Settings button. The loaded settings overwrite the current settings. See Save Settings.
Done – Closes the Submit Job dialog box.
Figure 3–5 shows the Submit Job dialog box with most of the parameters set.
The parameters of the job configured in the example are:
The job has the script file flow.sh, which must reside in the working directory of QMON.
The job is called Flow.
The script file takes the single argument big.data.
The job starts with priority 3.
The job is eligible for execution not before 4:30.44 AM of the 22th of April in the year 2004.
The project definition means that the job is subordinated to project crash.
The job is executed in the submission working directory.
The job uses the tcsh command interpreter.
Standard output and standard error output are merged into the file flow.out, which is created in the current working directory.
To submit the extended job request that is shown in Figure 3–5 from the command line, type the following command:
% qsub -N Flow -p -111 -P devel -a 200404221630.44 -cwd \ -S /bin/tcsh -o flow.out -j y flow.sh big.data |
The Advanced tab of the Submit Job dialog box enables you to define the following additional parameters:
Parallel Environment – A parallel environment interface to use
Environment – A set of environment variables to set for the job before the job runs. Click the icon at the right of the Environment field to open a dialog box that enables you to define he environment variables to export:
Environment variables can be taken from QMON`s runtime environment, or you can define your own environment variables.
Context – A list of name/value pairs that can be used to store and communicate job-related information. This information is accessible anywhere from within a cluster. You can modify context variables from the command line with the -ac, -dc, and -sc options to qsub, qrsh, qsh, qlogin, and qalter. You can retrieve context variables with the qstat -j command.
Checkpoint Object – The checkpointing environment to use if checkpointing the job is desirable and suitable. See Using Job Checkpointing for details.
Account – An account string to associate with the job. The account string is added to the accounting record that is kept for the job. The accounting record can be used for later accounting analysis.
Verify Mode – The Verify flag determines the consistency checking mode for your job. To check for consistency of the job request, the grid engine system assumes an empty and unloaded cluster. The system tries to find at least one queue in which the job could run. Possible checking modes are as follows:
Skip – No consistency checking at all.
Warning – Inconsistencies are reported, but the job is still accepted. Warning mode might be desirable if the cluster configuration should change after the job is submitted.
Error – Inconsistencies are reported. The job is rejected if any inconsistencies are encountered.
Just verify – The job is not submitted. An extensive report is generated about the suitability of the job for each host and queue in the cluster.
Mail – The events about which the user is notified by email. The events' start, end, abort, and suspend are currently defined for jobs.
Mail To – A list of email addresses to which these notifications are sent. Click the icon at the right of the Mail To field to open a dialog box for defining the mailing list.
Hard Queue List, Soft Queue List – A list of queue names that are requested to be the mandatory selection for the execution of the job. The Hard Queue List and the Soft Queue List are treated identically to a corresponding resource requirement.
Master Queue List – A list of queue names that are eligible as master queue for a parallel job. A parallel job is started in the master queue. All other queues to which the job spawns parallel tasks are called slave queues.
Job Dependencies – A list of IDs of jobs that must finish before the submitted job can be started. The newly created job depends on completion of those jobs.
Deadline – The deadline initiation time for deadline jobs. Deadline initiation defines the point in time at which a deadline job must reach maximum priority to finish before a given deadline. To determine the deadline initiation time, subtract an estimate of the running time, at maximum priority, of a deadline job from its desired deadline time. Click the icon at the right of the Deadline field to open the dialog box that enables you to set the deadline.
Not all users are allowed to submit deadline jobs. Ask your system administrator if you are permitted to submit deadline jobs. Contact the cluster administrator for information about the maximum priority that is given to deadline jobs.
Figure 3–6 shows an example of an advanced job submission.
The job defined in Extended Job Example has the following additional characteristics as compared to the job definition in Submitting Extended Jobs With QMON.
The job requires the use of the parallel environment mpi. The job needs at least 4 parallel processes to be created. The job can use up to 16 processes if the processes are available.
Two environment variables are set and exported for the job.
Two context variables are set.
The account string FLOW is to be added to the job accounting record.
Mail must be sent to me@myhost.org as soon as the job starts and finishes.
The job should preferably be executed in the queue big_q.
To submit the advanced job request that is shown in Figure 3–6 from the command line, type the following command:
% qsub -N Flow -p -111 -P devel -a 200012240000.00 -cwd \ -S /bin/tcsh -o flow.out -j y -pe mpi 4-16 \ -v SHARED_MEM=TRUE,MODEL_SIZE=LARGE \ -ac JOB_STEP=preprocessing,PORT=1234 \ -A FLOW -w w -m s,e -q big_q\ -M me@myhost.com,me@other.address \ flow.sh big.data |
The preceding command shows that advanced job requests can be rather complex and unwieldy, in particular if similar requests need to be submitted frequently. To avoid the cumbersome and error-prone task of entering such commands, users can embed qsub options in the script files, or use default request files. For more information, see Active Comments.
The -binary yes|no option when specified with the y argument, allows you to use qrsh to submit executable jobs without the script wrapper. See the qsub man page.
The cluster administration can set up a default request file for all grid engine system users. Users, on the other hand, can create private default request files located in their home directories. Users can also create application-specific default request files that are located in their working directories.
Default request files contain the qsub options to apply by default to the jobs in one or more lines. The location of the global cluster default request file is sge-root/cell/common/sge_request. The private general default request file is located under $HOME/.sge_request. The application-specific default request files are located under $cwd/.sge_request.
If more than one of these files are available, the files are merged into one default request, with the following order of precedence:
Application-specific default request file
General private default request file
Global default request file
Script embedding and the qsub command line have higher precedence than the default request files. Therefore, script embedding overrides default request file settings. The qsub command line options can override these settings again.
To discard any previous settings, use the qsub -clear command in a default request file, in embedded script commands, or in the qsub command line.
Here is an example of a private default request file:
-A myproject -cwd -M me@myhost.com -m b e -r y -j y -S /bin/ksh |
Unless overridden, for all of this user's jobs the following is true:
The account string is myproject
The jobs execute in the current working directory
Mail notification is sent to me@myhost.com at the beginning and at the end of the jobs
The standard output and standard error output are merged
The ksh is used as command interpreter
In the examples so far, the submit options do not express any resource requirements for the hosts on which the jobs are to be executed. The grid engine system assumes that such jobs can be run on any host. In practice, however, most jobs require that certain prerequisites be met on the executing host in order for the job to finish successfully. Such prerequisites include enough available memory, required software to be installed, or a certain operating system architecture. Also, the cluster administration usually imposes restrictions on the use of the machines in the cluster. For example, the CPU time that can be consumed by the jobs is often restricted.
The grid engine system provides users with the means to find suitable hosts for their jobs without precise knowledge of the cluster`s equipment and its usage policies. Users specify the requirement of their jobs and let the grid engine system manage the task of finding a suitable and lightly loaded host.
You specify resource requirements through requestable attributes, which are described in Requestable Attributes. QMON provides a convenient way to specify the requirements of a job. The Requested Resources dialog box displays only those attributes in the Available Resource list that are currently eligible. Click Request Resources in the Submit Job dialog box to open the Requested Resources dialog box. See Figure 3–7 for an example.
When you double-click an attribute, the attribute is added to the Hard or Soft Resources list of the job. A dialog box opens to guide you in entering a value specification for the attribute in question, except for BOOLEAN attributes, which are set to True. For more information, see How the Grid Engine System Allocates Resources.
Figure 3–7 shows a resource profile for a job that requests a solaris64 host with an available permas license offering at least 750 MBytes of memory. If more than one queue that fulfills this specification is found, any defined soft resource requirements are taken into account. However, if no queue satisfying both the hard and the soft requirements is found, any queue that grants the hard requirements is considered suitable.
The queue_sort_method parameter of the scheduler configuration determines where to start the job only if more than one queue is suitable for a job. See the sched_conf(5) man page for more information.
The attribute permas, an integer, is an administrator extension to the global resource attributes. The attribute arch, a string, is a host resource attribute. The attribute h_vmem, memory, is a queue resource attribute.
An equivalent resource requirement profile can as well be submitted from the qsub command line:
% qsub -l arch=solaris64,h_vmem=750M,permas=1 \ permas.sh |
The implicit -hard switch before the first -l option has been skipped.
The notation 750M for 750 MBytes is an example of the quantity syntax of the grid engine system. For those attributes that request a memory consumption, you can specify either integer decimal, floating-point decimal, integer octal, and integer hexadecimal numbers. The following multipliers must be appended to these numbers:
k – Multiplies the value by 1000
K – Multiplies the value by 1024
m – Multiplies the value by 1000 times 1000
M – Multiplies the value by 1024 times 1024
Octal constants are specified by a leading zero and digits ranging from 0 to 7 only. To specify a hexadecimal constant, you must prefix the number with 0x. You must also use digits ranging from 0 to 9, a through f, and A through F. If no multipliers are appended, the values are considered to count as bytes. If you are using floating-point decimals, the resulting value is truncated to an integer value.
For those attributes that impose a time limit, you can specify time values in terms of hours, minutes, or seconds, or any combination. Hours, minutes, and seconds are specified in decimal digits separated by colons. A time of 3:5:11 is translated to 11111 seconds. If zero is a specifier for hours, minutes, or seconds, you can leave it out if the colon remains. Thus a value of :5: is interpreted as 5 minutes. The form used in the Requested Resources dialog box that is shown in Figure 3–7 is an extension, which is valid only within QMON.
As shown in the previous section, knowing how grid engine software processes resource requests and allocates resources is important. The schematic view of grid engine software's resource allocation algorithm is as follows.
Read in and parse all default request files. See Default Request Files for details.
Process the script file for embedded options. See Active Comments for details.
Read all script-embedding options when the job is submitted, regardless of their position in the script file.
Read and parse all requests from the command line.
As soon as all qsub requests are collected, hard and soft requests are processed separately, the hard requests first. The requests are evaluated, according to the following order of precedence:
From left to right of the script or default request file
From top to bottom of the script or default request file
From left to right of the command line
In other words, you can use the command line to override the embedded flags.
The resources requested as hard are allocated. If a request is not valid, the submission is rejected. If one or more requests cannot be met at submit time, the job is spooled and rescheduled to be run at a later time. A request might not be met, for example, if a requested queue is busy. If all hard requests can be met, the requests are allocated and the job can be run.
The resources requested as soft are checked. The job can run even if some or all of these requests cannot be met. If multiple queues that meet the hard requests provide parts of the soft resources list, the grid engine software selects the queues that offer the most soft requests.
The job is started and covers the allocated resources.
You might want to gather experience of how argument list options and embedded options or hard and soft requests influence each other. You can experiment with small test script files that execute UNIX commands such as hostname or date.
Often the most convenient way to build a complex task is to split the task into subtasks. In these cases, subtasks depend on the completion of other subtasks before the dependent subtasks can get started. An example is that a predecessor task produces an output file that must be read and processed by a dependent task.
The grid engine system supports interdependent tasks with its job dependency facility. You can configure jobs to depend on the completion of one or more other jobs. The facility is enforced by the qsub -hold_jid command. You can specify a list of jobs upon which the submitted job depends. The list of jobs can also contain subsets of array jobs. The submitted job is not eligible for execution unless all jobs in the dependency list have finished.
Parameterized and repeated execution of the same set of operations that are contained in a job script is an ideal application for the array job facility of the grid engine system. Typical examples of such applications are found in the Digital Content Creation industries for tasks such as rendering. Computation of an animation is split into frames. The same rendering computation can be performed for each frame independently.
The array job facility offers a convenient way to submit, monitor, and control such applications. The grid engine system provides an efficient implementation of array jobs, handling the computations as an array of independent tasks joined into a single job. The tasks of an array job are referenced through an array index number. The indexes for all tasks span an index range for the entire array job. The index range is defined during submission of the array job by a single qsub command.
You can monitor and control an array job. For example, you can suspend, resume, or cancel an array job as a whole or by individual task or subset of tasks. To reference the tasks, the corresponding index numbers are suffixed to the job ID. Tasks are executed very much like regular jobs. Tasks can use the environment variable SGE_TASK_ID to retrieve their own task index number and to access input data sets designated for this task identifier.
Follow the instructions in How To Submit a Simple Job With QMON, additionally taking into account the following information.
The submission of array jobs from QMON works virtually identically to how the submission of a simple job is described in How To Submit a Simple Job With QMON. The only difference is that the Job Tasks input window that is shown in Figure 3–5 must contain the task range specification. The task range specification uses syntax that is identical to the qsub -t command. See the qsub(1) man page for detailed information about array index syntax.
For information about monitoring and controlling jobs in general, and about array jobs in particular, see Monitoring and Controlling Jobs and Monitoring and Controlling Jobs From the Command Line. See also the man pages for qstat(1), qhold(1), qrls(1), qmod(1), and qdel(1).
Array jobs offer full access to all facilities of the grid engine system that are available for regular jobs. In particular, array jobs can be parallel jobs at the same time. Array jobs also can have interdependencies with other jobs.
Array tasks cannot have interdependencies with other jobs or with other array tasks.
To submit an array job from the command line, type the qsub command with appropriate arguments.
The following is an example of how to submit an array job:
% qsub -l h_cpu=0:45:0 -t 2-10:2 render.sh data.in |
The -t option defines the task index range. In this case, 2-10:2 specifies that 2 is the lowest index number, and 10 is the highest index number. Only every second index, the :2 part of the specification, is used. Thus, the array job is made up of 5 tasks with the task indices 2, 4, 6, 8, and 10. Each task requests a hard CPU time limit of 45 minutes with the -l option. Each task executes the job script render.sh once the task is dispatched and started by the grid engine system. Tasks can use SGE_TASK_ID to find their index number, which they can use to find their input data record in the data file data.in.
The submission of interactive jobs instead of batch jobs is useful in situations where a job requires your direct input to influence the job results. Such situations are typical for X Windows applications or for tasks in which your interpretation of immediate results is required to steer further processing.
You can create interactive jobs in three ways:
qlogin – A telnet-like session that is started on a host selected by grid engine software.
qrsh – The equivalent of the standard UNIX rsh facility. A command is run remotely on a host selected by the grid engine system. If no command is specified, a remote rlogin session is started on a remote host.
qsh – An xterm that is displayed from the machine that is running the job. The display is set corresponding to your specification or to the setting of the DISPLAY environment variable. If the DISPLAY variable is not set, and if no display destination is defined, the grid engine system directs the xterm to the 0.0 screen of the X server on the host from which the job was submitted.
To function correctly, all the facilities need proper configuration of cluster parameters of the grid engine system. The correct xterm execution paths must be defined for qsh. Interactive queues must be available for this type of job. Contact your system administrator to find out if your cluster is prepared for interactive job execution.
The default handling of interactive jobs differs from the handling of batch jobs. Interactive jobs are not queued if the jobs cannot be executed when they are submitted. A job's not being queued indicates immediately that not enough appropriate resources are available to dispatch an interactive job at the time the job is submitted. The user is notified in such cases that the cluster is currently too busy.
You can change this default behavior with the -now no option to qsh, qlogin, and qrsh. If you use this option, interactive jobs are queued like batch jobs. When you use the -now yes option, batch jobs that are submitted with qsub can also be handled like interactive jobs. Such batch jobs are either dispatched for running immediately, or they are rejected.
Interactive jobs can be run only in queues of the type INTERACTIVE. See Configuring Queues in Sun N1 Grid Engine 6.1 Administration Guide for details.
The following sections describe how to use the qlogin and qsh facilities. The qrsh command is explained in a broader context in Transparent Remote Execution.
The only type of interactive jobs that you can submit from QMON are jobs that bring up an xterm on a host selected by the grid engine system.
At the right side of the Submit Job dialog box, click the button above the Submit button until the Interactive icon is displayed. Doing so prepares the Submit Job dialog box to submit interactive jobs. See Figure 3–8 and Figure 3–9.
The meaning and the use of the selection options in the dialog box is the same as that described for batch jobs in Submitting Batch Jobs. The difference is that several input fields are grayed out because those fields do not apply to interactive jobs
qsh is very similar to qsub. qsh supports several of the qsub options, as well as the additional option -display to direct the display of the xterm to be invoked. See the qsub(1) man page for details.
To submit an interactive job with qsh, type a command like the following:
% qsh -l arch=solaris64 |
This command starts an xterm on any available Sun Solaris 64–bit operating system host.
Use the qlogin command from any terminal or terminal emulation to start an interactive session under the control of the grid engine system.
To submit an interactive job with qlogin, type a command like the following:
% qlogin -l star-cd=1,h_cpu=6:0:0 |
This command locates a low-loaded host. The host has a Star-CD license available. The host also has at least one queue that can provide a minimum of six hours hard CPU time limit.
Depending on the remote login facility that is configured to be used by the grid engine system, you might have to provide your user name, your password, or both, at a login prompt.
The grid engine system provides a set of closely related facilities that support the transparent remote execution of certain computational tasks. The core tool for this functionality is the qrsh command, which is described in Remote Execution With qrsh. Two high-level facilities, qtcsh and qmake, build on top of qrsh. These two commands enable the grid engine system to transparently distribute implicit computational tasks, thereby enhancing the standard UNIX facilities make and csh. qtcsh is described in Transparent Job Distribution With qtcsh. qmake is described in Parallel Makefile Processing With qmake.
qrsh is built around the standard rsh facility. See the information that is provided in sge-root/3rd_party for details on the involvement of rsh. qrsh can be used for various purposes, including the following:
To provide remote execution of interactive applications that use the grid engine system comparable to the standard UNIX facility rsh. rsh is also called remsh on HP-UX systems.
To offer interactive login session capabilities that use the grid engine system, similar to the standard UNIX facility rlogin. qlogin is still required as a grid engine system's representation of the UNIX telnet facility.
To allow for the submission of batch jobs that support terminal I/O (standard output, standard error, and standard input) and terminal control.
To provide a way to submit a standalone program that is not embedded in a shell script.
You can also submit scripts with qrsh by using the -b n option. For more information, see the qrsh man page.
To provide a submission client that remains active while a batch job is pending or running and that goes away only if the job finishes or is cancelled.
To allow for the grid engine system-controlled remote running of job tasks within the framework of the dispersed resources allocated by parallel jobs. See Tight Integration of Parallel Environments and Grid Engine Software in Sun N1 Grid Engine 6.1 Administration Guide.
By virtue of these capabilities, qrsh is the major enabling infrastructure for the implementation of the qtcsh and the qmake facilities. qrsh is also used for the tight integration of the grid engine system with parallel environments such as MPI or PVM.
Type the qrsh command, adding options and arguments according to the following syntax:
% qrsh [options] program|shell-script [arguments] \ [> stdout] [>&2 stderr] [< stdin] |
qrsh understands almost all options of qsub. qrsh provides the following options:
-now yes|no – -now yes specifies that the job is scheduled immediately. The job is rejected if no appropriate resources are available. -now yes is the default. -now no specifies that the job is queued like a batch job if the job cannot be started at submission time.
-inherit – qrsh does not go through the scheduling process to start a job-task. Instead, qrsh assumes that the job is embedded in a parallel job that already has allocated suitable resources on the designated remote execution host. This form of qrsh is commonly used in qmake and in a tight parallel environment integration. The default is not to inherit external job resources.
-binary yes|no – When specified with the n option, enables you to use qrsh to submit script jobs.
-noshell – With this option, you do not start the command line that is given to qrsh in a user's login shell. Instead, you execute the command without the wrapping shell. Use this option to speed up execution, as some overhead, such as the shell startup and the sourcing of shell resource files, is avoided.
-nostdin – Suppresses the input stream STDIN. With this option set, qrsh passes the -n option to the rsh command. Suppression of the input stream is especially useful if multiple tasks are executed in parallel using qrsh, for example, in a make process. Which process gets the input is undefined.
-verbose – This option presents output on the scheduling process. -verbose is mainly intended for debugging purposes and is therefore switched off by default.
qtcsh is a fully compatible replacement for the widely known and used UNIX C shell derivative tcsh. qtcsh is built around tcsh. See the information that is provided in sge-root/3rd_party for details on the involvement of tcsh. qtcsh provides a command shell with the extension of transparently distributing execution of designated applications to suitable and lightly loaded hosts that use the grid engine system. The .qtask configuration files define the applications to execute remotely and the requirements that apply to the selection of an execution host.
These applications are transparent to the user and are submitted to the grid engine system through the qrsh facility. qrsh provides standard output, error output, and standard input handling as well as terminal control connection to the remotely executing application. Three noticeable differences between running such an application remotely and running the application on the same host as the shell are:
The remote host might be more powerful, lower-loaded, and have required hardware and software resources installed. Therefore, such a remote host would be much better suited than the local host, which might not allow running the application at all.
A small delay is incurred by the remote startup of the jobs and by their handling through the grid engine system.
Administrators can restrict the use of resources through interactive jobs (qrsh) and thus through qtcsh. If not enough suitable resources are available for an application to be started through qrsh, or if all suitable systems are overloaded, the implicit qrsh submission fails. A corresponding error message is returned, such as Not enough resources ... try later.
In addition to the standard use, qtcsh is a suitable platform for third-party code and tool integration. The single-application execution form of qtcsh is qtcsh -c app-name. The use of this form of qtcsh inside integration environments presents a persistent interface that almost never needs to be changed. All the required application, tool, integration, site, and even user-specific configurations are contained in appropriately defined .qtask files. A further advantage is that this interface can be used in shell scripts of any type, in C programs, and even in Java applications.
The invocation of qtcsh is exactly the same as for tcsh. qtcsh extends tcsh in providing support for the .qtask file and by offering a set of specialized shell built-in modes.
The .qtask file is defined as follows. Each line in the file has the following format:
% [!]app-name qrsh-options |
The optional leading exclamation mark (!) defines the precedence between conflicting definitions in a global cluster .qtask file and the personal .qtask file of the qtcsh user. If the exclamation mark is missing in the global cluster file, a conflicting definition in the user file overrides the definition in the global cluster file. If the exclamation mark is in the global cluster file, the corresponding definition cannot be overridden.
app-name specifies the name of the application that, when typed on a command line in a qtcsh, is submitted to the grid engine system for remote execution.
qrsh-options specifies the options to the qrsh facility to use. These options define resource requirements for the application.
The application name must appear in the command line exactly as the application is defined in the .qtask file. If the application name is prefixed with a path name, a local binary is addressed. No remote execution is intended.
csh aliases are expanded before a comparison with the application names is performed. The applications intended for remote execution can also appear anywhere in a qtcsh command line, in particular before or after standard I/O redirections.
Hence, the following examples are valid and meaningful syntax:
# .qtask file netscape -v DISPLAY=myhost:0 grep -l h=filesurfer |
Given this .qtask file, the following qtcsh command lines:
netscape ~/mybin/netscape cat very_big_file | grep pattern | sort | uniq |
implicitly result in:
qrsh -v DISPLAY=myhost:0 netscape ~/mybin/netscape cat very_big_file | qrsh -l h=filesurfer grep pattern | sort | uniq |
qtcsh can operate in different modes, influenced by switches that can be set on or off:
Local or remote execution of commands. Remote is the default.
Immediate or batch remote execution. Immediate is the default.
Verbose or nonverbose output. Nonverbose is the default.
The setting of these modes can be changed using option arguments of qtcsh at start time or with the shell built-in command qrshmode at runtime. See the qtcsh(1) man page for more information.
qmake is a replacement for the standard UNIX make facility. qmake extends make by enabling the distribution of independent make steps across a cluster of suitable machines. qmake is built around the popular GNU-make facility gmake. See the information that is provided in sge-root/3rd_party for details on the involvement of gmake.
To ensure that a distributed make process can run to completion, qmake first allocates the required resources in a way analogous to a parallel job. qmake then manages this set of resources without further interaction with the scheduling. qmake distributes make steps as resources become available, using the qrsh facility with the -inherit option.
qrsh provides standard output, error output, and standard input handling as well as terminal control connection to the remotely executing make step. Therefore, only three noticeable differences exist between executing a make procedure locally and using qmake:
Provided that individual make steps have a certain duration and that enough independent make steps exist to process, the parallelization of the make process will speed up significantly.
In the make steps to be started up remotely, an implied small overhead exists that is caused by qrsh and the remote execution.
To take advantage of the make step distribution of qmake, the user must specify as a minimum the degree of parallelization. That is, the user must specify the number of concurrently executable make steps. In addition, the user can specify the resource characteristics required by the make steps, such as available software licenses, machine architecture, memory, or CPU-time requirements.
The most common use of make is the compilation of complex software packages. Compilation might not be the major application for qmake, however. Program files are often quite small as a matter of good programming practice. Therefore, compilation of a single program file, which is a single make step, often takes only a few seconds. Furthermore, compilation usually implies significant file access. Nested include files can cause this problem. File access might not be accelerated if done for multiple make steps in parallel because the file server can become a bottleneck. Such a bottleneck effectively serializes all the file access. Therefore, the compilation process sometimes cannot be accelerated in a satisfactory manner.
Other potential applications of qmake are more appropriate. An example is the steering of the interdependencies and the workflow of complex analysis tasks through makefiles. Each make step in such environments is typically a simulation or data analysis operation with nonnegligible resource and computation time requirements. A considerable acceleration can be achieved in such cases.
The command-line syntax of qmake looks similar to the syntax of qrsh:
% qmake [-pe pe-name pe-range][options] \ -- [gnu-make-options][target] |
The -inherit option is also supported by qmake, as described later in this section.
Pay special attention to the use of the -pe option and its relation to the gmake -j option. You can use both options to express the amount of parallelism to be achieved. The difference is that gmake provides no possibility with -j to specify something like a parallel environment to use. Therefore, qmake assumes that a default environment for parallel makes is configured that is called make. Furthermore, gmake ´s -j allows for no specification of a range, but only for a single number. qmake interprets the number that is given with -j as a range of 1-n. By contrast, -pe permits the detailed specification of all these parameters. Consequently the following command line examples are identical:
% qmake -- -j 10 % qmake -pe make 1-10 -- |
The following command lines cannot be expressed using the -j option:
% qmake -pe make 5-10,16 -- % qmake -pe mpi 1-99999 -- |
Apart from the syntax, qmake supports two modes of invocation: interactively from the command line without the -inherit option, or within a batch job with the -inherit option. These two modes start different sequences of actions:
Interactive – When qmake is invoked on the command line, the make process is implicitly submitted to the grid engine system with qrsh. The process takes the resource requirements that are specified in the qmake command line into account. The grid engine system then selects a master machine for the execution of the parallel job that is associated with the parallel make job. The grid engine system starts the make procedure there. The procedure must start there because the make process can be architecture-dependent. The required architecture is specified in the qmake command line. The qmake process on the master machine then delegates execution of individual make steps to the other hosts that are allocated for the job. The steps are passed to qmake through the parallel environment hosts file.
Batch – In this case, qmake appears inside a batch script with the -inherit option. If the -inherit option is not present, a new job is spawned, as described in the first case earlier. This results in qmake making use of the resources already allocated to the job into which qmake is embedded. qmake uses qrsh -inherit directly to start make steps. When calling qmake in batch mode, the specification of resource requirements, the -pe option and the -j option are ignored.
Single CPU jobs also must request a parallel environment:
qmake -pe make 1 -- |
If no parallel execution is required, call qmake with gmake command-line syntax without grid engine system options and without --. This qmake command behaves like gmake.
See the qmake(1) man page for further details.
The grid engine software's policy management automatically controls the use of shared resources in the cluster to best achieve the goals of the administration. High priority jobs are dispatched preferentially. Such jobs receive better access to resources. The administration of a cluster can define high-level usage policies. The following policies are available:
Functional – Special treatment is given because of affiliation with a certain user group, project, and so forth.
Share-based – Level of service depends on an assigned share entitlement, the corresponding shares of other users and user groups, the past usage of resources by all users, and the current presence of users in the system.
Urgency – Preferential treatment is given to jobs that have greater urgency. A job's urgency is based on its resource requirements, how long the job must wait, and whether the job is submitted with a deadline requirement.
Override – Manual intervention by the cluster administrator modifies the automated policy implementation.
The grid engine software can be set up to routinely use either a share-based policy, a functional policy, or both. These policies can be combined in any proportion, from giving zero weight to one policy and using only the second policy, to giving both policies equal weight.
Along with the routine policies, jobs can be submitted with an initiation deadline. See the description of the deadline submission parameter under Submitting Advanced Jobs With QMON. Deadline jobs disturb routine scheduling. Administrators can also temporarily override share-based scheduling and functional scheduling. An override can be applied to an individual job, or to all jobs associated with a user, a department, or a project.
In addition to the four policies for mediating among all jobs, the grid engine software sometimes lets users set priorities among their own jobs. A user who submits several jobs can specify, for example, that job 3 is the most important and that jobs 1 and 2 are equally important but less important than job 3.
Priorities for jobs are set by using the QMON Submit Job parameter Priority or by using the qsub -p option. A priority range of -1024 (lowest) to 1023 (highest) can be given. This priority tells the scheduler how to choose among a single user's jobs when several of that user's jobs are in the system simultaneously. The relative importance assigned to a particular job depends on the maximum and minimum priorities that are given to any of that user's jobs, and on the priority value of the specific job.
The functional policy, the share-based policy, and the override policy are all implemented with tickets. Each ticket policy has a ticket pool from which tickets are allocated to jobs that are entering the multimachine grid engine system. Each routine ticket policy that is in force allocates some tickets to each new job. The ticket policy can reallocate tickets to the executing job at each scheduling interval. The criteria that each ticket policy uses to allocate tickets are explained in this section.
Tickets weight the three policies. For example, if no tickets are allocated to the functional policy, that policy is not used. If an equal number of tickets are assigned to the functional ticket pool and to the share-based ticket pool, both policies have equal weight in determining a job's importance.
Grid engine managers allocate tickets to the routine ticket policies at system configuration. Managers and operators can change ticket allocations at any time. Additional tickets can be injected into the system temporarily to indicate an override. Ticket policies are combined by assignment of tickets: when tickets are allocated to multiple ticket policies, a job gets a portion of its tickets from each ticket policy in force.
The grid engine system grants tickets to jobs that are entering the system to indicate their importance under each ticket policy in force. Each running job can gain tickets, for example, from an override; lose tickets, for example, because the job is getting more than its fair share of resources; or keep the same number of tickets at each scheduling interval. The number of tickets that a job holds represents the resource share that the grid engine system tries to grant that job during each scheduling interval.
You can display the number of tickets a job holds with QMON or using qstat -ext. See Monitoring and Controlling Jobs With QMON. The qstat command also displays the priority value assigned to a job, for example, using qsub -p. See the qstat(1) man page for more details.
The grid engine system does not dispatch jobs that request nonspecific queues if the jobs cannot be started immediately. Such jobs are marked as spooled at the sge_qmaster, which tries to reschedule the jobs from time to time. The jobs are dispatched to the next suitable queue that becomes available.
As opposed to spooling jobs, jobs that are submitted to a certain queue by name go directly to the named queue, regardless of whether the jobs can be started or need to be spooled. Therefore, viewing the queues of the grid engine system as computer science batch queues is valid only for jobs requested by name. Jobs submitted with nonspecific requests use the spooling mechanism of sge_qmaster for queueing, thus using a more abstract and flexible queuing concept.
If a job is scheduled and multiple free queues meet its resource requests, the job is usually dispatched to a suitable queue belonging to the least loaded host. By setting the scheduler configuration entry queue_sort_method to seq_no, the cluster administration can change this load-dependent scheme into a fixed order algorithm. The queue configuration entry seq_no defines a precedence among the queues, assigning the highest priority to the queue with the lowest sequence number.