The following sections describe how to submit more complex jobs through the grid engine system.
Shell scripts, also called batch jobs, are a sequence of command-line instructions that are assembled in a file. Script files are made executable by the chmod command. If scripts are invoked, a command interpreter is started. Each instruction is interpreted as if the instruction were typed manually by the user who is running the script. csh, tcsh, sh, or ksh are typical command interpreters. You can invoke arbitrary commands, applications, and other shell scripts from within a shell script.
The command interpreter can be invoked as login shell. To do so, the name of the command interpreter must be contained in the login_shells list of the grid engine system configuration that is in effect for the particular host and queue that is running the job.
The grid engine system configuration might be different for the various hosts and queues configured in your cluster. You can display the effective configurations with the -sconf and -sq options of the qconf command. For detailed information, see the qconf(1) man page.
If the command interpreter is invoked as login shell, the environment of your job is the same as if you logged in and ran the script. In using csh, for example, .login and .cshrc are executed in addition to the system default startup resource files, such as /etc/login, whereas only .cshrc is executed if csh is not invoked as login-shell. For a description of the difference between being invoked and not being invoked as login-shell, see the man page of your command interpreter.
Example 3–1 is a simple shell script. The script first compiles the application flow from its Fortran77 source and then runs the application.
#!/bin/csh # This is a sample script file for compiling and # running a sample FORTRAN program under N1 Grid Engine 6 cd TEST # Now we need to compile the program "flow.f" and # name the executable "flow". f77 flow.f -o flow |
Your local system user's guide provides detailed information about building and customizing shell scripts. You might also want to look at the sh, ksh, csh, or tcsh man page. The following sections emphasize special things that you should consider when you prepare batch scripts for the grid engine system.
In general, you can submit to the grid engine system all shell scripts that you can run from your command prompt by hand. Such shell scripts must not require a terminal connection, and the scripts must not need interactive user intervention. The exceptions are the standard error and standard output devices, which are automatically redirected. Therefore, Example 3–1 is ready to be submitted to the grid engine system and the script will perform the desired action.
Some extensions to regular shell scripts influence the behavior of scripts that run under grid engine system control. The following sections describe these extensions.
At submit time, you can specify the command interpreter to use to process the job script file as shown in Figure 3–5. However, if nothing is specified, the configuration variable shell_start_mode determines how the command interpreter is selected:
If shell_start_mode is set to unix_behavior, the first line of the script file specifies the command interpreter. The first line of the script file must begin with #!. If the first line does not begin with #!, the Bourne Shell sh is used by default.
For all other settings of shell_start_mode, the default command interpreter is determined by the shell parameter for the queue where the job starts. See Displaying Queues and Queue Properties and the queue_conf(5) man page.
Since batch jobs do not have a terminal connection, their standard output and their standard error output must be redirected into files. The grid engine system enables the user to define the location of the files to which the output is redirected. Defaults are used if no output files are specified.
The standard location for the files is in the current working directory where the jobs run. The default standard output file name is job-name.ojob-id, the default standard error output is redirected to job-name>.ejob-id. The job-name can be built from the script file name, or defined by the user. See, for example, the -N option in the submit(1) man page. job-id is a unique identifier that is assigned to the job by the grid engine system.
For array job tasks , the task identifier is added to these filenames, separated by a dot. The resulting standard redirection paths are job-name.ojob-id.task-id> and job-name.ejob-id.task-id. For more information, see Submitting Array Jobs.
In case the standard locations are not suitable, the user can specify output directions with QMON, as shown in Figure 3–6. Or the user can use the -e and -o options to the qsub command to specify output directions. Standard output and standard error output can be merged into one file. The redirections can be specified on a per execution host basis, in which case, the location of the output redirection file depends on the host on which the job is executed. To build custom but unique redirection file paths, use dummy environment variables together with the qsub -e and -o options. A list of these variables follows.
When the job runs, these variables are expanded into the actual values, and the redirection path is built with these values.
See the qsub(1) man page for further details.
Lines with a leading # sign are treated as comments in shell scripts. However, the grid engine system recognizes special comment lines and uses these lines in a special way. The special comment script line is treated as part of the command line argument list of the qsub command. The qsub options that are supplied within these special comment lines are also interpreted by the QMON Submit Job dialog box. The corresponding parameters are preset when a script file is selected.
By default, the special comment lines are identified by the #$ prefix string. You can redefine the prefix string with the qsub -C command.
This use of special comments is called script embedding of submit arguments. The following example shows a script file that uses script-embedded command-line options.
#!/bin/csh #Force csh if not Grid Engine default #shell #$ -S /bin/csh # This is a sample script file for compiling and # running a sample FORTRAN program under N1 Grid Engine 6 # We want Grid Engine to send mail # when the job begins # and when it ends. #$ -M EmailAddress #$ -m b e # We want to name the file for the standard output # and standard error. #$ -o flow.out -j y # Change to the directory where the files are located. cd TEST # Now we need to compile the program "flow.f" and # name the executable "flow". f77 flow.f -o flow # Once it is compiled, we can run the program. flow |
When a job runs, several variables are preset into the job's environment.
ARC – The architecture name of the node on which the job is running. The name is compiled into the sge_execd binary.
SGE_ROOT – The root directory of the grid engine system as set for sge_execd before startup, or the default /usr/SGE directory.
SGE_BINARY_PATH – The directory in which the grid engine system binaries are installed.
SGE_JOB_SPOOL_DIR – The directory used by sge_shepherd to store job-related data while the job runs.
SGE_O_HOME – The path to the home directory of the job owner on the host from which the job was submitted.
SGE_O_LOGNAME – The login name of the job owner on the host from which the job was submitted.
SGE_O_MAIL – The content of the MAIL environment variable in the context of the job submission command.
SGE_O_PATH – The content of the PATH environment variable in the context of the job submission command.
SGE_O_SHELL – The content of the SHELL environment variable in the context of the job submission command.
SGE_O_TZ – The content of the TZ environment variable in the context of the job submission command.
SGE_O_WORKDIR – The working directory of the job submission command.
SGE_CKPT_ENV – The checkpointing environment under which a checkpointing job runs. The checkpointing environment is selected with the qsub -ckpt command.
SGE_CKPT_DIR – The path ckpt_dir of the checkpoint interface. Set only for checkpointing jobs. For more information, see the checkpoint(5) man page.
SGE_STDERR_PATH – The path name of the file to which the standard error stream of the job is diverted. This file is commonly used for enhancing the output with error messages from prolog, epilog, parallel environment start and stop scripts, or checkpointing scripts.
SGE_STDOUT_PATH – The path name of the file to which the standard output stream of the job is diverted. This file is commonly used for enhancing the output with messages from prolog, epilog, parallel environment start and stop scripts, or checkpointing scripts.
SGE_TASK_ID – The task identifier in the array job represented by this task.
ENVIRONMENT – Always set to BATCH. This variable indicates that the script is run in batch mode.
HOME – The user's home directory path as taken from the passwd file.
HOSTNAME – The host name of the node on which the job is running.
JOB_ID – A unique identifier assigned by the sge_qmaster daemon when the job was submitted. The job ID is a decimal integer from 1 through 9,999,999.
JOB_NAME – The job name, which is built from the file name provided with the qsub command, a period, and the digits of the job ID. You can override this default with qsub -N.
LOGNAME – The user's login name as taken from the passwd file.
NQUEUES – The number of queues that are allocated for the job. This number is always 1 for serial jobs.
NSLOTS – The number of queue slots in use by a parallel job.
PATH – A default shell search path of: /usr/local/bin:/usr/ucb:/bin:/usr/bin.
PE – The parallel environment under which the job runs. This variable is for parallel jobs only.
PE_HOSTFILE – The path of a file that contains the definition of the virtual parallel machine that is assigned to a parallel job by the grid engine system. This variable is used for parallel jobs only. See the description of the $pe_hostfile parameter in sge_pe for details on the format of this file.
REQUEST – The request name of the job. The name is either the job script file name or is explicitly assigned to the job by the qsub -N command.
RESTARTED – Indicates whether a checkpointing job was restarted. If set to value 1, the job was interrupted at least once. The job is therefore restarted.
SHELL – The user's login shell as taken from the passwd file.
SHELL is not necessarily the shell that is used for the job.
TMPDIR – The absolute path to the job's temporary working directory.
TMP – The same as TMPDIR. This variable is provided for compatibility with NQS.
TZ – The time zone variable imported from sge_execd, if set.