Sun N1 Grid Engine 6.1 User's Guide

Composing a Checkpointing Job Script

Shell scripts for kernel-level checkpointing are the same as regular shell scripts.

Shell scripts for user-level checkpointing jobs differ from regular batch scripts only in their ability to properly handle the restart process. The environment variable RESTARTED is set for checkpointing jobs that are restarted. Use this variable to skip sections of the job script that need to be executed only during the initial invocation.

Example 4–3 shows a sample transparently checkpointing job script.


Example 4–3 Example of a Checkpointing Job Script


#!/bin/sh
#Force /bin/sh in Grid Engine
#$ -S /bin/sh

# Test if restarted/migrated
if [ $RESTARTED = 0 ]; then
     # 0 = not restarted
     # Parts to be executed only during the first
     # start go in here
     set_up_grid
fi

# Start the checkpointing executable
fem
#End of scriptfile

The job script restarts from the beginning if a user-level checkpointing job is migrated. The user is responsible for directing the program flow of the shell script to the location where the job was interrupted. Doing so skips those lines in the script that must be executed more than once.


Note –

Kernel-level checkpointing jobs are interruptible at any time. The embracing shell script is restarted exactly from the point where the last checkpoint occurred. Therefore, the RESTARTED environment variable is not relevant for kernel-level checkpointing jobs.


Submitting, Monitoring, or Deleting a Checkpointing Job From the Command Line

Type the following command with the appropriate options:


# qsub options arguments

The submission of a checkpointing job works in the same way as for regular batch scripts, except for the qsub -ckpt and qsub -c commands. These commands request a checkpointing mechanism. The commands also define the occasions at which checkpoints must be generated for the job.

The -ckpt option takes one argument, which is the name of the checkpointing environment to use. See Configuring Checkpointing Environments in Sun N1 Grid Engine 6.1 Administration Guide.

The -c option is not required. -c also takes one argument. Use the -c option to override the definitions of the when parameter in the checkpointing environment configuration. See the checkpoint(5) man page for details.

The argument to the -c option can be one of the following one-letter selections, or any combination. The argument can also be a time value.

The monitoring of checkpointing jobs differs from monitoring regular jobs. Checkpointing jobs can migrate from time to time. Checkpointing jobs are therefore not bound to a single queue. However, the unique job identification number and the job name stay the same.

The deletion of checkpointing jobs works in the same way as described in Monitoring and Controlling Jobs From the Command Line.

Submitting a Checkpointing Job With QMON

The submission of checkpointing jobs with QMON is identical to submitting regular batch jobs, with the addition of specifying an appropriate checkpointing environment. As explained in Submitting Advanced Jobs With QMON, the Submit Job dialog box provides a field for the checkpointing environment that is associated with a job. Click the button next to that field to opens the following Selection dialog box.

Dialog box titled Select an Item. Shows available
checkpoint objects and a field for selecting one. Shows OK, Cancel,
and Help buttons.

You can select a suitable checkpoint environment from the list of available checkpoint objects. Ask your system administrator for information about the properties of the checkpointing environments that are installed at your site. For more information, refer to Configuring Checkpointing Environments in Sun N1 Grid Engine 6.1 Administration Guide.