This section explores two different kinds of job checkpointing:
Many application programs, especially programs that consume considerable CPU time, use checkpointing and restart mechanisms to increase fault tolerance. Status information and important parts of the processed data are repeatedly written to one or more files at certain stages of the algorithm. If the application is aborted, these restart files can be processed and restarted at a later time. The files reach a consistent state that is comparable to the situation just before the checkpoint. Because the user mostly has to move the restart files to a proper location, this kind of checkpointing is called user-level checkpointing.
For application programs that do not have integrated user-level checkpointing, an alternative is to use a checkpointing library. A checkpointing library can be provided by some hardware vendors or by the public domain. The Condor project of the University of Wisconsin is an example. By relinking an application with such a library, a checkpointing mechanism is installed in the application without requiring source code changes.
Some operating systems provide checkpointing support inside the operating system kernel. No preparations in the application programs and no relinking of the application is necessary in this case. Kernel-level checkpointing usually applies to single processes as well as to complete process hierarchies. That is, a hierarchy of interdependent processes can be checkpointed and restarted at any time. Usually both a user command and a C library interface are available to initiate a checkpoint.
The grid engine system supports operating system checkpointing if available. See the release notes for the N1 Grid Engine 6 softwarefor information about the currently supported kernel-level checkpointing facilities.
Checkpointing jobs are interruptible at any time since their restart capability ensures that only little work already done must be repeated. This ability is used to build migration and dynamic load balancing mechanism in the grid engine system. If requested, checkpointing jobs are aborted on demand. The jobs are migrated to other machines in the grid engine system, thus averaging the load in the cluster dynamically. Checkpointing jobs are aborted and migrated for the following reasons:
The executing queue or the job is suspended explicitly by a qmod or a QMON command.
The job or the queue where the job runs is suspended automatically because a suspend threshold for the queue is exceeded. The checkpoint occasion specification for the job includes the suspension case. For more information, see Configuring Load and Suspend Thresholds in N1 Grid Engine 6 Administration Guide and Submitting, Monitoring, or Deleting a Checkpointing Job From the Command Line.
A migrating job moves back to sge_qmaster. The job is subsequently dispatched to another suitable queue if such a queue is available. In such a case, the qstat output shows R as the status.
Shell scripts for user-level checkpointing jobs differ from regular batch scripts only in their ability to properly handle getting restarted. The environment variable RESTARTED is set for checkpointing jobs which are restarted. Use this variable to skip sections of the job script that need to be executed only during the initial invocation.
A transparently checkpointing job script might look like Example 4–3.
#!/bin/sh #Force /bin/sh in Grid Engine #$ -S /bin/sh # Test if restarted/migrated if [ $RESTARTED = 0 ]; then # 0 = not restarted # Parts to be executed only during the first # start go in here set_up_grid fi # Start the checkpointing executable fem #End of scriptfile
The job script restarts from the beginning if a user-level checkpointing job is migrated. The user is responsible for directing the program flow of the shell script to the location where the job was interrupted. Doing so skips those lines in the script that must be executed more than once.
Kernel-level checkpointing jobs are interruptible at any point of time. The embracing shell script is restarted exactly from the point where the last checkpoint occurred. Therefore the RESTARTED environment variable is not relevant for kernel-level checkpointing jobs.
Type the following command with the appropriate options:
# qsub options arguments
The submission of a checkpointing job works in the same way as for regular batch scripts, except for the qsub -ckpt and qsub -c commands. These commands request a checkpointing mechanism. The commands also define the occasions at which checkpoints must be generated for the job.
The -ckpt option takes one argument, which is the name of the checkpointing environment to use. See Configuring Checkpointing Environments in N1 Grid Engine 6 Administration Guide.
The -c option is not required. -c also takes one argument. Use the -c option to override the definitions of the when parameter in the checkpointing environment configuration. See the checkpoint(5) man page for details.
The argument to the -c option can be one of the following one-letter selection, or any combination thereof. The argument can also be a time value.
n – No checkpoint is performed. n has the highest precedence.
s – A checkpoint is generated only if the sge_execd on the jobs host is shut down.
m – Generate checkpoint at minimum CPU interval defined in the corresponding queue configuration. See the min_cpu_interval parameter in the queue_conf(5) man page.
x – A checkpoint is generated if the job is suspended.
interval – Generate checkpoint in the given interval but not more frequently than defined by min_cpu_interval. The time value must be specified as hh:mm:ss. This format specifies two digit hours, minutes, and seconds, separated by colons.
The monitoring of checkpointing jobs differs from monitoring regular jobs. Checkpointing jobs can migrate from time to time. Checkpointing jobs are therefore not bound to a single queue. However, the unique job identification number and the job name stay the same.
The deletion of checkpointing jobs works in the same way as described in Monitoring and Controlling Jobs From the Command Line.
Follow the instructions in Submitting Advanced Jobs With QMON, taking note of the following additional information.
The submission of checkpointing jobs with QMON is identical to submitting regular batch jobs, with the addition of specifying an appropriate checkpointing environment. As explained in Submitting Advanced Jobs With QMON, the Submit Job dialog box provides a field for the checkpointing environment that is associated with a job. Next to the field is a button that opens the following Selection dialog box.
Here you can select a suitable checkpoint environment from the list of available checkpoint objects. Ask your system administrator for information about the properties of the checkpointing environments that are installed at your site. Or see Configuring Checkpointing Environments in N1 Grid Engine 6 Administration Guide.
When a user-level checkpoint or a kernel-level checkpoint that is based on a checkpointing library is written, a complete image of the virtual memory covered by the process or job to be checkpointed must be dumped. Sufficient disk space must be available for this purpose. If the checkpointing environment configuration parameter ckpt_dir is set, the checkpoint information is dumped to a job private location under ckpt_dir. If ckpt_dir is set to NONE, the directory where the checkpointing job started is used. See the checkpoint(5) man page for detailed information about the checkpointing environment configuration.
Checkpointing files and restart files must be visible on all machines in order to successfully migrate and restart jobs. File visibility is an additional requirement for the way file systems must be organized. Thus NFS or a similar file system is required. Ask your cluster administration if your site meets this requirement.
If your site does not run NFS, you can transfer the restart files explicitly at the beginning of your shell script. For example, you can use rcp or ftp, in the case of user-level checkpointing jobs.