N1 Grid Engine 6 User's Guide

File System Requirements for Checkpointing

When a user-level checkpoint or a kernel-level checkpoint that is based on a checkpointing library is written, a complete image of the virtual memory covered by the process or job to be checkpointed must be dumped. Sufficient disk space must be available for this purpose. If the checkpointing environment configuration parameter ckpt_dir is set, the checkpoint information is dumped to a job private location under ckpt_dir. If ckpt_dir is set to NONE, the directory where the checkpointing job started is used. See the checkpoint(5) man page for detailed information about the checkpointing environment configuration.


Note –

You should start a checkpointing job with the qsub -cwd script if ckpt_dir is set to NONE.


Checkpointing files and restart files must be visible on all machines in order to successfully migrate and restart jobs. File visibility is an additional requirement for the way file systems must be organized. Thus NFS or a similar file system is required. Ask your cluster administration if your site meets this requirement.

If your site does not run NFS, you can transfer the restart files explicitly at the beginning of your shell script. For example, you can use rcp or ftp, in the case of user-level checkpointing jobs.