Sun N1 Grid Engine 6.1 User's Guide

File System Requirements for Checkpointing

When a user-level checkpoint or a kernel-level checkpoint that is based on a checkpointing library is written, a complete image of the virtual memory covered by the process or job to be checkpointed must be saved. Sufficient disk space must be available for this purpose. If the checkpointing environment configuration parameter ckpt_dir is set, the checkpoint information is saved to a job private location under ckpt_dir. If ckpt_dir is set to NONE, the directory where the checkpointing job started is used. See the checkpoint(5) man page for detailed information about the checkpointing environment configuration.


Note –

You should start a checkpointing job with the qsub -cwd script if ckpt_dir is set to NONE.


Checkpointing files and restart files must be visible on all machines in order to successfully migrate and restart jobs. Because file visibility is necessary for the way file systems must be organized, NFS or a similar file system is required. Ask your cluster administration if your site meets this requirement.

If your site does not run NFS, you can transfer the restart files explicitly at the beginning of your shell script. For example, you can use rcp or ftp in the case of user-level checkpointing jobs.