Checkpointing is a facility that does the following tasks:
Freezes the status of an running job or application
Saves this status (the checkpoint) to disk
Restarts the job or application from the checkpoint if the job or application has otherwise not finished, for example, due to a system shutdown
If you move a checkpoint from one host to another host, checkpointing can migrate jobs or applications in a cluster without significant loss of resources. Hence, dynamic load balancing can be provided with the help of a checkpointing facility.
The grid engine system supports two levels of checkpointing:
At this level, providing the checkpoint generation mechanism is entirely the responsibility of the user or the application. Examples of user-level checkpointing include:
The periodic writing of restart files that are encoded in the application at prominent algorithmic steps, combined with proper processing of these files when the application is restarted.
The use of a checkpoint library that must be linked to the application and that thereby installs a checkpointing mechanism.
A variety of third-party applications provides an integrated checkpoint facility that is based on the writing of restart files. Checkpoint libraries are available from hardware vendors or from the public domain. Refer to the Condor project of the University of Wisconsin, for example.
Kernel-level transparent checkpointing.
This level of checkpointing must be provided by the operating system, or by enhancements to it, that can be applied to any job. No source code changes or relinking of your application need to be provided to use kernel-level checkpointing.
Kernel-level checkpointing can be applied to complete jobs, that is, the process hierarchy created by a job. By contrast, user-level checkpointing is usually restricted to single programs. Therefore the job in which such programs are embedded needs to properly handle cases where the entire job gets restarted.
Kernel-level checkpointing, as well as checkpointing based on checkpointing libraries, can consume many resources. The complete virtual address space that is in use by the job or application at the time of the checkpoint must be dumped to disk. By contrast, user-level checkpointing based on restart files can restrict the data that is written to the checkpoint on the important information only.
The grid engine system provides a configurable attribute description for each checkpointing method used. Different attribute descriptions reflect the different checkpointing methods and the potential variety of derivatives from these methods on different operating system architectures.
This attribute description is called a checkpointing environment. Default checkpointing environments are provided with the distribution of the grid engine system and can be modified according to the site's needs.
New checkpointing methods can be integrated in principal. However, the integration of new methods can be a challenging task. This integration should be performed only by experienced personnel or by your grid engine system support team.
On the QMON Main Control window, click the Checkpoint Configuration button. The Checkpointing Configuration dialog box appears.
To view previously configured checkpointing environments, select one of the checkpointing environment names listed under Checkpoint Objects. The corresponding configuration is displayed under Configuration.
In the Checkpointing Configuration dialog box, click Add. The Add/Modify Checkpoint Object dialog box appears, along with a template configuration that you can edit.
Fill out the template with the requested information.
Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.
In the Checkpoint Objects list, select the name of the configured checkpointing environment you want to modify, and then click Modify. The Add/Modify Checkpoint Object dialog box appears, along with the current configuration of the selected checkpointing environment.
The Add/Modify Checkpoint Object dialog box enables you to change the following information:
Name
Checkpoint, Migration, Restart, and Clean command strings
Directory where checkpointing files are stored
Occasions when checkpoints must be initiated
Signal to send to job or application when a checkpoint is initiated
See the checkpoint(5) man page for details about these parameters.
In addition, you must define the Interface to use. The Interface is also called checkpointing method. From the Interface list under Name, select an Interface. See the checkpoint(5) man page for details about the meaning of the different interfaces.
For the checkpointing environments provided with the distribution of the grid engine system, change only the Name parameter and the Checkpointing Directory parameter.
Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.
To delete a configured checkpointing environment, select it, and then click Delete.
To configure the checkpointing environment from the command line, type the qconf command with the appropriate options.
The following options are available:
The -ackpt option (add checkpointing environment) displays an editor containing a checkpointing environment configuration template. The editor is either the default vi editor or an editor corresponding to the EDITOR environment variable. The parameter ckpt-name specifies the name of the checkpointing environment. The parameter is already provided in the corresponding field of the template. Configure the checkpointing environment by changing the template and saving to disk. See the checkpoint(5) man page for a detailed description of the template entries to be changed.
The -Ackpt option (add checkpointing environment from file) parses the specified file and adds the new checkpointing environment configuration.
The file must have the format of the checkpointing environment template.
The -dckpt option (delete checkpointing environment) deletes the specified checkpointing environment.
The -mckpt option (modify checkpointing environment) displays an editor containing the specified checkpointing environment as a configuration template. The editor is either the default vi editor or an editor corresponding to the EDITOR environment variable. Modify the checkpointing environment by changing the template and saving to disk. See the checkpoint(5) man page for a detailed description of the template entries to be changed.
The -Mckpt option (modify checkpointing environment from file) parses the specified file and modifies the existing checkpointing configuration.
The file must have the format of the checkpointing environment template.
The -sckpt option (show checkpointing environment) prints the configuration of the specified checkpointing environment to standard output.
The -sckptl option (show checkpointing environment list) displays a list of the names of all checkpointing environments currently configured.