Sun N1 Grid Engine 6.1 Administration Guide

Configuring Checkpointing Environments

    Checkpointing is a facility that does the following tasks:

  1. Freezes the status of an running job or application

  2. Saves this status (the checkpoint) to disk

  3. Restarts the job or application from the checkpoint if the job or application has otherwise not finished, for example, due to a system shutdown

If you move a checkpoint from one host to another host, checkpointing can migrate jobs or applications in a cluster without significant loss of resources. Hence, dynamic load balancing can be provided with the help of a checkpointing facility.

The grid engine system supports two levels of checkpointing:

Kernel-level checkpointing can be applied to complete jobs, that is, the process hierarchy created by a job. By contrast, user-level checkpointing is usually restricted to single programs. Therefore the job in which such programs are embedded needs to properly handle cases where the entire job gets restarted.

Kernel-level checkpointing, as well as checkpointing based on checkpointing libraries, can consume many resources. The complete virtual address space that is in use by the job or application at the time of the checkpoint must be dumped to disk. By contrast, user-level checkpointing based on restart files can restrict the data that is written to the checkpoint on the important information only.

About Checkpointing Environments

The grid engine system provides a configurable attribute description for each checkpointing method used. Different attribute descriptions reflect the different checkpointing methods and the potential variety of derivatives from these methods on different operating system architectures.

This attribute description is called a checkpointing environment. Default checkpointing environments are provided with the distribution of the grid engine system and can be modified according to the site's needs.

New checkpointing methods can be integrated in principal. However, the integration of new methods can be a challenging task. This integration should be performed only by experienced personnel or by your grid engine system support team.

Configuring Checkpointing Environments With QMON

On the QMON Main Control window, click the Checkpoint Configuration button. The Checkpointing Configuration dialog box appears.

Dialog box titled Checkpointing Configuration.
Shows list of Checkpoint Objects and configurations. Shows Add, Modify,
Delete, Done, Help buttons.

Viewing Configured Checkpointing Environments

To view previously configured checkpointing environments, select one of the checkpointing environment names listed under Checkpoint Objects. The corresponding configuration is displayed under Configuration.

Adding a Checkpointing Environment

In the Checkpointing Configuration dialog box, click Add. The Add/Modify Checkpoint Object dialog box appears, along with a template configuration that you can edit.

Dialog box titled Add/Modify Checkpoint Object.
Shows fields in which you can type checkpointing parameters. Shows
Ok and Cancel buttons.

Fill out the template with the requested information.

Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.

Modifying Checkpointing Environments

In the Checkpoint Objects list, select the name of the configured checkpointing environment you want to modify, and then click Modify. The Add/Modify Checkpoint Object dialog box appears, along with the current configuration of the selected checkpointing environment.

The Add/Modify Checkpoint Object dialog box enables you to change the following information:

See the checkpoint(5) man page for details about these parameters.

In addition, you must define the Interface to use. The Interface is also called checkpointing method. From the Interface list under Name, select an Interface. See the checkpoint(5) man page for details about the meaning of the different interfaces.

Note –

For the checkpointing environments provided with the distribution of the grid engine system, change only the Name parameter and the Checkpointing Directory parameter.

Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.

Deleting Checkpointing Environments

To delete a configured checkpointing environment, select it, and then click Delete.

Configuring Checkpointing Environments From the Command Line

To configure the checkpointing environment from the command line, type the qconf command with the appropriate options.

The following options are available: