This chapter describes how to manage and administer the following special environments:
Parallel environments
Checkpointing environments
In addition to background information about these environments, this chapter includes detailed instructions for accomplishing the following tasks:
A parallel environment (PE) is a software package that enables concurrent computing on parallel platforms in networked environments.
A variety of systems have evolved over the past years into viable technology for distributed and parallel processing on various hardware platforms. The following are two examples of the most common message-passing environments:
PVM – Parallel Virtual Machine, Oak Ridge National Laboratories
MPI – Message Passing Interface, the Message Passing Interface Forum
Public domain as well as hardware vendor-provided implementations exist for both tools.
All these systems show different characteristics and have segregative requirements. In order to handle parallel jobs running on top of such systems, the grid engine system provides a flexible, powerful interface that satisfies various needs.
The grid engine system provides means to run parallel jobs by means of the following programs:
Arbitrary message-passing environments such as PVM or MPI. See the PVM User's Guide and the MPI User's Guide for details.
Shared memory parallel programs on multiple slots, either in single queues or distributed across multiple queues and across machines for distributed memory parallel jobs.
Any number of different parallel environment interfaces can be configured concurrently.
Interfaces between parallel environments and the grid engine system can be implemented if suitable startup and stop procedures are provided. The startup procedure and the stop procedure are described in Parallel Environment Startup Procedure and in Termination of the Parallel Environment, respectively.
On the QMON Main Control window, click the Parallel Environment Configuration button. The Parallel Environment Configuration dialog box appears.
Currently configured parallel environments are displayed under PE List.
To display the contents of a parallel environment, select it. The selected parallel environment configuration is displayed under Configuration.
To delete a parallel environment, select it, and then click Delete.
To add a new parallel environment, click Add. To modify a parallel environment, select it, and then click Modify.
When you click Add or Modify, the Add/Modify PE dialog box appears.
If you are adding a new parallel environment, type its name in the Name field. If you are modifying a parallel environment, its name is displayed in the Name field.
In the Slots box, enter the total number of job slots that can be occupied by all parallel environment jobs running concurrently.
User Lists displays the user access lists that are allowed to access the parallel environment. Xuser Lists displays the user access lists that are not allowed to access the parallel environment. See Configuring User Access Lists for more information about user access lists.
Click the icons at the right of each list to modify the content of the lists. The Select Access Lists dialog box appears.
The Start Proc Args and Stop Proc Args fields are optional. Use these fields to enter the precise invocation sequence of the parallel environment startup and stop procedures. See the sections Parallel Environment Startup Procedure and Termination of the Parallel Environment, respectively. If no such procedures are required for a certain parallel environment, you can leave the fields empty.
The first argument is usually the name of the start or stop procedure itself. The remaining parameters are command-line arguments to the procedures.
A variety of special identifiers, which begin with a $ prefix, are available to pass internal runtime information to the procedures. The sge_pe(5) man page contains a list of all available parameters.
The Allocation Rule field defines the number of parallel processes to allocate on each machine that is used by a parallel environment. A positive integer fixes the number of processes for each suitable host. Use the special denominator $pe_slots to cause the full range of processes of a job to be allocated on a single host (SMP). Use the denominators $fill_up and $round_robin to cause unbalanced distributions of processes at each host. For more details about these allocation rules, see the sge_pe(5) man page.
The Urgency Slots field specifies the method the grid engine system uses to assess the number of slots that pending jobs with a slot range get. The assumed slot allocation is meaningful when determining the resource-request-based priority contribution for numeric resources. You can specify an integer value for the number of slots. Specify min to use the slot range minimum. Specify max to use the slot range maximum. Specify avg to use the average of all numbers occurring within the job's parallel environment range request.
The Control Slaves check box specifies whether the grid engine system generates parallel tasks or whether the corresponding parallel environment creates its own process. The grid engine system uses sge_execd and sge_shepherd to generate parallel tasks. Full control over slave tasks by the grid engine system is preferable, because the system provides the correct accounting and resource control. However, this functionality is available only for parallel environment interfaces especially customized for the grid engine system. See Tight Integration of Parallel Environments and Grid Engine Software for more details.
The Job Is First Task check box is meaningful only if Control Slaves is selected. If you select Job Is First Task, the job script or one of its child processes acts as one of the parallel tasks of the parallel application. For PVM, you usually want the job script to be part of the parallel application, for example. If you clear the Job Is First Task check box, the job script initiates the parallel application but does not participate. For MPI, you usually do not want the job script to be part of the parallel application, for example, when you use mpirun.
Click OK to save your changes and close the dialog box. Click Cancel to close the dialog box without saving changes.
On the QMON Main Control window, click the Parallel Environment Configuration button. The Parallel Environment Configuration dialog box appears. See Configuring Parallel Environments With QMON for more information.
The following example defines a parallel job to be submitted. The job requests that the parallel environment interface mpi (message passing interface) be used with from 4 to 16 processes. 16 is preferable.
To select a parallel environment from a list of available parallel environments, click the button at the right of the Parallel Environment field. A selection dialog box appears.
You can add a range for the number of parallel tasks initiated by the job after the parallel environment name in the Parallel Environment field.
The qsub command corresponding to the parallel job specification described previously is as follows:
% qsub -N Flow -p -111 -P devel -a 200012240000.00 -cwd \ -S /bin/tcsh -o flow.out -j y -pe mpi 4-16 \ -v SHARED_MEM=TRUE,MODEL_SIZE=LARGE \ -ac JOB_STEP=preprocessing,PORT=1234 \ -A FLOW -w w -r y -m s,e -q big_q\ -M me@myhost.com,me@other.address \ flow.sh big.data |
This example shows how to use the qsub -pe command to formulate an equivalent request. The qsub(1) man page provides more details about the -pe option.
Select a suitable parallel environment interface for a parallel job, keeping the following considerations in mind:
Parallel environment interfaces can use different message-passing systems or no message systems.
Parallel environment interfaces can allocate processes on single or multiple hosts.
Access to the parallel environment can be denied to certain users.
Only a specific set of queues can be used by a parallel environment interface.
Only a certain number of queue slots can be occupied by a parallel environment interface at any point of time.
Ask the grid engine system administration for the available parallel environment interfaces best suited for your types of parallel jobs.
You can specify resource requirements along with your parallel environment request. The specifying of resource requirements further reduces the set of eligible queues for the parallel environment interface to those queues that fit the requirement. See Defining Resource Requirements in Sun N1 Grid Engine 6.1 User’s Guide.
For example, assume that you run the following command:
% qsub -pe mpi 1,2,4,8 -l nastran,arch=osf nastran.par |
The queues that are suitable for this job are queues that are associated with the parallel environment interface mpi by the parallel environment configuration. Suitable queues also satisfy the resource requirement specification specified by the qsub -l command.
The parallel environment interface facility is highly configurable. In particular, the administrator can configure the parallel environment startup and stop procedures to support site-specific needs. See the sge_pe(5) man page for details. Use the qsub -v and qsub -V commands to pass information from the user who submits the job to the startup and stop procedures. These two options export environment variables. If you are unsure, ask the administrator whether you are required to export certain environment variables.
Type the qconf command with appropriate options:
qconf options |
The following options are available:
The -ap option (add parallel environment) displays an editor containing a parallel environment configuration template. The editor is either the default vi editor or an editor defined by the EDITOR environment variable. pe-name specifies the name of the parallel environment. The name is already provided in the corresponding field of the template. Configure the parallel environment by changing the template and saving to disk. See the sge_pe(5) man page for a detailed description of the template entries to change.
The -Ap option (add parallel environment from file) parses the specified file filename and adds the new parallel environment configuration.
The file must have the format of the parallel environment configuration template.
The -dp option (delete parallel environment) deletes the specified parallel environment.
The -mp option (modify parallel environment) displays an editor containing the specified parallel environment as a configuration template. The editor is either the default vi editor or an editor defined by the EDITOR environment variable. Modify the parallel environment by changing the template and saving to disk. See the sge_pe(5) man page for a detailed description of the template entries to change.
The -Mp option (modify parallel environment from file) parses the specified file filename and modifies the existing parallel environment configuration.
The file must have the format of the parallel environment configuration template.
The -sp option (show parallel environment) prints the configuration of the specified parallel environment to standard output.
The -spl option (show parallel environment list) lists the names of all currently configured parallel environments.
To run parallel jobs, you must also associate a queue with the PE. Use the queue_conf(5) attribute pe_list to identify the suited PEs. Then, to link the PE and queues, use either the QMON utility or the following form of the qconf command:
# qconf -mq <queue_name> |
The grid engine system starts the parallel environment by using the exec system call to invoke a startup procedure. The name of the startup executable and the parameters passed to this executable are configurable from within the grid engine system.
An example for such a startup procedure for the PVM environment is contained in the distribution tree of the grid engine system. The startup procedure is made up of a shell script and a C program that is invoked by the shell script. The shell script uses the C program to start up PVM cleanly. All other required operations are handled by the shell script.
The shell script is located under sge-root/pvm/startpvm.sh. The C program file is located under sge-root/pvm/src/start_pvm.c.
The startup procedure could have been a single C program. The use of a shell script enables easier customization of the sample startup procedure.
The example script startpvm.sh requires the following three arguments:
The path of a host file generated by grid engine software, containing the names of the hosts from which PVM is to be started
The host on which the startpvm.sh procedure is invoked
The path of the PVM root directory, usually contained in the PVM_ROOT environment variable
These parameters can be passed to the startup script as described in Configuring Parallel Environments With QMON. The parameters are among the parameters provided to parallel environment startup and stop scripts by the grid engine system during runtime. The required host file, as an example, is generated by the grid engine system. The name of the file can be passed to the startup procedure in the parallel environment configuration by the special parameter name $pe_hostfile. A description of all available parameters is provided in the sge_pe(5) man page.
The host file has the following format:
Each line of the file refers to a queue on which parallel processes are to run.
The first entry of each line specifies the host name of the queue.
The second entry specifies the number of parallel processes to run in this queue.
The third entry denotes the queue.
The fourth entry denotes a processor range to use in case of a multiprocessor machine.
This file format is generated by the grid engine system. The file format is fixed. Parallel environments that need a different file format must translate it within the startup procedure. See the startpvm.sh file. PVM is an example of a parallel environment that needs a different file format.
When the grid engine system starts the parallel environment startup procedure, the startup procedure launches the parallel environment. The startup procedure should exit with a zero exit status. If the exit status of the startup procedure is not zero, grid engine software reports an error and does not start the parallel job.
You should test any startup procedures first from the command line, without using the grid engine system. Doing so avoids all errors that can be hard to trace if the procedure is integrated into the grid engine system framework.
When a parallel job finishes or is aborted, for example, by qdel, a procedure to halt the parallel environment is called. The definition and semantics of this procedure are similar to the procedures described for the startup program. The stop procedure can also be defined in a parallel environment configuration. See, for example, Configuring Parallel Environments With QMON.
The purpose of the stop procedure is to shut down the parallel environment and to reap all associated processes.
If the stop procedure fails to clean up parallel environment processes, the grid engine system might have no information about processes that are running under parallel environment control. Therefore the stop procedure cannot clean up these processes. The grid engine software, of course, cleans up the processes directly associated with the job script that the system has launched.
The distribution tree of the grid engine system also contains an example of a stop procedure for the PVM parallel environment. This example resides under sge-root/pvm/stoppvm.sh. It takes the following two arguments:
The path to the host file generated by the grid engine system
The name of the host on which the stop procedure is started
Similar to the startup procedure, the stop procedure is expected to return a zero exit status on success and a nonzero exit status on failure.
You should test any stop procedures first from the command line, without using the grid engine system. Doing so avoids all errors that can be hard to trace if the procedure is integrated into the grid engine system framework.
Configuring Parallel Environments With QMON mentions that using sge_execd and sge_shepherd to create parallel tasks offers benefits over parallel environments that create their own parallel tasks. The UNIX operating system allows reliable resource control only for the creator of a process hierarchy. Features such as correct accounting, resource limits, and process control for parallel applications, can be enforced only by the creator of all parallel tasks.
Most parallel environments do not implement these features. Therefore parallel environments do not provide a sufficient interface for the integration with a resource management system like the grid engine system. To overcome this problem, the grid engine system provides an advanced parallel environment interface for tight integration with parallel environments. This parallel environment interface transfers the responsibility for creating tasks from the parallel environment to the grid engine software.
The distribution of the grid engine system contains two examples of such a tight integration, one for the PVM public domain version, and one for the MPICH MPI implementation from Argonne National Laboratories. The examples are contained in the directories sge-root/pvm and sge-root/mpi, respectively. The directories also contain README files that describe the usage and any current restrictions. Refer to those README files for more details.
For the purpose of comparison, the sge-root/mpi/sunhpc/loose-integration directory contains a loose integration sample with Sun HPC ClusterToolsTM software, and the sge-root/mpi directory contain a loosely integrated variant of the interfaces for comparison.
The performance of a tight integration with a parallel environment is an advanced task that can require expert knowledge of the parallel environment and the grid engine system parallel environment interface. You might want to contact your Sun support representative distributor for assistance.
Checkpointing is a facility that does the following tasks:
Freezes the status of an running job or application
Saves this status (the checkpoint) to disk
Restarts the job or application from the checkpoint if the job or application has otherwise not finished, for example, due to a system shutdown
If you move a checkpoint from one host to another host, checkpointing can migrate jobs or applications in a cluster without significant loss of resources. Hence, dynamic load balancing can be provided with the help of a checkpointing facility.
The grid engine system supports two levels of checkpointing:
At this level, providing the checkpoint generation mechanism is entirely the responsibility of the user or the application. Examples of user-level checkpointing include:
The periodic writing of restart files that are encoded in the application at prominent algorithmic steps, combined with proper processing of these files when the application is restarted.
The use of a checkpoint library that must be linked to the application and that thereby installs a checkpointing mechanism.
A variety of third-party applications provides an integrated checkpoint facility that is based on the writing of restart files. Checkpoint libraries are available from hardware vendors or from the public domain. Refer to the Condor project of the University of Wisconsin, for example.
Kernel-level transparent checkpointing.
This level of checkpointing must be provided by the operating system, or by enhancements to it, that can be applied to any job. No source code changes or relinking of your application need to be provided to use kernel-level checkpointing.
Kernel-level checkpointing can be applied to complete jobs, that is, the process hierarchy created by a job. By contrast, user-level checkpointing is usually restricted to single programs. Therefore the job in which such programs are embedded needs to properly handle cases where the entire job gets restarted.
Kernel-level checkpointing, as well as checkpointing based on checkpointing libraries, can consume many resources. The complete virtual address space that is in use by the job or application at the time of the checkpoint must be dumped to disk. By contrast, user-level checkpointing based on restart files can restrict the data that is written to the checkpoint on the important information only.
The grid engine system provides a configurable attribute description for each checkpointing method used. Different attribute descriptions reflect the different checkpointing methods and the potential variety of derivatives from these methods on different operating system architectures.
This attribute description is called a checkpointing environment. Default checkpointing environments are provided with the distribution of the grid engine system and can be modified according to the site's needs.
New checkpointing methods can be integrated in principal. However, the integration of new methods can be a challenging task. This integration should be performed only by experienced personnel or by your grid engine system support team.
On the QMON Main Control window, click the Checkpoint Configuration button. The Checkpointing Configuration dialog box appears.
To view previously configured checkpointing environments, select one of the checkpointing environment names listed under Checkpoint Objects. The corresponding configuration is displayed under Configuration.
In the Checkpointing Configuration dialog box, click Add. The Add/Modify Checkpoint Object dialog box appears, along with a template configuration that you can edit.
Fill out the template with the requested information.
Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.
In the Checkpoint Objects list, select the name of the configured checkpointing environment you want to modify, and then click Modify. The Add/Modify Checkpoint Object dialog box appears, along with the current configuration of the selected checkpointing environment.
The Add/Modify Checkpoint Object dialog box enables you to change the following information:
Name
Checkpoint, Migration, Restart, and Clean command strings
Directory where checkpointing files are stored
Occasions when checkpoints must be initiated
Signal to send to job or application when a checkpoint is initiated
See the checkpoint(5) man page for details about these parameters.
In addition, you must define the Interface to use. The Interface is also called checkpointing method. From the Interface list under Name, select an Interface. See the checkpoint(5) man page for details about the meaning of the different interfaces.
For the checkpointing environments provided with the distribution of the grid engine system, change only the Name parameter and the Checkpointing Directory parameter.
Click OK to register your changes with sge_qmaster. Click Cancel to close the dialog box without saving changes.
To delete a configured checkpointing environment, select it, and then click Delete.
To configure the checkpointing environment from the command line, type the qconf command with the appropriate options.
The following options are available:
The -ackpt option (add checkpointing environment) displays an editor containing a checkpointing environment configuration template. The editor is either the default vi editor or an editor corresponding to the EDITOR environment variable. The parameter ckpt-name specifies the name of the checkpointing environment. The parameter is already provided in the corresponding field of the template. Configure the checkpointing environment by changing the template and saving to disk. See the checkpoint(5) man page for a detailed description of the template entries to be changed.
The -Ackpt option (add checkpointing environment from file) parses the specified file and adds the new checkpointing environment configuration.
The file must have the format of the checkpointing environment template.
The -dckpt option (delete checkpointing environment) deletes the specified checkpointing environment.
The -mckpt option (modify checkpointing environment) displays an editor containing the specified checkpointing environment as a configuration template. The editor is either the default vi editor or an editor corresponding to the EDITOR environment variable. Modify the checkpointing environment by changing the template and saving to disk. See the checkpoint(5) man page for a detailed description of the template entries to be changed.
The -Mckpt option (modify checkpointing environment from file) parses the specified file and modifies the existing checkpointing configuration.
The file must have the format of the checkpointing environment template.
The -sckpt option (show checkpointing environment) prints the configuration of the specified checkpointing environment to standard output.
The -sckptl option (show checkpointing environment list) displays a list of the names of all checkpointing environments currently configured.