This section describes how the grid engine system schedules jobs for execution. The section describes different types of scheduling strategies and explains how to configure the scheduler.
The grid engine system includes the following job-scheduling activities:
Predispatching decisions. Activities such as eliminating queues because they are full or overloaded, spooling jobs that are currently not eligible for execution, and reserving resources for higher-priority jobs
Dispatching. Deciding a job's importance with respect to other pending jobs and running jobs, sensing the load on all machines in the cluster, and sending the job to a queue on a machine selected according to configured selection criteria
Postdispatch monitoring. Adjusting a job's relative importance as it gets resources and as other jobs with their own relative importance enter or leave the system
The grid engine software schedules jobs across a heterogeneous cluster of computers, based on the following criteria:
The cluster's current load
The jobs' relative importance
The hosts' relative performance
The jobs' resource requirements, for example, CPU, memory, and I/O bandwidth
Decisions about scheduling are based on the strategy for the site and on the instantaneous load characteristics of each computer in the cluster. A site's scheduling strategy is expressed through the grid engine system's configuration parameters. Load characteristics are ascertained by collecting performance data as the system runs.
The administrator can set up strategies with respect to the following scheduling tasks:
Dynamic resource management. The grid engine system dynamically controls and adjusts the resource entitlements that are allocated to running jobs. In other words, the system modifies their CPU share.
Queue sorting. The software ranks the queues in the cluster according to the order in which the queues should be filled up.
Job sorting. Job sorting determines the order in which the grid engine system attempts to schedule jobs.
Resource reservation and backfilling. Resource reservation reserves resources for jobs, blocking their use by jobs of lower priority. Backfilling enables lower-priority jobs to use blocked resources when using those resources does not interfere with the reservation.
The grid engine software uses a weighted combination of the following three ticket-based policies to implement automated job scheduling strategies:
Share-based
Functional (sometimes called Priority)
Override
You can set up the grid engine system to routinely use either a share-based policy, a functional policy, or both. You can combine these policies in any combination. For example, you could give zero weight to one policy and use only the second policy. Or you could give both policies equal weight.
Along with routine policies, administrators can also override share-based and functional scheduling temporarily or, for certain purposes such as express queues, permanently. You can apply an override to one job or to all jobs associated with a user, a department, a project, or a job class (that is, a queue).
In addition to the three policies for mediating among all jobs, the grid engine system sometimes lets users set priorities among the jobs they own. For example, a user might say that jobs one and two are equally important, but that job three is more important than either job one or job two. Users can set their own job priorities if the combination of policies includes the share-based policy, the functional policy, or both. Also, functional tickets must be granted to jobs.
The share-based, functional, and override scheduling policies are implemented with tickets. Each policy has a pool of tickets. A policy allocates tickets to jobs as the jobs enter the multimachine grid engine system. Each routine policy that is in force allocates some tickets to each new job. The policy might also reallocate tickets to running jobs at each scheduling interval.
Tickets weight the three policies. For example, if no tickets are allocated to the functional policy, that policy is not used. If the functional ticket pool and the share-based ticket pool have an equal number of tickets, both policies have equal weight in determining a job's importance.
Tickets are allocated to the routine policies at system configuration by grid engine system managers. Managers and operators can change ticket allocations at any time with immediate effect. Additional tickets are injected into the system temporarily to indicate an override. Policies are combined by assignment of tickets. When tickets are allocated to multiple policies, a job gets a portion of each policy's tickets, which indicates the job's importance in each policy in force.
The grid engine system grants tickets to jobs that are entering the system to indicate their importance under each policy in force. At each scheduling interval, each running job can gain tickets, lose tickets, or keep the same number of tickets. For example, a job might gain tickets from an override. A job might lose tickets because it is getting more than its fair share of resources. The number of tickets that a job holds represent the resource share that the grid engine system tries to grant that job during each scheduling interval.
You configure a site's dynamic resource management strategy during installation. First, you allocate tickets to the share-based policy and to the functional policy. You then define the share tree and the functional shares. The share-based ticket allocation and the functional ticket allocation can change automatically at any time. The administrator manually assigns or removes tickets.
The following means are provided to determine the order in which the grid engine system attempts to fill up queues:
Load reporting. Administrators can select which load parameters are used to compare the load status of hosts and their queue instances. The wide range of standard load parameters that are available, and an interface for extending this set with site-specific load sensors, are described in Load Parameters.
Load scaling. Load reports from different hosts can be normalized to reflect a comparable situation. See Configuring Execution Hosts With QMON.
Load adjustment. The grid engine software can be configured to automatically correct the last reported load as jobs are dispatched to hosts. The corrected load represents the expected increase in the load situation caused by recently started jobs. This artificial increase of load can be automatically reduced as the load impact of these jobs takes effect.
Sequence number. Queues can be sorted following a strict sequence.
Before the grid engine system starts to dispatch jobs, the jobs are brought into priority order, highest priority first. The system then attempts to find suitable resources for the jobs in priority sequence.
Without any administrator influence, the order is first-in-first-out (FIFO). The administrator has the following means to control the job order:
Ticket-based job priority. Jobs are always treated according to their relative importance as defined by the number of tickets that the jobs have. Pending jobs are sorted in ticket order. Any change that the administrator applies to the ticket policy also changes the sorting order.
Urgency-based job priority. Jobs can have an urgency value that determines their relative importance. Pending jobs are sorted according to their urgency value. Any change applied to the urgency policy also changes the sorting order.
POSIX priority. You can use the –p option to the qsub command to implement site-specific priority policies. The –p option specifies a range of priorities from –1023 to 1024. The higher the number, the higher the priority. The default priority for jobs is zero.
Maximum number of user or user group jobs. You can restrict the maximum number of jobs that a user or a UNIX user group can run concurrently. This restriction influences the sorting order of the pending job list, because the jobs of users who have not exceeded their limit are given preference.
For each priority type, a weighting factor can be specified. This weighting factor determines the degree to which each type of priority affects overall job priority. To make it easier to control the range of values for each priority type, normalized values are used instead of the raw ticket values, urgency values, and POSIX priority values.
The following formula expresses how a job's priority values are determined:
job_priority = weight_urgency * normalized_urgency_value + weight_ticket * normalized_ticket_value + weight_priority * normalized_POSIX_priority_value |
You can use the qstat command to monitor job priorities:
Use qstat –pri to monitor job priorities overall, including POSIX priority.
Use qstat –ext to monitor job priorities based on the ticket policy.
Use qstat –urg to monitor job priorities based on the urgency policy.
Use qstat –prito diagnose job priority issues when urgency policy, ticket based policies and -p <priority> are used concurrently
Use qstat –explainto diagnose various queue instance based error conditions.
The urgency policy defines an urgency value for each job. The urgency value is derived from the sum of three contributions:
Resource requirement contribution
Waiting time contribution
Deadline contribution
The resource requirement contribution is derived from the sum of all hard resource requests, one addend for each request.
If the resource request is of the type numeric, the resource request addend is the product of the following three elements:
The resource's urgency value as defined in the complex. For more information, see Configuring Complex Resource Attributes With QMON.
The assumed slot allocation of the job.
The per slot request specified by the qsub –l command.
If the resource request is of the type string, the resource request addend is the resource's urgency value as defined in the complex.
The waiting time contribution is the product of the job's waiting time, in seconds, and the waiting-weight value specified in the Policy Configuration dialog box.
The deadline contribution is zero for jobs without a deadline. For jobs with a deadline, the deadline contribution is the weight-deadline value, which is defined in the Policy Configuration dialog box, divided by the free time, in seconds, until the deadline initiation time.
For information about configuring the urgency policy, see Configuring the Urgency Policy.
Resource reservation enables you to reserve system resources for specified pending jobs. When you reserve resources for a job, those resources are blocked from being used by jobs with lower priority.
Jobs can reserve resources depending on criteria such as resource requirements, job priority, waiting time, resource sharing entitlements, and so forth. The scheduler enforces reservations in such a way that jobs with the highest priority get the earliest possible resource assignment. This avoids such well-known problems as “job starvation”.
You can use resource reservation to guarantee that resources are dedicated to jobs in job-priority order.
Consider the following example. Job A is a large pending job, possibly parallel, that requires a large amount of a particular resource. A stream of smaller jobs B(i) require a smaller amount of the same resource. Without resource reservation, a resource assignment for job A cannot be guaranteed, assuming that the stream of B(i) jobs does not stop. The resource cannot be guaranteed even though job A has a higher priority than the B(i) jobs.
With resource reservation, job A gets a reservation that blocks the lower priority jobs B(i). Resources are guaranteed to be available for job A as soon as possible.
Backfilling enables a lower-priority job to use resources that are blocked due to a resource reservation. Backfilling work only if there is a runnable job whose prospective run time is small enough to allow the blocked resource to be used without interfering with the original reservation.
In the example described earlier, a job C, of very short duration, could use backfilling to start before job A.
Because resource reservation causes the scheduler to look ahead, using resource reservation affects system performance. In a small cluster, the effect on performance is negligible when there are only a few pending jobs. In larger clusters, however, and in clusters with many pending jobs, the effect on performance might be significant.
To offset this potential performance degradation, you can limit the overall number of resource reservations that can be made during a scheduling interval. You can limit resource reservation in two ways:
To limit the absolute number of reservations that can be made during a scheduling interval, set the Maximum Reservation parameter on the Scheduler Configuration dialog box. For example, if you set Maximum Reservation to 20, no more than 20 reservations can be made within an interval.
To limit reservation scheduling to only those jobs that are important, use the –R y option of the qsub command. In the example described earlier, there is no need to schedule B(i) job reservations just for the sake of guaranteeing the resource reservation for job A. Job A is the only job that you need to submit with the –R y option.
You can configure the scheduler to monitor how it is influenced by resource reservation. When you monitor the scheduler, information about each scheduling run is recorded in the file sge-root/cell/common/schedule.
The following example shows what schedule monitoring does. Assume that the following sequence of jobs is submitted to a cluster where the global license consumable resource is limited to 5 licenses:
qsub -N L4_RR -R y -l h_rt=30,license=4 -p 100 $SGE_ROOT/examples/jobs/sleeper.sh 20 qsub -N L5_RR -R y -l h_rt-30,license=5 $SGE_ROOT/examples/jobs/sleeper.sh 20 qsub -N L1_RR -R y -l h_rt=31,license=1 $SGE_ROOT/examples/jobs/sleeper.sh 20 |
Assume that the default priority settings in the scheduler configuration are being used:
weight_priority 1.000000 weight_urgency 0.100000 weight_ticket 0.010000 |
The –p 100 priority of job L4_RR supersedes the license-based urgency, which results in the following prioritization:
job-ID prior name --------------------- 3127 1.08000 L4_RR 3128 0.10500 L5_RR 3129 0.00500 L1_RR |
In this case, traces of these jobs can be found in the schedule file for 6 schedule intervals:
:::::::: 3127:1:STARTING:1077903416:30:G:global:license:4.000000 3127:1:STARTING:1077903416:30:Q:all.q@carc:slots:1.000000 3128:1:RESERVING:1077903446:30:G:global:license:5.000000 3128:1:RESERVING:1077903446:30:Q:all.q@bilbur:slots:1.000000 3129:1:RESERVING:1077903476:31:G:global:license:1.000000 3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3127:1:RUNNING:1077903416:30:G:global:license:4.000000 3127:1:RUNNING:1077903416:30:Q:all.q@carc:slots:1.000000 3128:1:RESERVING:1077903446:30:G:global:license:5.000000 3128:1:RESERVING:1077903446:30:Q:all.q@es-ergb01-01:slots:1.000000 3129:1:RESERVING:1077903476:31:G:global:license:1.000000 3129:1:RESERVING:1077903476:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3128:1:STARTING:1077903448:30:G:global:license:5.000000 3128:1:STARTING:1077903448:30:Q:all.q@carc:slots:1.000000 3129:1:RESERVING:1077903478:31:G:global:license:1.000000 3129:1:RESERVING:1077903478:31:Q:all.q@bilbur:slots:1.000000 :::::::: 3128:1:RUNNING:1077903448:30:G:global:license:5.000000 3128:1:RUNNING:1077903448:30:Q:all.q@carc:slots:1.000000 3129:1:RESERVING:1077903478:31:G:global:license:1.000000 3129:1:RESERVING:1077903478:31:Q:all.q@es-ergb01-01:slots:1.000000 :::::::: 3129:1:STARTING:1077903480:31:G:global:license:1.000000 3129:1:STARTING:1077903480:31:Q:all.q@carc:slots:1.000000 :::::::: 3129:1:RUNNING:1077903480:31:G:global:license:1.000000 3129:1:RUNNING:1077903480:31:Q:all.q@carc:slots:1.000000 |
Each section shows, for a schedule interval, all resource usage that was taken into account. RUNNING entries show usage of jobs that were already running at the start of the interval. STARTING entries show the immediate uses that were decided within the interval. RESERVING entries show uses that are planned for the future, that is, reservations.
The format of the schedule file is as follows:
The job ID
The array task ID, or 1 in the case of nonarray jobs
Can be RUNNING, SUSPENDED, MIGRATING, STARTING, RESERVING
Start time in seconds after 1.1.1070
Assumed job duration in seconds
Can be P (for parallel environment), G (for global), H (for host), or Q (for queue)
The name of the parallel environment, host, or queue
The name of the consumable resource
The resource usage incurred by the job
The line :::::::: marks the beginning of a new schedule interval.
The schedule file is not truncated. Be sure to turn monitoring off if you do not have an automated procedure that is set up to truncate the file.
The Scheduler schedules work in intervals. Between scheduling actions, the grid engine system keeps information about significant events such as the following:
Job submission
Job completion
Job cancellation
An update of the cluster configuration
Registration of a new machine in the cluster
When scheduling occurs, the scheduler first does the following:
Takes into account all significant events
Sorts jobs and queues according to the administrator's specifications
Takes into account all the jobs' resource requirements
Reserves resources for jobs in a forward-looking schedule
Then the grid engine system does the following tasks, as needed:
Dispatches new jobs
Suspends running jobs
Increases or decreases the resources allocated to running jobs
Maintains the status quo
If share-based scheduling is used, the calculation takes into account the usage that has already occurred for that user or project.
If scheduling is not at least in part share-based, the calculation ranks all the jobs running and waiting to run. The calculation then takes the most important job until the resources in the cluster (CPU, memory, and I/O bandwidth) are used as fully as possible.
If the reasons why a job does not get started are unclear to you, run the qalter -w v command for the job. The grid engine software assumes an empty cluster and checks whether any queue that is suitable for the job is available.
Further information can be obtained by running the qstat -j job-id command. This command prints a summary of the job's request profile. The summary also includes the reasons why the job was not scheduled in the last scheduling run. Running the qstat -j command without a job ID summarizes the reasons for all jobs not being scheduled in the last scheduling interval.
Collection of job scheduling information must be switched on in the scheduler configuration sched_conf(5). Refer to the schedd_job_info parameter description in the sched_conf(5) man page, or to Changing the Scheduler Configuration With QMON.
To retrieve even more detail about the decisions of the scheduler sge_schedd, use the -tsm option of the qconf command. This command forces sge_schedd to write trace output to the file.
Refer to Configuring Policy-Based Resource Management With QMON for details on the scheduling administration of resource-sharing policies of the grid engine system. The following sections focus on administering the scheduler configuration sched_conf and related issues.
The default scheduling is a first-in-first-out policy. In other words, the first job that is submitted is the first job the scheduler examines in order to dispatch it to a queue. If the first job in the list of pending jobs finds a queue that is suitable and available, that job is started first. A job ranked behind the first job can be started first only if the first job fails to find a suitable free resource.
The default strategy is to select queue instances on the least-loaded host, provided that the queues deliver suitable service for the job's resource requirements. If several suitable queues share the same load, the queue to be selected is unpredictable.
You can modify the job scheduling and queue selection strategy in various ways:
Changing the scheduling algorithm
Scaling system load
Selecting queue by sequence number
Selecting queue by share
Restricting the number of jobs per user or per group
The following sections explore these alternatives in detail.
The scheduler configuration parameter algorithm provides a selection for the scheduling algorithm in use. See the sched_conf(5) man page for further information. Currently, default is the only allowed setting.
To select the queue to run a job, the grid engine system uses the system load information on the machines that host queue instances. This queue selection scheme builds up a load-balanced situation, thus guaranteeing better use of the available resources in a cluster.
However, the system load may not always tell the truth. For example, if a multi-CPU machine is compared to a single CPU system, the multiprocessor system usually reports higher load figures, because it probably runs more processes. The system load is a measurement strongly influenced by the number of processes trying to get CPU access. But multi-CPU systems are capable of satisfying a much higher load than single-CPU machines. This problem is addressed by processor-number-adjusted sets of load values that are reported by default by sge_execd. Use these load parameters instead of the raw load values to avoid the problem described earlier. See Load Parameters and the sge-root/doc/load_parameters.asc file for details.
Another example of potentially improper interpretation of load values is when systems have marked differences in their performance potential or in their price performance ratio. In both cases, equal load values do not mean that arbitrary hosts can be selected to run a job. In this situation, the administrator should define load scaling factors for the relevant execution hosts and load parameters. See Configuring Execution Hosts With QMON, and related sections.
The scaled load parameters are also compared against the load threshold lists load-thresholds and migr-load-thresholds. See the queue_conf(5) man page for details.
Another problem associated with load parameters is the need for an application-dependent and site-dependent interpretation of the values and their relative importance. The CPU load might be dominant for a certain type of application that is common at a particular site. By contrast, the memory load might be more important for another site and for the application profile to which the site's compute cluster is dedicated. To address this problem, the grid engine system enables the administrator to specify a load formula in the scheduler configuration file sched_conf. See the sched_conf(5) man page for more details. Site-specific information on resource usage and capacity planning can be taken into account by using site-defined load parameters and consumable resources in the load formula. See the sections Adding Site-Specific Load Parameters) and Consumable Resources.
Finally, the time dependency of load parameters must be taken into account. The load that is imposed by the jobs that are running on a system varies in time. Often the load, for example, the CPU load, requires some amount of time to be reported in the appropriate quantity by the operating system. If a job recently started, the reported load might not provide an accurate representation of the load that the job has imposed on that host. The reported load adapts to the real load over time. But the period of time in which the reported load is too low might lead to an oversubscription of that host. The grid engine system enables the administrator to specify load adjustment factors that are used in the scheduler to compensate for this problem. See the sched_conf(5) man page for detailed information on how to set these load adjustment factors.
Load adjustments are used to virtually increase the measured load after a job is dispatched. In the case of oversubscribed machines, this helps to align with load thresholds. If you do not need load adjustments, you should turn them off. Load adjustments impose additional work on the scheduler in connection with sorting hosts and load thresholds verification.
To disable load adjustments, on the Load Adjustment tab of the Scheduler Configuration dialog box, set the Decay Time to zero, and delete all load adjustment values in the table. See Changing the Scheduler Configuration With QMON.
Another way to change the default scheme for queue selection is to set the global cluster configuration parameter queue_sort_method to seq_no instead of to the default load. In this case, the system load is no longer used as the primary method to select queues. Instead, the sequence numbers that are assigned to the queues by the queue configuration parameter seq_no define a fixed order for queue selection. The queues must be suitable for the considered job, and they must be available. See the queue_conf(5) and sched_conf(5) man pages for more details.
This queue selection policy is useful if the machines that offer batch services at your site are ranked in a monotonous price per job order. For example, a job running on machine A costs 1 unit of money. The same job costs 10 units on machine B. And on machine C the job costs 100 units. Thus the preferred scheduling policy is to first fill up host A and then to use host B. Host C is used only if no alternative remains.
If you have changed the method of queue selection to seq_no, and the considered queues all share the same sequence number, queues are selected by the default load.
The goal of this method is to place jobs so as to attempt to meet the targeted share of global system resources for each job. This method takes into account the resource capability represented by each host in relation to all the system resources. This method tries to balance the percentage of tickets for each host (that is, the sum of tickets for all jobs running on a host) with the percentage of the resource capability that particular host represents for the system. See Configuring Execution Hosts With QMON for instructions on how to define the capacity of a host.
The host's load, although of secondary importance, is also taken into account in the sorting. Choose this sorting method for a site that uses the share-tree policy.
The administrator can assign an upper limit to the number of jobs that any user or any UNIX group can run at any time. In order to enforce this feature, do one of the following:
Set maxujobs or maxgjobs, or both, as described in the sched_conf(5) man page.
On the General Parameters tab of the Scheduler Configuration dialog box, use the Max Jobs/User field to set the maximum number of jobs a user or user group can run concurrently.
On the QMON Main Control window, click the Scheduler Configuration button.
The Scheduler Configuration dialog box appears. The dialog box has two tabs:
General Parameters tab
Load Adjustment tab
To change general scheduling parameters, click the General Parameters tab. The General Parameters tab looks like the following figure.
Use the General Parameters tab to set the following parameters:
Algorithm. The scheduling algorithm. See Changing the Scheduling Algorithm.
Schedule Interval. The regular time interval between scheduler runs.
Reprioritize Interval. The regular time interval to reprioritize jobs on the execution hosts, based on the current ticket amount for running jobs. To turn reprioritizing off, set this parameter to zero.
Max Jobs/User. The maximum number of jobs that are allowed to run concurrently per user and per UNIX group. See Restricting the Number of Jobs per User or Group.
Sort by. The queue sorting scheme, either sorting by load or sorting by sequence number. See Selecting Queue by Sequence Number.
Job Scheduling Information. Whether job scheduling information is accessible through qstat -j, or whether this information should be collected only for a range of job IDs. You should turn on general collection of job scheduling information only temporarily, in case an extremely high number of jobs are pending.
Scheduler monitoring can help you find out the reason why certain jobs are not dispatched. However, providing this information for all jobs at all times can consume resources. Such information is usually not needed.
Load Formula. The load formula to use to sort hosts and queues.
Flush Submit Seconds. The number of seconds that the scheduler waits after a job is submitted before the scheduler is triggered. To disable the flush after a job is submitted, set this parameter to zero.
Flush Finish Seconds. The number of seconds that the scheduler waits after a job has finished before the scheduler is triggered. To disable the flush after a job has finished, set this parameter to zero.
Maximum Reservation. The maximum number of resource reservations that can be scheduled within a scheduling interval. See Resource Reservation and Backfilling.
Params. Use this setting to specify additional parameters to pass to the scheduler. Params can be PROFILE or MONITOR. If you specify PROFILE, the scheduler logs profiling information that summarizes each scheduling run. If you specify MONITOR, the scheduler records information for each scheduling run in the file sge-root/cell/common/schedule.
By default, the grid engine system schedules job runs in a fixed schedule interval. You can use the Flush Submit Seconds and Flush Finish Seconds parameters to configure immediate scheduling. For more information, see Immediate Scheduling.
To change load adjustment parameters, click the Load Adjustment tab. The Load Adjustment tab looks like the following figure:
Use the Load Adjustment tab to set the following parameters:
Decay Time. The decay time for the load adjustment.
A table of load adjustment values listing all load and consumable attributes for which an adjustment value is currently defined.
To add load values to the list, click the Load or the Value column heading. A selection list appears with all resource attributes that are attached to the hosts.
The Attribute Selection dialog box is shown in Figure 1–2. To add a resource attribute to the Load column of the Consumable/Fixed Attributes table, select one of the attributes, and then click OK.
To modify an existing value, double-click the Value field.
To delete a resource attribute, select it, and then press Control-D or click mouse button 3. A dialog box asks you to confirm the deletion.
See Scaling System Load for background information. See the sched_conf(5) man page for more details about the scheduler configuration.