Sun N1 Grid Engine 6.1 Administration Guide

Fine-Tuning Your Grid Environment

The grid engine system is a full-function, general-purpose distributed resource management tool. The scheduler component of the system supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment, you should review the features that are enabled. You should then determine which features you really need to solve your load management problem. Disabling some of these features can improve performance on the throughput of your cluster.

Scheduler Monitoring

Scheduler monitoring can help you to find out why certain jobs are not dispatched. However, providing this information for all jobs at all times can consume resources. You usually do not need this much information.

To disable scheduler monitoring, set schedd_job_info to false in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

Finished Jobs

In the case of array jobs, the finished job list in qmaster can become quite large. By switching the finished job list off, you save memory and speed up the qstat process, because qstat also fetches the finished jobs list.

To turn off the finished job list function, set finished_jobs to zero in the cluster configuration. See Adding and Modifying Global and Host Configurations With QMON, and the sge_conf(5) man page.

Job Validation

Forced validation at job submission time can be a valuable procedure to prevent nondispatchable jobs from forever remaining in a pending state. However, job validation can also be a time-consuming task. Job validation can be especially time-consuming in heterogeneous environments with different execution nodes and consumable resources, and in which all users have their own job profiles. In homogeneous environments with only a few different jobs, a general job validation usually can be omitted.

To disable job verification, add the qsub option –w n in the cluster-wide default requests. See Submitting Advanced Jobs With QMON in Sun N1 Grid Engine 6.1 User’s Guide, and the sge_request(5) man page.

Load Thresholds and Suspend Thresholds

Load thresholds are needed if you deliberately oversubscribe your machines and you need to prevent excessive system load. Suspend thresholds are also used to prevent overloading the system.

Another case where you want to prevent the overloading of a node is when the execution node is still open for interactive load. Interactive load is not under the control of the grid engine system.

A compute farm might be more single-purpose. For example, each CPU at a compute node might be represented by only one queue slot, and no interactive load might be expected at these nodes. In such cases, you can omit load_thresholds.

To disable both thresholds, set load_thresholds to none and suspend_thresholds to none. See Configuring Load and Suspend Thresholds, and the queue_conf(5) man page.

Load Adjustments

Load adjustments are used to increase the measured load after a job is dispatched. This mechanism prevents oversubscription of machines that is caused by the delay between job dispatching and the corresponding load impact. You can switch off load adjustments if you do not need them. Load adjustments impose on the scheduler some additional work in connection with sorting hosts and load thresholds verification.

To disable load adjustments, set job_load_adjustments to none and load_adjustment_decay_time to zero in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

Immediate Scheduling

The default for the grid engine system is to start scheduling runs in a fixed schedule interval. A good feature of fixed intervals is that they limit the CPU time consumption of the qmaster and the scheduler. A bad feature is that fixed intervals choke the scheduler, artificially resulting in a limited throughput. Many compute farms have machines specifically dedicated to qmaster and the scheduler, and such setups provide no reason to choke the scheduler. See schedule_interval in sched_conf(5).

You can configure immediate scheduling by using the flush_submit_sec and flush_finish_sec parameters of the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

If immediate scheduling is activated, the throughput of a compute farm is limited only by the power of the machine that is hosting sge_qmaster and the scheduler.

Urgency Policy and Resource Reservation

The urgency policy enables you to customize job priority schemes that are resource-dependent. Such job priority schemes include the following:

The implementing of both objectives is especially valuable if you are using resource reservation.