Sun N1 Grid Engine 6.1 Administration Guide

Chapter 9 Fine Tuning, Error Messages, and Troubleshooting

This chapter describes some ways to fine-tune your grid engine system environment. The chapter also describes the error messaging procedures and offers tips on how to resolve various common problems.

This chapter includes the following sections:

Fine-Tuning Your Grid Environment

The grid engine system is a full-function, general-purpose distributed resource management tool. The scheduler component of the system supports a wide range of different compute farm scenarios. To get the maximum performance from your compute environment, you should review the features that are enabled. You should then determine which features you really need to solve your load management problem. Disabling some of these features can improve performance on the throughput of your cluster.

Scheduler Monitoring

Scheduler monitoring can help you to find out why certain jobs are not dispatched. However, providing this information for all jobs at all times can consume resources. You usually do not need this much information.

To disable scheduler monitoring, set schedd_job_info to false in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

Finished Jobs

In the case of array jobs, the finished job list in qmaster can become quite large. By switching the finished job list off, you save memory and speed up the qstat process, because qstat also fetches the finished jobs list.

To turn off the finished job list function, set finished_jobs to zero in the cluster configuration. See Adding and Modifying Global and Host Configurations With QMON, and the sge_conf(5) man page.

Job Validation

Forced validation at job submission time can be a valuable procedure to prevent nondispatchable jobs from forever remaining in a pending state. However, job validation can also be a time-consuming task. Job validation can be especially time-consuming in heterogeneous environments with different execution nodes and consumable resources, and in which all users have their own job profiles. In homogeneous environments with only a few different jobs, a general job validation usually can be omitted.

To disable job verification, add the qsub option –w n in the cluster-wide default requests. See Submitting Advanced Jobs With QMON in Sun N1 Grid Engine 6.1 User’s Guide, and the sge_request(5) man page.

Load Thresholds and Suspend Thresholds

Load thresholds are needed if you deliberately oversubscribe your machines and you need to prevent excessive system load. Suspend thresholds are also used to prevent overloading the system.

Another case where you want to prevent the overloading of a node is when the execution node is still open for interactive load. Interactive load is not under the control of the grid engine system.

A compute farm might be more single-purpose. For example, each CPU at a compute node might be represented by only one queue slot, and no interactive load might be expected at these nodes. In such cases, you can omit load_thresholds.

To disable both thresholds, set load_thresholds to none and suspend_thresholds to none. See Configuring Load and Suspend Thresholds, and the queue_conf(5) man page.

Load Adjustments

Load adjustments are used to increase the measured load after a job is dispatched. This mechanism prevents oversubscription of machines that is caused by the delay between job dispatching and the corresponding load impact. You can switch off load adjustments if you do not need them. Load adjustments impose on the scheduler some additional work in connection with sorting hosts and load thresholds verification.

To disable load adjustments, set job_load_adjustments to none and load_adjustment_decay_time to zero in the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

Immediate Scheduling

The default for the grid engine system is to start scheduling runs in a fixed schedule interval. A good feature of fixed intervals is that they limit the CPU time consumption of the qmaster and the scheduler. A bad feature is that fixed intervals choke the scheduler, artificially resulting in a limited throughput. Many compute farms have machines specifically dedicated to qmaster and the scheduler, and such setups provide no reason to choke the scheduler. See schedule_interval in sched_conf(5).

You can configure immediate scheduling by using the flush_submit_sec and flush_finish_sec parameters of the scheduler configuration. See Changing the Scheduler Configuration With QMON, and the sched_conf(5) man page.

If immediate scheduling is activated, the throughput of a compute farm is limited only by the power of the machine that is hosting sge_qmaster and the scheduler.

Urgency Policy and Resource Reservation

The urgency policy enables you to customize job priority schemes that are resource-dependent. Such job priority schemes include the following:

The implementing of both objectives is especially valuable if you are using resource reservation.

Using DTrace for Performance Tuning

Troubleshooting in a distributed system that spans potentially thousands of active components can challenge even the most experienced system administrator. In practice, Grid Engine administrators have no explicit mechanism for identifying and reproducing issues that lead to degraded performance in their production environments. In the Solaris 10 environment, you can use the DTrace utility to monitor the on-site performance of the Grid Engine master component. DTrace is a comprehensive framework for tracing dynamic events in Solaris 10 environments. For general information about DTrace, see http://www.sun.com/bigadmin/content/dtrace/ and the dtrace man page. For detailed information about using DTrace with N1 Grid Engine 6.1 software, view the $SGE_ROOT/dtrace/README_dtrace.txt file.

Tuning Performance from the Command Line through DTrace

If you can use Solaris 10 DTrace, you can use the $SGE_ROOT/dtrace/monitor.sh script to monitor a Grid Engine master and look for any bottlenecks. The monitor.sh script supports the following options:

-interval value

Specify statistics interval to use. The default is 15sec. A larger interval results in coarser statistics, while a smaller value provides more refined results. Most useful values range from 1sec to 24hours.

-cell cell-name

Required if $SGE_CELL is not “default.”

-spooling

Display qmaster spooling probes in addition to statistics. This option enables you to view more specific information about a presumed spooling bottleneck.

-requests

Shows incoming qmaster request probes. This option enables you to view more specific information to evaluate instances in which someone is flooding your qmaster.


Note –

Any critical, error, or warning messages appear in monitor.sh output.


Analyzing Bottlenecks on the Grid Engine Master

To provide effective performance tuning, you must understand the bottlenecks of distributed systems. The $SGE_ROOT/dtrace/monitor.sh script measures throughput-relevant data of the running Grid Engine master and compiles this data into a few indices that are printed in a single-line view per interval. This view shows four main categories of information:

For more information, see the example below.

Sample DTrace Output for Bottleneck Analysis

The following monitoring output sample illustrates a case where a Grid Engine master bottleneck can be detected. The example shows the following information:


Note –

The specific columns displayed on your system might differ from the example.


In this example, performance degraded between 17:40:32 and 17:41:05.

CPU     ID      FUNCTION:NAME
  0      1             :BEGIN                 Time |   #wrt  wrt/ms |#rep #gdi #ack|   #dsp  dsp/ms    #sad|   #snd    #rcv|  #in++   #in--  #out++  #out--|  #lck0  #ulck0   #lck1  #ulck1
  0  36909         :tick-3sec 2006 Nov 24 17:39:23 |      43       3|   0    8    4|      3     691     121|      4       4|     11      11      15      15|     68      68     289     288
  0  36909         :tick-3sec 2006 Nov 24 17:39:26 |      83      16|   0   10    3|      3     699     122|      3       3|     14      13      17      17|     90      90     681     681
  0  36909         :tick-3sec 2006 Nov 24 17:39:29 |     117      24|   0    9    4|      4    1092     198|      4       4|     13      13      17      17|     71      71     591     591
  0  36909         :tick-3sec 2006 Nov 24 17:39:32 |      19       4|   0    9    3|      3     591     147|      3       3|     12      12      15      15|     44      43     249     249
  0  36909         :tick-3sec 2006 Nov 24 17:39:35 |     144      28|   0    9    4|      4    1012     173|      4       4|     13      13      17      17|     61      62    1246    1247
  0  36909         :tick-3sec 2006 Nov 24 17:39:38 |      46       5|   0    8    3|      3     705     122|      3       3|     11      11      14      14|     67      67     293     293
  0  36909         :tick-3sec 2006 Nov 24 17:39:41 |     154      31|   0    9    3|      4     894     198|      3       3|     13      13      16      16|     73      72     968     969
  0  36909         :tick-3sec 2006 Nov 24 17:39:44 |      46       5|   0   10    4|      4     971     162|      4       4|     13      13      17      17|     71      72     304     304
  0  36909         :tick-3sec 2006 Nov 24 17:39:47 |     154      29|   0    8    3|      3     739     158|      3       3|     11      11      14      14|     67      67     990     990
  0  36909         :tick-3sec 2006 Nov 24 17:39:50 |      46       5|   0   10    4|      4     815     162|      4       4|     14      14      18      18|     76      76     692     693
  0  36909         :tick-3sec 2006 Nov 24 17:39:53 |      74      15|   0    8    3|      3     746     136|      3       3|     12      12      15      15|     54      53     571     571
  0  36909         :tick-3sec 2006 Nov 24 17:39:56 |     116      20|   0   11    4|      4     992     184|      4       4|     14      14      18      18|     80      81     669     669
  0  36909         :tick-3sec 2006 Nov 24 17:39:59 |      87      18|   0   11    4|      4     851     176|      5       4|     15      15      21      21|     77      76     670     670
  0  36909          :tick-3sec 2006 Nov 24 17:40:02 |     109      20|   0   12    5|      4     930     184|      4       5|     17      17      20      20|     77      78     624     624
   0  36909         :tick-3sec 2006 Nov 24 17:40:05 |      88      15|   0    9    3|      4     995     176|      3       3|     12      12      15      15|     71      71    1026    1026
  0  36909          :tick-3sec 2006 Nov 24 17:40:08 |     112      20|   0   12    4|      4     927     184|      5       4|     16      16      22      22|     81      81     652     652
  0  36909          :tick-3sec 2006 Nov 24 17:40:11 |      32       6|   0    7    4|      3     618     121|      3       4|     11      11      13      13|     54      53     336     336
  0  36909          :tick-3sec 2006 Nov 24 17:40:14 |     145      30|   0   11    4|      4     988     199|      4       4|     15      15      19      19|     64      65     827     827
  0  36909          :tick-3sec 2006 Nov 24 17:40:17 |      43       3|   0    7    3|      3     618     121|      3       3|     10      10      13      13|     64      64     286     286
  0  36909          :tick-3sec 2006 Nov 24 17:40:20 |     157      31|   0   11    4|      4     977     199|      4       4|     15      15      19      19|     80      80    1406    1408
  0  36909          :tick-3sec 2006 Nov 24 17:40:23 |      43       4|   0    7    3|      3     701     121|      3       3|     10      10      13      13|     64      64     285     285
  0  36909          :tick-3sec 2006 Nov 24 17:40:26 |      73      18|   0   11    4|      4     948     171|      4       4|     15      15      19      19|     77      77     700     700
  0  36909          :tick-3sec 2006 Nov 24 17:40:29 |     127      31|   0   10    4|      4     968     189|      4       4|     14      14      18      18|     74      74     584     584
  0  36909          :tick-3sec 2006 Nov 24 17:40:32 |      10       3|   0    6    0|      1     203      41|      0       0|     58       8      62      62|     23      22     106     106
  0  36909          :tick-3sec 2006 Nov 24 17:40:35 |      19       5|   0    5    0|      0       0       0|      0       0|      8       5      13      13|     30      30     200     200
  0  36909          :tick-3sec 2006 Nov 24 17:40:38 |      16       5|   0    5    1|      0       0       0|      0       0|      5       6      10      10|     27      26     558     559
  0  36909          :tick-3sec 2006 Nov 24 17:40:41 |       1       0|   0    4    0|      0       0       0|      0       0|      7       4      11      11|      9       9      34      34
  0  36909          :tick-3sec 2006 Nov 24 17:40:44 |       0       0|   0    4    0|      0       0       0|      0       0|      7       4      11      11|      8       8      28      28
  0  36909          :tick-3sec 2006 Nov 24 17:40:47 |       0       0|   0    6    0|      1     744      81|      1       1|     10       6      15      15|     14      14      33      33
  0  36909          :tick-3sec 2006 Nov 24 17:40:50 |       1       0|   0    5    1|      0       0       0|      0       0|      8       6      14      14|     11      11      49      49
  0  36909          :tick-3sec 2006 Nov 24 17:40:53 |       0       0|   0    4    0|      0       0       0|      0       0|      9       4      12      12|      6       7      28      28
  0  36909          :tick-3sec 2006 Nov 24 17:40:56 |       0       0|   0    5    0|      0       0       0|      0       0|      8       5      13      13|     12      12     420     420
  0  36909          :tick-3sec 2006 Nov 24 17:40:59 |       0       0|   0    4    0|      0       0       0|      0       0|      8       4      12      12|      9       8      30      30
  0  36909          :tick-3sec 2006 Nov 24 17:41:02 |       0       0|   0    4    1|      0       0       0|      0       0|     12       5      16      16|      7       8      25      25
  0  36909          :tick-3sec 2006 Nov 24 17:41:05 |     165      41|   0   48   60|      0       0       0|      1       1|     23     106      71      71|     96      97    1236    1236
  0  36909          :tick-3sec 2006 Nov 24 17:41:08 |     178      28|   0   15   53|      4     965     206|      4       4|     68      68      75      75|    130     130    1336    1336
  0  36909          :tick-3sec 2006 Nov 24 17:41:11 |     106      23|   0   27   35|      4     855     166|      4       4|     82      82      91      91|    115     114    1040    1040
  0  36909          :tick-3sec 2006 Nov 24 17:41:14 |     198      37|   0   41   70|      4    1189     196|      4       4|    185     185     185     185|    134     135    1327    1327
  0  36909          :tick-3sec 2006 Nov 24 17:41:17 |      16       5|   0    9    5|      4     940     161|      3       3|     17      17      20      20|     43      42     234     234
  0  36909          :tick-3sec 2006 Nov 24 17:41:20 |     162      35|   0   13    8|      4     958     200|      4       4|     23      23      28      28|     80      81    1018    1018
  0  36909          :tick-3sec 2006 Nov 24 17:41:23 |      44       6|   0    6    3|      2     544      81|      3       3|      8       8      11      11|     63      63     747     747
  0  36909          :tick-3sec 2006 Nov 24 17:41:26 |     150      34|   0   13    6|      4     921     199|      4       4|     21      21      25      25|     73      72     923     923
  0  36909          :tick-3sec 2006 Nov 24 17:41:29 |      43       3|   0    5    2|      2     506      81|      2       2|      7       7       9       9|     57      57     260     260
  0  36909          :tick-3sec 2006 Nov 24 17:41:32 |     157      37|   0    9    3|      4     978     199|      3       3|     13      13      16      16|     73      72     970     970
  0  36909          :tick-3sec 2006 Nov 24 17:41:35 |      43       3|   0    7    3|      2     512      85|      3       3|      9       9      12      12|     61      62     274     274
  0  36909          :tick-3sec 2006 Nov 24 17:41:38 |     127      29|   0    8    3|      4     994     185|      3       3|     11      11      14      14|     68      68    1265    1265
  0  36909          :tick-3sec 2006 Nov 24 17:41:41 |      66      11|   0   10    4|      4     973     171|      4       4|     14      14      18      18|     67      67     354     354
  0  36909          :tick-3sec 2006 Nov 24 17:41:44 |      48      10|   0    8    3|      3     785     128|      3       3|     11      11      14      14|     52      51     399     399
  0  36909          :tick-3sec 2006 Nov 24 17:41:47 |     142      31|   0   12    4|      4     913     192|      5       4|     17      17      23      23|     89      90     830     830
  0  36909          :tick-3sec 2006 Nov 24 17:41:50 |      64      13|   0   11    5|      4     853     168|      4       5|     15      15      18      18|     75      75     542     542

How the Grid Engine Software Retrieves Error Reports

The grid engine software reports errors and warnings by logging messages into certain files or by sending email, or both. The log files include message files and job STDERR output.

As soon as a job is started, the standard error (STDERR) output of the job script is redirected to a file. The default file name and location are used, or you can specify the filename and the location with certain options of the qsub command. See the grid engine system man pages for detailed information.

Separate messages files exist for the sge_qmaster, the sge_schedd, and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master spool directory. The sge_schedd message file resides in the scheduler spool directory. The execution daemons' log files reside in the spool directories of the execution daemons. See Spool Directories Under the Root Directory in Sun N1 Grid Engine 6.1 Installation Guide for more information about the spool directories.

Each message takes up a single line in the files. Each message is subdivided into five components separated by the vertical bar sign (|).

    The components of a message are as follows:

  1. The first component is a time stamp for the message.

  2. The second component specifies the daemon that generates the message.

  3. The third component is the name of the host where the daemon runs.

  4. The fourth is a message type. The message type is one of the following:

    • N for notice – for informational purposes

    • I for info – for informational purposes

    • W for warning

    • E for error – an error condition has been detected

    • C for critical – can lead to a program abort

    Use the loglevel parameter in the cluster configuration to specify on a global basis or a local basis what message types you want to log.

  5. The fifth component is the message text.


    Note –

    If an error log file is not accessible for some reason, the grid engine system tries to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on the corresponding host.


In some circumstances, the grid engine system notifies users, administrators, or both, about error events by email. The email messages sent by the grid engine system do not contain a message body. The message text is fully contained in the mail subject field.

Consequences of Different Error or Exit Codes

The following table lists the consequences of different job-related error codes or exit codes. These codes are valid for every type of job.

Table 9–1 Job-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job script 

Success 

 

99 

Requeue 

 

Rest 

Success: exit code in accounting file 

 

 

 

prolog/epilog 

Success 

 

99 

Requeue 

 

Rest 

Queue error state, job requeued 

The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration.

Table 9–2 Parallel-Environment-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

pe_start 

Success 

 

Rest 

Queue set to error state, job requeued 

 

 

 

pe_stop 

Success 

 

Rest 

Queue set to error state, job not requeued 

The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten.

Table 9–3 Queue-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job starter 

Success 

 

Rest 

Success, no other special meaning 

 

 

 

Suspend 

Success 

 

Rest 

Success, no other special meaning 

 

 

 

Resume 

Success 

 

Rest 

Success, no other special meaning 

 

 

 

Terminate 

Success 

 

Rest 

Success, no other special meaning 

The following table lists the consequences of error or exit codes of jobs related to checkpointing.

Table 9–4 Checkpointing-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Checkpoint 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. 

 

 

 

Migrate 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. Migration will occur. 

 

 

 

Restart 

Success 

 

Rest 

Success, no other special meaning 

 

 

 

Clean 

Success 

 

Rest 

Success, no other special meaning 

Running Grid Engine System Programs in Debug Mode

For some severe error conditions, the error-logging mechanism might not yield sufficient information to identify the problems. Therefore, the grid engine system offers the ability to run almost all ancillary programs and the daemons in debug mode. Different debug levels vary in the extent and depth of information that is provided. The debug levels range from zero through 10, with 10 being the level delivering the most detailed information and zero turning off debugging.

To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution of the grid engine system. For csh or tcsh users, the file sge-root/util/dl.csh is included. For sh or ksh users, the corresponding file is named sge-root/util/dl.sh. The files must be sourced into your standard resource file. As csh or tcsh user, include the following line in your .cshrc file:


source sge-root/util/dl.csh

As sh or ksh user, include the following line in your .profile file:


. sge-root/util/dl.sh

As soon as you log out and log in again, you can use the following command to set a debug level:


% dl level

If level is greater than 0, starting a grid engine system command forces the command to write trace output to STDOUT. The trace output can contain warning messages, status messages, and error messages, as well as the names of the program modules that are called internally. The messages also include line number information, which is helpful for error reporting, depending on the debug level you specify.


Note –

To watch a debug trace, you should use a window with a large scroll-line buffer. For example, you might use a scroll-line buffer of 1000 lines.



Note –

If your window is an xterm, you might want to use the xterm logging mechanism to examine the trace output later on.


If you run one of the grid engine system daemons in debug mode, the daemons keep their terminal connection to write the trace output. You can abort the terminal connections by typing the interrupt character of the terminal emulation you use. For example, you might use Control-C.

To switch off debug mode, set the debug level back to 0.

Setting the dbwriter Debug Level

The sgedbwriter script starts the dbwriter program. The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration file, dbwriter.conf. This configuration file is located in sge_root/cell/common/dbwriter.conf. This configuration file sets the debug level of dbwriter. For example:


#
# Debug level
# Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL
#
DBWRITER_DEBUG=INFO

You can use the –debug option of the dbwriter command to change the number of messages that the dbwriter produces. In general, you should use the default debug level, which is info. If you use a more verbose debug level, you substantially increase the amount of data output by dbwriter.

You can specify the following debug levels:

warning

Displays only severe errors and warnings.

info

Adds a number of informational messages. info is the default debug level.

config

Gives additional information that is related to dbwriter configuration, for example, about the processing of rules.

fine

Produces more information. If you choose this debug level, all SQL statements run by dbwriter are output.

finer

For debugging.

finest

For debugging.

all

Displays information for all levels. For debugging.

Diagnosing Problems

The grid engine system offers several reporting methods to help you diagnose problems. The following sections outline their uses.

Pending Jobs Not Being Dispatched

Sometimes a pending job is obviously capable of being run, but the job does not get dispatched. To diagnose the reason, the grid engine system offers a pair of utilities and options, qstat -j job-id and qalter-w v job-id.

This command lists the reasons why a job is not dispatchable in principle. For this purpose, a dry scheduling run is performed. All consumable resources, as well as all slots, are considered to be fully available for this job. Similarly, all load values are ignored because these values vary.

Job or Queue Reported in Error State E

Job or queue errors are indicated by an uppercase E in the qstat output.

A job enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the job.

A queue enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the queue.

The grid engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. Therefore the diagnosis possibilities are applicable to both types of error states.

Troubleshooting Common Problems

This section provides information to help you diagnose and respond to the cause of common problems.