Sun N1 Grid Engine 6.1 User's Guide

Chapter 7 Error Messages, and Troubleshooting

This chapter describes the error messaging procedures of the grid engine system and offers tips on how to resolve various common problems.

How the Software Retrieves Error Reports

The grid engine software reports errors and warnings by logging messages into certain files or by sending email, or both. The log files include message files and job STDERR output.

As soon as a job is started, the standard error (STDERR) output of the job script is redirected to a file. The default file name and location are used, or you can specify the filename and the location with certain options of the qsub command. See the grid engine system man pages for detailed information.

Separate messages files exist for the sge_qmaster, the sge_schedd, and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master spool directory. The sge_schedd message file resides in the scheduler spool directory. The execution daemons' log files reside in the spool directories of the execution daemons. See Spool Directories Under the Root Directory in Sun N1 Grid Engine 6.1 Installation Guide for more information about the spool directories.

Each message takes up a single line in the files. Each message is subdivided into five components separated by the vertical bar sign (|).

    The components of a message are as follows:

  1. The first component is a time stamp for the message.

  2. The second component specifies the grid engine system daemon that generates the message.

  3. The third component is the name of the host where the daemon runs.

  4. The fourth is a message type. The message type is one of the following:

    • N for notice – for informational purposes

    • I for info – for informational purposes

    • W for warning

    • E for error – an error condition has been detected

    • C for critical – can lead to a program abort

    Use the loglevel parameter in the cluster configuration to specify on a global basis or a local basis what message types you want to log.

  5. The fifth component is the message text.


    Note –

    If an error log file is not accessible for some reason, the grid engine system tries to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on the corresponding host.


In some circumstances, the grid engine system notifies users, administrators, or both, about error events by email. The email messages sent by the grid engine system do not contain a message body. The message text is fully contained in the mail subject field.

Consequences of Different Error or Exit Codes

The following table lists the consequences of different job-related error codes or exit codes. These codes are valid for every type of job.

Table 7–1 Job-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job script 

Success 

 

99 

Requeue 

 

Rest 

Success: exit code in accounting file 

prolog/epilog 

Success 

 

99 

Requeue 

 

Rest 

Queue error state, job requeued 

The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration.

Table 7–2 Parallel-Environment-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

pe_start 

Success 

 

Rest 

Queue set to error state, job requeued 

pe_stop 

Success 

 

Rest 

Queue set to error state, job not requeued 

The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten.

Table 7–3 Queue-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Job starter 

Success 

 

Rest 

Success, no other special meaning 

Suspend 

Success 

 

Rest 

Success, no other special meaning 

Resume 

Success 

 

Rest 

Success, no other special meaning 

Terminate 

Success 

 

Rest 

Success, no other special meaning 

The following table lists the consequences of error or exit codes of jobs related to checkpointing.

Table 7–4 Checkpointing-Related Error or Exit Codes

Script/Method 

Exit or Error Code 

Consequence 

Checkpoint 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. 

Migrate 

Success 

 

Rest 

Success. For kernel checkpoint, however, this means that the checkpoint was not successful. Migration will occur. 

Restart 

Success 

 

Rest 

Success, no other special meaning 

Clean 

Success 

 

Rest 

Success, no other special meaning 

For jobs that run successfully, the qacct -j command output shows a value of 0 in the failed field, and the output shows the exit status of the job in the exit_status field. However, the shepherd might not be able to run a job successfully. For example, the epilog script might fail, or the shepherd might not be able to start the job. In such cases, the failed field displays one of the code values listed in the following table.

Table 7–5 qacct -j failed Field Codes

Code 

Description 

acctvalid 

Meaning for Job 

No failure 

Job ran, exited normally 

Presumably before job 

Job could not be started 

Before writing config 

Job could not be started 

Before writing PID 

Job could not be started 

On reading config file 

Job could not be started 

Setting processor set 

Job could not be started 

Before prolog 

Job could not be started 

In prolog 

Job could not be started 

Before pestart 

Job could not be started 

10 

In pestart 

Job could not be started 

11 

Before job 

Job could not be started 

12 

Before pestop 

Job ran, failed before calling PE stop procedure 

13 

In pestop 

Job ran, PE stop procedure failed 

14 

Before epilog 

Job ran, failed before calling epilog script 

15 

In epilog 

Job ran, failed in epilog script 

16 

Releasing processor set 

Job ran, processor set could not be released 

24 

Migrating (checkpointing jobs) 

Job ran, job will be migrated 

25 

Rescheduling 

Job ran, job will be rescheduled 

26 

Opening output file 

Job could not be started, stderr/stdout file could not be opened 

27 

Searching requested shell 

Job could not be started, shell not found 

28 

Changing to working directory 

Job could not be started, error changing to start directory 

100 

Assumedly after job 

Job ran, job killed by a signal 

The Code column lists the value of the failed field. The Description column lists the text that appears in the qacct -j output. If acctvalid is set to t, the job accounting values are valid. If acctvalid is set to f, the resource usage values of the accounting record are not valid. The Meaning for Job column indicates whether the job ran or not.

Running Grid Engine System Programs in Debug Mode

For some severe error conditions, the error-logging mechanism might not yield sufficient information to identify the problems. Therefore, the grid engine system offers the ability to run almost all ancillary programs and the daemons in debug mode. Different debug levels vary in the extent and depth of information that is provided. The debug levels range from zero through 10, with 10 being the level delivering the most detailed information and zero turning off debugging.

To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution of the grid engine system. For csh or tcsh users, the file sge-root/util/dl.csh is included. For sh or ksh users, the corresponding file is named sge-root/util/dl.sh. The files must be sourced into your standard resource file. As csh or tcsh user, include the following line in your .cshrc file:


source sge-root/util/dl.csh

As sh or ksh user, include the following line in your .profile file:


. sge-root/util/dl.sh

As soon as you log out and log in again, you can use the following command to set a debug level:


% dl level

If level is greater than 0, starting a grid engine system command forces the command to write trace output to STDOUT. The trace output can contain warning messages, status messages, and error messages, as well as the names of the program modules that are called internally. The messages also include line number information, which is helpful for error reporting, depending on the debug level you specify.


Note –

To watch a debug trace, you should use a window with a large scroll-line buffer. For example, you might use a scroll-line buffer of 1000 lines.



Note –

If your window is an xterm, you might want to use the xterm logging mechanism to examine the trace output later on.


If you run one of the grid engine system daemons in debug mode, the daemons keep their terminal connection to write the trace output. You can abort the terminal connections by typing the interrupt character of the terminal emulation you use. For example, you might use Control-C.

To switch off debug mode, set the debug level back to 0.

Setting the dbwriter Debug Level

The sgedbwriter script starts the dbwriter program. The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration file, dbwriter.conf. This configuration file is located in sge_root/cell/common/dbwriter.conf. This configuration file sets the debug level of dbwriter. For example:


#
# Debug level
# Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL
#
DBWRITER_DEBUG=INFO

You can use the –debug option of the dbwriter command to change the number of messages that the dbwriter produces. In general, you should use the default debug level, which is info. If you use a more verbose debug level, you substantially increase the amount of data output by dbwriter.

You can specify the following debug levels:

warning

Displays only severe errors and warnings.

info

Adds a number of informational messages. info is the default debug level.

config

Gives additional information that is related to dbwriter configuration, for example, about the processing of rules.

fine

Produces more information. If you choose this debug level, all SQL statements run by dbwriter are output.

finer

For debugging.

finest

For debugging.

all

Displays information for all levels. For debugging.

Diagnosing Problems

The grid engine system offers several reporting methods to help you diagnose problems. The following sections outline their uses.

Pending Jobs Not Being Dispatched

Sometimes a pending job is obviously capable of being run, but the job does not get dispatched. To diagnose the reason, the grid engine system offers a pair of utilities and options, qstat -j job-id and qalter-w v job-id.

This command lists the reasons why a job is not dispatchable in principle. For this purpose, a dry scheduling run is performed. All consumable resources, as well as all slots, are considered to be fully available for this job. Similarly, all load values are ignored because these values vary.

Job or Queue Reported in Error State E

Job or queue errors are indicated by an uppercase E in the qstat output.

A job enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the job.

A queue enters the error state when the grid engine system tries to run a job but fails for a reason that is specific to the queue.

The grid engine system offers a set of possibilities for users and administrators to gather diagnosis information in case of job execution errors. Both the queue and the job error states result from a failed job execution. Therefore the diagnosis possibilities are applicable to both types of error states.

Troubleshooting Common Problems

This section provides information to help you diagnose and respond to the cause of common problems.

Typical Accounting and Reporting Console Errors

Problem:

The installation of the Sun Web console Version 2.0.3 fails with the follow error message:


# ./inst_reporting
...
Register the N1 SGE reporting module in the webconsole

    Registering com.sun.grid.arco_6u3.

Starting Sun(TM) Web Console Version 2.0.3...
Ambiguous output redirect.
Solution:

. This Sun Web Console Version can only be installed by the user noacces who has /bin/sh as their login shell. The user must be added with the following command:


# useradd -u 60002 -g 60002 -d /tmp -s /bin/sh -c "No Access User" noaccess
Problem:

The table/view dropdown menu of a simple query definition does not contain any entry, but the tables are defined in the database.

Solution:

The problem normally occurs if Oracle is used as the database. During the installation of the reporting module the wrong database schema name has been specified. For Oracle, the database schema name is equal to the name of the database user which is used by dbwriter (the default name is arco_write). For Postgres, the database schema name should be public.

Problem:

Connection refused.

Solution:

The smcwebserver might be down. Start or restart the smcwebserver.

Problem:

The list of queries or the list of results is empty.

Solution:

The cause can be any of the following:

Problem:

The list of available database tables is empty.

Solution:

The cause can be any of the following:

Problem:

The list of selectable fields is empty.

Solution:

No table is selected. Select a table from the list.

Problem:

The list of filters is empty.

Solution:

No fields are selected. Define at least one field.

Problem:

The sort list is empty.

Solution:

No fields are selected. Define at least one field.

Problem:

A defined filter is not used.

Solution:

The filter may be inactive. Modify the unused filter and make it active.

Problem:

The late binding in the advanced query is ignored, but the execution runs into an error.

Solution:

The late binding macro has a syntactical error. The correct syntax for the late binding macro in the advanced query is as follows:


latebinding{attribute;operator}
latebinding{attribute;operator;defaultvalue}
Problem:

The breadcrumb is used to move back, but the login screen is shown.

Solution:

The session timed out. Log in again, or raise the session time in the app.xml.

Problem:

The view configuration is defined, but the default configuration is shown.

Solution:

The defined view configuration is not set to be visible. Open the view configuration and define the view configuration to be used.

Problem:

The view configuration is defined, but the last configuration is shown.

Solution:

The defined view configuration is not set to be visible. Open the view configuration and define the view configuration to be used.

Problem:

The execution of a query takes a very long time.

Solution:

The results coming from the database are very large. Set a limit for the results, or extend the filter conditions.