The grid engine software reports errors and warnings by logging messages into certain files or by sending email, or both. The log files include message files and job STDERR output.
As soon as a job is started, the standard error (STDERR) output of the job script is redirected to a file. The default file name and location are used, or you can specify the filename and the location with certain options of the qsub command. See the grid engine system man pages for detailed information.
Separate messages files exist for the sge_qmaster, the sge_schedd, and the sge_execds. The files have the same file name: messages. The sge_qmaster log file resides in the master spool directory. The sge_schedd message file resides in the scheduler spool directory. The execution daemons' log files reside in the spool directories of the execution daemons. See Spool Directories Under the Root Directory in Sun N1 Grid Engine 6.1 Installation Guide for more information about the spool directories.
Each message takes up a single line in the files. Each message is subdivided into five components separated by the vertical bar sign (|).
The components of a message are as follows:
The first component is a time stamp for the message.
The second component specifies the daemon that generates the message.
The third component is the name of the host where the daemon runs.
The fourth is a message type. The message type is one of the following:
N for notice – for informational purposes
I for info – for informational purposes
W for warning
E for error – an error condition has been detected
C for critical – can lead to a program abort
Use the loglevel parameter in the cluster configuration to specify on a global basis or a local basis what message types you want to log.
The fifth component is the message text.
If an error log file is not accessible for some reason, the grid engine system tries to log the error message to the files /tmp/sge_qmaster_messages, /tmp/sge_schedd_messages, or /tmp/sge_execd_messages on the corresponding host.
In some circumstances, the grid engine system notifies users, administrators, or both, about error events by email. The email messages sent by the grid engine system do not contain a message body. The message text is fully contained in the mail subject field.
The following table lists the consequences of different job-related error codes or exit codes. These codes are valid for every type of job.
Table 9–1 Job-Related Error or Exit Codes
Script/Method |
Exit or Error Code |
Consequence |
---|---|---|
Job script |
0 |
Success |
|
99 |
Requeue |
|
Rest |
Success: exit code in accounting file |
|
|
|
prolog/epilog |
0 |
Success |
|
99 |
Requeue |
|
Rest |
Queue error state, job requeued |
The following table lists the consequences of error codes or exit codes of jobs related to parallel environment (PE) configuration.
Table 9–2 Parallel-Environment-Related Error or Exit Codes
Script/Method |
Exit or Error Code |
Consequence |
---|---|---|
pe_start |
0 |
Success |
|
Rest |
Queue set to error state, job requeued |
|
|
|
pe_stop |
0 |
Success |
|
Rest |
Queue set to error state, job not requeued |
The following table lists the consequences of error codes or exit codes of jobs related to queue configuration. These codes are valid only if corresponding methods were overwritten.
Table 9–3 Queue-Related Error or Exit Codes
Script/Method |
Exit or Error Code |
Consequence |
---|---|---|
Job starter |
0 |
Success |
|
Rest |
Success, no other special meaning |
|
|
|
Suspend |
0 |
Success |
|
Rest |
Success, no other special meaning |
|
|
|
Resume |
0 |
Success |
|
Rest |
Success, no other special meaning |
|
|
|
Terminate |
0 |
Success |
|
Rest |
Success, no other special meaning |
The following table lists the consequences of error or exit codes of jobs related to checkpointing.
Table 9–4 Checkpointing-Related Error or Exit Codes
Script/Method |
Exit or Error Code |
Consequence |
---|---|---|
Checkpoint |
0 |
Success |
|
Rest |
Success. For kernel checkpoint, however, this means that the checkpoint was not successful. |
|
|
|
Migrate |
0 |
Success |
|
Rest |
Success. For kernel checkpoint, however, this means that the checkpoint was not successful. Migration will occur. |
|
|
|
Restart |
0 |
Success |
|
Rest |
Success, no other special meaning |
|
|
|
Clean |
0 |
Success |
|
Rest |
Success, no other special meaning |
For some severe error conditions, the error-logging mechanism might not yield sufficient information to identify the problems. Therefore, the grid engine system offers the ability to run almost all ancillary programs and the daemons in debug mode. Different debug levels vary in the extent and depth of information that is provided. The debug levels range from zero through 10, with 10 being the level delivering the most detailed information and zero turning off debugging.
To set a debug level, an extension to your .cshrc or .profile resource files is provided with the distribution of the grid engine system. For csh or tcsh users, the file sge-root/util/dl.csh is included. For sh or ksh users, the corresponding file is named sge-root/util/dl.sh. The files must be sourced into your standard resource file. As csh or tcsh user, include the following line in your .cshrc file:
source sge-root/util/dl.csh |
As sh or ksh user, include the following line in your .profile file:
. sge-root/util/dl.sh |
As soon as you log out and log in again, you can use the following command to set a debug level:
% dl level |
If level is greater than 0, starting a grid engine system command forces the command to write trace output to STDOUT. The trace output can contain warning messages, status messages, and error messages, as well as the names of the program modules that are called internally. The messages also include line number information, which is helpful for error reporting, depending on the debug level you specify.
To watch a debug trace, you should use a window with a large scroll-line buffer. For example, you might use a scroll-line buffer of 1000 lines.
If your window is an xterm, you might want to use the xterm logging mechanism to examine the trace output later on.
If you run one of the grid engine system daemons in debug mode, the daemons keep their terminal connection to write the trace output. You can abort the terminal connections by typing the interrupt character of the terminal emulation you use. For example, you might use Control-C.
To switch off debug mode, set the debug level back to 0.
The sgedbwriter script starts the dbwriter program. The script is located in sge_root/dbwriter/bin/sgedbwriter. The sgedbwriter script reads the dbwriter configuration file, dbwriter.conf. This configuration file is located in sge_root/cell/common/dbwriter.conf. This configuration file sets the debug level of dbwriter. For example:
# # Debug level # Valid values: WARNING, INFO, CONFIG, FINE, FINER, FINEST, ALL # DBWRITER_DEBUG=INFO |
You can use the –debug option of the dbwriter command to change the number of messages that the dbwriter produces. In general, you should use the default debug level, which is info. If you use a more verbose debug level, you substantially increase the amount of data output by dbwriter.
You can specify the following debug levels:
Displays only severe errors and warnings.
Adds a number of informational messages. info is the default debug level.
Gives additional information that is related to dbwriter configuration, for example, about the processing of rules.
Produces more information. If you choose this debug level, all SQL statements run by dbwriter are output.
For debugging.
For debugging.
Displays information for all levels. For debugging.