Sun N1 Grid Engine 6.1 Administration Guide

Gathering Accounting and Reporting Statistics

The grid engine system provides two kinds of reporting and accounting facilities:

Report Statistics (ARCo)

You can use the optional Accounting and Reporting Console (ARCo) to generate live accounting and reporting data from the grid engine system and store the data in the reporting database, which is a standard SQL database. ARCo supports the following SQL database systems:

ARCo also provides a web-based tool for generating information queries on the reporting database and for retrieving the results in tabular or graphical form. ARCo enables you to store queries for later use, to run predefined queries, and to run queries in batch mode. For more information about how to use ARCo, see Chapter 5, Accounting and Reporting, in Sun N1 Grid Engine 6.1 User’s Guide. For information about how to install ARCo, see Chapter 8, Installing the Accounting and Reporting Console, in Sun N1 Grid Engine 6.1 Installation Guide.

Raw reporting data is generated by sge_qmaster. This raw data is stored in a reporting file. The dbwriter program reads the raw data in the reporting file and writes it to the SQL reporting database, where it can be accessed by ARCo.

About the dbwriter Program

The dbwriter program performs the following tasks:

When dbwriter starts up, it calculates derived values. dbwriter also deletes outdated records at startup. If dbwriter runs in continuous mode, dbwriter continues to calculate derived values and to delete outdated records at hourly intervals, or at whatever interval you specify.

You can specify in an XML file the values that you want to calculate and the records that you want to delete. Use the -calculation option of the dbwriter command to specify the path to this XML file.

For detailed information about calculating derived values, see Calculating Derived Values With dbwriter.

For detailed information about deleting outdated records, see Deleting Outdated Records With dbwriter.

Enabling the Reporting File

The reporting file contains the following types of data:

When the grid engine system is first installed, the reporting file is disabled. To use ARCo, you must enable the reporting file for the cluster. Once enabled, the reporting file will be generated by sge_qmaster. By default, the reporting file is located in sge-root/cell/common. You can change the default with the -reporting option of the dbwriter command.

For information about configuring the generation of the reporting file, see the reporting_params parameter of the sge_conf(5) man page, and the report_variables parameter of the host_conf(5) man page.

To enable the reporting file with QMON, on the Main Control window click the Cluster Configuration button, select the global host, and then click Modify.

On the Cluster Settings dialog box, click the Advanced Settings tab.

In the Reporting Parameters field, set the following parameters:

To enable the reporting file from the command line, use the qconf –mconf command to set the reporting_params attributes, as described in the preceding paragraph.

Once the reporting file is enabled, the dbwriter can read raw data from the reporting file and write it to the reporting database.

For more information about configuring the reporting file, see the reporting(5) man page. For complete details about installing and setting up ARCo, see Chapter 8, Installing the Accounting and Reporting Console, in Sun N1 Grid Engine 6.1 Installation Guide.

Calculating Derived Values With dbwriter

The rules for calculating derived values are specified in a derived tag, which is a sub tag of the DbWriterConfig tag. The following table lists the attributes of the derived tag:

Attribute

Description

object

The object for which data is aggregated. The object is one of the following:

  • host

  • queue

  • project

  • department

  • user

  • group

interval

The time range specifying how often to calculate the derived values. The time range is one of the following:

  • hour

  • day

  • month

  • year

variable

The name of the variable to hold the calculated data.

The following table lists the subelements of the derived tag:

sql

The SQL statement that calculates the derived values. The statement must produce the following columns:

  • time_start – Together with time_end, specifies the time period for the calculated value

  • time_end

  • value – The calculated derived value

The SQL statement can contain the following placeholders. dbwriter replaces the placeholders for each query, based on a rule:

  • __time_start__ – Start time for the query. dbwriter searches for the last previously calculated derived value from this rule, and uses this timestamp as the start time for the next query.

  • __time_end__ – End time for the query. This timestamp specifies the end of the last passed time range. For example, if the time range is day, and if derived values are calculated at 00:30, 00:00 is taken as time_end.

  • __key_0__, __key_1__, . . . , __key_n__ – Components of the primary key for the specified object type. For example, the sge_hosts table has the primary h_hostname. If a rule is processed for the host object type, one query is executed per entry in the sge_hosts table, the __key_0__ placeholder in the SQL statement is replaced by the hostname.

    The sge_queue table has a composed primary key that is made up of q_qname and q_hostname.

auto

dbwriter generates the SQL statement for the calculation of derived values.

The autogenerated SQL statement looks like the following template:


SELECT time_start, time_end, <function>(<value_field>) as value
FROM ( SELECT TRUNC( <timestart_field>, <interval> ) as time_start
              TRUNC( <timestart_field>, <interval> ) +
                    INTERVAL '1' <interval> as time_end,
              <value_field>
       FROM <object value table>
       WHERE <primary key field 0> = __key_0__
         AND <primary key field 1> = __key_1__
         AND . . .
         AND <parent key field> =
             (SELECT <parent key field> FROM <parent table>
              WHERE <parent filter> )
         AND <timestart_field> <= {ts __time_start__ }
         AND <timeend_field>   >  {ts __time_end__ }
GROUP BY time_start, time_end

The SQL template parameters are as follows:

Parameter

Description

<function>

Aggregate function for calculating the derived value. Comes from the function attribute of the auto tag in the XML file.

<value_field>

Depends on the object of the derived value.

<timestart_field>

Depends on the object of the derived value.

<timeend_field>

Depends on the object of the derived value.

<interval>

Comes from the interval attribute of the derived tag

<object value table>

Name of the database table where the values are stored. Depends on the object (host => host_values, user => user_values, . . .)

<primary key field n>

Primary key that is necessary to join the value table to the parent table. Depends on the object.

<parent key field>

Name of the field that holds the ID of the parent. Depends on the object.

<parent table>

Name of the parent database table. Depends on the object (host => host, user => user)

<parent filter>

Filter for the parent table. A derived value for each entry of the parent table is calculated, for example, u_user = 'user1').

Here is an example of an autogenerated SQL statement:


<derive object="host" interval="day" variable="d_load">
   <auto function="AVG" variable="h_load" />
</derive>

SELECT time_start, time_end, AGE(hv_dvalue)
FROM ( SELECT TRUNC( hv_time_start, 'day' ) as time_start,
              TRUNC( hv_time_start, 'day' ) +
                             INTERVAL '1' day as time_end,
              hv_dvalue
       FROM sge_host_values
       WHERE hv_variable = 'h_load' AND
             hv_parent =
               (SELECT h_id FROM sge_host
                WHERE h_hostname = 'foo.bar') AND
             hv_time_start <= {ts '2004-05-21 00:00:00.0'} AND
             hv_time_end    > {ts '2004-05-17 00:00:00.0'} )
GROUP BY time_start, time_end

Deleting Outdated Records With dbwriter

To delete outdated records in the reporting database, you must specify a deletion rule in the delete tag. The following table lists the attributes of the delete tag:

Attribute

Description

scope

The type of data to delete. Valid entries are the following:

  • job

  • job_log

  • share_log

  • host_values

  • queue_values

  • project_values

  • department_values

  • user_values

  • group_values

time_range

The unit of time_amount:

time_amount

Number of units (time_range) during which a record is to be kept.

The following table lists a subelement of the delete tag:

sub_scope

For certain scopes, a subscope can be configured. The subscope specifies an additional condition for deletion. A subscope can be configured for all *_values scopes and for the share_log scope.

If a subscope is configured for a *_values rule, it contains a list of variables to delete, separated by spaces.

If a subscope is specified for the share_log, it contains a list of share-tree nodes to delete, separated by spaces.

If subscope are used, you should always have a fallback rule without subscope, which will delete all objects that are not explicitly named by the subscope.

Here is an example of a delete tag:


<?xml version="1.0" encoding="UTF-8"?>
<DbWriterConfig>
  <!-- keep host values for 2 years -->
  <delete scope="host_values" time_range="year" time_amount="2"/>

  <!-- keep queue values one month -->
  <delete scope="queue_values" time_range="month" time_amount="1">
    <sub_scope>slots</sub_scope>
    <sub_scope>state</sub_scope>
  </delete>
</DbWriterConfig>

Accounting and Usage Statistics (qacct)

You can use the qacct command to generate alphanumeric accounting statistics. If you specify no options, qacct displays the aggregate usage on all machines of the cluster, as generated by all jobs that have finished and that are contained in the cluster accounting file sge-root/cell/common/accounting. In this case, qacct reports three times, in seconds:

Several options are available for reporting accounting information about queues, users, and the like. In particular, you can use the qacct -l command to request information about all jobs that have finished and that match a resource requirement specification.

Use the qacct -j [job-id | job-name] command to get direct access to the complete resource usage information stored by the grid engine system. This information includes the information that is provided by the getrusage system call.

The -j option reports the resource usage entry for the jobs with job-id or with job-name. If no argument is given, all jobs contained in the referenced accounting file are displayed. If a job ID is specified, and if more than one entry is displayed, one of the following is true:

See the qacct(1) man page for more information.