Monitoring the Crawling Process

Monitor the crawling process in the Oracle SES Administration GUI by using a combination of the following:

  • Check the crawl progress and crawl status on the Home - Schedules page. (Click Refresh Status.)

  • Monitor your crawler statistics on the Home - Schedules - Crawler Progress Summary page and the Home - Statistics page.

  • Monitor the log file for the current schedule.

Crawler Statistics

The following crawler statistics are shown on the Home - Schedules - Crawler Progress Summary page. Some of them are also shown in the log file, under "Crawling results".

  • Documents to Fetch: Number of URLs in the queue waiting to be crawled. The log file uses the phrase "Documents to Process".

  • Documents Fetched: Number of documents retrieved by the crawler.

  • Document Fetch Failures: Number of documents whose contents cannot be retrieved by the crawler. This could be due to an inability to connect to the Web site, slow server response time causing time-outs, or authorization requirements. Problems encountered after successfully fetching the document are not considered here; for example, documents that are too big or duplicate documents that were ignored.

  • Documents Rejected: Number of URL links encountered but not considered for crawling. The rejection could be due to boundary rules, the robots exclusion rule, the mime type inclusion rule, the crawling depth limit, or the URL rewriter discard directive.

  • Documents Discovered: Total number of documents discovered so far. This is roughly equal to (documents to fetch) + (documents fetched) + (document fetch failures) + (documents rejected).

  • Documents Indexed: Number of documents that have been indexed or are pending indexing.

  • Documents Non-Indexable: Number of documents that cannot be indexed; for example, a file source directory or a document with robots NOINDEX metatag.

  • Document Conversion Failures: Number of document filtering errors. This is counted whenever a document cannot be converted to HTML format.

Crawler Log File

The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, run time, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file.

On the Global Settings - Crawler Configuration page, you can select either to log everything or to log only summary information. You can also select the crawler log file directory and the language the crawler uses to generate the log file.

Note:

On UNIX-based systems, ensure that the directory permission is set to 700 if you change the log file directory. Only the user who installed the Oracle software should have access to this directory.

A new log file is created when you restart the crawler. The location of the crawler log file can be found on the Home - Schedules - Crawler Progress Summary page. The crawler maintains the past seven versions of its log file, but only the most recent log file is shown in the Oracle SES Administration GUI. You can view the other log files in the file system.

The naming convention of the log file name is ids.MMDDhhmm.log, where ids is a system-generated ID that uniquely identifies the source, MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes.

For example, if a schedule for a source identified as i3ds23 starts at 10:00 PM on July 8, then the log file name is i3ds23.07082200.log. Each successive schedule has a unique log file name. After a source has seven log files, the oldest log file is overwritten.

Each logging message in the log file is one line, containing the following six tab delimited columns, in order:

  1. Timestamp

  2. Message level

  3. Crawler thread name

  4. Component name. It is typically the name of the executing Java class.

  5. Module name. It can be internal Java class method name

  6. Message

Crawler Configuration File

The crawler configuration file is ORACLE_HOME/search/data/config/crawler.dat. Most crawler configuration tasks are controlled in the Oracle SES Administration GUI, but certain features (like title fallback, character set detection, and indexing the title of multimedia files) are controlled only by the crawler.dat file.

Note:

The crawler.dat file is not backed up with Oracle SES backup and recovery. If you edit this file, be sure to back it up manually.

Crawling Zip Files Containing Non-UTF8 File Names

The Java library used to process zip files (java.util.zip) supports only UTF8 file names for zip entries. The content of non-UTF8 file names is not indexed.

To crawl zip files containing non-UTF8 file names, change the ZIPFILE_PACKAGE parameter in crawler.dat from JDK to APACHE. The Apache library org.apache.tools.zip does not read the zip content in the same order as the JDK library, so the content displayed in the user interface could look different. Zip file titles also may be different, because it uses the first file as the fallback title. Also, with the Apache library, the source default character set value is used to read the zip entry file name.

Setting the Logging Level

Specify the crawler logging level with the parameter Doracle.search.logLevel. The defined levels are DEBUG(2), INFO(4), WARN(6), ERROR(8), and FATAL(10). The default value is 4, which means that messages of level 4 and higher are logged. DEBUG (level=2) messages are not logged by default.

For example, the following "info" message is logged at 23:10:39330. It is from thread name crawler_2, and the message is Processing file://localhost/net/stawg02/. The component and module names are not specified.

23:10:39:330 INFO    crawler_2      Processing file://localhost/net/stawg02/

The crawler uses a set of codes to indicate the crawling result of the crawled URL. Besides the standard HTTP status codes, it uses its own codes for non-HTTP related situations.