crawlerSettings

This object configures the global crawler settings that are used by default for new data sources. You can also configure the crawler settings for individual sources, as described in source.

The Oracle SES crawler is a Java process activated by a schedule. When activated, the crawler spawns a configurable number of processor threads that fetch information from various sources and index the documents. This index is used for searching sources.

Object Type

Universal

State Properties

None

Supported Operations

export
update

Administration GUI Page

Global Settings - Crawler Configuration

XML Description

The <search:crawlerSettings> element describes the crawler settings:

<search:crawlerSettings>
   <search:numThreads>
   <search:numProcessors>
   <search:crawlDepth>
      <search:limit>
   <search:languageDetection>
   <search:defaultLanguage>
   <search:crawlTimeout>
   <search:maxDocumentSize>
   <search:charSetDetection>
   <search:defaultCharset>
   <search:preserveDocumentCache>
   <search:servicePipeline>
      <search:pipelineName>
   <search:verboseLogging>
   <search:logLanguage>
   <search:badTitles>
      <search:badTitle>

Element Descriptions

<search:crawlerSettings>

Contains all of the elements for configuring the crawler.

<search:numThreads>

Contains the number of processes the crawler starts to crawl sources.

<search:numProcessors>

Contains the number of CPUs (or cores in a multi-core processor) on the computer where the crawler runs. This setting determines the optimal number of processes used for document conversion. A document conversion process converts formatted documents into HTML documents for indexing.

<search:crawlDepth>

Controls whether crawling is limited to the number of nested links set by <search:limit>.

Attribute	Value
`haslimit`	Set to `true` to restrict crawling to the depth limit, or set to `false` otherwise. Required.

<search:limit>

Contains the number of nested links the crawler follows. Crawling depth starts at 0, so that the crawler only fetches the starting URL. With a crawling depth of 1, the crawler also fetches any document that it linked from the starting URL, and so forth.

<search:languageDetection>

Controls whether the crawler attempts to detect the language of documents that do not specify the language in their metadata.

Language detection involves these steps:

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column of a table source.
If the crawler cannot determine the language, then the language recognizer attempts to determine a language. The language recognizer operates on the Latin-1 alphabet and any language with a deterministic Unicode range of characters, such as Chinese, Japanese, and Korean.
If the language recognizer cannot identify the language, then the default language is used.

Attribute Value

enabled Set to true to attempt to detect a language, or set to false to use the default language. Required.

<search:defaultLanguage>

Attribute	Value
`enabled`	Set to `true` to attempt to detect a language, or set to `false` to use the default language. Required.

Contains the code for the default language. The default language is used when language detection is disabled or when the crawler and language detector cannot determine the document language. See Table 2-3, "Crawlable Languages".

<search:crawlTimeout>

Contains the number of seconds allowed for the crawler to access a document.

<search:maxDocumentSize>

Contains the maximum document size in megabytes. Larger documents are not crawled.

<search:charSetDetection>

Contains a value of true to enable automatic character set detection, or false to disable it. The default value is true.

<search:defaultCharset>

Contains the default character set. The crawler uses this character set for indexing documents when the character set cannot be determined. See Table 2-4, "Crawlable Character Sets".

<search:preserveDocumentCache>

Controls whether the cache is saved after indexing.

Attribute	Value
`enabled`	Set to `true` to preserve the cache, or set to `false` to discard it. Required.

<search:servicePipeline>

Controls use of a document service pipeline. A document service pipeline is used for search result clustering. If your installation does not use result clustering for any source, then disable the pipeline.

Attribute	Value
`enabled`	Set to `true` to enable the pipeline, or set to `false` to disable it. Required.

<search:pipelineName>

Contains the name of the document service pipeline used when the pipeline is enabled.

<search:verboseLogging>

Controls the level of detail in logging messages.

Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to record detailed activity.

The crawler maintains the last seven versions of its log file. The format of the log file name is ids.MMDDhhmm.log, where i is a system-generated ID, ds is the source ID, MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes. For example, if a schedule for source 23 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log. Each successive schedule launching has a unique log file name. When the total number of log files for a source reaches seven, the oldest log file is deleted.

Attribute	Value
`enabled`	Set to `true` to record all information, or set to `false` to record only summary information. Required.

<search:logLanguage>

Contains the language code for messages written to the log file. See Table 2-3, "Crawlable Languages".

<search:badTitles>

Contains one or more <search:badTitle> elements. This parameter can be set at the global level.

<search:badTitle>

Contains an exact character string for a document title that the crawler omits from the index. These bad titles are defined by default:

PowerPoint Presentation
Slide 1

Example

This XML document configures the crawler:

<?xml version="1.0" encoding="UTF-8"?>
<search:config productVersion="11.2.1.0.0" xmlns:search="http://xmlns.oracle.com/search">
   <search:crawlerSettings>
      <search:numThreads>5</search:numThreads>
      <search:numProcessors>3</search:numProcessors>
      <search:crawlDepth haslimit="true">
         <search:limit>2</search:limit>
      </search:crawlDepth>
      <search:languageDetection enabled="true"/>
      <search:defaultLanguage>en</search:defaultLanguage>
      <search:crawlTimeout>30</search:crawlTimeout>
      <search:maxDocumentSize>10</search:maxDocumentSize>
      <search:charSetDetection enabled="true"/>
      <search:defaultCharSet>8859_1</search:defaultCharSet>
      <search:cacheDirectory>$OH/data/cache/</search:cacheDirectory>
      <search:preserveDocumentCache enabled="true"/>
      <search:servicePipeline enabled="true">
         <search:pipelineName>Default pipeline</search:pipelineName>
      </search:servicePipeline>
      <search:verboseLogging enabled="true"/>
      <search:logDirectory>$OH/log/crawler/</search:logDirectory>
      <search:logLanguage>en-US</search:logLanguage>
      <search:badTitles>
         <search:badTitle>PowerPoint Presentation</search:badTitle>
         <search:badTitle>Slide 1</search:badTitle>
      </search:badTitles>
   </search:crawlerSettings>
</search:config>