Oracle Commerce Guided Search - Fetcher properties

Fetcher properties

The fetcher is the Web Crawler component that actually fetches pages from Web sites. You can set the fetcher properties in the default.xml file.

By using the properties listed in the table, you can configure the behavior of the fetcher.

Property Name	Property Value
`fetcher.delay`	Value in seconds (default is `2.0`). Specifies the number of seconds a fetcher will delay between successive requests to the same server. If you have multiple threads per host, the delay is on a per-thread basis, not across all threads.
`fetcher.delay.max`	Value in seconds (default is `30`). Specifies the maximum amount of time to wait between page requests.
`fetcher.threads.total`	Integer (default is `100`). Specifies the number of threads the fetcher should use. This value also determines the maximum number of requests that are made at once (because each thread handles one connection).
`fetcher.threads.per-host`	Integer (default is `1`). Specifies the maximum number of threads that should be allowed to access a host at one time.
`fetcher.retry.max`	Integer (default is `3`). Specifies the maximum number of times that a page will be retried. The page is skipped if it cannot be fetched in this number of retries.
`fetcher.retry.delay`	Value in seconds (default is `5`). Specifies the delay between subsequent retries on the same page. If this value is less than the `fetcher.delay` value, then the value of `fetcher.delay` is used instead.

Use of the max delay and crawl-delay values

The fetcher compares the value of the fetcher.delay.max property to the value of the Crawl-Delay parameter in the robots.txt file.

The fetcher works as follows:

If the fetcher.delay.max value is greater than the Crawl-Delay value, the fetcher will obey the amount of time specified by Crawl-Delay.

If the fetcher.delay.max value is less than the Crawl-Delay value, the fetcher will not crawl the site. It will also generate this error message:

The delay specified in robots.txt is greater than the max delay.
Therefore the crawler will not fully crawl this site. All pending work 
from this host has been removed.

If the fetcher.delay.max value is set to -1, the fetcher will wait the amount of time specified by the Crawl-Delay value.

Note that above behavior occurs only if the http.robots.ignore property is set to false (which is the default).

Fetcher overrides in the site.xml files

This topic describes overrides for the fetcher property values in the default.xml file.

The site.xml file in the workspace/conf/web-crawler/non-polite-crawl directory contains overrides to the fetcher's default property values.

The fetcher.delay value is set to 0.0.
The fetcher.threads.total value is set to 52.
The fetcher.threads.per-host value is set to 52.

The site.xml file in the workspace/conf/web-crawler/polite-crawl directory overrides the fetcher.delay value, which it sets to 1.0.

Otherwise, both files use the default values for the fetcher properties.

Fetcher properties

Use of the max delay and crawl-delay values

Fetcher overrides in the site.xml files

Content Acquisition System Web Crawler Guide