The fetcher is the Web Crawler component that actually fetches pages from Web sites. You can set the fetcher properties in the default.xml
file.
By using the properties listed in the table, you can configure the behavior of the fetcher.
The fetcher compares the value of the fetcher.delay.max
property to the value of the Crawl-Delay parameter in the robots.txt
file.
If the
fetcher.delay.max
value is greater than the Crawl-Delay value, the fetcher will obey the amount of time specified by Crawl-Delay.If the
fetcher.delay.max
value is less than the Crawl-Delay value, the fetcher will not crawl the site. It will also generate this error message:The delay specified in robots.txt is greater than the max delay. Therefore the crawler will not fully crawl this site. All pending work from this host has been removed.
If the
fetcher.delay.max
value is set to-1
, the fetcher will wait the amount of time specified by the Crawl-Delay value.
Note that above behavior occurs only if the http.robots.ignore
property is set to false
(which is the default).
This topic describes overrides for the fetcher property values in the default.xml
file.
The site.xml
file in the workspace/conf/web-crawler/non-polite-crawl
directory contains overrides to the fetcher's default property values.
The site.xml
file in the workspace/conf/web-crawler/polite-crawl
directory overrides the fetcher.delay
value, which it sets to 1.0
.
Otherwise, both files use the default values for the fetcher properties.