The fetcher is the Web Crawler component that actually fetches pages from Web sites. You can set the fetcher properties in the default.xml file.
By using the properties listed in the table, you can configure the behavior of the fetcher.
The fetcher compares the value of the fetcher.delay.max
property to the value of the Crawl-Delay parameter in the robots.txt file.
If the
fetcher.delay.maxvalue is greater than the Crawl-Delay value, the fetcher will obey the amount of time specified by Crawl-Delay.If the
fetcher.delay.maxvalue is less than the Crawl-Delay value, the fetcher will not crawl the site. It will also generate this error message:The delay specified in robots.txt is greater than the max delay. Therefore the crawler will not fully crawl this site. All pending work from this host has been removed.
If the
fetcher.delay.maxvalue is set to-1, the fetcher will wait the amount of time specified by the Crawl-Delay value.
Note that above behavior occurs only if the http.robots.ignore property is set to false (which is the default).
This topic describes overrides for the fetcher property values in the default.xml file.
The site.xml file in the workspace/conf/web-crawler/non-polite-crawl directory contains overrides to the fetcher's default property values.
The site.xml file in the workspace/conf/web-crawler/polite-crawl directory overrides the fetcher.delay value, which it sets to 1.0.
Otherwise, both files use the default values for the fetcher properties.

