The site.xml file

The site.xml file provides override property values for the global configuration file.

The default.xml file is the global default configuration for your Endeca Web Crawler and should not change often. Only one copy of this file is shipped with the product, and it is located in the workspace/conf/web-crawler/default directory.

The site.xml file is where you make the changes that override the default settings on a per-crawl basis. The properties that you can add to the site.xml file are the same ones that are in the default.xml file. A site.xml file is included in the workspace/conf/web-crawler/polite-crawl and workspace/conf/web-crawler/non-polite-crawl directories, but not in the workspace/conf/web-crawler/default directory.

Strategy for using the site.xml file

The strategy for using these two configuration files is to have only one directory that contains the default.xml file, but not a site.xml file. This directory is the default configuration directory.

You then create a separate directory for each different crawl-specific configuration. Each of these per-crawl directories will not contain the default.xml file, but will contain a site.xml file that is customized for a given crawl configuration.

When you run a crawl, you point to that crawl's configuration directory by using the -c command-line option. However, the Web Crawler is hard-coded to first read the configuration files in the workspace/conf/web-crawler/default directory and then those in the per-crawl directory (which can override the default files). For this reason, it is important that you do not change the name and location of the workspace/conf/web-crawler/default directory nor the default.xml file.

Differences among the site.xml and default.xml files

The following table lists the differences between the site.xml files in the non-polite-crawl and the polite-crawl directories, as well as the differences between those files and the global default.xml file.

config property default.xml polite site.xml non-polite site.xml
http.robots.ignore false false true
fetcher.delay 2.0 1.0 0.0
fetcher.threads.total 100 not used 52
fetcher.threads.per-host 1 1 52
output.file.directory workspace polite-crawl-workspace non-polite-crawl-workspace
output.file.name webcrawler-output polite-crawl non-polite-crawl
output.file.is-xml false true true
output.file.is-compressed true false false