The strategy for using these two configuration files is to have only one directory that contains the default.xml
file, but not a site.xml
file. This directory is the default configuration directory.
You then create a separate directory for each different crawl-specific configuration. Each of these per-crawl directories will not contain the default.xml
file, but will contain a site.xml
file that is customized for a given crawl configuration.
When you run a crawl, you point to that crawl's configuration directory by using the -c
command-line option. However, the Web Crawler is hard-coded to first read the configuration files in the workspace/conf/web-crawler/default
directory and then those in the per-crawl directory (which can override the default files). For this reason, it is important that you do not change the name and location of the workspace/conf/web-crawler/default
directory nor the default.xml
file.