The strategy for using these two configuration files is to have only one directory that contains the default.xml file, but not a site.xml file. This directory is the default configuration directory.

You then create a separate directory for each different crawl-specific configuration. Each of these per-crawl directories will not contain the default.xml file, but will contain a site.xml file that is customized for a given crawl configuration.

When you run a crawl, you point to that crawl's configuration directory by using the -c command-line option. However, the Web Crawler is hard-coded to first read the configuration files in the workspace/conf/web-crawler/default directory and then those in the per-crawl directory (which can override the default files). For this reason, it is important that you do not change the name and location of the workspace/conf/web-crawler/default directory nor the default.xml file.


Copyright © Legal Notices