After you install the CAS, the configuration files are in the following locations:
The
workspace/conf/web-crawler/default
directory contains all of the above files, except for thesite.xml
file. This directory is the global configuration directory, and you should not change its name nor remove thedefault.xml
file. Note that the settings of most of its files can be overridden by the versions in the crawl-specific configuration directories.The
workspace/conf/web-crawler/polite-crawl
directory contains only thesite.xml
andcrawl-urlfilter.txt
files.The
workspace/conf/web-crawler/non-polite-crawl
directory also contains only thesite.xml
andcrawl-urlfilter.txt
files. Thissite.xml
contains more aggressive settings, such as such as no fetcher delay (versus a 1-second delay in the polite version) and a maximum of 52 threads (versus 1 in the polite version).
You can use a text editor to edit the files.