The Oracle Commerce Web Crawler uses the following set of configuration files:

Configuration Filename

Purpose

default.xml

The global configuration file, which should contain properties for all of your crawls with reasonable default values. Specific settings in this file can be overridden by the site.xml file. Do not remove or rename this file, because its name and location are hard-coded in the Web Crawler software.

site.xml

A per-crawl property overrides file. The settings in this file override those in the default.xml file. Therefore, this file is meant to be used to adjust per-crawl settings.

crawl-urlfilter.txt

Contains a list of include and exclude regular expressions for URLs. These expressions determine which URLs the crawler is allowed to visit. Note that the filters can also be applied to seeds if the urlfilter.filter-seeds configuration property is set to true.

regex-normalize.xml

Contains a list of URL normalizations, which allow you to specify substitutions to be done on URLs. Each normalization is expressed as a regular expression and a replacement expression. Note that the seeds can also be normalized if the urlnormalizer.normalize-seeds configuration property is set to true.

mime-types.xml

Contains a list of MIME types known to the system. It is used to look up the MIME type for a specific file extension.

parse-plugins.xml

Maps MIME types to parsers (for example, "text/html" to the HTML parser).

form-credentials.xml

The credentials file for form-based authentication.

log4j.properties

The log4j configuration file, which is used to specify logging on certain components.


Copyright © Legal Notices