The Oracle Commerce Web Crawler uses the following set of configuration files:
Configuration Filename |
Purpose |
---|---|
|
The global configuration file, which should contain
properties for all of your crawls with reasonable default values. Specific settings in this file can be overridden by the |
|
A per-crawl property overrides file. The settings in this file override those in the |
|
Contains a list of include and exclude regular expressions for URLs. These expressions determine
which URLs the crawler is allowed to visit. Note that the filters can also be applied to seeds if the |
|
Contains a list of URL normalizations, which allow you to specify substitutions to be
done on URLs. Each normalization is expressed as a regular expression and a
replacement expression. Note that the seeds can also be normalized if the |
|
Contains a list of MIME types known to the system. It is used to look up the MIME type for a specific file extension. |
|
Maps MIME types to parsers (for example, "text/html" to the HTML parser). |
|
The credentials file for form-based authentication. |
|
The log4j configuration file, which is used to specify logging on certain components. |