You can set the URL normalization properties in the default.xml
file.
URL normalization (also called URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The purpose of URL normalization is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.
The Web Crawler performs URL normalization in order to avoid crawling the same resource more than once. By using the properties listed in the table, you can configure how the Web Crawler normalizes URLs.
Property Name |
Property Value |
---|---|
|
Space-delimited list of URL normalization class names. Specifies the order in which the URL normalizers will be run. If any normalizer is not activated, it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the listed normalizers run. |
|
File name (default is |
|
Integer value (default is |
|
Boolean value (default is |
The Oracle Commerce Web Crawler has three URL normalizers:
The BasicURLNormalizer performs the following transformations:
Note that these transformations are actually performed by the regex-normalize.xml
file.
The PassURLNormalizer
performs no transformations. It is included because it is sometimes useful if for a given scope at least one normalizer must be defined but no transformations are required.
The RegexURLNormalizer
allows users to specify regex substituions on all or any URLs that are encountered. This is useful for transformations like stripping session IDs from URLs. This class uses the file specified in the urlnormalizer.regex.file
property.
You can apply normalization to the seed list with the urlnormalizer.normalize-seeds
property.
By default, the seeds are read in as-is. In some cases, however, you may want to have URL normalization applied to the seeds (for example, if the seeds are extracted from a database instead of manually entered in the seed list by the user).
Related links