URL normalization properties

You can set the URL normalization properties in the default.xml file.

URL normalization (also called URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The purpose of URL normalization is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

The Web Crawler performs URL normalization in order to avoid crawling the same resource more than once. By using the properties listed in the table, you can configure how the Web Crawler normalizes URLs.
Property Name Property Value
urlnormalizer.order Space-delimited list of URL normalization class names. Specifies the order in which the URL normalizers will be run. If any normalizer is not activated, it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the listed normalizers run.
urlnormalizer.regex.file File name (default is regex-normalize.xml). Name of the configuration file used by the RegexUrlNormalizer class. Note that the file must be in the configuration directory.
urlnormalizer.loop.count Integer value (default is 1). Specifies how many times to loop through normalizers, to ensure that all transformations are performed.
urlnormalizer.normalize-seeds Boolean value (default is false). Specifies whether to normalize the seeds.