Oracle Commerce Guided Search - URL normalization properties

URL normalization properties

You can set the URL normalization properties in the default.xml file.

URL normalization (also called URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The purpose of URL normalization is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

The Web Crawler performs URL normalization in order to avoid crawling the same resource more than once. By using the properties listed in the table, you can configure how the Web Crawler normalizes URLs.

Property Name	Property Value
`urlnormalizer.order`	Space-delimited list of URL normalization class names. Specifies the order in which the URL normalizers will be run. If any normalizer is not activated, it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the listed normalizers run.
`urlnormalizer.regex.file`	File name (default is `regex-normalize.xml`). Name of the configuration file used by the `RegexUrlNormalizer` class. Note that the file must be in the configuration directory.
`urlnormalizer.loop.count`	Integer value (default is `1`). Specifies how many times to loop through normalizers, to ensure that all transformations are performed.
`urlnormalizer.normalize-seeds`	Boolean value (default is `false`). Specifies whether to normalize the seeds.

Types of URL normalizers

The Oracle Commerce Web Crawler has three URL normalizers:

BasicURLNormalizer
PassURLNormalizer
RegexURLNormalizer

The BasicURLNormalizer performs the following transformations:

Removes leading and trailing white spaces in the URL.
Lowercases the protocol (e.g., HTTP is changed to http).
Lowercases the host name.
Normalizes the port (e.g., http://xyz.com:80/index.html is changed to http://xyz.com/index.html).
Normalizes null paths (e.g., http://xyz.com is changed to http://xyz.com/index.html).
Removes references (e.g., http://xyz.com/about.html#history is changed to http://xyz.com/about.html).
Removes unnecessary paths, in particular the ../ paths.

Note that these transformations are actually performed by the regex-normalize.xml file.

The PassURLNormalizer performs no transformations. It is included because it is sometimes useful if for a given scope at least one normalizer must be defined but no transformations are required.

The RegexURLNormalizer allows users to specify regex substituions on all or any URLs that are encountered. This is useful for transformations like stripping session IDs from URLs. This class uses the file specified in the urlnormalizer.regex.file property.

Default order for the URL normalizers

The default classes for the urlnormalizer.order property are:

org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer

Normalizing the seed list

You can apply normalization to the seed list with the urlnormalizer.normalize-seeds property.

By default, the seeds are read in as-is. In some cases, however, you may want to have URL normalization applied to the seeds (for example, if the seeds are extracted from a database instead of manually entered in the seed list by the user).

To normalize the seed list:

In a text editor, open the default.xml file.
Set the urlnormalizer.normalize-seeds property to true.
Save and close the file.

Related links

URL filter properties

Copyright © Legal Notices