The regex-normalize.xml
file provides substitutions for normalizing URLs.
The regex-normalize.xml
file is the configuration file for the RegexUrlNormalizer class.
The file allows you to specify regular expressions that can be used as substitutions for URL normalization. The file provides a set of rules as sample regular expressions.
For example, if you are crawling a site with URLs that contain spaces, you should add the following regular expression to force URL encoding:
<regex> <pattern> </pattern> <substitution>%20</substitution> <regex>
Note that the expression uses one space character as the value for the pattern. The expression means that when a space character is found in the URL, the space should be encoded as %20 (hex). For example, if the URL contains a document named Price List.html
, it will be encoded to Price%20List.html
so that it can be processed correctly.
When modifying the file, keep the following in mind:
Note that the name of this file is specified to the Web Crawler via the urlnormalizer.regex.file
property in the default.xml
configuration file.