The regex-normalize.xml file

The regex-normalize.xml file provides substitutions for normalizing URLs.

The regex-normalize.xml file is the configuration file for the RegexUrlNormalizer class. The file allows you to specify regular expressions that can be used as substitutions for URL normalization. The file provides a set of rules as sample regular expressions.

For example, if you are crawling a site with URLs that contain spaces, you should add the following regular expression to force URL encoding:
<regex>
    <pattern> </pattern>
     <substitution>%20</substitution>
<regex>
Note that the expression uses one space character as the value for the pattern. The expression means that when a space character is found in the URL, the space should be encoded as %20 (hex). For example, if the URL contains a document named Price List.html, it will be encoded to Price%20List.html so that it can be processed correctly.
When modifying the file, keep the following in mind:

Note that the name of this file is specified to the Web Crawler via the urlnormalizer.regex.file property in the default.xml configuration file.