The regex-normalize.xml file provides substitutions for normalizing URLs.
The regex-normalize.xml file is the configuration file for the RegexUrlNormalizer class. The file allows you to specify regular expressions that can be used as substitutions for URL normalization. The file provides a set of rules as sample regular expressions.
<regex> <pattern> </pattern> <substitution>%20</substitution> <regex>Note that the expression uses one space character as the value for the pattern. The expression means that when a space character is found in the URL, the space should be encoded as %20 (hex). For example, if the URL contains a document named Price List.html, it will be encoded to Price%20List.html so that it can be processed correctly.
Note that the name of this file is specified to the Web Crawler via the urlnormalizer.regex.file property in the default.xml configuration file.