Parser properties

You can set the parser properties in the default.xml file.

The Web Crawler contains two HTML scanners that parse HTML documents: NekoHTML and TagSoup. By using the properties listed in the table, you can configure which HTML parser to use, as well as other parsing behavior.
Property Name Property Value
parse.plugin.file File name (default is parse-plugins.xml). Specifies the configuration file that defines the associations between content-types and parsers.
parser.character.encoding.default ISO code or other encoding representation (default is windows-1252). Specifies the character encoding to use when no other information is available.
parser.html.impl neko or tagsoup (default is neko). Specifies which HTML parser implementation to use: neko uses NekoHTML and tagsoup uses TagSoup.
parser.html.form.use_action Boolean value (default is false). If true, the HTML parser will collect URLs from Form action attributes.
Note: This may lead to undesirable behavior, such as submitting empty forms during the next fetch cycle.

If false, form action attributes will be ignored.

If the Web Crawler configuration includes the DOM for the Web page in the output Endeca records, the HTML parsers handle invalid XML characters as follows:
  • The NekoHTML parser removes the invalid XML characters in the range 0x00-0x1F and 0x7F-0x9F from the DOM.
  • The TagSoup parser strips nothing from the DOM, because TagSoup can efficiently handle invalid XML characters.

Note that the NekoHTML parser is the default HTML parser.