Parser properties

You can set the parser properties in the default.xml file.

The Web Crawler contains two HTML scanners that parse HTML documents: NekoHTML and TagSoup. By using the properties listed in the table, you can configure which HTML parser to use, as well as other parsing behavior.

Property Name	Property Value
`parse.plugin.file`	File name (default is `parse-plugins.xml`). Specifies the configuration file that defines the associations between content-types and parsers.
`parser.character.encoding.default`	ISO code or other encoding representation (default is `windows-1252`). Specifies the character encoding to use when no other information is available.
`parser.html.impl`	`neko` or `tagsoup` (default is `neko`). Specifies which HTML parser implementation to use: `neko` uses NekoHTML and `tagsoup` uses TagSoup.
`parser.html.form.use_action`	Boolean value (default is `false`). If `true`, the HTML parser will collect URLs from Form action attributes. Note: This may lead to undesirable behavior, such as submitting empty forms during the next fetch cycle. If `false`, form action attributes will be ignored.

If the Web Crawler configuration includes the DOM for the Web page in the output Endeca records, the HTML parsers handle invalid XML characters as follows:

The NekoHTML parser removes the invalid XML characters in the range 0x00-0x1F and 0x7F-0x9F from the DOM.
The TagSoup parser strips nothing from the DOM, because TagSoup can efficiently handle invalid XML characters.

Note that the NekoHTML parser is the default HTML parser.