You can set the parser properties in the default.xml file.
The Web Crawler contains two HTML scanners that parse HTML documents: NekoHTML and TagSoup.
By using the properties listed in the table, you can configure which HTML parser to use, as well as other parsing behavior.
|
Property Name
|
Property Value
|
|---|
|
parse.plugin.file
|
File name (default is parse-plugins.xml).
Specifies the configuration file that defines the associations between content-types and parsers.
|
|
parser.character.encoding.default
|
ISO code or other encoding representation (default is windows-1252). Specifies the character encoding to use when no other information is available.
|
|
parser.html.impl
|
neko or tagsoup (default is neko). Specifies which HTML parser implementation to use: neko
uses NekoHTML and tagsoup
uses TagSoup.
|
|
parser.html.form.use_action
|
Boolean value (default is false). If true, the HTML parser will collect URLs from Form action attributes.
NoteThis may lead to undesirable behavior, such as submitting empty forms during the next fetch cycle.
If false, form action attributes will be ignored.
|
If the Web Crawler configuration includes the DOM for the Web page in the output Endeca records, the HTML parsers handle invalid XML characters as follows:
The NekoHTML parser removes the invalid XML characters in the range 0x00-0x1F and 0x7F-0x9F from the DOM.
The TagSoup parser strips nothing from the DOM, because TagSoup can efficiently handle invalid XML characters.
Note that the NekoHTML parser is the default HTML parser.