You can set the parser properties in the default.xml
file.
The Web Crawler contains two HTML scanners that parse HTML documents: NekoHTML and TagSoup.
By using the properties listed in the table, you can configure which HTML parser to use, as well as other parsing behavior.
Property Name
|
Property Value
|
---|
parse.plugin.file
|
File name (default is parse-plugins.xml ).
Specifies the configuration file that defines the associations between content-types and parsers.
|
parser.character.encoding.default
|
ISO code or other encoding representation (default is windows-1252 ). Specifies the character encoding to use when no other information is available.
|
parser.html.impl
|
neko or tagsoup (default is neko ). Specifies which HTML parser implementation to use: neko
uses NekoHTML and tagsoup
uses TagSoup.
|
parser.html.form.use_action
|
Boolean value (default is false ). If true , the HTML parser will collect URLs from Form action attributes.
NoteThis may lead to undesirable behavior, such as submitting empty forms during the next fetch cycle.
If false , form action attributes will be ignored.
|
If the Web Crawler configuration includes the DOM for the Web page in the output Endeca records, the HTML parsers handle invalid XML characters as follows:
The NekoHTML parser removes the invalid XML characters in the range 0x00-0x1F and 0x7F-0x9F from the DOM.
The TagSoup parser strips nothing from the DOM, because TagSoup can efficiently handle invalid XML characters.
Note that the NekoHTML parser is the default HTML parser.