If the
output.dom.include property is set to
true, the Web Crawler normalizes the content of HTML
documents into XHTML and stores it in the
Endeca.Document.XHTML property in the record.
-
In a text editor, open
default.xml.
-
Set the
output.dom.include to
true.
You can now extract information from the XHTML using XSLT or any
other XML processing system.
-
Note that the
Endeca.Document.Text property will also have
extracted text, except that the XML header and the HTML tags are removed.
Therefore, if you do not need the XHTML version of the content, set the
output.dom.include property to
false.
-
Save and close the file.