Gathering XHTML information

If the output.dom.include property is set to true, the Web Crawler normalizes the content of HTML documents into XHTML and stores it in the Endeca.Document.XHTML property in the record.

  1. In a text editor, open default.xml.
  2. Set the output.dom.include to true.

    You can now extract information from the XHTML using XSLT or any other XML processing system.

  3. Note that the Endeca.Document.Text property will also have extracted text, except that the XML header and the HTML tags are removed. Therefore, if you do not need the XHTML version of the content, set the output.dom.include property to false.
  4. Save and close the file.