You can set output properties in the
default.xml file. You can configure output to either
an output file (the default) or to a Record Store instance.
The properties in the table below allow you to specify the attributes
of a crawl output file, such as its name, location, and output type. The
default name of the output file is
endecaOut and it is a compressed binary file by
default.
Note
By default, the Web Crawler writes output to a file on disk. If desired, you can configure the Web Crawler to write output to a Record Store instance. Oracle recommends this approach.
If the output.dom.include property is set to true, the Web Crawler normalizes the content of HTML documents into XHTML and stores it in the Endeca.Document.XHTML property in the record.
Set the
output.dom.includetotrue.You can now extract information from the XHTML using XSLT or any other XML processing system.
Note that the
Endeca.Document.Textproperty will also have extracted text, except that the XML header and the HTML tags are removed. Therefore, if you do not need the XHTML version of the content, set theoutput.dom.includeproperty tofalse.
The output.records.properties.excludes property allows you to specify a list of record properties that you want excluded from the records.
The list of the excluded property names is space delimited.
Note
Wildcards are not supported for the property names.
Example 3. Example of excluding record properties
For example, assume you want to exclude both Outlink properties from the output. You would add this entry to the site.xml configuration file:
<property> <name>output.records.properties.excludes</name> <value>Endeca.Document.Outlink Endeca.Document.OutlinkCount</value> </property>
On the next crawl, the Endeca.Document.Outlink and the Endeca.Document.OutlinkCount properties will not appear in the output.
Note
You can add the exclusion list to the default.xml file, but the site.xml file is recommended because you can then specify different property exclusions for different crawl configurations.
For the output.file.binary.file-size-max
property, if output has to be written to more than one output, the name pattern of the new files is similar to this example:
endecaOut-sgmt000.bin endecaOut-sgmt001.bin endecaOut-sgmt002.bin
That is, if the output.file.name
value is set to endecaOut, then the suffix -sgmt000
is used for the first file and the number is increased for subsequent files.
The site.xml files in the workspace/conf/web-crawler/polite-crawl and workspace/conf/web-crawler/non-polite-crawl directories contain these output file overrides.
|
config property |
default.xml |
polite site.xml |
non-polite site.xml |
|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|

