You can set output properties in the
default.xml
file. You can configure output to either
an output file (the default) or to a Record Store instance.
The properties in the table below allow you to specify the attributes
of a crawl output file, such as its name, location, and output type. The
default name of the output file is
endecaOut
and it is a compressed binary file by
default.
Note
By default, the Web Crawler writes output to a file on disk. If desired, you can configure the Web Crawler to write output to a Record Store instance. Oracle recommends this approach.
If the output.dom.include
property is set to true
, the Web Crawler normalizes the content of HTML documents into XHTML and stores it in the Endeca.Document.XHTML
property in the record.
Set the
output.dom.include
totrue
.You can now extract information from the XHTML using XSLT or any other XML processing system.
Note that the
Endeca.Document.Text
property will also have extracted text, except that the XML header and the HTML tags are removed. Therefore, if you do not need the XHTML version of the content, set theoutput.dom.include
property tofalse
.
The output.records.properties.excludes
property allows you to specify a list of record properties that you want excluded from the records.
The list of the excluded property names is space delimited.
Note
Wildcards are not supported for the property names.
Example 3. Example of excluding record properties
For example, assume you want to exclude both Outlink properties from the output. You would add this entry to the site.xml
configuration file:
<property> <name>output.records.properties.excludes</name> <value>Endeca.Document.Outlink Endeca.Document.OutlinkCount</value> </property>
On the next crawl, the Endeca.Document.Outlink
and the Endeca.Document.OutlinkCount
properties will not appear in the output.
Note
You can add the exclusion list to the default.xml
file, but the site.xml
file is recommended because you can then specify different property exclusions for different crawl configurations.
For the output.file.binary.file-size-max
property, if output has to be written to more than one output, the name pattern of the new files is similar to this example:
endecaOut-sgmt000.bin endecaOut-sgmt001.bin endecaOut-sgmt002.bin
That is, if the output.file.name
value is set to endecaOut
, then the suffix -sgmt000
is used for the first file and the number is increased for subsequent files.
The site.xml
files in the workspace/conf/web-crawler/polite-crawl
and workspace/conf/web-crawler/non-polite-crawl
directories contain these output file overrides.
config property |
default.xml |
polite site.xml |
non-polite site.xml |
---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|