You can set output properties in the default.xml file. You can configure output to either an output file (the default) or to a Record Store instance.

The properties in the table below allow you to specify the attributes of a crawl output file, such as its name, location, and output type. The default name of the output file is endecaOut and it is a compressed binary file by default.

Property Name

Description

output.file.directory

Directory name (default is workspace). Specifies the directory for the output file. The name is case-sensitive and is relative to where you run the crawl from. You can specify a multi-level path. Note that this setting can be overridden with the -w command-line flag.

output.file.name

File name (default is webcrawler-output). Specifies the filename of the output file. The name is case-sensitive.

output.file.is-xml

Boolean value (default is false). Specifies whether the output type is XML (true) or binary (false). XML is useful if you want to visually inspect the Endeca records after crawling.

output.file.is- compressed

Boolean value (default is true). Specifies whether to compress the Endeca records in a .gz file. Setting this property to true is useful when storing and transferring large files.

output.file.binary.file-size- max

Integer value (default is -1). Sets the maximum file size for binary output files. Output is written to a new file once the maximum size is reached. If the value is set to -1, no limits are imposed on the file size.

output.dom.include

Boolean value (default is false). Specifies whether to include the DOM for the Web page in the output Endeca records.

output.records.properties. excludes

Space-delimited list of output record properties (default is empty). Specifies the properties that should be excluded from the records. The names can be specified in a case-insensitive format. Note that wildcard names are not supported.

log.interval

Integer value in seconds (default is 60). Outputs crawl metrics information to the log every time this number of seconds has elapsed, per depth.

log.interval.summary

Integer value in seconds (default is 300). Outputs detailed crawl progress information (organized by host) every time this number of seconds has elapsed.


Copyright © Legal Notices