Output properties

You set output properties in the default.xml file. You can configure output to either an output file (the default) or to a Record Store instance.

The properties in the table below allow you to specify the attributes of a crawl output file, such as its name, location, and output type. The default name of the output file is endecaOut and it is a compressed binary file by default.

Note: By default, the Web Crawler writes output to a file on disk. If desired, you can configure the Web Crawler to write output to a Record Store instance. Oracle recommends this approach.
Property Name Property Value
output.file.directory Directory name (default is workspace). Specifies the directory for the output file. The name is case-sensitive and is relative to where you run the crawl from. You can specify a multi-level path. Note that this setting can be overridden with the -w command-line flag.
output.file.name File name (default is webcrawler-output). Specifies the filename of the output file. The name is case-sensitive.
output.file.is-xml Boolean value (default is false). Specifies whether the output type is XML (true) or binary (false). XML is useful if you want to visually inspect the Endeca records after crawling.
output.file.is-compressed Boolean value (default is true). Specifies whether to compress the Endeca records in a .gz file. Setting this property to true is useful when storing and transferring large files.
output.file.binary.file-size-max Integer value (default is -1). Sets the maximum file size for binary output files. Output is written to a new file once the maximum size is reached. If the value is set to -1, no limits are imposed on the file size.
output.dom.include Boolean value (default is false). Specifies whether to include the DOM for the Web page in the output Endeca records.
output.records.properties.excludes Space-delimited list of output record properties (default is empty). Specifies the properties that should be excluded from the records. The names can be specified in a case-insensitive format. Note that wildcard names are not supported.
log.interval Integer value in seconds (default is 60). Outputs crawl metrics information to the log every time this number of seconds has elapsed, per depth.
log.interval.summary Integer value in seconds (default is 300). Outputs detailed crawl progress information (organized by host) every time this number of seconds has elapsed.