The prefix for the name of a crawl output file is set by the outputPrefix
property (in the API) or key (in the configuration file). If you do not specify an output prefix, a default name of CrawlerOutput
is used.
The full name of the output file also depends on two other configuration settings:
The
outputXml
property. This specifies whether the output format is XML (with a file extension of.xml
) or Binary (with a file extension of.bin
).The
outputCompressed
property. This determines whether the output file is compressed. If compression is enabled, a.gz
file extension is added to the.xml
or.bin
extension. No extension is added if compression is not enabled.
In addition to the output prefix described above, a second prefix is automatically added to the filename to distinguish which type of crawl was run:
The maximum size of a binary output file is 512 megabytes. If the maximum size is reached and more records need to be output, the crawler rolls the output into another output file. To distinguish rollover files, the -sgmt000
prefix is added to the first file,
-sgmt001
is added to the second file, and so on, as shown in this example:
CrawlerOutput-FULL-sgmt000.bin.gz CrawlerOutput-FULL-sgmt001.bin.gz
The maximum size of binary output files is not configurable. Note that unlike the binary format, if you choose XML, only one file is output, regardless of its size.