The prefix for the name of a crawl output file is set by the outputPrefix property (in the API) or key (in the configuration file). If you do not specify an output prefix, a default name of CrawlerOutput is used.

The full name of the output file also depends on two other configuration settings:

In addition to the output prefix described above, a second prefix is automatically added to the filename to distinguish which type of crawl was run:

The maximum size of a binary output file is 512 megabytes. If the maximum size is reached and more records need to be output, the crawler rolls the output into another output file. To distinguish rollover files, the -sgmt000 prefix is added to the first file, -sgmt001 is added to the second file, and so on, as shown in this example:

CrawlerOutput-FULL-sgmt000.bin.gz
CrawlerOutput-FULL-sgmt001.bin.gz

The maximum size of binary output files is not configurable. Note that unlike the binary format, if you choose XML, only one file is output, regardless of its size.


Copyright © Legal Notices