For the first time that a crawl is run in a given workspace directory, the output file is named as described in the previous section.
For example, if you run a full crawl, the output filename might be endecaOut-sgmt000.bin.gz
.
If you then run a second crawl (full or resumable), the Web Crawler works as follows:
A directory named
archive
is created under theoutput
directory.The original
endecaOut-sgmt000.bin.gz
file is moved to thearchive
directory and is renamed by adding a timestamp to the name; for example:endecaOut-20091015173554-sgmt000.bin.gz
The output file from the second run is named
endecaOut-sgmt000.bin.gz
and is stored in theoutput
directory.For every subsequent crawl using the same workspace directory, steps 2 and 3 are repeated.
The timestamp format used for renaming is:
YYYYMMDDHHmmSS
where:
Note that the timestamp format is hard-coded and cannot be reconfigured.