You run a resumable crawl from the command line.

You can run a resumable crawl if you use the same workspace directory as the previous crawl and if a valid history database exists in the state/web directory. The resumed crawl work runs any URL in the database that has a status of pending and also generates new URLs to crawl.

Keep in mind that the value of the -d flag should be greater than that of the previous crawl, or else no new records are retrieved (unless the previous crawl did not finish the depth). Also, you cannot change the seed. You can, however, change the configuration of the resumed crawl.

To run a resumable crawl:

The crawl is finished when you see the Crawler complete message from the Web Crawler. The output file is written to the output subdirectory in the workspace directory, while the previous output file is renamed and moved to the output\archive subdirectory.

Below is an example of a resumed crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. As with full crawls, the complete output will include the crawl metrics and crawl host progress summaries.



Copyright © Legal Notices