This topic provides an overview of resumable crawls.
A resumable crawl (also called a restartable crawl) is a crawl that uses the seed URLs of a previous full or resumed crawl. It also uses a greater depth level and/or a different set of configuration settings.
You use the -r
(or --resume
)
command-line flag
to resume a crawl. Resumable crawls use the previously-created crawl history database in the workspace directory, because the database provides the seed and a list of URLs that have already been crawled.
Resumable crawls do not recrawl URLs that have a status of complete in the history database.
Among the possible use-case scenarios for resumable crawls are the following:
You have successfully run a crawl (for example, a test crawl using a depth of 0). Now you want to run the same crawl again (i.e., same seeds and same configuration), but this time with a greater depth. However, because you have the output from the first crawl, you do not want to recrawl those pages, but instead want to start from where the first crawl finished.
You have successfully run a crawl, and now want to run the same crawl (i.e., same seeds) but with a different configuration. Again, you do not want to recrawl any previously-crawled pages and want to keep the output from the first crawl.
The rules for resumed crawls are the following:
A previous crawl must have been successfully run. That is, the previous crawl must have generated a history (state) database that will be used as a starting point for the resumed crawl. Note that crawls that were stopped (e.g., via a Control-C in the command window) are considered successful crawls if the crawl was gracefully shut down (that is, the history database is up-to-date).
The same seed must be used. That is, you cannot use the
-s
flag to specify a different seed for the resumed crawler (the flag is ignored if you use it). Instead, the Web Crawler will use the the seed from the history database. Because the history database also contains the list of URLs that were crawled, the resumed crawl will not recrawl those URLs.The same workspace directory must be used. You cannot use the
-w
flag to specify a different workspace directory. The reason is that the resumed crawl must use the same history database as the previous crawl (and must also update that database with the newly-crawled information).You must use the
-d
flag to a greater crawl depth than the previous crawl. If you specify a crawl depth that is less than or the same as the previous crawl, no records are generated. (However, if you have the same depth as the previous crawl and the previous crawl did not finish that depth, then records will be generated.) This same rule also applies to the maximum number of requests to be made (via the-l
flag).The
-c
flag can be used to provide a different configuration for the resumed crawl. The new configuration is used for the uncrawled pages, but does not affect pages that have already been crawled.Because you can change the configuration, you can specify a new output file name.