You run a resumable crawl from the command line.
You can run a resumable crawl if you use the same workspace directory
as the previous crawl and if a valid history database exists in the
state/web
directory.
The resumed crawl work runs
any URL in the database that has a status of pending and also generates new
URLs to crawl.
Keep in mind that the value of the
-d
flag should be greater than that of the
previous crawl, or else no new records are retrieved (unless the previous crawl
did not finish the depth). Also, you cannot change the seed. You can, however,
change the configuration of the resumed crawl.
Note
If you are using the default configuration, Web crawls must be
run from the Web Crawler root directory (i.e., in a Windows installation
\CAS\
directory). To
run crawls from other directories, you must change the
version
plugin.folders
configuration property so that
it uses an absolute path (to the
lib\plugins
directory) instead of a relative
path.
To run a resumable crawl:
Navigate to the Web Crawler root directory.
For example, in a default installation on Windows, this is
\CAS\
.version
Note that you can run the startup script from an external directory if you have set an absolute path in the
plugin.folders
configuration property.Run the
web-crawler.bat
(for Windows) orweb-crawler.sh
(for UNIX) script with the-r
and-d
) flags. Use the-w
flag if you need to specify the location of the workspace directory. For example:.\bin\web-crawler -r -d 3
If the crawl begins successfully, the first
INFO
message reads:Resuming an old crawl. Seed URLs are ignored.
The crawl is finished when you see the
Crawler complete
message from the Web Crawler.
The output file is written to the
output
subdirectory in the workspace directory,
while the previous output file is renamed and moved to the
output\archive
subdirectory.
Below is an example of a resumed crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. As with full crawls, the complete output will include the crawl metrics and crawl host progress summaries.
Example 7. Example of running a resumed crawl
C:\Endeca\3.1.1\CAS>.\bin\web-crawler -d 1 -c ..\workspace\conf\web-crawler\polite-crawl -r Resuming an old crawl. Seed URLs are ignored. Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb Using executor settings: numThreads = 100, maxThreadsPerHost=1 Resuming the crawl. Starting crawler shut down, waiting for running threads to complete Finished level: host: endeca.com, depth: 1, max depth reached Progress: Level: Cumulative crawl summary (level) host-summary: endeca.com to depth 2 host depth completed total blocks endeca.com 0 0 0 0 endeca.com 1 36 36 1 endeca.com 2 0 141 1 endeca.com all 36 177 2 host-summary: total crawled: 36 completed. 177 total. Shutting down CrawlDb Progress: Host: Cumulative crawl summary (host) Host: endeca.com: 35 fetched. 0.4 mB. 35 records. 0 redirected. 0 retried. 1 gone. 377 filtered. Progress: Perf: All (cumulative) 40.0s. 0.9 Pages/s. 9.6 kB/s. 35 fetched. 0.4 mds. 0 redirected. 0 retried. 1 gone. 377 filtered. Crawl complete.