Oracle Commerce Guided Search - Running resumable crawls

Running resumable crawls

You run a resumable crawl from the command line.

You can run a resumable crawl if you use the same workspace directory as the previous crawl and if a valid history database exists in the state/web directory. The resumed crawl work runs any URL in the database that has a status of pending and also generates new URLs to crawl.

Keep in mind that the value of the -d flag should be greater than that of the previous crawl, or else no new records are retrieved (unless the previous crawl did not finish the depth). Also, you cannot change the seed. You can, however, change the configuration of the resumed crawl.

Note

If you are using the default configuration, Web crawls must be run from the Web Crawler root directory (i.e., in a Windows installation \CAS\version directory). To run crawls from other directories, you must change the plugin.folders configuration property so that it uses an absolute path (to the lib\plugins directory) instead of a relative path.

To run a resumable crawl:

Open a command prompt.
Navigate to the Web Crawler root directory.
For example, in a default installation on Windows, this is \CAS\version.
Note that you can run the startup script from an external directory if you have set an absolute path in the plugin.folders configuration property.
Run the web-crawler.bat (for Windows) or web-crawler.sh (for UNIX) script with the -r and -d) flags. Use the -w flag if you need to specify the location of the workspace directory. For example:
```
.\bin\web-crawler -r -d 3
```
If the crawl begins successfully, the first INFO message reads:
```
Resuming an old crawl. Seed URLs are ignored.
```

The crawl is finished when you see the Crawler complete message from the Web Crawler. The output file is written to the output subdirectory in the workspace directory, while the previous output file is renamed and moved to the output\archive subdirectory.

Below is an example of a resumed crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. As with full crawls, the complete output will include the crawl metrics and crawl host progress summaries.

Example 7. Example of running a resumed crawl

C:\Endeca\3.1.1\CAS>.\bin\web-crawler -d 1 -c ..\workspace\conf\web-crawler\polite-crawl -r
Resuming an old crawl.  Seed URLs are ignored.
Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb
Using executor settings: numThreads = 100, maxThreadsPerHost=1
Resuming the crawl.
Starting crawler shut down, waiting for running threads to complete
Finished level: host: endeca.com, depth: 1, max depth reached
Progress: Level: Cumulative crawl summary (level)
host-summary: endeca.com to depth 2
host          depth   completed   total   blocks
endeca.com    0       0           0       0
endeca.com    1       36          36      1
endeca.com    2       0           141     1
endeca.com    all     36          177     2

host-summary: total crawled: 36 completed. 177 total.
Shutting down CrawlDb
Progress: Host: Cumulative crawl summary (host)
Host: endeca.com:  35 fetched. 0.4 mB. 35 records.
  0 redirected. 0 retried. 1 gone. 377 filtered.
Progress: Perf: All (cumulative) 40.0s. 0.9 Pages/s.
  9.6 kB/s. 35 fetched. 0.4 mds. 0 redirected.
  0 retried. 1 gone. 377 filtered.
Crawl complete.

Running resumable crawls

Note

Content Acquisition System Web Crawler Guide