Oracle Commerce Guided Search - Running full crawls

Running full crawls

You run full crawls from the command line.

A full crawl means that the crawler processes all the URLs in the seed (except for URLs that are excluded by filters). By default, a crawl history database is created in the workspace/state/web directory.

You can run multiple, simultaneous crawls on the same machine. When running multiple crawls, each crawl must have its own workspace directory. All the crawls can use the same configuration, or they can use a crawl-specific configuration.

Note

If you are using the default configuration, you must run Web crawls from the Web Crawler root directory (i.e., the CAS\version directory). To run crawls from other directories, you must change the plugin.folders configuration property so that it uses an absolute path (to the lib\plugins directory) instead of a relative path.

To run a full crawl:

Open a command prompt.
Navigate to the Web Crawler root directory.
Note that you can run the startup script from an external directory if you have set an absolute path in the plugin.folders configuration property.
Run the web-crawler.bat (for Windows) or web-crawler.sh (for UNIX) script with at least the -d and -s flags. You can use the optional flags to customize the crawl, such as using the -w flag to specify the workspace directory. For example:
```
.\bin\web-crawler -c conf\web\myconfig -d 2 -s mysites.lst
```
If the crawl begins successfully, you see the INFO progress messages.

The crawl is finished when you see the Crawler complete message from the Web Crawler. The output file is written to the output subdirectory in the workspace directory.

Note that by default, the console receives all messages. You can create a crawl log by either redirecting the output to a log (such as >crawl.log) or specifying a file appender in the log4j.properties logging configuration file.

Below is an example of a full crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. The complete output will include the following summaries:

Crawl metrics information (the Perf sections)
Crawl progress information organized by host and seed depth

The crawl summaries include such page information as how many pages were fetched, redirected, retried, gone (i.e., pages were not available because of 404 errors or other reasons), and filtered.

Example 6. Example of running a full crawl

C:\Endeca\CAS\3.1.1>.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.com
INFO    2009-07-27 09:38:47,528 0       com.endeca.itl.web.Main [main]  Adding seed: http://www.endeca.com
INFO    2009-07-27 09:38:47,544 16      com.endeca.itl.web.Main [main]  Seed URLs: [http://www.endeca.com]
INFO    2009-07-27 09:38:49,606 2078    com.endeca.itl.web.db.CrawlDbFactory [main]  Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb
INFO    2009-07-27 09:38:49,606 2078    com.endeca.itl.web.Crawler      [main] Using executor settings: numThreads = 100, maxThreadsPerHost=1
INFO    2009-07-27 09:38:50,841 3313    com.endeca.itl.web.Crawler      [main] Fetching seed URLs.
INFO    2009-07-27 09:38:51,622 4094    com.endeca.itl.web.Crawler      [main] Seeds complete.
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Starting crawler shut down, waiting for running threads to complete
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Progress: Level: Cumulative crawl summary (level)
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] host-summary: www.endeca.com to depth 1
host    depth   completed       total   blocks
www.endeca.com  0       1       1       1
www.endeca.com  1       0       38      1
www.endeca.com  all     1       39      2

INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] host-summary: total crawled: 1 completed. 39 total.
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Shutting down CrawlDb
INFO    2009-07-27 09:38:51,700 4172    com.endeca.itl.web.Crawler      [main] Progress: Host: Cumulative crawl summary (host)
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main]
Host: www.endeca.com:  1 fetched. 0.0 mB. 1 records. 0 redirected. 0 retried. 0
gone. 19 filtered.
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main] Progress: Perf: All (cumulative) 2.0s. 0.5 Pages/s. 4.8 kB/s. 1 fetched. 0.0 mB.
 1 records. 0 redirected. 0 retried. 0 gone. 19 filtered.
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main] Crawl complete.

Running full crawls

Note

Content Acquisition System Web Crawler Guide