You run full crawls from the command line.

A full crawl means that the crawler processes all the URLs in the seed (except for URLs that are excluded by filters). By default, a crawl history database is created in the workspace/state/web directory.

You can run multiple, simultaneous crawls on the same machine. When running multiple crawls, each crawl must have its own workspace directory. All the crawls can use the same configuration, or they can use a crawl-specific configuration.

To run a full crawl:

The crawl is finished when you see the Crawler complete message from the Web Crawler. The output file is written to the output subdirectory in the workspace directory.

Note that by default, the console receives all messages. You can create a crawl log by either redirecting the output to a log (such as >crawl.log) or specifying a file appender in the log4j.properties logging configuration file.

Below is an example of a full crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. The complete output will include the following summaries:

The crawl summaries include such page information as how many pages were fetched, redirected, retried, gone (i.e., pages were not available because of 404 errors or other reasons), and filtered.

Example 6. Example of running a full crawl

C:\Endeca\CAS\3.1.1>.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.com
INFO    2009-07-27 09:38:47,528 0       com.endeca.itl.web.Main [main]  Adding seed: http://www.endeca.com
INFO    2009-07-27 09:38:47,544 16      com.endeca.itl.web.Main [main]  Seed URLs: [http://www.endeca.com]
INFO    2009-07-27 09:38:49,606 2078    com.endeca.itl.web.db.CrawlDbFactory [main]  Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb
INFO    2009-07-27 09:38:49,606 2078    com.endeca.itl.web.Crawler      [main] Using executor settings: numThreads = 100, maxThreadsPerHost=1
INFO    2009-07-27 09:38:50,841 3313    com.endeca.itl.web.Crawler      [main] Fetching seed URLs.
INFO    2009-07-27 09:38:51,622 4094    com.endeca.itl.web.Crawler      [main] Seeds complete.
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Starting crawler shut down, waiting for running threads to complete
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Progress: Level: Cumulative crawl summary (level)
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] host-summary: www.endeca.com to depth 1
host    depth   completed       total   blocks
www.endeca.com  0       1       1       1
www.endeca.com  1       0       38      1
www.endeca.com  all     1       39      2

INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] host-summary: total crawled: 1 completed. 39 total.
INFO    2009-07-27 09:38:51,653 4125    com.endeca.itl.web.Crawler      [main] Shutting down CrawlDb
INFO    2009-07-27 09:38:51,700 4172    com.endeca.itl.web.Crawler      [main] Progress: Host: Cumulative crawl summary (host)
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main]
Host: www.endeca.com:  1 fetched. 0.0 mB. 1 records. 0 redirected. 0 retried. 0
gone. 19 filtered.
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main] Progress: Perf: All (cumulative) 2.0s. 0.5 Pages/s. 4.8 kB/s. 1 fetched. 0.0 mB.
 1 records. 0 redirected. 0 retried. 0 gone. 19 filtered.
INFO    2009-07-27 09:38:51,715 4187    com.endeca.itl.web.Crawler      [main] Crawl complete.



Copyright © Legal Notices