You run full crawls from the command line.
A full crawl means that the crawler processes all the URLs in the seed
(except for URLs that are excluded by filters).
By default, a crawl history
database is created in the
workspace/state/web
directory.
You can run multiple, simultaneous crawls on the same machine. When running multiple crawls, each crawl must have its own workspace directory. All the crawls can use the same configuration, or they can use a crawl-specific configuration.
Note
If you are using the default configuration, you must run Web
crawls from the Web Crawler root directory (i.e., the
CAS\
directory). To
run crawls from other directories, you must change the
version
plugin.folders
configuration property so that
it uses an absolute path (to the
lib\plugins
directory) instead of a relative
path.
To run a full crawl:
Navigate to the Web Crawler root directory.
Note that you can run the startup script from an external directory if you have set an absolute path in the
plugin.folders
configuration property.Run the
web-crawler.bat
(for Windows) orweb-crawler.sh
(for UNIX) script with at least the-d
and-s
flags. You can use the optional flags to customize the crawl, such as using the-w
flag to specify the workspace directory. For example:.\bin\web-crawler -c conf\web\myconfig -d 2 -s mysites.lst
If the crawl begins successfully, you see the
INFO
progress messages.
The crawl is finished when you see the
Crawler complete
message from the Web Crawler.
The output file is written to the
output
subdirectory in the workspace directory.
Note that by default, the console receives all messages. You can
create a crawl log by either redirecting the output to a log (such as
>crawl.log
) or specifying a file appender in
the
log4j.properties
logging configuration file.
Below is an example of a full crawl using the default polite configuration. For ease of reading, the timestamps and module names are truncated. The complete output will include the following summaries:
The crawl summaries include such page information as how many pages were fetched, redirected, retried, gone (i.e., pages were not available because of 404 errors or other reasons), and filtered.
Example 6. Example of running a full crawl
C:\Endeca\CAS\3.1.1>.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.com INFO 2009-07-27 09:38:47,528 0 com.endeca.itl.web.Main [main] Adding seed: http://www.endeca.com INFO 2009-07-27 09:38:47,544 16 com.endeca.itl.web.Main [main] Seed URLs: [http://www.endeca.com] INFO 2009-07-27 09:38:49,606 2078 com.endeca.itl.web.db.CrawlDbFactory [main] Initialized crawldb: com.endeca.itl.web.db.BufferedDerbyCrawlDb INFO 2009-07-27 09:38:49,606 2078 com.endeca.itl.web.Crawler [main] Using executor settings: numThreads = 100, maxThreadsPerHost=1 INFO 2009-07-27 09:38:50,841 3313 com.endeca.itl.web.Crawler [main] Fetching seed URLs. INFO 2009-07-27 09:38:51,622 4094 com.endeca.itl.web.Crawler [main] Seeds complete. INFO 2009-07-27 09:38:51,653 4125 com.endeca.itl.web.Crawler [main] Starting crawler shut down, waiting for running threads to complete INFO 2009-07-27 09:38:51,653 4125 com.endeca.itl.web.Crawler [main] Progress: Level: Cumulative crawl summary (level) INFO 2009-07-27 09:38:51,653 4125 com.endeca.itl.web.Crawler [main] host-summary: www.endeca.com to depth 1 host depth completed total blocks www.endeca.com 0 1 1 1 www.endeca.com 1 0 38 1 www.endeca.com all 1 39 2 INFO 2009-07-27 09:38:51,653 4125 com.endeca.itl.web.Crawler [main] host-summary: total crawled: 1 completed. 39 total. INFO 2009-07-27 09:38:51,653 4125 com.endeca.itl.web.Crawler [main] Shutting down CrawlDb INFO 2009-07-27 09:38:51,700 4172 com.endeca.itl.web.Crawler [main] Progress: Host: Cumulative crawl summary (host) INFO 2009-07-27 09:38:51,715 4187 com.endeca.itl.web.Crawler [main] Host: www.endeca.com: 1 fetched. 0.0 mB. 1 records. 0 redirected. 0 retried. 0 gone. 19 filtered. INFO 2009-07-27 09:38:51,715 4187 com.endeca.itl.web.Crawler [main] Progress: Perf: All (cumulative) 2.0s. 0.5 Pages/s. 4.8 kB/s. 1 fetched. 0.0 mB. 1 records. 0 redirected. 0 retried. 0 gone. 19 filtered. INFO 2009-07-27 09:38:51,715 4187 com.endeca.itl.web.Crawler [main] Crawl complete.