You can examine the configuration and operation of the Web Crawler by running a sample Web crawl located in the CAS\workspace\conf\web-crawler\polite-crawl directory.

The sample configuration crawls the Web site (http://www.endeca.com) with a preconfigured seed file (endeca.lst) in the conf\web-crawler\default directory.

The sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file (with a text editor or the more command) to confirm that the crawl collected records. The site.xml file also specifies polite-crawl-workspace as the name of the workspace directory.

To run the sample crawl:

When finished, the Web Crawler displays: Crawl complete. The output file named polite-crawl.xml is in the CAS\version\polite-crawl-workspace\output directory.


Copyright © Legal Notices