You can examine the configuration and operation of the Web Crawler by running a sample Web crawl located in the CAS\workspace\conf\web-crawler\polite-crawl directory.
The sample configuration crawls the Endeca Web site (http://www.endeca.com) with a preconfigured seed file (endeca.lst) in the conf\web-crawler\default directory.
The Endeca sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file (with a text editor or the more command) to confirm that the crawl collected records. The site.xml file also specifies polite-crawl-workspace as the name of the workspace directory.
To run the Endeca sample crawl:
.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.comIf the crawl begins successfully, you see the INFO progress messages.
When finished, the Web Crawler displays: Crawl complete. The output file named polite-crawl.xml is in the CAS\version\polite-crawl-workspace\output directory.