Running the Endeca sample Web crawl

You can examine the configuration and operation of the Web Crawler by running a sample Web crawl located in the CAS\workspace\conf\web-crawler\polite-crawl directory.

The sample configuration crawls the Endeca Web site (http://www.endeca.com) with a preconfigured seed file (endeca.lst) in the conf\web-crawler\default directory.

The Endeca sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file (with a text editor or the more command) to confirm that the crawl collected records. The site.xml file also specifies polite-crawl-workspace as the name of the workspace directory.

To run the Endeca sample crawl:

  1. Open a command prompt.
  2. Navigate to the CAS root directory. For example, in a default installation on Windows, this is C:\Endeca\CAS\version.
  3. Run the web-crawler.bat (for Windows) or web-crawler.sh (for UNIX) script with the following flags. Be sure to specify 0 (zero) to the -d flag to crawl only the root of the site, as shown in this example on a Windows machine:
    .\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl 
    -d 0 -s http://www.endeca.com
    If the crawl begins successfully, you see the INFO progress messages.

When finished, the Web Crawler displays: Crawl complete. The output file named polite-crawl.xml is in the CAS\version\polite-crawl-workspace\output directory.