You can examine the configuration and operation of the Web Crawler by running a sample Web crawl located in the CAS\workspace\conf\web-crawler\polite-crawl
directory.
The sample configuration crawls the Web site (http://www.endeca.com) with a preconfigured seed file (endeca.lst)
in the conf\web-crawler\default
directory.
The sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file (with a text editor or the more command) to confirm that the crawl collected records. The site.xml file also specifies polite-crawl-workspace as the name of the workspace directory.
Navigate to the CAS root directory.
For example, in a default installation on Windows, this is
C:\Endeca\CAS\.versionRun the
web-crawler.bat(for Windows) orweb-crawler.sh(for UNIX) script with the following flags. Be sure to specify 0 (zero) to the-dflag to crawl only the root of the site, as shown in this example on a Windows machine:.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.com
If the crawl begins successfully, you see the
INFOprogress messages.
When finished, the Web Crawler displays: Crawl complete. The output file named polite-crawl.xml
is in the CAS\ directory.version\polite-crawl-workspace\output

