You can examine the configuration and operation of the Web Crawler by running a sample Web crawl located in the CAS\workspace\conf\web-crawler\polite-crawl
directory.
The sample configuration crawls the Web site (http://www.endeca.com
) with a preconfigured seed file (endeca.lst
)
in the conf\web-crawler\default
directory.
The sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file (with a text editor or the more
command) to confirm that the crawl collected records. The site.xml
file also specifies polite-crawl-workspace
as the name of the workspace directory.
Navigate to the CAS root directory.
For example, in a default installation on Windows, this is
C:\Endeca\CAS\
.version
Run the
web-crawler.bat
(for Windows) orweb-crawler.sh
(for UNIX) script with the following flags. Be sure to specify 0 (zero) to the-d
flag to crawl only the root of the site, as shown in this example on a Windows machine:.\bin\web-crawler -c ..\workspace\conf\web-crawler\polite-crawl -d 0 -s http://www.endeca.com
If the crawl begins successfully, you see the
INFO
progress messages.
When finished, the Web Crawler displays: Crawl complete
. The output file named polite-crawl.xml
is in the CAS\
directory.version
\polite-crawl-workspace\output