Running a sample Web crawl of oracle.com

You can examine the configuration and operation of the Web Crawler by running a sample Web crawl. The sample is located in the <install path>\IAS\workspace\conf\web-crawler\polite-crawl directory.

The sample crawls http://www.oracle.com with a pre-configured seed file (endeca.lst) in the <install path>\IAS\workspace\conf\web-crawler\default directory.

The sample crawl is configured to output the records as uncompressed XML. The XML format allows you to easily read the output file to confirm that the crawl collected records. The site.xml file also specifies polite-crawl-workspace as the name of the workspace directory.

To run the sample crawl:

  1. Open a command prompt window.
  2. Change to the <install path>\IAS\<version>\bin directory.
  3. Run the web-crawler script with the -d flag set to 0 to crawl only the root of the site.
    Here is a Windows example:
    web-crawler -c C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\polite-crawl 
    -d 0 -s C:\Oracle\Endeca\IAS\workspace\conf\web-crawler\default\endeca.lst
    If the crawl begins successfully, you see the INFO progress messages.

When finished, the Web Crawler displays: Crawl complete. The output file named polite-crawl.xml is in the <install path>\IAS\<version>\bin\polite-crawl-workspace\output directory.