Web Crawler Overview

The Web Crawler is installed by default as part of the CAS installation. The Endeca Web Crawler gathers source data by crawling HTTP and HTTPS Web sites and writes the data in a format that is ready for Forge processing (XML or binary).

Besides retrieving and converting the source documents, the Web Crawler tags the resulting Endeca records with metadata properties that are derived from the source documents.

After the Web Crawler writes the Endeca records, you can configure an Endeca record adapter (in Developer Studio) to read the records into your Endeca pipeline, where Forge processes the records, and you can add or modify the record properties. These property values can then be mapped to Endeca dimensions or properties by the property mapper in the pipeline. For details, see "Creating a Pipeline to read Endeca records" in the Endeca CAS Developer's Guide.

You can then build an Endeca application to access the records and allow your application users to search and navigate the document contents contained in the records.

The Endeca Web Crawler is intended for large-scale crawling and is designed with a highly modular architecture that allows developers to create their own plugins. The Endeca Web Crawler supports these types of crawls:

full crawls, in which all pages (URLs) in the seed are crawled.
resumable crawls (also called restartable crawls), in which the crawl uses the same seed as a previous crawl, but uses a different crawl depth or configuration.

Note that the current version of the Endeca Web Crawler does not support incremental crawls nor crawling FTP sites.

SSL Support

You can configure the Endeca Web Crawler to read and write from an SSL-enabled Record Store instance. For details, see the "SSL Configuration" chapter of the Endeca CAS Developer's Guide.