Web Crawler overview

The Endeca Web Crawler is installed by default as part of the IAS installation. The Web Crawler gathers source data by crawling HTTP and HTTPS Web sites and writes the data in a format that is accessible to Endeca Information Discovery Integrator (either XML or a Record Store instance).

After the Web Crawler writes the Endeca records, you can configure an Endeca Record Store Reader component (in Integrator ETL) to read the records from a Record Store instance into an Integrator ETL graph. This is the recommended integration model.

Although you can process XML records in an Integrator ETL graph, this model requires more configuration to create XML mappings using the XMLExtract component. XML output is typically used as a convenient format to examine the records after a Web crawl.

Besides crawling and converting the source documents, the Web Crawler tags the resulting Endeca records with metadata properties that are derived from the source documents.

The Endeca Web Crawler supports these types of crawls:

Full crawls, in which all pages (URLs) in the seed are crawled.
Resumable crawls (also called restartable crawls), in which the crawl uses the same seed as a previous crawl, but uses a different crawl depth or configuration.

Note that the current version of the Endeca Web Crawler does not support incremental crawls or crawling FTP sites.

Plugin Support

The Endeca Web Crawler is intended for large-scale crawling and is designed with a highly modular architecture that allows developers to create their own plugins. Plugins provide a means to extract additional content, such as HTML meta tags, from Web pages.

SSL Support

You can configure the Endeca Web Crawler to read and write from an SSL-enabled Record Store instance. For details, see the "Configuring SSL in the Integrator Acquisition System" chapter of the Security Guide for Integrator.