Oracle Commerce Guided Search - Web Crawler Overview

Web Crawler Overview

The Web Crawler is installed by default as part of the CAS installation. The Oracle Commerce Web Crawler gathers source data by crawling HTTP and HTTPS Web sites and writes the data in a format that is ready for Forge processing (XML or binary).

Besides retrieving and converting the source documents, the Web Crawler tags the resulting Endeca records with metadata properties that are derived from the source documents.

After the Web Crawler writes the Endeca records, you can configure an Endeca record adapter (in Developer Studio) to read the records into your Oracle Commerce pipeline, where Forge processes the records, and you can add or modify the record properties. These property values can then be mapped to MDEX dimensions or properties by the property mapper in the pipeline. For details, see "Creating a Pipeline to read Endeca records" in the Oracle Commerce CAS Developer's Guide.

You can then build an Oracle Commerce application to access the records and allow your application users to search and navigate the document contents contained in the records.

The Oracle Commerce Web Crawler is intended for large-scale crawling and is designed with a highly modular architecture that allows developers to create their own plugins. The Oracle Commerce Web Crawler supports these types of crawls:

full crawls, in which all pages (URLs) in the seed are crawled.
resumable crawls (also called restartable crawls), in which the crawl uses the same seed as a previous crawl, but uses a different crawl depth or configuration.

Note that the current version of the Oracle Commerce Web Crawler does not support incremental crawls nor crawling FTP sites.

Copyright © Legal Notices