Developer Studio exposes Endeca Crawler functionality using the following components, which form the core of an Endeca Crawler pipeline:
- A spider component — Crawls documents starting at the root URLs you specify. In the spider component, you indicate the root URLs from which to begin a crawl, URL filters to determine which documents to crawl, as well as other configuration information that specifies how the crawl should proceed. This information may include timeout values for a crawl, proxy server values, and so on. The spider crawls the URLs and manages a URL queue that feeds the record adapter.
A record adapter configured to read documents — Receives URLs from the spider and creates an Endeca record for each document located at a URL. Each record contains a number of properties, one of which is the record’s identifying URL. A downstream record manipulator uses the record identifier to retrieve the document and extract its data.
Unlike basic pipelines (which use a record adapter to input source data from a variety of formats), an Endeca Crawler pipeline uses a record adapter to input URLs provided by the spider. In a basic pipeline, the format type of a record adapter matches the source data, for example, delimited, XML, fixed-width, or ODBC. In an Endeca Crawler pipeline, the format type of a record adapter must be set to Document.
- A record manipulator incorporating expressions to handle documents — Contains several Data Foundry expressions that support crawling and document processing tasks. At a minimum, a record manipulator contains one expression to retrieve a URL based on the record's identifier and a second expression to extract and convert the document’s content to text. In addition, you can include optional expressions to identify the language of a document, remove temporary properties after processing is complete, or perform a variety of other processing tasks.