Spider document processing

The spider component is the core of a Endeca Crawler pipeline. Working in conjunction with a record adapter and a record manipulator, the spider forms a document-processing loop whose function is to get documents into a pipeline.

The primary function of the spider in a loop is to crawl URLs, filter URLs, send URLs to the record adapter, and manage the URL queue until all source documents are processed.

In the Spider editor, you can indicate the URLs to crawl, create URL filters to determine which documents to crawl, and specify timeout, proxy, and other configuration information that controls how the crawl proceeds.

Once configured and run, the spider loops through processing documents in a crawler pipeline as described in the steps below. These steps focus only on the spider's document processing loop, not the larger URL and record processing loop:
  1. For the first loop of source document processing, the spider crawls the root URL indicated on the Root URLs tab of the Spider editor.
  2. Based on the root URL that the spider crawls, the record adapter creates a record containing the URL, indicated by Endeca.Identifier, and a limited set of metadata properties.
  3. The newly-created record then flows down to the record manipulator where the following takes place:
    1. The document associated with the URL is fetched (using the RETRIEVE_URL expression) and stored in Endeca.Document.Body.
    2. Content (searchable text) is extracted from Endeca.Document.Body (using the CONVERTTOTEXT or PARSE_DOC expression) and stored in Endeca.Document.Text.
    3. Any URLs in Endeca.Document.Body are extracted for additional crawling and are stored in Endeca.Relation.References by default.
  4. The record based on the root URL moves downstream to the spider where additional URLs (those extracted from the root URL and stored in Endeca.Relation.References) are queued for crawling.
  5. The spider crawls URLs from the record as indicated in the Endeca.Relation.References properties. This is the next loop of source document processing.
  6. Based on the queued URL that the spider crawls, the record adapter creates a record containing the URL, indicated by Endeca.Identifier, and a limited set of metadata properties.
  7. Steps 3 through 6 repeat until the spider processes all URLs and the record adapter creates corresponding records.