Adding a spider

A spider component crawls documents rather than loading records from a file, thereby adding document crawling capabilities to your pipeline.

The spider handles URL management and content manipulation. Using spiders, you can configure a pipeline that crawls the network at which it is directed, and facilitates the processing of the documents it finds there using the Endeca Information Transformation Layer (ITL).

Note: The Endeca Crawler, which uses the spider component, has been deprecated and support for it will be removed in a future version. It is recommended that you use the Endeca Web Crawler for Web crawls and the Endeca CAS Server for file system crawls.

To be able to crawl, a spider component has a dependency on at least one other pipeline component. Upstream, there needs to be a record adapter with a Format of type "document." This record adapter needs to contain a reference to the spider as its URL source. The two components work together to crawl a content repository (such as a Web server, file server, or FTP server). Between the record adapter and the spider, other pipeline components can be implemented. For example, you can have a record manipulator between the record adapter and the spider.

To add a spider component:

  1. In the Pipeline Diagram editor, click New, and then choose Spider. The Spider editor appears.
  2. In the Name text box, type a unique name for this spider
  3. In the General tab, do the following:
    1. (Optional) In the Maximum hops text box, type an integer greater than zero to limit the depth of the crawl. If this field is not set, the crawl depth is unlimited.
    2. (Optional) In the Maximum depth text box, type an integer greater than zero to limit the depth of the URL path. If this field is not set, the URL depth is unlimited.
    3. If you are using a robots.txtfile, in the Agent name text box, type the name of the spider as it will be referred to in the User-agentfield of a robots.txtfile.
    4. If the spider is performing a differential crawl, in the Differential crawl URL text box, type the location to store the spider's state (as a state.tmp file) between Forge executions
    5. If you do not want the spider to conform to the robots.txtstandard, check Ignore robots.
    6. If you do not want the spider to accept cookies, check Disable cookies.
  4. In the Root URLs tab, add one or more URLs from which a crawl can be launched.
  5. (Optional) In the URL Configuration tab, add enqueue URLs and URL filters.
  6. In the Sources tab, choose a record source from the list.
  7. (Optional) In the Comment tab, add a comment for the component.
  8. Click OK.

Implementing this feature requires additional work outside of Developer Studio. Please refer to the Endeca Forge Guide for details.