A spider component crawls documents rather than loading records from a file, thereby adding document crawling capabilities to your pipeline.

The spider handles URL management and content manipulation. Using spiders, you can configure a pipeline that crawls the network at which it is directed, and facilitates the processing of the documents it finds there using the Endeca Information Transformation Layer (ITL).

To be able to crawl, a spider component has a dependency on at least one other pipeline component. Upstream, there needs to be a record adapter with a Format of type "document." This record adapter needs to contain a reference to the spider as its URL source. The two components work together to crawl a content repository (such as a Web server, file server, or FTP server). Between the record adapter and the spider, other pipeline components can be implemented. For example, you can have a record manipulator between the record adapter and the spider.

To add a spider component:

Implementing this feature requires additional work outside of Developer Studio. Please refer to the Endeca Forge Guide for details.

The Spider editor contains a unique name for this spider.

The spider editor contains the following tabs:

Root URLs

The Root URLs tab is where you manage root URLs, which specify the location from which the spider starts crawling. There must be at least one root URL specified for each spider.

Sources

Required. A choice of record server components in the project.

Comment

Optional. Provides a way to associate comments with a pipeline component.


Copyright © Legal Notices