A spider component crawls documents rather than loading records from a file, thereby adding document crawling capabilities to your pipeline.
The spider handles URL management and content manipulation. Using spiders, you can configure a pipeline that crawls the network at which it is directed, and facilitates the processing of the documents it finds there using the Endeca Information Transformation Layer (ITL).
Note
The Endeca Crawler, which uses the spider component, has been deprecated and support for it will be removed in a future version. It is recommended that you use the Endeca Web Crawler for Web crawls and the Endeca CAS Server for file system crawls.
To be able to crawl, a spider component has a dependency on at least one other pipeline component. Upstream, there needs to be a record adapter with a Format of type "document." This record adapter needs to contain a reference to the spider as its URL source. The two components work together to crawl a content repository (such as a Web server, file server, or FTP server). Between the record adapter and the spider, other pipeline components can be implemented. For example, you can have a record manipulator between the record adapter and the spider.
To add a spider component:
In the Pipeline Diagram editor, click New, and then choose Spider.
The Spider editor appears.
In the General tab, do the following:
(Optional) In the Maximum hops text box, type an integer greater than zero to limit the depth of the crawl. If this field is not set, the crawl depth is unlimited.
(Optional) In the Maximum depth text box, type an integer greater than zero to limit the depth of the URL path. If this field is not set, the URL depth is unlimited.
If you are using a
robots.txt
file, in the Agent name text box, type the name of the spider as it will be referred to in theUser-agent
field of arobots.txt
file.If the spider is performing a differential crawl, in the Differential crawl URL text box, type the location to store the spider's state (as a state.tmp file) between Forge executions
If you do not want the spider to conform to the
robots.txt
standard, check Ignore robots.If you do not want the spider to accept cookies, check Disable cookies.
In the Root URLs tab, add one or more URLs from which a crawl can be launched.
(Optional) In the URL Configuration tab, add enqueue URLs and URL filters.
(Optional) In the Comment tab, add a comment for the component.
Implementing this feature requires additional work outside of Developer Studio. Please refer to the Endeca Forge Guide for details.
The Spider editor contains a unique name for this spider.
The spider editor contains the following tabs:
Field |
Description |
---|---|
Maximum hops |
Optional. Limits the depth of the crawl to a maximum number of hops from the root URL (see below). The value must be an integer greater than zero. If no value is provided, the maximum number of hops is unlimited. |
Maximum depth |
Optional. Limits the depth of the crawl to a maximum depth of URL path. The value must be an integer greater than zero. If no value is provided, the maximum path depth is unlimited. |
Agent name |
Identifies the name of the spider as it will be referred to in the User-agent
field of a |
Differential crawl URL |
Optional. When configuring a spider to perform a differential crawl, the differential crawl URL specifies a file location to store the spider's state between Forge executions. This file is read in at the beginning of a differential crawl to enqueue URLs from previous crawls and may be updated during the crawl. |
Ignore robots |
When checked, the spider does not adhere to the |
Disable cookies |
When checked, the spider refuses cookies sent from a host server during a crawl. |
Root URLs
The Root URLs tab is where you manage root URLs, which specify the location from which the spider starts crawling. There must be at least one root URL specified for each spider.
Field |
Description |
---|---|
Enqueue URLs |
Optional. Takes each URL link from a specified property and adds them to the queue for further filtering. |
URL filters |
Optional. Provides pattern matching capabilities to control document filtering. |
Option |
Description |
---|---|
Maximum time spent fetching a URL |
When checked, type the time in seconds. |
Maximum time to wait for a connection to be made |
When checked, type the time in seconds. |
Abort fetch if transfer rate falls below |
When checked, type the bytes per second and the number of seconds. |
Option |
Description |
---|---|
Proxy mode list |
Choose whether to use no proxy servers, a single proxy server, or separate proxy servers for HTTP and HTTPS requests. |
HTTP proxy server |
The hostname and port number for the HTTP proxy server (if one is being used). |
HTTPS proxy server |
The hostname and port number for the HTTPS proxy server (if one is being used). |
Bypass URLs |
When clicked, opens the Bypass URLs editor, where you specify the list of URLs that should be fetched directly, bypassing any proxy servers. |
Sources
Required. A choice of record server components in the project.
Comment
Optional. Provides a way to associate comments with a pipeline component.