The Spider editor contains a unique name for this spider.
The spider editor contains the following tabs:
| Field | Description |
|---|---|
Maximum hops |
Optional. Limits the depth of the crawl to a maximum number of hops from the root URL (see below). The value must be an integer greater than zero. If no value is provided, the maximum number of hops is unlimited. |
Maximum depth |
Optional. Limits the depth of the crawl to a maximum depth of URL path. The value must be an integer greater than zero. If no value is provided, the maximum path depth is unlimited. |
Agent name |
Identifies the name of the spider as it will be referred to in the User-agent field of a robots.txt file. Required if you are following the robots.txt standard. |
Differential crawl URL |
Optional. When configuring a spider to perform a differential crawl, the differential crawl URL specifies a file location to store the spider's state between Forge executions. This file is read in at the beginning of a differential crawl to enqueue URLs from previous crawls and may be updated during the crawl. |
Ignore robots |
When checked, the spider does not adhere to the robots.txt standard, which tells the spider which files it can crawl. By default, the spider follows the standard and looks for robots.txt on the server. |
Disable cookies |
When checked, the spider refuses cookies sent from a host server during a crawl. |
Root URLs
The Root URLs tab is where you manage root URLs, which specify the location from which the spider starts crawling. There must be at least one root URL specified for each spider.
| Field | Description |
|---|---|
Enqueue URLs |
Optional. Takes each URL link from a specified property and adds them to the queue for further filtering. |
URL filters |
Optional. Provides pattern matching capabilities to control document filtering. |
| Option | Description |
|---|---|
Maximum time spent fetching a URL |
When checked, type the time in seconds. |
Maximum time to wait for a connection to be made |
When checked, type the time in seconds. |
Abort fetch if transfer rate falls below |
When checked, type the bytes per second and the number of seconds. |
| Option | Description |
|---|---|
Proxy mode list |
Choose whether to use no proxy servers, a single proxy server, or separate proxy servers for HTTP and HTTPS requests. |
HTTP proxy server |
The hostname and port number for the HTTP proxy server (if one is being used). |
HTTPS proxy server |
The hostname and port number for the HTTPS proxy server (if one is being used). |
Bypass URLs |
When clicked, opens the Bypass URLs editor, where you specify the list of URLs that should be fetched directly, bypassing any proxy servers. |
Sources
Required. A choice of record server components in the project.
Comment
Optional. Provides a way to associate comments with a pipeline component.