Spider editor

The Spider editor contains a unique name for this spider.

The spider editor contains the following tabs:

General

Field Description

Maximum hops

Optional. Limits the depth of the crawl to a maximum number of hops from the root URL (see below). The value must be an integer greater than zero. If no value is provided, the maximum number of hops is unlimited.

Maximum depth

Optional. Limits the depth of the crawl to a maximum depth of URL path. The value must be an integer greater than zero. If no value is provided, the maximum path depth is unlimited.

Agent name

Identifies the name of the spider as it will be referred to in the User-agent field of a robots.txt file. Required if you are following the robots.txt standard.

Differential crawl URL

Optional. When configuring a spider to perform a differential crawl, the differential crawl URL specifies a file location to store the spider's state between Forge executions. This file is read in at the beginning of a differential crawl to enqueue URLs from previous crawls and may be updated during the crawl.

Ignore robots

When checked, the spider does not adhere to the robots.txt standard, which tells the spider which files it can crawl. By default, the spider follows the standard and looks for robots.txt on the server.

Disable cookies

When checked, the spider refuses cookies sent from a host server during a crawl.

Root URLs

The Root URLs tab is where you manage root URLs, which specify the location from which the spider starts crawling. There must be at least one root URL specified for each spider.

URL Configuration

The URL Configuration tab is where you manage enqueue URLs and URL filters.

Field Description

Enqueue URLs

Optional. Takes each URL link from a specified property and adds them to the queue for further filtering.

URL filters

Optional. Provides pattern matching capabilities to control document filtering.

Timeout

The Timeout tab is where you manage connection and fetch timing. It contains the following fields:

Option Description

Maximum time spent fetching a URL

When checked, type the time in seconds.

Maximum time to wait for a connection to be made

When checked, type the time in seconds.

Abort fetch if transfer rate falls below

When checked, type the bytes per second and the number of seconds.

Proxy

The Proxy tab is where you establish the use of proxy servers. It contains the following fields:

Option Description

Proxy mode list

Choose whether to use no proxy servers, a single proxy server, or separate proxy servers for HTTP and HTTPS requests.

HTTP proxy server

The hostname and port number for the HTTP proxy server (if one is being used).

HTTPS proxy server

The hostname and port number for the HTTPS proxy server (if one is being used).

Bypass URLs

When clicked, opens the Bypass URLs editor, where you specify the list of URLs that should be fetched directly, bypassing any proxy servers.

Sources

Required. A choice of record server components in the project.

Comment

Optional. Provides a way to associate comments with a pipeline component.