You can implement crawl scoping to control which URLs are crawled.

A crawl scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl.

Crawl scoping is applied before all other filters including the regular expressions in the crawl-urlfilter.txt file and custom plugins. This order of URL filtering means that even if a URL makes it through the crawl scope filter, it may still be filtered out by the crawl-urlfilter.txt file. However, a URL that is excluded by the crawl scope filter cannot be added by the crawl-urlfilter.txt file.

The crawl scope properties are listed in the following table.

The Web Crawler implements a basic crawl scoping scheme to accommodate crawls of multiple seeds. The crawler can scope a crawl to only visit URLs from the same host or from the same domain as a seed.

Crawl scoping is implemented via these properties:

The setting of the crawlscope.mode property determines the crawl scoping mode (that is, how URLs are allowed to be visited). The property sets one of these modes:

The boolean setting of the crawlscope.on-redirected-seed property affects how redirections are handled when they result from visiting a seed. The property determines whether crawl scope filtering is applied to the redirected seed or to the original seed:

Note that this redirect filtering property applies only to the SAME_HOST and SAME_DOMAIN crawl scope modes.

As an example of how these properties work, suppose the seed is set to http://xyz.com and a redirect is made to http://xyz.go.com. If the crawl is using SAME_HOST mode and has the crawl.scope.on-redirected-seed property set to true, then all URLs that are linked from here are filtered against http://xyz.go.com. If the redirect property is set to false, then all URLs that are linked from here are filtered against http://xyz.com.

The two crawlscope.top-level-domains properties are used for parsing domain names.

Every domain name ends in a top-level domain (TLD) name. The TLDs are either generic names (such as com) or country codes (such as jp for Japan).

However, some domain names use a two-term TLD, which complicates the retrieval of top-level domain names from URLs.

For example:

As the example shows, it is often difficult to generalize whether to take the last term or the last two terms as the TLD name for the domain name. If you take only the last term as the TLD, then it would work for xyz.com but not for xyz.co.uk (because it would incorrectly result in co.uk as the domain name). Therefore, the crawler must take this into account when parsing a URL for a domain name.

The two crawlscope.top-level-domains properties are used for determining which TLDs to use in the domain name:

The Web Crawler uses the property values as follows when retrieving domain names from URLs:

For example, assume that you will be crawling http://www.xyz.co.uk and therefore want a domain name of xyz.co.uk. First you would add co.uk to the crawlscope.top-level-domains.additional list. The procedure for returning the domain name is as follows:

If after step 4 no match is found in the additional list, the last two terms that were checked are returned as the domain name (co.uk in this example). In addition, a DEBUG-level message similar to this example is logged:

Failed to get the domain name for url: url
using result as the default domain name

where url is the original URL from which the domain name is to be extracted and result is a domain name consisting of the final two terms to be checked (such as co.uk). If you see this message, add the two terms to the additional list and retry the crawl.


Copyright © Legal Notices