You can implement crawl scoping to control which URLs are crawled.
A crawl scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl.
Crawl scoping is applied before all other filters including the regular expressions in the crawl-urlfilter.txt
file and custom plugins. This order of URL filtering means that even if a URL makes it through the crawl scope filter,
it may still be filtered out by the crawl-urlfilter.txt
file.
However, a URL that is excluded by the crawl scope filter cannot be added by the crawl-urlfilter.txt
file.
The crawl scope properties are listed in the following table.
Property Name |
Property Value |
---|---|
|
|
|
Boolean value (default is |
|
Space-delimited list of top-level domain names. Do not modify this list because it may affect how domain names are retrieved. Contains a list of generic top-level domain names. |
|
Space-delimited list of top-level domain names (default is empty). Specifies additional top-level domain names that are pertinent to your crawls. |
The Web Crawler implements a basic crawl scoping scheme to accommodate crawls of multiple seeds. The crawler can scope a crawl to only visit URLs from the same host or from the same domain as a seed.
Crawl scoping is implemented via these properties:
The setting of the crawlscope.mode
property determines the crawl scoping mode (that is, how URLs are allowed to be visited). The property sets one of these modes:
ANY
indicates that any URL is allowed to be visited. This mode turns off crawl scoping because there is no restriction on which URLs can be visited.SAME_DOMAIN
indicates that a URL is allowed to be visited only if it comes from the same domain as the seed URL. The crawler attempts to figure out the domain name from examining the host.SAME_HOST
(the default) indicates that a URL is allowed to be visited only if it comes from the same host as the seed URL.
The boolean setting of the crawlscope.on-redirected-seed
property affects how redirections are handled when they result from visiting a seed. The property
determines whether crawl scope filtering is applied to the redirected seed or to the original seed:
Note that this redirect filtering property applies only to the SAME_HOST
and SAME_DOMAIN
crawl scope modes.
As an example of how these properties work, suppose the seed is set to http://xyz.com
and a redirect is made to
http://xyz.go.com
. If the crawl is using SAME_HOST
mode and has the crawl.scope.on-redirected-seed
property
set to true
, then all URLs that are linked from here are filtered against
http://xyz.go.com
. If the redirect property is set to false
, then all URLs that are linked
from here are filtered against http://xyz.com
.
The two crawlscope.top-level-domains
properties are used for parsing domain names.
Every domain name ends in a top-level domain (TLD) name. The TLDs are either generic names (such as com
) or country codes (such as jp
for Japan).
However, some domain names use a two-term TLD, which complicates the retrieval of top-level domain names from URLs.
For example:
As the example shows, it is often difficult to generalize whether to take the last term or the last two terms as the TLD name for the domain name. If you take only the last term as the TLD,
then it would work for xyz.com
but not for xyz.co.uk
(because it would incorrectly result in co.uk
as the domain name). Therefore, the crawler must take this into account when parsing a URL for a domain name.
The two crawlscope.top-level-domains
properties
are used for determining which TLDs to use in the domain name:
The
crawlscope.top-level-domains.generic
property contains a space-delimited list of generic TLD names, such ascom
,gov
, ororg
.The
crawlscope.top-level-domains.additional
property contains a space-delimited list of additional TLD names that may be encountered in a crawl. These are typically two-term TLDs, such asco.uk
orma.us
. However, you should also add country codes as necessary (for example, addca
if you are crawling thewww.xyz.ca
site). You should add TLDs to this list that are not generic TLDs but that you want to crawl.
The Web Crawler uses the property values as follows when retrieving domain names from URLs:
The crawler first looks at the last term of the host name. If it is a TLD in the
crawlscope.top-level-domains.generic
list (such ascom
), then the crawler takes the last two terms (xyz
andcom
) as the domain name. This results in a domain name ofxyz.com
for thehttp://www.xyz.com
sample URL.If the last term is not one of the generic TLDs, then the crawler does the following: Takes the entire host name and checks it against the
crawlscope.top-level-domains.additional
list; if not a match, repeats by truncating the first term from the host name and checks it against the list; if not a match, repeats until a match is found or there are no more terms to be truncated from the host name.If no terms matched on the
additional
list, return the last two terms as the domain name and log an error message.
For example, assume that you will be crawling http://www.xyz.co.uk
and therefore want a domain name of xyz.co.uk
. First you would add
co.uk
to the crawlscope.top-level-domains.additional
list. The procedure for returning the domain name is as follows:
The generic TLD list is checked for the
uk
term, but it is not found.www.xyz.co.uk
is checked against thecrawlscope.top-level-domains.additional
list, but no match is found.xyz.co.uk
is checked against the additional TLD list, but no match is found.co.uk
is checked against the additional TLD list, and a match is finally found. A domain name ofxyz.co.uk
is returned.
If after step 4 no match is found in the additional
list, the last two terms that were checked are returned as the domain name (co.uk
in this example). In addition, a DEBUG-level message similar to this example is logged:
Failed to get the domain name for url:url
usingresult
as the default domain name
where url
is the original URL from which the domain name is to be extracted and result
is a domain name consisting of the final two terms to be checked (such as co.uk
). If you see this message, add the two terms to the additional
list
and retry the crawl.
The crawlscope.top-level-domains.generic
property contains these TLD names in the default.xml
configuration file that is shipped with the product:
As mentioned in the property table above, you should not modify this list because it may affect how domain names are determined.