Crawl scoping properties

You implement crawl scoping to control which URLs are crawled in the default.xml file..

A crawl scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl.

Crawl scoping is applied before all other filters, including the regular expressions in the crawl-urlfilter.txt file and custom plugins. This order of URL filtering means that even if a URL makes it through the crawl scope filter, it may still be filtered out by the crawl-urlfilter.txt file. However, a URL that is excluded by the crawl scope filter cannot be added by the crawl-urlfilter.txt file.

The crawl scope properties are listed in the following table.
Property Name Property Value
crawlscope.mode ANY, SAME_DOMAIN, or SAME_HOST (default is SAME_HOST). Specifies the mode for crawl scoping.
crawlscope.on-redirected-seed Boolean value (default is true). Specifies whether to filter a URL based on its seed or its redirected seed.
crawlscope.top-level-domains.generic Space-delimited list of top-level domain names. Do not modify this list because it may affect how domain names are retrieved. Contains a list of generic top-level domain names.
crawlscope.top-level-domains.additional Space-delimited list of top-level domain names (default is empty). Specifies additional top-level domain names that are pertinent to your crawls.