Oracle Commerce Guided Search - Crawl scoping properties

Crawl scoping properties

You can implement crawl scoping to control which URLs are crawled.

A crawl scope defines the conditions under which a URL is considered within the scope of a crawl. A URL is within the crawl scope if it should be fetched for that crawl.

Crawl scoping is applied before all other filters including the regular expressions in the crawl-urlfilter.txt file and custom plugins. This order of URL filtering means that even if a URL makes it through the crawl scope filter, it may still be filtered out by the crawl-urlfilter.txt file. However, a URL that is excluded by the crawl scope filter cannot be added by the crawl-urlfilter.txt file.

The crawl scope properties are listed in the following table.

Property Name	Property Value
`crawlscope.mode`	`ANY`, `SAME_DOMAIN`, or `SAME_HOST` (default is `SAME_HOST`). Specifies the mode for crawl scoping.
`crawlscope.on-redirected-seed`	Boolean value (default is `true`). Specifies whether to filter a URL based on its seed or its redirected seed.
`crawlscope.top-level-domains.generic`	Space-delimited list of top-level domain names. Do not modify this list because it may affect how domain names are retrieved. Contains a list of generic top-level domain names.
`crawlscope.top-level-domains.additional`	Space-delimited list of top-level domain names (default is empty). Specifies additional top-level domain names that are pertinent to your crawls.

About configuring crawl scoping

The Web Crawler implements a basic crawl scoping scheme to accommodate crawls of multiple seeds. The crawler can scope a crawl to only visit URLs from the same host or from the same domain as a seed.

Crawl scoping is implemented via these properties:

crawlscope.mode
crawlscope.on-redirected-seed
crawlscope.top-level-domains.generic
crawlscope.top-level-domains.additional

The setting of the crawlscope.mode property determines the crawl scoping mode (that is, how URLs are allowed to be visited). The property sets one of these modes:

ANY indicates that any URL is allowed to be visited. This mode turns off crawl scoping because there is no restriction on which URLs can be visited.
SAME_DOMAIN indicates that a URL is allowed to be visited only if it comes from the same domain as the seed URL. The crawler attempts to figure out the domain name from examining the host.
SAME_HOST (the default) indicates that a URL is allowed to be visited only if it comes from the same host as the seed URL.

The boolean setting of the crawlscope.on-redirected-seed property affects how redirections are handled when they result from visiting a seed. The property determines whether crawl scope filtering is applied to the redirected seed or to the original seed:

true (the default) specifies that SAME_HOST/SAME_DOMAIN analysis will be performed on the redirected seed rather than the original seed.
false specifies that SAME_HOST/SAME_DOMAIN filtering will be applied to the original seed.

Note that this redirect filtering property applies only to the SAME_HOST and SAME_DOMAIN crawl scope modes.

As an example of how these properties work, suppose the seed is set to http://xyz.com and a redirect is made to http://xyz.go.com. If the crawl is using SAME_HOST mode and has the crawl.scope.on-redirected-seed property set to true, then all URLs that are linked from here are filtered against http://xyz.go.com. If the redirect property is set to false, then all URLs that are linked from here are filtered against http://xyz.com.

The two crawlscope.top-level-domains properties are used for parsing domain names.

How domain names are retrieved from URLs

Every domain name ends in a top-level domain (TLD) name. The TLDs are either generic names (such as com) or country codes (such as jp for Japan).

However, some domain names use a two-term TLD, which complicates the retrieval of top-level domain names from URLs.

http://www.xyz.com has a one-term TLD of com with a domain name of xyz.com.
http://www.xyz.co.uk has a two-term TLD of .co.uk with a domain name of xyz.co.uk

As the example shows, it is often difficult to generalize whether to take the last term or the last two terms as the TLD name for the domain name. If you take only the last term as the TLD, then it would work for xyz.com but not for xyz.co.uk (because it would incorrectly result in co.uk as the domain name). Therefore, the crawler must take this into account when parsing a URL for a domain name.

The two crawlscope.top-level-domains properties are used for determining which TLDs to use in the domain name:

The crawlscope.top-level-domains.generic property contains a space-delimited list of generic TLD names, such as com, gov, or org.
The crawlscope.top-level-domains.additional property contains a space-delimited list of additional TLD names that may be encountered in a crawl. These are typically two-term TLDs, such as co.uk or ma.us. However, you should also add country codes as necessary (for example, add ca if you are crawling the www.xyz.ca site). You should add TLDs to this list that are not generic TLDs but that you want to crawl.

The Web Crawler uses the property values as follows when retrieving domain names from URLs:

The crawler first looks at the last term of the host name. If it is a TLD in the crawlscope.top-level-domains.generic list (such as com), then the crawler takes the last two terms (xyz and com) as the domain name. This results in a domain name of xyz.com for the http://www.xyz.com sample URL.
If the last term is not one of the generic TLDs, then the crawler does the following: Takes the entire host name and checks it against the crawlscope.top-level-domains.additional list; if not a match, repeats by truncating the first term from the host name and checks it against the list; if not a match, repeats until a match is found or there are no more terms to be truncated from the host name.
If no terms matched on the additional list, return the last two terms as the domain name and log an error message.

For example, assume that you will be crawling http://www.xyz.co.uk and therefore want a domain name of xyz.co.uk. First you would add co.uk to the crawlscope.top-level-domains.additional list. The procedure for returning the domain name is as follows:

The generic TLD list is checked for the uk term, but it is not found.
www.xyz.co.uk is checked against the crawlscope.top-level-domains.additional list, but no match is found.
xyz.co.uk is checked against the additional TLD list, but no match is found.
co.uk is checked against the additional TLD list, and a match is finally found. A domain name of xyz.co.uk is returned.

If after step 4 no match is found in the additional list, the last two terms that were checked are returned as the domain name (co.uk in this example). In addition, a DEBUG-level message similar to this example is logged:

Failed to get the domain name for url: url
using result as the default domain name

where url is the original URL from which the domain name is to be extracted and result is a domain name consisting of the final two terms to be checked (such as co.uk). If you see this message, add the two terms to the additional list and retry the crawl.

Copyright © Legal Notices