Every domain name ends in a top-level domain (TLD) name. The TLDs are either generic names (such as com) or country codes (such as jp for Japan).
However, some domain names use a two-term TLD, which complicates the retrieval of top-level domain names from URLs.
For example:
- http://www.xyz.com has a one-term TLD of com
with a domain name of xyz.com.
- http://www.xyz.co.uk has a two-term TLD of .co.uk
with a domain name of xyz.co.uk
As the example shows, it is often difficult to generalize whether to take the last term or the last two terms as the TLD name for the domain name. If you take only the last term as the TLD,
then it would work for
xyz.com but not for
xyz.co.uk (because it would incorrectly result in
co.uk
as the domain name). Therefore, the crawler must take this into account when parsing a URL for a domain name.
The two
crawlscope.top-level-domains
properties
are used for determining which TLDs to use in the domain name:
- The crawlscope.top-level-domains.generic
property contains a space-delimited list of generic TLD names, such as com, gov, or org.
- The crawlscope.top-level-domains.additional
property contains a space-delimited list of additional TLD names that may be encountered in a crawl. These are typically two-term TLDs, such as co.uk or ma.us. However, you should also add country codes as necessary (for example, add ca if you are crawling the www.xyz.ca site). You should add TLDs to this list that are not generic TLDs but that you want to crawl.
The Web Crawler uses the property values as follows when retrieving domain names from URLs:
- The crawler first looks at the last term of the host name. If it is a TLD in the crawlscope.top-level-domains.generic
list (such as com), then the crawler takes the last two terms (xyz
and com)
as the domain name. This results in a domain name of xyz.com for the http://www.xyz.com
sample URL.
- If the last term is not one of the generic TLDs, then the crawler does the following:
Takes the entire host name and checks it against the crawlscope.top-level-domains.additional
list; if not a match, repeats by truncating the first term from the host name and checks it against the list; if not a match, repeats until a match is found or there are no more terms to be truncated from the host name.
- If no terms matched on the additional
list, return the last two terms as the domain name and log an error message.
For example, assume that you will be crawling
http://www.xyz.co.uk and therefore want a domain name of
xyz.co.uk. First you would add
co.uk
to the
crawlscope.top-level-domains.additional
list. The procedure for returning the domain name is as follows:
- The generic TLD list is checked for the uk
term, but it is not found.
- www.xyz.co.uk
is checked against the crawlscope.top-level-domains.additional
list, but no match is found.
- xyz.co.uk
is checked
against the additional TLD list, but no match is found.
- co.uk
is checked
against the additional TLD list,
and a match is finally found. A domain name of xyz.co.uk
is returned.
If after step 4 no match is found in the
additional
list, the last two terms that were checked are returned as the domain name (
co.uk
in this example). In addition, a DEBUG-level message similar to this example is logged:
Failed to get the domain name for url: url
using result as the default domain name
where
url is the original URL from which the domain name is to be extracted and
result is a domain name consisting of the final two terms to be checked (such as
co.uk). If you see this message, add the two terms to the
additional
list
and retry the crawl.