How domain names are retrieved from URLs

Every domain name ends in a top-level domain (TLD) name. The TLDs are either generic names (such as com) or country codes (such as jp for Japan).

However, some domain names use a two-term TLD, which complicates the retrieval of top-level domain names from URLs.

For example:

http://www.xyz.com has a one-term TLD of com with a domain name of xyz.com.
http://www.xyz.co.uk has a two-term TLD of .co.uk with a domain name of xyz.co.uk

As the example shows, it is often difficult to generalize whether to take the last term or the last two terms as the TLD name for the domain name. If you take only the last term as the TLD, then it would work for xyz.com but not for xyz.co.uk (because it would incorrectly result in co.uk as the domain name). Therefore, the crawler must take this into account when parsing a URL for a domain name.

The two crawlscope.top-level-domains properties are used for determining which TLDs to use in the domain name:

The crawlscope.top-level-domains.generic property contains a space-delimited list of generic TLD names, such as com, gov, or org.
The crawlscope.top-level-domains.additional property contains a space-delimited list of additional TLD names that may be encountered in a crawl. These are typically two-term TLDs, such as co.uk or ma.us. However, you should also add country codes as necessary (for example, add ca if you are crawling the www.xyz.ca site). You should add TLDs to this list that are not generic TLDs but that you want to crawl.

The Web Crawler uses the property values as follows when retrieving domain names from URLs:

The crawler first looks at the last term of the host name. If it is a TLD in the crawlscope.top-level-domains.generic list (such as com), then the crawler takes the last two terms (xyz and com) as the domain name. This results in a domain name of xyz.com for the http://www.xyz.com sample URL.
If the last term is not one of the generic TLDs, then the crawler does the following: Takes the entire host name and checks it against the crawlscope.top-level-domains.additional list; if not a match, repeats by truncating the first term from the host name and checks it against the list; if not a match, repeats until a match is found or there are no more terms to be truncated from the host name.
If no terms matched on the additional list, return the last two terms as the domain name and log an error message.

For example, assume that you will be crawling http://www.xyz.co.uk and therefore want a domain name of xyz.co.uk. First you would add co.uk to the crawlscope.top-level-domains.additional list. The procedure for returning the domain name is as follows:

The generic TLD list is checked for the uk term, but it is not found.
www.xyz.co.uk is checked against the crawlscope.top-level-domains.additional list, but no match is found.
xyz.co.uk is checked against the additional TLD list, but no match is found.
co.uk is checked against the additional TLD list, and a match is finally found. A domain name of xyz.co.uk is returned.

If after step 4 no match is found in the additional list, the last two terms that were checked are returned as the domain name (co.uk in this example). In addition, a DEBUG-level message similar to this example is logged:

Failed to get the domain name for url: url
using result as the default domain name

where url is the original URL from which the domain name is to be extracted and result is a domain name consisting of the final two terms to be checked (such as co.uk). If you see this message, add the two terms to the additional list and retry the crawl.