About obeying the robots.txt file

You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in HTML pages.

By default, the http.robots.ignore property is set to false in default.xml. However, site.xml in the conf/web-crawler/non-polite-crawl directory contains an override for the http.robots.ignore property, which is set to true in that file.

For example, if the property is set to false and an HTML page has these META tags:

<html>
<head>
<title>Sample Page</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>

then the presence of the NOINDEX tag causes the crawler to not index the content of the page (i.e., no text or title is extracted), while the NOFOLLOW tag prevents outlinks from being extracted from the page. In addition, a message is logged for each META tag that is obeyed:

The HTML meta tags for robots contains "noindex", no text and title are extracted for: URL

The HTML meta tags for robots contains "nofollow", no outlinks are extracted for: URL

If the property is set to true, then the robots.txt file is ignored, as well as any META ROBOTS tags in HTML pages (for example, outlinks are extracted even if the META ROBOTS tag is set to NOFOLLOW).