You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in HTML pages.
By default, the http.robots.ignore property is set to false in default.xml. However, site.xml in the conf/web-crawler/non-polite-crawl directory contains an override for the http.robots.ignore property, which is set to true in that file.
<html> <head> <title>Sample Page</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>then the presence of the NOINDEX tag causes the crawler to not index the content of the page (i.e., no text or title is extracted), while the NOFOLLOW tag prevents outlinks from being extracted from the page. In addition, a message is logged for each META tag that is obeyed:
The HTML meta tags for robots contains "noindex", no text and title are extracted for: URL
The HTML meta tags for robots contains "nofollow", no outlinks are extracted for: URL
If the property is set to true, then the robots.txt file is ignored, as well as any META ROBOTS tags in HTML pages (for example, outlinks are extracted even if the META ROBOTS tag is set to NOFOLLOW).