You can set the HTTP properties in the
default.xml
file.
The
default.xml
configuration file allows you to set
the HTTP transport properties for the Web
Crawler.
The http.cookies
property sets the cookies used by the HTTPClient.
The cookies must be in this format:
DOMAIN1~~~NAME1~~~VALUE1~~~PATH1~~~MAXAGE1~~~SECURE1|||DOMAIN2~~~...
where:
MAXAGE
is the number of seconds for which the cookie is valid (expected to be a non-negative number, -1 signifies that the cookie should never expire).SECURE
is eithertrue
(the cookie can only be sent over secure connections, that is, HTTPS servers) orfalse
(the cookie is considered safe to be sent in the clear over unsecured channels).
Note that the triple-tilde delimiter (~~~
)
must be used to separate the values.
A sample cookie specification is:
172.30.112.218~~~MYCOOKIE~~~ABRACADABRA=MAGIC~~~/junglegym/mycookie.jsp~~~-1~~~false
Note that the example cookie never expires and can be sent over unsecured channels.
You can set the Web Crawler to either ignore or obey the
robots.txt
exclusion standard, as well as any META
ROBOTS tags in HTML pages.
Note
By default, the
http.robots.ignore property
is set to
false
in
default.xml
. However,
site.xml
in the
conf/web-crawler/non-polite-crawl
directory
contains an override for the
http.robots.ignore
property, which is set to
true
in that file.
For example, if the property is set to
false
and an HTML page has these META tags:
<html> <head> <title>Sample Page</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
then the presence of the NOINDEX tag causes the crawler to not index the content of the page (i.e., no text or title is extracted), while the NOFOLLOW tag prevents outlinks from being extracted from the page. In addition, a message is logged for each META tag that is obeyed:
The HTML meta tags for robots contains "noindex", no text and title are extracted for: URL
The HTML meta tags for robots contains "nofollow", no outlinks are extracted for: URL
If the property is set to
true
, then the
robots.txt
file is ignored, as well as any META
ROBOTS tags in HTML pages (for example, outlinks are extracted even if the META
ROBOTS tag is set to NOFOLLOW).
If your crawls are downloading files with a lot of content (for example, large PDF or SWF files), you may see WARN messages about pages being skipped because the content limit was exceeded. To solve this problem, you should increase the download content limit to a setting that allows all content to be downloaded.
Any content longer than the size limit is not downloaded (i.e., the page is skipped).
To set the download content limit:
Set the value of the
http.content.limit
property as the length limit, in bytes, for download content.Note
Note that if the content limit is set to a negative number or 0, no limit is imposed on the content. However, this setting is not recommended because the Web Crawler may encounter very large files that slow down the crawl.
Example 1. Example of setting the download content limit
In this example, the size of the content is larger than the setting of the http.content.limit
property:
WARN com.endeca.itl.web.UrlProcessor Content limit exceeded for http://xyz.com/pdf/B2B_info.pdf. Page is skipped.