Oracle Commerce Guided Search

HTTP Properties

You can set the HTTP properties in the default.xml file.

The default.xml configuration file allows you to set the HTTP transport properties for the Web Crawler.

Property Name	Property Value
`http.agent.name`	String that contains the name of the user agent originating the request (default is `endeca webcrawler`). This value is used for the HTTP User-Agent request header. Required.
`http.robots.ignore`	Boolean value (default is `false`). Determines whether the crawler ignores the `robots.txt`.
`http.robots.agents`	Comma-delimited list of agent strings, in decreasing order of precedence (default is *`endeca webcrawler,`). The agent strings are checked against the User-Agent field in the `robots.txt` file. It is recommended that you put the value of `http.agent.name`** as the first agent name and keep the asterisk (*) at the end of the list.
`http.robots.403.allow`	Boolean value (default is `false`). Some servers return HTTP status 403 (Forbidden) if `robots.txt` does not exist. Setting this value to `false` means that such sites are treated as forbidden, while setting it to `true` means that the site can be crawled.
`http.agent.description`	String value (default is empty). Provides descriptive text about the crawler. The text is used in the User-Agent header, appearing in parenthesis after the agent name.
`http.agent.url`	String value (default is empty). Specifies the URL that appears in the User-Agent header, in parenthesis after the agent name. Custom dictates that the URL be a page explaining the purpose and behavior of this crawler.
`http.agent.email`	String value (default is empty). Specifies the email address that appears in the HTTP From request header and User-Agent header. A good practice is to mangle this address (e.g., "info at example dot com") to avoid spamming.
`http.agent.version`	String value (default is `WebCrawler`). Specifies the version of the crawl. The version is used in the User-Agent header.
`http.timeout`	Integer value (default is `10000`). Specifies the default network timeout in milliseconds.
`http.content.limit`	Integer value (default is `1048576`). Sets the length limit in bytes for downloaded content. If the value is a positive integer greater than 0, content longer than the setting will not be downloaded (the page will be skipped). If set to a negative integer, no limit is set on the content length. Oracle Commerce does not recommend setting this value to `0` because that value limits the crawl to producing 0-byte content.
`http.redirect.max`	Integer value (default is `5`). Sets the maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, the fetcher will not immediately follow redirected URLs, but instead will record them for later fetching.
`http.useHttp11`	Boolean value (default is `false`). If `true`, use HTTP 1.1; if `false`, use HTTP 1.0.
`http.cookies`	String value (default is empty). Specifies the cookies to be used by the HTTPClient.

About setting the HTTPClient cookies

The http.cookies property sets the cookies used by the HTTPClient.

The cookies must be in this format:

DOMAIN1~~~NAME1~~~VALUE1~~~PATH1~~~MAXAGE1~~~SECURE1|||DOMAIN2~~~...

where:

DOMAIN is the domain the cookie can be sent to.
NAME is the cookie name.
VALUE is the cookie value.
PATH is the path prefix for which the cookie can be sent.
MAXAGE is the number of seconds for which the cookie is valid (expected to be a non-negative number, -1 signifies that the cookie should never expire).
SECURE is either true (the cookie can only be sent over secure connections, that is, HTTPS servers) or false (the cookie is considered safe to be sent in the clear over unsecured channels).

Note that the triple-tilde delimiter (~~~) must be used to separate the values.

A sample cookie specification is:

172.30.112.218~~~MYCOOKIE~~~ABRACADABRA=MAGIC~~~/junglegym/mycookie.jsp~~~-1~~~false

Note that the example cookie never expires and can be sent over unsecured channels.

About obeying the robots.txt file

You can set the Web Crawler to either ignore or obey the robots.txt exclusion standard, as well as any META ROBOTS tags in HTML pages.

Note

By default, the http.robots.ignore property is set to false in default.xml. However, site.xml in the conf/web-crawler/non-polite-crawl directory contains an override for the http.robots.ignore property, which is set to true in that file.

For example, if the property is set to false and an HTML page has these META tags:

<html>
<head>
<title>Sample Page</title>
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
</head>

then the presence of the NOINDEX tag causes the crawler to not index the content of the page (i.e., no text or title is extracted), while the NOFOLLOW tag prevents outlinks from being extracted from the page. In addition, a message is logged for each META tag that is obeyed:

The HTML meta tags for robots contains "noindex", no text and title are extracted for: URL

The HTML meta tags for robots contains "nofollow", no outlinks are extracted for: URL

If the property is set to true, then the robots.txt file is ignored, as well as any META ROBOTS tags in HTML pages (for example, outlinks are extracted even if the META ROBOTS tag is set to NOFOLLOW).

Setting the download content limit

If your crawls are downloading files with a lot of content (for example, large PDF or SWF files), you may see WARN messages about pages being skipped because the content limit was exceeded. To solve this problem, you should increase the download content limit to a setting that allows all content to be downloaded.

Any content longer than the size limit is not downloaded (i.e., the page is skipped).

To set the download content limit:

In a text editor, open default.xml.
Set the value of the http.content.limit property as the length limit, in bytes, for download content.
Note
Note that if the content limit is set to a negative number or 0, no limit is imposed on the content. However, this setting is not recommended because the Web Crawler may encounter very large files that slow down the crawl.
Save and close the file.

Example 1. Example of setting the download content limit

In this example, the size of the content is larger than the setting of the http.content.limit property:

WARN com.endeca.itl.web.UrlProcessor
Content limit exceeded for http://xyz.com/pdf/B2B_info.pdf. Page is skipped.

HTTP Properties

About setting the HTTPClient cookies

About obeying the robots.txt file

Note

Setting the download content limit

Note

Content Acquisition System Web Crawler Guide