You can set the HTTP properties in the default.xml file.

The default.xml configuration file allows you to set the HTTP transport properties for the Web Crawler.

Property Name

Property Value

http.agent.name

String that contains the name of the user agent originating the request (default is endeca webcrawler). This value is used for the HTTP User-Agent request header. Required.

http.robots.ignore

Boolean value (default is false). Determines whether the crawler ignores the robots.txt.

http.robots.agents

Comma-delimited list of agent strings, in decreasing order of precedence (default is endeca webcrawler,*). The agent strings are checked against the User-Agent field in the robots.txt file. It is recommended that you put the value of http.agent.name as the first agent name and keep the asterisk (*) at the end of the list.

http.robots.403.allow

Boolean value (default is false). Some servers return HTTP status 403 (Forbidden) if robots.txt does not exist. Setting this value to false means that such sites are treated as forbidden, while setting it to true means that the site can be crawled.

http.agent.description

String value (default is empty). Provides descriptive text about the crawler. The text is used in the User-Agent header, appearing in parenthesis after the agent name.

http.agent.url

String value (default is empty). Specifies the URL that appears in the User-Agent header, in parenthesis after the agent name. Custom dictates that the URL be a page explaining the purpose and behavior of this crawler.

http.agent.email

String value (default is empty). Specifies the email address that appears in the HTTP From request header and User-Agent header. A good practice is to mangle this address (e.g., "info at example dot com") to avoid spamming.

http.agent.version

String value (default is WebCrawler). Specifies the version of the crawl. The version is used in the User-Agent header.

http.timeout

Integer value (default is 10000). Specifies the default network timeout in milliseconds.

http.content.limit

Integer value (default is 1048576). Sets the length limit in bytes for downloaded content. If the value is a positive integer greater than 0, content longer than the setting will not be downloaded (the page will be skipped). If set to a negative integer, no limit is set on the content length. Oracle Commerce does not recommend setting this value to 0 because that value limits the crawl to producing 0-byte content.

http.redirect.max

Integer value (default is 5). Sets the maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, the fetcher will not immediately follow redirected URLs, but instead will record them for later fetching.

http.useHttp11

Boolean value (default is false). If true, use HTTP 1.1; if false, use HTTP 1.0.

http.cookies

String value (default is empty). Specifies the cookies to be used by the HTTPClient.


Copyright © Legal Notices