HTTP Properties

You set the HTTP transport properties in the default.xml file.

Property Name Property Value
http.agent.name Required. String that contains the name of the user agent originating the request (default is endeca webcrawler). This value is used for the HTTP User-Agent request header.
http.robots.ignore Specifies whether the crawler ignores robots.txt.
http.robots.agents Comma-delimited list of agent strings, in decreasing order of precedence (default is endeca webcrawler,*). The agent strings are checked against the User-Agent field in the robots.txt file. It is recommended that you put the value of http.agent.name as the first agent name and keep the asterisk (*) at the end of the list.
http.robots.403.allow Some servers return HTTP status 403 (Forbidden) if robots.txt does not exist. Setting this value to false means that such sites are treated as forbidden, while setting it to true means that the site can be crawled. This is a Boolean value with a default of true.
http.agent.description String value (default is empty). Provides descriptive text about the crawler. The text is used in the User-Agent header, appearing in parenthesis after the agent name.
http.agent.url String value (default is empty). Specifies the URL that appears in the User-Agent header, in parenthesis after the agent name. Custom dictates that the URL be a page explaining the purpose and behavior of this crawler.
http.agent.email String value (default is empty). Specifies the email address that appears in the HTTP From request header and User-Agent header. A good practice is to mangle this address (e.g., "info at example dot com") to avoid spamming.
http.agent.version String value (default is WebCrawler). Specifies the version of the crawl. The version is used in the User-Agent header.
http.timeout Integer value (default is 10000). Specifies the default network timeout in milliseconds.
http.content.limit Integer value (default is 1048576). Sets the length limit in bytes for downloaded content. If the value is a positive integer greater than 0, content longer than the setting will not be downloaded (the page will be skipped). If set to a negative integer, no limit is set on the content length. Oracle does not recommend setting this value to 0 because that value limits the crawl to producing 0-byte content.
http.redirect.max Integer value (default is 5). Sets the maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, the fetcher will not immediately follow redirected URLs, but instead will record them for later fetching.
http.useHttp11 Boolean value (default is false). If true, use HTTP 1.1; if false, use HTTP 1.0.
http.cookies String value (default is empty). Specifies the cookies to be used by the HTTPClient.