You set the HTTP transport properties in the default.xml file.
Property Name | Property Value |
---|---|
http.agent.name | Required. String that contains the name of the user agent originating the request (default is endeca webcrawler). This value is used for the HTTP User-Agent request header. |
http.robots.ignore | Specifies whether the crawler ignores robots.txt. |
http.robots.agents | Comma-delimited list of agent strings, in decreasing order of precedence (default is endeca webcrawler,*). The agent strings are checked against the User-Agent field in the robots.txt file. It is recommended that you put the value of http.agent.name as the first agent name and keep the asterisk (*) at the end of the list. |
http.robots.403.allow | Some servers return HTTP status 403 (Forbidden) if robots.txt does not exist. Setting this value to false means that such sites are treated as forbidden, while setting it to true means that the site can be crawled. This is a Boolean value with a default of true. |
http.agent.description | String value (default is empty). Provides descriptive text about the crawler. The text is used in the User-Agent header, appearing in parenthesis after the agent name. |
http.agent.url | String value (default is empty). Specifies the URL that appears in the User-Agent header, in parenthesis after the agent name. Custom dictates that the URL be a page explaining the purpose and behavior of this crawler. |
http.agent.email | String value (default is empty). Specifies the email address that appears in the HTTP From request header and User-Agent header. A good practice is to mangle this address (e.g., "info at example dot com") to avoid spamming. |
http.agent.version | String value (default is WebCrawler). Specifies the version of the crawl. The version is used in the User-Agent header. |
http.timeout | Integer value (default is 10000). Specifies the default network timeout in milliseconds. |
http.content.limit | Integer value (default is 1048576). Sets the length limit in bytes for downloaded content. If the value is a positive integer greater than 0, content longer than the setting will not be downloaded (the page will be skipped). If set to a negative integer, no limit is set on the content length. Oracle does not recommend setting this value to 0 because that value limits the crawl to producing 0-byte content. |
http.redirect.max | Integer value (default is 5). Sets the maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, the fetcher will not immediately follow redirected URLs, but instead will record them for later fetching. |
http.useHttp11 | Boolean value (default is false). If true, use HTTP 1.1; if false, use HTTP 1.0. |
http.cookies | String value (default is empty). Specifies the cookies to be used by the HTTPClient. |