Setting the download content limit

If your crawls are downloading files with a lot of content (for example, large PDF or SWF files), you may see WARN messages about pages being skipped because the content limit was exceeded. To solve this problem, you should increase the download content limit to a setting that allows all content to be downloaded.

Any content longer than the size limit is not downloaded (i.e., the page is skipped).

To set the download content limit:

  1. In a text editor, open default.xml.
  2. Set the value of the http.content.limit property as the length limit, in bytes, for download content.
    Note: Note that if the content limit is set to a negative number or 0, no limit is imposed on the content. However, this setting is not recommended because the Web Crawler may encounter very large files that slow down the crawl.
  3. Save and close the file.

Example of setting the download content limit

In this example, the size of the content is larger than the setting of the http.content.limit property:
WARN com.endeca.itl.web.UrlProcessor
Content limit exceeded for http://xyz.com/pdf/B2B_info.pdf. Page is skipped.