You set the document conversion properties in the default.xml file.
The Endeca Web Crawler uses the IAS Document Conversion Module to perform text extraction on any document that is not: HTML, SGML, XML, text, or JavaScript. By using the properties listed in the table, you can configure the behavior of this module.
| Property Name | Property Value |
|---|---|
| doc-conversion.attempts.max | Integer value (default is 2). Specifies the maximum number of times that the module attempts to convert a document. |
| doc-conversion.timeout | Integer value (default is 60000). Specifies the time-out value in milliseconds for converting a document. |
Note that the IAS Document Conversion Module respects the no-copy option of a PDF. That is, if a PDF publishing application has a no-copy option (which prohibits the copying or extraction of text within the PDF), the IAS Document Conversion Module does not extract text from that PDF. To extract the text, you must re-create the PDF without setting the no-copy option.
WARN com.endeca.eidi.web.UrlProcessor Content limit exceeded for http://xyz.com/pdf/B2B_info.pdf. Page will be skipped.
This issue often occurs with large PDF files. If you regularly see these messages, increase the setting for the http.content.limit property.