You can set the document conversion
properties in the
default.xml
file.
The Oracle Commerce Web Crawler uses the CAS Document Conversion Module to perform text extraction on any document that is not one of these file types: HTML, text-based, or JavaScript. By using the properties listed in the table, you can configure the behavior of this module.
Note that the CAS Document Conversion Module respects the no-copy option of a PDF. That is, if a PDF publishing application has a no-copy option (which prohibits the copying or extraction of text within the PDF), the CAS Document Conversion Module does not extract text from that PDF. To extract the text, you must re-create the PDF without setting the no-copy option.
Property Name |
Property Value |
---|---|
|
Integer value (default is
|
|
Integer value (default is
|
Keep in mind that the
http.content.limit
property limits the maximum size of
the content that can be downloaded. If the content is larger than the limit (an
integer greater than 0), any content longer than the setting will be not be
downloaded and you will see a WARN message similar to this example:
WARN com.endeca.itl.web.UrlProcessor Content limit exceeded for http://xyz.com/pdf/B2B_info.pdf. Page will be skipped.
This problem often occurs with large PDF files. If you constantly see
these messages, increase the setting for the
http.content.limit
property.
Related links