You can change the behavior of the CAS Document Conversion Module for identifying fallback format, file identification, and extracting hidden text. You change the default document conversion behavior by specifying options via JVM property names and values. Note that you cannot set these options from the CAS Console.
The options are:
stellent.fallbackFormat
determines the fallback format, that is, what extraction format will be used if the CAS Document Conversion Module cannot identify the format of a file. The two valid settings areascii8
(unrecognized file types are treated as plain-text files, even if they are not plain-text) andnone
(unrecognized file types are considered to be unsupported types and therefore are not converted). Use thenone
setting if you are more concerned with preventing many binary and unrecognized files from being incorrectly identified as text. If there are documents that are not being properly extracted (especially text files containing multi-byte character encodings), it may be useful to try theascii8
option.stellent.fileId
determines the file identification behavior. The two valid settings arenormal
(standard file identification behavior occurs) andextended
(an extended test is run on all files that are not identified). Theextended
setting may result in slower crawls than with thenormal
setting, but it improves the accuracy of file identification.stellent.extractHiddenText
indicates whether to convert hidden text stored in a content item. Hidden text may include text produced by optical character recognition (OCR) software in addition to other types of hidden text. Specifyingtrue
forstellent.extractHiddenText
converts any hidden text stored in the content item. Specifyingfalse
does not convert hidden text.