Oracle® Outside In Search Export Release 8.3.5 |
|
|
View PDF |
Options are parameters affecting the behavior of an export or transformation. This chapter presents both the C/C++ and SOAP options relevant to the Search Export product.
While default values are provided, users are encouraged to set all options for a number of reasons. In some cases, the default values were chosen to provide backwards compatibility. In other cases, the default values were chosen arbitrarily from a range of possibilities.
These options are available to the developer when using the export engine.
Options are set using the DASetOption call. It is recommended that developers familiarize themselves with all of the options available.
Options may be Local, in which case they only affect the handle for which they are set, or Global, in which case they automatically affect all handles associated with the hDoc and must be set before the call to DAOpenDocument.
This section discusses character mapping options.
This option is used in cases where Outside In cannot determine the character set used to encode the text of an input file. When all other means of determining the file's character set are exhausted, Outside In will assume that an input document is encoded in the character set specified by this option. This is most often used when reading plain-text files, but may also be used when reading HTML or PDF files. The possible character sets are listed in charsets.h.
When "extended test for text" is enabled (see "SCCOPT_FIFLAGS"), this option will still apply to plain-text input files that are not identified as EBCDIC or Unicode.
This option supersedes the SCCOPT_FALLBACKFORMAT option for selecting the character set assumed for plain-text files. For backwards compatibility, use of deprecated character-set -related values is still currently supported for SCCOPT_FALLBACKFORMAT, though internally such values will be translated into equivalent values for the SCCOPT_DEFAULTINPUTCHARSET. As a result, if an application were to set both options, the last such value set for either option will be the value that takes effect.
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTDWORD
Default
ANSI1252 on Windows and Latin-1 on UNIX.
Data
The data types are listed in charsets.h.
This option selects the character used when a character is not a valid Unicode character, or does not conform to the XML specification for valid characters. This option takes the Unicode value for the replacement character. If you are using the PageML output format, this option is only valid if the SCCEX_PAGEML_TEXTOUT flag is set in SCCOPT_XML_PAGEML_FLAGS.
Handle Types
VTHDOC
Scope
Local
Data Type
VTWORD
Data
The Unicode value for the character to use.
Default
0xfffd
This section discusses output options.
This option is only valid on 32-bit Linux (Red Hat and Suse) and Solaris Sparc platforms.
This option is only valid when PageML is the output format.
When this option is set to TRUE, the technology will attempt to use its internal graphics code to render fonts and graphics. When set to FALSE, the technology will render images using the operating system's native graphics subsystem (X11 on UNIX/Linux platforms). Note that this option only works when at least one of the appropriate output solutions is present. For example, if the UNIX $DISPLAY variable does not point to a valid X Server, but the OSGD and/or WV_GD modules required for the Outside In output solution exist, Outside In will default to the Outside In rendering code. The option will fail if neither of these output solutions is present.
It is important for the system to be able to locate useable fonts when this option is set to TRUE. Only TrueType fonts (*.ttf or *.ttc files) are currently supported. To ensure that the system can find them, make sure that the environment variable GDFONTPATH includes one or more paths to these files. If the variable GDFONTPATH can't be found, the current directory is used. If fonts are called for and cannot be found, Search Export will exit with an error. Also note that when copying Windows fonts to a UNIX system, the font extension for the files (*.ttf or *.ttc) must be lowercase, or they will not be detected during the search for available fonts. Oracle does not provide fonts with any Outside In product.
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTBOOL
Data
One of the following values:
TRUE: Use the technology's internal graphics rendering code to produce bitmap output files whenever possible.
FALSE: Use the operating system's native graphics subsystem.
Default
FALSE
This section discusses input handling options.
Adobe's Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself. This option enables the XMP feature, which does not interpret the XMP metadata, but passes it straight through without any interpretation. This option is independent of the other two "metadata" options. This option will be ignored if the SCCOPT_PARSEXMPMETADATA option is enabled.
SCCEX_IND_SUPPRESSPROPERTIES will not affect XMP, so if you turn XMP on, but also set SuppressProperties, you will still get the XMP.
SCCEX_METADATAONLY will not guarantee that XMP is produced.
Handle Types
VTHDOC
Scope
Local (was Global prior to release 8.2.2)
Data Type
VTBOOL
Data
TRUE: This setting enables XMP extraction.
FALSE: This setting disables XMP extraction.
Default
FALSE
This option controls how files are handled when their specific application type cannot be determined. This normally affects all plain-text files, because plain-text files are generally identified by process of elimination, for example, when a file isn't identified as having been created by a known application, it is treated as a plain-text file.
This option must be set for an hDoc before any subhandle has been created for that hDoc.
A number of values that were formerly allowed for this option have been deprecated. Specifically, the values that selected specific plain-text character sets are no longer to be used. Instead, applications should use the SCCOPT_DEFAULTINPUTCHARSET option for such functionality.
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTDWORD
Data
The high VTWORD
of this value is reserved and should be set to 0, and the low VTWORD
must have one of the following values:
FI_TEXT: Unidentified file types will be treated as text files.
FI_NONE: Outside In will not attempt to process files whose type cannot be identified. This will include text files. When this option is selected, an attempt to process a file of unidentified type will cause Outside In to return an error value of DAERR_FILTERNOTAVAIL (or SCCERR_NOFILTER).
Default
FI_TEXT
This option affects how an input file's internal format (application type) is identified when the file is first opened by the Outside In technology. When the extended test flag is in effect, and an input file is identified as being either 7-bit ASCII, EBCDIC, or Unicode, the file's contents will be interpreted as such by the export process.
The extended test is optional because it requires extra processing and cannot guarantee complete accuracy (which would require the inspection of every single byte in a file to eliminate false positives.)
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTDWORD
Data
One of the following values:
SCCUT_FI_NORMAL: This is the default value. When this is set, standard file identification behavior occurs.
SCCUT_FI_EXTENDEDTEST: If set, the File Identification code will run an extended test on all files that are not identified.
Default
SCCUT_FI_NORMAL
This option allows the developer to set flags that enable options that span multiple export products.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
SCCOPT_FLAGS_ALLISODATETIMES: When this flag is set, all Date and Time values are converted to the ISO 8601 standard. This conversion can only be performed using dates that are stored as numeric data within the original file.
Default
0: All flags turned off
This option can disable the password verification of files where the contents can be processed without validation of the password. If this option is not set, the filter should prompt for a password if it handles password-protected files.
As of Release 8.3.5, only the PST Filter supports this option.
Scope
Global
Data Type
VTBOOL
Data
TRUE: Ignore validation of the password
FALSE: Prompt for the password
Default
FALSE
This option allows the developer to specify the location of a Lotus Notes or Domino installation for use by the NSF filter. A valid Lotus installation directory must contain the file nnotes.dll.
Note:
The NSF filter is currently only supported on Win32.Handle Types
NULL
Scope
Global
Data Type
VTLPBYTE
Data
A path to the Lotus Notes directory.
Default
If this option isn't set, then OIT will first attempt to load the Lotus library according to the operating system's PATH environment variable, and then attempt to find and load the Lotus library as indicated in HKEY_CLASSES_ROOT\Notes.Link.
Adobe's Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself. This option enables parsing of the XMP data into normal OIT document properties. Enabling this option may cause the loss of some regular data in premium graphics filters (such as Postscript), but won't affect most formats (such as PDF).
Handle Types
VTHDOC
Scope
Local
Data Type
VTBOOL
Data
TRUE: This setting enables parsing XMP.
FALSE: This setting disables parsing XMP.
Default
FALSE
This option allows the user to define an offset to GMT that will be applied during date formatting, allowing date values to be displayed in a selectable time zone. This option affects the formatting of numbers that have been defined as date values (e.g., most dates in spreadsheet cells). This option will not affect dates that are stored as text.
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTLONG
Data
Integer parameter from -96 to 96, representing 15-minute offsets from GMT. To query the operating system for the time zone set on the machine, specify SCC_TIMEZONE_USENATIVE.
Default
0: GMT time
This section discusses compression options.
This option can disable access to any files using Lempel-Ziv-Welch (LZW) compression, such as .GIF files, .ZIP files or self-extracting archive (.EXE) files containing "shrunk" files. Attempts to read such files when this option is enabled will fail and return the error SCCERR_UNSUPPORTEDCOMPRESSION.
The following is a list of file types affected when this option is disabled:
GIF files
TIF files using LZW compression
PDF files that use internal LZW compression
TAZ and TAR archives containing files that are identified as FI_UNIXCOMP
ZIP and self-extracting archive (.EXE) files containing "shrunk" files
Postscript files using LZW compression
Although this option can disable access to files in ZIP or EXE archives stored using LZW compression, any files in such archives that were stored using any other form of compression will still be accessible.
Handle Types
HDOC, HEXPORT
Scope
Local
Data Type
VTDWORD
Data
SCCVW_FILTER_LZW_ENABLED: LZW compressed files will be read normally.
SCCVW_FILTER_LZW_DISABLED: LZW compressed files will not be read.
Default
SCCVW_FILTER_LZW_ENABLED
This section discusses XML options.
Outside In has an internal flag that is used to optimize several of the input filters for searching. One of the side effects of this optimization is that many embedded bitmaps aren't output by the filter. SCCOPT_ENABLEALLSUBOBJECTS can override this internal optimization.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One of the following values:
SCCVW_FILTER_ENABLEALLSUBOBJECTS: Override the optimizations.
SCCVW_FILTER_NORMALSUBOBJECTS: Allow the optimizations.
Default
SCCVW_FILTER_NORMALSUBOBJECTS
This option determines whether Search Export will reference a SearchML or PageML schema, DTD, or no reference when generating output. This option is not valid when SearchText or SearchHTML is the output format.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One of the following values:
SCCEX_XML_XDM_DTD: Document Type Definition (DTD)
SCCEX_XML_XDM_XSD: Extensible Schema Definition
SCCEX_XML_XDM_NONE: No XML definition reference
Default
SCCEX_XML_XDM_NONE
This option allows the developer to set a particular file as the XML definition reference.
If the SCCOPT_XML_DEF_METHOD option is set to SCCEX_XML_XDM_XSD or SCCEX_XML_XDM_DTD, the value of this option will be used to reference the schema or DTD, respectively.
Handle Types
VTHDOC
Scope
Local
Data Type
Size (in bytes) of the data being passed, including a terminating NULL.
Data
The size of an array that holds WORD
-sized characters terminated with a WORD
-sized NULL
(a UCS-2 string). The size passed is the total number of bytes that this UCS-2 string comprises. It includes in its size the bytes occupied by the terminating NULL
.
Default
None
This option specifies a two-byte Unicode character that will be used to replace null characters if null path separators are being used. This option defaults to '/' and is valid for the SearchML 3.x, SearchHTML and SearchText output formats.
Handle Types
VTHDOC
Scope
Local
Data Type
VTWORD
Data
A two-byte Unicode character that will be used to replace null characters if null path separators are being used.
Default
0x002f = "/"
This option allows the developer to set flags that enable options unique to the PageML schema.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One or more of the following values bitwise OR-ed together. Note that these flags are valid ONLY for the PageML output format:
SCCEX_PAGEML_TEXTOUT: Include text in PageML's output.
SCCEX_XML_NO_XML_DECLARATION: Exclude the XML declaration in PageML's output.
Default
0: All flags turned off.
This option is Windows-specific. It is used to set which device context to use to render the pages.
It specifies, as a byte string, the name of the printer whose metrics should be used to calculate pagination information. If unspecified, the default printer will be used. The screen metrics of the system will be used if a printer is not specified and a default printer does not exist. As pagination is affected by the metrics of the device context and installed fonts, PageML XML output can vary between different systems and configurations.
Handle Types
VTHDOC
Scope
Local
Data Type
VTLPVOID
Data
A null-terminated single-byte string for the name of the printer which is the device context that should be used to render pages.
Default
NULL
PageML uses the Windows default printer.
This option allows the developer to track character attributes contained in the input document and choose which are output to tags in the XML document produced.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One or more of the following values bitwise OR-ed together. Note that not all flags are valid for all Search Export output formats.
SCCEX_XML_SEARCHML_ALLCAPS: Valid for the SearchML 2.0 and SearchML 3.x output formats only.
SCCEX_XML_SEARCHML_BOLD: Valid for the SearchML 2.0, SearchML 3.x and SearchHTML output formats only.
SCCEX_XML_SEARCHML_DUNDERLINE: Valid for the SearchML 2.0, SearchML 3.x and SearchHTML output formats only.
SCCEX_XML_SEARCHML_HIDDEN: Not valid for the PageML output format.
SCCEX_XML_SEARCHML_ITALIC: Valid for the SearchML 2.0, SearchML 3.x and SearchHTML output formats only.
SCCEX_XML_SEARCHML_OCE: When this flag is set, an attribute named oce is added either to <p> or <r> elements as appropriate. (This flag does not affect <unmapped> elements, which will always have an oce attribute.) The value of the attribute is a hex representation of the character set. The value is defined by our core technology, SO_ANSIUNKNOWN for instance. Possible values for this attribute appear in the vtchars.h header file. Valid for the SearchML 2.0, SearchML 3.x output formats only.
SCCEX_XML_SEARCHML_OUTLINE: Valid for the SearchML 2.0, SearchML 3.x output formats only.
SCCEX_XML_SEARCHML_REVISIONADD: Valid for the SearchML 3.x output format only. When set, causes added text to be output and appropriately marked.
SCCEX_XML_SEARCHML_REVISIONDELETE: Valid for the SearchML 3.x output format only. When set, causes deleted text to be output and appropriately marked.
SCCEX_XML_SEARCHML_SMALLCAPS: Valid for the SearchML 2.0 and SearchML 3.x output formats only.
SCCEX_XML_SEARCHML_STRIKEOUT: Valid for the SearchML 2.0 and SearchML 3.x output formats only.
SCCEX_XML_SEARCHML_UNDERLINE: Valid for the SearchML 2.0, SearchML 3.x and SearchHTML output formats only.
Default
0: All flags turned off.
This option allows the developer to set flags that enable options unique to the following SearchML formats: SearchML 2.0, SearchML 3.x, SearchHTML and SearchText.
This option is not valid for the PageML output format, although there is a similar PageML-specific option (SCCOPT_XML_PAGEML_FLAGS) that includes similar flags.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One or more of the following values bitwise OR-ed together. Note that not all flags are valid for all Search Export output formats:
SCCEX_ANNOTATIONS: When set, revised or annotated text will be designated as such. An "annotation" is a note or comment that goes along with a document, but is not really part of the document itself. Examples would be comments, footnotes, slidenotes, etc. Valid only for the SearchML 3.x output formats.
SCCEX_XML_ENABLEERRORINFO: When this flag is set, SearchML will output an <error> element if an error occurs while processing the main document or any sub-documents. The <error> element has one required attribute, code, which will be a hex value of the error code. The contents of the element will be a string with the description of the error returned from DAGetErrorString. Valid only for the SearchML 3.1 and later output formats.
SCCEX_IND_GENERATED: Includes data not originally stored as text in the input document. This can be important content the user would see when viewing the document in the original application (time and owner information in archives, numbers in spreadsheets/databases, etc.).
SCCEX_IND_GENERATESYSTEMMETADATA: When this flag is set, system metadata will be generated. This text is "generated" and part of the document properties, so it will be affected by SCCEX_IND_GENERATED and SCCEX_IND_SUPPRESSPROPERTIES. This information is gathered through system calls and may adversely affect performance. Valid only for the SearchML 3.x output formats.
SCCEX_IND_SS_CELLINFO: When this flag is set, SearchML will output a <cell> element that will encapsulate data from each non-empty cell in a spreadsheet. (NOTE: Numeric cells are considered empty unless SCCEX_IND_GENERATED is enabled.) The <cell> element will have a required attribute start which will give the location of the cell. It will also have an optional attribute end which will be used to indicate a merged cell. Both the start and end attributes will be in the form RowColumn where the Row will be a letter and Column will be a number (for example <cell start="A1">). Valid only for the SearchML 3.x output formats.
SCCEX_IND_SUPPRESSPROPERTIES: Document properties are not produced. Not valid for the PageML output format.
SCCEX_METADATAONLY: Produce only metadata. Not valid for the SearchML 2.0 output format.
SCCEX_PRODUCEURLS: Produce URL information when it is available. Valid only for the SearchHTML and SearchML 3.x output formats.
SCCEX_XML_EMBEDDINGS: Include embeddings.
SCCEX_XML_NO_XML_DECLARATION: Exclude the XML declaration. Valid only for the SearchML 2.0 and SearchML 3.x output formats.
SCCEX_XML_PRODUCEOBJECTINFO: When this flag is set, information for use with IOTYPE_OBJECT will be included in the <document> element. The information will correspond to the fields in the SCCDAOBJECT structure. Valid only for the SearchML 3.x output format.
SCCEX_XML_PSTYLENAMES: Include paragraph style name references as an attribute of paragraph tags. Valid only for the SearchML 2.0 and SearchML 3.x output formats.
SCCEX_XML_SUPPRESSARCHIVESUBDOCS: Subdocuments in archives are not processed.
SCCEX_XML_SUPPRESSATTACHMENTS: Attachments are not processed.
Default
0: All flags turned off.
The value of this option is a Boolean that if set to TRUE
will include offset information in the SearchML output according to the schema. If the option is set to FALSE
, no offset information is produced.
Handle Types
VTHDOC, VTHEXPORT
Scope
Local
Data Type
VTBOOL
Default
FALSE
This option allows the developer to track paragraph attributes contained in the input document and, optionally, include them in the XML output. All lengths are measured in twips. The values that appear in the SearchML output are the values that apply to the first content encountered in a given paragraph. For example, if the character height changes after the initial content in a paragraph, that change will be ignored. Left and first line indents are measured relative to the left page margin. The right indent is measured relative to the right page margin.
This option only affects SearchML output. The option is not valid for the SearchHTML, SearchText and PageML output flavors.
Handle Types
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One or more of the following values bitwise OR-ed together:
SCCEX_XML_SEARCHML_SPACING
SCCEX_XML_SEARCHML_HEIGHT
SCCEX_XML_SEARCHML_LEFTINDENT
SCCEX_XML_SEARCHML_RIGHTINDENT
SCCEX_XML_SEARCHML_FIRSTINDENT
Default
0: All flags turned off.
This option allows for the production of unmapped text (the original code points from the input document). A new <unmapped>
element will be produced to enclose this text. The <unmapped>
element will contain base64-encoded text. It will also contain two attributes. "OCE" will contain a hex value representing the character set. "font" will contain a string value of the original font name. This is necessary for non-standard encodings such as wingdings or webdings. This option is only valid in the SearchML 3.2 (and higher) schema.
Handle Type
VTHDOC
Scope
Local
Data Type
VTDWORD
Data
One of the following values:
SCCEX_XML_JUST_UNMAPPEDTEXT: Output just the unmapped text
SCCEX_XML_NO_UNMAPPEDTEXT: Don't output any unmapped text.
SCCEX_XML_BOTH_UNMAPPEDTEXT: Output both the original and the unmapped text.
Default
SCCEX_XML_NO_UNMAPPEDTEXT
This section discusses file system options.
This set of three options allows the user to adjust buffer sizes to tailor memory usage to the machine's ability. The numbers specified in these options are in kilobytes. These are advanced options that casual users of Search Export may ignore.
Handle Type
NULL, VTHDOC
Scope
Global
Data Type
SCCBUFFEROPTIONS Structure
Data
A buffer options structure
typedef struct SCCBUFFEROPTIONStag { VTDWORD dwReadBufferSize; /* size of the I/O Read buffer in KB */ VTDWORD dwMMapBufferSize; /* maximum size for the I/O Memory Map buffer in KB */ VTDWORD dwTempBufferSize; /* maximum size for the memory- mapped temp files in KB */ VTDWORD dwFlags; /* use flags */ } SCCBUFFEROPTIONS, *PSCCBUFFEROPTIONS;
Parameters
dwReadBufferSize: Used to define the number of bytes that will read from disk into memory at any given time. Once the buffer has data, further file reads will proceed within the buffer until the end of the buffer is reached, at which point the buffer will again be filled from the disk. This can lead to performance improvements in many file formats, regardless of the size of the document.
dwMMapBufferSize: Used to define a maximum size that a document can be and use a memory-mapped I/O model. In this situation, the entire file is read from disk into memory and all further I/O is performed on the data in memory. This can lead to significantly improved performance, but note that either the entire file can be read into memory, or it cannot. If both of these buffers are set, then if the file is smaller than the dwMMapBufferSize
, the entire file will be read into memory; if not, it will be read in blocks defined by the dwReadBufferSize
.
dwTempBufferSize: The maximum size that a temporary file can occupy in memory before being written to disk as a physical file. Storing temporary files in memory can boost performance on archives, files that have embedded objects or attachments. If set to 0, all temporary files will be written to disk.
dwFlags
SCCBUFOPT_SET_READBUFSIZE
1
SCCBUFOPT_SET_MMAPBUFSIZE
2
SCCBUFOPT_SET_TEMPBUFSIZE
4
To set any of the three buffer sizes, set the corresponding flag while calling dwSetOption.
Default
The default settings for these options are:
#define SCCBUFOPT_DEFAULT_READBUFSIZE 2: A 2KB read buffer.
#define SCCBUFOPT_DEFAULT_MMAPBUFSIZE 8192: An 8MB memory-map size.
#define SCCBUFOPT_DEFAULT_TEMPBUFSIZE 2048: A 2MB temp-file limit.
Minimum and maximum sizes for each are:
SCCBUFOPT_MIN_READBUFSIZE 1: Read one Kbyte at a time.
SCCBUFOPT_MIN_MMAPBUFSIZE 0: Don't use memory-mapped input.
SCCBUFOPT_MIN_TEMPBUFSIZE 0: Don't use memory temp files
SCCBUFOPT_MAX_READBUFSIZE 0x003fffff, SCCBUFOPT_MAX_MMAPBUFSIZE 0x003fffff, SCCBUFOPT_MAX_TEMPBUFSIZE 0x003fffff: These maximums correspond to the largest file size possible under the 4GB DWORD limit.
From time to time, the technology needs to create one or more temporary files. This option sets the directory to be used for those files.
It is recommended that this option be set as part of a system to clean up temporary files left behind in the event of abnormal program termination. By using this option with code to delete files older than a predefined time limit, the OEM can help to ensure that the number of temporary files does not grow without limit.
Note:
This option will be ignored if SCCOPT_REDIRECTTEMPFILE is set.Handle Types
NULL, VTHDOC
Scope
Global
Data Type
SCCUTTEMPDIRSPEC structure
This structure is used in the SCCOPT_TEMPDIR option.
SCCUTTEMPDIRSPEC is a C data structure defined in sccvw.h as follows:
typedef struct SCCUTTEMPDIRSPEC { VTDWORD dwSize; VTDWORD dwSpecType; VTBYTE szTempDirName[SCCUT_FILENAMEMAX]; } SCCUTTEMPDIRSPEC, * LPSCCUTTEMPDIRSPEC;
There is currently a limitation. dwSpecType describes the contents of szTempDirName. Together, dwSpecType and szTempDirName describe the location of the source file. The only dwSpecType values supported at this time are:
IOTYPE_ANSIPATH: Windows only. szTempDirName points to a NULL-terminated full path name using the ANSI character set and FAT 8.3 (Win16) or NTFS (Win32 and Win64) file name conventions.
IOTYPE_UNIXPATH: X Windows on UNIX platforms only. szTempDirName points to a NULL-terminated full path name using the system default character set and UNIX path conventions.
Specifically not supported at this time are IOTYPE_UNICODEPATH and IOTYPE_REDIRECT.
Parameters
dwSize: Set to sizeof(SCCUTTEMPDIRSPEC).
dwSpecType: IOTYPE_ANSIPATH, IOTYPE_UNICODE or IOTYPE_UNIXPATH
szTempDirName: The path to the directory to use for the temporary files. Note that if all SCCUT_FILENAMEMAX bytes in the buffer are filled, there will not be space left for file names.
Default
The system default directory for temporary files. On UNIX systems, this is the value of environment variable $TMP. On Windows systems, it is the value of environment variable %TMP%.
This option determines the maximum amount of memory that the chunker may use to store the document's data, from 4 MB to 1 GB. The more memory the chunker has available to it, the less often it needs to re-read data from the document.
Handle Types
NULL, VTHDOC
Scope
Global
Data Type
VTDWORD
Parameters
SCCDOCUMENTMEMORYMODE_SMALLEST 1 - 4MB
SCCDOCUMENTMEMORYMODE_SMALL 2 - 16MB
SCCDOCUMENTMEMORYMODE_MEDIUM 3 - 64MB
SCCDOCUMENTMEMORYMODE_LARGE 4 - 256MB
SCCDOCUMENTMEMORYMODE_LARGEST 5 - 1 GB
Default
SCCDOCUMENTMEMORYMODE_SMALL 2 - 16MB
This option is set when the developer wants to use redirected IO to completely take over responsibility for the low level IO calls of the temp file.
Handle Types
NULL, VTHDOC
Scope
Global (not persistent)
Data Type
VTLPVOID: pCallbackFunc
Function pointer of the redirect IO callback.
Redirect call back function:
typedef { VTDWORD (* REDIRECTTEMPFILECALLBACKPROC) (HIOFILE *phFile, VTVOID *pSpec, VTDWORD dwFileFlags);
There is another option to handle the temp directory, SCCOPT_TEMPDIR. Only one of these two can be set by the developer. The SCCOPT_TEMPDIR option will be ignored if SCCOPT_REDIRECTTEMPFILE is set. These files may be safely deleted when the Close function is called.
These options are available to the developer when using the export engine through the Transformation Server API.
This chapter details the Web Services implementation of options in Transformation Server. However, there are links to API-specific information for the C and JAVA client interfaces to the technology within each of the following sections.
An option is defined by an identifier and an associated value. The identifier (hOptions) indicates what particular option is being specified. The option value data must be in a form that conforms to the set of supported data types.
Note that it is not necessarily an error to specify options that are not understood by the export engine, but some transformation engines may require that certain options be specified.
This section discusses character mapping options.
This option is used in cases where Outside In cannot determine the character set used to encode the text of an input file. When all other means of determining the file's character set are exhausted, Outside In will assume that an input document is encoded in the character set specified by this option. This is most often used when reading plain-text files, but may also be used when reading HTML or PDF files.
When the "extended test for text" is enabled (see extendedTestForText), this option will still apply to plain-text input files that are not identified as EBCDIC or Unicode.
This option supersedes the fallbackFormat option for selecting the character set assumed for plain-text files. For backwards compatibility, use of deprecated character-set -related values is still currently supported for fallbackFormat, though internally such values will be translated into equivalent values for the defaultInputCharset. As a result, if an application were to set both options, the last such value set for either option will be the value that takes effect.
Data Type
DefaultInputCharSet
Data
The SOAP representation of the character set to use, from the values in defaultInputCharSetEnum.
This option selects the character used when a character is not a valid Unicode character, or does not conform to the XML specification for valid characters. This option takes the Unicode value for the replacement character. If you are using the PageML output format, this option is only valid if the textOutOn
option is set.
Data Type
xsd:unsignedShort
Data
The Unicode value for the character to use.
Default
0xfffd
Links
C Client Implementation: XSD_unsignedShort
JAVA Client Implementation: UnsignedShort
This section discusses output options.
This option is only valid on the Linux (Red Hat and Suse) and Solaris Sparc platforms.
This option is only valid when PageML is the output format.
When this option is set to true, the technology will attempt to use its internal graphics code to render fonts and graphics. When set to false, the technology will render images using the operating system's native graphics subsystem (X11 on UNIX/Linux platforms). Note that this option only works when at least one of the appropriate output solutions is present. For example, if the UNIX $DISPLAY variable does not point to a valid X Server, but the OSGD and/or WV_GD modules required for the Outside In output solution exist, Outside In will default to the Outside In rendering code. The option will fail if neither of these output solutions is present.
It is important for the system to be able to locate useable fonts when this option is set to true. Only TrueType fonts (*.ttf or *.ttc files) are currently supported. To ensure that the system can find them, make sure that the environment variable GDFONTPATH includes one or more paths to these files. If the variable GDFONTPATH can't be found, the current directory is used. If fonts are called for and cannot be found, Search Export will exit with an error. Also note that when copying Windows fonts to a UNIX system, the font extension for the files (*.ttf or *.ttc) must be lowercase, or they will not be detected during the search for available fonts. Oracle does not provide fonts with any Outside In product.
If preferOITRendering is set in a particular instance of tsagent, it cannot be changed in that agent until the agent is terminated.
Data Type
xsd:boolean
Data
One of the following values:
true: Use the technology's internal graphics rendering code to produce bitmap output files whenever possible.
false: Use the operating system's native graphics subsystem.
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This section discusses input handling options.
This option controls how files are handled when their specific application type cannot be determined. This normally affects all plain-text files, because plain-text files are generally identified by process of elimination, for example, when a file isn't identified as having been created by a known application, it is treated as a plain-text file.
A number of values that were formerly allowed for this option have been deprecated. Specifically, the values that selected specific plain-text character sets are no longer to be used. Instead, applications should use the defaultInputCharset option for such functionality.
Data Type
FallbackFormatEnum
Data
One of the following values:
fallbackToText: Unidentified file types will be treated as text files.
noFallbackFormat: Outside In will not attempt to process files whose type cannot be identified. This will include text files. When this option is selected, an attempt to process a file of unidentified type will cause Outside In to return an error value of SCCERR_UNSUPPORTEDFORMAT.
Default
ASCII-8
Links
C Client Implementation: OIT_FallbackFormatEnum
JAVA Client Implementation: FallbackFormatEnum
This option affects how an input file's internal format (application type) is identified when the file is first opened by the Outside In technology. When the extended test flag is in effect, and an input file is identified as being either 7-bit ASCII, EBCDIC, or Unicode, the file's contents will be interpreted as such by the export process.
The extended test is optional because it requires extra processing and cannot guarantee complete accuracy (which would require the inspection of every single byte in a file to eliminate false positives.)
Data Type
xsd:boolean
Data
One of the following values:
false: This is the default value. When this is set, standard file identification behavior occurs.
true: If set, the File Identification code will run an extended test on all files that are not identified.
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This option can disable the password verification of files where the contents can be processed without validation of the password. If this option is not set, the filter should prompt for a password if it handles password-protected files.
As of Release 8.3.5, only the PST Filter supports this option.
Data Type
xsd:boolean
Data
true: Ignore validation of the password
false: Prompt for the password
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Adobe's Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself. This option enables parsing of the XMP data into normal OIT document properties. Enabling this option may cause the loss of some regular data in premium graphics filters (such as Postscript), but won't affect most formats (such as PDF).
Data Type
xsd:boolean
Data
true: This setting enables parsing XMP.
false: This setting disables parsing XMP.
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This option allows the user to define an offset to GMT that will be applied during date formatting, allowing date values to be displayed in a selectable time zone. This option affects the formatting of numbers that have been defined as date values (e.g., most dates in spreadsheet cells). This option will not affect dates that are stored as text.
Data Type
xsd:int
Data
Integer parameter from -96 to 96, representing 15-minute offsets from GMT. To query the operating system for the time zone set on the machine, specify the numeric value of 61440 (0xF000 in hexadecimal).
Default
0: GMT time
Links
C Client Implementation: XSD_int
JAVA Client Implementation: Integer
Adobe's Extensible Metadata Platform (XMP) is a labeling technology that allows you to embed data about a file, known as metadata, into the file itself. This option enables the XMP feature, which does not interpret the XMP metadata, but passes it straight through without any interpretation.
Data Type
xsd:boolean
Data
true
false
Default
false
This section discusses compression options.
This option can disable access to any files using Lempel-Ziv-Welch (LZW) compression, such as .GIF files, .ZIP files or self-extracting archive (.EXE) files containing "shrunk" files. Attempts to read such files when this option is enabled will fail and return the error SCCERR_UNSUPPORTEDCOMPRESSION.
The following is a list of file types affected when this option is disabled:
GIF files
TIF files using LZW compression
PDF files that use internal LZW compression
TAZ and TAR archives containing files that are identified as FI_UNIXCOMP
ZIP and self-extracting archive (.EXE) files containing "shrunk" files
Postscript files using LZW compression
Although this option can disable access to files in ZIP or EXE archives stored using LZW compression, any files in such archives that were stored using any other form of compression will still be accessible.
Data Type
xsd:boolean
Data
true: LZW compressed files will be read and written normally.
false: LZW compressed files will not be read or written.
Default
true
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This section pertains to XML options.
When set, causes capitalized text to be output and appropriately marked. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes bold text to be output and appropriately marked. Not valid for the SearchText and PageML output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, SearchML will output a <cell> element that will encapsulate data from each non-empty cell in a spreadsheet. (Note: Numeric cells are considered empty unless FI DOCS NO HP BUILDING(3.7) is enabled. ) The <cell> element will have a required attribute start which will give the location of the cell. It will also have an optional attribute end which will be used to indicate a merged cell. Both the start and end attributes will be in the form RowColumn where the Row will be a letter and Column will be a number (for example <cell start="A1">). Valid only for the SearchML 3.x output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Includes data not originally stored as text in the input document. This can be important content the user would see when viewing the document in the original application (time and owner information in archives, numbers in spreadsheets/databases, etc.). Valid for all output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, document properties are included inthe output. Default value is false
. Not valid for the PageML output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes double-underlined text to be included in the output and appropriately marked. Not valid for the SearchText and PageML output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Include embeddings. Not valid for the PageML output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When this flag is set, SearchML will output an <error> element if an error occurs while processing the main document or any sub-documents. The <error> element has one required attribute, code, which will be a hex value of the error code. The contents of the element will be a string with the description of the error returned from DAGetErrorString. Valid only for the SearchML 3.x output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Include hidden text in the output. Not valid for the PageML output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Include italic text in the output. Not valid for the SearchText and PageML output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Produce only metadata. Not valid for the SearchML 2.0 and PageML output formats.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When this option is set, an attribute named oce
is added either to <p>
or <r>
elements as appropriate. The value of the attribute is a hex representation of the character set. The value is defined by our core technology, SO_ANSIUNKNOWN for instance. Possible values for this attribute appear in the vtchars.h header file. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Include outlined text in the output. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Produce URL information when it is available. Valid for the SearchML 3.x and SearchHTML output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes added text to be output and appropriately marked. Valid for the SearchML 3.x output format only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes deleted text to be output and appropriately marked. Valid for the SearchML 3.x output format only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, revised or annotated text will be designated as such. Valid only for the SearchML 3.x output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes text in small caps to be output and appropriately marked. Valid for the SearchML output format only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes strikeout text to be output and appropriately marked. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
When set, causes underlined text to be output and appropriately marked. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This option determines whether Search Export will reference a SearchML or PageML schema, DTD, or no reference when generating output. This option is not valid when SearchText or SearchHTML is the output format.
Data Type
XMLDefinitionMethodEnum
Data
One of the following values:
dtd: Document Type Definition (DTD)
xsd: Extensible Schema Definition
noDefinition: No XML definition reference
Default
noDefinition
Links
C Client Implementation: OIT_XmlDefinitionMethodEnum
JAVA Client Implementation: XmlDefinitionMethodEnum
This option allows the developer to set a particular file as the XML definition reference.
If the xmlDefinitionMethod option is set to xsd or dtd, the value of this option will be used to reference the schema or DTD, respectively.
Data Type
xsd:string
Data
A UTF-8 encoded string specifying the location of an xsd or dtd file. If using the C API, this string must be a null-terminated array of single-byte characters.
Default
None
Links
C Client Implementation: XSD_string
JAVA Client Implementation: String
This option specifies a two-byte Unicode character that will be used to replace null characters if null path separators are being used. This option defaults to '/' and is valid for the SearchML 3.x, SearchHTML and SearchText output formats.
Data Type
xsd:unsignedShort
Data
A two-byte Unicode character that will be used to replace null characters if null path separators are being used.
Default
0x002f = "/"
Links
C Client Implementation: XSD_unsignedShort
JAVA Client Implementation: UnsignedShort
This option is Windows-specific. It is used to set which device context to use to render the pages.
It specifies, as a byte string, the name of the printer whose metrics should be used to calculate pagination information. If unspecified, the default printer will be used. The screen metrics of the system will be used if a printer is not specified and a default printer does not exist. As pagination is affected by the metrics of the device context and installed fonts, PageML XML output can vary between different systems and configurations.
Data Type
xsd:string
Data
A null-terminated single-byte string for the name of the printer which is the device context that should be used to render pages.
Default
null: PageML uses the Windows default printer.
Links
C Client Implementation: XSD_string
JAVA Client Implementation: String
Include paragraph style name references as an attribute of paragraph tags. Valid for the SearchML 2.0 and SearchML 3.x output formats only.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
The value of this option is a Boolean that if set to true
will include offset information in the SearchML output according to the schema. If the option is set to false
, no offset information is produced.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This option allows the developer to track paragraph attributes contained in the input document and, optionally, include them in the XML output. All lengths are measured in twips. The values that appear in the SearchML output are the values that apply to the first content encountered in a given paragraph. For example, if the character height changes after the initial content in a paragraph, that change will be ignored. Left and first line indents are measured relative to the left page margin. The right indent is measured relative to the right page margin.
Data Type
ParagraphAttributes
Data
The paragraphAttributes option is a complexType data structure composed of Boolean variables, which may be switched on or off in any combination. The variables are:
spacing
height
leftIndent
rightIndent
firstIndent
Default
0: All flags set to false
.
Links
C Client Implementation: OIT_ParagraphAttributes
JAVA Client Implementation: ParagraphAttributes
This option allows for the production of unmapped text (the original code points from the input document). A new <unmapped>
element will be produced to enclose this text. The <unmapped>
element will contain base64-encoded text. It will also contain two attributes. "OCE" will contain a hex value representing the character set. "font" will contain a string value of the original font name. This is necessary for non-standard encodings such as wingdings or webdings. This option is only valid in the SearchML 3.2 (and higher) schema.
Data Type
SearchMLUnmappedTextEnum
Data
One of the following values:
justUnmappedText: Output just the unmapped text
noUnmappedText: Don't output any unmapped text.
bothUnmappedText: Output both the original and the unmapped text.
Default
noUnmappedText
Links
C Client Implementation: OIT_SearchMLUnmappedTextEnum
JAVA Client Implementation: SearchMLUnmappedTextEnum
Subdocuments in archives are not processed. Not valid for the PageML output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
Attachments are not processed. Not valid for the PageML output format.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This option is valid only for the PageML output format.
When set to true, include text in the PageML output.
Data Type
xsd:boolean
Default
false
Links
C Client Implementation: XSD_boolean
JAVA Client Implementation: Boolean
This section applies to file system options.
This option supplies information to OIT when information is required to open an input file. This information may be the password of the file or a support file location.
Further information about how Transformation Server implements this option will be forthcoming.
Used to define the number of bytes that that will read from disk into memory at any given time. Once the buffer has data, further file reads will proceed within the buffer until the end of the buffer is reached, at which point the buffer will again be filled from the disk. This can lead to performance improvements in many file formats, regardless of the size of the document.
Data Type
xsd:unsignedInt
Data
The size of the buffer in kilobytes.
Default
2
Links
C Client Implementation: XSD_unsignedInt
JAVA Client Implementation: UnsignedInt
Used to define a maximum size that a document can be and use a memory-mapped I/O model. In this situation, the entire file is read from disk into memory and all further I/O is performed on the data in memory. This can lead to significantly improved performance, but note that either the entire file can be read into memory, or it cannot. If both of these buffers are set, then if the file is smaller that the dwMMapBufferSize
, the entire file will be read into memory, if not, it will be read in blocks defined by the dwReadBufferSize
.
Data Type
xsd:unsignedInt
Data
The size of the buffer in kilobytes.
Default
8192
Links
C Client Implementation: XSD_unsignedInt
JAVA Client Implementation: UnsignedInt
The maximum size that a temporary file can occupy in memory before being written to disk as a physical file. Storing temporary files in memory can boost performance on archives, files that have embedded objects or attachments. If set to 0, all temporary files will be written to disk.
Data Type
xsd:unsignedInt
Data
The size of the buffer in kilobytes.
Default
2048
Links
C Client Implementation: XSD_unsignedInt
JAVA Client Implementation: UnsignedInt