|Oracle® Outside In XML Export Developer's Guide
Part Number E12888-02
XML Export allows developers to implement sophisticated text extraction from standard business documents. With the current version of XML Export, an application can access documents through a single C API. Included with XML Export is the powerful Flexiondoc schema.NSF Support
There may be references to other Outside In Technology SDKs within this manual. To obtain complete documentation for any other Outside In product, see:
and click on Outside In Technology.
The updated list of supported formats is linked from the page
http://www.outsideinsdk.com/. Look for the data sheet with the latest supported formats.
New File IDs have been added.
FI Defines, value and LO strings are :
FI_FLASH9, 1727, "Macromedia Flash 9"
FI_FLASH10, 1728, "Macromedia Flash 10"
FI_WIN_EXPLORERCMD, 2400, "Microsoft Windows Explorer Command File"
FI_7Z, 1826, "7z Archive File"
FI_TRILLIAN_TEXT, 1341, "Trillian Text Log File"
FI_TRILLIAN_XML, 1342, "Trillian XML Log File"
FI_LIVEMESSENGER, 1343, "Microsoft Live Messenger Log File"
FI_AOLMESSENGER, 1344, "AOL Messenger Log File"
FI_WINDOWSHELP, 2402, "Windows Help File"
FI_WIN_COMPILEDHELP, 2403, "Windows Compiled Help File"
FI_WIN_SHORTCUT, 2401, "Windows shortcut"
FI_TRUETYPEFONT, 2404, "TrueType Font File"
FI_TRUETYPECOLLECTION, 2405, "TrueType Font Collection File"
FI_TRUETYPEFONT_MAC, 2406, "TrueType (MAC) Font File"
FI_OUTLOOK_MSG_MAIL, 1143, "MS Outlook Mail File"
FI_OUTLOOK_OFT_MAIL, 1311, "Outlook Mail Form Template"
FI_OUTLOOK_MSG_APPT, 1345, "MS Outlook Appointment File"
FI_OUTLOOK_OFT_APPT, 1346, "Outlook Appointment Form Template"
FI_OUTLOOK_MSG_JOURNAL, 1347, "MS Outlook Journal File"
FI_OUTLOOK_OFT_JOURNAL, 1348, "Outlook Journal Form Template"
FI_OUTLOOK_MSG_CONTACT, 1349, "MS Outlook Contact File"
FI_OUTLOOK_OFT_CONTACT, 1350, "Outlook Contact Form Template"
FI_OUTLOOK_MSG_NOTE, 1351, "MS Outlook Note File"
FI_OUTLOOK_OFT_NOTE, 1352, "Outlook Note Form Template"
FI_OUTLOOK_MSG_TASK, 1353, "MS Outlook Task File"
FI_OUTLOOK_OFT_TASK, 1354, "Outlook Task Form Template"
For backward compatability - we will define previous FIs to the MAIL ids.
#define FI_OUTLOOK_MSG FI_OUTLOOK_MSG_MAIL
#define FI_OUTLOOK_OFT FI_OUTLOOK_OFT_MAIL
FlexionDoc can now process sub-documents, outputting their content into the main XML document (see SCCOPT_CCFLEX_FORMATOPTIONS).
OIT's internal error processing has been updated and propagation of error codes throughout OIT has been improved. In many cases the error codes reported by OIT will now more accurately reflect the actual cause of the error. DAERR is now functionally the same as SCCERR and OIT API functions that return DAERR may return any of the SCCERR values defined in sccerr.h.
Attachments in PDF Files are now supported.
The option SCCOPT_TEMPDIR now supports IOTYPE_UNICODEPATH on Windows.
The NSF filter is now supported on Linux x86-32 and Solaris Sparc 32. See NSF Support in the Unix Implementation chapter for for details about Unix environment variables.
The SCCOPT_PDF_FILTER_REORDER_BIDI (SOAP equivalent: reorderBIDI) option controls whether or not the PDF filter will attempt to reorder bidirectional text runs so that the output is in standard logical order as used by the Unicode 2.0 and later specification.
XML Export can normalize all of a document's content to the Flexiondoc schema, provided in the form of a DTD and an XML schema.
Note:All XML Export output formats are UTF-8 encoded Unicode text.
The Flexiondoc schema is designed to provide extremely dense, rich XML versions of input documents, enabling powerful applications such as document assembly, portals and content management systems.
Here are some of the schema's primary features:
Translation of documents to XML, with all characters translated to Unicode
A common interface to more than 500 file formats
Access to document properties
Support for word processor, spreadsheet, graphic, and archive formats
Support for embeddings
Special tags are created for hyperlinks, bookmarks, and sub-documents
The basic architecture of Outside In technologies is the same across all supported platforms:
|Input Filter||The input filters form the base of the architecture. Each one reads a specific file format or set of related formats and sends the data to OIT through a standard set of function calls. There are more than 150 of these filters that read more than 500 distinct file formats. Filters are loaded on demand by the data access module.|
|Export Filter||Architecturally similar to input filters, export filters know how to write out a specific format based on information coming from the chunker module. The export filters generate XML.|
|Export||The Export module implements the export API and understands how to load and run individual export filters.|
|Data Access||The Data Access module implements a generic API for access to files. It understands how to identify and load the correct filter for all the supported file formats. The module delivers to the developer a generic handle to the requested file, which can then be used to run more specialized processes, such as the Export process.|
|Schema||Schemas provide a means for defining the structure, content and semantics of XML documents. XML Export ships with the Flexiondoc schema. Schemas can be presented in the form of a DTD (Document Type Definition) or XML Schema (schema). The Flexiondoc schema is provided in both forms.|
The following terms are used in this documentation.
|Developer||Someone integrating this technology into another technology or application. Most likely this is you, the reader.|
|Source File||The file the developer wishes to export.|
|Output File||The file being written: FlexionDoc, XML, GIF, JPEG, and PNG.|
|Data Access Module||The core of Outside In Data Access, in the SCCDA library.|
|Data Access Submodule (also referred to as "Submodule")||This refers to any of the Outside In Data Access modules, including SCCEX (Export), but excluding SCCDA (Data Access).|
|Document Handle (also referred to as "
||A Document Handle is created when a file is opened using Data Access (see Chapter 4, "Data Access Common Functions"). Each Document Handle may have any number of Subhandles.|
|Subhandle (also referred to as "
||Any of the handles created by a Submodule's
Each Outside In product has an sdk directory, under which there is a subdirectory for each platform on which the product ships (for example, xx/sdk/xx_win-x86-32_sdk). Under each of these directories are the following three subdirectories:
docs - Contains both a PDF and HTML version of the product manual.
redist - Contains only the files that the customer is allowed to redistribute. These include all the compiled modules, filter support files, .xsd and .dtd files, cmmap000.bin, and third-party libraries, like freetype.
sdk - Contains the other subdirectories that used to be at the root-level of an sdk (common, lib (windows only), resource, samplefiles, and samplecode (previously samples). In addition, one new subdirectory has been added, demo, that holds all of the compiled sample apps and other files that are needed to demo the products. These are files that the customer should not redistribute (.cfg files, exportmaps, etc.).
In the root platform directory (for example, xx/sdk/xx_win-x86-32_sdk), there are two files:
README - Explains the contents of the sdk, and that makedemo must be run in order to use the sample applications.
makedemo (either .bat or .sh – platform-based) - This script will either copy (on Windows) or Symlink (on Unix) the contents of …/redist into …/sdk/demo, so that sample applications can then be run out of the demo directory.
If you load more than one OIT SDK, you must copy files from the secondary installations into the top-level OIT SDK directory as follows:
docs – copy all subdirectories named “[product name]guide” into this directory.
redist – copy all binaries into this directory.
sdk – this directory has several subdirectories: common, demo, lib, resource, samplecode, samplefiles. In each case, copy all of the files from the secondary installation into the top-level OIT SDK subdirectory of the same name. If the top-level OIT SDK directory lacks any directories found in the directory being copied from, just copy those directories over.
Here's a step-by-step overview of how to export a source file to XML.
Call DAThreadInit if running in multiple threads (optional). On Windows, the Solaris Sparc and Linux X86-32 platforms, DAThreadInit should be called before DAInit to initialize threading if it is being used. On all other platforms, or if threading is not being used, DAInit should be the first call. See "DAThreadInit" for more information about DAThreadInit.
Call DAInit to initialize the Data Access technology. This function needs to be called only once per application.
Set any options that require a NULL handle type (optional). Certain options need to be set before the desired source file is opened. These options are identified by requiring a NULL handle type. They include, but aren't limited to:
Open the Source File. DAOpenDocument is called to create a document handle that uniquely identifies the source file. This handle may be used in subsequent calls to the EXOpenExport function or the open function of any other Data Access Submodule, and will be used to close the file when access is complete. This allows the file to be accessed from multiple Data Access Submodules without reopening.
Set the Options. If you require option values other than the default settings, call DASetOption to set options. Note that options listed in the Options Guide as having "Handle Types" that accept VTHEXPORT may be set any time before EXRunExport is called. See "DASetOption" for more information on options and how to set them.
Open a Handle to XML Export. Using the document handle, EXOpenExport is called to obtain an export handle that identifies the file to the specific export product. This handle will be used in all subsequent calls to the specific export functions. The dwOutputId parameter of this function is used to specify that the output file type should be set to FI_XML_FLEXIONDOC_LATEST.
Export the File. EXRunExport is called to generate the output file(s) from the source file.
Close the Handle to XML Export. EXCloseExport is called to terminate the export process for the file. After this function is called, the export handle will no longer be valid, but the document handle may still be used.
Close the Source File. DACloseDocument is called to close the source file. After calling this function, the document handle will no longer be valid.
Close XML Export. DADeInit is called to de-initialize the Data Access technology.
The following notice must be included in the documentation, help system, or About box of any software that uses any of Oracle's executable code:
Outside In XML Export © 1991, 2011 Oracle.
The following notice must be included in the documentation of any software that uses Oracle's TIF6 filter (this filter reads TIFF and JPEG formats):
The software is based in part on the work of the Independent JPEG Group.