Oracle® Outside In Content Access Developer's Guide Release 8.4.0 Part Number E12846-03 |
|
|
View PDF |
Content Access is part of Oracle's family of OEM products known as Outside In Technology, a powerful document extraction, conversion and viewing technology that can access the information in more than 500 file formats. Content Access is a server-grade technology that provides developers with normalized access to content stored in documents across multiple platforms.
There may be references to other Outside In Technology SDKs within this manual. To obtain complete documentation for any other Outside In product, see:
http://www.oracle.com/technetwork/indexes/documentation/index.html#middleware
and click on Outside In Technology.
This chapter includes the following sections:
The updated list of supported formats is linked from the page http://www.outsideinsdk.com/
. Look for the data sheet with the latest supported formats.
In Chapter 4, "Data Access Common Functions," both DASaveTreeRecord and DASaveRandomTreeRecord have a new dwSpecType. IOTYPE_REDIRECT specifies that redirected I/O will be used to save the file. DASaveInputObject now has the same dwSpecType options as the other two functions.
A new content description has been added for SCCCA_SLIDE: Presentation Slide. This element will appear at the start of each slide in a presentation file.
A new option, SCCOPT_SYSTEMFLAGS, allows you to control miscellaneous interactions between the developer and the Outside In Technology.
NSF support has been added for the Win x86-64 platform. Please see Section 2.1.1, "NSF Support" for more details.
A new function has been added: DAInitEx. It replaces DAInit and DAThreadInit and adds an option not to load or save the options file.
The following new sample applications have been added:
batch_process_ca demonstrates running Content Access in a separate process on multiple input files.
extract_archive demonstrates using the DATree API to extract all nodes in an archive.
extract_object demonstrates using Content Access to parse an input file and then using the DAObject API to extract all embedded objects.
memoryio demonstrates an in-memory solution for Content Access.
parsepst demonstrates parsing a PST file.
Support has been added for AutoCAD 2011 and 2012 files, using the OpenDesign Alliance's Teigha 3.05.00 libraries.
Support has been added for Hangul 2010 documents.
Scalable Vector Graphics (SVG) files are now identified and processed by the XML filter.
Support has been added to extract and render MSGs and EMLs to which a digital signature has been applied.
A new option, SCCOPT_HTML_COND_COMMENT_MODE, allows you to control which special comments targeted for particular versions of browsers or other products that are found in the HTML will be included in the output.
PDF files created by Acrobat 10 are now validated and processed.
Support has been added for the extraction of table data in a Microsoft Jet 3.x- or 4.x-based file. This means that for database files created in Access 95, 97, 2000, 2002, 2003, 2007, and 2010, the TABLES data can be extracted.
Support has been added for text extraction from Microsoft OneNote 2007 and 2010 files.
Support has been added for Outlook 2010 PST and OST files, including support for High Encryption in all versions of Outlook PST and OST files.
Support has been added for two types of Office 2003 files: WordProcessingML (Word 2003), text only; and SpreadSheetML (Excel 2003), text only. The XML version of the binary format will be processed, skipping embedded objects and tagging properties.
Support has been added for IBM SmartSuite 9.8 files: Lotus WordPro, Lotus 1-2-3, and Lotus Freelance.
Support has been added for Apple iWork 09 files for Mac OSX: Pages 09 PDF Preview & Text, Numbers 09 PDF Preview & Text, and Keynote 09 PDF Preview & Text.
Support has been added for WordPerfect X5 files: Word Processor, Quattro Pro, and Presentations.
Support has been added for Adobe Creative Suite 5 files: Photoshop CS5, Illustrator CS5, and InDesign CS5.
Support has been added for Solaris x86 64 version 10.
Support has been added for the IBM AIX PPC (64-bit) OS version 7.1.
Support has been added for SUSE Linux (Itanium 64), Enterprise Server 9, and support for Enterprise Server 8 has been dropped.
Support has been added for SUSE Enterprise Linux 11.
Certification on HP-UX PA-RISC 11 (32-bit) has been discontinued.
Support has been added for Red Hat Linux (x86 64-bit), Red Hat Enterprise Linux (RHEL) 6.
Certification on Windows 2000 has been discontinued.
Some generic embedded objects that previously did not have names now do. These names appear in a new content type, SCCCA_OBJECTNAME: Object Name, as well as in the DAGetObjectInfo call. In addition, some objects have an alternate text string. These strings appear in a new content type, SCCCA_OBJECTALTSTRING: Alternate String, as well as in the DAGetObjectInfo call.
Two new values have been added to the dwInfoID parameter of the DAGetObjectInfo function. DAOBJECT_ALTSTRING_A retrieves the alternate string describing the object, in 8-bit characters. DAOBJECT_ALTSTRING_W retrieves the alternate string describing the object in 16-bit Unicode characters.
The PDF filter has been updated to enable support for PDFs using AES 256-bit encryption.
Note:
Not all formats that use passwords are supported. Only Microsoft Office binary (97-2003) and Microsoft Office 2007, Lotus NSF, PDF (with RC4 encryption), Zip (with AES 128 & 256 bit, ZipCrypto) are currently supported.
Outside In Content Access provides a simple interface to extract text and metadata from business documents. This technology is particularly useful for document indexing applications. The product is comprised of two modules: Content Access and Text Access. Benefits include:
The ability to extract text from documents with automatic translation into a particular character set, such as Unicode or ANSI.
Access to numerous additional properties of documents that store information such as author, keywords, typist, version notes, carbon copy, checked by, subject, character and paragraph attributes, and so forth.
A common interface to the content of diverse file formats including word processing, spreadsheet, database, email, vector, and presentation formats.
The Text Access module's specific functions have tight integration with Outside In Technology, such that text generated by the text access functions is highlighted in the Viewer.
Text Access and Content Access generate the same raw text. However, the following points are important.
rawtext and Text Access will extract some text as unmappable characters because they cannot be annotated. This includes text that is not visible (for example, document properties, hidden text, and so on.).
rawtext and Text Access only operate on the top-most layer of the file, and will not extract text from embedded documents. Thus, not all visible text will be extractable via rawtext or Text Access.
Content Access can be used to extract hidden text, like document properties; and text from embedded documents.
It should be noted that other Outside In products offer powerful text extraction and tagging abilities, such as Search Export and XML Export.
The basic architecture of Content Access is the same across all supported platforms:
Filter/Module | Description |
---|---|
Input Filter |
The input filters form the base of the architecture. Each one reads a specific file format or set of related formats and sends the data to the chunker module through a standard set of function calls. There are more than 150 of these filters that read more than 500 distinct file formats. Filters are loaded on demand by the data access module. |
Chunker |
The Chunker module is responsible for caching a certain amount of data from the filter and returning this data to the Content Access module. |
Content Access |
The Content Access module reads data from the chunker and repackages it in a way that is convenient for the developer. This repackaging process includes mapping characters to a particular character set and converting some data (such as paragraph and cell breaks) into representative characters. CA outputs non-visible text, provides a wealth of style information, provides the information needed for the consumer to process sub-documents, and optionally produces non-textual information such as numbers in spreadsheets. |
Text Access |
The Text Access module is similar to the Content Access module, although it is restricted to text. For more information, see Chapter 5, "Text Access Functions." |
Data Access |
The Data Access module implements a generic API for access to files. It understands how to identify and load the correct filter for all the supported file formats. The module delivers to the developer a generic handle to the requested file, which can then be used to run more specialized processes. The Data Access module is responsible for providing a document for the Content Access module. Data Access conserves resources by creating only one file handle and one chunker handle for each file, even if it is opened in multiple Content Access instances. It also provides a unified platform for several modules in addition to Content Access, including Text Access and Remote Filter Access. |
The following table provides definitions of some common terms.
Term | Definition |
---|---|
Developer |
Someone integrating this technology into another technology or application. Most likely this is you, the reader. |
Source File |
The file the developer wishes to extract content from. |
Data Access Module |
The core of Outside In Data Access, in the SCCDA library. |
Data Access Submodule (also referred to as "Submodule") |
This refers to any of the Outside In Data Access modules, including SCCCA (Content Access) and SCCTA (Text Access), but excluding SCCDA (Data Access). |
Document Handle (also referred to as " |
A Document Handle is created when a file is opened using Data Access (see Chapter 4, "Data Access Common Functions"). Each Document Handle may have any number of Subhandles. |
Content Handle (also referred to as " |
The handle created by a call to CAOpenContent or TAOpenText. Every Content Handle has a Document Handle associated with it. The DASetOption and DAGetOption functions in the Data Access Module may be called with any Content Handle or Document Handle. The DARetrieveDocHandle function returns the Document Handle associated with any Content Handle. |
Each Outside In product has an sdk directory, under which there is a subdirectory for each platform on which the product ships (for example, ca/sdk/ca_win-x86-32_sdk). Under each of these directories are the following three subdirectories:
docs: Contains both a PDF and HTML version of the product manual.
redist: Contains only the files that the customer is allowed to redistribute. These include all the compiled modules, filter support files, .xsd and .dtd files, cmmap000.bin, and third-party libraries, like freetype.
sdk: Contains the other subdirectories that used to be at the root-level of an sdk (common, lib (windows only), resource, samplefiles, and samplecode (previously samples). In addition, one new subdirectory has been added, demo, that holds all of the compiled sample apps and other files that are needed to demo the products. These are files that the customer should not redistribute (.cfg files, exportmaps, and so forth).
In the root platform directory (for example, ca/sdk/ca_win-x86-32_sdk), there are two files:
README: Explains the contents of the sdk, and that makedemo must be run in order to use the sample applications.
makedemo (either .bat or .sh – platform-based): This script will either copy (on Windows) or Symlink (on UNIX) the contents of …/redist into …/sdk/demo, so that sample applications can then be run out of the demo directory.
Here's a step-by-step overview of how to obtain information from a source file using Content Access.
Call DAInitEx to initialize the Data Access technology. This function needs to be called only once per application. If using threading, then pass in the correct ThreadOption.
Set "Null" options: Certain options need to be set before the desired source file is opened. These options are identified by requiring a NULL handle type. They include, but aren't limited to:
SCCOPT_FALLBACKFORMAT
SCCOPT_FIFLAGS
SCCOPT_TEMPDIR
Open the Source File: DAOpenDocument is called to create a document handle that uniquely identifies the source file. This handle may be used in subsequent calls to the CAOpenContent function or the open function of any other Data Access Submodule, and will be used to close the file when access is complete. This allows the file to be accessed from multiple Data Access Submodules without reopening.
Set other Options: Once the source document has been opened, set any other desired options. Most options will be set at this time and are identified by requiring a VTHDOC handle type.
Open a Handle to Content Access: Using the document handle, CAOpenContent is called to obtain a content handle that identifies the file to the Content Access module. This handle will used in all subsequent calls to the Content Access functions.
Retrieve the first Information from the File: Call CAReadFirst to read the first piece of information from the file. Note: this step may be repeated to reread the file.
Retrieve other Information from the File: Repeatedly call CAReadNext, which will iteratively read through and process the file.
Process sub-documents (Optional): When you encounter a sub-document, you may process that sub-document by repeating steps 4-11. Sub-documents are identified by either the SCCCA_OBJECT type or the SCCCA_LINKEDOBJECT subtype of the SCCCA_BEGINTAG type. Note: the document handle and content handle will be different for the parent and sub-document.
Close the Content Access Handle: Call CACloseContent to terminate the content access for the file. After this function is called, the content handle will no longer be valid, but the document handle may still be used.
Close the Source File: DACloseDocument is called to close the source file. After calling this function, the document handle will no longer be valid.
De-initialize DA: DADeInit is called to de-initialize the Data Access technology.
Here's a step-by-step overview of how to obtain information from a source file using Text Access.
Initialize Threading (Optional): Call DAThreadInit if running in multiple threads.
Initialize DA: Call DAInit to initialize the Data Access technology. This function needs to be called only once per application.
Set "Null" options: Certain options need to be set before the desired source file is opened. These options are identified by requiring a NULL handle type. They include, but aren't limited to:
SCCOPT_FALLBACKFORMAT
SCCOPT_FIFLAGS
SCCOPT_TEMPDIR
Open the Source File: DAOpenDocument is called to create a document handle that uniquely identifies the source file. This handle may be used in subsequent calls to the TAOpenText function or the open function of any other Data Access Submodule, and will be used to close the file when access is complete. This allows the file to be accessed from multiple Data Access Submodules without reopening.
Set other Options: Once the source document has been opened, set any other desired options. Most options will be set at this time and are identified by requiring a VTHDOC handle type.
Open a Handle to Text Access: Using the document handle, TAOpenContent is called to obtain a content handle that identifies the file to the Text Access module. This handle will used in all subsequent calls to the Text Access functions.
Retrieve the first Information from the File: Call TAReadFirst to read the first piece of information from the file. Note: this step may be repeated to reread the file.
Retrieve other Information from the File: Repeatedly call TAReadNext, which will iteratively read through and process the file.
Close the Text Access Handle: Call TACloseText to terminate the text access for the file. After this function is called, the text handle will no longer be valid, but the document handle may still be used.
Close the Source File: DACloseDocument is called to close the source file. After calling this function, the document handle will no longer be valid.
De-initialize DA: DADeInit is called to de-initialize the Data Access technology.
The following notice must be included in the documentation, help system, or About box of any software that uses any of Oracle's executable code:
Outside In Content Access © 1991, 2012 Oracle.
The following notice must be included in the documentation of any software that uses Oracle's TIF6 filter (this filter reads TIFF and JPEG formats):
The software is based in part on the work of the Independent JPEG Group.