There may be references to other Outside In Technology SDKs within this manual. To obtain complete documentation for any other Outside In product, see:
http://www.oracle.com/technetwork/indexes/documentation/index.html#middleware
and click on Outside In Technology.
This chapter includes the following sections:
The updated list of supported formats is linked from the page http://www.outsideinsdk.com/. Look for the data sheet with the latest supported formats.
The following new formats are supported:
Microsoft Word 2016
Microsoft Excel 2016
Microsoft PowerPoint 2016
MS Outlook 2011 for Mac (OLM and EML)
Corel WordPerfect X7
Corel Quattro Pro X7
Corel Presentations X7
Corel Draw X7
iWork KeyNote (text only)
AutoCAD 2015
The following new options are introduced:
A new option, SCCOPT_PDF_FILTER_MAX_EMBEDDED_OBJECTS, is added that allows you to limit the number of embedded objects produced in PDF files.
A new option, SCCOPT_PDF_FILTER_MAX_VECTOR_PATHS, is added that allows you to limit the number of vector paths produced in PDF files.
A new option, SCCOPT_PDF_FILTER_WORD_DELIM_FRACTION, is added. This allows you to control the spacing threshold in PDF input documents.
Support for the following general accuracy and fidelity features is provided:
MS Word table styles supported
MS Office Chart data label styles extended
Font selection algorithm improvements implemented
Outlook MSG “best body” algorithm implemented
PPTX Master slide Transparency provided
Four Color (CMYK) progressive JPEG supported
Processing of very large spreadsheets containing large areas of white space are optimized for improved performance supported
The following Operating System support is provided:
Windows 10
SLES 12
Outside In Content Access provides a simple interface to extract text and metadata from business documents. This technology is particularly useful for document indexing applications. The product is comprised of two modules: Content Access and Text Access. Benefits include:
The ability to extract text from documents with automatic translation into a particular character set, such as Unicode or ANSI.
Access to numerous additional properties of documents that store information such as author, keywords, typist, version notes, carbon copy, checked by, subject, character and paragraph attributes, and so forth.
A common interface to the content of diverse file formats including word processing, spreadsheet, database, email, vector, and presentation formats.
The Text Access module's specific functions have tight integration with Outside In Technology, such that text generated by the text access functions is highlighted in the Viewer.
Text Access and Content Access generate the same raw text. However, the following points are important.
rawtext and Text Access will extract some text as unmappable characters because they cannot be annotated. This includes text that is not visible (for example, document properties, hidden text, and so on.).
rawtext and Text Access only operate on the top-most layer of the file, and will not extract text from embedded documents. Thus, not all visible text will be extractable via rawtext or Text Access.
Content Access can be used to extract hidden text, like document properties; and text from embedded documents.
It should be noted that other Outside In products offer powerful text extraction and tagging abilities, such as Search Export and XML Export.
The basic architecture of Content Access is the same across all supported platforms:
Filter/Module | Description |
---|---|
Input Filter |
The input filters form the base of the architecture. Each one reads a specific file format or set of related formats and sends the data to the chunker module through a standard set of function calls. There are more than 150 of these filters that read more than 600 distinct file formats. Filters are loaded on demand by the data access module. |
Chunker |
The Chunker module is responsible for caching a certain amount of data from the filter and returning this data to the Content Access module. |
Content Access |
The Content Access module reads data from the chunker and repackages it in a way that is convenient for the developer. This repackaging process includes mapping characters to a particular character set and converting some data (such as paragraph and cell breaks) into representative characters. CA outputs non-visible text, provides a wealth of style information, provides the information needed for the consumer to process sub-documents, and optionally produces non-textual information such as numbers in spreadsheets. |
Text Access |
The Text Access module is similar to the Content Access module, although it is restricted to text. For more information, see Text Access Functions. |
Data Access |
The Data Access module implements a generic API for access to files. It understands how to identify and load the correct filter for all the supported file formats. The module delivers to the developer a generic handle to the requested file, which can then be used to run more specialized processes. The Data Access module is responsible for providing a document for the Content Access module. Data Access conserves resources by creating only one file handle and one chunker handle for each file, even if it is opened in multiple Content Access instances. It also provides a unified platform for several modules in addition to Content Access, including Text Access and Remote Filter Access. |
The following table provides definitions of some common terms.
Term | Definition |
---|---|
Developer |
Someone integrating this technology into another technology or application. Most likely this is you, the reader. |
Source File |
The file the developer wishes to extract content from. |
Data Access Module |
The core of Outside In Data Access, in the SCCDA library. |
Data Access Submodule (also referred to as "Submodule") |
This refers to any of the Outside In Data Access modules, including SCCCA (Content Access) and SCCTA (Text Access), but excluding SCCDA (Data Access). |
Document Handle (also referred to as " |
A Document Handle is created when a file is opened using Data Access (see Data Access Common Functions). Each Document Handle may have any number of Subhandles. |
Content Handle (also referred to as " |
The handle created by a call to CAOpenContent or TAOpenText. Every Content Handle has a Document Handle associated with it. The DASetOption and DAGetOption functions in the Data Access Module may be called with any Content Handle or Document Handle. The DARetrieveDocHandle function returns the Document Handle associated with any Content Handle. |
Each Outside In product has an sdk directory, under which there is a subdirectory for each platform on which the product ships (for example, ca/sdk/ca_win-x86-32_sdk). Under each of these directories are the following three subdirectories:
redist: Contains only the files that the customer is allowed to redistribute. These include all the compiled modules, filter support files, .xsd and .dtd files, cmmap000.bin, and third-party libraries, like freetype.
sdk: Contains the other subdirectories that used to be at the root-level of an sdk (common, lib (windows only), resource, samplefiles, and samplecode (previously samples). In addition, one new subdirectory has been added, demo, that holds all of the compiled sample apps and other files that are needed to demo the products. These are files that the customer should not redistribute (.cfg files, exportmaps, and so forth).
In the root platform directory (for example, ca/sdk/ca_win-x86-32_sdk), there are two files:
README: Explains the contents of the sdk, and that makedemo must be run in order to use the sample applications.
makedemo (either .bat or .sh – platform-based): This script will either copy (on Windows) or Symlink (on UNIX) the contents of …/redist into …/sdk/demo, so that sample applications can then be run out of the demo directory.
Here's a step-by-step overview of how to obtain information from a source file using Content Access.