Oracle® Outside In Clean Content Developer's Guide

Release 8.5

F11001-05


Skip Navigation Links

Table of contents

Introduction

Definitions

Features

File Formats

Support

New features

SDK layout

Architecture

Java

C/C++

.NET

Using the API

Initialization

Basic Use

Request

Response

Document IO

Targets

Analysis and Scrubbing

Extraction

Embedding Recursion

Embedding Export

Embedding Replacement

PowerPoint Disassembly/Assembly

Threading

Exception Handling

Install and Coding Guidelines

Java

C/C++

.NET

Technical Notes

Introduction

The Outside In Clean Content SDK provides all the components, documentation, samples and other resources required by third party developers to integrate Oracle's document analysis, scrubbing, extraction and export technology into their own applications.

Definitions

The following definitions are used throughout this documentation and the Clean Content API.

document
This term is used broadly and generically in the documentation and API to refer to any file such as a word processing document, a spreadsheet, a presentation, a PDF, etc.

target
Some feature or piece of information in a document that can be identified and in many cases removed (see scrub below). Most targets relate to well known security risks in popular file formats although some, like identifying a document as being encrypted, are more general.

analyze
To determine if a given target exists in the document

scrub
To remove a given target from a document

extract
To provide the developer with text, structure and other information in the document

export
To copy objects, images and other artifacts embedded in a document to standalone files

disassembly
To take a document with multiple parts (slides in PowerPoint is the only current example) and split it into multiple standalone documents, one per part

assembly
To take several documents and merge them into a single document

Features

The Clean Content API exposes the following major features;

File formats

TODO new FI!

Clean Content supports the following primary file formats. Many other formats (such as Windows Metafile) that are commonly associated with these primary formats are also supported. Note that hundreds of additional file formats are supported when using the optional integration with Outside In Search Export.

Adobe Acrobat (PDF) includes all versions
Support: Analyze, Extract, Export
Extensions: pdf

Adobe Forms Data Format
Support: Analyze, Extract, Export
Extensions: fdf

Compact Font Format
Support: Analyze, Scrub, Extract
Extensions: cff

Microsoft Docfile includes formats such as Microsoft Visio, Microsoft Project, etc.
Support: Analyze (properties only), Scrub (properties only), Extract (properties only)

Microsoft Excel 2007 and above
Support: Analyze, Scrub, Extract, Export
Extensions: xlsx xlsb xlsm xltx xltm xlam xlsb xlsm xltx xltm xlam xlsb xlsm xltx xltm xlam

Microsoft Excel 2007 and above binary
Support: Analyze (limited), Scrub (properties and macros only), Extract
Extensions: xlsb

Microsoft Excel 2010 binary
Support: Analyze (limited), Scrub (properties and macros only), Extract
Extensions: xlsb

Microsoft Excel 2013/2016 binary
Support: Analyze (limited), Scrub (properties and macros only), Extract
Extensions: xlsb

Microsoft Excel 97 thru 2003
Support: Analyze, Scrub, Extract, Export
Extensions: xls

Microsoft PowerPoint 2007 and above
Support: Analyze, Scrub, Extract, Export, Assembly, Disassembly
Extensions: pptx pptm potx potm ppsx ppsm ppam pptm potx potm ppsx ppsm ppam pptm potx potm ppsx ppsm ppam

Microsoft PowerPoint 97 thru 2003
Support: Analyze, Scrub, Extract, Export, Assembly, Disassembly
Extensions: ppt pps pot ppa

Microsoft Word 2007 and above
Support: Analyze, Scrub, Extract, Export
Extensions: docx docm dotx dotm docm dotx dotm docm dotx dotm

Microsoft Word 97 thru 2003
Support: Analyze, Scrub, Extract, Export
Extensions: doc dot

Support

For pre-sales support please contact your Oracle sales representative.

Oracle customers have access to electronic support through My Oracle Support.  For information, visit http://www.oracle.com/support/contact.html or visit https://www.oracle.com/corporate/accessibility/learning-support.html#support-tab if you are hearing impaired.

New Features

·         A flag-based OOXML (OfficeXML) feature has been introduced which enables you to do the following in your XML files:

o    NOTE: For detailed information, see SecureOptions.OfficeXMLFeatures.

o    Identify and/or remove of all CDATA constructs.

o    Identify and/or remove all XML comments within the XML.

o    Identify and/or remove all XML processing instructions within the XML.

o    Identify and/or remove external entity references within the XML.

o    Remove leading and trailing whitespaces within the XML.

o    Identify uncommon or unexpected XML namespaces in XML files. These namespaces can now be blacklisted using the Blacklist a namespace option. In the demo application, it can be found under Set Scrub Option -> Additional Option.

o    Use canonicalization of XML.Refer XML.Refer SecureOptions.OfficeXMLCanonicalization. For more information, see Javadoc in the CleanContent SDK.

o    Create a log file corresponding to each file being processed for removal of XML CDATA, XML Comments, XML Processing Instructions and XML External Entity within the XML.

o    Scrub unknown namespaces within the XML.

o    Rename XML namespace prefixes.

o    Whitelist known namespace prefixes.

o    Identify and scrub unused namespaces.

o    Remove bounding whitespace within text elements.

o    KNOWN ISSUES: Canonicalicazer currently does not canonicalize all XMLs in MS Office files. It canonicalizes Content_Types.xml and all rel file for all MS office files. It also canonicalizes document.xml for .docx files, workbook.xml for .xlsx files, and the presentation.xml file for .pptx or .ppsx files. All other associated XML files such as docProps/app.xml, core.xml, and fontTable.xml, and so on will be canonicalized in a future release.

·         Scrubbing macros from Excel files.

·         New option, SecureOption.ValidateEmbeddedContent, is now available to validate embedded images in MS Office files. Setting this option to true allows the extraction to report OfficeXMLPartDisclosureRisks if it exists in any files. All these masquerading files are treated as rogue elements. Rogue parts are automatically scrubbed whether this option is enabled or disabled as rogue parts serve no known valid purpose.

·         Ability to unhide comment fields.

·         Scrubbing of color obfuscated text from PDF files.

·         Extract font details from Microsoft Excel, PowerPoint, and Word files.

SDK Layout

The SDK's directory structure provides easy access to all the components, samples, documentation and other files needed to integrate the Clean Content SDK into your application.

CleanContentSDK

The root directory of the SDK

CleanContentSDKDemoWin32.exe
CleanContentSDKDemoWin64.exe
CleanContentSDKDemoLinux32.sh
CleanContentSDKDemoLinux64.sh

OS specific launchers for the Clean Content SDK demo application. This Java application is designed to demonstrate the full potential of the Clean Content API and allow developers to explore the analysis, scrubbing, extraction and export behavior of this SDK in a full featured GUI environment.

CleanContentSDKDemoGeneric.sh

A generic launcher for the Clean Content SDK demo application on Unix style operating systems. It will use the JAVA_HOME environment variable followed by locate bin/java to find an appropriate Java Runtime Environment to use. Requires the bash shell.

index.html

The Clean Content Developer's Guide

app

Directory containing components and documentation for the CleanContentSDKDemo application

c

Directory containing libraries, include files, samples and other files required to use the Clean Content C/C++ API.

include

Directory containing include files required to use the C/C++ API. Most importantly it contains secureapi.h which is the only file your code needs to include.

lib

Directory containing native code libraries needed to use the C/C++ API plus the test and sample app executables.

Windows

Directory containing DLLs and LIBs needed to use the C/C++ API on Microsoft Windows plus the test and sample application EXEs . This directory will include one or more sub-directories that correspond to the processor architecture for which Clean Content is available. Currently x86 and x64 architectures are available.

Linux

Directory containing library archives and shared objects needed to use the C/C++ API on Linux plus the test and sample applications. This directory will include one or more sub-directories that correspond to the processor architecture for which Clean Content is available. Currently x86 and x64 architectures are available.

api

Directory containing the full source code to the C/C++ API.

apitest

Directory containing a cross-platform, pure C, test application designed to exercise the C API.

sanitytest

Directory containing a cross-platform, C++, test application that uses the files in CleanContentSDK/samplefiles/targets to verify that basic document analysis is working correctly.

csample

Directory containing a cross-platform, pure C, sample application.

cppsample

Directory containing a cross-platform, C++, sample application.

dumptext

Directory containing a cross-platform, C++ sample application that shows how to retrieve the text out of a document using an element handler.

docs

Directory containing documentation for the SDK

cdoc

Directory containing C/C++ API documentation

javadoc

Directory containing Java API documentation

dotnetdoc

Directory containing .NET API documentation

technotes

Directory containing technical notes

java

Directory containing components, samples and other files required to use the Clean Content Java API

lib

Directory containing CleanContent.jar that should be shipped with your application. See Install and Coding Guidelines.

sample

Directory containing Java API sample applications. Sample directories include batch files and shell scripts to build and run each application.

AnalyzeDirectorySample

Directory containing a command line sample application that analyzes all the documents in a given directory.

dotnet

Directory containing components, samples and other files required to use the Clean Content .NET API

lib

Directory containing CleanContentNET.dll that should be shipped with your application plus the test app executables. Just shipping this dll is not enough, please see the .NET Install and Coding Guidelines.

apitest

Directory containing a .NET test application designed to exercise the basics of the .NET API.

sanitytest

Directory containing a .NET test application that uses the files in CleanContentSDK/samplefiles/targets to verify that basic document analysis is working correctly.

jres

Directory containing four Java Runtime Environments (Win32, Win64, Linux32 and Linux64) needed to run the CleanContentSDKDemo application and the Java sample applications on the supported operating systems. These JREs may also be distributed with the developer's application. Oracle chooses to ship these JREs along with its SDK instead of requiring developers to "install Java" before using the demo app.

samplefiles

Directory containing documents that can be used to test Clean Content's behavior and your application. See the readme.txt file in this directory for detailed information.

samples

Directory containing the original set of Clean Content sample documents.

targets

Directory containing documents that exercise all the targets Clean Content can identify for the various supported file formats. Oracle uses these documents internally as one part of Clean Content's automated QA process.

exception

Directory containing a series of Microsoft Word documents built specifically to trigger Clean Content to generate certain exceptions, including null pointer exception and out of memory exception. The document names indicate the Java exception they generate. These documents were developed to help customers build QA processes that include exception testing. It should be noted that these documents do not exercise flaws in Clean Content rather certain bytes have been modified and are tested by Clean Content's Microsoft Word transform which in turn triggers these specific exceptions on purpose.

Architecture

Java

The core of the SDK is a set of Java classes that perform the actual analysis, scrubbing, extraction and export. These classes are delivered as CleanContent.jar. If your application is written in Java or has direct access to Java classes (a web site using Java Server Pages for example) the jar can be used directly through the Clean Content Java API.

What Java runtime to use?

If you are already using Java then you probably have an existing Java Runtime Environment (JRE) that you use and/or require. If you plan on using Clean Content's C/C++ or .NET interfaces then this might be the first time you've been exposed to the JRE choices available to you. Clean Content requires a Java Standard Edition 6 compatible JRE and ships with four version of Oracle's JRE 6 (in the jres subdirectory of the SDK).

C/C++

Clean Content's C/C++ API is built on top of the Clean Content Java API allowing your C or C++ application to run Clean Content "in process" for maximum performance while getting all the stability and safety features of Java. This is accomplished by providing a native code library (CleanContentAPI.dll or libCleanContentAPI.so for example) that does all the work of loading the Java VM into your process and interfacing with Clean Content's Java core. This is done without requiring that you or your customer "install Java" on the target system. The Java components (CleanContent.jar and the JRE subdirectory) may be local to your application with no impact on the rest of the system. In this instance, Java is simply a number of extra DLLs or SOs that are being dynamically loaded into your process.

This architecture was selected in order to meet the requirement of high performance in-process parsing while still protecting your process from the problems often caused by the limitless variations of complex, malformed, hacked and truncated documents. The C and C++ APIs provide the interfaces that meet your application's needs while the Java VM provides a stable and well tested platform that protects your application from wild pointers and buffer overflows that plague parsers written in native code. Running these documents inside a VM protects your applications while avoiding the complexity and performance problems of "out of process" solutions.

.NET

Clean Content's .NET API is built on top of the C API using .NET's interop services. As with the C/C++ API the .NET API runs Clean Content "in process" (the Common Language Runtime and the Java Virtual Machine can coexist in the same process) for maximum performance and ease of integration all without requiring you or your customer to "install Java". Please review the C/C++ section above for further details.

Using the API

Initialization

The Java API requires no per-process or per-thread initialization and your code may immediately begin creating SecureRequest objects. The C/C++ and .NET APIs however require per-process and per-thread initialization in order to interface correctly with the underlying Java VM. In these environments the following guidelines must be followed for initialization...

Basic Use

A developer's primary interaction in this API is with a SecureRequest object or handle in the case of the C API (from now on this document will use object/class/method semantics, C API users should be aware that a SecureRequest handle is equivalent to a SecureRequest object). This class contains mostly methods that allow the developer to get and set a collection of typesafe options found in the SecureOptions object. In addition, SecureRequest contains methods for executing the request and for getting the results. This follows Clean Content's basic design philosophy for long term APIs which favors extensible, typesafe, self-describing options over more concrete methods attached directly to the SecureRequest.

The basic execution flow of a simple application that needs to process multiple documents is as follows.

  1. Call Startup (C, C++ and .NET only)
  2. Create a new SecureRequest object
  3. Use that object's various setOption methods with the options in SecureOptions to define what parts of the each document should be analyzed or scrubbed and/or to set extraction, export and other processing parameters.
  4. Set the SourceDocument option to define the next document to be processed. If all documents are done go to step 8.
  5. Call the SecureRequest's execute method.
  6. Use the SecureRequest's getResponse method to determine the outcome of the operation.
  7. Return to step #4
  8. Call Shutdown (C, C++ and .NET only)

Below are code samples for complete Java, C++ and C# programs that show how to analyze a single document for targets. Notice that the .NET API requires explicit Close methods for SecureRequest and SecureResponse objects. For more details see the .NET Install and Coding Guidelines.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.options.Option;
import net.bitform.api.options.AnalyzeOption;

import java.io.File;
import java.io.IOException;

public class Main {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Only analysis will occur and no output file
        // will be created regardless of other settings

       
request.setOption(SecureOptions.JustAnalyze, true);

       
// Set the document to be analyzed

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());
                System.out.println
("The file contains the following targets...");

               
// Print a list of targets present in the document

               
Option[] options = SecureOptions.getInstance().getAllOptions();

               
for (int j = 0; j < options.length; j++) {
                   
if (options[j] instanceof ScrubOption) {
                       
if (response.getResult((ScrubOption) options[j]) == ScrubOption.Reaction.EXISTS)
                           
System.out.println(options[j].getName());
                   
} else if (options[j] instanceof AnalyzeOption) {
                       
if (response.getResult((AnalyzeOption) options[j]) == AnalyzeOption.Reaction.EXISTS)
                           
System.out.println(options[j].getName());
                   
}
                }

            }
else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Only analysis will occur and no output file
    // will be created regardless of other settings

   
request->SetOption(BFSecureOptions::JustAnalyze, true);

   
// Set the document to be analyzed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
      wcout << L
"The file contains the following targets..." << endl;

     
// Print scrub targets that exist in the document

     
int scrubCount;
     
const ScrubOptions * so = BFSecureOptions::GetAllScrubOptions(&scrubCount);

     
for (int i = 0; i < scrubCount; i++) {
       
ScrubOptionReactions result = response->GetScrubResult(so[i]);
       
if (result == ScrubOption_Reaction_Exists) {
         
wchar_t name[1024];
          BFGetOptionName
(so[i],name,1024,NULL);
          wcout << name << endl;
       
}
      }

     
// Print analyze targets that exist in the document

     
int analyzeCount;
     
const AnalyzeOptions * ao = BFSecureOptions::GetAllAnalyzeOptions(&analyzeCount);

     
for (int i = 0; i < analyzeCount; i++) {
       
ScrubOptionReactions result = response->GetAnalyzeResult(ao[i]);
       
if (result == ScrubOption_Reaction_Exists) {
         
wchar_t name[1024];
          BFGetOptionName
(ao[i],name,1024,NULL);
          wcout << name << endl;
       
}
      }

    }
else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Program
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Only analysis will occur and no output file
            // will be created regardless of other settings

           
request.SetOption(SecureOptions.JustAnalyze, true);

           
// Set the document to be analyzed

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);
                    Console.WriteLine
("The file contains the following targets...");

                   
// Print a list of targets present in the document


                   
Option[] options = SecureOptions.AllOptions;

                    foreach
(Option option in options)
                    {
                       
if (option is ScrubOption)
                        {
                           
if (response.GetResult((ScrubOption)option) == ScrubOption.Reaction.EXISTS)
                               
Console.WriteLine(option.Name);
                       
}
                       
else if (option is AnalyzeOption)
                        {
                           
if (response.GetResult((AnalyzeOption)option) == AnalyzeOption.Reaction.EXISTS)
                               
Console.WriteLine(option.Name);
                       
}
                    }

                }
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

Request

The SecureRequest object represents a reusable request to perform actions on a document. A single SecureRequest is created and reused to process as many documents as necessary within a single thread (see Threading for details). SecureRequest objects act as a container for options that describe how the source document should be processed and the developer may use them as such. For example, if a developer needed to process documents in three different ways depending on the situation they might create three SecureRequest objects, load each with the proper options (using setOption) then use the appropriate one for each document.

XML Persistence

The SecureRequest object includes readXML and writeXML methods that allow its state (the options that have been set using setOption) to be written to and read from an XML file. While the XML is fairly self explanatory, the schema is not fixed and is currently not documented so developers should resist the urge to generate XML in this schema themselves.

Response

After a call to a SecureRequest object's execute method the developer should retrieve a SecureResponse object using the getResponse method and then query this object for the results of the processing using its getResult methods. Like the SecureRequest object's setOption method the SecureResponse object's getResult method takes options contained in the SecureOptions class. Options that are valid to provide to getResult include the following.

ProcessingStatus
Provides the result of processing the document. Returns one of the following:

SourceFormat
The file format of the source document.

LoggedError
True if an error was logged, false if not.

LoggedWarning
True if a warning was logged, false if not.

ScrubbedFormat
The file format of the scrubbed document. If null is returned then the file format is the same as the SourceFormat (see above). If a file format is returned then the format was changed. Currently this only occurs when macros are scrubbed from a Office 2007/2010 document that contains macros. In these cases the extension of the scrubbed document must be changed or Office 2007/2010 will not open the scrubbed document! The new extension can be retrieved from the file format using the getExtension method. For example, if a Word document with macros (.docm) is scrubbed and macros are removed then this option will return FileFormat.WORD2007 while the SourceFormat (see above) option will be FileFormat.WORD2007MACROS.

DecryptionStatus
Provides information about decryption of the processed document. Returns one of the following:

WasProcessed deprecated
True if the document was successfully processed, false if not.

WasIdentified deprecated
True if the format of the document could be determined, false if not.

WasSupported deprecated
True if the document's file format is supported. For example, we may be able to identify some document types (like RTF, WordPerfect, etc.) but do not currently support processing them.

WasException deprecated
True if an exception was thrown during processing. Even though the developer's code will catch the exception, the SecureResponse can still be retrieved and will reflect the fact an exception was thrown.

WasTimeout deprecated
True if processing was interrupted because it took longer than the value in the RequestTimeout option, false if not.

In addition to the options above, any target (see Targets below) may be passed to getResult in order to determine if that target exists in the source document and if it was removed.

Below are code samples for complete Java, C++ and C# programs that deal with all possible results in SecureResponse.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.options.EnumOptionValue;
import net.bitform.api.FileFormat;
import net.bitform.api.SharedOptions;

import java.io.File;
import java.io.IOException;

public class Response {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Set just Macros and Code to be scrubbed

       
request.setOption(SecureOptions.MacrosAndCode,ScrubOption.Action.SCRUB);

       
// Set the document to be scrubbed.
        // In this case it's a Word 2007 document containing macros

       
File sourceDocument = new File("c:/temp/test.docm");
        request.setOption
(SecureOptions.SourceDocument, sourceDocument);

       
// Set the scrubbed document

       
File scrubbedDocument = new File("c:/temp/out/",sourceDocument.getName());
        request.setOption
(SecureOptions.ScrubbedDocument, scrubbedDocument);

        IOException requestException =
null;

       
try {
           
// Execute the request
           
request.execute();
       
} catch (IOException ex) {
           
// Save the exception
           
requestException = ex;
       
}

       
// Get the response object
        // Note that the request is still valid (and can be reused) after an exception

       
SecureResponse response = request.getResponse();

       
// Do complete result check

       
EnumOptionValue status =  response.getResult(SecureOptions.ProcessingStatus);

       
if (status == SecureOptions.ProcessingStatusOption.Processed) {

           
FileFormat sourceFormat = response.getResult(SecureOptions.SourceFormat);

            System.out.println
("The document "+sourceDocument.getName()+
                   
" was identified as "+sourceFormat.getName()+" and was processed correctly.");

            FileFormat scrubbedFormat = response.getResult
(SecureOptions.ScrubbedFormat);

           
if (scrubbedFormat != null) {

               
// The file format (and therefore the file extension) has changed so we
                // need to rename the scrubbed document. This code just renames the scrubbed file
                // by tacking on the new extension.

               
File newScrubbedDocument = new File(scrubbedDocument.getParentFile(),
                        scrubbedDocument.getName
()+"."+scrubbedFormat.getExtension());
               
if (newScrubbedDocument.exists()) newScrubbedDocument.delete();
                scrubbedDocument.renameTo
(newScrubbedDocument);
           
}

        }
else if (status == SecureOptions.ProcessingStatusOption.NotIdentified) {

           
System.out.println("The document "+sourceDocument.getName()+" could not be identified.");

       
} else if (status == SecureOptions.ProcessingStatusOption.NotSupported) {

           
FileFormat sourceFormat = response.getResult(SecureOptions.SourceFormat);
            System.out.println
("The document "+sourceDocument.getName()+
                   
" was identified as "+sourceFormat.getName()+" but that format is not supported.");

       
} else if (status == SecureOptions.ProcessingStatusOption.CausedException) {

           
System.out.println("The document "+sourceDocument.getName()+" caused an exception.");
           
if (requestException != null) requestException.printStackTrace();

       
} else if (status == SecureOptions.ProcessingStatusOption.Timeout) {

           
FileFormat sourceFormat = response.getResult(SecureOptions.SourceFormat);
            System.out.println
("The document "+sourceDocument.getName()+
                   
" was identified as "+sourceFormat.getName()+" but processing timed out.");

       
} else {

           
System.out.println("Invalid ProcessingStatus! This will never happen.");

       
}

       
if (response.getResult(SecureOptions.LoggedWarning)) {
           
System.out.println("Warnings were logged.");
       
}

       
if (response.getResult(SecureOptions.LoggedError)) {
           
System.out.println("Errors were logged.");
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
// Initialize the Clean Content API
 
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

 
// Create a request
 
BFSecureRequest * request = new BFSecureRequest();

 
// Set the default scrubbing behavior to NONE
 
request->SetOption(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_None);

 
// Set just Macros and Code to be scrubbed
 
request->SetOption(BFSecureOptions::MacrosAndCode,ScrubOption_Action_Scrub);

 
// Set the document to be scrubbed
  // In this case it's a Word 2007 document containing macros

 
std::wstring sourceDocument(L"c:/temp/test.docm");
  request->SetOption
(BFSecureOptions::SourceDocument, sourceDocument);

 
// Set the scrubbed document
 
std::wstring scrubbedDocument(L"c:/temp/out/test.docm");
  request->SetOption
(BFSecureOptions::ScrubbedDocument, scrubbedDocument);


 
// Execute the request
 
BFTransformException requestException;

 
try {
   
request->Execute();
 
} catch (BFTransformException & ex) {
   
   
// Note that we just collect the exception information
    // here. Exceptions do not put the request in an invalid
    // state so 'normal' retreval of the response may continue.
    // The response will show that the request caused an
    // exception.
   
   
requestException = ex;
 
}

 
// Get the response object
 
BFSecureResponse * response = request->GetSecureResponse();

 
// Get the status
 
int status = response->GetEnumResult(BFSecureOptions::ProcessingStatus);

 
switch(status) {

   
case SecureOptions_ProcessingStatus_Processed:
     
     
{
     
FileFormats sourceFormat = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring sourceFormatName;
      BFSecureRequest::GetFileFormatName
(sourceFormat, sourceFormatName);

      wcout << L
"The document " << sourceDocument <<
         
" was identified as " << sourceFormatName <<
         
" and was processed correctly." << endl;

      FileFormats scrubbedFormat = response->GetFileFormatResult
(BFSecureOptions::ScrubbedFormat);

     
if (scrubbedFormat != NULL) {

               
// The file format (and therefore the file extension) has changed so we
                // need to rename the scrubbed document. This code just renames the scrubbed
        // file by tacking on the new extension.
        //
        // In this particular case the scrubbed .docm file must be renamed .docx
        // or it will not open in Microsoft Office.

       
std::wstring scrubbedFormatExtension;
        BFSecureRequest::GetFileFormatExtension
(scrubbedFormat, scrubbedFormatExtension);

        std::wstring newScrubbedDocument
(scrubbedDocument);
        newScrubbedDocument.append
(L".");
        newScrubbedDocument.append
(scrubbedFormatExtension);

        _wremove
(newScrubbedDocument.c_str());
        _wrename
(scrubbedDocument.c_str(),newScrubbedDocument.c_str());
       
}
      }

     
break;

   
case SecureOptions_ProcessingStatus_NotIdentified:
      wcout << L
"The document " << sourceDocument <<
       
" could not be indentified." << endl;
     
break;

   
case SecureOptions_ProcessingStatus_NotSupported:

     
{
     
FileFormats sourceFormat = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring sourceFormatName;
      BFSecureRequest::GetFileFormatName
(sourceFormat, sourceFormatName);

      wcout << L
"The document " << sourceDocument <<
       
" was identified as " << sourceFormatName <<
       
" but that format is not supported." << endl;
     
}

     
break;

   
case SecureOptions_ProcessingStatus_CausedException:

     
{
     
wcout << L"The document " << sourceDocument <<
       
" caused an exception." << endl;

      wcout << requestException.wwhat
() << endl;
      wcout << requestException.wextended
() << endl;

      BFTransformException * cause = requestException.getCause
();

     
while (cause != NULL) {
       
wcout << cause->wwhat() << endl;
        wcout << cause->wextended
() << endl;
        cause = cause->getCause
();
       
}
      }

     
break;

   
case SecureOptions_ProcessingStatus_Timeout:

      wcout << L
"The document " << sourceDocument <<
       
" timed out." << endl;
     
break;
 
}

 
BFSecureRequest::Shutdown();

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Response
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Set just Macros and Code to be scrubbed

           
request.SetOption(SecureOptions.MacrosAndCode, ScrubOption.Action.SCRUB);

           
// Set the document to be analyzed
            // In this case it's a Word 2007 document containing macros

           
FileInfo sourceDocument = new FileInfo("c:/temp/test.docm");
           
            request.SetOption
(SecureOptions.SourceDocument, sourceDocument);

           
// Set the scrubbed document

           
FileInfo scrubbedDocument = new FileInfo("c:/temp/out/" + sourceDocument.Name);
            request.SetOption
(SecureOptions.ScrubbedDocument, scrubbedDocument);

           
// Execute the request

           
TransformException requestException = null;

           
try
           
{
               
request.Execute();
           
}
           
catch (TransformException e)
            {
               
requestException = e;
           
}

           
// Get the response object

           
SecureResponse response = request.GetResponse();

           
// Get status

           
FileFormat sourceFormat = response.GetResult(SecureOptions.SourceFormat);
           
int status = response.GetResult(SecureOptions.ProcessingStatus);

           
switch (status)
            {
               
case SecureOptions.ProcessingStatusOption.Processed:

                    Console.WriteLine
("The document " + sourceDocument.Name +
                       
" was identified as '" + sourceFormat.Name +
                       
"' and was processed correctly.");

                    FileFormat scrubbedFormat = response.GetResult
(SecureOptions.ScrubbedFormat);

                   
if (scrubbedFormat != null)
                    {
                       
// The file format (and therefore the file extension) has
                        // changed so we need to rename the scrubbed document.
                        // This code just renames the scrubbed file by appending
                        // the new extension.
                        //
                        // In this particular case the scrubbed .docm file must be
                        // renamed .docx or it will not open in Microsoft Office.


                       
FileInfo newScrubbedDocument = new FileInfo(scrubbedDocument.FullName +
                           
"." + scrubbedFormat.Extension);

                       
if (newScrubbedDocument.Exists) newScrubbedDocument.Delete();

                        scrubbedDocument.MoveTo
(newScrubbedDocument.FullName);
                   
}

                   
int decryptionStatus = response.GetResult(SecureOptions.DecryptionStatus);

                   
switch (decryptionStatus)
                    {
                       
case SecureOptions.DecryptionStatusOption.NotEncrypted:
                           
// Standard case
                           
break;
                       
case SecureOptions.DecryptionStatusOption.DecryptedWithDefaultPassword:
                            Console.WriteLine
("The document is encrypted and was " +
                               
"decrypted with the default passsword");
                           
break;
                       
case SecureOptions.DecryptionStatusOption.DecryptedWithPasswordList:
                           
// This won't happen here because the code above does not
                            // provide a password list.
                           
Console.WriteLine("The document is encrypted and was " +
                               
"decrypted with the one of the passwords provided");
                           
break;
                       
case SecureOptions.DecryptionStatusOption.DecryptionFailed:
                            Console.WriteLine
("The document is encrypted and " +
                               
"could not be decrypted with either the default " +
                               
"or provided passwords");
                           
break;
                       
case SecureOptions.DecryptionStatusOption.DecryptionNotSupported:
                            Console.WriteLine
("The document is encrypted and  " +
                               
"the encryption format is not supported ");
                           
break;
                   
}



                   
break;

               
case SecureOptions.ProcessingStatusOption.Timeout:

                    Console.WriteLine
("The document " + sourceDocument.Name +
                       
" was identified as " + sourceFormat.Name +
                       
" but processing timed out.");
                   
break;

               
case SecureOptions.ProcessingStatusOption.NotSupported:

                    Console.WriteLine
("The document " + sourceDocument.Name +
                       
" was identified as " + sourceFormat.Name +
                       
" but that format is not supported.");
                   
break;

               
case SecureOptions.ProcessingStatusOption.NotIdentified:

                    Console.WriteLine
("The document " + sourceDocument.Name +
                       
" could not be identified.");
                   
break;

               
case SecureOptions.ProcessingStatusOption.CausedException:

                    Console.WriteLine
("The document " + sourceDocument.Name +
                       
" caused an exception.");
                    Console.WriteLine
(requestException.ToString());
                   
break;
           
}

           
// Close the response

           
response.Close();

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

Document IO

The SourceDocument, ScrubbedDocument, ResultDocument, ResultTranform, ExportDocument, ExportReplacementDocument options all require the developer to provide a stream of bytes. In the case of the ScrubbedDocument, ResultDocument and ExportDocument options the stream of bytes must be writable. The developer has several ways to do this.

Normal file

If the file is on a local or remote storage then a path name is the easiest way to provide a document to the API. To accomplish this in Java a File object is provided, in C/C++ a path name is provided and in .NET a FileInfo object is provided.

Java InputStream

Even though an InputStream is a valid type for the SourceDocument option, the execute method will throw an exception unless the InputStream is an instance of FileInputStream. The same is true for using OutputStream with the ResultDocument and ScrubbedDocument options. The reason for this has to do with the nature of the file formats being processed. These formats dictate that the parser seek all over the document in order to parse it correctly. Since an InputStream is non-seekable Clean Content would have to buffer the entire document in memory in order to work correctly. It was felt that doing such a memory intensive process "behind the back" of the developer was not acceptable. Developers that need to process InputStream objects using Clean Content should read them into a ByteBuffer and pass the ByteBuffer to the SourceDocument option.

Any kind of InputStream may be provided to the ResultTransform option.

In memory

In some instances the developer has a document already in memory or needs a document written to memory. "On the wire" email attachment processing is a good example of this. In these cases the document can be passed directly to the API without the need to persist it to storage. To accomplish this in Java a ByteBuffer is provided, in C/C++ a pointer to memory is provided and in .NET a MemoryStream is provided.

ISSUE: In the case of output documents (ScrubbedDocument, ResultDocument and ExportDocument) a developer using the C/C++ interface has no way of knowing how much of the memory block provided was filled with output. A long term solution to this issue is in the works but for now C/C++ developers can use a channel (see Channel section and sample code below) to resolve this issue. The following sample code shows the workaround for this issue.

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Scrub everything
   
request->SetOption(BFSecureOptions::JustAnalyze,FALSE);
    request->SetOption
(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_Scrub);

   
// Define a channel that writes to an expandable memory buffer

   
class MyChannel: public BFChannel {

     
private:
       
char * buf;
       
long bufincrement;
       
long bufsize;
       
long filesize;

     
public:

      MyChannel
(long inc) {
       
buf = new char[inc];
        bufincrement = inc;
        bufsize = inc;
        filesize =
0;
     
}

     
long Read(void * buffer, BFINT32 count, BFINT64 position) {

       
cout << "Read " << count << " bytes at " << position << endl;

       
if (position >= filesize) {
         
return 0;
       
}

       
if (position+count > filesize) {
         
count = filesize-position;
       
}

       
memcpy(buffer,&(buf[position]),count);
       
return count;
     
}

     
void Write(void * buffer, BFINT32 count, BFINT64 position) {

       
cout << "Write " << count << " bytes at " << position << endl;

       
if (position+count > filesize) filesize = position+count;

       
if (filesize > bufsize) {
         
// Enlarge buffer
         
long newbufsize = bufsize + bufincrement;
         
while (filesize > newbufsize) newbufsize += bufincrement;
         
char * newbuf = new char[newbufsize];
          memcpy
(newbuf,buf,bufsize);
          delete buf;
          buf = newbuf;
          bufsize = newbufsize;
          cout <<
"Buffer enlarged to " << bufsize << " bytes" << endl;
       
}

       
memcpy(&(buf[position]),buffer,count);
       
     
}

     
BFINT64 Size() {
       
return filesize;
     
}

     
long Supports() {
       
return BFCHANNELCANWRITE | BFCHANNELCANREAD;
     
}

     
void Close() {
       
// Write out the buffer to a file
       
FILE * out = _wfopen(L"c:/temp/test.channel.doc",L"wb");
        fwrite
(buf,1,filesize,out);
        fclose
(out);
     
}

     
void Truncate(BFINT64 size) {
       
filesize = size;
     
}
    }
;


   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Create a channel for the scrubbed document with a starting buffer size and increment of 20k bytes
   
MyChannel mychannel = MyChannel(1024*20);

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, &mychannel);

   
// Add some properties to check that increasing the size of the ScrubbedDocument works

   
for (int i = 0; i < 1000; i++) {
     
wchar_t name[128];
      wchar_t value
[128];
      wsprintf
(name,L"CustomProperty%i",i);
      wsprintf
(value,L"This is the value of custom property %i",i);
      SecureOptions_StringProperty prop;
      BFNewStringProperty
(name,name,&prop,NULL);
      request->SetOption
(prop.action,SecureOptions_Properties_Action_AddOrReplace);
      request->SetOption
(prop.newValue,value);
   
}

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;

   
} else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

Channel

Sometimes a file exists in a non-traditional storage medium that cannot be referenced by an operating system path. A file saved in a database BLOB is an example of this. In this case, the application can provide its own "channel" to the document by implementing a few simple functions like Read, Size, Close, etc. To accomplish this in Java a SimpleChannel is provided, in C/C++ a pointer to a list of functions is provided and in .NET a Stream is provided.

Targets

Clean Content's main focus is on the discovery (analysis) and removal (scrubbing) of various parts of documents (targets) that represent security or disclosure risks. The possible targets for analysis and scrubbing make up the bulk of the options in SecureOptions. Developers should carefully review these targets to clearly understand the implications of scrubbing them.

Any given target may be set to one of the following values.

Default
Use the value of the special DefaultScrubBehavior target

None
Don't perform any action on the target. Setting a target to this value does not guarantee that the target will not be analyzed and reported on, only that such an analysis is not necessary. None acts just like Analyze for most targets except those that take significant additional processing to analyze.

Analyze
Report the existence of the target but don't attempt to scrub or otherwise remove it

Scrub
Report the existence of the target and remove it

XML Bounded spaces Show code

Alternative Text Show code

Apps For Office Show code

XML Comment Show code

XML Processing Instruction Show code

XML CDATA Show code

XML Unknown Namespace Show code

XML External Entity Show code

XML Rename Namespace Prefix Show code

XML Unused Namespaces Show code

Audio and Video Paths Show code

Author History Show code

Clipped Text Show code

Color Obfuscated Text Show code

Comments Show code

Content Properties Show code

Custom Properties Show code

Custom XML Show code

Database Queries Show code

Default scrub behavior Show code

Document Variables Show code

Embedded Objects Show code

Encryption Show code

Excel Data Model Show code

Extreme Cells Show code

Extreme Indenting Show code

Extreme Objects Show code

Fast Save Data Show code

Headers and Footers Show code

Hidden Cells Show code

Hidden Slides Show code

Hidden Text Show code

Hybrid Excel 95 97 Book Stream Show code

Invalid XML Show code

Unknown XML Show code

Linked Objects Show code

Macros and Code Show code

Meeting Minutes Show code

Office GUID Property Show code

Office XML Rogue Parts Show code

Office XML Unexpected Parts Show code

Office XML Unanalyzed Parts Show code

Office XML Alternate Content Parts Show code

Outlook Properties Show code

Overlapped Objects Show code

Overlapped Text Show code

PDF Actions Show code

PDF Alternate Images Show code

PDF Deprecated Postscript Objects Show code

PDF Alternate Presentations Show code

PDF Private Application Data Show code

PDF Web Capture Information Show code

PDF Legal Attestation Show code

PDF Digital Signatures Show code

PDF Thumbnail Images Show code

PDF Annotations Show code

Presentation Notes Show code

Printer Information Show code

Routing Slip Show code

Scenarios Show code

Sensitive Content Links Show code

Sensitive Hyperlinks Show code

Sensitive INCLUDE Fields Show code

Size Obfuscated Text Show code

Smart Tags Show code

Statistic Properties Show code

StructuredDocumentTags Show code

Summary Properties Show code

Template Name Show code

Tracked Changes Show code

Uninitialized Docfile Data Show code

User Names Show code

Versions Show code

Weak Protections Show code

XMP Metadata Streams Show code

GPS location information Show code

Analysis and Scrubbing

Analysis and scrubbing of documents is achieved through use of the following options.

SourceDocument
The file to be analyzed or scrubbed

ScrubbedDocument
File that will contain a scrubbed version of the SourceDocument after scrubbing

ScrubInPlace - Removed in 2009.1
Ignore the ScrubbedDocument option and scrub the SourceDocument directly

JustAnalyze
Ignore all target settings and just analyze the document without changing it or writing a scrubbed version

DefaultScrubBehavior
A "special" target that sets default behavior for any target option not explicitly set

Response

For every target there is a result (reaction) describing if the target was found and if so, if it was scrubbed. The methods in the SecureRespone object reuse the same targets.

Sample code

The following sample code scrubs and reports on just the Comments and Tracked Changes targets but leave all other targets alone.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;

import java.io.File;
import java.io.IOException;

public class Scrub {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior,ScrubOption.Action.NONE);

       
// Set Comments and Tracked Changes to be scrubbed

       
request.setOption(SecureOptions.Comments,ScrubOption.Action.SCRUB);
        request.setOption
(SecureOptions.TrackedChanges,ScrubOption.Action.SCRUB);

       
// Set the document to be scrubbed

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Set the scrubbed document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.scrubbed.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

               
// Print results of scrubbing Comments and Tracked Changes

               
if (response.getResult(SecureOptions.Comments) == ScrubOption.Reaction.DOESNOTEXIST) {
                   
System.out.println("The document did not contain Comments");
               
} else if (response.getResult(SecureOptions.Comments) == ScrubOption.Reaction.SCRUBBED) {
                   
System.out.println("Comment were removed from the document");
               
}

               
if (response.getResult(SecureOptions.TrackedChanges) == ScrubOption.Reaction.DOESNOTEXIST) {
                   
System.out.println("The document did not contain Tracked Changes");
               
} else if (response.getResult(SecureOptions.TrackedChanges) == ScrubOption.Reaction.SCRUBBED) {
                   
System.out.println("Tracked Changes were removed from the document");
               
}

            }
else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Set the default scrubbing behavior to NONE
   
request->SetOption(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_None);

   
// Set Comments and Tracked Changes to be scrubbed
   
request->SetOption(BFSecureOptions::Comments,ScrubOption_Action_Scrub);
    request->SetOption
(BFSecureOptions::TrackedChanges,ScrubOption_Action_Scrub);

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.scrubbed.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;

     
// Print results of scrubbing Comments and Tracked Changes

     
if (response->GetScrubResult(BFSecureOptions::Comments) == ScrubOption_Reaction_DoesNotExist) {
       
wcout << L"The document does not contain Comments" << endl;
     
} else if (response->GetScrubResult(BFSecureOptions::Comments) == ScrubOption_Reaction_Scrubbed) {
       
wcout << L"Comments were removed from the document" << endl;
     
}

     
if (response->GetScrubResult(BFSecureOptions::TrackedChanges) == ScrubOption_Reaction_DoesNotExist) {
       
wcout << L"The document does not contain Tracked Changes" << endl;
     
} else if (response->GetScrubResult(BFSecureOptions::TrackedChanges) == ScrubOption_Reaction_Scrubbed) {
       
wcout << L"Tracked Changes were removed from the document" << endl;
     
}
    }
else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Scrub
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Set Comments and Tracked Changes to be scrubbed

           
request.SetOption(SecureOptions.Comments, ScrubOption.Action.SCRUB);
            request.SetOption
(SecureOptions.TrackedChanges, ScrubOption.Action.SCRUB);

           
// Set the document to be analyzed

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Set the scrubbed document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.scrubbed.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

                   
// Print results of scrubbing Comments and Tracked Changes

                   
if (response.GetResult(SecureOptions.Comments) == ScrubOption.Reaction.DOESNOTEXIST) {
                       
Console.WriteLine("The document did not contain Comments");
                   
} else if (response.GetResult(SecureOptions.Comments) == ScrubOption.Reaction.SCRUBBED) {
                       
Console.WriteLine("Comment were removed from the document");
                   
}

                   
if (response.GetResult(SecureOptions.TrackedChanges) == ScrubOption.Reaction.DOESNOTEXIST) {
                       
Console.WriteLine("The document did not contain Tracked Changes");
                   
} else if (response.GetResult(SecureOptions.TrackedChanges) == ScrubOption.Reaction.SCRUBBED) {
                       
Console.WriteLine("Tracked Changes were removed from the document");
                   
}
                }
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

Hyperlink testing using regular expressions

The behavior of the SensitiveHyperlinks target may be modified by using a regular expression to extend the definition of "sensitive". The following sample code shows the extended API calls necessary to identify or scrub hyperlinks based on regular expression matching. Regular expression testing is in addition to the standard test for sensitivity.

 

Java

request.setOption(SecureOptions.SensitiveHyperlinks,ScrubOption.Action.SCRUB);

String[] regexs = new String[] {".*yahoo.*",".*msn.*"};

request.setOption(SecureOptions.SensitiveHyperlinksRegex,regexs);

request.execute();

 

 

C++

request->SetOption(BFSecureOptions::SensitiveHyperlinks,ScrubOption_Action_Scrub);

std::wstring regexs[2] = {L".*yahoo.*",L".*msn.*"};

request->SetOption(BFSecureOptions::SensitiveHyperlinksRegex,regexs,2);

request->Execute();

 

 

C#

request.SetOption(SecureOptions.SensitiveHyperlinks,ScrubOption.Action.SCRUB);

string[] regexs = new string[] {".*yahoo.*",".*msn.*"};

request.SetOption(SecureOptions.SensitiveHyperlinksRegex,regexs);

request.Execute();

 

Modification of properties

As a special extension to scrubbing, Clean Content can also add, modify and remove document properties from Microsoft Office documents. The following code sample shows how to replace the Author property (or add one if no Author property exists), replace the Company property (only if a Company property already exists), remove the Title property and add a new custom property called State.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;

import java.io.File;
import java.io.IOException;

public class Properties {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Default is to leave all properties alone

       
request.setOption(SecureOptions.Properties.defaultAction, SecureOptions.Properties.Action.None);

       
// Add or replace Author with "Larry"

       
request.setOption(SecureOptions.Properties.Author.action, SecureOptions.Properties.Action.AddOrReplace);
        request.setOption
(SecureOptions.Properties.Author.newValue, "Larry");

       
// Replace Company, if it already exists in the document, with "Oracle"

       
request.setOption(SecureOptions.Properties.Company.action, SecureOptions.Properties.Action.Replace);
        request.setOption
(SecureOptions.Properties.Company.newValue, "Oracle");

       
// Remove the Title property

       
request.setOption(SecureOptions.Properties.Title.action, SecureOptions.Properties.Action.Scrub);

       
// Create a new custom property and add it to the document

       
SecureOptions.Properties.StringProperty stateprop = SecureOptions.Properties.newStringProperty("State","The state in which the document was created");
        request.setOption
(stateprop.action, SecureOptions.Properties.Action.AddOrReplace);
        request.setOption
(stateprop.newValue, "California");

       
// Set the document modify

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Set the modified document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.properties.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Set the default scrubbing behavior to NONE
   
request->SetOption(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_None);

   
// Default is to leave all properties alone
   
request->SetOption(SecureOptions_Properties_DefaultAction_action,SecureOptions_Properties_Action_None);

   
// Add or replace Author with "Larry"
   
request->SetOption(SecureOptions_Properties_Author_action,SecureOptions_Properties_Action_AddOrReplace);
    request->SetOption
(SecureOptions_Properties_Author_newValue,L"Larry");

   
// Replace Company, if it already exists in the document, with "Oracle"
   
request->SetOption(SecureOptions_Properties_Company_action,SecureOptions_Properties_Action_Replace);
    request->SetOption
(SecureOptions_Properties_Company_newValue,L"Oracle");

   
// Remove the Title property
   
request->SetOption(SecureOptions_Properties_Title_action,SecureOptions_Properties_Action_Scrub);

   
// Create a new custom property and add it to the document

   
SecureOptions_StringProperty stateprop;
    BFNewStringProperty
(L"State",L"The state in which the document was created",&stateprop,NULL);
    request->SetOption
(stateprop.action,SecureOptions_Properties_Action_AddOrReplace);
    request->SetOption
(stateprop.newValue,L"California");

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.properties.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
   
} else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Properties
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Default is to leave all properties alone

           
request.SetOption(SecureOptions.Properties.defaultAction, SecureOptions.Properties.Action.None);

           
// Add or replace Author with "Larry"

           
request.SetOption(SecureOptions.Properties.Author.action, SecureOptions.Properties.Action.AddOrReplace);
            request.SetOption
(SecureOptions.Properties.Author.newValue, "Larry");

           
// Replace Company, if it already exists in the document, with "Oracle"

           
request.SetOption(SecureOptions.Properties.Company.action, SecureOptions.Properties.Action.Replace);
            request.SetOption
(SecureOptions.Properties.Company.newValue, "Oracle");

           
// Remove the Title property

           
request.SetOption(SecureOptions.Properties.Title.action, SecureOptions.Properties.Action.Scrub);

           
// Create a new custom property and add it to the document

           
SecureOptions.Properties.StringProperty stateprop = SecureOptions.Properties.newStringProperty("State", "The state in which the document was created");
            request.SetOption
(stateprop.action, SecureOptions.Properties.Action.AddOrReplace);
            request.SetOption
(stateprop.newValue, "California");

           
// Set the document modify

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Set the modified document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.properties.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}

    }
}

Modification of Microsoft Word Fields

As a special extension to scrubbing, Clean Content can also modify and remove Fields in Microsoft Word documents. The following sample code shows the extended API calls necessary to scrub all Fields from a Microsoft Word document except for Date Fields, in addition all Author fields will be scrubbed and have their contents replaced by the string "Larry".

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;

import java.io.File;
import java.io.IOException;

public class Fields {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Default is to scrub all fields

       
request.setOption(SecureOptions.Fields.defaultAction, SecureOptions.Fields.Action.Scrub);

       
// Don't scrub Date fields

       
request.setOption(SecureOptions.Fields.Date.action, SecureOptions.Fields.Action.None);

       
// Scrub the Author field and replace the text with "Larry"

       
request.setOption(SecureOptions.Fields.Author.action, SecureOptions.Fields.Action.ScrubAndReplace);
        request.setOption
(SecureOptions.Fields.Author.newValue, "Larry");

       
// Set the document modify

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Set the modified document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.fields.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Set the default scrubbing behavior to NONE
   
request->SetOption(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_None);

   
// Default is to scrub all fields
   
request->SetOption(SecureOptions_Fields_DefaultAction_action,SecureOptions_Fields_Action_Scrub);

   
// Don't scrub Date fields
   
request->SetOption(SecureOptions_Fields_Date_action,SecureOptions_Fields_Action_None);

   
// Scrub the Author field and replace the text with "Larry"
   
request->SetOption(SecureOptions_Fields_Author_action,SecureOptions_Fields_Action_ScrubAndReplace);
    request->SetOption
(SecureOptions_Fields_Author_newValue,L"Larry");

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.fields.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
   
} else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;


namespace Main
{
   
class Fields
   
{
       
static void Main(string[] args)
        {

           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Default is to scrub all fields

           
request.SetOption(SecureOptions.Fields.defaultAction, SecureOptions.Fields.Action.Scrub);

           
// Don't scrub Date fields

           
request.SetOption(SecureOptions.Fields.Date.action, SecureOptions.Fields.Action.None);

           
// Scrub the Author field and replace the text with "Larry"

           
request.SetOption(SecureOptions.Fields.Author.action, SecureOptions.Fields.Action.ScrubAndReplace);
            request.SetOption
(SecureOptions.Fields.Author.newValue, "Larry");

           
// Set the document modify

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Set the modified document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.fields.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}


    }
}

Header/Footer removal and modification using regular expressions

As a special extension to scrubbing, Clean Content can also conditionally remove, remove just text or replace text in headers and footers using the HeadersFootersSearch, HeadersFootersBehavior and HeadersFootersReplace options. These options are only valid when the HeadersFooters scrub target is set to Scrub. If these options are empty, all headers and footers are scrubbed completely.

The code shows setting the HeadersFooters options in such a way that any header or footer containing the text "abc" will be left alone; any header or footer containing the text "123" will be scrubbed for text but other items like fields, images, page number, etc. will be left alone; any header or footer containing the text "Joe" will be left alone except "Joe" will be replaced by "Jim"; and all other headers and footers will be scrubbed completely.

Java Hide codeShow code

import net.bitform.api.options.EnumOptionValue;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureResponse;

import java.io.File;
import java.io.IOException;

public class Headers {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Scrub headers and footers

       
request.setOption(SecureOptions.HeadersFooters, ScrubOption.Action.SCRUB);

       
// List of regular expressions to match in headers and footers

       
String[] search = new String[]{
               
".*abc.*",
               
".*123.*",
               
"(.*)Joe(.*)"
       
};

       
// List of behaviors to take on a match condition

       
EnumOptionValue[] behavior = new EnumOptionValue[]{
               
SecureOptions.HeadersFootersBehaviorOption.Leave,
                SecureOptions.HeadersFootersBehaviorOption.ScrubText,
                SecureOptions.HeadersFootersBehaviorOption.Replace
       
};

       
// List of replacement text items

       
String[] replace = new String[]{
               
null,
                null,
               
"$1Jim$2"
       
};

       
// Set the lists

       
request.setOption(SecureOptions.HeadersFootersSearch, search);
        request.setOption
(SecureOptions.HeadersFootersBehavior, behavior);
        request.setOption
(SecureOptions.HeadersFootersReplace, replace);

       
// Set the document modify

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Set the modified document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.headers.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Set the default scrubbing behavior to NONE
   
request->SetOption(BFSecureOptions::DefaultScrubBehavior,ScrubOption_Action_None);

   
// Scrub headers and footers
   
request->SetOption(BFSecureOptions::HeadersFooters,ScrubOption_Action_Scrub);

   
// Set search terms

   
std::wstring search[] = {
     
L".*abc.*",
      L
".*123.*",
      L
"(.*)Joe(.*)"
   
};

    request->SetOption
(BFSecureOptions::HeadersFootersSearch, search, 3);

   
// Set behaviors

   
int behavior[] = {
     
SecureOptions_HeadersFootersBehavior_Leave,
      SecureOptions_HeadersFootersBehavior_ScrubText,
      SecureOptions_HeadersFootersBehavior_Replace
   
};

    request->SetOption
(BFSecureOptions::HeadersFootersBehavior, behavior, 3);

   
// Set replacement text

   
std::wstring replace[] = {
     
L"",
      L
"",
      L
"$1Jim$2"
   
};

    request->SetOption
(BFSecureOptions::HeadersFootersReplace, replace, 3);

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.headers.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
   
} else {
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Headers
   
{
       
static void Main(string[] args)
        {

           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Scrub headers and footers

           
request.SetOption(SecureOptions.HeadersFooters, ScrubOption.Action.SCRUB);

           
// List of regular expressions to match in headers and footers

           
string[] search = new string[]{
               
".*abc.*",
               
".*123.*",
               
"(.*)Joe(.*)"
               
};

           
// List of behaviors to take on a match condition

           
int[] behavior = new int[]{
               
SecureOptions.HeadersFootersBehaviorOption.Leave,
                SecureOptions.HeadersFootersBehaviorOption.ScrubText,
                SecureOptions.HeadersFootersBehaviorOption.Replace
               
};

           
// List of replacement text items

           
string[] replace = new string[]{
               
null,
                null,
               
"$1Jim$2"
               
};

           
// Set the lists

           
request.SetOption(SecureOptions.HeadersFootersSearch, search);
            request.SetOption
(SecureOptions.HeadersFootersBehavior, behavior);
            request.SetOption
(SecureOptions.HeadersFootersReplace, replace);

           
// Set the document modify

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Set the modified document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.headers.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}

    }
}

Extraction

In addition to analysis and scrubbing, Clean Content can extract the text, property and structural information from documents. The OutputType option tells the API if and how this data should be delivered. Possible values for this option include:

NoOutput
Disables text extraction (this is the default)

ToText
Outputs just the text to a simple text file. The ResultDocument option defines where the text will be written. The ToTextEncoding option controls the encoding of the text. If ToTextEncoding is set to UTF16, the text output is in Unicode UTF-16, the byte order is the platform's native order, the line separator is the platform's native line separator and the first character is always the Unicode Byte Order Mark (BOM). If ToTextEncoding is set to UTF8, the text output is in Unicode UTF-8 and the line separator is the platform's native line separator.

ToXML
Output complete text, property and structure information to an XML file. The ResultDocument option defines where the XML will be written. In addition, the TransformResult (a boolean) and ResultTransform (a document) options allow an XSLT process to be applied to the XML before it reaches the ResultDocument.

ToHandler
Output complete text, property and structure information to a developer provided element handler (much like a SAX content handler). The ElementHandler option defines where the data will be written to. This is by far the fastest way to receive the extracted data.

Schema

In the ToXML and ToHandler cases the data will conform to the XML Schema http://www.bitform.net/xml/schema/elements.xsd. This schema is available in the in the docs directory of this SDK.

Sample code - XML output

This code shows extraction to an XML file by setting OutputType to ToXML

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.options.EnumOptionValue;

import java.io.File;
import java.io.IOException;

public class ToXml {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Don't scrub

       
request.setOption(SecureOptions.JustAnalyze, true);

       
// Set the document to extract data from

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Setup for XML output

       
request.setOption(SecureOptions.OutputType,SecureOptions.OutputTypeOption.ToXML);

       
// Set the XML output document

       
request.setOption(SecureOptions.ResultDocument, new File("c:/temp/test.doc.xml"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Don't scrub
   
request->SetOption(BFSecureOptions::JustAnalyze,TRUE);

   
// Set the document to extract data from
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Setup for XML output
   
request->SetOption(BFSecureOptions::OutputType,SecureOptions_OutputType_ToXML);

   
// Set the XML output document
   
request->SetOption(BFSecureOptions::ResultDocument,L"c:/temp/test.doc.xml");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;

   
} else {
     
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class ToXml
   
{
       
static void Main(string[] args)
        {

           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Don't scrub

           
request.SetOption(SecureOptions.JustAnalyze, true);

           
// Set the document to extract data from

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Setup for XML output

           
request.SetOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToXML);

           
// Set the XML output document

           
request.SetOption(SecureOptions.ResultDocument, new FileInfo("c:/temp/test.doc.xml"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}

    }
}

Sample code - Element handler

This code shows extraction to an developer provided element handler by setting OutputType to ToHandler

Java Hide codeShow code

import net.bitform.api.elements.*;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureResponse;

import java.io.File;
import java.io.IOException;
import java.nio.CharBuffer;

public class ToHandler {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Don't scrub

       
request.setOption(SecureOptions.JustAnalyze, true);

       
// Set the document to extract data from

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Setup for XML output

       
request.setOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToHandler);

       
// Simple element handler class

       
class MyHandler extends BaseElementHandler {

           
/* Override just a few elements */

           
public void startContent(ContentElement element) throws IOException {
               
System.out.println("Format of content is " + element.format.getName());
           
}

           
public void endContent(Element element) throws IOException {
               
System.out.println("Content ends");
           
}

           
public void startStringProperty(StringPropertyElement element) throws IOException {
               
System.out.println("String property " + element.name + " has a value of " + element.value);
           
}

           
public void text(CharBuffer buffer) throws IOException {
               
System.out.println(buffer.toString());
           
}

           
public void startDateProperty(DatePropertyElement element) throws IOException {
               
System.out.println(element.value);
           
}
        }

       
// Set the handler

       
request.setOption(SecureOptions.ElementHandler, new MyHandler());

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Don't scrub
   
request->SetOption(BFSecureOptions::JustAnalyze,TRUE);

   
// Set the document to extract data from
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");


   
// Simple element handler class

   
class MyHandler : public BFBaseElementHandler {

     
/* Override just a few elements */

     
void StartContent(BFContentElement * element) {
       
std::wstring formatName;
        BFSecureRequest::GetFileFormatName
((FileFormats)element->format,formatName);
        wcout << L
"Format of content is " << formatName << endl;
     
}

     
void EndContent(BFElement * element) {
       
wcout << L"Content ends" << endl;
     
}

     
void StartStringProperty(BFStringPropertyElement * element) {
       
wcout << L"String property " << element->name << L" has value " << element->value << endl;
     
}

     
void Text(void * buffer, BFINT32 count) {
       
wchar_t * chars = (wchar_t * )buffer;
        chars
[count] = 0x00;

#ifdef BFWIN
       
// The following code gets around a problem with Windows console
        // output of Unicode characters over 255.
        // In the real world you (the developer) would be doing something
        // more interesting with the Unicode text.

       
for (int i = 0; i < count; i++) if (chars[i] > 255) chars[i] = '.';

       
// End Windows fix
#endif


        wcout << chars << endl;
     
}
    }
;

    MyHandler myElementHandler = MyHandler
();

   
// Setup for element handler output
   
request->SetOption(BFSecureOptions::OutputType,SecureOptions_OutputType_ToHandler);

   
// Set the element handler
   
request->SetOption(BFSecureOptions::ElementHandler,&myElementHandler);

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;

   
} else {
     
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Diagnostics;
using CleanContent;

namespace Main
{
   
class ToHandler
   
{
       
// Simple element handler class

       
class MyHandler : BaseElementHandler
       
{
           
/* Override just a few elements */

           
public override void StartContent(IntPtr handler, ref ElementHandler.ContentElement element)
            {
               
Console.WriteLine("Format of content is " + element.format.Description);
           
}

           
public override void EndContent(IntPtr handler, ref ElementHandler.Element element)
            {
               
Console.WriteLine("Content ends");
           
}

           
public override void StartStringProperty(IntPtr handler, ref ElementHandler.StringPropertyElement element)
            {
               
Console.WriteLine("String property " + element.name + " has a value of " + element.value);
           
}

           
public override void StartDateProperty(IntPtr handler, ref ElementHandler.DatePropertyElement element)
            {
               
Console.WriteLine("Date property " + element.name + " has a value of " + element.value.ToLongDateString());
           
}

           
public override void Text(IntPtr handler, char[] text, int length)
            {
               
Console.WriteLine(text, 0, length);
           
}
        }

       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Don't scrub

           
request.SetOption(SecureOptions.JustAnalyze, true);

           
// Set the document to extract data from

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Setup for XML output

           
request.SetOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToHandler);

           
// Set the handler

           
MyHandler mh = new MyHandler();

            request.SetOption
(SecureOptions.ElementHandler, mh);

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

PowerPoint Fingerprinting

When extracting data from PowerPoint documents the developer may also choose to receive a fingerprints (a MD5 hash of the relevant data) for the content and/or appearance of each slide by setting the GenerateSlideContentFingerprint and GenerateSlideAppearanceFingerprint options. Fingerprint values are received through the startFingerprint method of your ElementHandler or through the fingerprint element in the XML output. For more details see the technical note on PowerPoint fingerprinting.

Recursion into embeddings

During analysis, scrubbing and extraction Clean Content may encounter embedded objects. For example an Excel spreadsheet may be embedded in a Word document. Clean Content allows embedded objects of certain types to be recursively processed for analysis, scrubbing and extraction. For example, when scrubbing a Word document it is possible to set these options so that all embedded Word, Excel and PowerPoint documents are also scrubbed (not removed). The EmbeddingRecurseList option provides a list of file formats that should be recurred into and the EmbeddingRecurseDepth option defines the maximum depth of the recursion.

There are two important things to note about recursion. First is that recursion into a particular embedded object overrides the EmbeddedObjects scrub target. That is even if the EmbeddedObjects target is set to SCRUB, embedded objects that are recurred into are not totally removed (the behavior of the EmbeddedObjects target) but scrubbed with the same options as the main document. Second is that all options that hold for the main document hold for embedded objects that are recurred into including extraction. This allows text, property and structure information to be extracted from embedded objects to any depth required.

Sample code

The following code shows how to recur into all first level Word, Excel and PowerPoint documents but not recur any deeper. All embedded objects that are not Word, Excel and PowerPoint or are below the first level will be completely removed leaving only their cached image. Word, Excel and PowerPoint embeddings at the first level (that is direct child embeddings of the source document ) will be scrubbed of Comments but otherwise left intact.

Note that if extraction were enabled (which it isn't in this sample code) the output would include text, structure and other data from first level Word, Excel and PowerPoint embeddings.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.FileFormat;

import java.io.File;
import java.io.IOException;

public class Recur {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Set the default scrubbing behavior to NONE

       
request.setOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Set Embedded Objects and Comments to be scrubbed

       
request.setOption(SecureOptions.EmbeddedObjects,ScrubOption.Action.SCRUB);
        request.setOption
(SecureOptions.Comments,ScrubOption.Action.SCRUB);

       
// Recur into Word, Excel and PowerPoint embeddings,
        // Embedded Objects and Comments in these embedding types will also be scrubbed

       
request.setOption(SecureOptions.EmbeddingRecurseList, new FileFormat[] {FileFormat.WORD8, FileFormat.EXCEL8, FileFormat.POWERPOINT8});

       
// Recur into only the first level of embeddings

       
request.setOption(SecureOptions.EmbeddingRecurseDepth, 1);

       
// Set the document to be scrubbed

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Set the scrubbed document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.recur.doc"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Set the default scrubbing behavior to NONE
   
request->SetOption(BFSecureOptions::DefaultScrubBehavior, ScrubOption_Action_None);

   
// Set Embedded Objects and Comments to be scrubbed
   
request->SetOption(BFSecureOptions::EmbeddedObjects, ScrubOption_Action_Scrub);
    request->SetOption
(BFSecureOptions::Comments ,ScrubOption_Action_Scrub);

   
// Recur into one level of Word, Excel and PowerPoint embeddings
    // Embedded Objects and Comments in these embedding types will also be scrubbed
    // All other embedding types will be removed completely

   
enum FileFormats formats[] = {BFFileFormat::WORD8, BFFileFormat::EXCEL8, BFFileFormat::POWERPOINT8};
    request->SetOption
(BFSecureOptions::EmbeddingRecurseList, formats, 3);
    request->SetOption
(BFSecureOptions::EmbeddingRecurseDepth, 1);

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.recur.doc");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
   
} else {

     
// Processing failed

     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Recur
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Set the default scrubbing behavior to NONE

           
request.SetOption(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Set Embedded Objects and Comments to be scrubbed

           
request.SetOption(SecureOptions.EmbeddedObjects, ScrubOption.Action.SCRUB);
            request.SetOption
(SecureOptions.Comments, ScrubOption.Action.SCRUB);

           
// Recur into Word, Excel and PowerPoint embeddings,
            // Embedded Objects and Comments in these embedding types will also be scrubbed

           
request.SetOption(SecureOptions.EmbeddingRecurseList, new FileFormat[] { FileFormat.WORD8, FileFormat.EXCEL8, FileFormat.POWERPOINT8 });

           
// Recur into only the first level of embeddings

           
request.SetOption(SecureOptions.EmbeddingRecurseDepth, 1);

           
// Set the document to be analyzed

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Set the scrubbed document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.recur.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {
                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);
               
}
               
else
               
{
                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

Embedding Export

Clean Content allows embedded objects and images of certain types to be exported to stand alone files for further processing or display. The EmbeddingExportList option provides a list of file formats that should be exported, the EmbeddingExportDirectory option provides the default directory where exported embeddings and images should be placed (defaults to the current directory) and EmbeddingExportBaseFileName provides the default file name prefix to use for exported files. In addition, the developer may track or modify the locations of exported embedding and images using the ExportDocument option during the startEmbeddedContent method in an element handler.

Sample code

The following code shows how to export all Excel, Windows Metafile, Windows Enhanced Metafile, JPEG and PNG embeddings in a document. Files like test.doc.em1.xls, test.doc.em2.wmf, test.doc.em3.png, test.doc.em4.jpg will be placed in the c:\temp directory along with the XML extracted from the document. The XML will reference the exported image files.

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.FileFormat;

import java.io.File;
import java.io.IOException;

public class Export {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Don't scrub

       
request.setOption(SecureOptions.JustAnalyze,true);

       
// Export Excel, Windows Metafile, JPEG and PNG embeddings to c:\temp using names starting with 'test.doc.em'

       
request.setOption(SecureOptions.EmbeddingExportList, new FileFormat[] {FileFormat.EXCEL8, FileFormat.WMF, FileFormat.EMF, FileFormat.JPEG,FileFormat.PNG});
        request.setOption
(SecureOptions.EmbeddingExportDirectory, new File("c:/temp"));
        request.setOption
(SecureOptions.EmbeddingExportBaseFileName, "test.doc.em");

       
// Set the source document

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Setup for XML output

       
request.setOption(SecureOptions.OutputType,SecureOptions.OutputTypeOption.ToXML);


       
// Set the XML output document

       
request.setOption(SecureOptions.ResultDocument, new File("c:/temp/test.doc.xml"));

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Don't scrub
   
request->SetOption(BFSecureOptions::JustAnalyze, TRUE);

   
// Export Excel, Windows Metafile, JPEG and PNG embeddings to c:\temp
    // using names starting with 'test.doc.embedding'

   
enum FileFormats formats[] = {BFFileFormat::EXCEL8, BFFileFormat::WMF, BFFileFormat::EMF, BFFileFormat::JPEG, BFFileFormat::PNG};
    request->SetOption
(BFSecureOptions::EmbeddingExportList, formats, 5);
    request->SetOption
(BFSecureOptions::EmbeddingExportDirectory, L"c:/temp");
    request->SetOption
(BFSecureOptions::EmbeddingExportBaseFileName, L"test.doc.embedding");

   
// Set the source document
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Setup for XML output
   
request->SetOption(BFSecureOptions::OutputType,SecureOptions_OutputType_ToXML);

   
// Set the XML output document
   
request->SetOption(BFSecureOptions::ResultDocument,L"c:/temp/test.doc.xml");

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;
   
} else {

     
// Processing failed

     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using CleanContent;

namespace Main
{
   
class Export
   
{
       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Don't scrub

           
request.SetOption(SecureOptions.JustAnalyze, true);

           
// Export Excel, Windows Metafile, JPEG and PNG embeddings to c:\temp using names starting with 'test.doc.em'

           
request.SetOption(SecureOptions.EmbeddingExportList, new FileFormat[] { FileFormat.EXCEL8, FileFormat.WMF, FileFormat.EMF, FileFormat.JPEG, FileFormat.PNG });
            request.SetOption
(SecureOptions.EmbeddingExportDirectory, new DirectoryInfo("c:/temp"));
            request.SetOption
(SecureOptions.EmbeddingExportBaseFileName, "test.doc.em");

           
// Set the document to extract data from

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Setup for XML output

           
request.SetOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToXML);

           
// Set the XML output document

           
request.SetOption(SecureOptions.ResultDocument, new FileInfo("c:/temp/test.doc.xml"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {

               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());

           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}

    }
}

Embedding Replacement

Along with exporting embeddings and images Clean Content allows a developer using an element handler to replace embedded objects and images with ones of their choosing within certain strict limitations. Replacement is achieved through the use of the following options during the startEmbeddedContent and processEmbeddedContent methods within an element handler provided by the developer.

ExportDocument
Describes the location where this embedded object or image is being saved and allows the location to be overridden on an embedding by embedding basis. See Export options above.

ExportPossibleReplacementFormats
Describes the possible formats that this embedded object or image can be replaced with

ExportMaximumReplacementSize
Describes the maximum number of bytes that can be provided to replace this embedded object or image. If the value of this option is 0 (zero) then any size replacement is allowed.

ExportReplacementFormat
Set by the developer to describe the format of the bytes provided to replace this embedded object or image

ExportReplacementDocument
Set by the developer to describe the file that contains the bytes provided to replace this embedded object or image

Sample code

The following code replaces every Windows Metafile with a single PNG where possible. While this is somewhat useless behavior it demonstrates the basic code structure.

Java Hide codeShow code

import net.bitform.api.FileFormat;
import net.bitform.api.elements.BaseElementHandler;
import net.bitform.api.elements.EmbeddedContentElement;
import net.bitform.api.options.FileOptionValue;
import net.bitform.api.options.ScrubOption;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureResponse;

import java.io.File;
import java.io.IOException;

public class Replace {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Need to be scrubbing since we need a ScrubbedDocument to hold replacements but don't really want to
        // scrub anything else

       
request.setOption(SecureOptions.JustAnalyze, false);
        request.setOption
(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

       
// Export Windows Metafiles

       
request.setOption(SecureOptions.EmbeddingExportList, new FileFormat[]{FileFormat.WMF, FileFormat.EMF});
        request.setOption
(SecureOptions.EmbeddingExportDirectory, new File("c:/temp"));
        request.setOption
(SecureOptions.EmbeddingExportBaseFileName, "metafile");

       
// Set the source document

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.doc"));

       
// Element handler to replace metafiles with PNGs

       
class MyHandler extends BaseElementHandler {

           
// The start of embedded content
            // This sample just prints out the file path but a real world application
            // might want to process the embedding, possibly to the generate a replacement
            // in another format.

           
public void startEmbeddedContent(EmbeddedContentElement element) throws IOException {

               
if (element.exportOptions != null) {

                   
FileOptionValue file = element.exportOptions.getOption(SecureOptions.ExportDocument);

                   
if (file.isFile()) {
                       
System.out.println("The exported embedding is in " + file.getFile().getAbsolutePath());
                   
}
                }
            }

           
// This method gives the developer to opportunity to replace the embedding

           
public void processEmbeddedContent(EmbeddedContentElement element) throws IOException {

               
// If this image can be replaced

               
if (element.isReplaceable) {

                   
// Replace with a small, fixed PNG

                   
File replacementFile = new File("c:/temp/small.png");

                   
long maxFileSize = element.exportOptions.getOption(SecureOptions.ExportMaximumReplacementSize);

                   
// If the PNG will fit in the space available
                    // or there is no limit (maxFileSize == 0)

                   
if (maxFileSize == 0 || maxFileSize >= replacementFile.length()) {

                       
FileFormat[] formats = element.exportOptions.getOption(SecureOptions.ExportPossibleReplacementFormats);

                       
for (int i = 0; i < formats.length; i++) {

                           
// If PNG is one of the possible replacement formats, replace the image

                           
if (formats[i] == FileFormat.PNG) {

                               
element.exportOptions.setOption(SecureOptions.ExportReplace, true);
                                element.exportOptions.setOption
(SecureOptions.ExportReplacementFormat, FileFormat.PNG);
                                element.exportOptions.setOption
(SecureOptions.ExportReplacementDocument, replacementFile);
                               
break;
                           
}
                        }
                    }
                }
            }
        }

       
// Setup for output to my element handler

       
request.setOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToHandler);
        request.setOption
(SecureOptions.ElementHandler, new MyHandler());

       
// Set scrubbed document

       
request.setOption(SecureOptions.ScrubbedDocument, new File("c:/temp/test.replace.doc"));

       
try {
           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Check for success

           
if (response.getResult(SecureOptions.WasProcessed)) {

               
// Print information about the document

               
System.out.println("The file has a format of " + response.getResult(SecureOptions.SourceFormat).getName());

           
} else {

               
// Processing failed

               
System.out.println("Document processing failed");
           
}
        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

C++ Hide codeShow code

#include <iostream>
#include <tchar.h>
#include <malloc.h>
#include <sys/types.h>
#include <sys/stat.h>

using namespace std;

#include
"secureapi.h"

#ifdef BFWIN
#include <windows.h>
#endif

int main(int argc, _TCHAR* argv[])
{
 
try {
   
// Initialize the Clean Content API
   
BFSecureRequest::Startup(BFSTARTUPFEATURE_DEBUG);

   
// Create a request
   
BFSecureRequest * request = new BFSecureRequest();

   
// Need to be scrubbing since we need a ScrubbedDocument to hold
    // replacements but don't really want to scrub anything else

   
request->SetOption(BFSecureOptions::JustAnalyze, FALSE);
    request->SetOption
(BFSecureOptions::DefaultScrubBehavior, ScrubOption_Action_None);

   
// Export Windows Metafiles
   
enum FileFormats formats[] = {BFFileFormat::WMF, BFFileFormat::EMF};
    request->SetOption
(BFSecureOptions::EmbeddingExportList, formats, 2);

   
// Set the document to be scrubbed
   
request->SetOption(BFSecureOptions::SourceDocument, L"c:/temp/test.doc");

   
// Set the scrubbed document
   
request->SetOption(BFSecureOptions::ScrubbedDocument, L"c:/temp/test.replace.doc");

   
// Element handler to replace metafiles with PNGs

   
class MyHandler : public BFBaseElementHandler {

     
// The start of embedded content
      // This sample just prints out the file path but a real world application
      // might want to process the embedding, possibly to the generate a replacement
      // in another format.

     
void StartEmbeddedContent(BFEmbeddedContentElement * element) {

       
// Use the exportOptions handle to create a BFOptionSet.
        // The handle could also be used directly with the appropriate C options functions

       
BFOptionSet exportOptions(element->exportOptions);

       
// Show the file name there the embedding will be exported

       
if (exportOptions.IsValid()) {
         
std::wstring fileName;
          exportOptions.GetOption
(BFSecureOptions::ExportDocument, fileName);
          wcout <<
"The exported embedding is in " << fileName << endl;
       
}

      }

     
// This method gives the developer to opportunity to replace the embedding

     
void ProcessEmbeddedContent(BFEmbeddedContentElement * element) {

       
// Use the exportOptions handle to create a BFOptionSet.
        // The handle could also be used directly with the appropriate C options functions

       
BFOptionSet exportOptions(element->exportOptions);

       
// If this image can be replaced

       
if (element->isReplaceable == BFTRUE) {

         
// Replace with small fixed PNG

         
wstring replacementFile(L"c:\\temp\\small.png");

          BFINT64 maxFileSize = exportOptions.GetOption
(BFSecureOptions::ExportMaximumReplacementSize);

         
// If the PNG will fit in the space available
          // or there is no limit (maxFileSize == 0)

         
struct _stat buf;
          _wstat
(replacementFile.c_str(), &buf);

         
if (maxFileSize == 0 || maxFileSize >= buf.st_size) {

           
enum FileFormats formats[20];
           
int formatCount = 20;

            exportOptions.GetOption
(BFSecureOptions::ExportPossibleReplacementFormats,formats,&formatCount);

           
// If PNG is one of the possible replacement formats, replace the image

           
for (int i = 0; i < formatCount ; i++) {
             
if (formats[i] == BFFileFormat::PNG) {
               
exportOptions.SetOption(BFSecureOptions::ExportReplace,true);
                exportOptions.SetOption
(BFSecureOptions::ExportReplacementFormat,BFFileFormat::PNG);
                exportOptions.SetOption
(BFSecureOptions::ExportReplacementDocument,replacementFile);
               
break;

             
}
            }
          }
        }
      }
    }
;

    MyHandler myElementHandler = MyHandler
();

   
// Setup for element handler output
   
request->SetOption(BFSecureOptions::OutputType,SecureOptions_OutputType_ToHandler);

   
// Set the element handler
   
request->SetOption(BFSecureOptions::ElementHandler,&myElementHandler);

   
// Execute the request
   
request->Execute();

   
// Get the response object
   
BFSecureResponse * response = request->GetSecureResponse();

   
// Check for success

   
if (response->GetBooleanResult(BFSecureOptions::WasProcessed)) {

     
// Print information about the document

     
FileFormats format = response->GetFileFormatResult(BFSecureOptions::SourceFormat);
      std::wstring formatname;
      BFSecureRequest::GetFileFormatName
(format, formatname);

      wcout << L
"The file has a format of " << formatname << endl;

   
} else {
     
     
// Processing failed
     
wcout << L"Document processing failed" << endl;
   
}

   
BFSecureRequest::Shutdown();

 
} catch (BFTransformException & ex) {

   
wcout << ex.wwhat() << endl;
    wcout << ex.wextended
() << endl;

    BFTransformException * cause = ex.getCause
();

   
while (cause != NULL) {
     
wcout << cause->wwhat() << endl;
      wcout << cause->wextended
() << endl;
      cause = cause->getCause
();
   
}
  }

 
return 0;
}

C# Hide codeShow code

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;
using System.Runtime.InteropServices;
using System.Diagnostics;
using CleanContent;

namespace Main
{
   
class Replace
   
{
       
// Element handler class to replace images with small PNG

       
class MyHandler : BaseElementHandler
       
{
           
// The start of embedded content
            // This sample just prints out the file path but a real world application
            // might want to process the embedding, possibly to the generate a replacement
            // in another format.

           
public override void StartEmbeddedContent(IntPtr handler, ref EmbeddedContentElement element)
            {
               
if (element.exportOptions != null) {

                   
FileInfo file = element.exportOptions.GetOption(SecureOptions.ExportDocument);
                    Console.WriteLine
("The exported embedding is of type "+element.format.Name+" and will be exported to the file " + file.FullName);
               
}
            }

           
// This method gives the developer to opportunity to replace the embedding

           
public override void ProcessEmbeddedContent(IntPtr handler, ref EmbeddedContentElement element)
            {
               
// If this image can be replaced

               
if (element.isReplaceable)
                {
                   
// Replace with a small, fixed PNG

                   
FileInfo replacementFile = new FileInfo("c:/temp/small.png");

                   
long maxFileSize = element.exportOptions.GetOption(SecureOptions.ExportMaximumReplacementSize);

                   
// If the PNG will fit in the space available
                    // or there is no limit (maxFileSize == 0)

                   
if (maxFileSize == 0 || maxFileSize >= replacementFile.Length)
                    {
                       
FileFormat[] formats = element.exportOptions.GetOption(SecureOptions.ExportPossibleReplacementFormats);

                       
for (int i = 0; i < formats.Length; i++)
                        {
                           
// If PNG is one of the possible replacement formats, replace the image

                           
if (formats[i] == FileFormat.PNG)
                            {
                               
element.exportOptions.SetOption(SecureOptions.ExportReplace, true);
                                element.exportOptions.SetOption
(SecureOptions.ExportReplacementFormat, FileFormat.PNG);
                                element.exportOptions.SetOption
(SecureOptions.ExportReplacementDocument, replacementFile);
                               
break;
                           
}
                        }
                    }
                }
            }
        }

       
static void Main(string[] args)
        {
           
// Initialize API

           
SecureHelper.Startup(true);

           
// Create a request

           
SecureRequest request = new SecureRequest();

           
// Need to be scrubbing since we need a ScrubbedDocument to hold replacements but don't really want to
            // scrub anything else

           
request.SetOption(SecureOptions.JustAnalyze, false);
            request.SetOption
(SecureOptions.DefaultScrubBehavior, ScrubOption.Action.NONE);

           
// Export Windows Metafiles

           
request.SetOption(SecureOptions.EmbeddingExportList, new FileFormat[] { FileFormat.WMF, FileFormat.EMF });
            request.SetOption
(SecureOptions.EmbeddingExportDirectory, new DirectoryInfo("c:/temp"));
            request.SetOption
(SecureOptions.EmbeddingExportBaseFileName, "metafile");

           
// Set source document

           
request.SetOption(SecureOptions.SourceDocument, new FileInfo("c:/temp/test.doc"));

           
// Setup for output to my element handler

           
request.SetOption(SecureOptions.OutputType, SecureOptions.OutputTypeOption.ToHandler);
            request.SetOption
(SecureOptions.ElementHandler, new MyHandler());

           
// Set scrubbed document

           
request.SetOption(SecureOptions.ScrubbedDocument, new FileInfo("c:/temp/test.replace.doc"));

           
try
           
{

               
// Execute the request

               
request.Execute();

               
// Get the response object

               
SecureResponse response = request.GetResponse();

               
// Check for success

               
if (response.GetResult(SecureOptions.WasProcessed))
                {

                   
// Print information about the document

                   
Console.WriteLine("The file has a format of " + response.GetResult(SecureOptions.SourceFormat).Name);

               
}
               
else
               
{

                   
// Processing failed

                   
Console.WriteLine("Document processing failed");
               
}

               
// Close the response

               
response.Close();
           
}
           
catch (TransformException e)
            {
               
// An exception occured

               
Console.WriteLine("Document caused an exception");
                Console.WriteLine
(e.ToString());
           
}

           
// Close the request

           
request.Close();

           
// Uninitialize API

           
SecureHelper.Shutdown();
       
}
    }
}

PowerPoint Disassembly/Assembly

Preliminary & subject to change

As a special extension to Clean Content, PowerPoint files may been broken into individual slides, each in its own standalone PowerPoint file (disassembly) and a PowerPoint file may be created from a collection of other PowerPoint files (assembly).

Disassembly reuses many of the options from Embedding Export (see above). The EmbeddingExportDirectory option provides the default directory where disassembled slides should be placed and EmbeddingExportBaseFileName provides the default file name prefix to use for exported slides. In addition, the developer may track or modify the locations of exported slides using the ExportDocument option during the startExportDocument method in an element handler. Disassembly is triggered by setting the JustDisassemble option to true.

Assembly is triggered by setting the JustAssemble option to true. It generates a new PowerPoint created from the files provided in the AssembleFileList option. As with disassembly the resulting PowerPoint is placed in EmbeddingExportDirectory using the name prefix EmbeddingExportBaseFileName and may be overridden using the startExportDocument method in an element handler.

Disassembly Sample

Java Hide codeShow code

import net.bitform.api.FileFormat;
import net.bitform.api.elements.BaseElementHandler;
import net.bitform.api.elements.ExportDocumentElement;
import net.bitform.api.options.FileOptionValue;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureResponse;

import java.io.File;
import java.io.IOException;

public class Disassemble {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Don't scrub

       
request.setOption(SecureOptions.JustAnalyze, true);

       
// Set to disassemble

       
request.setOption(SecureOptions.JustDisassemble, true);

       
// Disassemble to c:\temp\out using names starting with 'test.ppt.slide'

       
request.setOption(SecureOptions.EmbeddingExportDirectory, new File("c:/temp/out"));
        request.setOption
(SecureOptions.EmbeddingExportBaseFileName, "test.ppt.slide");

       
// Set the  document to disassemble

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.ppt"));

       
// Set a handler that just prints out the names of the files as they get exported

       
request.setOption(SecureOptions.ElementHandler, new BaseElementHandler() {

           
public void startExportDocument(ExportDocumentElement element) throws IOException {
               
FileOptionValue file = element.exportOptions.getOption(SecureOptions.ExportDocument);
               
if (file.isFile()) {
                   
System.out.println(file.getFile().getName());
               
}
            }
        })
;

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Make sure the SourceDocument was PowerPoint since that's all
            // Clean Content currently supports

           
FileFormat format = response.getResult(SecureOptions.SourceFormat);

           
if (format.is(FileFormat.POWERPOINT8)) {

               
// Check for success

               
if (response.getResult(SecureOptions.WasProcessed)) {

                   
// Print information about the document

                   
System.out.println("The file was disassembled");

               
} else {

                   
// Processing failed

                   
System.out.println("Document processing failed");
               
}

            }
else {
               
System.out.println("Files of the format " + format.getName() + " cannot be disassembled");
           
}

        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

Assembly Sample

Java Hide codeShow code

import net.bitform.api.secure.SecureRequest;
import net.bitform.api.secure.SecureOptions;
import net.bitform.api.secure.SecureResponse;
import net.bitform.api.FileFormat;
import net.bitform.api.elements.BaseElementHandler;
import net.bitform.api.elements.ExportDocumentElement;
import net.bitform.api.options.FileOptionValue;

import java.io.File;
import java.io.IOException;

public class Assemble {

   
public static void main(String[] args) {

       
// Create a request

       
SecureRequest request = new SecureRequest();

       
// Don't scrub

       
request.setOption(SecureOptions.JustAnalyze,true);

       
// Set to assemble

       
request.setOption(SecureOptions.JustAssemble,true);

       
// Assemble three PowerPoint files

       
File[] files = {
               
new File("c:/temp/test1.ppt"),
               
new File("c:/temp/test2.ppt"),
               
new File("c:/temp/test3.ppt")
        }
;

        request.setOption
(SecureOptions.AssembleFileList,files);

       
// Assemble to c:\temp\out using name starting with 'result'

       
request.setOption(SecureOptions.EmbeddingExportDirectory, new File("c:/temp/out"));
        request.setOption
(SecureOptions.EmbeddingExportBaseFileName, "result");

       
// Set the document to use as a template for masters, etc.

       
request.setOption(SecureOptions.SourceDocument, new File("c:/temp/test.ppt"));

       
// Set a handler that just prints out the name of the file as it gets exported

       
request.setOption(SecureOptions.ElementHandler, new BaseElementHandler() {

           
public void startExportDocument(ExportDocumentElement element) throws IOException {
               
FileOptionValue file = element.exportOptions.getOption(SecureOptions.ExportDocument);
               
if (file.isFile()) {
                   
System.out.println(file.getFile().getName());
               
}
            }
        })
;

       
try {

           
// Execute the request

           
request.execute();

           
// Get the response object

           
SecureResponse response = request.getResponse();

           
// Make sure the SourceDocument was PowerPoint since that's all
            // Clean Content currently supports

           
FileFormat format = response.getResult(SecureOptions.SourceFormat);

           
if (format.is(FileFormat.POWERPOINT8)) {

               
// Check for success

               
if (response.getResult(SecureOptions.WasProcessed)) {

                   
// Print information about the document

                   
System.out.println("The files were assembled");

               
} else {

                   
// Processing failed

                   
System.out.println("Document processing failed");
               
}

            }
else {
               
System.out.println("Files of the format "+format.getName()+" cannot be assembled");
           
}

        }
catch (IOException e) {

           
// An exception occured

           
System.out.println("Document caused an exception");
            e.printStackTrace
();
       
}
    }
}

Threading

In deciding how to introduce this API into your code, one major factor to consider is how your application uses or will use threads to process documents. While a complete discussion of this topic is outside the scope of this document, the following guidelines may provide some direction.

Exception Handling

Clean Content is expected to handle any source document no matter how complex, malformed, hacked or truncated. Processing of such documents is an inherently garbage in/garbage out situation and developers running large numbers of documents (100,000 or more) can expect to see a wide array of exceptions occurring during the SecureRequest execute method. As of version 2007.1 all checked and many unchecked exceptions are caught internally by Clean Content and wrapped in TransformException which is a subclass of IOException. This means that as of version 2007.1 developers need only trap IOException during the execute method. Developers may then call the TransformException getCause method to get more detailed information on the underlying exception.

The unchecked exceptions trapped and wrapped during the execute method include...

The developer is assured failure atomicity and may continue to use the SecureRequest which threw the exception.

In order to facilitate testing of exceptional conditions the Clean Content SDK includes a number of specifically modified Microsoft Word documents that trigger Clean Content to generate certain exceptions. These document are in the SDK's samplefiles/exception directory. The document names indicate the exceptions they generate. Note that these documents DO NOT exercise "bugs" in Clean Content. They have been modified to have specific data in an innocuous location that the Word transform picks up on and purposely causes the given exception.

Install and Coding Guidelines

Java

Compilation and distribution

Your application must compile and ship with CleanContent.jar from the java/lib directory of the SDK. As with all jar files this one must be included in the classpath of your Java application.

C/C++

General

Including secureapi.h

In order to use the Clean Content's C/C++ API you must include the file secureapi.h from the SDK's c/include directory in your C or C++ source code. It defines all the C API entry points, structures, etc. If included in a C++ source file, secureapi.h also defines the classes in the C++ API. It should be noted that CleanContentAPI library (CleanContentAPI.dll, CleanContentAPI.so, CleanContentAPI.a, etc.) does not include or export the C++ API classes (only the C API functions). The classes are declared and defined right in secureapi.h using the "headers only" model. This avoids name mangling and other C++ compiler/linker interoperability issues.

Include files are located in the SDK at c/include.

Getting the right pieces is critical!

Over 50% of the issues Oracle sees from customers using the C/C++ API relate to getting all the pieces of the technology in the right location so the platform specific C/C++ library can find them during the BFStartup() function. Please read the Installation and Distribution section for your platform carefully and thoroughly!

Windows

Compiling and Linking

Library

The library CleanContentAPI.lib must be linked with your application and CleanContentAPI.dll must be delivered (see below) with your application. CleanContentAPI.dll has no dependencies (such as MFC or ATL) other than standard Win32 libraries.

The Windows library is located in the SDK at c/lib/windows/x86 for Win32 and c/lib/windows/x64 for Win64.

Installation and Distribution

Your Windows distribution must include these three components.

CleanContentAPI.dll

This DLL must be linked with your application and be available to your application at run time. Any standard method of making this DLL available to your EXE will work including placing it in the same directory as your EXE or putting its location on the PATH.

Oracle strongly advises against placing this DLL in the WINDOWS or SYSTEM directory unless you have complete control of the environment (if your product is a hardware appliance for example). It is possible that another vendor's product that also uses Clean Content (perhaps a different version) will be installed on the same system so keeping this DLL as isolated as possible is in everyone's best interest.

CleanContent.jar

This is the Clean Content Java code that does all the real work. You have two options for locating this file. The best option is to place it in the same directory as CleanContentAPI.dll where it will be found automatically. If that structure does not work for your application you may set the BITFORM_JARPATH environment variable to the directory containing this file. This environment variable may be set in a batch file that starts your application or using Windows' SetEnvironemntVariable() function within your code before calling BFStartup(). We strongly recommend against setting this environment variable globally to avoid conflicts with other vendor's software. Do not use the runtime library _putenv() routine to set BITFORM_JARPATH.

Java Runtime Environment (JRE)

This is a set of DLLs and other files that make up the Java Virtual Machine. There are several options available as to where you get the JRE and where it gets placed in your application's directory structure. Note that this SDK ships with 4 JREs (Win32, Win64, Linux32 and Linux64) but any given platform only needs one JRE.

Where to get it...

Option 1
This SDK includes Windows Java 1.8 JREs in the jres/Windows/x86/jre and jres/Windows/x64/jre directories. Either of these directories (including all files and subdirectories) may be shipped with your application.

Option 2
You may use any other Java 1.8 or above JRE. For example, if your application already ships with a JRE you can reuse it.

Where to put it...

Option 1
The simplest option is to place the JRE in a jre subdirectory of the directory where CleanContentAPI.dll exists.

Option 2
Place the JRE anywhere you like and set BITFORM_JREPATH environment variable to the jre directory. This environment variable may be set in a batch file that starts your application or using Windows' SetEnvironmentVariable() function within your code before calling BFStartup(). We strongly recommend against setting this environment variable globally to avoid conflicts with other vendors software. Do not use the runtime library _putenv() routine to set BITFORM_JREPATH.

Example 1

The simplest possible distribution looks like this:

Windows Distribution Example 1

In this case CleanContentAPI.dll will discover the jar and the jre, no environment variables need to be set.

Example 2

Let's say your distribution looked like the one below and you know there is a compatible Java Runtime Environment at c:\components\jre.

Windows Distribution Example 2

Your source code might include the following lines:

SetEnvironmentVariableA("BITFORM_JREPATH","c:\\components\\jre");
BFStartup(BFSTARTUPFEATURE_DEBUG);

Linux

Compiling and Linking

Static Library

To use the Clean Content static library simply link with CleanContentAPI.a in either the c/lib/linux/x86/lib or c/lib/linux/x64/lib directories of the SDK depending on your platform.

Shared Library

To use the Clean Content shared library, the library CleanContentAPI must be linked with your application. This library is a shared object located in either the c/lib/linux/x86/lib or c/lib/linux/x64/lib directory of the SDK. The contents of this directory follow Linux library naming standards as follows:

libCleanContentAPI.so.1.0.0
This is the real shared library

libCleanContentAPI.so.1
This is a symbolic link to libCleanContentAPI.so.1.0.0 and represents its "soname". This is the name that executables linked to this library will look for.

libCleanContentAPI.so
This is also a symbolic link to libCleanContentAPI.so.1.0.0 and represents its "link name ". This is the name that the linker will look for when your application is linked using -lCleanContentAPI.

Installation and Distribution

A Linux distribution must includes two and possibly three components.

libCleanContentAPI.so.1.0.0 (shared library linking only)

This is a shared library that must be linked with your application and must be available to your application at runtime along with a symbolic link of the name libCleanContentAPI.so.1. Like any other shared library your application must be able to find libCleanContentAPI.so.1 at runtime. This can be accomplished in any of the Linux standard ways including the following:

CleanContent.jar

This is the Clean Content Java code that does all the real work. You have two options for locating this file. The best option is to place it in the same directory as your application where it will be found automatically using /proc/self/exe. If that structure does not work for your application you may set the BITFORM_JARPATH environment variable to the directory containing this file (not the file itself). This environment variable may be set in a script file that starts your application or using the _putenv() function within your code before calling BFStartup(). We strongly recommend against setting this environment variable globally to avoid conflicts with other vendors software.

Java Runtime Environment (JRE)

This is a set of SOs and other files that make up the Java Virtual Machine. There are several options available as to where you get the JRE and where it gets placed in your application's directory structure. Note that this SDK ships with 4 JREs (Win32, Win64, Linux32 and Linux64) but any given platform only needs one JRE.

Where to get it...

Option 1
This SDK includes Linux Java 1.8 JREs in the jres/Linux/x86/jre and jres/Linux/x64/jre directories. Either of these directories (including all files and subdirectories) may be shipped with your application.

Option 2
You may use any other Java 1.8 or above JRE. For example, if your application already ships with a JRE you can reuse it.

Where to put it...

Option 1
The simplest option is to place (or link) the JRE in a jre subdirectory of your application's directory as found using /proc/self/exe.

Option 2
Place the JRE anywhere you like and set BITFORM_JREPATH environment variable to the jre directory. This environment variable may be set in a script file that starts your application or using the _putenv() function within your code before calling BFStartup(). Oracle strongly recommends against setting this environment variable globally to avoid conflicts with other vendors software.

Distribution Examples

Example 1

The simplest possible distribution looks like this:

Linus Distribution Example 1

In this case the Clean Content code in libCleanContentAPI.so that is linked to yourapp will discover the JAR file and the JRE directory.

Example 2

Let's say your distribution looked like the one below and you know there is compatible Java Runtime Environment at /usr/java/j2re1.5.0.

Linux Distribution Example 2

Your source code might include the following lines:

_putenv("BITFORM_JREPATH=/usr/java/j2re1.5.0");
BFStartup(BFSTARTUPFEATURE_DEBUG);

Compiling the C/C++ API on other platforms

The SDKs c directory contains a standard autoconf/automake/libtool based build process that should work on any reasonable unix-like OS that provides both a Java 1.5 or above Java Development Kit (JDK) and a recent GNU compiler tool chain. Steps to rebuild the C/C++ API library are as follows;

  1. Find or get a Java 1.8 compatible JDK for your OS.
    Note: This must be a JDK which allows one to develop java applications not just a JRE which only allows one to run them.
  2. Set the JAVA_HOME environment variable to the root of the JDK. For example something like this...
    export JAVA_HOME=/usr/lib/jvm/java-1.6.0
  3. Change to the c directory of the Clean Content SDK
  4. Run .\configure
  5. If configure completes successfully then run make all install
  6. Results will be placed in a sub-directory under c/lib named using uname

.NET

Close methods

Due to some unfortunate details of the Clean Content architecture and .NET object finalization the .NET API requires that the developer call explicit Close methods for SecureRequest and SecureResponse objects. Failure to call Close on these object types will result in memory leakage.

Getting the right pieces is critical!

Over 50% of the issues Oracle sees from customers using the .NET API relate to getting all the pieces of the technology in the right location so the .NET assembly can find them during the SecureHelper.Startup method. Please read the Installation and Distribution section carefully and thoroughly!

Compiling and Linking

Assembly

The Clean Content .NET API is provided as a single dll called CleanContentNET.dll. No installation into the GAC is provided or required.

Installation and Distribution

A .NET distribution includes four components.

CleanContentNET.dll

This .NET assembly must be referenced by your application and available to your application at run time under the rules of the .NET Framework. The simplest way to make this happen is to place it in the same folder as your application.

CleanContentAPI.dll

This is the Clean Content C API DLL on which the .NET API relies. It can be found in the c/lib/windows/x86 or c/lib/windows/x64 directory of the SDK. It must be placed in a location where if can be found by CleanContentNET.dll following the rules of the .NET DllImport attribute. The simplest and best way to make this happen is to place it in the same folder as CleanContentNET.dll.

CleanContent.jar

This is the Clean Content Java code that does all the real work. It can be found in the java/lib directory of the SDK. You have two options for locating this file. The best option is to place it in the same directory as CleanContentAPI.dll where it will be found automatically. If that structure does not work for your application you may set the BITFORM_JARPATH environment variable to the directory containing this file. This environment variable may be set in a batch file that starts your application or using .NET's System.Environment.SetEnvironemntVariable method within your code before calling SecureHelper.Startup. We strongly recommend against setting this environment variable globally to avoid conflicts with other vendor's software.

Java Runtime Environment (JRE)

This is a set of DLLs and other files that make up the Java Virtual Machine. There are several options available as to where you get the JRE and where it gets placed in your application's directory structure.

Where to get it...

Option 1
This SDK includes 32 bit and 64 bit Java 1.8 JREs in the jres\Windows\x86\jre and jres\Windows\x64\jre directories. One of these directories (including all files and subdirectories) may be shipped with your application.

Option 2
You may use any other Java 1.8 or later version of Sun's JRE. For example, if your application already ships with a JRE you can reuse it.

Where to put it...

Option 1
The simplest option is to place the JRE in a jre subdirectory of the directory where CleanContentAPI.dll exists.

Option 2
Place the JRE anywhere you like and set the environment variable BITFORM_JREPATH to the jre directory. This environment variable may be set in a batch file that starts your application or using .NET's System.Environment.SetEnvironemntVariable method within your code before calling SecureHelper.Startup. We strongly recommend against setting this environment variable globally to avoid conflicts with other vendor's software.

Distribution Examples

Example 1

The simplest possible distribution looks like this:

DotNet Distribution Example 1

In this case your application will find CleanContentNET.dll, CleanContentNET.dll will find CleanContentAPI.dll, and CleanContentAPI.dll will find the jar and the JRE. No environment variables need to be set.

Example 2

Let's say your distribution looked like the one below and you know there is compatible Java Runtime Environment at c:\components\jre.

DotNet Distribution Example 2

Your source code might include the following lines:

System.Environment.SetEnvironmentVariable("BITFORM_JREPATH","c:\\components\\jre");
SecureHelper.Startup(true);

Technical Notes

The following technical notes are available with extended technical information on specific formats or features...

Encrypted Document Support

This document outlines the support provided by Clean Content on documents that have been encrypted.

Microsoft Office 2007 XML Document Properties

This document will explain how and when Microsoft Office 2007 XML document properties are used by Microsoft, usable by third parties, and processed by Clean Content

Microsoft Office Open XML Support

This document describes Clean Content's support of Microsoft's Office Open XML file format and the associated ECMA 376 and ISO 29500 standards.

PDF Extraction, Analysis & Scrubbing Support

This document details the complexities of Clean Content support for Adobe's Portable Document Format

PowerPoint Disassembly and Assembly

This document describes the PowerPoint assembly and disassembly features provided by Clean Content.

Microsoft PowerPoint Fingerprinting

This document describes the PowerPoint fingerprinting feature provided by the Clean Content analysis process.