Root element for all extracted content. A set of loosely coupled documents that are related only in that they exist in the same container such as a ZIP file, TAR file, etc. Meta information about the process that generated the extracted content. Base name of the file being processed if it exists in the normal file system Complete path to the file being processed if it exists in the normal file system Date and time the file was processed User that processed the file Operating system on which the file was processed Java VM on which the file was processed A self-contained document, spreadsheet, presentation, image or drawing. General type of this content Specific file format of this content Containing element for a set of items of a certain type that may also be referenced from another location. A distinct area of embedded content where the data is in another application's format. Examples are graphics, OLE objects, embedded files, and XML metadata streams. General type of the content A text description or other textual information about the embedded content File format of this content Key that may be used to reference this content using the ContentRef element If set to true this content is replaceable. See the Extract options that begin with the word Export for details. A link to an external piece of content. The contents of this element, if any, will be a cached version of the linked content stored locally in the document. Provides the location of the linked content as a path or URL. Indicates whether the link is considered sensitive. A key that may be used to reference this content using the ContentRef element. A container for a piece of content that may be referenced from another place in the document. A key that may be used to reference this content using the ContentRef element. Indicates that the referenced content is used at this location in the document. Contains an attribute named 'reference' that provides a reference to embedded, linked, or sub content. The referenced content includes an attribute named 'key' that matches this 'ref' and also contains a matching 'type' attribute. A reference to an EmbeddedContent or LinkedContent element with a matching type attribute and key attribute. A container for content that has been added or deleted during a specific document editing session. The name of the author that created this revision. The date this content was revised Identifies an exported document File format of the slide when exported Contains the contents of a frame. A property containing text that is likely to be document content. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. Page size and margin information in twips Width of the physical page Height of the physical page Right margin Left margin Top margin Bottom margin Gutter width A property with a text attribute that is not likely to be document content. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property with a true or false value. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property with an integer value. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property with a floating point value. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property with a date/time value. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property with a duration value. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. The value of the property. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property used to identify binary data properties. A numeric value that can be used to identify the property type as an alternative to the name. A text description of property type. This value may be localized if the local name is provided in the document. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property that describes the locale of the creating application. This property is information only and not guaranteed to exist. This property may provide a useful hint about the locale associated with the origin of this document. A numeric identifier of the locale associated with this property collection. The value represents a locale id as defined by the Microsoft Win32 SDK. A text description of the locale associated with properties in this document. This property is informational only and is not guaranteed to exist. It may A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. This property provides the code page associated with non-Unicode text properties located in this document. This property is informational only. Note that non-Unicode text properties are internally converted to Unicode before being presented during extraction. This property, when available, may provide a useful hint about the locale associated with the origin of this document but is not guaranteed to be present. A numeric identifier of the code page associated with non-Unicode text properties located in this document. The value represents either a Windows code page or Macintosh Script based on the originating operating system. A text description of the code page associated with non-Unicode text properties. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. A property that groups a list of values provided as string elements. This type of property is commonly used to describe resources used within the document (Fonts, Template, ... ) or to provide a simple categorization of the content (Sheet names, Slide titles, ...). A text description of property type. A generated attribute that indicates the class of this property within the Secure Target options. A qualified name to distinguish properties whose names collide. Contains text as an attribute. The text is not likely to be document content. The text value associated with this String element. Contains text. The text is likely to be document content. Contains integer information about the element in which it is contained. The numeric value associated with this Integer element. Contains information about the date of the element in which it is contained. The date value associated with this Date element. Contains information that is true or false for the element in which it is contained. The boolean value associated with this Boolean element. The container for a single worksheet within a spreadsheet. The container for a page of content. This element is used to mark PDF pages. Containing element for the content in the body of the document. That is all text not in a sub-container like a footer, footnote or comment. Beginning of a hyperlink to another document, website, internal location, etc. The destination document is described using a contentref or a linkedcontent child element. A string child element of type HyperlinkAnchor is used to describe target locations within the current or destination document. A string child element of type CellRef is used to descibe the source location of hyperlinks within worksheets. Id that will match a following HyperlinkEnd element. End of a hyperlink to another document, website, etc. Hidden text. Added text. The name of the author that added this content. The date this content was added. Deleted text. The name of the author that deleted this content. The date this content was deleted. Contains content found in footnotes, endnotes, speaker notes, comments, and meeting minutes. A key that may be used to reference this note using the NoteRef element. Contains content found in annotations that have been added to the document. A reference to a Note element with a matching key attribute. Contains content that has been obfuscated from view in the authoring application. The type attribute identifies the form of obfuscation that has been found. This element is generated during analysis depending on the values of the applicable scrub targets. Contains content found in the header or footer of a document. This element exists because some formats to not differentiate header and footer text structurally. A key that may be used to reference this header or footer. A reference to a HeaderFooter element with a matching key attribute. Contains content that belongs to a single paragraph. Contains content that is visually separated from other content and typically formatted in a positioned rectangular region. Contains content found in a template, master or other such construct. Contains the content of a chart. Contains the content of a presentation slide. A key that may be used when referencing a slide from another context. For PowerPoint documents this key matches the internal slide identifier used within PowerPoint and may therefore be used when accessing slides using Office Automation. Contains text that is defined as the slide title. Contains text that is defined as the body of the slide. Contains an embeddedcontent element that describes a thumbnail image of the page in which it is found. Contains content that belongs to a single section of the document. Contains a series of survey questions Contains one or more text elements that may include the question text and the question help text. Contains a series of rows. Contains a series of cells. Contains the content of a cell. Contains column definitions and a series of rows. Container for a set of database column definitions. Contains the name of a database table. Contains database column type and optional name. Name of the column Contains database column information. Contains a series of field. Contains column definitions and a series of archive stream metadata. Container for a set of archive column definitions. Contains the name of an archive table. Contains archive column type and optional name. Name of the column Contains a series of metadata fields. A database or archive field containing text. A database or archive field containing a numeric value. The numeric value of the field. A database or archive field containing a date value. The date value of the field. A database or archive field containing a boolean value equal to either true or false. The boolean value of the field. Contains the name of a spreadsheet worksheet. A spreadsheet cell containing text. The row number of this text cell. The column number of this text cell. A spreadsheet cell containing a numeric value. The row number of this text cell. The column number of this data cell. The numeric value of the cell. A spreadsheet cell containing a date value. The row number of this date cell. The column number of this date cell. The numeric value of the cell. A spreadsheet cell containing a duration value. The row number of this duration cell. The column number of this duration cell. The numeric value of the cell. A spreadsheet cell containing a boolean value equal to either true or false. The row number of this text cell. The column number of this text cell. The boolean value of the cell. Contains an added cell revision. The 0 based sheet id number where this revision occurred. Contains a deleted cell revision. The 0 based sheet id number where this revision occurred. Contains information about a row within a spreadsheet. A hidden row is indicated by the presence of a this element with a child boolean element that indicates the row is hidden. The row number is provided as a 0 based value. The row associated with this RowInfo. Contains information about a range of columns within a spreadsheet. A hidden range of columns is indicated by the presence of this element with a child booleanInfo element to indicate the range of columns is hidden. The first and last column numbers are 0 based values. The first column in the range of columns associated with this ColInfo. The last column in the range of columns associated with this ColInfo. Contains the name, author, and comment associated with a data scenario defined in Excel. This element surounds Adobe Acrobat text operations and provides the character based highlight position associated with each text character found in the element. This element is generated on for every text operation only when the associated option named Generate Acrobat Highight Positions is set to true. This position information can be used to generate an Adobe highlight file to highlight terms when displaying a PDF file in Acrobat as defined in the Adobe technical note titled HighlightFileFormat.pdf. Only text inside this element can be highlighted by Acrobat. Note that the Acrobat highlighting feature has numerous anomolies that may cause the resulting highlight to either not be shown or to bleed into other text. The highlight positions of the set of characters tagged by this element. Contains the contents of a line of pdf text. Since the PDF format does not formally define line boundaries, line detection is based on an inferrence algorithm that detects horizontal and verical shifts indicative of line breaks. Includes a type and value attribute. This element is generated by Clean Content during analysis of the content when the applicable fingerprinting options are enabled. The type of fingerprint is provided by the type attribute and may be either SlideContent, SlideAppearance, or GraphicData. The value attribute provides the fingerprint as a 128 bit MD5 hash. The fingerprint for SlideContent is generated based on the text and images found on the slide. This allows the fingerprint to be consistent regardless of modifications due to positions, colors, shapes, masters, and other slide attributes. The SlideAppearance fingerprint is an extension of the SlideContent fingerprint that includes consideration for the applicable slide master, slide background, and the position and select formatting of slide content, including shapes. Numerous presentation features are excluded from the fingerprint calculation in order to improve the consistencty of the fingerprint across different versions of PowerPoint. The MD5 hash assoicated with this type of fingerprint. Contains database query information. Contains author history information. Contains information about macros and code in the document. Contains information about printers used by the document. Contains information about routing slips. Contains information about weak protections. Contains information about earlier versions of the document. Contains information about obsolete content left in the document. This element identifies a range of spreadsheet cells that are located an extreme distance away from other cells. The extreme cell ranges will be reported for extreme cell areas that contain cell content or an inserted object. This element is only generated if the ExtremeCells scrub target is enabled and will occur near the end of each sheet after analyzing all cell ranges in the sheet. The definition of an extreme cell range can be controlled by the options that define the extreme cell horizontal and vertical gap allowance. The first row in the range of cells associated with this extreme cell range. The first column in the range of cells associated with this extreme cell range. The last row in the range of cells associated with this extreme cell range. The last column in the range of cells associated with this extreme cell range. This element identifies geographic location information that is stored in the file, usually stored as GPS coordinates. The result of identifying or scrubbing a single scrub/analyze target. Describes a software fault cause by a malformed, truncated or corrupted file. An element that provides the title text associated with an outline time in the document outline. An OutlineItem may also contain children OutlineItem's creating the outline hiearchy. The title text associated with this outline item entry. A container for describing the article threads found in PDF documents. The title text associated with this article thread. An container element that tags the content of a form field. Form fields may contain various string and text elements as well as children formfield's that defines the hierarchy of the form. Detailed information about an Office part that may represent some levle of data disclosure risk. The name of the office part in the part collection. The conent type associated with this part if defined. The uncompressed size of this part. If set to true this content is scrubbable either automatically or by setting the applicable scrub target to SCRUB. The name of the most recently processed .rels file that includes a reference to this part. Note that a part may be referenced from multiple .rels but only on example is captured for informational purposes. The relationship type used when referencing this part from the most .rels file defined by @mrRelsName. The relationship id used when referencing this part from the most .rels file defined by @mrRelsName. Detailed information about the source of content captured from the web during the creation of this document. Describes a single author in the collection of authors that have commented on the document at some point in the odocument life cycle. The name of the commenting author. The initials of the commenting author. An identifier for authentication provider to which the userid applies. This may be Active Directory, Windows Live, or no provider. An identifier for this author unique to the provider service that provides user authentication. This may be an Active Directory Id, Windows Live Id, or simple user name. An optional string that provides contact information for this commenting author. Describes a single line in the exception's stack trace. Bounded whitespaces can be used to indent text.Note ScrubOption OfficeXMLFeatures must be set to scrub bounded spaces. Text that is used as an alternative to displaying a graphic image in constrained viewing environments. Apps for Office allow for integration of 3rd party applications into the Office applications XML Comments are used to provide semantic information to the human reader.Note ScrubOption OfficeXMLFeatures must be set extract and scrub XML Comments. XML Processing instruction can be used to pass information to applications.Note ScrubOption OfficeXMLFeatures must be set to extract and scrub XMP Processing instruction. XML CDATA refers to character data.Note ScrubOption OfficeXMLFeatures must be set to extract and scrub XML CDATA. XML namespace in the document which is not part of whitelisted namespace list.Note ScrubOption OfficeXMLFeatures must be set to extract and scrub XML UnknownNamespace. XML external entity are references to external file.Note ScrubOption OfficeXMLFeatures must be set to extract and scrub XML external entity. XML namespace prefix are used to avoid name conflict in XML.Note ScrubOption OfficeXMLFeatures must be set to rename namespace prefix. XML namespace are used to avoid name conflict in XML.Note ScrubOption OfficeXMLFeatures must be set to extract and scrub XML unused namespaces. Embedded audio and video objects that reference their data through a local or network share path Hidden author history in Microsoft Word document Invisible author history contains paths Invisible author history contains network share names Some characters are hidden because they fall outside the current clipping path. Some characters are visually obscured due to the font color matching the background color. Author or reviewer comments in the document Document properties categorized as content properties Document properties categorized as custom properties Any custom XML data Database connection and query information The default scrub behavior Programmatic variables that can be stored in PowerPoint documents. Data from other applications embedded in the document The document is encrypted Indicates the Excel workbook contains a relational data source and corresponding connection information to other data sources. Indicates the document contains one or more ranges of spreadsheet cells that are located an extreme distance from other cell ranges. Certain indenting, margin and other settings result in text that does not display or print. Indicates the document contains one or more objects that are positioned an extreme distance outside the standard viewing area. Text or other data that was 'deleted' but still exists in the file Headers and footers Hidden spreadsheet columns, rows, or worksheets Slides that have been hidden from presentation and printing Text that has been hidden by the author A redundant storage of Excel workbooks created for backwards combpatibility with Excel 95 Found XML elements that are invalid against the schema Found XML elements in unknown namespaces Links to files from other applications Macros and other executable code Meeting minutes entered using the PowerPoint Meeting Minder feature. A document property that provides a globally unique identifier (GUID) of the document and originating computer This document contains parts are not are not referenced or required by the document that represent a significant unintentional disclosure risk if not scrubbed or further analyzed. This document contains parts that are not processed by the Clean Content analysis process. This document contains parts that understood but not analyzed by the Clean Content analysis process. This document contains parts that represent some level of disclosure risk if not scrubbed or further analyzed. Document properties added to Office document email attachments by Microsoft Outlook Indicates the document contains one or more objects that have been overlapped by another object. Some characters are hidden because they have been overlapped by a rectangular shape or image.. PDF supports a set of interactive features called actions that range from jumping to a particular destination in the document to submitting the data of an interactive form to a server. Individual targets are defined for each specific type of action. This target acts covers the entire set of actions as a single target. The GoTo action causes the Viewer software to change the current view of the document to specific location within the document. The GoToR (Go to remote location) action causes the Viewer software to change the current view to a specific location in another PDF file. The GoToE (Go to embedded file) action causes the Viewer software to change the current view to a specific location in another PDF file that is embedded in this or another PDF file. The Launch action launches an application or opens or prints a document. The Thread action causes the Viewer software to change the current view of the document to specific location in an article thread within the document. The URI action causes the Viewer software to resolve and open a resource described by a Uniform Resource Identifier. The Sound action causes the Viewer software to play a sound object. The Movie action causes the Viewer software to play a movie object that is stored as an external file. The Hide action causes the Viewer software to change the visibility of annotations and form fields. The Named action causes the Viewer software to change the current view of the document to a specific named location in the current document. The Set OCG State action sets the state of one or morel optional content groups. The Rendition action controls the playback of multimedia content. The GoTo3D View action controls the view of a 3D annotation. The Rich Media action identifies a rich media annotation and specifies a command to be sent to that annotation handler. Rich media PDF contstructs support playing a SWF file to provide enhanced rich media. The command defined in this action can either be an ActionScript or JavaScript function name. The JavaScript Action causes Javascript code to be executed by the Java interpreter supported by the PDF Viewer. The Submit Form action transmits the names and values of selected form fields to a specified URL. The Reset Form action resets a selected set of interactive form fields. The Import Data action imports Forms Data Format (FDF), XFSD, or XML into the interactive form fields of the PDF document. The Transition action is used in a sequence of actions to define transition appearances during the sequence. Any action that is not in the list of supported actions is treated as an Unknown action. Alternate versions of an image they may be used by readers. Postscript objects embedded inside PDF documents. Alternate Presentations can be used to view a PDF document in an alternative way more consistent with a presentation rendition. Private data stored in PDF documents by applications using the PDF Page-Piece dictionary construct. Indicates that the document contains an embedded search index provided to make text searches faster within Adobe Acrobat. Indicates that the document contains private application data other than an embedded search index. Data stored in PDF documents used to import content from external Web pages Information that specifies the existence of content that may result in unexpected rendering of a document. Digital signatures are used to authenticate the identity of the author and the contents of the document. Thumbnail images are small images that provide a represenation of either a PDF page or an externally referenced file. PDF supports a set of interactive features called annotations that allow numerous types of content to be associated with a page location or provide user interaction.. This target covers the entire set of actions as a single target. Notes associated with a slide presentation Printer information in the document Printer information that includes network share names Email routing information Scenarios are an Excel feature that allow for multiple data models Sensitive paths or URI's to external content that is to be included in this file Hyperlinks containing either fully qualified local paths or network share names INCLUDETEXT and INCLUDEPICTURE fields containing either fully qualified local paths or network share names Some character's sizes are outside a certain normal range Tags applied to text that matches a defined pattern allowing specific actions to be executed based on the category of the smart tag. Document properties categorized as statistics properties Word's Structure dDocument Tags Document properties categorized as summary properties If a template other than Normal.dot is used the document will contain a full path to the template file Tracked changes in the document Uninitialized data segments found in the Docfile format leveraged by Office 2003 and below and many other formats. The names of users associated with the document Version information in Word documents Weak or easily breakable protections and passwords XMP Metadata streams are leveraged to store metadata properties using the Extensible Metadata Platform standard. GPS location information