27 Mining Unstructured Text

Explains how to use Oracle Data Mining to mine unstructured text.

27.1 About Unstructured Text

Data mining algorithms act on data that is numerical or categorical. Numerical data is ordered. It is stored in columns that have a numeric data type, such as NUMBER or FLOAT. Categorical data is identified by category or classification. It is stored in columns that have a character data type, such as VARCHAR2 or CHAR.

Unstructured text data is neither numerical nor categorical. Unstructured text includes items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes. It has been said that unstructured text accounts for more than three quarters of all enterprise data. Extracting meaningful information from unstructured text can be critical to the success of a business.

27.2 About Text Mining and Oracle Text

Understand what is text mining and oracle text.

Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.

Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by you to Oracle Text and uses the results in the model creation process.

27.3 Data Preparation for Text Features

The model details view for text features is DM$VXmodel_name.

The text feature view DM$VXmodel_name describes the extracted text features if there are text attributes present. The view has the following schema:

Name                                Type
 --------------            ---------------------
 PARTITION_NAME                     VARCHAR2(128)
 COLUMN_NAME                        VARCHAR2(128)
 TOKEN                              VARCHAR2(4000)
 DOCUMENT_FREQUENCY                 NUMBER

Table 27-1 Text Feature View for Extracted Text Features

Column Name Description

PARTITION_NAME

A partition in a partitioned model to retrieve details

COLUMN_NAME

Name of the identifier column

TOKEN

Text token which is usually a word or stemmed word

DOCUMENT_FREQUENCY

A measure of token frequency in the entire training set

27.4 Creating a Model that Includes Text Mining

Learn how to create a model that includes text mining.

Oracle Data Mining supports unstructured text within columns of VARCHAR2, CHAR, CLOB, BLOB, and BFILE, as described in the following table:

Table 27-2 Column Data Types That May Contain Unstructured Text

Data Type Description

BFILE and BLOB

Oracle Data Mining interprets BLOB and BFILE as text only if you identify the columns as text when you create the model. If you do not identify the columns as text, then CREATE_MODEL returns an error.

CLOB

Oracle Data Mining interprets CLOB as text.

CHAR

Oracle Data Mining interprets CHAR as categorical by default. You can identify columns of CHAR as text when you create the model.

VARCHAR2

Oracle Data Mining interprets VARCHAR2 with data length > 4000 as text.

Oracle Data Mining interprets VARCHAR2 with data length <= 4000 as categorical by default. You can identify these columns as text when you create the model.

Note:

Text is not supported in nested columns or as a target in supervised data mining.

The settings described in the following table control the term extraction process for text attributes in a model. Instructions for specifying model settings are in "Specifying Model Settings".

Table 27-3 Model Settings for Text

Setting Name Data Type Setting Value Description

ODMS_TEXT_POLICY_NAME

VARCHAR2(4000)

Name of an Oracle Text policy object created with CTX_DDL.CREATE_POLICY

Affects how individual tokens are extracted from unstructured text. See "Creating a Text Policy".

ODMS_TEXT_MAX_FEATURES

INTEGER

1 <= value <= 100000

Maximum number of features to use from the document set (across all documents of each text column) passed to CREATE_MODEL.

Default is 3000.

A model can include one or more text attributes. A model with text attributes can also include categorical and numerical attributes.

To create a model that includes text attributes:

  1. Create an Oracle Text policy object..

  2. Specify the model configuration settings that are described in "Table 27-3".

  3. Specify which columns must be treated as text and, optionally, provide text transformation instructions for individual attributes.

  4. Pass the model settings and text transformation instructions to DBMS_DATA_MINING.CREATE_MODEL.

    Note:

    All algorithms except O-Cluster can support columns of unstructured text.

    The use of unstructured text is not recommended for association rules (Apriori).

27.5 Creating a Text Policy

An Oracle Text policy specifies how text content must be interpreted. You can provide a text policy to govern a model, an attribute, or both the model and individual attributes.

If a model-specific policy is present and one or more attributes have their own policies, Oracle Data Mining uses the attribute policies for the specified attributes and the model-specific policy for the other attributes.

The CTX_DDL.CREATE_POLICY procedure creates a text policy.

CTX_DDL.CREATE_POLICY(
          policy_name    IN VARCHAR2,
          				filter         IN VARCHAR2 DEFAULT NULL,
          				section_group  IN VARCHAR2 DEFAULT NULL,
          				lexer          IN VARCHAR2 DEFAULT NULL,
          				stoplist       IN VARCHAR2 DEFAULT NULL,
          				wordlist       IN VARCHAR2 DEFAULT NULL);

The parameters of CTX_DDL.CREATE_POLICY are described in the following table.

Table 27-4 CTX_DDL.CREATE_POLICY Procedure Parameters

Parameter Name Description

policy_name

Name of the new policy object. Oracle Text policies and text indexes share the same namespace.

filter

Specifies how the documents must be converted to plain text for indexing. Examples are: CHARSET_FILTER for character sets and NULL_FILTER for plain text, HTML and XML.

For filter values, see "Filter Types" in Oracle Text Reference.

section_group

Identifies sections within the documents. For example, HTML_SECTION_GROUP defines sections in HTML documents.

For section_group values, see "Section Group Types" in Oracle Text Reference.

Note: You can specify any section group that is supported by CONTEXT indexes.

lexer

Identifies the language that is being indexed. For example, BASIC_LEXER is the lexer for extracting terms from text in languages that use white space delimited words (such as English and most western European languages).

For lexer values, see "Lexer Types" in Oracle Text Reference.

stoplist

Specifies words and themes to exclude from term extraction. For example, the word "the" is typically in the stoplist for English language documents.

The system-supplied stoplist is used by default.

See "Stoplists" in Oracle Text Reference.

wordlist

Specifies how stems and fuzzy queries must be expanded. A stem defines a root form of a word so that different grammatical forms have a single representation. A fuzzy query includes common misspellings in the representation of a word.

See "BASIC_WORDLIST" in Oracle Text Reference.

Related Topics

27.6 Configuring a Text Attribute

Learn how to identify a column as a text attribute and provide transformation instructions for any text attribute.

As shown in Table 27-2, you can identify columns of CHAR,shorter VARCHAR2 (<=4000), BFILE, and BLOB as text attributes. If CHAR and shorter VARCHAR2 columns are not explicitly identified as unstructured text, then CREATE_MODEL processes them as categorical attributes. If BFILE and BLOB columns are not explicitly identified as unstructured text, then CREATE_MODEL returns an error.

To identify a column as a text attribute, supply the keyword TEXT in an Attribute specification. The attribute specification is a field (attribute_spec) in a transformation record (transform_rec). Transformation records are components of transformation lists (xform_list) that can be passed to CREATE_MODEL.

Note:

An attribute specification can also include information that is not related to text. Instructions for constructing an attribute specification are in "Embedding Transformations in a Model".

You can provide transformation instructions for any text attribute by qualifying the TEXT keyword in the attribute specification with the subsettings described in the following table.

Table 27-5 Attribute-Specific Text Transformation Instructions

Subsetting Name Description Example

POLICY_NAME

Name of an Oracle Text policy object created with CTX_DDL.CREATE_POLICY

(POLICY_NAME:my_policy)

TOKEN_TYPE

The following values are supported:

  • NORMAL (the default)
  • STEM
  • THEME

See "Token Types in an Attribute Specification"

(TOKEN_TYPE:THEME)

MAX_FEATURES

Maximum number of features to use from the attribute.

(MAX_FEATURES:3000)

Note:

The TEXT keyword is only required for CLOB and longer VARCHAR2 (>4000) when you specify transformation instructions. The TEXT keyword is always required for CHAR, shorter VARCHAR2, BFILE, and BLOB — whether or not you specify transformation instructions.

Tip:

You can view attribute specifications in the data dictionary view ALL_MINING_MODEL_ATTRIBUTES, as shown in Oracle Database Reference.

Token Types in an Attribute Specification

When stems or themes are specified as the token type, the lexer preference for the text policy must support these types of tokens.

The following example adds themes and English stems to BASIC_LEXER.

BEGIN
  CTX_DDL.CREATE_PREFERENCE('my_lexer', 'BASIC_LEXER');
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_stems', 'ENGLISH');
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_themes', 'YES');
END;

Example 27-1 A Sample Attribute Specification for Text

This expression specifies that text transformation for the attribute must use the text policy named my_policy. The token type is THEME, and the maximum number of features is 3000.

"TEXT(POLICY_NAME:my_policy)(TOKEN_TYPE:THEME)(MAX_FEATURES:3000)"