Skip Headers
Oracle® Data Mining User's Guide
12c Release 1 (12.1)

E17693-15
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

7 Mining Unstructured Text

This chapter explains how to use Oracle Data Mining to mine unstructured text. This chapter includes the following topics:

About Unstructured Text

Data mining algorithms act on data that is numerical or categorical. Numerical data is ordered. It is stored in columns that have a numeric data type, such as NUMBER or FLOAT. Categorical data is identified by category or classification. It is stored in columns that have a character data type, such as VARCHAR2 or CHAR.

Unstructured text data is neither numerical nor categorical. Unstructured text includes items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes. It has been said that unstructured text accounts for more than three quarters of all enterprise data. Extracting meaningful information from unstructured text can be critical to the success of a business.

About Text Mining and Oracle Text

Text mining is the process of applying data mining techniques to text terms, also called text features or tokens. Text terms are words or groups of words that have been extracted from text documents and assigned numeric weights. Text terms are the fundamental unit of text that can be manipulated and analyzed.

Oracle Text is a Database technology that provides term extraction, word and theme searching, and other utilities for querying text. When columns of text are present in the training data, Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform the text for mining. Oracle Data Mining passes configuration information supplied by the user to Oracle Text and uses the results in the model creation process.

Creating a Model that Includes Text Mining

Oracle Data Mining supports unstructured text within columns of VARCHAR2, CHAR, CLOB, BLOB, and BFILE, as described in Table 7-1.

Table 7-1 Column Data Types That May Contain Unstructured Text

Data Type Description

BFILE and BLOB

Oracle Data Mining interprets BLOB and BFILE as text only if you identify the columns as text when you create the model. If you do not identify the columns as text, then CREATE_MODEL returns an error.

CLOB

Oracle Data Mining interprets CLOB as text.

CHAR

Oracle Data Mining interprets CHAR as categorical by default. You can identify columns of CHAR as text when you create the model.

VARCHAR2

Oracle Data Mining interprets VARCHAR2 with data length > 4000 as text.

Oracle Data Mining interprets VARCHAR2 with data length <= 4000 as categorical by default. You can identify these columns as text when you create the model.


Note:

Text is not supported in nested columns or as a target in supervised data mining.

The settings described in Table 7-2 control the term extraction process for text attributes in a model. Instructions for specifying model settings are in "Specifying Model Settings".

Table 7-2 Model Settings for Text

Setting Name Data Type Setting Value Description

ODMS_TEXT_POLICY_NAME

VARCHAR2(4000)

Name of an Oracle Text policy object created with CTX_DDL.CREATE_POLICY

Affects how individual tokens are extracted from unstructured text. See "Creating a Text Policy".

ODMS_TEXT_MAX_FEATURES

INTEGER

1 <= value <= 100000

Maximum number of features to use from the document set (across all documents of each text column) passed to CREATE_MODEL.

Default is 3000.


A model can include one or more text attributes. A model with text attributes can also include categorical and numerical attributes.

To create a model that includes text attributes: 

  1. Create an Oracle Text policy object, as described in "Creating a Text Policy".

  2. Specify the model configuration settings that are described in Table 7-2.

  3. Specify which columns should be treated as text and, optionally, provide text transformation instructions for individual attributes. See "Configuring a Text Attribute".

  4. Pass the model settings and text transformation instructions to DBMS_DATA_MINING.CREATE_MODEL. See "Embedding Transformations in a Model".

    Note:

    All algorithms except O-Cluster can support columns of unstructured text.

    The use of unstructured text is not recommended for association rules (Apriori).

Creating a Text Policy

An Oracle Text policy specifies how text content should be interpreted. You can provide a text policy to govern a model, an attribute, or both the model and individual attributes. If a model-specific policy is present and one or more attributes have their own policies, Oracle Data Mining uses the attribute policies for the specified attributes and the model-specific policy for the other attributes.

The CTX_DDL.CREATE_POLICY procedure creates a text policy.

CTX_DDL.CREATE_POLICY(
          policy_name    IN VARCHAR2,
                                        filter         IN VARCHAR2 DEFAULT NULL,
                                        section_group  IN VARCHAR2 DEFAULT NULL,
                                        lexer          IN VARCHAR2 DEFAULT NULL,
                                        stoplist       IN VARCHAR2 DEFAULT NULL,
                                        wordlist       IN VARCHAR2 DEFAULT NULL);

The parameters of CTX_DDL.CREATE_POLICY are described in Table 7-3.

Table 7-3 CTX_DDL.CREATE_POLICY Procedure Parameters

Parameter Name Description

policy_name

Name of the new policy object. Oracle Text policies and text indexes share the same namespace.

filter

Specifies how the documents should be converted to plain text for indexing. Examples are: CHARSET_FILTER for character sets and NULL_FILTER for plain text, HTML and XML.

For filter values, see "Filter Types" in Oracle Text Reference.

section_group

Identifies sections within the documents. For example, HTML_SECTION_GROUP defines sections in HTML documents.

For section_group values, see "Section Group Types" in Oracle Text Reference.

Note: You can specify any section group that is supported by CONTEXT indexes.

lexer

Identifies the language that is being indexed. For example, BASIC_LEXER is the lexer for extracting terms from text in languages that use white space delimited words (such as English and most western European languages).

For lexer values, see "Lexer Types" in Oracle Text Reference.

stoplist

Specifies words and themes to exclude from term extraction. For example, the word "the" would typically be in the stoplist for English language documents.

The system-supplied stoplist is used by default.

See "Stoplists" in Oracle Text Reference.

wordlist

Specifies how stems and fuzzy queries should be expanded. A stem defines a root form of a word so that different grammatical forms have a single representation. A fuzzy query includes common misspellings in the representation of a word.

See BASIC_WORDLIST in Oracle Text Reference.


See Also:

CTX_DDL.CREATE_POLICY in Oracle Text Reference

Configuring a Text Attribute

As shown in Table 7-1, you can identify columns of CHAR,shorter VARCHAR2 (<=4000), BFILE, and BLOB as text attributes. If CHAR and shorter VARCHAR2 columns are not explicitly identified as unstructured text, then CREATE_MODEL processes them as categorical attributes. If BFILE and BLOB columns are not explicitly identified as unstructured text, then CREATE_MODEL returns an error.

To identify a column as a text attribute, supply the keyword TEXT in an attribute specification. The attribute specification is a field (attribute_spec) in a transformation record (transform_rec). Transformation records are components of transformation lists (xform_list) that can be passed to CREATE_MODEL.

Note:

An attribute specification can also include information that is not related to text. Instructions for constructing an attribute specification are in "Embedding Transformations in a Model" in Chapter 4.

You can provide transformation instructions for any text attribute by qualifying the TEXT keyword in the attribute specification with the subsettings described in Table 7-4.

Table 7-4 Attribute-Specific Text Transformation Instructions

Subsetting Name Description Example

POLICY_NAME

Name of an Oracle Text policy object created with CTX_DDL.CREATE_POLICY

(POLICY_NAME:my_policy)

TOKEN_TYPE

The following values are supported:


NORMAL (the default)
STEM
THEME

See "Token Types in an Attribute Specification"

(TOKEN_TYPE:THEME)

MAX_FEATURES

Maximum number of features to use from the attribute.

(MAX_FEATURES:3000)


Note:

The TEXT keyword is only required for CLOB and longer VARCHAR2 (>4000) when you specify transformation instructions. The TEXT keyword is always required for CHAR, shorter VARCHAR2, BFILE, and BLOB — whether or not you specify transformation instructions.

Example 7-1 A Sample Attribute Specification for Text

This expression specifies that text transformation for the attribute should use the text policy named my_policy. The token type is THEME, and the maximum number of features is 3000.

”TEXT(POLICY_NAME:my_policy)(TOKEN_TYPE:THEME)(MAX_FEATURES:3000)”

Tip:

You can view attribute specifications in the data dictionary view ALL_MINING_MODEL_ATTRIBUTES, as shown in Oracle Database Reference.

Token Types in an Attribute Specification

When stems or themes are specified as the token type, the lexer preference for the text policy must support these types of tokens.

The following example adds themes and English stems to BASIC_LEXER.

BEGIN
  CTX_DDL.CREATE_PREFERENCE('my_lexer', 'BASIC_LEXER');
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_stems', 'ENGLISH');
  CTX_DDL.SET_ATTRIBUTE('my_lexer', 'index_themes', 'YES');
END;