11 Text Nodes

Text nodes are available in the Text section of the Components pane. Oracle Text Knowledge Base is required for text processing.

To install Oracle Text Knowledge Base, you must install the Oracle Database Examples. For directions on how to install the examples, see Oracle Database Examples Installation Guide

If you are connected to Oracle Database12c or later, then you can use Automatic Data Preparation (ADP) to prepare text data using the Text tab to specify data usage.

Oracle Data Miner supports the following Text nodes:

Oracle Text Concepts

Oracle text concepts include the terms Theme, Stopword, Stoplist, and Stoptheme.

  • Theme: A theme is a topic associated with a given document. A document can have many themes. A theme does not have to appear in a document. For example, a document containing the words San Francisco may have California as one of its themes.

  • Stopword: A stopword is a word that is not indexed during text transformations. A stopword is usually a low information word. In English a, the, this, or with are usually stopwords.

  • Stoplist: A stoplist is a list of stopwords. Oracle Text supplies a stoplist for every language. By default during indexing, the system uses the Oracle Text default stoplist for your language. You can edit the default stoplist or create a new one.

    Note:

    In Oracle Data Miner, stoplists are shared across all transformations and are not owned by a specific transformation.

  • Stoptheme: A stoptheme is a theme to be skipped over during indexing. Stopthemes are specified by adding them to stoplists.

Oracle Text uses stopwords and stopthemes to indicate text that can be safely ignored during text mining.

The Oracle Text Lexer breaks source text into tokens or themes—usually words—in accordance with a specified language. To extract tokens, the Lexer uses parameters as defined by a lexer preference. These parameters include:

  • Definitions for the characters that separate tokens. For example, whitespace.

  • Conditions to convert text to all uppercase or not.

  • Text analysis text to create theme tokens. This is done when theme indexing is enabled.

Text Mining in Oracle Data Mining

Text must undergo a transformation process before it can be mined.

After the data has been properly transformed, a case table can be used for building, testing, or scoring data mining models. The case table must be a relational table. It cannot be created as a view.

A Source table for Oracle Data Mining can include one or more columns of text. A text column cannot be used as a target.

The following Oracle Data Mining algorithms support text:

  • Anomaly Detection (one-class Support Vector Machine)

  • Classification algorithms: Naive Bayes, Generalized Linear Models, and Support Vector Machine

    Decision Tree when you connect to Oracle Database 12c or later

  • Clustering algorithms: k- Means and Expectation Maximization

  • Feature Extraction algorithms: Nonnegative Matrix Factorization, Singular Value Decomposition, and Principal Components Analysis

  • Regression algorithms: Generalized Linear Models and Support Vector Machine

Note:

These algorithms do not support text:

  • O-Cluster

  • Decision Tree when you connect to Oracle Database 11g

  • Association (Apriori)

Any text attributes are automatically filtered out for model builds when you use O-Cluster or Decision Tree connected to Oracle Database 11g.

Data Preparation for Text

Data Preparation for text depends on which version of Oracle Database that you connect to.

Text Processing in Oracle Data Mining 12c Release 1 (12.1) and Later

In Oracle Data Mining 12c Release 1 (12.1) and later, if unstructured text data is present, then text processing includes text transformation before text mining.

Oracle Data Mining includes significant enhancements in text processing that simplify the data mining process (model build, deployment, and scoring) when unstructured text data is present in the input. Some points about unstructured text and text transformation:

  • Unstructured text includes data items such as web pages, document libraries, Microsoft Power Point presentations, product specifications, email messages, comment fields in reports, and call center notes.

    • CLOB columns and long VARCHAR2 columns are automatically interpreted as unstructured text by Oracle Data Mining.

    • Columns of short VARCHAR2, CHAR, BLOB, and BFILE can be specified as unstructured text.

  • To transform unstructured text for mining, Oracle Data Mining uses Oracle Text utilities and term weighting strategies.

  • Text terms are extracted and given numeric values in a text index.

  • Text transformation process is configurable for models and individual attributes. You can specify data preparation for text nodes when you define a model node.

After text transformation, the text can be mined with a data mining algorithm.

Note:

If you connect to Oracle 12c Release 1 or later, then it is not always necessary to use the Text nodes, Apply Text node, Build text, and Text Reference.

Text Processing in Oracle Data Mining 11g Release 2 (11.2) and Earlier

In Oracle Data Mining 11g Release 2 (11.2) and earlier, before text mining is done, it must undergo Feature Extraction and Text preparation processes.

The following processes are:

  • Extraction or Feature Extraction: This is a special preprocessing step, where the text is broken down into units (terms) that can be mined. Text terms can be keywords or other document-derived features.

  • Text preparation: Text preparation uses a Build Text node to transform text columns. Build Text does not support HTML or XML documents. It also does not support any binary data types.

Oracle Data Miner uses the facilities of Oracle Text to preprocess text columns.

Note:

You must preprocess text using the Text nodes, Apply Text node, Build Text, and Text Reference.

Apply Text Node

The Apply Text node enables you to apply existing text transformations from either a Build Text node or from a Text Node to new data.

This ensure that the apply data is transformed in the same way that the build data was transformed.

Apply Text can run in parallel.

Default Behavior for the Apply Text Node

Apply Text node applies existing text transformations from either a Build Text or Text Node to new data.

This ensures that the apply data is transformed in the same way that the build data was transformed.

Note:

All models in the node must have the same case ID.

Create an Apply Text Node

You create an Apply Text node to apply existing text transformations from either a Build Text node or from a Text Node to new data.

Before creating the Apply Text node, first create a workflow. Then create a Data Source node.
To create an Apply Text node:
  1. In the Components pane, go to Workflow Editor. If the Components pane is not visible, then in the SQL Developer menu bar, go to View and click Components. Alternately, press Ctrl+Shift+P to dock the Components pane.
  2. In the Workflow Editor, expand Text and click Apply Text.
  3. Drag and drop the node from the Components pane to the Workflow pane.
    The node is added to the workflow. The GUI shows that the node has no data associated with it. Therefore, it cannot be run.
  4. Move to the node that provides data for Apply Text. Right-click and select Connect. Drag the line to the Build Text node and click again.

    Note:

    The Apply data must be compatible with the Build data.

  5. Move to the Build Text node or Text node that indicates how the text columns were prepared. For example, go to the Build Text node for the model to be applied. Link the Build Text node to the Apply node.
  6. To view or modify the transformation details, right-click the Apply Text node and select Edit. This opens the Edit Apply Text Node dialog box.
  7. The node is ready to run. Right-click the node and select Run.

Related Topics

Edit Apply Text Node

The Edit Apply Text Node dialog box enables you to view the text transformations performed on the Build data.

The Apply data must be prepared in the same way as the Build data.

To open the Edit Apply Text Node dialog box right-click the Apply Text node and select Editor just double-click the node.

  1. Right-click the Apply Text node and click Edit. Alternately, you can double-click the Apply Text node.
  2. The Edit Apply Text Node has two panes:
    • In the top pane, you can perform the following tasks:

      • Case ID: Specify a case ID in this field. This is optional.

      • View Attributes: Select All or Text and Transformed from the drop-down list. For each attribute, the following are displayed: Type: The data type of the attribute. The type of an attribute that has a text transform applied is DM_NESTED_NUMERICALS. Source: The source column for a transformed column. Transform: The type of text transform—Token or Theme. Output: Indicates if the attribute is passed on to subsequent nodes. By default, all nodes are passed on.

      • View Stoplist. To view the stoplist, select a transformed column and click View Stoplist. The Stoplist Editor starts. You can view the items in the stoplist.

      • View and Edit text transform: To view the definition of a text transformation, select a transformed attribute and click edit. This opens the Add/Edit Text Transform dialog box.

      • Exclude attributes: To exclude attributes, select the option and in the Output column of the grid, click pass. The icon changes to ignore. The excluded attribute is not passed on to subsequent operations. You may want to exclude the non-transformed version of a text column.

      • Include attributes: To include an attribute, click the ignore icon again. The icon changes to include, indicating that it is included.

    • In the lower pane, you can view the text transformation after running the node.

  3. Click OK.

View the Text Transform (Apply)

You can view the effects of the text transformation defined in the Build Text node or the Text node in the View Text Transform window.

To access the View Text Transform window:

  1. Open the Edit Apply Text Node dialog box by double-clicking the Apply Text node or right-clicking the node and select Edit.
  2. In the Upper pane, select the name of the transformed attribute.
  3. In the lower pane, you can perform the following tasks:
    • Click Tokens or Themes. A grid displays all of the tokens or themes in the document and the frequency of each token or theme. Use the search field to search for topics or themes by name, the default, or by frequency.

    • Click the Output to view the tokens or themes for a sample of the attributes. Output Sample lists a sample of the attributes listed by case ID, if you specified one, or by row ID if you did not specify a case ID. You can search the IDs. Click an ID. The original text from the untransformed attribute is displayed along with the tokens or themes identified in it, along with the frequency of each token or theme.

  4. Click OK.

Apply Text Node Properties

In the Properties pane, you can examine and change the characteristics or properties of a node.

To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.

The Apply Text Properties has the following sections:

Related Topics

Transforms

In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.

Related Topics

Cache

The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.

You can perform the following tasks:

  • Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.

    • Sampling Size: You can select caching or override the default settings. Default sampling size is Number of Rows Default Value=2000

Related Topics

Sample

The data is sampled to support data analysis.

The default is to use a sample. The Sample tab has the following selections:

  • Use All Data: By default, Use All Data is deselected.

  • Sampling Size: The default is Number of Rows with a value of 2000. You can change sampling size to Percent. The default is 60 percent.

Details

The Details section displays the node name and comments about the node.

You can change the name of the node and edit the comments from this tab. The new node name and comments must satisfy the node name and node comments requirements.

Apply Text Node Context Menu

The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.

To view the Apply Text node context menu, right-click the node. The following options are available in the context menu:

Related Topics

Build Text

Build Text node prepares a data source that has one or more Text columns.

You can use the data to build models.

Build Text can run in parallel.

Related Topics

Default Behavior of the Build Text Node

The Build Text node enables you to define a text transformation for each test column.

You can use the transformed columns to build models using any algorithm that supports text.

Note:

O-Cluster and Decision Tree do not support text.

A Build Text node builds one model using the NMF algorithm by default. The transformed column or columns are passed to subsequent nodes and the non-transformed columns are not passed on.

All models in the node have the same case ID.

Create Build Text Node

You create a Build Text node to prepare a data source that has one or more Text columns.

Before creating a Build Text node, create a workflow. Then identify or create a Data Source node.
To create a Build Text node:
  1. In the Components pane, go to Workflow Editor. If the Components pane is not visible, then in the SQL Developer menu bar, go to View and click Components. Alternately, press Ctrl+Shift+P to dock the Components pane.
  2. In the Workflow Editor, expand Text and click Build Text.
  3. Drag and drop the node from the Components pane to the Workflow pane.
    The node is added to the workflow. The GUI shows that the node has no data associated with it. Therefore, it cannot be run.
  4. Move to the node that provides data for the Build Text node. Right-click and select Connect. Drag the line to the Build Text node and click again.
  5. Either accept the default settings or edit the text details. To edit the transformation details, right-click the node and select Edit. The Edit Build Text Node dialog box opens.
  6. The node is ready to run. Right-click the node and select Run.

Related Topics

Edit Build Text Node

The Edit Build Text Node dialog box enables you to define transformations for text columns. The transformed text columns can be used in data mining.

To open the Edit Build Text Node dialog box:

  1. Right-click the Build Text node and select Edit. Alternately, double-click the node. The Edit Build Text Node dialog box opens.
  2. The Edit Build Text Node dialog box has two panes.
    • In the top pane, you can perform the following tasks:

      • Specify the case ID (optional).

      • Open the Stoplist Editor.

      • View attributes: Select All or Text and Transformed from the drop-down list. For each attribute, the following are displayed: Type: The data type of the attribute. The type of an attribute that has a text transform applied is DM_NESTED_NUMERICALS. Source: The source column for a transformed column. Transform: The type of text transform—Token or Theme. Output: Indicates if the attribute is passed on to subsequent nodes. By default, all nodes are passed on.

      • Define transformation: To define a transformation, select a text attribute and click add. Define the text transformation in the Add/Edit Text Transform dialog box. Repeat the step for each text attribute.

      • Edit Transformation: Select the transformed attribute and click edit.

      • Delete Transformation: Select the transformed attribute and click delete.

      • Exclude attribute: To exclude attributes, select it and in the Output column of the grid, click pass. The icon changes to ignore. The excluded attribute is not passed on to subsequent operations. You may want to exclude the non-transformed version of a text column.

      • Include attribute: To include an attribute, click the ignore icon again. The icon changes to include, indicating that it is included.

    • In the lower pane, you can view the test transformation after the node is run.

  3. Click OK.

View the Text Transform

In the View Text Transform dialog box, you can view the output sample of the tokens or themes for a sample of attributes.

To view the effects of a text transformation:

  1. Open the Edit Build Text Node dialog box by double-clicking the Build Text node or right-clicking the node and select Edit.
  2. In the upper pane, select the name of the transformed attribute.
  3. In the lower pane:
    • Click Tokens or Themes. A grid displays all the tokens or themes in the document and the frequency of each token or theme. Use the search field to search for topics or themes by name, the default, or by frequency.

    • To add a token or theme to the stoplist, select it and click Add to Stoplist.

    • Click Output to view the tokens or themes for a sample of the attributes. Output Sample lists a sample of the attributes listed by case ID, if you specified one, or by row ID if you did not specify a case ID. You can search the IDs.

    • Click an ID. The original text from the non-transformed attribute is displayed along with the tokens or themes identified in it, along with the frequency of each token or theme.

  4. Click OK.

Add/Edit Text Transform

You can add and edit text related transformation settings in the Add/Edit Text Transform dialog box.

The Add/Edit Text Transform dialog box can be opened from the Edit Build Text Node dialog box. To open or edit a text transform, click add The default values for the transformation are illustrated in this graphic:

  • Source Column: This is the name of the column to be transformed.

  • Transform Type: This is either Token (the default) or Theme.

  • Output Column: This is the name of the new column. The default name is the source column name with either TOK (for Token) or THM (for Theme) appended, depending on the transformation type. To specify the output column name, deselect Automatic and enter a name in the Output Column field.

In the Settings section, specify characteristics of the text and the transform:

  • Language: Select any one of the following options:

    • Single Language: By default, a single language is specified. English is the default language. You can select a different language.

    • Multiple Language: Select this option to specify multiple language. For example, to specify Single Byte languages, such as Arabic, Turkish, Thai, and European languages, select them from the Single Byte list. To specify Multibyte languages, such as Chinese (simplified or traditional), Japanese or Korean, select them from the Multibyte languages.

  • Stoplist: Oracle Text provides default stoplists for several single languages. If there is a default stoplist, then it is selected. For several languages, the default is no stoplist. You can select any stoplist that was previously created for this attribute from the drop-down list. You can perform the following tasks:

    • Edit a Stoplist: To edit a stoplist, click edit. The Stoplist Editor opens.

    • Add a Stoplist: To add a stoplist, click add. The Stoplist Editor opens.

  • Token: If you select Token, then the defaults are:

    • Maximum number per document: 50 (default)

    • Maximum number across all document: 3000 (default)

    You can change these values. The tokens per document and across all documents cutoffs are for rankings, not for an absolute count of tokens. You could have more than 3000 tokens across all documents if there were ties.

  • Theme: If you select theme, then the defaults are:

    • Maximum number per documents: 50 (default)

    • Maximum number across all document: 3000 (default)

    You can change these values. The themes per document and across all documents cutoffs are for rankings, not for absolute count of themes. You could have more then 3000 themes across all documents if there were ties.

    Theme incudes a Theme Type specification. The default is Single. You can select Full.

  • Frequency: The default is Term Frequency. You can select Term Frequency IDF.

    Note:

    Frequency is a sticky setting. If you change it, then the changed value becomes the default.

    Term Frequency uses the term frequency in the document itself. It does not take collection information into account.

    Term Frequency IDF is the traditional TF-IDF. It takes into account information from the document (Term Frequency) and collection-level information (IDF plus the terms to use if a maximum overall number of terms for the collection is set).

    TF-IDF (Term Frequency–Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection.

Related Topics

Stoplist Editor

In the Stoplist Editor, you can either edit an existing stoplist, or you can create a new stoplist. Stoplists are shared among all workflows.

You can edit any stoplist in this dialog box, not just the ones associated with transformations defined in this node.

To access the stoplist editor, open the Edit Build Text Node by double-clicking a Build Text node. To view, edit, and create a stoplist:

  1. Click Edit Stoplist.
  2. The Stoplist Editor opens. All stoplists for all transformations are listed.
  3. To add a stoplist, click add. The New Stoplist Editor wizard opens.
  4. To modify an existing stoplist, select the stoplist from the Custom Stoplists list.

    The items in the stoplist are listed in the bottom pane.

    • To delete an item from the stoplist, select the item and click delete.

    • To add stopwords or stopthemes to the selected list, click add. The Add Stopwords/Stopthemes dialog box opens.

  5. To delete a stoplist, select it in the Custom Stoplists list and click delete.
New Stoplist Editor

By using the New Stoplist Editor wizard, you can create new stoplists, edit stoplists, and combine stoplists.

You can perform the following tasks:

  • Create new stoplists. To create a stoplist, click add. The New Stoplist Wizard starts. The wizard has two steps:

    • Stoplist Definition

    • Review

  • Remove words from an existing stoplist.

  • Combine several stoplists to create a new one. For example, if the document is in both French and English, then you can combine the French and English stoplists.

  • Create an empty stoplist to which you must add all stopwords and stopthemes.

Stoplist Definition

Follow these steps to define a stoplist:

  1. Either accept the provided name or enter a different name.
  2. The default selection Extends following stoplist(s) enables you to create a new stoplist by combining and modifying existing stoplists.

    Select one or more stoplists to extend. If you select several stoplists, then they are combined.

  3. To create a completely new stoplist, select Empty. and select a language for the stoplist. The default is English.
  4. Click Next.
Review

To add or remove stopwords and stopthemes:

  1. To add items to the stoplist, click add. The Add Stopwords/Stopthemes dialog box opens.
  2. To remove items from the stoplist, select it and click delete.
  3. Click Finish.

Related Topics

Add Stopwords/Stopthemes

In the Add Stopwords/Stopthemes dialog box, you can add stopwords and stopthemes to a stoplist.

To add Stopwords or Stopthemes:

  1. Enter the stopwords, separated by commas.
  2. Click OK when you have finished.

Build Text Node Properties

In the Properties pane, you can examine and change the characteristics or properties of a node.

To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.

The Build Text node Properties pane has the following sections:

Transforms

In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.

Related Topics

Sample

The data is sampled to support data analysis.

The default is to use a sample. The Sample tab has the following selections:

  • Use All Data: By default, Use All Data is deselected.

  • Sampling Size: The default is Number of Rows with a value of 2000. You can change sampling size to Percent. The default is 60 percent.

Cache

The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.

You can perform the following tasks:

  • Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.

    • Sampling Size: You can select caching or override the default settings. Default sampling size is Number of Rows Default Value=2000

Related Topics

Details

The Details section displays the node name and comments about the node.

You can change the name of the node and edit the comments in this section. The new node name and comments must meet the requirements.

Build Text Node Context Menu

The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.

Right-click a Build Text node. The following options are available in the context menu:

Related Topics

Text Reference

A Text Reference node enables you to reference text transformations defined in a Build Text node in the current workflow or in a different workflow.

For example, if you have one workflow that builds a Text model (that is, a workflow that includes a Build Text node) and you want to create a separate workflow that applies the model created in the first workflow, then you can use a Text Reference to provide the text transformation information required by Apply Text.

Create a Text Reference Node

You create a Text Reference node to reference text transformations that are defined in a Build Text node in the current workflow or in a different workflow.

Before creating a Text Reference node, create a workflow. Then, identify or create a data source.
To create a Build Text node:
  1. In the Components pane, go to Workflow Editor. If the Components pane is not visible, then in the SQL Developer menu bar, go to View and click Components. Alternately, press Ctrl+Shift+P to dock the Components pane.
  2. In the Workflow Editor, expand Text and click Text Reference.
  3. Drag and drop the node from the Components pane to the Workflow pane.
    The node is added to the workflow. The GUI shows that the node has no data associated with it. Therefore, it cannot be run.
  4. Go to the Edit Text Reference Node dialog box to select a Build Text node to reference.
  5. The node is ready to be used. Connect it to an Apply Text node. The Text Reference node is used instead of a Build Text node.

Edit Text Reference Node

The Edit Text Reference Node dialog box enables you to select a Build Text node so that you can use its transformations in the current location in the current workflow.

To open the Edit Text Reference Node:

  1. Right-click the node and select Edit. Alternately, double-click the node. The Edit Text Reference Node dialog box has two panes.
  2. In the upper pane, click Select. The Select Text Reference Node dialog box opens.
  3. After you select a Build Text node, you can view tokens or themes for any transformed nodes. Select a transformed node in the upper pane.
  4. In the bottom pane, the Tokens and Themes and their frequencies are displayed. You can search by token or theme (the default) or by frequency
  5. Click OK.

Select Build Text Node

In the Select Build Text Node dialog box, you can select a Build Text node that is either in the current workflow, (the default) or in all workflows.

Show specifies the list of Build Text nodes to select from.

  1. In the Show field, select either All Workflows or Current Workflows (default).
  2. In the Search field, you can search for Build Text nodes by project (the default), workflow, or node.
  3. Select a Build Text node from the Available Nodes grid. For each Build Text node the grid shows project, workflow, and status.
  4. Click OK.

Note:

You cannot select a Text node that is not complete.

Text Reference Node Properties

In the Properties pane, you can examine and change the characteristics or properties of a node.

To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.

The Text Reference node Properties pane has the following sections:

Transforms

The Transforms dialog box for the Text reference node displays the transformation related information selected in the Edit Text Reference Node dialog box.

You can select a different Build Text node from the Properties pane.

Related Topics

Details

The Details section displays the node name and comments about the node.

You can change the name of the node and edit the comments in this section. The new node name and comments must meet the requirements.

Text Reference Node Context Menu

The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.

Right-click a Text Reference node. The following selections are displayed:

Related Topics