11 Text Nodes
Text nodes are available in the Text section of the Components pane. Oracle Text Knowledge Base is required for text processing.
To install Oracle Text Knowledge Base, you must install the Oracle Database Examples. For directions on how to install the examples, see Oracle Database Examples Installation Guide
If you are connected to Oracle Database12c or later, then you can use Automatic Data Preparation (ADP) to prepare text data using the Text tab to specify data usage.
Oracle Data Miner supports the following Text nodes:
- Oracle Text Concepts
Oracle text concepts include the terms Theme, Stopword, Stoplist, and Stoptheme. - Text Mining in Oracle Machine Learning
Text must undergo a transformation process before it can be mined. - Apply Text Node
The Apply Text node enables you to apply existing text transformations from either a Build Text node or from a Text Node to new data. - Build Text
Build Text node prepares a data source that has one or more Text columns. - Text Reference
A Text Reference node enables you to reference text transformations defined in a Build Text node in the current workflow or in a different workflow.
Related Topics
11.1 Oracle Text Concepts
Oracle text concepts include the terms Theme, Stopword, Stoplist, and Stoptheme.
-
Theme: A theme is a topic associated with a given document. A document can have many themes. A theme does not have to appear in a document. For example, a document containing the words
San Francisco
may haveCalifornia
as one of its themes. -
Stopword: A stopword is a word that is not indexed during text transformations. A stopword is usually a low information word. In English
a, the, this,
orwith
are usually stopwords. -
Stoplist: A stoplist is a list of stopwords. Oracle Text supplies a stoplist for every language. By default during indexing, the system uses the Oracle Text default stoplist for your language. You can edit the default stoplist or create a new one.
Note:
In Oracle Data Miner, stoplists are shared across all transformations and are not owned by a specific transformation.
-
Stoptheme: A stoptheme is a theme to be skipped over during indexing. Stopthemes are specified by adding them to stoplists.
Oracle Text uses stopwords and stopthemes to indicate text that can be safely ignored during text mining.
The Oracle Text Lexer breaks source text into tokens or themes—usually words—in accordance with a specified language. To extract tokens, the Lexer uses parameters as defined by a lexer preference. These parameters include:
-
Definitions for the characters that separate tokens. For example, whitespace.
-
Conditions to convert text to all uppercase or not.
-
Text analysis text to create theme tokens. This is done when theme indexing is enabled.
Parent topic: Text Nodes
11.2 Text Mining in Oracle Machine Learning
Text must undergo a transformation process before it can be mined.
After the data has been properly transformed, a case table can be used for building, testing, or scoring machine learning models. The case table must be a relational table. It cannot be created as a view.
A Source table for Oracle Machine Learning can include one or more columns of text. A text column cannot be used as a target.
The following Oracle Machine Learning algorithms support text:
-
Anomaly Detection (one-class Support Vector Machine)
-
Classification algorithms: Naive Bayes, Generalized Linear Models, and Support Vector Machine
Decision Tree when you connect to Oracle Database 12c or later
-
Clustering algorithms: k- Means and Expectation Maximization
-
Feature Extraction algorithms: Nonnegative Matrix Factorization, Singular Value Decomposition, and Principal Components Analysis
-
Regression algorithms: Generalized Linear Models and Support Vector Machine
Note:
These algorithms do not support text:
-
O-Cluster
-
Decision Tree when you connect to Oracle Database 11g
-
Association (Apriori)
Any text attributes are automatically filtered out for model builds when you use O-Cluster or Decision Tree connected to Oracle Database 11g.
- Data Preparation for Text
Data Preparation for text depends on which version of Oracle Database that you connect to.
Parent topic: Text Nodes
11.2.1 Data Preparation for Text
Data Preparation for text depends on which version of Oracle Database that you connect to.
- Text Processing in Oracle Machine Learning 12c Release 1 (12.1) and Later
In Oracle Machine Learning 12c Release 1 (12.1) and later, if unstructured text data is present, then text processing includes text transformation before text analysis. - Text Processing in Oracle Machine Learning 11g Release 2 (11.2) and Earlier
In Oracle Machine Learning 11g Release 2 (11.2) and earlier, before text analysis is done, it must undergo Feature Extraction and Text preparation processes.
Parent topic: Text Mining in Oracle Machine Learning
11.2.1.1 Text Processing in Oracle Machine Learning 12c Release 1 (12.1) and Later
In Oracle Machine Learning 12c Release 1 (12.1) and later, if unstructured text data is present, then text processing includes text transformation before text analysis.
Oracle Machine Learning includes significant enhancements in text processing that simplify the machine learning process (model build, deployment, and scoring) when unstructured text data is present in the input. Some points about unstructured text and text transformation:
-
Unstructured text includes data items such as web pages, document libraries, Microsoft Power Point presentations, product specifications, email messages, comment fields in reports, and call center notes.
-
CLOB
columns and longVARCHAR2
columns are automatically interpreted as unstructured text by Oracle Machine Learning. -
Columns of short
VARCHAR2
,CHAR
,BLOB
, andBFILE
can be specified as unstructured text.
-
-
To transform unstructured text for machine learning, Oracle Machine Learning uses Oracle Text utilities and term weighting strategies.
-
Text terms are extracted and given numeric values in a text index.
-
Text transformation process is configurable for models and individual attributes. You can specify data preparation for text nodes when you define a model node.
After text transformation, the text can be mined with a machine learning algorithm.
Note:
If you connect to Oracle 12c Release 1 or later, then it is not always necessary to use the Text nodes, Apply Text node, Build text, and Text Reference.
Related Topics
Parent topic: Data Preparation for Text
11.2.1.2 Text Processing in Oracle Machine Learning 11g Release 2 (11.2) and Earlier
In Oracle Machine Learning 11g Release 2 (11.2) and earlier, before text analysis is done, it must undergo Feature Extraction and Text preparation processes.
The following processes are:
-
Extraction or Feature Extraction: This is a special preprocessing step, where the text is broken down into units (terms) that can be analyzed. Text terms can be keywords or other document-derived features.
-
Text preparation: Text preparation uses a Build Text node to transform text columns. Build Text does not support HTML or XML documents. It also does not support any binary data types.
Oracle Data Miner uses the facilities of Oracle Text to preprocess text columns.
Note:
You must preprocess text using the Text nodes, Apply Text node, Build Text, and Text Reference.
Parent topic: Data Preparation for Text
11.3 Apply Text Node
The Apply Text node enables you to apply existing text transformations from either a Build Text node or from a Text Node to new data.
This ensure that the apply data is transformed in the same way that the build data was transformed.
Apply Text can run in parallel.
- Default Behavior for the Apply Text Node
Apply Text node applies existing text transformations from either a Build Text or Text Node to new data. - Create an Apply Text Node
You create an Apply Text node to apply existing text transformations from either a Build Text node or from a Text Node to new data. - Edit Apply Text Node
The Edit Apply Text Node dialog box enables you to view the text transformations performed on the Build data. - Apply Text Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node. - Apply Text Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Parent topic: Text Nodes
11.3.1 Default Behavior for the Apply Text Node
Apply Text node applies existing text transformations from either a Build Text or Text Node to new data.
This ensures that the apply data is transformed in the same way that the build data was transformed.
Note:
All models in the node must have the same case ID.
Parent topic: Apply Text Node
11.3.2 Create an Apply Text Node
You create an Apply Text node to apply existing text transformations from either a Build Text node or from a Text Node to new data.
Related Topics
Parent topic: Apply Text Node
11.3.3 Edit Apply Text Node
The Edit Apply Text Node dialog box enables you to view the text transformations performed on the Build data.
The Apply data must be prepared in the same way as the Build data.
To open the Edit Apply Text Node dialog box right-click the Apply Text node and select Editor just double-click the node.
- View the Text Transform (Apply)
You can view the effects of the text transformation defined in the Build Text node or the Text node in the View Text Transform window.
Parent topic: Apply Text Node
11.3.3.1 View the Text Transform (Apply)
You can view the effects of the text transformation defined in the Build Text node or the Text node in the View Text Transform window.
To access the View Text Transform window:
Parent topic: Edit Apply Text Node
11.3.4 Apply Text Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternatively, right-click the node and click Go to Properties.
The Apply Text Properties has the following sections:
- Transforms
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box. - Cache
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference. - Sample
The data is sampled to support data analysis. - Details
The Details section displays the node name and comments about the node.
Related Topics
Parent topic: Apply Text Node
11.3.4.1 Transforms
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.
Related Topics
Parent topic: Apply Text Node Properties
11.3.4.2 Cache
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.
You can perform the following tasks:
-
Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.
-
Sampling Size: You can select caching or override the default settings. Default sampling size is
Number of Rows
Default Value=2000
-
Related Topics
Parent topic: Apply Text Node Properties
11.3.4.3 Sample
The data is sampled to support data analysis.
The default is to use a sample. The Sample tab has the following selections:
-
Use All Data: By default, Use All Data is deselected.
-
Sampling Size: The default is
Number of Rows
with a value of 2000. You can change sampling size toPercent
. The default is 60 percent.
Parent topic: Apply Text Node Properties
11.3.4.4 Details
The Details section displays the node name and comments about the node.
You can change the name of the node and edit the comments from this tab. The new node name and comments must satisfy the node name and node comments requirements.
Related Topics
Parent topic: Apply Text Node Properties
11.3.5 Apply Text Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
To view the Apply Text node context menu, right-click the node. The following options are available in the context menu:
-
Edit. Opens Edit Apply Text Node dialog box.
-
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
-
Show Runtime Errors. Displayed if there are errors.
-
Show Validation Errors. Displayed if there are validation errors.
Related Topics
Parent topic: Apply Text Node
11.4 Build Text
Build Text node prepares a data source that has one or more Text columns.
You can use the data to build models.
Build Text can run in parallel.
- Default Behavior of the Build Text Node
The Build Text node enables you to define a text transformation for each test column. - Create Build Text Node
You create a Build Text node to prepare a data source that has one or more Text columns. - Edit Build Text Node
The Edit Build Text Node dialog box enables you to define transformations for text columns. The transformed text columns can be used in machine learning. - Build Text Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node. - Build Text Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Related Topics
Parent topic: Text Nodes
11.4.1 Default Behavior of the Build Text Node
The Build Text node enables you to define a text transformation for each test column.
You can use the transformed columns to build models using any algorithm that supports text.
Note:
O-Cluster and Decision Tree do not support text.
A Build Text node builds one model using the NMF algorithm by default. The transformed column or columns are passed to subsequent nodes and the non-transformed columns are not passed on.
All models in the node have the same case ID.
Parent topic: Build Text
11.4.2 Create Build Text Node
You create a Build Text node to prepare a data source that has one or more Text columns.
Related Topics
Parent topic: Build Text
11.4.3 Edit Build Text Node
The Edit Build Text Node dialog box enables you to define transformations for text columns. The transformed text columns can be used in machine learning.
To open the Edit Build Text Node dialog box:
- View the Text Transform
In the View Text Transform dialog box, you can view the output sample of the tokens or themes for a sample of attributes. - Add/Edit Text Transform
You can add and edit text related transformation settings in the Add/Edit Text Transform dialog box. - Stoplist Editor
In the Stoplist Editor, you can either edit an existing stoplist, or you can create a new stoplist. Stoplists are shared among all workflows.
Parent topic: Build Text
11.4.3.1 View the Text Transform
In the View Text Transform dialog box, you can view the output sample of the tokens or themes for a sample of attributes.
To view the effects of a text transformation:
Parent topic: Edit Build Text Node
11.4.3.2 Add/Edit Text Transform
You can add and edit text related transformation settings in the Add/Edit Text Transform dialog box.
The Add/Edit Text Transform dialog box can be opened from the Edit Build Text Node dialog box. To open or edit a text transform, click The default values for the transformation are illustrated in this graphic:
-
Source Column: This is the name of the column to be transformed.
-
Transform Type: This is either Token (the default) or Theme.
-
Output Column: This is the name of the new column. The default name is the source column name with either TOK (for Token) or THM (for Theme) appended, depending on the transformation type. To specify the output column name, deselect Automatic and enter a name in the Output Column field.
In the Settings section, specify characteristics of the text and the transform:
-
Language: Select any one of the following options:
-
Single Language: By default, a single language is specified. English is the default language. You can select a different language.
-
Multiple Language: Select this option to specify multiple language. For example, to specify Single Byte languages, such as Arabic, Turkish, Thai, and European languages, select them from the Single Byte list. To specify Multibyte languages, such as Chinese (simplified or traditional), Japanese or Korean, select them from the Multibyte languages.
-
-
Stoplist: Oracle Text provides default stoplists for several single languages. If there is a default stoplist, then it is selected. For several languages, the default is no stoplist. You can select any stoplist that was previously created for this attribute from the drop-down list. You can perform the following tasks:
-
Edit a Stoplist: To edit a stoplist, click . The Stoplist Editor opens.
-
Add a Stoplist: To add a stoplist, click . The Stoplist Editor opens.
-
-
Token: If you select Token, then the defaults are:
-
Maximum number per document:
50
(default) -
Maximum number across all document:
3000
(default)
You can change these values. The tokens per document and across all documents cutoffs are for rankings, not for an absolute count of tokens. You could have more than 3000 tokens across all documents if there were ties.
-
-
Theme: If you select theme, then the defaults are:
-
Maximum number per documents:
50
(default) -
Maximum number across all document:
3000
(default)
You can change these values. The themes per document and across all documents cutoffs are for rankings, not for absolute count of themes. You could have more then 3000 themes across all documents if there were ties.
Theme incudes a Theme Type specification. The default is
Single.
You can selectFull.
-
-
Frequency: The default is
Term Frequency.
You can selectTerm Frequency IDF.
Note:
Frequency is a sticky setting. If you change it, then the changed value becomes the default.
Term Frequency uses the term frequency in the document itself. It does not take collection information into account.
Term Frequency IDF is the traditional TF-IDF. It takes into account information from the document (Term Frequency) and collection-level information (IDF plus the terms to use if a maximum overall number of terms for the collection is set).
TF-IDF (Term Frequency–Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection.
Related Topics
Parent topic: Edit Build Text Node
11.4.3.3 Stoplist Editor
In the Stoplist Editor, you can either edit an existing stoplist, or you can create a new stoplist. Stoplists are shared among all workflows.
You can edit any stoplist in this dialog box, not just the ones associated with transformations defined in this node.
To access the stoplist editor, open the Edit Build Text Node by double-clicking a Build Text node. To view, edit, and create a stoplist:
- New Stoplist Editor
By using the New Stoplist Editor wizard, you can create new stoplists, edit stoplists, and combine stoplists. - Add Stopwords/Stopthemes
In the Add Stopwords/Stopthemes dialog box, you can add stopwords and stopthemes to a stoplist.
Parent topic: Edit Build Text Node
11.4.3.3.1 New Stoplist Editor
By using the New Stoplist Editor wizard, you can create new stoplists, edit stoplists, and combine stoplists.
You can perform the following tasks:
-
Create new stoplists. To create a stoplist, click . The New Stoplist Wizard starts. The wizard has two steps:
-
Stoplist Definition
-
Review
-
-
Remove words from an existing stoplist.
-
Combine several stoplists to create a new one. For example, if the document is in both French and English, then you can combine the French and English stoplists.
-
Create an empty stoplist to which you must add all stopwords and stopthemes.
Parent topic: Stoplist Editor
11.4.3.3.1.1 Stoplist Definition
Follow these steps to define a stoplist:
Parent topic: New Stoplist Editor
11.4.3.3.1.2 Review
To add or remove stopwords and stopthemes:
- To add items to the stoplist, click . The Add Stopwords/Stopthemes dialog box opens.
- To remove items from the stoplist, select it and click .
- Click Finish.
Related Topics
Parent topic: New Stoplist Editor
11.4.3.3.2 Add Stopwords/Stopthemes
In the Add Stopwords/Stopthemes dialog box, you can add stopwords and stopthemes to a stoplist.
To add Stopwords or Stopthemes:
- Enter the stopwords, separated by commas.
- Click OK when you have finished.
Parent topic: Stoplist Editor
11.4.4 Build Text Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternatively, right-click the node and click Go to Properties.
The Build Text node Properties pane has the following sections:
- Transforms
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box. - Sample
The data is sampled to support data analysis. - Cache
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference. - Details
The Details section displays the node name and comments about the node.
Parent topic: Build Text
11.4.4.1 Transforms
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.
Related Topics
Parent topic: Build Text Node Properties
11.4.4.2 Sample
The data is sampled to support data analysis.
The default is to use a sample. The Sample tab has the following selections:
-
Use All Data: By default, Use All Data is deselected.
-
Sampling Size: The default is
Number of Rows
with a value of 2000. You can change sampling size toPercent
. The default is 60 percent.
Parent topic: Build Text Node Properties
11.4.4.3 Cache
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.
You can perform the following tasks:
-
Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.
-
Sampling Size: You can select caching or override the default settings. Default sampling size is
Number of Rows
Default Value=2000
-
Related Topics
Parent topic: Build Text Node Properties
11.4.4.4 Details
The Details section displays the node name and comments about the node.
You can change the name of the node and edit the comments in this section. The new node name and comments must meet the requirements.
Related Topics
Parent topic: Build Text Node Properties
11.4.5 Build Text Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Right-click a Build Text node. The following options are available in the context menu:
-
Edit. Edits the text apply. Opens the Edit Build Text Node dialog box.
-
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
-
Show Runtime Errors. Displayed, if there are errors.
-
Show Validation Errors. Displayed, if there are validation errors.
Related Topics
Parent topic: Build Text
11.5 Text Reference
A Text Reference node enables you to reference text transformations defined in a Build Text node in the current workflow or in a different workflow.
For example, if you have one workflow that builds a Text model (that is, a workflow that includes a Build Text node) and you want to create a separate workflow that applies the model created in the first workflow, then you can use a Text Reference to provide the text transformation information required by Apply Text.
- Create a Text Reference Node
You create a Text Reference node to reference text transformations that are defined in a Build Text node in the current workflow or in a different workflow. - Edit Text Reference Node
The Edit Text Reference Node dialog box enables you to select a Build Text node so that you can use its transformations in the current location in the current workflow. - Text Reference Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node. - Text Reference Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Parent topic: Text Nodes
11.5.1 Create a Text Reference Node
You create a Text Reference node to reference text transformations that are defined in a Build Text node in the current workflow or in a different workflow.
Related Topics
Parent topic: Text Reference
11.5.2 Edit Text Reference Node
The Edit Text Reference Node dialog box enables you to select a Build Text node so that you can use its transformations in the current location in the current workflow.
To open the Edit Text Reference Node:
- Right-click the node and select Edit. Alternatively, double-click the node. The Edit Text Reference Node dialog box has two panes.
- In the upper pane, click Select. The Select Text Reference Node dialog box opens.
- After you select a Build Text node, you can view tokens or themes for any transformed nodes. Select a transformed node in the upper pane.
- In the bottom pane, the Tokens and Themes and their frequencies are displayed. You can search by token or theme (the default) or by frequency
- Click OK.
- Select Build Text Node
In the Select Build Text Node dialog box, you can select a Build Text node that is either in the current workflow, (the default) or in all workflows.
Parent topic: Text Reference
11.5.2.1 Select Build Text Node
In the Select Build Text Node dialog box, you can select a Build Text node that is either in the current workflow, (the default) or in all workflows.
Show specifies the list of Build Text nodes to select from.
- In the Show field, select either All Workflows or Current Workflows (default).
- In the Search field, you can search for Build Text nodes by project (the default), workflow, or node.
- Select a Build Text node from the Available Nodes grid. For each Build Text node the grid shows project, workflow, and status.
- Click OK.
Note:
You cannot select a Text node that is not complete.
Parent topic: Edit Text Reference Node
11.5.3 Text Reference Node Properties
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternatively, right-click the node and click Go to Properties.
The Text Reference node Properties pane has the following sections:
- Transforms
The Transforms dialog box for the Text reference node displays the transformation related information selected in the Edit Text Reference Node dialog box. - Details
The Details section displays the node name and comments about the node.
Parent topic: Text Reference
11.5.3.1 Transforms
The Transforms dialog box for the Text reference node displays the transformation related information selected in the Edit Text Reference Node dialog box.
You can select a different Build Text node from the Properties pane.
Related Topics
Parent topic: Text Reference Node Properties
11.5.3.2 Details
The Details section displays the node name and comments about the node.
You can change the name of the node and edit the comments in this section. The new node name and comments must meet the requirements.
Related Topics
Parent topic: Text Reference Node Properties
11.5.4 Text Reference Node Context Menu
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Right-click a Text Reference node. The following selections are displayed:
-
Edit.
-
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
-
Show Runtime Errors. Displayed if there are errors.
-
Show Validation Errors. Displayed if there are validation errors.
Related Topics
Parent topic: Text Reference