Text nodes are available in the Text section of the Components pane. Oracle Text Knowledge Base is required for text processing.
To install Oracle Text Knowledge Base, you must install the Oracle Database Examples. For directions on how to install the examples, see Oracle Database Examples Installation Guide
If you are connected to Oracle Database12c or later, then you can use Automatic Data Preparation (ADP) to prepare text data using the Text tab to specify data usage.
Oracle Data Miner supports the following Text nodes:
Related Topics
Oracle text concepts include the terms Theme, Stopword, Stoplist, and Stoptheme.
Theme: A theme is a topic associated with a given document. A document can have many themes. A theme does not have to appear in a document. For example, a document containing the words San Francisco
may have California
as one of its themes.
Stopword: A stopword is a word that is not indexed during text transformations. A stopword is usually a low information word. In English a, the, this,
or with
are usually stopwords.
Stoplist: A stoplist is a list of stopwords. Oracle Text supplies a stoplist for every language. By default during indexing, the system uses the Oracle Text default stoplist for your language. You can edit the default stoplist or create a new one.
Note:
In Oracle Data Miner, stoplists are shared across all transformations and are not owned by a specific transformation.
Stoptheme: A stoptheme is a theme to be skipped over during indexing. Stopthemes are specified by adding them to stoplists.
Oracle Text uses stopwords and stopthemes to indicate text that can be safely ignored during text mining.
The Oracle Text Lexer breaks source text into tokens or themes—usually words—in accordance with a specified language. To extract tokens, the Lexer uses parameters as defined by a lexer preference. These parameters include:
Definitions for the characters that separate tokens. For example, whitespace.
Conditions to convert text to all uppercase or not.
Text analysis text to create theme tokens. This is done when theme indexing is enabled.
Text must undergo a transformation process before it can be mined.
After the data has been properly transformed, a case table can be used for building, testing, or scoring data mining models. The case table must be a relational table. It cannot be created as a view.
A Source table for Oracle Data Mining can include one or more columns of text. A text column cannot be used as a target.
The following Oracle Data Mining algorithms support text:
Anomaly Detection (one-class Support Vector Machine)
Classification algorithms: Naive Bayes, Generalized Linear Models, and Support Vector Machine
Decision Tree when you connect to Oracle Database 12c or later
Clustering algorithms: k- Means and Expectation Maximization
Feature Extraction algorithms: Nonnegative Matrix Factorization, Singular Value Decomposition, and Principal Components Analysis
Regression algorithms: Generalized Linear Models and Support Vector Machine
Note:
These algorithms do not support text:
O-Cluster
Decision Tree when you connect to Oracle Database 11g
Association (Apriori)
Any text attributes are automatically filtered out for model builds when you use O-Cluster or Decision Tree connected to Oracle Database 11g.
Data Preparation for text depends on which version of Oracle Database that you connect to.
In Oracle Data Mining 12c Release 1 (12.1) and later, if unstructured text data is present, then text processing includes text transformation before text mining.
Oracle Data Mining includes significant enhancements in text processing that simplify the data mining process (model build, deployment, and scoring) when unstructured text data is present in the input. Some points about unstructured text and text transformation:
Unstructured text includes data items such as web pages, document libraries, Microsoft Power Point presentations, product specifications, email messages, comment fields in reports, and call center notes.
CLOB
columns and long VARCHAR2
columns are automatically interpreted as unstructured text by Oracle Data Mining.
Columns of short VARCHAR2
, CHAR
, BLOB
, and BFILE
can be specified as unstructured text.
To transform unstructured text for mining, Oracle Data Mining uses Oracle Text utilities and term weighting strategies.
Text terms are extracted and given numeric values in a text index.
Text transformation process is configurable for models and individual attributes. You can specify data preparation for text nodes when you define a model node.
After text transformation, the text can be mined with a data mining algorithm.
Note:
If you connect to Oracle 12c Release 1 or later, then it is not always necessary to use the Text nodes, Apply Text node, Build text, and Text Reference.
Related Topics
In Oracle Data Mining 11g Release 2 (11.2) and earlier, before text mining is done, it must undergo Feature Extraction and Text preparation processes.
The following processes are:
Extraction or Feature Extraction: This is a special preprocessing step, where the text is broken down into units (terms) that can be mined. Text terms can be keywords or other document-derived features.
Text preparation: Text preparation uses a Build Text node to transform text columns. Build Text does not support HTML or XML documents. It also does not support any binary data types.
Oracle Data Miner uses the facilities of Oracle Text to preprocess text columns.
Note:
You must preprocess text using the Text nodes, Apply Text node, Build Text, and Text Reference.
The Apply Text node enables you to apply existing text transformations from either a Build Text node or from a Text Node to new data.
This ensure that the apply data is transformed in the same way that the build data was transformed.
Apply Text can run in parallel.
Apply Text node applies existing text transformations from either a Build Text or Text Node to new data.
This ensures that the apply data is transformed in the same way that the build data was transformed.
Note:
All models in the node must have the same case ID.
You create an Apply Text node to apply existing text transformations from either a Build Text node or from a Text Node to new data.
Related Topics
The Edit Apply Text Node dialog box enables you to view the text transformations performed on the Build data.
The Apply data must be prepared in the same way as the Build data.
To open the Edit Apply Text Node dialog box right-click the Apply Text node and select Editor just double-click the node.
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.
The Apply Text Properties has the following sections:
Related Topics
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.
Related Topics
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.
You can perform the following tasks:
Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.
Sampling Size: You can select caching or override the default settings. Default sampling size is Number of Rows
Default Value=2000
Related Topics
The data is sampled to support data analysis.
The default is to use a sample. The Sample tab has the following selections:
Use All Data: By default, Use All Data is deselected.
Sampling Size: The default is Number of Rows
with a value of 2000. You can change sampling size to Percent
. The default is 60 percent.
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
To view the Apply Text node context menu, right-click the node. The following options are available in the context menu:
Edit. Opens Edit Apply Text Node dialog box.
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
Show Runtime Errors. Displayed if there are errors.
Show Validation Errors. Displayed if there are validation errors.
Related Topics
Build Text node prepares a data source that has one or more Text columns.
You can use the data to build models.
Build Text can run in parallel.
Related Topics
The Build Text node enables you to define a text transformation for each test column.
You can use the transformed columns to build models using any algorithm that supports text.
Note:
O-Cluster and Decision Tree do not support text.
A Build Text node builds one model using the NMF algorithm by default. The transformed column or columns are passed to subsequent nodes and the non-transformed columns are not passed on.
All models in the node have the same case ID.
You create a Build Text node to prepare a data source that has one or more Text columns.
Related Topics
The Edit Build Text Node dialog box enables you to define transformations for text columns. The transformed text columns can be used in data mining.
To open the Edit Build Text Node dialog box:
In the View Text Transform dialog box, you can view the output sample of the tokens or themes for a sample of attributes.
To view the effects of a text transformation:
You can add and edit text related transformation settings in the Add/Edit Text Transform dialog box.
The Add/Edit Text Transform dialog box can be opened from the Edit Build Text Node dialog box. To open or edit a text transform, click The default values for the transformation are illustrated in this graphic:
Source Column: This is the name of the column to be transformed.
Transform Type: This is either Token (the default) or Theme.
Output Column: This is the name of the new column. The default name is the source column name with either TOK (for Token) or THM (for Theme) appended, depending on the transformation type. To specify the output column name, deselect Automatic and enter a name in the Output Column field.
In the Settings section, specify characteristics of the text and the transform:
Language: Select any one of the following options:
Single Language: By default, a single language is specified. English is the default language. You can select a different language.
Multiple Language: Select this option to specify multiple language. For example, to specify Single Byte languages, such as Arabic, Turkish, Thai, and European languages, select them from the Single Byte list. To specify Multibyte languages, such as Chinese (simplified or traditional), Japanese or Korean, select them from the Multibyte languages.
Stoplist: Oracle Text provides default stoplists for several single languages. If there is a default stoplist, then it is selected. For several languages, the default is no stoplist. You can select any stoplist that was previously created for this attribute from the drop-down list. You can perform the following tasks:
Edit a Stoplist: To edit a stoplist, click . The Stoplist Editor opens.
Add a Stoplist: To add a stoplist, click . The Stoplist Editor opens.
Token: If you select Token, then the defaults are:
Maximum number per document: 50
(default)
Maximum number across all document: 3000
(default)
You can change these values. The tokens per document and across all documents cutoffs are for rankings, not for an absolute count of tokens. You could have more than 3000 tokens across all documents if there were ties.
Theme: If you select theme, then the defaults are:
Maximum number per documents: 50
(default)
Maximum number across all document: 3000
(default)
You can change these values. The themes per document and across all documents cutoffs are for rankings, not for absolute count of themes. You could have more then 3000 themes across all documents if there were ties.
Theme incudes a Theme Type specification. The default is Single.
You can select Full.
Frequency: The default is Term Frequency.
You can select Term Frequency IDF.
Note:
Frequency is a sticky setting. If you change it, then the changed value becomes the default.
Term Frequency uses the term frequency in the document itself. It does not take collection information into account.
Term Frequency IDF is the traditional TF-IDF. It takes into account information from the document (Term Frequency) and collection-level information (IDF plus the terms to use if a maximum overall number of terms for the collection is set).
TF-IDF (Term Frequency–Inverse Document Frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection. The importance increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the collection.
Related Topics
In the Stoplist Editor, you can either edit an existing stoplist, or you can create a new stoplist. Stoplists are shared among all workflows.
You can edit any stoplist in this dialog box, not just the ones associated with transformations defined in this node.
To access the stoplist editor, open the Edit Build Text Node by double-clicking a Build Text node. To view, edit, and create a stoplist:
By using the New Stoplist Editor wizard, you can create new stoplists, edit stoplists, and combine stoplists.
You can perform the following tasks:
Create new stoplists. To create a stoplist, click . The New Stoplist Wizard starts. The wizard has two steps:
Stoplist Definition
Review
Remove words from an existing stoplist.
Combine several stoplists to create a new one. For example, if the document is in both French and English, then you can combine the French and English stoplists.
Create an empty stoplist to which you must add all stopwords and stopthemes.
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.
The Build Text node Properties pane has the following sections:
In the Transforms tab, you can view and edit transformations defined in the Edit Build Text Node dialog box.
Related Topics
The data is sampled to support data analysis.
The default is to use a sample. The Sample tab has the following selections:
Use All Data: By default, Use All Data is deselected.
Sampling Size: The default is Number of Rows
with a value of 2000. You can change sampling size to Percent
. The default is 60 percent.
The Cache section provides the option to generate a cache for output data. You can change this default using the transform preference.
You can perform the following tasks:
Generate Cache of Output Data to Optimize Viewing of Results: Select this option to generate a cache. The default setting is to not generate a cache.
Sampling Size: You can select caching or override the default settings. Default sampling size is Number of Rows
Default Value=2000
Related Topics
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Right-click a Build Text node. The following options are available in the context menu:
Edit. Edits the text apply. Opens the Edit Build Text Node dialog box.
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
Show Runtime Errors. Displayed, if there are errors.
Show Validation Errors. Displayed, if there are validation errors.
Related Topics
A Text Reference node enables you to reference text transformations defined in a Build Text node in the current workflow or in a different workflow.
For example, if you have one workflow that builds a Text model (that is, a workflow that includes a Build Text node) and you want to create a separate workflow that applies the model created in the first workflow, then you can use a Text Reference to provide the text transformation information required by Apply Text.
You create a Text Reference node to reference text transformations that are defined in a Build Text node in the current workflow or in a different workflow.
Related Topics
The Edit Text Reference Node dialog box enables you to select a Build Text node so that you can use its transformations in the current location in the current workflow.
To open the Edit Text Reference Node:
In the Select Build Text Node dialog box, you can select a Build Text node that is either in the current workflow, (the default) or in all workflows.
Show specifies the list of Build Text nodes to select from.
Note:
You cannot select a Text node that is not complete.
In the Properties pane, you can examine and change the characteristics or properties of a node.
To view the properties of a node, click the node and click Properties. If the Properties pane is closed, then go to View and click Properties. Alternately, right-click the node and click Go to Properties.
The Text Reference node Properties pane has the following sections:
The Transforms dialog box for the Text reference node displays the transformation related information selected in the Edit Text Reference Node dialog box.
You can select a different Build Text node from the Properties pane.
Related Topics
The context menu options depend on the type of the node. It provides the shortcut to perform various tasks and view information related to the node.
Right-click a Text Reference node. The following selections are displayed:
Edit.
Performance Settings. This opens the Edit Selected Node Settings dialog box, where you can set Parallel Settings and In-Memory settings for the node.
Show Runtime Errors. Displayed if there are errors.
Show Validation Errors. Displayed if there are validation errors.
Related Topics