3 Oracle Data Mining Basics
Understand the basic concepts of Oracle Data Mining.
3.1 Mining Techniques
Introduces the concept of data mining techniques.
A basic understanding of data mining techniques and algorithms is required for using Oracle Data Mining.
Each data mining technique specifies a class of problems that can be modeled and solved. Data mining techniques fall generally into two categories: supervised and unsupervised. Notions of supervised and unsupervised learning are derived from the science of machine learning, which has been called a sub-area of artificial intelligence.
Artificial intelligence refers to the implementation and study of systems that exhibit autonomous intelligence or behavior of their own. Machine learning deals with techniques that enable devices to learn from their own performance and modify their own functioning. Data mining applies machine learning concepts to data.
Related Topics
3.1.1 Supervised Data Mining
Supervised learning is also known as directed learning. The learning process is directed by a previously known dependent attribute or target. Directed data mining attempts to explain the behavior of the target as a function of a set of independent attributes or predictors.
Supervised learning generally results in predictive models. This is in contrast to unsupervised learning where the goal is pattern detection.
The building of a supervised model involves training, a process whereby the software analyzes many cases where the target value is already known. In the training process, the model "learns" the logic for making the prediction. For example, a model that seeks to identify the customers who are likely to respond to a promotion must be trained by analyzing the characteristics of many customers who are known to have responded or not responded to a promotion in the past.
3.1.1.1 Supervised Learning: Testing
The process of applying the model to test data helps to determine whether the model, built on one chosen sample, is generalizable to other data. In other words, test data is used for scoring.
In particular, it helps to avoid the phenomenon of overfitting, which can occur when the logic of the model fits the build data too well and therefore has little predictive power.
3.1.1.2 Supervised Learning: Scoring
Apply data, also called scoring data, is the actual population to which a model is applied. For example, you might build a model that identifies the characteristics of customers who frequently buy a certain product. To obtain a list of customers who shop at a certain store and are likely to buy a related product, you might apply the model to the customer data for that store. In this case, the store customer data is the scoring data.
Most supervised learning can be applied to a population of interest. The principal supervised mining techniques, Classification and Regression, can both be used for scoring.
Oracle Data Mining does not support the scoring operation for Attribute Importance, another supervised technique. Models of this type are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An attribute importance model returns and ranks the attributes that are most important in predicting a target value.
Oracle Data Mining supports the supervised data mining techniques described in the following table:
Table 3-1 Oracle Data Mining Supervised Techniques
Technique | Description | Sample Problem |
---|---|---|
Identifies the attributes that are most important in predicting a target attribute |
Given customer response to an affinity card program, find the most significant predictors |
|
Assigns items to discrete classes and predicts the class to which an item belongs |
Given demographic data about a set of customers, predict customer response to an affinity card program |
|
Approximates and forecasts continuous values |
Given demographic and purchasing data about a set of customers, predict customers' age |
3.1.2 Unsupervised Data Mining
Overview of unsupervised data mining.
Unsupervised learning is non-directed. There is no distinction between dependent and independent attributes. There is no previously-known result to guide the algorithm in building the model.
Unsupervised learning can be used for descriptive purposes. It can also be used to make predictions.
3.1.2.1 Unsupervised Learning: Scoring
Introduces unsupervised learning, supported scoring operations, and unsupervised Oracle Data Mining techniques.
Although unsupervised data mining does not specify a target, most unsupervised learning can be applied to a population of interest. For example, clustering models use descriptive data mining techniques, but they can be applied to classify cases according to their cluster assignments. Anomaly detection, although unsupervised, is typically used to predict whether a data point is typical among a set of cases.
Oracle Data Mining supports the scoring operation for Clustering and Feature Extraction, both unsupervised mining techniques. Oracle Data Mining does not support the scoring operation for Association Rules, another unsupervised technique. Association models are built on a population of interest to obtain information about that population; they cannot be applied to separate data. An association model returns rules that explain how items or events are associated with each other. The association rules are returned with statistics that can be used to rank them according to their probability.
Oracle Data Mining supports the unsupervised techniques described in the following table:
Table 3-2 Oracle Data Mining Unsupervised Techniques
Related Topics
3.2 Algorithms
An algorithm is a mathematical procedure for solving a specific kind of problem. For some techniques, you can choose among several algorithms.
Each algorithm produces a specific type of model, with different characteristics. Some data mining problems can best be solved by using more than one algorithm in combination. For example, you might first use a feature extraction model to create an optimized set of predictors, then a classification model to make a prediction on the results.
3.2.1 Oracle Data Mining Supervised Algorithms
Oracle Data Mining supports the supervised data mining algorithms described in the following table. The algorithm abbreviations are used throughout this manual.
Table 3-3 Oracle Data Mining Algorithms for Supervised Techniques
3.2.2 Oracle Data Mining Unsupervised Algorithms
Learn about unsupervised algorithms that Oracle Data Mining supports.
Oracle Data Mining supports the unsupervised data mining algorithms described in the following table. The algorithm abbreviations are used throughout this manual.
Table 3-4 Oracle Data Mining Algorithms for Unsupervised Techniques
Related Topics
3.3 Data Preparation
Preparing the data is a valuable step in solving data mining problems.
The quality of a model depends to a large extent on the quality of the data used to build (train) it. Much of the time spent in any given data mining project is devoted to data preparation. The data must be carefully inspected, cleansed, and transformed, and algorithm-appropriate data preparation methods must be applied.
The process of data preparation is further complicated by the fact that any data to which a model is applied, whether for testing or for scoring, must undergo the same transformations as the data used to train the model.
3.3.1 Oracle Data Mining for SQL Simplifies Data Preparation
Learn about various features of Oracle Data Mining for SQL for data preparation.
Oracle Data Mining offers several features that significantly simplify the process of data preparation:
-
Embedded data preparation: The transformations used in training the model are embedded in the model and automatically run whenever the model is applied to new data. If you specify transformations for the model, you only have to specify them once.
-
Automatic Data Preparation (ADP): Oracle Data Mining for SQL supports an automated data preparation mode. When ADP is active, Oracle Data Mining for SQL automatically performs the data transformations required by the algorithm. The transformation instructions are embedded in the model along with any user-specified transformation instructions.
-
Automatic management of missing values and sparse data: Oracle Data Mining for SQL uses consistent methodology across data mining algorithms to handle sparsity and missing values.
-
Transparency: Oracle Data Mining for SQL provides model details, which are a view of the attributes that are internal to the model. This insight into the inner details of the model is possible because of reverse transformations, which map the transformed attribute values to a form that can be interpreted by a user. Where possible, attribute values are reversed to the original column values. Reverse transformations are also applied to the target of a supervised model, thus the results of scoring are in the same units as the units of the original target.
-
Tools for custom data preparation: Oracle Data Mining for SQL provides many common transformation routines in the
DBMS_DATA_MINING_TRANSFORM
PL/SQL package. You can use these routines, or develop your own routines in SQL, or both. The SQL language is well suited for implementing transformations in the database. You can use custom transformation instructions along with ADP or instead of ADP.
3.3.2 Case Data
Learn the importance of case data in data mining.
Most data mining algorithms act on single-record case data, where the information for each case is stored in a separate row. The data attributes for the cases are stored in the columns.
When the data is organized in transactions, the data for one case (one transaction) is stored in many rows. An example of transactional data is market basket data. With the single exception of Association Rules, which can operate on native transactional data, Oracle Data Mining for SQL algorithms require single-record case organization.
3.3.2.1 Nested Data
Learn how nested columns are treated in Oracle Data Mining for SQL.
Oracle Data Mining supports attributes in nested columns. A transactional table can be cast as a nested column and included in a table of single-record case data. Similarly, star schemas can be cast as nested columns. With nested data transformations, Oracle Data Mining for SQL can effectively mine data originating from multiple sources and configurations.
3.3.3 Text Data
Prepare and transform unstructured text data for data mining.
Oracle Data Mining for SQL interprets CLOB
columns and long VARCHAR2
columns automatically as unstructured text. Additionally, you can specify columns of short VARCHAR2
, CHAR
, BLOB
, and BFILE
as unstructured text. Unstructured text includes data items such as web pages, document libraries, Power Point presentations, product specifications, emails, comment fields in reports, and call center notes.
Oracle Data Mining uses Oracle Text utilities and term weighting strategies to transform unstructured text for analysis. In text transformation, text terms are extracted and given numeric values in a text index. The text transformation process is configurable for the model and for individual attributes. Once transformed, the text can by mined with a Oracle Data Mining algorithm.
Related Topics
3.4 In-Database Scoring
Scoring is the application of a data mining algorithm to new data. In Oracle Data Mining for SQL scoring engine and the data both reside within the database.
In traditional data mining, models are built using specialized software on a remote system and deployed to another system for scoring. This is a cumbersome, error-prone process open to security violations and difficulties in data synchronization.
With Oracle Data Mining, scoring is simple and secure. The scoring engine and the data both reside within the database. Scoring is an extension to the SQL language, so the results of data mining can easily be incorporated into applications and reporting systems.
3.4.1 Parallel Execution and Ease of Administration
All Oracle Data Mining for SQL scoring routines support parallel execution for scoring large data sets.
In-database scoring provides performance advantages. All Oracle Data Mining for SQL scoring routines support parallel execution, which significantly reduces the time required for executing complex queries and scoring large data sets.
In-database data mining minimizes the IT effort needed to support Oracle Data Mining initiatives. Using standard database techniques, models can easily be refreshed (re-created) on more recent data and redeployed. The deployment is immediate since the scoring query remains the same; only the underlying model is replaced in the database.
Related Topics
3.4.2 SQL Functions for Model Apply and Dynamic Scoring
In Oracle Data Mining, scoring is performed by SQL language functions. Understand the different ways involved in SQL function scoring.
The functions perform prediction, clustering, and feature extraction. The functions can be invoked in two different ways: By applying a mining model object (Example 3-1), or by executing an analytic clause that computes the mining analysis dynamically and applies it to the data (Example 3-2). Dynamic scoring, which eliminates the need for a model, can supplement, or even replace, the more traditional data mining methodology described in "The Data Mining Process".
In Example 3-1, the PREDICTION_PROBABILITY
function applies the model svmc_sh_clas_sample, created in Example 2-1, to score the data in mining_data_apply_v
. The function returns the ten customers in Italy who are most likely to use an affinity card.
In Example 3-2, the functions PREDICTION
and PREDICTION_PROBABILITY
use the analytic syntax (the OVER
() clause) to dynamically score the data in mining_data_apply_v
. The query returns the customers who currently do not have an affinity card with the probability that they are likely to use.
Example 3-1 Applying a Mining Model to Score Data
SELECT cust_id FROM (SELECT cust_id, rank() over (order by PREDICTION_PROBABILITY(svmc_sh_clas_sample, 1 USING *) DESC, cust_id) rnk FROM mining_data_apply_v WHERE country_name = 'Italy') WHERE rnk <= 10 ORDER BY rnk; CUST_ID ---------- 101445 100179 100662 100733 100554 100081 100344 100324 100185 101345
Example 3-2 Executing an Analytic Function to Score Data
SELECT cust_id, pred_prob FROM (SELECT cust_id, affinity_card, PREDICTION(FOR TO_CHAR(affinity_card) USING *) OVER () pred_card, PREDICTION_PROBABILITY(FOR TO_CHAR(affinity_card),1 USING *) OVER () pred_prob FROM mining_data_build_v) WHERE affinity_card = 0 AND pred_card = 1 ORDER BY pred_prob DESC; CUST_ID PRED_PROB ---------- --------- 102434 .96 102365 .96 102330 .96 101733 .95 102615 .94 102686 .94 102749 .93 . . . 101656 .51