Skip Headers

Oracle Data Mining Concepts
10g Release 1 (10.1)

Part Number B10698-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

3
Predictive Data Mining Models

This chapter describes the predictive models, that is, the supervised learning functions. These functions predict a target value. The Oracle Data Mining Java interface supports the following predictive functions and associated algorithms:

Function Algorithm

Classification (Section 3.1)

Naive Bayes (Section 3.1.3)

 

Adaptive Bayes Network (Section 3.1.4)

 

Support Vector Machine (Section 3.1.6)

Regression (Section 3.2)

Support Vector Machine (Section 3.2.1)

Attribute Importance (Section 3.3)

Minimal Descriptor Length (Section 3.3.1)

This chapter also describes ODM Model Seeker (Section 3.4), which builds several Naive Bayes and Adaptive Bayes Network models and selects the best one.

3.1 Classification

In a classification problem, you typically have historical data (labeled examples) and unlabeled examples. Each labeled example consists of multiple predictor attributes and one target attribute (dependent variable). The value of the target attribute is a class label. The unlabeled examples consist of the predictor attributes only. The goal of classification is to construct a model using the historical data that accurately predicts the label (class) of the unlabeled examples.

A classification task begins with build data (also know as training data) for which the target values (or class assignments) are known. Different classification algorithms use different techniques for finding relations between the predictor attributes' values and the target attribute's values in the build data. These relations are summarized in a model, which can then be applied to new cases with unknown target values to predict target values. A classification model can also be used on build data with known target values, to compare the predictions to the known answers; such data is also known as test data or evaluation data. This technique is called testing a model, which measures the model's predictive accuracy. The application of a classification model to new data is called applying the model, and the data is called apply data or scoring data. Applying data is often called scoring the data.

Classification is used in customer segmentation, business modeling, credit analysis, and many other applications. For example, a credit card company may wish to predict which customers will default on their payments. Each customer corresponds to a case; data for each case might consist of a number of attributes that describe the customer's spending habits, income, demographic attributes, etc. These are the predictor attributes. The target attribute indicates whether or not the customer has defaulted; that is, there are two possible classes, corresponding to having defaulted or not. The build data is used to build a model that you then use to predict, for new cases, whether these new customers are likely to default.

3.1.1 Costs

In a classification problem, it may be important to specify the costs involved in making an incorrect decision. Doing so can be useful when the costs of different misclassifications vary significantly.

For example, suppose the problem is to predict whether a user will respond to a promotional mailing. The target has two categories: YES (the customer responds) and NO (the customer does not respond). Suppose a positive response to the promotion generates $500 and that it costs $5 to do the mailing. If the model predicts YES and the actual value is YES, the cost of misclassification is $0. If the model predicts YES and the actual value is NO, the cost of misclassification is $5. If the model predicts NO and the actual value is YES, the cost of misclassification is $500. If the model predicts NO and the actual value is NO, the cost is $0.

The row indexes of a cost matrix correspond to actual values; the column indexes correspond to predicted values. For any pair of actual/predicted indexes, the value indicates the cost of misclassification.

Classification algorithms apply the cost matrix to the predicted probabilities during scoring to estimate the least expensive prediction. If a cost matrix is specified for apply, the output of the scoring run is prediction and cost., rather than predication and probability

3.1.2 Priors

In building a classification model, priors can be useful when the training data does not accurately reflect the real underlying population. A priors vector is used to inform the model of the true underlying distribution of target classes in the underlying population. The model build adjusts its predicted probabilities for Adaptive Bayes Network and Naive Bayes or relative Complexity factor for Support Vector Machine.

3.1.3 Naive Bayes Algorithm

The Naive Bayes algorithm (NB)N can be used for both binary and multiclass classification problems to answer questions such as "Which customers will switch to a competitor? Which transaction patterns suggest fraud? Which prospects will respond to an advertising campaign?" For example, suppose a bank wants to promote its mortgage offering to its current customers and that, to reduce promotion costs, it wants to target the most likely prospects. The bank has historical data for its customers, including income, number of household members, money-market holdings, and information on whether a customer has recently obtained a mortgage through the bank. Using NB, the bank can predict how likely a customer is to respond positively to a mortgage offering. With this information, the bank can reduce its promotion costs by restricting the promotion to the most likely candidates.

NB affords fast model building and scoring for relatively low volumes of data.

NB makes predictions using Bayes' Theorem, which derives the probability of a prediction from the underlying evidence.Bayes' Theorem states:

P(A | B) = (P(B | A) P(A))/P(B)

That is, the probability of event A occurring given that event B has occurred is equal to the probability of event B occurring given that event A has occurred, multiplied by the probability of event A occurring and divided by the probability of event B occurring.

NB assumes that each attribute is conditionally independent of the others: given a particular value of the target, the distribution of each predictor is independent of the other predictors.

In practice, this assumption, even when violated, does not degrade the model's predictive accuracy significantly, and makes the difference between a fast, computationally feasible algorithm and an intractable one.

Naive Bayes lets you, using cross-validation, test model accuracy on the same data that was used to build the model, rather than building the model on one portion of the data and testing it on a different portion. Not having to hold aside a portion of the data for testing is especially useful if the amount of build data is relatively small.

"Leave-one-out cross-validation" is a special case of cross-validation in which one record is left out of the build data when building a model. The number of models built equals the number of records (omitting a different build record for each model), which makes this procedure computationally expensive. With Naive Bayes models, however, the approach can be modified such that all build records are used for building a single model. Then, the model is repeatedly modified to quickly remove the effects of one build record, incrementally "unbuilding" the model for that record, as though that record had been omitted when building the model in the first place. The accuracy of the prediction for each build record can then be assessed against the model that would have been built from all the build records except that one, without having had to actually build a separate model for each build record.

To use Naive Bayes cross-validation, the user executes a CrossValidate task object, specifying that a Naive Bayes model is to be tested. The execution of the cross-validate task creates a ClassificationTestResult associated with classification test metrics.

See Table 3-1, below, for a comparison of the main characteristics of ABN and NB.

3.1.4 Adaptive Bayes Network Algorithm

Adaptive Bayes Network (ABN) is an Oracle proprietary algorithm that provides a fast, scalable, non-parametric means of extracting predictive information from data with respect to a target attribute. (Non-parametric statistical techniques avoid assuming that the population is characterized by a family of simple distributional models, such as standard linear regression, where different members of the family are differentiated by a small set of parameters.)

ABN, in single feature build mode, can describe the model in the form of human-understandable rules. The rules produced by ABN are one of its main advantages over Naive Bayes. The business user, marketing professional, or business analyst can understand the basis of the model's predictions and can therefore be comfortable acting on them and explaining them to others.

In addition to explanatory rules, ABN provides performance and scalability, which are derived via a collection of user parameters controlling the trade-off of accuracy and build time.

ABN predicts binary as well as multiclass targets. Binary targets are those that take on only two values, for example, buy and not buy. Multiclass targets have more than two values, for example, products purchased (product A or product B or product C). Multiclass target values are not assumed to exist in an ordered relation to each other, for example, hair brush is not assumed to be greater or less than comb.

ABN can use costs and priors for both building and scoring (see Section 3.1.1 and Section 3.1.2).

3.1.4.1 ABN Model Types

An ABN model is an (adaptive conditional independence model that uses the minimum description length principle to construct and prune an array of conditionally independent Network Features. Each Network Feature consists of one or more Conditional Probability Expressions. The collection of Network Features forms a product model that provides estimates of the target class probabilities. There can be one or more Network Features. The number and depth of the Network Features in the model determine the model mode. There are three model modes for ABN:

Users can select the ABN model type. Rules are available only for Single Feature Build; see Section 3.1.4.2 for information about rules.

Each Network Feature consists of one or more attributes included in a Conditional Probability Expression. An array of single attribute Network Features is an MDL-pruned Naive Bayes model. A single multi-attribute Network Feature model is equivalent to a simplified C4.5 decision tree; such a model is simplified in the sense that numerical attributes are binned and treated as categorical. Furthermore, a single predictor is used to split all nodes at a given tree depth. The splits are k-way, where k is the number of unique (binned) values of the splitting predictor. Finally, a collection of multi-attribute Network Features forms a product model (boosted mode). All three types provide estimates of the target class probabilities.

3.1.4.2 ABN Rules

Rules can be extracted from the Adaptive Bayes Network Model as Compound Predicates. Rules form a human-interpretable depiction of the model and include statistics indicating the number of the relevant training data instances in support of the rule. A record apply instance specifies a pathway in a network feature taking the form of a compound predicate.

For example, suppose the feature consists of two training attributes: Age {20-40, 40-60, 60-80} and Income {<=50K, >50K}. A record instance consisting of a person age 25 and income $42K is expressed as

IF AGE IN (20-40) and INCOME IN (<=50K)

Suppose that the associated target (for example, response to a promotion) probabilities are {0.8 (no), 0.2 (yes)}. Then we have a detailed rule of the form

IF AGE IN (20-40) and INCOME IN (<=50K) => Prob = {0.8, 0.2}

In addition to the probability distribution, there are the associated training data counts, e.g. {400, 100}. Suppose there is a cost matrix specifying that it is 6 times more costly to incorrectly predict a no than to incorrectly predict a yes. Then the cost of predicting yes for this instance is 0.8 * 1 = 0.8 (because the model is wrong in this prediction 80% of the time) and the cost of predicting no is 0.2 * 6 = 1.2. Thus, the minimum cost (best) prediction is yes.

Without the cost matrix and the decision is reversed. Implicitly, all errors are equal and we have: 0.8 * 1 = 0.8 for yes and 0.2 * 1 = 0.2 for no.

The order of the predicates in the generated rules implies relative importance.

When you apply an ABN model for which rules were generated, with a single feature, you get the same result that you would get if you wrote an external program that applied the rules.

3.1.4.3 ABN Build Parameters

To control the execution time of a build, ABN provides the following user-settable parameters:

3.1.4.4 ABN Model States

When you specify MaxBuildTime for a boosted mode ABN model, the ABN build terminates in one of the following states:

The algorithm outputs its current model state and statistics that provide an estimate of how long it would take for the model to build (and prune) a feature.

3.1.5 Comparison of NB and ABN Models

Table 3-1 compares the main characteristics of Adaptive Bayes Network and Naive Bayes.

Table 3-1 Comparison of Naive Bayes and Adaptive Bayes Network Algorithms
Feature Naive Bayes Adaptive Bayes Network

Number of cases

Any size

Any size

Number of attributes

Best if less than 200

Any number (built-in feature selection)

Speed

Faster

Not as fast

Accuracy

As accurate or less accurate than Adaptive Bayes Network

As accurate or more accurate than Naive Bayes

Attribute types

Numerical (binned) and categorical

Numerical (binned) and categorical

Automatic binning

Yes

Yes

Target attribute

Binary and multiclass

Binary and multiclass

Rules

No

Single Feature Build mode only

3.1.6 Support Vector Machine

Support Vector Machine (SVM) is a classification and regression prediction tool that uses machine learning theory to maximize predictive accuracy while automatically avoiding over-fit to the data.

Neural networks (NN) and radial basis functions (RBFs), both popular data mining techniques, can be viewed as a special case of SVMs.

SVMs perform well with real-world applications such as classifying text, recognizing hand-written characters, classifying images, as well as bioinformatics and biosequence analysis. Their introduction in the early 1990s led to an explosion of applications and deepening theoretical analysis that established SVM along with neural networks as one of the standard tools for machine learning and data mining.

There is no upper limit on the number of attributes and target cardinality for SVMs.

The SVM kernel functions supported at this release are linear and Gaussian.

3.1.6.1 Data Preparation and Settings Choice for Support Vector Machines

You can influence both the Support Vector Machine (SVM) model quality (accuracy) and performance (build time) through two basic mechanisms: data preparation and model settings. Significant performance degradation can be caused by a poor choice of settings or inappropriate data preparation. Poor settings choices can also lead to inaccurate models. ODM has built- in mechanisms that attempt to choose appropriate settings for the problem at hand by default. ODM provides data normalization and explosion of categorical fields for data preparation. You may need to override the defaults or do your own data preparation for some domains. For detailed information about data preparation for SVM models, see the Oracle Data Mining Application Developer's Guide.

SVM uses z-score or min-max normalization. The transformed data for each attribute has a mean of 0 and a standard deviation of 1; values can extend beyond the range -1 to +1, and there is no special treatment for sparse data.

3.2 Regression

Regression creates predictive models. The difference between regression and classification is that regression deals with numerical/continuous target attributes, whereas classification deals with discrete/categorical target attributes. In other words, if the target attribute contains continuous (floating-point) values, a regression technique is required. If the target attribute contains categorical (string or discrete integer) values, a classification technique is called for.

The most common form of regression is linear regression, in which a line that best fits the data is calculated, that is, the line that minimizes the average distance of all the points from the line.

This line becomes a predictive model when the value of the dependent variable is not known; its value is predicted by the point on the line that corresponds to the values of the independent variables for that record.

3.2.1 SVM Algorithm for Regression

Support Vector Machine (SVM) builds both classification and regression models. SVM is described in the section on classification; see Section 3.1.6, "Support Vector Machine".

3.3 Attribute Importance

Attribute Importance (AI) provides an automated solution for improving the speed and possibly the accuracy of classification models built on data tables with a large number of attributes.

Attribute Importance ranks the predictive attributes by eliminating redundant, irrelevant, or uninformative attributes and identifying those predictor attributes that may have the most influence in making predictions. ODM examines data and constructs classification models that can be used to make predictions about subsequent data. The time required to build these models increases with the number of predictors. Attribute Importance helps a user identify a proper subset of these attributes that are most relevant to predicting the target. Model building can proceed using the selected attributes (predictor attributes) only.

Using fewer attributes decreases model building time, although sometimes at a cost in predictive accuracy. Using too many attributes (especially those that are "noise") can affect the model and degrade its performance and accuracy. By extracting as much information as possible from a given data table using the smallest number of attributes, a user can save significant computing time and often build better models.

Attribute Importance permits the user to specify a number or percentage of attributes to use; alternatively the user can specify a cutoff point. After an Attribute Importance model is built, the user can select the subset of attributes based on the ranking or the predictive value.

Attribute Importance can be applied to data tables with a very large set of attributes. However, the DBA may have to tune the database in various ways to ensure that Attribute Importance build executes efficiently. For example, it is important to ensure that there is adequate swap space and table space.

3.3.1 Minimum Descriptor Length

Minimum Description Length (MDL) is an information theoretic model selection principle. MDL assumes that the simplest, most compact representation of data is the best and most probable explanation of it.

With MDL, the model selection problem is treated as a communication problem. There is a sender, a receiver, and data to be transmitted. For classification models, the data to be transmitted is a model and the sequence of target class values in the training data. Typically, each model under consideration comes from a list of potential candidate models. The sizes of the lists vary. The models compute probabilities for the target class of each training data set row. The model's predicted probability is used to encode the training data set target values. From Shannon's noiseless coding theorem, it is known that the most compact encoding uses a number of bits equal to -log2(pi), where pi is probability of the target value in the i-th training data set row for all rows in the data set.

The sender and receiver agree on lists of potential candidate models for each model under consideration. A model in a list is represented by its position in the list (for example, model #26). The position is expressed as a binary number. Thus, for a list of length m, the number of bits in the binary number is log2(m). The sender and receiver each get a copy of the list. The sender transmits the model. Once the model is known, the sender know how to encode the targets and the receiver knows how to decode the targets. The sender then transmits the encoded targets. The total description length of a model is then:

log2(m) - Sum(log2(pi))

Note that the better the model is at estimating the target probabilities, the shorter the description length. However, the improvement in target probability estimation can come at the expense of a longer list of potential candidates. Thus the description length is a penalized likelihood of the model.

The attribute importance problem can be put in the MDL framework, by considering each attribute as a simple predictive model of the target class. Each value of the predictor (indexed by i) has associated with it a set of ni training examples and a probability distribution, pij for the m target class values (indexed by j). From ni training examples there are at most ni pij's distinguishable in the data. From combinatorics, it is known that the number of distributions of m distinct objects in a total of n objects is n-m choose m. Hence the size of the list for a predictor is

Sumi log2((ni - m  choose m))

The total description length for a predictor is then:

Sumi (log2(ni - m  choose m)) - Sumij(log2(pij))

The predictor rank is the position in the list of associated description lengths, smallest first.

3.4 ODM Model Seeker (Java Interface Only)

ODM Model Seeker allows a user to build multiple ABN and NB models and select the "best" model based on a predetermined criterion.


Note:

Model Seeker is deprecated in Oracle Data Mining 10g Release 1 (10.1) . It will not be available in subsequent releases of ODM; the functionlaity will be supported in other ways.




With Model Seeker, the user can compactly specify parameters for an execution that will asynchronously build and test multiple classification models. These models may be built using different algorithms. For each algorithm, Model Seeker systematically varies the algorithm parameters to create a series of parameter values that are used to create corresponding models. Model Seeker then evaluates these models and selects a "best" model using a predetermined criterion.

These features were developed to support applications that want to create a model that can be directly used in a production application environment with a minimum of user interaction. This assumes that the application developer has the expertise, or advice from someone with the expertise, to set the various algorithm parameters in a manner that is suitable to the particular business problem and data sources on which the application is to operate.

A secondary purpose is to present to the user or application a summary of information about each model built and tested by a Model Seeker execution so that the user or application can independently find the parameters that correspond to an alternative "best" model using a different criterion. This feature is meant to provide support for an application for use by an analyst who will experiment with alternative parameter settings to discover the appropriate settings to use in a production environment.

Model Seeker's criterion for the "best" model is the one with the largest value for a calculated weighted relative accuracy. The weight used in this calculation is the relative importance of the positive target category to the other categories treated as a single negative category. If the weight is set to 1.0, the positive target category relative accuracy has the same weight as the relative accuracy of all the other categories combined.

The following formula is used to calculate the figure of merit (FOM) for the "best" model, where FOM is the weighted sum of the positive target relative accuracy and the total negative relative accuracy:

     FOM =        W * (number of correct positives)       +              (number of correct negatives)     
                     (W + 1) * (number of actual positives)            (W + 1) * (number of actual negatives)

 

where W is the user-specified weight, a value that must be > 0.0. A recommended way for a user to choose a value for the weight is as follows. First, estimate the cost to the user of predictions that are all correct except for a random fixed percentage, say 5%, of positive predictions being false positives. Second, estimate the cost to the user of predictions that are all correct except for the same random fixed percentage, 5%, of negative predictions being false negatives. Use a value for weight equal to the ratio of the second estimate to the first estimate. A weight of 1.0 means that a given percentage of false negatives has the same cost as a given percentage of false positives.