7 In-Database Predictive Models in Oracle R Enterprise

The Oracle Advanced Analytics option consists of both Oracle Data Mining and Oracle R Enterprise. Oracle R Enterprise provides a familiar R interface for predictive analytics and data mining functions available in Oracle Data Mining. This is exposed through the OREdm package within Oracle R Enterprise.

Data mining uses sophisticated mathematical algorithms to segment data and evaluate the probability of future events. Oracle Data Mining can mine tables, views, star schemas, transactional data, and unstructured data.

For more information about Oracle Data Mining and the algorithms that it supports, see Oracle Data Mining Concepts 11g Release 2 (11.2) http://www.oracle.com/technetwork/database/options/advanced-analytics/odm/index.html.

See OREdm Models for a complete list of supported algorithms and brief descriptions of the algorithms.

Note:

The CRAN package RODM also supports many Oracle Data Mining algorithms. RODM is different from OREdm.

The OREdm interface is designed to provide a standard R interface for corresponding predictive analytics and data mining functions.

This section provides an overview of the algorithms supported by OREdm. For detailed information about a specific model, see the R help associated with the specific OREdm function.

In order to build a model, you must have build (training) data that satisfies OREdm Requirements.

Oracle Data Mining models are somewhat different from OREdm models; see OREdm Models and Oracle Data Mining Models.

For list of the models available at this release and brief overview information, see OREdm Models.

Examples of using OREdm to build models are included in the descriptions of each function. For example, Attribute Importance Example shows how to build an AI model.

OREdm Requirements

OREdm requires that the data used to train (build) models exists in a single table or view that contains columns of the following types only: VARCHAR2, CHAR, NUMBER, and FLOAT.

All privileges required by Oracle Data Mining are automatically grant during Oracle R Enterprise installation.

Oracle Data Mining must be enabled for the database that you connect to.

OREdm Models and Oracle Data Mining Models

Within OREdm, Oracle Data Mining models are given generated names. As long as the OREdm R model object exists, these model names can be used to access Oracle Data Mining models through other interfaces, including:

  • Oracle Data Miner

  • Any SQL interface, such as SQL*Plus or SQL Developer

    In particular, the models can be used with the Oracle Data Mining SQL Prediction functions.

Oracle Data Miner can be useful in a number of ways:

  • Get a list of available models

  • Use Model viewers to inspect model details

  • Score appropriately transformed data

Note:

Any transformations performed in the R space will not be carried over into Oracle Data Miner or SQL scoring.

Similarly, SQL can be used to get a list of models, inspect model details, and score appropriately transformed data with these models.

Models created using OREdm are transient objects; they usually are not persisted past the R session that created them. Oracle Data Mining models created using Data Miner or SQL, on the other hand, exist until they are explicitly dropped.

Model objects can be saved or persisted, as described in Persist and Manage R Objects in the Database. This allows OREdm-generated model objects to exist across R sessions and keeps the ODM object in place.

While the OREdm model exists, you can export and import it; then you can use it apart from the Oracle R Enterprise R object existence.

OREdm Models

OREdm supports these Oracle Data Mining models:

Oracle Data Mining and Open-Source R uses different terminology; see Data Mining Terminology.

Note that there are several Overloaded Functions that perform common actions such as predict (score), summary, and print summary.

Data Mining Terminology

Oracle Data Mining and the Oracle R Enterprise OREdm package that creates statistical models use somewhat different terminology. These are the most important differences

  • Oracle R Enterprise fits models, whereas Oracle Data Mining builds or trains models.

  • Oracle R Enterprise predicts using new data, whereas Oracle Data Mining scores new data, or applies a model to new data.

  • Oracle R Enterprise uses formula, as described in Formula, in the API calls; Oracle Data Mining does not support formula.

Formula

R model definitions require a formula that expresses relationships between variables. The formula class is included in the R stats package. For more information, see the R help associated with ?formula. A formula provides a symbolic description of the model to be fitted.

The [stats]{formula} specification has the form (response ~ terms) where

  • response is the numeric or character response vector.

  • terms is a series of terms, that is, the column names to include in the model. Multiple terms are specified using + between column names.

Use {response ~ .} if all columns in data should be used for model building

Functions can be applied to response and terms to realize transformations.

To exclude columns, use - before the name of each column to exclude.

The examples of model builds in this document and in the R help all contain sample formulas. There is no equivalent of formula in the Oracle Data Mining API.

Overloaded Functions

predict(), summary(), and print() are defined across all OREdm algorithms, for example, as illustrated in GLM Examples.

summary() returns detailed information about the model created, such as details of the generated decision tree.

Attribute Importance

Oracle Data Mining uses the Minimum Descriptor Length algorithm to calculate Attribute Importance. Attribute importance ranks attributes according to their significance in predicting a target.

Minimum Description Length (MDL) is an information theoretic model selection principle. It is an important concept in information theory (the study of the quantification of information) and in learning theory (the study of the capacity for generalization based on empirical data).

MDL assumes that the simplest, most compact representation of the data is the best and most probable explanation of the data. The MDL principle is used to build Oracle Data Mining attribute importance models.

Attribute Importance models built using Oracle Data Mining cannot be applied to new data.

ore.odmAI produces a ranking of attributes and their importance values.

Note:

OREdm AI models differ from Oracle Data Mining AI models in these ways: a model object is not retained, and an R model object is not returned. Only the importance ranking created by the model is returned.

For details about parameters, see the R help associated with ore.odmAI.

For an example, see Attribute Importance Example.

Attribute Importance Example

This example creates a table by pushing the data frame iris to the table IRIS and then builds an attribute importance model:

  IRIS <- ore.push(iris)
  ore.odmAI(Species ~ ., IRIS) # Analyse the column Species

Decision Tree

The Decision Tree algorithm is based on conditional probabilities. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and easily used within a database to identify a set of records.

Decision Tree models are classification models.

A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.

During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. ore.odmDT offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

OREdm includes these functions for Decision Tree (DT):

  • ore.odmDT creates (builds) a DT model.

  • predict predicts classifications on new data using the DT model.

  • summary provides a summary of the DT model. The summary includes node details that describe the tree that the model generates, and a symbolic description of the model. Returns an instance of summary.ore.odmDT.

  • print.ore.odmDT prints select components of the ore.odmDT model.

For details about parameters, see the R help associated with ore.odmDT.

For an example, see Decision Tree Example.

Decision Tree Example

This example creates an input table, builds a model, makes predictions, and generates a confusion matrix.

# Create MTCARS, the input data
  m <- mtcars
  m$gear <- as.factor(m$gear)
  m$cyl  <- as.factor(m$cyl)
  m$vs   <- as.factor(m$vs)
  m$ID   <- 1:nrow(m)
  MTCARS <- ore.push(m)
  row.names(MTCARS) <- MTCARS
# Build the model 
  dt.mod  <- ore.odmDT(gear ~ ., MTCARS)
  summary(dt.mod)
 # Make predictions and generate a confusion matrix
  dt.res  <- predict (dt.mod, MTCARS,"gear")
  with(dt.res, table(gear,PREDICTION))  # generate confusion matrix

Generalized Linear Models

Generalized Linear Models (GLM) include and extend the class of linear models (linear regression). Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.

Oracle Data Mining's GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved via Cholesky decomposition and matrix inversion, Oracle Data Mining GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data (This approach is similar to the approach in Komarek's paper of 2004.) The choice of algorithm is handled internally and is transparent to the user.

GLM can be used to create classification or regression models as follows:

  • Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.

    For an example, see GLM Examples.

  • Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.

    For an example, see GLM Examples.

ore.odmGLM allows you to build two different types of models. Some arguments apply to classification models only, and some to regression models only.

OREdm provides these functions for Generalized Linear Models (GLM):

  • ore.odmGLM creates (builds) a GLM model; note that some arguments apply to classification models only, and some to regression models only.

  • residuals is an ore.frame containing three types of residuals: deviance, pearson, and response.

  • fitted is fitted.values: an ore.vector containing the fitted values:

    • rank: The numeric rank of the fitted model

    • type: The type of model fit

  • predict.ore.odmGLM predicts new data using the GLM model.

  • confint is logical indicator for whether to produce confidence intervals for the predicted values.

  • deviance is minus twice the maximized log-likelihood, up to a constant.

  • coef.ore.odmGLM retrieves coefficients for GLM models with linear kernel.

  • extractAIC.ore.odmGLM extracts Akaike's An Information Criterion (AIC) from the global details of the GLM model.

  • logLik extracts Log-Likelihood for an OREdm GLM model.

  • nobs extracts the number of observations from a model fit. nobs is used in computing BIC.

    BIC is defined as AIC(object, ..., k = log(nobs(object))).

  • summary creates a summary of the GLM model. The summary includes fit details for the model. Also returns formula, a symbolic description of the model. Returns an object of type summary.ore.odmGLM

  • print prints selected components of the GLM model.

For details about parameters and methods, see the R help associated with ore.odmGLM.

GLM Examples

These examples build several models using GLM. The input tables are R data sets pushed to the database.

  • Linear regression using the longley data set:

    LONGLEY <- ore.push(longley)
    longfit1 <- ore.odmGLM(Employed ~ ., data = LONGLEY)
    summary(longfit1)
    
  • Ridge regression using the longley data set:

    longfit2 <- ore.odmGLM(Employed ~ ., data = LONGLEY, ridge = TRUE,
                           ridge.vif = TRUE)
    summary(longfit2)
    
  • Logistic regression (classification) using the infert data set:

    INFERT <- ore.push(infert)
    infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                         data = INFERT, type = "logistic")
    infit1
    
  • Changing the reference value to 1 for infit1:

    infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                           data = INFERT, type = "logistic", reference = 1)
    infit2
    

k-Means

The k-Means (KM) algorithm, a distance-based clustering algorithm that partitions data into a specified number of clusters, is an enhanced version with these features:

  • Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.

  • For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numerical attributes.

OREdm includes these functions for k-Means (KM) models:

  • ore.odmKMeans creates (builds) a KM model.

  • predict predicts new data using the KM model.

  • rules.ore.odmKMeans extracts rules generated by the KM model.

  • clusterhists.ore.odmKMeans generates a data.frame with histogram data for each cluster and variable combination in the model. Numerical variables are binned.

  • histograms.ore.odmKMeans produces lattice-based histograms from a clustering model.

  • summary returns a summary of the KM model, including rules. Also returns formula, a symbolic description of the model. Returns an object of type summary.ore.KMeans.

  • print prints selected components of the KM model.

For details about parameters, see the R help associated with ore.odmKM().

For an example, see k-Means Example.

k-Means Example

This example creates the table X, builds a cluster model, plots the clusters via histogram(), and makes predictions:

# Create input table X
  x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
  colnames(x) <- c("x", "y")
  X <- ore.push (data.frame(x))
  km.mod1 <- NULL
  km.mod1 <- ore.odmKMeans(~., X, num.centers=2)
  km.mod1
  summary(km.mod1)
  rules(km.mod1)
  clusterhists(km.mod1)
  histogram(km.mod1)
 # Build clustering mode; plot results
  km.res1 <- predict(km.mod1,X,type="class",supplemental.cols=c("x","y"))
  head(km.res1,3)
  km.res1.local <- ore.pull(km.res1)
  plot(data.frame(x=km.res1.local$x, y=km.res1.local$y),
    col=km.res1.local$CLUSTER_ID)
  points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2)
# Make predictions 
  head(predict(km.mod1,X))
  head(predict(km.mod1,X,type=c("class","raw")),3)
  head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3)
  head(predict(km.mod1,X,type="class"),3)
  head(predict(km.mod1,X,type="class",supplemental.cols=c("x","y")),3)
  head(predict(km.mod1,X,type="raw"),3)
  head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)

Naive Bayes

The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.

Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)

OREdm includes these functions for Naive Bayes (NB) models:

  • ore.odmNB creates (builds) an NB model.

  • predict scores new data using the NB model.

  • summary provides a summary of the NB model. Also returns formula, a symbolic description of the model. Returns an instance of summary.ore.odmNB.

  • print prints select components of the NB model.

For details about parameters, see the R help associated with ore.odmNB.

For an example, see Naive Bayes Example.

Naive Bayes Example

This example creates MTCARS, builds a Naive Bayes model, and then uses the model to make predictions:

# Create MTCARS
  m <- mtcars
  m$gear <- as.factor(m$gear)
  m$cyl  <- as.factor(m$cyl)
  m$vs   <- as.factor(m$vs)
  m$ID   <- 1:nrow(m)
  MTCARS <- ore.push(m)
  row.names(MTCARS) <- MTCARS
 # Build model
  nb.mod  <- ore.odmNB(gear ~ ., MTCARS)
  summary(nb.mod)
 # Make predictions
  nb.res  <- predict (nb.mod, MTCARS,"gear")
  with(nb.res, table(gear,PREDICTION))  # generate confusion matrix

Support Vector Machine

Support Vector Machine (SVM) is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.

SVM models have similar functional form to neural networks and radial basis functions, both popular data mining techniques.

SVM can be used to solve the following problems:

  • Classification: SVM classification is based on decision planes that define decision boundaries. A decision plane is one that separates a set of objects having different class memberships. SVM finds the vectors ("support vectors") that define the separators giving the widest separation of classes.

    SVM classification supports both binary and multiclass targets.

    For an example, see SVM Classification.

  • Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.

    SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.

    For an example, see SVM Regression.

  • Anomaly Detection: Anomaly detection identifies cases that are unusual within data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.

    Anomaly detection is implemented as one-class SVM classification. An anomaly detection model predicts whether a data point is typical for a given distribution or not.

    For an example, see SVM Anomaly Detection.

The ore.odmSVM function builds each of these three different types of models. Some arguments apply to classification models only, some to regression models only, and some to anomaly detection models only.

OREdm provides these functions for SVM models:

  • ore.odmSVM creates (builds) SVM model.

  • predict predicts (scores) new data using the SVM model.

  • coef retrieves the coefficient of an SVM model.

    SVM has two kernels, Linear and Gaussian; the Linear Kernel generates coefficients.

  • summary creates a summary of the SVM model.Also returns formula, a symbolic description of the model. Returns an object of type summary.ore.odmSVM.

  • print print selected components of the SVM model.

For details about parameters, see the R help associated with ore.odmSVM.

Support Vector Machine Examples

These examples build three models:

SVM Classification

This example creates mtcars in the database from the R mtcars dataset., builds a classification model, makes predictions, and finally generates a confusion matrix.

  m <- mtcars
  m$gear <- as.factor(m$gear)
  m$cyl  <- as.factor(m$cyl)
  m$vs   <- as.factor(m$vs)
  m$ID   <- 1:nrow(m)
  MTCARS <- ore.push(m)
 
  svm.mod  <- ore.odmSVM(gear ~ .-ID, MTCARS,"classification")
  summary(svm.mod)
  coef(svm.mod)
  svm.res  <- predict (svm.mod, MTCARS,"gear")
  with(svm.res, table(gear,PREDICTION))  # generate confusion matrix
 
SVM Regression

This example creates a data frame, pushes it to a table, and then builds a regression model; note that ore.odmSVM specifies a linear kernel:

  x <- seq(0.1, 5, by = 0.02)
  y <- log(x) + rnorm(x, sd = 0.2)
  dat <-ore.push(data.frame(x=x, y=y))
 
# Build model with linear kernel
  svm.mod <- ore.odmSVM(y~x,dat,"regression",kernel.function="linear")
  summary(svm.mod)
  coef(svm.mod)
  svm.res <- predict(svm.mod,dat,supplemental.cols="x")
  head(svm.res,6)
SVM Anomaly Detection

This example uses MTCARS created in the classification example and builds an anomaly detection model:

  svm.mod  <- ore.odmSVM(~ .-ID, MTCARS,"anomaly.detection")
  summary(svm.mod)
  svm.res  <- predict (svm.mod, MTCARS, "ID")
  head(svm.res)
  table(svm.res$PREDICTION)