4 Build Models in Oracle Machine Learning for R
OML4R provides functions for building regression models, neural network models, and models based on Oracle Machine Learning for SQL algorithms.
This chapter has the following topics:
- Build Oracle Machine Learning for R Models
The OML4R packageOREmodels
contains functions with which you can create advanced analytical data models usingore.frame
objects. - Build Oracle Machine Learning for SQL Models
Use the functions in theOREdm
package of Oracle Machine Learning for R to build Oracle Machine Learning for SQL models in R. - Cross-Validate Models
Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models through repeated sampling from the available data.
4.1 Build Oracle Machine Learning for R Models
The OML4R package OREmodels
contains functions with which you can create advanced analytical data models using ore.frame
objects.
- About OREmodels Functions
TheOREmodels
package contains functions with which you can build machine learning models usingore.frame
objects. - About the longley Data Set for Examples
Most of the linear regression andore.neural
examples use the longley data set, which is provided by R. - Build Linear Regression Models
Theore.lm
andore.stepwise
functions perform least squares regression and stepwise least squares regression, respectively, on data represented in anore.frame
object. - Build a Generalized Linear Model
Theore.glm
functions fits generalized linear models on data in anore.frame
object.. - Build a Neural Network Model
Neural network models can be used to capture intricate nonlinear relationships between inputs and outputs or to find patterns in data. - Build a Random Forest Model
Theore.randomForest
function provides an ensemble learning technique for classification of data in anore.frame
object.
Parent topic: Build Models in Oracle Machine Learning for R
4.1.1 About OREmodels Functions
The OREmodels
package contains functions with which you can build machine learning models using ore.frame
objects.
The OREmodels
functions are the following:
Table 4-1 Functions in the OREmodels Package
Function | Description |
---|---|
|
Fits and uses a Generalized Linear Model model on data in an |
|
Fits a linear regression model on data in an |
|
Fits a Neural Network model on data in an |
ore.randomForest |
Creates a Random Forest classification model in parallel on data in an ore.frame .
|
|
Fits a stepwise linear regression model on data in an |
Note:
In R terminology, the phrase "fits a model" is often synonymous with "builds a model". In this document and in the online help for Oracle Machine Learning for R functions, the phrases are used interchangeably.
The ore.glm
, ore.lm
, and ore.stepwise
functions have the following advantages:
-
The algorithms provide accurate solutions using out-of-core QR factorization. QR factorization decomposes a matrix into an orthogonal matrix and a triangular matrix.
QR is an algorithm of choice for difficult rank-deficient models.
-
You can process data that does not fit into memory, that is, out-of-core data. QR factors a matrix into two matrices, one of which fits into memory while the other is stored on disk.
The
ore.glm
,ore.lm
andore.stepwise
functions can solve data sets with more than one billion rows. -
The
ore.stepwise
function allows fast implementations of forward, backward, and stepwise model selection techniques.
The ore.neural
function has the following advantages:
-
It is a highly scalable implementation of neural networks, able to build a model on even billion row data sets in a matter of minutes. The
ore.neural
function can be run in two modes: in-memory for small to medium data sets and distributed (out-of-core) for large inputs. -
You can specify the activation functions on neurons on a per-layer basis;
ore.neural
supports many different activation functions. -
You can specify a neural network topology consisting of any number of hidden layers, including none.
Parent topic: Build Oracle Machine Learning for R Models
4.1.2 About the longley Data Set for Examples
Most of the linear regression and ore.neural
examples use the longley data set, which is provided by R.
The longley data set is a small macroeconomic data set that provides a well-known example for collinear regression and consists of seven economic variables observed yearly over 16 years.
Example 4-1 Displaying Values from the longley Data Set
This example pushes the longley data set to a temporary database table that has the proxy ore.frame
object longley_of
displays the first six rows of longley_of
.
longley_of <- ore.push(longley) head(longley_of)
Listing for This Example
R> longley_of <- ore.push(longley) R> dim(longley_of)[1] 16 7 R> head(longley_of) GNP.deflator GNP Unemployed Armed.Forces Population Year Employed 1947 83.0 234.289 235.6 159.0 107.608 1947 60.323 1948 88.5 259.426 232.5 145.6 108.632 1948 61.122 1949 88.2 258.054 368.2 161.6 109.773 1949 60.171 1950 89.5 284.599 335.1 165.0 110.929 1950 61.187 1951 96.2 328.975 209.9 309.9 112.075 1951 63.221 1952 98.1 346.999 193.2 359.4 113.270 1952 63.639
Parent topic: Build Oracle Machine Learning for R Models
4.1.3 Build Linear Regression Models
The ore.lm
and ore.stepwise
functions perform least squares regression and stepwise least squares regression, respectively, on data represented in an ore.frame
object.
A model fit is generated using embedded R map/reduce operations where the map operation creates either QR decompositions or matrix cross-products depending on the number of coefficients being estimated. The underlying model matrices are created using either a model.matrix
or sparse.model.matrix
object depending on the sparsity of the model. Once the coefficients for the model have been estimated another pass of the data is made to estimate the model-level statistics.
When forward, backward, or stepwise selection is performed, the XtX and Xty matrices are subsetted to generate the F-test p-values based upon coefficient estimates that were generated using a Choleski decomposition of the XtX subset matrix.
If there are collinear terms in the model, functions ore.lm
and ore.stepwise
do not estimate the coefficient values for a collinear set of terms. For ore.stepwise
, a collinear set of terms is excluded throughout the procedure.
For more information on ore.lm
and ore.stepwise
, invoke help(ore.lm)
.
Example 4-2 Using ore.lm
This example pushes the longley
data set to a temporary database table that has the proxy ore.frame
object longley_of
. The example builds a linear regression model using ore.lm
.
longley_of <- ore.push(longley) # Fit full model oreFit1 <- ore.lm(Employed ~ ., data = longley_of) class(oreFit1) summary(oreFit1)
Listing for This Example
R> longley_of <- ore.push(longley) R> # Fit full model R> oreFit1 <- ore.lm(Employed ~ ., data = longley_of) R> class(oreFit1) [1] "ore.lm" "ore.model" "lm" R> summary(oreFit1) Call: ore.lm(formula = Employed ~ ., data = longley_of) Residuals: Min 1Q Median 3Q Max -0.41011 -0.15767 -0.02816 0.10155 0.45539 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 ** GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141 GNP -3.582e-02 3.349e-02 -1.070 0.312681 Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 ** Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 *** Population -5.110e-02 2.261e-01 -0.226 0.826212 Year 1.829e+00 4.555e-01 4.016 0.003037 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3049 on 9 degrees of freedom Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925 F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
Example 4-3 Using the ore.stepwise Function
This example pushes the longley
data set to a temporary database table that has the proxy ore.frame
object longley_of
. The example builds linear regression models using the ore.stepwise
function.
longley_of <- ore.push(longley) # Two stepwise alternatives oreStep1 <- ore.stepwise(Employed ~ .^2, data = longley_of, add.p = 0.1, drop.p = 0.1) oreStep2 <- step(ore.lm(Employed ~ 1, data = longley_of), scope = terms(Employed ~ .^2, data = longley_of))
Listing for This Example
R> longley_of <- ore.push(longley) R> # Two stepwise alternatives R> oreStep1 <- + ore.stepwise(Employed ~ .^2, data = longley_of, add.p = 0.1, drop.p = 0.1) R> oreStep2 <- + step(ore.lm(Employed ~ 1, data = longley_of), + scope = terms(Employed ~ .^2, data = longley_of)) Start: AIC=41.17 Employed ~ 1 Df Sum of Sq RSS AIC + GNP 1 178.973 6.036 -11.597 + Year 1 174.552 10.457 -2.806 + GNP.deflator 1 174.397 10.611 -2.571 + Population 1 170.643 14.366 2.276 + Unemployed 1 46.716 138.293 38.509 + Armed.Forces 1 38.691 146.318 39.411 <none> 185.009 41.165 Step: AIC=-11.6 Employed ~ GNP Df Sum of Sq RSS AIC + Unemployed 1 2.457 3.579 -17.960 + Population 1 2.162 3.874 -16.691 + Year 1 1.125 4.911 -12.898 <none> 6.036 -11.597 + GNP.deflator 1 0.212 5.824 -10.169 + Armed.Forces 1 0.077 5.959 -9.802 - GNP 1 178.973 185.009 41.165 ... The rest of the output is not shown.
Parent topic: Build Oracle Machine Learning for R Models
4.1.4 Build a Generalized Linear Model
The ore.glm
functions fits generalized linear models on data in an ore.frame
object..
The function uses a Fisher scoring iteratively reweighted least squares (IRLS) algorithm. Instead of the traditional step of halving to prevent the selection of less optimal coefficient estimates, ore.glm
uses a line search to select new coefficient estimates at each iteration, starting from the current coefficient estimates and moving through the Fisher scoring suggested estimates using the formula (1 - alpha) * old + alpha * suggested
where alpha in [0, 2]
. When the interp
control argument is TRUE
, the deviance is approximated by a cubic spline interpolation. When it is FALSE
, the deviance is calculated using a follow-up data scan.
Each iteration consists of two or three embedded R execution map/reduce operations: an IRLS operation, an initial line search operation, and, if interp = FALSE
, an optional follow-up line search operation. As with ore.lm
, the IRLS map operation creates QR decompositions when update = "qr"
or cross-products when update = "crossprod"
of the model.matrix
, or sparse.model.matrix
if argument sparse = TRUE
, and the IRLS reduce operation block updates those QR decompositions or cross-product matrices. After the algorithm has either converged or reached the maximum number of iterations, a final embedded R map/reduce operation is used to generate the complete set of model-level statistics.
The ore.glm
function returns an ore.glm
object.
For information on the ore.glm
function arguments, call help(ore.glm)
.
Settings for a Generalized Linear Models
The following table lists settings that apply to Generalized Linear models.
Table 4-2 Generalized Linear Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
The confidence level for coefficient confidence intervals. The default confidence level is |
|
|
Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data. |
|
|
Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Feature generation can only be enabled when feature selection is also enabled. |
|
|
Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data. |
|
|
Whether or not feature selection is enabled for GLM. By default, feature selection is not enabled. |
|
|
When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory. |
GLMS_PRUNE_MODEL |
|
Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data. When feature selection is enabled, the algorithm automatically performs pruning based on the data. |
|
target_value |
The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class. |
GLMS_RIDGE_REGRESSION |
|
Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions. When ridge is enabled, prediction bounds are not produced by the Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised. |
|
|
The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression. If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm. |
|
|
Enable or disable row diagnostics. |
|
The range is ( |
Convergence Tolerance setting of the GLM algorithm The default value is system-determined. |
|
Positive integer |
Maximum number of iterations for the GLM algorithm. The default value is system-determined. |
|
0 or Positive integer
|
Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting allows the user to choose the GLM solver. The solver cannot be selected if |
|
|
This setting allows the user to use sparse solver if it is available. The default value is |
Example 4-4 Using the ore.glm Function
This example loads the rpart
package and then pushes the kyphosis
data set to a temporary database table that has the proxy ore.frame
object KYPHOSIS
. The example builds a Generalized Linear Model using the ore.glm
function and one using the glm
function and calls the summary
function on the models.
# Load the rpart library to get the kyphosis and solder data sets. library(rpart) # Logistic regression KYPHOSIS <- ore.push(kyphosis) kyphFit1 <- ore.glm(Kyphosis ~ ., data = KYPHOSIS, family = binomial()) kyphFit2 <- glm(Kyphosis ~ ., data = kyphosis, family = binomial()) summary(kyphFit1) summary(kyphFit2)Listing for Example 4-4
R> # Load the rpart library to get the kyphosis and solder data sets. R> library(rpart) R> # Logistic regression R> KYPHOSIS <- ore.push(kyphosis) R> kyphFit1 <- ore.glm(Kyphosis ~ ., data = KYPHOSIS, family = binomial()) R> kyphFit2 <- glm(Kyphosis ~ ., data = kyphosis, family = binomial()) R> summary(kyphFit1) Call: ore.glm(formula = Kyphosis ~ ., data = KYPHOSIS, family = binomial()) Deviance Residuals: Min 1Q Median 3Q Max -2.3124 -0.5484 -0.3632 -0.1659 2.1613 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.036934 1.449622 -1.405 0.15998 Age 0.010930 0.006447 1.696 0.08997 . Number 0.410601 0.224870 1.826 0.06786 . Start -0.206510 0.067700 -3.050 0.00229 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 83.234 on 80 degrees of freedom Residual deviance: 61.380 on 77 degrees of freedom AIC: 69.38 Number of Fisher Scoring iterations: 4 R> summary(kyphFit2) Call: glm(formula = Kyphosis ~ ., family = binomial(), data = kyphosis) Deviance Residuals: Min 1Q Median 3Q Max -2.3124 -0.5484 -0.3632 -0.1659 2.1613 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.036934 1.449575 -1.405 0.15996 Age 0.010930 0.006446 1.696 0.08996 . Number 0.410601 0.224861 1.826 0.06785 . Start -0.206510 0.067699 -3.050 0.00229 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 83.234 on 80 degrees of freedom Residual deviance: 61.380 on 77 degrees of freedom AIC: 69.38 Number of Fisher Scoring iterations: 5 # Poisson regression R> SOLDER <- ore.push(solder) R> solFit1 <- ore.glm(skips ~ ., data = SOLDER, family = poisson()) R> solFit2 <- glm(skips ~ ., data = solder, family = poisson()) R> summary(solFit1) Call: ore.glm(formula = skips ~ ., data = SOLDER, family = poisson()) Deviance Residuals: Min 1Q Median 3Q Max -3.4105 -1.0897 -0.4408 0.6406 3.7927 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.25506 0.10069 -12.465 < 2e-16 *** OpeningM 0.25851 0.06656 3.884 0.000103 *** OpeningS 1.89349 0.05363 35.305 < 2e-16 *** SolderThin 1.09973 0.03864 28.465 < 2e-16 *** MaskA3 0.42819 0.07547 5.674 1.40e-08 *** MaskB3 1.20225 0.06697 17.953 < 2e-16 *** MaskB6 1.86648 0.06310 29.580 < 2e-16 *** PadTypeD6 -0.36865 0.07138 -5.164 2.41e-07 *** PadTypeD7 -0.09844 0.06620 -1.487 0.137001 PadTypeL4 0.26236 0.06071 4.321 1.55e-05 *** PadTypeL6 -0.66845 0.07841 -8.525 < 2e-16 *** PadTypeL7 -0.49021 0.07406 -6.619 3.61e-11 *** PadTypeL8 -0.27115 0.06939 -3.907 9.33e-05 *** PadTypeL9 -0.63645 0.07759 -8.203 2.35e-16 *** PadTypeW4 -0.11000 0.06640 -1.657 0.097591 . PadTypeW9 -1.43759 0.10419 -13.798 < 2e-16 *** Panel 0.11818 0.02056 5.749 8.97e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 6855.7 on 719 degrees of freedom Residual deviance: 1165.4 on 703 degrees of freedom AIC: 2781.6 Number of Fisher Scoring iterations: 4
Parent topic: Build Oracle Machine Learning for R Models
4.1.5 Build a Neural Network Model
Neural network models can be used to capture intricate nonlinear relationships between inputs and outputs or to find patterns in data.
Settings for a Neural Network Model
The following table lists the settings that apply to Neural Network models.
Table 4-3 Neural Network Model Settings
Setting Name | Setting Value | Description |
---|---|---|
NNET_HIDDEN_LAYERS |
Non-negative integer |
Defines the topology by number of hidden layers. The default value is |
|
A list of positive integers |
Defines the topology by number of nodes per layer. Different layers can have different number of nodes. The value should be non-negative integers and comma separated. For example, '10, 20, 5'. The setting values must be consistent with |
|
A list of the following strings:
|
Defines the activation function for the hidden layers. For example, '''NNET_ACTIVATIONS_BIPOLAR_SIG '', ''NNET_ACTIVATIONS_TANH '''.
Different layers can have different activation functions. The default value is The number of activation functions must be consistent with Note: All quotes are single and two single quotes are used to escape a single quote in SQL statements. |
|
|
The setting specifies the lower bound of the region where weights are randomly initialized.
The value of |
|
|
This setting specifies the upper bound of the region where weights are initialized. It should be set in pairs with The default value is |
|
Positive integer |
This setting specifies the maximum number of iterations in the Neural Network algorithm. The default value is |
|
|
Defines the convergence tolerance setting of the Neural Network algorithm. The default value is |
NNET_REGULARIZER |
|
Regularization setting for Neural Network algorithm. If the total number of training rows is greater than 50000, the default is If the total number of training rows is less than or equal to 50000, the default is |
|
|
Define the held ratio for the held-aside method. The default value is |
|
The value must be a positive integer. |
With The default value is |
NNET_REG_LAMBDA |
TO_CHAR(numeric_expr >=0) |
Defines the L2 regularization parameter lambda. This can not be set together with NNET_REGULARIZER_HELDASIDE .
The default value is |
The ore.neural
function builds a feed-forward neural network for regression on ore.frame
data. It supports multiple hidden layers with a specifiable number of nodes. Each layer can have one of several activation functions.
The output layer is a single numeric or binary categorical target. The output layer can have any of the activation functions. It has the linear activation function by default.
The output of ore.neural
is an object of type ore.neural
.
For information about the arguments to the ore.neural
function, invoke help(ore.neural)
.
Modeling with the ore.neural
function is well-suited for noisy and complex data such as sensor data. Problems that such data might have are the following:
-
Potentially many (numeric) predictors, for example, pixel values
-
The target may be discrete-valued, real-valued, or a vector of such values
-
Training data may contain errors – robust to noise
-
Fast scoring
-
Model transparency is not required; models difficult to interpret
Typical steps in neural network modeling are the following:
- Specifying the architecture
- Preparing the data
- Building the model
- Specifying the stopping criteria: iterations, error on a validation set within tolerance
- Viewing statistical results from model
- Improving the model
Example 4-5 Building a Neural Network Model
This example builds a Neural Network model with default values, including a hidden size of 1. The example pushes a subset of the longley
data set to an ore.frame
object in database memory as the object trainData
. The example then pushes a different subset of longley
to the database as the object testData
. The example builds the model with trainData
and then predicts results using testData
.
trainData <- ore.push(longley[1:11, ]) testData <- ore.push(longley[12:16, ]) fit <- ore.neural('Employed ~ GNP + Population + Year', data = trainData) ans <- predict(fit, newdata = testData) ans
Listing for This Example
R> trainData <- ore.push(longley[1:11, ]) R> testData <- ore.push(longley[12:16, ]) R> fit <- ore.neural('Employed ~ GNP + Population + Year', data = trainData) R> ans <- predict(fit, newdata = testData) R> ans pred_Employed 1 67.97452 2 69.50893 3 70.28098 4 70.86127 5 72.31066 Warning message: ORE object has no unique key - using random order
Example 4-6 Using ore.neural and Specifying Activations
This example pushes the iris
data set to a temporary database table that has the proxy ore.frame
object IRIS
. The example builds a Neural Network model using the ore.neural
function and specifies a different activation function for each layer.
IRIS <- ore.push(iris) fit <- ore.neural(Petal.Length ~ Petal.Width + Sepal.Length, data = IRIS, hiddenSizes = c(20, 5), activations = c("bSigmoid", "tanh", "linear")) ans <- predict(fit, newdata = IRIS, supplemental.cols = c("Petal.Length")) options(ore.warn.order = FALSE) head(ans, 3) summary(ans)
Listing for This Example
R> IRIS <- ore.push(iris) R> fit <- ore.neural(Petal.Length ~ Petal.Width + Sepal.Length, + data = IRIS, + hiddenSizes = c(20, 5), + activations = c("bSigmoid", "tanh", "linear")) R> R> ans <- predict(fit, newdata = IRIS, + supplemental.cols = c("Petal.Length")) R> options(ore.warn.order = FALSE) R> head(ans, 3) Petal.Length pred_Petal.Length 1 1.4 1.416466 2 1.4 1.363385 3 1.3 1.310709 R> summary(ans) Petal.Length pred_Petal.Length Min. :1.000 Min. :1.080 1st Qu.:1.600 1st Qu.:1.568 Median :4.350 Median :4.346 Mean :3.758 Mean :3.742 3rd Qu.:5.100 3rd Qu.:5.224 Max. :6.900 Max. :6.300
Parent topic: Build Oracle Machine Learning for R Models
4.1.6 Build a Random Forest Model
The ore.randomForest
function provides an ensemble learning technique for classification of data in an ore.frame
object.
Function ore.randomForest
builds a Random Forest model by growing trees in parallel on the database server. It constructs many decision trees and outputs the class that is the mode of the classes of the individual trees. The function avoids overfitting, which is a common problem for decision trees.
The Random Forest algorithm, developed by Leo Breiman and Adele Cutler, combines the ideas of bagging and the random selection of variables, which results in a collection of decision trees with controlled variance. The Random Forest algorithm provides high accuracy, but performance and scalability can be issues for large data sets.
Function ore.randomForest
executes in parallel for model building and scoring. Parallel execution can occur whether you are using the randomForest
package in Oracle R Distribution (ORD) or the open source randomForest
package 4.6-10. Using ore.randomForest
and ORD can require less memory than using ore.randomForest
with the open source alternative. If you use the open source randomForest
package, Oracle Machine Learning for R issues a warning.
Function ore.randomForest
uses the global option ore.parallel
to determine the degree of parallelism to employ. The function returns an ore.randomForest
object.
An invocation of the scoring method predict
on an ore.randomForest
object also runs in parallel on the database server. The cache.model
argument specifies whether to cache the entire Random Forest model in memory during prediction. If sufficient memory is available, use the default cache.model
value of TRUE
for better performance.
The grabTree
method returns an ore.frame
object that contains information on the specified tree. Each row of the ore.frame
represents one node of the tree.
Note:
Function ore.randomForest
loads a copy of the training data for each embedded R session executing in parallel. For large datasets, this can exceed the amount of available memory. Oracle recommends that you adjust the number of parallel processes and the amount of available memory accordingly. The global option ore.parallel
specifies the number of parallel processes. For information on controlling the amount of memory used by embedded R execution processes, see Controlling Memory Used by Embedded R in Oracle Machine Learning
for R Installation and Administration Guide.
Settings for a Random Forest Model
The following table lists the settings that apply to Random Forest models.
Table 4-4 Random Forest Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value |
|
|
Number of trees in the forest Default is |
|
|
Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data. |
Example 4-7 Using ore.randomForest
# Using the iris dataset
IRIS <- ore.push(iris)
mod <- ore.randomForest(Species~., IRIS)
tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
table(ans$Species, ans$prediction)
# Using the infert dataset
INFERT <- ore.push(infert)
formula <- case ~ age + parity + education + spontaneous + induced
rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
tree <- grabTree(rfMod, k = 500)
rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")
confusion.matrix <- with(rfPred, table(case, prediction))
confusion.matrix
Listing for This Example
R> # Using the iris dataset
R> IRIS <- ore.push(iris)
R> mod <- ore.randomForest(Species~., IRIS)
R> tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
R> ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
R> table(ans$Species, ans$prediction)
setosa versicolor virginica
setosa 50 0 0
versicolor 0 50 0
virginica 0 0 50
# Using the infert dataset
R> INFERT <- ore.push(infert)
R> formula <- case ~ age + parity + education + spontaneous + induced
R>
R> rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
R> tree <- grabTree(rfMod, k = 500)
R>
R> rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")
R>
R> confusion.matrix <- with(rfPred, table(case, prediction))
R> confusion.matrix
prediction
case 0 1
0 154 11
1 27 56
Parent topic: Build Oracle Machine Learning for R Models
4.2 Build Oracle Machine Learning for SQL Models
Use the functions in the OREdm
package of Oracle Machine Learning for R to build Oracle Machine Learning for SQL models in R.
These functions are described in the following topics:
- About Building OML4SQL Models using OML4R
Oracle Machine Learning for SQL functions can process tables, views, star schemas, transactional data, and unstructured data. - Association Rules
Theore.odmAssocRules
function implements the Apriori algorithm to find frequent itemsets and generate an association model. - Build an Attribute Importance Model
Attribute importance ranks attributes according to their significance in predicting a target. - Decision Tree
Theore.odmDT
function uses the OML4SQL Decision Tree algorithm, which is based on conditional probabilities. - Expectation Maximization
Theore.odmEM
function creates a model that uses the OML4SQL Expectation Maximization (EM) algorithm. - Explicit Semantic Analysis
Theore.odmESA
function creates a model that uses the OML4SQL Explicit Semantic Analysis (ESA) algorithm. - Build an Extensible R Algorithm Model
Theore.odmRAlg
function creates an Extensible R algorithm model using OML4SQL. - Generalized Linear Models
Theore.odmGLM
function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression). - k-Means
Theore.odmKM
function uses the OML4SQL k-Means (KM) algorithm, a distance-based clustering algorithm that partitions data into a specified number of clusters. - Naive Bayes
Theore.odmNB
function builds an OML4SQL Naive Bayes model. - Non-Negative Matrix Factorization
Theore.odmNMF
function builds an OML4SQL Non-Negative Matrix Factorization (NMF) model for feature extraction. - Orthogonal Partitioning Cluster
Theore.odmOC
function builds an OML4SQL model using the Orthogonal Partitioning Cluster (O-Cluster) algorithm. - Singular Value Decomposition
Theore.odmSVD
function creates a model that uses the OML4SQL Singular Value Decomposition (SVD) algorithm. - Support Vector Machine
Theore.odmSVM
function builds an OML4R Support Vector Machine (SVM) model. - Build a Partitioned Model
A partitioned model is an ensemble model that consists of multiple sub-models, one for each partition of the data. - Build a Text Processing Model
A text processing model usesctx.settings
arguments to specify Oracle Text attribute settings.
Parent topic: Build Models in Oracle Machine Learning for R
4.2.1 About Building OML4SQL Models using OML4R
Oracle Machine Learning for SQL functions can process tables, views, star schemas, transactional data, and unstructured data.
These OREdm
package functions provide R interfaces that use arguments that conform to typical R usage for corresponding predictive analytics and OML4SQL functions.
This section has the following topics:
- OML4SQL Models Supported by OML4R
- About In-Database Model Names and Renaming with OML4R Functions
In eachOREdm
R model object, the slotname
is the name of the underlying OML4SQL model generated by theOREdm
function. - Specify Model Settings
Functions in theOREdm
package have an argument that specifies settings for an Oracle Machine Learning for SQL model and some have an argument for setting text processing parameters.
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.1.1 OML4SQL Models Supported by OML4R
The functions in the OREdm
package provide access to the Oracle Machine Learning for SQL in-database machine learning functionality of Oracle Database. You use these functions to build OML4SQL models in the database.
The following table lists the OML4R functions that build OML4SQL models and the corresponding OML4SQL algorithms and functions.
Table 4-5 Oracle Machine Learning for R Model Functions
OML4R Function | OML4SQL Algorithm | OML4SQL Function |
---|---|---|
|
Minimum Description Length |
Attribute importance for classification or regression |
|
Apriori |
Association rules |
|
Decision Tree |
Classification |
(12.2 feature) |
Expectation Maximization |
Clustering |
(12.2 feature) |
Explicit Semantic Analysis |
Feature extraction |
|
Generalized Linear Models |
Classification and regression |
|
k-Means |
Clustering |
|
Naive Bayes |
Classification |
|
Non-Negative Matrix Factorization |
Feature extraction |
|
Orthogonal Partitioning Cluster (O-Cluster) |
Clustering |
(12.2 feature) |
Extensible R Algorithm |
Association rules, attribute importance, classification, clustering, feature extraction, and regression |
(12.2 feature) |
Singular Value Decomposition |
Feature extraction |
|
Support Vector Machines |
Classification and regression |
Parent topic: About Building OML4SQL Models using OML4R
4.2.1.2 About In-Database Model Names and Renaming with OML4R Functions
In each OREdm
R model object, the slot name
is the name of the underlying OML4SQL model generated by the OREdm
function.
By default, models built using OREdm
functions are transient objects; they do not persist past the R session in which they were built unless they are explicitly saved in an OML4R datastore or specified with an explicit name when created. OML4SQL models built using Data Miner or SQL, on the other hand, exist until they are explicitly dropped.
R proxy objects can be saved or persisted. Saving a model object generated by an OREdm
function allows it to exist across R sessions and keeps the corresponding in-database machine learning model object in place. While the OREdm
model exists, you can export and import the in-database model object and use it independent of the OML4R object.
You can use the MODEL_NAME
parameter in odm.settings
to explicitly name an in-database model object created in the database. The named in-database model object persists in the database just like those created using Oracle Data Miner or SQL.
Example 4-8 Using MODEL_NAME
parameter to explicitly name in-database model proxy object
This example demonstrates building a Random Forest Model and naming the model using explicit settings. The example uses the MODEL_NAME
parameter in odm.settings
to explicitly name an in-database model object created in the database.
ore.exec("BEGIN DBMS_DATA_MINING.DROP_MODEL(model_name=> 'RF_CLASSIFICATION_MODEL'); EXCEPTION WHEN others THEN null; END;") settings = list(RFOR_MTRY = 3, RFOR_NUM_TREES = 100, RFOR_SAMPLING_RATIO =0.5, model_name="RF_CLASSIFICATION_MODEL") MOD2 <- ore.odmRF(AFFINITY_CARD~., DEMO_DF.train, odm.settings= settings) RES2 <- predict(MOD2, DEMO_DF.test, type= c("class","raw"), norm.votes=TRUE, cache.model=TRUE, supplemental.cols=c("CUST_ID", "AFFINITY_CARD", "BOOKKEEPING_APPLICATION", "BULK_PACK_DISKETTES", "EDUCATION", "FLAT_PANEL_MONITOR", "HOME_THEATER_PACKAGE", "HOUSEHOLD_SIZE", "OCCUPATION", "OS_DOC_SET_KANJI", "PRINTER_SUPPLIES", "YRS_RESIDENCE", "Y_BOX_GAMES"))
In the above code, the model named RF_CLASSIFICATION_MODEL
should be dropped if it already exists because it will throw an exception while we try to build a model with the same name. The In-database Random Forest Classification model named RF_CLASSIFCIATION_MODEL
is built using the specified settings and the model is referenced by the variable MOD2
. The predictions are made using the random forest model MOD2
and the AFFINITY_CARD
as the predictor variable for the test data set DEMO DF test
and the result is stored in the local session in variable RES2
, and the model persists in the database.
While the R model exists or if you have assigned the model a user-specified name, use the OML4SQL model name to access the OML4SQL model through other interfaces, including:
-
Any SQL interface, such as SQL*Plus or SQL Developer
-
Oracle Data Miner
In particular, the model can be used with the OML4SQL prediction functions.
With Oracle Data Miner you can do the following:
-
Get a list of available models
-
Use model views to inspect model details
-
Score appropriately transformed data
Note:
Any explicit user-performed transformations in the R space are not carried over into SQL scoring or Oracle Data Miner. Users can also get a list of models using SQL for inspecting model details or for scoring appropriately transformed data.Parent topic: About Building OML4SQL Models using OML4R
4.2.1.3 Specify Model Settings
Functions in the OREdm
package have an argument that specifies settings for an Oracle Machine Learning for SQL model and some have an argument for setting text processing parameters.
General Parameter Settings
With the odm.settings
argument to an OREdm
function, you can specify a list of OML4SQL parameter settings. Each list element's name and value refer to the parameter setting name and value, respectively. The setting value must be numeric or string. Refer to Specify Model Settings in Oracle Machine Learning
for SQL User’s Guide for each algorithm's valid settings.
The settings
function returns a data.frame
that lists each OML4SQL parameter setting name and value pair used to build the model.
Text Processing Attribute Settings
Some OREdm
functions have a ctx.settings
argument that specifies text processing attribute settings with which you can specify Oracle Text attribute-specific settings. With the odm.settings
argument, you can specify the Oracle text policy, the minimal number of documents in which each token occurs, and the maximum number of distinct features for text processing. With the ctx.settings
argument, you specify the columns that should be treated as text and the type of text transformation to apply.
The ctx.settings
argument applies to the following functions:
-
ore.odmESA
, Explicit Semantic Analysis -
ore.odmGLM
, Generalized Linear Models -
ore.odmKMeans
, k-Means -
ore.odmNMF
, Non-Negative Matrix Factorization -
ore.odmSVD
, Singular Value Decomposition -
ore.odmSVM
, Support Vector Machine
Note:
To create an Oracle Text policy, the user must have theCTXSYS.CTX_DDL
privilege.
See Also:
Create a Model that Includes Machine Learning Operations on Text in Oracle Machine Learning for SQL User’s Guide for valid text attribute values.Parent topic: About Building OML4SQL Models using OML4R
4.2.2 Association Rules
The ore.odmAssocRules
function implements the Apriori algorithm to find frequent itemsets and generate an association model.
The function finds the co-occurrence of items in large volumes of transactional data such as in market basket analysis. An association rule identifies a pattern in the data in which the appearance of a set of items in a transactional record implies another set of items. The groups of items used to form rules must pass a minimum threshold according to how frequently they occur (the support of the rule) and how often the consequent follows the antecedent (the confidence of the rule). Association models generate all rules that have support and confidence greater than user-specified thresholds. The Apriori algorithm is efficient, and scales well with respect to the number of transactions, number of items, and number of itemsets and rules produced.
The formula
specification has the form ~ terms
, where terms
is a series of column names to include in the analysis. Multiple column names are specified using +
between column names. Use ~ .
if all columns in the data should be used for model building. To exclude columns, use -
before each column name to exclude. Functions can be applied to the items in terms
to realize transformations.
The ore.odmAssocRules
function accepts data in the following forms:
-
Transactional data
-
Multi-record case data using item id and item value
-
Relational data
For examples of specifying the forms of data and for information on the arguments of the function, call help(ore.odmAssocRules)
.
The function rules
returns an object of class ore.rules
, which specifies a set of association rules. You can pull an ore.rules
object into memory in a local R session by using ore.pull
. The local in-memory object is of class rules
defined in the arules
package. See help(ore.rules)
.
The function itemsets
returns an object of class ore.itemsets
, which specifies a set of itemsets. You can pull an ore.itemsets
object into memory in a local R session by using ore.pull
. The local in-memory object is of class itemsets
defined in the arules
package. See help(ore.itemsets)
.
Settings for an Association Rules Model
The following table lists the settings that apply to Association Rules models.
Table 4-6 Association Rules Model Settings
Setting Name | Setting Value | Description |
---|---|---|
ASSO_ABS_ERROR |
|
Specifies the absolute error for the association rules sampling. A smaller value of The default value is |
|
|
Specifies the columns to aggregate. It is a comma separated list of strings containing the names of the columns for aggregation. The number of columns in the list must be <= 10. You can set An item value is not mandatory. The default value is For each item, you may supply several columns to aggregate. However, doing so requires more memory to buffer the extra data and also affects performance because of the larger input data set and increased operations. |
|
|
Sets Including Rules for the antecedent: it is a comma separated list of strings, at least one of which must appear in the antecedent part of each reported association rule. The default value is |
|
|
Sets Excluding Rules for the antecedent: it is a comma separated list of strings, none of which can appear in the antecedent part of each reported association rule. The default value is |
|
|
Specifies the confidence level for an association rules sample. A larger value of |
|
|
Sets Including Rules for the consequent: it is a comma separated list of strings, at least one of which must appear in the consequent part of each reported association rule. The default value is |
|
|
Sets Excluding Rules for the consequent: it is a comma separated list of strings, none of which can appear in the consequent part of a reported association rule. You can use the excluding rule to reduce the data that must be stored, but you may be required to build extra models for executing different Including or Excluding Rules. The default value is |
ASSO_EX_RULES |
|
Sets Excluding Rules applied for each association rule: it is a comma separated list of strings that cannot appear in an association rule. No rule can contain any item in the list. The default value is |
|
|
Sets Including Rules applied for each association rule: it is a comma separated list of strings, at least one of which must appear in each reported association rule, either as antecedent or as consequent The default value |
ASSO_MAX_RULE_LENGTH |
TO_CHAR( 2<= numeric_expr <=20) |
Maximum rule length for association rules. The default value is |
ASSO_MIN_CONFIDENCE |
TO_CHAR( 0<= numeric_expr <=1) |
Minimum confidence for association rules. The default value is |
ASSO_MIN_REV_CONFIDENCE |
TO_CHAR( 0<= numeric_expr <=1) |
Sets the Minimum Reverse Confidence that each rule should satisfy. The Reverse Confidence of a rule is defined as the number of transactions in which the rule occurs divided by the number of transactions in which the consequent occurs. The value is real number between 0 and 1. The default value is |
|
TO_CHAR( 0<= numeric_expr <=1) |
Minimum support for association rules. The default value is |
|
TO_CHAR( 0<= numeric_expr <=1) |
Minimum absolute support that each rule must satisfy. The value must be an integer. The default value is |
|
||
|
column_name |
The name of a column that contains the items in a transaction. When you specify this setting, the algorithm expects the data to be presented in native transactional format, consisting of two columns:
|
ODMS_ITEM_VALUE_COLUMN_ NAME |
column_name |
The name of a column that contains a value associated with each item in a transaction. Use this setting only when you have specified a value for If you also use
If The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, macintosh apples). |
Example 4-9 Using the ore.odmAssocRules Function
This example builds an association model on a transactional data set. The packages arules
and arulesViz
are required to pull the resulting rules and itemsets into the client R session memory and be visualized. The graph of the rules appears in the figure following the example.
# Load the arules and arulesViz packages. library(arules) library(arulesViz) # Create some transactional data. id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) item <- c("b", "d", "e", "a", "b", "c", "e", "b", "c", "d", "e") # Push the data to the database as an ore.frame object. transdata_of <- ore.push(data.frame(ID = id, ITEM = item)) # Build a model with specifications. ar.mod1 <- ore.odmAssocRules(~., transdata_of, case.id.column = "ID", item.id.column = "ITEM", min.support = 0.6, min.confidence = 0.6, max.rule.length = 3) # Generate itemsets and rules of the model. itemsets <- itemsets(ar.mod1) rules <- rules(ar.mod1) # Convert the rules to the rules object in arules package. rules.arules <- ore.pull(rules) inspect(rules.arules) # Convert itemsets to the itemsets object in arules package. itemsets.arules <- ore.pull(itemsets) inspect(itemsets.arules) # Plot the rules graph. plot(rules.arules, method = "graph", interactive = TRUE)
Listing for This Example
R> # Load the arules and arulesViz packages. R> library(arules) R> library(arulesViz) R> # Create some transactional data. R> id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3) R> item <- c("b", "d", "e", "a", "b", "c", "e", "b", "c", "d", "e") R> # Push the data to the database as an ore.frame object. R> transdata_of <- ore.push(data.frame(ID = id, ITEM = item)) R> # Build a model with specifications. R> ar.mod1 <- ore.odmAssocRules(~., transdata_of, case.id.column = "ID", + item.id.column = "ITEM", min.support = 0.6, min.confidence = 0.6, + max.rule.length = 3) R> # Generate itemsets and rules of the model. R> itemsets <- itemsets(ar.mod1) R> rules <- rules(ar.mod1) R> # Convert the rules to the rules object in arules package. R> rules.arules <- ore.pull(rules) R> inspect(rules.arules) lhs rhs support confidence lift 1 {b} => {e} 1.0000000 1.0000000 1 2 {e} => {b} 1.0000000 1.0000000 1 3 {c} => {e} 0.6666667 1.0000000 1 4 {d, e} => {b} 0.6666667 1.0000000 1 5 {c, e} => {b} 0.6666667 1.0000000 1 6 {b, d} => {e} 0.6666667 1.0000000 1 7 {b, c} => {e} 0.6666667 1.0000000 1 8 {d} => {b} 0.6666667 1.0000000 1 9 {d} => {e} 0.6666667 1.0000000 1 10 {c} => {b} 0.6666667 1.0000000 1 11 {b} => {d} 0.6666667 0.6666667 1 12 {b} => {c} 0.6666667 0.6666667 1 13 {e} => {d} 0.6666667 0.6666667 1 14 {e} => {c} 0.6666667 0.6666667 1 15 {b, e} => {d} 0.6666667 0.6666667 1 16 {b, e} => {c} 0.6666667 0.6666667 1 R> # Convert itemsets to the itemsets object in arules package. R> itemsets.arules <- ore.pull(itemsets) R> inspect(itemsets.arules) items support 1 {b} 1.0000000 2 {e} 1.0000000 3 {b, e} 1.0000000 4 {c} 0.6666667 5 {d} 0.6666667 6 {b, c} 0.6666667 7 {b, d} 0.6666667 8 {c, e} 0.6666667 9 {d, e} 0.6666667 10 {b, c, e} 0.6666667 11 {b, d, e} 0.6666667 R> # Plot the rules graph. R> plot(rules.arules, method = "graph", interactive = TRUE)
Figure 4-1 A Visual Demonstration of the Association Rules
Description of "Figure 4-1 A Visual Demonstration of the Association Rules"
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.3 Build an Attribute Importance Model
Attribute importance ranks attributes according to their significance in predicting a target.
The ore.odmAI
function uses the OML4SQL Minimum Description Length algorithm to calculate attribute importance. Minimum Description Length (MDL) is an information theoretic model selection principle. It is an important concept in information theory (the study of the quantification of information) and in learning theory (the study of the capacity for generalization based on empirical data).
MDL assumes that the simplest, most compact representation of the data is the best and most probable explanation of the data. The MDL principle is used to build OML4SQL attribute importance models.
Attribute importance models built using OML4SQL cannot be applied to new data.
The ore.odmAI
function produces a ranking of attributes and their importance values.
Note:
OREdm
attribute importance models differ from OML4SQL attribute importance models in these ways: a model object is not retained, and an R model object is not returned. Only the importance ranking created by the model is returned.
For information on the ore.odmAI
function arguments, invoke help(ore.odmAI)
.
Example 4-10 Using the ore.odmAI Function
This example pushes the data.frame
iris
to the database as the ore.frame
iris_of
. The example then builds an attribute importance model.
iris_of <- ore.push(iris) ore.odmAI(Species ~ ., iris_of)
Listing for This Example
R> iris_of <- ore.push(iris) R> ore.odmAI(Species ~ ., iris_of) Call: ore.odmAI(formula = Species ~ ., data = iris_of) Importance: importance rank Petal.Width 1.1701851 1 Petal.Length 1.1494402 2 Sepal.Length 0.5248815 3 Sepal.Width 0.2504077 4
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.4 Decision Tree
The ore.odmDT
function uses the OML4SQL Decision Tree algorithm, which is based on conditional probabilities.
Decision Tree models are classification models. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and be used within a database to identify a set of records.
A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The ore.odmDT
function offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.
For information on the ore.odmDT
function arguments, call help(ore.odmDT)
.
Settings for a Decision Tree Model
The following table lists settings that apply to Decision Tree models.
Table 4-7 Decision Tree Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split values are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is by a metric. Decision trees can use either Gini ( |
|
For Decision Tree:
For Random Forest:
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). For Decision Tree, the default is For Random Forest, the default is |
|
|
The minimum number of training rows in a node expressed as a percentage of the rows in the training data.
Default is |
|
|
The minimum number of rows required to consider splitting a node expressed as a percentage of the training rows.
Default is |
|
|
The minimum number of rows in a node. Default is |
|
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value. Default is |
Example 4-11 Using the ore.odmDT Function
This example creates an input ore.frame
, builds a model, makes predictions, and generates a confusion matrix.
m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) mtcars_of <- ore.push(m) row.names(mtcars_of) <- mtcars_of # Build the model. dt.mod <- ore.odmDT(gear ~ ., mtcars_of) summary(dt.mod) # Make predictions and generate a confusion matrix. dt.res <- predict (dt.mod, mtcars_of, "gear") with(dt.res, table(gear, PREDICTION))
Listing for This Example
R> m <- mtcars R> m$gear <- as.factor(m$gear) R> m$cyl <- as.factor(m$cyl) R> m$vs <- as.factor(m$vs) R> m$ID <- 1:nrow(m) R> mtcars_of <- ore.push(m) R> row.names(mtcars_of) <- mtcars_of R> # Build the model. R> dt.mod <- ore.odmDT(gear ~ ., mtcars_of) R> summary(dt.mod) Call: ore.odmDT(formula = gear ~ ., data = mtcars_of) n = 32 Nodes: parent node.id row.count prediction split 1 NA 0 32 3 <NA> 2 0 1 16 4 (disp <= 196.299999999999995) 3 0 2 16 3 (disp > 196.299999999999995) surrogate full.splits 1 <NA> <NA> 2 (cyl in ("4" "6" )) (disp <= 196.299999999999995) 3 (cyl in ("8" )) (disp > 196.299999999999995) Settings: value prep.auto on impurity.metric impurity.gini term.max.depth 7 term.minpct.node 0.05 term.minpct.split 0.1 term.minrec.node 10 term.minrec.split 20 R> # Make predictions and generate a confusion matrix. R> dt.res <- predict (dt.mod, mtcars_of, "gear") R> with(dt.res, table(gear, PREDICTION)) PREDICTION gear 3 4 3 14 1 4 0 12 5 2 3
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.5 Expectation Maximization
The ore.odmEM
function creates a model that uses the OML4SQL Expectation Maximization (EM) algorithm.
EM is a density estimation algorithm that performs probabilistic clustering. In density estimation, the goal is to construct a density function that captures how a given population is distributed. The density estimate is based on observed data that represents a sample of the population.
For information on the ore.odmEM
function arguments, call help(ore.odmEM)
.
Settings for an Expectation Maximization Model
The following table lists settings that apply to Expectation Maximization Models.
Table 4-8 Expectation Maximization Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Whether or not to include uncorrelated attributes in the model. When Note: This setting applies only to attributes that are not nested.Default is system-determined. |
|
|
Maximum number of correlated attributes to include in the model. Note: This setting applies only to attributes that are not nested (2D).The default value is |
|
|
The distribution for modeling numeric attributes. Applies to the input table or view as a whole and does not allow per-attribute specifications.
The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numeric attributes are modeled using the same type of distribution. When the distribution is systemdetermined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data. The default value is |
|
|
Number of equi-width bins that will be used for gathering cluster statistics for numeric columns.
Default is |
|
|
Specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, the data will not be projected. The setting applies to all nested columns. Default is |
|
|
Specifies the number of quantile bins that will be used for modeling numeric columns with multivalued Bernoulli distributions. Default is system-determined. |
|
|
Specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions. Default is system-determined. |
Example 4-12 Using the ore.odmEM Function
## Synthetic 2-dimensional data set
set.seed(7654)
x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
X <- ore.push (data.frame(ID=1:100,x))
rownames(X) <- X$ID
em.mod <- NULL
em.mod <- ore.odmEM(~., X, num.centers = 2L)
summary(em.mod)
rules(em.mod)
clusterhists(em.mod)
histogram(em.mod)
em.res <- predict(em.mod, X, type="class", supplemental.cols=c("x", "y"))
head(em.res)
em.res.local <- ore.pull(em.res)
plot(data.frame(x=em.res.local$x, y=em.res.local$y), col=em.res.local$CLUSTER_ID)
points(em.mod$centers2, col = rownames(em.mod$centers2), pch=8, cex=2)
head(predict(em.mod,X))
head(predict(em.mod,X,type=c("class","raw")))
head(predict(em.mod,X,type=c("class","raw"),supplemental.cols=c("x","y")))
head(predict(em.mod,X,type="raw",supplemental.cols=c("x","y")))
Listing for This Example
R> ## Synthetic 2-dimensional data set
R>
R> set.seed(7654)
R>
R> x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
+ matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R>
R> X <- ore.push (data.frame(ID=1:100,x))
R> rownames(X) <- X$ID
R>
R> em.mod <- NULL
R> em.mod <- ore.odmEM(~., X, num.centers = 2L)
R>
R> summary(em.mod)
Call:
ore.odmEM(formula = ~., data = X, num.centers = 2L)
Settings:
value
clus.num.clusters 2
cluster.components cluster.comp.enable
cluster.statistics clus.stats.enable
cluster.thresh 2
linkage.function linkage.single
loglike.improvement .001
max.num.attr.2d 50
min.pct.attr.support .1
model.search model.search.disable
num.components 20
num.distribution num.distr.system
num.equiwidth.bins 11
num.iterations 100
num.projections 50
random.seed 0
remove.components remove.comps.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto ON
Centers:
MEAN.ID MEAN.x MEAN.y
2 25.5 4.03 3.96
3 75.5 1.93 1.99
R> rules(em.mod)
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var lhs.var.support lhs.var.conf predicate
1 1 100 1.0 100 1.00 ID 100 0.0000 ID <= 100
2 1 100 1.0 100 1.00 ID 100 0.0000 ID >= 1
3 1 100 1.0 100 1.00 x 100 0.2500 x <= 4.6298
4 1 100 1.0 100 1.00 x 100 0.2500 x >= 1.3987
5 1 100 1.0 100 1.00 y 100 0.3000 y <= 4.5846
6 1 100 1.0 100 1.00 y 100 0.3000 y >= 1.3546
7 2 50 0.5 50 1.00 ID 50 0.0937 ID <= 50.5
8 2 50 0.5 50 1.00 ID 50 0.0937 ID >= 1
9 2 50 0.5 50 1.00 x 50 0.0937 x <= 4.6298
10 2 50 0.5 50 1.00 x 50 0.0937 x > 3.3374
11 2 50 0.5 50 1.00 y 50 0.0937 y <= 4.5846
12 2 50 0.5 50 1.00 y 50 0.0937 y > 2.9696
13 3 50 0.5 50 0.98 ID 49 0.0937 ID <= 100
14 3 50 0.5 50 0.98 ID 49 0.0937 ID > 50.5
15 3 50 0.5 49 0.98 x 49 0.0937 x <= 2.368
16 3 50 0.5 49 0.98 x 49 0.0937 x >= 1.3987
17 3 50 0.5 49 0.98 y 49 0.0937 y <= 2.6466
18 3 50 0.5 49 0.98 y 49 0.0937 y >= 1.3546
R> clusterhists(em.mod)
cluster.id variable bin.id lower.bound upper.bound label count
1 1 ID 1 1.00 10.90 1:10.9 10
2 1 ID 2 10.90 20.80 10.9:20.8 10
3 1 ID 3 20.80 30.70 20.8:30.7 10
4 1 ID 4 30.70 40.60 30.7:40.6 10
5 1 ID 5 40.60 50.50 40.6:50.5 10
6 1 ID 6 50.50 60.40 50.5:60.4 10
7 1 ID 7 60.40 70.30 60.4:70.3 10
8 1 ID 8 70.30 80.20 70.3:80.2 10
9 1 ID 9 80.20 90.10 80.2:90.1 10
10 1 ID 10 90.10 100.00 90.1:100 10
11 1 ID 11 NA NA : 0
12 1 x 1 1.40 1.72 1.399:1.722 11
13 1 x 2 1.72 2.04 1.722:2.045 22
14 1 x 3 2.04 2.37 2.045:2.368 16
15 1 x 4 2.37 2.69 2.368:2.691 1
16 1 x 5 2.69 3.01 2.691:3.014 0
17 1 x 6 3.01 3.34 3.014:3.337 0
18 1 x 7 3.34 3.66 3.337:3.66 4
19 1 x 8 3.66 3.98 3.66:3.984 18
20 1 x 9 3.98 4.31 3.984:4.307 22
21 1 x 10 4.31 4.63 4.307:4.63 6
22 1 x 11 NA NA : 0
23 1 y 1 1.35 1.68 1.355:1.678 7
24 1 y 2 1.68 2.00 1.678:2.001 18
25 1 y 3 2.00 2.32 2.001:2.324 18
26 1 y 4 2.32 2.65 2.324:2.647 6
27 1 y 5 2.65 2.97 2.647:2.97 1
28 1 y 6 2.97 3.29 2.97:3.293 4
29 1 y 7 3.29 3.62 3.293:3.616 3
30 1 y 8 3.62 3.94 3.616:3.939 16
31 1 y 9 3.94 4.26 3.939:4.262 16
32 1 y 10 4.26 4.58 4.262:4.585 11
33 1 y 11 NA NA : 0
34 2 ID 1 1.00 10.90 1:10.9 10
35 2 ID 2 10.90 20.80 10.9:20.8 10
36 2 ID 3 20.80 30.70 20.8:30.7 10
37 2 ID 4 30.70 40.60 30.7:40.6 10
38 2 ID 5 40.60 50.50 40.6:50.5 10
39 2 ID 6 50.50 60.40 50.5:60.4 0
40 2 ID 7 60.40 70.30 60.4:70.3 0
41 2 ID 8 70.30 80.20 70.3:80.2 0
42 2 ID 9 80.20 90.10 80.2:90.1 0
43 2 ID 10 90.10 100.00 90.1:100 0
44 2 ID 11 NA NA : 0
45 2 x 1 1.40 1.72 1.399:1.722 0
46 2 x 2 1.72 2.04 1.722:2.045 0
47 2 x 3 2.04 2.37 2.045:2.368 0
48 2 x 4 2.37 2.69 2.368:2.691 0
49 2 x 5 2.69 3.01 2.691:3.014 0
50 2 x 6 3.01 3.34 3.014:3.337 0
51 2 x 7 3.34 3.66 3.337:3.66 4
52 2 x 8 3.66 3.98 3.66:3.984 18
53 2 x 9 3.98 4.31 3.984:4.307 22
54 2 x 10 4.31 4.63 4.307:4.63 6
55 2 x 11 NA NA : 0
56 2 y 1 1.35 1.68 1.355:1.678 0
57 2 y 2 1.68 2.00 1.678:2.001 0
58 2 y 3 2.00 2.32 2.001:2.324 0
59 2 y 4 2.32 2.65 2.324:2.647 0
60 2 y 5 2.65 2.97 2.647:2.97 0
61 2 y 6 2.97 3.29 2.97:3.293 4
62 2 y 7 3.29 3.62 3.293:3.616 3
63 2 y 8 3.62 3.94 3.616:3.939 16
64 2 y 9 3.94 4.26 3.939:4.262 16
65 2 y 10 4.26 4.58 4.262:4.585 11
66 2 y 11 NA NA : 0
67 3 ID 1 1.00 10.90 1:10.9 0
68 3 ID 2 10.90 20.80 10.9:20.8 0
69 3 ID 3 20.80 30.70 20.8:30.7 0
70 3 ID 4 30.70 40.60 30.7:40.6 0
71 3 ID 5 40.60 50.50 40.6:50.5 0
72 3 ID 6 50.50 60.40 50.5:60.4 10
73 3 ID 7 60.40 70.30 60.4:70.3 10
74 3 ID 8 70.30 80.20 70.3:80.2 10
75 3 ID 9 80.20 90.10 80.2:90.1 10
76 3 ID 10 90.10 100.00 90.1:100 10
77 3 ID 11 NA NA : 0
78 3 x 1 1.40 1.72 1.399:1.722 11
79 3 x 2 1.72 2.04 1.722:2.045 22
80 3 x 3 2.04 2.37 2.045:2.368 16
81 3 x 4 2.37 2.69 2.368:2.691 1
82 3 x 5 2.69 3.01 2.691:3.014 0
83 3 x 6 3.01 3.34 3.014:3.337 0
84 3 x 7 3.34 3.66 3.337:3.66 0
85 3 x 8 3.66 3.98 3.66:3.984 0
86 3 x 9 3.98 4.31 3.984:4.307 0
87 3 x 10 4.31 4.63 4.307:4.63 0
88 3 x 11 NA NA : 0
89 3 y 1 1.35 1.68 1.355:1.678 7
90 3 y 2 1.68 2.00 1.678:2.001 18
91 3 y 3 2.00 2.32 2.001:2.324 18
92 3 y 4 2.32 2.65 2.324:2.647 6
93 3 y 5 2.65 2.97 2.647:2.97 1
94 3 y 6 2.97 3.29 2.97:3.293 0
95 3 y 7 3.29 3.62 3.293:3.616 0
96 3 y 8 3.62 3.94 3.616:3.939 0
97 3 y 9 3.94 4.26 3.939:4.262 0
98 3 y 10 4.26 4.58 4.262:4.585 0
99 3 y 11 NA NA : 0
R> histogram(em.mod)
R>
R> em.res <- predict(em.mod, X, type="class", supplemental.cols=c("x", "y"))
R> head(em.res)
x y CLUSTER_ID
1 4.15 3.63 2
2 3.88 4.13 2
3 3.72 4.10 2
4 3.78 4.14 2
5 4.22 4.35 2
6 4.07 3.62 2
R> em.res.local <- ore.pull(em.res)
R> plot(data.frame(x=em.res.local$x, y=em.res.local$y), col=em.res.local$CLUSTER_ID)
R> points(em.mod$centers2, col = rownames(em.mod$centers2), pch=8, cex=2)
R>
R> head(predict(em.mod,X))
'2' '3' CLUSTER_ID
1 1 1.14e-54 2
2 1 1.63e-55 2
3 1 1.10e-51 2
4 1 1.53e-52 2
5 1 9.02e-62 2
6 1 3.20e-49 2
R> head(predict(em.mod,X,type=c("class","raw")))
'2' '3' CLUSTER_ID
1 1 1.14e-54 2
2 1 1.63e-55 2
3 1 1.10e-51 2
4 1 1.53e-52 2
5 1 9.02e-62 2
6 1 3.20e-49 2
R> head(predict(em.mod,X,type=c("class","raw"),supplemental.cols=c("x","y")))
'2' '3' x y CLUSTER_ID
1 1 1.14e-54 4.15 3.63 2
2 1 1.63e-55 3.88 4.13 2
3 1 1.10e-51 3.72 4.10 2
4 1 1.53e-52 3.78 4.14 2
5 1 9.02e-62 4.22 4.35 2
6 1 3.20e-49 4.07 3.62 2
R> head(predict(em.mod,X,type="raw",supplemental.cols=c("x","y")))
x y '2' '3'
1 4.15 3.63 1 1.14e-54
2 3.88 4.13 1 1.63e-55
3 3.72 4.10 1 1.10e-51
4 3.78 4.14 1 1.53e-52
5 4.22 4.35 1 9.02e-62
6 4.07 3.62 1 3.20e-49
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.6 Explicit Semantic Analysis
The ore.odmESA
function creates a model that uses the OML4SQL Explicit Semantic Analysis (ESA) algorithm.
ESA is an unsupervised algorithm used by OML4SQL for feature extraction. ESA does not discover latent features but instead uses explicit features based on an existing knowledge base.
Explicit knowledge often exists in text form. Multiple knowledge bases are available as collections of text documents. These knowledge bases can be generic, for example, Wikipedia, or domain-specific. Data preparation transforms the text into vectors that capture attribute-concept associations.
For information on the ore.odmESA
function arguments, call help(ore.odmESA)
.
Settings for an Explicit Semantic Analysis Model
The following table lists settings that apply to Explicit Semantic Analysis models.
Table 4-9 Explicit Semantic Analysis Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
Non-negative number |
This setting thresholds a small value for attribute weights in the transformed build data. The default is |
|
Text input Non-text input is |
This setting determines the minimum number of non-zero entries that need to be present in an input row. The default is 100 for text input and 0 for non-text input. |
|
A positive integer |
This setting controls the maximum number of features per attribute. The default is 1000 .
|
Example 4-13 Using the ore.odmESA Function
title <- c('Aids in Africa: Planning for a long war',
'Mars rover maneuvers for rim shot',
'Mars express confirms presence of water at Mars south pole',
'NASA announces major Mars rover finding',
'Drug access, Asia threat in focus at AIDS summit',
'NASA Mars Odyssey THEMIS image: typical crater',
'Road blocks for Aids')
# TEXT contents in character column
df <- data.frame(CUST_ID = seq(length(title)), TITLE = title)
ESA_TEXT <- ore.push(df)
# TEXT contents in clob column
attr(df$TITLE, "ora.type") <- "clob"
ESA_TEXT_CLOB <- ore.push(df)
# Create text policy (CTXSYS.CTX_DDL privilege is required)
ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
# Specify TEXT POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
# ESA algorithm settings in odm.settings
esa.mod <- ore.odmESA(~ TITLE, data = ESA_TEXT_CLOB,
odm.settings = list(case_id_column_name = "CUST_ID",
ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
ODMS_TEXT_MIN_DOCUMENTS = 1,
ODMS_TEXT_MAX_FEATURES = 3,
ESAS_MIN_ITEMS = 1,
ESAS_VALUE_THRESHOLD = 0.0001,
ESAS_TOPN_FEATURES = 3))
class(esa.mod)
summary(esa.mod)
settings(esa.mod)
features(esa.mod)
predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "TITLE")
# Use ctx.settings to specify a character column as TEXT and
# the same settings as above as well as TOKEN_TYPE
esa.mod2 <- ore.odmESA(~ TITLE, data = ESA_TEXT,
odm.settings = list(case_id_column_name = "CUST_ID", ESAS_MIN_ITEMS = 1),
ctx.settings = list(TITLE =
"TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
summary(esa.mod2)
settings(esa.mod2)
features(esa.mod2)
predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "TITLE")
ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")
Listing for This Example
R> title <- c('Aids in Africa: Planning for a long war',
+ 'Mars rover maneuvers for rim shot',
+ 'Mars express confirms presence of water at Mars south pole',
+ 'NASA announces major Mars rover finding',
+ 'Drug access, Asia threat in focus at AIDS summit',
+ 'NASA Mars Odyssey THEMIS image: typical crater',
+ 'Road blocks for Aids')
R>
R> # TEXT contents in character column
R> df <- data.frame(CUST_ID = seq(length(title)), TITLE = title)
R> ESA_TEXT <- ore.push(df)
R>
R> # TEXT contents in clob column
R> attr(df$TITLE, "ora.type") <- "clob"
R> ESA_TEXT_CLOB <- ore.push(df)
R>
R> # Create a text policy (CTXSYS.CTX_DDL privilege is required)
R> ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
R>
R> # Specify TEXT POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
R> # ESA algorithm settings in odm.settings
R> esa.mod <- ore.odmESA(~ TITLE, data = ESA_TEXT_CLOB,
+ odm.settings = list(case_id_column_name = "CUST_ID",
+ ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
+ ODMS_TEXT_MIN_DOCUMENTS = 1,
+ ODMS_TEXT_MAX_FEATURES = 3,
+ ESAS_MIN_ITEMS = 1,
+ ESAS_VALUE_THRESHOLD = 0.0001,
+ ESAS_TOPN_FEATURES = 3))
R> class(esa.mod)
[1] "ore.odmESA" "ore.model"
R> summary(esa.mod)
Call:
ore.odmESA(formula = ~TITLE, data = ESA_TEXT_CLOB, odm.settings = list(case_id_column_name = "CUST_ID",
ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", ODMS_TEXT_MIN_DOCUMENTS = 1,
ODMS_TEXT_MAX_FEATURES = 3, ESAS_MIN_ITEMS = 1, ESAS_VALUE_THRESHOLD = 1e-04,
ESAS_TOPN_FEATURES = 3))
Settings:
value
min.items 1
topn.features 3
value.threshold 1e-04
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
odms.text.max.features 3
odms.text.min.documents 1
odms.text.policy.name ESA_TXTPOL
prep.auto ON
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1 1 TITLE.AIDS <NA> 1.0000000
2 2 TITLE.MARS <NA> 0.4078615
3 2 TITLE.ROVER <NA> 0.9130438
4 3 TITLE.MARS <NA> 1.0000000
5 4 TITLE.NASA <NA> 0.6742695
6 4 TITLE.ROVER <NA> 0.6742695
7 5 TITLE.AIDS <NA> 1.0000000
8 6 TITLE.MARS <NA> 0.4078615
9 6 TITLE.NASA <NA> 0.9130438
10 7 TITLE.AIDS <NA> 1.0000000
R> settings(esa.mod)
SETTING_NAME SETTING_VALUE SETTING_TYPE
1 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS INPUT
2 ESAS_MIN_ITEMS 1 INPUT
3 ESAS_TOPN_FEATURES 3 INPUT
4 ESAS_VALUE_THRESHOLD 1e-04 INPUT
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO DEFAULT
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE DEFAULT
7 ODMS_TEXT_MAX_FEATURES 3 INPUT
8 ODMS_TEXT_MIN_DOCUMENTS 1 INPUT
9 ODMS_TEXT_POLICY_NAME ESA_TXTPOL INPUT
10 PREP_AUTO ON INPUT
R> features(esa.mod)
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1 1 TITLE.AIDS <NA> 1.0000000
2 2 TITLE.MARS <NA> 0.4078615
3 2 TITLE.ROVER <NA> 0.9130438
4 3 TITLE.MARS <NA> 1.0000000
5 4 TITLE.NASA <NA> 0.6742695
6 4 TITLE.ROVER <NA> 0.6742695
7 5 TITLE.AIDS <NA> 1.0000000
8 6 TITLE.MARS <NA> 0.4078615
9 6 TITLE.NASA <NA> 0.9130438
10 7 TITLE.AIDS <NA> 1.0000000
R> predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "TITLE")
TITLE FEATURE_ID
1 Aids in Africa: Planning for a long war 1
2 Mars rover maneuvers for rim shot 2
3 Mars express confirms presence of water at Mars south pole 3
4 NASA announces major Mars rover finding 4
5 Drug access, Asia threat in focus at AIDS summit 1
6 NASA Mars Odyssey THEMIS image: typical crater 6
7 Road blocks for Aids 1
R>
R> # Use ctx.settings to specify a character column as TEXT and
R> # the same settings as above as well as TOKEN_TYPE
R> esa.mod2 <- ore.odmESA(~ TITLE, data = ESA_TEXT,
+ odm.settings = list(case_id_column_name = "CUST_ID", ESAS_MIN_ITEMS = 1),
+ ctx.settings = list(TITLE =
+ "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
R> summary(esa.mod2)
Call:
ore.odmESA(formula = ~TITLE, data = ESA_TEXT, odm.settings = list(case_id_column_name = "CUST_ID",
ESAS_MIN_ITEMS = 1), ctx.settings = list(TITLE = "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
Settings:
value
min.items 1
topn.features 1000
value.threshold .00000001
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
odms.text.max.features 300000
odms.text.min.documents 3
prep.auto ON
Features:
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1 1 TITLE.AIDS <NA> 1.0000000
2 2 TITLE.MARS <NA> 0.4078615
3 2 TITLE.ROVER <NA> 0.9130438
4 3 TITLE.MARS <NA> 1.0000000
5 4 TITLE.MARS <NA> 0.3011997
6 4 TITLE.NASA <NA> 0.6742695
7 4 TITLE.ROVER <NA> 0.6742695
8 5 TITLE.AIDS <NA> 1.0000000
9 6 TITLE.MARS <NA> 0.4078615
10 6 TITLE.NASA <NA> 0.9130438
11 7 TITLE.AIDS <NA> 1.0000000
R> settings(esa.mod2)
SETTING_NAME SETTING_VALUE SETTING_TYPE
1 ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS INPUT
2 ESAS_MIN_ITEMS 1 INPUT
3 ESAS_TOPN_FEATURES 1000 DEFAULT
4 ESAS_VALUE_THRESHOLD .00000001 DEFAULT
5 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO DEFAULT
6 ODMS_SAMPLING ODMS_SAMPLING_DISABLE DEFAULT
7 ODMS_TEXT_MAX_FEATURES 300000 DEFAULT
8 ODMS_TEXT_MIN_DOCUMENTS 3 DEFAULT
9 PREP_AUTO ON INPUT
R> features(esa.mod2)
FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1 1 TITLE.AIDS <NA> 1.0000000
2 2 TITLE.MARS <NA> 0.4078615
3 2 TITLE.ROVER <NA> 0.9130438
4 3 TITLE.MARS <NA> 1.0000000
5 4 TITLE.MARS <NA> 0.3011997
6 4 TITLE.NASA <NA> 0.6742695
7 4 TITLE.ROVER <NA> 0.6742695
8 5 TITLE.AIDS <NA> 1.0000000
9 6 TITLE.MARS <NA> 0.4078615
10 6 TITLE.NASA <NA> 0.9130438
11 7 TITLE.AIDS <NA> 1.0000000
R> predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "TITLE")
TITLE FEATURE_ID
1 Aids in Africa: Planning for a long war 1
2 Mars rover maneuvers for rim shot 2
3 Mars express confirms presence of water at Mars south pole 3
4 NASA announces major Mars rover finding 4
5 Drug access, Asia threat in focus at AIDS summit 1
6 NASA Mars Odyssey THEMIS image: typical crater 6
7 Road blocks for Aids 1
R>
R> ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.7 Build an Extensible R Algorithm Model
The ore.odmRAlg
function creates an Extensible R algorithm model using OML4SQL.
The Extensible R algorithm builds, scores, and views an R model using registered R scripts. It supports classification, regression, clustering, feature extraction, attribute importance, and association machine learning functions.
For information on the ore.odmRAlg
function arguments and for an example of using the function, call help(ore.odmRAlg)
.
Settings for an Extensible R Algorithm Model
The following table lists settings that apply to Extensible R Algorithm models.
Table 4-10 Extensible R Algorithm Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Specifies the name of an existing registered R script for R algorithm mining model build function. The R script defines an R function for the first input argument for training data and returns an R model object. For Clustering and Feature Extraction mining function model build, the R attributes dm$nclus and dm$nfeat must be set on the R model to indicate the number of clusters and features respectively.
The |
|
SELECT value param_name, ...FROM DUAL |
Specifies a list of numeric and string scalar for optional input parameters of the model build function. |
|
|
Specifies the name of an existing registered R script to score data. The script returns a |
|
|
Specifies the name of an existing registered R script for R algorithm that computes the weight (contribution) for each attribute in scoring. The script returns a |
RALG_DETAILS_FUNCTION |
R_DETAILS_FUNCTION_SCRIPT_NAME |
Specifies the name of an existing registered R script for R algorithm that produces the model information. This setting is required to generate a model view. |
RALG_DETAILS_FORMAT |
SELECT type_value column_name,FROM DUAL |
Specifies the SELECT query for the list of numeric and string scalars for the output column type and the column name of the generated model view. This setting is required to generate a model view.
|
Example 4-14 Using the ore.odmRAlg Function
library(OREembed)
digits <- getOption("digits")
options(digits = 5L)
IRIS <- ore.push(iris)
# Regression with glm
ore.scriptCreate("glm_build",
function(data, form, family)
glm(formula = form, data = data, family = family))
ore.scriptCreate("glm_score",
function(mod, data)
{ res <- predict(mod, newdata = data);
data.frame(res) })
ore.scriptCreate("glm_detail", function(mod)
data.frame(name=names(mod$coefficients),
coef=mod$coefficients))
ore.scriptList(name = "glm_build")
ore.scriptList(name = "glm_score")
ore.scriptList(name = "glm_detail")
ralg.glm <- ore.odmRAlg(IRIS, mining.function = "regression",
formula = c(form="Sepal.Length ~ ."),
build.function = "glm_build",
build.parameter = list(family="gaussian"),
score.function = "glm_score",
detail.function = "glm_detail",
detail.value = data.frame(name="a", coef=1))
summary(ralg.glm)
predict(ralg.glm, newdata = head(IRIS), supplemental.cols = "Sepal.Length")
ore.scriptDrop(name = "glm_build")
ore.scriptDrop(name = "glm_score")
ore.scriptDrop(name = "glm_detail")
# Classification with nnet
ore.scriptCreate("nnet_build",
function(dat, form, sz){
require(nnet);
set.seed(1234);
nnet(formula = formula(form), data = dat,
size = sz, linout = TRUE, trace = FALSE);
},
overwrite = TRUE)
ore.scriptCreate("nnet_detail", function(mod)
data.frame(conn = mod$conn, wts = mod$wts),
overwrite = TRUE)
ore.scriptCreate("nnet_score",
function(mod, data) {
require(nnet);
res <- data.frame(predict(mod, newdata = data));
names(res) <- sort(mod$lev); res
})
ralg.nnet <- ore.odmRAlg(IRIS, mining.function = "classification",
formula = c(form="Species ~ ."),
build.function = "nnet_build",
build.parameter = list(sz=2),
score.function = "nnet_score",
detail.function = "nnet_detail",
detail.value = data.frame(conn=1, wts =1))
summary(ralg.nnet)
predict(ralg.nnet, newdata = head(IRIS), supplemental.cols = "Species")
ore.scriptDrop(name = "nnet_build")
ore.scriptDrop(name = "nnet_score")
ore.scriptDrop(name = "nnet_detail")
# Feature extraction with pca
# Feature extraction with pca
ore.scriptCreate("pca_build",
function(dat){
mod <- prcomp(dat, retx = FALSE)
attr(mod, "dm$nfeat") <- ncol(mod$rotation)
mod},
overwrite = TRUE)
ore.scriptCreate("pca_score",
function(mod, data) {
res <- predict(mod, data)
as.data.frame(res)},
overwrite=TRUE)
ore.scriptCreate("pca_detail",
function(mod) {
rotation_t <- t(mod$rotation)
data.frame(id = seq_along(rownames(rotation_t)),
rotation_t)},
overwrite = TRUE)
X <- IRIS[, -5L]
ralg.pca <- ore.odmRAlg(X,
mining.function = "feature_extraction",
formula = NULL,
build.function = "pca_build",
score.function = "pca_score",
detail.function = "pca_detail",
detail.value = data.frame(Feature.ID=1,
ore.pull(head(X,1L))))
summary(ralg.pca)
head(cbind(X, Pred = predict(ralg.pca, newdata = X)))
ore.scriptDrop(name = "pca_build")
ore.scriptDrop(name = "pca_score")
ore.scriptDrop(name = "pca_detail")
options(digits = digits)
Listing for This Example
R> library(OREembed)
R>
R> digits <- getOption("digits")
R> options(digits = 5L)
R>
R> IRIS <- ore.push(iris)
R>
R> # Regression with glm
R> ore.scriptCreate("glm_build",
+ function(data, form, family)
+ glm(formula = form, data = data, family = family))
R>
R> ore.scriptCreate("glm_score",
+ function(mod, data)
+ { res <- predict(mod, newdata = data);
+ data.frame(res) })
R>
R> ore.scriptCreate("glm_detail", function(mod)
+ data.frame(name=names(mod$coefficients),
+ coef=mod$coefficients))
R>
R> ore.scriptList(name = "glm_build")
NAME SCRIPT
1 glm_build function (data, form, family) \nglm(formula = form, data = data, family = family)
R> ore.scriptList(name = "glm_score")
NAME SCRIPT
1 glm_score function (mod, data) \n{\n res <- predict(mod, newdata = data)\n data.frame(res)\n}
R> ore.scriptList(name = "glm_detail")
NAME SCRIPT
1 glm_detail function (mod) \ndata.frame(name = names(mod$coefficients), coef = mod$coefficients)
R>
R> ralg.glm <- ore.odmRAlg(IRIS, mining.function = "regression",
+ formula = c(form="Sepal.Length ~ ."),
+ build.function = "glm_build",
+ build.parameter = list(family="gaussian"),
+ score.function = "glm_score",
+ detail.function = "glm_detail",
+ detail.value = data.frame(name="a", coef=1))
R>
R> summary(ralg.glm)
Call:
ore.odmRAlg(data = IRIS, mining.function = "regression", formula = c(form = "Sepal.Length ~ ."),
build.function = "glm_build", build.parameter = list(family = "gaussian"),
score.function = "glm_score", detail.function = "glm_detail",
detail.value = data.frame(name = "a", coef = 1))
Settings:
value
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto OFF
build.function OML_USER.glm_build
build.parameter select 'Sepal.Length ~ .' "form", 'gaussian' "family" from dual
details.format select cast('a' as varchar2(4000)) "name", 1 "coef" from dual
details.function OML_USER.glm_detail
score.function OML_USER.glm_score
name coef
1 (Intercept) 2.17127
2 Petal.Length 0.82924
3 Petal.Width -0.31516
4 Sepal.Width 0.49589
5 Speciesversicolor -0.72356
6 Speciesvirginica -1.02350
R> predict(ralg.glm, newdata = head(IRIS), supplemental.cols = "Sepal.Length")
Sepal.Length PREDICTION
1 5.1 5.0048
2 4.9 4.7568
3 4.7 4.7731
4 4.6 4.8894
5 5.0 5.0544
6 5.4 5.3889
R>
R> ore.scriptDrop(name = "glm_build")
R> ore.scriptDrop(name = "glm_score")
R> ore.scriptDrop(name = "glm_detail")
R>
R> # Classification with nnet
R> ore.scriptCreate("nnet_build",
+ function(dat, form, sz){
+ require(nnet);
+ set.seed(1234);
+ nnet(formula = formula(form), data = dat,
+ size = sz, linout = TRUE, trace = FALSE);
+ },
+ overwrite = TRUE)
R>
R> ore.scriptCreate("nnet_detail", function(mod)
+ data.frame(conn = mod$conn, wts = mod$wts),
+ overwrite = TRUE)
R>
R> ore.scriptCreate("nnet_score",
+ function(mod, data) {
+ require(nnet);
+ res <- data.frame(predict(mod, newdata = data));
+ names(res) <- sort(mod$lev); res
+ })
R>
R> ralg.nnet <- ore.odmRAlg(IRIS, mining.function = "classification",
+ formula = c(form="Species ~ ."),
+ build.function = "nnet_build",
+ build.parameter = list(sz=2),
+ score.function = "nnet_score",
+ detail.function = "nnet_detail",
+ detail.value = data.frame(conn=1, wts =1))
R>
R> summary(ralg.nnet)
Call:
ore.odmRAlg(data = IRIS, mining.function = "classification",
formula = c(form = "Species ~ ."), build.function = "nnet_build",
build.parameter = list(sz = 2), score.function = "nnet_score",
detail.function = "nnet_detail", detail.value = data.frame(conn = 1,
wts = 1))
Settings:
value
clas.weights.balanced OFF
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto OFF
build.function OML_USER.nnet_build
build.parameter select 'Species ~ .' "form", 2 "sz" from dual
details.format select 1 "conn", 1 "wts" from dual
details.function OML_USER.nnet_detail
score.function OML_USER.nnet_score
conn wts
1 0 1.46775
2 1 -12.88542
3 2 -4.38886
4 3 9.98648
5 4 16.57056
6 0 0.97809
7 1 -0.51626
8 2 -0.94815
9 3 0.13692
10 4 0.35104
11 0 37.22475
12 5 -66.49123
13 6 70.81160
14 0 -4.50893
15 5 7.01611
16 6 20.88774
17 0 -32.15127
18 5 58.92088
19 6 -91.96989
R> predict(ralg.nnet, newdata = head(IRIS), supplemental.cols = "Species")
Species PREDICTION PROBABILITY
1 setosa setosa 0.99999
2 setosa setosa 0.99998
3 setosa setosa 0.99999
4 setosa setosa 0.99998
5 setosa setosa 1.00000
6 setosa setosa 0.99999
R>
R> ore.scriptDrop(name = "nnet_build")
R> ore.scriptDrop(name = "nnet_score")
R> ore.scriptDrop(name = "nnet_detail")
R>
R> ore.scriptCreate("pca_build",
+ function(dat){
+ mod <- prcomp(dat, retx = FALSE)
+ attr(mod, "dm$nfeat") <- ncol(mod$rotation)
+ mod},
+ overwrite = TRUE)
R>
R> ore.scriptCreate("pca_score",
+ function(mod, data) {
+ res <- predict(mod, data)
+ as.data.frame(res)},
+ overwrite=TRUE)
R>
R> ore.scriptCreate("pca_detail",
+ function(mod) {
+ rotation_t <- t(mod$rotation)
+ data.frame(id = seq_along(rownames(rotation_t)),
+ rotation_t)},
+ overwrite = TRUE)
R>
R> X <- IRIS[, -5L]
R> ralg.pca <- ore.odmRAlg(X,
+ mining.function = "feature_extraction",
+ formula = NULL,
+ build.function = "pca_build",
+ score.function = "pca_score",
+ detail.function = "pca_detail",
+ detail.value = data.frame(Feature.ID=1,
+ ore.pull(head(X,1L))))
R>
R> summary(ralg.pca)
Call:
ore.odmRAlg(data = X, mining.function = "feature_extraction",
formula = NULL, build.function = "pca_build", score.function = "pca_score",
detail.function = "pca_detail", detail.value = data.frame(Feature.ID = 1,
ore.pull(head(X, 1L))))
Settings:
value
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto OFF
build.function OML_USER.pca_build
details.format select 1 "Feature.ID", 5.1 "Sepal.Length", 3.5 "Sepal.Width", 1.4 "Petal.Length", 0.2 "Petal.Width" from dual
details.function OML_USER.pca_detail
score.function OML_USER.pca_score
Feature.ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1 1 0.856671 0.358289 0.36139 -0.084523
2 2 -0.173373 -0.075481 0.65659 0.730161
3 3 0.076236 0.545831 -0.58203 0.597911
4 4 0.479839 -0.753657 -0.31549 0.319723
R> head(cbind(X, Pred = predict(ralg.pca, newdata = X)))
Sepal.Length Sepal.Width Petal.Length Petal.Width FEATURE_ID
1 5.1 3.5 1.4 0.2 2
2 4.9 3.0 1.4 0.2 4
3 4.7 3.2 1.3 0.2 3
4 4.6 3.1 1.5 0.2 4
5 5.0 3.6 1.4 0.2 2
6 5.4 3.9 1.7 0.4 2
R>
R> ore.scriptDrop(name = "pca_build")
R> ore.scriptDrop(name = "pca_score")
R> ore.scriptDrop(name = "pca_detail")
R>
R> options(digits = digits)
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.8 Generalized Linear Models
The ore.odmGLM
function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression).
Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.
The OML4SQL GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.
The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.
In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, OML4SQL GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.
GLM can be used to build classification or regression models as follows:
-
Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.
-
Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.
The ore.odmGLM
function allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.
For information on the ore.odmGLM
function arguments, invoke help(ore.odmGLM)
.
The following examples build several models using GLM. The input ore.frame
objects are R data sets pushed to the database.
Settings for a Generalized Linear Models
The following table lists settings that apply to Generalized Linear models.
Table 4-11 Generalized Linear Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
The confidence level for coefficient confidence intervals. The default confidence level is |
|
|
Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data. |
|
|
Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Feature generation can only be enabled when feature selection is also enabled. |
|
|
Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data. |
|
|
Whether or not feature selection is enabled for GLM. By default, feature selection is not enabled. |
|
|
When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory. |
GLMS_PRUNE_MODEL |
|
Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data. When feature selection is enabled, the algorithm automatically performs pruning based on the data. |
|
target_value |
The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class. |
GLMS_RIDGE_REGRESSION |
|
Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions. When ridge is enabled, prediction bounds are not produced by the Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised. |
|
|
The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression. If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm. |
|
|
Enable or disable row diagnostics. |
|
The range is ( |
Convergence Tolerance setting of the GLM algorithm The default value is system-determined. |
|
Positive integer |
Maximum number of iterations for the GLM algorithm. The default value is system-determined. |
|
0 or Positive integer
|
Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting allows the user to choose the GLM solver. The solver cannot be selected if |
|
|
This setting allows the user to use sparse solver if it is available. The default value is |
Example 4-15 Building a Linear Regression Model
This example builds a linear regression model using the longley
data set.
longley_of <- ore.push(longley) longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of) summary(longfit1)
Listing for This Example
R> longley_of <- ore.push(longley) R> longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of) R> summary(longfit1) Call: ore.odmGLM(formula = Employed ~ ., data = longely_of) Residuals: Min 1Q Median 3Q Max -0.41011 -0.15767 -0.02816 0.10155 0.45539 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.482e+03 8.904e+02 -3.911 0.003560 ** GNP.deflator 1.506e-02 8.492e-02 0.177 0.863141 GNP -3.582e-02 3.349e-02 -1.070 0.312681 Unemployed -2.020e-02 4.884e-03 -4.136 0.002535 ** Armed.Forces -1.033e-02 2.143e-03 -4.822 0.000944 *** Population -5.110e-02 2.261e-01 -0.226 0.826212 Year 1.829e+00 4.555e-01 4.016 0.003037 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3049 on 9 degrees of freedom Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925 F-statistic: 330.3 on 6 and 9 DF, p-value: 4.984e-10
Example 4-16 Using Ridge Estimation for the Coefficients of the ore.odmGLM Model
This example uses the longley_of
ore.frame
from the previous example. This example invokes the ore.odmGLM
function and specifies using ridge estimation for the coefficients.
longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE, ridge.vif = TRUE) summary(longfit2)
Listing for This Example
R> longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE, + ridge.vif = TRUE) R> summary(longfit2) Call: ore.odmGLM(formula = Employed ~ ., data = longley_of, ridge = TRUE, ridge.vif = TRUE) Residuals: Min 1Q Median 3Q Max -0.4100 -0.1579 -0.0271 0.1017 0.4575 Coefficients: Estimate VIF (Intercept) -3.466e+03 0.000 GNP.deflator 1.479e-02 0.077 GNP -3.535e-02 0.012 Unemployed -2.013e-02 0.000 Armed.Forces -1.031e-02 0.000 Population -5.262e-02 0.548 Year 1.821e+00 2.212 Residual standard error: 0.3049 on 9 degrees of freedom Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925 F-statistic: 330.2 on 6 and 9 DF, p-value: 4.986e-10
Example 4-17 Building a Logistic Regression GLM
This example builds a logistic regression (classification) model. It uses the infert
data set. The example invokes the ore.odmGLM
function and specifies logistic
as the type
argument, which builds a binomial GLM.
infert_of <- ore.push(infert) infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = infert_of, type = "logistic") infit1
Listing for This Example
R> infert_of <- ore.push(infert) R> infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, + data = infert_of, type = "logistic") R> infit1 Response: case == "1" Call: ore.odmGLM(formula = case ~ age + parity + education + spontaneous + induced, data = infert_of, type = "logistic") Coefficients: (Intercept) age parity education0-5yrs education12+ yrs spontaneous induced -2.19348 0.03958 -0.82828 1.04424 -0.35896 2.04590 1.28876 Degrees of Freedom: 247 Total (i.e. Null); 241 Residual Null Deviance: 316.2 Residual Deviance: 257.8 AIC: 271.8
Example 4-18 Specifying a Reference Value in Building a Logistic Regression GLM
This example builds a logistic regression (classification) model and specifies a reference value. The example uses the infert_of
ore.frame
from Example 4-17.
infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = infert_of, type = "logistic", reference = 1) infit2
Listing for This Example
infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced, data = infert_of, type = "logistic", reference = 1) infit2 Response: case == "0" Call: ore.odmGLM(formula = case ~ age + parity + education + spontaneous + induced, data = infert_of, type = "logistic", reference = 1) Coefficients: (Intercept) age parity education0-5yrs education12+ yrs spontaneous induced 2.19348 -0.03958 0.82828 -1.04424 0.35896 -2.04590 -1.28876 Degrees of Freedom: 247 Total (i.e. Null); 241 Residual Null Deviance: 316.2 Residual Deviance: 257.8 AIC: 271.8
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.9 k-Means
The ore.odmKM
function uses the OML4SQL k-Means (KM) algorithm, a distance-based clustering algorithm that partitions data into a specified number of clusters.
The algorithm has the following features:
-
Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.
-
For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numeric attributes.
For information on the ore.odmKM
function arguments, call help(ore.odmKM)
.
Settings for a k-Means Models
The following table lists settings that apply to k-Means models.
Table 4-12 k-Means Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times. The default Convergence Tolerance is |
|
|
Distance function for k-Means. The default distance function is |
|
|
Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum Convergence Tolerance, specified in The default number of iterations is |
|
|
Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster. If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules. The default minimum support is |
|
|
Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin. The default number of histogram bins is |
|
|
Split criterion for k-Means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time. When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located. When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster. The default split criterion is the |
KMNS_RANDOM_SEED |
Non-negative integer |
This setting controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value. The default is |
|
|
This setting determines the level of cluster detail that are computed during the build.
|
Example 4-19 Using the ore.odmKM Function
This example demonstrates the use of the ore.odmKMeans
function. The example creates two matrices that have 100 rows and two columns. The values in the rows are random variates. It binds the matrices into the matrix
x
, then coerces x
to a data.frame
and pushes it to the database as x_of
, an ore.frame
object. The example next calls the ore.odmKMeans
function to build the KM model, km.mod1
. It then calls the summary
and histogram
functions on the model. Figure 4-2 shows the graphic displayed by the histogram
function.
Finally, the example makes a prediction using the model, pulls the result to local memory, and plots the results.Figure 4-3 shows the graphic displayed by the points
function.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") x_of <- ore.push (data.frame(x)) km.mod1 <- NULL km.mod1 <- ore.odmKMeans(~., x_of, num.centers=2) summary(km.mod1) histogram(km.mod1) # Make a prediction. km.res1 <- predict(km.mod1, x_of, type="class", supplemental.cols=c("x","y")) head(km.res1, 3) # Pull the results to the local memory and plot them. km.res1.local <- ore.pull(km.res1) plot(data.frame(x=km.res1.local$x, y=km.res1.local$y), col=km.res1.local$CLUSTER_ID) points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2) head(predict(km.mod1, x_of, type=c("class","raw"), supplemental.cols=c("x","y")), 3)Listing for This Example
R> x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), + matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) R> colnames(x) <- c("x", "y") R> x_of <- ore.push (data.frame(x)) R> km.mod1 <- NULL R> km.mod1 <- ore.odmKMeans(~., x_of, num.centers=2) R> summary(km.mod1) Call: ore.odmKMeans(formula = ~., data = x_of, num.centers = 2) Settings: value clus.num.clusters 2 block.growth 2 conv.tolerance 0.01 distance euclidean iterations 3 min.pct.attr.support 0.1 num.bins 10 split.criterion variance prep.auto on Centers: x y 2 0.99772307 0.93368684 3 -0.02721078 -0.05099784 R> histogram(km.mod1) R> # Make a prediction. R> km.res1 <- predict(km.mod1, x_of, type="class", supplemental.cols=c("x","y")) R> head(km.res1, 3) x y CLUSTER_ID 1 -0.03038444 0.4395409 3 2 0.17724606 -0.5342975 3 3 -0.17565761 0.2832132 3 # Pull the results to the local memory and plot them. R> km.res1.local <- ore.pull(km.res1) R> plot(data.frame(x=km.res1.local$x, y=km.res1.local$y), + col=km.res1.local$CLUSTER_ID) R> points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2) R> head(predict(km.mod1, x_of, type=c("class","raw"), supplemental.cols=c("x","y")), 3) '2' '3' x y CLUSTER_ID 1 8.610341e-03 0.9913897 -0.03038444 0.4395409 3 2 8.017890e-06 0.9999920 0.17724606 -0.5342975 3 3 5.494263e-04 0.9994506 -0.17565761 0.2832132 3
Figure 4-2 shows the graphic displayed by the invocation of the histogram
function in Example 4-19.
Figure 4-2 Cluster Histograms for the km.mod1 Model
Description of "Figure 4-2 Cluster Histograms for the km.mod1 Model"
Figure 4-3 shows the graphic displayed by the invocation of the points
function in Example 4-19.
Figure 4-3 Results of the points Function for the km.mod1 Model
Description of "Figure 4-3 Results of the points Function for the km.mod1 Model"
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.10 Naive Bayes
The ore.odmNB
function builds an OML4SQL Naive Bayes model.
The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.
Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)
For information on the ore.odmNB
function arguments, call help(ore.odmNB)
.
Settings for a Naive Bayes Models
The following table lists settings that apply to Naive Bayes models.
Table 4-13 Naive Bayes Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Value of pairwise threshold for NB algorithm Default is |
|
|
Value of singleton threshold for NB algorithm Default value is |
Example 4-20 Using the ore.odmNB Function
This example creates an input ore.frame
, builds a Naive Bayes model, makes predictions, and generates a confusion matrix.
m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) mtcars_of <- ore.push(m) row.names(mtcars_of) <- mtcars_of # Build the model. nb.mod <- ore.odmNB(gear ~ ., mtcars_of) summary(nb.mod) # Make predictions and generate a confusion matrix. nb.res <- predict (nb.mod, mtcars_of, "gear") with(nb.res, table(gear, PREDICTION))
Listing for This Example
R> m <- mtcars R> m$gear <- as.factor(m$gear) R> m$cyl <- as.factor(m$cyl) R> m$vs <- as.factor(m$vs) R> m$ID <- 1:nrow(m) R> mtcars_of <- ore.push(m) R> row.names(mtcars_of) <- mtcars_of R> # Build the model. R> nb.mod <- ore.odmNB(gear ~ ., mtcars_of) R> summary(nb.mod) Call: ore.odmNB(formula = gear ~ ., data = mtcars_of) Settings: value prep.auto on Apriori: 3 4 5 0.46875 0.37500 0.15625 Tables: $ID ( ; 26.5), [26.5; 26.5] (26.5; ) 3 1.00000000 4 0.91666667 0.08333333 5 1.00000000 $am 0 1 3 1.0000000 4 0.3333333 0.6666667 5 1.0000000 $cyl '4', '6' '8' 3 0.2 0.8 4 1.0 5 0.6 0.4 $disp ( ; 196.299999999999995), [196.299999999999995; 196.299999999999995] 3 0.06666667 4 1.00000000 5 0.60000000 (196.299999999999995; ) 3 0.93333333 4 5 0.40000000 $drat ( ; 3.385), [3.385; 3.385] (3.385; ) 3 0.8666667 0.1333333 4 1.0000000 5 1.0000000 $hp ( ; 136.5), [136.5; 136.5] (136.5; ) 3 0.2 0.8 4 1.0 5 0.4 0.6 $vs 0 1 3 0.8000000 0.2000000 4 0.1666667 0.8333333 5 0.8000000 0.2000000 $wt ( ; 3.2024999999999999), [3.2024999999999999; 3.2024999999999999] 3 0.06666667 4 0.83333333 5 0.80000000 (3.2024999999999999; ) 3 0.93333333 4 0.16666667 5 0.20000000 Levels: [1] "3" "4" "5" R> # Make predictions and generate a confusion matrix. R> nb.res <- predict (nb.mod, mtcars_of, "gear") R> with(nb.res, table(gear, PREDICTION)) PREDICTION gear 3 4 5 3 14 1 0 4 0 12 0 5 0 1 4
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.11 Non-Negative Matrix Factorization
The ore.odmNMF
function builds an OML4SQL Non-Negative Matrix Factorization (NMF) model for feature extraction.
Each feature extracted by NMF is a linear combination of the original attribution set. Each feature has a set of non-negative coefficients, which are a measure of the weight of each attribute on the feature. If the argument allow.negative.scores
is TRUE
, then negative coefficients are allowed.
For information on the ore.odmNMF
function arguments, call help(ore.odmNMF)
.
Settings for a Non-Negative Matrix Factorization Models
The following table lists settings that apply to Non-Negative Matrix Factorization models.
Table 4-14 Non-Negative Matrix Factorization Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Convergence tolerance for NMF algorithm Default is |
|
|
Whether negative numbers should be allowed in scoring results. When set to Default is |
|
|
Number of iterations for NMF algorithm Default is |
|
|
Random seed for NMF algorithm. Default is |
Example 4-21 Using the ore.odmNMF Function
This example creates an NMF model on a training data set and scores on a test data set.
training.set <- ore.push(npk[1:18, c("N","P","K")]) scoring.set <- ore.push(npk[19:24, c("N","P","K")]) nmf.mod <- ore.odmNMF(~., training.set, num.features = 3) features(nmf.mod) summary(nmf.mod) predict(nmf.mod, scoring.set)
Listing for This Example
R> training.set <- ore.push(npk[1:18, c("N","P","K")]) R> scoring.set <- ore.push(npk[19:24, c("N","P","K")]) R> nmf.mod <- ore.odmNMF(~., training.set, num.features = 3) R> features(nmf.mod) FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT 1 1 K 0 3.723468e-01 2 1 K 1 1.761670e-01 3 1 N 0 7.469067e-01 4 1 N 1 1.085058e-02 5 1 P 0 5.730082e-01 6 1 P 1 2.797865e-02 7 2 K 0 4.107375e-01 8 2 K 1 2.193757e-01 9 2 N 0 8.065393e-03 10 2 N 1 8.569538e-01 11 2 P 0 4.005661e-01 12 2 P 1 4.124996e-02 13 3 K 0 1.918852e-01 14 3 K 1 3.311137e-01 15 3 N 0 1.547561e-01 16 3 N 1 1.283887e-01 17 3 P 0 9.791965e-06 18 3 P 1 9.113922e-01 R> summary(nmf.mod) Call: ore.odmNMF(formula = ~., data = training.set, num.features = 3) Settings: value feat.num.features 3 nmfs.conv.tolerance .05 nmfs.nonnegative.scoring nmfs.nonneg.scoring.enable nmfs.num.iterations 50 nmfs.random.seed -1 prep.auto on Features: FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT 1 1 K 0 3.723468e-01 2 1 K 1 1.761670e-01 3 1 N 0 7.469067e-01 4 1 N 1 1.085058e-02 5 1 P 0 5.730082e-01 6 1 P 1 2.797865e-02 7 2 K 0 4.107375e-01 8 2 K 1 2.193757e-01 9 2 N 0 8.065393e-03 10 2 N 1 8.569538e-01 11 2 P 0 4.005661e-01 12 2 P 1 4.124996e-02 13 3 K 0 1.918852e-01 14 3 K 1 3.311137e-01 15 3 N 0 1.547561e-01 16 3 N 1 1.283887e-01 17 3 P 0 9.791965e-06 18 3 P 1 9.113922e-01 R> predict(nmf.mod, scoring.set) '1' '2' '3' FEATURE_ID 19 0.1972489 1.2400782 0.03280919 2 20 0.7298919 0.0000000 1.29438165 3 21 0.1972489 1.2400782 0.03280919 2 22 0.0000000 1.0231268 0.98567623 2 23 0.7298919 0.0000000 1.29438165 3 24 1.5703239 0.1523159 0.00000000 1
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.12 Orthogonal Partitioning Cluster
The ore.odmOC
function builds an OML4SQL model using the Orthogonal Partitioning Cluster (O-Cluster) algorithm.
The O-Cluster algorithm builds a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space.
The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. The sensitivity
argument defines a baseline density level. Only areas that have a peak density above this baseline level can be identified as clusters.
The k-Means algorithm tessellates the space even when natural clusters may not exist. For example, if there is a region of uniform density, k-Means tessellates it into n clusters (where n is specified by the user). O-Cluster separates areas of high density by placing cutting planes through areas of low density. O-Cluster needs multi-modal histograms (peaks and valleys). If an area has projections with uniform or monotonically changing density, O-Cluster does not partition it.
The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring by the predict
function for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numeric attributes and multinomial distributions for categorical attributes.
If you choose to prepare the data for an O-Cluster model, keep the following points in mind:
-
The O-Cluster algorithm does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It only reads another batch if it believes, based on statistical tests, that there may still exist clusters that it has not yet uncovered.
-
Because O-Cluster may stop the model build before it reads all of the data, it is highly recommended that the data be randomized.
-
Binary attributes should be declared as categorical. O-Cluster maps categorical data to numeric values.
-
The use of OML4SQL equi-width binning transformation with automated estimation of the required number of bins is highly recommended.
-
The presence of outliers can significantly impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole population appears to fall within a single cluster.
The specification of the formula
argument has the form ~ terms
where terms
are the column names to include in the model. Multiple terms
items are specified using +
between column names. Use ~ .
if all columns in data
should be used for model building. To exclude columns, use -
before each column name to exclude.
For information on the ore.odmOC
function arguments, call help(ore.odmOC)
.
Settings for an Orthogonal Partitioning Cluster Models
The following table lists settings that apply to Orthogonal Partitioning Cluster models.
Table 4-15 Orthogonal Partitioning Cluster Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density. Default is |
Example 4-22 Using the ore.odmOC Function
This example creates an O-Cluster model on a synthetic data set. The figure following the example shows the histogram of the resulting clusters.
x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") x_of <- ore.push (data.frame(ID=1:100,x)) rownames(x_of) <- x_of$ID oc.mod <- ore.odmOC(~., x_of, num.centers=2) summary(oc.mod) histogram(oc.mod) predict(oc.mod, x_of, type=c("class","raw"), supplemental.cols=c("x","y"))
Listing for This Example
R> x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2), + matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2)) R> colnames(x) <- c("x", "y") R> x_of <- ore.push (data.frame(ID=1:100,x)) R> rownames(x_of) <- x_of$ID R> oc.mod <- ore.odmOC(~., x_of, num.centers=2) R> summary(oc.mod) Call: ore.odmOC(formula = ~., data = x_of, num.centers = 2) Settings: value clus.num.clusters 2 max.buffer 50000 sensitivity 0.5 prep.auto on Clusters: CLUSTER_ID ROW_CNT PARENT_CLUSTER_ID TREE_LEVEL DISPERSION IS_LEAF 1 1 100 NA 1 NA FALSE 2 2 56 1 2 NA TRUE 3 3 43 1 2 NA TRUE Centers: MEAN.x MEAN.y 2 1.85444 1.941195 3 4.04511 4.111740 R> histogram(oc.mod) R> predict(oc.mod, x_of, type=c("class","raw"), supplemental.cols=c("x","y")) '2' '3' x y CLUSTER_ID 1 3.616386e-08 9.999999e-01 3.825303 3.935346 3 2 3.253662e-01 6.746338e-01 3.454143 4.193395 3 3 3.616386e-08 9.999999e-01 4.049120 4.172898 3 # ... Intervening rows not shown. 98 1.000000e+00 1.275712e-12 2.011463 1.991468 2 99 1.000000e+00 1.275712e-12 1.727580 1.898839 2 100 1.000000e+00 1.275712e-12 2.092737 2.212688 2
Figure 4-4 Output of the histogram Function for the ore.odmOC Model
Description of "Figure 4-4 Output of the histogram Function for the ore.odmOC Model"
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.13 Singular Value Decomposition
The ore.odmSVD
function creates a model that uses the OML4SQL Singular Value Decomposition (SVD) algorithm.
Singular Value Decomposition (SVD) is a feature extraction algorithm. SVD is orthogonal linear transformations that capture the underlying variance of the data by decomposing a rectangular matrix into three matrixes: 'U', 'D', and 'V'. Matrix 'D' is a diagonal matrix and its singular values reflect the amount of data variance captured by the bases.
Settings for a Singular Value Decomposition Models
The following table lists settings that apply to Singular Value Decomposition models.
Table 4-16 Singular Value Decomposition Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
2500 |
The maximum number of features supported by SVD. |
Example 4-23 Using the ore.odmSVD Function
IRIS <- ore.push(cbind(Id = seq_along(iris[[1L]]), iris))
svd.mod <- ore.odmSVD(~. -Id, IRIS)
summary(svd.mod)
d(svd.mod)
v(svd.mod)
head(predict(svd.mod, IRIS, supplemental.cols = "Id"))
svd.pmod <- ore.odmSVD(~. -Id, IRIS,
odm.settings = list(odms_partition_columns = "Species"))
summary(svd.pmod)
d(svd.pmod)
v(svd.pmod)
head(predict(svd.pmod, IRIS, supplemental.cols = "Id"))
Listing for This Example
R> IRIS <- ore.push(cbind(Id = seq_along(iris[[1L]]), iris))
R>
R> svd.mod <- ore.odmSVD(~. -Id, IRIS)
R> summary(svd.mod)
Call:
ore.odmSVD(formula = ~. - Id, data = IRIS)
Settings:
value
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto ON
scoring.mode scoring.svd
u.matrix.output u.matrix.disable
d:
FEATURE_ID VALUE
1 1 96.2182677
2 2 19.0780817
3 3 7.2270380
4 4 3.1502152
5 5 1.8849634
6 6 1.1474731
7 7 0.5814097
v:
ATTRIBUTE_NAME ATTRIBUTE_VALUE '1' '2' '3' '4' '5' '6' '7'
1 Petal.Length <NA> 0.51162932 0.65943465 -0.004420703 0.05479795 -0.51969015 0.17392232 -0.005674672
2 Petal.Width <NA> 0.16745698 0.32071102 0.146484369 0.46553390 0.72685033 0.31962337 -0.021274748
3 Sepal.Length <NA> 0.74909171 -0.26482593 -0.102057243 -0.49272847 0.31969417 -0.09379235 -0.067308615
4 Sepal.Width <NA> 0.37906736 -0.50824062 0.142810811 0.69139828 -0.25849391 -0.17606099 -0.041908520
5 Species setosa 0.03170407 -0.32247642 0.184499940 -0.12245506 -0.14348647 0.76017824 0.497502783
6 Species versicolor 0.04288799 0.04054823 -0.780684855 0.19827972 0.07363250 -0.12354271 0.571881302
7 Species virginica 0.05018593 0.16796988 0.551546107 -0.07177990 0.08109974 -0.48442099 0.647048040
Warning message:
In u.ore.odmSVD(object) : U matrix is not calculated.
R> d(svd.mod)
FEATURE_ID VALUE
1 1 96.2182677
2 2 19.0780817
3 3 7.2270380
4 4 3.1502152
5 5 1.8849634
6 6 1.1474731
7 7 0.5814097
Warning message:
ORE object has no unique key - using random order
R> v(svd.mod)
ATTRIBUTE_NAME ATTRIBUTE_VALUE '1' '2' '3' '4' '5' '6' '7'
1 Petal.Length <NA> 0.51162932 0.65943465 -0.004420703 0.05479795 -0.51969015 0.17392232 -0.005674672
2 Petal.Width <NA> 0.16745698 0.32071102 0.146484369 0.46553390 0.72685033 0.31962337 -0.021274748
3 Sepal.Length <NA> 0.74909171 -0.26482593 -0.102057243 -0.49272847 0.31969417 -0.09379235 -0.067308615
4 Sepal.Width <NA> 0.37906736 -0.50824062 0.142810811 0.69139828 -0.25849391 -0.17606099 -0.041908520
5 Species setosa 0.03170407 -0.32247642 0.184499940 -0.12245506 -0.14348647 0.76017824 0.497502783
6 Species versicolor 0.04288799 0.04054823 -0.780684855 0.19827972 0.07363250 -0.12354271 0.571881302
7 Species virginica 0.05018593 0.16796988 0.551546107 -0.07177990 0.08109974 -0.48442099 0.647048040
Warning message:
ORE object has no unique key - using random order
R> head(predict(svd.mod, IRIS, supplemental.cols = "Id"))
Id '1' '2' '3' '4' '5' '6' '7' FEATURE_ID
1 1 0.06161595 -0.1291839 0.02586865 -0.01449182 1.536727e-05 -0.023495349 -0.007998605 2
2 2 0.05808905 -0.1130876 0.01881265 -0.09294788 3.466226e-02 0.069569113 0.051195429 2
3 3 0.05678818 -0.1190959 0.02565027 -0.01950986 8.851560e-04 0.040073030 0.060908867 2
4 4 0.05667915 -0.1081308 0.02496402 -0.02233741 -5.750222e-02 0.093904181 0.077741713 2
5 5 0.06123138 -0.1304597 0.02925687 0.02309694 -3.065834e-02 -0.030664898 -0.003629897 2
6 6 0.06747071 -0.1302726 0.03340671 0.06114966 -9.547838e-03 -0.008210224 -0.081807741 2
R>
R> svd.pmod <- ore.odmSVD(~. -Id, IRIS,
+ odm.settings = list(odms_partition_columns = "Species"))
R> summary(svd.pmod)
$setosa
Call:
ore.odmSVD(formula = ~. - Id, data = IRIS, odm.settings = list(odms_partition_columns = "Species"))
Settings:
value
odms.max.partitions 1000
odms.missing.value.treatment odms.missing.value.auto
odms.partition.columns "Species"
odms.sampling odms.sampling.disable
prep.auto ON
scoring.mode scoring.svd
u.matrix.output u.matrix.disable
d:
FEATURE_ID VALUE
1 1 44.2872290
2 2 1.5719162
3 3 1.1458732
4 4 0.6836692
v:
ATTRIBUTE_NAME ATTRIBUTE_VALUE '1' '2' '3' '4'
1 Petal.Length <NA> 0.2334487 0.46456598 0.8317440 -0.19463332
2 Petal.Width <NA> 0.0395488 0.04182015 0.1946750 0.97917752
3 Sepal.Length <NA> 0.8010073 0.40303704 -0.4410167 0.03811461
4 Sepal.Width <NA> 0.5498408 -0.78739486 0.2753323 -0.04331888
$versicolor
Call:
ore.odmSVD(formula = ~. - Id, data = IRIS, odm.settings = list(odms_partition_columns = "Species"))
Settings:
value
odms.max.partitions 1000
odms.missing.value.treatment odms.missing.value.auto
R> # xyz
R> d(svd.pmod)
PARTITION_NAME FEATURE_ID VALUE
1 setosa 1 44.2872290
2 setosa 2 1.5719162
3 setosa 3 1.1458732
4 setosa 4 0.6836692
5 versicolor 1 56.2523412
6 versicolor 2 1.9106625
7 versicolor 3 1.7015929
8 versicolor 4 0.6986103
9 virginica 1 66.2734064
10 virginica 2 2.4318639
11 virginica 3 1.6007740
12 virginica 4 1.2958261
Warning message:
ORE object has no unique key - using random order
R> v(svd.pmod)
PARTITION_NAME ATTRIBUTE_NAME ATTRIBUTE_VALUE '1' '2' '3' '4'
1 setosa Petal.Length <NA> 0.2334487 0.46456598 0.83174398 -0.19463332
2 setosa Petal.Width <NA> 0.0395488 0.04182015 0.19467497 0.97917752
3 setosa Sepal.Length <NA> 0.8010073 0.40303704 -0.44101672 0.03811461
4 setosa Sepal.Width <NA> 0.5498408 -0.78739486 0.27533228 -0.04331888
5 versicolor Petal.Length <NA> 0.5380908 0.49576111 -0.60174021 -0.32029352
6 versicolor Petal.Width <NA> 0.1676394 0.36693207 -0.03448373 0.91436795
7 versicolor Sepal.Length <NA> 0.7486029 -0.64738491 0.06943054 0.12516311
8 versicolor Sepal.Width <NA> 0.3492119 0.44774385 0.79492074 -0.21372297
9 virginica Petal.Length <NA> 0.5948985 -0.26368708 0.65157671 -0.38988802
10 virginica Petal.Width <NA> 0.2164036 0.59106806 0.42921836 0.64774968
11 virginica Sepal.Length <NA> 0.7058813 -0.27846153 -0.53436210 0.37235450
12 virginica Sepal.Width <NA> 0.3177999 0.70962445 -0.32507927 -0.53829342
Warning message:
ORE object has no unique key - using random order
R> head(predict(svd.pmod, IRIS, supplemental.cols = "Id"))
Id '1' '2' '3' '4' FEATURE_ID
1 1 0.1432539 -0.026487881 -0.071688339 -0.04956008 1
2 2 0.1334289 0.172689424 -0.114854368 -0.02902893 2
3 3 0.1317675 -0.008327214 -0.062409295 -0.02438248 1
4 4 0.1297716 0.075232572 0.097222019 -0.08055912 1
5 5 0.1426868 -0.102219140 -0.009172782 -0.06147133 1
6 6 0.1554060 -0.055950655 0.160698708 0.14286095 3
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.14 Support Vector Machine
The ore.odmSVM
function builds an OML4R Support Vector Machine (SVM) model.
SVM is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.
SVM models have similar functional form to neural networks and radial basis functions, both popular machine learning techniques.
SVM can be used to solve the following problems:
-
Classification: SVM classification is based on decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors (“support vectors") that define the separators that give the widest separation of classes.
SVM classification supports both binary and multiclass targets.
-
Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.
SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.
-
Anomaly Detection: Anomaly detection identifies identify cases that are unusual within data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.
Anomaly detection is implemented as one-class SVM classification. An anomaly detection model predicts whether a data point is typical for a given distribution or not.
The ore.odmSVM
function builds each of these three different types of models. Some arguments apply to classification models only, some to regression models only, and some to anomaly detection models only.
For information on the ore.odmSVM
function arguments, call help(ore.odmSVM)
.
Settings for a Support Vector Machine Models
The following table lists settings that apply to Support Vector Machine models.
Table 4-17 Support Vector Machine Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor. Value of complexity factor for SVM algorithm (both Classification and Regression). Default value estimated from the data by the algorithm. |
|
|
Convergence tolerance for SVM algorithm. Default is |
|
|
Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data. Value of epsilon factor for SVM regression. Default is |
|
|
Kernel for Support Vector Machine. Linear or Gaussian. The default value is |
|
|
The desired rate of outliers in the training data. Valid for One-Class SVM models only (Anomaly Detection). Default is |
SVMS_STD_DEV |
|
Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases. Value of standard deviation for SVM algorithm. This is applicable only for Gaussian kernel. Default value estimated from the data by the algorithm. |
|
Positive integer |
This setting sets an upper limit on the number of SVM iterations. The default is system determined because it depends on the SVM solver. |
SVMS_NUM_PIVOTS |
Range [ |
This setting sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is |
|
Positive integer |
This setting applies to SVM models with linear kernel. This setting sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting controls the type of regularization that the SGD SVM solver uses. The setting can be used only for linear SVM models. The default is system determined because it depends on the potential model size. |
SVMS_SOLVER |
|
This setting allows the user to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined. |
Example 4-24 Using the ore.odmSVM Function and Generating a Confusion Matrix
This example demonstrates the use of SVM classification. The example creates mtcars
in the database from the R mtcars
data set, builds a classification model, makes predictions, and finally generates a confusion matrix.
m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) mtcars_of <- ore.push(m) svm.mod <- ore.odmSVM(gear ~ .-ID, mtcars_of, "classification") summary(svm.mod) svm.res <- predict (svm.mod, mtcars_of,"gear") with(svm.res, table(gear, PREDICTION)) # generate confusion matrix
Listing for This Example
R> m <- mtcars R> m$gear <- as.factor(m$gear) R> m$cyl <- as.factor(m$cyl) R> m$vs <- as.factor(m$vs) R> m$ID <- 1:nrow(m) R> mtcars_of <- ore.push(m) R> R> svm.mod <- ore.odmSVM(gear ~ .-ID, mtcars_of, "classification") R> summary(svm.mod) Call: ore.odmSVM(formula = gear ~ . - ID, data = mtcars_of, type = "classification") Settings: value prep.auto on active.learning al.enable complexity.factor 0.385498 conv.tolerance 1e-04 kernel.cache.size 50000000 kernel.function gaussian std.dev 1.072341 Coefficients: [1] No coefficients with gaussian kernel R> svm.res <- predict (svm.mod, mtcars_of,"gear") R> with(svm.res, table(gear, PREDICTION)) # generate confusion matrix PREDICTION gear 3 4 3 12 3 4 0 12 5 2 3
Example 4-25 Using the ore.odmSVM Function and Building a Regression Model
This example demonstrates SVM regression. The example creates a data frame, pushes it to a table, and then builds a regression model; note that ore.odmSVM
specifies a linear kernel.
x <- seq(0.1, 5, by = 0.02) y <- log(x) + rnorm(x, sd = 0.2) dat <-ore.push(data.frame(x=x, y=y)) # Build model with linear kernel svm.mod <- ore.odmSVM(y~x,dat,"regression", kernel.function="linear") summary(svm.mod) coef(svm.mod) svm.res <- predict(svm.mod,dat, supplemental.cols="x") head(svm.res,6)
Listing for This Example
R> x <- seq(0.1, 5, by = 0.02) R> y <- log(x) + rnorm(x, sd = 0.2) R> dat <-ore.push(data.frame(x=x, y=y)) R> R> # Build model with linear kernel R> svm.mod <- ore.odmSVM(y~x,dat,"regression", kernel.function="linear") R> summary(svm.mod) Call: ore.odmSVM(formula = y ~ x, data = dat, type = "regression", kernel.function = "linear") Settings: value prep.auto on active.learning al.enable complexity.factor 0.620553 conv.tolerance 1e-04 epsilon 0.098558 kernel.function linear Residuals: Min. 1st Qu. Median Mean 3rd Qu. Max. -0.79130 -0.28210 -0.05592 -0.01420 0.21460 1.58400 Coefficients: variable value estimate 1 x 0.6637951 2 (Intercept) 0.3802170 R> coef(svm.mod) variable value estimate 1 x 0.6637951 2 (Intercept) 0.3802170 R> svm.res <- predict(svm.mod,dat, supplemental.cols="x") R> head(svm.res,6) x PREDICTION 1 0.10 -0.7384312 2 0.12 -0.7271410 3 0.14 -0.7158507 4 0.16 -0.7045604 5 0.18 -0.6932702 6 0.20 -0.6819799
Example 4-26 Using the ore.odmSVM Function and Building an Anomaly Detection Model
This example demonstrates SVN anomaly detection. It uses mtcars_of
created in the classification example and builds an anomaly detection model.
svm.mod <- ore.odmSVM(~ .-ID, mtcars_of, "anomaly.detection") summary(svm.mod) svm.res <- predict (svm.mod, mtcars_of, "ID") head(svm.res) table(svm.res$PREDICTION)
Listing for This Example
R> svm.mod <- ore.odmSVM(~ .-ID, mtcars_of, "anomaly.detection") R> summary(svm.mod) Call: ore.odmSVM(formula = ~. - ID, data = mtcars_of, type = "anomaly.detection") Settings: value prep.auto on active.learning al.enable conv.tolerance 1e-04 kernel.cache.size 50000000 kernel.function gaussian outlier.rate .1 std.dev 0.719126 Coefficients: [1] No coefficients with gaussian kernel R> svm.res <- predict (svm.mod, mtcars_of, "ID") R> head(svm.res) '0' '1' ID PREDICTION Mazda RX4 0.4999405 0.5000595 1 1 Mazda RX4 Wag 0.4999794 0.5000206 2 1 Datsun 710 0.4999618 0.5000382 3 1 Hornet 4 Drive 0.4999819 0.5000181 4 1 Hornet Sportabout 0.4949872 0.5050128 5 1 Valiant 0.4999415 0.5000585 6 1 R> table(svm.res$PREDICTION) 0 1 5 27
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.15 Build a Partitioned Model
A partitioned model is an ensemble model that consists of multiple sub-models, one for each partition of the data.
A partitioned model may achieve better accuracy through multiple targeted models that are managed and used as one. A partitioned model can simplify scoring by allowing you to reference the top-level model only. The proper sub-model is chosen by the system based on the values of the partitioned column or columns for each row of data to be scored.
To create a partitioned OML4SQL model, use the odm.setting
argument with ODMS_PARTITION_COLUMNS
as the name and with the names of the columns by which to partition the input data as the value. The OREdm
function returns a model with a sub-model for each partition. The partitions are based on the unique values found in the columns.
The partitions
function returns an ore.frame
that lists each partition of the specified model object and the associated partition column values of the model. Partition names are system-determined. The function returns NULL
for a non-partitioned model.
Example 4-27 Create a Partitioned Model
This example creates a partitioned Support Vector Machine classification model. It uses the Wine Quality data set from the University of California, Irvine Machine Learning Repository.
# Download the wine data set and create the data table.
white.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
white.wine <- read.csv(white.url, header = TRUE, sep = ";")
white.wine$color <- "white"
red.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
red.wine <- read.csv(red.url, header = TRUE, sep = ";")
red.wine$color <- "red"
dat <- rbind(white.wine, red.wine)
# Drop the WINE table if it exists.
ore.drop(table="WINE")
ore.create(dat, table="WINE")
# Assign row names to enable row indexing for train and test samples.
row.names(WINE) <- WINE$color
# Enable reproducible results.
set.seed(seed=6218945)
n.rows <- nrow(WINE)
# Train and test sampling.
random.sample <- sample(1:n.rows, ceiling(n.rows/2))
# Sample in-database using row indexing.
WINE.train <- WINE[random.sample,]
WINE.test <- WINE[setdiff(1:n.rows,random.sample),]
# Build a Support Vector Machine classification model
# on the training data set, using both red and white wine.
mod.svm <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train,
"classification", kernel.function="linear")
# Predict wine quality on the test data set.
pred.svm <- predict (mod.svm, WINE.test,"quality")
# View the probability of each class and prediction.
head(pred.svm,3)
# Generate a confusion matrix. Note that 3 and 8 are not predicted.
with(pred.svm, table(quality, PREDICTION, dnn = c("Actual", "Predicted")))
# Build a partitioned SVM model based on wine color.
# Specify the partitioning column with the odm.settings argument.
mod.svm2 <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train,
"classification", kernel.function="linear",
odm.settings=list(odms_partition_columns = "color"))
# Predict wine quality on the test data set.
pred.svm2 <- predict (mod.svm2, WINE.test, "quality")
# View the probability of each class and prediction.
head(pred.svm2,3)
# Generate a confusion matrix. Note that 3 and 4 are not predicted.
with(pred.svm2, table(quality, PREDICTION, dnn = c("Actual", "Predicted")))
partitions(mod.svm2)
summary(mod.svm2["red"])
Listing for This Example
> # Download the wine data set and create the data table.
> white.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
> white.wine <- read.csv(white.url, header = TRUE, sep = ";")
> white.wine$color <- "white"
>
> red.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
> red.wine <- read.csv(red.url, header = TRUE, sep = ";")
> red.wine$color <- "red"
>
> dat <- rbind(white.wine, red.wine)
>
> # Drop the WINE table if it exists.
> ore.drop(table="WINE")
Warning message:
Table WINE does not exist.
> ore.create(dat, table="WINE")
>
> # Assign row names to enable row indexing for train and test samples.
> row.names(WINE) <- WINE$color
>
> # Enable reproducible results.
> set.seed(seed=6218945)
>
> n.rows <- nrow(WINE)
>
> # Train and test sampling.
> random.sample <- sample(1:n.rows, ceiling(n.rows/2))
>
> # Sample in-database using row indexing.
> WINE.train <- WINE[random.sample,]
> WINE.test <- WINE[setdiff(1:n.rows,random.sample),]
>
> # Build a Support Vector Machine classification model
> # on the training data set, using both red and white wine.
> mod.svm <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train,
+ "classification",kernel.function="linear")
>
> # Predict wine quality on the test data set.
> pred.svm <- predict (mod.svm, WINE.test,"quality")
>
> # View the probability of each class and prediction.
> head(pred.svm,3)
'3' '4' '5' '6' '7' '8' '9'
red 0.04957242 0.1345280 0.27779399 0.1345281 0.1345280 0.1345275 0.1345220
red.1 0.04301663 0.1228311 0.34283345 0.1228313 0.1228311 0.1228307 0.1228257
red.2 0.04473419 0.1713883 0.09832961 0.1713891 0.1713890 0.1713886 0.1713812
quality PREDICTION
red 4 5
red.1 5 5
red.2 7 6
>
> # Generate a confusion matrix. Note that 3 and 4 are not predicted.
> with(pred.svm, table(quality,PREDICTION, dnn = c("Actual","Predicted")))
Predicted
Actual 3 4 5 6 7 8 9
3 0 0 11 5 0 0 0
4 0 1 85 16 2 0 0
5 2 1 927 152 4 0 1
6 2 1 779 555 63 1 9
7 2 0 121 316 81 0 3
8 0 0 18 66 21 0 0
9 0 0 0 2 1 0 0
>
> partitions(mod.svm2)
PARTITION_NAME color
1 red red
2 white white
> summary(mod.svm2["red"])
$red
Call:
ore.odmSVM(formula = quality ~ . - pH - fixed.acidity, data = WINE.train,
type = "classification", kernel.function = "linear", odm.settings = list(odms_partition_columns = "color"))
Settings:
value
clas.weights.balanced OFF
odms.details odms.enable
odms.max.partitions 1000
odms.missing.value.treatment odms.missing.value.auto
odms.partition.columns "color"
odms.sampling odms.sampling.disable
prep.auto ON
active.learning al.enable
conv.tolerance 1e-04
kernel.function linear
Coefficients:
PARTITION_NAME class variable value estimate
1 red 3 (Intercept) -1.347392e+01
2 red 3 alcohol 7.245737e-01
3 red 3 chlorides 1.761946e+00
4 red 3 citric.acid -3.276716e+00
5 red 3 density 2.449906e+00
6 red 3 free.sulfur.dioxide -6.035430e-01
7 red 3 residual.sugar 9.097631e-01
8 red 3 sulphates 1.240524e-04
9 red 3 total.sulfur.dioxide -2.467554e+00
10 red 3 volatile.acidity 1.300470e+00
11 red 4 (Intercept) -1.000002e+00
12 red 4 alcohol -7.920188e-07
13 red 4 chlorides -2.589198e-08
14 red 4 citric.acid 9.340296e-08
15 red 4 density -5.418190e-07
16 red 4 free.sulfur.dioxide -6.981268e-08
17 red 4 residual.sugar 3.389558e-07
18 red 4 sulphates 1.417324e-07
19 red 4 total.sulfur.dioxide -3.113900e-07
20 red 4 volatile.acidity 4.928625e-07
21 red 5 (Intercept) -3.151406e-01
22 red 5 alcohol -9.692192e-01
23 red 5 chlorides 3.690034e-02
24 red 5 citric.acid 2.258823e-01
25 red 5 density -1.770474e-01
26 red 5 free.sulfur.dioxide -1.289540e-01
27 red 5 residual.sugar 7.521771e-04
28 red 5 sulphates -3.596548e-01
29 red 5 total.sulfur.dioxide 5.688280e-01
30 red 5 volatile.acidity 3.005168e-01
31 red 6 (Intercept) -9.999994e-01
32 red 6 alcohol 8.807703e-07
33 red 6 chlorides 6.871310e-08
34 red 6 citric.acid -4.525750e-07
35 red 6 density 5.786923e-07
36 red 6 free.sulfur.dioxide 3.856018e-07
37 red 6 residual.sugar -4.281695e-07
38 red 6 sulphates 1.036468e-07
39 red 6 total.sulfur.dioxide -4.287512e-07
40 red 6 volatile.acidity -4.426151e-07
41 red 7 (Intercept) -1.000000e+00
42 red 7 alcohol 1.761665e-07
43 red 7 chlorides -3.583316e-08
44 red 7 citric.acid -4.837739e-08
45 red 7 density 2.169500e-08
46 red 7 free.sulfur.dioxide 4.800717e-08
47 red 7 residual.sugar 1.909498e-08
48 red 7 sulphates 1.062205e-07
49 red 7 total.sulfur.dioxide -2.339108e-07
50 red 7 volatile.acidity -1.539326e-07
51 red 8 (Intercept) -1.000000e+00
52 red 8 alcohol 7.089889e-08
53 red 8 chlorides -8.566726e-09
54 red 8 citric.acid 2.769301e-08
55 red 8 density -3.852321e-08
56 red 8 free.sulfur.dioxide -1.302056e-08
57 red 8 residual.sugar 4.847947e-09
58 red 8 sulphates 1.276461e-08
59 red 8 total.sulfur.dioxide -5.484427e-08
60 red 8 volatile.acidity 2.959182e-08
Parent topic: Build Oracle Machine Learning for SQL Models
4.2.16 Build a Text Processing Model
A text processing model uses ctx.settings
arguments to specify Oracle Text attribute settings.
Example 4-28 Building a Text Processing Model
This example builds an ore.odmKMeans
model that processes text. It uses the odm.settings
and ctx.settings
arguments. The figure following the example shows the output of the histogram(km.mod1)
function.
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
X <- ore.push (data.frame(x))
km.mod1 <- NULL
km.mod1 <- ore.odmKMeans(~., X, num.centers = 2)
km.mod1
summary(km.mod1)
rules(km.mod1)
clusterhists(km.mod1)
histogram(km.mod1)
km.res1 <- predict(km.mod1,X,type="class",supplemental.cols=c("x","y"))
head(km.res1,3)
km.res1.local <- ore.pull(km.res1)
plot(data.frame(x = km.res1.local$x,
y = km.res1.local$y),
col = km.res1.local$CLUSTER_ID)
points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2)
head(predict(km.mod1,X))
head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3)
head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)
# Text processing with ore.odmKMeans.
title <- c('Aids in Africa: Planning for a long war',
'Mars rover maneuvers for rim shot',
'Mars express confirms presence of water at Mars south pole',
'NASA announces major Mars rover finding',
'Drug access, Asia threat in focus at AIDS summit',
'NASA Mars Odyssey THEMIS image: typical crater',
'Road blocks for Aids')
response <- c('Aids', 'Mars', 'Mars', 'Mars', 'Aids', 'Mars', 'Aids')
# Text contents in a character column.
KM_TEXT <- ore.push(data.frame(CUST_ID = seq(length(title)),
RESPONSE = response, TITLE = title))
# Create a text policy (CTXSYS.CTX_DDL privilege is required).
ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
# Specify POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
# text column attributes.
km.mod <- ore.odmKMeans( ~ TITLE, data = KM_TEXT, num.centers = 2L,
odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
ODMS_TEXT_MIN_DOCUMENTS = 1,
ODMS_TEXT_MAX_FEATURES = 3,
kmns_distance = "dbms_data_mining.kmns_cosine",
kmns_details = "kmns_details_all"),
ctx.settings = list(TITLE = "TEXT(TOKEN_TYPE:STEM)"))
summary(km.mod)
settings(km.mod)
print(predict(km.mod, KM_TEXT, supplemental.cols = "RESPONSE"), digits = 3L)
ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")
Listing for This Example
R> x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
+ matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R>
R> X <- ore.push (data.frame(x))
R> km.mod1 <- NULL
R> km.mod1 <- ore.odmKMeans(~., X, num.centers = 2)
R> km.mod1
Call:
ore.odmKMeans(formula = ~., data = X, num.centers = 2)
Settings:
value
clus.num.clusters 2
block.growth 2
conv.tolerance 0.01
details details.all
distance euclidean
iterations 3
min.pct.attr.support 0.1
num.bins 10
random.seed 0
split.criterion variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto ON
R> summary(km.mod1)
Call:
ore.odmKMeans(formula = ~., data = X, num.centers = 2)
Settings:
value
clus.num.clusters 2
block.growth 2
conv.tolerance 0.01
details details.all
distance euclidean
iterations 3
min.pct.attr.support 0.1
num.bins 10
random.seed 0
split.criterion variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
prep.auto ON
Centers:
x y
2 -0.07638266 0.04449368
3 0.98493306 1.00864399
R> rules(km.mod1)
cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var lhs.var.support lhs.var.conf predicate
1 1 100 1.0 92 0.86 x 86 0.2222222 x <= 1.2209
2 1 100 1.0 92 0.86 x 86 0.2222222 x >= -.6188
3 1 100 1.0 86 0.86 y 86 0.4444444 y <= 1.1653
4 1 100 1.0 86 0.86 y 86 0.4444444 y > -.3053
5 2 50 0.5 48 0.96 x 48 0.0870793 x <= .4324
6 2 50 0.5 48 0.96 x 48 0.0870793 x >= -.6188
7 2 50 0.5 48 0.96 y 48 0.0893300 y <= .5771
8 2 50 0.5 48 0.96 y 48 0.0893300 y > -.5995
9 3 50 0.5 49 0.98 x 49 0.0852841 x <= 1.7465
10 3 50 0.5 49 0.98 x 49 0.0852841 x > .4324
11 3 50 0.5 50 0.98 y 49 0.0838225 y <= 1.7536
12 3 50 0.5 50 0.98 y 49 0.0838225 y > .2829
R> clusterhists(km.mod1)
cluster.id variable bin.id lower.bound upper.bound label count
1 1 x 1 -0.61884662 -0.35602715 -.6188466:-.3560272 6
2 1 x 2 -0.35602715 -0.09320769 -.3560272:-.0932077 17
3 1 x 3 -0.09320769 0.16961178 -.0932077:.1696118 15
4 1 x 4 0.16961178 0.43243125 .1696118:.4324312 11
5 1 x 5 0.43243125 0.69525071 .4324312:.6952507 8
6 1 x 6 0.69525071 0.95807018 .6952507:.9580702 17
7 1 x 7 0.95807018 1.22088965 .9580702:1.2208896 18
8 1 x 8 1.22088965 1.48370911 1.2208896:1.4837091 4
9 1 x 9 1.48370911 1.74652858 1.4837091:1.7465286 4
10 1 y 1 -0.89359597 -0.59946141 -.893596:-.5994614 2
11 1 y 2 -0.59946141 -0.30532685 -.5994614:-.3053269 4
12 1 y 3 -0.30532685 -0.01119230 -.3053269:-.0111923 11
13 1 y 4 -0.01119230 0.28294226 -.0111923:.2829423 24
14 1 y 5 0.28294226 0.57707682 .2829423:.5770768 13
15 1 y 6 0.57707682 0.87121138 .5770768:.8712114 12
16 1 y 7 0.87121138 1.16534593 .8712114:1.1653459 26
17 1 y 8 1.16534593 1.45948049 1.1653459:1.4594805 5
18 1 y 9 1.45948049 1.75361505 1.4594805:1.753615 3
19 2 x 1 -0.61884662 -0.35602715 -.6188466:-.3560272 6
20 2 x 2 -0.35602715 -0.09320769 -.3560272:-.0932077 17
21 2 x 3 -0.09320769 0.16961178 -.0932077:.1696118 15
22 2 x 4 0.16961178 0.43243125 .1696118:.4324312 10
23 2 x 5 0.43243125 0.69525071 .4324312:.6952507 2
24 2 x 6 0.69525071 0.95807018 .6952507:.9580702 0
25 2 x 7 0.95807018 1.22088965 .9580702:1.2208896 0
26 2 x 8 1.22088965 1.48370911 1.2208896:1.4837091 0
27 2 x 9 1.48370911 1.74652858 1.4837091:1.7465286 0
28 2 y 1 -0.89359597 -0.59946141 -.893596:-.5994614 2
29 2 y 2 -0.59946141 -0.30532685 -.5994614:-.3053269 4
30 2 y 3 -0.30532685 -0.01119230 -.3053269:-.0111923 11
31 2 y 4 -0.01119230 0.28294226 -.0111923:.2829423 24
32 2 y 5 0.28294226 0.57707682 .2829423:.5770768 9
33 2 y 6 0.57707682 0.87121138 .5770768:.8712114 0
34 2 y 7 0.87121138 1.16534593 .8712114:1.1653459 0
35 2 y 8 1.16534593 1.45948049 1.1653459:1.4594805 0
36 2 y 9 1.45948049 1.75361505 1.4594805:1.753615 0
37 3 x 1 -0.61884662 -0.35602715 -.6188466:-.3560272 0
38 3 x 2 -0.35602715 -0.09320769 -.3560272:-.0932077 0
39 3 x 3 -0.09320769 0.16961178 -.0932077:.1696118 0
40 3 x 4 0.16961178 0.43243125 .1696118:.4324312 1
41 3 x 5 0.43243125 0.69525071 .4324312:.6952507 6
42 3 x 6 0.69525071 0.95807018 .6952507:.9580702 17
43 3 x 7 0.95807018 1.22088965 .9580702:1.2208896 18
44 3 x 8 1.22088965 1.48370911 1.2208896:1.4837091 4
45 3 x 9 1.48370911 1.74652858 1.4837091:1.7465286 4
46 3 y 1 -0.89359597 -0.59946141 -.893596:-.5994614 0
47 3 y 2 -0.59946141 -0.30532685 -.5994614:-.3053269 0
48 3 y 3 -0.30532685 -0.01119230 -.3053269:-.0111923 0
49 3 y 4 -0.01119230 0.28294226 -.0111923:.2829423 0
50 3 y 5 0.28294226 0.57707682 .2829423:.5770768 4
51 3 y 6 0.57707682 0.87121138 .5770768:.8712114 12
52 3 y 7 0.87121138 1.16534593 .8712114:1.1653459 26
53 3 y 8 1.16534593 1.45948049 1.1653459:1.4594805 5
54 3 y 9 1.45948049 1.75361505 1.4594805:1.753615 3
R> histogram(km.mod1)
R>
R> km.res1 <- predict(km.mod1, X, type="class", supplemental.cols = c("x","y"))
R> head(km.res1, 3)
x y CLUSTER_ID
1 -0.43646407 0.26201831 2
2 -0.02797831 0.07319952 2
3 0.11998373 -0.08638716 2
R> km.res1.local <- ore.pull(km.res1)
R> plot(data.frame(x = km.res1.local$x,
+ y = km.res1.local$y),
+ col = km.res1.local$CLUSTER_ID)
R> points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex = 2)
R>
R> head(predict(km.mod1, X))
'2' '3' CLUSTER_ID
1 0.9992236 0.0007763706 2
2 0.9971310 0.0028690375 2
3 0.9974216 0.0025783939 2
4 0.9997335 0.0002665114 2
5 0.9917773 0.0082226599 2
6 0.9771667 0.0228333398 2
R> head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3)
'2' '3' x y CLUSTER_ID
1 0.9992236 0.0007763706 -0.43646407 0.26201831 2
2 0.9971310 0.0028690375 -0.02797831 0.07319952 2
3 0.9974216 0.0025783939 0.11998373 -0.08638716 2
R> head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)
x y '2' '3'
1 -0.43646407 0.26201831 0.9992236 0.0007763706
2 -0.02797831 0.07319952 0.9971310 0.0028690375
3 0.11998373 -0.08638716 0.9974216 0.0025783939R>
R>
R> # Text processing with ore.odmKMeans.
R> title <- c('Aids in Africa: Planning for a long war',
+ 'Mars rover maneuvers for rim shot',
+ 'Mars express confirms presence of water at Mars south pole',
+ 'NASA announces major Mars rover finding',
+ 'Drug access, Asia threat in focus at AIDS summit',
+ 'NASA Mars Odyssey THEMIS image: typical crater',
+ 'Road blocks for Aids')
R> response <- c('Aids', 'Mars', 'Mars', 'Mars', 'Aids', 'Mars', 'Aids')
R>
R> # Text contents in a character column.
R> KM_TEXT <- ore.push(data.frame(CUST_ID = seq(length(title)),
+ RESPONSE = response, TITLE = title))
R>
R> # Create a text policy (CTXSYS.CTX_DDL privilege is required).
R> ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
R>
R> # Specify POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
R> # text column attributes.
R> km.mod <- ore.odmKMeans( ~ TITLE, data = KM_TEXT, num.centers = 2L,
+ odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
+ ODMS_TEXT_MIN_DOCUMENTS = 1,
+ ODMS_TEXT_MAX_FEATURES = 3,
+ kmns_distance = "dbms_data_mining.kmns_cosine",
+ kmns_details = "kmns_details_all"),
+ ctx.settings = list(TITLE="TEXT(TOKEN_TYPE:STEM)"))
R> summary(km.mod)
Call:
ore.odmKMeans(formula = ~TITLE, data = KM_TEXT, num.centers = 2L,
odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
ODMS_TEXT_MIN_DOCUMENTS = 1, ODMS_TEXT_MAX_FEATURES = 3,
kmns_distance = "dbms_data_mining.kmns_cosine",
kmns_details = "kmns_details_all"),
ctx.settings = list(TITLE = "TEXT(TOKEN_TYPE:STEM)"))
Settings:
value
clus.num.clusters 2
block.growth 2
conv.tolerance 0.01
details details.all
distance cosine
iterations 3
min.pct.attr.support 0.1
num.bins 10
random.seed 0
split.criterion variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling odms.sampling.disable
odms.text.max.features 3
odms.text.min.documents 1
odms.text.policy.name ESA_TXTPOL
prep.auto ON
Centers:
TITLE.MARS TITLE.NASA TITLE.ROVER TITLE.AIDS
2 0.5292307 0.7936566 0.7936566 NA
3 NA NA NA 1
R> settings(km.mod)
SETTING_NAME SETTING_VALUE SETTING_TYPE
1 ALGO_NAME ALGO_KMEANS INPUT
2 CLUS_NUM_CLUSTERS 2 INPUT
3 KMNS_BLOCK_GROWTH 2 INPUT
4 KMNS_CONV_TOLERANCE 0.01 INPUT
5 KMNS_DETAILS KMNS_DETAILS_ALL INPUT
6 KMNS_DISTANCE KMNS_COSINE INPUT
7 KMNS_ITERATIONS 3 INPUT
8 KMNS_MIN_PCT_ATTR_SUPPORT 0.1 INPUT
9 KMNS_NUM_BINS 10 INPUT
10 KMNS_RANDOM_SEED 0 DEFAULT
11 KMNS_SPLIT_CRITERION KMNS_VARIANCE INPUT
12 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO DEFAULT
13 ODMS_SAMPLING ODMS_SAMPLING_DISABLE DEFAULT
14 ODMS_TEXT_MAX_FEATURES 3 INPUT
15 ODMS_TEXT_MIN_DOCUMENTS 1 INPUT
16 ODMS_TEXT_POLICY_NAME ESA_TXTPOL INPUT
17 PREP_AUTO ON INPUT
R> print(predict(km.mod, KM_TEXT, supplemental.cols = "RESPONSE"), digits = 3L)
'2' '3' RESPONSE CLUSTER_ID
1 0.0213 0.9787 Aids 3
2 0.9463 0.0537 Mars 2
3 0.9325 0.0675 Mars 2
4 0.9691 0.0309 Mars 2
5 0.0213 0.9787 Aids 3
6 0.9463 0.0537 Mars 2
7 0.0213 0.9787 Aids 3
R>
R> ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")
Parent topic: Build Oracle Machine Learning for SQL Models
4.3 Cross-Validate Models
Cross-validation is a model improvement technique that avoids the limitations of a single train-and-test experiment by building and testing multiple models through repeated sampling from the available data.
Predictive models are usually built on given data and verified on held-aside or unseen data. The purpose of cross-validation is to offer better insight into how well the model would generalize to new data and to avoid over-fitting and deriving wrong conclusions from misleading peculiarities of the seen data.
The ore.CV
utility R function uses Oracle Machine Learning for R for performing cross-validation of regression and classification models.
For a select set of algorithms and cases, the function ore.CV
performs cross-validation for models that were generated by OML4R regression and classification functions using in-database data.
The ore.CV
function works with models generated by the following OML4R functions:
-
ore.lm
-
ore.stepwise
-
ore.glm
-
ore.neural
-
ore.odmDT
-
ore.odmGLM
-
ore.odmNB
-
ore.odmSVM
You can also use ore.CV
to cross-validate models generated with some R regression functions through OML4R embedded R execution. Those R functions are the following:
-
lm
-
glm
-
svm
To download the function ore.CV
, and for more information on and examples of using ore.CV
, see the blog post:
Parent topic: Build Models in Oracle Machine Learning for R