7 OML4R Classes That Provide Access to In-Database Machine Learning Algorithms

OML4R has classes that provide access to in-database Oracle Machine Learning algorithms. Using in-database Oracle Machine Learning algorithms eliminate data movement and leverages the database for data preparation.

These functions are described in the following topics:

7.1 About Building In-Database Models using OML4R

The OML4R machine learning interface is built on top of OML4SQL, leveraging the same in-database algorithms, with the ability to use the same algorithm hyperparameters.

These OREdm package functions provide R interfaces that use arguments that conform to typical R usage for corresponding predictive analytics and OML4SQL functions.

This section has the following topics:

7.1.1 In-Database Models Supported by OML4R

The functions in the OREdm package provide access to the in-database machine learning functionality of Oracle Database. You use these functions to build in-database models in the database.

The following table lists the OML4R functions that build in-database models and the corresponding in-database algorithms and functions.

Table 7-1 Oracle Machine Learning for R Model Functions

OML4R Function Name Algorithm Machine Learning Technique (Mining Function)

ore.odmAI

Minimum Description Length

Attribute importance for classification or regression

ore.odmAssocRules

Apriori

Association rules

ore.odmDT

Decision Tree

Classification

ore.odmEM

Expectation Maximization

Clustering

ore.odmESA

Explicit Semantic Analysis

Feature extraction

ore.odmGLM

Generalized Linear Models

Classification and regression

ore.odmKMeans

k-Means

Clustering

ore.odmNB

Naive Bayes

Classification

ore.odmNMF

Non-Negative Matrix Factorization

Feature extraction

ore.odmOC

Orthogonal Partitioning Cluster (O-Cluster)

Clustering

ore.odmRAlg

Extensible R Algorithm

Association rules, attribute importance, classification, clustering, feature extraction, and regression

ore.odmSVD

Singular Value Decomposition

Feature extraction

ore.odmSVM

Support Vector Machines

Classification and regression

ore.odmNN

Neural Network

Classification and regression

ore.odmRF Random Forest

Classification

ore.odmXGB XGBoost

Classification and regression

Note:

Available only in Oracle Database 21c and later
ore.odmESM Exponential Smoothing Method Regression

7.1.2 About In-Database Model Names and Renaming with OML4R Functions

In each OREdm R model object, the slot name is the name of the underlying OML4SQL model generated by the OREdm function.

By default, models built using OREdm functions are transient objects; they do not persist past the R session in which they were built unless they are explicitly saved in an OML4R datastore or specified with an explicit name when created. OML4SQL models built using Data Miner or SQL, on the other hand, exist until they are explicitly dropped.

R proxy objects can be saved or persisted. Saving a model object generated by an OREdm function allows it to exist across R sessions and keeps the corresponding in-database machine learning model object in place. While the OREdm model exists, you can export and import the in-database model object and use it independent of the OML4R object.

You can use the MODEL_NAME parameter in odm.settings to explicitly name an in-database model object created in the database. The named in-database model object persists in the database just like those created using Oracle Data Miner or SQL.

Example 7-1 Using MODEL_NAME parameter to explicitly name in-database model proxy object

This example demonstrates building a Random Forest Model and naming the model using explicit settings. The example uses the MODEL_NAME parameter in odm.settings to explicitly name an in-database model object created in the database.

ore.exec("BEGIN DBMS_DATA_MINING.DROP_MODEL(model_name=> 'RF_CLASSIFICATION_MODEL'); EXCEPTION WHEN others THEN null; END;")
settings = list(RFOR_MTRY = 3,
           RFOR_NUM_TREES = 100,
           RFOR_SAMPLING_RATIO =0.5,
           model_name="RF_CLASSIFICATION_MODEL")
MOD2 <- ore.odmRF(AFFINITY_CARD~., DEMO_DF.train, odm.settings= settings)
RES2 <- predict(MOD2, DEMO_DF.test, type= c("class","raw"), norm.votes=TRUE, cache.model=TRUE, 
         supplemental.cols=c("CUST_ID", "AFFINITY_CARD", "BOOKKEEPING_APPLICATION", "BULK_PACK_DISKETTES", 
                      "EDUCATION", "FLAT_PANEL_MONITOR", "HOME_THEATER_PACKAGE", "HOUSEHOLD_SIZE",                            
                      "OCCUPATION", "OS_DOC_SET_KANJI", "PRINTER_SUPPLIES", "YRS_RESIDENCE", "Y_BOX_GAMES"))

In the above code, the model named RF_CLASSIFICATION_MODEL should be dropped if it already exists because it will throw an exception while we try to build a model with the same name. The In-database Random Forest Classification model named RF_CLASSIFCIATION_MODEL is built using the specified settings and the model is referenced by the variable MOD2. The predictions are made using the random forest model MOD2 and the AFFINITY_CARD as the predictor variable for the test data set DEMO DF test and the result is stored in the local session in variable RES2, and the model persists in the database.

While the R model exists or if you have assigned the model a user-specified name, use the OML4SQL model name to access the OML4SQL model through other interfaces, including:

  • Any SQL interface, such as SQL*Plus or SQL Developer

  • Oracle Data Miner

In particular, the model can be used with the OML4SQL prediction functions.

With Oracle Data Miner you can do the following:

  • Get a list of available models

  • Use model views to inspect model details

  • Score appropriately transformed data

Note:

Any explicit user-performed transformations in the R space are not carried over into SQL scoring or Oracle Data Miner. Users can also get a list of models using SQL for inspecting model details or for scoring appropriately transformed data.

7.1.3 Specify Model Settings

Functions in the OREdm package have an argument that specifies settings for an in-database model and some have an argument for setting text processing parameters.

General Parameter Settings

With the odm.settings argument to an OREdm function, you can specify a list of OML4SQL parameter settings. Each list element's name and value refer to the parameter setting name and value, respectively. The setting value must be numeric or string. Refer to Specify Model Settings in Oracle Machine Learning for SQL User’s Guide for each algorithm's valid settings.

The settings function returns a data.frame that lists each OML4SQL parameter setting name and value pair used to build the model.

Text Processing Attribute Settings

Some OREdm functions have a ctx.settings argument that specifies text processing attribute settings with which you can specify Oracle Text attribute-specific settings. With the odm.settings argument, you can specify the Oracle text policy, the minimal number of documents in which each token occurs, and the maximum number of distinct features for text processing. With the ctx.settings argument, you specify the columns that should be treated as text and the type of text transformation to apply.

The ctx.settings argument applies to the following functions:

  • ore.odmESA, Explicit Semantic Analysis

  • ore.odmGLM, Generalized Linear Models

  • ore.odmKMeans, k-Means

  • ore.odmNMF, Non-Negative Matrix Factorization

  • ore.odmSVD, Singular Value Decomposition

  • ore.odmSVM, Support Vector Machine

Note:

To create an Oracle Text policy, the user must have the CTXSYS.CTX_DDL privilege.

See Also:

Create a Model that Includes Machine Learning Operations on Text in Oracle Machine Learning for SQL User’s Guide for valid text attribute values.

7.2 About Model Settings

You can specify settings that affect the characteristics of a model.

Some settings are general, some are specific to an Oracle Machine Learning function, and some are specific to an algorithm.

All settings have default values. If you want to override one or more of the settings for a model, then you must specify the settings with the **params parameter when instantiating the model or later by using the set_params method of the model.

For the _init_ method, the argument can be key-value pairs or a dict. Each list element’s name and value refer to a machine learning algorithm parameter setting name and value, respectively. The setting value must be numeric or a string.

The argument for the **params parameter of the set_params method is a dict object mapping a str to a str. The key should be the name of the setting, and the value should be the new setting.

Example 7-2 Specifying Model Settings

This example shows the creation of an Expectation Maximization (EM) model and the changing of a setting.


settings = list(
  EMCS_NUM_ITERATIONS= 20,
  EMCS_RANDOM_SEED= 7)

EM.MOD <- ore.odmEM(~.-CUST_ID, CUST_DF, num.centers = 3, odm.settings = settings)

7.3 Shared Settings

These settings are common to multiple Oracle Machine Learning for R machine learning classes.

The following table lists the settings that are shared by all Oracle Machine Learning for R models.

Table 7-2 Shared Model Settings

Setting Name Setting Value Description

ODMS_DETAILS

ODMS_ENABLE

ODMS_DISABLE

Helps to control model size in the database. Model details can consume significant disk space, especially for partitioned models. The default value is ODMS_ENABLE.

If the setting value is ODMS_ENABLE, then model detail tables and views are created along with the model. You can query the model details using SQL.

If the value is ODMS_DISABLE, then model detail tables are not created and tables relevant to model details are also not created.

The reduction in the space depends on the algorithm. Model size reduction can be on the order of 10x .

ODMS_MAX_PARTITIONS

1 < value <= 1000000

Controls the maximum number of partitions allowed for a partitioned model. The default is 1000.

ODMS_MISSING_VALUE_TREATMENT

ODMS_MISSING_VALUE_AUTO

ODMS_MISSING_VALUE_MEAN_MODE

ODMS_MISSING_VALUE_DELETE_ROW

Indicates how to treat missing values in the training data. This setting does not affect the scoring data. The default value is ODMS_MISSING_VALUE_AUTO.

ODMS_MISSING_VALUE_MEAN_MODE replaces missing values with the mean (numeric attributes) or the mode (categorical attributes) both at build time and apply time where appropriate. ODMS_MISSING_VALUE_AUTO performs different strategies for different algorithms.

When ODMS_MISSING_VALUE_TREATMENT is set to ODMS_MISSING_VALUE_DELETE_ROW, the rows in the training data that contain missing values are deleted. However, if you want to replicate this missing value treatment in the scoring data, then you must perform the transformation explicitly.

The value ODMS_MISSING_VALUE_DELETE_ROW is applicable to all algorithms.

ODMS_PARTITION_BUILD_TYPE

ODMS_PARTITION_BUILD_INTRA

ODMS_PARTITION_BUILD_INTER

ODMS_PARTITION_BUILD_HYBRID

Controls the parallel building of partitioned models.

ODMS_PARTITION_BUILD_INTRA builds each partition in parallel using all processes/threads.

ODMS_PARTITION_BUILD_INTER builds each partition entirely in a single process/thread, but multiple partitions may be built at the same time because multiple processes/threads are active.

ODMS_PARTITION_BUILD_HYBRID combines the other two types and is recommended for most situations to adapt to dynamic environments. This is the default value.

ODMS_PARTITION_COLUMNS

Comma separated list of machine learning attributes

Requests the building of a partitioned model. The setting value is a comma-separated list of the machine learning attributes to be used to determine the in-list partition key values. These attributes are taken from the input columns, unless an XFORM_LIST parameter is passed to the model. If XFORM_LIST parameter is passed to the model, then the attributes are taken from the attributes produced by these transformations.

ODMS_TABLESPACE_NAME

tablespace_name

Specifies the tablespace in which to store the model.

If you explicitly set this to the name of a tablespace (for which you have sufficient quota), then the specified tablespace storage creates the resulting model content. If you do not provide this setting, then the your default tablespace creates the resulting model content.

ODMS_SAMPLE_SIZE

0 < value

Determines how many rows to sample (approximately). You can use this setting only if ODMS_SAMPLING is enabled. The default value is system determined.

ODMS_SAMPLING

ODMS_SAMPLING_ENABLE

ODMS_SAMPLING_DISABLE

Allows the user to request sampling of the build data. The default is ODMS_SAMPLING_DISABLE.

ODMS_TEXT_MAX_FEATURES

1 <= value

The maximum number of distinct features, across all text attributes, to use from a document set passed to the model. The default is 3000. An oml.esa model has the default value of 300000.

ODMS_TEXT_MIN_DOCUMENTS

Non-negative value

This text processing setting controls how many documents a token needs to appear in to be used as a feature.

The default is 1. An oml.esa model has the default value of 3.

ODMS_TEXT_POLICY_NAME

The name of an Oracle Text POLICY created using CTX_DDL.CREATE_POLICY.

Affects how individual tokens are extracted from unstructured text.

For details about CTX_DDL.CREATE_POLICY, see Oracle Text Reference.

PREP_AUTO

PREP_AUTO_ON

PREP_AUTO_OFF

This data preparation setting enables fully automated data preparation.

The default is PREP_AUTO_ON.

PREP_SCALE_2DNUM

PREP_SCALE_STDDEV

PREP_SCALE_RANGE

This data preparation setting enables scaling data preparation for two-dimensional numeric columns. PREP_AUTO must be OFF for this setting to take effect. The following are the possible values:

PREP_SCALE_STDDEV: A request to divide the column values by the standard deviation of the column and is often provided together with PREP_SHIFT_MEAN to yield z-score normalization.

PREP_SCALE_RANGE: A request to divide the column values by the range of values and is often provided together with PREP_SHIFT_MIN to yield a range of [0,1].

PREP_SCALE_NNUM

PREP_SCALE_MAXABS

This data preparation setting enables scaling data preparation for nested numeric columns. PREP_AUTO must be OFF for this setting to take effect. If specified, then the valid value for this setting is PREP_SCALE_MAXABS, which yields data in the range of [-1,1].

PREP_SHIFT_2DNUM

PREP_SHIFT_MEAN

PREP_SHIFT_MIN

This data preparation setting enables data centering preparation for two-dimensional numeric columns. PREP_AUTO must be OFF for this setting to take effect. The following are the possible values:

PREP_SHIFT_MEAN: Results in subtracting the average of the column from each value.

PREP_SHIFT_MIN: Results in subtracting the minimum of the column from each value.

7.4 Association Rules

The ore.odmAssocRules function implements the Apriori algorithm to find frequent itemsets and generate an association model.

The function finds the co-occurrence of items in large volumes of transactional data such as in market basket analysis. An association rule identifies a pattern in the data in which the appearance of a set of items in a transactional record implies another set of items. The groups of items used to form rules must pass a minimum threshold according to how frequently they occur (the support of the rule) and how often the consequent follows the antecedent (the confidence of the rule). Association models generate all rules that have support and confidence greater than user-specified thresholds. The Apriori algorithm is efficient, and scales well with respect to the number of transactions, number of items, and number of itemsets and rules produced.

The formula specification has the form ~ terms, where terms is a series of column names to include in the analysis. Multiple column names are specified using + between column names. Use ~ . if all columns in the data should be used for model building. To exclude columns, use - before each column name to exclude. Functions can be applied to the items in terms to realize transformations.

The ore.odmAssocRules function accepts data in the following forms:

  • Transactional data

  • Multi-record case data using item id and item value

  • Relational data

For examples of specifying the forms of data and for information on the arguments of the function, call help(ore.odmAssocRules).

The function rules returns an object of class ore.rules, which specifies a set of association rules. You can pull an ore.rules object into memory in a local R session by using ore.pull. The local in-memory object is of class rules defined in the arules package. See help(ore.rules).

The function itemsets returns an object of class ore.itemsets, which specifies a set of itemsets. You can pull an ore.itemsets object into memory in a local R session by using ore.pull. The local in-memory object is of class itemsets defined in the arules package. See help(ore.itemsets).

Settings for an Association Rules Model

The following table lists the settings that apply to Association Rules models.

Table 7-3 Association Rules Model Settings

Setting Name Setting Value Description
ASSO_ABS_ERROR

0<ASSO_ABS_ERROR≤MAX(ASSO_MIN_SUPPORT, ASSO_MIN_CONFIDENCE)

Specifies the absolute error for the association rules sampling.

A smaller value of ASSO_ABS_ERROR obtains a larger sample size that gives accurate results but takes longer to compute. Set a reasonable value for ASSO_ABS_ERROR, such as the default value, to avoid too large a sample size.

The default value is 0.5* MAX(ASSO_MIN_SUPPORT, ASSO_MIN_CONFIDENCE).

ASSO_AGGREGATES

NULL

Specifies the columns to aggregate. It is a comma separated list of strings containing the names of the columns for aggregation. The number of columns in the list must be <= 10.

You can set ASSO_AGGREGATES if you have specified a column name with ODMS_ITEM_ID_COLUMN_NAME. The data table must have valid column names such as ITEM_ID and CASE_ID which are derived from ODMS_ITEM_ID_COLUMN_NAME.

An item value is not mandatory.

The default value is NULL.

For each item, you may supply several columns to aggregate. However, doing so requires more memory to buffer the extra data and also affects performance because of the larger input data set and increased operations.

ASSO_ANT_IN_RULES

NULL

Sets Including Rules for the antecedent: it is a comma separated list of strings, at least one of which must appear in the antecedent part of each reported association rule.

The default value is NULL.

ASSO_ANT_EX_RULES

NULL

Sets Excluding Rules for the antecedent: it is a comma separated list of strings, none of which can appear in the antecedent part of each reported association rule.

The default value is NULL.

ASSO_CONF_LEVEL

0ASSO_CONF_LEVEL1

Specifies the confidence level for an association rules sample.

A larger value of ASSO_CONF_LEVEL obtains a larger sample size. Any value between 0.9 and 1 is suitable. The default value is 0.95.

ASSO_CONS_IN_RULES

NULL

Sets Including Rules for the consequent: it is a comma separated list of strings, at least one of which must appear in the consequent part of each reported association rule.

The default value is NULL.

ASSO_CONS_EX_RULES

NULL

Sets Excluding Rules for the consequent: it is a comma separated list of strings, none of which can appear in the consequent part of a reported association rule.

You can use the excluding rule to reduce the data that must be stored, but you may be required to build extra models for executing different Including or Excluding Rules.

The default value is NULL.

ASSO_EX_RULES

NULL

Sets Excluding Rules applied for each association rule: it is a comma separated list of strings that cannot appear in an association rule. No rule can contain any item in the list.

The default value is NULL.

ASSO_IN_RULES

NULL

Sets Including Rules applied for each association rule: it is a comma separated list of strings, at least one of which must appear in each reported association rule, either as antecedent or as consequent

The default value NULL, which specifies that filtering is not applied.

ASSO_MAX_RULE_LENGTH TO_CHAR( 2<= numeric_expr <=20)

Maximum rule length for association rules.

The default value is 4.

ASSO_MIN_CONFIDENCE TO_CHAR( 0<= numeric_expr <=1)

Minimum confidence for association rules.

The default value is 0.1.

ASSO_MIN_REV_CONFIDENCE TO_CHAR( 0<= numeric_expr <=1)

Sets the Minimum Reverse Confidence that each rule should satisfy.

The Reverse Confidence of a rule is defined as the number of transactions in which the rule occurs divided by the number of transactions in which the consequent occurs.

The value is real number between 0 and 1.

The default value is 0.

ASSO_MIN_SUPPORT

TO_CHAR( 0<= numeric_expr <=1)

Minimum support for association rules.

The default value is 0.1.

ASSO_MIN_SUPPORT_INT

TO_CHAR( 0<= numeric_expr <=1)

Minimum absolute support that each rule must satisfy. The value must be an integer.

The default value is 1.

ASSO_CONS_EX_RULES

   

ODMS_ITEM_ID_COLUMN_NAME

column_name

The name of a column that contains the items in a transaction. When you specify this setting, the algorithm expects the data to be presented in native transactional format, consisting of two columns:

  • Case ID, either categorical or numeric
  • Item ID, either categorical or numeric
ODMS_ITEM_VALUE_COLUMN_ NAME column_name

The name of a column that contains a value associated with each item in a transaction. Use this setting only when you have specified a value for ODMS_ITEM_ID_COLUMN_NAME, indicating that the data is presented in native transactional format.

If you also use ASSO_AGGREGATES, then the build data must include the following three columns and the columns specified in the AGGREGATES setting.

  • Case ID, either categorical or numeric
  • Item ID, either categorical or numeric, specified by ODMS_ITEM_ID_COLUMN_NAME
  • Item value, either categorical or numeric, specified by ODMS_ITEM_VALUE_COLUMN_ NAME

If ASSO_AGGREGATES, Case ID, and Item ID columns are present, then the Item Value column may or may not appear.

The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, macintosh apples).

Example 7-3 Using the ore.odmAssocRules Function

This example builds an association model on a transactional data set. The packages arules and arulesViz are required to pull the resulting rules and itemsets into the client R session memory and be visualized. The graph of the rules appears in the figure following the example.

# Load the arules and arulesViz packages.
library(arules)
library(arulesViz)
# Create some transactional data.
id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
item <- c("b", "d", "e", "a", "b", "c", "e", "b", "c", "d", "e")
# Push the data to the database as an ore.frame object.
transdata_of <- ore.push(data.frame(ID = id, ITEM = item))
# Build a model with specifications.
ar.mod1 <- ore.odmAssocRules(~., transdata_of, case.id.column = "ID",
             item.id.column = "ITEM", min.support = 0.6, min.confidence = 0.6,
             max.rule.length = 3)
# Generate itemsets and rules of the model.
itemsets <- itemsets(ar.mod1)
rules <- rules(ar.mod1)
# Convert the rules to the rules object in arules package.
rules.arules <- ore.pull(rules)
inspect(rules.arules)          
# Convert itemsets to the itemsets object in arules package.
itemsets.arules <- ore.pull(itemsets)
inspect(itemsets.arules)
# Plot the rules graph.
plot(rules.arules, method = "graph", interactive = TRUE)

Listing for This Example

R> # Load the arules and arulesViz packages.
R> library(arules)
R> library(arulesViz)
R> # Create some transactional data.
R> id <- c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3)
R> item <- c("b", "d", "e", "a", "b", "c", "e", "b", "c", "d", "e")
R> # Push the data to the database as an ore.frame object.
R> transdata_of <- ore.push(data.frame(ID = id, ITEM = item))
R> # Build a model with specifications.
R> ar.mod1 <- ore.odmAssocRules(~., transdata_of, case.id.column = "ID",
+             item.id.column = "ITEM", min.support = 0.6, min.confidence = 0.6,
+             max.rule.length = 3)
R> # Generate itemsets and rules of the model.
R> itemsets <- itemsets(ar.mod1)
R> rules <- rules(ar.mod1)
R> # Convert the rules to the rules object in arules package.
R> rules.arules <- ore.pull(rules)
R> inspect(rules.arules)          
   lhs    rhs   support confidence lift
1  {b} => {e} 1.0000000  1.0000000    1
2  {e} => {b} 1.0000000  1.0000000    1
3  {c} => {e} 0.6666667  1.0000000    1
4  {d,                                 
    e} => {b} 0.6666667  1.0000000    1
5  {c,                                 
    e} => {b} 0.6666667  1.0000000    1
6  {b,                                 
    d} => {e} 0.6666667  1.0000000    1
7  {b,                                 
    c} => {e} 0.6666667  1.0000000    1
8  {d} => {b} 0.6666667  1.0000000    1
9  {d} => {e} 0.6666667  1.0000000    1
10 {c} => {b} 0.6666667  1.0000000    1
11 {b} => {d} 0.6666667  0.6666667    1
12 {b} => {c} 0.6666667  0.6666667    1
13 {e} => {d} 0.6666667  0.6666667    1
14 {e} => {c} 0.6666667  0.6666667    1
15 {b,                                 
    e} => {d} 0.6666667  0.6666667    1
16 {b,                                 
    e} => {c} 0.6666667  0.6666667    1
R> # Convert itemsets to the itemsets object in arules package.
R> itemsets.arules <- ore.pull(itemsets)
R> inspect(itemsets.arules)
   items   support
1  {b}   1.0000000
2  {e}   1.0000000
3  {b,            
    e}   1.0000000
4  {c}   0.6666667
5  {d}   0.6666667
6  {b,            
    c}   0.6666667
7  {b,            
    d}   0.6666667
8  {c,            
    e}   0.6666667
9  {d,            
    e}   0.6666667
10 {b,            
    c,            
    e}   0.6666667
11 {b,            
    d,            
    e}   0.6666667

R> # Plot the rules graph.
R> plot(rules.arules, method = "graph", interactive = TRUE)

Figure 7-1 A Visual Demonstration of the Association Rules

Description of Figure 7-1 follows
Description of "Figure 7-1 A Visual Demonstration of the Association Rules"

7.5 Attribute Importance Model

The ore.odmAI attribute important function ranks attributes according to their significance in predicting a target.

The ore.odmAI function uses the OML4SQL Minimum Description Length algorithm to calculate attribute importance. Minimum Description Length (MDL) is an information theoretic model selection principle. It is an important concept in information theory (the study of the quantification of information) and in learning theory (the study of the capacity for generalization based on empirical data).

MDL assumes that the simplest, most compact representation of the data is the best and most probable explanation of the data. The MDL principle is used to build OML4SQL attribute importance models.

Attribute importance models built using OML4SQL cannot be applied to new data.

The ore.odmAI function produces a ranking of attributes and their importance values.

Note:

OREdm attribute importance models differ from OML4SQL attribute importance models in these ways: a model object is not retained, and an R model object is not returned. Only the importance ranking created by the model is returned.

For information on the ore.odmAI function arguments, invoke help(ore.odmAI).

Example 7-4 Using the ore.odmAI Function

This example pushes the data.frame iris to the database as the ore.frame iris_of. The example then builds an attribute importance model.

iris_of <- ore.push(iris)
ore.odmAI(Species ~ ., iris_of)

Listing for This Example

R> iris_of <- ore.push(iris)
R> ore.odmAI(Species ~ ., iris_of)
 
Call:
ore.odmAI(formula = Species ~ ., data = iris_of)
 
Importance: 
             importance rank
Petal.Width   1.1701851    1
Petal.Length  1.1494402    2
Sepal.Length  0.5248815    3
Sepal.Width   0.2504077    4

7.6 Decision Tree

The ore.odmDT function uses the in-database Decision Tree algorithm, which is based on conditional probabilities.

Decision Tree models are classification models. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and be used within a database to identify a set of records.

A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.

During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The ore.odmDT function offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

For information on the ore.odmDT function arguments, call help(ore.odmDT).

Settings for a Decision Tree Model

The following table lists settings that apply to Decision Tree models.

Table 7-4 Decision Tree Model Settings

Setting Name Setting Value Description

TREE_IMPURITY_METRIC

TREE_IMPURITY_ENTROPY

TREE_IMPURITY_GINI

Tree impurity metric for Decision Tree.

Tree algorithms seek the best test question for splitting data at each node. The best splitter and split values are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is by a metric. Decision trees can use either Gini (TREE_IMPURITY_GINI) or entropy (TREE_IMPURITY_ENTROPY) as the purity metric. By default, the algorithm uses TREE_IMPURITY_GINI.

TREE_TERM_MAX_DEPTH

For Decision Tree:

2<= a number <=20

For Random Forest:

2<= a number <=100

Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node).

For Decision Tree, the default is 7.

For Random Forest, the default is 16.

TREE_TERM_MINPCT_NODE

0<= a number<=10

The minimum number of training rows in a node expressed as a percentage of the rows in the training data.

Default is 0.05, indicating 0.05%.

TREE_TERM_MINPCT_SPLIT

0 < a number <=20

The minimum number of rows required to consider splitting a node expressed as a percentage of the training rows.

Default is 0.1, indicating 0.1%.

TREE_TERM_MINREC_NODE

a number >=0

The minimum number of rows in a node.

Default is 10.

TREE_TERM_MINREC_SPLIT

a number > 1

Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value.

Default is 20.

Example 7-5 Using the ore.odmDT Function

This example creates an input ore.frame, builds a model, makes predictions, and generates a confusion matrix.

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)
mtcars_of <- ore.push(m)
row.names(mtcars_of) <- mtcars_of
# Build the model.
dt.mod  <- ore.odmDT(gear ~ ., mtcars_of)
summary(dt.mod)
# Make predictions and generate a confusion matrix.
dt.res  <- predict (dt.mod, mtcars_of, "gear")
with(dt.res, table(gear, PREDICTION)) 

Listing for This Example

R> m <- mtcars
R> m$gear <- as.factor(m$gear)
R> m$cyl  <- as.factor(m$cyl)
R> m$vs   <- as.factor(m$vs)
R> m$ID   <- 1:nrow(m)
R> mtcars_of <- ore.push(m)
R> row.names(mtcars_of) <- mtcars_of
R> # Build the model.
R> dt.mod  <- ore.odmDT(gear ~ ., mtcars_of)
R> summary(dt.mod)
 
Call:
ore.odmDT(formula = gear ~ ., data = mtcars_of)
 
  n =  32 
 
Nodes:
  parent node.id row.count prediction                         split
1     NA       0        32          3                          <NA>
2      0       1        16          4 (disp <= 196.299999999999995)
3      0       2        16          3  (disp > 196.299999999999995)
            surrogate                   full.splits
1                <NA>                          <NA>
2 (cyl in ("4" "6" )) (disp <= 196.299999999999995)
3     (cyl in ("8" ))  (disp > 196.299999999999995)
 
Settings: 
                          value
prep.auto                    on
impurity.metric   impurity.gini
term.max.depth                7
term.minpct.node           0.05
term.minpct.split           0.1
term.minrec.node             10
term.minrec.split            20
R> # Make predictions and generate a confusion matrix.
R> dt.res  <- predict (dt.mod, mtcars_of, "gear")
R> with(dt.res, table(gear, PREDICTION)) 
    PREDICTION
gear  3  4
   3 14  1
   4  0 12
   5  2  3

7.7 Expectation Maximization

The ore.odmEM function creates a model that uses the OML4SQL Expectation Maximization (EM) algorithm.

EM is a density estimation algorithm that performs probabilistic clustering. In density estimation, the goal is to construct a density function that captures how a given population is distributed. The density estimate is based on observed data that represents a sample of the population.

For information on the ore.odmEM function arguments, call help(ore.odmEM).

Settings for an Expectation Maximization Model

The following table lists settings that apply to Expectation Maximization Models.

Table 7-5 Expectation Maximization Model Settings

Setting Name Setting Value Description

EMCS_ATTRIBUTE_FILTER

EMCS_ATTR_FILTER_ENABLE

EMCS_ATTR_FILTER_DISABLE

Whether or not to include uncorrelated attributes in the model. When EMCS_ATTRIBUTE_FILTER is enabled, uncorrelated attributes are not included.

Note:

This setting applies only to attributes that are not nested.

Default is system-determined.

EMCS_MAX_NUM_ATTR_2D

TO_CHAR(numeric_expr>= 1)

Maximum number of correlated attributes to include in the model.

Note:

This setting applies only to attributes that are not nested (2D).

The default value is 50.

EMCS_NUM_DISTRIBUTION

EMCS_NUM_DISTR_BERNOULLI

EMCS_NUM_DISTR_GAUSSIAN

EMCS_NUM_DISTR_SYSTEM

The distribution for modeling numeric attributes. Applies to the input table or view as a whole and does not allow per-attribute specifications.

The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numeric attributes are modeled using the same type of distribution. When the distribution is systemdetermined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data.

The default value is EMCS_NUM_DISTR_SYSTEM.

EMCS_NUM_EQUIWIDTH_BINS

TO_CHAR(1 <numeric_expr <=255)

Number of equi-width bins that will be used for gathering cluster statistics for numeric columns.

Default is 11.

EMCS_NUM_PROJECTIONS

TO_CHAR(numeric_expr >=1)

Specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, the data will not be projected. The setting applies to all nested columns.

Default is 50.

EMCS_NUM_QUANTILE_BINS

TO_CHAR(1<numeric_expr <=255)

Specifies the number of quantile bins that will be used for modeling numeric columns with multivalued Bernoulli distributions.

Default is system-determined.

EMCS_NUM_TOPN_BINS

TO_CHAR(1 <numeric_expr <=255)

Specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions.

Default is system-determined.

Example 7-6 Using the ore.odmEM Function

## Synthetic 2-dimensional data set
set.seed(7654)

x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

X <- ore.push (data.frame(ID=1:100,x))
rownames(X) <- X$ID

em.mod <- NULL
em.mod <- ore.odmEM(~., X, num.centers = 2L)

summary(em.mod)
rules(em.mod)
clusterhists(em.mod)
histogram(em.mod)

em.res <- predict(em.mod, X, type="class", supplemental.cols=c("x", "y"))
head(em.res)
em.res.local <- ore.pull(em.res)
plot(data.frame(x=em.res.local$x, y=em.res.local$y), col=em.res.local$CLUSTER_ID)
points(em.mod$centers2, col = rownames(em.mod$centers2), pch=8, cex=2)

head(predict(em.mod,X))
head(predict(em.mod,X,type=c("class","raw")))
head(predict(em.mod,X,type=c("class","raw"),supplemental.cols=c("x","y")))
head(predict(em.mod,X,type="raw",supplemental.cols=c("x","y")))

Listing for This Example

R> ## Synthetic 2-dimensional data set
R> 
R> set.seed(7654)
R>
R> x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
+             matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R>
R> X <- ore.push (data.frame(ID=1:100,x))
R> rownames(X) <- X$ID
R> 
R> em.mod <- NULL
R> em.mod <- ore.odmEM(~., X, num.centers = 2L)
R> 
R> summary(em.mod)

Call:
ore.odmEM(formula = ~., data = X, num.centers = 2L)

Settings: 
                                               value
clus.num.clusters                                  2
cluster.components               cluster.comp.enable
cluster.statistics                 clus.stats.enable
cluster.thresh                                     2
linkage.function                      linkage.single
loglike.improvement                             .001
max.num.attr.2d                                   50
min.pct.attr.support                              .1
model.search                    model.search.disable
num.components                                    20
num.distribution                    num.distr.system
num.equiwidth.bins                                11
num.iterations                                   100
num.projections                                   50
random.seed                                        0
remove.components                remove.comps.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON

Centers: 
  MEAN.ID MEAN.x MEAN.y
2    25.5   4.03   3.96
3    75.5   1.93   1.99

R> rules(em.mod)
   cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var lhs.var.support lhs.var.conf   predicate
1           1         100      1.0         100     1.00      ID             100       0.0000   ID <= 100
2           1         100      1.0         100     1.00      ID             100       0.0000     ID >= 1
3           1         100      1.0         100     1.00       x             100       0.2500 x <= 4.6298
4           1         100      1.0         100     1.00       x             100       0.2500 x >= 1.3987
5           1         100      1.0         100     1.00       y             100       0.3000 y <= 4.5846
6           1         100      1.0         100     1.00       y             100       0.3000 y >= 1.3546
7           2          50      0.5          50     1.00      ID              50       0.0937  ID <= 50.5
8           2          50      0.5          50     1.00      ID              50       0.0937     ID >= 1
9           2          50      0.5          50     1.00       x              50       0.0937 x <= 4.6298
10          2          50      0.5          50     1.00       x              50       0.0937  x > 3.3374
11          2          50      0.5          50     1.00       y              50       0.0937 y <= 4.5846
12          2          50      0.5          50     1.00       y              50       0.0937  y > 2.9696
13          3          50      0.5          50     0.98      ID              49       0.0937   ID <= 100
14          3          50      0.5          50     0.98      ID              49       0.0937   ID > 50.5
15          3          50      0.5          49     0.98       x              49       0.0937  x <= 2.368
16          3          50      0.5          49     0.98       x              49       0.0937 x >= 1.3987
17          3          50      0.5          49     0.98       y              49       0.0937 y <= 2.6466
18          3          50      0.5          49     0.98       y              49       0.0937 y >= 1.3546
R> clusterhists(em.mod)
   cluster.id variable bin.id lower.bound upper.bound       label count
1           1       ID      1        1.00       10.90      1:10.9    10
2           1       ID      2       10.90       20.80   10.9:20.8    10
3           1       ID      3       20.80       30.70   20.8:30.7    10
4           1       ID      4       30.70       40.60   30.7:40.6    10
5           1       ID      5       40.60       50.50   40.6:50.5    10
6           1       ID      6       50.50       60.40   50.5:60.4    10
7           1       ID      7       60.40       70.30   60.4:70.3    10
8           1       ID      8       70.30       80.20   70.3:80.2    10
9           1       ID      9       80.20       90.10   80.2:90.1    10
10          1       ID     10       90.10      100.00    90.1:100    10
11          1       ID     11          NA          NA           :     0
12          1        x      1        1.40        1.72 1.399:1.722    11
13          1        x      2        1.72        2.04 1.722:2.045    22
14          1        x      3        2.04        2.37 2.045:2.368    16
15          1        x      4        2.37        2.69 2.368:2.691     1
16          1        x      5        2.69        3.01 2.691:3.014     0
17          1        x      6        3.01        3.34 3.014:3.337     0
18          1        x      7        3.34        3.66  3.337:3.66     4
19          1        x      8        3.66        3.98  3.66:3.984    18
20          1        x      9        3.98        4.31 3.984:4.307    22
21          1        x     10        4.31        4.63  4.307:4.63     6
22          1        x     11          NA          NA           :     0
23          1        y      1        1.35        1.68 1.355:1.678     7
24          1        y      2        1.68        2.00 1.678:2.001    18
25          1        y      3        2.00        2.32 2.001:2.324    18
26          1        y      4        2.32        2.65 2.324:2.647     6
27          1        y      5        2.65        2.97  2.647:2.97     1
28          1        y      6        2.97        3.29  2.97:3.293     4
29          1        y      7        3.29        3.62 3.293:3.616     3
30          1        y      8        3.62        3.94 3.616:3.939    16
31          1        y      9        3.94        4.26 3.939:4.262    16
32          1        y     10        4.26        4.58 4.262:4.585    11
33          1        y     11          NA          NA           :     0
34          2       ID      1        1.00       10.90      1:10.9    10
35          2       ID      2       10.90       20.80   10.9:20.8    10
36          2       ID      3       20.80       30.70   20.8:30.7    10
37          2       ID      4       30.70       40.60   30.7:40.6    10
38          2       ID      5       40.60       50.50   40.6:50.5    10
39          2       ID      6       50.50       60.40   50.5:60.4     0
40          2       ID      7       60.40       70.30   60.4:70.3     0
41          2       ID      8       70.30       80.20   70.3:80.2     0
42          2       ID      9       80.20       90.10   80.2:90.1     0
43          2       ID     10       90.10      100.00    90.1:100     0
44          2       ID     11          NA          NA           :     0
45          2        x      1        1.40        1.72 1.399:1.722     0
46          2        x      2        1.72        2.04 1.722:2.045     0
47          2        x      3        2.04        2.37 2.045:2.368     0
48          2        x      4        2.37        2.69 2.368:2.691     0
49          2        x      5        2.69        3.01 2.691:3.014     0
50          2        x      6        3.01        3.34 3.014:3.337     0
51          2        x      7        3.34        3.66  3.337:3.66     4
52          2        x      8        3.66        3.98  3.66:3.984    18
53          2        x      9        3.98        4.31 3.984:4.307    22
54          2        x     10        4.31        4.63  4.307:4.63     6
55          2        x     11          NA          NA           :     0
56          2        y      1        1.35        1.68 1.355:1.678     0
57          2        y      2        1.68        2.00 1.678:2.001     0
58          2        y      3        2.00        2.32 2.001:2.324     0
59          2        y      4        2.32        2.65 2.324:2.647     0
60          2        y      5        2.65        2.97  2.647:2.97     0
61          2        y      6        2.97        3.29  2.97:3.293     4
62          2        y      7        3.29        3.62 3.293:3.616     3
63          2        y      8        3.62        3.94 3.616:3.939    16
64          2        y      9        3.94        4.26 3.939:4.262    16
65          2        y     10        4.26        4.58 4.262:4.585    11
66          2        y     11          NA          NA           :     0
67          3       ID      1        1.00       10.90      1:10.9     0
68          3       ID      2       10.90       20.80   10.9:20.8     0
69          3       ID      3       20.80       30.70   20.8:30.7     0
70          3       ID      4       30.70       40.60   30.7:40.6     0
71          3       ID      5       40.60       50.50   40.6:50.5     0
72          3       ID      6       50.50       60.40   50.5:60.4    10
73          3       ID      7       60.40       70.30   60.4:70.3    10
74          3       ID      8       70.30       80.20   70.3:80.2    10
75          3       ID      9       80.20       90.10   80.2:90.1    10
76          3       ID     10       90.10      100.00    90.1:100    10
77          3       ID     11          NA          NA           :     0
78          3        x      1        1.40        1.72 1.399:1.722    11
79          3        x      2        1.72        2.04 1.722:2.045    22
80          3        x      3        2.04        2.37 2.045:2.368    16
81          3        x      4        2.37        2.69 2.368:2.691     1
82          3        x      5        2.69        3.01 2.691:3.014     0
83          3        x      6        3.01        3.34 3.014:3.337     0
84          3        x      7        3.34        3.66  3.337:3.66     0
85          3        x      8        3.66        3.98  3.66:3.984     0
86          3        x      9        3.98        4.31 3.984:4.307     0
87          3        x     10        4.31        4.63  4.307:4.63     0
88          3        x     11          NA          NA           :     0
89          3        y      1        1.35        1.68 1.355:1.678     7
90          3        y      2        1.68        2.00 1.678:2.001    18
91          3        y      3        2.00        2.32 2.001:2.324    18
92          3        y      4        2.32        2.65 2.324:2.647     6
93          3        y      5        2.65        2.97  2.647:2.97     1
94          3        y      6        2.97        3.29  2.97:3.293     0
95          3        y      7        3.29        3.62 3.293:3.616     0
96          3        y      8        3.62        3.94 3.616:3.939     0
97          3        y      9        3.94        4.26 3.939:4.262     0
98          3        y     10        4.26        4.58 4.262:4.585     0
99          3        y     11          NA          NA           :     0
R> histogram(em.mod)
R>
R> em.res <- predict(em.mod, X, type="class", supplemental.cols=c("x", "y"))
R> head(em.res)
     x    y CLUSTER_ID
1 4.15 3.63          2
2 3.88 4.13          2
3 3.72 4.10          2
4 3.78 4.14          2
5 4.22 4.35          2
6 4.07 3.62          2
R> em.res.local <- ore.pull(em.res)
R> plot(data.frame(x=em.res.local$x, y=em.res.local$y), col=em.res.local$CLUSTER_ID)
R> points(em.mod$centers2, col = rownames(em.mod$centers2), pch=8, cex=2)
R>
R> head(predict(em.mod,X))
  '2'      '3' CLUSTER_ID
1   1 1.14e-54          2
2   1 1.63e-55          2
3   1 1.10e-51          2
4   1 1.53e-52          2
5   1 9.02e-62          2
6   1 3.20e-49          2
R> head(predict(em.mod,X,type=c("class","raw")))
  '2'      '3' CLUSTER_ID
1   1 1.14e-54          2
2   1 1.63e-55          2
3   1 1.10e-51          2
4   1 1.53e-52          2
5   1 9.02e-62          2
6   1 3.20e-49          2
R> head(predict(em.mod,X,type=c("class","raw"),supplemental.cols=c("x","y")))
  '2'      '3'    x    y CLUSTER_ID
1   1 1.14e-54 4.15 3.63          2
2   1 1.63e-55 3.88 4.13          2
3   1 1.10e-51 3.72 4.10          2
4   1 1.53e-52 3.78 4.14          2
5   1 9.02e-62 4.22 4.35          2
6   1 3.20e-49 4.07 3.62          2
R> head(predict(em.mod,X,type="raw",supplemental.cols=c("x","y")))
     x    y '2'      '3'
1 4.15 3.63   1 1.14e-54
2 3.88 4.13   1 1.63e-55
3 3.72 4.10   1 1.10e-51
4 3.78 4.14   1 1.53e-52
5 4.22 4.35   1 9.02e-62
6 4.07 3.62   1 3.20e-49

7.8 Explicit Semantic Analysis

The ore.odmESA function creates a model that uses the OML4SQL Explicit Semantic Analysis (ESA) algorithm.

ESA is an unsupervised algorithm used by OML4SQL for feature extraction. ESA does not discover latent features but instead uses explicit features based on an existing knowledge base.

Explicit knowledge often exists in text form. Multiple knowledge bases are available as collections of text documents. These knowledge bases can be generic, for example, Wikipedia, or domain-specific. Data preparation transforms the text into vectors that capture attribute-concept associations.

For information on the ore.odmESA function arguments, call help(ore.odmESA).

Settings for an Explicit Semantic Analysis Model

The following table lists settings that apply to Explicit Semantic Analysis models.

Table 7-6 Explicit Semantic Analysis Model Settings

Setting Name Setting Value Description

ESAS_VALUE_THRESHOLD

Non-negative number

This setting thresholds a small value for attribute weights in the transformed build data. The default is 1e-8.

ESAS_MIN_ITEMS

Text input 100

Non-text input is 0

This setting determines the minimum number of non-zero entries that need to be present in an input row. The default is 100 for text input and 0 for non-text input.

ESAS_TOPN_FEATURES

A positive integer

This setting controls the maximum number of features per attribute. The default is 1000.

Example 7-7 Using the ore.odmESA Function

title <- c('Aids in Africa: Planning for a long war',
       	    'Mars rover maneuvers for rim shot',
       	    'Mars express confirms presence of water at Mars south pole',
       	    'NASA announces major Mars rover finding',
       	    'Drug access, Asia threat in focus at AIDS summit',
       	    'NASA Mars Odyssey THEMIS image: typical crater',
       	    'Road blocks for Aids')

# TEXT contents in character column
df <- data.frame(CUST_ID = seq(length(title)), TITLE = title)
ESA_TEXT <- ore.push(df)

# TEXT contents in clob column
attr(df$TITLE, "ora.type") <- "clob"
ESA_TEXT_CLOB <- ore.push(df)

# Create text policy (CTXSYS.CTX_DDL privilege is required)
ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")

# Specify TEXT POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
# ESA algorithm settings in odm.settings
esa.mod <- ore.odmESA(~ TITLE, data = ESA_TEXT_CLOB,
 odm.settings = list(case_id_column_name = "CUST_ID",
                     ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
                     ODMS_TEXT_MIN_DOCUMENTS = 1,
                     ODMS_TEXT_MAX_FEATURES = 3,
                     ESAS_MIN_ITEMS = 1,
                     ESAS_VALUE_THRESHOLD = 0.0001,
                     ESAS_TOPN_FEATURES = 3))
class(esa.mod)
summary(esa.mod)
settings(esa.mod)
features(esa.mod)
predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "TITLE")

# Use ctx.settings to specify a character column as TEXT and 
# the same settings as above as well as TOKEN_TYPE 
esa.mod2 <- ore.odmESA(~ TITLE, data = ESA_TEXT,
  odm.settings = list(case_id_column_name = "CUST_ID", ESAS_MIN_ITEMS = 1),
  ctx.settings = list(TITLE = 
    "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
summary(esa.mod2)
settings(esa.mod2)
features(esa.mod2)
predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "TITLE")

ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")

Listing for This Example

R> title <- c('Aids in Africa: Planning for a long war',
+             'Mars rover maneuvers for rim shot',
+             'Mars express confirms presence of water at Mars south pole',
+             'NASA announces major Mars rover finding',
+             'Drug access, Asia threat in focus at AIDS summit',
+             'NASA Mars Odyssey THEMIS image: typical crater',
+             'Road blocks for Aids')
R>
R> # TEXT contents in character column
R> df <- data.frame(CUST_ID = seq(length(title)), TITLE = title)
R> ESA_TEXT <- ore.push(df)
R> 
R> # TEXT contents in clob column
R> attr(df$TITLE, "ora.type") <- "clob"
R> ESA_TEXT_CLOB <- ore.push(df)
R> 
R> # Create a text policy (CTXSYS.CTX_DDL privilege is required)
R> ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
R> 
R> # Specify TEXT POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
R> # ESA algorithm settings in odm.settings
R> esa.mod <- ore.odmESA(~ TITLE, data = ESA_TEXT_CLOB,
+  odm.settings = list(case_id_column_name = "CUST_ID",
+                      ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
+                      ODMS_TEXT_MIN_DOCUMENTS = 1,
+                      ODMS_TEXT_MAX_FEATURES = 3,
+                      ESAS_MIN_ITEMS = 1,
+                      ESAS_VALUE_THRESHOLD = 0.0001,
+                      ESAS_TOPN_FEATURES = 3))
R> class(esa.mod)
[1] "ore.odmESA" "ore.model" 
R> summary(esa.mod)

Call:
ore.odmESA(formula = ~TITLE, data = ESA_TEXT_CLOB, odm.settings = list(case_id_column_name = "CUST_ID", 
    ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", ODMS_TEXT_MIN_DOCUMENTS = 1, 
    ODMS_TEXT_MAX_FEATURES = 3, ESAS_MIN_ITEMS = 1, ESAS_VALUE_THRESHOLD = 1e-04, 
    ESAS_TOPN_FEATURES = 3))

Settings: 
                                               value
min.items                                          1
topn.features                                      3
value.threshold                                1e-04
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                             3
odms.text.min.documents                            1
odms.text.policy.name                     ESA_TXTPOL
prep.auto                                         ON

Features: 
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS            <NA>   1.0000000
2           2     TITLE.MARS            <NA>   0.4078615
3           2    TITLE.ROVER            <NA>   0.9130438
4           3     TITLE.MARS            <NA>   1.0000000
5           4     TITLE.NASA            <NA>   0.6742695
6           4    TITLE.ROVER            <NA>   0.6742695
7           5     TITLE.AIDS            <NA>   1.0000000
8           6     TITLE.MARS            <NA>   0.4078615
9           6     TITLE.NASA            <NA>   0.9130438
10          7     TITLE.AIDS            <NA>   1.0000000
R> settings(esa.mod)
                   SETTING_NAME                 SETTING_VALUE SETTING_TYPE
1                     ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS        INPUT
2                ESAS_MIN_ITEMS                             1        INPUT
3            ESAS_TOPN_FEATURES                             3        INPUT
4          ESAS_VALUE_THRESHOLD                         1e-04        INPUT
5  ODMS_MISSING_VALUE_TREATMENT       ODMS_MISSING_VALUE_AUTO      DEFAULT
6                 ODMS_SAMPLING         ODMS_SAMPLING_DISABLE      DEFAULT
7        ODMS_TEXT_MAX_FEATURES                             3        INPUT
8       ODMS_TEXT_MIN_DOCUMENTS                             1        INPUT
9         ODMS_TEXT_POLICY_NAME                    ESA_TXTPOL        INPUT
10                    PREP_AUTO                            ON        INPUT
R> features(esa.mod)
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS            <NA>   1.0000000
2           2     TITLE.MARS            <NA>   0.4078615
3           2    TITLE.ROVER            <NA>   0.9130438
4           3     TITLE.MARS            <NA>   1.0000000
5           4     TITLE.NASA            <NA>   0.6742695
6           4    TITLE.ROVER            <NA>   0.6742695
7           5     TITLE.AIDS            <NA>   1.0000000
8           6     TITLE.MARS            <NA>   0.4078615
9           6     TITLE.NASA            <NA>   0.9130438
10          7     TITLE.AIDS            <NA>   1.0000000
R> predict(esa.mod, ESA_TEXT, type = "class", supplemental.cols = "TITLE")
                                                       TITLE FEATURE_ID
1                    Aids in Africa: Planning for a long war          1
2                          Mars rover maneuvers for rim shot          2
3 Mars express confirms presence of water at Mars south pole          3
4                    NASA announces major Mars rover finding          4
5           Drug access, Asia threat in focus at AIDS summit          1
6             NASA Mars Odyssey THEMIS image: typical crater          6
7                                       Road blocks for Aids          1
R>
R> # Use ctx.settings to specify a character column as TEXT and 
R> # the same settings as above as well as TOKEN_TYPE 
R> esa.mod2 <- ore.odmESA(~ TITLE, data = ESA_TEXT,
+    odm.settings = list(case_id_column_name = "CUST_ID", ESAS_MIN_ITEMS = 1),
+    ctx.settings = list(TITLE = 
+      "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))
R> summary(esa.mod2)

Call:
ore.odmESA(formula = ~TITLE, data = ESA_TEXT, odm.settings = list(case_id_column_name = "CUST_ID", 
    ESAS_MIN_ITEMS = 1), ctx.settings = list(TITLE = "TEXT(POLICY_NAME:ESA_TXTPOL)(TOKEN_TYPE:STEM)(MIN_DOCUMENTS:1)(MAX_FEATURES:3)"))

Settings: 
                                               value
min.items                                          1
topn.features                                   1000
value.threshold                            .00000001
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                        300000
odms.text.min.documents                            3
prep.auto                                         ON

Features: 
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS            <NA>   1.0000000
2           2     TITLE.MARS            <NA>   0.4078615
3           2    TITLE.ROVER            <NA>   0.9130438
4           3     TITLE.MARS            <NA>   1.0000000
5           4     TITLE.MARS            <NA>   0.3011997
6           4     TITLE.NASA            <NA>   0.6742695
7           4    TITLE.ROVER            <NA>   0.6742695
8           5     TITLE.AIDS            <NA>   1.0000000
9           6     TITLE.MARS            <NA>   0.4078615
10          6     TITLE.NASA            <NA>   0.9130438
11          7     TITLE.AIDS            <NA>   1.0000000
R> settings(esa.mod2)
                  SETTING_NAME                 SETTING_VALUE SETTING_TYPE
1                    ALGO_NAME ALGO_EXPLICIT_SEMANTIC_ANALYS        INPUT
2               ESAS_MIN_ITEMS                             1        INPUT
3           ESAS_TOPN_FEATURES                          1000      DEFAULT
4         ESAS_VALUE_THRESHOLD                     .00000001      DEFAULT
5 ODMS_MISSING_VALUE_TREATMENT       ODMS_MISSING_VALUE_AUTO      DEFAULT
6                ODMS_SAMPLING         ODMS_SAMPLING_DISABLE      DEFAULT
7       ODMS_TEXT_MAX_FEATURES                        300000      DEFAULT
8      ODMS_TEXT_MIN_DOCUMENTS                             3      DEFAULT
9                    PREP_AUTO                            ON        INPUT
R> features(esa.mod2)
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT
1           1     TITLE.AIDS            <NA>   1.0000000
2           2     TITLE.MARS            <NA>   0.4078615
3           2    TITLE.ROVER            <NA>   0.9130438
4           3     TITLE.MARS            <NA>   1.0000000
5           4     TITLE.MARS            <NA>   0.3011997
6           4     TITLE.NASA            <NA>   0.6742695
7           4    TITLE.ROVER            <NA>   0.6742695
8           5     TITLE.AIDS            <NA>   1.0000000
9           6     TITLE.MARS            <NA>   0.4078615
10          6     TITLE.NASA            <NA>   0.9130438
11          7     TITLE.AIDS            <NA>   1.0000000
R> predict(esa.mod2, ESA_TEXT_CLOB, type = "class", supplemental.cols = "TITLE")
                                                       TITLE FEATURE_ID
1                    Aids in Africa: Planning for a long war          1
2                          Mars rover maneuvers for rim shot          2
3 Mars express confirms presence of water at Mars south pole          3
4                    NASA announces major Mars rover finding          4
5           Drug access, Asia threat in focus at AIDS summit          1
6             NASA Mars Odyssey THEMIS image: typical crater          6
7                                       Road blocks for Aids          1
R> 
R> ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")

7.9 Extensible R Algorithm Model

The ore.odmRAlg function creates an Extensible R algorithm model using OML4SQL.

The Extensible R algorithm builds, scores, and views an R model using registered R scripts. It supports classification, regression, clustering, feature extraction, attribute importance, and association machine learning functions.

For information on the ore.odmRAlg function arguments and for an example of using the function, call help(ore.odmRAlg).

Settings for an Extensible R Algorithm Model

The following table lists settings that apply to Extensible R Algorithm models.

Table 7-7 Extensible R Algorithm Model Settings

Setting Name Setting Value Description

RALG_BUILD_FUNCTION

R_BUILD_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script for R algorithm mining model build function. The R script defines an R function for the first input argument for training data and returns an R model object. For Clustering and Feature Extraction mining function model build, the R attributes dm$nclus and dm$nfeat must be set on the R model to indicate the number of clusters and features respectively.

The RALG_BUILD_FUNCTION must be set along with ALGO_EXTENSIBLE_LANG in the model_setting_table.

RALG_BUILD_PARAMETER

SELECT value param_name, ...FROM DUAL

Specifies a list of numeric and string scalar for optional input parameters of the model build function.

RALG_SCORE_FUNCTION

R_SCORE_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script to score data. The script returns a data.frame containing the corresponding prediction results. The setting is used to score data for mining functions such as Regression, Classification, Clustering, and Feature Extraction. This setting does not apply to Association and Attribute Importance functions.

RALG_WEIGHT_FUNCTION

R_WEIGHT_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script for R algorithm that computes the weight (contribution) for each attribute in scoring. The script returns a data.frame containing the contributing weight for each attribute in a row. This function setting is needed for PREDICTION_DETAILS SQL function.

RALG_DETAILS_FUNCTION R_DETAILS_FUNCTION_SCRIPT_NAME Specifies the name of an existing registered R script for R algorithm that produces the model information. This setting is required to generate a model view.
RALG_DETAILS_FORMAT SELECT type_value column_name,FROM DUAL Specifies the SELECT query for the list of numeric and string scalars for the output column type and the column name of the generated model view. This setting is required to generate a model view.

Example 7-8 Using the ore.odmRAlg Function

library(OREembed)

digits <- getOption("digits")
options(digits = 5L)

IRIS <- ore.push(iris)

# Regression with glm
ore.scriptCreate("glm_build", 
                 function(data, form, family) 
                 glm(formula = form, data = data, family = family)) 

ore.scriptCreate("glm_score", 
                  function(mod, data) 
                    { res <- predict(mod, newdata = data); 
                      data.frame(res) })

ore.scriptCreate("glm_detail", function(mod) 
                 data.frame(name=names(mod$coefficients), 
                              coef=mod$coefficients))

ore.scriptList(name = "glm_build")
ore.scriptList(name = "glm_score")
ore.scriptList(name = "glm_detail")

ralg.glm <- ore.odmRAlg(IRIS, mining.function = "regression",
                        formula = c(form="Sepal.Length ~ ."),
                        build.function = "glm_build", 
                        build.parameter = list(family="gaussian"),
                        score.function = "glm_score",
                        detail.function = "glm_detail", 
                        detail.value = data.frame(name="a", coef=1))
summary(ralg.glm)
predict(ralg.glm, newdata = head(IRIS), supplemental.cols = "Sepal.Length")

ore.scriptDrop(name = "glm_build")
ore.scriptDrop(name = "glm_score")
ore.scriptDrop(name = "glm_detail")

# Classification with nnet
ore.scriptCreate("nnet_build", 
                 function(dat, form, sz){
                   require(nnet); 
                   set.seed(1234);
                   nnet(formula = formula(form), data = dat, 
	                       size = sz, linout = TRUE, trace = FALSE); 
                  }, 
                  overwrite = TRUE)

ore.scriptCreate("nnet_detail", function(mod)
                 data.frame(conn = mod$conn, wts = mod$wts), 
                 overwrite = TRUE)

ore.scriptCreate("nnet_score", 
                 function(mod, data) {
                   require(nnet); 
                   res <- data.frame(predict(mod, newdata = data)); 
                   names(res) <- sort(mod$lev); res
                 })

ralg.nnet <- ore.odmRAlg(IRIS, mining.function = "classification",
                         formula = c(form="Species ~ ."),
                         build.function = "nnet_build", 
                         build.parameter = list(sz=2),
                         score.function = "nnet_score",
                         detail.function = "nnet_detail",
                         detail.value = data.frame(conn=1, wts =1))

summary(ralg.nnet)
predict(ralg.nnet, newdata = head(IRIS), supplemental.cols = "Species")

ore.scriptDrop(name = "nnet_build")
ore.scriptDrop(name = "nnet_score")
ore.scriptDrop(name = "nnet_detail")

# Feature extraction with pca
# Feature extraction with pca
ore.scriptCreate("pca_build", 
                 function(dat){
                   mod <- prcomp(dat, retx = FALSE)
                   attr(mod, "dm$nfeat") <- ncol(mod$rotation)
                   mod}, 
                 overwrite = TRUE)

ore.scriptCreate("pca_score", 
                 function(mod, data) {
                   res <- predict(mod, data)
                   as.data.frame(res)}, 
                 overwrite=TRUE)

ore.scriptCreate("pca_detail", 
                 function(mod) {
                   rotation_t <- t(mod$rotation)
                   data.frame(id = seq_along(rownames(rotation_t)), 
                                             rotation_t)},
	               overwrite = TRUE)

X <- IRIS[, -5L]
ralg.pca <- ore.odmRAlg(X, 
                        mining.function = "feature_extraction",
                        formula = NULL,
                        build.function = "pca_build",
                        score.function = "pca_score",
                        detail.function = "pca_detail",
                        detail.value = data.frame(Feature.ID=1, 
                                                  ore.pull(head(X,1L))))

summary(ralg.pca)
head(cbind(X, Pred = predict(ralg.pca, newdata = X)))

ore.scriptDrop(name = "pca_build")
ore.scriptDrop(name = "pca_score")
ore.scriptDrop(name = "pca_detail")

options(digits = digits)

Listing for This Example

R> library(OREembed)
R> 
R> digits <- getOption("digits")
R> options(digits = 5L)
R> 
R> IRIS <- ore.push(iris)
R> 
R> # Regression with glm
R> ore.scriptCreate("glm_build", 
+                   function(data, form, family) 
+                   glm(formula = form, data = data, family = family))
R> 
R> ore.scriptCreate("glm_score", 
+                    function(mod, data)
+                      { res <- predict(mod, newdata = data); 
+                        data.frame(res) })
R> 
R> ore.scriptCreate("glm_detail", function(mod) 
+                   data.frame(name=names(mod$coefficients), 
+                                     coef=mod$coefficients))
R>
R> ore.scriptList(name = "glm_build")
       NAME                                                                            SCRIPT
1 glm_build function (data, form, family) \nglm(formula = form, data = data, family = family)

R> ore.scriptList(name = "glm_score")
       NAME                                                                                    SCRIPT
1 glm_score function (mod, data) \n{\n    res <- predict(mod, newdata = data)\n    data.frame(res)\n}
R> ore.scriptList(name = "glm_detail")
        NAME                                                                               SCRIPT
1 glm_detail function (mod) \ndata.frame(name = names(mod$coefficients), coef = mod$coefficients)
R> 
R> ralg.glm <- ore.odmRAlg(IRIS, mining.function = "regression",
+                         formula = c(form="Sepal.Length ~ ."),
+                         build.function = "glm_build", 
+                         build.parameter = list(family="gaussian"),
+                         score.function = "glm_score",
+                         detail.function = "glm_detail", 
+                         detail.value = data.frame(name="a", coef=1))
R> 
R> summary(ralg.glm)

Call:
ore.odmRAlg(data = IRIS, mining.function = "regression", formula = c(form = "Sepal.Length ~ ."), 
    build.function = "glm_build", build.parameter = list(family = "gaussian"), 
    score.function = "glm_score", detail.function = "glm_detail", 
    detail.value = data.frame(name = "a", coef = 1))

Settings: 
                                                                                       value
odms.missing.value.treatment                                         odms.missing.value.auto
odms.sampling                                                          odms.sampling.disable
prep.auto                                                                                OFF
build.function                                                            OML_USER.glm_build
build.parameter              select 'Sepal.Length ~ .' "form", 'gaussian' "family" from dual
details.format                 select cast('a' as varchar2(4000)) "name", 1 "coef" from dual
details.function                                                         OML_USER.glm_detail
score.function                                                            OML_USER.glm_score

               name     coef
1       (Intercept)  2.17127
2      Petal.Length  0.82924
3       Petal.Width -0.31516
4       Sepal.Width  0.49589
5 Speciesversicolor -0.72356
6  Speciesvirginica -1.02350
R> predict(ralg.glm, newdata = head(IRIS), supplemental.cols = "Sepal.Length")
  Sepal.Length PREDICTION
1          5.1     5.0048
2          4.9     4.7568
3          4.7     4.7731
4          4.6     4.8894
5          5.0     5.0544
6          5.4     5.3889
R> 
R> ore.scriptDrop(name = "glm_build")
R> ore.scriptDrop(name = "glm_score")
R> ore.scriptDrop(name = "glm_detail")
R> 
R> # Classification with nnet
R> ore.scriptCreate("nnet_build", 
+                   function(dat, form, sz){
+                     require(nnet); 
+                     set.seed(1234);
+                     nnet(formula = formula(form), data = dat, 
+                            size = sz, linout = TRUE, trace = FALSE); 
+                   }, 
+                   overwrite = TRUE)
R> 
R> ore.scriptCreate("nnet_detail", function(mod)
+                   data.frame(conn = mod$conn, wts = mod$wts), 
+                   overwrite = TRUE)
R> 
R> ore.scriptCreate("nnet_score", 
+                   function(mod, data) {
+                     require(nnet); 
+                     res <- data.frame(predict(mod, newdata = data)); 
+                     names(res) <- sort(mod$lev); res
+                   })
R> 
R> ralg.nnet <- ore.odmRAlg(IRIS, mining.function = "classification",
+                           formula = c(form="Species ~ ."),
+                           build.function = "nnet_build", 
+                           build.parameter = list(sz=2),
+                           score.function = "nnet_score",
+                           detail.function = "nnet_detail",
+                           detail.value = data.frame(conn=1, wts =1))
R> 
R> summary(ralg.nnet)

Call:
ore.odmRAlg(data = IRIS, mining.function = "classification", 
    formula = c(form = "Species ~ ."), build.function = "nnet_build", 
    build.parameter = list(sz = 2), score.function = "nnet_score", 
    detail.function = "nnet_detail", detail.value = data.frame(conn = 1, 
        wts = 1))

Settings: 
                                                                     value
clas.weights.balanced                                                  OFF
odms.missing.value.treatment                       odms.missing.value.auto
odms.sampling                                        odms.sampling.disable
prep.auto                                                              OFF
build.function                                         OML_USER.nnet_build
build.parameter              select 'Species ~ .' "form", 2 "sz" from dual
details.format                          select 1 "conn", 1 "wts" from dual
details.function                                      OML_USER.nnet_detail
score.function                                         OML_USER.nnet_score

   conn       wts
1     0   1.46775
2     1 -12.88542
3     2  -4.38886
4     3   9.98648
5     4  16.57056
6     0   0.97809
7     1  -0.51626
8     2  -0.94815
9     3   0.13692
10    4   0.35104
11    0  37.22475
12    5 -66.49123
13    6  70.81160
14    0  -4.50893
15    5   7.01611
16    6  20.88774
17    0 -32.15127
18    5  58.92088
19    6 -91.96989
R> predict(ralg.nnet, newdata = head(IRIS), supplemental.cols = "Species")
  Species PREDICTION PROBABILITY
1  setosa     setosa     0.99999
2  setosa     setosa     0.99998
3  setosa     setosa     0.99999
4  setosa     setosa     0.99998
5  setosa     setosa     1.00000
6  setosa     setosa     0.99999
R> 
R> ore.scriptDrop(name = "nnet_build")
R> ore.scriptDrop(name = "nnet_score")
R> ore.scriptDrop(name = "nnet_detail")
R> 
R> ore.scriptCreate("pca_build", 
+                   function(dat){
+                     mod <- prcomp(dat, retx = FALSE)
+                     attr(mod, "dm$nfeat") <- ncol(mod$rotation)
+                     mod}, 
+                   overwrite = TRUE)
R> 
R> ore.scriptCreate("pca_score", 
+                   function(mod, data) {
+                     res <- predict(mod, data)
+                     as.data.frame(res)}, 
+                   overwrite=TRUE)
R> 
R> ore.scriptCreate("pca_detail", 
+                   function(mod) {
+                     rotation_t <- t(mod$rotation)
+                     data.frame(id = seq_along(rownames(rotation_t)), 
+                                               rotation_t)},
+                   overwrite = TRUE)
R> 
R> X <- IRIS[, -5L]
R> ralg.pca <- ore.odmRAlg(X, 
+                         mining.function = "feature_extraction",
+                         formula = NULL,
+                         build.function = "pca_build",
+                         score.function = "pca_score",
+                         detail.function = "pca_detail",
+                         detail.value = data.frame(Feature.ID=1, 
+                                                   ore.pull(head(X,1L))))
R> 
R> summary(ralg.pca)

Call:
ore.odmRAlg(data = X, mining.function = "feature_extraction", 
    formula = NULL, build.function = "pca_build", score.function = "pca_score", 
    detail.function = "pca_detail", detail.value = data.frame(Feature.ID = 1, 
        ore.pull(head(X, 1L))))

Settings: 
                                                                  value
odms.missing.value.treatment                    odms.missing.value.auto
odms.sampling                                     odms.sampling.disable
prep.auto                                                           OFF
build.function                                       OML_USER.pca_build
details.format    select 1 "Feature.ID", 5.1 "Sepal.Length", 3.5 "Sepal.Width", 1.4 "Petal.Length", 0.2 "Petal.Width" from dual
details.function                                    OML_USER.pca_detail
score.function                                       OML_USER.pca_score

  Feature.ID Sepal.Length Sepal.Width Petal.Length Petal.Width
1          1     0.856671    0.358289      0.36139   -0.084523
2          2    -0.173373   -0.075481      0.65659    0.730161
3          3     0.076236    0.545831     -0.58203    0.597911
4          4     0.479839   -0.753657     -0.31549    0.319723
R> head(cbind(X, Pred = predict(ralg.pca, newdata = X)))
  Sepal.Length Sepal.Width Petal.Length Petal.Width FEATURE_ID
1          5.1         3.5          1.4         0.2          2
2          4.9         3.0          1.4         0.2          4
3          4.7         3.2          1.3         0.2          3
4          4.6         3.1          1.5         0.2          4
5          5.0         3.6          1.4         0.2          2
6          5.4         3.9          1.7         0.4          2
R> 
R> ore.scriptDrop(name = "pca_build")
R> ore.scriptDrop(name = "pca_score")
R> ore.scriptDrop(name = "pca_detail")
R> 
R> options(digits = digits)

7.10 Generalized Linear Models

The ore.odmGLM function builds a Generalized Linear Model (GLM) model, which includes and extends the class of linear models (linear regression).

Generalized linear models relax the restrictions on linear models, which are often violated in practice. For example, binary (yes/no or 0/1) responses do not have same variance across classes.

The OML4SQL GLM is a parametric modeling technique. Parametric models make assumptions about the distribution of the data. When the assumptions are met, parametric models can be more efficient than non-parametric models.

The challenge in developing models of this type involves assessing the extent to which the assumptions are met. For this reason, quality diagnostics are key to developing quality parametric models.

In addition to the classical weighted least squares estimation for linear regression and iteratively re-weighted least squares estimation for logistic regression, both solved through Cholesky decomposition and matrix inversion, OML4SQL GLM provides a conjugate gradient-based optimization algorithm that does not require matrix inversion and is very well suited to high-dimensional data. The choice of algorithm is handled internally and is transparent to the user.

GLM can be used to build classification or regression models as follows:

  • Classification: Binary logistic regression is the GLM classification algorithm. The algorithm uses the logit link function and the binomial variance function.

  • Regression: Linear regression is the GLM regression algorithm. The algorithm assumes no target transformation and constant variance over the range of target values.

The ore.odmGLM function allows you to build two different types of models. Some arguments apply to classification models only and some to regression models only.

For information on the ore.odmGLM function arguments, invoke help(ore.odmGLM).

The following examples build several models using GLM. The input ore.frame objects are R data sets pushed to the database.

Settings for a Generalized Linear Models

The following table lists settings that apply to Generalized Linear models.

Table 7-8 Generalized Linear Model Settings

Setting Name Setting Value Description

GLMS_CONF_LEVEL

TO_CHAR(0< numeric_expr <1)

The confidence level for coefficient confidence intervals.

The default confidence level is 0.95.

GLMS_FTR_GEN_METHOD

GLMS_FTR_GEN_QUADRATIC

GLMS_FTR_GEN_CUBIC

Whether feature generation is quadratic or cubic.

When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data.

GLMS_FTR_GENERATION

GLMS_FTR_GENERATION_ENABLE

GLMS_FTR_GENERATION_DISABLE

Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

Note:

Feature generation can only be enabled when feature selection is also enabled.

GLMS_FTR_SEL_CRIT

GLMS_FTR_SEL_AIC

GLMS_FTR_SEL_SBIC

GLMS_FTR_SEL_RIC

GLMS_FTR_SEL_ALPHA_INV

Feature selection penalty criterion for adding a feature to the model.

When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.

GLMS_FTR_SELECTION

GLMS_FTR_SELECTION_ENABLE

GLMS_FTR_SELECTION_DISABLE

Whether or not feature selection is enabled for GLM.

By default, feature selection is not enabled.

GLMS_MAX_FEATURES

TO_CHAR(0 < numeric_expr <= 2000)

When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model.

By default, the algorithm limits the number of features to ensure sufficient memory.

GLMS_PRUNE_MODEL

GLMS_PRUNE_MODEL_ENABLE

GLMS_PRUNE_MODEL_DISABLE

Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data.

When feature selection is enabled, the algorithm automatically performs pruning based on the data.

GLMS_REFERENCE_CLASS_NAME

target_value

The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the non-reference class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.

GLMS_RIDGE_REGRESSION

GLMS_RIDGE_REG_ENABLE

GLMS_RIDGE_REG_DISABLE

Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions.

When ridge is enabled, prediction bounds are not produced by the PREDICTION_BOUNDS SQL function.

Note:

Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised.

GLMS_RIDGE_VALUE

TO_CHAR (numeric_expr > 0)

The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression.

If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm.

GLMS_ROW_DIAGNOSTICS

GLMS_ROW_DIAG_ENABLE

GLMS_ROW_DIAG_DISABLE (default).

Enable or disable row diagnostics.

GLMS_CONV_TOLERANCE

The range is (0, 1) non-inclusive.

Convergence Tolerance setting of the GLM algorithm

The default value is system-determined.

GLMS_NUM_ITERATIONS

Positive integer

Maximum number of iterations for the GLM algorithm. The default value is system-determined.

GLMS_BATCH_ROWS

0 or Positive integer

Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate.

The default is 2000

GLMS_SOLVER

GLMS_SOLVER_SGD (StochasticGradient Descent)

GLMS_SOLVER_CHOL (Cholesky)

GLMS_SOLVER_QR

GLMS_SOLVER_LBFGS_ADMM

This setting allows the user to choose the GLM solver. The solver cannot be selected if GLMS_FTR_SELECTION setting is enabled. The default value is system determined.

GLMS_SPARSE_SOLVER

GLMS_SPARSE_SOLVER_ENABLE

GLMS_SPARSE_SOLVER_DISABLE (default).

This setting allows the user to use sparse solver if it is available. The default value is GLMS_SPARSE_SOLVER_DISABLE.

Example 7-9 Building a Linear Regression Model

This example builds a linear regression model using the longley data set.

longley_of <- ore.push(longley)
longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
summary(longfit1)

Listing for This Example

R> longley_of <- ore.push(longley)
R> longfit1 <- ore.odmGLM(Employed ~ ., data = longley_of)
R> summary(longfit1)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longely_of)
 
Residuals:
     Min       1Q   Median       3Q      Max 
-0.41011 -0.15767 -0.02816  0.10155  0.45539 
 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.482e+03  8.904e+02  -3.911 0.003560 ** 
GNP.deflator  1.506e-02  8.492e-02   0.177 0.863141    
GNP          -3.582e-02  3.349e-02  -1.070 0.312681    
Unemployed   -2.020e-02  4.884e-03  -4.136 0.002535 ** 
Armed.Forces -1.033e-02  2.143e-03  -4.822 0.000944 ***
Population   -5.110e-02  2.261e-01  -0.226 0.826212    
Year          1.829e+00  4.555e-01   4.016 0.003037 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.3 on 6 and 9 DF,  p-value: 4.984e-10

Example 7-10 Using Ridge Estimation for the Coefficients of the ore.odmGLM Model

This example uses the longley_of ore.frame from the previous example. This example invokes the ore.odmGLM function and specifies using ridge estimation for the coefficients.

longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
                       ridge.vif = TRUE)
summary(longfit2)

Listing for This Example

R> longfit2 <- ore.odmGLM(Employed ~ ., data = longley_of, ridge = TRUE,
+                         ridge.vif = TRUE)
R> summary(longfit2)
 
Call:
ore.odmGLM(formula = Employed ~ ., data = longley_of, ridge = TRUE, 
    ridge.vif = TRUE)
 
Residuals:
    Min      1Q  Median      3Q     Max 
-0.4100 -0.1579 -0.0271  0.1017  0.4575 
 
Coefficients:
               Estimate   VIF
(Intercept)  -3.466e+03 0.000
GNP.deflator  1.479e-02 0.077
GNP          -3.535e-02 0.012
Unemployed   -2.013e-02 0.000
Armed.Forces -1.031e-02 0.000
Population   -5.262e-02 0.548
Year          1.821e+00 2.212
 
Residual standard error: 0.3049 on 9 degrees of freedom
Multiple R-squared:  0.9955,    Adjusted R-squared:  0.9925 
F-statistic: 330.2 on 6 and 9 DF,  p-value: 4.986e-10

Example 7-11 Building a Logistic Regression GLM

This example builds a logistic regression (classification) model. It uses the infert data set. The example invokes the ore.odmGLM function and specifies logistic as the type argument, which builds a binomial GLM.

infert_of <- ore.push(infert)
infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic")
infit1

Listing for This Example

R> infert_of <- ore.push(infert)
R> infit1 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
+                       data = infert_of, type = "logistic")
R> infit1
 
Response:
case == "1"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic")
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
        -2.19348           0.03958          -0.82828           1.04424          -0.35896           2.04590           1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8

Example 7-12 Specifying a Reference Value in Building a Logistic Regression GLM

This example builds a logistic regression (classification) model and specifies a reference value. The example uses the infert_of ore.frame from Example 7-11.

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Listing for This Example

infit2 <- ore.odmGLM(case ~ age+parity+education+spontaneous+induced,
                     data = infert_of, type = "logistic", reference = 1)
infit2

Response:
case == "0"
 
Call:  ore.odmGLM(formula = case ~ age + parity + education + spontaneous + 
    induced, data = infert_of, type = "logistic", reference = 1)
 
Coefficients:
     (Intercept)               age            parity   education0-5yrs  education12+ yrs       spontaneous           induced  
         2.19348          -0.03958           0.82828          -1.04424           0.35896          -2.04590          -1.28876  
 
Degrees of Freedom: 247 Total (i.e. Null);  241 Residual
Null Deviance:      316.2 
Residual Deviance: 257.8        AIC: 271.8

7.11 k-Means

The ore.odmKM function uses the OML4SQL k-Means (KM) algorithm, a distance-based clustering algorithm that partitions data into a specified number of clusters.

The algorithm has the following features:

  • Several distance functions: Euclidean, Cosine, and Fast Cosine distance functions. The default is Euclidean.

  • For each cluster, the algorithm returns the centroid, a histogram for each attribute, and a rule describing the hyperbox that encloses the majority of the data assigned to the cluster. The centroid reports the mode for categorical attributes and the mean and variance for numeric attributes.

For information on the ore.odmKM function arguments, call help(ore.odmKM).

Settings for a k-Means Models

The following table lists settings that apply to k-Means models.

Table 7-9 k-Means Model Settings

Setting Name Setting Value Description

KMNS_CONV_TOLERANCE

TO_CHAR(0<numeric_expr<1)

Minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in KMNS_ITERATIONS, is reached.

Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times.

The default Convergence Tolerance is 0.001.

KMNS_DISTANCE

KMNS_COSINE

KMNS_EUCLIDEAN

Distance function for k-Means.

The default distance function is KMNS_EUCLIDEAN.

KMNS_ITERATIONS

TO_CHAR(positive_numeric_expr)

Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum Convergence Tolerance, specified in KMNS_CONV_TOLERANCE, is satisfied.

The default number of iterations is 20.

KMNS_MIN_PCT_ATTR_SUPPORT

TO_CHAR(0<=numeric_expr<=1)

Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster.

If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules.

The default minimum support is 0.1.

KMNS_NUM_BINS

TO_CHAR(numeric_expr>0)

Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.

The default number of histogram bins is 11.

KMNS_SPLIT_CRITERION

KMNS_SIZE

KMNS_VARIANCE

Split criterion for k-Means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time.

When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located. When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster.

The default split criterion is the KMNS_VARIANCE.

KMNS_RANDOM_SEED

Non-negative integer

This setting controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value.

The default is 0.

KMNS_DETAILS

KMNS_DETAILS_NONE

KMNS_DETAILS_HIERARCHY .

KMNS_DETAILS_ALL

This setting determines the level of cluster detail that are computed during the build.

KMNS_DETAILS_NONE: No cluster details are computed. Only the scoring information is persisted.

KMNS_DETAILS_HIERARCHY: Cluster hierarchy and cluster record counts are computed. This is the default value.

KMNS_DETAILS_ALL: Cluster hierarchy, record counts, descriptive statistics (means, variances, modes, histograms, and rules) are computed.

Example 7-13 Using the ore.odmKMeans Function

This example demonstrates the use of the ore.odmKMeans function. The example creates two matrices that have 100 rows and two columns. The values in the rows are random variates. It binds the matrices into the matrix x, then coerces x to a data.frame and pushes it to the database as x_of, an ore.frame object. The example next calls the ore.odmKMeans function to build the KM model, km.mod1. It then calls the summary and histogram functions on the model. Figure 7-2 shows the graphic displayed by the histogram function.

Finally, the example makes a prediction using the model, pulls the result to local memory, and plots the results.Figure 7-3 shows the graphic displayed by the points function.

x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
x_of <- ore.push (data.frame(x))
km.mod1 <- NULL
km.mod1 <- ore.odmKMeans(~., x_of, num.centers=2)
summary(km.mod1)
histogram(km.mod1)
# Make a prediction.
km.res1 <- predict(km.mod1, x_of, type="class", supplemental.cols=c("x","y"))
head(km.res1, 3)
# Pull the results to the local memory and plot them.
km.res1.local <- ore.pull(km.res1)
plot(data.frame(x=km.res1.local$x, y=km.res1.local$y),
                col=km.res1.local$CLUSTER_ID)
points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2)
head(predict(km.mod1, x_of, type=c("class","raw"),
             supplemental.cols=c("x","y")), 3)
Listing for This Example
R> x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
+             matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R> x_of <- ore.push (data.frame(x))
R> km.mod1 <- NULL
R> km.mod1 <- ore.odmKMeans(~., x_of, num.centers=2)
R> summary(km.mod1)
 
Call:
ore.odmKMeans(formula = ~., data = x_of, num.centers = 2)
 
Settings: 
                         value
clus.num.clusters            2
block.growth                 2
conv.tolerance            0.01
distance             euclidean
iterations                   3
min.pct.attr.support       0.1
num.bins                    10
split.criterion       variance
prep.auto                   on
 
Centers: 
            x           y
2  0.99772307  0.93368684
3 -0.02721078 -0.05099784
R> histogram(km.mod1)
R> # Make a prediction.
R> km.res1 <- predict(km.mod1, x_of, type="class", supplemental.cols=c("x","y"))
R> head(km.res1, 3)
            x          y CLUSTER_ID
1 -0.03038444  0.4395409          3
2  0.17724606 -0.5342975          3
3 -0.17565761  0.2832132          3
# Pull the results to the local memory and plot them.
R> km.res1.local <- ore.pull(km.res1)
R> plot(data.frame(x=km.res1.local$x, y=km.res1.local$y),
+                  col=km.res1.local$CLUSTER_ID)
R> points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2)
R> head(predict(km.mod1, x_of, type=c("class","raw"),
                supplemental.cols=c("x","y")), 3)
           '2'       '3'           x          y CLUSTER_ID
1 8.610341e-03 0.9913897 -0.03038444  0.4395409          3
2 8.017890e-06 0.9999920  0.17724606 -0.5342975          3
3 5.494263e-04 0.9994506 -0.17565761  0.2832132          3

Figure 7-2 shows the graphic displayed by the invocation of the histogram function in Example 7-13.

Figure 7-2 Cluster Histograms for the km.mod1 Model

Description of Figure 7-2 follows
Description of "Figure 7-2 Cluster Histograms for the km.mod1 Model"

Figure 7-3 shows the graphic displayed by the invocation of the points function in Example 7-13.

Figure 7-3 Results of the points Function for the km.mod1 Model

Description of Figure 7-3 follows
Description of "Figure 7-3 Results of the points Function for the km.mod1 Model"

7.12 Naive Bayes

The ore.odmNB function builds an OML4SQL Naive Bayes model.

The Naive Bayes algorithm is based on conditional probabilities. Naive Bayes looks at the historical data and calculates conditional probabilities for the target values by observing the frequency of attribute values and of combinations of attribute values.

Naive Bayes assumes that each predictor is conditionally independent of the others. (Bayes' Theorem requires that the predictors be independent.)

For information on the ore.odmNB function arguments, call help(ore.odmNB).

Settings for a Naive Bayes Models

The following table lists settings that apply to Naive Bayes models.

Table 7-10 Naive Bayes Model Settings

Setting Name Setting Value Description

NABS_PAIRWISE_THRESHOLD

TO_CHAR(0<= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default is 0.

NABS_SINGLETON_THRESHOLD

TO_CHAR(0<=numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.

Example 7-14 Using the ore.odmNB Function

This example creates an input ore.frame, builds a Naive Bayes model, makes predictions, and generates a confusion matrix.

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)
mtcars_of <- ore.push(m)
row.names(mtcars_of) <- mtcars_of
# Build the model.
nb.mod  <- ore.odmNB(gear ~ ., mtcars_of)
summary(nb.mod)
# Make predictions and generate a confusion matrix.
nb.res  <- predict (nb.mod, mtcars_of, "gear")
with(nb.res, table(gear, PREDICTION))

Listing for This Example

R> m <- mtcars
R> m$gear <- as.factor(m$gear)
R> m$cyl  <- as.factor(m$cyl)
R> m$vs   <- as.factor(m$vs)
R> m$ID   <- 1:nrow(m)
R> mtcars_of <- ore.push(m)
R> row.names(mtcars_of) <- mtcars_of
R> # Build the model.
R> nb.mod  <- ore.odmNB(gear ~ ., mtcars_of)
R> summary(nb.mod)
 
Call:
ore.odmNB(formula = gear ~ ., data = mtcars_of)
 
Settings: 
          value
prep.auto    on
 
Apriori: 
      3       4       5 
0.46875 0.37500 0.15625 
Tables: 
$ID
  ( ; 26.5), [26.5; 26.5]  (26.5;  )
3              1.00000000           
4              0.91666667 0.08333333
5                         1.00000000
 
$am
          0         1
3 1.0000000          
4 0.3333333 0.6666667
5           1.0000000
 
$cyl
  '4', '6' '8'
3      0.2 0.8
4      1.0    
5      0.6 0.4
 
$disp
  ( ; 196.299999999999995), [196.299999999999995; 196.299999999999995]
3                                                           0.06666667
4                                                           1.00000000
5                                                           0.60000000
  (196.299999999999995;  )
3               0.93333333
4                         
5               0.40000000
 
$drat
  ( ; 3.385), [3.385; 3.385] (3.385;  )
3                  0.8666667  0.1333333
4                             1.0000000
5                             1.0000000
$hp
  ( ; 136.5), [136.5; 136.5] (136.5;  )
3                        0.2        0.8
4                        1.0           
5                        0.4        0.6
 
$vs
          0         1
3 0.8000000 0.2000000
4 0.1666667 0.8333333
5 0.8000000 0.2000000
 
$wt
  ( ; 3.2024999999999999), [3.2024999999999999; 3.2024999999999999]
3                                                        0.06666667
4                                                        0.83333333
5                                                        0.80000000
  (3.2024999999999999;  )
3              0.93333333
4              0.16666667
5              0.20000000
 
 
Levels: 
[1] "3" "4" "5"

R> # Make predictions and generate a confusion matrix.
R> nb.res  <- predict (nb.mod, mtcars_of, "gear")
R> with(nb.res, table(gear, PREDICTION))
    PREDICTION
gear  3  4  5
   3 14  1  0
   4  0 12  0
   5  0  1  4

7.13 Non-Negative Matrix Factorization

The ore.odmNMF function builds an OML4SQL Non-Negative Matrix Factorization (NMF) model for feature extraction.

Each feature extracted by NMF is a linear combination of the original attribution set. Each feature has a set of non-negative coefficients, which are a measure of the weight of each attribute on the feature. If the argument allow.negative.scores is TRUE, then negative coefficients are allowed.

For information on the ore.odmNMF function arguments, call help(ore.odmNMF).

Settings for a Non-Negative Matrix Factorization Models

The following table lists settings that apply to Non-Negative Matrix Factorization models.

Table 7-11 Non-Negative Matrix Factorization Model Settings

Setting Name Setting Value Description

NMFS_CONV_TOLERANCE

TO_CHAR(0< numeric_expr <=0.5)

Convergence tolerance for NMF algorithm

Default is 0.05

NMFS_NONNEGATIVE_SCORING

NMFS_NONNEG_SCORING_ENABLE

NMFS_NONNEG_SCORING_DISABLE

Whether negative numbers should be allowed in scoring results. When set to NMFS_NONNEG_SCORING_ENABLE, negative feature values will be replaced with zeros. When set to NMFS_NONNEG_SCORING_DISABLE, negative feature values will be allowed.

Default is NMFS_NONNEG_SCORING_ENABLE

NMFS_NUM_ITERATIONS

TO_CHAR(1 <= numeric_expr <=500)

Number of iterations for NMF algorithm

Default is 50

NMFS_RANDOM_SEED

TO_CHAR(numeric_expr)

Random seed for NMF algorithm.

Default is –1.

Example 7-15 Using the ore.odmNMF Function

This example creates an NMF model on a training data set and scores on a test data set.

training.set <- ore.push(npk[1:18, c("N","P","K")])
scoring.set <- ore.push(npk[19:24, c("N","P","K")])
nmf.mod <- ore.odmNMF(~., training.set, num.features = 3)
features(nmf.mod)
summary(nmf.mod)
predict(nmf.mod, scoring.set)

Listing for This Example

R> training.set <- ore.push(npk[1:18, c("N","P","K")])
R> scoring.set <- ore.push(npk[19:24, c("N","P","K")])
R> nmf.mod <- ore.odmNMF(~., training.set, num.features = 3)
R> features(nmf.mod)
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE  COEFFICIENT
1           1              K               0 3.723468e-01
2           1              K               1 1.761670e-01
3           1              N               0 7.469067e-01
4           1              N               1 1.085058e-02
5           1              P               0 5.730082e-01
6           1              P               1 2.797865e-02
7           2              K               0 4.107375e-01
8           2              K               1 2.193757e-01
9           2              N               0 8.065393e-03
10          2              N               1 8.569538e-01
11          2              P               0 4.005661e-01
12          2              P               1 4.124996e-02
13          3              K               0 1.918852e-01
14          3              K               1 3.311137e-01
15          3              N               0 1.547561e-01
16          3              N               1 1.283887e-01
17          3              P               0 9.791965e-06
18          3              P               1 9.113922e-01
R> summary(nmf.mod)
 
Call:
ore.odmNMF(formula = ~., data = training.set, num.features = 3)
 
Settings: 
                                              value
feat.num.features                                 3
nmfs.conv.tolerance                             .05
nmfs.nonnegative.scoring nmfs.nonneg.scoring.enable
nmfs.num.iterations                              50
nmfs.random.seed                                 -1
prep.auto                                        on
 
Features: 
   FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE  COEFFICIENT
1           1              K               0 3.723468e-01
2           1              K               1 1.761670e-01
3           1              N               0 7.469067e-01
4           1              N               1 1.085058e-02
5           1              P               0 5.730082e-01
6           1              P               1 2.797865e-02
7           2              K               0 4.107375e-01
8           2              K               1 2.193757e-01
9           2              N               0 8.065393e-03
10          2              N               1 8.569538e-01
11          2              P               0 4.005661e-01
12          2              P               1 4.124996e-02
13          3              K               0 1.918852e-01
14          3              K               1 3.311137e-01
15          3              N               0 1.547561e-01
16          3              N               1 1.283887e-01
17          3              P               0 9.791965e-06
18          3              P               1 9.113922e-01
R> predict(nmf.mod, scoring.set)
         '1'       '2'        '3' FEATURE_ID
19 0.1972489 1.2400782 0.03280919          2
20 0.7298919 0.0000000 1.29438165          3
21 0.1972489 1.2400782 0.03280919          2
22 0.0000000 1.0231268 0.98567623          2
23 0.7298919 0.0000000 1.29438165          3
24 1.5703239 0.1523159 0.00000000          1

7.14 Orthogonal Partitioning Cluster

The ore.odmOC function builds an OML4SQL model using the Orthogonal Partitioning Cluster (O-Cluster) algorithm.

The O-Cluster algorithm builds a hierarchical grid-based clustering model, that is, it creates axis-parallel (orthogonal) partitions in the input attribute space. The algorithm operates recursively. The resulting hierarchical structure represents an irregular grid that tessellates the attribute space into clusters. The resulting clusters define dense areas in the attribute space.

The clusters are described by intervals along the attribute axes and the corresponding centroids and histograms. The sensitivity argument defines a baseline density level. Only areas that have a peak density above this baseline level can be identified as clusters.

The k-Means algorithm tessellates the space even when natural clusters may not exist. For example, if there is a region of uniform density, k-Means tessellates it into n clusters (where n is specified by the user). O-Cluster separates areas of high density by placing cutting planes through areas of low density. O-Cluster needs multi-modal histograms (peaks and valleys). If an area has projections with uniform or monotonically changing density, O-Cluster does not partition it.

The clusters discovered by O-Cluster are used to generate a Bayesian probability model that is then used during scoring by the predict function for assigning data points to clusters. The generated probability model is a mixture model where the mixture components are represented by a product of independent normal distributions for numeric attributes and multinomial distributions for categorical attributes.

If you choose to prepare the data for an O-Cluster model, keep the following points in mind:

  • The O-Cluster algorithm does not necessarily use all the input data when it builds a model. It reads the data in batches (the default batch size is 50000). It only reads another batch if it believes, based on statistical tests, that there may still exist clusters that it has not yet uncovered.

  • Because O-Cluster may stop the model build before it reads all of the data, it is highly recommended that the data be randomized.

  • Binary attributes should be declared as categorical. O-Cluster maps categorical data to numeric values.

  • The use of OML4SQL equi-width binning transformation with automated estimation of the required number of bins is highly recommended.

  • The presence of outliers can significantly impact clustering algorithms. Use a clipping transformation before binning or normalizing. Outliers with equi-width binning can prevent O-Cluster from detecting clusters. As a result, the whole population appears to fall within a single cluster.

The specification of the formula argument has the form ~ terms where terms are the column names to include in the model. Multiple terms items are specified using + between column names. Use ~ . if all columns in data should be used for model building. To exclude columns, use - before each column name to exclude.

For information on the ore.odmOC function arguments, call help(ore.odmOC).

Settings for an Orthogonal Partitioning Cluster Models

The following table lists settings that apply to Orthogonal Partitioning Cluster models.

Table 7-12 Orthogonal Partitioning Cluster Model Settings

Setting Name Setting Value Description

OCLT_SENSITIVITY

TO_CHAR(0 <=numeric_expr <=1)

A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density.

Default is 0.5

Example 7-16 Using the ore.odmOC Function

This example creates an O-Cluster model on a synthetic data set. The figure following the example shows the histogram of the resulting clusters.

x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
x_of <- ore.push (data.frame(ID=1:100,x))
rownames(x_of) <- x_of$ID
oc.mod <- ore.odmOC(~., x_of, num.centers=2)
summary(oc.mod)
histogram(oc.mod)
predict(oc.mod, x_of, type=c("class","raw"), supplemental.cols=c("x","y"))

Listing for This Example

R> x <- rbind(matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
+             matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R> x_of <- ore.push (data.frame(ID=1:100,x))
R> rownames(x_of) <- x_of$ID
R> oc.mod <- ore.odmOC(~., x_of, num.centers=2)
R> summary(oc.mod)
 
Call:
ore.odmOC(formula = ~., data = x_of, num.centers = 2)
 
Settings: 
                  value
clus.num.clusters     2
max.buffer        50000
sensitivity         0.5
prep.auto            on
 
Clusters: 
  CLUSTER_ID ROW_CNT PARENT_CLUSTER_ID TREE_LEVEL DISPERSION IS_LEAF
1          1     100                NA          1         NA   FALSE
2          2      56                 1          2         NA    TRUE
3          3      43                 1          2         NA    TRUE
 
Centers: 
   MEAN.x   MEAN.y
2 1.85444 1.941195
3 4.04511 4.111740
R> histogram(oc.mod)
R> predict(oc.mod, x_of, type=c("class","raw"), supplemental.cols=c("x","y"))
             '2'          '3'        x        y CLUSTER_ID
1   3.616386e-08 9.999999e-01 3.825303 3.935346          3
2   3.253662e-01 6.746338e-01 3.454143 4.193395          3
3   3.616386e-08 9.999999e-01 4.049120 4.172898          3
# ... Intervening rows not shown.
98  1.000000e+00 1.275712e-12 2.011463 1.991468          2
99  1.000000e+00 1.275712e-12 1.727580 1.898839          2
100 1.000000e+00 1.275712e-12 2.092737 2.212688          2

Figure 7-4 Output of the histogram Function for the ore.odmOC Model

Description of Figure 7-4 follows
Description of "Figure 7-4 Output of the histogram Function for the ore.odmOC Model"

7.15 Singular Value Decomposition

The ore.odmSVD function creates a model that uses the OML4SQL Singular Value Decomposition (SVD) algorithm.

Singular Value Decomposition (SVD) is a feature extraction algorithm. SVD is orthogonal linear transformations that capture the underlying variance of the data by decomposing a rectangular matrix into three matrixes: 'U', 'D', and 'V'. Matrix 'D' is a diagonal matrix and its singular values reflect the amount of data variance captured by the bases.

Settings for a Singular Value Decomposition Models

The following table lists settings that apply to Singular Value Decomposition models.

Table 7-13 Singular Value Decomposition Model Settings

Setting Name Setting Value Description

SVDS_MAX_NUM_FEATURES

2500

The maximum number of features supported by SVD.

Example 7-17 Using the ore.odmSVD Function

IRIS <- ore.push(cbind(Id = seq_along(iris[[1L]]), iris))

svd.mod <- ore.odmSVD(~. -Id, IRIS)
summary(svd.mod)
d(svd.mod)
v(svd.mod)
head(predict(svd.mod, IRIS, supplemental.cols = "Id"))

svd.pmod <- ore.odmSVD(~. -Id, IRIS, 
	                     odm.settings = list(odms_partition_columns = "Species"))
summary(svd.pmod)
d(svd.pmod)
v(svd.pmod)
head(predict(svd.pmod, IRIS, supplemental.cols = "Id"))

Listing for This Example

R> IRIS <- ore.push(cbind(Id = seq_along(iris[[1L]]), iris))
R> 
R> svd.mod <- ore.odmSVD(~. -Id, IRIS)
R> summary(svd.mod)
Call:
ore.odmSVD(formula = ~. - Id, data = IRIS)

Settings: 
                                               value
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
scoring.mode                             scoring.svd
u.matrix.output                     u.matrix.disable

d: 
  FEATURE_ID      VALUE
1          1 96.2182677
2          2 19.0780817
3          3  7.2270380
4          4  3.1502152
5          5  1.8849634
6          6  1.1474731
7          7  0.5814097
v: 
  ATTRIBUTE_NAME ATTRIBUTE_VALUE        '1'         '2'          '3'         '4'         '5'         '6'          '7'
1   Petal.Length            <NA> 0.51162932  0.65943465 -0.004420703  0.05479795 -0.51969015  0.17392232 -0.005674672
2    Petal.Width            <NA> 0.16745698  0.32071102  0.146484369  0.46553390  0.72685033  0.31962337 -0.021274748
3   Sepal.Length            <NA> 0.74909171 -0.26482593 -0.102057243 -0.49272847  0.31969417 -0.09379235 -0.067308615
4    Sepal.Width            <NA> 0.37906736 -0.50824062  0.142810811  0.69139828 -0.25849391 -0.17606099 -0.041908520
5        Species          setosa 0.03170407 -0.32247642  0.184499940 -0.12245506 -0.14348647  0.76017824  0.497502783
6        Species      versicolor 0.04288799  0.04054823 -0.780684855  0.19827972  0.07363250 -0.12354271  0.571881302
7        Species       virginica 0.05018593  0.16796988  0.551546107 -0.07177990  0.08109974 -0.48442099  0.647048040
Warning message:
In u.ore.odmSVD(object) : U matrix is not calculated.
R> d(svd.mod)
  FEATURE_ID      VALUE
1          1 96.2182677
2          2 19.0780817
3          3  7.2270380
4          4  3.1502152
5          5  1.8849634
6          6  1.1474731
7          7  0.5814097
Warning message:
ORE object has no unique key - using random order 
R> v(svd.mod)
  ATTRIBUTE_NAME ATTRIBUTE_VALUE        '1'         '2'          '3'         '4'         '5'         '6'          '7'
1   Petal.Length            <NA> 0.51162932  0.65943465 -0.004420703  0.05479795 -0.51969015  0.17392232 -0.005674672
2    Petal.Width            <NA> 0.16745698  0.32071102  0.146484369  0.46553390  0.72685033  0.31962337 -0.021274748
3   Sepal.Length            <NA> 0.74909171 -0.26482593 -0.102057243 -0.49272847  0.31969417 -0.09379235 -0.067308615
4    Sepal.Width            <NA> 0.37906736 -0.50824062  0.142810811  0.69139828 -0.25849391 -0.17606099 -0.041908520
5        Species          setosa 0.03170407 -0.32247642  0.184499940 -0.12245506 -0.14348647  0.76017824  0.497502783
6        Species      versicolor 0.04288799  0.04054823 -0.780684855  0.19827972  0.07363250 -0.12354271  0.571881302
7        Species       virginica 0.05018593  0.16796988  0.551546107 -0.07177990  0.08109974 -0.48442099  0.647048040
Warning message:
ORE object has no unique key - using random order 
R> head(predict(svd.mod, IRIS, supplemental.cols = "Id"))
  Id        '1'        '2'        '3'         '4'           '5'          '6'          '7' FEATURE_ID
1  1 0.06161595 -0.1291839 0.02586865 -0.01449182  1.536727e-05 -0.023495349 -0.007998605          2
2  2 0.05808905 -0.1130876 0.01881265 -0.09294788  3.466226e-02  0.069569113  0.051195429          2
3  3 0.05678818 -0.1190959 0.02565027 -0.01950986  8.851560e-04  0.040073030  0.060908867          2
4  4 0.05667915 -0.1081308 0.02496402 -0.02233741 -5.750222e-02  0.093904181  0.077741713          2
5  5 0.06123138 -0.1304597 0.02925687  0.02309694 -3.065834e-02 -0.030664898 -0.003629897          2
6  6 0.06747071 -0.1302726 0.03340671  0.06114966 -9.547838e-03 -0.008210224 -0.081807741          2
R> 
R> svd.pmod <- ore.odmSVD(~. -Id, IRIS, 
+                         odm.settings = list(odms_partition_columns = "Species"))
R> summary(svd.pmod)
$setosa

Call:
ore.odmSVD(formula = ~. - Id, data = IRIS, odm.settings = list(odms_partition_columns = "Species"))

Settings: 
                                               value
odms.max.partitions                             1000
odms.missing.value.treatment odms.missing.value.auto
odms.partition.columns                     "Species"
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
scoring.mode                             scoring.svd
u.matrix.output                     u.matrix.disable

d: 
  FEATURE_ID      VALUE
1          1 44.2872290
2          2  1.5719162
3          3  1.1458732
4          4  0.6836692
v: 
  ATTRIBUTE_NAME ATTRIBUTE_VALUE       '1'         '2'        '3'         '4'
1   Petal.Length            <NA> 0.2334487  0.46456598  0.8317440 -0.19463332
2    Petal.Width            <NA> 0.0395488  0.04182015  0.1946750  0.97917752
3   Sepal.Length            <NA> 0.8010073  0.40303704 -0.4410167  0.03811461
4    Sepal.Width            <NA> 0.5498408 -0.78739486  0.2753323 -0.04331888

$versicolor

Call:
ore.odmSVD(formula = ~. - Id, data = IRIS, odm.settings = list(odms_partition_columns = "Species"))

Settings: 
                                               value
odms.max.partitions                             1000
odms.missing.value.treatment odms.missing.value.auto
R> # xyz
R> d(svd.pmod)
   PARTITION_NAME FEATURE_ID      VALUE
1          setosa          1 44.2872290
2          setosa          2  1.5719162
3          setosa          3  1.1458732
4          setosa          4  0.6836692
5      versicolor          1 56.2523412
6      versicolor          2  1.9106625
7      versicolor          3  1.7015929
8      versicolor          4  0.6986103
9       virginica          1 66.2734064
10      virginica          2  2.4318639
11      virginica          3  1.6007740
12      virginica          4  1.2958261
Warning message:
ORE object has no unique key - using random order 
R> v(svd.pmod)
   PARTITION_NAME ATTRIBUTE_NAME ATTRIBUTE_VALUE       '1'         '2'         '3'         '4'
1          setosa   Petal.Length            <NA> 0.2334487  0.46456598  0.83174398 -0.19463332
2          setosa    Petal.Width            <NA> 0.0395488  0.04182015  0.19467497  0.97917752
3          setosa   Sepal.Length            <NA> 0.8010073  0.40303704 -0.44101672  0.03811461
4          setosa    Sepal.Width            <NA> 0.5498408 -0.78739486  0.27533228 -0.04331888
5      versicolor   Petal.Length            <NA> 0.5380908  0.49576111 -0.60174021 -0.32029352
6      versicolor    Petal.Width            <NA> 0.1676394  0.36693207 -0.03448373  0.91436795
7      versicolor   Sepal.Length            <NA> 0.7486029 -0.64738491  0.06943054  0.12516311
8      versicolor    Sepal.Width            <NA> 0.3492119  0.44774385  0.79492074 -0.21372297
9       virginica   Petal.Length            <NA> 0.5948985 -0.26368708  0.65157671 -0.38988802
10      virginica    Petal.Width            <NA> 0.2164036  0.59106806  0.42921836  0.64774968
11      virginica   Sepal.Length            <NA> 0.7058813 -0.27846153 -0.53436210  0.37235450
12      virginica    Sepal.Width            <NA> 0.3177999  0.70962445 -0.32507927 -0.53829342
Warning message:
ORE object has no unique key - using random order 
R> head(predict(svd.pmod, IRIS, supplemental.cols = "Id"))
  Id       '1'          '2'          '3'         '4' FEATURE_ID
1  1 0.1432539 -0.026487881 -0.071688339 -0.04956008          1
2  2 0.1334289  0.172689424 -0.114854368 -0.02902893          2
3  3 0.1317675 -0.008327214 -0.062409295 -0.02438248          1
4  4 0.1297716  0.075232572  0.097222019 -0.08055912          1
5  5 0.1426868 -0.102219140 -0.009172782 -0.06147133          1
6  6 0.1554060 -0.055950655  0.160698708  0.14286095          3

7.16 Support Vector Machine

The ore.odmSVM function builds an OML4R Support Vector Machine (SVM) model.

SVM is a powerful, state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data.

SVM models have similar functional form to neural networks and radial basis functions, both popular machine learning techniques.

SVM can be used to solve the following problems:

  • Classification: SVM classification is based on decision planes that define decision boundaries. A decision plane is one that separates between a set of objects having different class memberships. SVM finds the vectors (“support vectors") that define the separators that give the widest separation of classes.

    SVM classification supports both binary and multiclass targets.

  • Regression: SVM uses an epsilon-insensitive loss function to solve regression problems.

    SVM regression tries to find a continuous function such that the maximum number of data points lie within the epsilon-wide insensitivity tube. Predictions falling within epsilon distance of the true target value are not interpreted as errors.

  • Anomaly Detection: Anomaly detection identifies identify cases that are unusual within data that is seemingly homogeneous. Anomaly detection is an important tool for detecting fraud, network intrusion, and other rare events that may have great significance but are hard to find.

    Anomaly detection is implemented as one-class SVM classification. An anomaly detection model predicts whether a data point is typical for a given distribution or not.

The ore.odmSVM function builds each of these three different types of models. Some arguments apply to classification models only, some to regression models only, and some to anomaly detection models only.

For information on the ore.odmSVM function arguments, call help(ore.odmSVM).

Settings for a Support Vector Machine Models

The following table lists settings that apply to Support Vector Machine models.

Table 7-14 Support Vector Machine Model Settings

Setting Name Setting Value Description

SVMS_COMPLEXITY_FACTOR

TO_CHAR(numeric_expr>0)

Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor.

Value of complexity factor for SVM algorithm (both Classification and Regression).

Default value estimated from the data by the algorithm.

SVMS_CONV_TOLERANCE

TO_CHAR(numeric_expr>0)

Convergence tolerance for SVM algorithm.

Default is 0.0001

SVMS_EPSILON

TO_CHAR(numeric_expr >0)

Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data.

Value of epsilon factor for SVM regression.

Default is 0.1.

SVMS_KERNEL_FUNCTION

SVMS_GAUSSIAN

SVMS_LINEAR

Kernel for Support Vector Machine. Linear or Gaussian.

The default value is SVMS_LINEAR.

SVMS_OUTLIER_RATE

TO_CHAR(0< numeric_expr<1)

The desired rate of outliers in the training data. Valid for One-Class SVM models only (Anomaly Detection).

Default is 0.01.

SVMS_STD_DEV

TO_CHAR(numeric_expr>0)

Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases.

Value of standard deviation for SVM algorithm.

This is applicable only for Gaussian kernel.

Default value estimated from the data by the algorithm.

SVMS_NUM_ITERATIONS

Positive integer

This setting sets an upper limit on the number of SVM iterations. The default is system determined because it depends on the SVM solver.

SVMS_NUM_PIVOTS

Range [1; 10000]

This setting sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is 200.

SVMS_BATCH_ROWS

Positive integer

This setting applies to SVM models with linear kernel. This setting sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is 20000.

SVMS_REGULARIZER

SVMS_REGULARIZER_L1

SVMS_REGULARIZER_L2

This setting controls the type of regularization that the SGD SVM solver uses. The setting can be used only for linear SVM models. The default is system determined because it depends on the potential model size.

SVMS_SOLVER

SVMS_SOLVER_SGD (Sub-Gradient Descend)

SVMS_SOLVER_IPM (Interior Point Method)

This setting allows the user to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined.

Example 7-18 Using the ore.odmSVM Function and Generating a Confusion Matrix

This example demonstrates the use of SVM classification. The example creates mtcars in the database from the R mtcars data set, builds a classification model, makes predictions, and finally generates a confusion matrix.

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)
mtcars_of <- ore.push(m)
svm.mod  <- ore.odmSVM(gear ~ .-ID, mtcars_of, "classification")
summary(svm.mod)
svm.res  <- predict (svm.mod, mtcars_of,"gear")
with(svm.res, table(gear, PREDICTION))  # generate confusion matrix

Listing for This Example

R> m <- mtcars
R> m$gear <- as.factor(m$gear)
R> m$cyl  <- as.factor(m$cyl)
R> m$vs   <- as.factor(m$vs)
R> m$ID   <- 1:nrow(m)
R> mtcars_of <- ore.push(m)
R>  
R> svm.mod  <- ore.odmSVM(gear ~ .-ID, mtcars_of, "classification")
R> summary(svm.mod)
Call:
ore.odmSVM(formula = gear ~ . - ID, data = mtcars_of, type = "classification")
 
Settings: 
                      value
prep.auto                on
active.learning   al.enable
complexity.factor  0.385498
conv.tolerance        1e-04
kernel.cache.size  50000000
kernel.function    gaussian
std.dev            1.072341
 
Coefficients: 
[1] No coefficients with gaussian kernel
R> svm.res  <- predict (svm.mod, mtcars_of,"gear")
R> with(svm.res, table(gear, PREDICTION))  # generate confusion matrix
    PREDICTION
gear  3  4
   3 12  3
   4  0 12
   5  2  3

Example 7-19 Using the ore.odmSVM Function and Building a Regression Model

This example demonstrates SVM regression. The example creates a data frame, pushes it to a table, and then builds a regression model; note that ore.odmSVM specifies a linear kernel.

x <- seq(0.1, 5, by = 0.02)
y <- log(x) + rnorm(x, sd = 0.2)
dat <-ore.push(data.frame(x=x, y=y))
 
# Build model with linear kernel
svm.mod <- ore.odmSVM(y~x,dat,"regression", kernel.function="linear")
summary(svm.mod)
coef(svm.mod)
svm.res <- predict(svm.mod,dat, supplemental.cols="x")
head(svm.res,6)

Listing for This Example

R> x <- seq(0.1, 5, by = 0.02)
R> y <- log(x) + rnorm(x, sd = 0.2)
R> dat <-ore.push(data.frame(x=x, y=y))
R>  
R> # Build model with linear kernel
R> svm.mod <- ore.odmSVM(y~x,dat,"regression", kernel.function="linear")
R> summary(svm.mod)
 
Call:
ore.odmSVM(formula = y ~ x, data = dat, type = "regression", 
    kernel.function = "linear")
 
Settings: 
                      value
prep.auto                on
active.learning   al.enable
complexity.factor  0.620553
conv.tolerance        1e-04
epsilon            0.098558
kernel.function      linear
 
Residuals: 
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.79130 -0.28210 -0.05592 -0.01420  0.21460  1.58400 
 
Coefficients: 
     variable value  estimate
1           x       0.6637951
2 (Intercept)       0.3802170
 
R> coef(svm.mod)
     variable value  estimate
1           x       0.6637951
2 (Intercept)       0.3802170
R> svm.res <- predict(svm.mod,dat, supplemental.cols="x")
R> head(svm.res,6)
     x PREDICTION
1 0.10 -0.7384312
2 0.12 -0.7271410
3 0.14 -0.7158507
4 0.16 -0.7045604
5 0.18 -0.6932702
6 0.20 -0.6819799

Example 7-20 Using the ore.odmSVM Function and Building an Anomaly Detection Model

This example demonstrates SVN anomaly detection. It uses mtcars_of created in the classification example and builds an anomaly detection model.

svm.mod  <- ore.odmSVM(~ .-ID, mtcars_of, "anomaly.detection")
summary(svm.mod)
svm.res  <- predict (svm.mod, mtcars_of, "ID")
head(svm.res)
table(svm.res$PREDICTION)

Listing for This Example

R> svm.mod  <- ore.odmSVM(~ .-ID, mtcars_of, "anomaly.detection")
R> summary(svm.mod)
 
Call:
ore.odmSVM(formula = ~. - ID, data = mtcars_of, type = "anomaly.detection")
 
Settings: 
                      value
prep.auto                on
active.learning   al.enable
conv.tolerance        1e-04
kernel.cache.size  50000000
kernel.function    gaussian
outlier.rate             .1
std.dev            0.719126
 
Coefficients: 
[1] No coefficients with gaussian kernel
 
R> svm.res  <- predict (svm.mod, mtcars_of, "ID")
R> head(svm.res)
                        '0'       '1' ID PREDICTION
Mazda RX4         0.4999405 0.5000595  1          1
Mazda RX4 Wag     0.4999794 0.5000206  2          1
Datsun 710        0.4999618 0.5000382  3          1
Hornet 4 Drive    0.4999819 0.5000181  4          1
Hornet Sportabout 0.4949872 0.5050128  5          1
Valiant           0.4999415 0.5000585  6          1
R> table(svm.res$PREDICTION)
 
 0  1 
 5 27 

7.17 Partitioned Model

A partitioned model is an ensemble model that consists of multiple sub-models, one for each partition of the data.

A partitioned model may achieve better accuracy through multiple targeted models that are managed and used as one. A partitioned model can simplify scoring by allowing you to reference the top-level model only. The proper sub-model is chosen by the system based on the values of the partitioned column or columns for each row of data to be scored.

To create a partitioned OML4SQL model, use the odm.setting argument with ODMS_PARTITION_COLUMNS as the name and with the names of the columns by which to partition the input data as the value. The OREdm function returns a model with a sub-model for each partition. The partitions are based on the unique values found in the columns.

The partitions function returns an ore.frame that lists each partition of the specified model object and the associated partition column values of the model. Partition names are system-determined. The function returns NULL for a non-partitioned model.

Example 7-21 Create a Partitioned Model

This example creates a partitioned Support Vector Machine classification model. It uses the Wine Quality data set from the University of California, Irvine Machine Learning Repository.

# Download the wine data set and create the data table.
white.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
white.wine <- read.csv(white.url, header = TRUE, sep = ";")
white.wine$color <- "white"

red.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
red.wine <- read.csv(red.url, header = TRUE, sep = ";")
red.wine$color <- "red"

dat <- rbind(white.wine, red.wine)

# Drop the WINE table if it exists.
ore.drop(table="WINE")
ore.create(dat, table="WINE")

# Assign row names to enable row indexing for train and test samples.
row.names(WINE) <- WINE$color

# Enable reproducible results.
set.seed(seed=6218945)

n.rows        <- nrow(WINE)

# Train and test sampling.
random.sample <- sample(1:n.rows, ceiling(n.rows/2))

# Sample in-database using row indexing.
WINE.train    <- WINE[random.sample,]
WINE.test     <- WINE[setdiff(1:n.rows,random.sample),]

# Build a Support Vector Machine classification model 
# on the training data set, using both red and white wine.
mod.svm   <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, 
                        "classification", kernel.function="linear")

# Predict wine quality on the test data set.
pred.svm  <- predict (mod.svm, WINE.test,"quality")

# View the probability of each class and prediction.
head(pred.svm,3)

# Generate a confusion matrix. Note that 3 and 8 are not predicted.
with(pred.svm, table(quality, PREDICTION, dnn = c("Actual", "Predicted")))

# Build a partitioned SVM model based on wine color.
# Specify the partitioning column with the odm.settings argument.
mod.svm2   <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train, 
                         "classification", kernel.function="linear",
                         odm.settings=list(odms_partition_columns = "color"))

# Predict wine quality on the test data set.
pred.svm2  <- predict (mod.svm2, WINE.test, "quality")

# View the probability of each class and prediction.
head(pred.svm2,3)

# Generate a confusion matrix. Note that 3 and 4 are not predicted.
with(pred.svm2, table(quality, PREDICTION, dnn = c("Actual", "Predicted")))

partitions(mod.svm2)
summary(mod.svm2["red"])

Listing for This Example

> # Download the wine data set and create the data table.
> white.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"
> white.wine <- read.csv(white.url, header = TRUE, sep = ";")
> white.wine$color <- "white"
> 
> red.url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
> red.wine <- read.csv(red.url, header = TRUE, sep = ";")
> red.wine$color <- "red"
> 
> dat <- rbind(white.wine, red.wine)
> 
> # Drop the WINE table if it exists.
> ore.drop(table="WINE")
Warning message:
Table WINE does not exist. 
> ore.create(dat, table="WINE")
> 
> # Assign row names to enable row indexing for train and test samples.
> row.names(WINE) <- WINE$color
> 
> # Enable reproducible results.
> set.seed(seed=6218945)                  
>
> n.rows        <- nrow(WINE)
>
> # Train and test sampling.
> random.sample <- sample(1:n.rows, ceiling(n.rows/2))
>
> # Sample in-database using row indexing.
> WINE.train    <- WINE[random.sample,]
> WINE.test     <- WINE[setdiff(1:n.rows,random.sample),]    
> 
> # Build a Support Vector Machine classification model
> # on the training data set, using both red and white wine.
> mod.svm   <- ore.odmSVM(quality~.-pH-fixed.acidity, WINE.train,
+                         "classification",kernel.function="linear")
>
> # Predict wine quality on the test data set.
> pred.svm  <- predict (mod.svm, WINE.test,"quality")
>
> # View the probability of each class and prediction.
> head(pred.svm,3)
            '3'       '4'        '5'       '6'       '7'       '8'       '9'
red   0.04957242 0.1345280 0.27779399 0.1345281 0.1345280 0.1345275 0.1345220
red.1 0.04301663 0.1228311 0.34283345 0.1228313 0.1228311 0.1228307 0.1228257
red.2 0.04473419 0.1713883 0.09832961 0.1713891 0.1713890 0.1713886 0.1713812
      quality PREDICTION
red         4          5
red.1       5          5
red.2       7          6
>
> # Generate a confusion matrix. Note that 3 and 4 are not predicted.
> with(pred.svm, table(quality,PREDICTION, dnn = c("Actual","Predicted")))
      Predicted
Actual   3   4   5   6   7   8   9
     3   0   0  11   5   0   0   0
     4   0   1  85  16   2   0   0
     5   2   1 927 152   4   0   1
     6   2   1 779 555  63   1   9
     7   2   0 121 316  81   0   3
     8   0   0  18  66  21   0   0
     9   0   0   0   2   1   0   0
>
> partitions(mod.svm2)
  PARTITION_NAME color
1            red   red
2          white white
> summary(mod.svm2["red"])
$red

Call:
ore.odmSVM(formula = quality ~ . - pH - fixed.acidity, data = WINE.train, 
    type = "classification", kernel.function = "linear", odm.settings = list(odms_partition_columns = "color"))

Settings: 
                                               value
clas.weights.balanced                            OFF
odms.details                             odms.enable
odms.max.partitions                             1000
odms.missing.value.treatment odms.missing.value.auto
odms.partition.columns                       "color"
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
active.learning                            al.enable
conv.tolerance                                 1e-04
kernel.function                               linear

Coefficients: 
   PARTITION_NAME class             variable value      estimate
1             red     3          (Intercept)       -1.347392e+01
2             red     3              alcohol        7.245737e-01
3             red     3            chlorides        1.761946e+00
4             red     3          citric.acid       -3.276716e+00
5             red     3              density        2.449906e+00
6             red     3  free.sulfur.dioxide       -6.035430e-01
7             red     3       residual.sugar        9.097631e-01
8             red     3            sulphates        1.240524e-04
9             red     3 total.sulfur.dioxide       -2.467554e+00
10            red     3     volatile.acidity        1.300470e+00
11            red     4          (Intercept)       -1.000002e+00
12            red     4              alcohol       -7.920188e-07
13            red     4            chlorides       -2.589198e-08
14            red     4          citric.acid        9.340296e-08
15            red     4              density       -5.418190e-07
16            red     4  free.sulfur.dioxide       -6.981268e-08
17            red     4       residual.sugar        3.389558e-07
18            red     4            sulphates        1.417324e-07
19            red     4 total.sulfur.dioxide       -3.113900e-07
20            red     4     volatile.acidity        4.928625e-07
21            red     5          (Intercept)       -3.151406e-01
22            red     5              alcohol       -9.692192e-01
23            red     5            chlorides        3.690034e-02
24            red     5          citric.acid        2.258823e-01
25            red     5              density       -1.770474e-01
26            red     5  free.sulfur.dioxide       -1.289540e-01
27            red     5       residual.sugar        7.521771e-04
28            red     5            sulphates       -3.596548e-01
29            red     5 total.sulfur.dioxide        5.688280e-01
30            red     5     volatile.acidity        3.005168e-01
31            red     6          (Intercept)       -9.999994e-01
32            red     6              alcohol        8.807703e-07
33            red     6            chlorides        6.871310e-08
34            red     6          citric.acid       -4.525750e-07
35            red     6              density        5.786923e-07
36            red     6  free.sulfur.dioxide        3.856018e-07
37            red     6       residual.sugar       -4.281695e-07
38            red     6            sulphates        1.036468e-07
39            red     6 total.sulfur.dioxide       -4.287512e-07
40            red     6     volatile.acidity       -4.426151e-07
41            red     7          (Intercept)       -1.000000e+00
42            red     7              alcohol        1.761665e-07
43            red     7            chlorides       -3.583316e-08
44            red     7          citric.acid       -4.837739e-08
45            red     7              density        2.169500e-08
46            red     7  free.sulfur.dioxide        4.800717e-08
47            red     7       residual.sugar        1.909498e-08
48            red     7            sulphates        1.062205e-07
49            red     7 total.sulfur.dioxide       -2.339108e-07
50            red     7     volatile.acidity       -1.539326e-07
51            red     8          (Intercept)       -1.000000e+00
52            red     8              alcohol        7.089889e-08
53            red     8            chlorides       -8.566726e-09
54            red     8          citric.acid        2.769301e-08
55            red     8              density       -3.852321e-08
56            red     8  free.sulfur.dioxide       -1.302056e-08
57            red     8       residual.sugar        4.847947e-09
58            red     8            sulphates        1.276461e-08
59            red     8 total.sulfur.dioxide       -5.484427e-08
60            red     8     volatile.acidity        2.959182e-08
 

7.18 Text Processing Model

A text processing model uses ctx.settings arguments to specify Oracle Text attribute settings.

Example 7-22 Building a Text Processing Model

This example builds an ore.odmKMeans model that processes text. It uses the odm.settings and ctx.settings arguments. The figure following the example shows the output of the histogram(km.mod1) function.

x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")

X <- ore.push (data.frame(x))
km.mod1 <- NULL
km.mod1 <- ore.odmKMeans(~., X, num.centers = 2)
km.mod1
summary(km.mod1)
rules(km.mod1)
clusterhists(km.mod1)
histogram(km.mod1)

km.res1 <- predict(km.mod1,X,type="class",supplemental.cols=c("x","y"))
head(km.res1,3)
km.res1.local <- ore.pull(km.res1)
plot(data.frame(x = km.res1.local$x, 
	              y = km.res1.local$y), 
	              col = km.res1.local$CLUSTER_ID)
points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex=2)

head(predict(km.mod1,X))
head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3)
head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)

# Text processing with ore.odmKMeans.
title <- c('Aids in Africa: Planning for a long war',
	         'Mars rover maneuvers for rim shot',
	         'Mars express confirms presence of water at Mars south pole',
	         'NASA announces major Mars rover finding',
	         'Drug access, Asia threat in focus at AIDS summit',
	         'NASA Mars Odyssey THEMIS image: typical crater',
	         'Road blocks for Aids')
response <- c('Aids', 'Mars', 'Mars', 'Mars', 'Aids', 'Mars', 'Aids')

# Text contents in a character column.
KM_TEXT <- ore.push(data.frame(CUST_ID = seq(length(title)),
			          RESPONSE = response, TITLE = title))

# Create a text policy (CTXSYS.CTX_DDL privilege is required).
ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")

# Specify POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
# text column attributes.
km.mod <- ore.odmKMeans( ~ TITLE, data = KM_TEXT, num.centers = 2L,
   odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
                       ODMS_TEXT_MIN_DOCUMENTS = 1,
                       ODMS_TEXT_MAX_FEATURES = 3,
                       kmns_distance = "dbms_data_mining.kmns_cosine",
                       kmns_details = "kmns_details_all"),
   ctx.settings = list(TITLE = "TEXT(TOKEN_TYPE:STEM)"))
summary(km.mod)
settings(km.mod)
print(predict(km.mod, KM_TEXT, supplemental.cols = "RESPONSE"), digits = 3L)

ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")

Listing for This Example

R> x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
+             matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
R> colnames(x) <- c("x", "y")
R> 
R> X <- ore.push (data.frame(x))
R> km.mod1 <- NULL
R> km.mod1 <- ore.odmKMeans(~., X, num.centers = 2)
R> km.mod1

Call:
ore.odmKMeans(formula = ~., data = X, num.centers = 2)

Settings: 
                                               value
clus.num.clusters                                  2
block.growth                                       2
conv.tolerance                                  0.01
details                                  details.all
distance                                   euclidean
iterations                                         3
min.pct.attr.support                             0.1
num.bins                                          10
random.seed                                        0
split.criterion                             variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON

R> summary(km.mod1)

Call:
ore.odmKMeans(formula = ~., data = X, num.centers = 2)

Settings: 
                                               value
clus.num.clusters                                  2
block.growth                                       2
conv.tolerance                                  0.01
details                                  details.all
distance                                   euclidean
iterations                                         3
min.pct.attr.support                             0.1
num.bins                                          10
random.seed                                        0
split.criterion                             variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON

Centers: 
            x          y
2 -0.07638266 0.04449368
3  0.98493306 1.00864399

R> rules(km.mod1)
   cluster.id rhs.support rhs.conf lhr.support lhs.conf lhs.var lhs.var.support lhs.var.conf   predicate
1           1         100      1.0          92     0.86       x              86    0.2222222 x <= 1.2209
2           1         100      1.0          92     0.86       x              86    0.2222222 x >= -.6188
3           1         100      1.0          86     0.86       y              86    0.4444444 y <= 1.1653
4           1         100      1.0          86     0.86       y              86    0.4444444  y > -.3053
5           2          50      0.5          48     0.96       x              48    0.0870793  x <= .4324
6           2          50      0.5          48     0.96       x              48    0.0870793 x >= -.6188
7           2          50      0.5          48     0.96       y              48    0.0893300  y <= .5771
8           2          50      0.5          48     0.96       y              48    0.0893300  y > -.5995
9           3          50      0.5          49     0.98       x              49    0.0852841 x <= 1.7465
10          3          50      0.5          49     0.98       x              49    0.0852841   x > .4324
11          3          50      0.5          50     0.98       y              49    0.0838225 y <= 1.7536
12          3          50      0.5          50     0.98       y              49    0.0838225   y > .2829

R> clusterhists(km.mod1)
   cluster.id variable bin.id lower.bound upper.bound               label count
1           1        x      1 -0.61884662 -0.35602715 -.6188466:-.3560272     6
2           1        x      2 -0.35602715 -0.09320769 -.3560272:-.0932077    17
3           1        x      3 -0.09320769  0.16961178  -.0932077:.1696118    15
4           1        x      4  0.16961178  0.43243125   .1696118:.4324312    11
5           1        x      5  0.43243125  0.69525071   .4324312:.6952507     8
6           1        x      6  0.69525071  0.95807018   .6952507:.9580702    17
7           1        x      7  0.95807018  1.22088965  .9580702:1.2208896    18
8           1        x      8  1.22088965  1.48370911 1.2208896:1.4837091     4
9           1        x      9  1.48370911  1.74652858 1.4837091:1.7465286     4
10          1        y      1 -0.89359597 -0.59946141  -.893596:-.5994614     2
11          1        y      2 -0.59946141 -0.30532685 -.5994614:-.3053269     4
12          1        y      3 -0.30532685 -0.01119230 -.3053269:-.0111923    11
13          1        y      4 -0.01119230  0.28294226  -.0111923:.2829423    24
14          1        y      5  0.28294226  0.57707682   .2829423:.5770768    13
15          1        y      6  0.57707682  0.87121138   .5770768:.8712114    12
16          1        y      7  0.87121138  1.16534593  .8712114:1.1653459    26
17          1        y      8  1.16534593  1.45948049 1.1653459:1.4594805     5
18          1        y      9  1.45948049  1.75361505  1.4594805:1.753615     3
19          2        x      1 -0.61884662 -0.35602715 -.6188466:-.3560272     6
20          2        x      2 -0.35602715 -0.09320769 -.3560272:-.0932077    17
21          2        x      3 -0.09320769  0.16961178  -.0932077:.1696118    15
22          2        x      4  0.16961178  0.43243125   .1696118:.4324312    10
23          2        x      5  0.43243125  0.69525071   .4324312:.6952507     2
24          2        x      6  0.69525071  0.95807018   .6952507:.9580702     0
25          2        x      7  0.95807018  1.22088965  .9580702:1.2208896     0
26          2        x      8  1.22088965  1.48370911 1.2208896:1.4837091     0
27          2        x      9  1.48370911  1.74652858 1.4837091:1.7465286     0
28          2        y      1 -0.89359597 -0.59946141  -.893596:-.5994614     2
29          2        y      2 -0.59946141 -0.30532685 -.5994614:-.3053269     4
30          2        y      3 -0.30532685 -0.01119230 -.3053269:-.0111923    11
31          2        y      4 -0.01119230  0.28294226  -.0111923:.2829423    24
32          2        y      5  0.28294226  0.57707682   .2829423:.5770768     9
33          2        y      6  0.57707682  0.87121138   .5770768:.8712114     0
34          2        y      7  0.87121138  1.16534593  .8712114:1.1653459     0
35          2        y      8  1.16534593  1.45948049 1.1653459:1.4594805     0
36          2        y      9  1.45948049  1.75361505  1.4594805:1.753615     0
37          3        x      1 -0.61884662 -0.35602715 -.6188466:-.3560272     0
38          3        x      2 -0.35602715 -0.09320769 -.3560272:-.0932077     0
39          3        x      3 -0.09320769  0.16961178  -.0932077:.1696118     0
40          3        x      4  0.16961178  0.43243125   .1696118:.4324312     1
41          3        x      5  0.43243125  0.69525071   .4324312:.6952507     6
42          3        x      6  0.69525071  0.95807018   .6952507:.9580702    17
43          3        x      7  0.95807018  1.22088965  .9580702:1.2208896    18
44          3        x      8  1.22088965  1.48370911 1.2208896:1.4837091     4
45          3        x      9  1.48370911  1.74652858 1.4837091:1.7465286     4
46          3        y      1 -0.89359597 -0.59946141  -.893596:-.5994614     0
47          3        y      2 -0.59946141 -0.30532685 -.5994614:-.3053269     0
48          3        y      3 -0.30532685 -0.01119230 -.3053269:-.0111923     0
49          3        y      4 -0.01119230  0.28294226  -.0111923:.2829423     0
50          3        y      5  0.28294226  0.57707682   .2829423:.5770768     4
51          3        y      6  0.57707682  0.87121138   .5770768:.8712114    12
52          3        y      7  0.87121138  1.16534593  .8712114:1.1653459    26
53          3        y      8  1.16534593  1.45948049 1.1653459:1.4594805     5
54          3        y      9  1.45948049  1.75361505  1.4594805:1.753615     3
R> histogram(km.mod1)
R> 
R> km.res1 <- predict(km.mod1, X, type="class", supplemental.cols = c("x","y"))
R> head(km.res1, 3)
            x           y CLUSTER_ID
1 -0.43646407  0.26201831          2
2 -0.02797831  0.07319952          2
3  0.11998373 -0.08638716          2
R> km.res1.local <- ore.pull(km.res1)
R> plot(data.frame(x = km.res1.local$x,
+                  y = km.res1.local$y), 
+                  col = km.res1.local$CLUSTER_ID)
R>  points(km.mod1$centers2, col = rownames(km.mod1$centers2), pch = 8, cex = 2)
R> 
R>  head(predict(km.mod1, X))
        '2'          '3' CLUSTER_ID
1 0.9992236 0.0007763706          2
2 0.9971310 0.0028690375          2
3 0.9974216 0.0025783939          2
4 0.9997335 0.0002665114          2
5 0.9917773 0.0082226599          2
6 0.9771667 0.0228333398          2
R> head(predict(km.mod1,X,type=c("class","raw"),supplemental.cols=c("x","y")),3)
        '2'          '3'           x           y CLUSTER_ID
1 0.9992236 0.0007763706 -0.43646407  0.26201831          2
2 0.9971310 0.0028690375 -0.02797831  0.07319952          2
3 0.9974216 0.0025783939  0.11998373 -0.08638716          2
R> head(predict(km.mod1,X,type="raw",supplemental.cols=c("x","y")),3)
            x           y       '2'          '3'
1 -0.43646407  0.26201831 0.9992236 0.0007763706
2 -0.02797831  0.07319952 0.9971310 0.0028690375
3  0.11998373 -0.08638716 0.9974216 0.0025783939R> 
R>
R> # Text processing with ore.odmKMeans.
R> title <- c('Aids in Africa: Planning for a long war',
+             'Mars rover maneuvers for rim shot',
+             'Mars express confirms presence of water at Mars south pole',
+             'NASA announces major Mars rover finding',                     
+             'Drug access, Asia threat in focus at AIDS summit',
+             'NASA Mars Odyssey THEMIS image: typical crater',
+             'Road blocks for Aids')
R>  response <- c('Aids', 'Mars', 'Mars', 'Mars', 'Aids', 'Mars', 'Aids')
R> 
R> # Text contents in a character column.
R> KM_TEXT <- ore.push(data.frame(CUST_ID = seq(length(title)),
+                                 RESPONSE = response, TITLE = title))
R> 
R> # Create a text policy (CTXSYS.CTX_DDL privilege is required).
R> ore.exec("Begin ctx_ddl.create_policy('ESA_TXTPOL'); End;")
R> 
R> # Specify POLICY_NAME, MIN_DOCUMENTS, MAX_FEATURES and
R> # text column attributes.
R> km.mod <- ore.odmKMeans( ~ TITLE, data = KM_TEXT, num.centers = 2L,
+    odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL",
+                        ODMS_TEXT_MIN_DOCUMENTS = 1,
+                        ODMS_TEXT_MAX_FEATURES = 3,
+                        kmns_distance = "dbms_data_mining.kmns_cosine",
+                        kmns_details = "kmns_details_all"),
+    ctx.settings = list(TITLE="TEXT(TOKEN_TYPE:STEM)"))
R> summary(km.mod)

Call:
ore.odmKMeans(formula = ~TITLE, data = KM_TEXT, num.centers = 2L, 
    odm.settings = list(ODMS_TEXT_POLICY_NAME = "ESA_TXTPOL", 
        ODMS_TEXT_MIN_DOCUMENTS = 1, ODMS_TEXT_MAX_FEATURES = 3, 
        kmns_distance = "dbms_data_mining.kmns_cosine", 
        kmns_details = "kmns_details_all"), 
    ctx.settings = list(TITLE = "TEXT(TOKEN_TYPE:STEM)"))

Settings: 
                                               value
clus.num.clusters                                  2
block.growth                                       2
conv.tolerance                                  0.01
details                                  details.all
distance                                      cosine
iterations                                         3
min.pct.attr.support                             0.1
num.bins                                          10
random.seed                                        0
split.criterion                             variance
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
odms.text.max.features                             3
odms.text.min.documents                            1
odms.text.policy.name                     ESA_TXTPOL
prep.auto                                         ON

Centers: 
  TITLE.MARS TITLE.NASA TITLE.ROVER TITLE.AIDS
2  0.5292307  0.7936566   0.7936566         NA
3         NA         NA          NA          1
R> settings(km.mod)
                   SETTING_NAME           SETTING_VALUE SETTING_TYPE
1                     ALGO_NAME             ALGO_KMEANS        INPUT
2             CLUS_NUM_CLUSTERS                       2        INPUT
3             KMNS_BLOCK_GROWTH                       2        INPUT
4           KMNS_CONV_TOLERANCE                    0.01        INPUT
5                  KMNS_DETAILS        KMNS_DETAILS_ALL        INPUT
6                 KMNS_DISTANCE             KMNS_COSINE        INPUT
7               KMNS_ITERATIONS                       3        INPUT
8     KMNS_MIN_PCT_ATTR_SUPPORT                     0.1        INPUT
9                 KMNS_NUM_BINS                      10        INPUT
10             KMNS_RANDOM_SEED                       0      DEFAULT
11         KMNS_SPLIT_CRITERION           KMNS_VARIANCE        INPUT
12 ODMS_MISSING_VALUE_TREATMENT ODMS_MISSING_VALUE_AUTO      DEFAULT
13                ODMS_SAMPLING   ODMS_SAMPLING_DISABLE      DEFAULT
14       ODMS_TEXT_MAX_FEATURES                       3        INPUT
15      ODMS_TEXT_MIN_DOCUMENTS                       1        INPUT
16        ODMS_TEXT_POLICY_NAME              ESA_TXTPOL        INPUT
17                    PREP_AUTO                      ON        INPUT
R> print(predict(km.mod, KM_TEXT, supplemental.cols = "RESPONSE"), digits = 3L)
     '2'    '3' RESPONSE CLUSTER_ID
1 0.0213 0.9787     Aids          3
2 0.9463 0.0537     Mars          2
3 0.9325 0.0675     Mars          2
4 0.9691 0.0309     Mars          2
5 0.0213 0.9787     Aids          3
6 0.9463 0.0537     Mars          2
7 0.0213 0.9787     Aids          3
R> 
R> ore.exec("Begin ctx_ddl.drop_policy('ESA_TXTPOL'); End;")

Figure 7-5 Cluster Histogram for km.mod1

Description of Figure 7-5 follows
Description of "Figure 7-5 Cluster Histogram for km.mod1"

7.19 Neural Network Model

The ore.odmNN class creates a Neural Network (NN) model for classification and regression. The Neural Network models can be used to capture intricate nonlinear relationships between inputs and outputs or to find patterns in data.

The ore.odmNN class methods build a feed-forward neural network for regression on OML4R proxy data frames. It supports multiple hidden layers with a specific number of nodes.

Each layer can have one of the following activation functions.

  • NNET_ACTIVATIONS_ARCTAN
  • NNET_ACTIVATIONS_BIPOLAR_SIG
  • NNET_ACTIVATIONS_LINEAR
  • NNET_ACTIVATIONS_LOG_SIG
  • NNET_ACTIVATIONS_RELU
  • NNET_ACTIVATIONS_TANH

The output layer is a single numeric or binary categorical target. The output layer can have any of the activation functions. It has the linear activation function by default.

Modeling with the ore.odmNN class is well-suited for noisy and complex data such as sensor data. Problems that such data might have are the following:

  • Potentially many (numeric) predictors, for example, pixel values

  • The target may be discrete-valued, real-valued, or a vector of such values

  • Training data may contain errors – robust to noise

  • Fast scoring

  • Model transparency is not required; models difficult to interpret

Typical steps in Neural Network modeling are the following:

  1. Specifying the architecture

  2. Preparing the data

  3. Building the model

  4. Specifying the stopping criteria: iterations, error on a validation set within tolerance

  5. Viewing statistical results from the model

  6. Improving the model

Settings for a Neural Network Model

The following table lists settings for NN models.

Table 7-15 Neural Network Model Settings

Setting Name Setting Value Description
NNET_HIDDEN_LAYERS

Non-negative integer

Defines the topology by number of hidden layers.

The default value is 1.

NNET_NODES_PER_LAYER

A list of positive integers

Defines the topology by number of nodes per layer. Different layers can have different number of nodes.

The value should be non-negative integers and comma separated. For example, '10, 20, 5'. The setting values must be consistent with NNET_HIDDEN_LAYERS. The default number of nodes per layer is the number of attributes or 50 (if the number of attributes > 50).

NNET_ACTIVATIONS

A list of the following strings:

  • "NNET_ACTIVATIONS_LOG_SIG"
  • "NNET_ACTIVATIONS_LINEAR"
  • "NNET_ACTIVATIONS_TANH"
  • "NNET_ACTIVATIONS_ARCTAN"
  • "NNET_ACTIVATIONS_BIPOLAR_SIG"
Defines the activation function for the hidden layers. For example, '''NNET_ACTIVATIONS_BIPOLAR_SIG'', ''NNET_ACTIVATIONS_TANH'''.

Different layers can have different activation functions.

The default value is "NNET_ACTIVATIONS_LOG_SIG".

The number of activation functions must be consistent with NNET_HIDDEN_LAYERS and NNET_NODES_PER_LAYER .

Note:

All quotes are single and two single quotes are used to escape a single quote in SQL statements.

NNET_WEIGHT_LOWER_BOUND

A real number

The setting specifies the lower bound of the region where weights are randomly initialized. NNET_WEIGHT_LOWER_BOUND and NNET_WEIGHT_UPPER_BOUND must be set together. Setting one and not setting the other raises an error. NNET_WEIGHT_LOWER_BOUND must not be greater than NNET_WEIGHT_UPPER_BOUND. The default value is –sqrt(6/(l_nodes+r_nodes)). The value of l_nodes for:

  • input layer dense attributes is (1+number of dense attributes)
  • input layer sparse attributes is number of sparse attributes
  • each hidden layer is (1+number of nodes in that hidden layer)

The value of r_nodes is the number of nodes in the layer that the weight is connecting to.

NNET_WEIGHT_UPPER_BOUND

A real number

This setting specifies the upper bound of the region where weights are initialized. It should be set in pairs with NNET_WEIGHT_LOWER_BOUND and its value must not be smaller than the value of NNET_WEIGHT_LOWER_BOUND. If not specified, the values of NNET_WEIGHT_LOWER_BOUND and NNET_WEIGHT_UPPER_BOUND are system determined.

The default value is sqrt(6/(l_nodes+r_nodes)). See NNET_WEIGHT_LOWER_BOUND.

NNET_ITERATIONS

Positive integer

This setting specifies the maximum number of iterations in the Neural Network algorithm.

The default value is 200.

NNET_TOLERANCE

TO_CHAR(0<numeric_expr<1)

Defines the convergence tolerance setting of the Neural Network algorithm.

The default value is 0.000001.

NNET_REGULARIZER

NNET_REGULARIZER_NONE

NNET_REGULARIZER_L2

NNET_REGULARIZER_HELDASIDE

Regularization setting for Neural Network algorithm. If the total number of training rows is greater than 50000, the default is NNET_REGULARIZER_HELDASIDE.

If the total number of training rows is less than or equal to 50000, the default is NNET_REGULARIZER_NONE.

NNET_HELDASIDE_RATIO

0 <= numeric_expr <=1

Define the held ratio for the held-aside method.

The default value is 0.25.

NNET_HELDASIDE_MAX_FAIL

The value must be a positive integer.

With NNET_REGULARIZER_HELDASIDE, the training process is stopped early if the network performance on the validation data fails to improve or remains the same for NNET_HELDASIDE_MAX_FAIL epochs in a row.

The default value is 6

NNET_REG_LAMBDA TO_CHAR(numeric_expr >=0) Defines the L2 regularization parameter lambda. This can not be set together with NNET_REGULARIZER_HELDASIDE.

The default value is 1.

Example 7-23 Building a Neural Network Model

This example creates an NN model and uses some of the methods of the ore.odmNN class.


# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

set.seed(7654)
x <- seq(0.1, 5, by = 0.02)
weights <- round(rnorm(length(x),10,3))
y <- log(x) + rnorm(x, sd = 0.2)

# Create a temporary OML4R proxy object DAT.

DAT <- ore.push(data.frame(x=x, y=y, weights=weights))

# Create an NN regression model object. Fit the NN model according to the data and setting parameters.

mod.nn <- ore.odmNN(y~x, DAT,"regression",
                          odm.settings = list(nnet_hidden_layers = 1))
weight(mod.nn)
summary(mod.nn)

# Use the model to make predictions on the input data.

pred.nn <- predict(mod.nn, DAT, "y")
head(pred.nn, 10)

Listing for This Example

Table 7-16 A data.frame: 4 x 6

LAYER IDX_FROM IDX_TO ATTRIBUTE_NAME ATTRIBUTE_VALUE WEIGHT
<dbl> <dbl> <dbl> <chr> <chr> <dbl>
0 0 0 x NA -1.0663866
0 NA 0 NA NA -7.4897304
1 0 0 NA NA -1068.0117188
1 NA 0 NA NA 0.9961451
Call: ore.odmNN(formula = y ~ x, data = DAT, type = "regression", odm.settings = list(nnet_hidden_layers = 1)) Settings: value lbfgs.gradient.tolerance .000000001 lbfgs.history.depth 20 lbfgs.scale.hessian LBFGS_SCALE_HESSIAN_ENABLE activations 'NNET_ACTIVATIONS_LOG_SIG' hidden.layers 1 iterations 200 tolerance .000001 odms.details odms.enable odms.missing.value.treatment odms.missing.value.auto odms.random.seed 0 odms.sampling odms.sampling.disable prep.auto ON Number of Layers: [1] 2 Nodes per Layer: [1] 1 Weight: LAYER IDX_FROM IDX_TO ATTRIBUTE_NAME ATTRIBUTE_VALUE WEIGHT 1 0 0 0 x -1.0663866 2 0 NA 0 -7.4897304 3 1 0 0 -1068.0117188 4 1 NA 0 0.9961451

Table 7-17 A data.frame: 10 x 2

y PREDICTION
<dbl> <dbl>
-2.376195 -1.648826
-1.906485 -1.601597
-2.027240 -1.555065
-1.541951 -1.509221
-1.654645 -1.464055
-1.742211 -1.419556
-1.320646 -1.375714
-1.357442 -1.332520
-1.442755 -1.289965
-1.192586 -1.248039

Example 7-24 ore.odmNN classification

This example creates an NN model and uses some of the methods of the ore.odmNN classification class.


# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)

# Create a temporary OML4R proxy object for the MTCARS table.

MTCARS <- ore.push(m)
row.names(MTCARS) <- MTCARS$ID

# Create an NN classification model object. Fit the NN model according to the data and setting parameters.

mod.nn  <- ore.odmNN(gear ~ ., MTCARS,"classification",
                         odm.settings = list(nnet_hidden_layers = 2,
                                             nnet_activations = c("'NNET_ACTIVATIONS_LOG_SIG'", "'NNET_ACTIVATIONS_TANH'"),
                                             nnet_nodes_per_layer = c(5, 2)))
head(weight(mod.nn), 10)

# Use the model to make predictions on the input data.

pred.nn <- predict(mod.nn, MTCARS, "gear")

# Generate a confusion matrix.
with(pred.nn, table(gear, PREDICTION))

Listing for This Example

Table 7-18 A data.frame: 10 x 6

LAYER IDX_FROM IDX_TO ATTRIBUTE_NAME ATTRIBUTE_VALUE WEIGHT
<dbl> <dbl> <dbl> <chr> <chr> <dbl>
0 0 0 ID NA 12.424586
0 0 1 ID NA -9.953163
0 0 2 ID NA -7.516252
0 0 3 ID NA -1.100170
0 0 4 ID NA -15.955383
0 1 0 am NA 21.585514
0 1 1 am NA -3.228476
0 1 2 am NA -22.794853
0 1 3 am NA 15.349457
0 1 4 am NA -19.099138
PREDICTION gear 3 4 3 15 0 4 0 12 5 0 5

7.20 Random Forest Model

The ore.odmRF class creates a Random Forest (RF) model that provides an ensemble learning technique for classification.

By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.

Settings for a Random Forest Model

The following table lists settings that apply to Random Forest models.

Table 7-19 Random Forest Model Settings

Setting Name Setting Value Description

RFOR_MTRY

a number >= 0

Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value 0 indicates that the candidate pool includes all columns.

RFOR_NUM_TREES

1<=a number <=65535

Number of trees in the forest

Default is 20.

RFOR_SAMPLING_RATIO

0< a fraction<=1

Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data.

Example 7-25 Using the ore.odmRF Function

This example pushes the data frame iris: to a temporary database table IRIS and creates a Random Forest model.


# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Create the a temporary OML4R proxy object IRIS.

IRIS <- ore.push(iris)

# Create an RF model object. Fit the RF model according to the data and setting parameters.

mod.rf <- ore.odmRF(Species ~ ., IRIS, 
                        odm.settings = list(tree_impurity_metric = 'TREE_IMPURITY_ENTROPY',
                        tree_term_max_depth = 5,
                        tree_term_minrec_split = 5,
                        tree_term_minpct_split = 2,
                        tree_term_minrec_node = 5,
                        tree_term_minpct_node = 0.05))
                        
# Show the model summary and attribute importance.

summary(mod.rf)
importance(mod.rf)

# Use the model to make predictions on the input data.

pred.rf <- predict(mod.rf, IRIS, supplemental.cols="Species")

# Generate a confusion matrix.

with(pred.rf, table(Species, PREDICTION))

Listing for This Example

Call: ore.odmRF(formula = Species ~ ., data = IRIS, odm.settings = list(tree_impurity_metric = "TREE_IMPURITY_ENTROPY", tree_term_max_depth = 5, tree_term_minrec_split = 5, tree_term_minpct_split = 2, tree_term_minrec_node = 5, tree_term_minpct_node = 0.05))
Settings:
                                                 value 
      clas.max.sup.bins                          32
      clas.weights.balanced                      OFF
      odms.details                               odms.enable
      odms.missing.value.treatment   odms.missing.value.auto 
      odms.random.seed                                     0 
      odms.sampling                    odms.sampling.disable 
      prep.auto                                           ON
      rfor.num.trees                                      20
      rfor.sampling.ratio                                 .5
      impurity.metric                       impurity.entropy 
      term.max.depth                                       5
      term.minpct.node                                  0.05 
      term.minpct.split                                    2 
      term.minrec.node                                     5
      term.minrec.split                                    5
Importance:
    ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE 
1   Petal.Length             <NA>              0.60890776 
2   Petal.Width              <NA>              0.53412466
3   Sepal.Length             <NA>              0.23343292
4   Sepal.Width              <NA>              0.06182114

Table 7-20 A data.frame: 4 x 3

ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE
<chr> <chr> <dbl>
Petal.Length NA 0.60890776
Petal.Width NA 0.53412466
Sepal.Length NA 0.23343292
Sepal.Width NA 0.06182114

PREDICTION Species setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 2 48

7.21 Exponential Smoothing Model

The ore.odmESM function uses the Exponential Smoothing Model (ESM) algorithm to create a clustering model.

Exponential Smoothing Methods have been widely used in forecasting for over half a century. It has applications at the strategic, tactical, and operation level. For example, at a strategic level, forecasting is used for projecting return on investment, growth and the effect of innovations. At a tactical level, forecasting is used for projecting costs, inventory requirements, and customer satisfaction. At an operational level, forecasting is used for setting targets and predicting quality and conformance with standards.

In its simplest form, Exponential Smoothing is a moving average method with a single parameter which models an exponentially decreasing effect of past levels on future values. With a variety of extensions, Exponential Smoothing covers a broader class of models than other well-known approaches, such as the Box-Jenkins auto-regressive integrated moving average (ARIMA) approach. Oracle Data Mining implements Exponential Smoothing using a state of the art state space method that incorporates a single source of error (SSOE) assumption which provides theoretical and performance advantages.

Settings for an ESM model

The following table lists settings that apply to ESM models.

Table 7-21 ESM Model Settings

Setting Name Setting Value Description
EXSM_MODEL

It can take value in set {EXSM_SIMPLE, EXSM_SIMPLE_MULT, EXSM_HOLT, EXSM_HOLT_DMP, EXSM_MUL_TRND, EXSM_MULTRD_DMP, EXSM_SEAS_ADD, EXSM_SEAS_MUL, EXSM_HW, EXSM_HW_DMP, EXSM_HW_ADDSEA, EXSM_DHW_ADDSEA, EXSM_HWMT, EXSM_HWMT_DMP}

This setting specifies the model.

EXSM_SIMPLE: Simple exponential smoothing model is applied.

EXSM_SIMPLE_MULT: Simple exponential smoothing model with multiplicative error is applied.

EXSM_HOLT: Holt linear exponential smoothing model is applied.

EXSM_HOLT_DMP: Holt linear exponential smoothing model with damped trend is applied.

EXSM_MUL_TRND: Exponential smoothing model with multiplicative trend is applied.

EXSM_MULTRD_DMP: Exponential smoothing model with multiplicative damped trend is applied.

EXSM_SEAS_ADD: Exponential smoothing with additive seasonality, but no trend, is applied.

EXSM_SEAS_MUL:Exponential smoothing with multiplicative seasonality, but no trend, is applied.

EXSM_HW: Holt-Winters triple exponential smoothing model, additive trend, multiplicative seasonality is applied.

EXSM_HW_DMP: Holt-Winters multiplicative exponential smoothing model with damped trend, additive trend, multiplicative seasonality is applied.

EXSM_HW_ADDSEA: Holt-Winters additive exponential smoothing model, additive trend, additive seasonality is applied.

EXSM_DHW_ADDSEA: Holt-Winters additive exponential smoothing model with damped trend, additive trend, additive seasonality is applied.

EXSM_HWMT: Holt-Winters multiplicative exponential smoothing model with multiplicative trend, multiplicative trend, multiplicative seasonality is applied.

EXSM_HWMT_DMP: Holt-Winters multiplicative exponential smoothing model with damped multiplicative trend, multiplicative trend, multiplicative seasonality is applied.

The default value is EXSM_SIMPLE.

EXSM_SEASONALITY

positive integer > 1

This setting specifies a positive integer value as the length of seasonal cycle. The value specified must be larger than 1. For example, setting value 4 means that every group of four observations forms a seasonal cycle.

This setting is only applicable and must be provided for models with seasonality, otherwise the model throws an error.

When EXSM_INTERVAL is not set, this setting applies to the original input time series. When EXSM_INTERVAL is set, this setting applies to the accumulated time series.

EXSM_INTERVAL

It can take value in set {EXSM_INTERVAL_YEAR, EXSM_INTERVAL_QTR, EXSM_INTERVAL_MONTH,EXSM_INTERVAL_WEEK, EXSM_INTERVAL_DAY, EXSM_INTERVAL_HOUR, EXSM_INTERVAL_MIN,EXSM_INTERVAL_SEC}

This setting only applies and must be provided when the time column (case_id column) has datetime type. It specifies the spacing interval of the accumulated equally spaced time series.

The model throws an error if the time column of input table is of datetime type and setting EXSM_INTERVAL is not provided.

The model throws an error if the time column of input table is of oracle number type and setting EXSM_INTERVAL is provided.

EXSM_ACCUMULATE

It can take value in set {EXSM_ACCU_TOTAL, EXSM_ACCU_STD, EXSM_ACCU_MAX, EXSM_ACCU_MIN, EXSM_ACCU_AVG, EXSM_ACCU_MEDIAN, EXSM_ACCU_COUNT}

This setting only applies and must be provided when the time column has datetime type. It specifies how to generate the value of the accumulated time series from the input time series.

EXSM_SETMISSING

It can also specify an option taking value in set {EXSM_MISS_MIN, EXSM_MISS_MAX, EXSM_MISS_AVG, EXSM_MISS_MEDIAN, EXSM_MISS_LAST, EXSM_MISS_FIRST, EXSM_MISS_PREV, EXSM_MISS_NEXT, EXSM_MISS_AUTO}.

This setting specifies how to handle missing values, which may come from input data and/or the accumulation process of time series. You can specify either a number or an option. If a number is specified, all the missing values are set to that number.

EXSM_MISS_MIN: Replaces missing value with minimum of the accumulated time series.

EXSM_MISS_MAX: Replaces missing value with maximum of the accumulated time series.

EXSM_MISS_AVG: Replaces missing value with average of the accumulated time series.

EXSM_MISS_MEDIAN: Replaces missing value with median of the accumulated time series.

EXSM_MISS_LAST: Replaces missing value with last non-missing value of the accumulated time series.

EXSM_MISS_FIRST: Replaces missing value with first non-missing value of the accumulated time series.

EXSM_MISS_PREV:Replaces missing value with the previous non-missing value of the accumulated time series.

EXSM_MISS_NEXT:Replaces missing value with the next non-missing value of the accumulated time series.

EXSM_MISS_AUTO:EXSM model treats the input data as an irregular (non-uniformly spaced) time series.

If this setting is not provided, EXSM_MISS_AUTO is the default value. In such a case, the model treats the input time series as irregular time series, viewing missing values as gaps.

EXSM_PREDICTION_STEP

It must be set to a number between 1-30.

This setting specifies how many steps ahead the predictions are to be made.

If it is not set, the default value is 1: the model gives one-step-ahead prediction. A value greater than 30 results in an error.

EXSM_CONFIDENCE_LEVEL

It must be a number between 0 and 1, exclusive.

This setting specifies the desired confidence level for prediction.

The lower and upper bounds of the specified confidence interval is reported. If this setting is not specified, the default confidence level is 95%.

EXSM_OPT_CRITERION

It takes value in set {EXSM_OPT_CRIT_LIK, EXSM_OPT_CRIT_MSE, EXSM_OPT_CRIT_AMSE, EXSM_OPT_CRIT_SIG, EXSM_OPT_CRIT_MAE}.

This setting specifies the desired optimization criterion. The optimization criterion is useful as a diagnostic for comparing models' fit to the same data.

EXSM_OPT_CRIT_LIK: Minus twice the log-likelihood of a model.

EXSM_OPT_CRIT_MSE: Mean square error of a model.

EXSM_OPT_CRIT_AMSE: Average mean square error over user-specified time window.

EXSM_OPT_CRIT_SIG: Model's standard deviation of residuals.

EXSM_OPT_CRIT_MAE: Mean absolute error of a model.

The default value is EXSM_OPT_CRIT_LIK.

EXSM_NMSE

positive integer

This setting specifies the length of the window used in computing the error metric average mean square error (AMSE).

Example 7-26 Using the ore.odmESM Function

This example pushes the data frame iris: to a temporary database table IRIS and creates Exponential Smoothing model.

# Turn off row ordering warnings.

options(ore.warn.order=FALSE)

# Data setup

set.seed(7654)
N <- 100
dat <- data.frame(ID=1:N, VAL=runif(N))

# Create the a temporary OML4R proxy object DAT.

DAT <- ore.push(dat)

# Create an ESM regression model object. Fit the ESM model according to the data and setting parameters.

esm.mod  <- ore.odmESM(VAL ~ ., DAT,
    odm.settings = list(case_id_column_name = "ID",
                        exsm_prediction_step = 4))
    
esm.mod
summary(esm.mod)

Listing for This Example

Call: ore.odmESM(formula = VAL ~ ., data = DAT, odm.settings = list(case_id_column_name = "ID", exsm_prediction_step = 4))
Settings:                                                 
                                                               value 
         confidence.level                                        .95 
         model                                                 imple 
         nmse                                                      3
         optimization.crit                              opt.crit.lik
         prediction.step                                           4
         setmissing                                        miss.auto
         odms.details                                    odms.enable
         odms.missing.value.treatment        odms.missing.value.auto
         odms.sampling                         odms.sampling.disable
         prep.auto                                                ON
Call: ore.odmESM(formula = VAL ~ ., data = DAT, odm.settings = list(case_id_column_name = "ID", exsm_prediction_step = 4))
Settings:
                                                      value
       confidence.level                                 .95 
       model                                         simple
       nmse                                               3
       optimization.crit                       opt.crit.lik
       prediction.step                                    4
       setmissing                                 miss.auto
       odms.details                             odms.enable
       odms.missing.value.treatment odms.missing.value.auto
       odms.sampling                  odms.sampling.disable
       prep.auto                                         ON 
Predictions:
      CASE_ID      VALUE PREDICTION        LOWER   UPPER
  1         1 0.68847989  0.5414108         NA      NA
  2         2 0.63346191  0.5414255         NA      NA 
  3         3 0.34073466  0.5414347         NA      NA
  4         4 0.41106593  0.5414146         NA      NA 
  5         5 0.17601063  0.5414016         NA      NA
  6         6 0.82879446  0.5413650         NA      NA
  7         7 0.23504359  0.5413938         NA      NA
  8         8 0.14222260  0.5413631         NA      NA 
  9         9 0.76561760  0.5413232         NA      NA 
  10       10 0.90813842  0.5413457         NA      NA 
  11       11 0.59706210  0.5413823         NA      NA 
  12       12 0.44463468  0.5413879         NA      NA 
  13       13 0.95294541  0.5413782         NA      NA 
  14       14 0.58209937  0.5414194         NA      NA
  15       15 0.62295773  0.5414235         NA      NA
  16       16 0.59711650  0.5414316         NA      NA
  17       17 0.41131782  0.5414372         NA      NA
  18       18 0.79952871  0.5414242         NA      NA
  19       19 0.12635680  0.5414500         NA      NA
  20       20 0.04773946  0.5414085         NA      NA

7.22 XGBoost Model

The ore.odmXGB class is a scalable gradient tree boosting system that supports both classification and regression. It makes available the open source gradient boosting framework. It prepares training data, calls the in-database XGBoost, builds and persists a model, and applies the model for prediction.

Note:

The ore.odmXGB algorithm is available in the database 21c or later.

You can use ore.odmXGB as a stand-alone predictor or incorporate it into real-world production pipelines for a wide range of problems such as ad click-through rate prediction, hazard risk prediction, web text classification, and so on.

The ore.odmXGB algorithm takes three types of parameters: general parameters, booster parameters, and task parameters. You set the parameters through the model settings table. The algorithm supports most of the settings of the open source project.

Through ore.odmXGB, OML4R supports a number of different classification and regression specifications, ranking models, and survival models. Binary and multi-class models are supported under the classification machine learning technique while regression, ranking, count, and survival are supported under the regression machine learning technique.

ore.odmXGB also supports partitioned models and internalizes the data preparation.

Settings for an XGBoost model

The following table lists settings that apply to XGBoost models.

Table 7-22 XGBoost Model Settings

Setting Name Setting Value Description
booster

A string that is one of the following:

  • dart
  • gblinear
  • gbtree

The booster to use:

  • dart
  • gblinear
  • gbtree

The dart and gbtree boosters use tree-based models whereas gblinear uses linear functions.

The default value is gbtree.

num_round

A non-negative integer.

The number of rounds for boosting.

The default value is 10.

Example 7-27 Using the ore.odmXGB Regression Function

This example pushes the data frame iris: to a temporary database table IRIS and creates an XGBoost model.

# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

x <- seq(0.1, 5, by = 0.02)
y <- log(x) + rnorm(x, sd = 0.2)


# Create the a temporary OML4R proxy object DAT.

DAT <-ore.push(data.frame(x=x, y=y))

# Create an XGBoost regression model object. Fit the XGBoost model according to the data and setting parameters.

xgb.mod <- ore.odmXGB(y~x,dat,"regression")

# Display the model summary and attribute importance

summary(xgb.mod)
importance(xgb.mod)

# Use the model to make predictions on the input data.

xgb.res <- predict(xgb.mod,dat,supplemental.cols="x")
head(xgb.res,6)

Listing for This Example

x <- seq(0.1, 5, by = 0.02)
>   y <- log(x) + rnorm(x, sd = 0.2)
>   DAT <-ore.push(data.frame(x=x, y=y))

>   xgb.mod <- ore.odmXGB(y~x,dat,"regression")
>   summary(xgb.mod)

Call:
ore.odmXGB(formula = y ~ x, data = dat, type = "regression")

Settings: 
                                               value
odms.details                             odms.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
booster                                       gbtree
ntree.limit                                        0
num.round                                         10

Importance: 
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER FREQUENCY
1  <NA>              x              <NA>            <NA>    1     1         1

>   importance(xgb.mod)
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE GAIN COVER FREQUENCY
1  <NA>              x              <NA>            <NA>    1     1         1
>   xgb.res <- predict(xgb.mod,dat,supplemental.cols="x")
>   head(xgb.res,6)
     x PREDICTION
1 0.10  -1.957506
2 0.12  -1.957506
3 0.14  -1.957506
4 0.16  -1.484602
5 0.18  -1.559072
6 0.20  -1.559072

Example 7-28 Using the ore.odmXGB Classification Function

This example pushes the data frame iris: to a temporary database table IRIS and creates an XGBoost model.

# Turn off row ordering warnings

options(ore.warn.order=FALSE)

# Data setup

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)

# Create the a temporary OML4R proxy object DAT.

MTCARS <- ore.push(m)

# Create an XGBoost classification model object. Fit the XGBoot model according to the data and setting parameters.

xgb.mod <- ore.odmXGB(gear ~ .-ID, MTCARS, "classification")

# Display the model summary and attribute importance

summary(xgb.mod)
importance(xgb.mod)

# Use the model to make predictions on the input data.

xgb.res <- predict (xgb.mod, MTCARS,"gear")

# Generate a confusion matrix.
with(xgb.res, table(gear, PREDICTION))

Listing for This Example

>   m <- mtcars
>   m$gear <- as.factor(m$gear)
>   m$cyl  <- as.factor(m$cyl)
>   m$vs   <- as.factor(m$vs)
>   m$ID   <- 1:nrow(m)
>   MTCARS <- ore.push(m)

>   xgb.mod  <- ore.odmXGB(gear ~ .-ID, MTCARS,"classification")
>   summary(xgb.mod)

Call:
ore.odmXGB(formula = gear ~ . - ID, data = MTCARS, type = "classification")

Settings: 
                                               value
clas.weights.balanced                            OFF
odms.details                             odms.enable
odms.missing.value.treatment odms.missing.value.auto
odms.sampling                  odms.sampling.disable
prep.auto                                         ON
booster                                       gbtree
ntree.limit                                        0
num.round                                         10
objective                             multi:softprob

Importance: 
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE         GAIN
1  <NA>             am              <NA>            <NA> 0.1062399524
2  <NA>           carb              <NA>            <NA> 0.0001902411
3  <NA>           disp              <NA>            <NA> 0.1903797590
4  <NA>           drat              <NA>            <NA> 0.5099772379
5  <NA>             hp              <NA>            <NA> 0.0120000788
6  <NA>            mpg              <NA>            <NA> 0.0040766784
7  <NA>           qsec              <NA>            <NA> 0.1771360524
        COVER  FREQUENCY
1 0.121840842 0.13924051
2 0.009026413 0.02531646
3 0.292335393 0.36708861
4 0.320671772 0.24050633
5 0.028994248 0.02531646
6 0.022994361 0.03797468
7 0.204136970 0.16455696

>   importance(xgb.mod)
  PNAME ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_VALUE         GAIN
1  <NA>             am              <NA>            <NA> 0.1062399524
2  <NA>           carb              <NA>            <NA> 0.0001902411
3  <NA>           disp              <NA>            <NA> 0.1903797590
4  <NA>           drat              <NA>            <NA> 0.5099772379
5  <NA>             hp              <NA>            <NA> 0.0120000788
6  <NA>            mpg              <NA>            <NA> 0.0040766784
7  <NA>           qsec              <NA>            <NA> 0.1771360524
        COVER  FREQUENCY
1 0.121840842 0.13924051
2 0.009026413 0.02531646
3 0.292335393 0.36708861
4 0.320671772 0.24050633
5 0.028994248 0.02531646
6 0.022994361 0.03797468
7 0.204136970 0.16455696
>   xgb.res  <- predict (xgb.mod, MTCARS,"gear")
>   with(xgb.res, table(gear,PREDICTION))  
    PREDICTION
gear  3  4  5
   3 15  0  0
   4  0 12  0
   5  0  0  5