4.8 Specify Model Settings
Understand how to configure machine learning models at build time.
Numerous configuration settings are available for configuring machine learning models at build time. To specify settings, create a settings table with the columns shown in the following table and pass the table to CREATE_MODEL
.
You can use CREATE_MODEL2
procedure where you can directly pass the model settings to a variable that can be used in the procedure. The variable can be declared with DBMS_DATA_MINING.SETTING_LIST
procedure.
Table 4-78 Settings Table Required Columns
Column Name | Data Type |
---|---|
|
|
|
|
Example 4-4 creates a settings table for a Support Vector Machine (SVM) classification model. Since SVM is not the default classifier, the ALGO_NAME
setting is used to specify the algorithm. Setting the SVMS_KERNEL_FUNCTION
to SVMS_LINEAR
causes the model to be built with a linear kernel. If you do not specify the kernel function, the algorithm chooses the kernel based on the number of attributes in the data.
Example 4-5 creates a model with the model settings that are stored in a variable from SETTING_LIST
.
Some settings apply generally to the model, others are specific to an algorithm. Model settings are referenced in Table 4-79 and Table 4-80.
Table 4-79 General Model Settings
Settings | Description |
---|---|
Machine learning function settings |
|
Algorithm names |
|
Global model characteristics |
|
Automatic Data Preparation |
Table 4-80 Algorithm-Specific Model Settings
Algorithm | Description |
---|---|
CUR Matrix Decomposition |
DBMS_DATA_MINING —Algorithm Settings: CUR Matrix Decomposition |
Decision Tree |
|
Expectation Maximization |
DBMS_DATA_MINING —Algorithm Settings: Expectation Maximization |
Explicit Semantic Analysis |
DBMS_DATA_MINING —Algorithm Settings: Explicit Semantic Analysis |
Exponential Smoothing |
DBMS_DATA_MINING —Algorithm Settings: Exponential Smoothing Models |
Generalized Linear Model |
DBMS_DATA_MINING —Algorithm Settings: Generalized Linear Models |
k-Means |
|
Multivariate State Estimation Technique - Sequential Probability Ratio Test |
|
Naive Bayes |
|
Neural Network |
|
Non-Negative Matrix Factorization |
DBMS_DATA_MINING —Algorithm Settings: Non-Negative Matrix Factorization |
O-Cluster |
|
Random Forest |
|
Singular Value Decomposition |
DBMS_DATA_MINING —Algorithm Settings: Singular Value Decomposition |
Support Vector Machine |
DBMS_DATA_MINING —Algorithm Settings: Support Vector Machine |
XGBoost |
Note:
Some XGBoost objectives apply only to classification function models and other objectives apply only to regression function models. If you specify an incompatible objective
value, an error is raised. In the DBMS_DATA_MINING.CREATE_MODEL
procedure, if you specify DBMS_DATA_MINING.CLASSIFICATION
as the function, then the only objective values that you can use are the binary
and multi
values. The one exception is binary: logitraw
, which produces a continuous value and applies only to a regression model. If you specify DBMS_DATA_MINING.REGRESSION
as the function, then you can specify binary: logitraw
or any of the count
, rank
, reg
, and survival
values as the objective.
The values for the XGBoost objective setting are listed in the Settings for Learning Tasks table in DBMS_DATA_MINING — Algorithm Settings: XGBoost.
Example 4-4 Creating a Settings Table and Creating an SVM Classification Model Using CREATE.MODEL procedure
CREATE TABLE svmc_sh_sample_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(4000)); BEGIN INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines); INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_linear); COMMIT; END; / -- Create the model using the specified settings BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'svm_model', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', settings_table_name => 'svmc_sh_sample_settings'); END;
Example 4-5 Specify Model Settings for a GLM Regression Model Using CREATE_MODEL2 procedure
DECLARE v_setlist DBMS_DATA_MINING.SETTING_LIST; BEGIN v_setlist('PREP_AUTO') := 'ON'; v_setlist('ALGO_NAME') := 'ALGO_GENERALIZED_LINEAR_MODEL'; v_setlist('GLMS_DIAGNOSTICS_TABLE_NAME') := 'GLMR_DIAG'; v_setlist('GLMS_FTR_SELECTION') := 'GLMS_FTR_SELECTION_ENABLE'; v_setlist('GLMS_FTR_GENERATION') := 'GLMS_FTR_GENERATION_ENABLE'; DBMS_DATA_MINING.CREATE_MODEL2( MODEL_NAME => 'GLM_REGR', MINING_FUNCTION => 'REGRESSION', DATA_QUERY => 'select * from TRAINING_DATA', SET_LIST => v_setlist, CASE_ID_COLUMN_NAME => 'HID', TARGET_COLUMN_NAME => 'MEDV'); END;
Related Topics
4.8.1 Specify Costs
Specify a cost matrix table to build a Decision Tree model.
The CLAS_COST_TABLE_NAME
setting specifies the name of a cost matrix table to be used in building a Decision Tree model. A cost matrix biases a classification model to minimize costly misclassifications. The cost matrix table must have the columns shown in the following table:
Table 4-81 Cost Matrix Table Required Columns
Column Name | Data Type |
---|---|
|
valid target data type |
|
valid target data type |
|
|
Decision Tree is the only algorithm that supports a cost matrix at build time. However, you can create a cost matrix and associate it with any classification model for scoring.
If you want to use costs for scoring, create a table with the columns shown in Table 4-81, and use the DBMS_DATA_MINING.ADD_COST_MATRIX
procedure to add the cost matrix table to the model. You can also specify a cost matrix inline when invoking a PREDICTION
function. Table 3-1 has details for valid target data types.
Related Topics
4.8.2 Specify Prior Probabilities
Prior probabilities can be used to offset differences in distribution between the build data and the actual population.
The CLAS_PRIORS_TABLE_NAME
setting specifies the name of a table of prior probabilities to be used in building a Naive Bayes model. The priors table must have the columns shown in the following table.
Table 4-82 Priors Table Required Columns
Column Name | Data Type |
---|---|
|
valid target data type |
|
|
Related Topics
4.8.3 Specify Class Weights
Specify class weights table settings in logistic regression or Support Vector Machine (SVM) classification to favor higher weighted classes.
The CLAS_WEIGHTS_TABLE_NAME
setting specifies the name of a table of class weights to be used to bias a logistic regression (Generalized Linear Model classification) or SVM classification model to favor higher weighted classes. The weights table must have the columns shown in the following table.
Table 4-83 Class Weights Table Required Columns
Column Name | Data Type |
---|---|
|
Valid target data type |
|
|
Related Topics
4.8.4 Specify Oracle Machine Learning Model Settings for an R Model
This topic applies only to Oracle on-premises.
ALGO_EXTENSIBLE_LANG
setting in the model settings table. You can create a model by combining in the settings table generic settings that do not require an algorithm, such as ODMS_PARTITION_COLUMNS
and ODMS_SAMPLING
. You can also specify the following settings, which are exclusive to an R machine learning model.
Related Topics
4.8.4.1 ALGO_EXTENSIBLE_LANG
Use the ALGO_EXTENSIBLE_LANG
setting to specify the language for the Oracle Machine Learning for SQL extensible algorithm framework.
Currently, R
is the only valid value for the ALGO_EXTENSIBLE_LANG
setting. When you set the value for ALGO_EXTENSIBLE_LANG
to R, the machine learning models are built using the R language. You can use the following settings in the settings table to specify the characteristics of the R model.
Related Topics
4.8.4.2 RALG_BUILD_FUNCTION
Use the RALG_BUILD_FUNCTION
setting to specify the name of an existing registered R script for building an Oracle Machine Learning for SQL model using the R language.
RALG_BUILD_FUNCTION
and ALGO_EXTENSIBLE_LANG
settings in the model settings table. The R script defines an R function that has as the first input argument an R data.frame
object for training data. The function returns an Oracle Machine Learning model object. The first data argument is mandatory. The RALG_BUILD_FUNCTION
can accept additional model build parameters.
Note:
The valid inputs for input parameters are numeric and string scalar data types.Example 4-6 Example of RALG_BUILD_FUNCTION
This example shows how to specify the name of the R script MY_LM_BUILD_SCRIPT
that is used to build the model.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_build_function,'MY_LM_BUILD_SCRIPT');
End;
/
The R script MY_LM_BUILD_SCRIPT
defines an R function that builds the LM model. You must register the script MY_LM_BUILD_SCRIPT
in the Oracle Machine Learning for R script repository which uses the existing OML4R security restrictions. You can use the OML4R sys.rqScriptCreate
procedure to register the script. OML4R requires the RQADMIN
role to register R scripts.
For example:
Begin
sys.rqScriptCreate('MY_LM_BUILD_SCRIPT', 'function(data, formula, model.frame) {lm(formula = formula, data=data, model = as.logical(model.frame)}');
End;
/
For Clustering and Feature Extraction machine learning function model builds, the R attributes dm$nclus
and dm$nfeat
must be set on the return R model to indicate the number of clusters and features respectively.
The R script MY_KM_BUILD_SCRIPT
defines an R function that builds the k-Means model for clustering. The R attribute dm$nclus
is set with the number of clusters for the returned clustering model.
'function(dat) {dat.scaled <- scale(dat)
set.seed(6543); mod <- list()
fit <- kmeans(dat.scaled, centers = 3L)
mod[[1L]] <- fit
mod[[2L]] <- attr(dat.scaled, "scaled:center")
mod[[3L]] <- attr(dat.scaled, "scaled:scale")
attr(mod, "dm$nclus") <- nrow(fit$centers)
mod}'
The R script MY_PCA_BUILD_SCRIPT
defines an R function that builds the PCA model. The R attribute dm$nfeat
is set with the number of features for the returned feature extraction model.
'function(dat) {
mod <- prcomp(dat, retx = FALSE)
attr(mod, "dm$nfeat") <- ncol(mod$rotation)
mod}'
Related Topics
4.8.4.2.1 RALG_BUILD_PARAMETER
The RALG_BUILD_FUNCTION
input parameter specifies a list of numeric and string scalar values in SQL SELECT
query statement format.
Example 4-7 Example of RALG_BUILD_PARAMETER
The RALG_BUILD_FUNCTION
input parameters must be a list of numeric and string scalar values. The input parameters are optional.
'SELECT value parameter name ...FROM dual'
'formula'
and a numeric value of zero for input argument 'model.frame'
using the RALG_BUILD_PARAMETER
. These input arguments must match with the function signature of the R script used in the RALG_BUILD_FUNCTION
parameter. Begin
insert into model_setting_table values
(dbms_data_mining.ralg_build_parameter, 'select ''AGE ~ .'' as "formula", 0 as "model.frame" from dual');
End;
/
Related Topics
4.8.4.3 RALG_DETAILS_FUNCTION
The RALG_DETAILS_FUNCTION
specifies the R model metadata that is returned in the R data.frame
.
Use the RALG_DETAILS_FUNCTION
to specify an existing registered R script that generates model information. The script defines an R function that contains the first input argument for the R model object. The output of the R function must be a data.frame
. The columns of the data.frame
are defined by the RALG_DETAILS_FORMAT
setting, and may contain only numeric or string scalar types.
Example 4-8 Example of RALG_DETAILS_FUNCTION
MY_LM_DETAILS_SCRIPT
in the model settings table. This script defines the R function that is used to provide the model information. Begin
insert into model_setting_table values
(dbms_data_mining.ralg_details_function, 'MY_LM_DETAILS_SCRIPT');
End;
/
MY_LM_DETAILS_SCRIPT
is registered as: 'function(mod) data.frame(name=names(mod$coefficients),
coef=mod$coefficients)'
Related Topics
4.8.4.4 RALG_DETAILS_FORMAT
Use the RALG_DETAILS_FORMAT
setting to specify the names and column types in the model view.
The value of the setting is a string that contains a SELECT
statement to specify a list of numeric and string scalar data types for the name and type of the model view columns.
When the RALG_DETAILS_FORMAT
and RALG_DETAILS_FUNCTION
settings are both specified, a model view by the name DM$VD
<model_name>
is created along with an R model in the current schema. The first column of the model view is PARTITION_NAME
. It has the value NULL
for non-partitioned models. The other columns of the model view are defined by RALG_DETAILS_FORMAT
setting.
Example 4-9 Example of RALG_DETAILS_FORMAT
varchar2
column attr_name
and the number column coef_value
after the first column partition_name.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_details_format, 'select cast(''a'' as varchar2(20)) as attr_name, 0 as coef_value from dual');
End;
/
Related Topics
4.8.4.5 RALG_SCORE_FUNCTION
Use the RALG_SCORE_FUNCTION
setting to specify an existing registered R script for R algorithm machine learning model to use for scoring data.
The specified R script defines an R function. The first input argument defines the model object. The second input argument defines the R data.frame
that is used for scoring data.
Example 4-10 Example of RALG_SCORE_FUNCTION
data.frame.
The function argument object
is the LM model. The argument newdata
is a data.frame
containing the data to score.function(object, newdata) {res <- predict.lm(object, newdata = newdata, se.fit = TRUE); data.frame(fit=res$fit, se=res$se.fit, df=summary(object)$df[1L])}
The output of the R function must be a data.frame
. Each row represents the prediction for the corresponding scoring data from the input data.frame
. The columns of the data.frame
are specific to machine learning functions, such as:
Regression: A single numeric column for the predicted target value, with two optional columns containing the standard error of the model fit, and the degrees of freedom number. The optional columns are needed for the SQL function PREDICTION_BOUNDS
to work.
Example 4-11 Example of RALG_SCORE_FUNCTION for Regression
This example shows how to specify the name of the R script MY_LM_PREDICT_SCRIPT
that is used to score the model in the model settings table model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_LM_PREDICT_SCRIPT');
End;
/
MY_LM_PREDICT_SCRIPT
is registered as: function(object, newdata) {data.frame(pre = predict(object, newdata = newdata))}
Classification: Each column represents the predicted probability of one target class. The column name is the target class name.
Example 4-12 Example of RALG_SCORE_FUNCTION for Classification
This example shows how to specify the name of the R script MY_LOGITGLM_PREDICT_SCRIPT
that is used to score the logit Classification model in the model settings table model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_LOGITGLM_PREDICT_SCRIPT');
End;
/
MY_LOGITGLM_PREDICT_SCRIPT
is registered as follows. It is a logit Classification with two target classes, "0" and "1".'function(object, newdata) {
pred <- predict(object, newdata = newdata, type="response");
res <- data.frame(1-pred, pred);
names(res) <- c("0", "1");
res}'
Clustering: Each column represents the predicted probability of one cluster. The columns are arranged in order of cluster ID. Each cluster is assigned a cluster ID, and they are consecutive values starting from 1. To support CLUSTER_DISTANCE
in the R model, the output of R score function returns an extra column containing the value of the distance to each cluster in order of cluster ID after the columns for the predicted probability.
Example 4-13 Example of RALG_SCORE_FUNCTION for Clustering
This example shows how to specify the name of the R script MY_CLUSTER_PREDICT_SCRIPT
that is used to score the model in the model settings table model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_CLUSTER_PREDICT_SCRIPT');
End;
/
MY_CLUSTER_PREDICT_SCRIPT
is registered as:'function(object, dat){
mod <- object[[1L]]; ce <- object[[2L]]; sc <- object[[3L]];
newdata = scale(dat, center = ce, scale = sc);
centers <- mod$centers;
ss <- sapply(as.data.frame(t(centers)),
function(v) rowSums(scale(newdata, center=v, scale=FALSE)^2));
if (!is.matrix(ss)) ss <- matrix(ss, ncol=length(ss));
disp <- -1 / (2* mod$tot.withinss/length(mod$cluster));
distr <- exp(disp*ss);
prob <- distr / rowSums(distr);
as.data.frame(cbind(prob, sqrt(ss)))}'
Feature Extraction: Each column represents the coefficient value of one feature. The columns are arranged in order of feature ID. Each feature is assigned a feature ID, which are consecutive values starting from 1.
Example 4-14 Example of RALG_SCORE_FUNCTION for Feature Extraction
This example shows how to specify the name of the R script MY_FEATURE_EXTRACTION_SCRIPT
that is used to score the model in the model settings table model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_FEATURE_EXTRACTION_SCRIPT');
End;
/
MY_FEATURE_EXTRACTION_SCRIPT
is registered as: 'function(object, dat) { as.data.frame(predict(object, dat)) }'
The function fetches the centers of the features from the R model, and computes the feature coefficient based on the distance of the score data to the corresponding feature center.
Related Topics
4.8.4.6 RALG_WEIGHT_FUNCTION
Use the RALG_WEIGHT_FUNCTION
setting to specify the name of an existing registered R script that computes the weight or contribution for each attribute in scoring. The specified R script is used in the SQL function PREDICTION_DETAILS
to evaluate attribute contribution.
The specified R script defines an R function containing the first input argument for a model object, and the second input argument of an R data.frame
for scoring data. When the machine learning function is Classification, Clustering, or Feature Extraction, the target class name, cluster ID, or feature ID is passed by the third input argument to compute the weight for that particular class, cluster, or feature. The script returns a data.frame
containing the contributing weight for each attribute in a row. Each row corresponds to that input scoring data.frame
.
Example 4-15 Example of RALG_WEIGHT_FUNCTION
MY_PREDICT_WEIGHT_SCRIPT
that computes the weight or contribution of R model attributes in the model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_weight_function, 'MY_PREDICT_WEIGHT_SCRIPT');
End;
/
MY_PREDICT_WEIGHT_SCRIPT
for Regression is registered as:'function(mod, data) { coef(mod)[-1L]*data }'
MY_PREDICT_WEIGHT_SCRIPT
for logit Classification is registered as:'function(mod, dat, clas) {
v <- predict(mod, newdata=dat, type = "response");
v0 <- data.frame(v, 1-v); names(v0) <- c("0", "1");
res <- data.frame(lapply(seq_along(dat),
function(x, dat) {
if(is.numeric(dat[[x]])) dat[,x] <- as.numeric(0)
else dat[,x] <- as.factor(NA);
vv <- predict(mod, newdata = dat, type = "response");
vv = data.frame(vv, 1-vv); names(vv) <- c("0", "1");
v0[[clas]] / vv[[clas]]}, dat = dat));
names(res) <- names(dat);
res}'
Related Topics
4.8.4.7 Registered R Scripts
The RALG_*_FUNCTION
settings must specify R scripts that exist in the Oracle Machine Learning for R script repository.
You can register the R scripts using the OML4R SQL procedure sys.rqScriptCreate
. To register a scripts, you must have the RQADMIN
role.
The RALG_*_FUNCTION
settings include the following functions:
-
RALG_BUILD_FUNCTION
-
RALG_DETAILS_FUNCTION
-
RALG_SCORE_FUNCTION
-
RALG_WEIGHT_FUNCTION
Note:
The R scripts must exist in the OML4R script repository for an R model to function.After an R model is built, the name of the specified R script become a model setting. These R script must exist in the OML4R script repository for an R model to remain functional.
You can manage the R memory that is used to build, score, and view the R models through OML4R as well.
4.8.4.8 Algorithm Metadata Registration
Algorithm metadata registration allows for a uniform and consistent approach of registering new algorithm functions and their settings.
User have the ability to add new algorithms through the REGISTER_ALGORITHM
procedure registration process. The new algorithms can appear as available within Oracle Machine Learning for SQL for their appropriate machine learning functions. Based on the registration metadata, the settings page is dynamically rendered. Algorithm metadata registration extends the machine learning model capability of OML4SQL.