4.8 Specify Model Settings

Understand how to configure machine learning models at build time.

Numerous configuration settings are available for configuring machine learning models at build time. To specify settings, create a settings table with the columns shown in the following table and pass the table to CREATE_MODEL.

You can use CREATE_MODEL2 procedure where you can directly pass the model settings to a variable that can be used in the procedure. The variable can be declared with DBMS_DATA_MINING.SETTING_LIST procedure.

Table 4-78 Settings Table Required Columns

Column Name Data Type

setting_name

VARCHAR2(30)

setting_value

VARCHAR2(4000)

Example 4-4 creates a settings table for a Support Vector Machine (SVM) classification model. Since SVM is not the default classifier, the ALGO_NAME setting is used to specify the algorithm. Setting the SVMS_KERNEL_FUNCTION to SVMS_LINEAR causes the model to be built with a linear kernel. If you do not specify the kernel function, the algorithm chooses the kernel based on the number of attributes in the data.

Example 4-5 creates a model with the model settings that are stored in a variable from SETTING_LIST.

Some settings apply generally to the model, others are specific to an algorithm. Model settings are referenced in Table 4-79 and Table 4-80.

Table 4-79 General Model Settings

Settings Description

Machine learning function settings

Machine Learning Technique Settings

Algorithm names

Algorithm Names

Global model characteristics

Global Settings

Automatic Data Preparation

Automatic Data Preparation

Note:

Some XGBoost objectives apply only to classification function models and other objectives apply only to regression function models. If you specify an incompatible objective value, an error is raised. In the DBMS_DATA_MINING.CREATE_MODEL procedure, if you specify DBMS_DATA_MINING.CLASSIFICATION as the function, then the only objective values that you can use are the binary and multi values. The one exception is binary: logitraw, which produces a continuous value and applies only to a regression model. If you specify DBMS_DATA_MINING.REGRESSION as the function, then you can specify binary: logitraw or any of the count, rank, reg, and survival values as the objective.

The values for the XGBoost objective setting are listed in the Settings for Learning Tasks table in DBMS_DATA_MINING — Algorithm Settings: XGBoost.

Example 4-4 Creating a Settings Table and Creating an SVM Classification Model Using CREATE.MODEL procedure

CREATE TABLE svmc_sh_sample_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(4000));

BEGIN 
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_linear);
  COMMIT;
END;
/
-- Create the model using the specified settings 
BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'svm_model',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'mining_data_build_v',
    case_id_column_name => 'cust_id',
    target_column_name  => 'affinity_card',
    settings_table_name => 'svmc_sh_sample_settings');
END;

Example 4-5 Specify Model Settings for a GLM Regression Model Using CREATE_MODEL2 procedure

DECLARE
    v_setlist DBMS_DATA_MINING.SETTING_LIST;
BEGIN
    v_setlist('PREP_AUTO') := 'ON';
    v_setlist('ALGO_NAME') := 'ALGO_GENERALIZED_LINEAR_MODEL';
    v_setlist('GLMS_DIAGNOSTICS_TABLE_NAME') := 'GLMR_DIAG';
    v_setlist('GLMS_FTR_SELECTION') := 'GLMS_FTR_SELECTION_ENABLE';
    v_setlist('GLMS_FTR_GENERATION') := 'GLMS_FTR_GENERATION_ENABLE';
 
    DBMS_DATA_MINING.CREATE_MODEL2(
        MODEL_NAME          => 'GLM_REGR',
        MINING_FUNCTION     => 'REGRESSION',
        DATA_QUERY          => 'select * from TRAINING_DATA',
        SET_LIST            => v_setlist,
        CASE_ID_COLUMN_NAME => 'HID',
	TARGET_COLUMN_NAME  => 'MEDV');
END;

4.8.1 Specify Costs

Specify a cost matrix table to build a Decision Tree model.

The CLAS_COST_TABLE_NAME setting specifies the name of a cost matrix table to be used in building a Decision Tree model. A cost matrix biases a classification model to minimize costly misclassifications. The cost matrix table must have the columns shown in the following table:

Table 4-81 Cost Matrix Table Required Columns

Column Name Data Type

actual_target_value

valid target data type

predicted_target_value

valid target data type

cost

NUMBER

Decision Tree is the only algorithm that supports a cost matrix at build time. However, you can create a cost matrix and associate it with any classification model for scoring.

If you want to use costs for scoring, create a table with the columns shown in Table 4-81, and use the DBMS_DATA_MINING.ADD_COST_MATRIX procedure to add the cost matrix table to the model. You can also specify a cost matrix inline when invoking a PREDICTION function. Table 3-1 has details for valid target data types.

4.8.2 Specify Prior Probabilities

Prior probabilities can be used to offset differences in distribution between the build data and the actual population.

The CLAS_PRIORS_TABLE_NAME setting specifies the name of a table of prior probabilities to be used in building a Naive Bayes model. The priors table must have the columns shown in the following table.

Table 4-82 Priors Table Required Columns

Column Name Data Type

target_value

valid target data type

prior_probability

NUMBER

4.8.3 Specify Class Weights

Specify class weights table settings in logistic regression or Support Vector Machine (SVM) classification to favor higher weighted classes.

The CLAS_WEIGHTS_TABLE_NAME setting specifies the name of a table of class weights to be used to bias a logistic regression (Generalized Linear Model classification) or SVM classification model to favor higher weighted classes. The weights table must have the columns shown in the following table.

Table 4-83 Class Weights Table Required Columns

Column Name Data Type

target_value

Valid target data type

class_weight

NUMBER

4.8.4 Specify Oracle Machine Learning Model Settings for an R Model

Only in Oracle on-premises This topic applies only to Oracle on-premises.

The machine learning model settings for an R language model determine the characteristics of the model and are specified in the model settings table.
You can build a machine learning model in the R language by specifying R as the value of the ALGO_EXTENSIBLE_LANG setting in the model settings table. You can create a model by combining in the settings table generic settings that do not require an algorithm, such as ODMS_PARTITION_COLUMNS and ODMS_SAMPLING. You can also specify the following settings, which are exclusive to an R machine learning model.

Related Topics

4.8.4.1 ALGO_EXTENSIBLE_LANG

Use the ALGO_EXTENSIBLE_LANG setting to specify the language for the Oracle Machine Learning for SQL extensible algorithm framework.

Currently, R is the only valid value for the ALGO_EXTENSIBLE_LANG setting. When you set the value for ALGO_EXTENSIBLE_LANG to R, the machine learning models are built using the R language. You can use the following settings in the settings table to specify the characteristics of the R model.

Related Topics

4.8.4.2 RALG_BUILD_FUNCTION

Use the RALG_BUILD_FUNCTION setting to specify the name of an existing registered R script for building an Oracle Machine Learning for SQL model using the R language.

You must specify both the RALG_BUILD_FUNCTION and ALGO_EXTENSIBLE_LANG settings in the model settings table. The R script defines an R function that has as the first input argument an R data.frame object for training data. The function returns an Oracle Machine Learning model object. The first data argument is mandatory. The RALG_BUILD_FUNCTION can accept additional model build parameters.

Note:

The valid inputs for input parameters are numeric and string scalar data types.

Example 4-6 Example of RALG_BUILD_FUNCTION

This example shows how to specify the name of the R script MY_LM_BUILD_SCRIPT that is used to build the model.

Begin
insert into model_setting_table values
(dbms_data_mining.ralg_build_function,'MY_LM_BUILD_SCRIPT');
End;
/ 

The R script MY_LM_BUILD_SCRIPT defines an R function that builds the LM model. You must register the script MY_LM_BUILD_SCRIPT in the Oracle Machine Learning for R script repository which uses the existing OML4R security restrictions. You can use the OML4R sys.rqScriptCreate procedure to register the script. OML4R requires the RQADMIN role to register R scripts.

For example:

Begin
sys.rqScriptCreate('MY_LM_BUILD_SCRIPT', 'function(data, formula, model.frame) {lm(formula = formula, data=data, model = as.logical(model.frame)}');
End;
/

For Clustering and Feature Extraction machine learning function model builds, the R attributes dm$nclus and dm$nfeat must be set on the return R model to indicate the number of clusters and features respectively.

The R script MY_KM_BUILD_SCRIPT defines an R function that builds the k-Means model for clustering. The R attribute dm$nclus is set with the number of clusters for the returned clustering model.

'function(dat) {dat.scaled <- scale(dat)
     set.seed(6543); mod <- list()
     fit <- kmeans(dat.scaled, centers = 3L)
     mod[[1L]] <- fit
     mod[[2L]] <- attr(dat.scaled, "scaled:center")
     mod[[3L]] <- attr(dat.scaled, "scaled:scale")
     attr(mod, "dm$nclus") <- nrow(fit$centers)
     mod}'

The R script MY_PCA_BUILD_SCRIPT defines an R function that builds the PCA model. The R attribute dm$nfeat is set with the number of features for the returned feature extraction model.

'function(dat) {
     mod <- prcomp(dat, retx = FALSE)
     attr(mod, "dm$nfeat") <- ncol(mod$rotation)
     mod}'
4.8.4.2.1 RALG_BUILD_PARAMETER

The RALG_BUILD_FUNCTION input parameter specifies a list of numeric and string scalar values in SQL SELECT query statement format.

Example 4-7 Example of RALG_BUILD_PARAMETER

The RALG_BUILD_FUNCTION input parameters must be a list of numeric and string scalar values. The input parameters are optional.

The syntax of the parameter is:
'SELECT value parameter name ...FROM dual'
This example shows how to specify a formula for the input argument 'formula' and a numeric value of zero for input argument 'model.frame' using the RALG_BUILD_PARAMETER. These input arguments must match with the function signature of the R script used in the RALG_BUILD_FUNCTION parameter.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_build_parameter, 'select ''AGE ~ .'' as "formula", 0 as "model.frame" from dual');
End;
/

Related Topics

4.8.4.3 RALG_DETAILS_FUNCTION

The RALG_DETAILS_FUNCTION specifies the R model metadata that is returned in the R data.frame.

Use the RALG_DETAILS_FUNCTION to specify an existing registered R script that generates model information. The script defines an R function that contains the first input argument for the R model object. The output of the R function must be a data.frame. The columns of the data.frame are defined by the RALG_DETAILS_FORMAT setting, and may contain only numeric or string scalar types.

Example 4-8 Example of RALG_DETAILS_FUNCTION

This example shows how to specify the name of the R script MY_LM_DETAILS_SCRIPT in the model settings table. This script defines the R function that is used to provide the model information.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_details_function, 'MY_LM_DETAILS_SCRIPT');
End;
/
In the Oracle Machine Learning for R script repository, the script MY_LM_DETAILS_SCRIPT is registered as:
 'function(mod) data.frame(name=names(mod$coefficients),
    coef=mod$coefficients)'

4.8.4.4 RALG_DETAILS_FORMAT

Use the RALG_DETAILS_FORMAT setting to specify the names and column types in the model view.

The value of the setting is a string that contains a SELECT statement to specify a list of numeric and string scalar data types for the name and type of the model view columns.

When the RALG_DETAILS_FORMAT and RALG_DETAILS_FUNCTION settings are both specified, a model view by the name DM$VD <model_name> is created along with an R model in the current schema. The first column of the model view is PARTITION_NAME. It has the value NULL for non-partitioned models. The other columns of the model view are defined by RALG_DETAILS_FORMAT setting.

Example 4-9 Example of RALG_DETAILS_FORMAT

This example shows how to specify the name and type of the columns for the generated model view. The model view contains the varchar2 column attr_name and the number column coef_value after the first column partition_name.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_details_format, 'select cast(''a'' as varchar2(20)) as attr_name, 0 as coef_value from dual');
End;
/

Related Topics

4.8.4.5 RALG_SCORE_FUNCTION

Use the RALG_SCORE_FUNCTION setting to specify an existing registered R script for R algorithm machine learning model to use for scoring data.

The specified R script defines an R function. The first input argument defines the model object. The second input argument defines the R data.frame that is used for scoring data.

Example 4-10 Example of RALG_SCORE_FUNCTION

This example shows how the R function takes the Linear Model model and scores the data in the data.frame. The function argument object is the LM model. The argument newdata is a data.frame containing the data to score.
function(object, newdata) {res <- predict.lm(object, newdata = newdata, se.fit = TRUE); data.frame(fit=res$fit, se=res$se.fit, df=summary(object)$df[1L])}

The output of the R function must be a data.frame. Each row represents the prediction for the corresponding scoring data from the input data.frame. The columns of the data.frame are specific to machine learning functions, such as:

Regression: A single numeric column for the predicted target value, with two optional columns containing the standard error of the model fit, and the degrees of freedom number. The optional columns are needed for the SQL function PREDICTION_BOUNDS to work.

Example 4-11 Example of RALG_SCORE_FUNCTION for Regression

This example shows how to specify the name of the R script MY_LM_PREDICT_SCRIPT that is used to score the model in the model settings table model_setting_table.

Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_LM_PREDICT_SCRIPT');
End;
/
In the Oracle Machine Learning for R script repository, the script MY_LM_PREDICT_SCRIPT is registered as:
function(object, newdata) {data.frame(pre = predict(object, newdata = newdata))}

Classification: Each column represents the predicted probability of one target class. The column name is the target class name.

Example 4-12 Example of RALG_SCORE_FUNCTION for Classification

This example shows how to specify the name of the R script MY_LOGITGLM_PREDICT_SCRIPT that is used to score the logit Classification model in the model settings table model_setting_table.

Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_LOGITGLM_PREDICT_SCRIPT');
End;
/
In the OML4R script repository, MY_LOGITGLM_PREDICT_SCRIPT is registered as follows. It is a logit Classification with two target classes, "0" and "1".
'function(object, newdata) {
   pred <- predict(object, newdata = newdata, type="response");
   res <- data.frame(1-pred, pred);
   names(res) <- c("0", "1");
   res}'

Clustering: Each column represents the predicted probability of one cluster. The columns are arranged in order of cluster ID. Each cluster is assigned a cluster ID, and they are consecutive values starting from 1. To support CLUSTER_DISTANCE in the R model, the output of R score function returns an extra column containing the value of the distance to each cluster in order of cluster ID after the columns for the predicted probability.

Example 4-13 Example of RALG_SCORE_FUNCTION for Clustering

This example shows how to specify the name of the R script MY_CLUSTER_PREDICT_SCRIPT that is used to score the model in the model settings table model_setting_table.

Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_CLUSTER_PREDICT_SCRIPT');
End;
/
In the OML4R script repository, the script MY_CLUSTER_PREDICT_SCRIPT is registered as:
'function(object, dat){
     mod <- object[[1L]]; ce <- object[[2L]]; sc <- object[[3L]];
     newdata = scale(dat, center = ce, scale = sc);
     centers <- mod$centers;
     ss <- sapply(as.data.frame(t(centers)),
     function(v) rowSums(scale(newdata, center=v, scale=FALSE)^2));
     if (!is.matrix(ss)) ss <- matrix(ss, ncol=length(ss));
     disp <- -1 / (2* mod$tot.withinss/length(mod$cluster));
     distr <- exp(disp*ss);
     prob <- distr / rowSums(distr);
     as.data.frame(cbind(prob, sqrt(ss)))}'
The function fetches the centers of the clusters from the R model, and computes the probabilities for belonging to each cluster based on the distance of the score data to the corresponding cluster center.

Feature Extraction: Each column represents the coefficient value of one feature. The columns are arranged in order of feature ID. Each feature is assigned a feature ID, which are consecutive values starting from 1.

Example 4-14 Example of RALG_SCORE_FUNCTION for Feature Extraction

This example shows how to specify the name of the R script MY_FEATURE_EXTRACTION_SCRIPT that is used to score the model in the model settings table model_setting_table.

Begin
insert into model_setting_table values
(dbms_data_mining.ralg_score_function, 'MY_FEATURE_EXTRACTION_SCRIPT');
End;
/
In the OML4R script repository, the script MY_FEATURE_EXTRACTION_SCRIPT is registered as:
 'function(object, dat) { as.data.frame(predict(object, dat)) }'

The function fetches the centers of the features from the R model, and computes the feature coefficient based on the distance of the score data to the corresponding feature center.

Related Topics

4.8.4.6 RALG_WEIGHT_FUNCTION

Use the RALG_WEIGHT_FUNCTION setting to specify the name of an existing registered R script that computes the weight or contribution for each attribute in scoring. The specified R script is used in the SQL function PREDICTION_DETAILS to evaluate attribute contribution.

The specified R script defines an R function containing the first input argument for a model object, and the second input argument of an R data.frame for scoring data. When the machine learning function is Classification, Clustering, or Feature Extraction, the target class name, cluster ID, or feature ID is passed by the third input argument to compute the weight for that particular class, cluster, or feature. The script returns a data.frame containing the contributing weight for each attribute in a row. Each row corresponds to that input scoring data.frame.

Example 4-15 Example of RALG_WEIGHT_FUNCTION

This example specifies the name of the R script MY_PREDICT_WEIGHT_SCRIPT that computes the weight or contribution of R model attributes in the model_setting_table.
Begin
insert into model_setting_table values
(dbms_data_mining.ralg_weight_function, 'MY_PREDICT_WEIGHT_SCRIPT');
End;
/
In the Oracle Machine Learning for R script repository, the script MY_PREDICT_WEIGHT_SCRIPT for Regression is registered as:
'function(mod, data) { coef(mod)[-1L]*data }'
In the OML4R script repository, the script MY_PREDICT_WEIGHT_SCRIPT for logit Classification is registered as:
'function(mod, dat, clas) {
   v <- predict(mod, newdata=dat, type = "response");
   v0 <- data.frame(v, 1-v); names(v0) <- c("0", "1");
   res <- data.frame(lapply(seq_along(dat),
   function(x, dat) {
   if(is.numeric(dat[[x]])) dat[,x] <- as.numeric(0)
   else dat[,x] <- as.factor(NA);
   vv <- predict(mod, newdata = dat, type = "response");
   vv = data.frame(vv, 1-vv); names(vv) <- c("0", "1");
   v0[[clas]] / vv[[clas]]}, dat = dat));
   names(res) <- names(dat);
   res}'

Related Topics

4.8.4.7 Registered R Scripts

The RALG_*_FUNCTION settings must specify R scripts that exist in the Oracle Machine Learning for R script repository.

You can register the R scripts using the OML4R SQL procedure sys.rqScriptCreate. To register a scripts, you must have the RQADMIN role.

The RALG_*_FUNCTION settings include the following functions:

  • RALG_BUILD_FUNCTION

  • RALG_DETAILS_FUNCTION

  • RALG_SCORE_FUNCTION

  • RALG_WEIGHT_FUNCTION

Note:

The R scripts must exist in the OML4R script repository for an R model to function.

After an R model is built, the name of the specified R script become a model setting. These R script must exist in the OML4R script repository for an R model to remain functional.

You can manage the R memory that is used to build, score, and view the R models through OML4R as well.

4.8.4.8 Algorithm Metadata Registration

Algorithm metadata registration allows for a uniform and consistent approach of registering new algorithm functions and their settings.

User have the ability to add new algorithms through the REGISTER_ALGORITHM procedure registration process. The new algorithms can appear as available within Oracle Machine Learning for SQL for their appropriate machine learning functions. Based on the registration metadata, the settings page is dynamically rendered. Algorithm metadata registration extends the machine learning model capability of OML4SQL.