52 DBMS_DATA_MINING

The DBMS_DATA_MINING package is the application programming interface for creating, evaluating, and querying Oracle Machine Learning for SQL models.

In Oracle Database Release 21c, Oracle Data Mining has been rebranded to Oracle Machine Learning for SQL (Oracle Machine Learning for SQL). The PL/SQL package name, however, has not changed and remains DBMS_DATA_MINING.

This chapter contains the following topics:

52.1 DBMS_DATA_MINING Overview

Oracle Machine Learning for SQL supports both supervised and unsupervised machine learning. Supervised machine learning predicts a target value based on historical data. Unsupervised machine learning discovers natural groupings and does not use a target. You can use Oracle Machine Learning for SQL procedures on structured data and unstructured text.

Supervised machine learning techniques include:

  • Classification

  • Regression

  • Feature Selection (Attribute Importance)

  • Time Series

Unsupervised machine learning techniques include:

  • Clustering

  • Association

  • Feature Extraction

  • Anomaly Detection

The steps you use to build and apply a machine learning model depend on the machine learning technique and the algorithm being used. The algorithms supported by Oracle Machine Learning for SQL are listed in the following table.

Table 52-1 Oracle Machine Learning for SQL Algorithms

Algorithm Abbreviation Function

Apriori

AR

Association

CUR Matrix Decomposition

CUR

Attribute importance

Decision Tree

DT

Classification

Expectation Maximization

EM

Clustering

Explicit Semantic Analysis

ESA

Feature extraction, classification

Exponential Smoothing

ESM

Time series

Generalized Linear Models

GLM

Classification, regression

k-Means

KM

Clustering

Minimum Descriptor Length

MDL

Attribute importance

Multivariate State Estimation Technique - Sequential Probability Ratio Test

MSET-SPRT

Anomaly detection, classification

Naive Bayes

NB

Classification

Neural Network

NN

Classification, regression

Non-Negative Matrix Factorization

NMF

Feature extraction

Orthogonal Partitioning Clustering

O-Cluster

Clustering

Random Forest

RF

Classification

Singular Value Decomposition and Principal Component Analysis

SVD and PCA

Feature extraction

Support Vector Machine

SVM

Classification, regression, anomaly detection

XGBoost

XGBoost

Classification, regression

Oracle Machine Learning for SQL supports more than one algorithm for the classification, regression, clustering, and feature extraction machine learning techniques. Each of these machine learning techniques has a default algorithm, as shown in the following table.

Table 52-2 Oracle Machine Learning for SQL Default Algorithms

Mining Function Default Algorithm

Classification

Naive Bayes

Clustering

k-Means

Feature Extraction

Non-Negative Matrix Factorization

Feature Selection

Minimum Descriptor Length

Regression

Support Vector Machine

Time Series

Exponential Smoothing

52.2 DBMS_DATA_MINING Security Model

The DBMS_DATA_MINING package is owned by user SYS and is installed as part of database installation. Execution privilege on the package is granted to public. The routines in the package are run with invokers' rights (run with the privileges of the current user).

The DBMS_DATA_MINING package exposes APIs that are leveraged by the Oracle Machine Learning for SQL. Users who wish to create machine learning models in their own schema require the CREATE MINING MODEL system privilege. Users who wish to create machine learning models in other schemas require the CREATE ANY MINING MODEL system privilege.

Users have full control over managing models that exist within their own schema. Additional system privileges necessary for managing machine learning models in other schemas include ALTER ANY MINING MODEL, DROP ANY MINING MODEL, SELECT ANY MINING MODEL, COMMENT ANY MINING MODEL, and AUDIT ANY.

Individual object privileges on machine learning models, ALTER MINING MODEL and SELECT MINING MODEL, can be used to selectively grant privileges on a model to a different user.

See Also:

Oracle Data Mining User's Guide for more information about the security features of Oracle Machine Learning for SQL

52.3 DBMS_DATA_MINING — Machine Learning Functions

A machine learning function refers to the methods for solving a given class of machine learning problems.

The machine learning function must be specified when a model is created. You specify a machine learning function with the mining_function parameter of the CREATE_MODEL Procedure or the CREATE_MODEL2 Procedure.

Table 52-3 Machine Learning Functions

Value Description

ASSOCIATION

Association is a descriptive machine learning function. An association model identifies relationships and the probability of their occurrence within a data set.

Association models use the Apriori algorithm.

ATTRIBUTE_IMPORTANCE

Attribute importance is a predictive machine learning function, also known as feature selection. An attribute importance model identifies the relative importance of an attribute in predicting a given outcome.

Attribute importance models can use Minimum Description Length (MDL) or CUR Matrix Decomposition. MDL is the default.

CLASSIFICATION

Classification is a predictive machine learning function. A classification model uses historical data to predict a categorical target.

Classification models can use: Decision Tree, logistic regression, Multivariate State Estimation Technique - Sequential Probability Ratio Test, Naive Bayes, Support Vector Machine (SVM), or XGBoost. The default is Naive Bayes.

The classification function can also be used for anomaly detection. For anomaly detection, you can use the Multivariate State Estimation Technique - Sequential Probability Ratio Test algorithm or the SVM algorithm with a null target (One-Class SVM).

CLUSTERING

Clustering is a descriptive machine learning function. A clustering model identifies natural groupings within a data set.

Clustering models can use k-Means, O-Cluster, or Expectation Maximization. The default is k-Means.

FEATURE_EXTRACTION

Feature extraction is a descriptive machine learning function. A feature extraction model creates an optimized data set on which to base a model.

Feature extraction models can use Explicit Semantic Analysis, Non-Negative Matrix Factorization, Singular Value Decomposition, or Principal Component Analysis. Non-Negative Matrix Factorization is the default.

REGRESSION

Regression is a predictive machine learning function. A regression model uses historical data to predict a numerical target.

Regression models can use linear regression, Support Vector Machine, or XGBoost. The default is Support Vector Machine.

TIME_SERIES

Time series is a predictive machine learning function. A time series model forecasts the future values of a time-ordered series of historical numeric data over a user-specified time window. Time series models use the Exponential Smoothing algorithm.

See Also:

Oracle Machine Learning for SQL Concepts for more information about mining functions

52.4 DBMS_DATA_MINING — Model Settings

Oracle Machine Learning for SQL uses settings to specify the algorithm and other characteristics of a model. Some settings are general, some are specific to a machine learning function, and some are specific to an algorithm.

All settings have default values. If you want to override one or more of the settings for a model, then you must create a settings table. The settings table must have the column names and data types shown in the following table.

Table 52-4 Required Columns in the Model Settings Table

Column Name Data Type

SETTING_NAME

VARCHAR2(30)

SETTING_VALUE

VARCHAR2(4000)

The information you provide in the settings table is used by the model at build time. The name of the settings table is an optional argument to the CREATE_MODEL Procedure. You can also provide these settings through the CREATE_MODEL2 Procedure.

The settings used by a model can be found by querying the data dictionary view ALL_MINING_MODEL_SETTINGS. This view displays the model settings used by the machine learning models to which you have access. All of the default and user-specified setting values are included in the view.

See Also:

52.4.1 DBMS_DATA_MINING — Algorithm Names

The ALGO_NAME setting specifies the model algorithm.

The values for the ALGO_NAME setting are listed in the following table.

Table 52-5 Algorithm Names

ALGO_NAME Value Description Machine Learning Function

ALGO_AI_MDL

Minimum Description Length

Attribute importance

ALGO_APRIORI_ASSOCIATION_RULES

Apriori

Association rules

ALGO_CUR_DECOMPOSITION

CUR Matrix Decomposition

Attribute importance

ALGO_DECISION_TREE

Decision Tree

Classification

ALGO_EXPECTATION_MAXIMIZATION

Expectation Maximization

Clustering

ALGO_EXPLICIT_SEMANTIC_ANALYS

Explicit Semantic Analysis

Feature extraction

Classification

ALGO_EXPONENTIAL_SMOOTHING

Exponential Smoothing

Time series

ALGO_EXTENSIBLE_LANG

Language used for extensible algorithm

All mining functions supported

ALGO_GENERALIZED_LINEAR_MODEL

Generalized Linear Model

Classification, regression; also feature selection and generation

ALGO_KMEANS

Enhanced k-Means

Clustering

ALGO_MSET_SPRT

Multivariate State Estimation Technique - Sequential Probability Ratio Test

Classification

ALGO_NAIVE_BAYES

Naive Bayes

Classification

ALGO_NEURAL_NETWORK

Neural Network

Classification

ALGO_NONNEGATIVE_MATRIX_FACTOR

Non-Negative Matrix Factorization

Feature extraction

ALGO_O_CLUSTER

O-Cluster

Clustering

ALGO_RANDOM_FOREST

Random Forest

Classification

ALGO_SINGULAR_VALUE_DECOMP

Singular Value Decomposition

Feature extraction

ALGO_SUPPORT_VECTOR_MACHINES

Support Vector Machine

Classification and regression

ALGO_XGBOOST

XGBoost

Classification and regression

See Also:

Oracle Machine Learning for SQL Concepts for information about algorithms

52.4.2 DBMS_DATA_MINING — Automatic Data Preparation

Oracle Machine Learning for SQL supports fully Automatic Data Preparation (ADP), user-directed general data preparation, and user-specified embedded data preparation. The PREP_* settings enable the user to request fully automated or user-directed general data preparation. By default, fully Automatic Data Preparation (PREP_AUTO_ON) is enabled.

When you enable ADP, the model uses heuristics to transform the build data according to the requirements of the algorithm. Instead of fully ADP, the user can request that the data be shifted and/or scaled with the PREP_SCALE* and PREP_SHIFT* settings. The transformation instructions are stored with the model and reused whenever the model is applied. The model settings can be viewed in USER_MINING_MODEL_SETTINGS.

You can choose to supplement Automatic Data Preparations by specifying additional transformations in the xform_list parameter when you build the model. See "CREATE_MODEL Procedure" and "CREATE_MODEL2 Procedure".

If you do not use ADP and do not specify transformations in the xform_list parameter to CREATE_MODEL, you must implement your own transformations separately in the build, test, and scoring data. You must take special care to implement the exact same transformations in each data set.

If you do not use ADP, but you do specify transformations in the xform_list parameter to CREATE_MODEL, OML4SQL embeds the transformation definitions in the model and prepares the test and scoring data to match the build data.

The values for the PREP_* setting are described in the following table.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.PREP_AUTO_ON. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'ON'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-6 PREP_* Setting

Setting Name Constant Value String Value Equivalent Description

PREP_AUTO

PREP_AUTO_ON

ON

This setting enables fully automated data preparation.

The default is PREP_AUTO_ON.

PREP_AUTO_OFF

OFF

Disables fully automated data preparation.

PREP_SCALE_2DNUM

PREP_SCALE_STDDEV

PREP_SCALE_STDDEV

This setting enables scaling data preparation for two-dimensional numeric columns. PREP_AUTO must be OFF for this setting to take effect.

PREP_SCALE_STDDEV: A request to divide the column values by the standard deviation of the column and is often provided together with PREP_SHIFT_MEAN to yield z-score normalization.

PREP_SCALE_RANGE

PREP_SCALE_RANGE

A request to divide the column values by the range of values and is often provided together with PREP_SHIFT_MIN to yield a range of [0,1].

PREP_SCALE_NNUM

PREP_SCALE_MAXABS

PREP_SCALE_MAXABS

This setting enables scaling data preparation for nested numeric columns. PREP_AUTO must be OFF for this setting to take effect. If specified, then the valid value for this setting is PREP_SCALE_MAXABS, which yields data in the range of [-1,1].

PREP_SHIFT_2DNUM

PREP_SHIFT_MEAN

PREP_SHIFT_MEAN

This setting enables centering data preparation for two-dimensional numeric columns. PREP_AUTO must be OFF for this setting to take effect.

PREP_SHIFT_MEAN: Results in subtracting the average of the column from each value.

PREP_SHIFT_MIN

PREP_SHIFT_MIN

Results in subtracting the minimum of the column from each value.

See Also:

Oracle® Machine Learning for SQL for information about data transformations

52.4.3 DBMS_DATA_MINING — Machine Learning Function Settings

The settings described in this table apply to a machine learning function.

Table 52-7 Machine Learning Function Settings

Machine Learning Function Setting Name Setting Value Description

Association

ASSO_MAX_RULE_LENGTH

TO_CHAR( 2<= numeric_expr <=20)

Maximum rule length for association rules.

Default is 4.

Association

ASSO_MIN_CONFIDENCE

TO_CHAR( 0<= numeric_expr <=1)

Minimum confidence for association rules.

Default is 0.1.

Association

ASSO_MIN_SUPPORT

TO_CHAR( 0<= numeric_expr <=1)

Minimum support for association rules

Default is 0.1.

Association

ASSO_MIN_SUPPORT_INT

a positive integer

Minimum absolute support that each rule must satisfy. The value must be an integer.

The default is 1.

Association

ASSO_MIN_REV_CONFIDENCE

TO_CHAR( 0<= numeric_expr <=1)

Sets the Minimum Reverse Confidence that each rule should satisfy.

The Reverse Confidence of a rule is defined as the number of transactions in which the rule occurs divided by the number of transactions in which the consequent occurs.

The value is real number between 0 and 1.

The default is 0.

Association

ASSO_IN_RULES

NULL

Sets Including Rules applied for each association rule: it specifies the list of items that at least one of them must appear in each reported association rule, either as antecedent or as consequent. It is a comma separated string containing the list of including items.

If not set, the default behavior is, the filtering is not applied.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_in_rules, '''a'',''b''');

Association

ASSO_EX_RULES

NULL

Sets Excluding Rules applied for each association rule: it specifies the list of items that none of them can appear in each reported association rules. It is a comma separated string containing the list of excluded items. No rule can contain any item in the list.

The default is NULL.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_ex_rules, '''a'',''b''');

Association

ASSO_ANT_IN_RULES

NULL

Sets Including Rules for the antecedent: it specifies the list of items that at least one of them must appear in the antecedent part of each reported association rule. It is a comma separated string containing the list of including items. The antecedent part of each rule must contain at least one item in the list.

The default is NULL.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_ant_in_rules, '''a'',''b''');

Association

ASSO_ANT_EX_RULES

NULL

Sets Excluding Rules for the antecedent: it specifies the list of items that none of them can appear in the antecedent part of each reported association rule. It is a comma separated string containing the list of excluded items. No rule can contain any item in the list in its antecedent part.

The default is NULL.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_ant_ex_rules, '''a'',''b''');

Association

ASSO_CONS_IN_RULES

NULL

Sets Including Rules for the consequent: it specifies the list of items that at least one of them must appear in the consequent part of each reported association rule. It is a comma separated string containing the list of including items. The consequent of each rule must be an item in the list.

The default is NULL.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_cons_in_rules, '''a'',''b''');

Association

ASSO_CONS_EX_RULES

NULL

Sets Excluding Rules for the consequent: it specifies the list of items that none of them can appear in the consequent part of each reported association rule. It is a comma separated string containing the list of excluded items. No rule can have any item in the list as its consequent.

The excluding rule can be used to reduce the data that must be stored, but the user may be required to build an extra model for executing different including or Excluding Rules.

The default is NULL.

For example,
INSERT INTO sett_tab (setting_name, setting_value) VALUES        
      (dbms_data_mining.asso_cons_ex_rules, '''a'',''b''');

Association

ASSO_AGGREGATES

NULL

Specifies the columns to be aggregated. It is a comma separated string containing the names of the columns for aggregation. The number of columns in the list must be <= 10.

You can set ASSO_AGGREGATES if ODMS_ITEM_ID_COLUMN_NAME is set indicating transactional input data. See DBMS_DATA_MINING - Global Settings. The data table must have valid column names such as ITEM_ID and CASE_ID which are derived from ODMS_ITEM_ID_COLUMN_NAME and case_id_column_name respectively.

ITEM_VALUE is not a mandatory value.

The default is NULL.

For each item, the user may supply several columns to aggregate. It requires more memory to buffer the extra data. Also, the performance impact can be seen because of the larger input data set and more operation.

Association

ASSO_ABS_ERROR

0<ASSO_ABS_ERRORMAX(ASSO_MIN_SUPPORT, ASSO_MIN_CONFIDENCE).

Specifies the absolute error for the association rules sampling.

A smaller value of ASSO_ABS_ERROR obtains a larger sample size which gives accurate results but takes longer computational time. "Set a reasonable value for ASSO_ABS_ERROR, such as its default value, to avoid large sample size. The default value is 0.5 * MAX(ASSO_MIN_SUPPORT, ASSO_MIN_CONFIDENCE).

Association

ASSO_CONF_LEVEL

0ASSO_CONF_LEVEL1

Specifies the confidence level for an association rules sample.

A larger value of ASSO_CONF_LEVEL obtains a larger sample size. Any value between 0.9 and 1 is suitable. The default value is 0.95.

Classification

CLAS_COST_TABLE_NAME

table_name

(Decision tree only) Name of a table that stores a cost matrix to be used by the algorithm in building the model. The cost matrix specifies the costs associated with misclassifications.

Only decision tree models can use a cost matrix at build time. All classification algorithms can use a cost matrix at apply time.

The cost matrix table is user-created. See "ADD_COST_MATRIX Procedure" for the column requirements.

See Oracle Machine Learning for SQL Concepts for information about costs.

Classification

CLAS_PRIORS_TABLE_NAME

table_name

(Naive Bayes) Name of a table that stores prior probabilities to offset differences in distribution between the build data and the scoring data.

The priors table is user-created. See Oracle Machine Learning for SQL User’s Guide for the column requirements. See Oracle Machine Learning for SQL Concepts for additional information about priors.

Classification

CLAS_WEIGHTS_TABLE_NAME

table_name

(GLM and SVM only) Name of a table that stores weighting information for individual target values in SVM classification and GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes.

The class weights table is user-created. See Oracle Machine Learning for SQL User’s Guide for the column requirements. See Oracle Machine Learning for SQL Concepts for additional information about class weights.

Classification

CLAS_WEIGHTS_BALANCED

ON

OFF

This setting indicates that the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is OFF.

Classification

CLAS_MAX_SUP_BINS

For Decision Tree:

2 <= a number <=2147483647

For Random Forest:

2 <= a number <=254

This parameter specifies the maximum number of bins for each attribute.

The default value is 32.

See, DBMS_DATA_MINING — Automatic Data Preparation

Clustering

CLUS_NUM_CLUSTERS

TO_CHAR( numeric_expr >=1)

The maximum number of leaf clusters generated by a clustering algorithm. The algorithm may return fewer clusters, depending on the data.

Enhanced k-Means usually produces the exact number of clusters specified by CLUS_NUM_CLUSTERS, unless there are fewer distinct data points.

Expectation maximization (EM) may return fewer clusters than the number specified by CLUS_NUM_CLUSTERS depending on the data. The number of clusters returned by EM cannot be greater than the number of components, which is governed by algorithm-specific settings. (See Expectation Maximization Settings for Learning table) Depending on these settings, there may be fewer clusters than components. If component clustering is disabled, the number of clusters equals the number of components.

For EM, the default value of CLUS_NUM_CLUSTERS is system-determined. For k-Means and O-Cluster, the default is 10.

Feature extraction

FEAT_NUM_FEATURES

TO_CHAR( numeric_expr >=1)

The number of features to be extracted by a feature extraction model.

The default is estimated from the data by the algorithm. If the matrix rank is smaller than this number, fewer features will be returned.

For CUR Matrix Decomposition, the FEAT_NUM_FEATURES value is the same as the CURS_SVD_RANK value.

See Also:

Oracle Machine Learning for SQL Concepts for information about machine learning functions

52.4.4 DBMS_DATA_MINING — Global Settings

The configuration settings in this table are applicable to any type of model, but are currently only implemented for specific algorithms.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.ODMS_MISSING_VALUE_MEAN_MODE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'ODMS_MISSING_VALUE_MEAN_MODE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-8 Global Settings

Setting Name Setting Value String Value Equivalent Description

ODMS_ITEM_ID_COLUMN_NAME

column_name

column_name

(Association rules only) Name of a column that contains the items in a transaction. When this setting is specified, the algorithm expects the data to be presented in a native transactional format, consisting of two columns:

  • Case ID, either categorical or numeric

  • Item ID, either categorical or numeric

A typical example of transactional data is market basket data, wherein a case represents a basket that may contain many items. Each item is stored in a separate row, and many rows may be needed to represent a case. The case ID values do not uniquely identify each row. Transactional data is also called multi-record case data.

Association rules function is normally used with transactional data, but it can also be applied to single-record case data (similar to other algorithms).

For more information about single-record and multi-record case data, see Oracle SQL Developer Data Modeler User's Guide.

ODMS_ITEM_VALUE_COLUMN_NAME

column_name

column_name

(Association rules only) Name of a column that contains a value associated with each item in a transaction. This setting is only used when a value has been specified for ODMS_ITEM_ID_COLUMN_NAME indicating that the data is presented in native transactional format.

If ASSO_AGGREGATES is used, then the build data must include the following three columns and the columns specified in the AGGREGATES setting.

  • Case ID, either categorical or numeric

  • Item ID, either categorical or numeric, specified by ODMS_ITEM_ID_COLUMN_NAME

  • Item value, either categorical or numeric, specified by ODMS_ITEM_VALUE_COLUMN_NAME

If ASSO_AGGREGATES, Case ID, and Item ID column are present, then the Item Value column may or may not appear.

The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, macintosh apples).

For details on ASSO_AGGREGATES, see DBMS_DATA_MINING - Mining Function Settings.

ODMS_MISSING_VALUE_TREATMENT

ODMS_MISSING_VALUE_MEAN_MODE

ODMS_MISSING_VALUE_MEAN_MODE

Indicates how to treat missing values in the training data. This setting does not affect the scoring data. The default value is ODMS_MISSING_VALUE_AUTO.

ODMS_MISSING_VALUE_MEAN_MODE replaces missing values with the mean (numeric attributes) or the mode (categorical attributes) both at build time and apply time where appropriate.

When ODMS_MISSING_VALUE_TREATMENT is set to ODMS_MISSING_VALUE_DELETE_ROW, the rows in the training data that contain missing values are deleted. However, if you want to replicate this missing value treatment in the scoring data, then you must perform the transformation explicitly.

ODMS_MISSING_VALUE_DELETE_ROW

ODMS_MISSING_VALUE_DELETE_ROW

The value ODMS_MISSING_VALUE_DELETE_ROW applies to all algorithms.

ODMS_MISSING_VALUE_AUTO

ODMS_MISSING_VALUE_AUTO

ODMS_MISSING_VALUE_AUTO performs different strategies for different algorithms.

ODMS_ROW_WEIGHT_COLUMN_NAME

column_name

column_name

(GLM only) Name of a column in the training data that contains a weighting factor for the rows. The column data type must be NUMBER.

Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data.

ODMS_TEXT_POLICY_NAME

The name of an Oracle Text POLICY created using CTX_DDL.CREATE_POLICY.

The name of an Oracle Text POLICY created using CTX_DDL.CREATE_POLICY.

Affects how individual tokens are extracted from unstructured text.

For details about CTX_DDL.CREATE_POLICY, see Oracle Text Reference.

ODMS_TEXT_MAX_FEATURES

1 <= value

1 <= value

The maximum number of distinct features, across all text attributes, to use from a document set passed to CREATE_MODEL. The default is 3000. ESA has the default value of 300000.

ODMS_TEXT_MIN_DOCUMENTS

Non-negative value

Non-negative value

This is a text processing setting the controls how in how many documents a token needs to appear to be used as a feature.

The default is 1. ESA has a default of 3.

ODMS_PARTITION_COLUMNS

Comma separated list of machine learning attributes

Comma separated list of machine learning attributes

This setting indicates a request to build a partitioned model. The setting value is a comma-separated list of the machine learning attributes to be used to determine the in-list partition key values. These machine learning attributes are taken from the input columns unless an XFORM_LIST parameter is passed to CREATE_MODEL. If the XFORM_LIST parameter is passed to CREATE_MODEL, then the machine learning attributes are taken from the attributes produced by these transformations.

ODMS_MAX_PARTITIONS

1< value <= 1000000

1< value <= 1000000

This setting indicates the maximum number of partitions allowed for the model. The default is 1000.

ODMS_SAMPLING

ODMS_SAMPLING_ENABLE

ODMS_SAMPLING_ENABLE

This setting allows the user to request a sampling of the build data. The default is ODMS_SAMPLING_DISABLE.

ODMS_SAMPLING_DISABLE

ODMS_SAMPLING_DISABLE

Disables sampling of the build data.

ODMS_SAMPLE_SIZE

0 < Value

0 < Value

This setting determines how many rows will be sampled (approximately). It can be set only if ODMS_SAMPLING is enabled. The default value is the system determined.

ODMS_PARTITION_BUILD_TYPE

ODMS_PARTITION_BUILD_INTRA

ODMS_PARTITION_BUILD_INTRA

This setting controls the parallel build of partitioned models.

ODMS_PARTITION_BUILD_INTRA — Each partition is built in parallel using all slaves.

The default mode is ODMS_PARTITION_BUILD_HYBRID

ODMS_PARTITION_BUILD_INTER

ODMS_PARTITION_BUILD_INTER

Each partition is built entirely in a single slave, but multiple partitions may be built at the same time since multiple slaves are active.

ODMS_PARTITION_BUILD_HYBRID

ODMS_PARTITION_BUILD_HYBRID

It is a combination of the other two types and is recommended for most situations to adapt to dynamic environments.

ODMS_TABLESPACE_NAME

tablespace_name

tablespace_name

This setting controls the storage specifications.

If you explicitly sets this to the name of a tablespace (for which you have sufficient quota), then the specified tablespace storage creates the resulting model content. If you do not provide this setting, then the default tablespace of the user creates the resulting model content.

ODMS_RANDOM_SEED

Non-negative integer

Non-negative integer

The hash function with a random number seed generates a random number with uniform distribution. Users can control the random number seed by this setting. The default is 0.

This setting is used by Random Forest, Neural Network, and CUR Matrix Decomposition.

ODMS_DETAILS

ODMS_ENABLE

ODMS_ENABLE

This setting reduces the space that is used while creating a model, especially a partitioned model. The default value is ODMS_ENABLE.

When the setting is ODMS_ENABLE, it creates model tables and views when the model is created. You can query the model with SQL.

The reduction in space depends on the model. Reduction on the order of 10x can be achieved.

ODMS_DISABLE

ODMS_DISABLE

When the setting is ODMS_DISABLE, model views are not created and tables relevant to model details are not created either.

See Also:

Oracle Machine Learning for SQL Concepts for information about GLM

Oracle Machine Learning for SQL Concepts for information about association rules

Oracle Machine Learning for SQL User’s Guide for information about machine learning unstructured text

52.5 DBMS_DATA_MINING — Algorithm Specific Model Settings

Oracle Machine Learning for SQL uses algorithm specific settings to define the characteristics of a model.

All settings have default values. If you want to override one or more of the settings for a model, then you must specify those settings.

The information you provide in the settings table is used by the model at build time. The name of the settings table is an optional argument to the CREATE_MODEL Procedure. You can also provide these settings through the CREATE_MODEL2 Procedure.

The settings used by a model can be found by querying the data dictionary view ALL_MINING_MODEL_SETTINGS. This view displays the model settings used by the machine learning models to which you have access. All of the default and user-specified setting values are included in the view.

See Also:

52.5.1 DBMS_DATA_MINING — Algorithm Settings: ALGO_EXTENSIBLE_LANG

The settings listed in the following table configure the behavior of the machine learning model with an extensible algorithm. The model is built in the R language.

The RALG_*_FUNCTION specifies the R script that is used to build, score, and view an R model and must be registered in the Oracle Machine Learning for R script repository. The R scripts are registered through Oracle Machine Learning for R with special privileges. When ALGO_EXTENSIBLE_LANG is set to R in the MINING_MODEL_SETTING table, the machine learning model is built in the R language. After the R model is built, the names of the R scripts are recorded in the MINING_MODEL_SETTING table in the SYS schema. The scripts must exist in the script repository for the R model to function. The amount of R memory used to build, score, and view the R model through these R scripts can be controlled by Oracle Machine Learning for R.

All algorithm-independent DBMS_DATA_MINING subprograms can operate on an R model for machine learning functions such as association, attribute importance, classification, clustering, feature extraction, and regression.

The supported DBMS_DATA_MINING subprograms include, but are not limited, to the following:

  • ADD_COST_MATRIX Procedure

  • COMPUTE_CONFUSION_MATRIX Procedure

  • COMPUTE_LIFT Procedure

  • COMPUTE_ROC Procedure

  • CREATE_MODEL Procedure

  • DROP_MODEL Procedure

  • EXPORT_MODEL Procedure

  • GET_MODEL_COST_MATRIX Function

  • IMPORT_MODEL Procedure

  • REMOVE_COST_MATRIX Procedure

  • RENAME_MODEL Procedure

Table 52-9 ALGO_EXTENSIBLE_LANG Settings

Setting Name Setting Value Description

RALG_BUILD_FUNCTION

R_BUILD_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script for the R algorithm machine learning model build function. The R script defines an R function for the first input argument for training data and returns an R model object. For clustering and feature extraction machine learning function model build, the R attributes dm$nclus and dm$nfeat must be set on the R model to indicate the number of clusters and features respectively. The RALG_BUILD_FUNCTION must be set along with ALGO_EXTENSIBLE_LANG in the model_setting_table.

RALG_BUILD_PARAMETER

SELECT value param_name, ...FROM DUAL Specifies a list of numeric and string scalar for optional input parameters of the model build function.

RALG_SCORE_FUNCTION

R_SCORE_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script to score data. The script returns a data.frame containing the corresponding prediction results. The setting is used to score data for machine learning functions such as regression, classification, clustering, and feature extraction. This setting does not apply to the association and the attribute importance functions.

RALG_WEIGHT_FUNCTION

R_WEIGHT_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script for the R algorithm that computes the weight (contribution) for each attribute in scoring. The script returns a data.frame containing the contributing weight for each attribute in a row. This function setting is needed for the PREDICTION_DETAILS SQL function.

RALG_DETAILS_FUNCTION

R_DETAILS_FUNCTION_SCRIPT_NAME

Specifies the name of an existing registered R script for the R algorithm that produces the model information. This setting is required to generate a model view.

RALG_DETAILS_FORMAT

SELECT type_value column_name, ... FROM DUAL

Specifies the SELECT query for the list of numeric and string scalars for the output column type and the column name of the generated model view. This setting is required to generate a model view.

52.5.2 DBMS_DATA_MINING — Algorithm Settings: CUR Matrix Decomposition

The following settings affects the behavior of the CUR Matrix Decomposition algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.CURS_ROW_IMP_DISABLE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'CURS_ROW_IMP_DISABLE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-10 CUR Matrix Decomposition Settings

Setting Name Constant Value String Value Equivalent Description

CURS_APPROX_ATTR_NUM

The value must be a positive integer

The value must be a positive integer

Defines the approximate number of attributes to be selected.

The default value is the number of attributes.

CURS_ROW_IMPORTANCE

CURS_ROW_IMP_ENABLE

CURS_ROW_IMP_ENABLE

Defines the flag indicating whether or not to perform row selection.

Enables row selection.

The default value is CURS_ROW_IMP_DISABLE.

CURS_ROW_IMP_DISABLE

CURS_ROW_IMP_DISABLE

Disables row selection.

 

CURS_APPROX_ROW_NUM

The value must be a positive integer

The value must be a positive integer

Defines the approximate number of rows to be selected. This parameter is only used when users decide to perform row selection (CURS_ROW_IMP_ENABLE).

The default value is the total number of rows.

CURS_SVD_RANK

The value must be a positive integer

The value must be a positive integer

Defines the rank parameter used in the column/row leverage score calculation.

If users do not provide an input value, the value is determined by the system.

52.5.3 DBMS_DATA_MINING — Algorithm Settings: Decision Tree

These settings configure the behavior of the Decision Tree algorithm. Note that the Decision Tree settings are also used to configure the behavior of Random Forest as it constructs each individual decision tree.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.TREE_IMPURITY_ENTROPY. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'TREE_IMPURITY_ENTROPY'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-11 Decision Tree Settings

Setting Name Constant Value String Value Equivalent Description

TREE_IMPURITY_METRIC

TREE_IMPURITY_ENTROPY

TREE_IMPURITY_ENTROPY

Tree impurity metric for Decision Tree.

Tree algorithms seek the best test question for splitting data at each node. The best splitter and split values are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is by a metric. By default, the algorithm uses TREE_IMPURITY_GINI.

TREE_IMPURITY_GINI

TREE_IMPURITY_GINI

Decision trees can use either Gini (TREE_IMPURITY_GINI) or entropy (TREE_IMPURITY_ENTROPY) as the purity metric.

TREE_TERM_MAX_DEPTH

For Decision Tree:

2<= a number <=20

For Random Forest:

2<= a number <=100

For Decision Tree:

2<= a number <=20

For Random Forest:

2<= a number <=100

Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node).

For Decision Tree, the default is 7.

For Random Forest, the default is 16.

TREE_TERM_MINPCT_NODE

0<= a number<=10

0<= a number<=10

The minimum number of training rows in a node expressed as a percentage of the rows in the training data.

Default is 0.05, indicating 0.05%.

TREE_TERM_MINPCT_SPLIT

0 < a number <=20

0 < a number <=20

The minimum number of rows required to consider splitting a node expressed as a percentage of the training rows.

Default is 0.1, indicating 0.1%.

TREE_TERM_MINREC_NODE

a number>=0

a number>=0

The minimum number of rows in a node.

Default is 10.

TREE_TERM_MINREC_SPLIT

a number > 1

a number > 1

Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value.

Default is 20.

See Also:

Oracle Machine Learning for SQL Concepts for information about Decision Tree

52.5.4 DBMS_DATA_MINING — Algorithm Settings: Expectation Maximization

These algorithm settings configure the behavior of the Expectation Maximization algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.EMCS_ATTR_FILTER_ENABLE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'EMCS_ATTR_FILTER_ENABLE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

See Also:

Oracle Data Mining Concepts for information about Expectation Maximization

Table 52-12 Expectation Maximization Settings for Data Preparation and Analysis

Setting Name Constant Value String Value Equivalent Description

EMCS_ATTRIBUTE_FILTER

EMCS_ATTR_FILTER_ENABLE

EMCS_ATTR_FILTER_ENABLE

Whether or not to include uncorrelated attributes in the model. When EMCS_ATTRIBUTE_FILTER is enabled, uncorrelated attributes are not included.

Note:

This setting applies only to attributes that are not nested.

Default is system-determined.

EMCS_ATTR_FILTER_DISABLE

EMCS_ATTR_FILTER_DISABLE

Includes uncorrelated attributes in the model.

EMCS_MAX_NUM_ATTR_2D

TO_CHAR( numeric_expr >=1)

TO_CHAR( numeric_expr >=1)

Maximum number of correlated attributes to include in the model.

Note: This setting applies only to attributes that are not nested (2D).

Default is 50.

EMCS_NUM_DISTRIBUTION

EMCS_NUM_DISTR_BERNOULLI

EMCS_NUM_DISTR_BERNOULLI

The distribution for modeling numeric attributes. Applies to the input table or view as a whole and does not allow per-attribute specifications.

The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numeric attributes are modeled using the same type of distribution.

Default is EMCS_NUM_DISTR_SYSTEM.

EMCS_NUM_DISTR_GAUSSIAN

EMCS_NUM_DISTR_GAUSSIAN

Models all numeric attributes using Gaussian distribution.

EMCS_NUM_DISTR_SYSTEM

EMCS_NUM_DISTR_SYSTEM

When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data.

EMCS_NUM_EQUIWIDTH_BINS

TO_CHAR( 1 <numeric_expr <=255)

TO_CHAR( 1 <numeric_expr <=255)

Number of equi-width bins that will be used for gathering cluster statistics for numeric columns.

Default is 11.

EMCS_NUM_PROJECTIONS

TO_CHAR( numeric_expr >=1)

TO_CHAR( numeric_expr >=1)

Specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, the data will not be projected. The setting applies to all nested columns.

Default is 50.

EMCS_NUM_QUANTILE_BINS

TO_CHAR( 1 <numeric_expr <=255)

TO_CHAR( 1 <numeric_expr <=255)

Specifies the number of quantile bins that will be used for modeling numeric columns with multivalued Bernoulli distributions.

Default is system-determined.

EMCS_NUM_TOPN_BINS

TO_CHAR( 1 <numeric_expr <=255)

TO_CHAR( 1 <numeric_expr <=255)

Specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions.

Default is system-determined.

Table 52-13 Expectation Maximization Settings for Learning

Setting Name Constant Value String Value Equivalent Description

EMCS_CONVERGENCE_CRITERION

EMCS_CONV_CRIT_HELDASIDE

EMCS_CONV_CRIT_HELDASIDE

The convergence criterion for EM. The convergence criterion may be based on a held-aside data set, or it may be Bayesian Information Criterion.

EMCS_CONV_CRIT_HELDASIDE: Uses a held-aside data set for convergence criterion.

Default is system determined.

EMCS_CONV_CRIT_BIC

EMCS_CONV_CRIT_BIC

Uses the Bayesian Information Criterion (BIC) for convergence.

EMCS_LOGLIKE_IMPROVEMENT

TO_CHAR( 0 < numeric_expr < 1)

TO_CHAR( 0 < numeric_expr < 1)

When the convergence criterion is based on a held-aside data set (EMCS_CONVERGENCE_CRITERION = EMCS_CONV_CRIT_HELDASIDE), this setting specifies the percentage improvement in the value of the log likelihood function that is required for adding a new component to the model.

Default value is 0.001.

EMCS_NUM_COMPONENTS

TO_CHAR( numeric_expr >=1)

TO_CHAR( numeric_expr >=1)

Maximum number of components in the model. If model search is enabled, the algorithm automatically determines the number of components based on improvements in the likelihood function or based on regularization, up to the specified maximum.

The number of components must be greater than or equal to the number of clusters.

Default is 20.

EMCS_NUM_ITERATIONS

TO_CHAR( numeric_expr >=1)

TO_CHAR( numeric_expr >=1)

Specifies the maximum number of iterations in the EM algorithm.

Default is 100.

EMCS_MODEL_SEARCH

EMCS_MODEL_SEARCH_ENABLE

EMCS_MODEL_SEARCH_ENABLE

This setting enables model search in EM where different model sizes are explored and a best size is selected.

The default is EMCS_MODEL_SEARCH_DISABLE.

EMCS_MODEL_SEARCH_DISABLE (default).

EMCS_MODEL_SEARCH_DISABLE (default).

Deactivates model search.

EMCS_REMOVE_COMPONENTS

EMCS_REMOVE_COMPS_ENABLE (default)

EMCS_REMOVE_COMPS_ENABLE (default)

This setting allows the EM algorithm to remove a small component from the solution.

The default is EMCS_REMOVE_COMPS_ENABLE.

EMCS_REMOVE_COMPS_DISABLE

EMCS_REMOVE_COMPS_DISABLE

Prevents the EM algorithm from removing small components.

EMCS_RANDOM_SEED

Non-negative integer

Non-negative integer

This setting controls the seed of the random generator used in EM. The default is 0.

Table 52-14 Expectation Maximization Settings for Component Clustering

Setting Name Constant Value String Value Equivalent Description

EMCS_CLUSTER_COMPONENTS

EMCS_CLUSTER_COMP_ENABLE

EMCS_CLUSTER_COMP_ENABLE

Enables or disables the grouping of EM components into high-level clusters. When disabled, the components themselves are treated as clusters.

When component clustering is enabled, model scoring through the SQL CLUSTER function will produce assignments to the higher level clusters.

Default is EMCS_CLUSTER_COMP_ENABLE.

EMCS_CLUSTER_COMP_DISABLE

EMCS_CLUSTER_COMP_DISABLE

When clustering is disabled, the CLUSTER function will produce assignments to the original components.

EMCS_CLUSTER_THRESH

TO_CHAR( numeric_expr >=1)

TO_CHAR( numeric_expr >=1)

Dissimilarity threshold that controls the clustering of EM components. When the dissimilarity measure is less than the threshold, the components are combined into a single cluster.

A lower threshold may produce more clusters that are more compact. A higher threshold may produce fewer clusters that are more spread out.

Default is 2.

EMCS_LINKAGE_FUNCTION

EMCS_LINKAGE_SINGLE

EMCS_LINKAGE_SINGLE

Allows the specification of a linkage function for the agglomerative clustering step.

EMCS_LINKAGE_SINGLE uses the nearest distance within the branch. The clusters tend to be larger and have arbitrary shapes.

Default is EMCS_LINKAGE_SINGLE.

EMCS_LINKAGE_AVERAGE

EMCS_LINKAGE_AVERAGE

EMCS_LINKAGE_AVERAGE uses the average distance within the branch. There is less chaining effect and the clusters are more compact.

EMCS_LINKAGE_COMPLETE

EMCS_LINKAGE_COMPLETE

EMCS_LINKAGE_COMPLETE uses the maximum distance within the branch. The clusters are smaller and require strong component overlap.

Table 52-15 Expectation Maximization Settings for Cluster Statistics

Setting Name Constant Value String Value Equivalents Description

EMCS_CLUSTER_STATISTICS

EMCS_CLUS_STATS_ENABLE

EMCS_CLUS_STATS_ENABLE

Enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules).

Default is EMCS_CLUS_STATS_ENABLE.

EMCS_CLUS_STATS_DISABLE

EMCS_CLUS_STATS_DISABLE

When statistics are disabled, model size is reduced, and GET_MODEL_DETAILS_EM only returns taxonomy (hierarchy) and cluster counts.

EMCS_MIN_PCT_ATTR_SUPPORT

TO_CHAR( 0 < numeric_expr < 1)

TO_CHAR( 0 < numeric_expr < 1)

Minimum support required for including an attribute in the cluster rule. The support is the percentage of the data rows assigned to a cluster that must have non-null values for the attribute.

Default is 0.1.

52.5.5 DBMS_DATA_MINING — Algorithm Settings: Explicit Semantic Analysis

Explicit Semantic Analysis (ESA) is a useful technique for extracting meaningful and interpretable features.

The settings listed in the following table configure the ESA values.

Table 52-16 Explicit Semantic Analysis Settings

Setting Name Constant Value String Value Equivalent Description

ESAS_VALUE_THRESHOLD

Non-negative number

Non-negative number

This setting thresholds a small value for attribute weights in the transformed build data. The default is 1e-8.

ESAS_MIN_ITEMS

Text input 100

Non-text input is 0

Text input 100

Non-text input is 0

This setting determines the minimum number of non-zero entries that need to be present in an input row. The default is 100 for text input and 0 for non-text input.

ESAS_TOPN_FEATURES

A positive integer

A positive integer

This setting controls the maximum number of features per attribute. The default is 1000.

See Also:

Oracle Machine Learning for SQL Concepts for information about ESA.

52.5.6 DBMS_DATA_MINING — Algorithm Settings: Exponential Smoothing

These settings configure the behavior of the Exponential Smoothing (ESM) algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.EXSM_SIMPLE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'EXSM_SIMPLE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.
The settings listed in the following table configure Exponential Smoothing values.

Table 52-17 Exponential Smoothing Settings

Setting Name Constant Value String Value Equivalent Description
EXSM_MODEL

EXSM_SIMPLE

EXSM_SIMPLE

This setting specifies the model.

EXSM_SIMPLE: Forecasts data as a weighted moving average, with the influence of past observations declining exponentially with the length of time since the observation occurred. Errors in estimation are assumed to be normally distributed, with constant mean and variance. It is appropriate for data with no clear trend or seasonal pattern.

The default value is EXSM_SIMPLE.

EXSM_SIMPLE_MULT EXSM_SIMPLE_MULT_ERR Forecasts data as a weighted moving average, with the influence of past observations declining exponentially with the length of time since the observation occurred. Errors in estimation are assumed to be proportional to the level of the prior estimate.
EXSM_HOLT EXSM_HOLT Applies Holt's linear exponential smoothing method, designed to forecast data with an underlying linear trend.
EXSM_HOLT_DMP EXSM_HOLT_DAMPED Applies Holt's linear exponential smoothing with a damping factor to progressively reduce the strength of the trend over time.
EXSM_MUL_TRND EXSM_MULT_TREND Applies an exponential smoothing framework with a multiplicative trend component, effectively capturing data where trends are not linear but grow or decay over time.
EXSM_MULTRD_DMP EXSM_MULT_TREND_DAMPED Applies an exponential smoothing algorithm with a multiplicative trend that diminishes over time, providing a conservative approach to trend estimation.
EXSM_SEAS_ADD EXSM_SEASON_ADD Applies an exponential smoothing with an additive seasonal component, isolating and accounting for seasonal variations without incorporating a trend.
EXSM_SEAS_MUL EXSM_SEASON_MUL Executes exponential smoothing with a multiplicative seasonal component, capturing seasonal effects that increase or decrease in proportion to the level of the series.
EXSM_HW EXSM_WINTERS Applies the Holt-Winters method with additive trends and multiplicative seasonality, offering a robust model for data with both linear trend and proportional seasonal variation.
EXSM_HW_DMP EXSM_WINTERS_DAMPED Applies the Holt-Winters method with a damped trend and multiplicative seasonality, moderating the linear trend over time while still capturing proportional seasonal changes.
EXSM_HW_ADDSEA EXSM_ADDWINTERS Applies the Holt-Winters additive model to simultaneously smooth data with linear trends and additive seasonal effects.
EXSM_DHW_ADDSEA EXSM_ADDWINTERS_DAMPED Applies the Holt-Winters additive approach with a damping mechanism, reducing the impact of the trend and seasonal components over time.
EXSM_HWMT EXSM_WINTERS_MULT_TREND Applies the Holt-Winters model with both trend and seasonality components being multiplicative, suited for series where the seasonal variations and trends are both increasing or decreasing proportional to level.
EXSM_HWMT_DMP EXSM_WINTERS_MUL_TREND_DMP Applies the Holt-Winters model with a damped multiplicative trend, effectively moderating the exponential increase or decrease of both trend and seasonal components over time.
EXSM_SEASONALITY positive integer > 1 positive integer > 1

This setting specifies a positive integer value as the length of seasonal cycle.

The value it takes must be larger than 1. For example, setting value 4 means that every group of four observations forms a seasonal cycle.

This setting is only applicable and must be provided for models with seasonality, otherwise the model throws an error.

When EXSM_INTERVAL is not set, this setting applies to the original input time series. When EXSM_INTERVAL is set, this setting applies to the accumulated time series.

EXSM_INTERVAL

EXSM_INTERVAL_YEAR

EXSM_INTERVAL_YEAR

This setting only applies and must be provided when the time column (case_id column) has datetime type. It specifies the spacing interval of the accumulated equally spaced time series.

The model throws an error if the time column of input table is of datetime type and setting EXSM_INTERVAL is not provided.

EXSM_INTERVAL_YEAR: This option sets the spacing interval of the accumulated time series to one year. When selected, the data is aggregated or summarized on a yearly basis.

EXSM_INTERVAL_QTR EXSM_INTERVAL_QTR This option sets the spacing interval to a quarter, aggregating the data for every three months.
EXSM_INTERVAL_MONTH EXSM_INTERVAL_MONTH This option adjusts the spacing interval to one month. The accumulated time series represent aggregated or summarized data for each month.
EXSM_INTERVAL_WEEK EXSM_INTERVAL_WEEK With this option data is aggregated or summarized on a weekly basis, setting the spacing interval to one week.
EXSM_INTERVAL_DAY EXSM_INTERVAL_DAY This option adjusts the spacing interval to one day. It's suitable for scenarios where daily aggregated insights are required.
EXSM_INTERVAL_HOUR EXSM_INTERVAL_HOUR For more granular insights, this option sets the spacing interval to one hour. It's especially useful when analyzing data that changes significantly within a day.
EXSM_INTERVAL_MIN EXSM_INTERVAL_MINUTE With this option the spacing is set to one minute. This provides a very detailed view of data, suitable for applications like high-frequency trading or real-time monitoring systems.
EXSM_INTERVAL_SEC EXSM_INTERVAL_SECOND For most granular details, this options sets the spacing interval to one second. It's tailored for scenarios requiring real-time or near-real-time analysis.
EXSM_INITVL_OPTIMIZE

EXSM_INITVL_OPTIMIZE_ENABLE

EXSM_INITVL_OPTIMIZE_ENABLE

The setting EXSM_INITVL_OPTIMIZE determines whether initial values are optimized during model build.

The default value is EXSM_INITVL_OPTIMIZE_ENABLE.

EXSM_INITVL_OPTIMIZE_DISABLE

EXSM_INITVL_OPTIMIZE_DISABLE

Note:

EXSM_INITVL_OPTIMIZE can only be set to EXSM_INITVL_OPTIMIZE_DISABLE if the user has set EXSM_MODEL to EXSM_HW or EXSM_HW_ADDSEA. If EXSM_MODEL is set to another model type or is not specified, you get an error 40213 (conflicting settings) and the model is not built.
EXSM_ACCUMULATE

EXSM_ACCU_TOTAL

EXSM_ACCU_TOTAL

This setting only applies and must be provided when the time column has datetime type. It specifies how to generate the value of the accumulated time series from the input time series.

EXSM_ACCU_TOTAL: This option calculates the total sum of the time series values within a specified interval. When selected, it will aggregate the data by summing up all the individual values in the datetime range.

The default value is EXSM_ACCU_TOTAL.

EXSM_ACCU_STD

EXSM_ACCU_STD

This option computes the standard deviation of the time series values within a specified interval. It helps you understand the amount of variation or dispersion in your data.

EXSM_ACCU_MAX

EXSM_ACCU_MAX

By selecting this option, the maximum value of the time series within a specified interval will be determined. It helps in identifying the peak value in the given range.

EXSM_ACCU_MIN

EXSM_ACCU_MIN

This option focuses on determining the minimum value of the time series within a specified interval. It is useful for identifying the lowest value in the time series for the given datetime range.

EXSM_ACCU_AVG

EXSM_ACCU_AVG

This specifies the average value of your time series within a specified interval. It calculates the mean value of all data points in the specified range.

EXSM_ACCU_MEDIAN

EXSM_ACCU_MEDIAN

This option provides the median of the time series values within the given interval. The median gives a central value, which can be especially useful if your data contains outliers.

EXSM_ACCU_COUNT

EXSM_ACCU_COUNT

This option counts the number of time series values within the specified interval. It is helpful if you want to know how many data points are present in a certain datetime range.

EXSM_SETMISSING

Specify an option:

EXSM_MISS_MIN

EXSM_MISS_MIN

This setting specifies how to handle missing values, which may come from input data and/or the accumulation process of time series. You can specify either a number or an option. If a number is specified, all the missing values are set to that number.

EXSM_MISS_MIN: Replaces missing value with minimum of the accumulated time series.

EXSM_MISS_MAX

EXSM_MISS_MAX

Replaces missing value with maximum of the accumulated time series.

EXSM_MISS_AVG

EXSM_MISS_AVG

Replaces missing value with average of the accumulated time series.

EXSM_MISS_MEDIAN

EXSM_MISS_MEDIAN

Replaces missing value with median of the accumulated time series.

EXSM_MISS_LAST

EXSM_MISS_LAST

Replaces missing value with last non-missing value of the accumulated time series.

EXSM_MISS_FIRST

EXSM_MISS_FIRST

Replaces missing value with first non-missing value of the accumulated time series.

EXSM_MISS_PREV

EXSM_MISS_PREV

Replaces missing value with the previous non-missing value of the accumulated time series.

EXSM_MISS_NEXT

EXSM_MISS_NEXT

Replaces missing value with the next non-missing value of the accumulated time series.

EXSM_MISS_AUTO

EXSM_MISS_AUTO

EXSM model treats the input data as an irregular (non-uniformly spaced) time series.

If this setting is not provided, EXSM_MISS_AUTO is the default value. In such a case, the model treats the input time series as irregular time series, viewing missing values as gaps.

EXSM_PREDICTION_STEP It must be set to a number between 1-30. It must be set to a number between 1-30.

This setting specifies how many steps ahead the predictions are to be made.

If it is not set, the default value is 1: the model gives one-step-ahead prediction. A value greater than 30 results in an error.

EXSM_CONFIDENCE_LEVEL It must be a number between 0 and 1, exclusive. It must be a number between 0 and 1, exclusive.

This setting specifies the desired confidence level for prediction.

The lower and upper bounds of the specified confidence interval is reported. If this setting is not specified, the default confidence level is 95%.

EXSM_OPT_CRITERION

EXSM_OPT_CRIT_LIK

EXSM_OPT_CRIT_LIK

This setting specifies the desired optimization criterion. The optimization criterion is useful as a diagnostic for comparing models' fit to the same data.

EXSM_OPT_CRIT_LIK: This represents the negative double of the logarithm of the likelihood associated with a given model.

The default value is EXSM_OPT_CRIT_LIK.

EXSM_OPT_CRIT_MSE

EXSM_OPT_CRIT_MSE

This provides the mean squared error pertaining to the model.

EXSM_OPT_CRIT_AMSE

EXSM_OPT_CRIT_AMSE

This denotes the average of the mean squared error over a time window as specified by the user.

EXSM_OPT_CRIT_SIG

EXSM_OPT_CRIT_SIG

This metric captures the standard deviation of the residuals of the model.

EXSM_OPT_CRIT_MAE

EXSM_OPT_CRIT_MAE

This metric conveys the average absolute error associated with the model. It measures the size of the error.

EXSM_NMSE positive integer positive integer This setting specifies the length of the window used in computing the error metric Average Mean Square Error (AMSE).

See Also:

Oracle Machine Learning for SQL Concepts for information about ESM.

https://github.com/oracle-samples/oracle-db-examples/tree/main/machine-learning/sql browse to the release folder and click the oml4sql-time-series-exponential-smoothing.sql example.

52.5.7 DBMS_DATA_MINING — Algorithm Settings: Generalized Linear Model

The settings listed in the following table configure the behavior of the Generalized Linear Model algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.GLMS_FTR_GEN_QUADRATIC. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'GLMS_FTR_GEN_QUADRATIC'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-18 DBMS_DATA_MINING GLM Settings

Setting Name Constant Value String Value Equivalent Description

GLMS_CONF_LEVEL

TO_CHAR(0< numeric_expr <1)

TO_CHAR(0< numeric_expr <1)

The confidence level for coefficient confidence intervals.

The default confidence level is 0.95.

GLMS_FTR_GEN_METHOD

GLMS_FTR_GEN_QUADRATIC

GLMS_FTR_GEN_QUADRATIC

Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data.

GLMS_FTR_GEN_QUADRATIC: Generates features using a quadratic method.

GLMS_FTR_GEN_CUBIC

GLMS_FTR_GEN_CUBIC

Generates features using a cubic method.

GLMS_FTR_GENERATION

GLMS_FTR_GENERATION_ENABLE

GLMS_FTR_GENERATION_ENABLE

Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

Note: Feature generation can only be enabled when feature selection is also enabled.

GLMS_FTR_GENERATION_DISABLE

GLMS_FTR_GENERATION_DISABLE

Disables feature generation for GLM (default).

GLMS_FTR_SEL_CRIT

GLMS_FTR_SEL_AIC

GLMS_FTR_SEL_AIC

Feature selection penalty criterion for adding a feature to the model.

When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.

GLMS_FTR_SEL_AIC: Uses Akaike Information Criterion for feature selection.

GLMS_FTR_SEL_SBIC

GLMS_FTR_SEL_SBIC

Uses Schwarz Bayesian Information Criterion for feature selection.

GLMS_FTR_SEL_RIC

GLMS_FTR_SEL_RIC

Uses Risk Inflation Criterion for feature selection.

GLMS_FTR_SEL_ALPHA_INV

GLMS_FTR_SEL_ALPHA_INV

Uses Alpha Inverse Criterion for feature selection.

GLMS_FTR_SELECTION

GLMS_FTR_SELECTION_ENABLE

GLMS_FTR_SELECTION_ENABLE

Whether or not feature selection is enabled for GLM.

By default, feature selection is not enabled.

GLMS_FTR_SELECTION_DISABLE

GLMS_FTR_SELECTION_DISABLE

Disables feature selection.

GLMS_MAX_FEATURES

TO_CHAR(0 < numeric_expr <= 2000)

TO_CHAR(0 < numeric_expr <= 2000)

When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model.

By default, the algorithm limits the number of features to ensure sufficient memory.

GLMS_PRUNE_MODEL

GLMS_PRUNE_MODEL_ENABLE

GLMS_PRUNE_MODEL_ENABLE

Enables or disables pruning of features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data.

When feature selection is enabled, the algorithm automatically performs pruning.

GLMS_PRUNE_MODEL_DISABLE

GLMS_PRUNE_MODEL_DISABLE

Disables pruning of features.

GLMS_REFERENCE_CLASS_NAME

target_value

target_value

The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the other class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.

GLMS_RIDGE_REGRESSION

GLMS_RIDGE_REG_ENABLE

GLMS_RIDGE_REG_ENABLE

Enable or disable ridge regression. Ridge applies to both regression and classification machine learning techniques.

When ridge is enabled, prediction bounds are not produced by the PREDICTION_BOUNDS SQL function.

Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If ridge regression and feature selection are both explicitly enabled, then an exception is raised.

GLMS_RIDGE_REG_DISABLE

GLMS_RIDGE_REG_DISABLE

Disables ridge regression.

GLMS_RIDGE_VALUE

TO_CHAR (numeric_expr > 0)

TO_CHAR (numeric_expr > 0)

The value of the ridge parameter. This setting is only used when the algorithm is configured to use ridge regression.

If ridge regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm.

GLMS_ROW_DIAGNOSTICS

GLMS_ROW_DIAG_ENABLE

GLMS_ROW_DIAG_ENABLE

Enables or disables row diagnostics.

GLMS_ROW_DIAG_ENABLE: Enables row diagnostics.

GLMS_ROW_DIAG_DISABLE (default).

GLMS_ROW_DIAG_DISABLE (default).

Disables row diagnostics (default).

GLMS_CONV_TOLERANCE

The range is (0, 1) non-inclusive.

The range is (0, 1) non-inclusive.

Convergence Tolerance setting of the GLM algorithm

The default value is system-determined.

GLMS_NUM_ITERATIONS

Positive integer

Positive integer

Maximum number of iterations for the GLM algorithm. The default value is system-determined.

GLMS_BATCH_ROWS

0 or Positive integer

0 or Positive integer

Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate.

The default is 2000

GLMS_SOLVER

GLMS_SOLVER_SGD (StochasticGradient Descent)

GLMS_SOLVER_SGD (StochasticGradient Descent)

This setting allows the user to choose the GLM solver. The solver cannot be selected if GLMS_FTR_SELECTION setting is enabled.

GLMS_SOLVER_SGD: Optimizes generalized linear models by iteratively updating parameters using a subset of the data to minimize errors.

The default value is system determined.

See Also:

GLM Solvers

GLMS_SOLVER_CHOL (Cholesky)

GLMS_SOLVER_CHOL (Cholesky)

Solves generalized linear models using the Cholesky decomposition method, which provides a stable and efficient solution by transforming the right-hand of the equation into a lower triangular matrix and its conjugate transpose.

GLMS_SOLVER_QR

GLMS_SOLVER_QR

Utilizes the QR decomposition technique to solve generalized linear models, ensuring numerical stability and accuracy by decomposing the problem into an orthonormal matrix Q and upper triangular matrix R.

GLMS_SOLVER_LBFGS_ADMM

GLMS_SOLVER_LBFGS_ADMM

Combines L-BFGS, an approximation of the Broyden-Fletcher-Goldfarb-Shanno optimization algorithm, with ADMM for solving large-scale generalized linear model problems efficiently.

GLMS_SPARSE_SOLVER

GLMS_SPARSE_SOLVER_ENABLE

GLMS_SPARSE_SOLVER_ENABLE

This setting allows the user to use sparse solver if it is available. The default value is GLMS_SPARSE_SOLVER_DISABLE.

GLMS_SPARSE_SOLVER_DISABLE (default).

GLMS_SPARSE_SOLVER_DISABLE (default).

Disables sparse solver.

See Also:

Oracle Machine Learning for SQL Concepts for information about GLM.

52.5.8 DBMS_DATA_MINING — Algorithm Settings: k-Means

The settings listed in the following table configure the behavior of the k-Means algorithm.

/

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.KMNS_COSINE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'KMNS_COSINE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-19 k-Means Settings

Setting Name Constant Value String Value Equivalent Description

KMNS_CONV_TOLERANCE

TO_CHAR(0<numeric_expr<1)

TO_CHAR(0<numeric_expr<1)

Minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in KMNS_ITERATIONS, is reached.

Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times.

The default Convergence Tolerance is 0.001.

KMNS_DISTANCE

KMNS_COSINE

KMNS_COSINE

Distance function for k-Means.

KMNS_COSINE: Uses cosine distance for k-Means.

The default distance function is KMNS_EUCLIDEAN.

KMNS_EUCLIDEAN

KMNS_EUCLIDEAN

Uses Euclidean distance for k-Means.

KMNS_ITERATIONS

TO_CHAR(positive_numeric_expr)

TO_CHAR(positive_numeric_expr)

Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum Convergence Tolerance, specified in KMNS_CONV_TOLERANCE, is satisfied.

The default number of iterations is 20.

KMNS_MIN_PCT_ATTR_SUPPORT

TO_CHAR(0<=numeric_expr<=1)

TO_CHAR(0<=numeric_expr<=1)

Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster.

If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules.

The default minimum support is 0.1.

KMNS_NUM_BINS

TO_CHAR(numeric_expr>0)

TO_CHAR(numeric_expr>0)

Number of bins in the attribute histogram produced by k-means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.

The default number of histogram bins is 11.

KMNS_SPLIT_CRITERION

KMNS_SIZE

KMNS_SIZE

Split criterion for k-means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time.

When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located.

The default split criterion is the KMNS_VARIANCE.

KMNS_VARIANCE

KMNS_VARIANCE

When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster.

KMNS_RANDOM_SEED

Non-negative integer

Non-negative integer

This setting controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value.

The default is 0.

KMNS_DETAILS

KMNS_DETAILS_NONE

KMNS_DETAILS_NONE

This setting determines the level of cluster detail that are computed during the build.

KMNS_DETAILS_NONE: No cluster details are computed. Only the scoring information is persisted.

KMNS_DETAILS_HIERARCHY

KMNS_DETAILS_HIERARCHY

Cluster hierarchy and cluster record counts are computed. This is the default value.

KMNS_DETAILS_ALL

KMNS_DETAILS_ALL

Cluster hierarchy, record counts, descriptive statistics (means, variances, modes, histograms, and rules) are computed.

See Also:

Oracle Machine Learning for SQL Concepts for information about k-Means

52.5.9 DBMS_DATA_MINING - Algorithm Settings: Multivariate State Estimation Technique - Sequential Probability Ratio Test

Settings that configure the training calibration behavior of the Multivariate State Estimation Technique - Sequential Probability Ratio Test algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.MSET_ADB_HEIGHT. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'MSET_ADB_HEIGHT'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-20 MSET-SPRT Settings

Setting Name Setting Value String Value Equivalent Description

MSET_ADB_HEIGHT

A positive double

A positive double

Estimates the band within which signal values normally oscillate.

The default value is 0.05.

MSET_ALERT_COUNT

A positive integer

A positive integer

The number of the last n signals (the alert window) that should have passed the threshold to raise an alert. The alert count should be lower or equal to the alert window.

The default value is 5.

MSET_ALERT_WINDOW

A positive integer greater than or equal to MSET_ALERT_COUNT

A positive integer greater than or equal to MSET_ALERT_COUNT

The number of signals to consider in the SPRT hypothesis consolidation logic.

The default value is 5.

MSET_ALPHA_PROB

A positive double between 0 and 1

A positive double between 0 and 1

False Alarm Probability FAP (false positive).

The default is 0.01.

MSET_BETA_PROB

A positive double between 0 and 1

A positive double between 0 and 1

Missed Alarm Probability MAP (false negative).

The default is 0.10.

MSET_HELDASIDE

A positive integer

A positive integer

The approximate number of data rows used for MSET model calibration.

You can use ODMS_RANDOM_SEED to change the held-aside sample.

The default value is 10000.

MSET_MEMORY_VECTORS

A positive integer

A positive integer

The default value is data driven.

MSET_PROJECTION_THRESHOLD

A positive integer >0, <=10000

A positive integer >0, <=10000

Specifies whether to use random projections. When the number of sensors exceeds the setting value, random projections are used. To turn off random projections, set the threshold to a value that is equal to or greater than the number of sensors.

The default value is 500.

MSET_STD_TOLERANCE

A positive integer

A positive integer

The tolerance in standard deviations used in the SPRT calculation.

The default value is 3.

52.5.10 DBMS_DATA_MINING — Algorithm Settings: Naive Bayes

The settings listed in the following table configure the behavior of the Naive Bayes algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.NABS_PAIRWISE_THRESHOLD. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'NABS_PAIRWISE_THRESHOLD'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-21 Naive Bayes Settings

Setting Name Setting Value String Value Equivalent Description

NABS_PAIRWISE_THRESHOLD

TO_CHAR(0<= numeric_expr <=1)

TO_CHAR(0<= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default is 0.

NABS_SINGLETON_THRESHOLD

TO_CHAR(0<= numeric_expr <=1)

TO_CHAR(0<= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.

See Also:

Oracle Machine Learning for SQL Concepts for information about Naive Bayes

52.5.11 DBMS_DATA_MINING — Algorithm Settings: Neural Network

The settings listed in the following table configure the behavior of the Neural Network algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.NNET_SOLVER_ADAM. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'NNET_SOLVER_ADAM'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-22 DBMS_DATA_MINING Neural Network Settings

Setting Name Constant Value String Value Equivalents Description

NNET_SOLVER

One of the following strings:

NNET_SOLVER_ADAM

NNET_SOLVER_ADAM

Specifies the method of optimization.

The default value is system determined.

NNET_SOLVER_ADAM: Uses the Adam optimization method.

NNET_SOLVER_LBFGS

NNET_SOLVER_LBFGS

Uses the Limited-memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) optimization method.

NNET_ACTIVATIONS

One or more of the following strings:

NNET_ACTIVATIONS_ARCTAN

NNET_ACTIVATIONS_ARCTAN

Specifies the activation functions for the hidden layers. You can specify a single activation function, which is then applied to each hidden layer, or you can specify an activation function for each layer individually. Different layers can have different activation functions.

To apply a different activation function to one or more of the layers, you must specify an activation function for each layer. The number of activation functions you specify must be consistent with the NNET_HIDDEN_LAYERS and NNET_NODES_PER_LAYER values.

For example, if you have three hidden layers, you could specify the use of the same activation function for all three layers with the following settings value:

('NNET_ACTIVATIONS', 'NNET_ACTIVATIONS_TANH')

The following settings value specifies a different activation function for each layer:

('NNET_ACTIVATIONS', '''NNET_ACTIVATIONS_TANH'', ''NNET_ACTIVATIONS_LOG_SIG'', ''NNET_ACTIVATIONS_ARCTAN''')

Note:

You specify the different activation functions as strings within a single string. All quotes are single and two single quotes are used to escape a single quote in SQL statements and PL/SQL blocks.

NNET_ACTIVATIONS_ARCTAN: Uses the arctangent activation function.

The default value is NNET_ACTIVATIONS_LOG_SIG.

NNET_ACTIVATIONS_BIPOLAR_SIG

NNET_ACTIVATIONS_BIPOLAR_SIG

Uses the bipolar sigmoid activation function.

NNET_ACTIVATIONS_LINEAR

NNET_ACTIVATIONS_LINEAR

Uses the linear activation function.

NNET_ACTIVATIONS_LOG_SIG

NNET_ACTIVATIONS_LOG_SIG

Uses the logistic sigmoid activation function.

NNET_ACTIVATIONS_RELU

NNET_ACTIVATIONS_RELU

Uses the rectified linear unit activation function.

NNET_ACTIVATIONS_TANH

NNET_ACTIVATIONS_TANH

Uses the hyperbolic tangent activation function.

NNET_HELDASIDE_MAX_FAIL

A positive integer

A positive integer

With NNET_REGULARIZER_HELDASIDE, the training process is stopped early if the network performance on the validation data fails to improve or remains the same for NNET_HELDASIDE_MAX_FAIL epochs in a row.

The default value is 6.

NNET_HELDASIDE_RATIO

0 <= numeric_expr <=1

0 <= numeric_expr <=1

Define the held ratio for the held-aside method.

The default value is 0.25.

NNET_HIDDEN_LAYERS

A positive integer

A positive integer

Defines the topology by the number of hidden layers.

The default value is 1.

NNET_ITERATIONS

A positive integer

A positive integer

Specifies the maximum number of iterations in the Neural Network algorithm.

For the DMSSET_NN_SOLVER_LBFGS solver, the default value is 200.

For the DMSSET_NN_SOLVER_ADAM solver, the default value is 10000.

NNET_NODES_PER_LAYER

A positive integer or a list of positive integers

A positive integer or a list of positive integers

Defines the topology by the number of nodes per layer. Different layers can have different numbers of nodes.

To specify the same number of nodes for each layer, you can provide a single value, which is then applied to each layer.

To specify a different number of nodes for one or more layers, provide a list of comma-separated positive integers, one for each layer. For example, '10, 20, 5' for three layers. The setting values must be consistent with the NNET_HIDDEN_LAYERS value.

The default number of nodes per layer is the number of attributes or 50 (if the number of attributes > 50).

NNET_REG_LAMBDA

TO_CHAR(numeric_expr >=0)

TO_CHAR(numeric_expr >=0)

Defines the L2 regularization parameter lambda. This can not be set together with NNET_REGULARIZER_HELDASIDE.

The default value is 1.

NNET_REGULARIZER

One of the following strings:

NNET_REGULARIZER_HELDASIDE

NNET_REGULARIZER_HELDASIDE

Regularization setting for Neural Network algorithm.

NNET_REGULARIZER_HELDASIDE: Uses a held-aside method for regularization. If the total number of training rows is greater than 50000, the default is NNET_REGULARIZER_HELDASIDE.

NNET_REGULARIZER_L2

NNET_REGULARIZER_L2

Applies L2 regularization, which penalizes the sum of squared weights.

NNET_REGULARIZER_NONE

NNET_REGULARIZER_NONE

Disables regularization. If the total number of training rows is less than or equal to 50000, the default is NNET_REGULARIZER_NONE.

NNET_TOLERANCE

TO_CHAR(0< numeric_expr <1)

TO_CHAR(0< numeric_expr <1)

Defines the convergence tolerance setting of the Neural Network algorithm.

The default value is 0.000001.

NNET_WEIGHT_LOWER_BOUND

A real number

A real number

The setting specifies the lower bound of the region where weights are randomly initialized. NNET_WEIGHT_LOWER_BOUND and NNET_WEIGHT_UPPER_BOUND must be set together. Setting one and not setting the other raises an error. NNET_WEIGHT_LOWER_BOUND must not be greater than NNET_WEIGHT_UPPER_BOUND. The default value is –sqrt(6/(l_nodes+r_nodes)). The value of l_nodes for:
  • input layer dense attributes is (1+number of dense attributes)

  • input layer sparse attributes is number of sparse attributes

  • each hidden layer is (1+number of nodes in that hidden layer)

The value of r_nodes is the number of nodes in the layer that the weight is connecting to.

NNET_WEIGHT_UPPER_BOUND

A real number

A real number

This setting specifies the upper bound of the region where weights are initialized. It should be set in pairs with NNET_WEIGHT_LOWER_BOUND and its value must not be smaller than the value of NNET_WEIGHT_LOWER_BOUND. If not specified, the values of NNET_WEIGHT_LOWER_BOUND and NNET_WEIGHT_UPPER_BOUND are system determined.

The default value is sqrt(6/(l_nodes+r_nodes)). See NNET_WEIGHT_LOWER_BOUND.

See Also:

Oracle Machine Learning for SQL Concepts for information about Neural Network.

52.5.12 DBMS_DATA_MINING — Algorithm Settings: Non-Negative Matrix Factorization

The settings listed in the following table configure the behavior of the Non-negative Matrix Factorization algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.NMFS_NONNEG_SCORING_ENABLE. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'NMFS_NONNEG_SCORING_ENABLE'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

You can query the data dictionary view *_MINING_MODEL_SETTINGS (using the ALL, USER, or DBA prefix) to find the setting values for a model. See Oracle Database Reference for information about *_MINING_MODEL_SETTINGS.

Table 52-23 NMF Settings

Setting Name Constant Value String Value Equivalent Description

NMFS_CONV_TOLERANCE

TO_CHAR(0< numeric_expr <=0.5)

TO_CHAR(0< numeric_expr <=0.5)

Convergence tolerance for NMF algorithm

Default is 0.05

NMFS_NONNEGATIVE_SCORING

NMFS_NONNEG_SCORING_ENABLE

NMFS_NONNEG_SCORING_ENABLE

Whether negative numbers should be allowed in scoring results.

When set to NMFS_NONNEG_SCORING_ENABLE, negative feature values will be replaced with zeros.

Default is NMFS_NONNEG_SCORING_ENABLE

NMFS_NONNEG_SCORING_DISABLE

NMFS_NONNEG_SCORING_DISABLE

When set to NMFS_NONNEG_SCORING_DISABLE, negative feature values will be allowed.

NMFS_NUM_ITERATIONS

TO_CHAR(1 <= numeric_expr <=500)

TO_CHAR(1 <= numeric_expr <=500)

Number of iterations for NMF algorithm

Default is 50

NMFS_RANDOM_SEED

TO_CHAR(numeric_expr)

TO_CHAR(numeric_expr)

Random seed for NMF algorithm.

Default is –1.

See Also:

Oracle Machine Learning for SQL Concepts for information about NMF

52.5.13 DBMS_DATA_MINING — Algorithm Settings: O-Cluster

The settings in the table configure the behavior of the O-Cluster algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.OCLT_SENSITIVITY. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'OCLT_SENSITIVITY'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-24 O-CLuster Settings

Setting Name Constant Value String Value Equivalent Description

OCLT_SENSITIVITY

TO_CHAR(0 <=numeric_expr <=1)

TO_CHAR(0 <=numeric_expr <=1)

A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density.

Default is 0.5.

See Also:

Oracle Machine Learning for SQL Concepts for information about O-Cluster

52.5.14 DBMS_DATA_MINING — Algorithm Settings: Random Forest

These settings configure the behavior of the Random Forest algorithm. Random Forest makes use of the Decision Tree settings to configure the construction of individual trees.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.RFOR_MTRY. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'RFOR_MTRY'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-25 Random Forest Settings

Setting Name Constant Value String Value Equivalent Description

RFOR_MTRY

a number >= 0

a number >= 0

Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value 0 indicates that the candidate pool includes all columns.

RFOR_NUM_TREES

1<= a number <=65535

1<= a number <=65535

Number of trees in the forest

Default is 20.

RFOR_SAMPLING_RATIO

0< a fraction<=1

0< a fraction<=1

Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data.

See Also:

Oracle Machine Learning for SQL Concepts for information about Random Forest

52.5.15 DBMS_DATA_MINING — Algorithm Constants and Settings: Singular Value Decomposition

The following constant affects the behavior of the Singular Value Decomposition algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.SVDS_MAX_NUM_FEATURES. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'SVDS_MAX_NUM_FEATURES'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-26 Singular Value Decomposition Constant

Constant Name Constant Value Description

SVDS_MAX_NUM_FEATURES

2500

The maximum number of features supported by SVD.

The following settings configure the behavior of the Singular Value Decomposition algorithm.

Table 52-27 Singular Value Decomposition Settings

Setting Name Constant Value String Value Equivalent Description

SVDS_U_MATRIX_OUTPUT

SVDS_U_MATRIX_ENABLE

SVDS_U_MATRIX_ENABLE

Specifies whether to persist the U Matrix produced by SVD.

SVDS_U_MATRIX_ENABLE: The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix is persisted only when SVDS_U_MATRIX_OUTPUT is enabled.

When SVDS_U_MATRIX_OUTPUT is enabled, the build data must include a case ID. An exception is raised if no case ID is present.

Default is SVDS_U_MATRIX_DISABLE.

SVDS_U_MATRIX_DISABLE

SVDS_U_MATRIX_DISABLE

Does not persist the U Matrix.

SVDS_SCORING_MODE

SVDS_SCORING_SVD

SVDS_SCORING_SVD

Whether to use SVD or PCA scoring for the model.

When the build data is scored with SVD, the projections will be the same as the U matrix.

Default is SVDS_SCORING_SVD.

SVDS_SCORING_PCA

SVDS_SCORING_PCA

When the build data is scored with PCA, the projections will be the product of the U and S matrices.

SVDS_SOLVER

SVDS_SOLVER_TSSVD

SVDS_SOLVER_TSSVD

This setting indicates the solver to be used for computing SVD of the data. In the case of PCA, the solver setting indicates the type of SVD solver used to compute the PCA for the data. When this setting is not specified the solver type selection is data driven. If the number of attributes is greater than 3240, then the default wide solver is used. Otherwise, the default narrow solver is selected.

The following are the group of solvers:

  • Narrow data solvers: for matrices with up to 11500 attributes (TSEIGEN) or up to 8100 attributes (TSSVD).

  • Wide data solvers: for matrices up to 1 million attributes.

For narrow data solvers:

  • Tall-Skinny SVD uses QR computation TSVD (SVDS_SOLVER_TSSVD)

  • Tall-Skinny SVD uses eigenvalue computation, TSEIGEN (SVDS_SOLVER_TSEIGEN), is the default solver for narrow data.

For wide data solvers:

  • Stochastic SVD uses QR computation SSVD (SVDS_SOLVER_SSVD), is the default solver for wide data solvers.

  • Stochastic SVD uses eigenvalue computations, STEIGEN (SVDS_SOLVER_STEIGEN).

SVDS_SOLVER_TSEIGEN

SVDS_SOLVER_TSEIGEN

Tall-Skinny SVD using eigenvalue computation for matrices with up to 11500 attributes. This is the default solver for narrow data.

SVDS_SOLVER_SSVD

SVDS_SOLVER_SSVD

Stochastic SVD using QR computation for matrices with up to 1 million attributes. This is the default solver for wide data.

SVDS_SOLVER_STEIGEN

SVDS_SOLVER_STEIGEN

Stochastic SVD using eigenvalue computations for matrices with up to 1 million attributes.

SVDS_TOLERANCE

Range [0, 1]

Range [0, 1]

This setting is used to prune features. Define the minimum value the eigenvalue of a feature as a share of the first eigenvalue to not to prune. Default value is data driven.

SVDS_RANDOM_SEED

Range [0 - 4,294,967,296]

Range [0 - 4,294,967,296]

The random seed value is used for initializing the sampling matrix used by the Stochastic SVD solver. The default is 0. The SVD Solver must be set to SSVD or STEIGEN.

SVDS_OVER_SAMPLING

Range [1, 5000].

Range [1, 5000].

This setting is configures the number of columns in the sampling matrix used by the Stochastic SVD solver. The number of columns in this matrix is equal to the requested number of features plus the oversampling setting. The SVD Solver must be set to SSVD or STEIGEN.

SVDS_POWER_ITERATIONS

Range [0, 20].

Range [0, 20].

The power iteration setting improves the accuracy of the SSVD solver. The default is 2. The SVD Solver must be set to SSVD or STEIGEN.

52.5.16 DBMS_DATA_MINING — Algorithm Settings: Support Vector Machine

The settings listed in the following table configure the behavior of the Support Vector Machine algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.SVMS_GAUSSIAN. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'SVMS_GAUSSIAN'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.

Table 52-28 SVM Settings

Setting Name Constant Value String Value Equivalent Description

SVMS_COMPLEXITY_FACTOR

TO_CHAR(numeric_expr >0)

TO_CHAR(numeric_expr >0)

Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor.

Value of complexity factor for SVM algorithm (both classification and regression).

Default value estimated from the data by the algorithm.

SVMS_CONV_TOLERANCE

TO_CHAR(numeric_expr >0)

TO_CHAR(numeric_expr >0)

Convergence tolerance for SVM algorithm.

Default is 0.0001.

SVMS_EPSILON

TO_CHAR(numeric_expr >0)

TO_CHAR(numeric_expr >0)

Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data.

Value of epsilon factor for SVM regression.

Default is 0.1.

SVMS_KERNEL_FUNCTION

SVMS_GAUSSIAN

SVMS_GAUSSIAN

Kernel for Support Vector Machine. Linear or Gaussian.

SVMS_GAUSSIAN: Uses the Gaussian kernel for SVM.

The default value is SVMS_LINEAR.

SVMS_LINEAR

SVMS_LINEAR

Uses the Linear kernel for SVM. This is the default option.

SVMS_OUTLIER_RATE

TO_CHAR(0< numeric_expr <1)

TO_CHAR(0< numeric_expr <1)

The desired rate of outliers in the training data. Valid for One-Class SVM models only (anomaly detection).

Default is 0.01.

SVMS_STD_DEV

TO_CHAR(numeric_expr >0)

TO_CHAR(numeric_expr >0)

Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases.

Value of standard deviation for SVM algorithm.

This is applicable only for Gaussian kernel.

Default value estimated from the data by the algorithm.

SVMS_NUM_ITERATIONS

Positive integer

Positive integer

This setting sets an upper limit on the number of SVM iterations. The default is system determined because it depends on the SVM solver.

SVMS_NUM_PIVOTS

Range [1; 10000]

Range [1; 10000]

This setting sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is 200.

SVMS_BATCH_ROWS

Positive integer

Positive integer

This setting applies to SVM models with linear kernel. This setting sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is 20000.

SVMS_REGULARIZER

SVMS_REGULARIZER_L1

SVMS_REGULARIZER_L1

This setting controls the type of regularization that the SGD SVM solver uses. The setting can be used only for linear SVM models. The default is system determined because it depends on the potential model size.

SVMS_REGULARIZER_L1: Uses L1 regularization.

SVMS_REGULARIZER_L2

SVMS_REGULARIZER_L2

Uses L2 regularization.

SVMS_SOLVER

SVMS_SOLVER_SGD (Sub-Gradient Descend)

SVMS_SOLVER_SGD (Sub-Gradient Descend)

Enables to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined.

SVMS_SOLVER_SGD: Uses Sub-Gradient Descent solver.

SVMS_SOLVER_IPM (Interior Point Method)

SVMS_SOLVER_IPM (Interior Point Method)

Uses Interior Point Method solver.

See Also:

Oracle Machine Learning for SQL Concepts for information about SVM

52.5.17 DBMS_DATA_MINING — Algorithm Settings: XGBoost

Settings that configure the behavior of the XGBoost gradient boosting algorithm.

The Constant Value column specifies constants using the prefix DBMS_DATA_MINING. For example, DBMS_DATA_MINING.xgboost_booster. Alternatively, you can specify the corresponding string value from the String Value Equivalent column without the DBMS_DATA_MINING prefix, in single quotes. For example, 'booster'.

Note:

The distinction between Constant Value and String Value Equivalent for this algorithm is applicable to Oracle Database 19c and Oracle Database 21c.
The XGBoost settings are case sensitive. Enter the settings as they appear in the settings table. These settings match the XGBoost settings available in open source. OML4SQL XGBoost is based on the 1.1 version of XGBoost.

Table 52-29 General Settings

Constant Name String Value Equivalent Setting Value Description

xgboost_booster

booster

A string that is one of the following:

dart

gblinear

gbtree

The booster to use:

  • dart
  • gblinear
  • gbtree

The dart and gbtree boosters use tree-based models whereas gblinear uses linear functions.

The default value is gbtree.

xgboost_num_round

num_round

A non-negative integer.

The number of rounds for boosting.

The default value is 10.

Table 52-30 Settings for Tree Boosting

Constant Name String Value Equivalent Setting Value Description

xgboost_alpha

alpha

A non-negative number

L1 regularization term on weights. Increasing this value makes the model more conservative.

The default value is 0.

xgboost_colsample_bylevel

colsample_bylevel

A number in the range (0, 1]

Subsample ratio of columns for each split, in each level. Subsampling occurs each time a new split is made. This parameter has no effect when tree_method is set to hist.

The default value is 1.

xgboost_colsample_bynode

colsample_bynode

A number in the range (0, 1]

The subsample ratio of columns for each node (split). Subsampling occurs once every time a new split is evaluated. Columns are subsampled from the set of columns chosen for the current level.

The default value is 1.

xgboost_colsample_bytree

colsample_bytree

A number in the range (0, 1]

Subsample ratio of columns when constructing each tree. Subsampling occurs once in every boosting iteration.

The default value is 1.

xgboost_eta

eta

A number in the range [0, 1]

Step-size shrinkage used in the update step to prevent overfitting. After each boosting step, eta shrinks the feature weights to make the boosting process more conservative.

The default value is 0.3.

xgboost_gamma

gamma

A number in the range [0, ∞]

Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma value is, the more conservative the algorithm is.

The default value is 0.

xgboost_grow_policy

grow_policy

A string; one of the following:

  • depthwise
  • lossguide

Controls the way new nodes are added to the tree:

  • depthwise splits at nodes closest to the root
  • lossguide splits at nodes with the highest loss change

Valid only if tree_method is set to hist.

The default value is depthwise.

xgboost_lambda

lambda

A non-negative number

L2 regularization term on weights.

The default value is 1.

xgboost_max_bin

max_bin

A non-negative integer

Maximum number of discrete bins to bucket continuous features. Increasing this number improves the optimality of splits at the cost of higher computation time.

This parameter is valid only when tree_method is set to hist.

The default value is 256.

xgboost_max_delta_step

max_delta_step

A number in the range [0, ∞]

Maximum delta step allowed for each leaf output.

Setting this to a positive value can help make the update step more conservative. Usually this parameter is not needed, but it might help in logistic regression when the class is extremely imbalanced. Setting it to value from 1 to 10 might help control the update.

The default value is 0, which means there is no constraint.

xgboost_max_depth

max_depth

An integer in the range [0, ∞]

Maximum depth of a tree. Increasing this value makes the model more complex and more likely to overfit.

Setting this to 0 indicates no limit.

Note:

You must set a max_depth limit when the grow_policy setting is depthwise.

The default value is 6.

xgboost_max_leaves

max_leaves

A non-negative number

Maximum number of nodes to add.

Use this setting only when grow_policy is set to lossguide.

The default value is 0.

xgboost_min_child_weight

min_child_weight

A number in the range [0, ∞]

Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with a sum of instance weight less than min_child_weight, then the building process stops partitioning. In a linear regression task, this corresponds to the minimum number of instances needed in each node. The larger min_child_weight is, the more conservative the algorithm is.

The default value is 1.

xgboost_num_parallel_tree

num_parallel_tree

A non-negative integer

Number of parallel trees constructed during each iteration. Use this option to support a boosted random forest.

The default value is 1.

xgboost_scale_pos_weight

scale_pos_weight

A non-negative number

Controls the balance of positive and negative weights, which is useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases).

The default value is 1.

xgboost_sketch_eps

sketch_eps

A number in the range (0, 1]

Increases enumeration accuracy. Valid only for the approximate greedy tree method.

Compared to directly selecting the number of bins, this setting comes with a theoretical guarantee with sketch accuracy. You usually do not need to change this setting, but you might consider setting a lower number for more accurate enumeration.

The default value is 0.03.

xgboost_subsample

subsample

A number in the range (0, 1]

Subsample ratio of the training instances. A setting of 0.5 means that XGBoost randomly samples half of the training data prior to growing trees, which prevents overfitting. Subsampling occurs once in every boosting iteration.

The default value is 1.

xgboost_tree_method

tree_method

A string that is one of the following:

  • approx
  • auto
  • exact
  • hist

Tree construction algorithm used in XGBoost:

  • approx: Approximate greedy algorithm using sketching and histogram.
  • auto: Use a heuristic to choose the faster algorithm:
    • For a small to medium sized data set, uses the exact greedy algorithm.
    • For a very large data set, uses the approximate greedy algorithm.
  • exact: Exact greedy algorithm.
  • hist: Fast histogram optimized approximate greedy algorithm; uses some performance improvements such as bins caching.

The default value is auto.

xgboost_updater

updater

A comma-separated string; one or more of the following:

  • grow_colmaker
  • grow_histmaker
  • grow_local_histmaker
  • grow_skmaker
  • grow_quantile_histmaker
  • prune
  • sync

Defines the sequence of tree updaters to run, which provides a modular way to construct and to modify the trees. This is an advanced parameter that is usually set automatically, depending on some other parameters. However, you can also explicitly specify a settting.

The setting values are:

  • grow_colmaker: Non-distributed column-based construction of trees.
  • grow_histmaker: Distributed tree construction with row-based data splitting based on a global proposal of histogram counting.
  • grow_local_histmaker: Based on local histogram counting.
  • grow_skmaker: Uses the approximate sketching algorithm.
  • grow_quantile_histmaker: Grow tree using quantized histogram.
  • prune: Prunes the splits where loss < min_split_loss (or gamma).
  • sync: Synchronizes trees in all distributed nodes.

Table 52-31 Settings for the Dart Booster

Constant Name String Value Equivalent Setting Value Description

xgboost_one_drop

one_drop

A number that is 0 or 1

When set to 1, at least one tree is always dropped during the dropout. When set to 0, at least one tree is not always dropped during the dropout.

The default value is 0.

xgboost_normalize_type

normalize_type

A string; either:

  • forest
  • tree

Type of normalization algorithm:

  • forest: New trees have the same weight as the sum of the dropped trees (forest):
    • The weight of new trees is 1 / (1 + learning_rate)
    • Dropped trees are scaled by a factor of 1 / (l + learning_rate)
  • tree: New trees have the same weight as dropped trees:
    • The weight of new trees is 1 / (k + learning_rate)
    • Dropped trees are scaled by a factor of k / (k + learning_rate)

The default value is tree.

xgboost_rate_drop

rate_drop

A number in the range [0.0, 1.0]

Dropout rate (a fraction of the previous trees to drop during the dropout).

The default value is 0.0.

xgboost_sample_type

sample_type

A string; either:

  • uniform
  • weighted

Type of sampling algorithm:

  • uniform: Dropped trees are selected uniformly
  • weighted: Dropped trees are selected in proportion to weight

The default value is uniform.

xgboost_skip_drop

skip_drop

A number in the range [0.0, 1.0]

Probability of skipping the dropout procedure during a boosting iteration. If a dropout is skipped, new trees are added in the same manner as gbtree.

A non-zero skip_drop has higher priority than rate_drop or one_drop.

The default value is 0.0.

Table 52-32 Settings for the Linear Booster

Constant Name String Value Equivalent Setting Value Description

xgboost_alpha

alpha

A non-negative number

L1 regularization term on weights, normalized to the number of training examples. Increasing this value makes the model more conservative.

The default value is 0.

xgboost_feature_selector

feature_selector

A string that is one of the following:

  • cyclic
  • greedy
  • random
  • shuffle
  • thrifty

Feature selection and ordering method:

  • cyclic: Deterministic selection by cycling through the features one at a time.
  • greedy: Selects the coordinate with the greatest gradient magnitude. This method:
    • Has O(num_feature^2) complexity
    • Is fully deterministic
    • Allows restricting the selection to the top_k features per group with the largest magnitude of univariate weight change, by setting the top_k parameter; doing so reduces the complexity to O(num_feature*top_k).
  • random: A random (with replacement) coordinate selector.
  • shuffle: Similar to cyclic but with random feature shuffling prior to each update.
  • thrifty: Thrifty, approximately-greedy feature selector. Prior to cyclic updates, reorders features in descending magnitude of their univariate weight changes. This operation is multithreaded and is a linear complexity approximation of the quadratic greedy selection. Restricts the selection per group to the top_k features with the largest magnitude of univariate weight change.

The default value is cyclic.

xgboost_lambda

lambda

A non-negative number

L2 regularization term on weights, normalized to the number of training examples. Increasing this value makes the model more conservative.

The default value is 0.

xgboost_top_k

top_k

A non-negative integer

Number of top features to select for the greedy or thrifty feature selector. The value of 0 uses all of the features.

The default value is 0.

xgboost_updater

updater

A string that is one of the following:

  • coord_descent
  • shotgun

Algorithm to fit the linear model:

  • coord_descent: Ordinary coordinate descent algorithm; multithreaded but still produces a deterministic solution.
  • shotgun: Parallel coordinate descent algorithm based on the shotgun algorithm; uses "hogwild" parallelism and therefore produces a nondeterministic solution on each run.

The default value is shotgun.

Table 52-33 Settings for Tweedie Regression

Constant Name String Value Equivalent Setting Value Description

xgboost_tweedie_variance_power

tweedie_variance_power

A number in the range (1, 2)

Controls the variance of the Tweedie distribution var(y) ~ E(y)^tweedie_variance_power.

A setting closer to 1 shifts towards a Poisson distribution.

A setting closer to 2 shifts towards a gamma distribution.

The default value is 1.5.

Some XGBoost objectives apply only to classification function models and other objectives apply only to regression function models. If you specify an incompatible objective value, an error is raised. In the DBMS_DATA_MINING.CREATE_MODEL procedure, if you specify DBMS_DATA_MINING.CLASSIFICATION as the function, then the only objective values that you can use are the binary and multi values. The one exception is binary: logitraw, which produces a continuous value and applies only to a regression model. If you specify DBMS_DATA_MINING.REGRESSION as the function, then you can specify binary: logitraw or any of the count, rank, reg, and survival values as the objective.

Table 52-34 Settings for Learning Tasks

Constant Name String Value Equivalent Setting Value Description

xgboost_objective

objective

For a classification model, a string that is one of the following:

  • binary:hinge
  • binary:logistic
  • multi:softmax
  • multi:softprob

For a regression model, a string that is one of the following:

  • binary:logitraw
  • count:poisson
  • rank:map
  • rank:ndcg
  • rank:pairwise
  • reg:gamma
  • reg:logistic
  • reg:tweedie
  • survival:cox
  • reg:squarederror
  • reg:squaredlogerror

Settings for a Classification model:

  • binary:hinge: Hinge loss for binary classification. This setting makes predictions of 0 or 1, rather than producing probabilities.
  • binary:logistic: Logistic regression for binary classification. The output is the probability.
  • multi:softmax: Performs multiclass classification using the softmax objective; you must also set num_class(number_of_classes).
  • multi:softprob: : Same as softmax, except the output is a vector of ndata * nclass, which can be further reshaped to an ndata * nclass matrix. The result contains the predicted probability of each data point belonging to each class.

The default objective value for classification is multi:softprob.

Settings for a Regression model:

  • binary:logitraw: Logistic regression for binary classification; the output is the score before logistic transformation.
  • count:poisson: Poisson regression for count data; the output is the mean of the Poisson distribution. The max_delta_step value is set to 0.7 by default in Poisson regression to safeguard optimization.
  • rank:map: Using LambdaMART, performs list-wise ranking in which the Mean Average Precision (MAP) is maximized.
  • rank:ndcg: Using LambdaMART, performs list-wise ranking in which the Normalized Discounted Cumulative Gain (NDCG) is maximized.
  • rank:pairwise: Performs ranking by minimizing the pairwise loss.
  • reg:gamma: Gamma regression with log-link; the output is the mean of the gamma distribution. This setting might be useful for any outcome that might be gamma-distributed, such as modeling insurance claims severity.
  • reg:logistic: Logistic regression.
  • reg:tweedie: Tweedie regression with log-link. This setting might be useful for any outcome that might be Tweedie-distributed, such as modeling total loss in insurance.
  • survival:cox: Cox regression for right-censored survival time data (negative values are considered right-censored). Predictions are returned on the hazard ratio scale (that is, as HR = exp(marginal_prediction) in the proportional hazard function h(t) = h0(t) * HR).
  • reg:squarederror: Regression with squared loss.
  • reg:squaredlogerror: Regression with squared log loss. All input labels must be greater than -1.

The default objective value for regression is reg:squarederror.

xgboost_base_score

base_score

A number

Initial prediction score of all instances, global bias.

For a sufficient number of iterations, changing this value does not have much effect.

The default value is 0.5.

xgboost_eval_metric

eval_metric

A comma-separated string; one or more of the following:

  • auc
  • aucpr
  • cox-nloglik
  • error
  • error@t
  • gamma-deviance
  • gamma-nloglik
  • logloss
  • mae
  • map
  • map@n
  • merror
  • mlogloss
  • ndcg
  • ndcg@n
  • poisson-nloglik
  • rmse
  • tweedie-nloglik@rho
  • ndcg-
  • map-
  • rmsle

Evaluation metrics for validation data. You can specify one or more of these evaluation metrics:

  • auc: Area under the curve.
  • aucpr: Area under the PR curve.
  • cox-nloglik: Negative partial log-likelihood for Cox proportional hazards regression.
  • error: Binary classification error rate, calculated as the number of wrong cases divided by the number of all cases. For the predictions, the evaluation regards the instances with a prediction value larger than 0.5 as positive instances, and the others as negative instances.
  • error@t: You can specify a binary classification threshold value other than 0.5 by specifying a numerical value t; for example, error@0.8.
  • gamma-deviance: Residual deviance for gamma regression.
  • gamma-nloglik: Negative log-likelihood for gamma regression.
  • logloss: Negative log-likelihood.
  • mae: Mean absolute error.
  • map: Mean average precision.
  • map@n: Assigns the integer n as the cut-off value for the top positions in the lists for evaluation.
  • merror: Multiclass classification error rate calculated as the number of wrong cases divided by the number of all cases; the objective must be multi:softprob or multi:softmax.
  • mlogloss: Multiclass logloss; the objective must be multi:softprob or multi:softmax.
  • ndcg: Normalized Discounted Cumulative Gain.
  • ndcg@n: Assigns the integer n as the cut-off value for the top positions in the lists for evaluation.
  • poisson-nloglik: Negative log-likelihood for Poisson regression
  • rmse: Root Mean Square Error.
  • tweedie-nloglik@rho: Negative log-likelihood for Tweedie regression (at a specified value rho of the tweedie_variance_power parameter); rho must be a number in the range (1, 2); for example, tweedie-nloglik@1.8.
  • ndcg- and map-: In XGBoost, NDCG and MAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoost will evaluate these score as 0 to be consistent under some conditions.
  • rmsle: It is root mean square log error. This is the default metric of reg:squaredlogerror objective. This metric reduces errors generated by outliers in dataset. But because log function is employed, rmsle might output nan when prediction value is less than -1.

A default metric is assigned according to the objective:

  • error for classification
  • mean average precision for ranking
  • rmse for regression

xgboost_seed

seed

A non-negative integer

Random number seed.

The default value is 0.

See Also:

https://github.com/oracle/oracle-db-examples/tree/master/machine-learning/sql/20c GitHub repository for an example of XGBoost.

52.6 DBMS_DATA_MINING — Solver Settings

Oracle Machine Learning for SQL algorithms can use different solvers. Solver settings can be provided at build time in the settings table.

52.6.1 DBMS_DATA_MINING - Solver Settings: Adam

These settings configure the behavior of the Adaptive Moment Estimation (Adam) solver.

Neural Network models use these settings.

Table 52-35 DBMS_DATA_MINING Adam Settings

Setting Name Setting Value Description

ADAM_ALPHA

A non-negative double precision floating point number in the interval (0; 1]

The learning rate for Adam.

The default value is 0.001.

ADAM_BATCH_ROWS

A positive integer

The number of rows per batch.

The default value is 10000.

ADAM_BETA1

A positive double precision floating point number in the interval [0; 1)

The exponential decay rate for the 1st moment estimates.

The default value is 0.9.

ADAM_BETA2

A positive double precision floating point number in the interval [0; 1)

The exponential decay rate for the 2nd moment estimates.

The default value is 0.99.

ADAM_GRADIENT_TOLERANCE

A positive double precision floating point number

The gradient infinity norm tolerance for Adam.

The default value is 1E-9.

52.6.2 DBMS_DATA_MINING — Solver Settings: ADMM

The settings listed in the following table configure the behavior of Alternating Direction Method of Multipliers (ADMM). The Generalized Linear Model (GLM) algorithm uses these settings.

Table 52-36 DBMS_DATA_MINING ADMM Settings

Settings Name Setting Value Description

ADMM_CONSENSUS

A positive integer

It is a ADMM’s consensus parameter. The value must be a positive number. The default value is 0.1.

ADMM_ITERATIONS

A positive integer

The number of ADMM iterations. The value must be a positive integer. The default value is 50.

ADMM_TOLERANCE

A positive integer

It is a tolerance parameter. The value must be a positive number. The default value is 0.0001

See Also:

Oracle Machine Learning for SQL Concepts for information about neural network

52.6.3 DBMS_DATA_MINING — Solver Settings: LBFGS

The settings listed in the following table configure the behavior of L-BFGS. Neural Network and Generalized Linear Model (GLM) use these settings.

Table 52-37 DBMS_DATA_MINING L-BFGS Settings

Setting Name Setting Value Description

LBFGS_GRADIENT_TOLERANCE

TO_CHAR (numeric_expr >0)

Defines gradient infinity norm tolerance for L-BFGS. Default value is 1E-9.

LBFGS_HISTORY_DEPTH

The value must be a positive integer.

Defines the number of historical copies kept in L-BFGS solver.

The default value is 20.

LBFGS_SCALE_HESSIAN

LBFGS_SCALE_HESSIAN_ENABLE

LBFGS_SCALE_HESSIAN_DISABLE

Defines whether to scale Hessian in L-BFGS or not.

Default value is LBFGS_SCALE_HESSIAN_ENABLE.

See Also:

Oracle Machine Learning for SQL Concepts for information about neural network

52.7 DBMS_DATA_MINING Datatypes

The DBMS_DATA_MINING package defines object data types for processing transactional data. The package also defines a type for user-specified transformations. These types are called DM_NESTED_n, where n identifies the Oracle data type of the nested attributes.

The Oracle Machine Learning for SQL object data types are described in the following table:

Table 52-38 DBMS_DATA_MINING Summary of Data Types

Datatype Description

DM_NESTED_BINARY_DOUBLE

The name and value of a numerical attribute of type BINARY_DOUBLE.

DM_NESTED_BINARY_DOUBLES

A collection of DM_NESTED_BINARY_DOUBLE.

DM_NESTED_BINARY_FLOAT

The name and value of a numerical attribute of type BINARY_FLOAT.

DM_NESTED_BINARY_FLOATS

A collection of DM_NESTED_BINARY_FLOAT.

DM_NESTED_CATEGORICAL

The name and value of a categorical attribute of type CHAR, VARCHAR, or VARCHAR2.

DM_NESTED_CATEGORICALS

A collection of DM_NESTED_CATEGORICAL.

DM_NESTED_NUMERICAL

The name and value of a numerical attribute of type NUMBER or FLOAT.

DM_NESTED_NUMERICALS

A collection of DM_NESTED_NUMERICAL.

ORA_MINING_VARCHAR2_NT

A table of VARCHAR2(4000).

TRANSFORM_LIST

A list of user-specified transformations for a model. Accepted as a parameter by the CREATE_MODEL Procedure.

This collection type is defined in the DBMS_DATA_MINING_TRANSFORM package.

For more information about processing nested data, see Oracle Machine Learning for SQL User’s Guide.

Note:

Starting from Oracle Database 12c Release 2, *GET_MODEL_DETAILS are deprecated and are replaced with Model Detail Views. See Oracle Machine Learning for SQL User’s Guide.

52.7.1 Deprecated Types

This topic contains tables listing deprecated types.

The DBMS_DATA_MINING package defines object datatypes for storing information about model attributes. Most of these types are returned by the table functions GET_n, where n identifies the type of information to return. These functions take a model name as input and return the requested information as a collection of rows.

For a list of the GET functions, see "Summary of DBMS_DATA_MINING Subprograms".

All the table functions use pipelining, which causes each row of output to be materialized as it is read from model storage, without waiting for the generation of the complete table object. For more information on pipelined, parallel table functions, consult the Oracle Database PL/SQL Language Reference.

Table 52-39 DBMS_DATA_MINING Summary of Deprecated Datatypes

Datatype Description

DM_CENTROID

The centroid of a cluster.

DM_CENTROIDS

A collection of DM_CENTROID. A member of DM_CLUSTER.

DM_CHILD

A child node of a cluster.

DM_CHILDREN

A collection of DM_CHILD. A member of DM_CLUSTER.

DM_CLUSTER

A cluster. A cluster includes DM_PREDICATES, DM_CHILDREN, DM_CENTROIDS, and DM_HISTOGRAMS. It also includes a DM_RULE.

See also, Table 52-41.

DM_CLUSTERS

A collection of DM_CLUSTER. Returned by GET_MODEL_DETAILS_KM Function, GET_MODEL_DETAILS_OC Function, and GET_MODEL_DETAILS_EM Function.

See also, Table 52-41.

DM_CONDITIONAL

The conditional probability of an attribute in a Naive Bayes model.

DM_CONDITIONALS

A collection of DM_CONDITIONAL. Returned by GET_MODEL_DETAILS_NB Function.

DM_COST_ELEMENT

The actual and predicted values in a cost matrix.

DM_COST_MATRIX

A collection of DM_COST_ELEMENT. Returned by GET_MODEL_COST_MATRIX Function.

DM_EM_COMPONENT

A component of an Expectation Maximization model.

DM_EM_COMPONENT_SET

A collection of DM_EM_COMPONENT. Returned by GET_MODEL_DETAILS_EM_COMP Function.

DM_EM_PROJECTION

A projection of an Expectation Maximization model.

DM_EM_PROJECTION_SET

A collection of DM_EM_PROJECTION. Returned by GET_MODEL_DETAILS_EM_PROJ Function.

DM_GLM_COEFF

The coefficient and associated statistics of an attribute in a Generalized Linear Model.

DM_GLM_COEFF_SET

A collection of DM_GLM_COEFF. Returned by GET_MODEL_DETAILS_GLM Function.

DM_HISTOGRAM_BIN

A histogram associated with a cluster.

DM_HISTOGRAMS

A collection of DM_HISTOGRAM_BIN. A member of DM_CLUSTER.

See also, Table 52-41.

DM_ITEM

An item in an association rule.

DM_ITEMS

A collection of DM_ITEM.

DM_ITEMSET

A collection of DM_ITEMS.

DM_ITEMSETS

A collection of DM_ITEMSET. Returned by GET_FREQUENT_ITEMSETS Function.

DM_MODEL_GLOBAL_DETAIL

High-level statistics about a model.

DM_MODEL_GLOBAL_DETAILS

A collection of DM_MODEL_GLOBAL_DETAIL. Returned by GET_MODEL_DETAILS_GLOBAL Function.

DM_NB_DETAIL

Information about an attribute in a Naive Bayes model.

DM_NB_DETAILS

A collection of DM_DB_DETAIL. Returned by GET_MODEL_DETAILS_NB Function.

DM_NMF_ATTRIBUTE

An attribute in a feature of a Non-Negative Matrix Factorization model.

DM_NMF_ATTRIBUTE_SET

A collection of DM_NMF_ATTRIBUTE. A member of DM_NMF_FEATURE.

DM_NMF_FEATURE

A feature in a Non-Negative Matrix Factorization model.

DM_NMF_FEATURE_SET

A collection of DM_NMF_FEATURE. Returned by GET_MODEL_DETAILS_NMF Function.

DM_PREDICATE

Antecedent and consequent in a rule.

DM_PREDICATES

A collection of DM_PREDICATE. A member of DM_RULE and DM_CLUSTER. Predicates are returned by GET_ASSOCIATION_RULES Function, GET_MODEL_DETAILS_EM Function, GET_MODEL_DETAILS_KM Function, and GET_MODEL_DETAILS_OC Function.

See also, Table 52-41.

DM_RANKED_ATTRIBUTE

An attribute ranked by its importance in an Attribute Importance model.

DM_RANKED_ATTRIBUTES

A collection of DM_RANKED_ATTRIBUTE. Returned by GET_MODEL_DETAILS_AI Function.

DM_RULE

A rule that defines a conditional relationship.

The rule can be one of the association rules returned by GET_ASSOCIATION_RULES Function, or it can be a rule associated with a cluster in the collection of clusters returned by GET_MODEL_DETAILS_KM Function and GET_MODEL_DETAILS_OC Function.

See also, Table 52-41.

DM_RULES

A collection of DM_RULE. Returned by GET_ASSOCIATION_RULES Function.

See also, Table 52-41.

DM_SVD_MATRIX

A factorized matrix S, V, or U returned by a Singular Value Decomposition model.

DM_SVD_MATRIX_SET

A collection of DM_SVD_MATRIX. Returned by GET_MODEL_DETAILS_SVD Function.

DM_SVM_ATTRIBUTE

The name, value, and coefficient of an attribute in a Support Vector Machine model.

DM_SVM_ATTRIBUTE_SET

A collection of DM_SVM_ATTRIBUTE. Returned by GET_MODEL_DETAILS_SVM Function. Also a member of DM_SVM_LINEAR_COEFF.

DM_SVM_LINEAR_COEFF

The linear coefficient of each attribute in a Support Vector Machine model.

DM_SVM_LINEAR_COEFF_SET

A collection of DM_SVM_LINEAR_COEFF. Returned by GET_MODEL_DETAILS_SVM Function for an SVM model built using the linear kernel.

DM_TRANSFORM

The transformation and reverse transformation expressions for an attribute.

DM_TRANSFORMS

A collection of DM_TRANSFORM. Returned by GET_MODEL_TRANSFORMATIONS Function.

Return Values for Clustering Algorithms

The table contains description of DM_CLUSTER return value columns, nested table columns, and rows.

Table 52-40 DM_CLUSTER Return Values for Clustering Algorithms

Return Value Description

DM_CLUSTERS

A set of rows of type DM_CLUSTER. The rows have the following columns:

(id               NUMBER,
 cluster_id       VARCHAR2(4000),
 record_count     NUMBER,
 parent           NUMBER,
 tree_level       NUMBER,
 dispersion       NUMBER,
 split_predicate  DM_PREDICATES,
 child            DM_CHILDREN,
 centroid         DM_CENTROIDS,
 histogram        DM_HISTOGRAMS,
 rule             DM_RULE)
DM_PREDICATE

The antecedent and consequent columns each return nested tables of type DM_PREDICATES. The rows, of type DM_PREDICATE, have the following columns:

          (attribute_name           VARCHAR2(4000),
           attribute_subname        VARCHAR2(4000),
           conditional_operator     CHAR(2)/*=,<>,<,>,<=,>=*/,
           attribute_num_value      NUMBER,
           attribute_str_value      VARCHAR2(4000),
           attribute_support        NUMBER,
           attribute_confidence     NUMBER)

DM_CLUSTER Fields

The following table describes DM_CLUSTER fields.

Table 52-41 DM_CLUSTER Fields

Column Name Description

id

Cluster identifier

cluster_id

The ID of a cluster in the model

record_count

Specifies the number of records

parent

Parent ID

tree_level

Specifies the number of splits from the root

dispersion

A measure used to quantify whether a set of observed occurrences are dispersed compared to a standard statistical model.

split_predicate

The split_predicate column of DM_CLUSTER returns a nested table of type DM_PREDICATES. Each row, of type DM_PREDICATE, has the following columns:

     (attribute_name           VARCHAR2(4000),
      attribute_subname        VARCHAR2(4000),
      conditional_operator     CHAR(2) /*=,<>,<,>,<=,>=*/,
      attribute_num_value      NUMBER,
      attribute_str_value      VARCHAR2(4000),
      attribute_support        NUMBER,
      attribute_confidence     NUMBER)

Note: The Expectation Maximization algorithm uses all the fields except dispersion and split_predicate.

child

The child column of DM_CLUSTER returns a nested table of type DM_CHILDREN. The rows, of type DM_CHILD, have a single column of type NUMBER, which contains the identifiers of each child.

centroid

The centroid column of DM_CLUSTER returns a nested table of type DM_CENTROIDS. The rows, of type DM_CENTROID, have the following columns:

      (attribute_name    VARCHAR2(4000),
       attribute_subname  VARCHAR2(4000),
       mean               NUMBER,
       mode_value         VARCHAR2(4000),
       variance           NUMBER)

histogram

The histogram column of DM_CLUSTER returns a nested table of type DM_HISTOGRAMS. The rows, of type DM_HISTOGRAM_BIN, have the following columns:

    (attribute_name    VARCHAR2(4000),
     attribute_subname  VARCHAR2(4000),
     bin_id             NUMBER,
     lower_bound        NUMBER,
     upper_bound        NUMBER,
     label              VARCHAR2(4000),
     count              NUMBER)

rule

The rule column of DM_CLUSTER returns a single row of type DM_RULE. The columns are:

     (rule_id            INTEGER,
      antecedent         DM_PREDICATES,
      consequent         DM_PREDICATES,
      rule_support       NUMBER,
      rule_confidence    NUMBER,
      rule_lift          NUMBER,
      antecedent_support NUMBER,
      consequent_support NUMBER,
      number_of_items    INTEGER)

Usage Notes

  • The table function pipes out rows of type DM_CLUSTER. For information on Oracle Machine Learning for SQL data types and piped output from table functions, see "Data Types".

  • For descriptions of predicates (DM_PREDICATE) and rules (DM_RULE), see GET_ASSOCIATION_RULES Function.

52.8 Summary of DBMS_DATA_MINING Subprograms

This table summarizes the subprograms included in the DBMS_DATA_MINING package.

The GET_* interfaces are replaced by model views. Oracle recommends that users leverage model detail views instead. For more information, refer to Model Detail Views in Oracle Machine Learning for SQL User’s Guide and Static Data Dictionary Views: ALL_ALL_TABLES to ALL_OUTLINES in Oracle Database Reference.

Table 52-42 DBMS_DATA_MINING Package Subprograms

Subprogram Purpose

ADD_COST_MATRIX Procedure

Adds a cost matrix to a classification model

ADD_PARTITION Procedure

Adds single or multiple partitions in an existing partition model

ALTER_REVERSE_EXPRESSION Procedure

Changes the reverse transformation expression to an expression that you specify

APPLY Procedure

Applies a model to a data set (scores the data)

COMPUTE_CONFUSION_MATRIX Procedure

Computes the confusion matrix for a classification model

COMPUTE_CONFUSION_MATRIX_PART Procedure

Computes the evaluation matrix for partitioned models

COMPUTE_LIFT Procedure

Computes lift for a classification model

COMPUTE_LIFT_PART Procedure

Computers lift for partitioned models

COMPUTE_ROC Procedure

Computes Receiver Operating Characteristic (ROC) for a classification model

COMPUTE_ROC_PART Procedure

Computes Receiver Operating Characteristic (ROC) for a partitioned model

CREATE_MODEL Procedure

Creates a model

CREATE_MODEL2 Procedure

Creates a model without extra persistent stages

Create Model Using Registration Information

Fetches setting information from JSON object

DROP_ALGORITHM Procedure

Drops the registered algorithm information.

DROP_PARTITION Procedure

Drops a single partition

DROP_MODEL Procedure

Drops a model

EXPORT_MODEL Procedure

Exports a model to a dump file

EXPORT_SERMODEL Procedure

Exports a model in a serialized format

FETCH_JSON_SCHEMA Procedure

Fetches and reads JSON schema from all_mining_algorithms view

GET_MODEL_COST_MATRIX Function

Returns the cost matrix for a model

IMPORT_MODEL Procedure

Imports a model into a user schema

IMPORT_SERMODEL Procedure

Imports a serialized model back into the database

JSON Schema for R Extensible Algorithm

Displays flexibility in creating JSON schema for R Extensible

REGISTER_ALGORITHM Procedure

Registers a new algorithm

RANK_APPLY Procedure

Ranks the predictions from the APPLY results for a classification model

REMOVE_COST_MATRIX Procedure

Removes a cost matrix from a model

RENAME_MODEL Procedure

Renames a model

Deprecated GET_MODEL_DETAILS

Starting from Oracle Database 12c Release 2, the following GET_MODEL_DETAILS are deprecated:

Table 52-43 Deprecated GET_MODEL_DETAILS Functions

Subprogram Purpose

GET_ASSOCIATION_RULES Function

Returns the rules from an association model

GET_FREQUENT_ITEMSETS Function

Returns the frequent itemsets for an association model

GET_MODEL_DETAILS_AI Function

Returns details about an attribute importance model

GET_MODEL_DETAILS_EM Function

Returns details about an Expectation Maximization model

GET_MODEL_DETAILS_EM_COMP Function

Returns details about the parameters of an Expectation Maximization model

GET_MODEL_DETAILS_EM_PROJ Function

Returns details about the projects of an Expectation Maximization model

GET_MODEL_DETAILS_GLM Function

Returns details about a Generalized Linear Model model

GET_MODEL_DETAILS_GLOBAL Function

Returns high-level statistics about a model

GET_MODEL_DETAILS_KM Function

Returns details about a k-Means model

GET_MODEL_DETAILS_NB Function

Returns details about a Naive Bayes model

GET_MODEL_DETAILS_NMF Function

Returns details about a Non-Negative Matrix Factorization model

GET_MODEL_DETAILS_OC Function

Returns details about an O-Cluster model

GET_MODEL_SETTINGS Function

Returns the settings used to build the given model

This function is replaced with USER/ALL/DBA_MINING_MODEL_SETTINGS

GET_MODEL_SIGNATURE Function

Returns the list of columns from the build input table

This function is replaced with USER/ALL/DBA_MINING_MODEL_ATTRIBUTES

GET_MODEL_DETAILS_SVD Function

Returns details about a Singular Value Decomposition model

GET_MODEL_DETAILS_SVM Function

Returns details about a Support Vector Machine model with a linear kernel

GET_MODEL_TRANSFORMATIONS Function

Returns the transformations embedded in a model

This function is replaced with USER/ALL/DBA_MINING_MODEL_XFORMS

GET_MODEL_DETAILS_XML Function

Returns details about a Decision Tree model

GET_TRANSFORM_LIST Procedure

Converts between two different transformation specification formats

52.8.1 ADD_COST_MATRIX Procedure

The ADD_COST_MATRIX procedure associates a cost matrix table with a classification model. The cost matrix biases the model by assigning costs or benefits to specific model outcomes.

The cost matrix is stored with the model and taken into account when the model is scored.

You can also specify a cost matrix inline when you invoke an Oracle Machine Learning for SQL function for scoring. To view the scoring matrix for a model, query the DM$VC prefixed model view. Refer to Model Detail View for Classification Algorithm.

To obtain the default scoring matrix for a model, query the DM$VC prefixed model view. To remove the default scoring matrix from a model, use the REMOVE_COST_MATRIX procedure. See REMOVE_COST_MATRIX Procedure.

See Also:

Syntax

DBMS_DATA_MINING.ADD_COST_MATRIX (
       model_name                IN VARCHAR2,
       cost_matrix_table_name    IN VARCHAR2,
       cost_matrix_schema_name   IN VARCHAR2 DEFAULT NULL);
       partition_name            IN VARCHAR2 DEFAULT NULL);

Parameters

Table 52-44 ADD_COST_MATRIX Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is assumed.

cost_matrix_table_name

Name of the cost matrix table (described in Table 52-45).

cost_matrix_schema_name

Schema of the cost matrix table. If no schema is specified, then the current schema is used.

partition_name

Name of the partition in a partitioned model

Usage Notes

  1. If the model is not in your schema, then ADD_COST_MATRIX requires the ALTER ANY MINING MODEL system privilege or the ALTER object privilege for the machine learning model.

  2. The cost matrix table must have the columns shown in Table 52-45.

    Table 52-45 Required Columns in a Cost Matrix Table

    Column Name Data Type

    ACTUAL_TARGET_VALUE

    Valid target data type

    PREDICTED_TARGET_VALUE

    Valid target data type

    COST

    NUMBER,FLOAT, BINARY_DOUBLE, or BINARY_FLOAT

    See Also:

    Oracle Machine Learning for SQL User’s Guide for valid target data types

  3. The types of the actual and predicted target values must be the same as the type of the model target. For example, if the target of the model is BINARY_DOUBLE, then the actual and predicted values must be BINARY_DOUBLE. If the actual and predicted values are CHAR or VARCHAR, then ADD_COST_MATRIX treats them as VARCHAR2 internally.

    If the types do not match, or if the actual or predicted value is not a valid target value, then the ADD_COST_MATRIX procedure raises an error.

    Note:

    If a reverse transformation is associated with the target, then the actual and predicted values must be consistent with the target after the reverse transformation has been applied.

    See “Reverse Transformations and Model Transparency” under the “About Transformation Lists” section in DBMS_DATA_MINING_TRANSFORM Operational Notes for more information.

  4. Since a benefit can be viewed as a negative cost, you can specify a benefit for a given outcome by providing a negative number in the costs column of the cost matrix table.

  5. All classification algorithms can use a cost matrix for scoring. The Decision Tree algorithm can also use a cost matrix at build time. If you want to build a Decision Tree model with a cost matrix, specify the cost matrix table name in the CLAS_COST_TABLE_NAME setting in the settings table for the model. See Table 52-7.

    The cost matrix used to create a Decision Tree model becomes the default scoring matrix for the model. If you want to specify different costs for scoring, use the REMOVE_COST_MATRIX procedure to remove the cost matrix and the ADD_COST_MATRIX procedure to add a new one.

  6. Scoring on a partitioned model is partition-specific. Scoring cost matrices can be added to or removed from an individual partition in a partitioned model. If PARTITION_NAME is NOT NULL, then the model must be a partitioned model. The COST_MATRIX is added to that partition of the partitioned model.

    If the PARTITION_NAME is NULL, but the model is a partitioned model, then the COST_MATRIX table is added to every partition in the model.

Example

This example creates a cost matrix table called COSTS_NB and adds it to a Naive Bayes model called NB_SH_CLAS_SAMPLE. The model has a binary target: 1 means that the customer responds to a promotion; 0 means that the customer does not respond. The cost matrix assigns a cost of .25 to misclassifications of customers who do not respond and a cost of .75 to misclassifications of customers who do respond. This means that it is three times more costly to misclassify responders than it is to misclassify non-responders.

CREATE TABLE costs_nb (
  actual_target_value           NUMBER,
  predicted_target_value        NUMBER,
  cost                          NUMBER);
INSERT INTO costs_nb values (0, 0, 0);
INSERT INTO costs_nb values (0, 1, .25);
INSERT INTO costs_nb values (1, 0, .75);
INSERT INTO costs_nb values (1, 1, 0);
COMMIT;
 
EXEC dbms_data_mining.add_cost_matrix('nb_sh_clas_sample', 'costs_nb');
 
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
   FROM mining_data_apply_v
   WHERE PREDICTION(nb_sh_clas_sample COST MODEL
      USING cust_marital_status, education, household_size) = 1
   GROUP BY cust_gender
   ORDER BY cust_gender;
   
C        CNT    AVG_AGE
- ---------- ----------
F         72         39
M        555         44

52.8.2 ADD_PARTITION Procedure

ADD_PARTITION procedure supports a single or multiple partition addition to an existing partitioned model.

The ADD_PARTITION procedure derives build settings and user-defined expressions from the existing model. The target column must exist in the input data query when adding partitions to a supervised model.

Syntax

DBMS_DATA_MINING.ADD_PARTITION (
      model_name                IN VARCHAR2,
      data_query                IN CLOB,
      add_options               IN VARCHAR2 DEFAULT ERROR);

Parameters

Table 52-46 ADD_PARTITION Procedure Parameters

Parameter Description
model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used.

data_query

An arbitrary SQL statement that provides data to the model build. The user must have privilege to evaluate this query.

add_options

Allows users to control the conditional behavior of ADD for cases where rows in the input dataset conflict with existing partitions in the model. The following are the possible values:

  • REPLACE: Replaces the existing partition for which the conflicting keys are found.

  • ERROR: Terminates the ADD operation without adding any partitions.

  • IGNORE: Eliminates the rows having the conflicting keys.

Note:

For better performance, Oracle recommends using DROP_PARTITION followed by the ADD_PARTITION instead of using the REPLACE option.

52.8.3 ALTER_REVERSE_EXPRESSION Procedure

This procedure replaces a reverse transformation expression with an expression that you specify. If the attribute does not have a reverse expression, the procedure creates one from the specified expression.

You can also use this procedure to customize the output of clustering, feature extraction, and anomaly detection models.

Syntax

DBMS_DATA_MINING.ALTER_REVERSE_EXPRESSION (
         model_name             VARCHAR2,
         expression             CLOB,
         attribute_name         VARCHAR2 DEFAULT NULL,
         attribute_subname      VARCHAR2 DEFAULT NULL);

Parameters

Table 52-47 ALTER_REVERSE_EXPRESSION Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

expression

An expression to replace the reverse transformation associated with the attribute.

attribute_name

Name of the attribute. Specify NULL if you wish to apply expression to a cluster, feature, or One-Class SVM prediction.

attribute_subname

Name of the nested attribute if attribute_name is a nested column, otherwise NULL.

Usage Notes

  1. For purposes of model transparency, Oracle Machine Learning for SQL provides reverse transformations for transformations that are embedded in a model. Reverse transformations are applied to the attributes returned in model detail views and to the scored target of predictive models.

    See Also:

  2. If you alter the reverse transformation for the target of a model that has a cost matrix, you must specify a transformation expression that has the same type as the actual and predicted values in the cost matrix. Also, the reverse transformation that you specify must result in values that are present in the cost matrix.

    See Also:

    "ADD_COST_MATRIX Procedure" and Oracle Machine Learning for SQL Concepts for information about cost matrixes.

  3. To prevent reverse transformation of an attribute, you can specify NULL for expression.

  4. The reverse transformation expression can contain a reference to a PL/SQL function that returns a valid Oracle data type. For example, you could define a function like the following for a categorical attribute named blood_pressure that has values 'Low', 'Medium' and 'High'.

    CREATE OR REPLACE FUNCTION numx(c char) RETURN NUMBER IS
      BEGIN
        CASE c WHEN ''Low'' THEN RETURN 1;
               WHEN ''Medium'' THEN RETURN 2;
               WHEN ''High'' THEN RETURN 3;
               ELSE RETURN null;
        END CASE;
      END numx;
    

    Then you could invoke ALTER_REVERSE_EXPRESION for blood_pressure as follows.

    EXEC dbms_data_mining.alter_reverse_expression(
                 '<model_name>', 'NUMX(blood_pressure)', 'blood_pressure');
    
  5. You can use ALTER_REVERSE_EXPRESSION to label clusters produced by clustering models and features produced by feature extraction.

    You can use ALTER_REVERSE_EXPRESSION to replace the zeros and ones returned by anomaly-detection models. By default, anomaly-detection models label anomalous records with 0 and all other records with 1.

    See Also:

    Oracle Machine Learning for SQL Concepts for information about anomaly detection

Examples

  1. In this example, the target (affinity_card) of the model CLASS_MODEL is manipulated internally as yes or no instead of 1 or 0 but returned as 1s and 0s when scored. The ALTER_REVERSE_EXPRESSION procedure causes the target values to be returned as TRUE or FALSE.

    DECLARE
            v_xlst dbms_data_mining_transform.TRANSFORM_LIST;
      BEGIN
        dbms_data_mining_transform.SET_TRANSFORM(v_xlst,
              'affinity_card', NULL,
              'decode(affinity_card, 1, ''yes'', ''no'')',
              'decode(affinity_card, ''yes'', 1, 0)');
        dbms_data_mining.CREATE_MODEL(
          model_name             => 'CLASS_MODEL',
          mining_function        => dbms_data_mining.classification,
          data_table_name        => 'mining_data_build',
          case_id_column_name    => 'cust_id',
          target_column_name     => 'affinity_card',
          settings_table_name    => NULL,
          data_schema_name       => 'oml_user',
          settings_schema_name   => NULL,
          xform_list             => v_xlst );
      END;
    /
    SELECT cust_income_level, occupation,
               PREDICTION(CLASS_MODEL USING *) predict_response
          FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M'
          ORDER BY cust_income_level;
     
    CUST_INCOME_LEVEL              OCCUPATION                PREDICT_RESPONSE
    ------------------------------ --------------------- --------------------
    A: Below 30,000                Transp.                                  1
    E: 90,000 - 109,999            Transp.                                  1
    E: 90,000 - 109,999            Sales                                    1
    G: 130,000 - 149,999           Handler                                  0
    G: 130,000 - 149,999           Crafts                                   0
    H: 150,000 - 169,999           Prof.                                    1
    J: 190,000 - 249,999           Prof.                                    1
    J: 190,000 - 249,999           Sales                                    1
     
    BEGIN
      dbms_data_mining.ALTER_REVERSE_EXPRESSION (
         model_name      => 'CLASS_MODEL',
         expression      => 'decode(affinity_card, ''yes'', ''TRUE'', ''FALSE'')',
         attribute_name  => 'affinity_card');
    END;
    /
    column predict_response on
    column predict_response format a20
    SELECT cust_income_level, occupation,
                 PREDICTION(CLASS_MODEL USING *) predict_response
          FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M'
          ORDER BY cust_income_level;
     
    CUST_INCOME_LEVEL              OCCUPATION            PREDICT_RESPONSE
    ------------------------------ --------------------- --------------------
    A: Below 30,000                Transp.               TRUE
    E: 90,000 - 109,999            Transp.               TRUE
    E: 90,000 - 109,999            Sales                 TRUE
    G: 130,000 - 149,999           Handler               FALSE
    G: 130,000 - 149,999           Crafts                FALSE
    H: 150,000 - 169,999           Prof.                 TRUE
    J: 190,000 - 249,999           Prof.                 TRUE
    J: 190,000 - 249,999           Sales                 TRUE
    
  2. This example specifies labels for the clusters that result from the sh_clus model. The labels consist of the word "Cluster" and the internal numeric identifier for the cluster.

    BEGIN
      dbms_data_mining.ALTER_REVERSE_EXPRESSION( 'sh_clus', '''Cluster ''||value');
    END;
    /
     
    SELECT cust_id, cluster_id(sh_clus using *) cluster_id
       FROM sh_aprep_num
           WHERE cust_id < 100011
           ORDER by cust_id;
     
    CUST_ID CLUSTER_ID
    ------- ------------------------------------------------
     100001 Cluster 18
     100002 Cluster 14
     100003 Cluster 14
     100004 Cluster 18
     100005 Cluster 19
     100006 Cluster 7
     100007 Cluster 18
     100008 Cluster 14
     100009 Cluster 8
     100010 Cluster 8

52.8.4 APPLY Procedure

The APPLY procedure applies a machine learning model to the data of interest, and generates the results in a table. The APPLY procedure is also referred to as scoring.

For predictive machine learning functions, the APPLY procedure generates predictions in a target column. For descriptive machine learning functions such as Clustering, the APPLY process assigns each case to a cluster with a probability.

In Oracle Machine Learning for SQL, the APPLY procedure is not applicable to Association models and Attribute Importance models.

Note:

Scoring can also be performed directly in SQL using the Oracle Machine Learning for SQL functions. See

Syntax

DBMS_DATA_MINING.APPLY (
      model_name           IN VARCHAR2,
      data_table_name      IN VARCHAR2,
      case_id_column_name  IN VARCHAR2,
      result_table_name    IN VARCHAR2,
      data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameters

Table 52-48 APPLY Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used.

data_table_name

Name of table or view containing the data to be scored

case_id_column_name

Name of the case identifier column

result_table_name

Name of the table in which to store apply results

data_schema_name

Name of the schema containing the data to be scored

Usage Notes

  1. The data provided for APPLY must undergo the same preprocessing as the data used to create and test the model. When you use Automatic Data Preparation, the preprocessing required by the algorithm is handled for you by the model: both at build time and apply time. (See "Automatic Data Preparation".)

  2. APPLY creates a table in the user's schema to hold the results. The columns are algorithm-specific.

    The columns in the results table are listed in Table 52-49 through Table 52-53. The case ID column name in the results table will match the case ID column name provided by you. The type of the incoming case ID column is also preserved in APPLY output.

    Note:

    Make sure that the case ID column does not have the same name as one of the columns that will be created by APPLY. For example, when applying a Classification model, the case ID in the scoring data must not be PREDICTION or PROBABILITY (See Table 52-49).

  3. The data type for the PREDICTION, CLUSTER_ID, and FEATURE_ID output columns is influenced by any reverse expression that is embedded in the model by the user. If the user does not provide a reverse expression that alters the scored value type, then the types will conform to the descriptions in the following tables. See "ALTER_REVERSE_EXPRESSION Procedure".

  4. If the model is partitioned, the result_table_name can contain results from different partitions depending on the data from the input data table. An additional column called PARTITION_NAME is added to the result table indicating the partition name that is associated with each row.

    For a non-partitioned model, the behavior does not change.

Classification

The results table for Classification has the columns described in Table 52-49. If the target of the model is categorical, the PREDICTION column will have a VARCHAR2 data type. If the target has a binary type, the PREDICTION column will have the binary type of the target.

Table 52-49 APPLY Results Table for Classification

Column Name Data type

Case ID column name

Type of the case ID

PREDICTION

Type of the target

PROBABILITY

BINARY_DOUBLE

Anomaly Detection

The results table for Anomaly Detection has the columns described in Table 52-50.

Table 52-50 APPLY Results Table for Anomaly Detection

Column Name Data Type

Case ID column name

Type of the case ID

PREDICTION

NUMBER

PROBABILITY

BINARY_DOUBLE

Regression

The results table for Regression has the columns described in APPLY Procedure.

Table 52-51 APPLY Results Table for Regression

Column Name Data Type

Case ID column name

Type of the case ID

PREDICTION

Type of the target

Clustering

Clustering is an unsupervised machine learning function, and hence there are no targets. The results of an APPLY procedure contain simply the cluster identifier corresponding to a case, and the associated probability. The results table has the columns described in Table 52-52.

Table 52-52 APPLY Results Table for Clustering

Column Name Data Type

Case ID column name

Type of the case ID

CLUSTER_ID

NUMBER

PROBABILITY

BINARY_DOUBLE

Feature Extraction

Feature Extraction is also an unsupervised machine learning function, hence there are no targets. The results of an APPLY procedure will contain simply the feature identifier corresponding to a case, and the associated match quality. The results table has the columns described in Table 52-53.

Table 52-53 APPLY Results Table for Feature Extraction

Column Name Data Type

Case ID column name

Type of the case ID

FEATURE_ID

NUMBER

MATCH_QUALITY

BINARY_DOUBLE

Examples

This example applies the GLM Regression model GLMR_SH_REGR_SAMPLE to the data in the MINING_DATA_APPLY_V view. The APPLY results are output of the table REGRESSION_APPLY_RESULT.

SQL> BEGIN
       DBMS_DATA_MINING.APPLY (
       model_name     => 'glmr_sh_regr_sample',
       data_table_name     => 'mining_data_apply_v',
       case_id_column_name => 'cust_id',
       result_table_name   => 'regression_apply_result');
    END;
    /
 
SQL> SELECT * FROM regression_apply_result WHERE cust_id >  101485;
 
   CUST_ID PREDICTION
---------- ----------
    101486 22.8048824
    101487 25.0261101
    101488 48.6146619
    101489   51.82595
    101490 22.6220714
    101491 61.3856816
    101492 24.1400748
    101493  58.034631
    101494 45.7253149
    101495 26.9763318
    101496 48.1433425
    101497 32.0573434
    101498 49.8965531
    101499  56.270656
    101500 21.1153047

52.8.5 COMPUTE_CONFUSION_MATRIX Procedure

This procedure computes a confusion matrix, stores it in a table in the user's schema, and returns the model accuracy.

A confusion matrix is a test metric for classification models. It compares the predictions generated by the model with the actual target values in a set of test data. The confusion matrix lists the number of times each class was correctly predicted and the number of times it was predicted to be one of the other classes.

COMPUTE_CONFUSION_MATRIX accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

See Also:

Oracle Machine Learning for SQL Concepts for more details about confusion matrixes and other test metrics for classification

"COMPUTE_LIFT Procedure"

"COMPUTE_ROC Procedure"

Syntax

DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
      accuracy                     OUT NUMBER,
      apply_result_table_name      IN  VARCHAR2,
      target_table_name            IN  VARCHAR2,
      case_id_column_name          IN  VARCHAR2,
      target_column_name           IN  VARCHAR2,
      confusion_matrix_table_name  IN  VARCHAR2,
      score_column_name            IN  VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN  VARCHAR2 DEFAULT 'PROBABILITY',
      cost_matrix_table_name       IN  VARCHAR2 DEFAULT NULL,
      apply_result_schema_name     IN  VARCHAR2 DEFAULT NULL,
      target_schema_name           IN  VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name      IN  VARCHAR2 DEFAULT NULL,
      score_criterion_type         IN  VARCHAR2 DEFAULT 'PROBABILITY');

Parameters

Table 52-54 COMPUTE_CONFUSION_MATRIX Procedure Parameters

Parameter Description

accuracy

Output parameter containing the overall percentage accuracy of the predictions.

apply_result_table_name

Table containing the predictions.

target_table_name

Table containing the known target values from the test data.

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

confusion_matrix_table_name

Table containing the confusion matrix. The table will be created by the procedure in the user's schema.

The columns in the confusion matrix table are described in the Usage Notes.

score_column_name

Column containing the predictions in the apply results table.

The default column name is PREDICTION, which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions.

By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted.

The score_criterion_type parameter indicates whether probabilities or costs will be used for scoring.

The default column name is 'PROBABILITY', which is the default name created by the APPLY procedure (See "APPLY Procedure").

See the Usage Notes for additional information.

cost_matrix_table_name

(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the score_criterion_type parameter is set to 'COSTS', the costs in this table will be used as the scoring criteria.

The columns in a cost matrix table are described in the Usage Notes.

apply_result_schema_name

Schema of the apply results table.

If null, the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, the user's schema is assumed.

cost_matrix_schema_name

Schema of the cost matrix table, if one is provided.

If null, the user's schema is assumed.

score_criterion_type

Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the score_criterion_column_name parameter.

The default value of score_criterion_type is 'PROBABILITY'. To use costs as the scoring criterion, specify 'COST'.

If score_criterion_type is set to 'COST' but no cost matrix is provided and if there is a scoring cost matrix associated with the model, then the associated costs are used for scoring.

See the Usage Notes and the Examples.

Usage Notes

  • The predictive information you pass to COMPUTE_CONFUSION_MATRIX may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the confusion matrix.

  • Instead of passing a cost matrix to COMPUTE_CONFUSION_MATRIX, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQL PREDICTION_COST function to populate the score criterion column.

  • The predictions that you pass to COMPUTE_CONFUSION_MATRIX are in a table or view specified in apply_result_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • A cost matrix must have the columns described in Table 52-55.

    Table 52-55 Columns in a Cost Matrix

    Column Name Data Type

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    cost

    BINARY_DOUBLE

    See Also:

    Oracle Machine Learning for SQL User’s Guide for valid target data types

    Oracle Machine Learning for SQL Concepts for more information about cost matrixes

  • The confusion matrix created by COMPUTE_CONFUSION_MATRIX has the columns described in Table 52-56.

    Table 52-56 Columns in a Confusion Matrix

    Column Name Data Type

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target is the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    value

    BINARY_DOUBLE

    See Also:

    Oracle Machine Learning for SQL Concepts for more information about confusion matrixes

Examples

These examples use the Naive Bayes model nb_sh_clas_sample.

Compute a Confusion Matrix Based on Probabilities

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
       SELECT cust_id,
              PREDICTION(nb_sh_clas_sample USING *) prediction,
              PREDICTION_PROBABILITY(nb_sh_clas_sample USING *) probability
       FROM mining_data_test_v;

Using probabilities as the scoring criterion, you can compute the confusion matrix as follows.

DECLARE
   v_accuracy    NUMBER;
      BEGIN
        DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
                   accuracy                     => v_accuracy,
                   apply_result_table_name      => 'nb_apply_results',
                   target_table_name            => 'mining_data_test_v',
                   case_id_column_name          => 'cust_id',
                   target_column_name           => 'affinity_card',
                   confusion_matrix_table_name  => 'nb_confusion_matrix',
                   score_column_name            => 'PREDICTION',
                   score_criterion_column_name  => 'PROBABILITY'
                   cost_matrix_table_name       =>  null,
                   apply_result_schema_name     =>  null,
                   target_schema_name           =>  null,
                   cost_matrix_schema_name      =>  null,
                   score_criterion_type         => 'PROBABILITY');
        DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
      END;
      /

The confusion matrix and model accuracy are shown as follows.

 **** MODEL ACCURACY ****: .7847

SQL>SELECT * from nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         60
                  0                      0        891
                  1                      1        286
                  0                      1        263

Compute a Confusion Matrix Based on a Cost Matrix Table

The confusion matrix in the previous example shows a high rate of false positives. For 263 cases, the model predicted 1 when the actual value was 0. You could use a cost matrix to minimize this type of error.

The cost matrix table nb_cost_matrix specifies that a false positive is 3 times more costly than a false negative.

SQL> SELECT * from nb_cost_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE       COST
------------------- ---------------------- ----------
                  0                      0          0
                  0                      1        .75
                  1                      0        .25
                  1                      1          0

This statement shows how to generate the predictions using APPLY.

BEGIN
    DBMS_DATA_MINING.APPLY(
          model_name          => 'nb_sh_clas_sample',
          data_table_name     => 'mining_data_test_v',
          case_id_column_name => 'cust_id',
          result_table_name   => 'nb_apply_results');
 END;
/

This statement computes the confusion matrix using the cost matrix table. The score criterion column is named 'PROBABILITY', which is the name generated by APPLY.

DECLARE
  v_accuracy    NUMBER;
     BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
                accuracy                     => v_accuracy,
                apply_result_table_name      => 'nb_apply_results',
                target_table_name            => 'mining_data_test_v',
                case_id_column_name          => 'cust_id',
                target_column_name           => 'affinity_card',
                confusion_matrix_table_name  => 'nb_confusion_matrix',
                score_column_name            => 'PREDICTION',
                score_criterion_column_name  => 'PROBABILITY',
                cost_matrix_table_name       => 'nb_cost_matrix',
                apply_result_schema_name     => null,
                target_schema_name           => null,
                cost_matrix_schema_name      => null,
                score_criterion_type         => 'COST');
       DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
    END;
    /

The resulting confusion matrix shows a decrease in false positives (212 instead of 263).

**** MODEL ACCURACY ****: .798

SQL> SELECT * FROM nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

Compute a Confusion Matrix Based on Embedded Costs

You can use the ADD_COST_MATRIX procedure to embed a cost matrix in a model. The embedded costs can be used instead of probabilities for scoring. This statement adds the previously-defined cost matrix to the model.

BEGIN    DBMS_DATA_MINING.ADD_COST_MATRIX ('nb_sh_clas_sample', 'nb_cost_matrix');END;/

The following statement applies the model to the test data using the embedded costs and stores the results in a table.

CREATE TABLE nb_apply_results AS
         SELECT cust_id,
              PREDICTION(nb_sh_clas_sample COST MODEL USING *) prediction,
              PREDICTION_COST(nb_sh_clas_sample COST MODEL USING *) cost
          FROM mining_data_test_v;

You can compute the confusion matrix using the embedded costs.

DECLARE
   v_accuracy         NUMBER;
   BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
            accuracy                     => v_accuracy,
            apply_result_table_name      => 'nb_apply_results',
            target_table_name            => 'mining_data_test_v',
            case_id_column_name          => 'cust_id',
            target_column_name           => 'affinity_card',
            confusion_matrix_table_name  => 'nb_confusion_matrix',
            score_column_name            => 'PREDICTION',
            score_criterion_column_name  => 'COST',
            cost_matrix_table_name       => null,
            apply_result_schema_name     => null,
            target_schema_name           => null,
            cost_matrix_schema_name      => null,
            score_criterion_type         => 'COST');
   END;
   /

The results are:

**** MODEL ACCURACY ****: .798

SQL> SELECT * FROM nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

52.8.6 COMPUTE_CONFUSION_MATRIX_PART Procedure

The COMPUTE_CONFUSION_MATRIX_PART procedure computes a confusion matrix, stores it in a table in the user's schema, and returns the model accuracy.

COMPUTE_CONFUSION_MATRIX_PART provides support to computation of evaluation metrics per-partition for partitioned models. For non-partitioned models, refer to COMPUTE_CONFUSION_MATRIX Procedure.

A confusion matrix is a test metric for classification models. It compares the predictions generated by the model with the actual target values in a set of test data. The confusion matrix lists the number of times each class was correctly predicted and the number of times it was predicted to be one of the other classes.

COMPUTE_CONFUSION_MATRIX_PART accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

See Also:

Oracle Machine Learning for SQL Concepts for more details about confusion matrixes and other test metrics for classification

"COMPUTE_LIFT_PART Procedure"

"COMPUTE_ROC_PART Procedure"

Syntax

DBMS_DATA_MINING.compute_confusion_matrix_part(
      accuracy                    OUT DM_NESTED_NUMERICALS,
      apply_result_table_name     IN  VARCHAR2,
      target_table_name           IN  VARCHAR2,
      case_id_column_name         IN  VARCHAR2,
      target_column_name          IN  VARCHAR2,
      confusion_matrix_table_name IN  VARCHAR2,
      score_column_name           IN  VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name IN  VARCHAR2 DEFAULT 'PROBABILITY',
      score_partition_column_name IN  VARCHAR2 DEFAULT 'PARTITION_NAME',
      cost_matrix_table_name      IN  VARCHAR2 DEFAULT NULL,
      apply_result_schema_name    IN  VARCHAR2 DEFAULT NULL,
      target_schema_name          IN  VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name     IN  VARCHAR2 DEFAULT NULL,
      score_criterion_type        IN  VARCHAR2 DEFAULT NULL);

Parameters

Table 52-57 COMPUTE_CONFUSION_MATRIX_PART Procedure Parameters

Parameter Description

accuracy

Output parameter containing the overall percentage accuracy of the predictions

The output argument is changed from NUMBER to DM_NESTED_NUMERICALS

apply_result_table_name

Table containing the predictions

target_table_name

Table containing the known target values from the test data

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

confusion_matrix_table_name

Table containing the confusion matrix. The table will be created by the procedure in the user's schema.

The columns in the confusion matrix table are described in the Usage Notes.

score_column_name

Column containing the predictions in the apply results table.

The default column name is PREDICTION, which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions.

By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, then the class with the lowest cost is predicted.

The score_criterion_type parameter indicates whether probabilities or costs will be used for scoring.

The default column name is PROBABILITY, which is the default name created by the APPLY procedure (See "APPLY Procedure").

See the Usage Notes for additional information.

score_partition_column_name

(Optional) Parameter indicating the column which contains the name of the partition. This column slices the input test results such that each partition has independent evaluation matrices computed.

cost_matrix_table_name

(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the score_criterion_type parameter is set to COSTS, the costs in this table will be used as the scoring criteria.

The columns in a cost matrix table are described in the Usage Notes.

apply_result_schema_name

Schema of the apply results table.

If null, then the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, then the user's schema is assumed.

cost_matrix_schema_name

Schema of the cost matrix table, if one is provided.

If null, then the user's schema is assumed.

score_criterion_type

Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the score_criterion_column_name parameter.

The default value of score_criterion_type is PROBABILITY. To use costs as the scoring criterion, specify COST.

If score_criterion_type is set to COST but no cost matrix is provided and if there is a scoring cost matrix associated with the model, then the associated costs are used for scoring.

See the Usage Notes and the Examples.

Usage Notes

  • The predictive information you pass to COMPUTE_CONFUSION_MATRIX_PART may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the confusion matrix.

  • Instead of passing a cost matrix to COMPUTE_CONFUSION_MATRIX_PART, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQL PREDICTION_COST function to populate the score criterion column.

  • The predictions that you pass to COMPUTE_CONFUSION_MATRIX_PART are in a table or view specified in apply_result_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • A cost matrix must have the columns described in Table 52-55.

    Table 52-58 Columns in a Cost Matrix

    Column Name Data Type

    actual_target_value

    Type of the target column in the test data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    cost

    BINARY_DOUBLE

    See Also:

    Oracle Machine Learning for SQL User’s Guide for valid target data types

    Oracle Machine Learning for SQL Concepts for more information about cost matrixes

  • The confusion matrix created by COMPUTE_CONFUSION_MATRIX_PART has the columns described in Table 52-56.

    Table 52-59 Columns in a Confusion Matrix Part

    Column Name Data Type

    actual_target_value

    Type of the target column in the test data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target is the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    value

    BINARY_DOUBLE

    See Also:

    Oracle Machine Learning for SQL Concepts for more information about confusion matrixes

Examples

These examples use the Naive Bayes model nb_sh_clas_sample.

Compute a Confusion Matrix Based on Probabilities

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
       SELECT cust_id,
              PREDICTION(nb_sh_clas_sample USING *) prediction,
              PREDICTION_PROBABILITY(nb_sh_clas_sample USING *) probability
       FROM mining_data_test_v;

Using probabilities as the scoring criterion, you can compute the confusion matrix as follows.

DECLARE
   v_accuracy    NUMBER;
      BEGIN
        DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART (
                   accuracy                     => v_accuracy,
                   apply_result_table_name      => 'nb_apply_results',
                   target_table_name            => 'mining_data_test_v',
                   case_id_column_name          => 'cust_id',
                   target_column_name           => 'affinity_card',
                   confusion_matrix_table_name  => 'nb_confusion_matrix',
                   score_column_name            => 'PREDICTION',
                   score_criterion_column_name  => 'PROBABILITY'
                   score_partition_column_name  => 'PARTITION_NAME'
                   cost_matrix_table_name       =>  null,
                   apply_result_schema_name     =>  null,
                   target_schema_name           =>  null,
                   cost_matrix_schema_name      =>  null,
                   score_criterion_type         => 'PROBABILITY');
        DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
      END;
      /

The confusion matrix and model accuracy are shown as follows.

 **** MODEL ACCURACY ****: .7847

SELECT * FROM NB_CONFUSION_MATRIX;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         60
                  0                      0        891
                  1                      1        286
                  0                      1        263

Compute a Confusion Matrix Based on a Cost Matrix Table

The confusion matrix in the previous example shows a high rate of false positives. For 263 cases, the model predicted 1 when the actual value was 0. You could use a cost matrix to minimize this type of error.

The cost matrix table nb_cost_matrix specifies that a false positive is 3 times more costly than a false negative.

 SELECT * from NB_COST_MATRIX;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE       COST
------------------- ---------------------- ----------
                  0                      0          0
                  0                      1        .75
                  1                      0        .25
                  1                      1          0

This statement shows how to generate the predictions using APPLY.

BEGIN
    DBMS_DATA_MINING.APPLY(
          model_name          => 'nb_sh_clas_sample',
          data_table_name     => 'mining_data_test_v',
          case_id_column_name => 'cust_id',
          result_table_name   => 'nb_apply_results');
 END;
/

This statement computes the confusion matrix using the cost matrix table. The score criterion column is named 'PROBABILITY', which is the name generated by APPLY.

DECLARE
  v_accuracy    NUMBER;
     BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART (
                accuracy                     => v_accuracy,
                apply_result_table_name      => 'nb_apply_results',
                target_table_name            => 'mining_data_test_v',
                case_id_column_name          => 'cust_id',
                target_column_name           => 'affinity_card',
                confusion_matrix_table_name  => 'nb_confusion_matrix',
                score_column_name            => 'PREDICTION',
                score_criterion_column_name  => 'PROBABILITY',
                score_partition_column_name  => 'PARTITION_NAME'
                cost_matrix_table_name       => 'nb_cost_matrix',
                apply_result_schema_name     => null,
                target_schema_name           => null,
                cost_matrix_schema_name      => null,
                score_criterion_type         => 'COST');
       DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
    END;
    /

The resulting confusion matrix shows a decrease in false positives (212 instead of 263).

**** MODEL ACCURACY ****: .798

 SELECT * FROM NB_CONFUSION_MATRIX;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

Compute a Confusion Matrix Based on Embedded Costs

You can use the ADD_COST_MATRIX procedure to embed a cost matrix in a model. The embedded costs can be used instead of probabilities for scoring. This statement adds the previously-defined cost matrix to the model.

BEGIN    
DBMS_DATA_MINING.ADD_COST_MATRIX ('nb_sh_clas_sample', 'nb_cost_matrix');
END;/

The following statement applies the model to the test data using the embedded costs and stores the results in a table.

CREATE TABLE nb_apply_results AS
         SELECT cust_id,
              PREDICTION(nb_sh_clas_sample COST MODEL USING *) prediction,
              PREDICTION_COST(nb_sh_clas_sample COST MODEL USING *) cost
          FROM mining_data_test_v;

You can compute the confusion matrix using the embedded costs.

DECLARE
   v_accuracy         NUMBER;
   BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART (
            accuracy                     => v_accuracy,
            apply_result_table_name      => 'nb_apply_results',
            target_table_name            => 'mining_data_test_v',
            case_id_column_name          => 'cust_id',
            target_column_name           => 'affinity_card',
            confusion_matrix_table_name  => 'nb_confusion_matrix',
            score_column_name            => 'PREDICTION',
            score_criterion_column_name  => 'COST',
            score_partition_column_name  => 'PARTITION_NAME'  
            cost_matrix_table_name       => null,          
            apply_result_schema_name     => null,
            target_schema_name           => null,
            cost_matrix_schema_name      => null,
            score_criterion_type         => 'COST');
   END;
   /

The results are:

**** MODEL ACCURACY ****: .798

 SELECT * FROM NB_CONFUSION_MATRIX;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

52.8.7 COMPUTE_LIFT Procedure

This procedure computes lift and stores the results in a table in the user's schema.

Lift is a test metric for binary classification models. To compute lift, one of the target values must be designated as the positive class. COMPUTE_LIFT compares the predictions generated by the model with the actual target values in a set of test data. Lift measures the degree to which the model's predictions of the positive class are an improvement over random chance.

Lift is computed on scoring results that have been ranked by probability (or cost) and divided into quantiles. Each quantile includes the scores for the same number of cases.

COMPUTE_LIFT calculates quantile-based and cumulative statistics. The number of quantiles and the positive class are user-specified. Additionally, COMPUTE_LIFT accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs associated with the predictions

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

See Also:

Oracle Machine Learning for SQL Concepts for more details about lift and test metrics for classification

"COMPUTE_CONFUSION_MATRIX Procedure"

"COMPUTE_ROC Procedure"

Syntax

DBMS_DATA_MINING.COMPUTE_LIFT (
      apply_result_table_name      IN VARCHAR2,
      target_table_name            IN VARCHAR2,
      case_id_column_name          IN VARCHAR2,
      target_column_name           IN VARCHAR2,
      lift_table_name              IN VARCHAR2,
      positive_target_value        IN VARCHAR2,
      score_column_name            IN VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN VARCHAR2 DEFAULT 'PROBABILITY',
      num_quantiles                IN NUMBER DEFAULT 10,
      cost_matrix_table_name       IN VARCHAR2 DEFAULT NULL,
      apply_result_schema_name     IN VARCHAR2 DEFAULT NULL,
      target_schema_name           IN VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name      IN VARCHAR2 DEFAULT NULL
      score_criterion_type         IN VARCHAR2 DEFAULT 'PROBABILITY');

Parameters

Table 52-60 COMPUTE_LIFT Procedure Parameters

Parameter Description

apply_result_table_name

Table containing the predictions.

target_table_name

Table containing the known target values from the test data.

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

lift_table_name

Table containing the lift statistics. The table will be created by the procedure in the user's schema.

The columns in the lift table are described in the Usage Notes.

positive_target_value

The positive class. This should be the class of interest, for which you want to calculate lift.

If the target column is a NUMBER, you can use the TO_CHAR() operator to provide the value as a string.

score_column_name

Column containing the predictions in the apply results table.

The default column name is 'PREDICTION', which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions.

By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted.

The score_criterion_type parameter indicates whether probabilities or costs will be used for scoring.

The default column name is 'PROBABILITY', which is the default name created by the APPLY procedure (See "APPLY Procedure").

See the Usage Notes for additional information.

num_quantiles

Number of quantiles to be used in calculating lift. The default is 10.

cost_matrix_table_name

(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the score_criterion_type parameter is set to 'COST', the costs will be used as the scoring criteria.

The columns in a cost matrix table are described in the Usage Notes.

apply_result_schema_name

Schema of the apply results table.

If null, the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, the user's schema is assumed.

cost_matrix_schema_name

Schema of the cost matrix table, if one is provided.

If null, the user's schema is assumed.

score_criterion_type

Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the score_criterion_column_name parameter.

The default value of score_criterion_type is 'PROBABILITY'. To use costs as the scoring criterion, specify 'COST'.

If score_criterion_type is set to 'COST' but no cost matrix is provided and if there is a scoring cost matrix associated with the model, then the associated costs are used for scoring.

See the Usage Notes and the Examples.

Usage Notes

  • The predictive information you pass to COMPUTE_LIFT may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the lift.

  • Instead of passing a cost matrix to COMPUTE_LIFT, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQL PREDICTION_COST function to populate the score criterion column.

  • The predictions that you pass to COMPUTE_LIFT are in a table or view specified in apply_results_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • A cost matrix must have the columns described in Table 52-61.

    Table 52-61 Columns in a Cost Matrix

    Column Name Data Type

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    cost

    NUMBER

    See Also:

    Oracle Machine Learning for SQL Concepts for more information about cost matrixes

  • The table created by COMPUTE_LIFT has the columns described in Table 52-62

    Table 52-62 Columns in a Lift Table

    Column Name Data Type

    quantile_number

    NUMBER

    probability_threshold

    NUMBER

    gain_cumulative

    NUMBER

    quantile_total_count

    NUMBER

    quantile_target_count

    NUMBER

    percent_records_cumulative

    NUMBER

    lift_cumulative

    NUMBER

    target_density_cumulative

    NUMBER

    targets_cumulative

    NUMBER

    non_targets_cumulative

    NUMBER

    lift_quantile

    NUMBER

    target_density

    NUMBER

    See Also:

    Oracle Machine Learning for SQL Concepts for details about the information in the lift table

  • When a cost matrix is passed to COMPUTE_LIFT, the cost threshold is returned in the probability_threshold column of the lift table.

Examples

This example uses the Naive Bayes model nb_sh_clas_sample.

The example illustrates lift based on probabilities. For examples that show computation based on costs, see "COMPUTE_CONFUSION_MATRIX Procedure".

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
    SELECT cust_id, t.prediction, t.probability
    FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;

Using probabilities as the scoring criterion, you can compute lift as follows.

BEGIN
	   DBMS_DATA_MINING.COMPUTE_LIFT (
  	      apply_result_table_name	        => 'nb_apply_results',
  	      target_table_name 	             => 'mining_data_test_v',
  	      case_id_column_name	            => 'cust_id',
  	      target_column_name	             => 'affinity_card',
  	      lift_table_name		                => 'nb_lift',
  	      positive_target_value	          =>  to_char(1),
  	      score_column_name 	             => 'PREDICTION',
  	      score_criterion_column_name    => 'PROBABILITY',
  	      num_quantiles		                  =>  10,
  	      cost_matrix_table_name	         =>  null,
  	      apply_result_schema_name	       =>  null,
  	      target_schema_name	             =>  null,
  	      cost_matrix_schema_name	        =>  null,
  	      score_criterion_type	           =>  'PROBABILITY');
  	END;
  	/

This query displays some of the statistics from the resulting lift table.

SQL>SELECT quantile_number, probability_threshold, gain_cumulative,
           quantile_total_count
           FROM nb_lift;

QUANTILE_NUMBER PROBABILITY_THRESHOLD GAIN_CUMULATIVE QUANTILE_TOTAL_COUNT 
--------------- --------------------- --------------- --------------------  
              1            .989335775       .15034965                   55 
              2            .980534911       .26048951                   55  
              3            .968506098      .374125874                   55  
              4            .958975196      .493006993                   55 
              5            .946705997      .587412587                   55  
              6            .927454174       .66958042                   55  
              7            .904403627      .748251748                   55  
              8            .836482525      .839160839                   55  
             10            .500184953               1                   54  

52.8.8 COMPUTE_LIFT_PART Procedure

The COMPUTE_LIFT_PART procedure computes lift and stores the results in a table in the user's schema. This procedure provides support to the computation of evaluation metrics per-partition for partitioned models.

Lift is a test metric for binary classification models. To compute lift, one of the target values must be designated as the positive class. COMPUTE_LIFT_PART compares the predictions generated by the model with the actual target values in a set of test data. Lift measures the degree to which the model's predictions of the positive class are an improvement over random chance.

Lift is computed on scoring results that have been ranked by probability (or cost) and divided into quantiles. Each quantile includes the scores for the same number of cases.

COMPUTE_LIFT_PART calculates quantile-based and cumulative statistics. The number of quantiles and the positive class are user-specified. Additionally, COMPUTE_LIFT_PART accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs associated with the predictions

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

Syntax

DBMS_DATA_MINING.COMPUTE_LIFT_PART (
      apply_result_table_name     IN VARCHAR2,
      target_table_name           IN VARCHAR2,
      case_id_column_name         IN VARCHAR2,
      target_column_name          IN VARCHAR2,
      lift_table_name             IN VARCHAR2,
      positive_target_value       IN VARCHAR2,
      score_column_name           IN VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY',
      score_partition_column_name IN VARCHAR2 DEFAULT 'PARTITION_NAME',
      num_quantiles               IN NUMBER   DEFAULT 10,
      cost_matrix_table_name      IN VARCHAR2 DEFAULT NULL,
      apply_result_schema_name    IN VARCHAR2 DEFAULT NULL,
      target_schema_name          IN VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name     IN VARCHAR2 DEFAULT NULL,
      score_criterion_type        IN VARCHAR2 DEFAULT NULL); 

Parameters

Table 52-63 COMPUTE_LIFT_PART Procedure Parameters

Parameter Description

apply_result_table_name

Table containing the predictions

target_table_name

Table containing the known target values from the test data

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.