43 DBMS_DATA_MINING
The DBMS_DATA_MINING
package is the application programming interface for creating, evaluating, and querying data mining models.
This chapter contains the following topics:
43.1 DBMS_DATA_MINING Overview
Oracle Data Mining supports both supervised and unsupervised data mining. Supervised data mining predicts a target value based on historical data. Unsupervised data mining discovers natural groupings and does not use a target. You can use Oracle Data Mining to mine structured data and unstructured text.
Supervised data mining functions include:
-
Classification
-
Regression
-
Feature Selection (Attribute Importance)
Unsupervised data mining functions include:
-
Clustering
-
Association
-
Feature Extraction
-
Anomaly Detection
The steps you use to build and apply a mining model depend on the data mining function and the algorithm being used. The algorithms supported by Oracle Data Mining are listed in Table 43-1.
Table 43-1 Oracle Data Mining Algorithms
Algorithm | Abbreviation | Function |
---|---|---|
Apriori |
AR |
|
CUR Matrix Decomposition |
CUR |
|
Decision Tree |
DT |
Classification |
Expectation Maximization |
EM |
Clustering |
Explicit Semantic Analysis |
ESA |
|
Exponential Smoothing |
ESM |
|
Generalized Linear Model |
GLM |
|
k-Means |
KM |
|
Minimum Descriptor Length |
MDL |
|
Naive Bayes |
NB |
|
Neural Networks |
NN |
|
Non-Negative Matrix Factorization |
NMF |
|
Orthogonal Partitioning Clustering |
O-Cluster |
|
Random Forest |
RF |
|
Singular Value Decomposition and Principal Component Analysis |
SVD and PCA |
Feature Extraction |
Support Vector Machine |
SVM |
Oracle Data Mining supports more than one algorithm for the classification, regression, clustering, and feature extraction mining functions. Each of these mining functions has a default algorithm, as shown in Table 43-2.
Table 43-2 Oracle Data Mining Default Algorithms
Mining Function | Default Algorithm |
---|---|
Classification |
Naive Bayes |
Clustering |
k-Means |
Feature Extraction |
Non-Negative Matrix Factorization |
Feature Selection |
Minimum Descriptor Length |
Regression |
Support Vector Machine |
43.2 DBMS_DATA_MINING Security Model
The DBMS_DATA_MINING
package is owned by user SYS
and is installed as part of database installation. Execution privilege on the package is granted to public. The routines in the package are run with invokers' rights (run with the privileges of the current user).
The DBMS_DATA_MINING
package exposes APIs that are leveraged by the Oracle Data Mining component of the Advanced Analytics Option. Users who wish to create mining models in their own schema require the CREATE MINING MODEL
system privilege. Users who wish to create mining models in other schemas require the CREATE ANY MINING MODEL
system privilege.
Users have full control over managing models that exist within their own schema. Additional system privileges necessary for managing data mining models in other schemas include ALTER ANY MINING MODEL
, DROP ANY MINING MODEL
, SELECT ANY MINING MODEL
, COMMENT ANY MINING MODEL
, and AUDIT ANY
.
Individual object privileges on mining models, ALTER MINING MODEL
and SELET MINING MODEL
, can be used to selectively grant privileges on a model to a different user.
See Also:
Oracle Data Mining User's Guide for more information about the security features of Oracle Data Mining
43.3 DBMS_DATA_MINING — Mining Functions
A data mining function refers to the methods for solving a given class of data mining problems.
The mining function must be specified when a model is created. (See CREATE_MODEL Procedure.)
Table 43-3 Mining Functions
Value | Description |
---|---|
|
Association is a descriptive mining function. An association model identifies relationships and the probability of their occurrence within a data set. Association models use the Apriori algorithm. |
|
Attribute importance is a predictive mining function, also known as feature selection. An attribute importance model identifies the relative importance of an attribute in predicting a given outcome. Attribute importance models can use Minimum Description Length, or CUR Matrix Decomposition. Minimum Description Length is the default. |
|
Classification is a predictive mining function. A classification model uses historical data to predict a categorical target. Classification models can use: Naive Bayes, Decision Tree, Logistic Regression, or Support Vector Machine. The default is Naive Bayes. The classification function can also be used for anomaly detection. In this case, the SVM algorithm with a null target is used (One-Class SVM). |
|
Clustering is a descriptive mining function. A clustering model identifies natural groupings within a data set. Clustering models can use k-Means, O-Cluster, or Expectation Maximization. The default is k-Means. |
|
Feature Extraction is a descriptive mining function. A feature extraction model creates an optimized data set on which to base a model. Feature extraction models can use Explicit Semantic Analysis, Non-Negative Matrix Factorization, Singular Value Decomposition, or Principal Component Analysis. Non-Negative Matrix Factorization is the default. |
|
Regression is a predictive mining function. A regression model uses historical data to predict a numerical target. Regression models can use Support Vector Machine or Linear Regression. The default is Support Vector Machine. |
|
Time series is a predictive mining function. A time series model forecasts the future values of a time-ordered series of historical numeric data over a user-specified time window. Time series models use the Exponential Smoothing algorithm. |
See Also:
Oracle Data Mining Concepts for more information about mining functions
43.4 DBMS_DATA_MINING — Model Settings
Oracle Data Mining uses settings to specify the algorithm and other characteristics of a model. Some settings are general, some are specific to a mining function, and some are specific to an algorithm.
All settings have default values. If you want to override one or more of the settings for a model, you must create a settings table. The settings table must have the column names and datatypes shown in the following table.
Table 43-4 Required Columns in the Model Settings Table
Column Name | Datatype |
---|---|
|
|
|
|
The information you provide in the settings table is used by the model at build time. The name of the settings table is an optional argument to the CREATE_MODEL Procedure.
You can find the settings used by a model by querying the data dictionary view ALL_MINING_MODEL_SETTINGS
. This view lists the model settings used by the mining models to which you have access. All the setting values are included in the view, whether default or user-specified.
See Also:
-
ALL_MINING_MODEL_SETTINGS
in Oracle Database Reference -
Oracle Data Mining User's Guide for information about specifying model settings
43.4.1 DBMS_DATA_MINING — Algorithm Names
The ALGO_NAME
setting specifies the model algorithm.
The values for the ALGO_NAME
setting are listed in the following table.
Table 43-5 Algorithm Names
ALGO_NAME Value | Description | Mining Function |
---|---|---|
|
Minimum Description Length |
Attribute Importance |
|
Apriori |
Association Rules |
|
CUR Decomposition |
Attribute Importance |
|
Decision Tree |
Classification |
|
Expectation Maximization |
Clustering |
|
Explicit Semantic Analysis |
Feature Extraction Classification |
|
Exponential Smoothing |
Time Series |
|
Language used for extensible algorithm |
All mining functions supported |
|
Generalized Linear Model |
Classification, Regression; also Feature Selection and Generation |
|
Enhanced k_Means |
Clustering |
|
Naive Bayes |
Classification |
|
Neural Network |
Classification |
|
Non-Negative Matrix Factorization |
Feature Extraction |
|
O-Cluster |
Clustering |
|
Random Forest |
Classification |
|
Singular Value Decomposition |
Feature Extraction |
|
Support Vector Machine |
Classification and Regression |
See Also:
Oracle Data Mining Concepts for information about algorithms
43.4.2 DBMS_DATA_MINING — Automatic Data Preparation
Oracle Data Mining supports fully Automatic Data Preparation (ADP), user-directed general data preparation, and user-specified embedded data preparation. The PREP_*
settings enable the user to request fully automated or user-directed general data preparation. By default, fully Automatic Data Preparation (PREP_AUTO_ON
) is enabled.
When you enable Automatic Data Preparation, the model uses heuristics to transform the build data according to the requirements of the algorithm. Instead of fully Automatic Data Preparation, the user can request that the data be shifted and/or scaled with the PREP_SCALE*
and PREP_SHIFT*
settings. The transformation instructions are stored with the model and reused whenever the model is applied. Refer to Model Detail Views, Oracle Data Mining User’s Guide.
You can choose to supplement Automatic Data Preparations by specifying additional transformations in the xform_list
parameter when you build the model. (See "CREATE_MODEL Procedure".)
If you do not use Automatic Data Preparation and do not specify transformations in the xform_list
parameter to CREATE_MODEL
, you must implement your own transformations separately in the build, test, and scoring data. You must take special care to implement the exact same transformations in each data set.
If you do not use Automatic Data Preparation, but you do specify transformations in the xform_list
parameter to CREATE_MODEL
, Oracle Data Mining embeds the transformation definitions in the model and prepares the test and scoring data to match the build data.
The values for the PREP_*
setting are described in the following table.
Table 43-6 PREP_* Setting
Setting Name | Setting Value | Description |
---|---|---|
|
|
This setting enables fully automated data preparation.
The default is |
|
|
This setting enables scaling data preparation for two-dimensional numeric columns.
|
|
PREP_SCALE_MAXABS |
This setting enables scaling data preparation for nested numeric columns. |
|
|
This setting enables centering data preparation for two-dimensional numeric columns.
PREP_AUTO must be OFF for this setting to take effect. The following are the possible values:
|
See Also:
Oracle Data Mining User's Guide for information about data transformations
43.4.3 DBMS_DATA_MINING — Mining Function Settings
The settings described in this table apply to a mining function.
Table 43-7 Mining Function Settings
Mining Function | Setting Name | Setting Value | Description |
---|---|---|---|
Association |
|
|
Maximum rule length for Association Rules. Default is |
Association |
|
|
Minimum confidence for Association Rules. Default is |
Association |
|
|
Minimum support for Association Rules Default is |
Association |
|
|
Minimum absolute support that each rule must satisfy. The value must be an integer. Default is |
Association |
|
|
Sets the Minimum Reverse Confidence that each rule should satisfy. The Reverse Confidence of a rule is defined as the number of transactions in which the rule occurs divided by the number of transactions in which the consequent occurs. The value is real number between 0 and 1. The default is |
Association |
|
|
Sets Including Rules applied for each association rule: it specifies the list of items that at least one of them must appear in each reported association rule, either as antecedent or as consequent. It is a comma separated string containing the list of including items. If not set, the default behavior is, the filtering is not applied. |
Association |
|
|
Sets Excluding Rules applied for each association rule: it specifies the list of items that none of them can appear in each reported Association Rules. It is a comma separated string containing the list of excluding items. No rule can contain any item in the list. The default is |
Association |
|
|
Sets Including Rules for the antecedent: it specifies the list of items that at least one of them must appear in the antecedent part of each reported association rule. It is a comma separated string containing the list of including items. The antecedent part of each rule must contain at least one item in the list. The default is |
Association |
|
|
Sets Excluding Rules for the antecedent: it specifies the list of items that none of them can appear in the antecedent part of each reported association rule. It is a comma separated string containing the list of excluding items. No rule can contain any item in the list in its antecedent part. The default is |
Association |
|
|
Sets Including Rules for the consequent: it specifies the list of items that at least one of them must appear in the consequent part of each reported association rule. It is a comma separated string containing the list of including items. The consequent of each rule must be an item in the list. The default is |
Association |
|
|
Sets Excluding Rules for the consequent: it specifies the list of items that none of them can appear in the consequent part of each reported association rule. It is a comma separated string containing the list of excluding items. No rule can have any item in the list as its consequent. The excluding rule can be used to reduce the data that must be stored, but the user may be required to build extra model for executing different including or Excluding Rules. The default is |
Association |
|
|
Specifies the columns to be aggregated. It is a comma separated string containing the names of the columns for aggregation. Number of columns in the list must be <= 10. You can set
The default is For each item, the user may supply several columns to aggregate. It requires more memory to buffer the extra data. Also, the performance impact can be seen because of the larger input data set and more operation. |
Association |
|
0 <ASSO_ABS_ERROR ≤MAX(ASSO_MIN_SUPPORT, ASSO_MIN_CONFIDENCE). |
Specifies the absolute error for the Association Rules sampling. A smaller value of |
Association |
|
0 ≤ ASSO_CONF_LEVEL ≤ 1 |
Specifies the confidence level for an Association Rules sample. A larger value of |
Classification |
|
table_name |
(Decision Tree only) Name of a table that stores a cost matrix to be used by the algorithm in building the model. The cost matrix specifies the costs associated with misclassifications. Only Decision Tree models can use a cost matrix at build time. All classification algorithms can use a cost matrix at apply time. The cost matrix table is user-created. See "ADD_COST_MATRIX Procedure" for the column requirements. See Oracle Data Mining Concepts for information about costs. |
Classification |
|
table_name |
(Naive Bayes) Name of a table that stores prior probabilities to offset differences in distribution between the build data and the scoring data. The priors table is user-created. See Oracle Data Mining User's Guide for the column requirements. See Oracle Data Mining Concepts for additional information about priors. |
Classification |
|
table_name |
(GLM and SVM only) Name of a table that stores weighting information for individual target values in SVM classification and GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes. The class weights table is user-created. See Oracle Data Mining User's Guide for the column requirements. See Oracle Data Mining Concepts for additional information about class weights. |
Classification |
|
|
This setting indicates that the algorithm must create a model that balances the target distribution. This setting is most relevant in the presence of rare targets, as balancing the distribution may enable better average accuracy (average of per-class accuracy) instead of overall accuracy (which favors the dominant class). The default value is |
Clustering |
|
|
Maximum number of leaf clusters generated by a clustering algorithm. The algorithm may return fewer clusters, depending on the data. Enhanced k-Means usually produces the exact number of clusters specified by Expectation Maximization (EM) may return fewer clusters than the number specified by For EM, the default value of |
Feature Extraction |
|
|
Number of features to be extracted by a feature extraction model. The default is estimated from the data by the algorithm. If the matrix rank is smaller than this number, fewer features will be returned. For CUR Matrix Decomposition, the |
See Also:
Oracle Data Mining Concepts for information about mining functions
43.4.4 DBMS_DATA_MINING — Global Settings
The configuration settings in this table are applicable to any type of model, but are currently only implemented for specific algorithms.
Table 43-8 Global Settings
Setting Name | Setting Value | Description |
---|---|---|
|
column_name |
(Association rules only) Name of a column that contains the items in a transaction. When this setting is specified, the algorithm expects the data to be presented in a native transactional format, consisting of two columns:
A typical example of transactional data is market basket data, wherein a case represents a basket that may contain many items. Each item is stored in a separate row, and many rows may be needed to represent a case. The case ID values do not uniquely identify each row. Transactional data is also called multi-record case data. Association rules function is normally used with transactional data, but it can also be applied to single-record case data (similar to other algorithms). For more information about single-record and multi-record case data, see Oracle SQL Developer Data Modeler User's Guide. |
|
column_name |
(Association rules only) Name of a column that contains a value associated with each item in a transaction. This setting is only used when a value has been specified for If
ASSO_AGGREGATES , Case ID, and Item ID column are present, then the Item Value column may or may not appear.
The Item Value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, macintosh apples). For details onASSO_AGGREGATES , see DBMS_DATA_MINING - Mining Function Settings.
|
|
|
Indicates how to treat missing values in the training data. This setting does not affect the scoring data. The default value is
When The value |
|
column_name |
(GLM only) Name of a column in the training data that contains a weighting factor for the rows. The column data type must be NUMBER. Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data. |
|
The name of an Oracle Text POLICY created using |
Affects how individual tokens are extracted from unstructured text. For details about |
|
1 <= value |
The maximum number of distinct features, across all text attributes, to use from a document set passed to |
|
Non-negative value |
This is a text processing setting the controls how in how many documents a token needs to appear to be used as a feature. The default is |
|
Comma separated list of machine learning attributes |
This setting indicates a request to build a partitioned model. The setting value is a comma-separated list of the machine learning attributes to be used to determine the in-list partition key values. These machine learning attributes are taken from the input columns unless an |
|
1< value <= 1000000 |
This setting indicates the maximum number of partitions allowed for the model. The default is |
|
|
This setting allows the user to request a sampling of the build data. The default is |
|
0 < Value |
This setting determines how many rows will be sampled (approximately). It can be set only if |
|
|
This setting controls the parallel build of partitioned models.
The default mode is |
|
tablespace_name |
This setting controls the storage specifications. If you explicitly sets this to the name of a tablespace (for which you have sufficient quota), then the specified tablespace storage creates the resulting model content. If you do not provide this setting, then the default tablespace of the user creates the resulting model content. |
|
The value must be a non-negative integer |
The hash function with a random number seed generates a random number with uniform distribution. Users can control the random number seed by this setting. The default is This setting is used by Random Forest, Neural Network, and CUR Matrix Decomposition. |
|
|
This setting reduces the space that is used while creating a model, especially a partitioned model. The default value is When the setting is The reduction in space depends on the model. Reduction on the order of 10x can be achieved. |
See Also:
Oracle Data Mining Concepts for information about GLM
Oracle Data Mining Concepts for information about association rules
Oracle Data Mining User’s Guide for information about machine learning unstructured text
43.4.5 DBMS_DATA_MINING — Algorithm Settings: ALGO_EXTENSIBLE_LANG
The settings listed in the following table configure the behavior of the mining model with an Extensible algorithm. The mining model is built in R language.
The RALG_*_FUNCTION
specifies the R script that is used to build, score, and view R model and must be registered in ORACLR Script repository. The R scripts are registered through Oracle Enterprise R product with special privilege. When ALGO_EXTENSIBLE_LANG
is set to R in the MINING_MODEL_SETTING
table, the mining model is built in the R language. After the R model is built, the names of the R scripts are recorded in MINING_MODEL_SETTING
table in the SYS
schema. The scripts must exist in the repository for the R model to function. The amount of R memory used to build, score, and view the R model through these R scripts can be controlled by Oracle Enterprise R.
All algorithm-independent DBMS_DATA_MINING
subprograms can operate on the R model for mining functions such as Association, Attribute Importance, Classification, Clustering, Feature Extraction, and Regression.
The supported DBMS_DATA_MINING
subprograms include, but are not limited, to the following:
-
ADD_COST_MATRIX Procedure
-
COMPUTE_CONFUSION_MATRIX Procedure
-
COMPUTE_LIFT Procedure
-
COMPUTE_ROC Procedure
-
CREATE_MODEL Procedure
-
DROP_MODEL Procedure
-
EXPORT_MODEL Procedure
-
GET_MODEL_COST_MATRIX Function
-
IMPORT_MODEL Procedure
-
REMOVE_COST_MATRIX Procedure
-
RENAME_MODEL Procedure
Table 43-9 ALGO_EXTENSIBLE_LANG Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Specifies the name of an existing registered R script for R algorithm mining model build function. The R script defines an R function for the first input argument for training data and returns an R model object. For Clustering and Feature Extraction mining function model build, the R attributes dm$nclus and dm$nfeat must be set on the R model to indicate the number of clusters and features respectively. The |
|
SELECT value param_name, ...FROM DUAL
|
Specifies a list of numeric and string scalar for optional input parameters of the model build function. |
|
|
Specifies the name of an existing registered R script to score data. The script returns a |
|
|
Specifies the name of an existing registered R script for R algorithm that computes the weight (contribution) for each attribute in scoring. The script returns a |
|
|
Specifies the name of an existing registered R script for R algorithm that produces the model information. This setting is required to generate a model view. |
|
SELECT type_value column_name, ... FROM DUAL |
Specifies the |
See Also:
43.4.6 DBMS_DATA_MINING — Algorithm Settings: CUR Matrix Decomposition
The following settings affects the behavior of the CUR Matrix Decomposition algorithm.
The following settings configure the behavior of the CUR Matrix Decomposition algorithm.
Table 43-10 CUR Matrix Decomposition Settings
Setting Name | Setting Value | Description |
---|---|---|
|
The value must be a positive integer |
Defines the approximate number of attributes to be selected. The default value is the number of attributes. |
|
|
Defines the flag indicating whether or not to perform row selection. The default value is |
|
The value must be a positive integer |
Defines the approximate number of rows to be selected. This parameter is only used when users decide to perform row selection ( The default value is the total number of rows. |
|
The value must be a positive integer |
Defines the rank parameter used in the column/row leverage score calculation. If users do not provide an input value, the value is determined by the system. |
See Also:
43.4.7 DBMS_DATA_MINING — Algorithm Settings: Decision Tree
These settings configure the behavior of the Decision Tree algorithm. Note that the Decision Tree settings are also used to configure the behavior of Random Forest as it constructs each individual Decision Tree.
Table 43-11 Decision Tree Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is measured in accordance with a metric. Decision trees can use either gini ( |
|
For Decision Tree:
For Random Forest:
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). For Decision Tree the default is For Random Forest the default is |
|
|
The minimum number of training rows in a node expressed as a percentage of the rows in the training data. Default is |
|
|
Minimum number of rows required to consider splitting a node expressed as a percentage of the training rows. Default is |
|
|
Minimum number of rows in a node. Default is |
|
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value. Default is |
|
For Decision Tree:
For Random Forest:
|
This parameter specifies the maximum number of bins for each attribute. Default value is |
See Also:
Oracle Data Mining Concepts for information about Decision Tree
43.4.8 DBMS_DATA_MINING — Algorithm Settings: Expectation Maximization
These algorithm settings configure the behavior of the Expectation Maximization algorithm.
See Also:
Oracle Data Mining Concepts for information about Expectation Maximization
Table 43-12 Expectation Maximization Settings for Data Preparation and Analysis
Setting Name | Setting Value | Description |
---|---|---|
|
|
Whether or not to include uncorrelated attributes in the model. When Note: This setting applies only to attributes that are not nested. Default is system-determined. |
|
|
Maximum number of correlated attributes to include in the model. Note: This setting applies only to attributes that are not nested (2D). Default is |
|
|
The distribution for modeling numeric attributes. Applies to the input table or view as a whole and does not allow per-attribute specifications. The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numeric attributes are modeled using the same type of distribution. When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data. Default is |
|
|
Number of equi-width bins that will be used for gathering cluster statistics for numeric columns. Default is |
|
|
Specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, the data will not be projected. The setting applies to all nested columns. Default is |
|
|
Specifies the number of quantile bins that will be used for modeling numeric columns with multivalued Bernoulli distributions. Default is system-determined. |
|
|
Specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions. Default is system-determined. |
Table 43-13 Expectation Maximization Settings for Learning
Setting Name | Setting Value | Description |
---|---|---|
|
|
The convergence criterion for EM. The convergence criterion may be based on a held-aside data set, or it may be Bayesian Information Criterion. Default is system determined. |
|
|
When the convergence criterion is based on a held-aside data set ( Default value is |
|
|
Maximum number of components in the model. If model search is enabled, the algorithm automatically determines the number of components based on improvements in the likelihood function or based on regularization, up to the specified maximum. The number of components must be greater than or equal to the number of clusters. Default is 20. |
|
|
Specifies the maximum number of iterations in the EM algorithm. Default is |
|
|
This setting enables model search in EM where different model sizes are explored and a best size is selected. The default is |
|
|
This setting allows the EM algorithm to remove a small component from the solution. The default is |
|
Non-negative integer |
This setting controls the seed of the random generator used in EM. The default is |
Table 43-14 Expectation Maximization Settings for Component Clustering
Setting Name | Setting Value | Description |
---|---|---|
|
|
Enables or disables the grouping of EM components into high-level clusters. When disabled, the components themselves are treated as clusters. When component clustering is enabled, model scoring through the SQL Default is |
|
|
Dissimilarity threshold that controls the clustering of EM components. When the dissimilarity measure is less than the threshold, the components are combined into a single cluster. A lower threshold may produce more clusters that are more compact. A higher threshold may produce fewer clusters that are more spread out. Default is |
|
|
Allows the specification of a linkage function for the agglomerative clustering step.
Default is |
Table 43-15 Expectation Maximization Settings for Cluster Statistics
Setting Name | Setting Value | Description |
---|---|---|
|
|
Enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules). When statistics are disabled, model size is reduced, and Default is |
|
|
Minimum support required for including an attribute in the cluster rule. The support is the percentage of the data rows assigned to a cluster that must have non-null values for the attribute. Default is |
43.4.9 DBMS_DATA_MINING — Algorithm Settings: Explicit Semantic Analysis
Explicit Semantic Analysis (ESA) is a useful technique for extracting meaningful and interpretable features.
Table 43-16 Explicit Semantic Analysis Settings
Setting Name | Setting Value | Description |
---|---|---|
|
Non-negative number |
This setting thresholds a small value for attribute weights in the transformed build data. The default is |
|
Text input Non-text input is |
This setting determines the minimum number of non-zero entries that need to be present in an input row. The default is 100 for text input and 0 for non-text input. |
|
A positive integer |
This setting controls the maximum number of features per attribute. The default is |
See Also:
Oracle Data Mining Concepts for information about Explicit Semantic Analysis.
43.4.10 DBMS_DATA_MINING — Algorithm Settings: Exponential Smoothing Models
Exponential Smoothing Models (ESM) is a useful technique for extracting meaningful and interpretable features.
Table 43-17 Exponential Smoothing Models Settings
Setting Name | Setting Value | Description |
---|---|---|
|
It can take value in set {EXSM_SIMPLE, EXSM_SIMPLE_MULT, EXSM_HOLT, EXSM_HOLT_DMP, EXSM_MUL_TRND, EXSM_MULTRD_DMP, EXSM_SEAS_ADD, EXSM_SEAS_MUL, EXSM_HW, EXSM_HW_DMP, EXSM_HW_ADDSEA, EXSM_DHW_ADDSEA, EXSM_HWMT, EXSM_HWMT_DMP} |
This setting specifies the model.
The default value is |
|
|
This setting specifies a positive integer value as the length of seasonal cycle. The value it takes must be larger than This setting is only applicable and must be provided for models with seasonality, otherwise the model throws an error. When |
|
It can take value in set {EXSM_INTERVAL_YEAR, EXSM_INTERVAL_QTR, EXSM_INTERVAL_MONTH,EXSM_INTERVAL_WEEK, EXSM_INTERVAL_DAY, EXSM_INTERVAL_HOUR, EXSM_INTERVAL_MIN,EXSM_INTERVAL_SEC} |
This setting only applies and must be provided when the time column ( If the time column of input table is of datetime type and setting If the time column of input table is of oracle number type and setting |
|
It can take value in set {EXSM_ACCU_TOTAL, EXSM_ACCU_STD, EXSM_ACCU_MAX, EXSM_ACCU_MIN, EXSM_ACCU_AVG, EXSM_ACCU_MEDIAN, EXSM_ACCU_COUNT}. |
This setting only applies and must be provided when the time column has datetime type. It specifies how to generate the value of the accumulated time series from the input time series. |
|
It can also specify an option taking value in set {EXSM_MISS_MIN, EXSM_MISS_MAX, EXSM_MISS_AVG, EXSM_MISS_MEDIAN, EXSM_MISS_LAST, EXSM_MISS_FIRST, EXSM_MISS_PREV, EXSM_MISS_NEXT, EXSM_MISS_AUTO}. |
This setting specifies how to handle missing values, which may come from input data and/or the accumulation process of input time series. It can specify either a number or an option. If a number is specified, all the missing values are set to that number.
If this setting is not provided, |
|
It must be set to a number between 1-30. |
This setting is used to specify how many steps ahead the predictions are to be made. If it is not set, the default value is |
|
It must be a number between 0 and 1, exclusive. |
This setting is used to specify the desired confidence level for prediction. The lower and upper bounds of the specified confidence interval is reported. If not specified, the default confidence level is |
|
It takes value in set {EXSM_OPT_CRIT_LIKE, XSM_OPT_CRIT_MSE, EXSM_OPT_CRIT_AMSE, EXSM_OPT_CRIT_SIG, EXSM_OPT_CRIT_MAE}. |
This setting is used to specify the desired optimization criterion. |
|
positive integer |
This setting specifies the length of the window used in computing the error metric average mean square error (AMSE). |
See Also:
Oracle Data Mining Concepts for information about Exponential Smoothing Models.
43.4.11 DBMS_DATA_MINING — Algorithm Settings: Generalized Linear Models
The settings listed in the following table configure the behavior of Generalized Linear Models
Table 43-18 DBMS_DATA_MINING GLM Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
The confidence level for coefficient confidence intervals. The default confidence level is |
|
|
Whether feature generation is quadratic or cubic. When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data. |
|
|
Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled. Note: Feature generation can only be enabled when feature selection is also enabled. |
|
|
Feature selection penalty criterion for adding a feature to the model. When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data. |
|
|
Whether or not feature selection is enabled for GLM. By default, feature selection is not enabled. |
|
|
When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model. By default, the algorithm limits the number of features to ensure sufficient memory. |
|
|
Prune enable or disable for features in the final model. Pruning is based on T-Test statistics for linear regression, or Wald Test statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data. When feature selection is enabled, the algorithm automatically performs pruning based on the data. |
|
target_value |
The target value used as the reference class in a binary logistic regression model. Probabilities are produced for the other class. By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class. |
|
|
Enable or disable Ridge Regression. Ridge applies to both regression and Classification mining functions. When ridge is enabled, prediction bounds are not produced by the Note: Ridge may only be enabled when feature selection is not specified, or has been explicitly disabled. If Ridge Regression and feature selection are both explicitly enabled, then an exception is raised. |
|
|
The value of the ridge parameter. This setting is only used when the algorithm is configured to use Ridge Regression. If Ridge Regression is enabled internally by the algorithm, then the ridge parameter is determined by the algorithm. |
GLMS_ROW_DIAGNOSTICS
|
|
Enable or disable row diagnostics. |
|
The range is ( |
Convergence Tolerance setting of the GLM algorithm The default value is system-determined. |
|
Positive integer |
Maximum number of iterations for the GLM algorithm. The default value is system-determined. |
|
|
Number of rows in a batch used by the SGD solver. The value of this parameter sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting allows the user to choose the GLM solver. The solver cannot be selected if |
|
|
This setting allows the user to use sparse solver if it is available. The default value is |
Related Topics
See Also:
Oracle Data Mining Concepts for information about GLM.
43.4.12 DBMS_DATA_MINING — Algorithm Settings: k-Means
The settings listed in the following table configure the behavior of the k-Means algorithm.
Table 43-19 k-Means Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Minimum Convergence Tolerance for k-Means. The algorithm iterates until the minimum Convergence Tolerance is satisfied or until the maximum number of iterations, specified in Decreasing the Convergence Tolerance produces a more accurate solution but may result in longer run times. The default Convergence Tolerance is |
|
|
Distance function for k-Means. The default distance function is |
|
|
Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum Convergence Tolerance, specified in The default number of iterations is |
|
|
Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster. If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules. The default minimum support is |
|
|
Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin. The default number of histogram bins is |
|
|
Split criterion for k-Means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time. When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located. When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster. The default split criterion is the |
KMNS_RANDOM_SEED |
Non-negative integer |
This setting controls the seed of the random generator used during the k-Means initialization. It must be a non-negative integer value. The default is |
|
|
This setting determines the level of cluster detail that are computed during the build.
|
See Also:
Oracle Data Mining Concepts for information about k-Means
43.4.13 DBMS_DATA_MINING — Algorithm Settings: Naive Bayes
The settings listed in the following table configure the behavior of the Naive Bayes Algorithm.
Table 43-20 Naive Bayes Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Value of pairwise threshold for NB algorithm Default is |
|
|
Value of singleton threshold for NB algorithm Default value is |
See Also:
Oracle Data Mining Concepts for information about Naive Bayes
43.4.14 DBMS_DATA_MINING — Algorithm Settings: Neural Network
The settings listed in the following table configure the behavior of Neural Network.
Table 43-21 DBMS_DATA_MINING Neural Network Settings
Setting Name | Setting Value | Description |
---|---|---|
|
Non-negative integer |
Defines the topology by number of hidden layers. The default value is |
|
A list of positive integers |
Defines the topology by number of nodes per layer. Different layers can have different number of nodes. The value should be non-negative integers and comma separated. For example, '10, 20, 5'. The setting values must be consistent with |
|
A list of the following strings:
|
Defines the activation function for the hidden layers. For example, ''' Different layers can have different activation functions. The default value is '' The number of activation functions must be consistent with Note: All quotes are single and two single quotes are used to escape a single quote in SQL statements. |
|
|
The setting specifies the lower bound of the region where weights are randomly initialized.
NNET_WEIGHT_LOWER_BOUND and NNET_WEIGHT_UPPER_BOUND must be set together. Setting one and not setting the other raises an error. NNET_WEIGHT_LOWER_BOUND must not be greater than NNET_WEIGHT_UPPER_BOUND . The default value is –sqrt(6/(l_nodes+r_nodes)) . The value of l_nodes for:
The value of |
|
|
This setting specifies the upper bound of the region where weights are initialized. It should be set in pairs with The default value is |
|
Positive integer |
This setting specifies the maximum number of iterations in the Neural Network algorithm. The default value is |
|
|
Defines the convergence tolerance setting of the Neural Network algorithm. The default value is |
|
|
Regularization setting for Neural Network algorithm. If the total number of training rows is greater than 50000, the default is |
|
|
Define the held ratio for the held-aside method. The default value is |
NNET_HELDASIDE_MAX_FAIL
|
The value must be a positive integer. |
With The default value is |
|
|
Defines the L2 regularization parameter lambda. This can not be set together with The default value is |
Related Topics
See Also:
Oracle Data Mining Concepts for information about Neural Network.
43.4.15 DBMS_DATA_MINING — Algorithm Settings: Non-Negative Matrix Factorization
The settings listed in the following table configure the behavior of the Non-Negative Matrix Factorization algorithm.
You can query the data dictionary view *_MINING_MODEL_SETTINGS
(using the ALL
, USER
, or DBA
prefix) to find the setting values for a model. See Oracle Database Reference for information about *_MINING_MODEL_SETTINGS
.
Table 43-22 NMF Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Convergence tolerance for NMF algorithm Default is |
|
|
Whether negative numbers should be allowed in scoring results. When set to Default is |
|
|
Number of iterations for NMF algorithm Default is |
|
|
Random seed for NMF algorithm. Default is |
See Also:
Oracle Data Mining Concepts for information about NMF
43.4.16 DBMS_DATA_MINING — Algorithm Settings: O-Cluster
The settings in the table configure the behavior of the O-Cluster algorithm.
Table 43-23 O-CLuster Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density. Default is |
See Also:
Oracle Data Mining Concepts for information about O-Cluster
43.4.17 DBMS_DATA_MINING — Algorithm Settings: Random Forest
These settings configure the behavior of the Random Forest algorithm. Random forest makes use of the Decision Tree settings to configure the construction of individual trees.
Table 43-24 Random Forest Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value |
|
|
Number of trees in the forest Default is |
|
|
Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data. |
Related Topics
See Also:
Oracle Data Mining Concepts for information about Random Forest
43.4.18 DBMS_DATA_MINING — Algorithm Constants and Settings: Singular Value Decomposition
The following constant affects the behavior of the Singular Value Decomposition algorithm.
Table 43-25 Singular Value Decomposition Constant
Constant Name | Constant Value | Description |
---|---|---|
|
2500 |
The maximum number of features supported by SVD. |
The following settings configure the behavior of the Singular Value Decomposition algorithm.
Table 43-26 Singular Value Decomposition Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Indicates whether or not to persist the U Matrix produced by SVD. The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix is persisted only when When Default is |
|
|
Whether to use SVD or PCA scoring for the model. When the build data is scored with SVD, the projections will be the same as the U matrix. When the build data is scored with PCA, the projections will be the product of the U and S matrices. Default is |
|
|
This setting indicates the solver to be used for computing SVD of the data. In the case of PCA, the solver setting indicates the type of SVD solver used to compute the PCA for the data. When this setting is not specified the solver type selection is data driven. If the number of attributes is greater than 3240, then the default wide solver is used. Otherwise, the default narrow solver is selected. The following are the group of solvers:
For narrow data solvers:
For wide data solvers:
|
|
Range [ |
This setting is used to prune features. Define the minimum value the eigenvalue of a feature as a share of the first eigenvalue to not to prune. Default value is data driven. |
|
Range [ |
The random seed value is used for initializing the sampling matrix used by the Stochastic SVD solver. The default is |
|
Range [ |
This setting is configures the number of columns in the sampling matrix used by the Stochastic SVD solver. The number of columns in this matrix is equal to the requested number of features plus the oversampling setting. The SVD Solver must be set to |
|
Range [ |
The power iteration setting improves the accuracy of the SSVD solver. The default is |
See Also:
43.4.19 DBMS_DATA_MINING — Algorithm Settings: Support Vector Machine
The settings listed in the following table configure the behavior of the Support Vector Machine algorithm.
Table 43-27 SVM Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Regularization setting that balances the complexity of the model against model robustness to achieve good generalization on new data. SVM uses a data-driven approach to finding the complexity factor. Value of complexity factor for SVM algorithm (both Classification and Regression). Default value estimated from the data by the algorithm. |
|
|
Convergence tolerance for SVM algorithm. Default is |
|
|
Regularization setting for regression, similar to complexity factor. Epsilon specifies the allowable residuals, or noise, in the data. Value of epsilon factor for SVM regression. Default is |
|
|
Kernel for Support Vector Machine. Linear or Gaussian. The default value isSVMS_LINEAR .
|
|
|
The desired rate of outliers in the training data. Valid for One-Class SVM models only (Anomaly Detection). Default is |
|
|
Controls the spread of the Gaussian kernel function. SVM uses a data-driven approach to find a standard deviation value that is on the same scale as distances between typical cases. Value of standard deviation for SVM algorithm. This is applicable only for Gaussian kernel. Default value estimated from the data by the algorithm. |
|
Positive integer |
This setting sets an upper limit on the number of SVM iterations. The default is system determined because it depends on the SVM solver. |
|
Range [ |
This setting sets an upper limit on the number of pivots used in the Incomplete Cholesky decomposition. It can be set only for non-linear kernels. The default value is |
|
Positive integer |
This setting applies to SVM models with linear kernel. This setting sets the size of the batch for the SGD solver. An input of 0 triggers a data driven batch size estimate. The default is |
|
|
This setting controls the type of regularization that the SGD SVM solver uses. The setting can be used only for linear SVM models. The default is system determined because it depends on the potential model size. |
|
|
This setting allows the user to choose the SVM solver. The SGD solver cannot be selected if the kernel is non-linear. The default value is system determined. |
See Also:
Oracle Data Mining Concepts for information about SVM
43.5 DBMS_DATA_MINING — Solver Settings
Oracle Data Mining algorithms can use different solvers. Solver settings can be provided at build time in the setting table.
43.5.1 DBMS_DATA_MINING — Solver Settings: ADMM
The settings listed in the following table configure the behavior of Alternating Direction Method of Multipliers (ADMM). Generalized Linear Models (GLM) use these settings.
Table 43-28 DBMS_DATA_MINING ADMM Settings
Related Topics
See Also:
Oracle Data Mining Concepts for information about Neural Network
43.5.2 DBMS_DATA_MINING — Solver Settings: LBFGS
The settings listed in the following table configure the behavior of L-BFGS. Neural Network and Generalized Linear Models (GLM) use these settings.
Table 43-29 DBMS_DATA_MINING L-BFGS Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Defines gradient infinity norm tolerance for L-BFGS. Default value is |
|
The value must be a positive integer. |
Defines the number of historical copies kept in L-BFGS solver. The default value is |
|
|
Defines whether to scale Hessian in L-BFGS or not. Default value is |
See Also:
Oracle Data Mining Concepts for information about Neural Network
43.6 DBMS_DATA_MINING Datatypes
The DBMS_DATA_MINING
package defines object datatypes for mining transactional data. The package also defines a type for user-specified transformations. These types are called DM_NESTED_
n
, where n
identifies the Oracle datatype of the nested attributes.
The Data Mining object datatypes are described in the following table:
Table 43-30 DBMS_DATA_MINING Summary of Datatypes
Datatype | Description |
---|---|
|
The name and value of a numerical attribute of type |
|
A collection of |
|
The name and value of a numerical attribute of type |
|
A collection of |
|
The name and value of a categorical attribute of type |
|
A collection of |
|
The name and value of a numerical attribute of type |
|
A collection of |
|
A table of |
|
A list of user-specified transformations for a model. Accepted as a parameter by the CREATE_MODEL Procedure. This collection type is defined in the DBMS_DATA_MINING_TRANSFORM package. |
For more information about mining nested data, see Oracle Data Mining User's Guide.
Note:
Starting from Oracle Database 12c Release 2,*GET_MODEL_DETAILS
are deprecated and are replaced with Model Detail Views. See Oracle Data Mining User’s Guide.
43.6.1 Deprecated Types
This topic contains tables listing deprecated types.
The DBMS_DATA_MINING
package defines object datatypes for storing information about model attributes. Most of these types are returned by the table functions GET
_n
, where n
identifies the type of information to return. These functions take a model name as input and return the requested information as a collection of rows.
For a list of the GET
functions, see "Summary of DBMS_DATA_MINING Subprograms".
All the table functions use pipelining, which causes each row of output to be materialized as it is read from model storage, without waiting for the generation of the complete table object. For more information on pipelined, parallel table functions, consult the Oracle Database PL/SQL Language Reference.
Table 43-31 DBMS_DATA_MINING Summary of Deprecated Datatypes
Datatype | Description |
---|---|
|
The centroid of a cluster. |
|
A collection of |
|
A child node of a cluster. |
|
A collection of |
|
A cluster. A cluster includes See also, Table 43-33. |
|
A collection of See also, Table 43-33. |
|
The conditional probability of an attribute in a Naive Bayes model. |
|
A collection of |
|
The actual and predicted values in a cost matrix. |
|
A collection of |
|
A component of an Expectation Maximization model. |
|
A collection of |
|
A projection of an Expectation Maximization model. |
|
A collection of |
|
The coefficient and associated statistics of an attribute in a Generalized Linear Model. |
|
A collection of |
|
A histogram associated with a cluster. |
|
A collection of See also, Table 43-33. |
|
An item in an association rule. |
|
A collection of |
|
A collection of |
|
A collection of |
|
High-level statistics about a model. |
|
A collection of |
|
Information about an attribute in a Naive Bayes model. |
|
A collection of |
|
An attribute in a feature of a Non-Negative Matrix Factorization model. |
|
A collection of |
|
A feature in a Non-Negative Matrix Factorization model. |
|
A collection of |
|
Antecedent and consequent in a rule. |
|
A collection of See also, Table 43-33. |
|
An attribute ranked by its importance in an Attribute Importance model. |
|
A collection of |
|
A rule that defines a conditional relationship. The rule can be one of the association rules returned by GET_ASSOCIATION_RULES Function, or it can be a rule associated with a cluster in the collection of clusters returned by GET_MODEL_DETAILS_KM Function and GET_MODEL_DETAILS_OC Function. See also, Table 43-33. |
|
A collection of See also, Table 43-33. |
|
A factorized matrix S, V, or U returned by a Singular Value Decomposition model. |
|
A collection of |
|
The name, value, and coefficient of an attribute in a Support Vector Machine model. |
|
A collection of |
|
The linear coefficient of each attribute in a Support Vector Machine model. |
|
A collection of |
|
The transformation and reverse transformation expressions for an attribute. |
|
A collection of |
Return Values for Clustering Algorithms
The table contains description of DM_CLUSTER
return value columns, nested table columns, and rows.
Table 43-32 DM_CLUSTER Return Values for Clustering Algorithms
Return Value | Description |
---|---|
|
A set of rows of type (id NUMBER, cluster_id VARCHAR2(4000), record_count NUMBER, parent NUMBER, tree_level NUMBER, dispersion NUMBER, split_predicate DM_PREDICATES, child DM_CHILDREN, centroid DM_CENTROIDS, histogram DM_HISTOGRAMS, rule DM_RULE) |
DM_PREDICATE |
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), conditional_operator CHAR(2)/*=,<>,<,>,<=,>=*/, attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000), attribute_support NUMBER, attribute_confidence NUMBER) |
DM_CLUSTER Fields
The following table describes DM_CLUSTER
fields.
Table 43-33 DM_CLUSTER Fields
Column Name | Description |
---|---|
|
Cluster identifier |
|
The ID of a cluster in the model |
|
Specifies the number of records |
|
Parent ID |
|
Specifies the number of splits from the root |
|
A measure used to quantify whether a set of observed occurrences are dispersed compared to a standard statistical model. |
|
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), conditional_operator CHAR(2) /*=,<>,<,>,<=,>=*/, attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000), attribute_support NUMBER, attribute_confidence NUMBER) Note: The Expectation Maximization algorithm uses all the fields except |
|
The |
|
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), mean NUMBER, mode_value VARCHAR2(4000), variance NUMBER) |
|
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), bin_id NUMBER, lower_bound NUMBER, upper_bound NUMBER, label VARCHAR2(4000), count NUMBER) |
|
The (rule_id INTEGER, antecedent DM_PREDICATES, consequent DM_PREDICATES, rule_support NUMBER, rule_confidence NUMBER, rule_lift NUMBER, antecedent_support NUMBER, consequent_support NUMBER, number_of_items INTEGER) |
Usage Notes
-
The table function pipes out rows of type
DM_CLUSTER
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes". -
For descriptions of predicates (
DM_PREDICATE
) and rules (DM_RULE
), see GET_ASSOCIATION_RULES Function.
43.7 Summary of DBMS_DATA_MINING Subprograms
This table summarizes the subprograms included in the DBMS_DATA_MINING
package.
The GET_*
interfaces are replaced by model views. Oracle recommends that users leverage model detail views instead. For more information, refer to “Model Detail Views” in Oracle Data Mining User’s Guide and "Static Data Dictionary Views: ALL_ALL_TABLES
to ALL_OUTLINES
" in Oracle Database Reference.
Table 43-34 DBMS_DATA_MINING Package Subprograms
Subprogram | Purpose |
---|---|
Adds a cost matrix to a classification model |
|
ADD_PARTITION Procedure |
Adds single or multiple partitions in an existing partition model |
Changes the reverse transformation expression to an expression that you specify |
|
Applies a model to a data set (scores the data) |
|
Computes the confusion matrix for a classification model |
|
COMPUTE_CONFUSION_MATRIX_PART Procedure |
Computes the evaluation matrix for partitioned models |
Computes lift for a classification model |
|
COMPUTE_LIFT_PART Procedure |
Computers lift for partitioned models |
Computes Receiver Operating Characteristic (ROC) for a classification model |
|
COMPUTE_ROC_PART Procedure |
Computes Receiver Operating Characteristic (ROC) for a partitioned model |
Creates a model |
|
CREATE_MODEL2 Procedure |
Creates a model without extra persistent stages |
Create Model Using Registration Information |
Fetches setting information from JSON object |
DROP_ALGORITHM Procedure |
Drops the registered algorithm information. |
DROP_PARTITION Procedure |
Drops a single partition |
Drops a model |
|
Exports a model to a dump file |
|
Exports a model in a serialized format |
|
Fetches and reads JSON schema from |
|
Returns the cost matrix for a model |
|
Imports a model into a user schema |
|
Imports a serialized model back into the database |
|
Displays flexibility in creating JSON schema for R Extensible |
|
Registers a new algorithm |
|
Ranks the predictions from the |
|
Removes a cost matrix from a model |
|
Renames a model |
Deprecated GET_MODEL_DETAILS
Starting from Oracle Database 12c Release 2, the following GET_MODEL_DETAILS
are deprecated:
Table 43-35 Deprecated GET_MODEL_DETAILS
Functions
Subprogram | Purpose |
---|---|
Returns the rules from an association model |
|
Returns the frequent itemsets for an association model |
|
Returns details about an Attribute Importance model |
|
Returns details about an Expectation Maximization model |
|
Returns details about the parameters of an Expectation Maximization model |
|
Returns details about the projects of an Expectation Maximization model |
|
Returns details about a Generalized Linear Model |
|
Returns high-level statistics about a model |
|
Returns details about a k-Means model |
|
Returns details about a Naive Bayes model |
|
Returns details about a Non-Negative Matrix Factorization model |
|
Returns details about an O-Cluster model |
|
Returns the settings used to build the given model This function is replaced with |
|
Returns the list of columns from the build input table This function is replaced with |
|
Returns details about a Singular Value Decomposition model |
|
Returns details about a Support Vector Machine model with a linear kernel |
|
Returns the transformations embedded in a model This function is replaced with |
|
Returns details about a Decision Tree model |
|
Converts between two different transformation specification formats |
Related Topics
43.7.1 ADD_COST_MATRIX Procedure
The ADD_COST_MATRIX
procedure associates a cost matrix table with a Classification model. The cost matrix biases the model by assigning costs or benefits to specific model outcomes.
The cost matrix is stored with the model and taken into account when the model is scored.
You can also specify a cost matrix inline when you invoke a Data Mining SQL function for scoring. To view the scoring matrix for a model, query the DM$VC
prefixed model view. Refer to Model Detail View for Classification Algorithm.
To obtain the default scoring matrix for a model, query the DM$VC
prefixed model view. To remove the default scoring matrix from a model, use the REMOVE_COST_MATRIX
procedure. See "GET_MODEL_COST_MATRIX Function" and "REMOVE_COST_MATRIX Procedure".
See Also:
-
"Biasing a Classification Model" in Oracle Data Mining Concepts for more information about costs
-
Oracle Database SQL Language Reference for syntax of inline cost matrix
-
Oracle Data Mining User’s Guide
Syntax
DBMS_DATA_MINING.ADD_COST_MATRIX ( model_name IN VARCHAR2, cost_matrix_table_name IN VARCHAR2, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL); partition_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-36 ADD_COST_MATRIX Procedure Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is assumed. |
|
Name of the cost matrix table (described in Table 43-37). |
|
Schema of the cost matrix table. If no schema is specified, then the current schema is used. |
|
Name of the partition in a partitioned model |
Usage Notes
-
If the model is not in your schema, then
ADD_COST_MATRIX
requires theALTER ANY MINING MODEL
system privilege or theALTER
object privilege for the mining model. -
The cost matrix table must have the columns shown in Table 43-37.
Table 43-37 Required Columns in a Cost Matrix Table
Column Name Datatype ACTUAL_TARGET_VALUE
Valid target data type
PREDICTED_TARGET_VALUE
Valid target data type
COST
NUMBER,
FLOAT
,BINARY_DOUBLE
, orBINARY_FLOAT
See Also:
Oracle Data Mining User's Guide for valid target datatypes
-
The types of the actual and predicted target values must be the same as the type of the model target. For example, if the target of the model is
BINARY_DOUBLE
, then the actual and predicted values must beBINARY_DOUBLE
. If the actual and predicted values areCHAR
orVARCHAR
, thenADD_COST_MATRIX
treats them asVARCHAR2
internally.If the types do not match, or if the actual or predicted value is not a valid target value, then the
ADD_COST_MATRIX
procedure raises an error.Note:
If a reverse transformation is associated with the target, then the actual and predicted values must be consistent with the target after the reverse transformation has been applied.
See “Reverse Transformations and Model Transparency” under the “About Transformation Lists” section in DBMS_DATA_MINING_TRANSFORM Operational Notes for more information.
-
Since a benefit can be viewed as a negative cost, you can specify a benefit for a given outcome by providing a negative number in the
costs
column of the cost matrix table. -
All Classification algorithms can use a cost matrix for scoring. The Decision Tree algorithm can also use a cost matrix at build time. If you want to build a Decision Tree model with a cost matrix, specify the cost matrix table name in the
CLAS_COST_TABLE_NAME
setting in the settings table for the model. See Table 43-7.The cost matrix used to create a Decision Tree model becomes the default scoring matrix for the model. If you want to specify different costs for scoring, use the
REMOVE_COST_MATRIX
procedure to remove the cost matrix and theADD_COST_MATRIX
procedure to add a new one. -
Scoring on a partitioned model is partition-specific. Scoring cost matrices can be added to or removed from an individual partition in a partitioned model. If
PARTITION_NAME
is NOT NULL, then the model must be a partitioned model. TheCOST_MATRIX
is added to that partition of the partitioned model.If the
PARTITION_NAME
is NULL, but the model is a partitioned model, then theCOST_MATRIX
table is added to every partition in the model.
Example
This example creates a cost matrix table called COSTS_NB
and adds it to a Naive Bayes model called NB_SH_CLAS_SAMPLE
. The model has a binary target: 1 means that the customer responds to a promotion; 0 means that the customer does not respond. The cost matrix assigns a cost of .25 to misclassifications of customers who do not respond and a cost of .75 to misclassifications of customers who do respond. This means that it is three times more costly to misclassify responders than it is to misclassify non-responders.
CREATE TABLE costs_nb ( actual_target_value NUMBER, predicted_target_value NUMBER, cost NUMBER); INSERT INTO costs_nb values (0, 0, 0); INSERT INTO costs_nb values (0, 1, .25); INSERT INTO costs_nb values (1, 0, .75); INSERT INTO costs_nb values (1, 1, 0); COMMIT; EXEC dbms_data_mining.add_cost_matrix('nb_sh_clas_sample', 'costs_nb'); SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age FROM mining_data_apply_v WHERE PREDICTION(nb_sh_clas_sample COST MODEL USING cust_marital_status, education, household_size) = 1 GROUP BY cust_gender ORDER BY cust_gender; C CNT AVG_AGE - ---------- ---------- F 72 39 M 555 44
43.7.2 ADD_PARTITION Procedure
ADD_PARTITION
procedure supports a single or multiple partition addition to an existing partitioned model.
The ADD_PARTITION
procedure derives build settings and user-defined expressions from the existing model. The target column must exist in the input data query when adding partitions to a supervised model.
Syntax
DBMS_DATA_MINING.ADD_PARTITION (
model_name IN VARCHAR2,
data_query IN CLOB,
add_options IN VARCHAR2 DEFAULT ERROR);
Parameters
Table 43-38 ADD_PARTITION Procedure Parameters
Parameter | Description |
---|---|
model_name |
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
data_query |
An arbitrary SQL statement that provides data to the model build. The user must have privilege to evaluate this query. |
add_options |
Allows users to control the conditional behavior of ADD for cases where rows in the input dataset conflict with existing partitions in the model. The following are the possible values:
Note: For better performance, Oracle recommends usingDROP_PARTITION followed by the ADD_PARTITION instead of using the REPLACE option.
|
43.7.3 ALTER_REVERSE_EXPRESSION Procedure
This procedure replaces a reverse transformation expression with an expression that you specify. If the attribute does not have a reverse expression, the procedure creates one from the specified expression.
You can also use this procedure to customize the output of clustering, feature extraction, and anomaly detection models.
Syntax
DBMS_DATA_MINING.ALTER_REVERSE_EXPRESSION ( model_name VARCHAR2, expression CLOB, attribute_name VARCHAR2 DEFAULT NULL, attribute_subname VARCHAR2 DEFAULT NULL);
Parameters
Table 43-39 ALTER_REVERSE_EXPRESSION Procedure Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used. |
|
An expression to replace the reverse transformation associated with the attribute. |
|
Name of the attribute. Specify |
|
Name of the nested attribute if |
Usage Notes
-
For purposes of model transparency, Oracle Data Mining provides reverse transformations for transformations that are embedded in a model. Reverse transformations are applied to the attributes returned in model details (
GET_MODEL_DETAILS_
* functions) and to the scored target of predictive models.See Also:
“About Transformation Lists” under DBMS_DATA_MINING_TRANSFORM Operational Notes
-
If you alter the reverse transformation for the target of a model that has a cost matrix, you must specify a transformation expression that has the same type as the actual and predicted values in the cost matrix. Also, the reverse transformation that you specify must result in values that are present in the cost matrix.
See Also:
"ADD_COST_MATRIX Procedure" and Oracle Data Mining Concepts for information about cost matrixes.
-
To prevent reverse transformation of an attribute, you can specify
NULL
forexpression
. -
The reverse transformation expression can contain a reference to a PL/SQL function that returns a valid Oracle datatype. For example, you could define a function like the following for a categorical attribute named
blood_pressure
that has values 'Low', 'Medium' and 'High'.CREATE OR REPLACE FUNCTION numx(c char) RETURN NUMBER IS BEGIN CASE c WHEN ''Low'' THEN RETURN 1; WHEN ''Medium'' THEN RETURN 2; WHEN ''High'' THEN RETURN 3; ELSE RETURN null; END CASE; END numx;
Then you could invoke
ALTER_REVERSE_EXPRESION
forblood_pressure
as follows.EXEC dbms_data_mining.alter_reverse_expression( '<model_name>', 'NUMX(blood_pressure)', 'blood_pressure');
-
You can use
ALTER_REVERSE_EXPRESSION
to label clusters produced by clustering models and features produced by feature extraction.You can use
ALTER_REVERSE_EXPRESSION
to replace the zeros and ones returned by anomaly-detection models. By default, anomaly-detection models label anomalous records with 0 and all other records with 1.See Also:
Oracle Data Mining Concepts for information about anomaly detection
Examples
-
In this example, the target (
affinity_card
) of the modelCLASS_MODEL
is manipulated internally asyes
orno
instead of1
or0
but returned as1
s and0
s when scored. TheALTER_REVERSE_EXPRESSION
procedure causes the target values to be returned asTRUE
orFALSE
.The data sets
MINING_DATA_BUILD
andMINING_DATA_TEST
are included with the Oracle Data Mining sample programs. See Oracle Data Mining User's Guide for information about the sample programs.DECLARE v_xlst dbms_data_mining_transform.TRANSFORM_LIST; BEGIN dbms_data_mining_transform.SET_TRANSFORM(v_xlst, 'affinity_card', NULL, 'decode(affinity_card, 1, ''yes'', ''no'')', 'decode(affinity_card, ''yes'', 1, 0)'); dbms_data_mining.CREATE_MODEL( model_name => 'CLASS_MODEL', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', settings_table_name => NULL, data_schema_name => 'dmuser', settings_schema_name => NULL, xform_list => v_xlst ); END; / SELECT cust_income_level, occupation, PREDICTION(CLASS_MODEL USING *) predict_response FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M' ORDER BY cust_income_level; CUST_INCOME_LEVEL OCCUPATION PREDICT_RESPONSE ------------------------------ --------------------- -------------------- A: Below 30,000 Transp. 1 E: 90,000 - 109,999 Transp. 1 E: 90,000 - 109,999 Sales 1 G: 130,000 - 149,999 Handler 0 G: 130,000 - 149,999 Crafts 0 H: 150,000 - 169,999 Prof. 1 J: 190,000 - 249,999 Prof. 1 J: 190,000 - 249,999 Sales 1 BEGIN dbms_data_mining.ALTER_REVERSE_EXPRESSION ( model_name => 'CLASS_MODEL', expression => 'decode(affinity_card, ''yes'', ''TRUE'', ''FALSE'')', attribute_name => 'affinity_card'); END; / column predict_response on column predict_response format a20 SELECT cust_income_level, occupation, PREDICTION(CLASS_MODEL USING *) predict_response FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M' ORDER BY cust_income_level; CUST_INCOME_LEVEL OCCUPATION PREDICT_RESPONSE ------------------------------ --------------------- -------------------- A: Below 30,000 Transp. TRUE E: 90,000 - 109,999 Transp. TRUE E: 90,000 - 109,999 Sales TRUE G: 130,000 - 149,999 Handler FALSE G: 130,000 - 149,999 Crafts FALSE H: 150,000 - 169,999 Prof. TRUE J: 190,000 - 249,999 Prof. TRUE J: 190,000 - 249,999 Sales TRUE
-
This example specifies labels for the clusters that result from the
sh_clus
model. The labels consist of the word "Cluster" and the internal numeric identifier for the cluster.BEGIN dbms_data_mining.ALTER_REVERSE_EXPRESSION( 'sh_clus', '''Cluster ''||value'); END; / SELECT cust_id, cluster_id(sh_clus using *) cluster_id FROM sh_aprep_num WHERE cust_id < 100011 ORDER by cust_id; CUST_ID CLUSTER_ID ------- ------------------------------------------------ 100001 Cluster 18 100002 Cluster 14 100003 Cluster 14 100004 Cluster 18 100005 Cluster 19 100006 Cluster 7 100007 Cluster 18 100008 Cluster 14 100009 Cluster 8 100010 Cluster 8
43.7.4 APPLY Procedure
The APPLY
procedure applies a mining model to the data of interest, and generates the results in a table. The APPLY
procedure is also referred to as scoring.
For predictive mining functions, the APPLY
procedure generates predictions in a target column. For descriptive mining functions such as Clustering, the APPLY
process assigns each case to a cluster with a probability.
In Oracle Data Mining, the APPLY
procedure is not applicable to Association models and Attribute Importance models.
Note:
Scoring can also be performed directly in SQL using the Data Mining functions. See
-
"Data Mining Functions" in Oracle Database SQL Language Reference
-
"Scoring and Deployment" in Oracle Data Mining User's Guide
Syntax
DBMS_DATA_MINING.APPLY ( model_name IN VARCHAR2, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, result_table_name IN VARCHAR2, data_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-40 APPLY Procedure Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Name of table or view containing the data to be scored |
|
Name of the case identifier column |
|
Name of the table in which to store apply results |
|
Name of the schema containing the data to be scored |
Usage Notes
-
The data provided for
APPLY
must undergo the same preprocessing as the data used to create and test the model. When you use Automatic Data Preparation, the preprocessing required by the algorithm is handled for you by the model: both at build time and apply time. (See "Automatic Data Preparation".) -
APPLY
creates a table in the user's schema to hold the results. The columns are algorithm-specific.The columns in the results table are listed in Table 43-41 through Table 43-45. The case ID column name in the results table will match the case ID column name provided by you. The type of the incoming case ID column is also preserved in
APPLY
output.Note:
Make sure that the case ID column does not have the same name as one of the columns that will be created by
APPLY
. For example, when applying a Classification model, the case ID in the scoring data must not bePREDICTION
orPROBABILITY
(See Table 43-41). -
The datatype for the
PREDICTION
,CLUSTER_ID
, andFEATURE_ID
output columns is influenced by any reverse expression that is embedded in the model by the user. If the user does not provide a reverse expression that alters the scored value type, then the types will conform to the descriptions in the following tables. See "ALTER_REVERSE_EXPRESSION Procedure". -
If the model is partitioned, the
result_table_name
can contain results from different partitions depending on the data from the input data table. An additional column calledPARTITION_NAME
is added to the result table indicating the partition name that is associated with each row.For a non-partitioned model, the behavior does not change.
Classification
The results table for Classification has the columns described in Table 43-41. If the target of the model is categorical, the PREDICTION
column will have a VARCHAR2
datatype. If the target has a binary type, the PREDICTION
column will have the binary type of the target.
Table 43-41 APPLY Results Table for Classification
Column Name | Datatype |
---|---|
|
Type of the case ID |
|
Type of the target |
|
|
Anomaly Detection
The results table for Anomaly Detection has the columns described in Table 43-42.
Table 43-42 APPLY Results Table for Anomaly Detection
Column Name | Datatype |
---|---|
|
Type of the case ID |
|
|
|
|
Regression
The results table for Regression has the columns described in APPLY Procedure.
Table 43-43 APPLY Results Table for Regression
Column Name | Datatype |
---|---|
|
Type of the case ID |
|
Type of the target |
Clustering
Clustering is an unsupervised mining function, and hence there are no targets. The results of an APPLY
procedure will contain simply the cluster identifier corresponding to a case, and the associated probability. The results table has the columns described in Table 43-44.
Table 43-44 APPLY Results Table for Clustering
Column Name | Datatype |
---|---|
|
Type of the case ID |
|
|
|
|
Feature Extraction
Feature Extraction is also an unsupervised mining function, and hence there are no targets. The results of an APPLY
procedure will contain simply the feature identifier corresponding to a case, and the associated match quality. The results table has the columns described in Table 43-45.
Table 43-45 APPLY Results Table for Feature Extraction
Column Name | Datatype |
---|---|
|
Type of the case ID |
|
|
|
|
Examples
This example applies the GLM Regression model GLMR_SH_REGR_SAMPLE
to the data in the MINING_DATA_APPLY_V
view. The APPLY
results are output of the table REGRESSION_APPLY_RESULT
.
SQL> BEGIN DBMS_DATA_MINING.APPLY ( model_name => 'glmr_sh_regr_sample', data_table_name => 'mining_data_apply_v', case_id_column_name => 'cust_id', result_table_name => 'regression_apply_result'); END; / SQL> SELECT * FROM regression_apply_result WHERE cust_id > 101485; CUST_ID PREDICTION ---------- ---------- 101486 22.8048824 101487 25.0261101 101488 48.6146619 101489 51.82595 101490 22.6220714 101491 61.3856816 101492 24.1400748 101493 58.034631 101494 45.7253149 101495 26.9763318 101496 48.1433425 101497 32.0573434 101498 49.8965531 101499 56.270656 101500 21.1153047
43.7.5 COMPUTE_CONFUSION_MATRIX Procedure
This procedure computes a confusion matrix, stores it in a table in the user's schema, and returns the model accuracy.
A confusion matrix is a test metric for classification models. It compares the predictions generated by the model with the actual target values in a set of test data. The confusion matrix lists the number of times each class was correctly predicted and the number of times it was predicted to be one of the other classes.
COMPUTE_CONFUSION_MATRIX
accepts three input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing either probabilities or costs
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
-
(Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.
See Also:
Oracle Data Mining Concepts for more details about confusion matrixes and other test metrics for classification
Syntax
DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX ( accuracy OUT NUMBER, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, confusion_matrix_table_name IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL, score_criterion_type IN VARCHAR2 DEFAULT 'PROBABILITY');
Parameters
Table 43-46 COMPUTE_CONFUSION_MATRIX Procedure Parameters
Parameter | Description |
---|---|
|
Output parameter containing the overall percentage accuracy of the predictions. |
|
Table containing the predictions. |
|
Table containing the known target values from the test data. |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the confusion matrix. The table will be created by the procedure in the user's schema. The columns in the confusion matrix table are described in the Usage Notes. |
|
Column containing the predictions in the apply results table. The default column name is |
|
Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions. By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted. The The default column name is ' See the Usage Notes for additional information. |
|
(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the The columns in a cost matrix table are described in the Usage Notes. |
|
Schema of the apply results table. If null, the user's schema is assumed. |
|
Schema of the table containing the known targets. If null, the user's schema is assumed. |
|
Schema of the cost matrix table, if one is provided. If null, the user's schema is assumed. |
|
Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the The default value of If See the Usage Notes and the Examples. |
Usage Notes
-
The predictive information you pass to
COMPUTE_CONFUSION_MATRIX
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the confusion matrix. -
Instead of passing a cost matrix to
COMPUTE_CONFUSION_MATRIX
, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQLPREDICTION_COST
function to populate the score criterion column. -
The predictions that you pass to
COMPUTE_CONFUSION_MATRIX
are in a table or view specified inapply_result_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
A cost matrix must have the columns described in Table 43-47.
Table 43-47 Columns in a Cost Matrix
Column Name Datatype actual_target_value
Type of the target column in the build data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.
cost
BINARY_DOUBLE
See Also:
Oracle Data Mining User's Guide for valid target datatypes
Oracle Data Mining Concepts for more information about cost matrixes
-
The confusion matrix created by
COMPUTE_CONFUSION_MATRIX
has the columns described in Table 43-48.Table 43-48 Columns in a Confusion Matrix
Column Name Datatype actual_target_value
Type of the target column in the build data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target is the same as the type of the actual target unless the predicted target has an associated reverse transformation.
value
BINARY_DOUBLE
See Also:
Oracle Data Mining Concepts for more information about confusion matrixes
Examples
These examples use the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
Compute a Confusion Matrix Based on Probabilities
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, PREDICTION(nb_sh_clas_sample USING *) prediction, PREDICTION_PROBABILITY(nb_sh_clas_sample USING *) probability FROM mining_data_test_v;
Using probabilities as the scoring criterion, you can compute the confusion matrix as follows.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'PROBABILITY' cost_matrix_table_name => null, apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'PROBABILITY'); DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4)); END; /
The confusion matrix and model accuracy are shown as follows.
**** MODEL ACCURACY ****: .7847 SQL>SELECT * from nb_confusion_matrix; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 60 0 0 891 1 1 286 0 1 263
Compute a Confusion Matrix Based on a Cost Matrix Table
The confusion matrix in the previous example shows a high rate of false positives. For 263 cases, the model predicted 1 when the actual value was 0. You could use a cost matrix to minimize this type of error.
The cost matrix table nb_cost_matrix
specifies that a false positive is 3 times more costly than a false negative.
SQL> SELECT * from nb_cost_matrix; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE COST ------------------- ---------------------- ---------- 0 0 0 0 1 .75 1 0 .25 1 1 0
This statement shows how to generate the predictions using APPLY
.
BEGIN DBMS_DATA_MINING.APPLY( model_name => 'nb_sh_clas_sample', data_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', result_table_name => 'nb_apply_results'); END; /
This statement computes the confusion matrix using the cost matrix table. The score criterion column is named 'PROBABILITY
', which is the name generated by APPLY
.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'PROBABILITY', cost_matrix_table_name => 'nb_cost_matrix', apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'COST'); DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4)); END; /
The resulting confusion matrix shows a decrease in false positives (212 instead of 263).
**** MODEL ACCURACY ****: .798 SQL> SELECT * FROM nb_confusion_matrix; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 91 0 0 942 1 1 255 0 1 212
Compute a Confusion Matrix Based on Embedded Costs
You can use the ADD_COST_MATRIX
procedure to embed a cost matrix in a model. The embedded costs can be used instead of probabilities for scoring. This statement adds the previously-defined cost matrix to the model.
BEGIN DBMS_DATA_MINING.ADD_COST_MATRIX ('nb_sh_clas_sample', 'nb_cost_matrix');END;/
The following statement applies the model to the test data using the embedded costs and stores the results in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, PREDICTION(nb_sh_clas_sample COST MODEL USING *) prediction, PREDICTION_COST(nb_sh_clas_sample COST MODEL USING *) cost FROM mining_data_test_v;
You can compute the confusion matrix using the embedded costs.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'COST', cost_matrix_table_name => null, apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'COST'); END; /
The results are:
**** MODEL ACCURACY ****: .798 SQL> SELECT * FROM nb_confusion_matrix; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 91 0 0 942 1 1 255 0 1 212
43.7.6 COMPUTE_CONFUSION_MATRIX_PART Procedure
The COMPUTE_CONFUSION_MATRIX_PART
procedure computes a confusion matrix, stores it in a table in the user's schema, and returns the model accuracy.
COMPUTE_CONFUSION_MATRIX_PART
provides support to computation of evaluation metrics per-partition for partitioned models. For non-partitioned models, refer to COMPUTE_CONFUSION_MATRIX Procedure.
A confusion matrix is a test metric for Classification models. It compares the predictions generated by the model with the actual target values in a set of test data. The confusion matrix lists the number of times each class was correctly predicted and the number of times it was predicted to be one of the other classes.
COMPUTE_CONFUSION_MATRIX_PART
accepts three input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing either probabilities or costs
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
-
(Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.
See Also:
Oracle Data Mining Concepts for more details about confusion matrixes and other test metrics for classification
Syntax
DBMS_DATA_MINING.compute_confusion_matrix_part( accuracy OUT DM_NESTED_NUMERICALS, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, confusion_matrix_table_name IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', score_partition_column_name IN VARCHAR2 DEFAULT 'PARTITION_NAME', cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL, score_criterion_type IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-49 COMPUTE_CONFUSION_MATRIX_PART Procedure Parameters
Parameter | Description |
---|---|
|
Output parameter containing the overall percentage accuracy of the predictions The output argument is changed from |
|
Table containing the predictions |
|
Table containing the known target values from the test data |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the confusion matrix. The table will be created by the procedure in the user's schema. The columns in the confusion matrix table are described in the Usage Notes. |
|
Column containing the predictions in the apply results table. The default column name is |
|
Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions. By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, then the class with the lowest cost is predicted. The The default column name is See the Usage Notes for additional information. |
|
(Optional) Parameter indicating the column which contains the name of the partition. This column slices the input test results such that each partition has independent evaluation matrices computed. |
|
(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the The columns in a cost matrix table are described in the Usage Notes. |
|
Schema of the apply results table. If null, then the user's schema is assumed. |
|
Schema of the table containing the known targets. If null, then the user's schema is assumed. |
|
Schema of the cost matrix table, if one is provided. If null, then the user's schema is assumed. |
|
Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the The default value of If See the Usage Notes and the Examples. |
Usage Notes
-
The predictive information you pass to
COMPUTE_CONFUSION_MATRIX_PART
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the confusion matrix. -
Instead of passing a cost matrix to
COMPUTE_CONFUSION_MATRIX_PART
, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQLPREDICTION_COST
function to populate the score criterion column. -
The predictions that you pass to
COMPUTE_CONFUSION_MATRIX_PART
are in a table or view specified inapply_result_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
A cost matrix must have the columns described in Table 43-47.
Table 43-50 Columns in a Cost Matrix
Column Name Datatype actual_target_value
Type of the target column in the test data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.
cost
BINARY_DOUBLE
See Also:
Oracle Data Mining User's Guide for valid target datatypes
Oracle Data Mining Concepts for more information about cost matrixes
-
The confusion matrix created by
COMPUTE_CONFUSION_MATRIX_PART
has the columns described in Table 43-48.Table 43-51 Columns in a Confusion Matrix Part
Column Name Datatype actual_target_value
Type of the target column in the test data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target is the same as the type of the actual target unless the predicted target has an associated reverse transformation.
value
BINARY_DOUBLE
See Also:
Oracle Data Mining Concepts for more information about confusion matrixes
Examples
These examples use the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
Compute a Confusion Matrix Based on Probabilities
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, PREDICTION(nb_sh_clas_sample USING *) prediction, PREDICTION_PROBABILITY(nb_sh_clas_sample USING *) probability FROM mining_data_test_v;
Using probabilities as the scoring criterion, you can compute the confusion matrix as follows.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'PROBABILITY' score_partition_column_name => 'PARTITION_NAME' cost_matrix_table_name => null, apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'PROBABILITY'); DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4)); END; /
The confusion matrix and model accuracy are shown as follows.
**** MODEL ACCURACY ****: .7847 SELECT * FROM NB_CONFUSION_MATRIX; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 60 0 0 891 1 1 286 0 1 263
Compute a Confusion Matrix Based on a Cost Matrix Table
The confusion matrix in the previous example shows a high rate of false positives. For 263 cases, the model predicted 1 when the actual value was 0. You could use a cost matrix to minimize this type of error.
The cost matrix table nb_cost_matrix
specifies that a false positive is 3 times more costly than a false negative.
SELECT * from NB_COST_MATRIX; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE COST ------------------- ---------------------- ---------- 0 0 0 0 1 .75 1 0 .25 1 1 0
This statement shows how to generate the predictions using APPLY
.
BEGIN DBMS_DATA_MINING.APPLY( model_name => 'nb_sh_clas_sample', data_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', result_table_name => 'nb_apply_results'); END; /
This statement computes the confusion matrix using the cost matrix table. The score criterion column is named 'PROBABILITY
', which is the name generated by APPLY
.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'PROBABILITY', score_partition_column_name => 'PARTITION_NAME' cost_matrix_table_name => 'nb_cost_matrix', apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'COST'); DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4)); END; /
The resulting confusion matrix shows a decrease in false positives (212 instead of 263).
**** MODEL ACCURACY ****: .798 SELECT * FROM NB_CONFUSION_MATRIX; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 91 0 0 942 1 1 255 0 1 212
Compute a Confusion Matrix Based on Embedded Costs
You can use the ADD_COST_MATRIX
procedure to embed a cost matrix in a model. The embedded costs can be used instead of probabilities for scoring. This statement adds the previously-defined cost matrix to the model.
BEGIN DBMS_DATA_MINING.ADD_COST_MATRIX ('nb_sh_clas_sample', 'nb_cost_matrix'); END;/
The following statement applies the model to the test data using the embedded costs and stores the results in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, PREDICTION(nb_sh_clas_sample COST MODEL USING *) prediction, PREDICTION_COST(nb_sh_clas_sample COST MODEL USING *) cost FROM mining_data_test_v;
You can compute the confusion matrix using the embedded costs.
DECLARE v_accuracy NUMBER; BEGIN DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX_PART ( accuracy => v_accuracy, apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', confusion_matrix_table_name => 'nb_confusion_matrix', score_column_name => 'PREDICTION', score_criterion_column_name => 'COST', score_partition_column_name => 'PARTITION_NAME' cost_matrix_table_name => null, apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'COST'); END; /
The results are:
**** MODEL ACCURACY ****: .798 SELECT * FROM NB_CONFUSION_MATRIX; ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE VALUE ------------------- ---------------------- ---------- 1 0 91 0 0 942 1 1 255 0 1 212
43.7.7 COMPUTE_LIFT Procedure
This procedure computes lift and stores the results in a table in the user's schema.
Lift is a test metric for binary classification models. To compute lift, one of the target values must be designated as the positive class. COMPUTE_LIFT
compares the predictions generated by the model with the actual target values in a set of test data. Lift measures the degree to which the model's predictions of the positive class are an improvement over random chance.
Lift is computed on scoring results that have been ranked by probability (or cost) and divided into quantiles. Each quantile includes the scores for the same number of cases.
COMPUTE_LIFT
calculates quantile-based and cumulative statistics. The number of quantiles and the positive class are user-specified. Additionally, COMPUTE_LIFT
accepts three input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing either probabilities or costs associated with the predictions
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
-
(Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.
See Also:
Oracle Data Mining Concepts for more details about lift and test metrics for classification
Syntax
DBMS_DATA_MINING.COMPUTE_LIFT ( apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, lift_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', num_quantiles IN NUMBER DEFAULT 10, cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL score_criterion_type IN VARCHAR2 DEFAULT 'PROBABILITY');
Parameters
Table 43-52 COMPUTE_LIFT Procedure Parameters
Parameter | Description |
---|---|
|
Table containing the predictions. |
|
Table containing the known target values from the test data. |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the lift statistics. The table will be created by the procedure in the user's schema. The columns in the lift table are described in the Usage Notes. |
|
The positive class. This should be the class of interest, for which you want to calculate lift. If the target column is a |
|
Column containing the predictions in the apply results table. The default column name is ' |
|
Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions. By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted. The The default column name is ' See the Usage Notes for additional information. |
|
Number of quantiles to be used in calculating lift. The default is 10. |
|
(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the The columns in a cost matrix table are described in the Usage Notes. |
|
Schema of the apply results table. If null, the user's schema is assumed. |
|
Schema of the table containing the known targets. If null, the user's schema is assumed. |
|
Schema of the cost matrix table, if one is provided. If null, the user's schema is assumed. |
|
Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the The default value of If See the Usage Notes and the Examples. |
Usage Notes
-
The predictive information you pass to
COMPUTE_LIFT
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the lift. -
Instead of passing a cost matrix to
COMPUTE_LIFT
, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQLPREDICTION_COST
function to populate the score criterion column. -
The predictions that you pass to
COMPUTE_LIFT
are in a table or view specified inapply_results_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
A cost matrix must have the columns described in Table 43-53.
Table 43-53 Columns in a Cost Matrix
Column Name Datatype actual_target_value
Type of the target column in the build data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.
cost
NUMBER
See Also:
Oracle Data Mining Concepts for more information about cost matrixes
-
The table created by
COMPUTE_LIFT
has the columns described in Table 43-54Table 43-54 Columns in a Lift Table
Column Name Datatype quantile_number
NUMBER
probability_threshold
NUMBER
gain_cumulative
NUMBER
quantile_total_count
NUMBER
quantile_target_count
NUMBER
percent_records_cumulative
NUMBER
lift_cumulative
NUMBER
target_density_cumulative
NUMBER
targets_cumulative
NUMBER
non_targets_cumulative
NUMBER
lift_quantile
NUMBER
target_density
NUMBER
See Also:
Oracle Data Mining Concepts for details about the information in the lift table
-
When a cost matrix is passed to
COMPUTE_LIFT
, the cost threshold is returned in theprobability_threshold
column of the lift table.
Examples
This example uses the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
The example illustrates lift based on probabilities. For examples that show computation based on costs, see "COMPUTE_CONFUSION_MATRIX Procedure".
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, t.prediction, t.probability FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;
Using probabilities as the scoring criterion, you can compute lift as follows.
BEGIN DBMS_DATA_MINING.COMPUTE_LIFT ( apply_result_table_name => 'nb_apply_results', target_table_name => 'mining_data_test_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', lift_table_name => 'nb_lift', positive_target_value => to_char(1), score_column_name => 'PREDICTION', score_criterion_column_name => 'PROBABILITY', num_quantiles => 10, cost_matrix_table_name => null, apply_result_schema_name => null, target_schema_name => null, cost_matrix_schema_name => null, score_criterion_type => 'PROBABILITY'); END; /
This query displays some of the statistics from the resulting lift table.
SQL>SELECT quantile_number, probability_threshold, gain_cumulative, quantile_total_count FROM nb_lift; QUANTILE_NUMBER PROBABILITY_THRESHOLD GAIN_CUMULATIVE QUANTILE_TOTAL_COUNT --------------- --------------------- --------------- -------------------- 1 .989335775 .15034965 55 2 .980534911 .26048951 55 3 .968506098 .374125874 55 4 .958975196 .493006993 55 5 .946705997 .587412587 55 6 .927454174 .66958042 55 7 .904403627 .748251748 55 8 .836482525 .839160839 55 10 .500184953 1 54
43.7.8 COMPUTE_LIFT_PART Procedure
The COMPUTE_LIFT_PART
procedure computes Lift and stores the results in a table in the user's schema. This procedure provides support to the computation of evaluation metrics per-partition for partitioned models.
Lift is a test metric for binary Classification models. To compute Lift, one of the target values must be designated as the positive class. COMPUTE_LIFT_PART
compares the predictions generated by the model with the actual target values in a set of test data. Lift measures the degree to which the model's predictions of the positive class are an improvement over random chance.
Lift is computed on scoring results that have been ranked by probability (or cost) and divided into quantiles. Each quantile includes the scores for the same number of cases.
COMPUTE_LIFT_PART
calculates quantile-based and cumulative statistics. The number of quantiles and the positive class are user-specified. Additionally, COMPUTE_LIFT_PART
accepts three input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing either probabilities or costs associated with the predictions
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
-
(Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.
See Also:
Oracle Data Mining Concepts for more details about Lift and test metrics for classification
"COMPUTE_CONFUSION_MATRIX Procedure"
Syntax
DBMS_DATA_MINING.COMPUTE_LIFT_PART ( apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, lift_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', score_partition_column_name IN VARCHAR2 DEFAULT 'PARTITION_NAME', num_quantiles IN NUMBER DEFAULT 10, cost_matrix_table_name IN VARCHAR2 DEFAULT NULL, apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL, cost_matrix_schema_name IN VARCHAR2 DEFAULT NULL, score_criterion_type IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-55 COMPUTE_LIFT_PART Procedure Parameters
Parameter | Description |
---|---|
|
Table containing the predictions |
|
Table containing the known target values from the test data |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the Lift statistics. The table will be created by the procedure in the user's schema. The columns in the Lift table are described in the Usage Notes. |
|
The positive class. This should be the class of interest, for which you want to calculate Lift. If the target column is a |
|
Column containing the predictions in the apply results table. The default column name is |
|
Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions. By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, then the class with the lowest cost is predicted. The The default column name is See the Usage Notes for additional information. |
|
Optional parameter indicating the column containing the name of the partition. This column slices the input test results such that each partition has independent evaluation matrices computed. |
|
Number of quantiles to be used in calculating Lift. The default is 10. |
|
(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the The columns in a cost matrix table are described in the Usage Notes. |
|
Schema of the apply results table If null, then the user's schema is assumed. |
|
Schema of the table containing the known targets If null, then the user's schema is assumed. |
|
Schema of the cost matrix table, if one is provided If null, then the user's schema is assumed. |
|
Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the The default value of If See the Usage Notes and the Examples. |
Usage Notes
-
The predictive information you pass to
COMPUTE_LIFT_PART
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the Lift. -
Instead of passing a cost matrix to
COMPUTE_LIFT_PART
, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQLPREDICTION_COST
function to populate the score criterion column. -
The predictions that you pass to
COMPUTE_LIFT_PART
are in a table or view specified inapply_results_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
A cost matrix must have the columns described in Table 43-53.
Table 43-56 Columns in a Cost Matrix
Column Name Datatype actual_target_value
Type of the target column in the test data
predicted_target_value
Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.
cost
NUMBER
See Also:
Oracle Data Mining Concepts for more information about cost matrixes
-
The table created by
COMPUTE_LIFT_PART
has the columns described in Table 43-54Table 43-57 Columns in a COMPUTE_LIFT_PART Table
Column Name Datatype quantile_number
NUMBER
probability_threshold
NUMBER
gain_cumulative
NUMBER
quantile_total_count
NUMBER
quantile_target_count
NUMBER
percent_records_cumulative
NUMBER
lift_cumulative
NUMBER
target_density_cumulative
NUMBER
targets_cumulative
NUMBER
non_targets_cumulative
NUMBER
lift_quantile
NUMBER
target_density
NUMBER
See Also:
Oracle Data Mining Concepts for details about the information in the Lift table
-
When a cost matrix is passed to
COMPUTE_LIFT_PART
, the cost threshold is returned in theprobability_threshold
column of the Lift table.
Examples
This example uses the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
The example illustrates Lift based on probabilities. For examples that show computation based on costs, see "COMPUTE_CONFUSION_MATRIX Procedure".
For a partitioned model example, see "COMPUTE_CONFUSION_MATRIX_PART Procedure".
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, t.prediction, t.probability FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;
Using probabilities as the scoring criterion, you can compute Lift as follows.
BEGIN
DBMS_DATA_MINING.COMPUTE_LIFT_PART (
apply_result_table_name => 'nb_apply_results',
target_table_name => 'mining_data_test_v',
case_id_column_name => 'cust_id',
target_column_name => 'affinity_card',
lift_table_name => 'nb_lift',
positive_target_value => to_char(1),
score_column_name => 'PREDICTION',
score_criterion_column_name => 'PROBABILITY',
score_partition_column_name => 'PARTITITON_NAME',
num_quantiles => 10,
cost_matrix_table_name => null,
apply_result_schema_name => null,
target_schema_name => null,
cost_matrix_schema_name => null,
score_criterion_type => 'PROBABILITY');
END;
/
This query displays some of the statistics from the resulting Lift table.
SELECT quantile_number, probability_threshold, gain_cumulative, quantile_total_count FROM nb_lift; QUANTILE_NUMBER PROBABILITY_THRESHOLD GAIN_CUMULATIVE QUANTILE_TOTAL_COUNT --------------- --------------------- --------------- -------------------- 1 .989335775 .15034965 55 2 .980534911 .26048951 55 3 .968506098 .374125874 55 4 .958975196 .493006993 55 5 .946705997 .587412587 55 6 .927454174 .66958042 55 7 .904403627 .748251748 55 8 .836482525 .839160839 55 10 .500184953 1 54
43.7.9 COMPUTE_ROC Procedure
This procedure computes the receiver operating characteristic (ROC), stores the results in a table in the user's schema, and returns a measure of the model accuracy.
ROC is a test metric for binary classification models. To compute ROC, one of the target values must be designated as the positive class. COMPUTE_ROC
compares the predictions generated by the model with the actual target values in a set of test data.
ROC measures the impact of changes in the probability threshold. The probability threshold is the decision point used by the model for predictions. In binary classification, the default probability threshold is 0.5. The value predicted for each case is the one with a probability greater than 50%.
ROC can be plotted as a curve on an X-Y axis. The false positive rate is placed on the X axis. The true positive rate is placed on the Y axis. A false positive is a positive prediction for a case that is negative in the test data. A true positive is a positive prediction for a case that is positive in the test data.
COMPUTE_ROC
accepts two input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing probabilities
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
See Also:
Oracle Data Mining Concepts for more details about ROC and test metrics for classification
Syntax
DBMS_DATA_MINING.COMPUTE_ROC ( roc_area_under_curve OUT NUMBER, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, roc_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-58 COMPUTE_ROC Procedure Parameters
Parameter | Description |
---|---|
|
Output parameter containing the area under the ROC curve (AUC). The AUC measures the likelihood that an actual positive will be predicted as positive. The greater the AUC, the greater the flexibility of the model in accommodating trade-offs between positive and negative class predictions. AUC can be especially important when one target class is rarer or more important to identify than another. |
|
Table containing the predictions. |
|
Table containing the known target values from the test data. |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the ROC output. The table will be created by the procedure in the user's schema. The columns in the ROC table are described in the Usage Notes. |
|
The positive class. This should be the class of interest, for which you want to calculate ROC. If the target column is a |
|
Column containing the predictions in the apply results table. The default column name is ' |
|
Column containing the scoring criterion in the apply results table. Contains the probabilities that determine the predictions. The default column name is ' |
|
Schema of the apply results table. If null, the user's schema is assumed. |
|
Schema of the table containing the known targets. If null, the user's schema is assumed. |
Usage Notes
-
The predictive information you pass to
COMPUTE_ROC
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the receiver operating characteristic. -
The predictions that you pass to
COMPUTE_ROC
are in a table or view specified inapply_results_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
The table created by
COMPUTE_ROC
has the columns shown in Table 43-59.Table 43-59 COMPUTE_ROC Output
Column Datatype probability
BINARY_DOUBLE
true_positives
NUMBER
false_negatives
NUMBER
false_positives
NUMBER
true_negatives
NUMBER
true_positive_fraction
NUMBER
false_positive_fraction
NUMBER
See Also:
Oracle Data Mining Concepts for details about the output of
COMPUTE_ROC
-
ROC is typically used to determine the most desirable probability threshold. This can be done by examining the true positive fraction and the false positive fraction. The true positive fraction is the percentage of all positive cases in the test data that were correctly predicted as positive. The false positive fraction is the percentage of all negative cases in the test data that were incorrectly predicted as positive.
Given a probability threshold, the following statement returns the positive predictions in an apply result table ordered by probability.
SELECT case_id_column_name FROM apply_result_table_name WHERE
probability
>probability_threshold
ORDER BYprobability
DESC; -
There are two approaches to identifying the most desirable probability threshold. Which approach you use depends on whether or not you know the relative cost of positive versus negative class prediction errors.
If the costs are known, you can apply the relative costs to the ROC table to compute the minimum cost probability threshold. Suppose the relative cost ratio is: Positive Class Error Cost / Negative Class Error Cost = 20. Then execute a query like this.
WITH
cost
AS ( SELECTprobability_threshold
, 20 *false_negatives
+false_positives
cost
FROMROC_table
GROUP BYprobability_threshold
),minCost
AS ( SELECT min(cost
)minCost
FROMcost
) SELECT max(probability_threshold
)probability_threshold FROMcost
,minCost
WHEREcost
=minCost
;If relative costs are not well known, you can simply scan the values in the ROC table (in sorted order) and make a determination about which of the displayed trade-offs (misclassified positives versus misclassified negatives) is most desirable.
SELECT * FROM
ROC_table
ORDER BYprobability_threshold
;
Examples
This example uses the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, t.prediction, t.probability FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;
Using the predictions and the target values from the test data, you can compute ROC as follows.
DECLARE
v_area_under_curve NUMBER;
BEGIN
DBMS_DATA_MINING.COMPUTE_ROC (
roc_area_under_curve => v_area_under_curve,
apply_result_table_name => 'nb_apply_results',
target_table_name => 'mining_data_test_v',
case_id_column_name => 'cust_id',
target_column_name => 'mining_data_test_v',
roc_table_name => 'nb_roc',
positive_target_value => '1',
score_column_name => 'PREDICTION',
score_criterion_column_name => 'PROBABILITY');
DBMS_OUTPUT.PUT_LINE('**** AREA UNDER ROC CURVE ****: ' ||
ROUND(v_area_under_curve,4));
END;
/
The resulting AUC and a selection of columns from the ROC table are shown as follows.
**** AREA UNDER ROC CURVE ****: .8212 SELECT PROBABILITY, TRUE_POSITIVE_FRACTION, FALSE_POSITIVE_FRACTION FROM NB_ROC; PROBABILITY TRUE_POSITIVE_FRACTION FALSE_POSITIVE_FRACTION ----------- ---------------------- ----------------------- .00000 1 1 .50018 .826589595 .227902946 .53851 .823699422 .221837088 .54991 .820809249 .217504333 .55628 .815028902 .215771231 .55628 .817919075 .215771231 .57563 .800578035 .214904679 .57563 .812138728 .214904679 . . . . . . . . .
43.7.10 COMPUTE_ROC_PART Procedure
The COMPUTE_ROC_PART
procedure computes Receiver Operating Characteristic (ROC), stores the results in a table in the user's schema, and returns a measure of the model accuracy. This procedure provides support to computation of evaluation metrics per-partition for partitioned models.
ROC is a test metric for binary classification models. To compute ROC, one of the target values must be designated as the positive class. COMPUTE_ROC_PART
compares the predictions generated by the model with the actual target values in a set of test data.
ROC measures the impact of changes in the probability threshold. The probability threshold is the decision point used by the model for predictions. In binary classification, the default probability threshold is 0.5
. The value predicted for each case is the one with a probability greater than 50%.
ROC can be plotted as a curve on an x-y axis. The false positive rate is placed on the x-axis. The true positive rate is placed on the y-axis. A false positive is a positive prediction for a case that is negative in the test data. A true positive is a positive prediction for a case that is positive in the test data.
COMPUTE_ROC_PART
accepts two input streams:
-
The predictions generated on the test data. The information is passed in three columns:
-
Case ID column
-
Prediction column
-
Scoring criterion column containing probabilities
-
-
The known target values in the test data. The information is passed in two columns:
-
Case ID column
-
Target column containing the known target values
-
See Also:
Oracle Data Mining Concepts for more details about ROC and test metrics for Classification
Syntax
DBMS_DATA_MINING.compute_roc_part( roc_area_under_curve OUT DM_NESTED_NUMERICALS, apply_result_table_name IN VARCHAR2, target_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2, roc_table_name IN VARCHAR2, positive_target_value IN VARCHAR2, score_column_name IN VARCHAR2 DEFAULT 'PREDICTION', score_criterion_column_name IN VARCHAR2 DEFAULT 'PROBABILITY', score_partition_column_name IN VARCHAR2 DEFAULT 'PARTITION_NAME', apply_result_schema_name IN VARCHAR2 DEFAULT NULL, target_schema_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-60 COMPUTE_ROC_PART Procedure Parameters
Parameter | Description |
---|---|
|
Output parameter containing the area under the ROC curve (AUC). The AUC measures the likelihood that an actual positive will be predicted as positive. The greater the AUC, the greater the flexibility of the model in accommodating trade-offs between positive and negative class predictions. AUC can be especially important when one target class is rarer or more important to identify than another. The output argument is changed from |
|
Table containing the predictions. |
|
Table containing the known target values from the test data. |
|
Case ID column in the apply results table. Must match the case identifier in the targets table. |
|
Target column in the targets table. Contains the known target values from the test data. |
|
Table containing the ROC output. The table will be created by the procedure in the user's schema. The columns in the ROC table are described in the Usage Notes. |
|
The positive class. This should be the class of interest, for which you want to calculate ROC. If the target column is a |
|
Column containing the predictions in the apply results table. The default column name is |
|
Column containing the scoring criterion in the apply results table. Contains the probabilities that determine the predictions. The default column name is |
|
Optional parameter indicating the column which contains the name of the partition. This column slices the input test results such that each partition has independent evaluation matrices computed. |
|
Schema of the apply results table. If null, then the user's schema is assumed. |
|
Schema of the table containing the known targets. If null, then the user's schema is assumed. |
Usage Notes
-
The predictive information you pass to
COMPUTE_ROC_PART
may be generated using SQLPREDICTION
functions, theDBMS_DATA_MINING.APPLY
procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the receiver operating characteristic. -
The predictions that you pass to
COMPUTE_ROC_PART
are in a table or view specified inapply_results_table_name
.CREATE TABLE
apply_result_table_name
AS (case_id_column_name
VARCHAR2, score_column_name VARCHAR2,score_criterion_column_name
VARCHAR2); -
The
COMPUTE_ROC_PART
table has the following columns:Table 43-61 COMPUTE_ROC_PART Output
Column Datatype probability
BINARY_DOUBLE
true_positives
NUMBER
false_negatives
NUMBER
false_positives
NUMBER
true_negatives
NUMBER
true_positive_fraction
NUMBER
false_positive_fraction
NUMBER
See Also:
Oracle Data Mining Concepts for details about the output of
COMPUTE_ROC_PART
-
ROC is typically used to determine the most desirable probability threshold. This can be done by examining the true positive fraction and the false positive fraction. The true positive fraction is the percentage of all positive cases in the test data that were correctly predicted as positive. The false positive fraction is the percentage of all negative cases in the test data that were incorrectly predicted as positive.
Given a probability threshold, the following statement returns the positive predictions in an apply result table ordered by probability.
SELECT case_id_column_name FROM apply_result_table_name WHERE
probability
>probability_threshold
ORDER BYprobability
DESC; -
There are two approaches to identify the most desirable probability threshold. The approach you use depends on whether you know the relative cost of positive versus negative class prediction errors.
If the costs are known, then you can apply the relative costs to the ROC table to compute the minimum cost probability threshold. Suppose the relative cost ratio is: Positive Class Error Cost / Negative Class Error Cost = 20. Then execute a query as follows:
WITH
cost
AS ( SELECTprobability_threshold
, 20 *false_negatives
+false_positives
cost
FROMROC_table
GROUP BYprobability_threshold
),minCost
AS ( SELECT min(cost
)minCost
FROMcost
) SELECT max(probability_threshold
)probability_threshold FROMcost
,minCost
WHEREcost
=minCost
;If relative costs are not well known, then you can simply scan the values in the ROC table (in sorted order) and make a determination about which of the displayed trade-offs (misclassified positives versus misclassified negatives) is most desirable.
SELECT * FROM
ROC_table
ORDER BYprobability_threshold
;
Examples
This example uses the Naive Bayes model nb_sh_clas_sample
, which is created by one of the Oracle Data Mining sample programs.
The following statement applies the model to the test data and stores the predictions and probabilities in a table.
CREATE TABLE nb_apply_results AS SELECT cust_id, t.prediction, t.probability FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;
Using the predictions and the target values from the test data, you can compute ROC as follows.
DECLARE
v_area_under_curve NUMBER;
BEGIN
DBMS_DATA_MINING.COMPUTE_ROC_PART (
roc_area_under_curve => v_area_under_curve,
apply_result_table_name => 'nb_apply_results',
target_table_name => 'mining_data_test_v',
case_id_column_name => 'cust_id',
target_column_name => 'affinity_card',
roc_table_name => 'nb_roc',
positive_target_value => '1',
score_column_name => 'PREDICTION',
score_criterion_column_name => 'PROBABILITY');
score_partition_column_name => 'PARTITION_NAME'
DBMS_OUTPUT.PUT_LINE('**** AREA UNDER ROC CURVE ****: ' ||
ROUND(v_area_under_curve,4));
END;
/
The resulting AUC and a selection of columns from the ROC table are shown as follows.
**** AREA UNDER ROC CURVE ****: .8212 SELECT PROBABILITY, TRUE_POSITIVE_FRACTION, FALSE_POSITIVE_FRACTION FROM NB_ROC; PROBABILITY TRUE_POSITIVE_FRACTION FALSE_POSITIVE_FRACTION ----------- ---------------------- ----------------------- .00000 1 1 .50018 .826589595 .227902946 .53851 .823699422 .221837088 .54991 .820809249 .217504333 .55628 .815028902 .215771231 .55628 .817919075 .215771231 .57563 .800578035 .214904679 .57563 .812138728 .214904679 . . . . . . . . .
43.7.11 CREATE_MODEL Procedure
This procedure creates a mining model with a given mining function.
Syntax
DBMS_DATA_MINING.CREATE_MODEL ( model_name IN VARCHAR2, mining_function IN VARCHAR2, data_table_name IN VARCHAR2, case_id_column_name IN VARCHAR2, target_column_name IN VARCHAR2 DEFAULT NULL, settings_table_name IN VARCHAR2 DEFAULT NULL, data_schema_name IN VARCHAR2 DEFAULT NULL, settings_schema_name IN VARCHAR2 DEFAULT NULL, xform_list IN TRANSFORM_LIST DEFAULT NULL);
Parameters
Table 43-62 CREATE_MODEL Procedure Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. See the Usage Notes for model naming restrictions. |
|
The mining function. Values are listed in Table 43-3. |
|
Table or view containing the build data |
|
Case identifier column in the build data. |
|
For supervised models, the target column in the build data. |
|
Table containing build settings for the model. |
|
Schema hosting the build data. If |
|
Schema hosting the settings table. If |
|
A list of transformations to be used in addition to or instead of automatic transformations, depending on the value of the The datatype of TYPE TRANFORM_REC IS RECORD ( attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), expression EXPRESSION_REC, reverse_expression EXPRESSION_REC, attribute_spec VARCHAR2(4000)); The The See Table 44-1for details about the |
Usage Notes
-
You can use the
attribute_spec
field of thexform_list
argument to identify an attribute as unstructured text or to disable Automatic Data Preparation for the attribute. Theattribute_spec
can have the following values:-
TEXT
: Indicates that the attribute contains unstructured text. TheTEXT
value may optionally be followed byPOLICY_NAME
,TOKEN_TYPE
,MAX_FEATURES
, andMIN_DOCUMENTS
parameters.TOKEN_TYPE
has the following possible values:NORMAL
,STEM
,THEME
,SYNONYM
,BIGRAM
,STEM_BIGRAM
.SYNONYM
may be optionally followed by a thesaurus name in square brackets.MAX_FEATURES
specifies the maximum number of tokens extracted from the text.MIN_DOCUMENTS
specifies the minimal number of documents in which every selected token shall occur. (For information about creating a text policy, seeCTX_DDL.CREATE_POLICY
in Oracle Text Reference).Oracle Data Mining can process columns of
VARCHAR2
/CHAR
,CLOB
,BLOB
, andBFILE
as text. If the column isVARCHAR2
orCHAR
and you do not specifyTEXT
, Oracle Data Mining will process the column as categorical data. If the column isCLOB
, then Oracle Data Mining will process it as text by default (You do not need to specify it asTEXT
. However, you do need to provide an Oracle Text Policy in the settings). If the column isBLOB
orBFILE
, you must specify it asTEXT
, otherwiseCREATE_MODEL
will return an error.If you specify
TEXT
for a nested column or for an attribute in a nested column,CREATE_MODEL
will return an error. -
NOPREP
: Disables ADP for the attribute. When ADP isOFF
, theNOPREP
value is ignored.You can specify
NOPREP
for a nested column, but not for an attribute in a nested column. If you specifyNOPREP
for an attribute in a nested column when ADP is on,CREATE_MODEL
will return an error.
-
-
You can obtain information about a model by querying the Data Dictionary views.
ALL/USER/DBA_MINING_MODELS ALL/USER/DBA_MINING_MODEL_ATTRIBUTES ALL/USER/DBA_MINING_MODEL_SETTINGS ALL/USER/DBA_MINING_MODEL_VIEWS ALL/USER/DBA_MINING_MODEL_PARTITIONS ALL/USER/DBA_MINING_MODEL_XFORMS
You can obtain information about model attributes by querying the model details through model views. Refer to Oracle Data Mining User’s Guide.
-
The naming rules for models are more restrictive than the naming rules for most database schema objects. A model name must satisfy the following additional requirements:
-
It must be 123 or fewer characters long.
-
It must be a nonquoted identifier. Oracle requires that nonquoted identifiers contain only alphanumeric characters, the underscore (_), dollar sign ($), and pound sign (#); the initial character must be alphabetic. Oracle strongly discourages the use of the dollar sign and pound sign in nonquoted literals.
Naming requirements for schema objects are fully documented in Oracle Database SQL Language Reference.
-
-
To build a partitioned model, you must provide additional settings.
The setting for partitioning columns are as follows:
INSERT INTO settings_table VALUES (‘ODMS_PARTITION_COLUMNS’, ‘GENDER, AGE’);
To set user-defined partition number for a model, the setting is as follows:
INSERT INTO settings_table VALUES ('ODMS_MAX_PARTITIONS’, '10’);
The default value for maximum number of partitions is
1000
. - By passing an
xform_list
toCREATE_MODEL
, you can specify a list of transformations to be performed on the input data. If thePREP_AUTO
setting isON
, the transformations are used in addition to the automatic transformations. If thePREP_AUTO
setting isOFF
, the specified transformations are the only ones implemented by the model. In both cases, transformation definitions are embedded in the model and executed automatically whenever the model is applied. See "Automatic Data Preparation". Other transforms that can be specified withxform_list
includeFORCE_IN
. Refer to Oracle Data Mining User’s Guide.
Examples
The first example builds a Classification model using the Support Vector Machine algorithm.
-- Create the settings table CREATE TABLE svm_model_settings ( setting_name VARCHAR2(30), setting_value VARCHAR2(30)); -- Populate the settings table -- Specify SVM. By default, Naive Bayes is used for classification. -- Specify ADP. By default, ADP is not used. BEGIN INSERT INTO svm_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines); INSERT INTO svm_model_settings (setting_name, setting_value) VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on); COMMIT; END; / -- Create the model using the specified settings BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'svm_model', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build_v', case_id_column_name => 'cust_id', target_column_name => 'affinity_card', settings_table_name => 'svm_model_settings'); END; /
You can display the model settings with the following query:
SELECT * FROM user_mining_model_settings WHERE model_name IN 'SVM_MODEL'; MODEL_NAME SETTING_NAME SETTING_VALUE SETTING ------------- ---------------------- ----------------------------- ------- SVM_MODEL ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES INPUT SVM_MODEL SVMS_STD_DEV 3.004524 DEFAULT SVM_MODEL PREP_AUTO ON INPUT SVM_MODEL SVMS_COMPLEXITY_FACTOR 1.887389 DEFAULT SVM_MODEL SVMS_KERNEL_FUNCTION SVMS_LINEAR DEFAULT SVM_MODEL SVMS_CONV_TOLERANCE .001 DEFAULT
The following is an example of querying a model view instead of the older GEL_MODEL_DETAILS_SVM
routine.
SELECT target_value, attribute_name, attribute_value, coefficient FROM DM$VLSVM_MODEL;
The second example creates an Anomaly Detection model. Anomaly Detection uses SVM Classification without a target. This example uses the same settings table created for the SVM Classification model in the first example.
BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'anomaly_detect_model', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build_v', case_id_column_name => 'cust_id', target_column_name => null, settings_table_name => 'svm_model_settings'); END; /
This query shows that the models created in these examples are the only ones in your schema.
SELECT model_name, mining_function, algorithm FROM user_mining_models; MODEL_NAME MINING_FUNCTION ALGORITHM ---------------------- -------------------- ------------------------------ SVM_MODEL CLASSIFICATION SUPPORT_VECTOR_MACHINES ANOMALY_DETECT_MODEL CLASSIFICATION SUPPORT_VECTOR_MACHINES
This query shows that only the SVM Classification model has a target.
SELECT model_name, attribute_name, attribute_type, target FROM user_mining_model_attributes WHERE target = 'YES'; MODEL_NAME ATTRIBUTE_NAME ATTRIBUTE_TYPE TARGET ------------------ --------------- ----------------- ------ SVM_MODEL AFFINITY_CARD CATEGORICAL YES
43.7.12 CREATE_MODEL2 Procedure
The CREATE_MODEL2
procedure is an alternate procedure to the CREATE_MODEL
procedure, which enables creating a model without extra persistence stages. In the CREATE_MODEL
procedure, the input is a table or a view and if such an object is not already present, the user must create it. By using the CREATE_MODEL2
procedure, the user does not need to create such transient database objects.
Syntax
DBMS_DATA_MINING.CREATE_MODEL2 ( model_name IN VARCHAR2, mining_function IN VARCHAR2, data_query IN CLOB, set_list IN SETTING_LIST, case_id_column_name IN VARCHAR2 DEFAULT NULL, target_column_name IN VARCHAR2 DEFAULT NULL, xform_list IN TRANSFORM_LIST DEFAULT NULL);
Parameters
Table 43-63 CREATE_MODEL2 Procedure Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [ See the Usage Notes, CREATE_MODEL Procedure for model naming restrictions. |
|
The mining function. Values are listed in DBMS_DATA_MINING — Mining Function Settings. |
|
A query which provides training data for building the model. |
|
Specifies the SETTING_LIST is a table of CLOB index by VARCHAR2(30) ; Where the index is the setting name and the CLOB is the setting value for that name.
|
|
Case identifier column in the build data. |
|
For supervised models, the target column in the build data. |
|
Refer to CREATE_MODEL Procedure. |
Usage Notes
Refer to CREATE_MODEL Procedure for Usage Notes.
Examples
The following example uses the Support Vector Machine algorithm.
declare
v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
v_setlst(dbms_data_mining.algo_name) := dbms_data_mining.algo_support_vector_machines;
v_setlst(dbms_data_mining.prep_auto) := dbms_data_mining.prep_auto_on;
DBMS_DATA_MINING.CREATE_MODEL2(
model_name => 'svm_model',
mining_function => dbms_data_mining.classification,
data_query => 'select * from mining_data_build_v',
data_table_name => 'mining_data_build_v',
case_id_column_name=> 'cust_id',
target_column_name => 'affinity_card',
set_list => v_setlst,
case_id_column_name=> 'cust_id',
target_column_name => 'affinity_card');
END;
/
43.7.13 Create Model Using Registration Information
Create model function fetches the setting information from JSON object.
Usage Notes
If an algorithm is registered, user can create model using the registered algorithm name. Since all R scripts and default setting values are already registered, providing the value through the setting table is not necessary. This makes the use of this algorithm easier.
Examples
The first example builds a Classification model using the GLM algorithm.
CREATE TABLE GLM_RDEMO_SETTINGS_CL ( setting_name VARCHAR2(30), setting_value VARCHAR2(4000)); BEGIN INSERT INTO GLM_RDEMO_SETTINGS_CL VALUES ('ALGO_EXTENSIBLE_LANG', 'R'); INSERT INTO GLM_RDEMO_SETTINGS_CL VALUES (dbms_data_mining.ralg_registration_algo_name, 't1'); INSERT INTO GLM_RDEMO_SETTINGS_CL VALUES (dbms_data_mining.odms_formula, 'AGE + EDUCATION + HOUSEHOLD_SIZE + OCCUPATION'); INSERT INTO GLM_RDEMO_SETTINGS_CL VALUES ('RALG_PARAMETER_FAMILY', 'binomial(logit)' ); END; / BEGIN DBMS_DATA_MINING.CREATE_MODEL( model_name => 'GLM_RDEMO_CLASSIFICATION', mining_function => dbms_data_mining.classification, data_table_name => 'mining_data_build_v', case_id_column_name => 'CUST_ID', target_column_name => 'AFFINITY_CARD', settings_table_name => 'GLM_RDEMO_SETTINGS_CL'); END; /
43.7.14 DROP_ALGORITHM Procedure
This function is used to drop the registered algorithm information.
Syntax
DBMS_DATA_MINING.DROP_ALGORITHM (algorithm_name IN VARCHAR2(30), cascade IN BOOLEAN default FALSE)
Parameters
Table 43-64 DROP_ALGORITHM Procedure Parameters
Parameter | Description |
---|---|
|
Name of the algorithm. |
|
If the cascade option is |
Usage Note
-
To drop a mining model, you must be the owner or you must have the
RQADMIN
privilege. See Oracle Data Mining User's Guide for information about privileges for data mining. -
Make sure a model is not built on the algorithm, then drop the algorithm from the system table.
-
If you try to drop an algorithm with a model built on it, then an error is displayed.
43.7.15 DROP_PARTITION Procedure
The DROP_PARTITION
procedure drops a single partition that is specified in the parameter partition_name
.
Syntax
DBMS_DATA_MINING.DROP_PARTITION (
model_name IN VARCHAR2,
partition_name IN VARCHAR2);
Parameters
Table 43-65 DROP_PARTITION Procedure Parameters
Parameters | Description |
---|---|
|
Name of the mining model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Name of the partition that must be dropped. |
43.7.16 DROP_MODEL Procedure
This procedure deletes the specified mining model.
Syntax
DBMS_DATA_MINING.DROP_MODEL (model_name IN VARCHAR2, force IN BOOLEAN DEFAULT FALSE);
Parameters
Table 43-66 DROP_MODEL Procedure Parameters
Parameter | Description |
---|---|
|
Name of the mining model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used. |
|
Forces the mining model to be dropped even if it is invalid. A mining model may be invalid if a serious system error interrupted the model build process. |
Usage Note
To drop a mining model, you must be the owner or you must have the DROP ANY MINING MODEL
privilege. See Oracle Data Mining User's Guide for information about privileges for data mining.
Example
You can use the following command to delete a valid mining model named nb_sh_clas_sample
that exists in your schema.
BEGIN DBMS_DATA_MINING.DROP_MODEL(model_name => 'nb_sh_clas_sample'); END; /
43.7.17 EXPORT_MODEL Procedure
This procedure exports the specified data mining models to a dump file set.
To import the models from the dump file set, use the IMPORT_MODEL Procedure. EXPORT_MODEL
and IMPORT_MODEL
use Oracle Data Pump technology.
When Oracle Data Pump is used to export/import an entire schema or database, the mining models in the schema or database are included. However, EXPORT_MODEL
and IMPORT_MODEL
are the only utilities that support the export/import of individual models.
See Also:
Oracle Database Utilities for information about Oracle Data Pump
Oracle Data Mining User's Guide for more information about exporting and importing mining models
Syntax
DBMS_DATA_MINING.EXPORT_MODEL ( filename IN VARCHAR2, directory IN VARCHAR2, model_filter IN VARCHAR2 DEFAULT NULL, filesize IN VARCHAR2 DEFAULT NULL, operation IN VARCHAR2 DEFAULT NULL, remote_link IN VARCHAR2 DEFAULT NULL, jobname IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-67 EXPORT_MODEL Procedure Parameters
Parameter | Description |
---|---|
|
Name of the dump file set to which the models should be exported. The name must be unique within the schema. The dump file set can contain one or more files. The number of files in a dump file set is determined by the size of the models being exported (both metadata and data) and a specified or estimated maximum file size. You can specify the file size in the When the export operation completes successfully, the name of the dump file set is automatically expanded to |
|
Name of a pre-defined directory object that specifies where the dump file set should be created. The exporting user must have read/write privileges on the directory object and on the file system directory that it identifies. See Oracle Database SQL Language Reference for information about directory objects. |
|
Optional parameter that specifies which model or models to export. If you do not specify a value for You can export individual models by name and groups of models based on mining function or algorithm. For instance, you could export all regression models or all Naive Bayes models. Examples are provided in Table 43-68. |
|
Optional parameter that specifies the maximum size of a file in the dump file set. The size may be specified in bytes, kilobytes (K), megabytes (M), or gigabytes (G). The default size is 50 MB. If the size of the models to export is larger than |
|
Optional parameter that specifies whether or not to estimate the size of the files in the dump set. By default the size is not estimated and the value of the You can specify either of the following values for
|
|
Optional parameter that specifies the name of a database link to a remote system. The default value is |
|
Optional parameter that specifies the name of the export job. By default, the name has the form If you specify a job name, it must be unique within the schema. The maximum length of the job name is 30 characters. A log file for the export job, named |
Usage Notes
The model_filter
parameter specifies which models to export. You can list the models by name, or you can specify all models that have the same mining function or algorithm. You can query the USER_MINING_MODELS
view to list the models in your schema.
SQL> describe user_mining_models Name Null? Type ----------------------------------------- -------- ---------------------------- MODEL_NAME NOT NULL VARCHAR2(30) MINING_FUNCTION VARCHAR2(30) ALGORITHM VARCHAR2(30) CREATION_DATE NOT NULL DATE BUILD_DURATION NUMBER MODEL_SIZE NUMBER COMMENTS VARCHAR2(4000)
Examples of model filters are provided in Table 43-68.
Table 43-68 Sample Values for the Model Filter Parameter
Sample Value | Meaning |
---|---|
|
Export the model named |
|
Export the model named |
|
Export the models named |
|
Export all Naive Bayes models. See Table 43-5 for a list of algorithm names. |
|
Export all classification models. See Table 43-3 for a list of mining functions. |
Examples
-
The following statement exports all the models in the
DMUSER3
schema to a dump file set calledmodels_out
in the directory$ORACLE_HOME/rdbms/log
. This directory is mapped to a directory object calledDATA_PUMP_DIR
. TheDMUSER3
user has read/write access to the directory and to the directory object.SQL>execute dbms_data_mining.export_model ('models_out', 'DATA_PUMP_DIR');
You can exit SQL*Plus and list the resulting dump file and log file.
SQL>EXIT >cd $ORACLE_HOME/rdbms/log >ls >DMUSER3_exp_1027.log models_out01.dmp
-
The following example uses the same directory object and is executed by the same user.This example exports the models called
NMF_SH_SAMPLE
andSVMR_SH_REGR_SAMPLE
to a different dump file set in the same directory.SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ( 'models2_out', 'DATA_PUMP_DIR', 'name in (''NMF_SH_SAMPLE'', ''SVMR_SH_REGR_SAMPLE'')'); SQL>EXIT >cd $ORACLE_HOME/rdbms/log >ls >DMUSER3_exp_1027.log models_out01.dmp DMUSER3_exp_924.log models2_out01.dmp
-
The following examples show how to export models with specific algorithm and mining function names.
SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL('algo.dmp','DM_DUMP', 'ALGORITHM_NAME IN (''O_CLUSTER'',''GENERALIZED_LINEAR_MODEL'', ''SUPPORT_VECTOR_MACHINES'',''NAIVE_BAYES'')'); SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL('func.dmp', 'DM_DUMP', 'FUNCTION_NAME IN (CLASSIFICATION,CLUSTERING,FEATURE_EXTRACTION)');
43.7.18 EXPORT_SERMODEL Procedure
This procedure exports the model in a serialized format so that they can be moved to another platform for scoring.
When exporting a model in serialized format, the user must pass in an empty BLOB
locator and specify the model name to be exported. If the model is partitioned, the user can optionally select an individual partition to export, otherwise all partitions are exported. The returned BLOB
contains the content that can be deployed.
Syntax
DBMS_DATA_MINING.EXPORT_SERMODEL ( model_data IN OUT NOCOPY BLOB, model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL);
Parameters
Table 43-69 EXPORT_SERMODEL Procedure Parameters
Parameter | Description |
---|---|
|
Provides serialized model data. |
|
Name of the mining model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Name of the partition that must be exported. |
Examples
The following statement exports all the models in a serialized format.
DECLARE
v_blob blob;
BEGIN
dbms_lob.createtemporary(v_blob, FALSE);
dbms_data_mining.export_sermodel(v_blob, 'MY_MODEL');
-- save v_blob somewhere (e.g., bfile, etc.)
dbms_lob.freetemporary(v_blob);
END;
/
See Also:
Oracle Data Mining User's Guide for more information about exporting and importing mining models
43.7.19 FETCH_JSON_SCHEMA Procedure
User can fetch and read JSON schema from the ALL_MINING_ALGORITHMS
view. This function returns the pre-registered JSON schema for R extensible algorithms.
Syntax
DBMS_DATA_MINING.FETCH_JSON_SCHEMA RETURN CLOB;
Parameters
Table 43-70 FETCH_JSON_SCHEMA Procedure Parameters
Parameter | Description |
---|---|
|
This function returns the pre-registered JSON schema for R extensibility. The default value is |
Usage Note
If a user wants to register a new algorithm using the algorithm registration function, they must fetch and follow the pre-registered JSON schema using this function, when they create the required JSON object metadata, and then pass it to the registration function.
43.7.20 GET_ASSOCIATION_RULES Function
The GET_ASSOCIATION_RULES
function returns the rules produced by an Association model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide
You can specify filtering criteria to GET_ASSOCIATION_RULES
to return a subset of the rules. Filtering criteria can improve the performance of the table function. If the number of rules is large, then the greatest performance improvement will result from specifying the topn
parameter.
Syntax
DBMS_DATA_MINING.get_association_rules( model_name IN VARCHAR2, topn IN NUMBER DEFAULT NULL, rule_id IN INTEGER DEFAULT NULL, min_confidence IN NUMBER DEFAULT NULL, min_support IN NUMBER DEFAULT NULL, max_rule_length IN INTEGER DEFAULT NULL, min_rule_length IN INTEGER DEFAULT NULL, sort_order IN ORA_MINING_VARCHAR2_NT DEFAULT NULL, antecedent_items IN DM_ITEMS DEFAULT NULL, consequent_items IN DM_ITEMS DEFAULT NULL, min_lift IN NUMBER DEFAULT NULL, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_Rules PIPELINED;
Parameters
Table 43-71 GET_ASSOCIATION_RULES Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. This is the only required parameter of |
|
Returns the n top rules ordered by confidence and then support, both descending. If you specify a sort order, then the top n rules are derived after the sort is performed. If |
|
Identifier of the rule to return. If you specify a value for |
|
Returns the rules with confidence greater than or equal to this number. |
|
Returns the rules with support greater than or equal to this number. |
|
Returns the rules with a length less than or equal to this number. Rule length refers to the number of items in the rule (See If |
|
Returns the rules with a length greater than or equal to this number. See If |
|
Sorts the rules by the values in one or more of the returned columns. Specify one or more column names, each followed by For example, to sort the result set in descending order first by the
If you specify By default, the results are sorted by Confidence in descending order, then by Support in descending order. |
|
Returns the rules with these items in the antecedent. |
|
Returns the rules with this item in the consequent. |
|
Returns the rules with lift greater than or equal to this number. |
|
Specifies a partition in a partitioned model. |
Return Values
The object type returned by GET_ASSOCIATION_RULES
is described in Table 43-72. For descriptions of each field, see the Usage Notes.
Table 43-72 GET_ASSOCIATION RULES Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (rule_id INTEGER, antecedent DM_PREDICATES, consequent DM_PREDICATES, rule_support NUMBER, rule_confidence NUMBER, rule_lift NUMBER, antecedent_support NUMBER, consequent_support NUMBER, number_of_items INTEGER ) |
|
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), conditional_operator CHAR(2)/*=,<>,<,>,<=,>=*/, attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000), attribute_support NUMBER, attribute_confidence NUMBER) |
Usage Notes
-
This table function pipes out rows of type
DM_RULES
. For information on Data Mining data types and piped output from table functions, see "Datatypes". -
The columns returned by
GET_ASSOCIATION_RULES
are described as follows:Column in DM_RULES Description rule_id
Unique identifier of the rule
antecedent
The independent condition in the rule. When this condition exists, the dependent condition in the consequent also exists.
The condition is a combination of attribute values called a predicate (
DM_PREDICATE
). The predicate specifies a condition for each attribute. The condition may specify equality (=), inequality (<>), greater than (>), less than (<), greater than or equal to (>=), or less than or equal to (<=) a given value.Support
andConfidence
for each attribute condition in the antecedent is returned in the predicate. Support is the number of transactions that satisfy the antecedent. Confidence is the likelihood that a transaction will satisfy the antecedent.Note: The occurence of the attribute as a
DM_PREDICATE
indicates the presence of the item in the transaction. The actual value forattribute_num_value
orattribute_str_value
is meaningless. For example, the following predicate indicates that 'Mouse Pad' is present in the transaction even though the attribute value isNULL
.DM_PREDICATE('PROD_NAME', 'Mouse Pad', '= ', NULL, NULL, NULL, NULL))
consequent
The dependent condition in the rule. This condition exists when the antecedent exists.
The consequent, like the antecedent, is a predicate (
DM_PREDICATE
).Support and confidence for each attribute condition in the consequent is returned in the predicate. Support is the number of transactions that satisfy the consequent. Confidence is the likelihood that a transaction will satisfy the consequent.
rule_support
The number of transactions that satisfy the rule.
rule_confidence
The likelihood of a transaction satisfying the rule.
rule_lift
The degree of improvement in the prediction over random chance when the rule is satisfied.
antecedent_support
The ratio of the number of transactions that satisfy the antecedent to the total number of transactions.
consequent_support
The ratio of the number of transactions that satisfy the consequent to the total number of transactions.
number_of_items
The total number of attributes referenced in the antecedent and consequent of the rule.
Examples
The following example demonstrates an Association model build followed by several invocations of the GET_ASSOCIATION_RULES
table function:
-- prepare a settings table to override default settings CREATE TABLE market_settings AS SELECT * FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'ASSO_%'; BEGIN -- update the value of the minimum confidence UPDATE market_settings SET setting_value = TO_CHAR(0.081) WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence; -- build an AR model DBMS_DATA_MINING.CREATE_MODEL( model_name => 'market_model', function => DBMS_DATA_MINING.ASSOCIATION, data_table_name => 'market_build', case_id_column_name => 'item_id', target_column_name => NULL, settings_table_name => 'market_settings'); END; / -- View the (unformatted) rules SELECT rule_id, antecedent, consequent, rule_support, rule_confidence FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('market_model'));
In the previous example, you view all rules. To view just the top 20 rules, use the following statement.
-- View the top 20 (unformatted) rules SELECT rule_id, antecedent, consequent, rule_support, rule_confidence FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('market_model', 20));
The following query uses the Association model AR_SH_SAMPLE
, which is created from one of the Oracle Data Mining sample programs:
SELECT * FROM TABLE ( DBMS_DATA_MINING.GET_ASSOCIATION_RULES ( 'AR_SH_SAMPLE', 10, NULL, 0.5, 0.01, 2, 1, ORA_MINING_VARCHAR2_NT ( 'NUMBER_OF_ITEMS DESC', 'RULE_CONFIDENCE DESC', 'RULE_SUPPORT DESC'), DM_ITEMS(DM_ITEM('CUSTPRODS', 'Mouse Pad', 1, NULL), DM_ITEM('CUSTPRODS', 'Standard Mouse', 1, NULL)), DM_ITEMS(DM_ITEM('CUSTPRODS', 'Extension Cable', 1, NULL))));
The query returns three rules, shown as follows:
13 DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Mouse Pad', '= ', 1, NULL, NULL, NULL), DM_PREDICATE('CUSTPRODS', 'Standard Mouse', '= ', 1, NULL, NULL, NULL)) DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL)) .15532 .84393 2.7075 .18404 .3117 2 11 DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Standard Mouse', '= ', 1, NULL, NULL, NULL)) DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL)) .18085 .56291 1.8059 .32128 .3117 1 9 DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Mouse Pad', '= ', 1, NULL, NULL, NULL)) DM_PREDICATES( DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL)) .17766 .55116 1.7682 .32234 .3117 1
See Also:
Table 43-72 for the DM_RULE
column data types.
Oracle Data Mining User's Guide for information about the sample programs.
Oracle Data Mining User’s Guide for Model Detail Views.
43.7.21 GET_FREQUENT_ITEMSETS Function
The GET_FREQUENT_ITEMSETS
function returns a set of rows that represent the frequent itemsets from an Association model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
For a detailed description of frequent itemsets, consult Oracle Data Mining Concepts.
Syntax
DBMS_DATA_MINING.get_frequent_itemsets( model_name IN VARCHAR2, topn IN NUMBER DEFAULT NULL, max_itemset_length IN NUMBER DEFAULT NULL, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_ItemSets PIPELINED;
Parameters
Table 43-73 GET_FREQUENT_ITEMSETS Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
When not |
|
Maximum length of an item set. |
|
Specifies a partition in a partitioned model. Note: Thepartition_name columns applies only when the model is partitioned.
|
Return Values
Table 43-74 GET_FREQUENT_ITEMSETS Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (partition_name VARCHAR2(128) itemsets_id NUMBER, items DM_ITEMS, support NUMBER, number_of_items NUMBER) Note: Thepartition_name columns applies only when the model is partitioned.
The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), attribute_num_value NUMBER, attribute_str_value VARCHAR2(4000)) |
Usage Notes
This table function pipes out rows of type DM_ITEMSETS
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".
Examples
The following example demonstrates an Association model build followed by an invocation of GET_FREQUENT_ITEMSETS
table function from Oracle SQL.
-- prepare a settings table to override default settings CREATE TABLE market_settings AS
SELECT *
FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS) WHERE setting_name LIKE 'ASSO_%'; BEGIN -- update the value of the minimum confidence UPDATE market_settings SET setting_value = TO_CHAR(0.081) WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence; /* build a AR model */ DBMS_DATA_MINING.CREATE_MODEL( model_name => 'market_model', function => DBMS_DATA_MINING.ASSOCIATION, data_table_name => 'market_build', case_id_column_name => 'item_id', target_column_name => NULL, settings_table_name => 'market_settings'); END; / -- View the (unformatted) Itemsets from SQL*Plus SELECT itemset_id, items, support, number_of_items FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('market_model'));
In the example above, you view all itemsets. To view just the top 20 itemsets, use the following statement:
-- View the top 20 (unformatted) Itemsets from SQL*Plus SELECT itemset_id, items, support, number_of_items FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('market_model', 20));
See Also:
Oracle Data Mining User’s Guide43.7.22 GET_MODEL_COST_MATRIX Function
The GET_*
interfaces are replaced by model views, and Oracle recommends that users leverage the views instead. The GET_MODEL_COST_MATRIX function is replaced by the DM$VC
prefixed view, Scoring Cost Matrix. The cost matrix used when building a Decision Tree is made available by the DM$VM
prefixed view, Decision Tree Build Cost Matrix.
Refer to Model Detail View for Classification Algorithm.
The GET_MODEL_COST_MATRIX
function returns the rows of a cost matrix associated with the specified model.
By default, this function returns the scoring cost matrix that was added to the model with the ADD_COST_MATRIX
procedure. If you wish to obtain the cost matrix used to create a model, specify cost_matrix_type_create
as the matrix_type
. See Table 43-75.
See also ADD_COST_MATRIX Procedure.
Syntax
DBMS_DATA_MINING.GET_MODEL_COST_MATRIX ( model_name IN VARCHAR2, matrix_type IN VARCHAR2 DEFAULT cost_matrix_type_score) partition_name IN VARCHAR2 DEFAULT NULL); RETURN DM_COST_MATRIX PIPELINED;
Parameters
Table 43-75 GET_MODEL_COST_MATRIX Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
The type of cost matrix.
|
|
Name of the partition in a partitioned model |
Return Values
Table 43-76 GET_MODEL_COST_MATRIX Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type actual VARCHAR2(4000), NUMBER, predicted VARCHAR2(4000), cost NUMBER) |
Usage Notes
Only Decision Tree models can be built with a cost matrix. If you want to build a Decision Tree model with a cost matrix, specify the cost matrix table name in the CLAS_COST_TABLE_NAME
setting in the settings table for the model. See Table 43-7.
The cost matrix used to create a Decision Tree model becomes the default scoring matrix for the model. If you want to specify different costs for scoring, you can use the REMOVE_COST_MATRIX
procedure to remove the cost matrix and the ADD_COST_MATRIX
procedure to add a new one.
The GET_MODEL_COST_MATRIX
may return either the build or scoring cost matrix defined for a model or model partition.
If you do not specify a partitioned model name, then an error is displayed.
Example
This example returns the scoring cost matrix associated with the Naive Bayes model NB_SH_CLAS_SAMPLE
.
column actual format a10 column predicted format a10 SELECT * FROM TABLE(dbms_data_mining.get_model_cost_matrix('nb_sh_clas_sample')) ORDER BY predicted, actual; ACTUAL PREDICTED COST ---------- ---------- ----- 0 0 .00 1 0 .75 0 1 .25 1 1 .00
43.7.23 GET_MODEL_DETAILS_AI Function
The GET_MODEL_DETAILS_AI
function returns a set of rows that provide the details of an Attribute Importance model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Syntax
DBMS_DATA_MINING.get_model_details_ai( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN dm_ranked_attributes pipelined;
Parameters
Table 43-77 GET_MODEL_DETAILS_AI Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model. |
Return Values
Table 43-78 GET_MODEL_DETAILS_AI Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (attribute_name VARCHAR2(4000, attribute_subname VARCHAR2(4000), importance_value NUMBER, rank NUMBER(38)) |
Examples
The following example returns model details for the Attribute Importance model AI_SH_sample
, which was created by the sample program dmaidemo.sql
. For information about the sample programs, see Oracle Data Mining User's Guide.
SELECT attribute_name, importance_value, rank FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('AI_SH_sample')) ORDER BY RANK; ATTRIBUTE_NAME IMPORTANCE_VALUE RANK ---------------------------------------- ---------------- ---------- HOUSEHOLD_SIZE .151685183 1 CUST_MARITAL_STATUS .145294546 2 YRS_RESIDENCE .07838928 3 AGE .075027496 4 Y_BOX_GAMES .063039952 5 EDUCATION .059605314 6 HOME_THEATER_PACKAGE .056458722 7 OCCUPATION .054652937 8 CUST_GENDER .035264741 9 BOOKKEEPING_APPLICATION .019204751 10 PRINTER_SUPPLIES 0 11 OS_DOC_SET_KANJI -.00050013 12 FLAT_PANEL_MONITOR -.00509564 13 BULK_PACK_DISKETTES -.00540822 14 COUNTRY_NAME -.01201116 15 CUST_INCOME_LEVEL -.03951311 16
43.7.24 GET_MODEL_DETAILS_EM Function
The GET_MODEL_DETAILS_EM
function returns a set of rows that provide statistics about the clusters produced by an Expectation Maximization model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
By default, the EM algorithm groups components into high-level clusters, and GET_MODEL_DETAILS_EM
returns only the high-level clusters with their hierarchies. Alternatively, you can configure EM model to disable the grouping of components into high-level clusters. In this case, GET_MODEL_DETAILS_EM
returns the components themselves as clusters with their hierarchies. See Table 43-12.
Syntax
DBMS_DATA_MINING.get_model_details_em( model_name VARCHAR2, cluster_id NUMBER DEFAULT NULL, attribute VARCHAR2 DEFAULT NULL, centroid NUMBER DEFAULT 1, histogram NUMBER DEFAULT 1, rules NUMBER DEFAULT 2, attribute_subname VARCHAR2 DEFAULT NULL, topn_attributes NUMBER DEFAULT NULL, partition_name IN VARCHAR2 DEFAULT NULL) RETURN dm_clusters PIPELINED;
Parameters
Table 43-79 GET_MODEL_DETAILS_EM Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
The ID of a cluster in the model. When a valid cluster ID is specified, only the details of this cluster are returned. Otherwise, the details for all clusters are returned. |
|
The name of an attribute. When a valid attribute name is specified, only the details of this attribute are returned. Otherwise, the details for all attributes are returned |
|
This parameter accepts the following values:
|
|
This parameter accepts the following values:
|
|
This parameter accepts the following values:
|
|
The name of a nested attribute. The full name of a nested attribute has the form:
where |
|
Restricts the number of attributes returned in the centroid, histogram, and rules objects. Only the If the number of attributes included in the rules is less than If both the |
|
Specifies a partition in a partitioned model. |
Usage Notes
-
For information on Data Mining datatypes and Return Values for Clustering Algorithms piped output from table functions, see "Datatypes".
-
GET_MODEL_DETAILS
functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model. -
When cluster statistics are disabled (
EMCS_CLUSTER_STATISTICS
is set toEMCS_CLUS_STATS_DISABLE
),GET_MODEL_DETAILS_EM
does not return centroids, histograms, or rules. Only taxonomy (hierarchy) and cluster counts are returned. -
When the
partition_name
isNULL
for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Related Topics
43.7.25 GET_MODEL_DETAILS_EM_COMP Function
he GET_MODEL_DETAILS_EM_COMP
table function returns a set of rows that provide details about the parameters of an Expectation Maximization model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Syntax
DBMS_DATA_MINING.get_model_details_em_comp( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_EM_COMPONENT_SET PIPELINED;
Parameters
Table 43-80 GET_MODEL_DETAILS_EM_COMP Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used. |
|
Specifies a partition in a partitioned model to retrieve details for. |
Return Values
Table 43-81 GET_MODEL_DETAILS_EM_COMP Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (info_type VARCHAR2(30), component_id NUMBER, cluster_id NUMBER, attribute_name VARCHAR2(4000), covariate_name VARCHAR2(4000), attribute_value VARCHAR2(4000), value NUMBER ) |
Usage Notes
-
This table function pipes out rows of type
DM_EM_COMPONENT
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".The columns in each row returned by
GET_MODEL_DETAILS_EM_COMP
are described as follows:Column in DM_EM_COMPONENT Description info_type
The type of information in the row. The following information types are supported:
-
cluster
-
prior
-
mean
-
covariance
-
frequency
component_id
Unique identifier of a component
cluster_id
Unique identifier of the high-level leaf cluster for each component
attribute_name
Name of an original attribute or a derived feature ID. The derived feature ID is used in models built on data with nested columns. The derived feature definitions can be obtained from the GET_MODEL_DETAILS_EM_PROJ Function.
covariate_name
Name of an original attribute or a derived feature ID used in variance/covariance definition
attribute_value
Categorical value or bin interval for binned numerical attributes
value
Encodes different information depending on the value of
info_type
, as follows:-
cluster
— The value field isNULL
-
prior
— The value field returns the component prior -
mean
— The value field returns the mean of the attribute specified inattribute_name
-
covariance
— The value field returns the covariance of the attributes specified inattribute_name
andcovariate_name
. Using the same attribute inattribute_name
andcovariate_name
, returns the variance. -
frequency
— The value field returns the multivalued Bernoulli frequency parameter for the attribute/value combination specified byattribute_name
andattribute_value
See Usage Note 2 for details.
-
-
The following table shows which fields are used for each
info_type
. The blank cells representNULL
s.info_type component_id cluster_id attribute_name covariate_name attribute_value value cluster
X
X
prior
X
X
X
mean
X
X
X
X
covariance
X
X
X
X
X
frequency
X
X
X
X
X
-
GET_MODEL_DETAILS
functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model. -
When the value is
NULL
for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Related Topics
43.7.26 GET_MODEL_DETAILS_EM_PROJ Function
The GET_MODEL_DETAILS_EM_PROJ
function returns a set of rows that provide statistics about the projections produced by an Expectation Maximization model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Syntax
DBMS_DATA_MINING.get_model_details_em_proj( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_EM_PROJECTION_SET PIPELINED;
Parameters
Table 43-82 GET_MODEL_DETAILS_EM_PROJ Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model |
Return Values
Table 43-83 GET_MODEL_DETAILS_EM_PROJ Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (feature_name VARCHAR2(4000), attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), attribute_value VARCHAR2(4000), coefficient NUMBER ) See Usage Notes for details. |
Usage Notes
-
This table function pipes out rows of type
DM_EM_PROJECTION
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".The columns in each row returned by
GET_MODEL_DETAILS_EM_PROJ
are described as follows:Column in DM_EM_PROJECTION Description feature_name
Name of a derived feature. The feature maps to the attribute_name returned by the GET_MODEL_DETAILS_EM Function.
attribute_name
Name of a column in the build data
attribute_subname
Subname in a nested column
attribute_value
Categorical value
coefficient
Projection coefficient. The representation is sparse; only the non-zero coefficients are returned.
-
GET_MODEL_DETAILS
functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.The coefficients are related to the transformed, not the original, attributes. When returned directly with the model details, the coefficients may not provide meaningful information.
-
When the value is
NULL
for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Related Topics
43.7.27 GET_MODEL_DETAILS_GLM Function
The GET_MODEL_DETAILS_GLM
function returns the coefficient statistics for a Generalized Linear Model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
The same set of statistics is returned for both linear and Logistic Regression, but statistics that do not apply to the mining function are returned as NULL
. For more details, see the Usage Notes.
Syntax
DBMS_DATA_MINING.get_model_details_glm( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_GLM_Coeff_Set PIPELINED;
Parameters
Table 43-84 GET_MODEL_DETAILS_GLM Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model |
Return Values
Table 43-85 GET_MODEL_DETAILS_GLM Return Values
Return Value | Description |
---|---|
|
A set of rows of type (class VARCHAR2(4000), attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), attribute_value VARCHAR2(4000), feature_expression VARCHAR2(4000), coefficient NUMBER, std_error NUMBER, test_statistic NUMBER, p_value NUMBER, VIF NUMBER, std_coefficient NUMBER, lower_coeff_limit NUMBER, upper_coeff_limit NUMBER, exp_coefficient BINARY_DOUBLE, exp_lower_coeff_limit BINARY_DOUBLE, exp_upper_coeff_limit BINARY_DOUBLE) |
GET_MODEL_DETAILS_GLM
returns a row of statistics for each attribute and one extra row for the intercept, which is identified by a null value in the attribute name. Each row has the DM_GLM_COEFF
datatype. The statistics are described in Table 43-86.
Table 43-86 DM_GLM_COEFF Datatype Description
Column | Description |
---|---|
|
The non-reference target class for Logistic Regression. The model is built to predict the probability of this class. The other class (the reference class) is specified in the model setting For Linear Regression, |
|
The attribute name when there is no subname, or first part of the attribute name when there is a subname. The value of For the intercept, |
|
The name of an attribute in a nested table. The full name of a nested attribute has the form:
where If the attribute is not nested, then |
|
The value of the attribute (categorical attribute only). For numeric attributes, |
|
The feature name constructed by the algorithm when feature generation is enabled and higher-order features are found. If feature selection is not enabled, then the feature name is simply the fully-qualified attribute name ( For categorical attributes, the algorithm constructs a feature name that has the following form:
For numeric attributes, the algorithm constructs a name for the higher-order feature by taking the product of the resulting values: ( where |
|
The linear coefficient estimate. |
|
Standard error of the coefficient estimate. |
|
For Linear Regression, the t-value of the coefficient estimate. For Logistic Regression, the Wald chi-square value of the coefficient estimate. |
|
Probability of the |
|
Variance Inflation Factor. The value is zero for the intercept. For Logistic Regression, |
|
Standardized estimate of the coefficient. |
|
Lower confidence bound of the coefficient. |
|
Upper confidence bound of the coefficient. |
|
Exponentiated coefficient for Logistic Regression. For Linear Regression, |
|
Exponentiated coefficient for lower confidence bound of the coefficient for Logistic Regression. For Linear Regression, |
|
Exponentiated coefficient for upper confidence bound of the coefficient for Logistic Regression. For Linear Regression, |
Usage Notes
Not all statistics are necessarily returned for each coefficient. Statistics will be null if:
-
They do not apply to the mining function. For example,
exp_coefficient
does not apply to Linear Regression. -
They cannot be computed from a theoretical standpoint. For information on ridge regression, see Table 43-18.
-
They cannot be computed because of limitations in system resources.
-
Their values would be infinity.
-
When the value is NULL for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Examples
The following example returns some of the model details for the GLM Regression model GLMR_SH_Regr_sample
, which was created by the sample program dmglrdem.sql
. For information about the sample programs, see Oracle Data Mining User's Guide.
SET line 120 SET pages 99 column attribute_name format a30 column attribute_subname format a20 column attribute_value format a20 col coefficient format 990.9999 col std_error format 990.9999 SQL> SELECT * FROM (SELECT attribute_name, attribute_value, coefficient, std_error FROM DM$VDGLMR_SH_REGR_SAMPLE order by 1,2) WHERE rownum < 11; ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT STD_ERROR ------------------------------ -------------------- ----------- --------- AFFINITY_CARD -0.5797 0.5283 BOOKKEEPING_APPLICATION -0.4689 3.8872 BULK_PACK_DISKETTES -0.9819 2.5430 COUNTRY_NAME Argentina -1.2020 1.1876 COUNTRY_NAME Australia -0.0071 5.1146 COUNTRY_NAME Brazil 5.2931 1.9233 COUNTRY_NAME Canada 4.0191 2.4108 COUNTRY_NAME China 0.8706 3.5889 COUNTRY_NAME Denmark -2.9822 3.1803 COUNTRY_NAME France -1.1044 7.1811
Related Topics
43.7.28 GET_MODEL_DETAILS_GLOBAL Function
The GET_MODEL_DETAILS_GLOBAL
function returns statistics about the model as a whole. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Global details are available for Generalized Linear Models, Association Rules, Singular Value Decomposition, and Expectation Maximization. There are new Global model views which show global information for all algorithms. Oracle recommends that users leverage the views instead. Refer to Model Details View Global.
Syntax
DBMS_DATA_MINING.get_model_details_global( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_model_global_details PIPELINED;
Parameters
Table 43-87 GET_MODEL_DETAILS_GLOBAL Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model. |
Return Values
Table 43-88 GET_MODEL_DETAILS_GLOBAL Function Return Values
Return Value | Description |
---|---|
|
A collection of rows of type (global_detail_name VARCHAR2(30), global_detail_value NUMBER) |
Examples
The following example returns the global model details for the GLM Regression model GLMR_SH_Regr_sample
, which was created by the sample program dmglrdem.sql
. For information about the sample programs, see Oracle Data Mining User's Guide.
SELECT * FROM TABLE(dbms_data_mining.get_model_details_global( 'GLMR_SH_Regr_sample')) ORDER BY global_detail_name; GLOBAL_DETAIL_NAME GLOBAL_DETAIL_VALUE ------------------------------ ------------------- ADJUSTED_R_SQUARE .731412557 AIC 5931.814 COEFF_VAR 18.1711243 CORRECTED_TOTAL_DF 1499 CORRECTED_TOT_SS 278740.504 DEPENDENT_MEAN 38.892 ERROR_DF 1433 ERROR_MEAN_SQUARE 49.9440956 ERROR_SUM_SQUARES 71569.8891 F_VALUE 62.8492452 GMSEP 52.280819 HOCKING_SP .034877162 J_P 52.1749319 MODEL_CONVERGED 1 MODEL_DF 66 MODEL_F_P_VALUE 0 MODEL_MEAN_SQUARE 3138.94871 MODEL_SUM_SQUARES 207170.615 NUM_PARAMS 67 NUM_ROWS 1500 ROOT_MEAN_SQ 7.06711367 R_SQ .743238288 SBIC 6287.79977 VALID_COVARIANCE_MATRIX 1
Related Topics
43.7.29 GET_MODEL_DETAILS_KM Function
The GET_MODEL_DETAILS_KM
function returns a set of rows that provide the details of a k-Means clustering model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
You can provide input to GET_MODEL_DETAILS_KM
to request specific information about the model, thus improving the performance of the query. If you do not specify filtering parameters, then GET_MODEL_DETAILS_KM
returns all the information about the model.
Syntax
DBMS_DATA_MINING.get_model_details_km( model_name VARCHAR2, cluster_id NUMBER DEFAULT NULL, attribute VARCHAR2 DEFAULT NULL, centroid NUMBER DEFAULT 1, histogram NUMBER DEFAULT 1, rules NUMBER DEFAULT 2, attribute_subname VARCHAR2 DEFAULT NULL, topn_attributes NUMBER DEFAULT NULL, partition_name VARCHAR2 DEFAULT NULL) RETURN dm_clusters PIPELINED;
Parameters
Table 43-89 GET_MODEL_DETAILS_KM Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
The ID of a cluster in the model. When a valid cluster ID is specified, only the details of this cluster are returned. Otherwise the details for all clusters are returned. |
|
The name of an attribute. When a valid attribute name is specified, only the details of this attribute are returned. Otherwise, the details for all attributes are returned |
|
This parameter accepts the following values:
|
|
This parameter accepts the following values:
|
|
This parameter accepts the following values:
|
|
The name of a nested attribute. The full name of a nested attribute has the form:
where If the attribute is not nested, |
|
Restricts the number of attributes returned in the centroid, histogram, and rules objects. Only the If the number of attributes included in the rules is less than If both the |
|
Specifies a partition in a partitioned model. |
Usage Notes
-
The table function pipes out rows of type
DM_CLUSTERS
. For information on Data Mining datatypes and Return Value for Clustering Algorithms piped output from table functions, see "Datatypes". -
When the value is NULL for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Examples
The following example returns model details for the k-Means clustering model KM_SH_Clus_sample
, which was created by the sample program dmkmdemo.sql
. For information about the sample programs, see Oracle Data Mining User's Guide.
SELECT T.id clu_id, T.record_count rec_cnt, T.parent parent, T.tree_level tree_level, T.dispersion dispersion FROM (SELECT * FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_KM( 'KM_SH_Clus_sample')) ORDER BY id) T WHERE ROWNUM < 6; CLU_ID REC_CNT PARENT TREE_LEVEL DISPERSION ---------- ---------- ---------- ---------- ---------- 1 1500 1 5.9152211 2 638 1 2 3.98458982 3 862 1 2 5.83732097 4 376 3 3 5.05192137 5 486 3 3 5.42901522
Related Topics
43.7.30 GET_MODEL_DETAILS_NB Function
The GET_MODEL_DETAILS_NB
function returns a set of rows that provide the details of a Naive Bayes model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Syntax
DBMS_DATA_MINING.get_model_details_nb( model_name IN VARCHAR2, partition_name IN VARCHAR2 DEFAULT NULL) RETURN DM_NB_Details PIPELINED;
Parameters
Table 43-90 GET_MODEL_DETAILS_NB Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model |
Return Values
Table 43-91 GET_MODEL_DETAILS_NB Function Return Values
Return Value | Description |
---|---|
|
A set of rows of type (target_attribute_name VARCHAR2(30), target_attribute_str_value VARCHAR2(4000), target_attribute_num_value NUMBER, prior_probability NUMBER, conditionals DM_CONDITIONALS) The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), attribute_str_value VARCHAR2(4000), attribute_num_value NUMBER, conditional_probability NUMBER) |
Usage Notes
-
The table function pipes out rows of type
DM_NB_DETAILS
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes". -
When the value is
NULL
for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Examples
The following query is from the sample program dmnbdemo.sql
. It returns model details about the model NB_SH_Clas_sample
. For information about the sample programs, see Oracle Data Mining User's Guide.
The query creates labels from the bin boundary tables that were used to bin the training data. It replaces the attribute values with the labels. For numeric bins, the labels are (
lower_boundary
,upper_boundary
]
; for categorical bins, the label matches the value it represents. (This method of categorical label representation will only work for cases where one value corresponds to one bin.) The target was not binned.
WITH bin_label_view AS ( SELECT col, bin, (DECODE(bin,'1','[','(') || lv || ',' || val || ']') label FROM (SELECT col, bin, LAST_VALUE(val) OVER ( PARTITION BY col ORDER BY val ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) lv, val FROM nb_sh_sample_num) UNION ALL SELECT col, bin, val label FROM nb_sh_sample_cat ), model_details AS ( SELECT T.target_attribute_name tname, NVL(TO_CHAR(T.target_attribute_num_value,T.target_attribute_str_value)) tval, C.attribute_name pname, NVL(L.label, NVL(C.attribute_str_value, C.attribute_num_value)) pval, T.prior_probability priorp, C.conditional_probability condp FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NB('NB_SH_Clas_sample')) T, TABLE(T.conditionals) C, bin_label_view L WHERE C.attribute_name = L.col (+) AND (NVL(C.attribute_str_value,C.attribute_num_value) = L.bin(+)) ORDER BY 1,2,3,4,5,6 ) SELECT tname, tval, pname, pval, priorp, condp FROM model_details WHERE ROWNUM < 11; TNAME TVAL PNAME PVAL PRIORP CONDP -------------- ---- ------------------------- ------------- ------- ------- AFFINITY_CARD 0 AGE (24,30] .6500 .1714 AFFINITY_CARD 0 AGE (30,35] .6500 .1509 AFFINITY_CARD 0 AGE (35,40] .6500 .1125 AFFINITY_CARD 0 AGE (40,46] .6500 .1134 AFFINITY_CARD 0 AGE (46,53] .6500 .1071 AFFINITY_CARD 0 AGE (53,90] .6500 .1312 AFFINITY_CARD 0 AGE [17,24] .6500 .2134 AFFINITY_CARD 0 BOOKKEEPING_APPLICATION 0 .6500 .1500 AFFINITY_CARD 0 BOOKKEEPING_APPLICATION 1 .6500 .8500 AFFINITY_CARD 0 BULK_PACK_DISKETTES 0 .6500 .3670
Related Topics
43.7.31 GET_MODEL_DETAILS_NMF Function
The GET_MODEL_DETAILS_NMF
function returns a set of rows that provide the details of a Non-Negative Matrix Factorization model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
Syntax
DBMS_DATA_MINING.get_model_details_nmf( model_name IN VARCHAR2, partition_name VARCHAR2 DEFAULT NULL) RETURN DM_NMF_Feature_Set PIPELINED;
Parameters
Table 43-92 GET_MODEL_DETAILS_NMF Function Parameters
Parameter | Description |
---|---|
|
Name of the model in the form [schema_name.]model_name. If you do not specify a schema, then your own schema is used. |
|
Specifies a partition in a partitioned model |
Return Values
Table 43-93 GET_MODEL_DETAILS_NMF Function Return Values
Return Value | Description |
---|---|
|
A set of rows of (feature_id NUMBER, mapped_feature_id VARCHAR2(4000), attribute_set DM_NMF_ATTRIBUTE_SET) The (attribute_name VARCHAR2(4000), attribute_subname VARCHAR2(4000), attribute_value VARCHAR2(4000), coefficient NUMBER) |
Usage Notes
-
The table function pipes out rows of type
DM_NMF_FEATURE_SET
. For information on Data Mining datatypes and piped output from table functions, see "Datatypes". -
When the value is NULL for a partitioned model, an exception is thrown. When the value is not null, it must contain the desired partition name.
Examples
The following example returns model details for the feature extraction model NMF_SH_Sample
, which was created by the sample program dmnmdemo.sql
. For information about the sample programs, see Oracle Data Mining User's Guide.
SELECT * FROM ( SELECT F.feature_id, A.attribute_name, A.attribute_value, A.coefficient FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NMF('NMF_SH_Sample')) F, TABLE(F.attribute_set) A ORDER BY feature_id,attribute_name,attribute_value ) WHERE ROWNUM < 11; FEATURE_ID ATTRIBUTE_NAME ATTRIBUTE_VALUE COEFFICIENT --------- ----------------------- ---------------- ------------------- 1 AFFINITY_CARD .051208078859308 1 AGE .0390513260041573 1 BOOKKEEPING_APPLICATION .0512734004239326 1 BULK_PACK_DISKETTES .232471260895683 1 COUNTRY_NAME Argentina .00766817464479959 1 COUNTRY_NAME Australia .000157637881096675 1 COUNTRY_NAME Brazil .0031409632415604 1 COUNTRY_NAME Canada .00144213099311427 1 COUNTRY_NAME China .000102279310968754 1 COUNTRY_NAME Denmark .000242424084307513
Related Topics
43.7.32 GET_MODEL_DETAILS_OC Function
The GET_MODEL_DETAILS_OC
function returns a set of rows that provide the details of an O-Cluster clustering model. The rows are an enumeration of the Clustering patterns generated during the creation of the model. Starting from Oracle Database 12c Release 2, this function is deprecated. See "Model Detail Views" in Oracle Data Mining User’s Guide.
You can provide input to GET_MODEL_DETAILS_OC
to request specific information about the model, thus improving the performance of the query. If you do not specify filtering parameters, then GET_MODEL_DETAILS_OC
returns all the information about the model.
Syntax
DBMS_DATA_MINING.get_model_details_oc( model_name VARCHAR2, cluster_id NUMBER DEFAULT NULL, attribute VARCHAR2 DEFAULT NULL, centroid NUMBER DEFAULT 1, histogram NUMBER DEFAULT 1, rules NUMBER DEFAULT 2, topn_attributes NUMBER DEFAULT NULL, partition_name VARCHAR2 DEFAULT NULL) RETURN dm_clusters PIPELINED;
Parameters
Table 43-94 GET_MODEL_DETAILS_OC Function Parameters
Parameter | Description |
---|---|
|