45 DBMS_DATA_MINING

The DBMS_DATA_MINING package is the application programming interface for creating, evaluating, and querying data mining models.

This chapter contains the following topics:

Using DBMS_DATA_MINING

This section contains topics that relate to using the DBMS_DATA_MINING package.

Overview

Oracle Data Mining supports both supervised and unsupervised data mining. Supervised data mining predicts a target value based on historical data. Unsupervised data mining discovers natural groupings and does not use a target. You can use Oracle Data Mining to mine structured data and unstructured text.

Supervised data mining functions include:

  • Classification

  • Regression

  • Feature Selection (Attribute Importance)

Unsupervised data mining functions include:

  • Clustering

  • Association

  • Feature Extraction

  • Anomaly Detection

The steps you use to build and apply a mining model depend on the data mining function and the algorithm being used. The algorithms supported by Oracle Data Mining are listed in Table 45-1.

Table 45-1 Oracle Data Mining Algorithms

Algorithm Abbreviation Function

Apriori

AP

Association

Decision Tree

DT

Classification

Expectation Maximization

EM

Clustering

Generalized Linear Model

GLM

Classification, Regression

k-Means

KM

Clustering

Minimum Descriptor Length

MDL

Attribute Importance)

Naive Bayes

NB

Classification

Non-Negative Matrix Factorization

NMF

Feature Extraction

Orthogonal Partitioning Clustering

O-Cluster

Clustering

Singular Value Decomposition and Principal Component Analysis

SVD and PCA

Feature Extraction

Support Vector Machine

SVM

Classification, Regression, Anomaly Detection


Oracle Data Mining supports more than one algorithm for the classification, regression, clustering, and feature extraction mining functions. Each of these mining functions has a default algorithm, as shown in Table 45-2.

Table 45-2 Oracle Data Mining Default Algorithms

Mining Function Default Algorithm

Classification

Naive Bayes

Clustering

k-Means

Feature Extraction

Non-Negative Matrix Factorization

Feature Selection

Minimum Descriptor Length

Regression

Support Vector Machine


Security Model

The DBMS_DATA_MINING package is owned by user SYS and is installed as part of database installation. Execution privilege on the package is granted to public. The routines in the package are run with invokers' rights (run with the privileges of the current user).

The DBMS_DATA_MINING package exposes APIs that are leveraged by the Oracle Data Mining component of the Advanced Analytics Option. Users who wish to create mining models in their own schema require the CREATE MINING MODEL system privilege (as well as the CREATE TABLE and CREATE VIEW system privilege). Users who wish to create mining models in other schemas require the CREATE ANY MINING MODEL system privilege (as well as the corresponding table and view creation privileges).

Users have full control over managing models that exist within their own schema. Additional system privileges necessary for managing data mining models in other schemas include ALTER ANY MINING MODEL, DROP ANY MINING MODEL, SELECT ANY MINING MODEL, COMMENT ANY MINING MODEL, and AUDIT ANY.

Individual object privileges on mining models, ALTER MINING MODEL and SELET MINING MODEL, can be used to selectively grant privileges on a model to a different user.

See Also:

Oracle Data Mining User's Guide for more information about the security features of Oracle Data Mining

Mining Functions

A data mining function refers to the methods for solving a given class of data mining problems. The mining function must be specified when a model is created. (See CREATE_MODEL Procedure.)

Table 45-3 Mining Functions

Value Description

ASSOCIATION

Association is a descriptive mining function. An association model identifies relationships and the probability of their occurrence within a data set.

Association models use the Apriori algorithm.

ATTRIBUTE_IMPORTANCE

Attribute importance is a predictive mining function, also known as feature selection. An attribute importance model identifies the relative importance of an attribute in predicting a given outcome.

Attribute importance models use Minimum Description Length.

CLASSIFICATION

Classification is a predictive mining function. A classification model uses historical data to predict a categorical target.

Classification models can use: Naive Bayes, Decision Tree, Logistic Regression, or Support Vector Machine. The default is Naive Bayes.

The classification function can also be used for anomaly detection. In this case, the SVM algorithm with a null target is used (One-Class SVM).

CLUSTERING

Clustering is a descriptive mining function. A clustering model identifies natural groupings within a data set.

Clustering models can use k-Means, O-Cluster, or Expectation Maximization. The default is k-Means.

FEATURE_EXTRACTION

Feature Extraction is a descriptive mining function. A feature extraction model creates an optimized data set on which to base a model.

Feature extraction models can use Non-Negative Matrix Factorization, Singular Value Decomposition, or Principal Component Analysis. Non-Negative Matrix Factorization is the default.

REGRESSION

Regression is a predictive mining function. A regression model uses historical data to predict a numerical target.

Regression models can use Support Vector Machine or Linear Regression. The default is Support Vector Machine.


See Also:

Oracle Data Mining Concepts for more information about mining functions

Model Settings

Oracle Data Mining uses settings to specify the algorithm and other characteristics of a model. Some settings are general, some are specific to a mining function, and some are specific to an algorithm.

All settings have default values. If you want to override one or more of the settings for a model, you must create a settings table. The settings table must have the column names and datatypes shown in Table 45-4.

Table 45-4 Required Columns in the Model Settings Table

Column Name Datatype

SETTING_NAME

VARCHAR2(30)

SETTING_VALUE

VARCHAR2(4000)


The information you provide in the settings table is used by the model at build time. The name of the settings table is an optional argument to the CREATE_MODEL Procedure.

You can find the settings used by a model by querying the data dictionary view ALL_MINING_MODEL_SETTINGS. This view lists the model settings used by the mining models to which you have access. All the setting values are included in the view, whether default or user-specified.

See Also:

Algorithm Names

The ALGO_NAME setting specifies the model algorithm. The values for the ALGO_NAME setting are listed in Table 45-5.

Table 45-5 Algorithm Names

ALGO_NAME Value Description Mining Function

ALGO_AI_MDL

Minimum Description Length

Attribute Importance

ALGO_APRIORI_ASSOCIATION_RULES

Apriori

Association Rules

ALGO_DECISION_TREE

Decision Tree

Classification

ALGO_EXPECTATION_MAXIMIZATION

Expectation Maximization

Clustering

ALGO_GENERALIZED_LINEAR_MODEL

Generalized Linear Model

Classification, Regression; also Feature Selection and Extraction

ALGO_KMEANS

Enhanced k_Means

Clustering

ALGO_NAIVE_BAYES

Naive Bayes

Classification

ALGO_NONNEGATIVE_MATRIX_FACTOR

Non-Negative Matrix Factorization

Feature Extraction

ALGO_O_CLUSTER

O-Cluster

Clustering

ALGO_SINGULAR_VALUE_DECOMP

Singular Value Decomposition

Feature Extraction

ALGO_SUPPORT_VECTOR_MACHINES

Support Vector Machine

Classification and Regression


See Also:

Oracle Data Mining Concepts for information about algorithms

Automatic Data Preparation

The PREP_AUTO setting indicates whether or not the model will use Automatic Data Preparation (ADP). By default ADP is disabled.

When you enable ADP, the model uses heuristics to transform the build data according to the requirements of the algorithm. The transformation instructions are stored with the model and reused whenever the model is applied. You can view the transformation instructions in the model details (DBMS_DATA_MINING.GET_MODEL_DETAILS_* functions).

You can choose to supplement automatic data preparations by specifying additional transformations in the xform_list parameter when you build the model. (See "CREATE_MODEL Procedure".)

If you do not use ADP (default) and do not specify transformations in the xform_list parameter to CREATE_MODEL (also the default), you must implement your own transformations separately in the build, test, and scoring data. You must take special care to implement the exact same transformations in each data set.

If you do not use ADP, but you do specify transformations in the xform_list parameter to CREATE_MODEL, Oracle Data Mining embeds the transformation definitions in the model and prepares the test and scoring data to match the build data.

The values for the PREP_AUTO setting are described in Table 45-6.

Table 45-6 PREP_AUTO Setting

PREP_AUTO Value Description

PREP_AUTO_OFF

Disable Automatic Data Preparation (default).

PREP_AUTO_ON

Enable Automatic Data Preparation.


See Also:

Oracle Data Mining User's Guide for information about data transformations

Mining Function Settings

The settings described in Table 45-7 apply to a mining function.

Table 45-7 Mining Function Settings

Mining Function Setting Name Setting Value Description

Association

ASSO_MAX_RULE_LENGTH

TO_CHAR( 2<= numeric_expr <=20)

Maximum rule length for association rules.

Default is 4.

Association

ASSO_MIN_CONFIDENCE

TO_CHAR( 0<= numeric_expr <=1)

Minimum confidence for association rules.

Default is 0.1.

Association

ASSO_MIN_SUPPORT

TO_CHAR( 0<= numeric_expr <=1)

Minimum support for association rules.

Default is 0.1.

Classification

CLAS_COST_TABLE_NAME

table_name

(Decision Tree only) Name of a table that stores a cost matrix to be used by the algorithm in building the model. The cost matrix specifies the costs associated with misclassifications.

Only Decision Tree models can use a cost matrix at build time. All classification algorithms can use a cost matrix at apply time.

The cost matrix table is user-created. See "ADD_COST_MATRIX Procedure" for the column requirements.

See Oracle Data Mining Concepts for information about costs.

Classification

CLAS_PRIORS_TABLE_NAME

table_name

(Naive Bayes) Name of a table that stores prior probabilities to offset differences in distribution between the build data and the scoring data.

The priors table is user-created. See Oracle Data Mining User's Guide for the column requirements. See Oracle Data Mining Concepts for additional information about priors.

Classification

CLAS_WEIGHTS_TABLE_NAME

table_name

(GLM and SVM only) Name of a table that stores weighting information for individual target values in SVM classification and GLM logistic regression models. The weights are used by the algorithm to bias the model in favor of higher weighted classes.

The class weights table is user-created. See Oracle Data Mining User's Guide for the column requirements. See Oracle Data Mining Concepts for additional information about class weights.

Clustering

CLUS_NUM_CLUSTERS

TO_CHAR( numeric_expr >=1)

Maximum number of leaf clusters generated by a clustering algorithm. The algorithm may return fewer clusters, depending on the data.

Enhanced k-Means usually produces the exact number of clusters specified by CLUS_NUM_CLUSTERS, unless there are fewer distinct data points.

Expectation Maximization (EM) may return fewer clusters than the number specified by CLUS_NUM_CLUSTERS depending on the data. The number of clusters returned by EM cannot be greater than the number of components, which is governed by algorithm-specific settings. (See Table 45-10, "Expectation Maximization Settings for Data Preparation and Analysis") Depending on these settings, there may be fewer clusters than components. If component clustering is disabled, the number of clusters equals the number of components.

For EM, the default value of CLUS_NUM_CLUSTERS is system-determined. For k-Means and O-CLuster, the default is 10.

Feature Extraction

FEAT_NUM_FEATURES

TO_CHAR( numeric_expr >=1)

Number of features to be extracted by a feature extraction model.

The default is estimated from the data by the algorithm.


See Also:

Oracle Data Mining Concepts for information about mining functions

Global Settings

The configuration settings in Table 45-8 are applicable to any type of model, but are currently only implemented for specific algorithms.

Table 45-8 Global Settings

Setting Name Setting Value Description

ODMS_APPROXIMATE_COMPUTATION

ODMS_APPR_COMP_ENABLE

ODMS_APPR_COMP_DISABLE

(EM, GLM, and SVD) Whether the algorithm should use approximate computations to improve performance.

For EM, approximation is appropriate for large models with many components and for data sets with many columns. The approximate computation uses localized parameter optimization that restricts learning to parameters that are likely to have the most significant impact on the model.

For SVD, approximation is often appropriate for data sets with many columns. An approximate low-rank decomposition provides good solutions at a reasonable computational cost. If you disable ODMS_APPROXIMATE_COMPUTATION for SVD, approximation is dependent on the characteristics of the data. For data sets with more than 2500 attributes (the maximum number of features allowed) only approximate decomposition is possible. If approximate computation is disabled for a data set with more than 2500 attributes, an exception is raised.

For GLM, approximation is appropriate for data sets that have many rows and are densely populated (not sparse).

Default is system determined.

ODMS_ITEM_ID_COLUMN_NAME

column_name

(Association Rules only) Name of a column that contains the items in a transaction. When this setting is specified, the algorithm expects the data to be presented in native transactional format, consisting of two columns:

  • Case ID, either categorical or numerical

  • Item ID, either categorical or numerical, specified by ODMS_ITEM_ID_COLUMN_NAME

A typical example of transactional data is market basket data, wherein a case represents a basket that may contain many items. Each item is stored in a separate row, and many rows may be needed to represent a case. The case ID values do not uniquely identify each row. Transactional data is also called multi-record case data.

Association Rules is normally used with transactional data, but it can also be applied to single-record case data (similar to other algorithms).

For more information about single-record and multi-record case data, see Oracle Data Mining User's Guide.

ODMS_ITEM_VALUE_COLUMN_NAME

column_name

(Association Rules only) Name of a column that contains a value associated with each item in a transaction. This setting is only used when a value has been specified for ODMS_ITEM_ID_COLUMN_NAME indicating that the data is presented in native transactional format.

When ODMS_ITEM_VALUE_COLUMN_NAME is specified, the algorithm expects the build data to consist of three columns:

  • Case ID, either categorical or numerical

  • Item ID, either categorical or numerical, specified by ODMS_ITEM_ID_COLUMN_NAME

  • Item value, either categorical or numerical, specified by ODMS_ITEM_VALUE_COLUMN_NAME

The item value column may specify information such as the number of items (for example, three apples) or the type of the item (for example, macintosh apples).

ODMS_MISSING_VALUE_TREATMENT

ODMS_MISSING_VALUE_MEAN_MODE

ODMS_MISSING_VALUE_DELETE_ROW

(GLM only) How to treat missing values in the training data. This setting does not affect the scoring data. The default value is ODMS_MISSING_VALUE_MEAN_MODE, which causes missing numeric values in the training data to be replaced with the mean and missing categorical values in the training data to be replaced with the mode.

Oracle Data Mining replaces missing values with the mean (numeric attributes) or the mode (categorical attributes) both at build time and apply time. You can set ODMS_MISSING_VALUE_TREATMENT to ODMS_MISSING_VALUE_DELETE_ROW to override this behavior in the training data. When ODMS_MISSING_VALUE_TREATMENT is set to ODMS_MISSING_VALUE_DELETE_ROW, the rows in the training data that contain missing values are deleted. However, if you want to replicate this missing value treatment in the scoring data, you must perform the transformation explicitly.

The value ODMS_MISSING_VALUE_DELETE_ROW is only valid for tables without nested columns. If this value is used with nested data, an exception is raised.

ODMS_ROW_WEIGHT_COLUMN_NAME

column_name

(GLM only) Name of a column in the training data that contains a weighting factor for the rows. The column datatype must be NUMBER.

Row weights can be used as a compact representation of repeated rows, as in the design of experiments where a specific configuration is repeated several times. Row weights can also be used to emphasize certain rows during model construction. For example, to bias the model towards rows that are more recent and away from potentially obsolete data.

ODMS_TEXT_POLICY_NAME

The name of an Oracle Text POLICY created using CTX_DDL.CREATE_POLICY.

Affects how individual tokens are extracted from unstructured text.

For details about CTX_DDL.CREATE_POLICY, see Oracle Text Reference.

ODMS_TEXT_MAX_FEATURES

1 <= value <= 100000

Maximum number of distinct features, across all text attributes, to use from a document set passed to CREATE_MODEL. The default is 3000.


See Also:

Oracle Data Mining Concepts for information about GLM

Oracle Data Mining Concepts for information about Association Rules

Oracle Data Mining User's Guide for information about mining unstructured text

Algorithm Settings: Decision Tree

These settings configure the behavior of the Decision Tree algorithm.

Table 45-9 Decision Tree Settings

Setting Name Setting Value Description

TREE_IMPURITY_METRIC

TREE_IMPURITY_ENTROPY

TREE_IMPURITY_GINI

Tree impurity metric for Decision Tree.

Tree algorithms seek the best test question for splitting data at each node. The best splitter and split value are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is measured in accordance with a metric. Decision trees can use either gini (TREE_IMPURITY_GINI) or entropy (TREE_IMPURITY_ENTROPY) as the purity metric. By default, the algorithm uses gini.

TREE_TERM_MAX_DEPTH

TO_CHAR( 2<= numeric_expr <=20)

Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node).

Default is 7.

TREE_TERM_MINPCT_MODE

TO_CHAR( 0<= numeric_expr <=10)

No child shall have fewer records than this number, which is expressed as a percentage of the training rows.

Default is 0.05, indicating 0.05%.

TREE_TERM_MINPCT_SPLIT

TO_CHAR( 0 <= numeric_expr <=20)

Criteria for splits: minimum number of records in a parent node expressed as a percent of the total number of records used to train the model. No split is attempted if number of records is below this value.

Default is 0.1, indicating 0.1%.

TREE_TERM_MINREC_NODE

TO_CHAR( numeric_expr >=0)

No child shall have fewer records than this number.

Default is 10.

TREE_TERM_MINREC_SPLIT

TO_CHAR( numeric_expr >=0)

Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if number of records is below this value.

Default is 20.


See Also:

Oracle Data Mining Concepts for information about Decision Tree

Algorithm Settings: Expectation Maximization

The settings described in the following tables configure the behavior of the Expectation Maximization algorithm.

See Also:

Oracle Data Mining Concepts for information about Expectation Maximization

Table 45-10 Expectation Maximization Settings for Data Preparation and Analysis

Setting Name Setting Value Description

EMCS_ATTRIBUTE_FILTER

EMCS_ATTR_FILTER_ENABLE

EMCS_ATTR_FILTER_DISABLE

Whether or not to include uncorrelated attributes in the model. When EMCS_ATTRIBUTE_FILTER is enabled, uncorrelated attributes are not included.

Note: This setting applies only to attributes that are not nested.

Default is system-determined.

EMCS_MAX_NUM_ATTR_2D

TO_CHAR( numeric_expr >=1)

Maximum number of correlated attributes to include in the model.

Note: This setting applies only to attributes that are not nested (2D).

Default is 50.

EMCS_NUM_DISTRIBUTION

EMCS_NUM_DISTR_BERNOULLI

EMCS_NUM_DISTR_GAUSSIAN

EMCS_NUM_DISTR_SYSTEM

The distribution for modeling numeric attributes. Applies to the input table/view as a whole and does not allow per-attribute specifications.

The options include Bernoulli, Gaussian, or system-determined distribution. When Bernoulli or Gaussian distribution is chosen, all numerical attributes are modeled using the same type of distribution. When the distribution is system-determined, individual attributes may use different distributions (either Bernoulli or Gaussian), depending on the data.

Default is EMCS_NUM_DISTR_SYSTEM.

EMCS_NUM_EQUIWIDTH_BINS

TO_CHAR( 1 <numeric_expr <=255)

Number of equi-width bins that will be used for gathering cluster statistics for numerical columns.

Default is 11.

EMCS_NUM_PROJECTIONS

TO_CHAR( numeric_expr >=1)

Specifies the number of projections that will be used for each nested column. If a column has fewer distinct attributes than the specified number of projections, the data will not be projected. The setting applies to all nested columns.

Default is 50.

EMCS_NUM_QUANTILE_BINS

TO_CHAR( 1 <numeric_expr <=255)

Specifies the number of quantile bins that will be used for modeling numerical columns with multivalued Bernoulli distributions.

Default is system-determined.

EMCS_NUM_TOPN_BINS

TO_CHAR( 1 <numeric_expr <=255)

Specifies the number of top-N bins that will be used for modeling categorical columns with multivalued Bernoulli distributions.

Default is system-determined.


Table 45-11 Expectation Maximization Settings for Learning

Setting Name Setting Value Description

EMCS_CONVERGENCE_CRITERION

EMCS_CONV_CRIT_HELDASIDE

EMCS_CONV_CRIT_BIC

The convergence criterion for EM. The convergence criterion may be based on a held-aside data set, or it may be Bayesian Information Criterion.

Default is system determined.

EMCS_LOGLIKE_IMPROVEMENT

TO_CHAR( 0 < numeric_expr < 1)

When the convergence criterion is based on a held-aside data set (EMCS_CONVERGENCE_CRITERION = EMCS_CONV_CRIT_HELDASIDE), this setting specifies the percentage improvement in the value of the log likelihood function that is required for adding a new component to the model.

Default value is 0.001.

EMCS_NUM_COMPONENTS

TO_CHAR( numeric_expr >=1)

Maximum number of components in the model. The algorithm automatically determines the number of components, based on improvements in the likelihood function or based on regularization, up to the specified maximum.

The number of components must be greater than or equal to the number of clusters.

Default is 20.

EMCS_NUM_ITERATIONS

TO_CHAR( numeric_expr >=1)

Specifies the maximum number of iterations in the EM algorithm.

Default is 100.


Table 45-12 Expectation Maximization Settings for Component Clustering

Setting Name Setting Value Description

EMCS_CLUSTER_COMPONENTS

EMCS__CLUSTER_COMP_ENABLE

EMCS_CLUSTER_COMP_DISABLE

Enables or disables the grouping of EM components into high-level clusters. When disabled, the components themselves are treated as clusters.

When component clustering is enabled, model scoring through the SQL CLUSTER function will produce assignments to the higher level clusters. When clustering is disabled, the CLUSTER function will produce assignments to the original components.

Default is EMCS_CLUSTER_COMP_ENABLE.

EMCS_CLUSTER_THRESH

TO_CHAR( numeric_expr >=1)

Dissimilarity threshold that controls the clustering of EM components. When the dissimilarity measure is less than the threshold, the components are combined into a single cluster.

A lower threshold may produce more clusters that are more compact. A higher threshold may produce fewer clusters that are more spread out.

Default is 2.

EMCS_LINKAGE_FUNCTION

EMCS_LINKAGE_SINGLE

EMCS_LINKAGE_AVERAGE

EMCS_LINKAGE_COMPLETE

Allows the specification of a linkage function for the agglomerative clustering step.

EMCS_LINKAGE_SINGLE uses the nearest distance within the branch. The clusters tend to be larger and have arbitrary shapes.

EMCS_LINKAGE_AVERAGE uses the average distance within the branch. There is less chaining effect and the clusters are more compact.

EMCS_LINKAGE_COMPLETE uses the maximum distance within the branch. The clusters are smaller and require strong component overlap.

Default is EMCS_LINKAGE_SINGLE.


Table 45-13 Expectation Maximization Settings for Cluster Statistics

Setting Name Setting Value Description

EMCS_CLUSTER_STATISTICS

EMCS_CLUS_STATS_ENABLE

EMCS_CLUS_STATS_DISABLE

Enables or disables the gathering of descriptive statistics for clusters (centroids, histograms, and rules). When statistics are disabled, model size is reduced, and GET_MODEL_DETAILS_EM only returns taxonomy (hierarchy) and cluster counts.

Default is EMCS_CLUS_STATS_ENABLE.

EMCS_MIN_PCT_ATTR_SUPPORT

TO_CHAR( 0 < numeric_expr < 1)

Minimum support required for including an attribute in the cluster rule. The support is the percentage of the data rows assigned to a cluster that must have non-null values for the attribute.

Default is 0.1


Algorithm Settings: Generalized Linear Models

These settings configure the behavior of Generalized Linear Models.

Table 45-14 GLM Settings

Setting Name Setting Value Description

GLMS_CONF_LEVEL

TO_CHAR(0< numeric_expr <1)

The confidence level for coefficient confidence intervals.

The default confidence level is 0.95.

GLMS_DIAGNOSTICS_TABLE_NAME

table_name

The name of a table to contain row-level diagnostic information for GLM. The table is created during model build. The specified table name must not be the name of an existing table.

If you use a diagnostics table, you must specify a case ID for the model. (See the CREATE_MODEL Procedure.) If you specify a diagnostics table name but do not provide a case ID for the model, an exception is raised.

For information about GLM diagnostics, see Oracle Data Mining Concepts.

GLMS_FTR_ACCEPTANCE

GLMS_FTR_ACCEPTANCE_STRICT

GLMS_FTR_ACCEPTANCE_RELAXED

Whether to use a strict or relaxed approach to feature acceptance. A strict approach guards against spurious features but may miss some true features, especially in small data samples. A relaxed approach is unlikely to miss true features, but might include some spurious features.

When feature selection or generation is enabled, the algorithm automatically chooses a strict or relaxed approach to feature acceptance based on the data.

GLMS_FTR_GEN_METHOD

GLMS_FTR_GEN_QUADRATIC

GLMS_FTR_GEN_CUBIC

Whether feature generation is quadratic or cubic.

When feature generation is enabled, the algorithm automatically chooses the most appropriate feature generation method based on the data.

GLMS_FTR_GENERATION

GLMS_FTR_GENERATION_ENABLE

GLMS_FTR_GENERATION_DISABLE

Whether or not feature generation is enabled for GLM. By default, feature generation is not enabled.

Note: Feature generation can only be enabled when feature selection is also enabled.

GLMS_FTR_IDENTIFICATION

GLMS_FTR_IDENT_QUICK

GLMS_FTR_IDENT_COMPLETE

Whether or not GLM uses internal sampling for identifying model features. With sampling, GLM operates on a reduced set of rows and thus can identify the model features more quickly. When GLM operates on all the rows in the data set, the identification of features can be a more lengthy process.

By default, GLM chooses whether or not to sample and determines the sample size based on characteristics of the training data, including the number of rows and attributes.

GLMS_FTR_SEL_CRIT

GLMS_FTR_SEL_AIC

GLMS_FTR_SEL_SBIC

GLMS_FTR_SEL_RIC

GLMS_FTR_SEL_ALPHA_INV

Feature selection penalty criterion for adding a feature to the model.

When feature selection is enabled, the algorithm automatically chooses the penalty criterion based on the data.

GLMS_FTR_SELECTION

GLMS_FTR_SELECTION_ENABLE

GLMS_FTR_SELECTION_DISABLE

Whether or not feature selection is enabled for GLM.

By default, feature selection is not enabled.

GLMS_MAX_FEATURES

TO_CHAR(0 < numeric_expr <= 2000)

When feature selection is enabled, this setting specifies the maximum number of features that can be selected for the final model.

By default, the algorithm limits the number of features to ensure sufficient memory.

GLMS_PRUNE_MODEL

GLMS_PRUNE_MODEL_ENABLE

GLMS_PRUNE_MODEL_DISABLE

Whether or not to prune the features in the final model. Pruning is based on t-statistics for linear regression or wald statistics for logistic regression. Features are pruned in a loop until all features are statistically significant with respect to the full data.

When feature selection is enabled, the algorithm automatically determines whether or not to perform pruning based on the data.

GLMS_REFERENCE_CLASS_NAME

target_value

The target value to be used as the reference class in a binary logistic regression model. Probabilities will be produced for the other class.

By default, the algorithm chooses the value with the highest prevalence (the most cases) for the reference class.

GLMS_RIDGE_REGRESSION

GLMS_RIDGE_REG_ENABLE

GLMS_RIDGE_REG_DISABLE

Whether or not to use ridge regression. Ridge applies to both regression and classification mining functions.

When ridge is enabled, prediction bounds are not produced by the PREDICTION_BOUNDS SQL function.

By default, the algorithm determines whether or not to use ridge.

Note: Ridge may only be enabled when feature selection is not specified or has been explicitly disabled. If ridge regression and feature selection are both explicitly enabled, an exception is raised.

GLMS_RIDGE_VALUE

TO_CHAR(numeric_expr > 0)

The value of the ridge parameter. This setting is only used when the algorithm is configured to use ridge regression.

If ridge regression is enabled internally by the algorithm, the ridge parameter is determined by the algorithm.

GLMS_SELECT_BLOCK

GLMS_SELECT_BLOCK_ENABLE

GLMS_SELECT_BLOCK_DISABLE

Whether to add the entire group of values for a categorical attribute to the model all at once or one at a time. A categorical attribute is represented as a collection of binary (exploded) attributes. When GLMS_SELECT_BLOCK is enabled, the values are added all at once.

When feature selection is enabled, the algorithm automatically determines whether to add the exploded categorical values all at once or one at a time.

GLMS_VIF_FOR_RIDGE

GLMS_VIF_RIDGE_ENABLE

GLMS_VIF_RIDGE_DISABLE

(Linear regression only) Whether or not to produce Variance Inflation Factor (VIF) statistics when ridge is being used.

VIF statistics can only be produced when ridge regression is explicitly enabled (GLMS_RIDGE_REGRESSION = GLMS_RIDGE_REG_ENABLE). When ridge is explicitly enabled and you request VIF, the algorithm produces VIF if there is sufficient memory.

By default, VIF is not produced when ridge is enabled.


See Also:

Oracle Data Mining Concepts for information about GLM

Algorithm Settings: k-Means

These settings configure the behavior of the k-Means algorithm.

Table 45-15 k-Means Settings

Setting Name Setting Value Description

KMNS_BLOCK_GROWTH

TO_CHAR(1<numeric_expr<=5)

Growth factor for memory allocated to hold cluster data.

The default block growth factor is 2

KMNS_CONV_TOLERANCE

TO_CHAR(0<numeric_expr<=0.5)

Minimum convergence tolerance for k-Means. The algorithm iterates until the minimum convergence tolerance is satisfied or until the maximum number of iterations, specified in KMNS_ITERATIONS, is reached.

Decreasing the convergence tolerance produces a more accurate solution but may result in longer run times.

The default convergence tolerance is 0.01.

KMNS_DISTANCE

KMNS_COSINE

KMNS_EUCLIDEAN

KMNS_FAST_COSINE

Distance function for k-Means.

The default distance function is euclidean.

KMNS_ITERATIONS

TO_CHAR(positive_numeric_expr)

Maximum number of iterations for k-Means. The algorithm iterates until either the maximum number of iterations is reached or the minimum convergence tolerance, specified in KMNS_CONV_TOLERANCE, is satisfied.

The default number of iterations is 3.

KMNS_MIN_PCT_ATTR_SUPPORT

TO_CHAR(0<=numeric_expr<=1)

Minimum percentage of attribute values that must be non-null in order for the attribute to be included in the rule description for the cluster.

If the data is sparse or includes many missing values, a minimum support that is too high can cause very short rules or even empty rules.

The default minimum support is 0.1.

KMNS_NUM_BINS

TO_CHAR(numeric_expr>0)

Number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.

The default number of histogram bins is 10.

KMNS_SPLIT_CRITERION

KMNS_SIZE

KMNS_VARIANCE

Split criterion for k-Means. The split criterion controls the initialization of new k-Means clusters. The algorithm builds a binary tree and adds one new cluster at a time.

When the split criterion is based on size, the new cluster is placed in the area where the largest current cluster is located. When the split criterion is based on the variance, the new cluster is placed in the area of the most spread-out cluster.

The default split criterion is the variance.


See Also:

Oracle Data Mining Concepts for information about k-Means

Algorithm Settings: Naive Bayes

These settings configure the behavior of the Naive Bayes Algorithm.

Table 45-16 Naive Bayes Settings

Setting Name Setting Value Description

NABS_PAIRWISE_THRESHOLD

TO_CHAR(0<= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default is 0.01.

NABS_SINGLETON_THRESHOLD

TO_CHAR(0<= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.01


See Also:

Oracle Data Mining Concepts for information about Naive Bayes

Algorithm Settings: Non-Negative Matrix Factorization

These settings configure the behavior of the Non-Negative Matrix Factorization algorithm.

You can query the data dictionary view *_MINING_MODEL_SETTINGS (using the ALL, USER, or DBA prefix) to find the setting values for a model. See Oracle Database Reference for information about *_MINING_MODEL_SETTINGS.

Table 45-17 NMF Settings

Setting Name Setting Value Description

NMFS_CONV_TOLERANCE

TO_CHAR(0< numeric_expr <=0.5)

Convergence tolerance for NMF algorithm

Default is 0.05

NMFS_NONNEGATIVE_SCORING

NMFS_NONNEG_SCORING_ENABLE

NMFS_NONNEG_SCORING_DISABLE

Whether negative numbers should be allowed in scoring results. When set to NMFS_NONNEG_SCORING_ENABLE, negative feature values will be replaced with zeros. When set to NMFS_NONNEG_SCORING_DISABLE, negative feature values will be allowed.

Default is NMFS_NONNEG_SCORING_ENABLE

NMFS_NUM_ITERATIONS

TO_CHAR(1 <= numeric_expr <=500)

Number of iterations for NMF algorithm

Default is 50

NMFS_RANDOM_SEED

TO_CHAR(numeric_expr)

Random seed for NMF algorithm.

Default is –1.


See Also:

Oracle Data Mining Concepts for information about NMF

Algorithm Settings: O-Cluster

These settings configure the behavior of the O-Cluster algorithm.

Table 45-18 O-CLuster Settings

Setting Name Setting Value Description

OCLT_MAX_BUFFER

TO_CHAR(numeric_expr >0)

Buffer size for O-Cluster.

Default is 50,000.

OCLT_SENSITIVITY

TO_CHAR(0 <=numeric_expr <=1)

A fraction that specifies the peak density required for separating a new cluster. The fraction is related to the global uniform density.

Default is 0.5.


See Also:

Oracle Data Mining Concepts for information about O-Cluster

Algorithm Constants and Settings: Singular Value Decomposition

The following constant affects the behavior of the Singular Value Decomposition algorithm.

Table 45-19 Singular Value Decomposition Constant

Constant Name Constant Value Description

SVDS_MAX_NUM_FEATURES

2500

The maximum number of features supported by SVD.


These settings configure the behavior of the Singular Value Decomposition algorithm.

Table 45-20 Singular Value Decomposition Settings

Setting Name Setting Value Description

SVDS_U_MATRIX_OUTPUT

SVDS_U_MATRIX_ENABLE

SVDS_U_MATRIX_DISABLE

Whether or not to persist the U matrix produced by SVD.

The U matrix in SVD has as many rows as the number of rows in the build data. To avoid creating a large model, the U matrix is persisted only when SVDS_U_MATRIX_OUTPUT is enabled.

When SVDS_U_MATRIX_OUTPUT is enabled, the build data must include a case ID. If no case ID is present and the U matrix is requested, an exception is raised.

Default is SVDS_U_MATRIX_DISABLE.

SVDS_SCORING_MODE

SVDS_SCORING_SVD

SVDS_SCORING_PCA

Whether to use SVD or PCA scoring for the model.

When the build data is scored with SVD, the projections will be the same as the U matrix. When the build data is scored with PCA, the projections will be the product of the U and S matrices.

Default is SVDS_SCORING_SVD.


Algorithm Settings: Support Vector Machine

These settings configure the behavior of the Support Vector Machine algorithm.

Table 45-21 SVM Settings

Setting Name Setting Value Description

SVMS_ACTIVE_LEARNING

SVMS_AL_DISABLE

SVMS_AL_ENABLE

Whether active learning is enabled or disabled. By default, active learning is enabled.

When active learning is enabled, the SVM algorithm uses active learning to build a reduced size model. When active learning is disabled, the SVM algorithm builds a standard model.

SVMS_COMPLEXITY_FACTOR

TO_CHAR(numeric_expr >0)

Value of complexity factor for SVM algorithm (both classification and regression).

Default value estimated from the data by the algorithm.

SVMS_CONV_TOLERANCE

TO_CHAR(numeric_expr >0)

Convergence tolerance for SVM algorithm.

Default is 0.001.

SVMS_EPSILON

TO_CHAR(numeric_expr >0)

Value of epsilon factor for SVM regression.

Default value estimated from the data by the algorithm.

SVMS_KERNEL_CACHE_SIZE

TO_CHAR(numeric_expr >0)

Value of kernel cache size for SVM algorithm. Applies to Gaussian kernel only.

Default is 50000000 bytes.

SVMS_KERNEL_FUNCTION

svm_gaussian

svms_linear

Kernel for Support Vector Machine. The default is determined by the algorithm based on the number of attributes in the training data. When there are many attributes, the algorithm uses a linear kernel, otherwise it uses a nonlinear (Gaussian) kernel.

The number of attributes does not correspond to the number of columns in the training data. The algorithm explodes categorical attributes to binary, numeric attributes. In addition, Oracle Data Mining handles each row in a nested column as a separate attribute. SVM takes these factors into account when choosing the kernel function.

SVMS_OUTLIER_RATE

TO_CHAR(0< numeric_expr <1)

The desired rate of outliers in the training data. Valid for One-Class SVM models only (anomaly detection).

Default is.1.

SVMS_STD_DEV

TO_CHAR(numeric_expr >0)

Value of standard deviation for SVM algorithm.

This is applicable only for Gaussian kernel.

Default value estimated from the data by the algorithm.


See Also:

Oracle Data Mining Concepts for information about SVM

Datatypes

The DBMS_DATA_MINING package defines object datatypes for storing information about model attributes. Most of these types are returned by the table functions GET_n, where n identifies the type of information to return. These functions take a model name as input and return the requested information as a collection of rows. For a list of the GET functions, see "Summary of DBMS_DATA_MINING Subprograms".

The DBMS_DATA_MINING package also defines object datatypes for mining transactional data. These types are called DM_NESTED_n, where n identifies the Oracle datatype of the nested attributes. For more information about mining nested data, see Oracle Data Mining User's Guide.

All the table functions use pipelining, which causes each row of output to be materialized as it is read from model storage, without waiting for the generation of the complete table object. For more information on pipelined, parallel table functions, consult the Oracle Database PL/SQL Language Reference.

The Data Mining object datatypes are described in Table 45-22.

Table 45-22 DBMS_DATA_MINING Summary of Datatypes

Datatype Description

DM_CENTROID

The centroid of a cluster.

DM_CENTROIDS

A collection of DM_CENTROID. A member of DM_CLUSTER.

DM_CHILD

A child node of a cluster.

DM_CHILDREN

A collection of DM_CHILD. A member of DM_CLUSTER.

DM_CLUSTER

A cluster. A cluster includes DM_PREDICATES, DM_CHILDREN, DM_CENTROIDS, and DM_HISTOGRAMS. It also includes a DM_RULE.

DM_CLUSTERS

A collection of DM_CLUSTER. Returned by GET_MODEL_DETAILS_KM Function, GET_MODEL_DETAILS_OC Function, and GET_MODEL_DETAILS_EM Function.

DM_CONDITIONAL

The conditional probability of an attribute in a Naive Bayes model.

DM_CONDITIONALS

A collection of DM_CONDITIONAL. Returned by GET_MODEL_DETAILS_NB Function.

DM_COST_ELEMENT

The actual and predicted values in a cost matrix.

DM_COST_MATRIX

A collection of DM_COST_ELEMENT. Returned by GET_MODEL_COST_MATRIX Function.

DM_EM_COMPONENT

A component of an Expectation Maximization model.

DM_EM_COMPONENT_SET

A collection of DM_EM_COMPONENT. Returned by GET_MODEL_DETAILS_EM_COMP Function.

DM_EM_PROJECTION

A projection of an Expectation Maximization model.

DM_EM_PROJECTION_SET

A collection of DM_EM_PROJECTION. Returned by GET_MODEL_DETAILS_EM_PROJ Function.

DM_GLM_COEFF

The coefficient and associated statistics of an attribute in a Generalized Linear Model.

DM_GLM_COEFF_SET

A collection of DM_GLM_COEFF. Returned by GET_MODEL_DETAILS_GLM Function.

DM_HISTOGRAM_BIN

A histogram associated with a cluster.

DM_HISTOGRAMS

A collection of DM_HISTOGRAM_BIN. A member of DM_CLUSTER.

DM_ITEM

An item in an association rule.

DM_ITEMS

A collection of DM_ITEM.

DM_ITEMSET

A collection of DM_ITEMS.

DM_ITEMSETS

A collection of DM_ITEMSET. Returned by GET_FREQUENT_ITEMSETS Function.

DM_MODEL_GLOBAL_DETAIL

High-level statistics about a model.

DM_MODEL_GLOBAL_DETAILS

A collection of DM_MODEL_GLOBAL_DETAIL. Returned by GET_MODEL_DETAILS_GLOBAL Function.

DM_NB_DETAIL

Information about an attribute in a Naive Bayes model.

DM_NB_DETAILS

A collection of DM_DB_DETAIL. Returned by GET_MODEL_DETAILS_NB Function.

DM_NESTED_BINARY_DOUBLE

The name and value of a numerical attribute of type BINARY_DOUBLE.

DM_NESTED_BINARY_DOUBLES

A collection of DM_NESTED_BINARY_DOUBLE.

DM_NESTED_BINARY_FLOAT

The name and value of a numerical attribute of type BINARY_FLOAT.

DM_NESTED_BINARY_FLOATS

A collection of DM_NESTED_BINARY_FLOAT.

DM_NESTED_CATEGORICAL

The name and value of a categorical attribute of type CHAR, VARCHAR, or VARCHAR2.

DM_NESTED_CATEGORICALS

A collection of DM_NESTED_CATEGORICAL.

DM_NESTED_NUMERICAL

The name and value of a numerical attribute of type NUMBER or FLOAT.

DM_NESTED_NUMERICALS

A collection of DM_NESTED_NUMERICAL.

DM_NMF_ATTRIBUTE

An attribute in a feature of a Non-Negative Matrix Factorization model.

DM_NMF_ATTRIBUTE_SET

A collection of DM_NMF_ATTRIBUTE. A member of DM_NMF_FEATURE.

DM_NMF_FEATURE

A feature in a Non-Negative Matrix Factorization model.

DM_NMF_FEATURE_SET

A collection of DM_NMF_FEATURE. Returned by GET_MODEL_DETAILS_NMF Function.

DM_PREDICATE

Antecedent and consequent in a rule.

DM_PREDICATES

A collection of DM_PREDICATE. A member of DM_RULE and DM_CLUSTER. Predicates are returned by GET_ASSOCIATION_RULES Function, GET_MODEL_DETAILS_EM Function, GET_MODEL_DETAILS_KM Function, and GET_MODEL_DETAILS_OC Function.

DM_RANKED_ATTRIBUTE

An attribute ranked by its importance in an Attribute Importance model.

DM_RANKED_ATTRIBUTES

A collection of DM_RANKED_ATTRIBUTE. Returned by GET_MODEL_DETAILS_AI Function.

DM_RULE

A rule that defines a conditional relationship.

The rule can be one of the association rules returned by GET_ASSOCIATION_RULES Function, or it can be a rule associated with a cluster in the collection of clusters returned by GET_MODEL_DETAILS_KM Function and GET_MODEL_DETAILS_OC Function.

DM_RULES

A collection of DM_RULE. Returned by GET_ASSOCIATION_RULES Function.

DM_SVD_MATRIX

A factorized matrix S, V, or U returned by a Singular Value Decomposition model.

DM_SVD_MATRIX_SET

A collection of DM_SVD_MATRIX. Returned by GET_MODEL_DETAILS_SVD Function.

DM_SVM_ATTRIBUTE

The name, value, and coefficient of an attribute in a Support Vector Machine model.

DM_SVM_ATTRIBUTE_SET

A collection of DM_SVM_ATTRIBUTE. Returned by GET_MODEL_DETAILS_SVM Function. Also a member of DM_SVM_LINEAR_COEFF.

DM_SVM_LINEAR_COEFF

The linear coefficient of each attribute in a Support Vector Machine model.

DM_SVM_LINEAR_COEFF_SET

A collection of DM_SVM_LINEAR_COEFF. Returned by GET_MODEL_DETAILS_SVM Function for an SVM model built using the linear kernel.

DM_TRANSFORM

The transformation and reverse transformation expressions for an attribute.

DM_TRANSFORMS

A collection of DM_TRANSFORM. Returned by GET_MODEL_TRANSFORMATIONS Function.

TRANSFORM_LIST

A list of user-specified transformations for a model. Accepted as a parameter by the CREATE_MODEL Procedure.

This collection type is defined in the DBMS_DATA_MINING_TRANSFORM package.


Summary of DBMS_DATA_MINING Subprograms

Table 45-23 summarizes the subprograms included in the DBMS_DATA_MINING package.

Table 45-23 DBMS_DATA_MINING Package Subprograms

Subprogram Purpose

ADD_COST_MATRIX Procedure

Adds a cost matrix to a classification model

ALTER_REVERSE_EXPRESSION Procedure

Changes the reverse transformation expression to an expression that you specify

APPLY Procedure

Applies a model to a data set (scores the data)

COMPUTE_CONFUSION_MATRIX Procedure

Computes the confusion matrix for a classification model

COMPUTE_LIFT Procedure

Computes lift for a a classification model

COMPUTE_ROC Procedure

Computes Receiver Operating Characteristic (ROC) for a classification model

CREATE_MODEL Procedure

Creates a model

DROP_MODEL Procedure

Drops a model

EXPORT_MODEL Procedure

Exports a model to a dump file

GET_ASSOCIATION_RULES Function

Returns the rules from an association model

GET_FREQUENT_ITEMSETS Function

Returns the frequent itemsets for an association model

GET_MODEL_COST_MATRIX Function

Returns the cost matrix for a model

GET_MODEL_DETAILS_AI Function

Returns details about an Attribute Importance model

GET_MODEL_DETAILS_EM Function

Returns details about an Expectation Maximization model

GET_MODEL_DETAILS_EM_COMP Function

Returns details about the parameters of an Expectation Maximization model

GET_MODEL_DETAILS_EM_PROJ Function

Returns details about the projects of an Expectation Maximization model

GET_MODEL_DETAILS_GLM Function

Returns details about a Generalized Linear Model

GET_MODEL_DETAILS_GLOBAL Function

Returns high-level statistics about a model

GET_MODEL_DETAILS_KM Function

Returns details about a k-Means model

GET_MODEL_DETAILS_NB Function

Returns details about a Naive Bayes model

GET_MODEL_DETAILS_NMF Function

Returns details about a Non-Negative Matrix Factorization model

GET_MODEL_DETAILS_OC Function

Returns details about an O-Cluster model

GET_MODEL_DETAILS_SVD Function

Returns details about a Singular Value Decomposition model

GET_MODEL_DETAILS_SVM Function

Returns details about a Support Vector Machine model with a linear kernel

GET_MODEL_DETAILS_XML Function

Returns details about a Decision Tree model

GET_MODEL_TRANSFORMATIONS Function

Returns the transformations embedded in a model

GET_TRANSFORM_LIST Procedure

Converts between two different transformation specification formats

IMPORT_MODEL Procedure

Imports a model into a user schema

RANK_APPLY Procedure

Ranks the predictions from the APPLY results for a classification model

REMOVE_COST_MATRIX Procedure

Removes a cost matrix from a model

RENAME_MODEL Procedure

Renames a model


ADD_COST_MATRIX Procedure

This procedure associates a cost matrix table with a classification model. The cost matrix biases the model by assigning costs or benefits to specific model outcomes.

The cost matrix is stored with the model and taken into account when the model is scored. The stored cost matrix is the default scoring matrix for the model.

You can also specify a cost matrix inline when you invoke a Data Mining SQL function for scoring. When an inline cost matrix is specified, it is used instead of the default, stored cost matrix (if one exists).

To obtain the default scoring matrix for a model, use the GET_MODEL_COST_MATRIX function. To remove the default scoring matrix from a model, use the REMOVE_COST_MATRIX procedure. See "GET_MODEL_COST_MATRIX Function" and "REMOVE_COST_MATRIX Procedure".

See Also:

Syntax

DBMS_DATA_MINING.ADD_COST_MATRIX (
       model_name                IN VARCHAR2,
       cost_matrix_table_name    IN VARCHAR2,
       cost_matrix_schema_name   IN VARCHAR2 DEFAULT NULL);

Parameters

Table 45-24 ADD_COST_MATRIX Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is assumed.

cost_matrix_table_name

Name of the cost matrix table (described in Table 45-25).

cost_matrix_schema_name

Schema of the cost matrix table. If no schema is specified, the current schema is used.


Usage Notes

  1. If the model is not in your schema, then ADD_COST_MATRIX requires the ALTER ANY MINING MODEL system privilege or the ALTER object privilege for the mining model.

  2. The cost matrix table must have the columns shown in Table 45-25.

    Table 45-25 Required Columns in a Cost Matrix Table

    Column Name Datatype

    ACTUAL_TARGET_VALUE

    Valid target data type

    PREDICTED_TARGET_VALUE

    Valid target data type

    COST

    NUMBER,FLOAT, BINARY_DOUBLE, or BINARY_FLOAT


    See Also:

    Oracle Data Mining User's Guide for valid target datatypes
  3. The types of the actual and predicted target values must be the same as the type of the model target. For example, if the target of the model is BINARY_DOUBLE, then the actual and predicted values must be BINARY_DOUBLE. If the actual and predicted values are CHAR or VARCHAR, ADD_COST_MATRIX treats them as VARCHAR2 internally.

    If the types do not match, or if the actual or predicted value is not a valid target value, the ADD_COST_MATRIX procedure raises an error.

    Note:

    If a reverse transformation is associated with the target, the actual and predicted values must be consistent with the target after the reverse transformation has been applied.

    See "Reverse Transformations and Model Transparency" for more information.

  4. Since a benefit can be viewed as a negative cost, you can specify a benefit for a given outcome by providing a negative number in the costs column of the cost matrix table.

  5. All classification algorithms can use a cost matrix for scoring. The Decision Tree algorithm can also use a cost matrix at build time. If you want to build a Decision Tree model with a cost matrix, specify the cost matrix table name in the CLAS_COST_TABLE_NAME setting in the settings table for the model. See Table 45-7, "Mining Function Settings".

    The cost matrix used to create a Decision Tree model becomes the default scoring matrix for the model. If you want to specify different costs for scoring, use the REMOVE_COST_MATRIX procedure to remove the cost matrix and the ADD_COST_MATRIX procedure to add a new one.

Example

This example creates a cost matrix table called COSTS_NB and adds it to a Naive Bayes model called NB_SH_CLAS_SAMPLE. The model has a binary target: 1 means that the customer responds to a promotion; 0 means that the customer does not respond. The cost matrix assigns a cost of .25 to misclassifications of customers who do not respond and a cost of .75 to misclassifications of customers who do respond. This means that it is three times more costly to misclassify responders than it is to misclassify non-responders.

CREATE TABLE costs_nb (
  actual_target_value           NUMBER,
  predicted_target_value        NUMBER,
  cost                          NUMBER);
INSERT INTO costs_nb values (0, 0, 0);
INSERT INTO costs_nb values (0, 1, .25);
INSERT INTO costs_nb values (1, 0, .75);
INSERT INTO costs_nb values (1, 1, 0);
COMMIT;
 
EXEC dbms_data_mining.add_cost_matrix('nb_sh_clas_sample', 'costs_nb');
 
SELECT cust_gender, COUNT(*) AS cnt, ROUND(AVG(age)) AS avg_age
   FROM mining_data_apply_v
   WHERE PREDICTION(nb_sh_clas_sample COST MODEL
      USING cust_marital_status, education, household_size) = 1
   GROUP BY cust_gender
   ORDER BY cust_gender;
   
C        CNT    AVG_AGE
- ---------- ----------
F         72         39
M        555         44

ALTER_REVERSE_EXPRESSION Procedure

This procedure replaces a reverse transformation expression with an expression that you specify. If the attribute does not have a reverse expression, the procedure creates one from the specified expression.

You can also use this procedure to customize the output of clustering, feature extraction, and anomaly detection models.

Syntax

DBMS_DATA_MINING.ALTER_REVERSE_EXPRESSION (
         model_name             VARCHAR2,
         expression             CLOB,
         attribute_name         VARCHAR2 DEFAULT NULL,
         attribute_subname      VARCHAR2 DEFAULT NULL);

Parameters

Table 45-26 ALTER_REVERSE_EXPRESSION Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

expression

An expression to replace the reverse transformation associated with the attribute.

attribute_name

Name of the attribute. Specify NULL if you wish to apply expression to a cluster, feature, or One-Class SVM prediction.

attribute_subname

Name of the nested attribute if attribute_name is a nested column, otherwise NULL.


Usage Notes

  1. For purposes of model transparency, Oracle Data Mining provides reverse transformations for transformations that are embedded in a model. Reverse transformations are applied to the attributes returned in model details (GET_MODEL_DETAILS_* functions) and to the scored target of predictive models.

  2. If you alter the reverse transformation for the target of a model that has a cost matrix, you must specify a transformation expression that has the same type as the actual and predicted values in the cost matrix. Also, the reverse transformation that you specify must result in values that are present in the cost matrix.

    See Also:

    "ADD_COST_MATRIX Procedure" and Oracle Data Mining Concepts for information about cost matrixes.
  3. To prevent reverse transformation of an attribute, you can specify NULL for expression.

  4. The reverse transformation expression can contain a reference to a PL/SQL function that returns a valid Oracle datatype. For example, you could define a function like the following for a categorical attribute named blood_pressure that has values 'Low', 'Medium' and 'High'.

    CREATE OR REPLACE FUNCTION numx(c char) RETURN NUMBER IS
      BEGIN
        CASE c WHEN ''Low'' THEN RETURN 1;
               WHEN ''Medium'' THEN RETURN 2;
               WHEN ''High'' THEN RETURN 3;
               ELSE RETURN null;
        END CASE;
      END numx;
    

    Then you could invoke ALTER_REVERSE_EXPRESION for blood_pressure as follows.

    EXEC dbms_data_mining.alter_reverse_expression(
                 '<model_name>', 'NUMX(blood_pressure)', 'blood_pressure');
    
  5. You can use ALTER_REVERSE_EXPRESSION to label clusters produced by clustering models and features produced by feature extraction.

    You can use ALTER_REVERSE_EXPRESSION to replace the zeros and ones returned by anomaly-detection models. By default, anomaly-detection models label anomalous records with 0 and all other records with 1.

    See Also:

    Oracle Data Mining Concepts for information about anomaly detection

Examples

  1. In this example, the target (affinity_card) of the model CLASS_MODEL is manipulated internally as yes or no instead of 1 or 0 but returned as 1s and 0s when scored. The ALTER_REVERSE_EXPRESSION procedure causes the target values to be returned as TRUE or FALSE.

    The data sets MINING_DATA_BUILD and MINING_DATA_TEST are included with the Oracle Data Mining sample programs. See Oracle Data Mining User's Guide for information about the sample programs.

    DECLARE
            v_xlst dbms_data_mining_transform.TRANSFORM_LIST;
      BEGIN
        dbms_data_mining_transform.SET_TRANSFORM(v_xlst,
              'affinity_card', NULL,
              'decode(affinity_card, 1, ''yes'', ''no'')',
              'decode(affinity_card, ''yes'', 1, 0)');
        dbms_data_mining.CREATE_MODEL(
          model_name             => 'CLASS_MODEL',
          mining_function        => dbms_data_mining.classification,
          data_table_name        => 'mining_data_build',
          case_id_column_name    => 'cust_id',
          target_column_name     => 'affinity_card',
          settings_table_name    => NULL,
          data_schema_name       => 'dmuser',
          settings_schema_name   => NULL,
          xform_list             => v_xlst );
      END;
    /
    SELECT cust_income_level, occupation,
               PREDICTION(CLASS_MODEL USING *) predict_response
          FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M'
          ORDER BY cust_income_level;
     
    CUST_INCOME_LEVEL              OCCUPATION                PREDICT_RESPONSE
    ------------------------------ --------------------- --------------------
    A: Below 30,000                Transp.                                  1
    E: 90,000 - 109,999            Transp.                                  1
    E: 90,000 - 109,999            Sales                                    1
    G: 130,000 - 149,999           Handler                                  0
    G: 130,000 - 149,999           Crafts                                   0
    H: 150,000 - 169,999           Prof.                                    1
    J: 190,000 - 249,999           Prof.                                    1
    J: 190,000 - 249,999           Sales                                    1
     
    BEGIN
      dbms_data_mining.ALTER_REVERSE_EXPRESSION (
         model_name      => 'CLASS_MODEL',
         expression      => 'decode(affinity_card, ''yes'', ''TRUE'', ''FALSE'')',
         attribute_name  => 'affinity_card');
    END;
    /
    column predict_response on
    column predict_response format a20
    SELECT cust_income_level, occupation,
                 PREDICTION(CLASS_MODEL USING *) predict_response
          FROM mining_data_test WHERE age = 60 AND cust_gender IN 'M'
          ORDER BY cust_income_level;
     
    CUST_INCOME_LEVEL              OCCUPATION            PREDICT_RESPONSE
    ------------------------------ --------------------- --------------------
    A: Below 30,000                Transp.               TRUE
    E: 90,000 - 109,999            Transp.               TRUE
    E: 90,000 - 109,999            Sales                 TRUE
    G: 130,000 - 149,999           Handler               FALSE
    G: 130,000 - 149,999           Crafts                FALSE
    H: 150,000 - 169,999           Prof.                 TRUE
    J: 190,000 - 249,999           Prof.                 TRUE
    J: 190,000 - 249,999           Sales                 TRUE
    
  2. This example specifies labels for the clusters that result from the sh_clus model. The labels consist of the word "Cluster" and the internal numeric identifier for the cluster.

    BEGIN
      dbms_data_mining.ALTER_REVERSE_EXPRESSION( 'sh_clus', '''Cluster ''||value');
    END;
    /
     
    SELECT cust_id, cluster_id(sh_clus using *) cluster_id
       FROM sh_aprep_num
           WHERE cust_id < 100011
           ORDER by cust_id;
     
    CUST_ID CLUSTER_ID
    ------- ------------------------------------------------
     100001 Cluster 18
     100002 Cluster 14
     100003 Cluster 14
     100004 Cluster 18
     100005 Cluster 19
     100006 Cluster 7
     100007 Cluster 18
     100008 Cluster 14
     100009 Cluster 8
     100010 Cluster 8
    

APPLY Procedure

This procedure applies a mining model to the data of interest, and generates the results in a table. The apply process is also referred to as scoring.

For predictive mining functions, the apply process generates predictions in a target column. For descriptive mining functions such as clustering, the apply process assigns each case to a cluster with a probability.

In Oracle Data Mining, the apply operation is not applicable to association models and attribute importance models.

Note:

Scoring can also be performed directly in SQL using the Data Mining functions. See

Syntax

DBMS_DATA_MINING.APPLY (
      model_name           IN VARCHAR2,
      data_table_name      IN VARCHAR2,
      case_id_column_name  IN VARCHAR2,
      result_table_name    IN VARCHAR2,
      data_schema_name     IN VARCHAR2 DEFAULT NULL);

Parameters

Table 45-27 APPLY Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

data_table_name

Name of table or view containing the data to be scored

case_id_column_name

Name of the case identifier column

result_table_name

Name of the table in which to store apply results

data_schema_name

Name of the schema containing the data to be scored


Usage Notes

  1. The data provided for APPLY must undergo the same preprocessing as the data used to create and test the model. When you use Automatic Data Preparation, the preprocessing required by the algorithm is handled for you by the model — both at build time and apply time. (See "Automatic Data Preparation".)

  2. APPLY creates a table in the user's schema to hold the results. The columns are algorithm-specific.

    The columns in the results table are listed in Table 45-28 through Table 45-32. The case ID column name in the results table will match the case ID column name provided by you. The type of the incoming case ID column is also preserved in APPLY output.

    Note:

    Make sure that the case ID column does not have the same name as one of the columns that will be created by APPLY. For example, when applying a classification model, the case ID in the scoring data must not be 'PREDICTION' or 'PROBABILITY' (See Table 45-28).
  3. The datatype for the 'PREDICTION', 'CLUSTER_ID', and 'FEATURE_ID' output columns is influenced by any reverse expression that is embedded in the model by the user. If the user does not provide a reverse expression that alters the scored value type, then the types will conform to the descriptions in the following tables. See "ALTER_REVERSE_EXPRESSION Procedure".

Classification

The results table for classification has the columns described in Table 45-28. If the target of the model is categorical, the PREDICTION column will have a VARCHAR2 datatype. If the target has a binary type, the PREDICTION column will have the binary type of the target.

Table 45-28 APPLY Results Table for Classification

Column Name Datatype

Case ID column name

Type of the case ID

PREDICTION

Type of the target

PROBABILITY

BINARY_DOUBLE


Anomaly Detection

The results table for anomaly detection has the columns described in Table 45-29.

Table 45-29 APPLY Results Table for Anomaly Detection

Column Name Datatype

Case ID column name

Type of the case ID

PREDICTION

NUMBER

PROBABILITY

BINARY_DOUBLE


Values in the PREDICTION column can be either 0 or 1. When the prediction is 1, the case is a typical example. When the prediction is 0, the case is an outlier.

Regression

The results table for regression has the columns described in Table 45-30.

Table 45-30 APPLY Results Table for Regression

Column Name Datatype

Case ID column name

Type of the case ID

PREDICTION

Type of the target


Clustering

Clustering is an unsupervised mining function, and hence there are no targets. The results of an APPLY operation will contain simply the cluster identifier corresponding to a case, and the associated probability. The results table has the columns described in Table 45-31.

Table 45-31 APPLY Results Table for Clustering

Column Name Datatype

Case ID column name

Type of the case ID

CLUSTER_ID

NUMBER

PROBABILITY

BINARY_DOUBLE


Feature Extraction

Feature extraction is also an unsupervised mining function, and hence there are no targets. The results of an APPLY operation will contain simply the feature identifier corresponding to a case, and the associated match quality. The results table has the columns described in Table 45-32.

Table 45-32 APPLY Results Table for Feature Extraction

Column Name Datatype

Case ID column name

Type of the case ID

FEATURE_ID

NUMBER

MATCH_QUALITY

BINARY_DOUBLE


Examples

This example applies the GLM regression model GLMR_SH_REGR_SAMPLE to the data in the MINING_DATA_APPLY_V view. The apply results are output to the table REGRESSION_APPLY_RESULT.

SQL> BEGIN
       DBMS_DATA_MINING.APPLY (
       model_name     => 'glmr_sh_regr_sample',
       data_table_name     => 'mining_data_apply_v',
       case_id_column_name => 'cust_id',
       result_table_name   => 'regression_apply_result');
    END;
    /
 
SQL> SELECT * FROM regression_apply_result WHERE cust_id >  101485;
 
   CUST_ID PREDICTION
---------- ----------
    101486 22.8048824
    101487 25.0261101
    101488 48.6146619
    101489   51.82595
    101490 22.6220714
    101491 61.3856816
    101492 24.1400748
    101493  58.034631
    101494 45.7253149
    101495 26.9763318
    101496 48.1433425
    101497 32.0573434
    101498 49.8965531
    101499  56.270656
    101500 21.1153047

COMPUTE_CONFUSION_MATRIX Procedure

This procedure computes a confusion matrix, stores it in a table in the user's schema, and returns the model accuracy.

A confusion matrix is a test metric for classification models. It compares the predictions generated by the model with the actual target values in a set of test data. The confusion matrix lists the number of times each class was correctly predicted and the number of times it was predicted to be one of the other classes.

COMPUTE_CONFUSION_MATRIX accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

See Also:

Oracle Data Mining Concepts for more details about confusion matrixes and other test metrics for classification

"COMPUTE_LIFT Procedure"

"COMPUTE_ROC Procedure"

Syntax

DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
      accuracy                     OUT NUMBER,
      apply_result_table_name      IN  VARCHAR2,
      target_table_name            IN  VARCHAR2,
      case_id_column_name          IN  VARCHAR2,
      target_column_name           IN  VARCHAR2,
      confusion_matrix_table_name  IN  VARCHAR2,
      score_column_name            IN  VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN  VARCHAR2 DEFAULT 'PROBABILITY',
      cost_matrix_table_name       IN  VARCHAR2 DEFAULT NULL,
      apply_result_schema_name     IN  VARCHAR2 DEFAULT NULL,
      target_schema_name           IN  VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name      IN  VARCHAR2 DEFAULT NULL,
      score_criterion_type         IN  VARCHAR2 DEFAULT 'PROBABILITY');

Parameters

Table 45-33 COMPUTE_CONFUSION_MATRIX Procedure Parameters

Parameter Description

accuracy

Output parameter containing the overall percentage accuracy of the predictions.

apply_result_table_name

Table containing the predictions.

target_table_name

Table containing the known target values from the test data.

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

confusion_matrix_table_name

Table containing the confusion matrix. The table will be created by the procedure in the user's schema.

The columns in the confusion matrix table are described in the Usage Notes.

score_column_name

Column containing the predictions in the apply results table.

The default column name is PREDICTION, which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions.

By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted.

The score_criterion_type parameter indicates whether probabilities or costs will be used for scoring.

The default column name is 'PROBABILITY', which is the default name created by the APPLY procedure (See "APPLY Procedure").

See the Usage Notes for additional information.

cost_matrix_table_name

(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the score_criterion_type parameter is set to 'COSTS', the costs in this table will be used as the scoring criteria.

The columns in a cost matrix table are described in the Usage Notes.

apply_result_schema_name

Schema of the apply results table.

If null, the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, the user's schema is assumed.

cost_matrix_schema_name

Schema of the cost matrix table, if one is provided.

If null, the user's schema is assumed.

score_criterion_type

Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the score_criterion_column_name parameter.

The default value of score_criterion_type is 'PROBABILITY'. To use costs as the scoring criterion, specify 'COST'.

If score_criterion_type is set to 'COST' but no cost matrix is provided and if there is a scoring cost matrix associated with the model, then the associated costs are used for scoring.

See the Usage Notes and the Examples.


Usage Notes

  • The predictive information you pass to COMPUTE_CONFUSION_MATRIX may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the confusion matrix.

  • Instead of passing a cost matrix to COMPUTE_CONFUSION_MATRIX, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQL PREDICTION_COST function to populate the score criterion column.

  • The predictions that you pass to COMPUTE_CONFUSION_MATRIX are in a table or view specified in apply_result_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • A cost matrix must have the columns described in Table 45-34.

    Table 45-34 Columns in a Cost Matrix

    Column Name Datatype

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    cost

    BINARY_DOUBLE


    See Also:

    Oracle Data Mining User's Guide for valid target datatypes

    Oracle Data Mining Concepts for more information about cost matrixes

  • The confusion matrix created by COMPUTE_CONFUSION_MATRIX has the columns described in Table 45-35.

    Table 45-35 Columns in a Confusion Matrix

    Column Name Datatype

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target is the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    value

    BINARY_DOUBLE


    See Also:

    Oracle Data Mining Concepts for more information about confusion matrixes

Examples

These examples use the Naive Bayes model nb_sh_clas_sample, which is created by one of the Oracle Data Mining sample programs.

Compute a Confusion Matrix Based on Probabilities

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
       SELECT cust_id,
              PREDICTION(nb_sh_clas_sample USING *) prediction,
              PREDICTION_PROBABILITY(nb_sh_clas_sample USING *) probability
       FROM mining_data_test_v;

Using probabilities as the scoring criterion, you can compute the confusion matrix as follows.

DECLARE
   v_accuracy    NUMBER;
      BEGIN
        DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
                   accuracy                     => v_accuracy,
                   apply_result_table_name      => 'nb_apply_results',
                   target_table_name            => 'mining_data_test_v',
                   case_id_column_name          => 'cust_id',
                   target_column_name           => 'affinity_card',
                   confusion_matrix_table_name  => 'nb_confusion_matrix',
                   score_column_name            => 'PREDICTION',
                   score_criterion_column_name  => 'PROBABILITY'
                   cost_matrix_table_name       =>  null,
                   apply_result_schema_name     =>  null,
                   target_schema_name           =>  null,
                   cost_matrix_schema_name      =>  null,
                   score_criterion_type         => 'PROBABILITY');
        DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
      END;
      /

The confusion matrix and model accuracy are shown as follows.

 **** MODEL ACCURACY ****: .7847

SQL>SELECT * from nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         60
                  0                      0        891
                  1                      1        286
                  0                      1        263

Compute a Confusion Matrix Based on a Cost Matrix Table

The confusion matrix in the previous example shows a high rate of false positives. For 263 cases, the model predicted 1 when the actual value was 0. You could use a cost matrix to minimize this type of error.

The cost matrix table nb_cost_matrix specifies that a false positive is 3 times more costly than a false negative.

SQL> SELECT * from nb_cost_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE       COST
------------------- ---------------------- ----------
                  0                      0          0
                  0                      1        .75
                  1                      0        .25
                  1                      1          0

This statement shows how to generate the predictions using APPLY.

BEGIN
    DBMS_DATA_MINING.APPLY(
          model_name          => 'nb_sh_clas_sample',
          data_table_name     => 'mining_data_test_v',
          case_id_column_name => 'cust_id',
          result_table_name   => 'nb_apply_results');
 END;
/

This statement computes the confusion matrix using the cost matrix table. The score criterion column is named 'PROBABILITY', which is the name generated by APPLY.

DECLARE
  v_accuracy    NUMBER;
     BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
                accuracy                     => v_accuracy,
                apply_result_table_name      => 'nb_apply_results',
                target_table_name            => 'mining_data_test_v',
                case_id_column_name          => 'cust_id',
                target_column_name           => 'affinity_card',
                confusion_matrix_table_name  => 'nb_confusion_matrix',
                score_column_name            => 'PREDICTION',
                score_criterion_column_name  => 'PROBABILITY',
                cost_matrix_table_name       => 'nb_cost_matrix',
                apply_result_schema_name     => null,
                target_schema_name           => null,
                cost_matrix_schema_name      => null,
                score_criterion_type         => 'COST');
       DBMS_OUTPUT.PUT_LINE('**** MODEL ACCURACY ****: ' || ROUND(v_accuracy,4));
    END;
    /

The resulting confusion matrix shows a decrease in false positives (212 instead of 263).

**** MODEL ACCURACY ****: .798

SQL> SELECT * FROM nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

Compute a Confusion Matrix Based on Embedded Costs

You can use the ADD_COST_MATRIX procedure to embed a cost matrix in a model. The embedded costs can be used instead of probabilities for scoring. This statement adds the previously-defined cost matrix to the model.

BEGIN    DBMS_DATA_MINING.ADD_COST_MATRIX ('nb_sh_clas_sample', 'nb_cost_matrix');END;/

The following statement applies the model to the test data using the embedded costs and stores the results in a table.

CREATE TABLE nb_apply_results AS
         SELECT cust_id,
              PREDICTION(nb_sh_clas_sample COST MODEL USING *) prediction,
              PREDICTION_COST(nb_sh_clas_sample COST MODEL USING *) cost
          FROM mining_data_test_v;

You can compute the confusion matrix using the embedded costs.

DECLARE
   v_accuracy         NUMBER;
   BEGIN
       DBMS_DATA_MINING.COMPUTE_CONFUSION_MATRIX (
            accuracy                     => v_accuracy,
            apply_result_table_name      => 'nb_apply_results',
            target_table_name            => 'mining_data_test_v',
            case_id_column_name          => 'cust_id',
            target_column_name           => 'affinity_card',
            confusion_matrix_table_name  => 'nb_confusion_matrix',
            score_column_name            => 'PREDICTION',
            score_criterion_column_name  => 'COST',
            cost_matrix_table_name       => null,
            apply_result_schema_name     => null,
            target_schema_name           => null,
            cost_matrix_schema_name      => null,
            score_criterion_type         => 'COST');
   END;
   /

The results are:

**** MODEL ACCURACY ****: .798

SQL> SELECT * FROM nb_confusion_matrix;
ACTUAL_TARGET_VALUE PREDICTED_TARGET_VALUE      VALUE
------------------- ---------------------- ----------
                  1                      0         91
                  0                      0        942
                  1                      1        255
                  0                      1        212

COMPUTE_LIFT Procedure

This procedure computes lift and stores the results in a table in the user's schema.

Lift is a test metric for binary classification models. To compute lift, one of the target values must be designated as the positive class. COMPUTE_LIFT compares the predictions generated by the model with the actual target values in a set of test data. Lift measures the degree to which the model's predictions of the positive class are an improvement over random chance.

Lift is computed on scoring results that have been ranked by probability (or cost) and divided into quantiles. Each quantile includes the scores for the same number of cases.

COMPUTE_LIFT calculates quantile-based and cumulative statistics. The number of quantiles and the positive class are user-specified. Additionally, COMPUTE_LIFT accepts three input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing either probabilities or costs associated with the predictions

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

  • (Optional) A cost matrix table with predefined columns. See the Usage Notes for the column requirements.

See Also:

Oracle Data Mining Concepts for more details about lift and test metrics for classification

"COMPUTE_CONFUSION_MATRIX Procedure"

"COMPUTE_ROC Procedure"

Syntax

DBMS_DATA_MINING.COMPUTE_LIFT (
      apply_result_table_name      IN VARCHAR2,
      target_table_name            IN VARCHAR2,
      case_id_column_name          IN VARCHAR2,
      target_column_name           IN VARCHAR2,
      lift_table_name              IN VARCHAR2,
      positive_target_value        IN VARCHAR2,
      score_column_name            IN VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN VARCHAR2 DEFAULT 'PROBABILITY',
      num_quantiles                IN NUMBER DEFAULT 10,
      cost_matrix_table_name       IN VARCHAR2 DEFAULT NULL,
      apply_result_schema_name     IN VARCHAR2 DEFAULT NULL,
      target_schema_name           IN VARCHAR2 DEFAULT NULL,
      cost_matrix_schema_name      IN VARCHAR2 DEFAULT NULL
      score_criterion_type         IN VARCHAR2 DEFAULT 'PROBABILITY');

Parameters

Table 45-36 COMPUTE_LIFT Procedure Parameters

Parameter Description

apply_result_table_name

Table containing the predictions.

target_table_name

Table containing the known target values from the test data.

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

lift_table_name

Table containing the lift statistics. The table will be created by the procedure in the user's schema.

The columns in the lift table are described in the Usage Notes.

positive_target_value

The positive class. This should be the class of interest, for which you want to calculate lift.

If the target column is a NUMBER, you can use the TO_CHAR() operator to provide the value as a string.

score_column_name

Column containing the predictions in the apply results table.

The default column name is 'PREDICTION', which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains either the probabilities or the costs that determine the predictions.

By default, scoring is based on probability; the class with the highest probability is predicted for each case. If scoring is based on cost, the class with the lowest cost is predicted.

The score_criterion_type parameter indicates whether probabilities or costs will be used for scoring.

The default column name is 'PROBABILITY', which is the default name created by the APPLY procedure (See "APPLY Procedure").

See the Usage Notes for additional information.

num_quantiles

Number of quantiles to be used in calculating lift. The default is 10.

cost_matrix_table_name

(Optional) Table that defines the costs associated with misclassifications. If a cost matrix table is provided and the score_criterion_type parameter is set to 'COST', the costs will be used as the scoring criteria.

The columns in a cost matrix table are described in the Usage Notes.

apply_result_schema_name

Schema of the apply results table.

If null, the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, the user's schema is assumed.

cost_matrix_schema_name

Schema of the cost matrix table, if one is provided.

If null, the user's schema is assumed.

score_criterion_type

Whether to use probabilities or costs as the scoring criterion. Probabilities or costs are passed in the column identified in the score_criterion_column_name parameter.

The default value of score_criterion_type is 'PROBABILITY'. To use costs as the scoring criterion, specify 'COST'.

If score_criterion_type is set to 'COST' but no cost matrix is provided and if there is a scoring cost matrix associated with the model, then the associated costs are used for scoring.

See the Usage Notes and the Examples.


Usage Notes

  • The predictive information you pass to COMPUTE_LIFT may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the lift.

  • Instead of passing a cost matrix to COMPUTE_LIFT, you can use a scoring cost matrix associated with the model. A scoring cost matrix can be embedded in the model or it can be defined dynamically when the model is applied. To use a scoring cost matrix, invoke the SQL PREDICTION_COST function to populate the score criterion column.

  • The predictions that you pass to COMPUTE_LIFT are in a table or view specified in apply_results_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • A cost matrix must have the columns described in Table 45-37.

    Table 45-37 Columns in a Cost Matrix

    Column Name Datatype

    actual_target_value

    Type of the target column in the build data

    predicted_target_value

    Type of the predicted target in the test data. The type of the predicted target must be the same as the type of the actual target unless the predicted target has an associated reverse transformation.

    cost

    NUMBER


    See Also:

    Oracle Data Mining Concepts for more information about cost matrixes
  • The table created by COMPUTE_LIFT has the columns described in Table 45-38

    Table 45-38 Columns in a Lift Table

    Column Name Datatype

    quantile_number

    NUMBER

    probability_threshold

    NUMBER

    gain_cumulative

    NUMBER

    quantile_total_count

    NUMBER

    quantile_target_count

    NUMBER

    percent_records_cumulative

    NUMBER

    lift_cumulative

    NUMBER

    target_density_cumulative

    NUMBER

    targets_cumulative

    NUMBER

    non_targets_cumulative

    NUMBER

    lift_quantile

    NUMBER

    target_density

    NUMBER


    See Also:

    Oracle Data Mining Concepts for details about the information in the lift table
  • When a cost matrix is passed to COMPUTE_LIFT, the cost threshold is returned in the probability_threshold column of the lift table.

Examples

This example uses the Naive Bayes model nb_sh_clas_sample, which is created by one of the Oracle Data Mining sample programs.

The example illustrates lift based on probabilities. For examples that show computation based on costs, see "COMPUTE_CONFUSION_MATRIX Procedure".

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
    SELECT cust_id, t.prediction, t.probability
    FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;

Using probabilities as the scoring criterion, you can compute lift as follows.

BEGIN
           DBMS_DATA_MINING.COMPUTE_LIFT (
              apply_result_table_name              => 'nb_apply_results',
              target_table_name                => 'mining_data_test_v',
              case_id_column_name              => 'cust_id',
              target_column_name               => 'affinity_card',
              lift_table_name                             => 'nb_lift',
              positive_target_value                 =>  to_char(1),
              score_column_name                => 'PREDICTION',
              score_criterion_column_name    => 'PROBABILITY',
              num_quantiles                                =>  10,
              cost_matrix_table_name                =>  null,
              apply_result_schema_name            =>  null,
              target_schema_name               =>  null,
              cost_matrix_schema_name              =>  null,
              score_criterion_type                   =>  'PROBABILITY');
        END;
        /

This query displays some of the statistics from the resulting lift table.

SQL>SELECT quantile_number, probability_threshold, gain_cumulative,
           quantile_total_count
           FROM nb_lift;

QUANTILE_NUMBER PROBABILITY_THRESHOLD GAIN_CUMULATIVE QUANTILE_TOTAL_COUNT 
--------------- --------------------- --------------- --------------------  
              1            .989335775       .15034965                   55 
              2            .980534911       .26048951                   55  
              3            .968506098      .374125874                   55  
              4            .958975196      .493006993                   55 
              5            .946705997      .587412587                   55  
              6            .927454174       .66958042                   55  
              7            .904403627      .748251748                   55  
              8            .836482525      .839160839                   55  
             10            .500184953               1                   54  

COMPUTE_ROC Procedure

This procedure computes receiver operating characteristic (ROC), stores the results in a table in the user's schema, and returns a measure of the model accuracy.

ROC is a test metric for binary classification models. To compute ROC, one of the target values must be designated as the positive class. COMPUTE_ROC compares the predictions generated by the model with the actual target values in a set of test data.

ROC measures the impact of changes in the probability threshold. The probability threshold is the decision point used by the model for predictions. In binary classification, the default probability threshold is 0.5. The value predicted for each case is the one with a probability greater than 50%.

ROC can be plotted as a curve on an X-Y axis. The false positive rate is placed on the X axis. The true positive rate is placed on the Y axis. A false positive is a positive prediction for a case that is negative in the test data. A true positive is a positive prediction for a case that is positive in the test data.

COMPUTE_ROC accepts two input streams:

  • The predictions generated on the test data. The information is passed in three columns:

    • Case ID column

    • Prediction column

    • Scoring criterion column containing probabilities

  • The known target values in the test data. The information is passed in two columns:

    • Case ID column

    • Target column containing the known target values

See Also:

Oracle Data Mining Concepts for more details about ROC and test metrics for classification

"COMPUTE_CONFUSION_MATRIX Procedure"

"COMPUTE_LIFT Procedure"

Syntax

DBMS_DATA_MINING.COMPUTE_ROC (
      roc_area_under_curve         OUT NUMBER,
      apply_result_table_name      IN  VARCHAR2,
      target_table_name            IN  VARCHAR2,
      case_id_column_name          IN  VARCHAR2,
      target_column_name           IN  VARCHAR2,
      roc_table_name               IN  VARCHAR2,
      positive_target_value        IN  VARCHAR2,
      score_column_name            IN  VARCHAR2 DEFAULT 'PREDICTION',
      score_criterion_column_name  IN  VARCHAR2 DEFAULT 'PROBABILITY',
      apply_result_schema_name     IN  VARCHAR2 DEFAULT NULL,
      target_schema_name           IN  VARCHAR2 DEFAULT NULL);

Parameters

Table 45-39 COMPUTE_ROC Procedure Parameters

Parameter Description

roc_area_under_the_curve

Output parameter containing the area under the ROC curve (AUC). The AUC measures the likelihood that an actual positive will be predicted as positive.

The greater the AUC, the greater the flexibility of the model in accommodating trade-offs between positive and negative class predictions. AUC can be especially important when one target class is rarer or more important to identify than another.

apply_result_table_name

Table containing the predictions.

target_table_name

Table containing the known target values from the test data.

case_id_column_name

Case ID column in the apply results table. Must match the case identifier in the targets table.

target_column_name

Target column in the targets table. Contains the known target values from the test data.

roc_table_name

Table containing the ROC output. The table will be created by the procedure in the user's schema.

The columns in the ROC table are described in the Usage Notes.

positive_target_value

The positive class. This should be the class of interest, for which you want to calculate ROC.

If the target column is a NUMBER, you can use the TO_CHAR() operator to provide the value as a string.

score_column_name

Column containing the predictions in the apply results table.

The default column name is 'PREDICTION', which is the default name created by the APPLY procedure (See "APPLY Procedure").

score_criterion_column_name

Column containing the scoring criterion in the apply results table. Contains the probabilities that determine the predictions.

The default column name is 'PROBABILITY', which is the default name created by the APPLY procedure (See "APPLY Procedure").

apply_result_schema_name

Schema of the apply results table.

If null, the user's schema is assumed.

target_schema_name

Schema of the table containing the known targets.

If null, the user's schema is assumed.


Usage Notes

  • The predictive information you pass to COMPUTE_ROC may be generated using SQL PREDICTION functions, the DBMS_DATA_MINING.APPLY procedure, or some other mechanism. As long as you pass the appropriate data, the procedure can compute the receiver operating characteristic.

  • The predictions that you pass to COMPUTE_ROC are in a table or view specified in apply_results_table_name.

    CREATE TABLE apply_result_table_name AS (
                case_id_column_name            VARCHAR2, 
                score_column_name              VARCHAR2,
                score_criterion_column_name    VARCHAR2);
    
  • The table created by COMPUTE_ROC has the columns shown in Table 45-40.

    Table 45-40 COMPUTE_ROC Output

    Column Datatype

    probability

    BINARY_DOUBLE

    true_positives

    NUMBER

    false_negatives

    NUMBER

    false_positives

    NUMBER

    true_negatives

    NUMBER

    true_positive_fraction

    NUMBER

    false_positive_fraction

    NUMBER


    See Also:

    Oracle Data Mining Concepts for details about the output of COMPUTE_ROC
  • ROC is typically used to determine the most desirable probability threshold. This can be done by examining the true positive fraction and the false positive fraction. The true positive fraction is the percentage of all positive cases in the test data that were correctly predicted as positive. The false positive fraction is the percentage of all negative cases in the test data that were incorrectly predicted as positive.

    Given a probability threshold, the following statement returns the positive predictions in an apply result table ordered by probability.

    SELECT case_id_column_name 
           FROM apply_result_table_name 
           WHERE probability > probability_threshold 
           ORDER BY probability DESC;
    
  • There are two approaches to identifying the most desirable probability threshold. Which approach you use depends on whether or not you know the relative cost of positive versus negative class prediction errors.

    If the costs are known, you can apply the relative costs to the ROC table to compute the minimum cost probability threshold. Suppose the relative cost ratio is: Positive Class Error Cost / Negative Class Error Cost = 20. Then execute a query like this.

    WITH cost AS (
      SELECT probability_threshold, 20 * false_negatives + false_positives cost 
        FROM ROC_table 
      GROUP BY probability_threshold), 
        minCost AS (
          SELECT min(cost) minCost 
            FROM cost)
          SELECT max(probability_threshold)probability_threshold 
            FROM cost, minCost 
        WHERE cost = minCost;
    

    If relative costs are not well known, you can simply scan the values in the ROC table (in sorted order) and make a determination about which of the displayed trade-offs (misclassified positives versus misclassified negatives) is most desirable.

    SELECT * FROM ROC_table 
             ORDER BY probability_threshold;
    

Examples

This example uses the Naive Bayes model nb_sh_clas_sample, which is created by one of the Oracle Data Mining sample programs.

The following statement applies the model to the test data and stores the predictions and probabilities in a table.

CREATE TABLE nb_apply_results AS
    SELECT cust_id, t.prediction, t.probability
    FROM mining_data_test_v, TABLE(PREDICTION_SET(nb_sh_clas_sample USING *)) t;

Using the predictions and the target values from the test data, you can compute ROC as follows.

DECLARE
     v_area_under_curve NUMBER;
  BEGIN
         DBMS_DATA_MINING.COMPUTE_ROC (
               roc_area_under_curve                  => v_area_under_curve,
               apply_result_table_name       => 'nb_apply_results',
               target_table_name               => 'mining_data_test_v',
               case_id_column_name            => 'cust_id',
               target_column_name             => 'affinity_card',
               roc_table_name                     => 'nb_roc',
               positive_target_value         => '1',
               score_column_name               => 'PREDICTION',
               score_criterion_column_name   => 'PROBABILITY');
           DBMS_OUTPUT.PUT_LINE('**** AREA UNDER ROC CURVE ****: ' ||
           ROUND(v_area_under_curve,4));
  END;
 /

The resulting AUC and a selection of columns from the ROC table are shown as follows.

**** AREA UNDER ROC CURVE ****: .8212

SQL> SELECT probability, true_positive_fraction, false_positive_fraction 
            FROM nb_roc;
 
PROBABILITY  TRUE_POSITIVE_FRACTION  FALSE_POSITIVE_FRACTION
-----------  ----------------------  -----------------------
     .00000                       1                        1
     .50018              .826589595               .227902946
     .53851              .823699422               .221837088
     .54991              .820809249               .217504333
     .55628              .815028902               .215771231
     .55628              .817919075               .215771231
     .57563              .800578035               .214904679
     .57563              .812138728               .214904679
      .                   .                        .
      .                   .                        .
      .                   .                        .

CREATE_MODEL Procedure

This procedure creates a mining model with a given mining function.

By passing an xform_list to CREATE_MODEL, you can specify a list of transformations to be performed on the input data. If the PREP_AUTO setting is on, the transformations are used in addition to the automatic transformations. If the PREP_AUTO setting is off, the specified transformations are the only ones implemented by the model. In both cases, the transformation definitions are embedded in the model and executed automatically whenever the model is applied. See "Automatic Data Preparation".

Syntax

DBMS_DATA_MINING.CREATE_MODEL (
      model_name            IN VARCHAR2,
      mining_function       IN VARCHAR2,
      data_table_name       IN VARCHAR2,
      case_id_column_name   IN VARCHAR2,
      target_column_name    IN VARCHAR2 DEFAULT NULL,
      settings_table_name   IN VARCHAR2 DEFAULT NULL,
      data_schema_name      IN VARCHAR2 DEFAULT NULL,
      settings_schema_name  IN VARCHAR2 DEFAULT NULL,
      xform_list            IN TRANSFORM_LIST DEFAULT NULL);

Parameters

Table 45-41 CREATE_MODEL Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

See the Usage Notes for model naming restrictions.

mining_function

The mining function. Values are listed in Table 45-3, "Mining Functions".

data_table_name

Table or view containing the build data.

case_id_column_name

Case identifier column in the build data.

target_column_name

For supervised models, the target column in the build data. NULL for unsupervised models.

settings_table_name

Table containing build settings for the model. NULL if there is no settings table (only default settings are used).

data_schema_name

Schema hosting the build data. If NULL, the user's schema is assumed.

settings_schema_name

Schema hosting the settings table. If NULL, the user's schema is assumed.

xform_list

A list of transformations to be used in addition to or instead of automatic transformations, depending on the value of the PREP_AUTO setting. (See "Automatic Data Preparation".)

The datatype of xform_list is TRANSFORM_LIST, which consists of records of type TRANSFORM_REC. Each TRANSFORM_REC specifies the transformation information for a single attribute.

TYPE
  TRANFORM_REC     IS RECORD (
     attribute_name       VARCHAR2(4000),
     attribute_subname    VARCHAR2(4000),
     expression           EXPRESSION_REC,
     reverse_expression   EXPRESSION_REC,
     attribute_spec       VARCHAR2(4000));

The expression field stores a SQL expression for transforming the attribute. The reverse_expression field stores a SQL expression for reversing the transformation in model details and, if the attribute is a target, in the results of scoring. The SQL expressions are manipulated by routines in the DBMS_DATA_MINING_TRANSFORM package:

The attribute_spec field identifies individualized treatment for the attribute. See the Usage Notes for details.

See Table 46-1, "Datatypes in DBMS_DATA_MINING_TRANSFORM"for details about the TRANSFORM_REC type.


Usage Notes

  1. You can use the attribute_spec field of the xform_list argument to identify an attribute as unstructured text or to disable Automatic Data Preparation for the attribute. The attribute_spec can have the following values:

    • TEXT — Indicates that the attribute contains unstructured text. The TEXT value may optionally be followed by an Oracle Text POLICY name. (For information about creating a Text POLICY, see CTX_DDL.CREATE_POLICY in Oracle Text Reference.)

      Oracle Data Mining can process columns of VARCHAR2/CHAR, CLOB, BLOB, and BFILE as text. If the column is VARCHAR2 or CHAR and you do not specify TEXT, Oracle Data Mining will process the column as categorical data. If the column is CLOB, Oracle Data Mining will process it as text by default (You do not need to specify it as TEXT). If the column is BLOB or BFILE, you must specify it as TEXT, otherwise CREATE_MODEL will return an error.

      If you specify TEXT for a nested column or for an attribute in a nested column, CREATE_MODEL will return an error.

    • NOPREP — Disables ADP for the attribute. When ADP is off, the NOPREP value is ignored.

      You can specify NOPREP for a nested column, but not for an attribute in a nested column. If you specify NOPREP for an attribute in a nested column when ADP is on, CREATE_MODEL will return an error.

  2. You can obtain information about a model by querying the data dictionary views.

    ALL/USER/DBA_MINING_MODELS
    ALL/USER/DBA_MINING_MODEL_ATTRIBUTES
    ALL/USER/DBA_MINING_MODEL_SETTINGS
    
  3. You can obtain information about model attributes by querying the model details.

    dbms_data_mining.get_model_details_algorithm (model_name)
    
  4. The naming rules for models are more restrictive than the naming rules for most database schema objects. A model name must satisfy the following additional requirements:

    • It must be 25 or fewer characters long.

    • It must be a nonquoted identifier. Oracle requires that nonquoted identifiers contain only alphanumeric characters, the underscore (_), dollar sign ($), and pound sign (#); the initial character must be alphabetic. Oracle strongly discourages the use of the dollar sign and pound sign in nonquoted literals.

    Naming requirements for schema objects are fully documented in Oracle Database SQL Language Reference.

Examples

The first example builds a classification model using the Support Vector Machine algorithm.

-- Create the settings table 
CREATE TABLE svm_model_settings (
  setting_name  VARCHAR2(30),
  setting_value VARCHAR2(30));

-- Populate the settings table
-- Specify SVM. By default, Naive Bayes is used for classification.
-- Specify ADP. By default, ADP is not used.
BEGIN 
  INSERT INTO svm_model_settings (setting_name, setting_value) VALUES
     (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
  INSERT INTO svm_model_settings (setting_name, setting_value) VALUES
     (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
  COMMIT;
END;
/
-- Create the model using the specified settings 
BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'svm_model',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'mining_data_build_v',
    case_id_column_name => 'cust_id',
    target_column_name  => 'affinity_card',
    settings_table_name => 'svm_model_settings');
END;
/

You can display the model settings with the following query.

SELECT * FROM user_mining_model_settings 
       WHERE model_name IN 'SVM_MODEL';

MODEL_NAME     SETTING_NAME            SETTING_VALUE                  SETTING
-------------  ----------------------  -----------------------------  -------
SVM_MODEL      ALGO_NAME               ALGO_SUPPORT_VECTOR_MACHINES  INPUT
SVM_MODEL      SVMS_KERNEL_CACHE_SIZE  50000000                      DEFAULT
SVM_MODEL      SVMS_ACTIVE_LEARNING    SVMS_AL_ENABLE                DEFAULT
SVM_MODEL      SVMS_STD_DEV            3.004524                      DEFAULT
SVM_MODEL      PREP_AUTO               ON                            INPUT
SVM_MODEL      SVMS_COMPLEXITY_FACTOR  1.887389                      DEFAULT
SVM_MODEL      SVMS_KERNEL_FUNCTION    SVMS_GAUSSIAN                 DEFAULT
SVM_MODEL      SVMS_CONV_TOLERANCE     .001                          DEFAULT

The second example creates an anomaly detection model. Anomaly detection uses SVM classification without a target. This example uses the same settings table created for the SVM classification model in the first example.

BEGIN
  DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'anomaly_detect_model',
    mining_function     => dbms_data_mining.classification,
    data_table_name     => 'mining_data_build_v',
    case_id_column_name => 'cust_id',
    target_column_name  => null,
    settings_table_name => 'svm_model_settings');
END;
/

This query shows that the models created in these examples are the only ones in your schema.

SELECT model_name, mining_function, algorithm FROM user_mining_models;
 
MODEL_NAME              MINING_FUNCTION      ALGORITHM
----------------------  -------------------- ------------------------------
SVM_MODEL               CLASSIFICATION       SUPPORT_VECTOR_MACHINES
ANOMALY_DETECT_MODEL    CLASSIFICATION       SUPPORT_VECTOR_MACHINES

This query shows that only the SVM classification model has a target.

SELECT model_name, attribute_name, attribute_type, target 
       FROM user_mining_model_attributes 
       WHERE target = 'YES';
 
MODEL_NAME          ATTRIBUTE_NAME   ATTRIBUTE_TYPE     TARGET
------------------  ---------------  -----------------  ------
SVM_MODEL           AFFINITY_CARD    CATEGORICAL         YES

DROP_MODEL Procedure

This procedure deletes the specified mining model.

Syntax

DBMS_DATA_MINING.DROP_MODEL (model_name IN VARCHAR2,
                             force      IN BOOLEAN DEFAULT FALSE);

Parameters

Table 45-42 DROP_MODEL Procedure Parameters

Parameter Description

model_name

Name of the mining model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

force

Forces the mining model to be dropped even if it is invalid. A mining model may be invalid if a serious system error interrupted the model build process.


Usage Note

To drop a mining model, you must be the owner or you must have the DROP ANY MINING MODEL privilege. See Oracle Data Mining User's Guide for information about privileges for data mining.

Example

You can use the following command to delete a valid mining model named nb_sh_clas_sample that exists in your schema.

BEGIN
  DBMS_DATA_MINING.DROP_MODEL(model_name => 'nb_sh_clas_sample');
END;
/

EXPORT_MODEL Procedure

This procedure exports the specified data mining models to a dump file set. To import the models from the dump file set, use the IMPORT_MODEL Procedure. EXPORT_MODEL and IMPORT_MODEL use Oracle Data Pump technology.

When Oracle Data Pump is used to export/import an entire schema or database, the mining models in the schema or database are included. However, EXPORT_MODEL and IMPORT_MODEL are the only utilities that support the export/import of individual models.

See Also:

Oracle Database Utilities for information about Oracle Data Pump

Oracle Data Mining User's Guide for more information about exporting and importing mining models

Syntax

DBMS_DATA_MINING.EXPORT_MODEL (
      filename          IN VARCHAR2,
      directory         IN VARCHAR2,
      model_filter      IN VARCHAR2 DEFAULT NULL,
      filesize          IN VARCHAR2 DEFAULT NULL,
      operation         IN VARCHAR2 DEFAULT NULL,
      remote_link       IN VARCHAR2 DEFAULT NULL,
      jobname           IN VARCHAR2 DEFAULT NULL);

Parameters

Table 45-43 EXPORT_MODEL Procedure Parameters

Parameter Description

filename

Name of the dump file set to which the models should be exported. The name must be unique within the schema.

The dump file set can contain one or more files. The number of files in a dump file set is determined by the size of the models being exported (both metadata and data) and a specified or estimated maximum file size. You can specify the file size in the filesize parameter, or you can use the operation parameter to cause Oracle Data Pump to estimate the file size. If the size of the models to export is greater than the maximum file size, one or more additional files are created.

When the export operation completes successfully, the name of the dump file set is automatically expanded to filename01.dmp, even if there is only one file in the dump set. If there are additional files, they are named sequentially as filename02.dmp, filename03.dmp, and so forth.

directory

Name of a pre-defined directory object that specifies where the dump file set should be created.

The exporting user must have read/write privileges on the directory object and on the file system directory that it identifies.

See Oracle Database SQL Language Reference for information about directory objects.

model_filter

Optional parameter that specifies which model or models to export. If you do not specify a value for model_filter, all models in the schema are exported. You can also specify NULL (the default) or 'ALL' to export all models.

You can export individual models by name and groups of models based on mining function or algorithm. For instance, you could export all regression models or all Naive Bayes models. Examples are provided in Table 45-44.

filesize

Optional parameter that specifies the maximum size of a file in the dump file set. The size may be specified in bytes, kilobytes (K), megabytes (M), or gigabytes (G). The default size is 50 MB.

If the size of the models to export is larger than filesize, one or more additional files are created within the dump set. See the description of the filename parameter for more information.

operation

Optional parameter that specifies whether or not to estimate the size of the files in the dump set. By default the size is not estimated and the value of the filesize parameter determines the size of the files.

You can specify either of the following values for operation:

  • 'EXPORT' — Export all or the specified models. (Default)

  • 'ESTIMATE' — Estimate the size of the exporting models.

remote_link

Optional parameter that specifies the name of a database link to a remote system. The default value is NULL. A database link is a schema object in a local database that enables access to objects in a remote database. When you specify a value for remote_link, you can export the models in the remote database. The EXP_FULL_DATABASE role is required for exporting the remote models. The EXP_FULL_DATABASE privilege, the CREATE DATABASE LINK privilege, and other privileges may also be required.

jobname

Optional parameter that specifies the name of the export job. By default, the name has the form username_exp_nnnn, where nnnn is a number. For example, a job name in the SCOTT schema might be SCOTT_exp_134.

If you specify a job name, it must be unique within the schema. The maximum length of the job name is 30 characters.

A log file for the export job, named jobname.log, is created in the same directory as the dump file set.


Usage Notes

The model_filter parameter specifies which models to export. You can list the models by name, or you can specify all models that have the same mining function or algorithm. You can query the USER_MINING_MODELS view to list the models in your schema.

SQL> describe user_mining_models
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 MODEL_NAME                                NOT NULL VARCHAR2(30)
 MINING_FUNCTION                                    VARCHAR2(30)
 ALGORITHM                                          VARCHAR2(30)
 CREATION_DATE                             NOT NULL DATE
 BUILD_DURATION                                     NUMBER
 MODEL_SIZE                                         NUMBER
 COMMENTS                                           VARCHAR2(4000)

Examples of model filters are provided in Table 45-44.

Table 45-44 Sample Values for the Model Filter Parameter

Sample Value Meaning

'mymodel'

Export the model named mymodel

'name= ''mymodel'''

Export the model named mymodel

'name IN (''mymodel2'',''mymodel3'')'

Export the models named mymodel2 and mymodel3

'ALGORITHM_NAME = ''NAIVE_BAYES'''

Export all Naive Bayes models. See Table 45-5 for a list of algorithm names.

'FUNCTION_NAME =''CLASSIFICATION'''

Export all classification models. See Table 45-3 for a list of mining functions.


Examples

  1. The following statement exports all the models in the DMUSER3 schema to a dump file set called models_out in the directory $ORACLE_HOME/rdbms/log. This directory is mapped to a directory object called DATA_PUMP_DIR. The DMUSER3 user has read/write access to the directory and to the directory object.

    SQL>execute dbms_data_mining.export_model ('models_out', 'DATA_PUMP_DIR');
    

    You can exit SQL*Plus and list the resulting dump file and log file.

    SQL>EXIT
    >cd $ORACLE_HOME/rdbms/log
    >ls
    >DMUSER3_exp_1027.log  models_out01.dmp  
    
  2. The following example uses the same directory object and is executed by the same user.This example exports the models called NMF_SH_SAMPLE and SVMR_SH_REGR_SAMPLE to a different dump file set in the same directory.

    SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL ( 'models2_out', 'DATA_PUMP_DIR',
                'name in (''NMF_SH_SAMPLE'', ''SVMR_SH_REGR_SAMPLE'')');
    SQL>EXIT
    >cd $ORACLE_HOME/rdbms/log
    >ls
    >DMUSER3_exp_1027.log  models_out01.dmp
     DMUSER3_exp_924.log  models2_out01.dmp
    
  3. The following examples show how to export models with specific algorithm and mining function names.

    SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL('algo.dmp','DM_DUMP',
            'ALGORITHM_NAME IN (''O_CLUSTER'',''GENERALIZED_LINEAR_MODEL'',
            ''SUPPORT_VECTOR_MACHINES'',''NAIVE_BAYES'')');
     
    SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL('func.dmp', 'DM_DUMP', 
            'FUNCTION_NAME IN (CLASSIFICATION,CLUSTERING,FEATURE_EXTRACTION)');
    

GET_ASSOCIATION_RULES Function

This table function returns the rules produced by an association model.

You can specify filtering criteria to cause GET_ASSOCIATION_RULES to return a subset of the rules. Filtering criteria can improve the performance of the table function. If the number of rules is large, the greatest performance improvement will result from specifying the topn parameter.

Syntax

DBMS_DATA_MINING.GET_ASSOCIATION_RULES (
   model_name            IN VARCHAR2,
   topn                  IN NUMBER                 DEFAULT NULL,
   rule_id               IN NUMBER (38)            DEFAULT NULL,
   min_confidence        IN NUMBER                 DEFAULT NULL,
   min_support           IN NUMBER                 DEFAULT NULL,
   max_rule_length       IN NUMBER (38)            DEFAULT NULL,
   min_rule_length       IN NUMBER (38)            DEFAULT NULL,
   sort_order            IN ORA_MINING_VARCHAR2_NT DEFAULT NULL,
   antecedent_items      IN DM_ITEMS               DEFAULT NULL,
   consequent_items      IN DM_ITEMS               DEFAULT NULL,
   min_lift              IN NUMBER                 DEFAULT NULL)
 RETURN DM_RULES PIPELINED;

Parameters

Table 45-45 GET_ASSOCIATION_RULES Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

This is the only required parameter of GET_ASSOCIATION_RULES. All other parameters specify optional filters on the rules to return.

topn

Return the n top rules ordered by confidence and then support, both descending. If you specify a sort order, the top n rules are derived after the sort is performed.

If topn is specified and no maximum or minimum rule length is specified, then the only columns allowed in the sort order are RULE_CONFIDENCE and RULE_SUPPORT. If topn is specified and a maximum or minimum rule length is specified, then RULE_CONFIDENCE, RULE_SUPPORT, and NUMBER_OF_ITEMS are allowed in the sort order.

rule_id

Identifier of the rule to return. If you specify a value for rule_id, do not specify values for the other filtering parameters.

min_confidence

Return the rules with confidence greater than or equal to this number.

min_support

Return the rules with support greater than or equal to this number.

max_rule_length

Return the rules with a length less than or equal to this number.

Rule length refers to the number of items in the rule (See NUMBER_OF_ITEMS in Table 45-46). For example, in the rule A=>B (if A, then B), the number of items is 2.

If max_rule_length is specified, then the NUMBER_OF_ITEMS column is permitted in the sort order.

min_rule_length

Return the rules with a length greater than or equal to this number. See max_rule_length for a description of rule length.

If min_rule_length is specified, then the NUMBER_OF_ITEMS column is permitted in the sort order.

sort_order

Sort the rules by the values in one or more of the returned columns. Specify one or more column names, each followed by ASC for ascending order or DESC for descending order. (See Table 45-46, "GET_ASSOCIATION RULES Function Return Values" for the column names.)

For example, to sort the result set in descending order first by the NUMBER_OF_ITEMS column, then by the RULE_CONFIDENCE column, you would specify:

ORA_MINING_VARCHAR2_NT('NUMBER_OF_ITEMS DESC', 'RULE_CONFIDENCE DESC')

If you specify topn, the results will vary depending on the sort order.

By default, the results are sorted by confidence in descending order, then by support in descending order.

antecedent_items

Return the rules with these items in the antecedent.

consequent_items

Return the rules with this item in the consequent.

min_lift

Return the rules with lift greater than or equal to this number.


Return Values

The object type returned by GET_ASSOCIATION_RULES is described in Table 45-46. For descriptions of each field, see the Usage Notes.

Table 45-46 GET_ASSOCIATION RULES Function Return Values

Return Value Description

DM_RULES

A set of rows of type DM_RULE. The rows have the following columns:

(rule_id              INTEGER,
 antecedent           DM_PREDICATES,
 consequent           DM_PREDICATES,
 rule_support         NUMBER,
 rule_confidence      NUMBER,
 rule_lift            NUMBER,
 antecedent_support   NUMBER,
 consequent_support   NUMBER,
 number_of_items      INTEGER )
 

The antecedent and consequent columns each return nested tables of type DM_PREDICATES. The rows, of type DM_PREDICATE, have the following columns:

     (attribute_name            VARCHAR2(4000),
      attribute_subname         VARCHAR2(4000),
      conditional_operator      CHAR(2)/*=,<>,<,>,<=,>=*/,
      attribute_num_value       NUMBER,
      attribute_str_value       VARCHAR2(4000),
      attribute_support         NUMBER,
      attribute_confidence      NUMBER)

Usage Notes

  1. This table function pipes out rows of type DM_RULES. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

  2. The ORA_MINING_VARCHAR2_NT type is defined as a table of VARCHAR2(4000).

  3. The columns returned by GET_ASSOCIATION_RULES are described as follows. (See Table 45-46 for the DM_RULE column data types.)

    Column in DM_RULES Description
    rule_id Unique identifier for the rule
    antecedent The independent condition in the rule. When this condition exists, the dependent condition in the consequent also exists.

    The condition is a combination of attribute values called a predicate (DM_PREDICATE). The predicate specifies a condition for each attribute. The condition may specify equality (=), inequality (<>), greater than (>), less than (<), greater than or equal to (>=), or less than or equal to (<=) a given value.

    Support and confidence for each attribute condition in the antecedent is returned in the predicate. Support is the number of transactions that satisfy the antecedent. Confidence is the likelihood that a transaction will satisfy the antecedent.

    Note: The occurence of the attribute as a DM_PREDICATE indicates the presence of the item in the transaction. The actual value for attribute_num_value or attribute_str_value is meaningless. For example, the following predicate indicates that 'Mouse Pad' is present in the transaction even though the attribute value is NULL.

    DM_PREDICATE('PROD_NAME', 
                    'Mouse Pad', '= ', NULL, NULL, NULL, NULL))
    
    consequent The dependent condition in the rule. This condition exists when the antecedent exists.

    The consequent, like the antecedent, is a predicate (DM_PREDICATE).

    Support and confidence for each attribute condition in the consequent is returned in the predicate. Support is the number of transactions that satisfy the consequent. Confidence is the likelihood that a transaction will satisfy the consequent.

    rule_support The number of transactions that satisfy the rule.
    rule_confidence The likelihood of a transaction satisfying the rule.
    rule_lift The degree of improvement in the prediction over random chance when the rule is satisfied.
    antecedent_support The ratio of the number of transactions that satisfy the antecedent to the total number of transactions.
    consequent_support The ratio of the number of transactions that satisfy the consequent to the total number of transactions.
    number_of_items The total number of attributes referenced in the antecedent and consequent of the rule.

Examples

The following example demonstrates an Association model build followed by several invocations of the GET_ASSOCIATION_RULES table function.

-- prepare a settings table to override default settings
CREATE TABLE market_settings AS
SELECT *
  FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS)
 WHERE setting_name LIKE 'ASSO_%';
BEGIN
-- update the value of the minimum confidence
UPDATE market_settings
   SET setting_value = TO_CHAR(0.081)
 WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence;

-- build an AR model 
DBMS_DATA_MINING.CREATE_MODEL(
  model_name => 'market_model',
  function => DBMS_DATA_MINING.ASSOCIATION,
  data_table_name => 'market_build',
  case_id_column_name => 'item_id',
  target_column_name => NULL,
  settings_table_name => 'market_settings');
END;
/
-- View the (unformatted) rules 
SELECT rule_id, antecedent, consequent, rule_support,
       rule_confidence
  FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('market_model'));

In the previous example, you view all rules. To view just the top 20 rules, use the following statement.

-- View the top 20 (unformatted) rules
SELECT rule_id, antecedent, consequent, rule_support,
       rule_confidence
  FROM TABLE(DBMS_DATA_MINING.GET_ASSOCIATION_RULES('market_model', 20));

The following query uses the association model AR_SH_SAMPLE, which is created from one of the Oracle Data Mining sample programs. (See Oracle Data Mining User's Guide for information about the sample programs.)

SELECT * FROM TABLE (
   DBMS_DATA_MINING.GET_ASSOCIATION_RULES (
      'AR_SH_SAMPLE', 10, NULL, 0.5, 0.01, 2, 1,
         ORA_MINING_VARCHAR2_NT (
         'NUMBER_OF_ITEMS DESC', 'RULE_CONFIDENCE DESC', 'RULE_SUPPORT DESC'),
         DM_ITEMS(DM_ITEM('CUSTPRODS', 'Mouse Pad', 1, NULL), 
                  DM_ITEM('CUSTPRODS', 'Standard Mouse', 1, NULL)),
         DM_ITEMS(DM_ITEM('CUSTPRODS', 'Extension Cable', 1, NULL))));

The query returns three rules, shown as follows.

13  DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Mouse Pad', '= ', 1, NULL, NULL, NULL), 
       DM_PREDICATE('CUSTPRODS', 'Standard Mouse', '= ', 1, NULL, NULL, NULL))
    DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL))
    .15532      .84393   2.7075     .18404     .3117   2
 
11  DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Standard Mouse', '= ', 1, NULL, NULL, NULL))
    DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL))
    .18085      .56291   1.8059     .32128     .3117   1
 
9   DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Mouse Pad', '= ', 1, NULL, NULL, NULL))
    DM_PREDICATES(
       DM_PREDICATE('CUSTPRODS', 'Extension Cable', '= ', 1, NULL, NULL, NULL))
      .17766    .55116   1.7682     .32234     .3117   1



GET_FREQUENT_ITEMSETS Function

This table function returns a set of rows that represent the frequent itemsets from an Association model. For a detailed description of frequent itemsets, consult Oracle Data Mining Concepts.

Syntax

DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS (
    model_name          IN VARCHAR2,
    topn                IN NUMBER DEFAULT NULL,
    max_itemset_length  IN NUMBER DEFAULT NULL)
  RETURN DM_ITEMSETS PIPELINED;

Parameters

Table 45-47 GET_FREQUENT_ITEMSETS Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

topn

When not NULL, return the top n rows ordered by support in descending order

max_itemset_length

Maximum length of an item set.


Return Values

Table 45-48 GET_FREQUENT_ITEMSETS Function Return Values

Return Value Description

DM_ITEMSETS

A set of rows of type DM_ITEMSET. The rows have the following columns:

(itemsets_id      NUMBER,
items             DM_ITEMS,
support           NUMBER,
number_of_items   NUMBER)

The items column returns a nested table of type DM_ITEMS. The rows have type DM_ITEM:

(attribute_name      VARCHAR2(4000),
attribute_subname    VARCHAR2(4000),
attribute_num_value  NUMBER,
attribute_str_value  VARCHAR2(4000))

Usage Notes

This table function pipes out rows of type DM_ITEMSETS. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

Examples

The following example demonstrates an Association model build followed by an invocation of GET_FREQUENT_ITEMSETS table function from Oracle SQL.

-- prepare a settings table to override default settings
CREATE TABLE market_settings AS
    SELECT *
  FROM TABLE(DBMS_DATA_MINING.GET_DEFAULT_SETTINGS)
 WHERE setting_name LIKE 'ASSO_%';
BEGIN
-- update the value of the minimum confidence
UPDATE market_settings
   SET setting_value = TO_CHAR(0.081)
 WHERE setting_name = DBMS_DATA_MINING.asso_min_confidence;

/* build a AR model */
DBMS_DATA_MINING.CREATE_MODEL(
  model_name           => 'market_model',
  function             => DBMS_DATA_MINING.ASSOCIATION,
  data_table_name      => 'market_build',
  case_id_column_name  => 'item_id',
  target_column_name   => NULL,
  settings_table_name  => 'market_settings');
END;
/

-- View the (unformatted) Itemsets from SQL*Plus
SELECT itemset_id, items, support, number_of_items
  FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('market_model'));

In the example above, you view all itemsets. To view just the top 20 itemsets, use the following statement:

-- View the top 20 (unformatted) Itemsets from SQL*Plus
SELECT itemset_id, items, support, number_of_items
  FROM TABLE(DBMS_DATA_MINING.GET_FREQUENT_ITEMSETS('market_model', 20));

GET_MODEL_COST_MATRIX Function

This function returns the rows of the default scoring matrix associated with the specified model.

By default, this function returns the scoring matrix that was added to the model with the ADD_COST_MATRIX procedure. If you wish to obtain the cost matrix used to create a model, specify cost_matrix_type_create as the matrix_type. See Table 45-49.

See also ADD_COST_MATRIX Procedure.

Syntax

DBMS_DATA_MINING.GET_MODEL_COST_MATRIX (
      model_name        IN VARCHAR2,
      matrix_type       IN VARCHAR2 DEFAULT cost_matrix_type_score)
RETURN DM_COST_MATRIX PIPELINED;

Parameters

Table 45-49 GET_MODEL_COST_MATRIX Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

matrix_type

The type of cost matrix.

COST_MATRIX_TYPE_SCORE — cost matrix used for scoring. (Default.)

COST_MATRIX_TYPE_CREATE — cost matrix used to create the model (Decision Tree only).


Return Values

Table 45-50 GET_MODEL_COST_MATRIX Function Return Values

Return Value Description

DM_COST_MATRIX

A set of rows of type DM_COST_ELEMENT. The rows have the following columns:

actual          VARCHAR2(4000), NUMBER,  predicted       VARCHAR2(4000), cost            NUMBER)

Usage Notes

Only Decision Tree models can be built with a cost matrix. If you want to build a Decision Tree model with a cost matrix, specify the cost matrix table name in the CLAS_COST_TABLE_NAME setting in the settings table for the model. See Table 45-7, "Mining Function Settings".

The cost matrix used to create a Decision Tree model becomes the default scoring matrix for the model. If you want to specify different costs for scoring, you can modify the values in the cost matrix table or you can use the REMOVE_COST_MATRIX procedure to remove the cost matrix and the ADD_COST_MATRIX procedure to add a new one.

Example

This example returns the scoring cost matrix associated with the Naive Bayes model NB_SH_CLAS_SAMPLE.

column actual format a10
column predicted format a10
SELECT *
    FROM TABLE(dbms_data_mining.get_model_cost_matrix('nb_sh_clas_sample'))
    ORDER BY predicted, actual;
 
ACTUAL     PREDICTED   COST
---------- ---------- -----
0          0            .00
1          0            .75
0          1            .25
1          1            .00

GET_MODEL_DETAILS_AI Function

This table function returns a set of rows that provide the details of an Attribute Importance model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_AI (
  model_name         IN VARCHAR2)
 RETURN DM_RANKED_ATTRIBUTES PIPELINED;

Parameters

Table 45-51 GET_MODEL_DETAILS_AI Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-52 GET_MODEL_DETAILS_AI Function Return Values

Return Value Description

DM_RANKED_ATTRIBUTES

A set of rows of type DM_RANKED_ATTRIBUTE. The rows have the following columns:

(attribute_name          VARCHAR2(4000,
 attribute_subname       VARCHAR2(4000),
 importance_value        NUMBER,
 rank                    NUMBER(38))

Examples

The following example returns model details for the attribute importance model AI_SH_sample, which was created by the sample program dmaidemo.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

SELECT attribute_name, importance_value, rank
    FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_AI('AI_SH_sample'))
    ORDER BY RANK;
 
ATTRIBUTE_NAME                           IMPORTANCE_VALUE       RANK
---------------------------------------- ---------------- ----------
HOUSEHOLD_SIZE                                 .151685183          1
CUST_MARITAL_STATUS                            .145294546          2
YRS_RESIDENCE                                   .07838928          3
AGE                                            .075027496          4
Y_BOX_GAMES                                    .063039952          5
EDUCATION                                      .059605314          6
HOME_THEATER_PACKAGE                           .056458722          7
OCCUPATION                                     .054652937          8
CUST_GENDER                                    .035264741          9
BOOKKEEPING_APPLICATION                        .019204751         10
PRINTER_SUPPLIES                                        0         11
OS_DOC_SET_KANJI                               -.00050013         12
FLAT_PANEL_MONITOR                             -.00509564         13
BULK_PACK_DISKETTES                            -.00540822         14
COUNTRY_NAME                                   -.01201116         15
CUST_INCOME_LEVEL                              -.03951311         16

GET_MODEL_DETAILS_EM Function

This table function returns a set of rows that provide statistics about the clusters produced by an Expectation Maximization model.

By default, the EM algorithm groups components into high-level clusters, and GET_MODEL_DETAILS_EM returns only the high-level clusters with their hierarchies. Alternatively, you can configure EM to disable the grouping of components into high-level clusters. In this case, GET_MODEL_DETAILS_EM returns the components themselves as clusters with their hierarchies. See Table 45-10, "Expectation Maximization Settings for Data Preparation and Analysis".

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_EM (
          model_name         VARCHAR2,
          cluster_id         NUMBER    DEFAULT NULL,
          attribute          VARCHAR2  DEFAULT NULL,
          centroid           NUMBER    DEFAULT 1, 
          histogram          NUMBER    DEFAULT 1, 
          rules              NUMBER    DEFAULT 2,
          attribute_subname  VARCHAR2  DEFAULT NULL,
          topn_attributes    NUMBER    DEFAULT NULL)
RETURN DM_CLUSTERS PIPELINED;

Parameters

Table 45-53 GET_MODEL_DETAILS_EM Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

cluster_id

The ID of a cluster in the model. When a valid cluster ID is specified, only the details of this cluster are returned. Otherwise the details for all clusters are returned.

attribute

The name of an attribute. When a valid attribute name is specified, only the details of this attribute are returned. Otherwise the details for all attributes are returned

centroid

This parameter accepts the following values:

  • 1 — Details about centroids are returned (default)

  • 0 — Details about centroids are not returned

histogram

This parameter accepts the following values:

  • 1 — Details about histograms are returned (default)

  • 0 — Details about histograms are not returned

rules

This parameter accepts the following values:

  • 2 — Details about rules are returned (default)

  • 1 — Rule summaries are returned

  • 0 — No information about rules is returned

attribute_subname

The name of a nested attribute. The full name of a nested attribute has the form:

attribute_name.attribute_subname

where attribute_name is the name of the column and attribute_subname is the name of the nested attribute in that column. If the attribute is not nested, attribute_subname is null.

topn_attributes

Restricts the number of attributes returned in the centroid, histogram, and rules objects. Only the n attributes with the highest confidence values in the rules are returned.

If the number of attributes included in the rules is less than topn, then up to n additional attributes in alphabetical order are returned.

If both the attribute and topn_attributes parameters are specified, then topn_attributes is ignored.


Return Values

Table 45-54 GET_MODEL_DETAILS_EM Function Return Values

Return Value Description

DM_CLUSTERS

A set of rows of type DM_CLUSTER. The rows have the following columns:

(id               NUMBER,
 cluster_id       VARCHAR2(4000),
 record_count     NUMBER,
 parent           NUMBER,
 tree_level       NUMBER,
 dispersion       NUMBER,
 split_predicate  DM_PREDICATES,
 child            DM_CHILDREN,
 centroid         DM_CENTROIDS,
 histogram        DM_HISTOGRAMS,
 rule             DM_RULES)

Note: The Expectation Maximization algorithm uses all the fields except dispersion and split_predicate.

 

The split_predicate column of DM_CLUSTER returns a nested table of type DM_PREDICATES. Each row, of type DM_PREDICATE, has the following columns:

     (attribute_name           VARCHAR2(4000),
      attribute_subname        VARCHAR2(4000),
      conditional_operator     CHAR(2)
                               /*=,<>,<,>,<=,>=*/,
      attribute_num_value      NUMBER,
      attribute_str_value      VARCHAR2(4000),
      attribute_support        NUMBER,
      attribute_confidence     NUMBER)
 

The child column of DM_CLUSTER returns a nested table of type DM_CHILDREN. The rows, of type DM_CHILD, have a single column of type NUMBER, which contains the identifiers of each child.

 

The centroid column of DM_CLUSTER returns a nested table of type DM_CENTROIDS. The rows, of type DM_CENTROID, have the following columns:

      (attribute_name    VARCHAR2(4000),
       attribute_subname  VARCHAR2(4000),
       mean               NUMBER,
       mode_value         VARCHAR2(4000),
       variance           NUMBER)
 

The histogram column of DM_CLUSTER returns a nested table of type DM_HISTOGRAMS. The rows, of type DM_HISTOGRAM_BIN, have the following columns:

    (attribute_name    VARCHAR2(4000),
     attribute_subname  VARCHAR2(4000),
     bin_id             NUMBER,
     lower_bound        NUMBER,
     upper_bound        NUMBER,
     label              VARCHAR2(4000),
     count              NUMBER)
 

The rule column of DM_CLUSTER returns a single row of type DM_RULE. The columns are:

     (rule_id            INTEGER,
      antecedent         DM_PREDICATES,
      consequent         DM_PREDICATES,
      rule_support       NUMBER,
      rule_confidence    NUMBER,
      rule_lift          NUMBER,
      antecedent_support NUMBER,
      consequent_support NUMBER,
      number_of_items    INTEGER)
 

The antecedent and consequent columns each return nested tables of type DM_PREDICATES. The rows, of type DM_PREDICATE, have the following columns:

          (attribute_name           VARCHAR2(4000),
           attribute_subname        VARCHAR2(4000),
           conditional_operator     CHAR(2)
                                 /*=,<>,<,>,<=,>=*/,
           attribute_num_value      NUMBER,
           attribute_str_value      VARCHAR2(4000),
           attribute_support        NUMBER,
           attribute_confidence     NUMBER)

Usage Notes

  1. This table function pipes out rows of type DM_CLUSTERS. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

  2. GET_MODEL_DETAILS functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.

  3. When cluster statistics are disabled (EMCS_CLUSTER_STATISTICS is set to EMCS_CLUS_STATS_DISABLE), GET_MODEL_DETAILS_EM does not return centroids, histograms, or rules. Only taxonomy (hierarchy) and cluster counts are returned.

  4. For descriptions of predicates (DM_PREDICATE) and rules (DM_RULE), see GET_ASSOCIATION_RULES Function.

GET_MODEL_DETAILS_EM_COMP Function

This table function returns a set of rows that provide details about the parameters of an Expectation Maximization model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_EM_COMP (
          model_name         VARCHAR2)
RETURN DM_EM_COMPONENT_SET PIPELINED;

Parameters

Table 45-55 GET_MODEL_DETAILS_EM_COMP Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-56 GET_MODEL_DETAILS_EM_COMP Function Return Values

Return Value Description

DM_EM_COMPONENT_SET

A set of rows of type DM_EM_COMPONENT. The rows have the following columns:

(info_type             VARCHAR2(30),
 component_id          NUMBER,
 cluster_id            NUMBER,
 attribute_name        VARCHAR2(4000),
 covariate_name        VARCHAR2(4000),
 attribute_value       VARCHAR2(4000),
 value                 NUMBER )

Usage Notes

  1. This table function pipes out rows of type DM_EM_COMPONENT. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

    The columns in each row returned by GET_MODEL_DETAILS_EM_COMP are described as follows:

    Column in DM_EM_COMPONENT Description
    info_type The type of information in the row. The following information types are supported:
    • cluster

    • prior

    • mean

    • covariance

    • frequency

    component_id Unique identifier of a component
    cluster_id Unique identifier of the high-level leaf cluster for each component
    attribute_name Name of an original attribute or a derived feature ID. The derived feature ID is used in models built on data with nested columns. The derived feature definitions can be obtained from the GET_MODEL_DETAILS_EM_PROJ Function.
    covariate_name Name of an original attribute or a derived feature ID used in variance/covariance definition
    attribute_value Categorical value or bin interval for binned numerical attributes
    value Encodes different information depending on the value of info_type, as follows:
    • cluster — The value field is NULL

    • prior — The value field returns the component prior

    • mean — The value field returns the mean of the attribute specified in attribute_name

    • covariance — The value field returns the covariance of the attributes specified in attribute_name and covariate_name. Using the same attribute in attribute_name and covariate_name, returns the variance.

    • frequency— The value field returns the multivalued Bernoulli frequency parameter for the attribute/value combination specified by attribute_name and attribute_value

    See Usage Note 2 for details.


  2. The following table shows which fields are used for each info_type. The blank cells represent NULLs.

    info_type component_id cluster_id attribute_name covariate_name attribute_value value
    cluster X X        
    prior X X       X
    mean X X X     X
    covariance X X X X   X
    frequency X X X   X X

  3. GET_MODEL_DETAILS functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.

GET_MODEL_DETAILS_EM_PROJ Function

This table function returns a set of rows that provide statistics about the projections produced by an Expectation Maximization model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_EM_PROJ (
          model_name         VARCHAR2 )
RETURN DM_EM_PROJECTION_SET PIPELINED;

Parameters

Table 45-57 GET_MODEL_DETAILS_EM_PROJ Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-58 GET_MODEL_DETAILS_EM_PROJ Function Return Values

Return Value Description

DM_EM_PROJECTION_SET

A set of rows of type DM_EM_PROJECTION. The rows have the following columns:

(feature_name          VARCHAR2(4000),
 attribute_name        VARCHAR2(4000),
 attribute_subname     VARCHAR2(4000),
 attribute_value       VARCHAR2(4000),
 coefficient           NUMBER )

See Usage Notes for details.


Usage Notes

  1. This table function pipes out rows of type DM_EM_PROJECTION. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

    The columns in each row returned by GET_MODEL_DETAILS_EM_PROJ are described as follows:

    Column in DM_EM_PROJECTION Description
    feature_name Name of a derived feature. The feature maps to the attribute_name returned by the GET_MODEL_DETAILS_EM Function.
    attribute_name Name of a column in the build data
    attribute_subname Subname in a nested column
    attribute_value Categorical value.
    coefficient Projection coefficient. The representation is sparse; only the non-zero coefficients are returned.

  2. GET_MODEL_DETAILS functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.

    The coefficients are related to the transformed, not the original, attributes. When returned directly with the model details, the coefficients may not provide meaningful information. If you want GET_MODEL_DETAILS_SVM to transform the coefficients such that they relate to the original attributes, set the reverse_coef parameter to 1.

GET_MODEL_DETAILS_GLM Function

This table function returns the coefficient statistics for a Generalized Linear Model.

The same set of statistics is returned for both linear and logistic regression, but statistics that do not apply to the mining function are returned as NULL. For more details, see the Usage Notes.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_GLM (
             model_name             VARCHAR2)
RETURN DM_GLM_COEFF_SET PIPELINED;

Parameters

Table 45-59 GET_MODEL_DETAILS_GLM Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-60 GET_MODEL_DETAILS_GLM Return Values

Return Value Description

DM_GLM_COEFF_SET

A set of rows of type DM_GLM_COEFF. The rows have the following columns:

(class                   VARCHAR2(4000),
 attribute_name          VARCHAR2(4000),
 attribute_subname       VARCHAR2(4000),
 attribute_value         VARCHAR2(4000),
 feature_expression      VARCHAR2(4000), 
 coefficient             NUMBER,
 std_error               NUMBER,
 test_statistic          NUMBER,
 p_value                 NUMBER,
 VIF                     NUMBER,
 std_coefficient         NUMBER,
 lower_coeff_limit       NUMBER,
 upper_coeff_limit       NUMBER,
 exp_coefficient         BINARY_DOUBLE,
 exp_lower_coeff_limit   BINARY_DOUBLE,
 exp_upper_coeff_limit   BINARY_DOUBLE)

GET_MODEL_DETAILS_GLM returns a row of statistics for each attribute and one extra row for the intercept, which is identified by a null value in the attribute name. Each row has the DM_GLM_COEFF datatype. The statistics are described in Table 45-61.

Table 45-61 DM_GLM_COEFF Datatype Description

Column Description

class

The non-reference target class for logistic regression. The model is built to predict the probability of this class.

The other class (the reference class) is specified in the model setting GLMS_REFERENCE_CLASS_NAME. See Table 45-14, "GLM Settings".

For linear regression, class is null.

attribute_name

The attribute name when there is no subname, or first part of the attribute name when there is a subname. The value of attribute_name is also the name of the column in the case table that is the source for this attribute.

For the intercept, attribute_name is null. Intercepts are equivalent to the bias term in SVM models.

attribute_subname

The name of an attribute in a nested table. The full name of a nested attribute has the form:

attribute_name.attribute_subname

where attribute_name is the name of the nested column in the case table that is the source for this attribute.

If the attribute is not nested, attribute_subname is null. If the attribute is an intercept, both the attribute_name and the attribute_subname are null.

attribute_value

The value of the attribute (categorical attribute only).

For numerical attributes, attribute_value is null.

feature_expression

The feature name constructed by the algorithm when feature selection is enabled. If feature selection is not enabled, the feature name is simply the fully-qualified attribute name (attribute_name.attribute_subname if the attribute is in a nested column).

For categorical attributes, the algorithm constructs a feature name that has the following form:

fully-qualified_attribute_name.attribute_value

For numerical attributes, the algorithm constructs a name for the higher-order feature by subtracting the mean from each of the component attributes and taking the product of the resulting values:

(attrib1-mean(attrib1))*(attrib2-mean(attrib2))*......

where attrib1 and attrib2 are fully-qualified attribute names.

coefficient

The linear coefficient estimate.

std_error

Standard error of the coefficient estimate.

test_statistic

For linear regression, the t-value of the coefficient estimate.

For logistic regression, the Wald chi-square value of the coefficient estimate.

p-value

Probability of the test_statistic. Used to analyze the significance of specific attributes in the model.

VIF

Variance Inflation Factor. The value is zero for the intercept. For logistic regression, VIF is null.

std_coefficient

Standardized estimate of the coefficient.

lower_coeff_limit

Lower confidence bound of the coefficient.

upper_coeff_limit

Upper confidence bound of the coefficient.

exp_coefficient

Exponentiated coefficient for logistic regression. For linear regression, exp_coefficient is null.

exp_lower_coeff_limit

Exponentiated coefficient for lower confidence bound of the coefficient for logistic regression. For linear regression, exp_lower_coeff_limit is null.

exp_upper_coeff_limit

Exponentiated coefficient for upper confidence bound of the coefficient for logistic regression. For linear regression, exp_lower_coeff_limit is null.


Usage Notes

Not all statistics are necessarily returned for each coefficient. Statistics will be null if:

  • They do not apply to the mining function. For example, exp_coefficient does not apply to linear regression.

  • They cannot be computed from a theoretical standpoint. For example, when ridge regression is enabled, the coefficient values are returned with no statistics except VIF if it is enabled. (For information on ridge regression, see Table 45-14, "GLM Settings".)

  • They cannot be computed because of limitations in system resources.

  • Their values would be infinity.

Examples

The following example returns some of the model details for the GLM regression model GLMR_SH_Regr_sample, which was created by the sample program dmglrdem.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

SQL> SELECT *
   FROM (SELECT class, attribute_name, attribute_value, coefficient, std_error
              FROM TABLE(dbms_data_mining.get_model_details_glm(
                           'GLMR_SH_Regr_sample'))
            ORDER BY class, attribute_name, attribute_value)
     WHERE ROWNUM < 11;

CLASS     ATTRIBUTE_NAME      ATTRIBUTE_VALUE  COEFFICIENT  STD_ERROR
--------- ------------------  ---------------- -----------  ----------
          AFFINITY_CARD                         -.60686139   .531250033
          BULK_PACK_DISKETTES                   -1.9729645   .924531227
          COUNTRY_NAME        Argentina         -1.3340963  1.1942193
          COUNTRY_NAME        Australia           -.340504  5.13688361
          COUNTRY_NAME        Brazil             5.3855269  1.93197882
          COUNTRY_NAME        Canada            4.13393291  2.41283125
          COUNTRY_NAME        China              .74409259  3.59923638
          COUNTRY_NAME        Denmark           -2.5287879  3.18569293
          COUNTRY_NAME        France            -1.0908689  7.18471003
          COUNTRY_NAME        Germany           -1.7472166  2.53689456

GET_MODEL_DETAILS_GLOBAL Function

This table function returns statistics about the model as a whole. Global details are available for Generalized Linear Models, Association Rules, Singular Value Decomposition, and Expectation Maximization.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_GLOBAL (
      model_name     IN  VARCHAR2)
RETURN DM_MODEL_GLOBAL_DETAILS PIPELINED;

Parameters

Table 45-62 GET_MODEL_DETAILS_GLOBAL Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-63 GET_MODEL_DETAILS_GLOBAL Function Return Values

Return Value Description

DM_MODEL_GLOBAL_DETAILS

A collection of rows of type DM_MODEL_GLOBAL_DETAIL. The rows have the following columns:

(global_detail_name   VARCHAR2(30),
 global_detail_value   NUMBER)

Global Details for GLM: Linear Regression

Table 45-64 Global Details for Linear Regression

GLOBAL_DETAIL_NAME Description

MODEL_DF

Model degrees of freedom

MODEL_SUM_SQUARES

Model sum of squares

MODEL_MEAN_SQUARE

Model mean square

F_VALUE

Model F value statistic

MODEL_F_P_VALUE

Model F value probability

ERROR_DF

Error degrees of freedom

ERROR_SUM_SQUARES

Error sum of squares

ERROR_MEAN_SQUARE

Error mean square

CORRECTED_TOTAL_DF

Corrected total degrees of freedom

CORRECTED_TOT_SS

Corrected total sum of squares

ROOT_MEAN_SQ

Root mean square error

DEPENDENT_MEAN

Dependent mean

COEFF_VAR

Coefficient of variation

R_SQ

R-Square

ADJUSTED_R_SQUARE

Adjusted R-Square

AIC

Akaike's information criterion

SBIC

Schwarz's Bayesian information criterion

GMSEP

Estimated mean square error of the prediction, assuming multivariate normality

HOCKING_SP

Hocking Sp statistic

J_P

JP statistic (the final prediction error)

NUM_PARAMS

Number of parameters (the number of coefficients, including the intercept)

NUM_ROWS

Number of rows

MODEL_CONVERGED

Whether or not the model converged. Value is 1 if it converged, or 0 if it did not converge

VALID_COVARIANCE_MATRIX

Valid covariance matrix. Value is 1 if the covariance matrix was computed, or 0 if it was not computed


Global Details for GLM: Logistic Regression

Table 45-65 Global Details for Logistic Regression

GLOBAL_DETAIL_NAME Description

AIC_INTERCEPT

Akaike's criterion for the fit of the intercept only model

AIC_MODEL

Akaike's criterion for the fit of the intercept and the covariates (predictors) mode

SC_INTERCEPT

Schwarz's Criterion for the fit of the intercept only model

SC_MODEL

Schwarz's Criterion for the fit of the intercept and the covariates (predictors) model

NEG2_LL_INTERCEPT

-2 log likelihood of the intercept only model

NEG2_LL_MODEL

-2 log likelihood of the model

LR_DF

Likelihood ratio degrees of freedom

LR_CHI_SQ

Likelihood ratio chi-square value

LR_CHI_SQ_P_VALUE

Likelihood ratio chi-square probability value

PSEUDO_R_SQ_CS

Pseudo R-square Cox and Snell

PSEUDO_R_SQ_N

Pseudo R-square Nagelkerke

DEPENDENT_MEAN

Dependent mean

PCT_CORRECT

Percent of correct predictions

PCT_INCORRECT

Percent of incorrectly predicted rows

PCT_TIED

Percent of cases where probability for both cases is the same

NUM_PARAMS

Number of parameters (the number of coefficients, including the intercept)

NUM_ROWS

Number of rows

MODEL_CONVERGED

Whether or not the model converged. Value is 1if it converged, or 0 if it did not converge.

VALID_COVARIANCE_MATRIX

Valid covariance matrix. Value is 1 if the covariance matrix was computed, or 0 if the covariance matrix not computed


Note:

When ridge regression is enabled, fewer global details are returned. For information about ridge, see Oracle Data Mining Concepts.

Global Detail for Association Rules

A single global detail is produced by an Association model.

Table 45-66 Global Detail for Association Rules

GLOBAL_DETAIL_NAME Description

RULE_COUNT

The number of association rules in the model.


Global Details for Singular Value Decomposition

Table 45-67 Global Details for Singular Value Decomposition

GLOBAL_DETAIL_NAME Description

NUM_COMPONENTS

Number of features (components) produced by the model

SUGGESTED_CUTOFF

Suggested cutoff that indicates how many of the top computed features capture most of the variance in the model. Using only the features below this cutoff would be a reasonable strategy for dimensionality reduction.


Global Details for Expectation Maximization

Table 45-68 Global Details for Expectation Maximization

GLOBAL_DETAIL_NAME Description

NUM_COMPONENTS

Number of components produced by the model

NUM_CLUSTERS

Number of clusters produced by the model

LOGLIKELIHOOD

Percent improvement in the value of the log likelihood function required to add a new component to the model.


Examples

The following example returns the global model details for the GLM regression model GLMR_SH_Regr_sample, which was created by the sample program dmglrdem.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

SELECT *
  FROM TABLE(dbms_data_mining.get_model_details_global(
              'GLMR_SH_Regr_sample'))
ORDER BY global_detail_name;
GLOBAL_DETAIL_NAME             GLOBAL_DETAIL_VALUE
------------------------------ -------------------
ADJUSTED_R_SQUARE                       .731412557
AIC                                       5931.814
COEFF_VAR                               18.1711243
CORRECTED_TOTAL_DF                            1499
CORRECTED_TOT_SS                        278740.504
DEPENDENT_MEAN                              38.892
ERROR_DF                                      1433
ERROR_MEAN_SQUARE                       49.9440956
ERROR_SUM_SQUARES                       71569.8891
F_VALUE                                 62.8492452
GMSEP                                    52.280819
HOCKING_SP                              .034877162
J_P                                     52.1749319
MODEL_CONVERGED                                  1
MODEL_DF                                        66
MODEL_F_P_VALUE                                  0
MODEL_MEAN_SQUARE                       3138.94871
MODEL_SUM_SQUARES                       207170.615
NUM_PARAMS                                      67
NUM_ROWS                                      1500
ROOT_MEAN_SQ                            7.06711367
R_SQ                                    .743238288
SBIC                                    6287.79977
VALID_COVARIANCE_MATRIX                          1

GET_MODEL_DETAILS_KM Function

This table function returns a set of rows that provide the details of a k-Means clustering model.

You can provide input to GET_MODEL_DETAILS_KM to request specific information about the model, thus improving the performance of the query. If you do not specify filtering parameters, GET_MODEL_DETAILS_KM returns all the information about the model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_KM (
          model_name          VARCHAR2,
          cluster_id          NUMBER    DEFAULT NULL,
          attribute           VARCHAR2  DEFAULT NULL,
          centroid            NUMBER    DEFAULT 1, 
          histogram           NUMBER    DEFAULT 1, 
          rules               NUMBER    DEFAULT 2,
          attribute_subname   VARCHAR2  DEFAULT NULL,
          topn_attributes     NUMBER    DEFAULT NULL)
RETURN DM_CLUSTERS PIPELINED;

Parameters

Table 45-69 GET_MODEL_DETAILS_KM Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

cluster_id

The ID of a cluster in the model. When a valid cluster ID is specified, only the details of this cluster are returned. Otherwise the details for all clusters are returned.

attribute

The name of an attribute. When a valid attribute name is specified, only the details of this attribute are returned. Otherwise the details for all attributes are returned

centroid

This parameter accepts the following values:

  • 1 — Details about centroids are returned (default)

  • 0 — Details about centroids are not returned

histogram

This parameter accepts the following values:

  • 1 — Details about histograms are returned (default)

  • 0 — Details about histograms are not returned

rules

This parameter accepts the following values:

  • 2 — Details about rules are returned (default)

  • 1 — Rule summaries are returned

  • 0 — No information about rules is returned

attribute_subname

The name of a nested attribute. The full name of a nested attribute has the form:

attribute_name.attribute_subname

where attribute_name is the name of the column and attribute_subname is the name of the nested attribute in that column.

If the attribute is not nested, attribute_subname is null.

topn_attributes

Restricts the number of attributes returned in the centroid, histogram, and rules objects. Only the n attributes with the highest confidence values in the rules are returned.

If the number of attributes included in the rules is less than topn, then up to n additional attributes in alphabetical order are returned.

If both the attribute and topn_attributes parameters are specified, then topn_attributes is ignored.


Return Values

Table 45-70 GET_MODEL_DETAILS_KM Function Return Values

Return Value Description

DM_CLUSTERS

A set of rows of type DM_CLUSTER. The rows have the following columns:

(id                   NUMBER,
 cluster_id           VARCHAR2(4000),
 record_count         NUMBER,
 parent               NUMBER,
 tree_level           NUMBER,
 dispersion           NUMBER,
 split_predicate      DM_PREDICATES,
 child                DM_CHILDREN,
 centroid             DM_CENTROIDS,
 histogram            DM_HISTOGRAMS,
 rule                 DM_RULE)
 

The split_predicate column of DM_CLUSTER returns a nested table of type DM_PREDICATES. Each row, of type DM_PREDICATE, has the following columns:

     (attribute_name           VARCHAR2(4000),
      attribute_subname        VARCHAR2(4000),
      conditional_operator     CHAR(2) /*=,<>,<,>,<=,>=*/,
      attribute_num_value      NUMBER,
      attribute_str_value      VARCHAR2(4000),
      attribute_support        NUMBER,
      attribute_confidence     NUMBER)
 

The child column of DM_CLUSTER returns a nested table of type DM_CHILDREN. The rows, of type DM_CHILD, have a single column of type NUMBER, which contains the identifiers of each child.

 

The centroid column of DM_CLUSTER returns a nested table of type DM_CENTROIDS. The rows, of type DM_CENTROID, have the following columns:

     (attribute_name    VARCHAR2(4000),
      attribute_subname  VARCHAR2(4000),
      mean               NUMBER,
      mode_value         VARCHAR2(4000),
      variance           NUMBER)
 

The histogram column of DM_CLUSTER returns a nested table of type DM_HISTOGRAMS. The rows, of type DM_HISTOGRAM_BIN, have the following columns:

     (attribute_name    VARCHAR2(4000),
      attribute_subname  VARCHAR2(4000),
      bin_id             NUMBER,
      lower_bound        NUMBER,
      upper_bound        NUMBER,
      label              VARCHAR2(4000),
      count              NUMBER)
 

The rule column of DM_CLUSTER returns a single row of type DM_RULE. The columns are:

     (rule_id            INTEGER,
      antecedent         DM_PREDICATES,
      consequent         DM_PREDICATES,
      rule_support       NUMBER,
      rule_confidence    NUMBER,
      rule_lift          NUMBER,
      antecedent_support NUMBER,
      consequent_support NUMBER,
      number_of_items    INTEGER)
 

The antecedent and consequent columns of DM_RULE each return nested tables of type DM_PREDICATES. The rows, of type DM_PREDICATE, have the following columns:

          (attribute_name           VARCHAR2(4000),
           attribute_subname        VARCHAR2(4000),
           conditional_operator     CHAR(2)/*=,<>,<,>,<=,>=*/,
           attribute_num_value      NUMBER,
           attribute_str_value      VARCHAR2(4000),
           attribute_support        NUMBER,
           attribute_confidence     NUMBER)

Usage Notes

  1. The table function pipes out rows of type DM_CLUSTERS. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

  2. For descriptions of predicates (DM_PREDICATE) and rules (DM_RULE), see GET_ASSOCIATION_RULES Function.

Examples

The following example returns model details for the k-Means clustering model KM_SH_Clus_sample, which was created by the sample program dmkmdemo.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

SELECT T.id           clu_id,
       T.record_count rec_cnt,
       T.parent       parent,
       T.tree_level   tree_level,
       T.dispersion   dispersion
  FROM (SELECT *
          FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_KM(
                     'KM_SH_Clus_sample'))
        ORDER BY id) T
 WHERE ROWNUM < 6;  
 
    CLU_ID    REC_CNT     PARENT TREE_LEVEL DISPERSION
---------- ---------- ---------- ---------- ----------
         1       1500                     1  5.9152211
         2        638          1          2 3.98458982
         3        862          1          2 5.83732097
         4        376          3          3 5.05192137
         5        486          3          3 5.42901522

GET_MODEL_DETAILS_NB Function

This table function returns a set of rows that provide the details of a Naive Bayes model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_NB (
   model_name      IN       VARCHAR2)
 RETURN DM_NB_DETAILS PIPELINED;

Parameters

Table 45-71 GET_MODEL_DETAILS_NB Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-72 GET_MODEL_DETAILS_NB Function Return Values

Return Value Description

DM_NB_DETAILS

A set of rows of type DM_NB_DETAIL. The rows have the following columns:

(target_attribute_name          VARCHAR2(30),
 target_attribute_str_value     VARCHAR2(4000),
 target_attribute_num_value     NUMBER,
 prior_probability              NUMBER,
 conditionals                   DM_CONDITIONALS)
 

The conditionals column of DM_NB_DETAIL returns a nested table of type DM_CONDITIONALS. The rows, of type DM_CONDITIONAL, have the following columns:

     (attribute_name             VARCHAR2(4000),
      attribute_subname          VARCHAR2(4000),
      attribute_str_value        VARCHAR2(4000),
      attribute_num_value        NUMBER,
      conditional_probability    NUMBER)

Usage Notes

The table function pipes out rows of type DM_NB_DETAILS. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

Examples

The following query is from the sample program dmnbdemo.sql. It returns model details about the model NB_SH_Clas_sample. For information about the sample programs, see Oracle Data Mining User's Guide.

The query creates labels from the bin boundary tables that were used to bin the training data. It replaces the attribute values with the labels. For numeric bins, the labels are (lower_boundary,upper_boundary]; for categorical bins, the label matches the value it represents. (This method of categorical label representation will only work for cases where one value corresponds to one bin.) The target was not binned.

WITH
    bin_label_view AS (
    SELECT col, bin, (DECODE(bin,'1','[','(') || lv || ',' || val || ']') label
      FROM (SELECT col,
                   bin,
                   LAST_VALUE(val) OVER (
                   PARTITION BY col ORDER BY val
                   ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) lv,
                   val
             FROM nb_sh_sample_num)
   UNION ALL
   SELECT col, bin, val label
     FROM nb_sh_sample_cat
   ),
   model_details AS (
   SELECT T.target_attribute_name                                        tname,
          NVL(TO_CHAR(T.target_attribute_num_value,T.target_attribute_str_value))   
tval,
          C.attribute_name                                               pname,
          NVL(L.label, NVL(C.attribute_str_value, C.attribute_num_value)) pval,
          T.prior_probability                                           priorp,
          C.conditional_probability                                      condp
     FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NB('NB_SH_Clas_sample')) T,
          TABLE(T.conditionals) C,
          bin_label_view L
    WHERE C.attribute_name = L.col (+) AND
          (NVL(C.attribute_str_value,C.attribute_num_value) = L.bin(+))
   ORDER BY 1,2,3,4,5,6
   )
   SELECT tname, tval, pname, pval, priorp, condp
     FROM model_details
    WHERE ROWNUM < 11;

TNAME          TVAL PNAME                     PVAL           PRIORP   CONDP
-------------- ---- ------------------------- ------------- ------- -------
AFFINITY_CARD  0    AGE                       (24,30]         .6500   .1714
AFFINITY_CARD  0    AGE                       (30,35]         .6500   .1509
AFFINITY_CARD  0    AGE                       (35,40]         .6500   .1125
AFFINITY_CARD  0    AGE                       (40,46]         .6500   .1134
AFFINITY_CARD  0    AGE                       (46,53]         .6500   .1071
AFFINITY_CARD  0    AGE                       (53,90]         .6500   .1312
AFFINITY_CARD  0    AGE                       [17,24]         .6500   .2134
AFFINITY_CARD  0    BOOKKEEPING_APPLICATION   0               .6500   .1500
AFFINITY_CARD  0    BOOKKEEPING_APPLICATION   1               .6500   .8500
AFFINITY_CARD  0    BULK_PACK_DISKETTES       0               .6500   .3670

GET_MODEL_DETAILS_NMF Function

This table function returns a set of rows that provide the details of a Non-Negative Matrix Factorization model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_NMF (
   model_name        IN        VARCHAR2)
 RETURN DM_NMF_FEATURE_SET PIPELINED;

Parameters

Table 45-73 GET_MODEL_DETAILS_NMF Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-74 GET_MODEL_DETAILS_NMF Function Return Values

Return Value Description

DM_NMF_FEATURE_SET

A set of rows of DM_NMF_FEATURE. The rows have the following columns:

(feature_id          NUMBER,
 mapped_feature_id   VARCHAR2(4000),
 attribute_set       DM_NMF_ATTRIBUTE_SET)
 

The attribute_set column of DM_NMF_FEATURE returns a nested table of type DM_NMF_ATTRIBUTE_SET. The rows, of type DM_NMF_ATTRIBUTE, have the following columns:

     (attribute_name    VARCHAR2(4000),
      attribute_subname  VARCHAR2(4000),
      attribute_value    VARCHAR2(4000),
      coefficient        NUMBER)

Usage Notes

The table function pipes out rows of type DM_NMF_FEATURE_SET. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

Examples

The following example returns model details for the feature extraction model NMF_SH_Sample, which was created by the sample program dmnmdemo.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

SELECT * FROM (
SELECT F.feature_id,
       A.attribute_name,
       A.attribute_value,
       A.coefficient
  FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_NMF('NMF_SH_Sample')) F,
       TABLE(F.attribute_set) A
ORDER BY feature_id,attribute_name,attribute_value
) WHERE ROWNUM < 11; 
 
FEATURE_ID ATTRIBUTE_NAME          ATTRIBUTE_VALUE          COEFFICIENT
--------- -----------------------  ---------------- -------------------
        1 AFFINITY_CARD                                 .051208078859308
        1 AGE                                          .0390513260041573
        1 BOOKKEEPING_APPLICATION                      .0512734004239326
        1 BULK_PACK_DISKETTES                           .232471260895683
        1 COUNTRY_NAME             Argentina          .00766817464479959
        1 COUNTRY_NAME             Australia         .000157637881096675
        1 COUNTRY_NAME             Brazil              .0031409632415604
        1 COUNTRY_NAME             Canada             .00144213099311427
        1 COUNTRY_NAME             China             .000102279310968754
        1 COUNTRY_NAME             Denmark           .000242424084307513

GET_MODEL_DETAILS_OC Function

This table function returns a set of rows that provide the details of an O-Cluster clustering model. The rows are an enumeration of the clustering patterns generated during the creation of the model.

You can provide input to GET_MODEL_DETAILS_OC to request specific information about the model, thus improving the performance of the query. If you do not specify filtering parameters, GET_MODEL_DETAILS_OC returns all the information about the model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_OC (
          model_name         VARCHAR2,
          cluster_id         NUMBER    DEFAULT NULL,
          attribute          VARCHAR2  DEFAULT NULL,
          centroid           NUMBER    DEFAULT 1, 
          histogram          NUMBER    DEFAULT 1, 
          rules              NUMBER    DEFAULT 2,
          topn_attributes    NUMBER    DEFAULT NULL)
RETURN DM_CLUSTERS PIPELINED;

Parameters

Table 45-75 GET_MODEL_DETAILS_OC Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

cluster_id

The ID of a cluster in the model. When a valid cluster ID is specified, only the details of this cluster are returned. Otherwise the details for all clusters are returned.

attribute

The name of an attribute. When a valid attribute name is specified, only the details of this attribute are returned. Otherwise the details for all attributes are returned

centroid

This parameter accepts the following values:

  • 1 — Details about centroids are returned (default)

  • 0 — Details about centroids are not returned

histogram

This parameter accepts the following values:

  • 1 — Details about histograms are returned (default)

  • 0 — Details about histograms are not returned

rules

This parameter accepts the following values:

  • 2 — Details about rules are returned (default)

  • 1 — Rule summaries are returned

  • 0 — No information about rules is returned

topn_attributes

Restricts the number of attributes returned in the centroid, histogram, and rules objects. Only the n attributes with the highest confidence values in the rules are returned.

If the number of attributes included in the rules is less than topn, then up to n additional attributes in alphabetical order are returned.

If both the attribute and topn_attributes parameters are specified, then topn_attributes is ignored.


Return Values

Table 45-76 GET_MODEL_DETAILS_OC Function Return Values

Return Value Description

DM_CLUSTERS

A set of rows of type DM_CLUSTER. The rows have the following columns:

(id               NUMBER,
 cluster_id       VARCHAR2(4000),
 record_count     NUMBER,
 parent           NUMBER,
 tree_level       NUMBER,
 dispersion       NUMBER,
 split_predicate  DM_PREDICATES,
 child            DM_CHILDREN,
 centroid         DM_CENTROIDS,
 histogram        DM_HISTOGRAMS,
 rule             DM_RULE)
 

The split_predicate column of DM_CLUSTER returns a nested table of type DM_PREDICATES. Each row, of type DM_PREDICATE, has the following columns:

     (attribute_name           VARCHAR2(4000),
      attribute_subname        VARCHAR2(4000),
      conditional_operator     CHAR(2) /*=,<>,<,>,<=,>=*/,
      attribute_num_value      NUMBER,
      attribute_str_value      VARCHAR2(4000),
      attribute_support        NUMBER,
      attribute_confidence     NUMBER)
 

The child column of DM_CLUSTER returns a nested table of type DM_CHILDREN. The rows, of type DM_CHILD, have a single column of type NUMBER, which contains the identifiers of each child.

 

The centroid column of DM_CLUSTER returns a nested table of type DM_CENTROIDS. The rows, of type DM_CENTROID, have the following columns:

      (attribute_name    VARCHAR2(4000),
       attribute_subname  VARCHAR2(4000),
       mean               NUMBER,
       mode_value         VARCHAR2(4000),
       variance           NUMBER)
 

The histogram column of DM_CLUSTER returns a nested table of type DM_HISTOGRAMS. The rows, of type DM_HISTOGRAM_BIN, have the following columns:

    (attribute_name    VARCHAR2(4000),
     attribute_subname  VARCHAR2(4000),
     bin_id             NUMBER,
     lower_bound        NUMBER,
     upper_bound        NUMBER,
     label              VARCHAR2(4000),
     count              NUMBER)
 

The rule column of DM_CLUSTER returns a single row of type DM_RULE. The columns are:

     (rule_id            INTEGER,
      antecedent         DM_PREDICATES,
      consequent         DM_PREDICATES,
      rule_support       NUMBER,
      rule_confidence    NUMBER,
      rule_lift          NUMBER,
      antecedent_support NUMBER,
      consequent_support NUMBER,
      number_of_items    INTEGER)
 

The antecedent and consequent columns each return nested tables of type DM_PREDICATES. The rows, of type DM_PREDICATE, have the following columns:

          (attribute_name           VARCHAR2(4000),
           attribute_subname        VARCHAR2(4000),
           conditional_operator     CHAR(2)/*=,<>,<,>,<=,>=*/,
           attribute_num_value      NUMBER,
           attribute_str_value      VARCHAR2(4000),
           attribute_support        NUMBER,
           attribute_confidence     NUMBER)

Usage Notes

  1. The table function pipes out rows of type DM_CLUSTER. For information about Data Mining datatypes and piped output from table functions, see "Datatypes".

  2. For descriptions of predicates (DM_PREDICATE) and rules (DM_RULE), see GET_ASSOCIATION_RULES Function.

Examples

The following example returns model details for the clustering model OC_SH_Clus_sample, which was created by the sample program dmocdemo.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

For each cluster in this example, the split predicate indicates the attribute and the condition used to assign records to the cluster's children during model build. It provides an important piece of information on how the population within a cluster can be divided up into two smaller clusters.

SELECT clu_id, attribute_name, op, s_value
     FROM (SELECT a.id clu_id, sp.attribute_name, sp.conditional_operator op,
                  sp.attribute_str_value s_value
             FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_OC(
                    'OC_SH_Clus_sample')) a,
                  TABLE(a.split_predicate) sp
           ORDER BY a.id, op, s_value)
     WHERE ROWNUM < 11;
 
     CLU_ID ATTRIBUTE_NAME       OP S_VALUE
----------- -------------------- ---------------------------------
          1 OCCUPATION           IN ?
          1 OCCUPATION           IN Armed-F
          1 OCCUPATION           IN Cleric.
          1 OCCUPATION           IN Crafts
          2 OCCUPATION           IN ?
          2 OCCUPATION           IN Armed-F
          2 OCCUPATION           IN Cleric.
          3 OCCUPATION           IN Exec.
          3 OCCUPATION           IN Farming
          3 OCCUPATION           IN Handler

GET_MODEL_DETAILS_SVD Function

This table function returns a set of rows that provide the details of a Singular Value Decomposition model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_SVD (
        model_name        VARCHAR2,
        matrix_type       VARCHAR2 DEFAULT NULL)
 RETURN DM_SVD_MATRIX_SET PIPELINED;

Parameters

Table 45-77 GET_MODEL_DETAILS_SVD Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

matrix_type

Specifies which of the three SVD matrix types to return. Values are: U, S, V, and NULL. When matrix_type is null (default), all matrices are returned.

The U matrix is only computed when the SVDS_U_MATRIX_OUTPUT setting is enabled. It is not computed by default. If the model does not contain U matrices and you set matrix_type to U, an empty set of rows is returned. See Table 45-20, "Singular Value Decomposition Settings".


Return Values

Table 45-78 GET_MODEL_DETAILS_SVD Function Return Values

Return Value Description

DM_SVD_MATRIX_SET

A set of rows of type DM_SVD_MATRIX. The rows have the following columns:

(matrix_type         CHAR(1),
 feature_id          NUMBER,
 mapped_feature_id   VARCHAR2(4000),
 attribute_name      VARCHAR2(4000),
 attribute_subname   VARCHAR2(4000),
 case_id             VARCHAR2(4000),
 value               NUMBER,
 variance            NUMBER,
 pct_cum_variance    NUMBER)

See Usage Notes for details.


Usage Notes

  1. This table function pipes out rows of type DM_SVD_MATRIX. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

    The columns in each row returned by GET_MODEL_DETAILS_SVD are described as follows:

    Column in DM_SVD_MATRIX_SET Description
    matrix_type The type of matrix. Possible values are S, V, and U. This field is never null.
    feature_id The feature that the matrix entry refers to.
    mapped_feature_id A descriptive name for the feature.
    attribute_name Column name in the V matrix component bases. This field is null for the S and U matrices.
    attribute_subname Subname in the V matrix component bases. This is relevant only in the case of a nested column. This field is null for the S and U matrices.
    case_id Unique identifier of the row in the build data described by the U matrix projection. This field is null for the S and V matrices.
    value The matrix entry value.
    variance The variance explained by a component. It is non-null only for S matrix entries.
    pct_cum_variance The percent cumulative variance explained by the components thus far. The components are ranked by the explained variance in descending order.

  2. The output of GET_MODEL_DETAILS is in sparse format. Zero values are not returned. Only the diagonal elements of the S matrix, the non-zero coefficients in the V matrix bases, and the non-zero U matrix projections are returned.

    There is one exception: If the data row does not produce non-zero U matrix projections, the case ID for that row is returned with nulls for the feature ID and value. This is done to avoid losing any records from the original data.

  3. GET_MODEL_DETAILS functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.

GET_MODEL_DETAILS_SVM Function

This table function returns a set of rows that provide the details of a linear Support Vector Machine (SVM) model. If invoked for nonlinear SVM, it returns ORA-40215.

In linear SVM models, only nonzero coefficients are stored. This reduces storage and speeds up model loading. As a result, if an attribute is missing in the coefficient list returned by GET_MODEL_DETAILS_SVM, then the coefficient of this attribute should be interpreted as zero.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_SVM (
        model_name        VARCHAR2,
        reverse_coef      NUMBER DEFAULT 0)
 RETURN DM_SVM_LINEAR_COEFF_SET PIPELINED;

Parameters

Table 45-79 GET_MODEL_DETAILS_SVM Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.

reverse_coef

Whether or not GET_MODEL_DETAILS_SVM should transform the attribute coefficients using the original attribute transformations.

When reverse_coef is set to 0 (default), GET_MODEL_DETAILS_SVM returns the coefficients directly from the model without applying transformations.

When reverse_coef is set to 1, GET_MODEL_DETAILS_SVM transforms the coefficients and bias by applying the normalization shifts and scales that were generated using automatic data preparation.

See Usage Note 4.


Return Values

Table 45-80 GET_MODEL_DETAILS_SVM Function Return Values

Return Value Description

DM_SVM_LINEAR_COEFF_SET

A set of rows of type DM_SVM_LINEAR_COEFF. The rows have the following columns:

(class            VARCHAR2(4000),
 attribute_set    DM_SVM_ATTRIBUTE_SET)
 

The attribute_set column returns a nested table of type DM_SVM_ATTRIBUTE_SET. The rows, of type DM_SVM_ATTRIBUTE, have the following columns:

     (attribute_name      VARCHAR2(4000),
      attribute_subname   VARCHAR2(4000),
      attribute_value     VARCHAR2(4000),
      coefficient         NUMBER)

See Usage Notes.


Usage Notes

  1. This table function pipes out rows of type DM_SVM_LINEAR_COEFF. For information on Data Mining datatypes and piped output from table functions, see "Datatypes".

  2. The class column of DM_SVM_LINEAR_COEFF contains classification target values. For SVM regression models, class is null. For each classification target value, a set of coefficients is returned. For binary classification, one-class classification, and regression models, only a single set of coefficients is returned.

  3. The attribute_value column in DM_SVM_ATTRIBUTE_SET is used for categorical attributes.

  4. GET_MODEL_DETAILS functions preserve model transparency by automatically reversing the transformations applied during the build process. Thus the attributes returned in the model details are the original attributes (or a close approximation of the original attributes) used to build the model.

    The coefficients are related to the transformed, not the original, attributes. When returned directly with the model details, the coefficients may not provide meaningful information. If you want GET_MODEL_DETAILS_SVM to transform the coefficients such that they relate to the original attributes, set the reverse_coef parameter to 1.

Examples

The following example returns model details for the SVM classification model SVMC_SH_Clas_sample, which was created by the sample program dmsvcdem.sql. For information about the sample programs, see Oracle Data Mining User's Guide.

WITH
  mod_dtls AS (
  SELECT *
    FROM TABLE(DBMS_DATA_MINING.GET_MODEL_DETAILS_SVM('SVMC_SH_Clas_sample'))
  ),
  model_details AS (
  SELECT D.class, A.attribute_name, A.attribute_value, A.coefficient
    FROM mod_dtls D,
            TABLE(D.attribute_set) A
    ORDER BY D.class, ABS(A.coefficient) DESC
  )
  SELECT class, attribute_name aname, attribute_value aval, coefficient coeff
    FROM model_details
    WHERE ROWNUM < 11;
 
CLASS      ANAME                     AVAL                      COEFF
---------- ------------------------- ------------------------- -----
1                                                              -2.85
1          BOOKKEEPING_APPLICATION                              1.11
1          OCCUPATION                Other                      -.94
1          HOUSEHOLD_SIZE            4-5                         .88
1          CUST_MARITAL_STATUS       Married                     .82
1          YRS_RESIDENCE                                         .76
1          HOUSEHOLD_SIZE            6-8                        -.74
1          OCCUPATION                Exec.                       .71
1          EDUCATION                 11th                       -.71
1          EDUCATION                 Masters                     .63

GET_MODEL_DETAILS_XML Function

This function returns an XML object that provides the details of a Decision Tree model.

Syntax

DBMS_DATA_MINING.GET_MODEL_DETAILS_XML (
         model_name      IN       VARCHAR2)
   RETURN XMLTYPE;

Parameters

Table 45-81 GET_MODEL_DETAILS_XML Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-82 GET_MODEL_DETAILS_XML Function Return Value

Return Value Description

XMLTYPE

The XML definition for the decision tree model. See Chapter 285, "XMLTYPE" for details.

The XML conforms to the Data Mining Group Predictive Model Markup Language (PMML) version 2.1 specification. The specification is available at http://www.dmg.org.


Usage Notes

Special characters that cannot be displayed by Oracle XML are converted to '#'.

Examples

The following statements in SQL*Plus return the details of the decision tree model dt_sh_clas_sample. This model is created by the program dmdtdemo.sql, one of the sample data mining programs provided with Oracle Database Examples.

Note: The "&quot" characters you will see in the XML output are a result of SQL*Plus behavior. To display the XML in proper format, cut and past it into a file and open the file in a browser.

column dt_details format a320
SELECT 
 dbms_data_mining.get_model_details_xml('dt_sh_clas_sample') 
 AS DT_DETAILS
FROM dual;


DT_DETAILS
--------------------------------------------------------------------------------
<PMML version="2.1">
  <Header copyright="Copyright (c) 2004, Oracle Corporation. All rights
      reserved."/>
  <DataDictionary numberOfFields="9">
    <DataField name="AFFINITY_CARD" optype="categorical"/> 
    <DataField name="AGE" optype="continuous"/> 
    <DataField name="BOOKKEEPING_APPLICATION" optype="continuous"/>
    <DataField name="CUST_MARITAL_STATUS" optype="categorical"/>
    <DataField name="EDUCATION" optype="categorical"/> 
    <DataField name="HOUSEHOLD_SIZE" optype="categorical"/>
    <DataField name="OCCUPATION" optype="categorical"/>
    <DataField name="YRS_RESIDENCE" optype="continuous"/>
    <DataField name="Y_BOX_GAMES" optype="continuous"/>
  </DataDictionary>
  <TreeModel modelName="DT_SH_CLAS_SAMPLE" functionName="classification"
      splitCharacteristic="binarySplit">
    <Extension name="buildSettings">
      <Setting name="TREE_IMPURITY_METRIC" value="TREE_IMPURITY_GINI"/>
      <Setting name="TREE_TERM_MAX_DEPTH" value="7"/>
      <Setting name="TREE_TERM_MINPCT_NODE" value=".05"/>
      <Setting name="TREE_TERM_MINPCT_SPLIT" value=".1"/> 
      <Setting name="TREE_TERM_MINREC_NODE" value="10"/>
      <Setting name="TREE_TERM_MINREC_SPLIT" value="20"/>
      <costMatrix>
        <costElement>
          <actualValue>0</actualValue>
          <predictedValue>0</predictedValue>
          <cost>0</cost>
        </costElement>
        <costElement>
          <actualValue>0</actualValue>
          <predictedValue>1</predictedValue>
          <cost>1</cost>
        </costElement>
        <costElement>
          <actualValue>1</actualValue>
          <predictedValue>0</predictedValue>
          <cost>8</cost> 
        </costElement>
        <costElement> 
          <actualValue>1</actualValue>
          <predictedValue>1</predictedValue> 
          <cost>0</cost> 
        </costElement>
      </costMatrix>
    </Extension>
    <MiningSchema>
      .
      .
      .
      .
      .
      . 
      </Node>
    </Node>
  </TreeModel>
</PMML> 

GET_MODEL_TRANSFORMATIONS Function

This function returns the transformation expressions embedded in the specified model.

Syntax

DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS (
      model_name     IN VARCHAR2)
RETURN DM_TRANSFORMS PIPELINED;

Parameters

Table 45-83 GET_MODEL_TRANSFORMATIONS Function Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Return Values

Table 45-84 GET_MODEL_TRANSFORMATIONS Function Return Value

Return Value Description

DM_TRANSFORMS

The transformation expressions embedded in model_name.

The DM_TRANSFORMS type is a table of DM_TRANSFORM objects. Each DM_TRANSFORM has these fields:

attribute_name       VARCHAR2(4000)
attribute_subname    VARCHAR2(4000)
expression           CLOB
reverse_expression   CLOB

Usage Notes

When Automatic Data Preparation (ADP) is enabled, both automatic and user-defined transformations may be associated with an attribute. In this case, the user-defined transformations are evaluated before the automatic transformations.

Examples

In this example, several columns in the SH.CUSTOMERS table are used to create a Naive Bayes model. A transformation expression is specified for one of the columns. The model does not use ADP.

CREATE OR REPLACE VIEW mining_data AS
   SELECT cust_id, cust_year_of_birth, cust_income_level,cust_credit_limit
   FROM sh.customers;

describe mining_data
 Name                                   Null?    Type
 -------------------------------------- -------- --------------------------
 CUST_ID                                NOT NULL NUMBER
 CUST_YEAR_OF_BIRTH                     NOT NULL NUMBER(4)
 CUST_INCOME_LEVEL                               VARCHAR2(30)
 CUST_CREDIT_LIMIT                               NUMBER
 
CREATE TABLE settings_nb(
      setting_name  VARCHAR2(30),
      setting_value VARCHAR2(30));
BEGIN
     INSERT INTO settings_nb (setting_name, setting_value) VALUES
           (dbms_data_mining.algo_name, dbms_data_mining.algo_naive_bayes);
     INSERT INTO settings_nb (setting_name, setting_value) VALUES
           (dbms_data_mining.prep_auto, dbms_data_mining.prep_auto_off);
     COMMIT;
END;
/
DECLARE
    mining_data_xforms   dbms_data_mining_transform.TRANSFORM_LIST;
  BEGIN
    dbms_data_mining_transform.SET_TRANSFORM (
         xform_list           =>  mining_data_xforms,
         attribute_name       => 'cust_year_of_birth',
         attribute_subname    =>  null,
         expression           => 'cust_year_of_birth + 10',
         reverse_expression   => 'cust_year_of_birth - 10');
    dbms_data_mining.CREATE_MODEL (
        model_name           =>  'new_model',
        mining_function      =>   dbms_data_mining.classification,
        data_table_name      =>  'mining_data',
        case_id_column_name  =>  'cust_id',
        target_column_name   =>  'cust_income_level',
        settings_table_name  =>  'settings_nb',
        data_schema_name     =>   nulL,
        settings_schema_name =>   null,
        xform_list           =>   mining_data_xforms );
  END;
 /
SELECT attribute_name, TO_CHAR(expression), TO_CHAR(reverse_expression)
      FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('new_model'));

ATTRIBUTE_NAME      TO_CHAR(EXPRESSION)       TO_CHAR(REVERSE_EXPRESSION)
------------------  ------------------------  ----------------------------- 
CUST_YEAR_OF_BIRTH  cust_year_of_birth + 10   cust_year_of_birth - 10

GET_TRANSFORM_LIST Procedure

This procedure converts transformation expressions specified as DM_TRANSFORMS to a transformation list (TRANSFORM_LIST) that can be used in creating a model. DM_TRANSFORMS is returned by the GET_MODEL_TRANSFORMATIONS function.

You can also use routines in the DBMS_DATA_MINING_TRANSFORM package to construct a transformation list.

Syntax

DBMS_DATA_MINING.GET_TRANSFORM_LIST (
      xform_list           OUT NOCOPY TRANSFORM_LIST,
      model_xforms         IN  DM_TRANSFORMS);

Parameters

Table 45-85 GET_TRANSFORM_LIST Procedure Parameters

Parameter Description

xform_list

A list of transformation specifications that can be embedded in a model. Accepted as a parameter to the CREATE_MODEL Procedure.

The TRANSFORM_LIST type is a table of TRANSFORM_REC objects. Each TRANSFORM_REC has these fields:

attribute_name      VARCHAR2(30)
attribute_subname   VARCHAR2(4000)
expression          EXPRESSION_REC
reverse_expression  EXPRESSION_REC
attribute_spec      VARCHAR2(4000)

For details about the TRANSFORM_LIST collection type, see Table 46-1, "Datatypes in DBMS_DATA_MINING_TRANSFORM".

model_xforms

A list of embedded transformation expressions returned by the GET_MODEL_TRANSFORMATIONS Function for a specific model.

The DM_TRANSFORMS type is a table of DM_TRANSFORM objects. Each DM_TRANSFORM has these fields:

attribute_name       VARCHAR2(4000)
attribute_subname    VARCHAR2(4000)
expression           CLOB
reverse_expression   CLOB

Examples

In this example, a model mod1 is trained using several columns in the SH.CUSTOMERS table. The model uses ADP, which automatically bins one of the columns.

A second model mod2 is trained on the same data without ADP, but it uses a transformation list that was obtained from mod1. As a result, both mod1 and mod2 have the same embedded transformation expression.

CREATE OR REPLACE VIEW mining_data AS
     SELECT cust_id, cust_year_of_birth, cust_income_level, cust_credit_limit
     FROM sh.customers;
 
describe mining_data
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                   NOT NULL NUMBER
 CUST_YEAR_OF_BIRTH                        NOT NULL NUMBER(4)
 CUST_INCOME_LEVEL                                  VARCHAR2(30)
 CUST_CREDIT_LIMIT                                  NUMBER

CREATE TABLE setmod1(setting_name  VARCHAR2(30),setting_value VARCHAR2(30));
BEGIN
   INSERT INTO setmod1 VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_naive_bayes);
   INSERT INTO setmod1 VALUES (dbms_data_mining.prep_auto,dbms_data_mining.prep_auto_on);
   dbms_data_mining.CREATE_MODEL (
               model_name            => 'mod1',
               mining_function       => dbms_data_mining.classification,
               data_table_name       => 'mining_data',
               case_id_column_name   => 'cust_id',
               target_column_name    => 'cust_income_level',
               settings_table_name   => 'setmod1');
    COMMIT;
END;
/
CREATE TABLE setmod2(setting_name  VARCHAR2(30),setting_value VARCHAR2(30));
BEGIN
  INSERT INTO setmod2
      VALUES (dbms_data_mining.algo_name, dbms_data_mining.algo_naive_bayes);
  COMMIT;
END;
/
DECLARE
  v_xform_list       dbms_data_mining_transform.TRANSFORM_LIST;
  dmxf               DM_TRANSFORMS;
BEGIN
   EXECUTE IMMEDIATE
    'SELECT dm_transform(attribute_name, attribute_subname,expression, reverse_expression)
     FROM TABLE(dbms_data_mining.GET_MODEL_TRANSFORMATIONS (''mod1''))'
     BULK COLLECT INTO dmxf;
   dbms_data_mining.GET_TRANSFORM_LIST (
        xform_list             =>  v_xform_list,
        model_xforms           =>  dmxf);
   dbms_data_mining.CREATE_MODEL(
         model_name            => 'mod2',
         mining_function       =>  dbms_data_mining.classification,
         data_table_name       => 'mining_data',
         case_id_column_name   => 'cust_id',
         target_column_name    => 'cust_income_level',
         settings_table_name   => 'setmod2',
         xform_list            =>  v_xform_list);
END;
/

-- Transformation expression embedded in mod1
SELECT TO_CHAR(expression) FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('mod1'));

TO_CHAR(EXPRESSION)
--------------------------------------------------------------------------------
CASE WHEN "CUST_YEAR_OF_BIRTH"<1915 THEN 0 WHEN "CUST_YEAR_OF_BIRTH"<=1915 THEN 0
WHEN "CUST_YEAR_OF_BIRTH"<=1920.5 THEN 1 WHEN "CUST_YEAR_OF_BIRTH"<=1924.5 THEN 2
.
.
.
.5 THEN 29 WHEN "CUST_YEAR_OF_BIRTH" IS NOT NULL THEN 30 END

-- Transformation expression embedded in mod2
SELECT TO_CHAR(expression) FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('mod2'));

TO_CHAR(EXPRESSION)
--------------------------------------------------------------------------------
CASE WHEN "CUST_YEAR_OF_BIRTH"<1915 THEN 0 WHEN "CUST_YEAR_OF_BIRTH"<=1915 THEN 0
WHEN "CUST_YEAR_OF_BIRTH"<=1920.5 THEN 1 WHEN "CUST_YEAR_OF_BIRTH"<=1924.5 THEN 2
.
.
.
.5 THEN 29 WHEN "CUST_YEAR_OF_BIRTH" IS NOT NULL THEN 30 END

-- Reverse transformation expression embedded in mod1
SELECT TO_CHAR(reverse_expression)FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('mod1'));

TO_CHAR(REVERSE_EXPRESSION)
--------------------------------------------------------------------------------
DECODE("CUST_YEAR_OF_BIRTH",0,'( ; 1915), [1915; 1915]',1,'(1915; 1920.5]',2,'(1
920.5; 1924.5]',3,'(1924.5; 1928.5]',4,'(1928.5; 1932.5]',5,'(1932.5; 1936.5]',6
.
.
.
8,'(1987.5; 1988.5]',29,'(1988.5; 1989.5]',30,'(1989.5;  )',NULL,'NULL')
 
-- Reverse transformation expression embedded in mod2
SELECT TO_CHAR(reverse_expression) FROM TABLE (dbms_data_mining.GET_MODEL_TRANSFORMATIONS('mod2'));
      
TO_CHAR(REVERSE_EXPRESSION)
--------------------------------------------------------------------------------
DECODE("CUST_YEAR_OF_BIRTH",0,'( ; 1915), [1915; 1915]',1,'(1915; 1920.5]',2,'(1
920.5; 1924.5]',3,'(1924.5; 1928.5]',4,'(1928.5; 1932.5]',5,'(1932.5; 1936.5]',6
.
.
.
8,'(1987.5; 1988.5]',29,'(1988.5; 1989.5]',30,'(1989.5;  )',NULL,'NULL')

IMPORT_MODEL Procedure

This procedure imports one or more data mining models. The procedure is overloaded. You can call it to import mining models from a dump file set, or you can call it to import a single mining model from a PMML document.

Import from a dump file set

You can import mining models from a dump file set that was created by the EXPORT_MODEL Procedure. IMPORT_MODEL and EXPORT_MODEL use Oracle Data Pump technology to export to and import from a dump file set.

When Oracle Data Pump is used directly to export/import an entire schema or database, the mining models in the schema or database are included. EXPORT_MODEL and IMPORT_MODEL export/import mining models only.

Import from PMML

You can import a mining model represented in Predictive Model Markup Language (PMML). The model must be of type RegressionModel, either linear regression or binary logistic regression.

PMML is an XML-based standard specified by the Data Mining Group (http://www.dmg.org). Applications that are PMML-compliant can deploy PMML-compliant models that were created by any vendor. Oracle Data Mining supports the core features of PMML 3.1 for regression models.

See Also:

Oracle Data Mining User's Guide for more information about exporting and importing mining models

Oracle Database Utilities for information about Oracle Data Pump

http://www.dmg.org/faq.html for more information about PMML

Syntax

Imports a mining model from a dump file set:

DBMS_DATA_MINING.IMPORT_MODEL (
      filename          IN  VARCHAR2,
      directory         IN  VARCHAR2,
      model_filter      IN  VARCHAR2 DEFAULT NULL,
      operation         IN  VARCHAR2 DEFAULT NULL,
      remote_link       IN  VARCHAR2 DEFAULT NULL,
      jobname           IN  VARCHAR2 DEFAULT NULL,
      schema_remap      IN  VARCHAR2 DEFAULT NULL,
      tablespace_remap  IN  VARCHAR2 DEFAULT NULL);

Imports a mining model from a PMML document:

DBMS_DATA_MINING.IMPORT_MODEL (
      model_name        IN  VARCHAR2,
      pmmldoc           IN  XMLTYPE
      strict_check      IN  BOOLEAN DEFAULT FALSE);

Parameters

Table 45-86 IMPORT_MODEL Procedure Parameters

Parameter Description

filename

Name of the dump file set from which the models should be imported. The dump file set must have been created by the EXPORT_MODEL procedure or the expdp export utility of Oracle Data Pump.

The dump file set can contain one or more files. (Refer to "EXPORT_MODEL Procedure" for details.) If the dump file set contains multiple files, you can specify 'filename%U' instead of listing them. For example, if your dump file set contains 3 files, archive01.dmp, archive02.dmp, and archive03.dmp, you can import them by specifying 'archive%U'.

directory

Name of a pre-defined directory object that specifies where the dump file set is located. Both the exporting and the importing user must have read/write access to the directory object and to the file system directory that it identifies.

Note: The target database must have also have read/write access to the file system directory.

model_filter

Optional parameter that specifies one or more models to import. If you do not specify a value for model_filter, all models in the dump file set are imported. You can also specify NULL (the default) or 'ALL' to import all models.

The value of model_filter can be one or more model names. The following are valid filters.

'mymodel1'
'name IN (''mymodel2'',''mymodel3'')'

The first causes IMPORT_MODEL to import a single model named mymodel1. The second causes IMPORT_MODEL to import two models, mymodel2 and mymodel3.

operation

Optional parameter that specifies whether to import the models or the SQL statements that create the models. By default, the models are imported.

You can specify either of the following values for operation:

  • 'IMPORT' — Import the models (Default)

  • 'SQL_FILE'— Write the SQL DDL for creating the models to a text file. The text file is named job_name.sql and is located in the dump set directory.

remote_link

Optional parameter that specifies the name of a database link to a remote system. The default value is NULL. A database link is a schema object in a local database that enables access to objects in a remote database. When you specify a value for remote_link, you can import models into the local database from the remote database. The import is fileless; no dump file is involved. The IMP_FULL_DATABASE role is required for importing the remote models. The EXP_FULL_DATABASE privilege, the CREATE DATABASE LINK privilege, and other privileges may also be required. (See Example 2.)

jobname

Optional parameter that specifies the name of the import job. By default, the name has the form username_imp_nnnn, where nnnn is a number. For example, a job name in the SCOTT schema might be SCOTT_imp_134.

If you specify a job name, it must be unique within the schema. The maximum length of the job name is 30 characters.

A log file for the import job, named jobname.log, is created in the same directory as the dump file set.

schema_remap

Optional parameter for importing into a different schema. By default, models are exported and imported within the same schema.

If the dump file set belongs to a different schema, you must specify a schema mapping in the form export_user:import_user. For example, you would specify 'SCOTT:MARY' to import a model exported by SCOTT into the MARY schema.

Note: In some cases, you may need to have the IMP_FULL_DATABASE privilege or the SYS role to import a model from a different schema.

tablespace_remap

Optional parameter for importing into a different tablespace. By default, models are exported and imported within the same tablespace.

If the dump file set belongs to a different tablespace, you must specify a tablespace mapping in the form export_tablespace:import_tablespace. For example, you would specify 'TBLSPC01:TBLSPC02' to import a model that was exported from tablespace TBLSPC01 into tablespace TBLSPC02.

Note: In some cases, you may need to have the IMP_FULL_DATABASE privilege or the SYS role to import a model from a different tablespace.

model_name

Name for the new model that will be created in the database as a result of an import from PMML The name must be unique within the user's schema.

pmmldoc

The PMML document representing the model to be imported. The PMML document has an XMLTYPE object type. See Chapter 285, "XMLTYPE" for details.

strict_check

Whether or not an error occurs when the PMML document contains sections that are not part of core PMML (for example, Output or Targets). Oracle Data Mining supports only core PMML; any non-core features may affect the scoring representation.

If the PMML does not strictly conform to core PMML and strict_check is set to TRUE, then IMPORT_MODEL returns an error. If strict_check is FALSE (the default), then the error is suppressed. The model may be imported and scored.


Examples

  1. This example shows a model being exported and imported within the schema dmuser2. Then the same model is imported into the dmuser3 schema. The dmuser3 user has the IMP_FULL_DATABASE privilege. The dmuser2 user has been assigned the USER2 tablespace; dmuser3 has been assigned the USER3 tablespace.

    SQL> connect dmuser2
    Enter password: dmuser2_password
    Connected.
    SQL> select model_name from user_mining_models;
     
    MODEL_NAME
    ------------------------------
    NMF_SH_SAMPLE
    SVMO_SH_CLAS_SAMPLE
    SVMR_SH_REGR_SAMPLE
    
    -- export the model called NMF_SH_SAMPLE to a dump file in same schema
    SQL>EXECUTE DBMS_DATA_MINING.EXPORT_MODEL (
                filename =>'NMF_SH_SAMPLE_out', 
                directory =>'DATA_PUMP_DIR', 
                model_filter => 'name = ''NMF_SH_SAMPLE''');
    
    -- import the model back into the same schema
    SQL>EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (
                filename => 'NMF_SH_SAMPLE_out01.dmp',
                directory => 'DATA_PUMP_DIR', 
                model_filter => 'name = ''NMF_SH_SAMPLE''');
    
    -- connect as different user
    -- import same model into that schema
    SQL> connect dmuser3
    Enter password: dmuser3_password
    Connected.
    SQL>EXECUTE DBMS_DATA_MINING.IMPORT_MODEL (
                filename => 'NMF_SH_SAMPLE_out01.dmp', 
                directory => 'DATA_PUMP_DIR', 
                model_filter => 'name = ''NMF_SH_SAMPLE''',
                operation =>'IMPORT',  
                remote_link => NULL,
                jobname => 'nmf_imp_job',
                schema_remap => 'dmuser2:dmuser3',
                tablespace_remap => 'USER2:USER3');
    

    The following example shows user MARY importing all models from a dump file, model_exp_001.dmp, which was created by user SCOTT. User MARY has been assigned a tablespace named USER2; user SCOTT was assigned the tablespace USERS when the models were exported into the dump file model_exp_001.dmp.The dump file is located in the file system directory mapped to a directory object called DM_DUMP. If user MARY does not have IMP_FULL_DATABASE privileges, IMPORT_MODEL will raise an error.

    -- import all models
    DECLARE
      file_name  VARCHAR2(40);
    BEGIN
      file_name := 'model_exp_001.dmp';
      DBMS_DATA_MINING.IMPORT_MODEL(
                filename=> 'file_name',
                directory=>'DM_DUMP',
                schema_remap=>'SCOTT:MARY', 
                tablespace_remap=>'USERS:USER2');
      DBMS_OUTPUT.PUT_LINE(
               'DBMS_DATA_MINING.IMPORT_MODEL of all models from SCOTT done!');
    END;
    /
    
  2. This example shows how the user xuser could import the model dmuser.r1mod from a remote database. The SQL*Net connection alias for the remote database is R1DB. The user xuser is assigned the SYSAUX tablespace; the user dmuser is assigned the TBS_1 tablespace.

    CONNECT / AS SYSDBA;
    GRANT CREATE DATABASE LINK TO xuser; 
    GRANT imp_full_database TO xuser;
    CONNECT xuser/xuserpassword 
    CREATE DATABASE LINK dmuser_link 
             CONNECT TO dmuser IDENTIFIED BY dmuserpassword USING 'R1DB';
    EXEC dbms_data_mining.import_model (
        NULL, 
       'DMUSER_DIR', 
       'R1MOD',
        remote_link => 'DMUSER_LINK', schema_remap => 'DMUSER:XUSER', 
                        tablespace_remap => 'TBS_1:SYSAUX' );
    SELECT name FROM dm_user_models;
     
    NAME
    -----------------------------------------------------------------------------
    R1MOD
    
  3. This example shows how a PMML document called SamplePMML1.xml could be imported from a location referenced by directory object PMMLDIR into the schema of the current user. The imported model will be called PMMLMODEL1.

    BEGIN    
        dbms_data_mining.import_model ('PMMLMODEL1',
            XMLType (bfilename ('PMMLDIR', 'SamplePMML1.xml'),
              nls_charset_id ('AL32UTF8')
            ));
    END;
    

RANK_APPLY Procedure

This procedure ranks the results of an APPLY operation based on a top-N specification for predictive and descriptive model results. For classification models, you can provide a cost matrix as input, and obtain the ranked results with costs applied to the predictions.

Syntax

DBMS_DATA_MINING.RANK_APPLY (
      apply_result_table_name        IN VARCHAR2,
      case_id_column_name            IN VARCHAR2,
      score_column_name              IN VARCHAR2,
      score_criterion_column_name    IN VARCHAR2,
      ranked_apply_table_name        IN VARCHAR2,
      top_N                          IN NUMBER (38) DEFAULT 1,
      cost_matrix_table_name         IN VARCHAR2    DEFAULT NULL,
      apply_result_schema_name       IN VARCHAR2    DEFAULT NULL,
      cost_matrix_schema_name        IN VARCHAR2    DEFAULT NULL);

Parameters

Table 45-87 RANK_APPLY Procedure Parameters

Parameter Description

apply_result_table_name

Name of the table or view containing the results of an APPLY operation on the test data set (see Usage Notes)

case_id_column_name

Name of the case identifier column. This must be the same as the one used for generating APPLY results.

score_column_name

Name of the prediction column in the apply results table

score_criterion_column_name

Name of the probability column in the apply results table

ranked_apply_result_tab_name

Name of the table containing the ranked apply results

top_N

Top N predictions to be considered from the APPLY results for precision recall computation

cost_matrix_table_name

Name of the cost matrix table

apply_result_schema_name

Name of the schema hosting the APPLY results table

cost_matrix_schema_name

Name of the schema hosting the cost matrix table


Usage Notes

You can use RANK_APPLY to generate ranked apply results, based on a top-N filter and also with application of cost for predictions, if the model was built with costs.

The behavior of RANK_APPLY is similar to that of APPLY with respect to other DDL-like operations such as CREATE_MODEL, DROP_MODEL, and RENAME_MODEL. The procedure does not depend on the model; the only input of relevance is the apply results generated in a fixed schema table from APPLY.

The main intended use of RANK_APPLY is for the generation of the final APPLY results against the scoring data in a production setting. You can apply the model against test data using APPLY, compute various test metrics against various cost matrix tables, and use the candidate cost matrix for RANK_APPLY.

The schema for the apply results from each of the supported algorithms is listed in subsequent sections. The case_id column will be the same case identifier column as that of the apply results.

Classification Models — NB and SVM

For numerical targets, the ranked results table will have the definition as shown:

(case_id       VARCHAR2/NUMBER,
prediction     NUMBER,
probability    NUMBER,
cost           NUMBER,
rank           INTEGER)

For categorical targets, the ranked results table will have the following definition:

(case_id       VARCHAR2/NUMBER,
prediction     VARCHAR2,
probability    NUMBER,
cost           NUMBER,
rank           INTEGER)

Clustering using k-Means or O-Cluster

Clustering is an unsupervised mining function, and hence there are no targets. The results of an APPLY operation contains simply the cluster identifier corresponding to a case, and the associated probability. Cost matrix is not considered here. The ranked results table will have the definition as shown, and contains the cluster ids ranked by top-N.

(case_id       VARCHAR2/NUMBER,
cluster_id     NUMBER,
probability    NUMBER,
rank           INTEGER)

Feature Extraction using NMF

Feature extraction is also an unsupervised mining function, and hence there are no targets. The results of an APPLY operation contains simply the feature identifier corresponding to a case, and the associated match quality. Cost matrix is not considered here. The ranked results table will have the definition as shown, and contains the feature ids ranked by top-N.

(case_id        VARCHAR2/NUMBER,
feature_id      NUMBER,
match_quality   NUMBER,
rank            INTEGER)

Examples

BEGIN
/* build a model with name census_model.
 * (See example under CREATE_MODEL)
 */ 

/* if training data was pre-processed in any manner,
 * perform the same pre-processing steps on apply
 * data also.
 * (See examples in the section on DBMS_DATA_MINING_TRANSFORM)
 */

/* apply the model to data to be scored */
DBMS_DATA_MINING.RANK_APPLY(
  apply_result_table_name       => 'census_apply_result',
  case_id_column_name           => 'person_id',
  score_column_name             => 'prediction',
  score_criterion_column_name   => 'probability
  ranked_apply_result_tab_name  => 'census_ranked_apply_result',
  top_N                         => 3,
  cost_matrix_table_name        => 'census_cost_matrix');
END;
/

-- View Ranked Apply Results
SELECT *
  FROM census_ranked_apply_result;

REMOVE_COST_MATRIX Procedure

Removes the default scoring matrix from a classification model.

Syntax

DBMS_DATA_MINING.REMOVE_COST_MATRIX (
      model_name   IN  VARCHAR2);

Parameters

Table 45-88 Remove_Cost_Matrix Procedure Parameters

Parameter Description

model_name

Name of the model in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used.


Usage Notes

If the model is not in your schema, then REMOVE_COST_MATRIX requires the ALTER ANY MINING MODEL system privilege or the ALTER object privilege for the mining model.

Example

The Naive Bayes model NB_SH_CLAS_SAMPLE has an associated cost matrix that can be used for scoring the model.

SQL>SELECT *
      FROM TABLE(dbms_data_mining.get_model_cost_matrix('nb_sh_clas_sample'))
      ORDER BY predicted, actual;
 
ACTUAL     PREDICTED        COST
---------- ---------- ----------
0          0                   0
1          0                 .75
0          1                 .25
1          1                   0

You can remove the cost matrix with REMOVE_COST_MATRIX.

SQL>EXECUTE dbms_data_mining.remove_cost_matrix('nb_sh_clas_sample');

SQL>SELECT *
      FROM TABLE(dbms_data_mining.get_model_cost_matrix('nb_sh_clas_sample'))
      ORDER BY predicted, actual;

no rows selected

RENAME_MODEL Procedure

This procedure changes the name of the mining model indicated by model_name to the name that you specify as new_model_name. If a model with new_model_name already exists, then the procedure optionally renames new_model_name to versioned_model_name before renaming model_name to new_model_name.

The model name is in the form [schema_name.]model_name. If you do not specify a schema, your own schema is used. For mining model naming restrictions, see the Usage Notes for "CREATE_MODEL Procedure".

Syntax

DBMS_DATA_MINING.RENAME_MODEL (
     model_name            IN VARCHAR2,
     new_model_name        IN VARCHAR2,
     versioned_model_name  IN VARCHAR2 DEFAULT NULL);

Parameters

Table 45-89 RENAME_MODEL Procedure Parameters

Parameter Description

model_name

Model to be renamed.

new_model_name

New name for the model model_name.

versioned_model_name

New name for the model new_model_name if it already exists.


Usage Notes

If you attempt to rename a model while it is being applied, then the model will be renamed but the apply operation will return indeterminate results.

Examples

  1. This example changes the name of model census_model to census_model_2012.

    BEGIN
      DBMS_DATA_MINING.RENAME_MODEL(
        model_name      => 'census_model',
        new_model_name  => 'census_model_2012');
    END;
    /
    
  2. In this example, there are two classification models in the user's schema: clas_mod, the working model, and clas_mod_tst, a test model. The RENAME_MODEL procedure preserves clas_mod as clas_mod_old and makes the test model the new working model.

    SELECT model_name FROM user_mining_models;
    MODEL_NAME
    -------------------------------------------------------------------
    CLAS_MOD
    CLAS_MOD_TST
    
    BEGIN
      DBMS_DATA_MINING.RENAME_MODEL(
        model_name            => 'clas_mod_tst',
        new_model_name        => 'clas_mod',
        versioned_model_name  => 'clas_mod_old');
    END;
    /
    
    SELECT model_name FROM user_mining_models;
    MODEL_NAME
    -------------------------------------------------------------------
    CLAS_MOD
    CLAS_MOD_OLD