Skip Headers
Oracle® Data Mining User's Guide
12c Release 1 (12.1)

E17693-15
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

5 Creating a Model

This chapter explains how to create data mining models and query model details. This chapter contains the following topics:

Before Creating a Model

As described in "About Mining Models", models are database schema objects that perform data mining. The DBMS_DATA_MINING PL/SQL package is the API for creating, configuring, evaluating, and querying mining models (model details). See "DBMS_DATA_MINING" for an overview.

Before you create a model, you must decide what you want the model to do. You must identify the training data and determine if transformations are required. You may specify model settings to influence the behavior of the model behavior. The preparation steps are summarized in Table 5-1.

Table 5-1 Preparation for Creating a Mining Model

Preparation Step Description

Choose the mining function

See "Choosing the Mining Function"

Choose the algorithm

See "Choosing the Algorithm"

Identify the build (training) data

See "Preparing the Data"

For classification models, identify the test data

See "Data Sets for Classification and Regression"

Determine your data transformation strategy

See "Transforming the Data"

Create and populate a settings tables (if needed)

See "Specifying Model Settings"


The CREATE_MODEL Procedure

The CREATE_MODEL procedure in the DBMS_DATA_MINING package uses the specified data to create a mining model with the specified name and mining function. The model may be created with configuration settings and user-specified transformations.

PROCEDURE CREATE_MODEL(
                  model_name            IN VARCHAR2,
                  mining_function       IN VARCHAR2,
                  data_table_name       IN VARCHAR2,
                  case_id_column_name   IN VARCHAR2,
                  target_column_name    IN VARCHAR2 DEFAULT NULL,
                  settings_table_name   IN VARCHAR2 DEFAULT NULL,
                  data_schema_name      IN VARCHAR2 DEFAULT NULL,
                  settings_schema_name  IN VARCHAR2 DEFAULT NULL,
                  xform_list            IN TRANSFORM_LIST DEFAULT NULL);

Choosing the Mining Function

The mining function is a required argument to the CREATE_MODEL procedure. A data mining function specifies a class of problems that can be modeled and solved.

Data mining functions implement either supervised or unsupervised learning. Supervised learning uses a set of independent attributes to predict the value of a dependent attribute or target. Unsupervised learning does not distinguish between dependent and independent attributes. Supervised functions are predictive. Unsupervised functions are descriptive.

Note on Terminology:

In data mining terminology, a function is a general type of problem to be solved by a given approach to data mining. In SQL language terminology, a function is an operator that returns a value.

In Oracle Data Mining documentation, the term function, or mining function refers to a data mining function; the term SQL function or SQL Data Mining function refers to a SQL function for scoring (applying data mining models).

You can specify any of the values in Table 5-2 for the mining_function parameter to CREATE_MODEL.

Table 5-2 Mining Model Functions

Mining_Function Value Description

ASSOCIATION

Association is a descriptive mining function. An association model identifies relationships and the probability of their occurrence within a data set. (association rules)

Association models use the Apriori algorithm.

ATTRIBUTE_IMPORTANCE

Attribute Importance is a predictive mining function. An attribute importance model identifies the relative importance of attributes in predicting a given outcome.

Attribute Importance models use the Minimum Description Length algorithm.

CLASSIFICATION

Classification is a predictive mining function. A classification model uses historical data to predict a categorical target.

Classification models can use Naive Bayes, Decision Tree, Logistic Regression, or Support Vector Machines. The default is Naive Bayes.

The classification function can also be used for anomaly detection. In this case, the SVM algorithm with a null target is used (One-Class SVM).

CLUSTERING

Clustering is a descriptive mining function. A clustering model identifies natural groupings within a data set.

Clustering models can use k-Means, O-Cluster, or Expectation Maximization. The default is k-Means.

FEATURE_EXTRACTION

Feature Extraction is a descriptive mining function. A feature extraction model creates a set of optimized attributes.

Feature extraction models can use Non-Negative Matrix Factorization or Singular Value Decomposition (which can also be used for Principal Component Analysis). The default is Non-Negative Matrix Factorization.

REGRESSION

Regression is a predictive mining function. A regression model uses historical data to predict a numerical target.

Regression models can use Support Vector Machines or Linear Regression. The default is Support Vector Machine.


See Also:

Oracle Data Mining Concepts for an introduction to mining functions

Choosing the Algorithm

The ALGO_NAME setting specifies the algorithm for a model. If you use the default algorithm for the mining function, or if there is only one algorithm available for the mining function, you do not need to specify the ALGO_NAME setting. Instructions for specifying model settings are in "Specifying Model Settings".

Table 5-3 Data Mining Algorithms

ALGO_NAME Value Algorithm Default? Mining Model Function

ALGO_AI_MDL

Minimum Description Length

attribute importance

ALGO_APRIORI_ASSOCIATION_RULES

Apriori

association

ALGO_DECISION_TREE

Decision Tree

classification

ALGO_EXPECTATION_MAXIMIZATION

Expectation Maximization

   

ALGO_GENERALIZED_LINEAR_MODEL

Generalized Linear Model

classification and regression

ALGO_KMEANS

k-Means

yes

clustering

ALGO_NAIVE_BAYES

Naive Bayes

yes

classification

ALGO_NONNEGATIVE_MATRIX_FACTOR

Non-Negative Matrix Factorization

yes

feature extraction

ALGO_O_CLUSTER

O-Cluster

clustering

ALGO_SINGULAR_VALUE_DECOMP

Singular Value Decomposition (can also be used for Principal Component Analysis)

feature extraction

ALGO_SUPPORT_VECTOR_MACHINES

Support Vector Machine

yes

default regression algorithm

regression, classification, and anomaly detection (classification with no target)


See Also:

Oracle Data Mining Concepts for an introduction to the algorithms supported by Oracle Data Mining

Supplying Transformations

You can optionally specify transformations for the build data in the xform_list parameter to CREATE_MODEL. The transformation instructions are embedded in the model and reapplied whenever the model is applied to new data.

Creating a Transformation List

There are two ways to create a transformation list:

  • The STACK interface in DBMS_DATA_MINING_TRANSFORM.

    The STACK interface offers a set of pre-defined transformations that you can apply to an attribute or to a group of attributes. For example, you could specify supervised binning for all categorical attributes.

  • The SET_TRANSFORM procedure in DBMS_DATA_MINING_TRANSFORM.

    The SET_TRANSFORM procedure applies a specified SQL expression to a specified attribute. For example, the following statement appends a transformation instruction for country_id to a list of transformations called my_xforms. The transformation instruction divides country_id by 10 before algorithmic processing begins. The reverse transformation multiplies country_id by 10.

      dbms_data_mining_transform.SET_TRANSFORM (my_xforms,
         'country_id', NULL, 'country_id/10', 'country_id*10');
    

    The reverse transformation is applied in the model details. If country_id is the target of a supervised model, the reverse transformation is also applied to the scored target.

Transformation List and Automatic Data Preparation

The transformation list argument to CREATE_MODEL interacts with the PREP_AUTO setting, which controls Automatic Data Preparation (ADP):

  • When ADP is on and you specify a transformation list, your transformations are applied with the automatic transformations and embedded in the model. The transformations that you specify are executed before the automatic transformations.

  • When ADP is off and you specify a transformation list, your transformations are applied and embedded in the model, but no system-generated transformations are performed.

  • When ADP is on and you do not specify a transformation list, the system-generated transformations are applied and embedded in the model.

  • When ADP is off and you do not specify a transformation list, no transformations are embedded in the model; you must separately prepare the data sets you use for building, testing, and scoring the model.

See Also:

Specifying Model Settings

Numerous configuration settings are available for configuring data mining models at build time. To specify settings, create a settings table with the columns shown in Table 5-4 and pass the table to CREATE_MODEL.

Table 5-4 Settings Table Required Columns

Column Name Data Type

setting_name

VARCHAR2(30)

setting_value

VARCHAR2(4000)


Example 5-1 creates a settings table for an SVM classification model. Since SVM is not the default classifier, the ALGO_NAME setting is used to specify the algorithm. Setting the SVMS_KERNEL_FUNCTION to SVMS_LINEAR causes the model to be built with a linear kernel. If you do not specify the kernel function, the algorithm chooses the kernel based on the number of attributes in the data.

Example 5-1 Creating a Settings Table for an SVM Classification Model

CREATE TABLE svmc_sh_sample_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(4000));

BEGIN 
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.algo_name, dbms_data_mining.algo_support_vector_machines);
  INSERT INTO svmc_sh_sample_settings (setting_name, setting_value) VALUES
    (dbms_data_mining.svms_kernel_function, dbms_data_mining.svms_linear);
  COMMIT;
END;
/

Some settings apply generally to the model, others are specific to an algorithm. Model settings are documented in Oracle Database PL/SQL Packages and Types Reference and referenced in Table 5-5 and Table 5-6.

Table 5-5 General Model Settings

Settings Description

Mining function settings

See "Mining Function Settings" in Oracle Database PL/SQL Packages and Types Reference

Algorithm names

See "Algorithm Names" in Oracle Database PL/SQL Packages and Types Reference

Global model characteristics

See "Global Settings" in Oracle Database PL/SQL Packages and Types Reference

Automatic Data Preparation

See "Automatic Data Preparation" in Oracle Database PL/SQL Packages and Types Reference


Table 5-6 Algorithm-Specific Model Settings

Algorithm Description

Decision Tree

See "Algorithm Settings: Decision Tree" in Oracle Database PL/SQL Packages and Types Reference

Expectation Maximization

See "Algorithm Settings: Expectation Maximization" in Oracle Database PL/SQL Packages and Types Reference

Generalized Linear Models

See "Algorithm Settings: Generalized Linear Models" in Oracle Database PL/SQL Packages and Types Reference

k-Means

See "Algorithm Settings: k-Means" in Oracle Database PL/SQL Packages and Types Reference

Naive Bayes

See "Algorithm Settings: Naive Bayes" in Oracle Database PL/SQL Packages and Types Reference

Non-Negative Matrix Factorization

See "Algorithm Settings: Non-Negative Matrix Factorization" in Oracle Database PL/SQL Packages and Types Reference

O-Cluster

See "Algorithm Settings: O-Cluster" in Oracle Database PL/SQL Packages and Types Reference

Singular Value Decomposition

See "Algorithm Settings: Singular Value Decomposition" in Oracle Database PL/SQL Packages and Types Reference

Support Vector Machine

See "Algorithm Settings: Support Vector Machine" in Oracle Database PL/SQL Packages and Types Reference


Specifying Costs

The CLAS_COST_TABLE_NAME setting specifies the name of a cost matrix table to be used in building a Decision Tree model. A cost matrix biases a classification model to minimize costly misclassifications. The cost matrix table must have the columns shown in Table 5-7.

Table 5-7 Cost Matrix Table Required Columns

Column Name Data Type

actual_target_value

valid target data type

predicted_target_value

valid target data type

cost

NUMBER


Note: Valid target data types are listed in Table 3-1.

Decision Tree is the only algorithm that supports a cost matrix at build time. However, you can create a cost matrix and associate it with any classification model for scoring.

If you want to use costs for scoring, create a table with the columns shown in Table 5-7, and use the DBMS_DATA_MINING.ADD_COST_MATRIX procedure to add the cost matrix table to the model. You can also specify a cost matrix inline when invoking a PREDICTION function.

See Also:

Oracle Data Mining Concepts for information about costs

Specifying Prior Probabilities

The CLAS_PRIORS_TABLE_NAME setting specifies the name of a table of prior probabilities to be used in building a Naive Bayes model. Prior probabilities can be used to offset differences in distribution between the build data and the actual population. The priors table must have the columns shown in Table 5-8.

Table 5-8 Priors Table Required Columns

Column Name Data Type

target_value

valid target data type

prior_probability

NUMBER


See Also:

Specifying Class Weights

The CLAS_WEIGHTS_TABLE_NAME setting specifies the name of a table of class weights to be used to bias a logistic regression (GLM classification) or SVM classification model to favor higher weighted classes. The weights table must have the columns shown in Table 5-9.

Table 5-9 Class Weights Table Required Columns

Column Name Data Type

target_value

valid target data type

class_weight

NUMBER


See Also:

Model Settings in the Data Dictionary

Information about mining model settings can be obtained from the data dictionary view ALL/USER/DBA_MINING_MODEL_SETTINGS. When used with the ALL prefix, this view returns information about the settings for the models accessible to the current user. When used with the USER prefix, it returns information about the settings for the models in the user's schema. The DBA prefix is only available for DBAs.

The columns of ALL_MINING_MODEL_SETTINGS are described as follows and explained in Table 5-10.

SQL> describe all_mining_model_settings
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 OWNER                                     NOT NULL VARCHAR2(30)
 MODEL_NAME                                NOT NULL VARCHAR2(30)
 SETTING_NAME                              NOT NULL VARCHAR2(30)
 SETTING_VALUE                                      VARCHAR2(4000)
 SETTING_TYPE                                       VARCHAR2(7)

Table 5-10 ALL_MINING_MODEL_SETTINGS

Column Description

owner

Owner of the mining model.

model_name

Name of the mining model.

setting_name

Name of the setting.

setting_value

Value of the setting.

setting_type

INPUT if the value is specified by a user. DEFAULT if the value is system-generated.


The following query lists the settings for the SVM classification model SVMC_SH_CLAS_SAMPLE. The ALGO_NAME, CLAS_WEIGHTS_TABLE_NAME, and SVMS_KERNEL_FUNCTION settings are user-specified. These settings have been specified in a settings table for the model.

Example 5-2 ALL_MINING_MODEL_SETTINGS

SQL> COLUMN setting_value FORMAT A35
SQL> SELECT setting_name, setting_value, setting_type
            FROM all_mining_model_settings
            WHERE model_name in 'SVMC_SH_CLAS_SAMPLE';
 
SETTING_NAME                   SETTING_VALUE                       SETTING
------------------------------ ----------------------------------- -------
SVMS_ACTIVE_LEARNING           SVMS_AL_ENABLE                      DEFAULT
PREP_AUTO                      OFF                                 DEFAULT
SVMS_COMPLEXITY_FACTOR         0.244212                            DEFAULT
SVMS_KERNEL_FUNCTION           SVMS_LINEAR                         INPUT
CLAS_WEIGHTS_TABLE_NAME        svmc_sh_sample_class_wt             INPUT
SVMS_CONV_TOLERANCE            .001                                DEFAULT
ALGO_NAME                      ALGO_SUPPORT_VECTOR_MACHINES        INPUT

See Also:

Oracle Database PL/SQL Packages and Types Reference for details about model settings

Viewing Model Details

Model details describe model attributes, rules, statistics, and other information about the model. The DBMS_DATA_MINING package supports a separate GET_MODEL_DETAILS function for each algorithm. Global details are also available for Generalized Linear Models, Expectation Maximization, Singular Value Decompostion, and Association Rules.

Model details reverse the transformations applied to the attributes, thus enabling the information to be easily understood by a user. You can obtain the transformations embedded in the model by invoking the DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS function.

The query in Example 5-3 returns the coefficients for several attribute values in a GLM regression model called GLMR_SH_Regr_sample. Additional details available for this algorithm include: standard error, test statistic, p value, standard coefficient, lower coefficient limit, and upper coefficient limit.

Example 5-3 Model Details for GLM Regression

SELECT attribute_name, attribute_value, coefficient
    FROM TABLE(dbms_data_mining.get_model_details_glm('GLMR_SH_Regr_sample'))
    WHERE attribute_name IN ('AFFINITY_CARD','BULK_PACK_DISKETTES','COUNTRY_NAME')
   ORDER BY class, attribute_name, attribute_value;
 
ATTRIBUTE_NAME       ATTRIBUTE_VALUE      COEFFICIENT
-------------------- -------------------- -----------
AFFINITY_CARD                              -.58234968
BULK_PACK_DISKETTES                        -.99684665
COUNTRY_NAME         Argentina             -1.2032688
COUNTRY_NAME         Australia             .000541598
COUNTRY_NAME         Brazil                5.29534224
COUNTRY_NAME         Canada                4.02414761
COUNTRY_NAME         China                 .878394982
COUNTRY_NAME         Denmark               -2.9852215
COUNTRY_NAME         France                -1.0946872
COUNTRY_NAME         Germany               -1.6345684
COUNTRY_NAME         Italy                 -1.2749328
COUNTRY_NAME         Japan                  -6.259627
COUNTRY_NAME         New Zealand           5.07675762
COUNTRY_NAME         Poland                2.20458524
COUNTRY_NAME         Saudi Arabia          .443146197
COUNTRY_NAME         Singapore             -4.9472244
COUNTRY_NAME         South Africa          .493327068
COUNTRY_NAME         Spain                 -3.0895076
COUNTRY_NAME         Turkey                -5.9014625
COUNTRY_NAME         United Kingdom        2.25154714

The query in Example 5-4 returns global details for the same model.

Example 5-4 Global Details for GLM Regression

SELECT *
  FROM TABLE(dbms_data_mining.get_model_details_global('GLMR_SH_Regr_sample'))
ORDER BY global_detail_name;

GLOBAL_DETAIL_NAME             GLOBAL_DETAIL_VALUE
------------------------------ -------------------
ADJUSTED_R_SQUARE                             .732
AIC                                       5943.057
COEFF_VAR                                   18.165
CORRECTED_TOTAL_DF                        1499.000
CORRECTED_TOT_SS                        278740.504
DEPENDENT_MEAN                              38.892
ERROR_DF                                  1420.000
ERROR_MEAN_SQUARE                           49.908
ERROR_SUM_SQUARES                        70869.218
F_VALUE                                     52.291
GMSEP                                       52.722
HOCKING_SP                                    .035
J_P                                         52.570
MODEL_CONVERGED                              1.000
MODEL_DF                                    79.000
MODEL_F_P_VALUE                               .000
MODEL_MEAN_SQUARE                         2609.739
MODEL_SUM_SQUARES                       206169.407
NUM_PARAMS                                  80.000
NUM_ROWS                                  1500.000
ROOT_MEAN_SQ                                 7.065
R_SQ                                          .746
SBIC                                      6368.114
VALID_COVARIANCE_MATRIX                       .000