Skip Headers

Oracle Data Mining Application Developer's Guide
10g Release 1 (10.1)

Part Number B10699-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Feedback

Go to previous page
Previous
Go to next page
Next
View PDF

4
DBMS_DATA_MINING

This chapter discusses the following topics related to writing data mining programs with the PL/SQL interface:

This chapter provides an overview of the steps required to perform basic Oracle Data Mining tasks. For detailed examples of how to perform these tasks, see the sample programs in Chapter 5.

This chapter does not include detailed descriptions of the PL/SQL subprograms. For that information, see the DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM chapters in the PL/SQL Packages and Types Reference.

The DBMS_DATA_MINING package provides PL/SQL support for in-database data mining. You can use the package to build a mining model, test the model, and apply this model to your data to obtain predictive and descriptive information.

See also:

This chapter discusses the following topics:

DBMS_DATA_MINING subprograms are presented in PL/SQL Packages and Types Reference. Sample code is described in Chapter 5 of this manual; the code itself is in the dm/demo/sample/plsql directory.

4.1 Development Methodology

The development methodology for data mining using the DBMS_DATA_MINING API is divided into two phases.

The first phase includes your application and data analysis and design, where you perform the following two steps:

  1. Analyze your problem, and choose the mining function and algorithm.
  2. Analyze the data to be used for building mining models (build data), testing predictive models (test data), and the new data on which the model will be applied (scoring data).

The second phase involves developing a mining application using DBMS_DATA_MINING and DBMS_DATA_MINING_TRANSFORM packages.

  1. Prepare the build, test, and scoring data using the DBMS_DATA_MINING_TRANSFORM package or other third-party tool or direct SQL or PL/SQL utility scripts in a manner suitable for the chosen mining function and algorithm. An important caveat is that the three detests referred to above have to be prepared in an identical manner for mining results to be meaningful. This is an optional step.
  2. Prepare a settings table that overrides the default mining algorithm for a given mining function, and the default algorithm settings. This is also an optional step.
  3. Build a mining model for the given training dataset.
  4. For predictive models (classification and regression), test the model for its accuracy and other attributes. This amounts to applying the model on the test data (i.e., scoring the test data), and computing various matrix on the apply results.
  5. Retrieve the model signature to determine the mining attributes required by a given model for scoring. This information will help ascertain that the scoring data is suitable for a given model. This is an optional step.
  6. Apply a classification, regression, clustering, or feature extraction model to new data to generate predictions and/or descriptive summaries and patterns about the data.
  7. Retrieve the model details to understand why a model scored the data in a particular manner. This is an optional step.
  8. Repeat steps 3 through 9 until you obtain satisfactory results.

4.2 Mining Models, Function, and Algorithm Settings

The DBMS_DATA_MINING package creates a mining model for a mining function using a specified mining algorithm that supports the function. The algorithm can be influenced by specific algorithm settings.

4.2.1 Mining Model

A model is identified by its name. Like tables in the database, a model has storage associated with it. But unlike a table, the form, shape, and content of this storage is opaque to the user. However, the user can view the contents of a model -- that is, the patterns and rules that constitute a mining model -- using algorithm-specific GET_MODEL_DETAILS functions. In addition, dm_user_models provides the model size in megabytes.

4.2.2 Mining Function

The DBMS_DATA_MINING package supports Classification, Regression, Association Rules, Clustering, and Feature Extraction. You can specify your choice of mining function through a parameter to the CREATE_MODEL procedure.

4.2.3 Mining Algorithm

Each mining function can be implemented using one or more algorithms. Table 4-1 provides a list of supported algorithms. Oracle assumes a default algorithm for each mining function, but you can override this default through an explicit setting in the settings table.

Table 4-1   DBMS_DATA_MINING Summary of Functions and Algorithms
Mining Function Mining Algorithm

Classification

Naive Bayes (NB) -- default algorithm

Classification

Adaptive Bayes Network (ABN)

Classification

Support Vector Machine (SVM)

Regression

Support Vector Machine (SVM) -- default algorithm

Association Rules

Apriori Association Rules (AR)

Clustering

k-Means (KM)

Feature Extraction

Non-Negative Matrix Factorization (NMF)



Each algorithm has one or more settings or parameters that influence the way it builds the model. Oracle assumes a default set of algorithm settings for each mining algorithm. These defaults are available for your review through the table function GET_DEFAULT_SETTINGS. To override the defaults, you must provide the choice of the algorithm and the settings for the algorithm through a settings table input to the CREATE_MODEL procedure.

4.2.4 Settings Table

The settings table is a simple relational table with a fixed schema. You can choose the name of the settings table, but the column names and their types must be defined as specified below.

(setting_name   VARCHAR2(30),
setting_value   VARCHAR2(128))

The values provided in the settings table override the default values assumed by the system. The values inserted into the setting_name column are one or more of several constants defined in the DBMS_DATA_MINING package. Depending on what the setting name denotes, the value for the setting_value column can be a predefined constant or the actual numerical value corresponding to the setting itself. The setting_value column is defined to be VARCHAR2, so you must cast numerical inputs to string using the TO_CHAR() function before input into the settings table.

Table 4-2 through Table 4-7 list the various setting names and the valid setting values, with a brief explanation of each setting

Table 4-2 DBMS_DATA_MINING Function Settings
Algorithm Settings Setting Value (with Permissible Value Ranges)

algo_name

Classification: One of:

  • algo_naive_bayes
  • algo_support_vector_machines
  • algo_adaptive_bayes_network

Regression:

  • algo_support_vector_machines

Association Rules:

  • algo_apriori_association_rules

Clustering:

  • algo_kmeans

Feature Extraction:

  • algo_non_negative_matrix_factor

Attribute Importance:

  • algo_ai_mdl

clas_priors_table_name

VARCHAR2 string denoting the name of a relational table of fixed schema containing prior probabilities. The schema of this table is provided in Section 4.2.4.1.

This input is applicable only for classification algorithms. The prior probabilities table must be present in the current user's schema

clus_num_clusters

TO_CHAR(numeric_expr >= 1)

Number of clusters generated by a clustering algorithm

Default value is 10

feat_num_features

TO_CHAR(numeric_expr >= 1)

Number of features to be extracted

Default value estimated from the data by the algorithm

asso_max_rule_length

TO_CHAR(2 <= numeric_expr <= 20)

Maximum rule length for AR algorithm

Default value is 4

asso_min_confidence

TO_CHAR(0 <= numeric_expr <= 1)

Minimum confidence value for AR algorithm

Default value is 0.1

asso_min_support

TO_CHAR(0 <= numeric_expr <= 1)

Minimum support value for AR algorithm

Default value is 0.1

.
Table 4-3  Algorithm Settings for Adaptive Bayes Network
Setting Name Setting Value (with Permissible Value Ranges)

abns_model_type

Model type for Adaptive Bayes Network:

  • abns_single_feature (single feature)
  • abns_multi_feature (multi feature
  • abns_naive_bayes (naive bayes)

Default value is abns_multi_feature

abns_max_build_minutes

TO_CHAR(numeric_expr >= 0)

The maximum time threshold for completion of model build. Default value is 0, which implies no time limit.

abns_max_nb_predictors

TO_CHAR(numeric_expr > 0)

Maximum number of Naive Bayes predictors to be considered for model build, when the model type is chosen to be abns_naive_bayes. Default value is 10.

abns_max_predictors

TO_CHAR(numeric_expr > 0)

Default is 25

Table 4-4  Algorithm Settings for Naive Bayes
Setting Name Setting Value (with Permissible Value Ranges)

nabs_singleton_threshold

TO_CHAR(0 <= numeric_expr <=1)

Value of singleton threshold for NB algorithm

Default value is 0.01

nabs_pairwise_threshold

TO_CHAR (0 <= numeric_expr <=1)

Value of pairwise threshold for NB algorithm

Default value is 0.01

Table 4-5  Algorithm Settings for Support Vector Machines
Setting Name Setting Value (with Permissible Value Ranges)

svms_kernel_function

Kernel for Support Vector Machine:

  • svms_linear (for Linear Kernel
  • svms_gaussian (for Gaussian Kernel)

Default value is svms_linear

svms_kernel_cache_size

TO_CHAR(numeric_expr > 0)

Value of kernel cache size for SVM algorithm. Applies to Gaussian kernel only.

Default value is 50000000 bytes

svms_conv_tolerance

TO_CHAR(numeric_expr > 0)

Convergence tolerance for SVM algorithm

Default value is 0.001.

svms_std_dev

TO_CHAR(numeric_expr > 0)

Value of standard deviation for SVM algorithm

This is applicable only for Gaussian kernel

Default value estimated from the data by the algorithm

svms_complexity_factor

TO_CHAR(numeric_expr > 0)

Value of complexity factor for SVM algorithm

Default value estimated from the data by the algorithm

svms_epsilon

TO_CHAR(numeric_expr > 0)

Value of epsilon factor for SVM Regression

Default value estimated from the data by the algorithm

Table 4-6  Algorithm Settings for Non-Negative Matrix Factorization
Setting Name Setting Value (with Permissible Value Ranges)

nmfs_random_seed

TO_CHAR(numeric_expr)

Number of iterations for NMF algorithm

Default value is -1

nmfs_num_iterations

TO_CHAR(1 <= numeric_expr <= 500)

Number of iterations for NMF algorithm

Default value is 50

nmfs_conv_tolerance

TO_CHAR(0 < numeric_expr <= 0.5)

Convergence tolerance for NMF algorithm

Default value is 0.05

Table 4-7  Algorithm Settings for k-Means
Setting Name Setting Value (with Permissible Value Ranges)

kmns_distance

Distance Function for k-Means Clustering:

  • kmns_euclidean
  • kmns_cosine
  • kmns_fast_cosine

Default value is kmns_euclidean

kmns_iterations

TO_CHAR(0 <numeric_expr <= 20)

Number of iterations for k-Means algorithm

Default value is 3

kmns_conv_tolerance

TO_CHAR(0 < numeric_expr <= 0.5)

Convergence tolerance for k-Means algorithm

Default value is 0.01

kmns_split_criterion

Split criterion for k-Means Clustering:

  • kmns_variance
  • kmns_size

Default value is kmns_variance

kmns_num_bins

Number of histogram bins. Specifies the number of bins in the attribute histogram produced by k-Means. The bin boundaries for each attribute are computed globally on the entire training data set. The binning method is equi-width. All attributes have the same number of bins with the exception of attributes with a single value that have only one bin.

Range > 0

Default value is 10.

kmns_block_growth

TO_CHAR(1 < numeric_expr <= 5)

Growth factor for memory allocated to hold cluster data

Default value is 2

kmns_min_pct_attr_support

Minimum percentage support required for attributes in rules. Specifies the minimum percentage of values for an attribute in a given cluster required to include this attribute in the rule description of the cluster. That is, if the required support level is not met, the attribute would be omitted from the rule. This would allow retaining in the rule only the well-represented attributes. Setting the parameter value too high in data with missing values can result in very short or even empty rules.

Range >= 0 and <= 1

Default is 0.1.

You can create a settings table as shown in the example below for an SVM classification model, and edit the individual values using SQL DML.

CREATE TABLE drugstore_settings (
  setting_name VARCHAR2(30),
  setting_value VARCHAR2(128))
BEGIN
-- override the default for convergence tolerance for SVM Classification
INSERT INTO drugstore_model_settings (setting_name, setting_value)
VALUES (dbms_data_mining.svms_conv_tolerance, TO_CHAR(0.081));
COMMIT;
END;

The table function GET_DEFAULT_SETTINGS provides you all the default settings for mining functions and algorithms. If you intend to override all the default settings, you can create a seed settings table and edit them using DML.

BEGIN
CREATE TABLE drug_store_settings AS
SELECT setting_name, setting_value
  FROM TABLE (DBMS_DATA_MINING.GET_DEFAULT_SETTINGS
 WHERE setting_name LIKE 'SVMS_%';
-- update the values using appropriate DML
END;

You can also create a settings table based on another model's settings using GET_MODEL_SETTINGS, as shown in the example below.

BEGIN
CREATE TABLE my_new_model_settings AS
SELECT setting_name, setting_value 
  FROM TABLE (DBMS_DATA_MINING.GET_MODEL_SETTINGS('my_other_
model'));
END;

4.2.4.1 Prior Probabilities Table

Consult Oracle Data Mining Concepts for an explanation of the prior probabilities table. You can specify a prior probabilities table as an optional function setting when building classification models.

You must create the prior probabilities table using the fixed schema shown below. For numerical targets, use the following schema:

target_value      NUMBER
prior_probability NUMBER

For categorical targets, use the following schema:

target_value      VARCHAR2
prior_probability NUMBER

Next, provide the name of the prior probabilities table as input to the setting_value column in the settings table, with the corresponding value for the setting_name column to be DBMS_DATA_MINING.clas_priors_table_name, as shown below:

BEGIN
INSERT INTO drugstore_settings (setting_name, setting_value) VALUES (DBMS_
DATA_MINING.class_priors_table_name, 
'census_priors');
COMMIT;
END;

4.2.4.2 Cost Matrix Table

Consult Oracle Data Mining Concepts for an explanation of the cost matrix. You must create a cost matrix table with the fixed schema shown below. For numerical targets, use the following schema:

actual_target_value    NUMBER
predicted_target_value NUMBER
cost                   NUMBER

For categorical targets, use the following schema:

actual_target_value    VARCHAR2
predicted_target_value VARCHAR2
cost                   NUMBER

The DBMS_DATA_MINING package enables you to evaluate the cost of predictions from classification models in an iterative manner during the experimental phase of mining, and to eventually apply the optimal cost matrix to predictions on the actual scoring data in a production environment.

The data input to each COMPUTE procedure in the package is the result generated from applying the model on test data. In addition, if you also provide a cost matrix as an input, the COMPUTE procedure generates test results taking the cost matrix into account. This enables you to experiment with various costs for a given prediction against the same APPLY results, without rebuilding the model and applying it against the same test data for every iteration.

Once you arrive at an optimal cost matrix, you can then input this cost matrix to the RANK_APPLY procedure along with the results of APPLY on your scoring data. RANK_APPLY will provide your new data ranked by cost.

4.3 Mining Operations and Results

There are essentially four classes of mining operations supported by the DBMS_DATA_MINING package:

Of these, the first set represents DDL-like operations. The last set represents utilities. The rest are query-like operations in that they do not modify the model. Besides these operations, the following capabilities are also provided as part of the Oracle Data Mining installation:

Mining results are either returned as result sets or persisted as fixed schema tables.

4.3.1 Build Results

The CREATE_MODEL operation creates a mining model. The viewable contents of a mining model are provided to you through GET_MODEL_DETAILS functions for each supported algorithm. In addition, GET_MODEL_SIGNATURE and GET_MODEL_SETTINGS provide descriptive information about the model.

4.3.2 Apply Results

The APPLY operation creates and populates a fixed schema table with a given name. The schema of this table varies based on the particular mining function, algorithm, and target attribute type -- numerical or categorical.

The RANK_APPLY operation takes this results table as input and generates another table with results ranked based on a top-N input, and for classification models, also based on cost. The schema of this table varies based on the particular mining function, algorithm, and the target attribute type -- numerical or categorical.

4.3.3 Test Results for Classification Models

The COMPUTE routines provided in the package are the most popularly used metrics for classification. They are not tied to a particular model -- they can compute the metrics from any meaningful data input as long as the schema of the input tables fits the specification of the apply results table and the targets tables. Please consult any of the COMPUTE descriptions in this document for details.

4.3.4 Test Results for Regression Models

The most commonly used metrics for regression models are root mean square error and mean absolute error. You can use the simple SQL queries provided below to compute these metrics by replacing the italicized tokens with table and column names appropriate for your application.

4.3.4.1 Root Mean Square Error

SELECT sqrt(avg((A.prediction - B.target_column_name) * 
                (A.prediction - B.target_column_name))) rmse 
  FROM apply_results_table A, targets_table B 
 WHERE A.case_id_column_name = B.case_id_column_name;

4.3.4.2 Mean Absolute Error

Given the targets_table generated from the test data with the schema:

(case_id_column_name    VARCHAR2,
target_column_name      NUMBER)

and apply results table for regression with the schema:

(case_id_column_name    VARCHAR2, 
prediction              NUMBER)

and a normalization table (optional) with the schema:

(attribute_name        VARCHAR2(30),
scale                  NUMBER,
shift                  NUMBER)

the query for mean absolute error is:

SELECT  /*+PARALLEL(T) PARALLEL(A)*/
    AVG(ABS(T.actual_value - T.target_value)) mean_absolute_error
  FROM (SELECT B.case_id_column_name
              (B.target_column_name * N.scale + N.shift) actual_value
          FROM targets_table B,
               normalization_table N
        WHERE N.attribute_name = B.target_column_name AND
              B.target_column_name = 1) T,
apply_results_table_name A
 WHERE A.case_id_column_name = T.case_id_column_name;

You can fill in the italicized values with the actual column and table names chosen by you. If the data has not undergone normalization transformation, you can eliminate those references from the subquery. See dm/demo/sample/plsql/svmrdemo.sql for an example.

4.4 Mining Data

The data input for all the mining operations should be through standard relational tables, where each row in the table represents a case, and each column in the table uniquely represents a mining attribute. We call this format single-record case.

A table in the Oracle RDBMS supports 1000 columns. The DBMS_DATA_MINING package requires a mandatory case identifier column, which implies that you can provide 999 mining attributes using this representation, which is adequate for most business applications.

4.4.1 Wide Data Support

The notable exceptions to the common scenario are applications in the domains of bioinformatics, text mining, and such specialized areas where data is characterized to be wide and shallow -- with relatively few cases numbering in the thousands, but with several thousand mining attributes.

You can provide such data in a multi-record case format, where attribute, value pairs are grouped into collections (nested tables) associated with a given case-id. You must use the fixed collection types DM_Nested_Numericals and DM_Nested_Categoricals to define columns that represent collections of numerical attributes and categorical attributes respectively.

A few caveats on the use of multi-record case format:

From a data modeling perspective, the multi-record case format is analogous to the dimension-fact relationship commonly found in OLAP applications, where the dimension and fact tables are aggregated based on a common key for generating reports using materialized views or SQL queries.

The main distinction between the OLAP fact table and the multi-record case is this: Each row in the fact table represents a column value, whereas each row in the multi-record collection represents an attribute name (paired with its corresponding value in another column in the nested table). The two fixed collection types are provided to highlight and enforce this distinction. These two types are defined with the reasonable assumption that mining attributes of the same type (numericals versus categoricals) are generally grouped together, just as a fact table contains values that logically correspond to the same entity.

Oracle strongly recommends that you present your multi-record case data using object views, and present the view as input to CREATE_MODEL and APPLY operations. Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.

We illustrate this discussion on wide data with a real-world example of an analytical pipeline for brain tumor research. The datasets of relevance are gene expression data from the instruments (fact table), and the clinical data about the patient (dimension table). The schemas of these tables are provided below.

4.4.1.1 Clinical Data -- Dimension Table

(case_id    NUMBER,
name        VARCHAR2(30)
type        VARCHAR2(30)
subtype     VARCHAR2(30),
gender      CHAR(1),
age         NUMBER,
status      VARCHAR2(30))

4.4.1.2 Gene Expression Data -- Fact Table

(case_id    NUMBER,
gene        VARCHAR2(30),
expr        NUMBER)

Let us consider building a model with status as the target prediction, and with sex, age, and expr being the predictors. You can provide the build data input using an object view that combines the clinical_data table and the gene_expression_data table with the schema:

(case_id     NUMBER,
gender       CHAR(1),
age          NUMBER,
gene_expr    DM_Nested_Numericals,
status       VARCHAR2(30))

The query for constructing such an object view is relatively simple in Oracle SQL:

CREATE OR REPLACE VIEW gene_expr_build AS
SELECT C.case_id,
       C.gender,
       C.age,
       CAST(MULTISET(
       SELECT gene, expr
         FROM gene_expression_data
        WHERE case_id = C.case_id) AS DM_Nested_Numericals
       ) gene_expr,
       C.status
 FROM  clinical_data C

Now gene_expr_build can be provided as the input data_table_name for CREATE_MODEL.

4.4.2 Attribute Types

Oracle Data Mining handles categorical and numerical attributes. See Oracle Data Mining Concepts for explanation and examples of attribute types.

The DBMS_DATA_MINING package infers the attribute type based on the type of the column representing the mining attribute. You must define all columns representing numerical attributes to be of type NUMBER. You must define all columns representing categorical attributes to be of type VARCHAR2 or CHAR.

In situations where you have numbers that are deemed categorical data by your application logic, you must typecast such attribute values using the TO_CHAR() operator and populate them into a VARCHAR2 or CHAR column representing the mining attribute.

In situations where you have numeric attribute values stored in a CHAR or VARCHAR2 column, you must typecast those attribute values using the TO_NUMBER() operator and store them in a NUMBER column.

If persisting these transformed values in another table is not a viable option, you could also create a view with these conversions in place, and provide the view name to represent the training data input for the CREATE_MODEL operation.

4.4.3 Target Attribute

Classification and Regression algorithms require a target attribute. The package supports single targets. The target attribute for all classification algorithms can be numerical or categorical. SVM Regression supports only numerical target attributes.

4.4.4 Data Transformations

All mining operations assume the incoming data to be already prepared and transformed. You can preprocess your data using the DBMS_DATA_MINING_TRANSFORM package, or by using any third-party tool, or using your own homegrown utilities based on SQL and/or PL/SQL scripts.

If you preprocess or transform the input data, you must also reverse-transform the results from APPLY to map the data back to its original form.

See PL/SQL Packages and Types Reference for a description of the DBMS_DATA_MINING_TRANSFORM package.

The suggestion to customers with wide data is to perform transforms before forming WD views on data in single-record (2D) or multi-record (transactional) format. It is possible to use dbmsdmxf for multi-record format. In this case, all attributes are transformed in a similar way. In most of the cases, attributes in transactional (multi-record) form are of the same scale and thus this approach works. Otherwise, our suggestion is to split the data into sets of similar items and then transform them separately.

4.5 Performance Considerations

If you have mining attributes numbering in the few hundreds, and your application requires them to be represented as columns in the same row of the table, here are some data storage issues to consider. For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intra-block chaining. Oracle stores multiple row pieces in the same block owing to pragmatics of design, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database). For more details, consult the Oracle Data Mining Concepts and the Oracle Database Performance Tuning Guide.

4.6 Rules and Limitations for DBMS_DATA_MINING

The use of DBMS_DATA_MINING is subject to the following rules and limitations:

4.7 Summary of Data Types, Constants, Exceptions, and User Views

The DBMS_DATA_MINING and the DBMS_DATA_MINING_TRANSFORM packages use the data types shown in Table 4-8.

Table 4-8 DBMS_DATA_MINING Summary of Data Types
Data Type Purpose

DM_ABN_Detail

This type represents each row of the model detail output generated by GET_MODEL_DETAILS_ABN

DM_ABN_Details

This type represents the ABN model details generated by GET_MODEL_DETAILS_ABN

DM_Centroid

This type represents the centroid of a cluster. It is used when retrieving cluster details using GET_MODEL_DETAILS_KM.

DM_Child

This type represents each child node of a cluster

DM_Children

This type represents a set of children nodes for a given cluster identifier

DM_Cluster

This type represents a cluster retrieved using GET_MODEL_DETAILS_KM

DM_Clusters

This type represents a set of clusters

DM_Conditional

This type represents each conditional probability from a set of conditional probabilities associated with each mining attribute used in a Naive Bayes or Adaptive Bayes Network model

DM_Conditionals

This type represents conditional probabilities associated with a given mining attribute used in a Naive Bayes or Adaptive Bayes Network model. It is used when retrieving model details using GET_MODEL_DETAILS_NB or GET_MODEL_DETAILS_ABN respectively.

DM_Histogram_Bin

This type represents a histogram associated with a cluster identifier. It is used when retrieving cluster details using GET_MODEL_DETAILS_KM

DM_Histograms

This type represents a set of histograms

DM_Item

This type represents an item in a set of items

DM_Items

This type represents the set of items in an ItemSet

DM_ItemSet

This type represents an ItemSet

DM_ItemSets

This type represents frequent Itemsets in Association models.

DM_Model_Settings

This type represents the algorithm settings retrieved using the GET_MODEL_SETTINGS function.

DM_Model_Signature

This type represents a list of model signature attributes generated by GET_MODEL_SIGNATURE

DM_Modelname_List

This type represents a list of model names provided as input for the parameter model_names in EXPORT_MODEL and IMPORT_MODEL procedures.

DM_NB_Detail

his type represents the each row of the model detail output generated by GET_MODEL_DETAILS_NB

DM_NB_Details

This type represents the NB model details generated by GET_MODEL_DETAILS_NB.

DM_Nested_Categoricals

This type represents a nested table of categorical attributes, used for representing wide data.

DM_Nested_Numericals

This type represents a nested table of numerical attributes, used for representing wide data.

DM_NMF_Attribute

This type represents each attribute in an attribute set for NMF model details

DM_NMF_Attribute_Set

This type represents a set of attributes that correspond to a feature identifier, returned by GET_MODEL_DETAILS_NMF.

DM_NMF_Feature

This type represents a feature in a NMF model

DM_NMF_Feature_Set

This type represents a set of features returned by GET_MODEL_DETAILS_NMF.

DM_Predicate

This type represents each predicate in the set of predicates in a rule.

DM_Predicates

This type represents a set of predicates that constitute either the antecedent or the consequent of a rule.

DM_Ranked_Attribute

This type represents an entry in the set of ranked attribute returned by GET_MODEL_DETAILS_AI, ranked by the attribute's importance.

DM_Ranked_Attributes

This type represents a list of ranked attributes returned by GET_MODEL_DETAILS_AI.

DM_Rule

This type represents each rule in a list of rules generated by either GET_ASSOCIATION_RULES or GET_MODEL_DETAILS_KM.

DM_Rules

This type represents rules retrieved for Association Rules or k-means models using GET_ASSOCIATION_RULES and GET_MODEL_DETAILS_KM respectively.

DM_SVM_Attribute

This type represents each attribute in an attribute set for SVM model details

DM_SVM_Attribute_Set

This type represents a set of attributes returned by GET_MODEL_DETAILS_SVM for a linear model.

DM_SVM_Linear_Coeff

This type represents an entry in the set of linear coefficients returned by GET_MODEL_DETAILS_SVM

DM_SVM_Linear_Coeff_Set

This type represents the set of linear coefficients returned by GET_MODEL_DETAILS_SVM for an SVM model built using the linear kernel.

.

Table 4-9 through Table 4-16 list the constants to be used for various settings in the settings table.

Table 4-9  DBMS_DATA_MINING Constants Summary: Mining Function
Constant Purpose

association

Parameter value for mining_function in CREATE_MODEL, representing association mining function

attribute importance

Parameter value for mining_function in CREATE_MODEL, representing attribute importance mining function

classification

Parameter value for mining_function in CREATE_MODEL, representing classification mining function

regression

Parameter value for mining_function in CREATE_MODEL, representing regression mining function

clustering

Parameter value for mining_function in CREATE_MODEL, representing clustering mining function

feature_extraction

Parameter value for mining_function in CREATE_MODEL, representing Feature Extraction mining function

Table 4-10  DBMS_DATA_MINING Constants Summary: Function Settings
Constant Purpose

clas_priors_table_name

Setting name representing prior probability table name for classification function

clus_num_clusters

Setting name representing number of clusters for clustering function

feat_num_features

Setting name representing number of features for feature selection function

asso_max_rule_length

Setting name representing maximum rule length

asso_min_confidence

Setting name representing minimum confidence

asso_min_support

Setting name representing minimum support

Table 4-11  DBMS_DATA_MINING Constants Summary: Algorithm Settings
Constant Purpose

algo_name

Setting name representing the mining algorithm

algo_apriori_association_rules

Setting value for Apriori algorithm for association rules

algo_naive_bayes

Setting value for Naive Bayes (NB) algorithm for classification

algo_support_vector_machines

Setting value for Support Vector Machine (SVM) algorithm for classification or regression

algo_nonnegative_matrix_factor

Setting value for Non-Negative Matrix Factorization (NMF) for feature selection

algo_kmeans

Setting value for k-Means (KM) for clustering

algo_ai_mdl

Setting value for Minimum Description Length based algorithm for Attribute Importance.

Table 4-12  DBMS_DATA_MINING Constants Summary: Adaptive Bayes Network
Constant Purpose

abns_model_type

Setting name representing ABN model type

abns_single_feature

Setting value representing single feature ABN model

abns_multi_feature

Setting value representing multi feature ABN model

abns_naive_bayes

Setting value representing Naive Bayes ABN model

abns_max_build_minutes

Setting name representing maximum time threshold to complete an ABN model build

abns_max_nb_predictors

Setting name representing the maximum number of Naive Bayes predictors to be considered for building an ABN model of type abns_naive_bayes

Table 4-13  DBMS_DATA_MINING Constants Summary: Naive Bayes
Constant Purpose

nabs_singleton_threshold

Setting value for singleton threshold for Naive Bayes

nabs_pairwise_threshold

Setting value for pair-wise threshold for Naive Bayes

Table 4-14  DBMS_DATA_MINING Constants Summary: Support Vector Machines
Constant Purpose

svms_kernel_function

Setting name representing the kernel function for SVM

svms_linear

Setting value for Linear Kernel for SVM

svms_guassian

Setting value for Gaussian Kernel for SVM

svms_kernel_cache_size

Setting name representing for Kernel Cache Size for Support Vector Machine

svms_conv_tolerance

Setting name representing tolerance for SVM

svms_std_dev

Setting name representing standard deviation for

svms_complexity_factor

Setting name representing complexity factor for SVM

svms_epsilon

Setting name representing epsilon for SVM Regression

Table 4-15  DBMS_DATA_MINING Constants Summary: Non-Negative Matrix Factorization
Constant Purpose

nmfs_num_iterations

Setting name representing number of iterations

nmfs_conv_tolerance

Setting name representing convergence tolerance

nmfs_random_seed

Setting name representing random seed for NMF

Table 4-16  DBMS_DATA_MINING Constants Summary: k-Means
Constant Purpose

kmns_distance

Setting name representing distance function

kmns_euclidean

Setting value representing Euclidean distance function

kmns_cosine

Setting value representing cosine distance function

kmns_fast_cosine

Setting value representing fast cosine distance function

kmns_iterations

Setting name representing number of iterations

kmns_conv_tolerance

Setting name representing convergence tolerance

kmns_split_criterion

Setting name representing split criterion

kmns_variance

Setting value representing variance as the split criterion

kmns_size

Setting value representing size as the split criterion

kmns_block_growth

Setting name representing growth factor for memory allocated to hold cluster data

kmns_num_bins

Setting value for number of histogram bins

kmns_min_pct_attr_support

Setting value for minimum percentage report required for attributes in rules

Table 6-18 lists the errors generated by DBMS_DATA_MINING.

Table 4-17 DBMS DATA_MINING Errors Summary
Oracle Error Description

ORA-40201

Invalid input parameter %s

ORA-40202

Column %s does not exist in the input table %s

ORA-40203

Model %s does not exist

ORA-40204

Model %s already exists

ORA-40205

Invalid setting name %s

ORA-40206

Invalid setting value for setting name %s

ORA-40207

Duplicate or multiple function settings

ORA-40208

Duplicate or multiple algorithm settings for function %s

ORA-40209

Invalid setting: %s for function %s

ORA-40211

Algorithm name: %s is invalid

ORA-40212

Invalid target data type in input data for function: %

ORA-40213

Contradictory values for settings: %s, %s

ORA-40214

Duplicate setting: %s

ORA-40215

Model %s is incompatible with current operation

ORA-40216

Feature not supported

ORA-40219

Apply result table %s is incompatible with current operation

ORA-40220

Maximum number of attributes exceeded

ORA-40221

Maximum target cardinality exceeded

ORA-40222

Data mining model export failed, job name=%s, error=%s

ORA-40223

Data mining model import failed, job name=%s, error=%s

ORA-40225

Model is currently in use by another process

ORA-40251

No support vectors were found

ORA-40252

No target values were found

ORA-40253

No target counter examples were found

ORA-40261

Input data for model build contains negative values

ORA-40262

NMF: number of features not between [1, %s]

ORA-40271

No statistically significant features were found

ORA-40272

Apply rules prohibited for this model mode

ORA-40273

Invalid model type %s for Adaptive Bayes Network algorithm



Table 4-18 lists the user views provided by Oracle to obtain information about the models generated using DBMS_DATA_MINING.

Table 4-18 DBMS_DATA_MINING Summary of User Views
User View Purpose

DM_USER_MODELS

Lists all models in a given user's schema.



4.8 Summary of DBMS_DATA_MINING Subprograms

.
Table 4-19  DBMS_DATA_MINING Summary of Subprograms
Data Type Purpose

APPLY Procedure

Applies a model to scoring data

CREATE_MODEL Procedure

Creates (builds) a mining model

COMPUTE_CONFUSION_MATRIX Procedure

Computes the confusion matrix from the APPLY results on test data for a classification model; also provides the accuracy of the model

COMPUTE_LIFT Procedure

Computes lift for a given positive target value from the APPLY results on test data for a classification model

COMPUTE_ROC Procedure

Computes Receiver Operating Characteristic for target attributes with binary class from the APPLY results on test data for a classification model.

DROP_MODEL Procedure

Drops a model

EXPORT_MODEL Procedure

Exports one or more models from a schema

GET_ASSOCIATION_RULES Function

This table function returns the rules from an Association model

GET_DEFAULT_SETTINGS Function

This table function returns all the default settings for all mining functions and algorithms.

GET_FREQUENT_ITEMSETS Function

Returns a set of rows that represent the frequent itemsets from an Association model.

GET_MODEL_DETAILS_ABN Function

Provides the details of an Adaptive Bayes Network model

GET_MODEL_DETAILS_KM Function

Provides the details of a k-Means model

GET_MODEL_DETAILS_NB Function

Provides the details of a Naive Bayes model

GET_MODEL_DETAILS_NMF Function

Provides the details of an NMF model

GET_MODEL_DETAILS_SVM Function

Provides the details of a SVM model

GET_MODEL_SETTINGS Function

Provides the settings used to build a model

GET_MODEL_SIGNATURE Function

Provides the signature of a model

IMPORT_MODEL Procedure

Imports one or more models into the current schema

RANK_APPLY Procedure

Ranks the predictions from the APPLY results for a classification model

RENAME_MODEL Procedure

Renames a model

4.9 Model Export and Import

Data mining models can be moved between Oracle databases or schemas. For example, in an organization, data mining specialists may build and test data mining models in a data mining lab. After models are built and tested in the lab, the chosen model may be moved to a scoring engine used by applications. Because data mining lab and scoring engine usually do not share the same database, the model must be exported from the lab and then imported to the scoring engine. Model export and import can be a routine procedure. As new data are accumulated, data mining specialists will build and test new models, and newer and better models will be loaded onto the scoring engine on a regular basis. DBAs will want to back up and restore models in their routine database maintenance.

Native export and import of data mining models are supported in the following scenarios:

4.9.1 Limitations

The use of model export and import is subject to the following limitations:

4.9.2 Prerequisites

Prerequisites for model export are as follows:

Prerequisites for model import are as follows:

See also:

4.9.3 Choose the Right Utility

There are two ways to export models:

To export all data mining models in a user schema, you can either run expdp or use EXPORT_MODEL with the parameter model_filter set to NULL. Note the difference between the two operations: When you run expdp to export the schema, all objects in the schema including data mining models are exported. When you run EXPORT_MODEL with a NULL model_filter, only the models will be exported.

There are also two ways to import models from the dump file:

In general, if you want to import the full dump file set, run impdp. This imports all database objects and data, including all data mining models, from the dump file set. If you want to import models only, use IMPORT_MODEL. When the parameter model_filter is set to NULL, all models are imported from the dump. If valid model names are assigned in model_filter, this operation imports only named models from the dump file set.

4.9.4 Temp Tables

Data mining model export and import jobs will create and manage two temporary tables in the user schema: DM$P_MODEL_EXPIMP_TEMP and DM$P_MODEL_TABKEY_TEMP. Users should not manipulate these tables.