Skip Headers
Oracle® Data Mining Administrator's Guide
11g Release 1 (11.1)

B28130-04
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
PDF · Mobi · ePub

8 The Data Mining Sample Programs

A number of sample programs are available with Oracle Data Mining. These programs illustrate the many features of the PL/SQL and Java APIs.

The sample programs create a set of models in the database. You can examine the sample source code, which includes numerous comments, to familiarize yourself with the Oracle Data Mining APIs, and you can create your own models by modifying the samples.

Note:

This chapter provides general information about the sample programs and the sample data.

Instructions for installing and running the sample programs are provided in Chapter 1.

This chapter includes the following sections:

PL/SQL Programs

The PL/SQL sample programs illustrate the use of the DBMS_DATA_MINING package for creating models and the DBMS_DATA_MINING_TRANSFORM package for performing transformations on the mining data.

See Also:

PL/SQL Program Summaries

Summary descriptions of the PL/SQL sample programs are provided in Table 8-1. For detailed descriptions of the sample programs, see the comments in the source code.

Table 8-1 Overview of the PL/SQL Sample Programs

Mining Function Description

Classification

The classification programs demonstrate various preprocessing techniques and perform the following steps:

  • Build a classification model using training data

  • Display model details and settings

  • Test the model by applying the model on the test data

  • Compute test metrics, such as confusion matrix, lift, and ROC

  • Apply the model on the scoring data

  • Present apply results

  • Present ranked apply results, influenced by a cost matrix

dmnbdemo.sql illustrates Naive Bayes.

dmdtdemo.sql illustrates Decision Tree.

dmsvcdem.sql illustrates SVM classification.

dmglcdem.sql illustrates GLM classification (binary logistic regression)

The dmdtxvlddemo.sql program demonstrates cross-validation techniques for decision tree based-classification. With minor modifications, this program can be used to perform cross validation using other models/algorithms.

Regression

dmsvrdem.sql uses different test metrics, but otherwise performs most of the same steps used in the classification programs. Selected attributes of the input data are preprocessed (normalized).

NOTE: dmsvrdem.sql illustrates the new Automatic Data Preparation feature.

dmglrdem.sql illustrates GLM regression (multivariate linear regression)

Anomaly Detection

dmsvodem.sql illustrates one-class SVM

Association

dmardemo.sql builds an association model and presents frequent itemsets and association rules as output.

Clustering

dmkmdemo.sql (k-Means) and dmocdemo.sql (0-Cluster) build clustering models and present cluster details, such as rules, centroid, and histogram for each cluster as output. The models are scored, and the probabilities associated with each cluster are returned as output. Selected attributes of the input data are preprocessed.

NOTE: dmkmdemo.sql illustrates the new Automatic Data Preparation feature.

Feature extraction

dmnmdemo.sql builds a feature extraction model and presents model details as the output. The model is scored, and each feature ID is associated with a probability. Selected attributes of the input data are preprocessed (normalized).

Attribute importance

dmaidemo.sql builds an attribute importance model and presents a list of important attributes as the output of model details. Selected attributes of the input data are preprocessed (binned).


Data Mining SQL Scoring Functions

Most of the PL/SQL sample programs use the Data Mining SQL functions for scoring. The Data Mining scoring functions can be used to apply models created with the PL/SQL API or with the Java API.

Note:

The SQL functions for Data Mining are documented in Oracle Database SQL Language Reference. Information about these functions is also provided in Oracle Data Mining Application Developer's Guide.

The programs that demonstrate the Data Mining functions are listed in Table 8-2.

Table 8-2 Data Mining SQL Functions in the Sample Programs

Program Name Algorithm SQL Functions Used

dmkmdemo.sql

k-Means

CLUSTER_ID, CLUSTER_PROBABILITY, CLUSTER_SET

dmocdemo.sql

O-Cluster

CLUSTER_ID

dmnmdemo.sql

NMF

FEATURE_ID, FEATURE_SET, FEATURE_VALUE

dmdtdemo.sql

Decision Tree

PREDICTION, PREDICTION_COST, PREDICTION_DETAILS, PREDICTION_SET

dmdtxvlddemo.sql

Decision Tree

PREDICTION

dmglcdem.sql

GLM classification

PREDICTION, PREDICTION_BOUNDS, PREDICTION_PROBABILITY,

dmglrdem.sql

GLM regression

PREDICTION, PREDICTION_BOUNDS

dmsvcdem.sql

SVM classification

PREDICTION, PREDICTION_PROBABILITY, PREDICTION_SET

dmsvodem.sql

One-Class SVM

PREDICTION, PREDICTION_PROBABILITY, PREDICTION_SET

dmsvrdem.sql

SVM regression

PREDICTION

dmtxtsvm.sql

Text mining

PREDICTION, PREDICTION_PROBABILITY


Java Programs

The Java demos illustrate the features of the Oracle Data Mining Java API, which implements Oracle-specific extensions to the Java Data Mining (JDM) 1.0.1.1 standard.The Java programs demonstrate data preprocessing and the basic mining functions. Additional Java samples demonstrate predictive analytics, import/export, and text mining.

See Also:

Oracle Data Mining Java API Reference (javadoc) and the Oracle Data Mining Application Developer's Guide for information on the Java API.

Java Program Summaries

Summary descriptions of the Java sample programs are provided in Table 8-3. For detailed descriptions, see the comments in the source code.

Table 8-3 Overview of the Java Sample Programs

Mining Function or Task Description

Classification

The classification programs demonstrate various preprocessing techniques and perform the following steps:

  • Build a classification model using training data

  • Display model details and settings

  • Test the model by applying the model on the test data

  • Compute test metrics, such as confusion matrix, lift, and ROC

  • Apply the model on the scoring data

  • Present apply results

  • Present ranked apply results, influenced by a cost matrix

The dmapplydemo.java program demonstrates several ways of applying a Naive Bayes model.

dmglcdemo.java illustrates GLM classification (binary logistic regression)

Regression

dmsvrdemo.java uses different test metrics, but otherwise performs most of the same steps used in the classification programs. Selected attributes of the input data are preprocessed (normalized).

dmglrdemo.java illustrates GLM regression (multivariate linear regression)

Association

dmardemo.java builds an association model and presents frequent itemsets and association rules as output. Selected attributes of the input data are preprocessed (binned).

Clustering

dmkmdemo.java (k-Means) and dmocdemo.java (0-Cluster) build clustering models and present cluster details, such as rules, centroid, and histogram for each cluster as output. The models are scored, and the probabilities associated with each cluster are returned as output. Selected attributes of the input data are preprocessed (normalized).

Feature extraction

dmnmdemo.java builds a feature extraction model and presents model details as the output. The model is scored, and each feature ID is associated with a probability. Selected attributes of the input data are preprocessed (normalized).

Attribute importance

dmaidemo.java builds an attribute importance model and presents a list of important attributes as the output of model details. Selected attributes of the input data are preprocessed (binned).

Data transformations

dmxfdemo.java demonstrates binning, clipping, and normalization transformations.

Predictive Analytics

dmpademo.java demonstrates PREDICT, EXPLAIN, and PROFILE functions.

Model import/export

dmexpimpdemo.java builds a Naive Bayes model, exports it to a dump file, then imports it from the dump file.


Text Mining Programs

Oracle Data Mining can mine text columns that have undergone pre-processing by Oracle Text routines.

Oracle Text is a technology for building text query and document classification applications. It provides indexing, word and theme searching, and viewing capabilities for text. Oracle Text is included in a general installation of Oracle Database Enterprise Edition, and therefore is already present in a database installed according to the instructions in Chapter 1.

The pre-processing steps for text mining create nested table columns of type DM_NESTED_NUMERICALS from columns of type VARCHAR2 or CLOB. Each row of the nested table specifies an attribute name and a value. The type definition is as follows.

CREATE OR REPLACE TYPE dm_nested_numerical AS OBJECT
  (attribute_name VARCHAR2(4000),
   value          NUMBER)
/
CREATE OR REPLACE TYPE dm_nested_numericals AS TABLE OF dm_nested_numerical

Terms extracted from text documents into nested tables can become generic attributes in training or scoring data.

Sample text mining programs in both PL/SQL and Java illustrate classification and feature extraction of a pre-processed text column.

Text Mining in PL/SQL

Three PL/SQL sample programs illustrate the process of text mining. One program illustrates the pre-processing that is required to prepare the data for mining. The other two programs build models that use the transformed text.

Text Transformation Demo

To prepare a column for text mining using the PL/SQL API, you must use Oracle Text routines to perform the following general steps:

  1. Create a domain index on the column.

  2. Use the index to extract terms from the column to a temporary table.

  3. Populate a column of type DM_NESTED_NUMERICALS with the terms in the temporary table.

The process of term extraction using Oracle Text is illustrated in the sample program dmtxtfe.sql. The source code contains extensive comments that explain the steps involved in transforming text into a set of features that can be mined using Oracle Data Mining.

More details about text transformation are provided in the Oracle Data Mining Application Developer's Guide.

Text Transformation for the PL/SQL Text Mining Sample Programs

The dmsh.sql script performs the text transformation required by the PL/SQL text mining demos. There are two such sample programs: dmtxtnmf.sql, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvm.sql, which builds a classification model using Support Vector Machine. Both of these programs use the following tables, which have a nested table column of comment data:

MINING_BUILD_NESTED_TEXT
MINING_TEST_NESTED_TEXT
MINING_APPLY_NESTED_TEXT

The Sample Text Mining Models (PL/SQL)

You can run the PL/SQL text mining demo programs, dmtxtnmf.sql and dmtxtsvm.sql, like the other PL/SQL programs. The models created by these programs are listed in the following example.

SQL> @ %ORACLE_HOME%\RDBMS\demo\dmtxtnmf.sql
SQL> @ %ORACLE_HOME%\RDBMS\demo\dmtxtsvm.sql
SQL> SELECT model_name, mining_function, algorithm
           FROM user_mining_models;MODEL_NAME         MINING_FUNCTION    ALGORITHM       ----------------   ------------------ ------------------------  
T_NMF_SAMPLE       FEATURE_EXTRACTION NONNEGATIVE_MATRIX_FACTOR T_SVM_CLAS_SAMPLE  CLASSIFICATION     SUPPORT_VECTOR_MACHINES   

Text Mining in Java

Two Java sample programs illustrate the process of text mining. One builds a feature extraction model, the other builds a classification model.

Text Transformation for the Sample Java Text Mining Programs

The Oracle Data Mining Java API provides an interface that handles the term extraction process. If you are developing data mining applications in Java, you do not need to use Oracle Text directly. However, you must ensure that Oracle Text is present in the database.

The OraTextTransform interface is used to perform text transformation within the Java text mining demos. There are two such sample programs: dmtxtnmfdemo.java, which builds a feature extraction model using Non-Negative Matrix Factorization, and dmtxtsvmdemo.java, which builds a classification model using Support Vector Machine. Both of these programs create build, test, and apply data sets from the following tables, which have a text column of comment data:

MINING_BUILD_TEXT
MINING_TEST_TEXT
MINING_APPLY_TEXT

The Sample Text Mining Models (Java)

You can run the Java text mining sample programs, dmtxtnmfdemo.java and dmtxtsvmdemo.java, like the other Java programs. The models created by these programs are shown in the following example.

java dmtxtnmfdemo host:port:SID dmuser password
java dmtxtsvmdemo host:port:SID dmuser password
SQLPLUS dmuser
Enter password: password
SQL> SELECT model_name, mining_function, algorithm
           FROM user_mining_models;MODEL_NAME         MINING_FUNCTION     ALGORITHM       ----------------   ------------------  ------------------------  
txtnmfModel_jdm    FEATURE_EXTRACTION  NONNEGATIVE_MATRIX_FACTOR txtsvmModel_jdm    CLASSIFICATION      SUPPORT_VECTOR_MACHINES   

The Sample Data

The dmsh.sql script creates views, tables, and indexes in the user's schema. The views define columns of customer data from tables in the SH schema. This data is used by the Data Mining sample programs. The tables reference the same columns in SH, but they include an extra COMMENTS column for text mining. The indexes are used to extract terms from the text in the COMMENTS column and build a nested table column.

Customer Data for Data Mining

Views in the data mining user's schema define columns of data from the CUSTOMERS, SALES, PRODUCTS, COUNTRIES, and SUPPLEMENTARY_DEMOGRAPHICS tables in the SH schema. You can list these views with the following SQL statements.

SQL>CONNECT dmuser
Enter password: password
SQL>SELECT view_name FROM user_views;

The views are listed in Table 8-4.

Table 8-4 Views Used by the Data Mining Sample Programs

View Name Description

MINING_DATA_APPLY_STR_V

Scoring data for o-cluster

MINING_DATA_BUILD_STR_V

Training data for o-cluster

MINING_DATA_APPLY_V

Scoring data for data mining (not text mining)

MINING_DATA_BUILD_V

Training data for data mining (not text mining)

MINING_DATA_TEST_V

Test data for data mining (not text mining)

MARKET_BASKET_V

Data for association rules

MINING_DATA_ONE_CLASS_V

Data for one-class SVM


You can see the references to tables in SH by listing the view definitions. The definition of the view MINING_DATA_BUILD_V is shown as follows.

SQL> set long 1000000
SQL> set longc 100000
SQL> set pagesize 100
SQL> SELECT text FROM all_views WHERE 
    owner='dmuser3'AND view_name='mining_data_build_v';

      SELECT a.CUST_ID, a.CUST_GENDER, 2003-a.CUST_YEAR_OF_BIRTH AGE, 
             a.CUST_MARITAL_STATUS, c.COUNTRY_NAME, a.CUST_INCOME_LEVEL,
             b.EDUCATION, b.OCCUPATION, b.HOUSEHOLD_SIZE, b.YRS_RESIDENCE,
             b.AFFINITY_CARD, b.BULK_PACK_DISKETTES, b.FLAT_PANEL_MONITOR,
             b.HOME_THEATER_PACKAGE, b.BOOKKEEPING_APPLICATION, 
             b.PRINTER_SUPPLIES, b.Y_BOX_GAMES, b.OS_DOC_SET_KANJI 
       FROM  sh.customers a, 
             sh.supplementary_demographics b, 
             sh.countries c 
       WHERE a.CUST_ID = b.CUST_ID AND a.country_id = c.country_id 
             AND a.cust_id between 101501 and 103000 

The views are used to build, test, and score the sample models. Each view has a CUSTOMER_ID column, which is the case ID, and an AFFINITY_CARD column, which is the target used by the predictive models. Most of the views provide data for 1500 customers (1500 rows). The view used by the One-Class SVM model has data for 940 customers.

The columns of training data in the MINING_DATA_BUILD_V view are listed in the following example.

SQL> DESCRIBE mining_data_build_v

CUST_ID                    NOT NULL            NUMBER
CUST_GENDER                NOT NULL            CHAR(1)
AGE                                            NUMBER
CUST_MARITAL_STATUS                            VARCHAR2(20)
COUNTRY_NAME               NOT NULL            VARCHAR2(40)
CUST_INCOME_LEVEL                              VARCHAR2(30)
EDUCATION                                      VARCHAR2(21)
OCCUPATION                                     VARCHAR2(21)
HOUSEHOLD_SIZE                                 VARCHAR2(21)
YRS_RESIDENCE                                  NUMBER
AFFINITY_CARD                                  NUMBER(10)
BULK_PACK_DISKETTES                            NUMBER(10)
FLAT_PANEL_MONITOR                             NUMBER(10)
HOME_THEATER_PACKAGE                           NUMBER(10)
BOOKKEEPING_APPLICATION                        NUMBER(10)
PRINTER_SUPPLIES                               NUMBER(10)
Y_BOX_GAMES                                    NUMBER(10)
OS_DOC_SET_KANJI                               NUMBER(10)

Market Basket Data for Association Rules

The association demos use the MARKET_BASKET_V data set, which includes columns of products from the PRODUCTS table and the CUSTOMER_ID column from the CUSTOMERS table in SH. The columns of the MARKET_BASKET_V view are listed in the following example.

SQL> DESCRIBE market_basket_v

CUST_ID                     NOT NULL         NUMBER
EXTENSION_CABLE                              NUMBER
FLAT_PANEL_MONITOR                           NUMBER
CD_RW_HIGH_SPEED_5_PACK                      NUMBER
ENVOY_256MB_40GB                             NUMBER
ENVOY_AMBASSADOR                             NUMBER
EXTERNAL_8X_CD_ROM                           NUMBER
KEYBOARD_WRIST_REST                          NUMBER
SM26273_BLACK_INK_CARTRIDGE                  NUMBER
MOUSE_PAD                                    NUMBER
MULTIMEDIA_SPEAKERS_3INCH                    NUMBER
OS_DOC_SET_ENGLISH                           NUMBER
SIMM_16MB_PCMCIAII_CARD                      NUMBER
STANDARD_MOUSE                               NUMBER 

Customer Data for Text Mining

The text mining demos use the same customer data from tables in SH, but they include either an extra text column or a collection type column. The collection type is a nested table of type DM_NESTED_NUMERICALS.

You can list these tables with the following SQL statements.

SQL>CONNECT dmuser
Enter password: password
SQL>SELECT table_name FROM user_tables WHERE table_name LIKE '%MINING%';

The text mining tables are listed in Table 8-5.

Table 8-5 Tables Used by the Text Mining Sample Programs

Table Name Description

MINING_APPLY_NESTED_TEXT

Apply table with COMMENTS column as DM_NESTED_NUMERICALS

MINING_BUILD_NESTED_TEXT

Build table with COMMENTS column as DM_NESTED_NUMERICALS

MINING_TEST_NESTED_TEXT

Test table with COMMENTS column as DM_NESTED_NUMERICALS

MINING_APPLY_TEXT

Apply table with COMMENTS column as VARCHAR2(4000)

MINING_BUILD_TEXT

Build table with COMMENTS column as VARCHAR2(4000)

MINING_TEST_TEXT

Test table with COMMENTS column as VARCHAR2(4000)


In the MINING_BUILD_TEXT, MINING_TEST_TEXT, and MINING_APPLY_TEXT tables, the COMMENTS column is of type VARCHAR2(4000).

SQL> DESCRIBE mining_build_text
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                   NOT NULL NUMBER
 CUST_GENDER                               NOT NULL CHAR(1)
 AGE                                                NUMBER
 CUST_MARITAL_STATUS                                VARCHAR2(20)
 COUNTRY_NAME                              NOT NULL VARCHAR2(40)
 CUST_INCOME_LEVEL                                  VARCHAR2(30)
 EDUCATION                                          VARCHAR2(21)
 OCCUPATION                                         VARCHAR2(21)
 HOUSEHOLD_SIZE                                     VARCHAR2(21)
 YRS_RESIDENCE                                      NUMBER
 AFFINITY_CARD                                      NUMBER(10)
 BULK_PACK_DISKETTES                                NUMBER(10)
 FLAT_PANEL_MONITOR                                 NUMBER(10)
 HOME_THEATER_PACKAGE                               NUMBER(10)
 BOOKKEEPING_APPLICATION                            NUMBER(10)
 PRINTER_SUPPLIES                                   NUMBER(10)
 Y_BOX_GAMES                                        NUMBER(10)
 OS_DOC_SET_KANJI                                   NUMBER(10)
 COMMENTS                                           VARCHAR2(4000)

In the MINING_*_NESTED_TEXT tables, the COMMENTS column is of type DM_NESTED_NUMERICALS.

SQL> DESCRIBE mining_build_nested_text
 Name                                      Null?    Type
 ----------------------------------------- -------- ----------------------------
 CUST_ID                                   NOT NULL NUMBER
 CUST_GENDER                               NOT NULL CHAR(1)
 AGE                                                NUMBER
 CUST_MARITAL_STATUS                                VARCHAR2(20)
 COUNTRY_NAME                              NOT NULL VARCHAR2(40)
 CUST_INCOME_LEVEL                                  VARCHAR2(30)
 EDUCATION                                          VARCHAR2(21)
 OCCUPATION                                         VARCHAR2(21)
 HOUSEHOLD_SIZE                                     VARCHAR2(21)
 YRS_RESIDENCE                                      NUMBER
 AFFINITY_CARD                                      NUMBER(10)
 BULK_PACK_DISKETTES                                NUMBER(10)
 FLAT_PANEL_MONITOR                                 NUMBER(10)
 HOME_THEATER_PACKAGE                               NUMBER(10)
 BOOKKEEPING_APPLICATION                            NUMBER(10)
 PRINTER_SUPPLIES                                   NUMBER(10)
 Y_BOX_GAMES                                        NUMBER(10)
 OS_DOC_SET_KANJI                                   NUMBER(10)
 COMMENTS                                           DM_NESTED_NUMERICALS

The process of extracting terms from a text column into a nested table column is described in "Text Mining Programs" and in Oracle Data Mining Application Developer's Guide.