Understand how data is stored and viewed for data mining.
Data mining activities require data that is defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can optionally be identified by a unique case ID. The table or view itself can be referred to as a case table.
CUSTOMERS table in the
SH schema is an example of a table that could be used for mining. All the information for each customer is contained in a single row. The case ID is the
CUST_ID column. The rows listed in the following example are selected from
Example 3-1 Sample Case Table
SQL> select cust_id, cust_gender, cust_year_of_birth, cust_main_phone_number from sh.customers where cust_id < 11; CUST_ID CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MAIN_PHONE_NUMBER ------- ----------- ---- ------------- ------------------------- 1 M 1946 127-379-8954 2 F 1957 680-327-1419 3 M 1939 115-509-3391 4 M 1934 577-104-2792 5 M 1969 563-667-7731 6 F 1925 682-732-7260 7 F 1986 648-272-6181 8 F 1964 234-693-8728 9 F 1936 697-702-2618 10 F 1947 601-207-4099
Column Data Types
Understand the different types of column data in a case table.
The columns of the case table hold the attributes that describe each case. In Example 3-1, the attributes are:
CUST_MAIN_PHONE_NUMBER. The attributes are the predictors in a supervised model or the descriptors in an unsupervised model. The case ID,
CUST_ID, can be viewed as a special attribute; it is not a predictor or a descriptor.
Data Sets for Classification and Regression
Understand how data sets are used for training and testing the model.
You need two case tables to build and validate classification and regression models. One set of rows is used for training the model, another set of rows is used for testing the model. It is often convenient to derive the build data and test data from the same data set. For example, you could randomly select 60% of the rows for training the model; the remaining 40% could be used for testing the model.
Models that implement other mining functions, such as attribute importance, clustering, association, or feature extraction, do not use separate test data.
Most data mining models can be applied to separate data in a process known as scoring. Oracle Data Mining supports the scoring operation for classification, regression, anomaly detection, clustering, and feature extraction.
The scoring process matches column names in the scoring data with the names of the columns that were used to build the model. The scoring process does not require all the columns to be present in the scoring data. If the data types do not match, Oracle Data Mining attempts to perform type coercion. For example, if a column called
VARCHAR2 in the training data but
NUMBER in the scoring data, Oracle Data Mining effectively applies a
TO_CHAR() function to convert it.
The column in the test or scoring data must undergo the same transformations as the corresponding column in the build data. For example, if the
AGE column in the build data was transformed from numbers to the values
SENIOR, then the
AGE column in the scoring data must undergo the same transformation so that the model can properly evaluate it.
Oracle Data Mining can embed user-specified transformation instructions in the model and reapply them whenever the model is applied. When the transformation instructions are embedded in the model, you do not need to specify them for the test or scoring data sets.
Oracle Data Mining also supports Automatic Data Preparation (ADP). When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model along with any user-specified transformations.
Transforming the Data for more information on automatic and embedded data transformations