35.1 Data Requirements
Understand how data is stored and viewed for Oracle Machine Learning.
Machine learning activities require data that is defined within a single table or view. The information for each record must be stored in a separate row. The data records are commonly called cases. Each case can optionally be identified by a unique case ID. The table or view itself can be referred to as a case table.
The CUSTOMERS
table in the
SH
schema is an example of a table that
could be used for machine learning. All the
information for each customer is contained in a single row. The
case ID is the CUST_ID
column. The rows listed in
the following example are
selected from SH.CUSTOMERS
.
Note:
Oracle Machine Learning requires single-record case data for all types of models except association models, which can be built on native transactional data.
Example 35-1 Sample Case Table
select cust_id, cust_gender, cust_year_of_birth, cust_main_phone_number from sh.customers where cust_id < 11;
The output is as follows:
CUST_ID CUST_GENDER CUST_YEAR_OF_BIRTH CUST_MAIN_PHONE_NUMBER
------- ----------- ---- ------------- -------------------------
1 M 1946 127-379-8954
2 F 1957 680-327-1419
3 M 1939 115-509-3391
4 M 1934 577-104-2792
5 M 1969 563-667-7731
6 F 1925 682-732-7260
7 F 1986 648-272-6181
8 F 1964 234-693-8728
9 F 1936 697-702-2618
10 F 1947 601-207-4099
Related Topics
35.1.1 Column Data Types
Understand the different types of column data in a case table.
The columns of the case table hold the attributes that describe each case. In Example 35-1, the attributes are: CUST_GENDER
, CUST_YEAR_OF_BIRTH
, and CUST_MAIN_PHONE_NUMBER
. The attributes are the predictors in a supervised model or the descriptors in an unsupervised model. The case ID, CUST_ID
, can be viewed as a special attribute; it is not a predictor or a descriptor.
OML4SQL supports standard Oracle data types except DATE
, TIMESTAMP
, RAW
, and LONG
. Oracle Machine Learning supports date type (datetime, date, timestamp) for case_id, CLOB
/BLOB
/FILE
that are interpreted as text columns, and the following collection types as well:
DM_NESTED_CATEGORICALS
DM_NESTED_NUMERICALS
DM_NESTED_BINARY_DOUBLES
DM_NESTED_BINARY_FLOATS
Note:
The attributes with the data type BOOLEAN
are treated as numeric with the following values: TRUE
means 1
, FALSE
means 0
, and NULL
is interpreted as an unknown value. The CASE_ID_COLUMN_NAME
attribute does not support BOOLEAN
data type.
35.1.2 Data Sets for Classification and Regression
Understand how data sets are used for training and testing the model.
You need two case tables to build and validate classification and regression models. One set of rows is used for training the model, another set of rows is used for testing the model. It is often convenient to derive the build data and test data from the same data set. For example, you could randomly select 60% of the rows for training the model; the remaining 40% could be used for testing the model.
Models that implement other machine learning functions, such as attribute importance, clustering, association, or feature extraction, do not use separate test data.
35.1.3 Scoring Requirements
Learn how scoring is done in Oracle Machine Learning for SQL.
Most machine learning models can be applied to separate data in a process known as scoring. Oracle Machine Learning for SQL supports the scoring operation for classification, regression, anomaly detection, clustering, and feature extraction.
The scoring process matches column names in the scoring data with the names of the columns that were used to build the model. The scoring process does not require all the columns to be present in the scoring data. If the data types do not match, OML4SQL attempts to perform type coercion. For example, if a column called PRODUCT_RATING
is VARCHAR2
in the training data but NUMBER
in the scoring data, OML4SQL effectively applies a TO_CHAR()
function to convert it.
The column in the test or scoring data must undergo the same transformations as the corresponding column in the build data. For example, if the AGE
column in the build data was transformed from numbers to the values CHILD
, ADULT
, and SENIOR
, then the AGE
column in the scoring data must undergo the same transformation so that the model can properly evaluate it.
Note:
OML4SQL can embed user-specified transformation instructions in the model and reapply them whenever the model is applied. When the transformation instructions are embedded in the model, you do not need to specify them for the test or scoring data sets.
OML4SQL also supports Automatic Data Preparation (ADP). When ADP is enabled, the transformations required by the algorithm are performed automatically and embedded in the model along with any user-specified transformations.
See Also:
Automatic Data Preparation and Embed Transformations in a Model for more information on automatic and embedded data transformations