3.8 Prepare the Case Table
The first step in preparing data for machine learning is the creation of a case table.
If all the data resides in a single table and all the information for each case (record) is included in a single row (single-record case), this process is already taken care of. If the data resides in several tables, creating the data source involves the creation of a view. For the sake of simplicity, the term "case table" is used here to refer to either a table or a view.
- Convert Column Data Types
In OML, string columns are treated as categorical, number columns as numerical, andBOOLEAN
columns are treated as numerical. If you have a numeric column that you want to be treated as a categorical, you must convert it to a string. For example, the day number of the week. - Extract Datetime Column Values
You can extract values from a datatime or interval value using theEXTRACT
function. - Text Transformation
Learn text processing using Oracle Machine Learning for SQL. - About Business and Domain-Sensitive Transformations
Understand why you need to transform data according to business problems. - Create Nested Columns
In transactional data, the information for each case is contained in multiple rows. When the data source includes transactional data (multi-record case), the transactions must be aggregated to the case level in nested columns.
Parent topic: Prepare the Data
3.8.1 Convert Column Data Types
In OML, string columns are treated as categorical, number columns as numerical, and BOOLEAN
columns are treated as numerical. If you have a numeric column that you want to be treated as a categorical, you must convert it to a string. For example, the day number of the week.
For example, zip codes identify different postal zones; they do not imply order. If the zip codes are stored in a numeric column, they are interpreted as a numeric attribute. You must convert the data type so that the column data can be used as a categorical attribute by the model. You can do this using the TO_CHAR
function to convert the digits 1-9 and the LPAD
function to retain the leading 0, if there is one.
LPAD(TO_CHAR(ZIPCODE),5,'0')
The attributes with the data type BOOLEAN
are treated as numeric with the following values: TRUE
means 1
, FALSE
means 0
, and NULL
is interpreted as an unknown value. The CASE_ID_COLUMN_NAME
attribute does not support BOOLEAN
data type.
Parent topic: Prepare the Case Table
3.8.2 Extract Datetime Column Values
You can extract values from a datatime or interval value using the EXTRACT
function.
The EXTRACT
function extracts and returns the value of a specified datetime field from a datetime or interval value expression. The values that can be extracted are YEAR
, MONTH
, DAY
, HOUR
, MINUTE
, SECOND
, TIMEZONE_HOUR
, TIMEZONE_MINUTE
, TIMEZONE_REGION
, and TIMEZONE_ABBR
.
sales_ts
view. The sales_ts
view contains CUST_ID
and TIME_STAMP
columns. select cust_id, time_stamp,
extract(year from time_stamp) year,
extract(month from time_stamp) month,
extract(day from time_stamp) day_of_month,
to_char(time_stamp,'ww') week_of_year,
to_char(time_stamp,'D') day_of_week,
extract(hour from time_stamp) hour,
extract(minute from time_stamp) minute,
extract(second from time_stamp) second
from sales_ts
Parent topic: Prepare the Case Table
3.8.3 Text Transformation
Learn text processing using Oracle Machine Learning for SQL.
You can use OML4SQL to process text. Columns of text in the case table can be processed once they have undergone the proper transformation.
The text column must be in a table, not a view. The transformation process uses several features of Oracle Text; it treats the text in each row of the table as a separate document. Each document is transformed to a set of text tokens known as terms, which have a numeric value and a text label. The text column is transformed to a nested column of DM_NESTED_NUMERICALS
.
Parent topic: Prepare the Case Table
3.8.4 About Business and Domain-Sensitive Transformations
Understand why you need to transform data according to business problems.
Some transformations are dictated by the definition of the business problem. For example, you want to build a model to predict high-revenue customers. Since your revenue data for current customers is in dollars you need to define what "high-revenue" means. Using some formula that you have developed from past experience, you can recode the revenue attribute into ranges Low, Medium, and High before building the model.
Another common business transformation is the conversion of date information into elapsed time. For example, date of birth can be converted to age.
Domain knowledge can be very important in deciding how to prepare the data. For example, some algorithms produce unreliable results if the data contains values that fall far outside of the normal range. In some cases, these values represent errors or unusualities. In others, they provide meaningful information.
Related Topics
Parent topic: Prepare the Case Table
3.8.5 Create Nested Columns
In transactional data, the information for each case is contained in multiple rows. When the data source includes transactional data (multi-record case), the transactions must be aggregated to the case level in nested columns.
An example is sales data in a star schema when machine learning at the product level. Sales is stored in many rows for a single product (the case) because the product is sold in many stores to many customers over a period of time.
See Also:
Using Nested Data for information about converting transactional data to nested columns
Parent topic: Prepare the Case Table