|Oracle® Data Mining Application Developer's Guide,
10g Release 2 (10.2)
|PDF · Mobi · ePub|
You should ensure that a data set is properly defined before applying transformations to optimize it for a particular model. Data transformation techniques are not addressed in this chapter.
Oracle Data Mining Concepts for information about data transformations.
DBMS_DATA_MINING_TRANSFORM in Oracle Database PL/SQL Packages and Types Reference for information about data transformations in PL/SQL.
"Preparing the Data" for information about data transformations in Java.
Oracle Database SQL Reference for information about Oracle schema objects and data types.
This chapter contains the following topics:
The ODM collection types define tables of attribute name/value pairs. ODM data sets can include any number of these nested table columns in addition to scalar columns with built-in numeric or character data types. See "Nested Tables" for more information.
ODM uses features of Oracle Text to transform unstructured text columns to structured columns of type
DM_NESTED_NUMERICALS for mining. The ODM Java API provides the
OraTextTransform interface to manage the text transformation process for you. However, if you are using the PL/SQL API, you must use Oracle Text routines directly (See Chapter 5).
A model may additionally identify a case ID column, a target column, or both.
A case ID column holds a unique identifier for each record (row) of data. The case ID must be specified at model build time for all algorithms except Decision Tree. If a case ID is present in a Decision Tree model, it is not considered a possible predictor.
The case ID column can be of type
NUMBER, and its maximum length is 128 bytes.
Predictive algorithms (Classification, Regression, and Attribute Importance) require that one column be designated as a target. The name of the target column is supplied as an argument when the model is created. The target column holds the predictions generated by the model. The target column must be of type
FLOAT. SVM Regression supports only numeric targets. One-Class SVM does not use a target.
ODM interprets attributes as either categorical or numerical.
Categorical attributes are values, such as gender or job title, that belong to a category or domain. Values of a categorical attribute do not have a meaningful order. Categorical attributes have character data types.
Numerical attributes are values, such as age or income, that fall within a continuum. Numerical attributes represent interval data that has a measurable order. Numerical attributes have numeric data types.
If the column data type is incompatible with the attribute type, you must convert the data type. For example, an application might use postal codes as a categorical attribute, but the data is actually stored in a numeric column. In this case, you would use the
TO_CHAR function to convert the column to a character data type.
If your mining data includes
TIMESTAMP columns, and you are not using Predictive Analytics, you must convert those columns to numeric or character data types. In most cases, these data types should be converted to
NUMBER, but you should evaluate each case individually. If, for example, the date serves as a timestamp indicating when a transaction occurred, converting the date to
VARCHAR2 makes it categorical with unique values, one in each record. This kind of column is known as an identifier and is not useful in model building. However, if the date values are coarse and significantly fewer than the number of records (for example, they might indicate the week or month when an item was purchased), it may be useful to use character values.
You can convert dates to numbers by selecting a starting date and subtracting it from each date value. This process results in a
NUMBER column. Another approach would be to parse the date and distribute its components over several columns. This is the conversion method used by Predictive Analytics.
Predictive Analytics interprets
DATE data and all forms of
TIMESTAMP data, including
TIMESTAMP WITH TIMEZONE and
TIMESTAMP WITH LOCAL TIMEZONE, as a set of numerical attributes. For example, a column named
TIMECOL would be transformed into attributes for year, month, week, day of year, day of month, day of week, hour, and minute. Each attribute would be named
x is the suffix used to convert the date into a number. For example, the name of the year attribute would be
The attributes resulting from
TIMESTAMP data are visible in the results of an
EXPLAIN operation. They are not visible in the results of a
Schema object names can be quoted or nonquoted identifiers from one to thirty bytes long. Nonquoted identifiers are not case sensitive; Oracle converts them to uppercase. Nonquoted identifiers can consist of alphanumeric characters and the underscore (_), dollar sign ($), and pound sign (#). The initial character must be alphabetic. Quoted identifiers are case sensitive and can contain most characters.
See Also:Oracle Database SQL Reference for information on schema object naming requirements.
ODM accepts data in single-record case format, where all the information (attributes) concerning an individual is contained in one row. Single-record case, also known as non-transactional format, is illustrated in the table in Example 2-1. This table contains descriptive information about customers.
CUSTOMER_ID is the case ID column.
CUSTOMER_ID GENDER AGE MARITAL_STATUS ------------ ------ --- -------------- 1 Male 30 Married 2 Female 35 Single 3 Male 21 Single 4 Female 20 Single 5 Female 35 Married
Sometimes data is organized in multi-record case, also known as transactional, format. A typical example is market basket data. In transactional format, the data pertaining to an individual is distributed across multiple records. The table in Example 2-2 illustrates transactional format. This table contains information about products purchased by a group of customers on a given day.
CUSTOMER_ID is the case ID column.
CUST_ID PROD_ID PROD_NAME -------- ------- --------- 1 524 brand x icecream 1 530 brand y frozen dinners 1 109 brand z dog food 2 578 brand a orange juice 2 191 brand x frozen dinners
ODM does not support multi-record case format. However, there could be circumstances in which you want to construct a model using transactional data. For example, you might want to use transactional data like that in Example 2-2 to predict the products that each customer is likely to buy on his next visit to the store. Discount coupons for these or similar products could then be generated with the checkout receipt.
If you want to construct a model using transactional data, you must first convert the data to single-record case. You must do this by defining columns of nested tables using the ODM fixed collection types,
DM_NESTED_CATEGORICALS. These types define collections of numerical attributes and categorical attributes respectively. The data type descriptions are shown as follows.
SQL> describe dm_nested_numerical Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE NUMBER SQL> describe dm_nested_numericals DM_NESTED_NUMERICALS TABLE OF DMSYS.DM_NESTED_NUMERICAL Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE NUMBER SQL> describe dm_nested_categorical Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE VARCHAR2(4000) SQL> describe dm_nested_categoricals DM_NESTED_CATEGORICALS TABLE OF DMSYS.DM_NESTED_CATEGORICAL Name Null? Type ----------------------------------------- -------- ---------------------------- ATTRIBUTE_NAME VARCHAR2(30) VALUE VARCHAR2(4000)
For a given case identifier, attribute names must be unique across all the collections and individual columns. The fixed collection types enforce this requirement. However, the attribute naming requirements, described in "Attribute Names", do not apply to the attribute_name column of a nested table.
The attributes in Example 2-2 could be stored in nested table columns, as illustrated in Example 2-3. The column
PRODUCT_IDENTIFIERS is of type
DM_NESTED_NUMERICALS, and the column
PRODUCT_NAMES is of type
|attribute_name value||attribute_name value|
You can create an object view that presents several sources of transactional data (implemented with nested table columns) as a single data set for data mining. See "Example: Multi-Record Collections With an Object View".
Apart from the benefit of providing all your mining attributes through a single row-source without impacting their physical data storage, the view acts as a join specification on the underlying tables that can be used by the server for efficiently accessing your data.
Note:Oracle recommends that you perform any necessary data transformations on the base tables before building object views. In this way, all attributes are transformed in a similar way. In most cases, attributes in transactional format are of the same scale, and thus this approach works. Otherwise, you can split the data into sets of similar items and then transform them separately.
DBMS_DATA_MINING_TRANSFORM in Oracle Database PL/SQL Packages and Types Reference for information about data transformations using PL/SQL. See "Preparing the Data" for information about data transformations using Java.
A real-world example of an analytical pipeline for brain tumor research illustrates multi-case collections with an object view. The underlying tables store gene expression data and clinical data about the patient.
The fact table,
GENE_EXPRESSION_DATA, stores gene expression data. It has the following columns.
case_ID NUMBER gene VARCHAR2(30) expr NUMBER
The dimension table,
CLINICAL_DATA_TABLE, stores clinical patient data. It has the following columns.
case_ID NUMBER name VARCHAR2(30) type VARCHAR2(30) subtype VARCHAR2(30) gender CHAR(1) age NUMBER status VARCHAR2(30)
In this example, we want to create a model that predicts status based on gender, age, and gene expression. The build data for the model is an object view that uses columns of clinical patient data and a nested column of gene expression data. The view will have the following columns.
case_id NUMBER gender CHAR(1) age NUMBER gene_expr DM_NESTED_NUMERICALS status VARCHAR2(30)
The following statement constructs the object view
gene_expr_build, which can be used as build data for the model.
CREATE OR REPLACE VIEW gene_expr_build AS SELECT C.case_id, C.gender, C.age, CAST(MULTISET( SELECT gene, expr FROM gene_expression_data WHERE case_id = C.case_id) AS DM_NESTED_NUMERICALS ) gene_expr, C.status FROM clinical_data_table C
For a table with several columns, the key question to consider is the (average) row length, not the number of columns. Having more than 255 columns in a table built with a smaller block size typically results in intrablock chaining.
Oracle stores multiple row pieces in the same block, but the overhead to maintain the column information is minimal as long as all row pieces fit in a single data block. If the rows don't fit in a single data block, you may consider using a larger database block size (or use multiple block sizes in the same database).
See Also:Oracle Database Performance Tuning Guide for more details.