About Transformations
Understand how you can transform data by using Automatic Data Preparation (ADP) and embedded data transformation.
A transformation is a SQL expression that modifies the data in one or more columns. Data must typically undergo certain transformations before it can be used to build a model. Many Oracle Machine Learning algorithms have specific transformation requirements. Before data can be scored, it must be transformed in the same way that the training data was transformed.
Oracle Machine Learning for SQL supports ADP, which automatically implements the transformations required by the algorithm. The transformations are embedded in the model and automatically run whenever the model is applied.
If additional transformations are required, you can specify them as SQL expressions and supply them as input when you create the model. These transformations are embedded in the model as they are with ADP.
With automatic and embedded data transformation, most of the work of data preparation is handled for you. You can create a model and score multiple data sets in a few steps:
- 
                     Identify the columns to include in the case table. 
- 
                     Create nested columns if you want to include transactional data. 
- 
                     Write SQL expressions for any transformations not handled by ADP. 
- 
                     Create the model, supplying the SQL expressions (if specified) and identifying any columns that contain text data. 
- 
                     Ensure that some or all of the columns in the scoring data have the same name and type as the columns used to train the model. 
See Also:
OML provides algorithm-specific automatic data preparation and other model building-related featuresRelated Topics
Embed Transformations in a Model
You can specify your own transformations and embed them in a model by creating a transformation list and passing it to DBMS_DATA_MINING.CREATE_MODEL2 or DBMS_DATA_MINING.CREATE_MODEL.
                  
The transformation instructions are embedded in the model and reapplied whenever the model is applied to new data.
The schema of how you can use xform_list to embed your
                                                  transformations is shown here with
                                                  CREATE_MODEL procedure.
                     
DBMS_DATA_MINING.CREATE_MODEL2 (
model_name 		IN VARCHAR2,
mining_function 	  IN VARCHAR2,
data_query 		IN CLOB,
set_list 		  IN SETTING_LIST,
case_id_column_name      IN VARCHAR2 DEFAULT NULL,
target_column_name       IN VARCHAR2 DEFAULT NULL,
xform_list 		IN TRANSFORM_LIST DEFAULT NULL);
DBMS_DATA_MINING.CREATE_MODEL(
                  model_name           IN VARCHAR2,
                  mining_function      IN VARCHAR2,
                  data_table_name      IN VARCHAR2,
                  case_id_column_name  IN VARCHAR2,
                  target_column_name   IN VARCHAR2 DEFAULT NULL,
                  settings_table_name  IN VARCHAR2 DEFAULT NULL,
                  data_schema_name     IN VARCHAR2 DEFAULT NULL,
                  settings_schema_name IN VARCHAR2 DEFAULT NULL,
                  xform_list           IN TRANSFORM_LIST DEFAULT NULL);The following examples show how to create an
                                                  embedded transform list with
                                                  CREATE_MODEL and
                                                  CREATE_MODEL2 procedures.
                     
Here is an example with DBMS_DATA_MINING.CREATE_MODEL procedure:
                     
BEGIN 
DBMS_DATA_MINING.DROP_MODEL('model_sample2'); 
EXCEPTION WHEN OTHERS THEN NULL; 
END;
/
CREATE TABLE sett_table (SETTING_NAME  VARCHAR2(30),
                                    SETTING_VALUE VARCHAR2(4000));
 
BEGIN       
   INSERT INTO sett_table (SETTING_NAME, SETTING_VALUE) VALUES ('KMNS_DISTANCE','KMNS_EUCLIDEAN');
   INSERT INTO sett_table (SETTING_NAME, SETTING_VALUE) VALUES ('PREP_AUTO','ON');
   INSERT INTO sett_table (SETTING_NAME, SETTING_VALUE) VALUES ('KMNS_DETAILS', 'KMNS_DETAILS_ALL');
END;
DECLARE
  xformlist dbms_data_mining_transform.TRANSFORM_LIST;
BEGIN
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'N_TRANS_ATM', null, 'TO_CHAR(N_TRANS_ATM)', null);
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'BANK_FUNDS', null, 'BANK_FUNDS+BANK_FUNDS+BANK_FUNDS', null);
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'AGE', null, 'log(10,AGE+1)', 'power(10, AGE)-1');
 DBMS_DATA_MINING.CREATE_MODEL(
    model_name          => 'model_sample2',
    mining_function     => dbms_data_mining.clustering,
    data_table_name     => 'INSUR_CUST_LTV',
    case_id_column_name => 'customer_id',
    settings_table_name => 'sett_table',
    xform_list          => xformlist);
END;
DBMS_DATA_MINING.CREATE_MODEL2
        procedure:DECLARE 
  xformlist dbms_data_mining_transform.TRANSFORM_LIST;
  v_setlst DBMS_DATA_MINING.SETTING_LIST;
BEGIN
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'N_TRANS_ATM', null, 'TO_CHAR(N_TRANS_ATM)', null);
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'BANK_FUNDS', null, 'BANK_FUNDS+BANK_FUNDS+BANK_FUNDS', null);
  dbms_data_mining_transform.SET_TRANSFORM(xformlist, 'AGE', null, 'log(10,AGE+1)', 'power(10, AGE)-1');
  
  v_setlst('ALGO_NAME') := 'ALGO_KMEANS';
 DBMS_DATA_MINING.CREATE_MODEL2(
    model_name          => 'model_sample3',
    mining_function     => 'CLUSTERING',
    data_query          => 'select * from INSUR_CUST_LTV',
    set_list            => v_setlst,
    case_id_column_name => 'customer_id',
    xform_list          => xformlist);
END;
Build a Transformation List
You can build transformation list by SET_TRANSFORM, STACK, and  GET_* methods. These methods are listed here.
                     
A transformation list is a collection of transformation records. When a new transformation record is added, it is appended to the top of the transformation list. You can use any of the following methods to build a transformation list:
- 
                              The SET_TRANFORMprocedure inDBMS_DATA_MINING_TRANSFORM
- 
                              The STACKinterface inDBMS_DATA_MINING_TRANSFORM
- 
                              The GET_MODEL_TRANSFORMATIONSandGET_TRANSFORM_LISTfunctions inDBMS_DATA_MINING
SET_TRANSFORM
The SET_TRANSFORM procedure applies a specified SQL expression to a specified attribute. 
                        
The SET_TRANSFORM procedure adds a single transformation record to a transformation list.
                           
DBMS_DATA_MINING_TRANSFORM.SET_TRANSFORM (
          xform_list               IN OUT NOCOPY TRANSFORM_LIST,
          attribute_name           VARCHAR2,
          attribute_subname        VARCHAR2,
          expression               VARCHAR2,
          reverse_expression       VARCHAR2,
          attribute_spec           VARCHAR2 DEFAULT NULL);
SQL expressions that you specify with SET_TRANSFORM must fit within a VARCHAR2. To specify a longer expression, you can use the SET_EXPRESSION procedure, which builds an expression by appending rows to a VARCHAR2 array. For example, the following statement appends a transformation instruction for country_id to a list of transformations called my_xforms. The transformation instruction divides country_id by 10 before algorithmic processing begins. The reverse transformation multiplies country_id by 10.
                           
  dbms_data_mining_transform.SET_TRANSFORM (my_xforms,
     'country_id', NULL, 'country_id/10', 'country_id*10');
The reverse transformation is applied in the model details. If country_id is the target of a supervised model, the reverse transformation is also applied to the scored target.
                           
The STACK Interface
The STACK interface creates transformation records from a table of transformation instructions and adds them to a transformation list. 
                        
The STACK interface offers a set of pre-defined transformations that you can apply to an attribute or to a group of attributes. For example, you can specify supervised binning for all categorical attributes.
                        
The STACK interface specifies that all or some of the attributes of a given type must be transformed in the same way. For example, STACK_BIN_CAT appends binning instructions for categorical attributes to a transformation list. The STACK interface consists of three steps:
                        
- 
                              A CREATEprocedure creates a transformation definition table. For example,CREATE_BIN_CATcreates a table to hold categorical binning instructions. The table has columns for storing the name of the attribute, the value of the attribute, and the bin assignment for the value.
- 
                              An INSERTprocedure computes the bin boundaries for one or more attributes and populates the definition table. For example,INSERT_BIN_CAT_FREQperforms frequency-based binning on some or all of the categorical attributes in the data source and populates a table created byCREATE_BIN_CAT.
- 
                              A STACKprocedure creates transformation records from the information in the definition table and appends the transformation records to a transformation list. For example,STACK_BIN_CATcreates transformation records for the information stored in a categorical binning definition table and appends the transformation records to a transformation list.
GET_MODEL_TRANSFORMATIONS and GET_TRANSFORM_LIST
Use the functions to create a new transformation list.
These two functions can be used to create a new transformation list from the transformations embedded in an existing model.
The GET_MODEL_TRANSFORMATIONS function returns a list of embedded transformations.
                           
DBMS_DATA_MINING.GET_MODEL_TRANSFORMATIONS (
      model_name     IN VARCHAR2)
RETURN DM_TRANSFORMS PIPELINED;
GET_MODEL_TRANSFORMATIONS returns a table of dm_transform objects. Each dm_transform has these fields
                           
attribute_name VARCHAR2(4000) attribute_subname VARCHAR2(4000) expression CLOB reverse_expression CLOB
The components of a transformation list are transform_rec, not dm_transform. The fields of a transform_rec are described in Table 3-7. You can call GET_MODEL_TRANSFORMATIONS to convert a list of dm_transform objects to transform_rec objects and append each transform_rec to a transformation list. 
                           
DBMS_DATA_MINING.GET_TRANSFORM_LIST (
      xform_list           OUT NOCOPY TRANSFORM_LIST,
      model_xforms         IN  DM_TRANSFORMS);See Also:
"DBMS_DATA_MINING_TRANSFORM Operational Notes", "SET_TRANSFORM Procedure", "CREATE_MODEL Procedure", and "GET_MODEL_TRANSFORMATIONS Function" in Oracle AI Database PL/SQL Packages and Types Reference
Transformation List and Automatic Data Preparation
You can provide transformation list and Automatic Data Preparation (ADP) to customize the data transformation.
The transformation list argument to CREATE_MODEL2 and CREATE_MODEL interacts with the PREP_AUTO setting, which controls ADP: 
                     
- 
                           When ADP is on and you specify a transformation list, your transformations are applied with the automatic transformations and embedded in the model. The transformations that you specify are processed before the automatic transformations. 
- 
                           When ADP is off and you specify a transformation list, your transformations are applied and embedded in the model, but no system-generated transformations are performed. 
- 
                           When ADP is on and you do not specify a transformation list, the system-generated transformations are applied and embedded in the model. 
- 
                           When ADP is off and you do not specify a transformation list, no transformations are embedded in the model; you must separately prepare the data sets you use for building, testing, and scoring the model. 
Specify Transformation Instructions for an Attribute
You can pass transformation instructions for an attribute by defining a transformation list.
A transformation list is defined as a table of transformation records. Each record (transform_rec) specifies the transformation instructions for an attribute.
                        
TYPE transform_rec IS RECORD (
    attribute_name      VARCHAR2(30),
    attribute_subname   VARCHAR2(4000),
    expression          EXPRESSION_REC,
    reverse_expression  EXPRESSION_REC,
    attribute_spec      VARCHAR2(4000));
The fields in a transformation record are described in this table.
Table 3-7 Fields in a Transformation Record for an Attribute
| Field | Description | 
|---|---|
| 
 | These fields identify the attribute, as described in "Scoping of Model Attribute Name" | 
| 
 | A SQL expression for transforming the attribute. For example, this expression transforms the age attribute into two categories: child and adult:[0,19) for 'child' and [19,) for adult CASE WHEN age < 19 THEN 'child' ELSE 'adult' Expression and reverse expressions are stored in  | 
| 
 | A SQL expression for reversing the transformation. For example, this expression reverses the transformation of the age attribute: DECODE(age,'child','(-Inf,19)','[19,Inf)') | 
| 
 | Specifies special treatment for the attribute. The  
 See Example 3-5 and Example 3-6. | 
Related Topics
Expression Records
Example of a transformation record.
The transformation expressions in a transformation record are expression_rec objects. 
                           
TYPE expression_rec IS RECORD (
     lstmt       DBMS_SQL.VARCHAR2A,
     lb          BINARY_INTEGER DEFAULT 1,
     ub          BINARY_INTEGER DEFAULT 0);
TYPE varchar2a IS TABLE OF VARCHAR2(32767)
INDEX BY BINARY_INTEGER;
The lstmt field stores a VARCHAR2A, which allows transformation expressions to be very long, as they can be broken up across multiple rows of VARCHAR2. Use the DBMS_DATA_MINING_TRANSFORM.SET_EXPRESSION procedure to create an expression_rec.
                           
Attribute Specifications
Learn how to define the characteristics specific to an attribute through attribute specification.
The attribute specification in a transformation record defines characteristics that are specific to this attribute. If not null, the attribute specification can include values FORCE_IN, NOPREP, or TEXT, as described in Table 3-7.
                           
Example 3-5 An Attribute Specification with Multiple Keywords
If more than one attribute specification keyword is applicable, you can provide them in a comma-delimited list. The following expression is the specification for an attribute in a GLM model. Assuming that the ftr_selection_enable setting is enabled, this expression forces the attribute to be included in the model. If ADP is on, automatic transformation of the attribute is not performed. 
                           
"FORCE_IN,NOPREP"
Example 3-6 A Text Attribute Specification
For text attributes, you can optionally specify subsettings POLICY_NAME, TOKEN_TYPE, and MAX_FEATURES. The subsettings provide configuration information that is specific to text transformation. In this example, the transformation instructions for the text content are defined in a text policy named my_policy with token type is THEME. The maximum number of extracted features is 3000.
                           
"TEXT(POLICY_NAME:my_policy)(TOKEN_TYPE:THEME)(MAX_FEATURES:3000)"
Related Topics
Oracle Machine Learning for SQL Transformation Routines
Learn about transformation routines.
OML4SQL provides routines that implement various
                transformation techniques in the DBMS_DATA_MINING_TRANSFORM
                package.
                        
Related Topics
Binning Routines
Explains binning techniques in Oracle Machine Learning for SQL.
A number of factors go into deciding a binning strategy. Having fewer values typically leads to a more compact model and one that builds faster, but it can also lead to some loss in accuracy.
Model quality can improve significantly with well-chosen bin boundaries. For example, an appropriate way to bin ages is to separate them into groups of interest, such as children 0-13, teenagers 13-19, youth 19-24, working adults 24-35, and so on.
The following table lists the binning techniques provided by Oracle Machine Learning for SQL:
Table 3-8 Binning Methods in DBMS_DATA_MINING_TRANSFORM
Related Topics
Normalization Routines
Learn about normalization routines in Oracle Machine Learning for SQL.
Most normalization methods map the range of a single attribute to another range, typically 0 to 1 or -1 to +1.
Normalization is very sensitive to outliers. Without outlier treatment, most values are mapped to a tiny range, resulting in a significant loss of information.
Table 3-9 Normalization Methods in DBMS_DATA_MINING_TRANSFORM
| Transformation | Description | 
|---|---|
| This technique computes the normalization of an attribute using the minimum and maximum values. The shift is the minimum value, and the scale is the difference between the maximum and minimum values. | |
| This normalization technique also uses the minimum and maximum values. For scale normalization, shift = 0, and scale = max{abs(max), abs(min)}. | |
| This technique computes the normalization of an attribute using the mean and the standard deviation. Shift is the mean, and scale is the standard deviation. | 
Related Topics
Outlier Treatment
Understand what you must do to treat outliers.
A value is considered an outlier if it deviates significantly from most other values in the column. The presence of outliers can have a skewing effect on the data and can interfere with the effectiveness of transformations such as normalization or binning.
Outlier treatment methods such as trimming or clipping can be implemented to minimize the effect of outliers.
Outliers represent problematic data, for example, a bad reading due to the unusual condition of an instrument. However, in some cases, especially in the business arena, outliers are perfectly valid. For example, in census data, the earnings for some of the richest individuals can vary significantly from the general population. Do not treat this information as an outlier, since it is an important part of the data. You need domain knowledge to determine outlier handling.
Routines for Outlier Treatment
Understand the transformations used for outlier treatment.
Outliers are extreme values, typically several standard deviations from the mean. To minimize the effect of outliers, you can Winsorize or trim the data.
Winsorizing involves setting the tail values of an attribute to some specified value. For example, for a 90% Winsorization, the bottom 5% of values are set equal to the minimum value in the 5th percentile, while the upper 5% of values are set equal to the maximum value in the 95th percentile.
Trimming sets the tail values to NULL. The algorithm treats them as missing values.
Outliers affect the different algorithms in different ways. In general, outliers cause distortion with equi-width binning and min-max normalization.
Table 3-10 Outlier Treatment Methods in DBMS_DATA_MINING_TRANSFORM
| Transformation | Description | 
|---|---|
| This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with nulls. | |
| This technique trims the outliers in numeric columns by sorting the non-null values, computing the tail values based on some fraction, and replacing the tail values with some specified value. | 
Understand Reverse Transformations
Reverse transformations ensure that information returned by the model is expressed in a format that is similar to or the same as the format of the data that was used to train the model. Internal transformation are reversed in the model details and in the results of scoring.
Some of the attributes used by the model correspond to columns in the build data. However, because of logic specific to the algorithm, nested data, and transformations, some attributes do not correspond to columns.
For example, a nested column in the training data is not interpreted as an attribute by the model. During the model build,Oracle Machine Learning for SQL explodes nested columns, and each row (an attribute name/value pair) becomes an attribute.
Some algorithms, for example Support Vector Machine (SVM) and Generalized Linear Model (GLM), only operate on numeric attributes. Any non-numeric column in the build data is exploded into binary attributes, one for each distinct value in the column (SVM). GLM does not generate a new attribute for the most frequent value in the original column. These binary attributes are set to one only if the column value for the case is equal to the value associated with the binary attribute.
Algorithms that generate coefficients present challenges in interpreting the results. Examples are SVM and Non-Negative Matrix Factorization (NMF). These algorithms produce coefficients that are used in combination with the transformed attributes. The coefficients are relevant to the data on the transformed scale, not the original data scale.
For all these reasons, the attributes listed in the model details do not resemble the columns of data used to train the model. However, attributes that undergo embedded transformations, whether initiated by Automatic Data Preparation (ADP) or by a user-specified transformation list, appear in the model details in their pre-transformed state, as close as possible to the original column values. Although the attributes are transformed when they are used by the model, they are visible in the model details in a form that can be interpreted by a user.