Decision Tree

7.6 Decision Tree

The ore.odmDT function uses the in-database Decision Tree algorithm, which is based on conditional probabilities.

Decision Tree models are classification models. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and be used within a database to identify a set of records.

A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.

During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The ore.odmDT function offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.

For information on the ore.odmDT function arguments, call help(ore.odmDT).

Settings for a Decision Tree Model

The following table lists settings that apply to Decision Tree models.

Table 7-5 Decision Tree Model Settings

Setting Name	Setting Value	Description
`TREE_IMPURITY_METRIC`	`TREE_IMPURITY_ENTROPY` `TREE_IMPURITY_GINI`	Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split values are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is by a metric. Decision trees can use either Gini (`TREE_IMPURITY_GINI`) or entropy (`TREE_IMPURITY_ENTROPY`) as the purity metric. By default, the algorithm uses `TREE_IMPURITY_GINI`.
`TREE_TERM_MAX_DEPTH`	For Decision Tree: `2 <= X <= 20` For Random Forest: `2 <= X <= 100`	Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). For Decision Tree, the default is `7`. For Random Forest, the default is `16`.
`TREE_TERM_MINPCT_NODE`	`0 <= X <= 10`	The minimum number of training rows in a node expressed as a percentage of the rows in the training data. Default is `0.05`, indicating `0.05%`.
`TREE_TERM_MINPCT_SPLIT`	`0 < X <= 20`	The minimum number of rows required to consider splitting a node expressed as a percentage of the training rows. Default is `0.1`, indicating `0.1%`.
`TREE_TERM_MINREC_NODE`	`X >= 0`	The minimum number of rows in a node. Default is `10`.
`TREE_TERM_MINREC_SPLIT`	`X > 1`	Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value. Default is `20`.

Example 7-5 Using the ore.odmDT Function

This example creates an input ore.frame, builds a model, makes predictions, and generates a confusion matrix.

m <- mtcars
m$gear <- as.factor(m$gear)
m$cyl  <- as.factor(m$cyl)
m$vs   <- as.factor(m$vs)
m$ID   <- 1:nrow(m)
mtcars_of <- ore.push(m)
row.names(mtcars_of) <- mtcars_of
# Build the model.
dt.mod  <- ore.odmDT(gear ~ ., mtcars_of)
summary(dt.mod)
# Make predictions and generate a confusion matrix.
dt.res  <- predict (dt.mod, mtcars_of, "gear")
with(dt.res, table(gear, PREDICTION))

Listing for This Example

R> m <- mtcars
R> m$gear <- as.factor(m$gear)
R> m$cyl  <- as.factor(m$cyl)
R> m$vs   <- as.factor(m$vs)
R> m$ID   <- 1:nrow(m)
R> mtcars_of <- ore.push(m)
R> row.names(mtcars_of) <- mtcars_of
R> # Build the model.
R> dt.mod  <- ore.odmDT(gear ~ ., mtcars_of)
R> summary(dt.mod)
 
Call:
ore.odmDT(formula = gear ~ ., data = mtcars_of)
 
  n =  32 
 
Nodes:
  parent node.id row.count prediction                         split
1     NA       0        32          3                          <NA>
2      0       1        16          4 (disp <= 196.299999999999995)
3      0       2        16          3  (disp > 196.299999999999995)
            surrogate                   full.splits
1                <NA>                          <NA>
2 (cyl in ("4" "6" )) (disp <= 196.299999999999995)
3     (cyl in ("8" ))  (disp > 196.299999999999995)
 
Settings: 
                          value
prep.auto                    on
impurity.metric   impurity.gini
term.max.depth                7
term.minpct.node           0.05
term.minpct.split           0.1
term.minrec.node             10
term.minrec.split            20
R> # Make predictions and generate a confusion matrix.
R> dt.res  <- predict (dt.mod, mtcars_of, "gear")
R> with(dt.res, table(gear, PREDICTION)) 
    PREDICTION
gear  3  4
   3 14  1
   4  0 12
   5  2  3

Parent topic: OML4R Classes That Provide Access to In-Database Machine Learning Algorithms