4.2.4 Decision Tree
The ore.odmDT
function uses the in-database Decision Tree algorithm, which is based on conditional probabilities.
Decision Tree models are classification models. Decision trees generate rules. A rule is a conditional statement that can easily be understood by humans and be used within a database to identify a set of records.
A decision tree predicts a target value by asking a sequence of questions. At a given stage in the sequence, the question that is asked depends upon the answers to the previous questions. The goal is to ask questions that, taken together, uniquely identify specific target values. Graphically, this process forms a tree structure.
During the training process, the Decision Tree algorithm must repeatedly find the most efficient way to split a set of cases (records) into two child nodes. The ore.odmDT
function offers two homogeneity metrics, gini and entropy, for calculating the splits. The default metric is gini.
For information on the ore.odmDT
function arguments, call help(ore.odmDT)
.
Settings for a Decision Tree Model
The following table lists settings that apply to Decision Tree models.
Table 4-4 Decision Tree Model Settings
Setting Name | Setting Value | Description |
---|---|---|
|
|
Tree impurity metric for Decision Tree. Tree algorithms seek the best test question for splitting data at each node. The best splitter and split values are those that result in the largest increase in target value homogeneity (purity) for the entities in the node. Purity is by a metric. Decision trees can use either Gini ( |
|
For Decision Tree:
For Random Forest:
|
Criteria for splits: maximum tree depth (the maximum number of nodes between the root and any leaf node, including the leaf node). For Decision Tree, the default is For Random Forest, the default is |
|
|
The minimum number of training rows in a node expressed as a percentage of the rows in the training data.
Default is |
|
|
The minimum number of rows required to consider splitting a node expressed as a percentage of the training rows.
Default is |
|
|
The minimum number of rows in a node. Default is |
|
|
Criteria for splits: minimum number of records in a parent node expressed as a value. No split is attempted if the number of records is below this value. Default is |
Example 4-11 Using the ore.odmDT Function
This example creates an input ore.frame
, builds a model, makes predictions, and generates a confusion matrix.
m <- mtcars m$gear <- as.factor(m$gear) m$cyl <- as.factor(m$cyl) m$vs <- as.factor(m$vs) m$ID <- 1:nrow(m) mtcars_of <- ore.push(m) row.names(mtcars_of) <- mtcars_of # Build the model. dt.mod <- ore.odmDT(gear ~ ., mtcars_of) summary(dt.mod) # Make predictions and generate a confusion matrix. dt.res <- predict (dt.mod, mtcars_of, "gear") with(dt.res, table(gear, PREDICTION))
Listing for This Example
R> m <- mtcars
R> m$gear <- as.factor(m$gear)
R> m$cyl <- as.factor(m$cyl)
R> m$vs <- as.factor(m$vs)
R> m$ID <- 1:nrow(m)
R> mtcars_of <- ore.push(m)
R> row.names(mtcars_of) <- mtcars_of
R> # Build the model.
R> dt.mod <- ore.odmDT(gear ~ ., mtcars_of)
R> summary(dt.mod)
Call:
ore.odmDT(formula = gear ~ ., data = mtcars_of)
n = 32
Nodes:
parent node.id row.count prediction split
1 NA 0 32 3 <NA>
2 0 1 16 4 (disp <= 196.299999999999995)
3 0 2 16 3 (disp > 196.299999999999995)
surrogate full.splits
1 <NA> <NA>
2 (cyl in ("4" "6" )) (disp <= 196.299999999999995)
3 (cyl in ("8" )) (disp > 196.299999999999995)
Settings:
value
prep.auto on
impurity.metric impurity.gini
term.max.depth 7
term.minpct.node 0.05
term.minpct.split 0.1
term.minrec.node 10
term.minrec.split 20
R> # Make predictions and generate a confusion matrix.
R> dt.res <- predict (dt.mod, mtcars_of, "gear")
R> with(dt.res, table(gear, PREDICTION))
PREDICTION
gear 3 4
3 14 1
4 0 12
5 2 3
Parent topic: Build Oracle Machine Learning for SQL Models