7.18 Random Forest Model
The ore.odmRF class creates an in-database Random Forest (RF) model that provides an ensemble learning technique for classification.
               
By combining the ideas of bagging and random selection of variables, the Random Forest algorithm produces a collection of decision trees with controlled variance while avoiding overfitting, which is a common problem for decision trees.
Settings for a Random Forest Model
The following table lists settings that apply to Random Forest models.
Table 7-19 Random Forest Model Settings
| Setting Name | Setting Value | Description | 
|---|---|---|
| 
 | 
 | Size of the random subset of columns to be considered when choosing a split at a node. For each node, the size of the pool remains the same, but the specific candidate columns change. The default is half of the columns in the model signature. The special value  | 
| 
 | 
 | Number of trees in the forest Default is  | 
| 
 | 
 | Fraction of the training data to be randomly sampled for use in the construction of an individual tree. The default is half of the number of rows in the training data. | 
Example 7-21 Using the ore.odmRF Function
This example pushes the data frame iris: to a temporary database table IRIS and creates a Random Forest model.
# Turn off row ordering warnings
options(ore.warn.order=FALSE)
# Create the a temporary OML4R proxy object IRIS.
IRIS <- ore.push(iris)
# Create an RF model object. Fit the RF model according to the data and setting parameters.
mod.rf <- ore.odmRF(Species ~ ., IRIS, 
                        odm.settings = list(tree_impurity_metric = 'TREE_IMPURITY_ENTROPY',
                        tree_term_max_depth = 5,
                        tree_term_minrec_split = 5,
                        tree_term_minpct_split = 2,
                        tree_term_minrec_node = 5,
                        tree_term_minpct_node = 0.05))
                        
# Show the model summary and attribute importance.
summary(mod.rf)
importance(mod.rf)
# Use the model to make predictions on the input data.
pred.rf <- predict(mod.rf, IRIS, supplemental.cols="Species")
# Generate a confusion matrix.
with(pred.rf, table(Species, PREDICTION))
Listing for This Example
Call: ore.odmRF(formula = Species ~ ., data = IRIS, odm.settings = list(tree_impurity_metric = "TREE_IMPURITY_ENTROPY", tree_term_max_depth = 5, tree_term_minrec_split = 5, tree_term_minpct_split = 2, tree_term_minrec_node = 5, tree_term_minpct_node = 0.05))Settings:
                                                 value 
      clas.max.sup.bins                          32
      clas.weights.balanced                      OFF
      odms.details                               odms.enable
      odms.missing.value.treatment   odms.missing.value.auto 
      odms.random.seed                                     0 
      odms.sampling                    odms.sampling.disable 
      prep.auto                                           ON
      rfor.num.trees                                      20
      rfor.sampling.ratio                                 .5
      impurity.metric                       impurity.entropy 
      term.max.depth                                       5
      term.minpct.node                                  0.05 
      term.minpct.split                                    2 
      term.minrec.node                                     5
      term.minrec.split                                    5Importance:
    ATTRIBUTE_NAME ATTRIBUTE_SUBNAME ATTRIBUTE_IMPORTANCE 
1   Petal.Length             <NA>              0.60890776 
2   Petal.Width              <NA>              0.53412466
3   Sepal.Length             <NA>              0.23343292
4   Sepal.Width              <NA>              0.06182114Table 7-20 A data.frame: 4 x 3
| ATTRIBUTE_NAME | ATTRIBUTE_SUBNAME | ATTRIBUTE_IMPORTANCE | 
|---|---|---|
| <chr> | <chr> | <dbl> | 
| Petal.Length | NA | 0.60890776 | 
| Petal.Width | NA | 0.53412466 | 
| Sepal.Length | NA | 0.23343292 | 
| Sepal.Width | NA | 0.06182114 |