4.1.6 Building a Random Forest Model

The ore.randomForest function provides an ensemble learning technique for classification of data in an ore.frame object.

Function ore.randomForest builds a random forest model by growing trees in parallel on the database server. It constructs many decision trees and outputs the class that is the mode of the classes of the individual trees. The function avoids overfitting, which is a common problem for decision trees.

The random forest algorithm, developed by Leo Breiman and Adele Cutler, combines the ideas of bagging and the random selection of variables, which results in a collection of decision trees with controlled variance. The random forest algorithm provides high accuracy, but performance and scalability can be issues for large data sets.

Function ore.randomForest executes in parallel for model building and scoring. Parallel execution can occur whether you are using the randomForest package in Oracle R Distribution (ORD) or the open source randomForest package 4.6-10. Using ore.randomForest and ORD can require less memory than using ore.randomForest with the open source alternative. If you use the open source randomForest package, Oracle R Enterprise issues a warning.

Function ore.randomForest uses the global option ore.parallel to determine the degree of parallelism to employ. The function returns an ore.randomForest object.

An invocation of the scoring method predict on an ore.randomForest object also runs in parallel on the database server. The cache.modelargument specifies whether to cache the entire random forest model in memory during prediction. If sufficient memory is available, use the default cache.model value of TRUE for better performance.

The grabTree method returns an ore.frame object that contains information on the specified tree. Each row of the ore.frame represents one node of the tree.

Note:

Function ore.randomForest loads a copy of the training data for each embedded R session executing in parallel. For large datasets, this can exceed the amount of available memory. Oracle recommends that you adjust the number of parallel processes and the amount of available memory accordingly. The global option ore.parallel specifies the number of parallel processes. For information on controlling the amount of memory used by embedded R execution processes, see Controlling Memory Used by Embedded R in Oracle R Enterprise Installation and Administration Guide.

Example 4-7 Using ore.randomForest

# Using the iris dataset
IRIS <- ore.push(iris)
mod <- ore.randomForest(Species~., IRIS)
tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
table(ans$Species, ans$prediction)

# Using the infert dataset
INFERT <- ore.push(infert)
formula <- case ~ age + parity + education + spontaneous + induced

rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
tree <- grabTree(rfMod, k = 500)

rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")

confusion.matrix <- with(rfPred, table(case, prediction))

confusion.matrix

Listing for Example 4-7

R> # Using the iris dataset
R> IRIS <- ore.push(iris)
R> mod <- ore.randomForest(Species~., IRIS)
R> tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
R> ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
R> table(ans$Species, ans$prediction)
            
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         50         0
  virginica       0          0        50

# Using the infert dataset
R> INFERT <- ore.push(infert)
R> formula <- case ~ age + parity + education + spontaneous + induced
R> 
R> rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
R> tree <- grabTree(rfMod, k = 500)
R> 
R> rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")
R> 
R> confusion.matrix <- with(rfPred, table(case, prediction))

R> confusion.matrix
    prediction
case   0   1
   0 154  11
   1  27  56