6.6 Build a Random Forest Model
The ore.randomForest
function provides an ensemble learning technique for classification of data in an ore.frame
object.
Function ore.randomForest
builds a Random Forest model by growing trees in parallel on the database server. It constructs many decision trees and outputs the class that is the mode of the classes of the individual trees. The function avoids overfitting, which is a common problem for decision trees.
The Random Forest algorithm, developed by Leo Breiman and Adele Cutler, combines the ideas of bagging and the random selection of variables, which results in a collection of decision trees with controlled variance. The Random Forest algorithm provides high accuracy, but performance and scalability can be issues for large data sets.
Function ore.randomForest
executes in parallel for model building and scoring. Parallel execution can occur whether you are using the randomForest
package in Oracle R Distribution (ORD) or the open source randomForest
package 4.6-10. Using ore.randomForest
and ORD can require less memory than using ore.randomForest
with the open source alternative. If you use the open source randomForest
package, Oracle Machine Learning for R issues a warning.
Function ore.randomForest
uses the global option ore.parallel
to determine the degree of parallelism to employ. The function returns an ore.randomForest
object.
An invocation of the scoring method predict
on an ore.randomForest
object also runs in parallel on the database server. The cache.model
argument specifies whether to cache the entire Random Forest model in memory during prediction. If sufficient memory is available, use the default cache.model
value of TRUE
for better performance.
The grabTree
method returns an ore.frame
object that contains information on the specified tree. Each row of the ore.frame
represents one node of the tree.
Note:
Function ore.randomForest
loads a copy of the training data for each embedded R session executing in parallel. For large datasets, this can exceed the amount of available memory. Oracle recommends that you adjust the number of parallel processes and the amount of available memory accordingly. The global option ore.parallel
specifies the number of parallel processes. For information on controlling the amount of memory used by embedded R execution processes, see Controlling Memory Used by Embedded R in Oracle Machine Learning
for R Installation and Administration Guide.
Example 6-7 Using ore.randomForest
# Using the iris dataset
IRIS <- ore.push(iris)
mod <- ore.randomForest(Species~., IRIS)
tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
table(ans$Species, ans$prediction)
# Using the infert dataset
INFERT <- ore.push(infert)
formula <- case ~ age + parity + education + spontaneous + induced
rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
tree <- grabTree(rfMod, k = 500)
rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")
confusion.matrix <- with(rfPred, table(case, prediction))
confusion.matrix
Listing for This Example
R> # Using the iris dataset
R> IRIS <- ore.push(iris)
R> mod <- ore.randomForest(Species~., IRIS)
R> tree10 <- grabTree(mod, k = 10, labelVar = TRUE)
R> ans <- predict(mod, IRIS, type="all", supplemental.cols="Species")
R> table(ans$Species, ans$prediction)
setosa versicolor virginica
setosa 50 0 0
versicolor 0 50 0
virginica 0 0 50
# Using the infert dataset
R> INFERT <- ore.push(infert)
R> formula <- case ~ age + parity + education + spontaneous + induced
R>
R> rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2)
R> tree <- grabTree(rfMod, k = 500)
R>
R> rfPred <- predict(rfMod, INFERT, supplemental.cols = "case")
R>
R> confusion.matrix <- with(rfPred, table(case, prediction))
R> confusion.matrix
prediction
case 0 1
0 154 11
1 27 56
Parent topic: Build Oracle Machine Learning for R Models