ore.randomForest function provides an ensemble learning technique for classification of data in an
ore.randomForest builds a random forest model by growing trees in parallel on the database server. It constructs many decision trees and outputs the class that is the mode of the classes of the individual trees. The function avoids overfitting, which is a common problem for decision trees.
The random forest algorithm, developed by Leo Breiman and Adele Cutler, combines the ideas of bagging and the random selection of variables, which results in a collection of decision trees with controlled variance. The random forest algorithm provides high accuracy, but performance and scalability can be issues for large data sets.
ore.randomForest executes in parallel for model building and scoring. Parallel execution can occur whether you are using the
randomForest package in Oracle R Distribution (ORD) or the open source
randomForest package 4.6-10. Using
ore.randomForest and ORD can require less memory than using
ore.randomForest with the open source alternative. If you use the open source
randomForest package, Oracle R Enterprise issues a warning.
ore.randomForest uses the global option
ore.parallel to determine the degree of parallelism to employ. The function returns an
An invocation of the scoring method
predict on an
ore.randomForest object also runs in parallel on the database server. The
cache.modelargument specifies whether to cache the entire random forest model in memory during prediction. If sufficient memory is available, use the default
cache.model value of
TRUE for better performance.
grabTree method returns an
ore.frame object that contains information on the specified tree. Each row of the
ore.frame represents one node of the tree.
ore.randomForest loads a copy of the training data for each embedded R session executing in parallel. For large datasets, this can exceed the amount of available memory. Oracle recommends that you adjust the number of parallel processes and the amount of available memory accordingly. The global option
ore.parallel specifies the number of parallel processes. For information on controlling the amount of memory used by embedded R execution processes, see Controlling Memory Used by Embedded R in Oracle R Enterprise Installation and Administration Guide.
Example 4-7 Using ore.randomForest
# Using the iris dataset IRIS <- ore.push(iris) mod <- ore.randomForest(Species~., IRIS) tree10 <- grabTree(mod, k = 10, labelVar = TRUE) ans <- predict(mod, IRIS, type="all", supplemental.cols="Species") table(ans$Species, ans$prediction) # Using the infert dataset INFERT <- ore.push(infert) formula <- case ~ age + parity + education + spontaneous + induced rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2) tree <- grabTree(rfMod, k = 500) rfPred <- predict(rfMod, INFERT, supplemental.cols = "case") confusion.matrix <- with(rfPred, table(case, prediction)) confusion.matrix
Listing for Example 4-7
R> # Using the iris dataset R> IRIS <- ore.push(iris) R> mod <- ore.randomForest(Species~., IRIS) R> tree10 <- grabTree(mod, k = 10, labelVar = TRUE) R> ans <- predict(mod, IRIS, type="all", supplemental.cols="Species") R> table(ans$Species, ans$prediction) setosa versicolor virginica setosa 50 0 0 versicolor 0 50 0 virginica 0 0 50 # Using the infert dataset R> INFERT <- ore.push(infert) R> formula <- case ~ age + parity + education + spontaneous + induced R> R> rfMod <- ore.randomForest(formula, INFERT, ntree=1000, nodesize = 2) R> tree <- grabTree(rfMod, k = 500) R> R> rfPred <- predict(rfMod, INFERT, supplemental.cols = "case") R> R> confusion.matrix <- with(rfPred, table(case, prediction)) R> confusion.matrix prediction case 0 1 0 154 11 1 27 56