6.2.5.1 Partition on a Single Column
This example uses the ore.groupApply function and partitions the data on a single column.
The example uses the C50 package, which has functions that build decision tree and rule-based models. The package also provides training and testing data sets. The example builds C5.0 models on the churnTrain training data set from the churn data set of the C50 package, with the goal of building one churn model on the data for each state. The example does the following:
-
Loads the
C50package and then thechurndata set. -
Uses the
ore.createfunction to create theCHURN_TRAINdatabase table and its proxyore.frameobject fromchurnTrain, adata.frameobject. -
Specifies
CHURN_TRAIN, the proxyore.frameobject, as the first argument to theore.groupApplyfunction and specifies thestatecolumn as theINDEXargument. Theore.groupApplyfunction partitions the data on thestatecolumn and invokes the user-defined function on each partition. -
Creates the variable
modList, which gets theore.listobject returned by theore.groupApplyfunction. Theore.listobject contains the results from the execution of the user-defined function on each partition of the data. In this case, it is one C5.0 model per state, with each model stored as anore.objectobject. -
Specifies the user-defined function. The first argument of the user-defined function receives one partition of the data, which in this case is all of the data associated with a single state.
The user-defined function does the following:
-
Loads the C50 package so that it is available to the function when it executes in an R engine in the database.
-
Deletes the
statecolumn from thedata.frameso that the column is not included in the model. -
Converts the columns to factors because, although the
ore.framedefined factors, when they are loaded to the user-defined function, factors appear as character vectors. -
Builds a model for a state and returns it.
-
-
Uses the
ore.pullfunction to retrieve the model from the database as themod.MAvariable and then invokes thesummaryfunction on it. The class ofmod.MAisC5.0.
Example 6-12 Using the ore.groupApply Function
library(C50)
data("churn")
ore.create(churnTrain, "CHURN_TRAIN")
modList <- ore.groupApply(
CHURN_TRAIN,
INDEX=CHURN_TRAIN$state,
function(dat) {
library(C50)
dat$state <- NULL
dat$churn <- as.factor(dat$churn)
dat$area_code <- as.factor(dat$area_code)
dat$international_plan <- as.factor(dat$international_plan)
dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
C5.0(churn ~ ., data = dat, rules = TRUE)
});
mod.MA <- ore.pull(modList$MA)
summary(mod.MA)Listing for This Example
R> library(C50)
R> data(churn)
R>
R> ore.create(churnTrain, "CHURN_TRAIN")
R>
R> modList <- ore.groupApply(
+ CHURN_TRAIN,
+ INDEX=CHURN_TRAIN$state,
+ function(dat) {
+ library(C50)
+ dat$state <- NULL
+ dat$churn <- as.factor(dat$churn)
+ dat$area_code <- as.factor(dat$area_code)
+ dat$international_plan <- as.factor(dat$international_plan)
+ dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
+ C5.0(churn ~ ., data = dat, rules = TRUE)
+ });
R> mod.MA <- ore.pull(modList$MA)
R> summary(mod.MA)
Call:
C5.0.formula(formula = churn ~ ., data = dat, rules = TRUE)
C5.0 [Release 2.07 GPL Edition] Thu Feb 13 15:09:10 2014
-------------------------------
Class specified by attribute `outcome'
Read 65 cases (19 attributes) from undefined.data
Rules:
Rule 1: (52/1, lift 1.2)
international_plan = no
total_day_charge <= 43.04
-> class no [0.963]
Rule 2: (5, lift 5.1)
total_day_charge > 43.04
-> class yes [0.857]
Rule 3: (6/1, lift 4.4)
area_code in {area_code_408, area_code_415}
international_plan = yes
-> class yes [0.750]
Default class: no
Evaluation on training data (65 cases):
Rules
----------------
No Errors
3 2( 3.1%) <<
(a) (b) <-classified as
---- ----
53 1 (a): class no
1 10 (b): class yes
Attribute usage:
89.23% international_plan
87.69% total_day_charge
9.23% area_code
Time: 0.0 secsParent topic: Use the ore.groupApply Function