6.2.5.1 Partitioning on a Single Column

Example 6-13 uses the C50 package, which has functions that build decision tree and rule-based models. The package also provides training and testing data sets. Example 6-13 builds C5.0 models on the churnTrain training data set from the churn data set of the C50 package, with the goal of building one churn model on the data for each state. The example does the following:

  • Loads the C50 package and then the churn data set.

  • Uses the ore.create function to create the CHURN_TRAIN database table and its proxy ore.frame object from churnTrain, a data.frame object.

  • Specifies CHURN_TRAIN, the proxy ore.frame object, as the first argument to the ore.groupApply function and specifies the state column as the INDEX argument. The ore.groupApply function partitions the data on the state column and invokes the user-defined function on each partition.

  • Creates the variable modList, which gets the ore.list object returned by the ore.groupApply function. The ore.list object contains the results from the execution of the user-defined function on each partition of the data. In this case, it is one C5.0 model per state, with each model stored as an ore.object object.

  • Specifies the user-defined function. The first argument of the user-defined function receives one partition of the data, which in this case is all of the data associated with a single state.

    The user-defined function does the following:

    • Loads the C50 package so that it is available to the function when it executes in an R engine in the database.

    • Deletes the state column from the data.frame so that the column is not included in the model.

    • Converts the columns to factors because, although the ore.frame defined factors, when they are loaded to the user-defined function, factors appear as character vectors.

    • Builds a model for a state and returns it.

  • Uses the ore.pull function to retrieve the model from the database as the mod.MA variable and then invokes the summary function on it. The class of mod.MA is C5.0.

Example 6-13 Using the ore.groupApply Function

library(C50)
data("churn")
 
ore.create(churnTrain, "CHURN_TRAIN")

modList <- ore.groupApply(
  CHURN_TRAIN,
  INDEX=CHURN_TRAIN$state,
    function(dat) {
      library(C50)
      dat$state <- NULL
      dat$churn <- as.factor(dat$churn)
      dat$area_code <- as.factor(dat$area_code)
      dat$international_plan <- as.factor(dat$international_plan)
      dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
      C5.0(churn ~ ., data = dat, rules = TRUE)
    });
mod.MA <- ore.pull(modList$MA)
summary(mod.MA)
Listing for Example 6-13
R> library(C50)
R> data(churn)
R> 
R> ore.create(churnTrain, "CHURN_TRAIN")
R>
R> modList <- ore.groupApply(
+   CHURN_TRAIN,
+   INDEX=CHURN_TRAIN$state,
+     function(dat) {
+       library(C50)
+       dat$state <- NULL
+       dat$churn <- as.factor(dat$churn)
+       dat$area_code <- as.factor(dat$area_code)
+       dat$international_plan <- as.factor(dat$international_plan)
+       dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
+       C5.0(churn ~ ., data = dat, rules = TRUE)
+     });
R> mod.MA <- ore.pull(modList$MA)
R> summary(mod.MA)
 
Call:
C5.0.formula(formula = churn ~ ., data = dat, rules = TRUE)
 
 
C5.0 [Release 2.07 GPL Edition]         Thu Feb 13 15:09:10 2014
-------------------------------
 
Class specified by attribute `outcome'
 
Read 65 cases (19 attributes) from undefined.data
 
Rules:
 
Rule 1: (52/1, lift 1.2)
        international_plan = no
        total_day_charge <= 43.04
        ->  class no  [0.963]
 
Rule 2: (5, lift 5.1)
        total_day_charge > 43.04
        ->  class yes  [0.857]
 
Rule 3: (6/1, lift 4.4)
        area_code in {area_code_408, area_code_415}
        international_plan = yes
        ->  class yes  [0.750]
 
Default class: no
 
 
Evaluation on training data (65 cases):
 
                Rules     
          ----------------
            No      Errors
 
             3    2( 3.1%)   <<
 
 
           (a)   (b)    <-classified as
          ----  ----
            53     1    (a): class no
             1    10    (b): class yes
 
 
        Attribute usage:
 
         89.23% international_plan
         87.69% total_day_charge
          9.23% area_code
 
 
Time: 0.0 secs