6.2.6 Using the ore.rowApply Function

The ore.rowApply function invokes an R script with an ore.frame as the input data. The ore.rowApply function passes the ore.frame to the user-defined input function as the first argument to that function. The rows argument to the ore.rowApply function specifies the number of rows to pass to each invocation of the user-defined R function. The last chunk or rows may have fewer rows than the number specified. The ore.rowApply function can use data-parallel execution, in which one or more R engines perform the same R function, or task, on different partitions of data.

The syntax of the ore.rowApply function is the following:

ore.rowApply(X, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, rows = 1, FUN.OWNER = NULL,
             parallel = getOption("ore.parallel", NULL))

The ore.rowApply function returns an ore.list object or an ore.frame object.

See Also:

Example 6-15 uses the e1071 package, previously downloaded from CRAN. The example also uses the nbmod object, which is the Naive Bayes model created in Example 6-12.

Example 6-15 does the following:

  • Loads the package e1071.

  • Pushes the iris data set to the database as the IRIS temporary table and ore.frame object.

  • Creates a copy of IRIS as IRIS_PRED and adds the PRED column to IRIS_PRED to contain the predictions.

  • Invokes the ore.rowApply function, passing the IRIS ore.frame as the data source for user-defined R function and the user-defined R function itself.

  • The user-defined function does the following:

    • Loads the package e1071 so that it is available to the R engine or engines that run in the database.

    • Converts the Species column to a factor because, although the ore.frame defined factors, when they are loaded to the user-defined function, factors appear as character vectors.

    • Invokes the predict method and returns the res object, which contains the predictions in the column added to the data set.

  • The example pulls the model to the client R session.

  • Passes IRIS_PRED as the argument FUN.VALUE, which specifies the structure of the object that the ore.rowApply function returns.

  • Specifies the number of rows to pass to each invocation of the user-defined function.

  • Displays the class of res, and invokes the table function to display the Species column and the PRED column of the res object.

Example 6-15 Using the ore.rowApply Function

library(e1071)
IRIS <- ore.push(iris)
IRIS_PRED <- IRIS
IRIS_PRED$PRED <- "A"
res <- ore.rowApply(
  IRIS,
  function(dat, nbmod) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    dat$PRED <- predict(nbmod, newdata = dat)
    dat
  },
  nbmod = ore.pull(nbmod),
  FUN.VALUE = IRIS_PRED,
  rows = 10)
class(res)
table(res$Species, res$PRED)
Listing for Example 6-15
R> library(e1071)
R> IRIS <- ore.push(iris)      
R> IRIS_PRED <- IRIS
R> IRIS_PRED$PRED <- "A"
R> res <- ore.rowApply(
+   IRIS ,
+   function(dat, nbmod) {
+     library(e1071)
+     dat$Species <- as.factor(dat$Species)
+     dat$PRED <- predict(nbmod, newdata = dat)
+     dat
+   },
+   nbmod = ore.pull(nbmod),
+   FUN.VALUE = IRIS_PRED,
+   rows = 10)
R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> table(res$Species, res$PRED)
            
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

As Example 6-13 does, Example 6-16 uses the C50 package to score churn data (that is, to predict which customers are likely to churn) using C5.0 models. However, instead of partitioning the data by a column, Example 6-16 partitions the data by a number of rows. The example scores the customers from the specified state in parallel. The example uses datastores and saves functions to the Oracle R Enterprise R script repository, which allows the functions to be used by the Oracle R Enterprise SQL API functions.

Example 6-16 first loads C50 package and the data sets. The example deletes the datastores with names containing myC5.0modelFL, if they exist. It invokes ore.drop to delete the CHURN_TEST table, if it exists, and then invokes ore.create to create the CHURN_TEST table from the churnTest data set.

The example next invokes ore.getLevels, which returns a list of the levels for each factor column. The invocation excludes the first column, which is state, because the levels for that column are not needed. Getting the levels first can ensure that all possible levels are provided during model building, even if some rows do not have values for some of the levels. The ore.delete invocation ensures that no datastore with the specified name exists and the ore.save invocation saves the xlevels object in the datastore named myXLevels.

Example 6-16 creates a user-defined function, myC5.0FunctionForLevels, that generates a C5.0 model. The function uses the list of levels returned by function ore.getXlevels instead of computing the levels using the as.factor function as the user-defined function does in Example 6-13. It uses the levels to convert the column type from character vector to factor. The function myC5.0FunctionForLevels returns the value TRUE. The example saves the function in the Oracle R Enterprise R script repository.

The example next gets a list of datastores that have names that include the specified string and deletes those datastores if they exist.

The example then invokes ore.groupApply, which invokes function myC5.0FunctionForLevels on each state in the CHURN_TEST data. To each myC5.0FunctionForLevels invocation, ore.groupApply passes the datastore that contains the xlevels object and a prefix to use in naming the datastore generated by myC5.0FunctionForLevels. It also passes the ore.connect control argument to connect to the database in the embedded R function, which enables the use of objects stored in a datastore. The ore.groupApply invocation returns a list that contains the results of all of the invocations of myC5.0FunctionForLevels.

The example pulls the result over to the local R session and verifies that myC5.0FunctionForLevels returned TRUE for each state in the data source.

Example 6-16 next creates another user-defined another function, myScoringFunction, and stores it in the Oracle R Enterprise R script repository. The function scores a C5.0 model for the levels of a state and returns the results in a data.frame.

The example then invokes function ore.rowApply. It filters the input data to use only data for the state of Massachusetts. It specifies myScoringFunction as the function to invoke and passes that user-defined function the name of the datastore that contains the xlevels object and a prefix to use in loading the datastore that contains the C5.0 model for the state. The ore.rowApply invocation specifies invoking myScoringFunction on 200 rows of the data set in each parallel R engine. It uses the FUN.VALUE argument so that ore.rowApply returns an ore.frame that contains the results of all of the myScoringFunction invocations. The variable scores gets the results of the ore.rowApply invocation.

Finally, Example 6-16 prints the scores object and then uses the table function to display the confusion matrix for the scoring.

See Also:

Example A-8 for an invocation of the rqRowEval function that produces the same result as the ore.rowApply function in Example 6-16

Example 6-16 Using the ore.rowApply Function with Datastores and Scripts

library(C50)
data(churn)

ore.drop("CHURN_TEST"
ore.create(churnTest, "CHURN_TEST")

xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
ore.delete("myXLevels")
ore.save(xlevels, name = "myXLevels")

ore.scriptDrop("myC5.0FunctionForLevels")
ore.scriptCreate("myC5.0FunctionForLevels",
  function(dat, xlevelsDatastore, datastorePrefix) {
    library(C50)
    state <- dat[1,"state"]
    datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
    dat$state <- NULL
    ore.load(name = xlevelsDatastore)
    for (j in names(xlevels))
       dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
    c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
    ore.save(c5mod, name = datastoreName)
    TRUE
  })

ds.v <- ore.datastore(pattern= "myC5.0modelFL")$datastore.name
for (ds in ds.v) ore.delete(name = ds)

res <- ore.groupApply(CHURN_TEST, 
          INDEX=CHURN_TEST$state,
          FUN.NAME = "myC5.0FunctionForLevels",
          xlevelsDatastore = "myXLevels",
          datastorePrefix = "myC5.0modelFL",
          ore.connect = TRUE)
res <- ore.pull(res)
all(as.logical(res) == TRUE)

ore.scriptDrop("myScoringFunction")
ore.scriptCreate("myScoringFunction",
                  function(dat, xlevelsDatastore, datastorePrefix) {
                     library(C50)
                     state <- dat[1,"state"]
                     datastoreName <- paste(datastorePrefix,state,sep="_")
                     dat$state <- NULL
                     ore.load(name = xlevelsDatastore)
                     for (j in names(xlevels))
                        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
                     ore.load(name = datastoreName)
                     res <- data.frame(pred = predict(c5mod, dat, type = "class"),
                                              actual = dat$churn,
                                              state = state)
                     res
                  }
                )

scores <- ore.rowApply(
  CHURN_TEST[CHURN_TEST$state == "MA",],
  FUN.NAME = "myScoringFunction",
  xlevelsDatastore = "myXLevels", 
  datastorePrefix = "myC5.0modelFL",
  ore.connect = TRUE, parallel = TRUE,
  FUN.VALUE = data.frame(pred = character(0), 
                         actual = character(0), 
                         state = character(0)),
  rows=200)
scores
table(scores$actual, scores$pred)
Listing for Example 6-16
R> library(C50)
R> data(churn)
R>
R> ore.drop("CHURN_TEST"
R> ore.create(churnTest, "CHURN_TEST")
R>
R> xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
R> ore.delete("myXLevels")
[1] "myXLevels
R> ore.save(xlevels, name = "myXLevels")
R>
R> ore.scriptDrop("myC5.0FunctionForLevels")
R> ore.scriptCreate("myC5.0FunctionForLevels",
+    function(dat, xlevelsDatastore, datastorePrefix) {
+      library(C50)
+      state <- dat[1,"state"]
+      datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
+      dat$state <- NULL
+      ore.load(name = xlevelsDatastore)
+      for (j in names(xlevels))
+        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+      c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
+      ore.save(c5mod, name = datastoreName)
+      TRUE
+    })
R>
R> ds.v <- ore.datastore(pattern="myC5.0modelFL")$datastore.name
R> for (ds in ds.v) ore.delete(name=ds)
R>
R> res <- ore.groupApply(CHURN_TEST, 
+                       INDEX=CHURN_TEST$state,
+                       FUN.NAME="myC5.0FunctionForLevels",
+                       xlevelsDatastore = "myXLevels",
+                       datastorePrefix = "myC5.0modelFL",
+                       ore.connect = TRUE)
R> res <- ore.pull(res)
R> all(as.logical(res) == TRUE)
[1] TRUE
R>
R> ore.scriptDrop("myScoringFunction")
R> ore.scriptCreate("myScoringFunction",
+                    function(dat, xlevelsDatastore, datastorePrefix) {
+                      library(C50)
+                      state <- dat[1,"state"]
+                      datastoreName <- paste(datastorePrefix,state,sep="_")
+                      dat$state <- NULL
+                      ore.load(name = xlevelsDatastore)
+                      for (j in names(xlevels))
+                          dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+                      ore.load(name = datastoreName)
+                      res <- data.frame(pred = predict(c5mod, dat, type="class"),
+                                        actual = dat$churn,
+                                        state = state)
+                      res
+                  }
+ )
R>
R> scores <- ore.rowApply(
+     CHURN_TEST[CHURN_TEST$state =="MA",],
+     FUN.NAME = "myScoringFunction",
+     xlevelsDatastore = "myXLevels",
+     datastorePrefix = "myC5.0modelFL",
+     ore.connect = TRUE, parallel = TRUE,
+     FUN.VALUE = data.frame(pred=character(0),
+                            actual=character(0),
+                            state=character(0)),
+     rows=200
R>
R> scores
   pred actual state
1    no     no    MA
2    no     no    MA
3    no     no    MA
4    no     no    MA
5    no     no    MA
6    no    yes    MA
7   yes    yes    MA
8   yes    yes    MA
9    no     no    MA
10   no     no    MA
11   no     no    MA
12   no     no    MA
13   no     no    MA
14   no     no    MA
15  yes    yes    MA
16   no     no    MA
17   no     no    MA
18   no     no    MA
19   no     no    MA
20   no     no    MA
21   no     no    MA
22   no     no    MA
23   no     no    MA
24   no     no    MA
25   no     no    MA
26   no     no    MA
27   no     no    MA
28   no     no    MA
29   no    yes    MA
30   no     no    MA
31   no     no    MA
32   no     no    MA
33  yes    yes    MA
34   no     no    MA
35   no     no    MA
36   no     no    MA
37   no     no    MA
38   no     no    MA
Warning message:
ORE object has no unique key - using random order
R> table(scores$actual, scores$pred)
 
      no yes
  no  32   0
  yes  2   4