6 Use Oracle Machine Learning for R Embedded R Execution

Embedded R execution in OML4R enables you to invoke R scripts in R sessions that run on the Oracle Database server.

These topics discuss embedded R execution:

6.1 About Oracle Machine Learning for R Embedded R Execution

In OML4R, embedded R execution is the ability to run R scripts in R engines that are dynamically started and managed by the database.

You can store R scripts in the OML4R script repository and to invoke such scripts with embedded R functions. When invoked, a script executes in one or more R engines that run on the database server. OML4R provides both an R interface and a SQL interface for embedded R execution. From the same R script you can get structured data, an XML representation of R objects and images, and even PNG images through a BLOB column in a database table.

The following topics describe embedded R execution:

6.1.1 Benefits of Embedded R Execution

Embedded R execution has the following benefits:

  • Eliminates moving data from the Oracle Database server to your local R session.

    As well as being more secure, the transfer of database data between Oracle Database and an internal R engine is much faster than to a separate client R engine.

  • Uses the database server to start, manage, and control the execution of R scripts in R engines running on the server.

  • Leverages the memory and processing power of the database server machine for R engine execution, which provides better scalability and performance.

  • Enables data-parallel and task-parallel execution of user-defined R functions that correspond to special cases of Hadoop Map-Reduce jobs.

  • Provides parallel simulations capability.

  • Allows the use of open source CRAN packages in R scripts running on the database server.

  • Provides the ability to develop and operationalize comprehensive scripts for analytical applications in a single step, without leaving the R environment.

    You can directly integrate R scripts used in exploratory analysis into application tasks. You can also immediately invoke R scripts in production to drastically reduce time to market by eliminating porting and enabling instantaneous updates of changes to application code.

  • Executing R scripts from SQL enables integration of R script results with Oracle Business Intelligence Enterprise Edition (OBIEE), Oracle BI Publisher, and other SQL-enabled tools for structured data, R objects, and images.

6.1.2 APIs for Embedded R Execution

Oracle Machine Learning for R provides R and SQL application programming interfaces for embedded R execution.

The following table lists the R functions and the equivalent SQL functions and procedures for embedded R execution and OML4R script repository management. The function f refers to a named R function or an R function defined in a script in the OML4R script repository.

Table 6-1 R and SQL APIs for Embedded R Execution

R API SQL API Description

ore.doEval

rqEval

Executes f with no automatic transfer of data.

ore.tableApply

rqTableEval

Executes f by passing all rows of the provided input ore.frame as the first argument of f. Provides the first argument of f as a data.frame.

ore.groupApply

rqGroupEval

This function must be explicitly defined by the user.

Executes f by partitioning data according to the values of a grouping column. Provides each data partition as a data.frame in the first argument of f. Supports parallel execution of each f invocation in the pool of database server-side R engines.

ore.rowApply

rqRowEval

Executes f by passing a specified number of rows (a chunk) of the provided input ore.frame. Provides each chunk as a data.frame in the first argument of f. Supports parallel execution of each f invocation in the pool of database server-side R engines.

ore.indexApply

No equivalent.

Executes f with no automatic transfer of data but provides the index of the invocation, 1 through n, where n is the number of times to invoke the function. Supports parallel execution of each f invocation in the pool of R engines running on the database server.

ore.grant rqGrant Grants read privilege access to a datastore or script.
ore.revoke rqRevoke Revokes read privilege access to a datastore or script.

ore.scriptCreate

sys.rqScriptCreate

Adds the provided R function into the OML4R script repository with the provided name.

ore.scriptDrop

sys.rqScriptDrop

Removes the named R function from the OML4R script repository.

ore.scriptList ALL_RQ_SCRIPTS

USER_RQ_SCRIPTS

Lists information about scripts.
ore.scriptLoad No equivalent. Loads the R function of a script into the R environment.

6.1.3 Security for Scripts

Because R scripts allow access to the database server, the creation of scripts must be controlled.

The RQADMIN role is a collection of Oracle Database privileges that a user must have to create scripts and store them in the Oracle Machine Learning for R script repository or drop scripts from the repository.

The installation of OML4R on the database server creates the RQADMIN role. The role must be explicitly granted to a user. To grant RQADMIN to a user, start SQL*Plus as sysdba and enter a GRANT statement such as the following, which grants the role to the user OML_USER:

GRANT RQADMIN to OML_USER

Note:

You should grant RQADMIN only to those users who need it.

When creating a script, the owner can use the global argument to specify whether the script is public or private. If global = TRUE, then all users have read privilege access to the script. If global = FALSE, which is the default, then the owner can share the script by granting access to other users. The owner can revoke the access at any time.

6.1.4 Support for Parallel Execution

Some of the Oracle Machine Learning for R embedded R execution functions support the use of parallel execution in the database.

The ore.groupApply, ore.rowApply, rq.groupEval, and rq.rowEval functions support data-parallel execution and the ore.indexApply function supports task-parallel execution. This parallel execution capability enables a script to take advantage of high-performance computing hardware such as an Oracle Exadata Database Machine.

The parallel argument of the ore.groupApply, ore.rowApply, and ore.indexApply functions specifies the degree of parallelism to use in the embedded R execution. The value of the argument can be one of the following:

  • A positive integer greater than or equal to 2 for a specific degree of parallelism

  • FALSE or 1 for no parallelism

  • TRUE for the default parallelism of the data argument

  • NULL for the database default for the operation

The default value of the argument is the value of the global option ore.parallel or FALSE if ore.parallel is not set.

A user-defined R function invoked using ore.doEval or ore.tableApply is not executed in parallel. The function executes in a single R engine.

For the rq.groupEval, and rq.rowEval functions, the degree of parallelism is specified by a PARALLEL hint in the input cursor argument.

In data-parallel execution for the ore.groupApply and rq.groupEval functions, one or more R engines perform the same R function, or task, on different partitions of data. This functionality enables the building of large numbers of models, for example building tens or hundreds of thousands of predictive models, one model per customer.

In data-parallel execution for the ore.rowApply and rq.rowEval functions, one or more R engines perform the same R function on disjoint chunks of data. This functionality enables scalable model scoring and predictions on large data sets.

In task-parallel execution for the ore.indexApply function, one or more R engines perform the same or different calculations, or task. A number, associated with the index of the execution, is provided to the function. This functionality is valuable in a variety of operations, such as in performing simulations.

Oracle Database handles the management and control of potentially multiple R engines at the database server, automatically partitioning and passing data to R engines executing in parallel. It ensures that all of the R function executions for all of the partitions complete; if not, the OML4R function returns an error. The result from the execution of each user-defined embedded R function is gathered in an ore.list. This list remains in the database until the user requires the result.

Embedded R execution also allows for data-parallel execution of user-defined R functions that may use functions from an open source R package from The Comprehensive R Archive Network (CRAN) or other third-party R package. However, third-party packages do not leverage in-database parallelism and are subject to the parallelism constraints of R. Third-party packages can benefit from the data-parallel and task-parallel execution supported in embedded R execution.

6.1.5 Install a Third-Party Package for Use in Embedded R Execution

Embedded R execution allows the use of CRAN or other third-party packages in user-defined R functions executed on the Oracle Database server.

To use a third-party package in embedded R execution, the package must be installed on the database server. If you are going to use the package from the R interface for embedded R execution, then the package must also be installed on the client, as well. To avoid incompatibilities, you must install the same version of the package on both the client and server machines.

An Oracle Database Administrator (DBA) can install a package on a database server so that it can be used by embedded R execution functions or by any R user. The DBA can install a package on a single database server or on multiple database servers.

A DBA would typically do the following:

  1. Download and install the package from CRAN. Downloading a package from CRAN requires an Internet connection.
  2. In an Oracle Machine Learning for R session running on the server, load the package. Verify that the package is installed correctly by using a function in the package.

To install a package on a single database server, do one of the following:

  • In an OML4R session running on the server, invoke the install.packages function, as shown in Example 6-1. The function downloads the package and installs dependencies automatically.

  • Download the package source from CRAN using wget. If the package depends on any packages that are not in the R distribution in use, then download those packages, also.

    From the operating system command line, use the ORE CMD INSTALL command to install the package or packages in the same location as the OML4R packages, which is $ORACLE_HOME/R/library. See Example 6-2.

To install a package, and any dependent packages, on multiple database servers, such as those in an Oracle Real Application Clusters (Oracle RAC) or a multinode Oracle Exadata Database Machine environment, use the Exadata Distributed Command Line Interface (DCLI) utility, as shown in Example 6-3.

To verify that the package is installed correctly, load the package and use a function in the package, as shown in Example 6-4.

Example 6-1 Installing a Package for a Single Database in an OML4R Session

This example invokes the install.packages function to download the C50 package from CRAN and to install it. The C50 package contains functions for creating C5.0 decision trees and rule-based models for pattern recognition.

The output this example, which is not shown, is almost identical to the output of the ORE CMD INSTALL command in Example 6-2.

install.packages("c50")

Example 6-2 Installing a Package for a Single Database from the Command Line

This example demonstrates downloading the C50 package from CRAN and installing it with ORE CMD INSTALL from a Linux command line.

wget http://cran.r-project.org/src/contrib/C50_0.1.0-19.tar.gz
ORE CMD INSTALL C50_0.1.0-19.tar.gz

Listing for This Example

$ wget http://cran.r-project.org/src/contrib/C50_0.1.0-19.tar.gz
# The output of wget is not shown.
$ ORE CMD INSTALL C50_0.1.0-19.tar.gz
* installing to library '/example/dbhome_1/R/library'
* installing *source* package 'C50' ...
** package 'C50' successfully unpacked and MD5 sums checked
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c attwinnow.c -o attwinnow.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c classify.c -o classify.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c confmat.c -o confmat.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c construct.c -o construct.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c contin.c -o contin.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c discr.c -o discr.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c formrules.c -o formrules.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c formtree.c -o formtree.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c getdata.c -o getdata.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c getnames.c -o getnames.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c global.c -o global.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c hash.c -o hash.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c hooks.c -o hooks.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c implicitatt.c -o implicitatt.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c info.c -o info.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c mcost.c -o mcost.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c modelfiles.c -o modelfiles.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c p-thresh.c -o p-thresh.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c prune.c -o prune.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c rc50.c -o rc50.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c redefine.c -o redefine.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c rsample.c -o rsample.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c rulebasedmodels.c -o rulebasedmodels.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c rules.c -o rules.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c ruletree.c -o ruletree.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c siftrules.c -o siftrules.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c sort.c -o sort.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c strbuf.c -o strbuf.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c subset.c -o subset.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c top.c -o top.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c trees.c -o trees.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c 
update.c -o update.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c utility.c -o utility.o
gcc -m64 -std=gnu99 -I/usr/include/R -DNDEBUG -DNDEBUG -I/usr/local/include   -ffloat-store -g -fpic  -g -O2  -c xval.c -o xval.o
gcc -m64 -std=gnu99 -shared -L/usr/local/lib64 -o C50.so attwinnow.o classify.o confmat.o construct.o contin.o discr.o formrules.o formtree.o getdata.o getnames.o global.o hash.o hooks.o implicitatt.o info.o mcost.o modelfiles.o p-thresh.o prune.o rc50.o redefine.o rsample.o rulebasedmodels.o rules.o ruletree.o siftrules.o sort.o strbuf.o subset.o top.o trees.o update.o utility.o xval.o -L/usr/lib64/R/lib -lR
installing to /example/dbhome_1/R/library/C50/libs
** R
** data
** preparing package for lazy loading
** help
*** installing help indices
  converting help for package 'C50'
    finding HTML links ... done
    C5.0                                    html  
    C5.0Control                             html  
    churn                                   html  
    predict.C5.0                            html  
    summary.C5.0                            html  
    varImp.C5.0                             html  
** building package indices
** testing if installed package can be loaded
* DONE (C50)

Example 6-3 Installing a Package Using DCLI

This example shows the DLCI command for installing the C50 package. The dcli -g flag designates a file containing a list of nodes to install on, and the -l flag specifies the user ID to use when executing the commands.

dcli -g nodes -l oracle R CMD INSTALL C50_0.1.0-19.tar.gz

Example 6-4 Using a C50 Package Function

This example shows starting R, connecting to OML4R on the server, loading the C50 package, and using a function in the package. The example starts R by executing the ORE command from the Linux command line. The example connects to OML4R and then loads the C50 package. It invokes the demo function to look for example programs in the package. Because the package does not have examples, this example then gets help for the C5.0 function. The example invokes example code from the help.

ORE

library(ORE)
ore.connect(user = "OML_USER", sid = "orcl", host = "myhost", 
            password = "oml_userStrongPassword", port = 1521, all=TRUE)

library(C50)
demo(package = "C50")
?C5.0
data(churn)
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn)
treeModel

Listing for This Example

$ ORE

R> library(ORE)
Loading required package: OREbase
 
Attaching package: 'OREbase'
 
The following objects are masked from 'package:base':
 
    cbind, data.frame, eval, interaction, order, paste, pmax, pmin,
    rbind, table
 
Loading required package: OREembed
Loading required package: OREstats
Loading required package: MASS
Loading required package: OREgraphics
Loading required package: OREeda
Loading required package: OREmodels
Loading required package: OREdm
Loading required package: lattice
Loading required package: OREpredict
Loading required package: ORExml
 
> ore.connect(user = "OML_USER", sid = "orcl", host = "myhost", 
+             password = "oml_userStrongPassword", port = 1521, all=TRUE)
Loading required package: ROracle
Loading required package: DBI
 
R> library(C50)
R> demo(package = "C50")
no demos found
R> ?C5.0     # Output not shown.
R> data(churn)
R> treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn)
R> treeModel
Call:
C5.0.default(x = churnTrain[, -20], y = churnTrain$churn)
 
Classification Tree
Number of samples: 3333 
Number of predictors: 19 
 
Tree size: 27 
 
Non-standard options: attempt to group attributes

6.2 R Interface for Embedded R Execution

Oracle Machine Learning for R provides functions that invoke R scripts that run in one or more R engines that are embedded in the Oracle database.

Other functions create and store an R function as a script in the OML4R script repository, grant or revoke read access to a script, list the available scripts, load a script function into the R environment, or drop a script from the repository. This section describes these functions in the following topics:

6.2.1 Arguments for Functions that Run Scripts

The Oracle Machine Learning for R embedded R execution functions ore.doEval, ore.tableApply, ore.groupApply, ore.rowApply, and ore.indexApply have arguments that are common to some or all of the functions.

Some of the functions also have an argument that is unique to the function. The following topics describe these arguments:

See Also:

  • For function signatures and more details about function arguments, see the online help displayed by invoking help(ore.doEval)

  • For examples of the use of the arguments, see Using the ore.doEval Function and the other topics on using the embedded R execution functions

6.2.1.1 Input Function to Execute

The embedded R execution functions all require an R function to apply during the execution of the script.

You specify the input function with one of the following mutually exclusive arguments:

  • FUN

  • FUN.NAME (and optional FUN.OWNER)

The FUN argument takes a function object as a directly specified function or as one assigned to an R variable. Only a user with the RQADMIN role can use the FUN argument when invoking an embedded R function.

The FUN.NAME argument specifies a script that is stored in the OML4R R script repository. A stored script contains the function to apply when the script runs. Any OML4R user can use the FUN.NAME argument when invoking an embedded R function.

The optional argument FUN.OWNER specifies the owner of a script in the R script repository. The owner is the user who created the script. Use this argument only with the FUN.NAME argument. When FUN.NAME is a private script to which you have been granted read privilege access, use FUN.OWNNER to specify the owner of the private script.

The RQSYS schema is the owner of public scripts and the predefined OML4R scripts. For a list of the predefined scripts, invoke help(“ore.doEval”) and see the description of the FUN.NAME argument. If FUN.OWNNER is not specified or is NULL, then OML4R looks for the owner in the following order: user of the current session, RQSYS. If the owner of the script is not current user or RQSYS, then an error occurs.

Note:

The OML4R functions in the OREmodels package, ore.glm, ore.lm, ore.neural, and ore.randomForest, use the embedded R execution framework internally and cannot be used in embedded R execution functions.

6.2.1.2 Optional and Control Arguments

All of the embedded R execution functions take optional arguments, which can be named or not.

Oracle Machine Learning for R passes user-defined optional arguments to the input function. You can pass any number of optional arguments to the input function, including complex R objects such as models.

Arguments that start with ore. are special control arguments. OML4R does not pass them to the input function, but instead uses them to control what happens before or after the execution of that function. The following control arguments are supported:

  • ore.connect controls whether to automatically connect to OML4R inside the embedded R execution function. This is equivalent to doing an ore.connect call with the same credentials as the client session. The default value is FALSE.

    If an automatic connection is enabled, the following functionality occurs:

    • The embedded R script is connected to the database.

    • The connection has the same credentials as the session that invokes the embedded R SQL function.

    • The script runs in an autonomous transaction.

    • ROracle queries can work with the automatic connection.

    • OML4R transparency layer functionality is enabled in the embedded script.

  • ore.drop controls the input data. If the option value is TRUE, a one column data.frame is converted to a vector. The default value is TRUE.

  • ore.envAsEmptyenv controls whether an environment referenced in an object is replaced with an empty environment during serialization. Some types of input parameters and returned objects, such as list and formula, are serialized before being saved to the database. If the control argument value is TRUE, then the referenced environment in the object is replaced with an empty environment whose parent is .GlobalEnv and the objects in the original referenced environment are not serialized. In some cases, this can significantly reduce the size of serialized objects. If the control argument value is FALSE, then all of the objects in the referenced environment are serialized and can be unserialized and recovered later. The default value is regulated by the global option ore.envAsEmptyenv.

  • ore.na.omit controls the handling of missing values in the input data. If you specify ore.na.omit = TRUE, then rows or vector elements, depending on the ore.drop setting, that contain missing values are removed from the input data. If all of the rows in a chunk contain missing values, then the input data for that chunk will be an empty data.frame or vector. The default value is FALSE.

  • ore.graphics controls whether to start a graphical driver and look for images. The default value is TRUE.

  • ore.png.* specifies additional arguments for the png graphics driver if ore.graphics is TRUE. The naming convention for these arguments is to add an ore.png. prefix to the arguments of the png function. For example, if ore.png.height is supplied, argument height is passed to the png function. If not set, the standard default values for the png function are used.

See Also:

For more details about control arguments, see the online help displayed by invoking help(ore.doEval)

6.2.1.3 Structure of Return Value

Another argument that applies to all of the embedded R execution functions is FUN.VALUE.

If the FUN.VALUE argument is NULL, then the ore.doEval and ore.tableApply function can return a serialized R object as an ore.object class object, and the ore.groupApply, ore.indexApply, and ore.rowApply functions return an ore.list object. However, if you specify a data.frame or an ore.frame with the FUN.VALUE argument, then the function returns an ore.frame that has the structure of the specified data.frame or ore.frame object.

To specify that the corresponding output column of an ore.frame have a CLOB or BLOB database data type, you can apply the attribute ora.type to a column of a FUN.VALUE data.frame. For an example of using ora.type, see Example 6-11.

6.2.1.4 Input Data

The ore.doEval and ore.indexApply functions do not automatically receive any data from the database.

They simply execute the function specified by the FUN or FUN.NAME argument. Any data needed by the input function is either generated within that function or explicitly retrieved from a data source such as Oracle Database, other databases, or flat files. The input function can load data from a file or a table using the ore.pull function or other transparency layer function.

The ore.tableApply, ore.groupApply, and ore.rowApply functions require a database table as input data. The table is represented by an ore.frame. You supply that data with an ore.frame object that you specify with the X argument, which is the first argument to the embedded R execution function. The embedded R execution function passes the ore.frame object to the user-defined input function as the first argument to that function.

Note:

The data represented by the ore.frame object passed to the user-defined R function is copied from Oracle Database to the database server R engine. The R memory limitations apply. If your database server machine has 32 GB RAM and your data table is 64 GB, then Oracle R Enterprise cannot load the data into the R engine memory.

6.2.1.5 Parallel Execution

The parallel argument specifies the degree of parallelism to use in the embedded R execution of the input function.

The parallel argument is accepted by the ore.groupApply, ore.indexApply, and ore.rowApply functions.

See Support for Parallel Execution.

6.2.1.6 Unique Arguments

The ore.groupApply, ore.indexApply, and ore.rowApply functions each take an argument that is unique to the function.

The ore.groupApply function takes the INDEX argument, which specifies the name of a column by which the rows of the input data are partitioned for processing by the input function.

The ore.indexApply function takes the times argument, which specifies the number of times to execute the input function.

The ore.rowApply function tales the rows argument, which specifies the number of rows to pass to each invocation of the input function.

6.2.2 Manage Scripts in R

Embedded R execution functions can invoke R functions that are stored as scripts in the OML4R script repository. You can use the R functions described in this topic to create and manage scripts.

The embedded R execution functions can take a FUN.NAME argument, which specifies the name of a script in the OML4R script repository. Scripts in the R script repository are also available through the SQL API for embedded R execution.

The R functions for managing scripts are the following:

  • ore.grant

  • ore.revoke

  • ore.scriptCreate

  • ore.scriptList

  • ore.scriptLoad

  • ore.scriptDrop

These functions are described in the following sections:

For an example that uses these functions, see Example 6-5.

Adding a Script

To add an R function as a script in the OML4R script repository, use the ore.createScript function. To evoke this function, you must have the RQADMIN role. The ore.createScript function has the following syntax:

ore.scriptCreate(name, FUN, global, overwrite)

The arguments are the following:

Argument Description
name A name for the script in the OML4R script repository.
fun An R function.
global A logical value that indicates whether the script is public (global) or private. FALSE (the default) specifies that the script is not public and is visible only to the owner or to users to whom the owner has granted read privilege access; TRUE specifies that the script is public and therefore visible to all users.
overwrite A logical value that indicates whether to replace the R function of the script with the function specified in by the fun argument. TRUE specifies replacing the function, if it exists; FALSE (the default) specifies that the existing contents cannot be replaced.

If overwrite = FALSE, an error condition occurs if a script by the same name already exists in the OML4R script repository; otherwise, ore.scriptCreate returns NULL.

Granting or Revoking Read Access to a Script

The creator of a script can use the ore.grant function to grant read access privilege to the script and the ore.revoke function to revoke that access. Those functions have the following syntax:

ore.grant(name, type = "rqscript", user)
ore.revoke(name, type = "rqscript", user)

The arguments are the following:

Argument Description
name The name of a script in the OML4R script repository.
type For a script, the type is rqscript.
user The user to whom to grant or revoke read privilege access.

The name and type arguments are required. If argument user is not specified, then read privilege access is granted to or revoked from all users.

An error occurs when one of the following is true:

  • The named script is not in the OML4R script repository.

  • The type argument is not specified.

  • The user is not found.

  • The read privilege has already been granted to or revoked from the user.

  • The named script is public.

Listing the Available Scripts

To list the scripts available to you, use ore.scriptList. You can list scripts by name, by a pattern, or by type. If you have the RQADMIN role, you can list system scripts, as well. The function has the following syntax:

ore.scriptList(name, pattern, type)

The arguments are the following:

Argument Description
name The name of a script in the OML4R script repository. Cannot be used when argument pattern is specified.
pattern A regular expression pattern. Scripts that match the pattern are listed. Cannot be used when argument name is specified.
type The type of the script, which can be one of the following:
  • user, which lists scripts owned by the current user

  • global, which lists public scripts, which are visible to all users

  • grant, which lists the scripts to which the current user has granted read access to others

  • granted, which lists the scripts to which the current user has been granted read access by another user

  • all, which lists all of the user, public, and granted scripts

The ore.scriptList function returns a data.frame that contains the names of the scripts in the OML4R script repository and the function in the script.

Loading a Script into an R Environment

To load the R function of a script into an R environment, use ore.scriptLoad, which has the following syntax:

ore.scriptLoad(name, owner, newname, envir)

The arguments are the following:

Argument Description
name The name of a script in the OML4R script repository.
owner The owner of the script.
newname A new function name in which to load the script.
envir The R environment in which to load the script.

Specifying the owner of a script is useful when access to the script has been granted to the user who is invoking ore.scriptLoad.

Specifying a new function name is useful when the name of the script in the OML4R script repository is not a valid R function name.

An error occurs when one of the following is true:

  • The script is not in the OML4R script repository.

  • The current user does not have read access to the script.

  • The function specified by the name argument is not a valid R function name.

  • The newname argument is not a valid R function name.

Dropping a Script

To remove a script from the OML4R script repository, use the ore.scriptDrop function. To invoke this function, you must have the RQADMIN role. The ore.scriptDrop function has the following syntax:

ore.scriptDrop(name, global, silent)

The arguments are the following:

Argument Description
name A name for the script in the OML4R script repository.
global A logical value that indicates whether the script is global (public) or private. TRUE specifies dropping a global script; FALSE (the default) specifies dropping a script owned by the current user.
silent A logical value that indicates whether to display an error message if ore.scriptDrop encounters an error condition. TRUE specifies the display of error messages; FALSE (the default) specifies no display.

An error condition occurs when one of the following is true:

  • The script is not in the OML4R script repository.

  • If global = TRUE, the script is a private script.

  • If global = FALSE, the script is a public script.

If successful, ore.scriptDrop returns NULL.

Example 6-5 Using the R Script Management Functions

# Create an ore.frame object from the data.frame for the iris data set.
IRIS <- ore.push(iris)

# Create a private R script for the current user.
ore.scriptCreate("myRandomRedDots", function(divisor = 100){
                 id <- 1:10
                 plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
                 data.frame(id = id, val = id / divisor)
                 })

# Create another private R script.
ore.scriptCreate("MYLM",
                 function(data, formula, ...) lm(formula, data, ...))

# Create a public script, available to any user.
ore.scriptCreate("GLBGLM",
                 function(data, formula, ...) 
                 glm(formula = formula, data = data, ...),
                 global = TRUE)

# List only my private scripts.
ore.scriptList()                

# List my private scripts and the public scripts. 
ore.scriptList(type = "all")    

# List my private scripts that have the specified pattern.
ore.scriptList(pattern = "MY")  

# Grant read access to a private script to all users.
ore.grant("MYLM", type = "rqscript")

# Grant read access to a private script to a specific user.
ore.grant("myRandomRedDots", user = "SCOTT",  type = "rqscript")

# List the granted scripts.
ore.scriptList(type = "grant")

# Use the MYLM script in an embedded R execution function.
ore.tableApply(IRIS[1:4], FUN.NAME = "MYLM",
               formula = Sepal.Length ~ .)
# Use the GLBGLM script in an embedded R execution function.
ore.tableApply(IRIS[1:4], FUN.NAME = "GLBGLM",
               formula = Sepal.Length ~ .)

# Load an R script to an R function object
ore.scriptLoad(name = "MYLM")

# Invoke the function.
MYLM(iris, formula = Sepal.Length ~ .)

# Load another R script to an R function object
ore.scriptLoad(name = "GLBGLM", newname = "MYGLM")

# Invoke the function.
MYGLM(iris, formula = Sepal.Length ~ .)

# Drop some scripts.
ore.scriptDrop("MYLM")
ore.scriptDrop("GLBGLM", global = TRUE)

# List all scripts.
ore.scriptList(type = "all")

Listing for This Example

R> # Create an ore.frame object from the data.frame for the iris data set.
R> IRIS <- ore.push(iris)
R> 
R> # Create a private R script for the current user.
R> ore.scriptCreate("myRandomRedDots", function(divisor = 100){
+                   id <- 1:10
+                   plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
+                   data.frame(id = id, val = id / divisor)
+                   })
R> 
R> # Create another private R script.
R> ore.scriptCreate("MYLM",
+                   function(data, formula, ...) lm(formula, data, ...))
R> 
R> # Create a public script, available to any user.
R> ore.scriptCreate("GLBGLM",
+                   function(data, formula, ...) 
+                   glm(formula = formula, data = data, ...),
+                   global = TRUE)
R> 
R> # List only my private scripts.
R> ore.scriptList()
             NAME      SCRIPT
1            MYLM      function (data, formula, ...) \nlm(formula, data, ...)
2 myRandomRedDots      function (divisor = 100) \n{\n    id & lt\n    -1:10\n
                       plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2)\n
                       data.frame(id = id, val = id/divisor)\n}
R> 
R> # List my private scripts and the public scripts. 
R> ore.scriptList(type = "all")    
   OWNER              NAME    SCRIPT
1  RQSYS            GLBGLM    function (data, formula, ...) \nglm(formula = formula, data = data, ...)
2 OML_USER            MYLM    function (data, formula, ...) \nlm(formula, data, ...)
3 OML_USER myRandomRedDots    function (divisor = 100) \n{\n    id & lt\n    -1:10\n
                              plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2)\n
                              data.frame(id = id, val = id/divisor)\n}
R> 
R> # List my private scripts that have the specified pattern.
R> ore.scriptList(pattern = "MY")
  NAME  SCRIPT
1 MYLM  function (data, formula, ...) \nlm(formula, data, ...)
R> 
R> # Grant read access to a private script to all users.
R> ore.grant("MYLM", type = "rqscript")
R> 
R> # Grant read access to a private script to a specific user.
R> ore.grant("myRandomRedDots", user = "SCOTT",  type = "rqscript")
R> 
R> # List the granted scripts.
R> ore.scriptList(type = "grant")
             NAME GRANTEE
1            MYLM  PUBLIC
2 myRandomRedDots   SCOTT
R> 
R> # Use the MYLM script in an embedded R execution function.
R> ore.tableApply(IRIS[1:4], FUN.NAME = "MYLM",
+                 formula = Sepal.Length ~ .)

Call:
lm(formula = formula, data = data)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      1.8560        0.6508        0.7091       -0.5565 
R> 
R> # Use the GLBGLM script in an embedded R execution function.
R> ore.tableApply(IRIS[1:4], FUN.NAME = "GLBGLM",
+                formula = Sepal.Length ~ .) 

Call:  glm(formula = formula, data = data)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      1.8560        0.6508        0.7091       -0.5565  

Degrees of Freedom: 149 Total (i.e. Null);  146 Residual
Null Deviance:      102.2 
Residual Deviance: 14.45        AIC: 84.64
R> 
R> # Load an R script to an R function object
R> ore.scriptLoad(name="MYLM")
R> 
R> # Invoke the function.
R> MYLM(iris, formula = Sepal.Length ~ .)
R> 
R> # Load another R script to an R function object
R> ore.scriptLoad(name = "GLBGLM", newname = "MYGLM")
R> 
R> # Invoke the function.
R> MYGLM(iris, formula = Sepal.Length ~ .)
R> 
R> # Drop some scripts.
R> ore.scriptDrop("MYLM")
R> ore.scriptDrop("GLBGLM", global = TRUE)
R> 
R> # List all scripts.
R> ore.scriptList(type = "all")
   OWNER               NAME    SCRIPT
  OML_USER  myRandomRedDots    function (divisor = 100) \n{\n    id & lt\n    -1:10\n
                               plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 
                               2)\n data.frame(id = id, val = id/divisor)\n}

See Also:

6.2.3 Use the ore.doEval Function

The ore.doEval function executes the specified input function using data that is generated by the input function.

It returns an ore.frame object or a serialized R object as an ore.object object.

The syntax of the ore.doEval function is the following:

ore.doEval(FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, FUN.OWNER = NULL)

See Also:

Arguments for Functions that Run Scripts for descriptions of the arguments to function ore.doEval

Example 6-6 Using the ore.doEval Function

In this example, RandomRedDots gets a function that has an argument and that returns a data.frame object that has two columns and that plots 100 random normal values. The example then invokes ore.doEval function and passes it the RandomRedDots function object. The image is displayed at the client, but it is generated by the database server R engine that executed the RandomRedDots function.

RandomRedDots <- function(divisor = 100){
  id<- 1:10
  plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
  data.frame(id=id, val=id / divisor)
}
ore.doEval(RandomRedDots)

Listing for This Example

R> RandomRedDots <- function(divisor = 100){
+   id<- 1:10
+   plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
+   data.frame(id=id, val=id / divisor)
+ }
R> ore.doEval(RandomRedDots)
   id  val
1   1 0.01
2   2 0.02
3   3 0.03
4   4 0.04
5   5 0.05
6   6 0.06
7   7 0.07
8   8 0.08
9   9 0.09
10 10 0.10

Figure 6-1 Display of Random Red Dots

Description of Figure 6-1 follows
Description of "Figure 6-1 Display of Random Red Dots"

Example 6-7 Using the ore.doEval Function with an Optional Argument

You can provide arguments to the input function as optional arguments to the doEval function. This example invokes the doEval function with an optional argument that overrides the divisor argument of the RandomRedDots function.

ore.doEval(RandomRedDots, divisor = 50)

Listing for This Example

R> ore.doEval(RandomRedDots, divisor = 50)
   id  val
1   1 0.02
2   2 0.04
3   3 0.06
4   4 0.08
5   5 0.10
6   6 0.12
7   7 0.14
8   8 0.16
9   9 0.18
10 10 0.20
# The graph displayed by the plot function is not shown.

Example 6-8 Using the ore.doEval Function with the FUN.NAME Argument

If the input function is stored in the OML4R script repository, then you can invoke the ore.doEval function with the FUN.NAME argument. This example first invokes ore.scriptDrop to ensure that the script repository does not contain a script with the name myRandomRedDots. The example adds the RandomRedDots function from Example 6-6 to the repository under the name myRandomRedDots. This example invokes the ore.doEval function and specifies myRandomRedDots. The result is assigned to the variable res.

The return value of the RandomRedDots function is a data.frame but in this example the ore.doEval function returns an ore.object object. To get back the data.frame object, the example invokes ore.pull to pull the result to the client R session.

ore.scriptDrop("myRandomRedDots")
ore.scriptCreate("myRandomRedDots", RandomRedDots)
res <- ore.doEval(FUN.NAME = "myRandomRedDots", divisor = 50)
class(res)
res.local <- ore.pull(res)
class(res.local)

Listing for This Example

R> ore.scriptDrop("myRandomRedDots")
R> ore.scriptCreate("myRandomRedDots", RandomRedDots)
R> res <- ore.doEval(FUN.NAME = "myRandomRedDots", divisor = 50)
R> class(res)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> res.local <- ore.pull(res)
R> class(res.local)
[1] "data.frame"

Example 6-9 Using the ore.doEval Function with the FUN.VALUE Argument

To have the doEval function return an ore.frame object instead of an ore.object, use the argument FUN.VALUE to specify the structure of the result, as shown in this example.

res.of <- ore.doEval(FUN.NAME="myRandomRedDots", divisor = 50,
                     FUN.VALUE= data.frame(id = 1, val = 1))
class(res.of)
Listing for Example 6-9
R> res.of <- ore.doEval(FUN.NAME="myRandomRedDots", divisor = 50,
+                        FUN.VALUE= data.frame(id = 1, val = 1))
R> class(res.of)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"

Example 6-10 Using the doEval Function with the ore.connect Argument

This example demonstrates using the special optional argument ore.connect to connect to the database in the embedded R function, which enables the use of objects stored in a datastore. The example creates the RandomRedDots2 function object, which is the same as the RandomRedDots function from Example 6-6 except the RandomRedDots2 function has an argument that takes the name of a datastore. The example creates the myVar variable and saves it in the datastore named datastore_1. The example then invokes the doEval function and passes it the name of the datastore and passes the ore.connect control argument set to TRUE.

RandomRedDots2 <- function(divisor = 100, datastore.name = "myDatastore"){
  id <- 1:10
  plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
  ore.load(datastore.name) # Contains the numeric variable myVar.
  data.frame(id = id, val = id / divisor, num = myVar)
}
myVar <- 5
ore.save(myVar, name = "datastore_1")
ore.doEval(RandomRedDots2, datastore.name = "datastore_1", ore.connect = TRUE)

Listing for This Example

R> RandomRedDots2 <- function(divisor = 100, datastore.name = "myDatastore"){
+   id <- 1:10
+   plot(1:100, rnorm(100), pch = 21, bg = "red", cex = 2 )
+   ore.load(datastore.name) # Contains the numeric variable myVar.
+   data.frame(id = id, val = id / divisor, num = myVar)
+ }
R> ore.doEval(RandomRedDots2, datastore.name = "datastore_1", ore.connect = TRUE)
   id  val num
1   1 0.01   5
2   2 0.02   5
3   3 0.03   5
4   4 0.04   5
5   5 0.05   5
6   6 0.06   5
7   7 0.07   5
8   8 0.08   5
9   9 0.09   5
10 10 0.10   5
# The graph displayed by the plot function is not shown.

Example 6-11 Using the ora.type Attribute

This example demonstrates using the ora.type attribute to specify database data types of CLOB and BLOB for columns in the data.frame object specified by the FUN.VALUE argument.

eval1 <- ore.doEval(function() "Hello, world")
eval2 <-
  ore.doEval(function()
	           data.frame(x = "Hello, world", stringsAsFactors = FALSE))
eval3 <-
  ore.doEval(function()
	           data.frame(x = "Hello, world", stringsAsFactors = FALSE),
	           FUN.VALUE =
	           data.frame(x = character(), stringsAsFactors = FALSE))
out.df <- data.frame(x = character(), y = raw(), stringsAsFactors = FALSE)
attr(out.df$x, "ora.type") <- "clob"
attr(out.df$y, "ora.type") <- "blob"
eval4 <- 
  ore.doEval(function() {
	           res <- data.frame(x = "Hello, world",stringsAsFactors = FALSE)
	           res$y[[1L]] <- charToRaw("Hello, world")
	           res},
	           FUN.VALUE = out.df)
eval1
class(eval1)  # ore.object
eval2
class(eval2)  # ore.object
eval3
class(eval3)  # ore.frame
eval4$x
rawToChar(ore.pull(eval4$y))

Listing for This Example

R> eval1 <- ore.doEval(function() "Hello, world")
R> eval2 <- 
+    ore.doEval(function()
+               data.frame(x = "Hello, world", stringsAsFactors = FALSE))
R> eval3 <-
+    ore.doEval(function()
+               data.frame(x = "Hello, world", stringsAsFactors = FALSE),
+               FUN.VALUE =
+               data.frame(x = character(), stringsAsFactors = FALSE))
R> out.df <- data.frame(x = character(), y = raw(), stringsAsFactors = FALSE)
R> attr(out.df$x, "ora.type") <- "clob"
R> attr(out.df$y, "ora.type") <- "blob"
R> eval4 <- 
+    ore.doEval(function() {
+               res <- data.frame(x = "Hello, world",stringsAsFactors = FALSE)
+               res$y[[1L]] <- charToRaw("Hello, world")
+               res},
+               FUN.VALUE = out.df)
R> eval1
[1] "Hello, world"
R> class(eval1)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> eval2
             x
1 Hello, world
R> class(eval2)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> eval3
             x
1 Hello, world
Warning message:
ORE object has no unique key - using random order 
R> class(eval3)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> eval4$x                              
[1] "Hello, world"
Warning message:
ORE object has no unique key - using random order 
R> rawToChar(ore.pull(eval4$y))
[1] "Hello, world"
Warning message:
ORE object has no unique key - using random order 

6.2.4 Use the ore.tableApply Function

The ore.tableApply function invokes an R script with an ore.frame as the input data.

The ore.tableApply function passes the ore.frame to the user-defined input function as the first argument to that function. The ore.tableApply function returns an ore.frame object or a serialized R object as an ore.object object.

The syntax of the ore.tableApply function is the following:

ore.tableApply(X, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, FUN.OWNER = NULL)

See Also:

Arguments for Functions that Run Scripts for descriptions of the arguments to function ore.tableApply

Example 6-12 Using the ore.tableApply Function

This example uses the ore.tableApply function to build a Naive Bayes model on the iris data set. The naiveBayes function is in the e1071 package, which must be installed on both the client and database server machine R engines. As the first argument to the ore.tableApply function, the ore.push(iris) invocation creates a temporary database table and an ore.frame that is a proxy for the table. The second argument is the input function, which has as an argument dat. The ore.tableApply function passes the ore.frame table proxy to the input function as the dat argument. The input function creates a model, which the ore.tableApply function returns as an ore.object object.

library(e1071)
nbmod <- ore.tableApply(
  ore.push(iris),
  function(dat) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    naiveBayes(Species ~ ., dat)
})
class(nbmod)
nbmod

Listing for This Example

R> nbmod <- ore.tableApply(
+   ore.push(iris),
+   function(dat) {
+     library(e1071)
+     dat$Species <- as.factor(dat$Species)
+     naiveBayes(Species ~ ., dat)
+ })
R> class(nbmod)
[1] "ore.object"
attr(,"package")
[1] "OREembed"
R> nbmod
 
Naive Bayes Classifier for Discrete Predictors
 
Call:
naiveBayes.default(x = X, y = Y, laplace = laplace)
 
A-priori probabilities:
Y
    setosa versicolor  virginica 
 0.3333333  0.3333333  0.3333333 
 
Conditional probabilities:
            Sepal.Length
Y             [,1]      [,2]
  setosa     5.006 0.3524897
  versicolor 5.936 0.5161711
  virginica  6.588 0.6358796
 
            Sepal.Width
Y             [,1]      [,2]
  setosa     3.428 0.3790644
  versicolor 2.770 0.3137983
  virginica  2.974 0.3224966
 
            Petal.Length
Y             [,1]      [,2]
  setosa     1.462 0.1736640
  versicolor 4.260 0.4699110
  virginica  5.552 0.5518947
 
            Petal.Width
Y             [,1]      [,2]
  setosa     0.246 0.1053856
  versicolor 1.326 0.1977527
  virginica  2.026 0.2746501

6.2.5 Use the ore.groupApply Function

The ore.groupApply function invokes an R script with an ore.frame as the input data.

The ore.groupApply function passes the ore.frame to the user-defined input function as the first argument to that function. The INDEX argument to the ore.groupApply function specifies the name of a column of the ore.frame by which Oracle Database partitions the rows for processing by the user-defined R function. The ore.groupApply function can use data-parallel execution, in which one or more R engines perform the same R function, or task, on different partitions of data.

The syntax of the ore.groupApply function is the following:

ore.groupApply(X, INDEX, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, FUN.OWNER = NULL,
               parallel = getOption("ore.parallel", NULL))

The ore.groupApply function returns an ore.list object or an ore.frame object.

Examples of the use of the ore.groupApply function are in the following topics:

6.2.5.1 Partition on a Single Column

This example uses the ore.groupApply function and partitions the data on a single column.

The example uses the C50 package, which has functions that build decision tree and rule-based models. The package also provides training and testing data sets. The example builds C5.0 models on the churnTrain training data set from the churn data set of the C50 package, with the goal of building one churn model on the data for each state. The example does the following:

  • Loads the C50 package and then the churn data set.

  • Uses the ore.create function to create the CHURN_TRAIN database table and its proxy ore.frame object from churnTrain, a data.frame object.

  • Specifies CHURN_TRAIN, the proxy ore.frame object, as the first argument to the ore.groupApply function and specifies the state column as the INDEX argument. The ore.groupApply function partitions the data on the state column and invokes the user-defined function on each partition.

  • Creates the variable modList, which gets the ore.list object returned by the ore.groupApply function. The ore.list object contains the results from the execution of the user-defined function on each partition of the data. In this case, it is one C5.0 model per state, with each model stored as an ore.object object.

  • Specifies the user-defined function. The first argument of the user-defined function receives one partition of the data, which in this case is all of the data associated with a single state.

    The user-defined function does the following:

    • Loads the C50 package so that it is available to the function when it executes in an R engine in the database.

    • Deletes the state column from the data.frame so that the column is not included in the model.

    • Converts the columns to factors because, although the ore.frame defined factors, when they are loaded to the user-defined function, factors appear as character vectors.

    • Builds a model for a state and returns it.

  • Uses the ore.pull function to retrieve the model from the database as the mod.MA variable and then invokes the summary function on it. The class of mod.MA is C5.0.

Example 6-13 Using the ore.groupApply Function

library(C50)
data("churn")
 
ore.create(churnTrain, "CHURN_TRAIN")

modList <- ore.groupApply(
  CHURN_TRAIN,
  INDEX=CHURN_TRAIN$state,
    function(dat) {
      library(C50)
      dat$state <- NULL
      dat$churn <- as.factor(dat$churn)
      dat$area_code <- as.factor(dat$area_code)
      dat$international_plan <- as.factor(dat$international_plan)
      dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
      C5.0(churn ~ ., data = dat, rules = TRUE)
    });
mod.MA <- ore.pull(modList$MA)
summary(mod.MA)

Listing for This Example

R> library(C50)
R> data(churn)
R> 
R> ore.create(churnTrain, "CHURN_TRAIN")
R>
R> modList <- ore.groupApply(
+   CHURN_TRAIN,
+   INDEX=CHURN_TRAIN$state,
+     function(dat) {
+       library(C50)
+       dat$state <- NULL
+       dat$churn <- as.factor(dat$churn)
+       dat$area_code <- as.factor(dat$area_code)
+       dat$international_plan <- as.factor(dat$international_plan)
+       dat$voice_mail_plan <- as.factor(dat$voice_mail_plan)
+       C5.0(churn ~ ., data = dat, rules = TRUE)
+     });
R> mod.MA <- ore.pull(modList$MA)
R> summary(mod.MA)
 
Call:
C5.0.formula(formula = churn ~ ., data = dat, rules = TRUE)
 
 
C5.0 [Release 2.07 GPL Edition]         Thu Feb 13 15:09:10 2014
-------------------------------
 
Class specified by attribute `outcome'
 
Read 65 cases (19 attributes) from undefined.data
 
Rules:
 
Rule 1: (52/1, lift 1.2)
        international_plan = no
        total_day_charge <= 43.04
        ->  class no  [0.963]
 
Rule 2: (5, lift 5.1)
        total_day_charge > 43.04
        ->  class yes  [0.857]
 
Rule 3: (6/1, lift 4.4)
        area_code in {area_code_408, area_code_415}
        international_plan = yes
        ->  class yes  [0.750]
 
Default class: no
 
 
Evaluation on training data (65 cases):
 
                Rules     
          ----------------
            No      Errors
 
             3    2( 3.1%)   <<
 
 
           (a)   (b)    <-classified as
          ----  ----
            53     1    (a): class no
             1    10    (b): class yes
 
 
        Attribute usage:
 
         89.23% international_plan
         87.69% total_day_charge
          9.23% area_code
 
 
Time: 0.0 secs
6.2.5.2 Partition on Multiple Columns

This example uses the ore.groupApply function and partitions the data on multiple columns.

The ore.groupApply function takes a single column or multiple columns as the INDEX argument. The following example uses data from the CHURN_TRAIN data set to build an rpart model that produces rules on the partitions of data specified, which are the voice_mail_plan and international_plan columns. The example uses the R table function to show the number of rows to expect in each partition.

The example invokes the ore.scriptDrop function to ensure that no script by the specified name exists in the OML4R script repository. It then uses the ore.scriptCreate function to define a script named my_rpartFunction and to store it in the repository. The stored script defines a function that takes a data source and a prefix to use for naming OML4R datastore objects. Each invocation of the function my_rpartFunction receives data from one of the partitions identified by the values in the voice_mail_plan and international_plan columns. Because the source partition columns are constants, the function sets them to NULL. It converts the character vectors to factors, builds a model to predict churn, and saves it in an appropriately named datastore. The function creates a list to return the specific partition column values, the distribution of churn values, and the model itself.

The example then loads the rpart library, sets the datastore prefix, and invokes ore.groupApply using the values from the voice_mail_plan and international_plan columns as the INDEX argument and my_rpartFunction as the value of the FUN.NAME argument to invoke the user-defined function stored in the script repository. The ore.groupApply function uses an optional argument to pass the datastorePrefix variable to the user-defined function. It uses the optional argument ore.connect to connect to the database when executing the user-defined function. The ore.groupApply function returns an ore.list object as the variable res.

The example displays the first entry in the list returned. It then invokes the ore.load function to load the model for the case where the customer has both the voice mail plan and the international plan.

Example 6-14 Using ore.groupApply for Partitioning Data on Multiple Columns

library(C50)
data(churn)
ore.drop("CHURN_TRAIN") 
ore.create(churnTrain, "CHURN_TRAIN")
 
table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)

options(width = 80)
head(CHURN_TRAIN, 3)

ore.scriptDrop("my_rpartFunction")
ore.scriptCreate("my_rpartFunction",
  function(dat, datastorePrefix) {
    library(rpart)
    vmp <- dat[1, "voice_mail_plan"]
    ip <- dat[1, "international_plan"]
    datastoreName <- paste(datastorePrefix, vmp, ip, sep = "_")
    dat$voice_mail_plan <- NULL
    dat$international_plan <- NULL
    dat$state <- as.factor(dat$state)
    dat$churn <- as.factor(dat$churn)
    dat$area_code <- as.factor(dat$area_code)
    mod <- rpart(churn ~ ., data = dat)
    ore.save(mod, name = datastoreName, overwrite = TRUE)
    list(voice_mail_plan = vmp,
        international_plan = ip,
        churn.table = table(dat$churn),
        rpart.model = mod)
  })

library(rpart)
datastorePrefix = "my.rpartModel"
 
res <- ore.groupApply(CHURN_TRAIN,
      INDEX = CHURN_TRAIN[, c("voice_mail_plan", "international_plan")],
      FUN.NAME = "my_rpartFunction",
      datastorePrefix = datastorePrefix,
      ore.connect = TRUE)
res[[1]]
ore.load(name=paste(datastorePrefix, "yes", "yes", sep = "_"))
mod

Listing for This Example

R> library(C50)
R> data(churn)
R> ore.drop("CHURN_TRAIN") 
R> ore.create(churnTrain, "CHURN_TRAIN")
R>  
R> table(CHURN_TRAIN$international_plan, CHURN_TRAIN$voice_mail_plan)
     
        no  yes
  no  2180  830
  yes  231   92
R>
R> options(width = 80)
R> head(CHURN_TRAIN, 3)
  state account_length     area_code international_plan voice_mail_plan
1    KS            128 area_code_415                 no             yes
2    OH            107 area_code_415                 no             yes
3    NJ            137 area_code_415                 no              no
  number_vmail_messages total_day_minutes total_day_calls total_day_charge
1                    25             265.1             110            45.07
2                    26             161.6             123            27.47
3                     0             243.4             114            41.38
  total_eve_minutes total_eve_calls total_eve_charge total_night_minutes
1             197.4              99            16.78               244.7
2             195.5             103            16.62               254.4
3             121.2             110            10.30               162.6
  total_night_calls total_night_charge total_intl_minutes total_intl_calls
1                91              11.01               10.0                3
2               103              11.45               13.7                3
3               104               7.32               12.2                5
  total_intl_charge number_customer_service_calls churn
1              2.70                             1    no
2              3.70                             1    no
3              3.29                             0    no
Warning messages:
1: ORE object has no unique key - using random order
2: ORE object has no unique key - using random order
R> 
R> ore.scriptDrop("my_rpartFunction")
R> ore.scriptCreate("my_rpartFunction",
+   function(dat, datastorePrefix) {
+     library(rpart)
+     vmp <- dat[1, "voice_mail_plan"]
+     ip <- dat[1, "international_plan"]
+     datastoreName <- paste(datastorePrefix, vmp, ip, sep = "_")
+     dat$voice_mail_plan <- NULL
+     dat$international_plan <- NULL
+     dat$state <- as.factor(dat$state)
+     dat$churn <- as.factor(dat$churn)
+     dat$area_code <- as.factor(dat$area_code)
+     mod <- rpart(churn ~ ., data = dat)
+     ore.save(mod, name = datastoreName, overwrite = TRUE)
+     list(voice_mail_plan = vmp,
+         international_plan = ip,
+         churn.table = table(dat$churn),
+         rpart.model = mod)
+   })
R> 
R> library(rpart)
R> datastorePrefix = "my.rpartModel"
R> 
R> res <- ore.groupApply(CHURN_TRAIN,
+       INDEX = CHURN_TRAIN[, c("voice_mail_plan", "international_plan")],
+       FUN.NAME = "my_rpartFunction",
+       datastorePrefix = datastorePrefix,
+       ore.connect = TRUE)
R> res[[1]]
$voice_mail_plan
[1] "no"
 
$international_plan
[1] "no"
 
$churn.table
 
  no  yes 
1878  302 
 
$rpart.model
n= 2180 
 
node), split, n, loss, yval, (yprob)
      * denotes terminal node
 
 1) root 2180 302 no (0.86146789 0.13853211)  
   2) total_day_minutes< 263.55 2040 192 no (0.90588235 0.09411765)  
     4) number_customer_service_calls< 3.5 1876 108 no (0.94243070 0.05756930)  
       8) total_day_minutes< 223.25 1599  44 no (0.97248280 0.02751720) *
       9) total_day_minutes>=223.25 277  64 no (0.76895307 0.23104693)  
        18) total_eve_minutes< 242.35 210  18 no (0.91428571 0.08571429) *
        19) total_eve_minutes>=242.35 67  21 yes (0.31343284 0.68656716)  
          38) total_night_minutes< 174.2 17   4 no (0.76470588 0.23529412) *
          39) total_night_minutes>=174.2 50   8 yes (0.16000000 0.84000000) *
     5) number_customer_service_calls>=3.5 164  80 yes (0.48780488 0.51219512)  
      10) total_day_minutes>=160.2 95  22 no (0.76842105 0.23157895)  
        20) state=AL,AZ,CA,CO,DC,DE,FL,HI,KS,KY,MA,MD,ME,MI,NC,ND,NE,NH,NM,OK,OR,SC,TN,VA,VT,WY 56   2 no (0.96428571 0.03571429) *
        21) state=AK,AR,CT,GA,IA,ID,MN,MO,NJ,NV,NY,OH,RI,TX,UT,WA,WV 39  19 yes (0.48717949 0.51282051)  
          42) total_day_minutes>=182.3 21   5 no (0.76190476 0.23809524) *
          43) total_day_minutes< 182.3 18   3 yes (0.16666667 0.83333333) *
      11) total_day_minutes< 160.2 69   7 yes (0.10144928 0.89855072) *
   3) total_day_minutes>=263.55 140  30 yes (0.21428571 0.78571429)  
     6) total_eve_minutes< 167.3 29   7 no (0.75862069 0.24137931)  
      12) state=AK,AR,AZ,CO,CT,FL,HI,IN,KS,LA,MD,ND,NM,NY,OH,UT,WA,WV 21   0 no (1.00000000 0.00000000) *
      13) state=IA,MA,MN,PA,SD,TX,WI 8   1 yes (0.12500000 0.87500000) *
     7) total_eve_minutes>=167.3 111   8 yes (0.07207207 0.92792793) *
 
R> ore.load(name = paste(datastorePrefix, "yes", "yes", sep = "_"))
[1] "mod"
R> mod
n= 92 
 
node), split, n, loss, yval, (yprob)
      * denotes terminal node
 
1) root 92 36 no (0.60869565 0.39130435)  
  2) total_intl_minutes< 13.1 71 15 no (0.78873239 0.21126761)  
    4) total_intl_calls>=2.5 60  4 no (0.93333333 0.06666667)  
      8) state=AK,AR,AZ,CO,CT,DC,DE,FL,GA,HI,ID,IL,IN,KS,MD,MI,MO,MS,MT,NC,ND,NE,NH,NJ,OH,SC,SD,UT,VA,WA,WV,WY 53  0 no (1.00000000 0.00000000) *
      9) state=ME,NM,VT,WI 7  3 yes (0.42857143 0.57142857) *
    5) total_intl_calls< 2.5 11  0 yes (0.00000000 1.00000000) *
  3) total_intl_minutes>=13.1 21  0 yes (0.00000000 1.00000000) *

6.2.6 Use the ore.rowApply Function

The ore.rowApply function invokes an R script with an ore.frame as the input data.

The ore.rowApply function passes the ore.frame to the user-defined input function as the first argument to that function. The rows argument to the ore.rowApply function specifies the number of rows to pass to each invocation of the user-defined R function. The last chunk or rows may have fewer rows than the number specified. The ore.rowApply function can use data-parallel execution, in which one or more R engines perform the same R function, or task, on different partitions of data.

The syntax of the ore.rowApply function is the following:

ore.rowApply(X, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, rows = 1, 
             FUN.OWNER = NULL, parallel = getOption("ore.parallel", NULL))

The ore.rowApply function returns an ore.list object or an ore.frame object.

See Also:

Example 6-15 Using the ore.rowApply Function

This example uses the e1071 package, previously downloaded from CRAN. The example does the following:

  • Loads the package e1071.

  • Pushes the iris data set to the database as the IRIS temporary table and ore.frame object.

  • Creates the Naive Bayes model nbmod.

  • Creates a copy of IRIS as IRIS_PRED and adds the PRED column to IRIS_PRED to contain the predictions.

  • Invokes the ore.rowApply function, passing the IRIS ore.frame as the data source for user-defined R function and the user-defined R function itself. The user-defined function does the following:

    • Loads the package e1071 so that it is available to the R engine or engines that run in the database.

    • Converts the Species column to a factor because, although the ore.frame defined factors, when they are loaded to the user-defined function, factors appear as character vectors.

    • Invokes the predict method and returns the res object, which contains the predictions in the column added to the data set.

  • Pulls the model to the client R session.

  • Passes IRIS_PRED as the argument FUN.VALUE, which specifies the structure of the object that the ore.rowApply function returns.

  • Specifies the number of rows to pass to each invocation of the user-defined function.

  • Displays the class of res, and invokes the table function to display the Species column and the PRED column of the res object.

library(e1071)
IRIS <- ore.push(iris)
nbmod <- ore.tableApply(
  ore.push(iris),
  function(dat) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    naiveBayes(Species ~ ., dat)
  })
IRIS_PRED <- IRIS
IRIS_PRED$PRED <- "A"
res <- ore.rowApply(
  IRIS,
  function(dat, nbmod) {
    library(e1071)
    dat$Species <- as.factor(dat$Species)
    dat$PRED <- predict(nbmod, newdata = dat)
    dat
  },
  nbmod = ore.pull(nbmod),
  FUN.VALUE = IRIS_PRED,
  rows = 10)
class(res)
table(res$Species, res$PRED)

Listing for This Example

R> library(e1071)
R> IRIS <- ore.push(iris)      
R> nbmod <- ore.tableApply(
+    ore.push(iris),
+    function(dat) {
+      library(e1071)
+      dat$Species <- as.factor(dat$Species)
+      naiveBayes(Species ~ ., dat)
+    })
R> IRIS_PRED <- IRIS
R> IRIS_PRED$PRED <- "A"
R> res <- ore.rowApply(
+   IRIS ,
+   function(dat, nbmod) {
+     library(e1071)
+     dat$Species <- as.factor(dat$Species)
+     dat$PRED <- predict(nbmod, newdata = dat)
+     dat
+   },
+   nbmod = ore.pull(nbmod),
+   FUN.VALUE = IRIS_PRED,
+   rows = 10)
R> class(res)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> table(res$Species, res$PRED)
            
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         47         3
  virginica       0          3        47

This example uses the C50 package to score churn data (that is, to predict which customers are likely to churn) using C5.0 models. The example partitions the data by a number of rows. It scores the customers from the specified state in parallel. It uses datastores and saves functions to the OML4R script repository, which allows the functions to be used by the OML4R SQL API functions.

The example first loads C50 package and the data sets. It deletes the datastores with names containing myC5.0modelFL, if they exist. It invokes ore.drop to delete the CHURN_TEST table, if it exists, and then invokes ore.create to create the CHURN_TEST table from the churnTest data set.

The example next invokes ore.getLevels, which returns a list of the levels for each factor column. The invocation excludes the first column, which is state, because the levels for that column are not needed. Getting the levels first can ensure that all possible levels are provided during model building, even if some rows do not have values for some of the levels. The ore.delete invocation ensures that no datastore with the specified name exists and the ore.save invocation saves the xlevels object in the datastore named myXLevels.

The example creates a user-defined function, myC5.0FunctionForLevels, that generates a C5.0 model. The function uses the list of levels returned by function ore.getXlevels instead of computing the levels using the as.factor function. It uses the levels to convert the column type from character vector to factor. The function myC5.0FunctionForLevels returns the value TRUE. The example saves the function in the script repository.

The example next gets a list of datastores that have names that include the specified string and deletes those datastores if they exist.

The example then invokes ore.groupApply, which invokes function myC5.0FunctionForLevels on each state in the CHURN_TEST data. To each myC5.0FunctionForLevels invocation, ore.groupApply passes the datastore that contains the xlevels object and a prefix to use in naming the datastore generated by myC5.0FunctionForLevels. It also passes the ore.connect control argument to connect to the database in the embedded R function, which enables the use of objects stored in a datastore. The ore.groupApply invocation returns a list that contains the results of all of the invocations of myC5.0FunctionForLevels.

The example pulls the result over to the local R session and verifies that myC5.0FunctionForLevels returned TRUE for each state in the data source.

The example next creates another user-defined another function, myScoringFunction, and stores it in the script repository. The function scores a C5.0 model for the levels of a state and returns the results in a data.frame.

The example then invokes function ore.rowApply. It filters the input data to use only data for the state of Massachusetts. It specifies myScoringFunction as the function to invoke and passes that user-defined function the name of the datastore that contains the xlevels object and a prefix to use in loading the datastore that contains the C5.0 model for the state. The ore.rowApply invocation specifies invoking myScoringFunction on 200 rows of the data set in each parallel R engine. It uses the FUN.VALUE argument so that ore.rowApply returns an ore.frame that contains the results of all of the myScoringFunction invocations. The variable scores gets the results of the ore.rowApply invocation.

Finally, the example prints the scores object and then uses the table function to display the confusion matrix for the scoring.

See Also:

Example A-8 for an invocation of the SQL rqRowEval function that produces the same result as the ore.rowApply function in this example

Example 6-16 Using the ore.rowApply Function with Datastores and Scripts

library(C50)
data(churn)

ore.drop("CHURN_TEST"
ore.create(churnTest, "CHURN_TEST")

xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
ore.delete("myXLevels")
ore.save(xlevels, name = "myXLevels")

ore.scriptDrop("myC5.0FunctionForLevels")
ore.scriptCreate("myC5.0FunctionForLevels",
  function(dat, xlevelsDatastore, datastorePrefix) {
    library(C50)
    state <- dat[1,"state"]
    datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
    dat$state <- NULL
    ore.load(name = xlevelsDatastore)
    for (j in names(xlevels))
       dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
    c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
    ore.save(c5mod, name = datastoreName)
    TRUE
  })

ds.v <- ore.datastore(pattern= "myC5.0modelFL")$datastore.name
for (ds in ds.v) ore.delete(name = ds)

res <- ore.groupApply(CHURN_TEST, 
          INDEX=CHURN_TEST$state,
          FUN.NAME = "myC5.0FunctionForLevels",
          xlevelsDatastore = "myXLevels",
          datastorePrefix = "myC5.0modelFL",
          ore.connect = TRUE)
res <- ore.pull(res)
all(as.logical(res) == TRUE)

ore.scriptDrop("myScoringFunction")
ore.scriptCreate("myScoringFunction",
                  function(dat, xlevelsDatastore, datastorePrefix) {
                     library(C50)
                     state <- dat[1,"state"]
                     datastoreName <- paste(datastorePrefix,state,sep="_")
                     dat$state <- NULL
                     ore.load(name = xlevelsDatastore)
                     for (j in names(xlevels))
                        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
                     ore.load(name = datastoreName)
                     res <- data.frame(pred = predict(c5mod, dat, type = "class"),
                                              actual = dat$churn,
                                              state = state)
                     res
                  }
                )

scores <- ore.rowApply(
  CHURN_TEST[CHURN_TEST$state == "MA",],
  FUN.NAME = "myScoringFunction",
  xlevelsDatastore = "myXLevels", 
  datastorePrefix = "myC5.0modelFL",
  ore.connect = TRUE, parallel = TRUE,
  FUN.VALUE = data.frame(pred = character(0), 
                         actual = character(0), 
                         state = character(0)),
  rows=200)
scores
table(scores$actual, scores$pred)

Listing for This Example

R> library(C50)
R> data(churn)
R>
R> ore.drop("CHURN_TEST"
R> ore.create(churnTest, "CHURN_TEST")
R>
R> xlevels <- ore.getXlevels(~ ., CHURN_TEST[,-1])
R> ore.delete("myXLevels")
[1] "myXLevels
R> ore.save(xlevels, name = "myXLevels")
R>
R> ore.scriptDrop("myC5.0FunctionForLevels")
R> ore.scriptCreate("myC5.0FunctionForLevels",
+    function(dat, xlevelsDatastore, datastorePrefix) {
+      library(C50)
+      state <- dat[1,"state"]
+      datastoreName <- paste(datastorePrefix, dat[1, "state"], sep = "_")
+      dat$state <- NULL
+      ore.load(name = xlevelsDatastore)
+      for (j in names(xlevels))
+        dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+      c5mod <- C5.0(churn ~ ., data = dat, rules = TRUE)
+      ore.save(c5mod, name = datastoreName)
+      TRUE
+    })
R>
R> ds.v <- ore.datastore(pattern="myC5.0modelFL")$datastore.name
R> for (ds in ds.v) ore.delete(name=ds)
R>
R> res <- ore.groupApply(CHURN_TEST, 
+                       INDEX=CHURN_TEST$state,
+                       FUN.NAME="myC5.0FunctionForLevels",
+                       xlevelsDatastore = "myXLevels",
+                       datastorePrefix = "myC5.0modelFL",
+                       ore.connect = TRUE)
R> res <- ore.pull(res)
R> all(as.logical(res) == TRUE)
[1] TRUE
R>
R> ore.scriptDrop("myScoringFunction")
R> ore.scriptCreate("myScoringFunction",
+                    function(dat, xlevelsDatastore, datastorePrefix) {
+                      library(C50)
+                      state <- dat[1,"state"]
+                      datastoreName <- paste(datastorePrefix,state,sep="_")
+                      dat$state <- NULL
+                      ore.load(name = xlevelsDatastore)
+                      for (j in names(xlevels))
+                          dat[[j]] <- factor(dat[[j]], levels = xlevels[[j]])
+                      ore.load(name = datastoreName)
+                      res <- data.frame(pred = predict(c5mod, dat, type="class"),
+                                        actual = dat$churn,
+                                        state = state)
+                      res
+                  }
+ )
R>
R> scores <- ore.rowApply(
+     CHURN_TEST[CHURN_TEST$state =="MA",],
+     FUN.NAME = "myScoringFunction",
+     xlevelsDatastore = "myXLevels",
+     datastorePrefix = "myC5.0modelFL",
+     ore.connect = TRUE, parallel = TRUE,
+     FUN.VALUE = data.frame(pred=character(0),
+                            actual=character(0),
+                            state=character(0)),
+     rows=200
R>
R> scores
   pred actual state
1    no     no    MA
2    no     no    MA
3    no     no    MA
4    no     no    MA
5    no     no    MA
6    no    yes    MA
7   yes    yes    MA
8   yes    yes    MA
9    no     no    MA
10   no     no    MA
11   no     no    MA
12   no     no    MA
13   no     no    MA
14   no     no    MA
15  yes    yes    MA
16   no     no    MA
17   no     no    MA
18   no     no    MA
19   no     no    MA
20   no     no    MA
21   no     no    MA
22   no     no    MA
23   no     no    MA
24   no     no    MA
25   no     no    MA
26   no     no    MA
27   no     no    MA
28   no     no    MA
29   no    yes    MA
30   no     no    MA
31   no     no    MA
32   no     no    MA
33  yes    yes    MA
34   no     no    MA
35   no     no    MA
36   no     no    MA
37   no     no    MA
38   no     no    MA
Warning message:
ORE object has no unique key - using random order
R> table(scores$actual, scores$pred)
 
      no yes
  no  32   0
  yes  2   4

6.2.7 Use the ore.indexApply Function

The ore.indexApply function executes the specified user-defined input function using data that is generated by the input function.

The function supports task-parallel execution, in which one or more R engines perform the same or different calculations, or task. The times argument to the ore.indexApply function specifies the number of times that the input function executes in the database. Any required data must be explicitly generated or loaded within the input function.

The syntax of the ore.indexApply function is the following:

ore.indexApply(times, FUN, ..., FUN.VALUE = NULL, FUN.NAME = NULL, FUN.OWNER = NULL,
               parallel = getOption("ore.parallel", NULL))

The ore.indexApply function returns an ore.list object or an ore.frame object.

See Also:

Examples of the use of the ore.indexApply function are in the following topics:

6.2.7.1 Simple Example of Using the ore.indexApply Function

The example invokes ore.indexApply and specifies that it execute the input function five times in parallel.

Example 6-17 Using the ore.indexApply Function

This example displays the class of the result, which is ore.list, and then displays the result.

res <- ore.indexApply(5,
      function(index) {
        paste("IndexApply:", index)
      },
      parallel = TRUE)
class(res)
res

Listing for This Example

R> res <- ore.indexApply(5,
+       function(index) {
+         paste("IndexApply:", index)
+       },
+       parallel = TRUE)
R> class(res)
[1] "ore.list"
attr(,"package")
[1] "OREembed"
R> res
$`1`
[1] "IndexApply: 1"
 
$`2`
[1] "IndexApply: 2"
 
$`3`
[1] "IndexApply: 3"
 
$`4`
[1] "IndexApply: 4"
 
$`5`
[1] "IndexApply: 5"
6.2.7.2 Column-Parallel Use Case

The example uses the R summary function to compute in parallel summary statistics on the first four numeric columns of the iris data set.

Example 6-18 Using the ore.indexApply Function and Combining Results

The example combines the computations into a final result. The first argument to the ore.indexApply function is 4, which specifies the number of columns to summarize in parallel. The user-defined input function takes one argument, index, which will be a value between 1 and 4 and which specifies the column to summarize.

The example invokes the summary function on the specified column. The summary invocation returns a single row, which contains the summary statistics for the column. The example converts the result of the summary invocation into a data.frame and adds the column name to it.

The example next uses the FUN.VALUE argument to the ore.indexApply function to define the structure of the result of the function. The result is then returned as an ore.frame object with that structure.

res <- NULL
res <- ore.indexApply(4,
      function(index) {
        ss <- summary(iris[, index])
        attr.names <- attr(ss, "names")
        stats <- data.frame(matrix(ss, 1, length(ss)))
        names(stats) <- attr.names
        stats$col <- names(iris)[index]
        stats
      },
      FUN.VALUE=data.frame(Min. = numeric(0),
        "1st Qu." = numeric(0),
        Median = numeric(0),
        Mean = numeric(0),
        "3rd Qu." = numeric(0),
        Max. = numeric(0),
        Col = character(0)), 
      parallel = TRUE)
res

Listing for This Example

R> res <- NULL
R> res <- ore.indexApply(4,
+       function(index) {
+         ss <- summary(iris[, index])
+         attr.names <- attr(ss, "names")
+         stats <- data.frame(matrix(ss, 1, length(ss)))
+         names(stats) <- attr.names
+         stats$col <- names(iris)[index]
+         stats
+       },
+       FUN.VALUE=data.frame(Min. = numeric(0),
+         "1st Qu." = numeric(0),
+         Median = numeric(0),
+         Mean = numeric(0),
+         "3rd Qu." = numeric(0),
+         Max. = numeric(0),
+         Col = character(0)),
+       parallel = TRUE)
R> res
  Min. X1st.Qu. Median  Mean X3rd.Qu. Max.          Col
1  2.0      2.8   3.00 3.057      3.3  4.4  Sepal.Width
2  4.3      5.1   5.80 5.843      6.4  7.9 Sepal.Length
3  0.1      0.3   1.30 1.199      1.8  2.5  Petal.Width
4  1.0      1.6   4.35 3.758      5.1  6.9 Petal.Length
Warning message:
ORE object has no unique key - using random order
6.2.7.3 Simulations Use Case

You can use the ore.indexApply function in simulations, which can take advantage of high-performance computing hardware like an Oracle Exadata Database Machine.

Example 6-19 Using the ore.indexApply Function in a Simulation

This example takes multiple samples from a random normal distribution to compare the distribution of the summary statistics. Each simulation occurs in a separate R engine in the database, in parallel, up to the degree of parallelism allowed by the database. The example defines variables for the sample size, the mean and standard deviations of the random numbers, and the number of simulations to perform. The example specifies num.simulations as the first argument to the ore.indexApply function. The ore.indexApply function passes num.simulations to the user-defined function as the index argument. This input function then sets the random seed based on the index so that each invocation of the input function generates a different set of random numbers.

The input function next uses the rnorm function to produce sample.size random normal values. It invokes the summary function on the vector of random numbers, and then prepares a data.frame as the result it returns. The ore.indexApply function specifies the FUN.VALUE argument so that it returns an ore.frame that structures the combined results of the simulations. The res variable gets the ore.frame returned by the ore.indexApply function.

To get the distribution of samples, the example invokes the boxplot function on the data.frame that is the result of using the ore.pull function to bring selected columns from res to the client.

res <- NULL
sample.size = 1000
mean.val = 100
std.dev.val = 10
num.simulations = 1000
 
res <- ore.indexApply(num.simulations,
      function(index, sample.size = 1000, mean = 0, std.dev = 1) {
        set.seed(index)
        x <- rnorm(sample.size, mean, std.dev)
        ss <- summary(x)
        attr.names <- attr(ss, "names")
        stats <- data.frame(matrix(ss, 1, length(ss)))
        names(stats) <- attr.names
        stats$index <- index
        stats
      },
      FUN.VALUE=data.frame(Min. = numeric(0),
        "1st Qu." = numeric(0),
        Median = numeric(0),
        Mean = numeric(0),
        "3rd Qu." = numeric(0),
        Max. = numeric(0),
        Index = numeric(0)),
      parallel = TRUE,
      sample.size = sample.size,
      mean = mean.val, std.dev = std.dev.val)
options("ore.warn.order" = FALSE)
head(res, 3)
tail(res, 3)
boxplot(ore.pull(res[, 1:6]),
  main=sprintf("Boxplot of %d rnorm samples size %d, mean=%d, sd=%d",
               num.simulations, sample.size, mean.val, std.dev.val))

Listing for This Example

R> res <- ore.indexApply(num.simulations,
+       function(index, sample.size = 1000, mean = 0, std.dev = 1) {
+         set.seed(index)
+         x <- rnorm(sample.size, mean, std.dev)
+         ss <- summary(x)
+         attr.names <- attr(ss, "names")
+         stats <- data.frame(matrix(ss, 1, length(ss)))
+         names(stats) <- attr.names
+         stats$index <- index
+         stats
+       },
+       FUN.VALUE=data.frame(Min. = numeric(0),
+         "1st Qu." = numeric(0),
+         Median = numeric(0),
+         Mean = numeric(0),
+         "3rd Qu." = numeric(0),
+         Max. = numeric(0),
+         Index = numeric(0)),
+       parallel = TRUE,
+       sample.size = sample.size,
+       mean = mean.val, std.dev = std.dev.val)
R> options("ore.warn.order" = FALSE)
R> head(res, 3)
   Min. X1st.Qu. Median   Mean X3rd.Qu.  Max. Index
1 67.56    93.11  99.42  99.30    105.8 128.0   847
2 67.73    94.19  99.86 100.10    106.3 130.7   258
3 65.58    93.15  99.78  99.82    106.2 134.3   264
R> tail(res, 3)
   Min. X1st.Qu. Median   Mean X3rd.Qu.  Max. Index
1 65.02    93.44  100.2 100.20    106.9 134.0     5
2 71.60    93.34   99.6  99.66    106.4 131.7     4
3 69.44    93.15  100.3 100.10    106.8 135.2     3
R> boxplot(ore.pull(res[, 1:6]),
+   main=sprintf("Boxplot of %d rnorm samples size %d, mean=%d, sd=%d",
+                num.simulations, sample.size, mean.val, std.dev.val))

6.3 SQL Interface for Embedded R Execution

The SQL interface for Oracle Machine Learning for R embedded R execution allows you to execute R functions in production database applications.

The SQL interface has procedures for the following actions:

  • Adding and removing a script from the OML4R script repository

  • Granting or revoking read privilege access to a script by the owner to other users

  • Executing an R script in an embedded R session

  • Deleting an OML4R datastore

Data dictionary views provide information about scripts and datastores.

This SQL interface is described in the following topics:

6.3.1 About Oracle Machine Learning for R SQL Table Functions

OML4R provides SQL table functions that are equivalents of most of the R interface functions for embedded R execution.

Executing a SELECT FROM TABLE statement and specifying one of the table functions results in the invocation of the specified R script. The script runs in one or more R engines on the Oracle Database server.

The SQL table functions for embedded R execution are:

  • rqEval

  • rqGroupEval

  • rqRowEval

  • rqTableEval

The R interface functions and the SQL equivalents are listed in Table 6-1.

For the rqGroupEval function, OML4R provides a generic implementation of the group apply functionality in SQL. You must write a table function that captures the structure of the input cursor.

See the reference pages for the functions for information about them, including examples of their use.

Some general aspects of the SQL table functions are described in the following topics:

6.3.1.1 Parameters of the SQL Table Functions

The SQL table functions have some parameters in common and some functions have parameters that are unique to that function.

The parameters of the SQL table functions are the following.

Table 6-2 SQL Table Function Parameters

Parameter Description

INP_CUR

A cursor that specifies the data that is input to the R function specified by EXP_NAM. For all of the SQL table functions except rqEval, the first argument is a cursor that specifies input data for the R function.

PAR_CUR

A cursor that specifies arguments to pass to the R function. The parameters cursor consists of a single row of scalar values. An argument can be a string or a numeric value. You can specify multiple arguments in the cursor. Arguments to an R function are case sensitive, so you should put names, such as a column name, in double quotes.

In the cursor, you can also specify as scalar values an OML4R control argument or the names of serialized R objects, such as predictive models, that are in an OML4R datastore.

The value of this parameters cursor can be NULL if you are not passing any arguments to the R function or any control arguments.

OUT_QRY

An output table definition. The value of this argument can be NULL or a string that defines the structure of the R data.frame returned by the R function specified by EXP_NAM. The string can be a SELECT statement, 'XML', or 'PNG'.

GRP_COL

For the rqGroupEval function, the name of the grouping column.

ROW_NUM

For the rqRowEval function, the number of rows to pass to each invocation of the R function.

EXP_NAM

The name of a script in the OML4R script repository.

6.3.1.2 Return Value of SQL Table Functions

The Oracle Machine Learning for R SQL table functions return a table.

The structure and contents of the table are determined by the results of the R function passed to the SQL table function and by the OUT_QRY parameter. The R function can return a data.frame object, other R objects, and graphics. The structure of the table that represents the results of the R function is specified by one of the following OUT_QRY values:

  • NULL, which results in a table that has a serialized object that can contain both data and image objects.

  • A table signature specified in a SELECT statement, which results in a table that has the defined structure. The result of the R function must be a data.frame. No images are returned.

  • The string 'XML', which results in a table that has a CLOB that can contain both structured data and graph images in an XML string. The non-image R objects, such as data.frame or model objects, are provided first, followed by the base 64 encoding of a PNG representation of the image.

  • The string 'PNG', which results in a table that has a BLOB that contains graph images in PNG format. The table has the column names name, id, and image.

6.3.1.3 Connect to Oracle Machine Learning for R in Embedded R Execution

To establish a connection to OML4R on the Oracle Database server during the embedded R execution, you can specify the control argument ore.connect in the parameters cursor.

Doing so establishes a connection using the credentials of the user who invoked the embedded R function. It also automatically loads the ORE package. Establishing an OML4R connection is required to save objects in an OML4R R object datastore or to load objects from a datastore. It also allows you to explicitly use the OML4R transparency layer.

See Also:

Optional and Control Arguments for information on other control arguments

6.3.2 Manage Scripts in SQL

This topic lists the PL/SQL procedures and Oracle Database data dictionary views for creating and managing R scripts.

The functions in the SQL API for embedded R execution require as an argument a named script that is stored in the OML4R script repository. The PL/SQL procedures sys.rqScriptCreate and sys.rqScriptDrop create and drop scripts. To create a script or drop one from the script repository requires the RQADMIN role.

When using the sys.rqScriptCreate function, you must specify a name for the script and an R function script that contains a single R function definition. Calls to the functions sys.rqScriptCreate and sys.rqScriptDrop must be wrapped in a BEGIN-END PL/SQL block. The script repository stores the R function as a character large object (a CLOB), so you must enclose the function definition in single quotes to specify it as a string.

The owner of a script can use the rqGrant procedure to grant to another user read privilege access to a script or use the rqRevoke procedure to revoke the privilege. To use a script granted to you by another user, you must specify the owner by prepending the owner’s name and a period to the name of the script, as in the following:

select * from table(rqEval(NULL, ‘select 1 x from dual’, ‘owner_name.script_name’));

The owner prefix is not required for a public script or for a script owned by the user.

The following tables list the PL/SQL procedures for managing script repository scripts and the data dictionary views that contain information about scripts.

Table 6-3 PL/SQL Procedures for Managing Scripts

PL/SQL Procedure Description
rqGrant Grants read privilege access to a datastore or script.
rqRevoke Revokes read privilege access to a datastore or script.
sys.rqScriptCreate Adds the provided R function into the script repository with the provided name.
sys.rqScriptDrop Removes the named R function from the script repository.

Table 6-4 Data Dictionary Views for Scripts

Data Dictionary View Description
ALL_RQ_SCRIPTS Describes the scripts in the OML4R script repository that are available to the current user
USER_RQ_SCRIPTS Describes the scripts in the script repository that are owned by the current user.
USER_RQ_SCRIPT_PRIVS Describes the scripts in the script repository to which the current user has granted read access and the users to whom access has been granted.
SYS.RQ_SCRIPTS Describes the system scripts in the script repository.

Example 6-20 Create a Script with the SQL APIs

This example uses the sys.rqScriptCreate procedure to create a script in the Oracle Machine Learning for R script repository.

The example creates the user-defined function named myRandomRedDots2. The user-defined function accepts two arguments, and it returns a data.frame object that has two columns and that plots the specified number of random normal values. The sys.rqScriptCreate function stores the user-defined function in the OML4R script repository.

-- Create a script named myRandomRedDots2 and add it to the script repository.
-- Specify that the script is private and to overwrite a script with the same name.
BEGIN
  sys.rqScriptCreate('myRandomRedDots2',
    'function(divisor = 100, numDots = 100) {
       id <- 1:10
       plot(1:numDots, rnorm(numDots), pch = 21, bg = "red", cex = 2 )
       data.frame(id = id, val = id / divisor)}',
       v_global => FALSE,
       v_overwrite => TRUE);
END;
/

-- Grant read privilege access to Scott.
BEGIN
  rqGrant('myRandomRedDots2', 'rqscript', 'SCOTT');
END;
 /

-- View the users granted read access to myRandomRedDots2.
select * from USER_RQ_SCRIPT_PRIVS;

NAME                GRANTEE
----------------    -------
myRandomRedDots       SCOTT

-- Revoke the read privilege access from Scott.
BEGIN
  rqRevoke('myRandomRedDots2', 'rqscript', 'SCOTT');
END;
 /

-- Remove the script from the script repository.
BEGIN
  sys.rqScriptDrop('myRandomRedDots2');
 END; 
 /

6.3.3 Manage Datastores in SQL

Oracle Machine Learning for R provides PL/SQL procedures and Oracle Database data dictionary views for the basic management of datastores in SQL.

The following tables list the procedures and views.

Table 6-5 PL/SQL Procedures for Managing Datastores

PL/SQL Procedures Description
rqGrant Grants read privilege access to a datastore or script.
rqRevoke Revokes read privilege access to a datastore or script.
rqDropDataStore Deletes a datastore.

Table 6-6 Data Dictionary Views for Datastores

Views Description
ALL_RQ_DATASTORES Describes the datastores available to the current user, including whether the datastore is grantable.
RQUSER_DATASTORELIST Describes the datastores in the Oracle Database schema..
RQUSER_DATASTORECONTENTS Describes the objects in the datastores in the Oracle Database schema.
USER_RQ_DATASTORE_PRIVS Describes the datastores and the users to whom the current user has granted read privilege access.
USER_RQ_DATASTORES Describes the datastores owned by the current user, including whether the datastore is grantable.