8 Oracle R Enterprise Embedded Execution

This chapter describes these topics:

Security Considerations for Scripts

Both R scripts and SQL scripts allow access to the database server. For this reason, creation of scripts must be controlled. The RQADMIN Role is required for those users who create and drop scripts.

RQADMIN Role

Oracle R Enterprise creates the RQADMIN role.

The RQADMIN role must be explicitly granted to a user.

The RQADMIN role is required in these instances:

  • Calling ore.doEval() with FUN argument

  • Creating and dropping scripts with ore.scriptCreate and ore.scriptDrop

The RQADMIN role is not required when calling ore.rowApply, ore.groupApply, ore.tableApply, ore.indexApply, and ore.doEval with the FUN.NAME argument.

To grant RQADMIN to RQUSER, start SQL*Plus as sysdba and type

grant rqadmin to RQUSER

Note:

You should grant RQADMIN only to those users who need it.

Support for Database Parallelism

Parallel processing is not restricted to Oracle R Enterprise functions only; it can be enabled for Open Source R packages that are not part of Oracle R Enterprise. For such packages data-parallelism can be leveraged through Oracle R Enterprise embedded R execution.

On the R side, Oracle R Enterprise provides ore.groupApply(), ore.rowApply(), and ore.indexApply() for data-parallel processing. Data-parallel processing consists of dividing a data set into multiple subsets that can be processed in parallel (independently). Oracle R Enterprise also provides SQL-equivalent functionality for group apply and row apply as described in Oracle R Enterprise Embedded SQL Scripts.

Open Source packages (CRAN packages) can generally not leverage the Oracle R Enterprise transparency layer (because they are not written using base R exclusively or include callouts to functionality such as C functions) and execute on data in the R address space. This means that their use is subject to memory and parallelism constraints of R and the way the CRAN package was written. Oracle R Enterprise does not automatically parallelize the internal code of CRAN packages.

Embedded R execution enables leveraging what is likely a larger server (a Database server, such as Oracle Exadata) in terms of memory and number of processors to expand what a typical R client may be able to achieve. In addition, embedded R execution provides for more efficient transfer of data between the database and the R engine (since they are on the same machine). Embedded R execution also allows for data parallel execution of user R functions that may leverage CRAN packages, both from Oracle R Enterprise R and SQL APIs.

R Interface for Embedded Oracle R Enterprise Scripts

These Oracle R Enterprise functions permit R-based applications to embed Oracle R Enterprise functionality in the scripts. For example, they allow R scripts to perform operations on database objects.

An R script contains a single function definition. R scripts reside in the Oracle R Enterprise in-database R script archive.

Embedded R scripts provide several advantages:

  • You can execute R scripts in the database where the data resides; you do not have to move data out of the database. The scripts may contain custom techniques or include functions from CRAN packages.

  • You can run existing R scripts within R-based applications and operational SQL-based applications.

  • You can leverage distributed data flow parallelism in Oracle Database; the parallelism is user controlled but database managed.

  • You can use the security provided by Oracle Database. See Security Issues for Embedded R Scripts for information about how to register scripts so that they are secure.

Oracle R Enterprise provides these functions that support running R scripts in the database:

There are example scripts in ore.doEval() and ore.indexApply().

Security Issues for Embedded R Scripts

All of these scripts require an argument FUN or FUN.NAME. For security reasons, use of the argument FUN requires the RQADMIN role, a collection of Oracle Database privileges. Since creation of the script represented by the argument FUN.NAME must be published by a user with RQADMIN credentials, it can be used by anyone authorized to use Oracle R Enterprise.

Input for ore.*Apply() and ore.doEval()

The functions ore.tableApply(), ore.groupApply(), ore.rowApply(),and ore.indexApply(), and ore.doEval() all take either a FUN.NAME parameter (for a function that has been loaded into the R script repository) or FUN, which is an R function (closure).

All functions can return anything. However, when you specify the FUN.VALUE argument the output should be a matching data.frame.

All functions take parameters that are passed as optional arguments (... arguments). They can named or not.

All functions take the FUN.NAME parameter, which is the name of a function in the R script repository in the database, or an actual R function in the FUN parameter.

ore.doEval()

ore.doEval() invokes a stand-alone R script in the database without input data; parameters are allowed. It returns an ore.frame object or serialized R objects.

Input for ore.doEval()is internally generated data. You can load data from a file or a table using ore.pull().

Input data is one of the following:

  • Internally generated

  • Loaded from a file or pulled from the database by using ore.pull()

  • Made available through the Transparency Layer

ore.doEval() takes the FUN.NAME parameter, which is the name of a function in the R script repository in the database, or an actual R function in the FUN parameter.

The following additional arguments to the FUN parameter starting with ore. are special control arguments. They are not passed to the function specified by the FUN or FUN.NAME arguments, but instead control what happens before or after the execution of the closure. The following control arguments are supported:

  • ore.drop controls the input data. If TRUE, one column data.frame will be converted to a vector. The default value is TRUE.

  • ore.connect controls whether to automatically connect to Oracle R Enterprise inside the closure. This is equivalent to doing an ore.connect call with the same credentials as the client session. The default value is FALSE.

  • ore.graphics controls whether to start a graphical driver and look for images. The default value is TRUE.

  • ore.png.* specifies additional parameters for the png graphics driver if ore.graphics is TRUE. The naming convention for these arguments is to add an ore.png.prefix to the arguments of the png function. For example, if ore.png.height is supplied, argument height is passed to the png function. If not set, the standard default values for the png function are used.

This example scales the first n integers by the value provided. The result is a serialized R object (data.frame):

Oracle R Enterprise comes with a number of predefined graphical scripts. All predefined scripts have a reserved name that start with RQG$ followed by a function name from the graphics package that the script wraps. Depending on the function it either takes the first, the first and second, or all of the columns of the input data.frame. Thus, predefined scripts can only be used with ore.tableApply, ore.groupApply, or ore.rowApply. Each function also has ... so that it can pass any parameter to the function that it wraps.

res <- 
   ore.doEval(function (num = 10, scale = 100) {
          ID <- seq(num)
          data.frame(ID = ID, RES = ID / scale)
          })
class(res)
res
local_res <- ore.pull(res)
class(local_res)
local_res

For more examples, see the R help for ore.doEval().

ore.tableApply()

ore.tableApply() invokes an R script with an entire table (ore.frame) as input. The input is provided all at once to the function. As with ore.doEval(), it can return an ore.frame object or serialized R objects.

Input data is an ore.frame object.

Returns a data frame signature as an ore.frame object.

Takes NULL or <variable>=<value> as an argument.

ore.groupApply()

ore.groupApply() partitions the data according to a specified column's values and invokes the R script on each partition in parallel, when possible. The return value is a list of each group's execution results.

Input data is an ore.frame object.

Returns either a NULL value as an ore.object or a data frame signature as an ore.frame object.

Takes NULL or <variable>=<value> as an argument.

You must specify the partition column for ore.groupApply().

Takes the FUN.NAME parameter, which is the name of a function in the R script repository in the database, or an actual R function in the FUN parameter.

ore.rowApply()

ore.rowApply() enables you to specify a chunk size, which is the number of rows that the function should act upon. The function is invoked multiple times in parallel, if multiple R engines can be invoked at the database server, until all data is processed. The return value is a list of each chunk's execution results.

Input data is an ore.frame object.

Returns either a NULL value as an ore.object or a data frame signature as an ore.frame object.

Takes NULL or <variable>=<value> as an argument.

You can specify the chunk size for ore.rowApply().

Takes the FUN.NAME parameter, which is the name of a function in the R script repository in the database, or an actual R function in the FUN parameter.

ore.indexApply()

ore.indexApply() invokes an R script n times, where n is a positive integer. The return value is a list of each execution's results.

Input data is one of the following:

  • Internally generated

  • Loaded from a file or pulled from the database by using ore.pull()

  • Made available through the Transparency Layer

ore.indexApply() can take NULL or <variable>=<value> as arguments.

You must specify n, the number of times to invoke the R function.

Takes the FUN.NAME parameter, which is the name of a function in the R script repository in the database, or an actual R function in the FUN parameter

For example, this code applies the function 10 times:

res<-ore.indexApply(10,function (x, scale = 100) x / scale)

ore.scriptCreate()

ore.scriptCreate() creates an R script in the database. The script can be used by name in other embedded R script functions.

Note:

ore.scriptCreate() requires the RQADMIN role.

ore.scriptCreate() has this syntax:

ore.scriptCreate(name, FUN)

where

  • name is a character string specifying the name of the R script in Oracle Database.

  • FUN is a function definition to be used with functions ore.doEval(, ore.groupApply(), ore.indexApply(), ore.rowApply(), or ore.tableApply().

The function returns an invisible NULL value if it succeeds; if it does not succeed in creating the script, it returns an error.

ore.scriptCreate() Example

This example creates a script and then drops it:

ore.scriptCreate("MYLM",function(data, formula, ...) lm(formula, data, ...))
       IRIS <- ore.push(iris)
       ore.tableApply(IRIS[1:4], FUN.NAME = "MYLM" formula = Sepal.Length ~ .)
       ore.scriptDrop("MYLM")

ore.scriptDrop()

ore.scriptDrop() drops a named R script from the database repository. Requires the RQADMIN role.

Note:

ore.scriptDrop() requires the RQADMIN role.

ore.scriptDrop() has this syntax:

ore.scriptDrop(name)

where

  • name is a character string specifying the name of the R script in Oracle Database.

The function returns an invisible NULL value if it succeeds; if it does not succeed in dropping the script, it returns an error.

For an example, see ore.scriptCreate() Example.

Automatic Database Connection in Embedded R Scripts

An embedded R script can automatically connect to an Oracle database.

If automatic connections are enabled, the following functionality occurs:

  • Embedded R scripts are automatically connected to the database.

  • The automatic connection has the same credentials as the session that invokes the embedded R SQL functions.

  • The script runs in an autonomous transaction.

  • ROracle queries work with the automatic connection.

  • Oracle R Enterprise transparency is enabled in the embedded script.

  • User and site-wide R profile loading is disabled in embedded R.

    Profile loading was supported in earlier Oracle R Enterprise releases. An automatic connection provides a more secure connection.

Automatic connections are disabled by default. You can specify whether automatic connections are enabled or disabled by using the ore.connect control argument. Control arguments are documented in R help for ore.doEval().

To enable automatic connections, ROracle was extended by adding a new driver ExtDriver with the constructor Extproc that is initialized by passing an external pointer wrapping the extproc context. Similarly to OraDriver, ExtDriver is a singleton. Both drivers can exist simultaneously in a session since these are represented by two distinct singletons. This setup allows working with extproc and explicit OraDriver connections in the same R script as shown by the following example.

ore.doEval(function() {
  ore.disconnect()
  con1 <- dbConnect(Extproc())
  res1 <- dbGetQuery(con1, "select * from grade order by name")
  con2 <- dbConnect(Oracle(), "scott", "tiger")
  res2 <- dbGetQuery(con2, "select * from emp order by empno")
  dbDisconnect(con1)
  dbDisconnect(con2)
  cbind(head(res1)[,1:3], head(res2)[,1:3])
}, ore.connect = TRUE)

Examples of Embedded R Scripts

For a detailed example of an embedded R script, see the Oracle R Enterprise Blog "Introduction to ORE Embedded R Script Execution" at https://blogs.oracle.com/R/entry/analyzing_big_data_using_the1.

Part 6: ORE Embedded R Scripts: R Interface in the free Oracle R Enterprise Tutorial Series describes embedded R scripts and contains several examples. See Oracle R Enterprise Training for information about the Tutorial Series.

Several of the Oracle R Enterprise Demos illustrate embedded execution.

Oracle R Enterprise Embedded SQL Scripts

The SQL interface allows you to embed R script execution in production database applications.

The functions associated with the SQL interface must be stored in the database R repository, and referenced by name in SQL API functions. See Registering and Managing SQL Scripts for a description of how to add scripts to the repository, remove scripts from the repository, and list and use scripts in the repository.

For descriptions of the SQL functions, see Oracle R Enterprise SQL Functions.

Registering and Managing SQL Scripts

For security purposes, you must first register the R script under some system unique name and use the new name instead of the actual script in calls to rq*Eval table functions.

There are two administrative functions that create and drop scripts and a view that lists scripts:

  • sys.rqScriptCreate()

  • sys.rqScriptDrop()

  • view allows you to list and use scripts that were created

The scripts require the RQADMIN role described in RQADMIN Role.

When using sys.rqScriptCreate(), you must specify a corresponding R Closure of the function string.

Here is an example of registering the scripts and of using the registered scripts:

begin
  sys.rqScriptCreate('tmrqfun2',
'function() {
ID <- 1:10
res <- data.frame(ID = ID, RES = ID / 100)
res
}');
end;
/
 
select *
  from table(rqEval(
        NULL,
       'select 1 id, 1 res from dual',
       'tmrqfun2'));
 
begin
  sys.rqScriptDrop('tmrqfun2');
end;
/

Oracle R Enterprise SQL Functions

The rq*Eval functions result in one or more new R engines being started at the database depending on database parallelism settings.

To enable execution of an R script in the database (lights-out processing), Oracle R Enterprise provides variants of ore.doEval(), ore.tableApply(), ore.groupApply(), and ore.rowApply() in SQL. (ore.doEval(), ore.tableApply(), ore.groupApply(), and ore.rowApply() are described in R Interface for Embedded Oracle R Enterprise Scripts.)

The SQL functions are

  • rqTableEval()

  • rqEval()

  • rqRowEval()

  • rqGroupEval()

rqGroupEval() requires additional SQL specification and is provided here as a virtual function, which partitions the data according to a specified column's values and invokes the R script on each partition. For more information, see rqGroupEval() Function.

You can also use these functions with objects in a datastore, as described in rq*Eval() and Objects in a Datastore.

The rq*Eval() functions (rqEval(), rqTableEval(), rqGroupEval(), and rqRowEval()) have similar syntax:

rq*Eval(
     cursor(select * from table-1),
     cursor(select * from table-2),
     'select <column list> from table-3 t',
     <grouping col-name from table-1 or num_rows>,
     <R closure name of registered-R-code>
     )

where

  • The first cursor is the input cursor: Input is passed as a whole table, group, or N rows at a time to the R closure described in the fourth parameter.

    rqEval() does not have this cursor argument.

  • The second cursor is the parameters cursor: One row of scalar values (string, numeric, or both) can be passed; for example, the name of the model and several numeric scalar values for model setting.

  • The query specifies the output table definition; output can be 'SELECT statement', 'XML', or 'PNG'.

  • grouping col-name applies to rqGroupEval(); it provides the name of the grouping column.

  • num_rows applies to rqRowEval(); it provides the number of rows to provide to the functions at one time.

  • <R closure name of registered-R-code> is a registered version of the R function to execute. See Registering and Managing SQL Scripts for details.

The return values for all of the SQL functions specify one of these values:

  • A table signature that is specified in a SELECT statement, which returns results as a table from the rq function.

  • XML, returned as a CLOB which returns both structured and graph images in an XML string. The structured components are provided first, followed by the base 64 encoding of the png representation of the image.

  • PNG, returned as a BLOB which returns graph images in PNG format.

rqEval(), rqTableEval(), rqGroupEval(), and rqRowEval() must specify an R script by the name that is stored in the R script repository. See Registering and Managing SQL Scripts for information about the sys.rq_scripts view provides a list of registered scripts.

The following examples illustrate using these functions:

  • This example uses all rows from the table fish as input to the R function that takes no other parameters and produces output that contains all input data plus the ROWSUM of values.

    Note that parameters (param) to the R function is optional.

    begin
    sys.rqScriptCreate('tmrqfun2',
    'function(x, param) {
            dat <- data.frame(x, stringsAsFactors=F)
            cbind(dat, ROWSUM = apply(dat,1,sum)+10)
    }');
    end;
    /
     
    select * from table(rqTableEval(  
       cursor(select * from fish),
       NULL,
       'select t.*, 1 rowsum from fish t',
       'tmrqfun2' ));
     
    begin
    sys.rqScriptDrop('tmrqfun2');
    end;
    /
    
  • This example illustrates passing n=1 (4th parameter) row at a time from the table fish to the R function. No parameters are required by the function. The function generates ROWSUM which is added as an extra column to fish in the output.

    begin
    sys.rqScriptCreate('tmrqfun2',
    'function(x, param) {
            dat <- data.frame(x, stringsAsFactors=F)
            cbind(dat, ROWSUM = apply(dat,1,sum)+10)
    }');
    end;
    /
     
    select * from table(rqRowEval(  
       cursor(select * from fish),
       NULL,
       'select t.*, 1 rowsum from fish t',
        1,
       'tmrqfun2' ));
     
    begin
    sys.rqScriptDrop('tmrqfun2');
    end;
    /
    

rqGroupEval() Function

rqGroupEval() invokes an R script on data that is partitioned by a grouping column.

rqGroupEval() requires the creation of two PL/SQL objects, a package and a pipelined table function:

  1. Create a PL/SQL package that specifies the types of result to be returned.

  2. Create a function that takes the return value of the package and uses the return value with PIPELINED_PARALLEL_ENABLE set to indicate the column on which to partition data.

Suppose that ONTIME_S is a table that stores information about arrival of airplanes. The data cursor uses all data, but you could also define cursors that use some columns using PL/SQL records. Then you must define as many PL?SQL table functions as the number of grouping columns that you are interested in using for a particular data cursor.

CREATE PACKAGE ontimePkg AS
  TYPE cur IS REF CURSOR RETURN ontime_s%ROWTYPE;
END ontimePkg;
/
 
CREATE FUNCTION ontimeGroupEval(
  inp_cur  ontimePkg.cur,
  par_cur  SYS_REFCURSOR,
  out_qry  VARCHAR2,
  grp_col  VARCHAR2,
  exp_txt  CLOB)
RETURN SYS.AnyDataSet
PIPELINED PARALLEL_ENABLE (PARTITION inp_cur BY HASH (month))
CLUSTER inp_cur BY (month)
USING rqGroupEvalImpl;
/

At this time, only one grouping column is supported. If you have multiple columns combine the columns into one column and use the new column as a grouping column. PARALLEL_ENABLE clause is optional but CLUSTER BY is not.

rq*Eval() and Objects in a Datastore

rq*Eval()and related functions allow you to use serialized R objects saved in a datastore using a parameter cursor. You can specify the association of object and datastore names of the serialized R objects with the R function parameter names in that parameter cursor.

Here is an example of how to use rq*Eval()this way. Suppose that user scott has saved a model in the datastore ontime_model as the object lm.mod,. Suppose scott wants to use this model in SQL for embedded Oracle R Enterprise scoring. This code shows how to use the model for embedded scoring. See Automatic Database Connection in Embedded R Scripts for the configuration parameters for ore.connect().

begin
  sys.rqScriptCreate('tmrqmodelscore',
    'function(dat, in.dsname, in.objname) {
       ore.load(name=in.dsname, list=in.objname)
       mod <- get(in.objname)
       prd <- predict(mod, newdata=dat)
       prd[as.integer(rownames(prd))] <- prd
       res <- cbind(dat, PRED = prd)
       res
    }');
end;
/ -- score model
 
select * from table(rqTableEval(
             cursor(select ARRDELAY, DISTANCE, DEPDELAY from ontime_s
                where year = 2003 and month = 5 and dayofmonth = 2),
             cursor(select 'ontime_model' as "in.dsname", 
            'lm.mod' as "in.objname", 1 as "ore.connect" from dual),
             'select ARRDELAY, DISTANCE, DEPDELAY, 1 PRED from ontime_s',
            'tmrqmodelscore'))
order by 1, 2, 3;

Datastore Management in SQL

Oracle R Enterprise provides basic management for datastores in SQL. Basic datastore management includes show, search, and drop. The following functions and views are provided:

  • rqDropDataStore() deletes a datastore and all of the objects in the datastore.

    Syntax: rqDropDataStore('<ds_name>'), where <ds_name> is the name of the datastore to delete.

    The following example deletes the datastore ds_model from current user schema:

    rqDropDataStore('ds_model')
    
  • rquser_DataStoreList is a view containing datastore-level information for all datastores in the current user schema. The information consists of datastore name, number of objects, size, creation date, and description.

    These examples illustrate using the view:

    select * from rquser_DataStoreList
    select dsname, nobj, size from rquser_datastorelist where dsname = 'ds_1'
    
  • rquser_DataStoreContents is a view containing object-level information about all datastores in the current user schema. The information consists of object name, size, class, length, number of rows and columns.

    This example lists the datastore contents for datastore ds_1:

    select * from rquser_DataStoreContents where dsname = 'ds_1';