1 Introduction to Oracle Machine Learning for R

Lists topics that introduce Oracle Machine Learning for R (OML4R).

OML4R in previous releases was named Oracle R Enterprise. The following topics introduce OML4R:

1.1 About Oracle Machine Learning for R

Oracle Machine Learning for R (OML4R) is a comprehensive, database-centric environment for end-to-end analytical processes in R, with immediate deployment to production environments.

OML4R is a set of R packages and Oracle Database features that enable an R user to operate on database-resident data without using SQL and to execute R scripts in one or more embedded R engines that run on the database server.

Using OML4R from your local R session, you have easy access to data in an Oracle Database instance. You can create and use R objects that specify data in database tables. OML4R has overloaded functions that translate R operations into SQL that executes in the database. The database consolidates the SQL and can use the query optimization, parallel processing, and scalability features of the database when it executes the SQL statements. The database returns the results as R objects.

Embedded R execution provides some of the most significant advantages of using OML4R. Using embedded R execution, you can store and run R scripts in the database through either an R interface or a SQL interface or both. You can use the results of R scripts in SQL-enabled tools for structured data, R objects, and images.

1.2 Advantages of Oracle Machine Learning for R

Using OML4R to prepare and analyze data in an Oracle Database instance has many advantages for an R user.

With OML4R, you can do the following:

  • Operate on Database-Resident Data Without Using SQL. OML4R has overloaded open source R methods and functions that transparently convert standard R syntax into SQL. These methods and functions are in packages that implement the OML4R transparency layer. With these functions and methods, you can create R objects that access, analyze, and manipulate data that resides in the database. The database can automatically optimize the SQL to improve the efficiency of the query.

  • Eliminate Data Movement. By keeping the data in the database, you eliminate the time involved in transferring the data to your desktop computer and the need to store the data locally. You also eliminate the need to manage the locally stored data, which includes tasks such as distributing the data files to the appropriate locations, synchronizing the data with changes that are made in the production database, and so on.

  • Keep Data Secure. By keeping the data in the database, you have the security, scalability, reliability, and backup features of Oracle Database for managing the data.

  • Use the Power of the Database. By operating directly on database-resident data, you can use the memory and processing power of the database and avoid the memory constraints of your client R session.

  • Use Current Data. As data is refreshed in the database, you have immediate access to current data.

  • Prepare Data in the Database. Using the transparency layer functions, prepare large database-resident data sets for predictive analysis through operations such as ordering, aggregating, filtering, recoding, and the use of comprehensive sampling techniques without having to write SQL code.

  • Save R Objects in the Database. You can save R objects in an Oracle Database instance as persistent database objects that are available to others. You can store R and OML4R objects in an OML4R datastore, which is managed by the Oracle Database instance.

  • Build Models in the Database. You can build models in the database and store and manage them in an OML4R datastore. You can use functions in packages that you download from CRAN (The Comprehensive R Archive Network) to build models that require large amounts of memory and that use techniques such as ensemble modeling.

  • Score Data in the Database. You can include your R models in scripts to score database-resident data. You can perform tasks such as the following:

    • Go from model building to scoring in one step because you can use the same R code for scoring. You do not need to translate the scoring logic as required by some standalone analytic servers.

    • Schedule scripts to be run automatically to perform tasks such as bulk scoring.

    • Score data in the context of a transaction.

    • Perform online what-if scoring.

    • Optionally convert a model to SQL, which Oracle Database does automatically for you. You can then deploy the resulting SQL for low-latency scoring tasks.

  • Execute R Scripts in the Database. Using OML4R embedded R execution functionality, you can create, store, and execute R scripts in the database. When the script executes, Oracle Database starts, controls, and manages one or more R engines that can run in parallel on the database server. By executing scripts on the database server, you can take advantage of scalability and performance of the server.

    With the embedded R execution functionality, you can do the following:

    • Develop and test R scripts interactively and make the scripts available for use by SQL applications

    • Use CRAN and other packages in R scripts on the database server

    • Operationalize entire R scripts in production applications and eliminate porting R code; avoid reinventing code to integrate R results into existing applications

    • Seamlessly leverage Oracle Database as a high performance computing (HPC) environment for R scripts, providing data parallelism and resource management

    • Use the processing and memory resources of Oracle Database and the increased efficiency of read/write operations between the database and the embedded R execution R engines

    • Use the parallel processing capabilities of the database for data-parallel or task-parallel operations

    • Perform parallel simulations

    • Generate XML and PNG images that can be used by R or SQL applications

  • Integrate with the Oracle Technology Stack. You can take advantage of all aspects of the Oracle technology stack to integrate your data analysis within a larger framework for business intelligence or scientific inquiry. For example, you can integrate the results of your OML4R analysis into Oracle Business Intelligence Enterprise Edition (OBIEE).

1.3 Get Online Help for Oracle Machine Learning for R Classes, Functions, and Methods

The OML4R client packages contain the R components that you use to interact with data in an Oracle database.

For a list and brief descriptions of the client packages, and for information on installing them, see Oracle Machine Learning for R Installation and Administration Guide.

To get help on OML4R classes, functions, and methods, use R functions such as help and showMethods. If the name of a class or function has an ore prefix, you can supply the name to the help function. To get help on an overloaded method of an open-source R function, supply the name of the method and the name of the ore class.

Example 1-1 Getting Help on OML4R Classes, Functions, and Methods

This example shows several ways of getting information on OML4R classes, functions, and methods. In the listing following the example some code has been modified to display only a portion of the results and the output of some of the functions is not shown.

# List the contents of the OREbase package.
ls("package:OREbase")

# Get help for the OREbase package.
help("OREbase")

# Get help for the ore virtual class.
help("ore-class")

# Show the subclasses of the ore virtual class.
showClass("ore")

# Get help on the ore.frame class.
help("ore.frame")

# Get help on the ore.vector class.
help("ore.vector")

# Show the arguments for the aggregate method.
showMethods("aggregate")

# Get help on the aggregate method for an ore.vector object.
help("aggregate,ore.vector-method")

# Show the signatures for the merge method.
showMethods("merge")

# Get help on the merge method for an ore.frame object.
help("merge,ore.frame,ore.frame-method")

showMethods("scale")

# Get help on the scale method for an ore.number object.
help("scale,ore.number-method")

# Get help on the ore.connect function.
help("ore.connect")
Listing for Example 1-1
R> options(width = 80)
# List the contents of the OREbase package.
R> head(ls("package:OREbase"), 12)
 [1] "%in%"          "Arith"         "Compare"       "I"            
 [5] "Logic"         "Math"          "NCOL"          "NROW"         
 [9] "Summary"       "as.data.frame" "as.env"        "as.factor"
R>
R># Get help for the OREbase package.
R> help("OREbase")      # Output not shown.
R>
R> # Get help for the ore virtual class.
R> help("ore-class")    # Output not shown.
R>
R># Show the subclasses of the ore virtual class.
R> showClass("ore")
Virtual Class "ore" [package "OREbase"]
 
No Slots, prototype of class "ore.vector"
 
Known Subclasses: 
Class "ore.vector", directly
Class "ore.frame", directly
Class "ore.matrix", directly
Class "ore.number", by class "ore.vector", distance 2
Class "ore.character", by class "ore.vector", distance 2
Class "ore.factor", by class "ore.vector", distance 2
Class "ore.date", by class "ore.vector", distance 2
Class "ore.datetime", by class "ore.vector", distance 2
Class "ore.difftime", by class "ore.vector", distance 2
Class "ore.logical", by class "ore.vector", distance 3
Class "ore.integer", by class "ore.vector", distance 3
Class "ore.numeric", by class "ore.vector", distance 3
Class "ore.tblmatrix", by class "ore.matrix", distance 2
Class "ore.vecmatrix", by class "ore.matrix", distance 2
R>
# Get help on the ore.frame class.
R> help("ore.frame")       # Output not shown.

R># Get help on the ore.vector class.
R> help("ore.vector")    # Output not shown.
R>
R># Show the arguments for the aggregate method.
R> showMethods("aggregate")
Function: aggregate (package stats)
x="ANY"
x="ore.vector"

# Get help on the aggregate method for an ore.vector object.
R> help("aggregate,ore.vector-method")      # Output not shown.

# Show the signatures for the merge method.
R> showMethods("merge")
Function: merge (package base)
x="ANY", y="ANY"
x="data.frame", y="ore.frame"
x="ore.frame", y="data.frame"
x="ore.frame", y="ore.frame

# Get help on the merge method for an ore.frame object.
R> help("merge,ore.frame,ore.frame-method")  # Output not shown.

R> showMethods("scale")
Function: scale (package base)
x="ANY"
x="ore.frame"
x="ore.number"
x="ore.tblmatrix"
x="ore.vecmatrix"

# Get help on the scale method for an ore.number object.
R> help("scale,ore.number-method")    # Output not shown.

# Get help on the ore.connect function.
R> help("ore.connect")                # Output not shown.

1.4 About Transparently Using R on Oracle Database Data

OML4R has overloaded open source R methods and functions that you can use to operate directly on data in an Oracle Database instance.

The methods and functions are in packages that implement a transparency layer that translates R functions into SQL.

The OML4R transparency layer packages and the limitations of converting R into SQL are described in the following topics:

1.4.1 About the Transparency Layer

The Oracle Machine Learning for R transparency layer is implemented by the OREbase, OREgraphics, and OREstats packages.

These OML4R packages contain overloaded methods of functions in the open source R base, graphics, and stats packages, respectively. The OML4R packages also contain OML4R versions of some of the open source R functions.

With the methods and functions in these packages, you can create R objects that specify data in an Oracle Database instance. When you execute an R expression that uses such an object, the method or function transparently generates a SQL query and sends it to the database. The database then executes the query and returns the results of the operation as an R object.

A database table or view is represented by an ore.frame object, which is a subclass of data.frame. Other OML4R classes inherit from corresponding R classes, such as ore.vector and vector. OML4R maps Oracle Database data types to OML4R classes, such as NUMBER to ore.integer.

You can use the transparency layer methods and functions to prepare database-resident data for analysis. You can then use functions in other OML4R packages to build and fit models and use them to score data. For large data sets, you can do the modeling and scoring using R engines embedded in Oracle Database.

See Also:

Example 1-2 Finding the Mean of the Petal Lengths by Species in R

This example illustrates the translation of an R function invocation into SQL. It uses the overloaded OML4R aggregate function to get the mean of the petal lengths from the IRIS_TABLE object.

ore.create(iris, table = 'IRIS_TABLE')
aggplen = aggregate(IRIS_TABLE$Petal.Length,
                    by = list(species = IRIS_TABLE$Species),
                    FUN = mean)
aggplen
Listing for This Example
R> ore.create(iris, table = 'IRIS_TABLE')
R> aggplen = aggregate(IRIS_TABLE$Petal.Length,
                       by = list(species = IRIS_TABLE$Species),
                       FUN = mean)
R> aggplen
              species     x
setosa         setosa 1.462
versicolor versicolor 4.260
virginica   virginica 5.552

Example 1-3 SQL Equivalent of the Previous Example

This example shows the SQL equivalent of the aggregate function in the previous example.

SELECT "Species", AVG("Petal.Length")
FROM IRIS_TABLE
GROUP BY "Species"
ORDER BY "Species";
 
Species     AVG("PETAL.LENGTH") 
----------- ------------------- 
setosa       1.4620000000000002 
versicolor   4.26
virginica    5.552

1.4.2 Transparency Layer Support for R Data Types and Classes

Oracle Machine Learning for R transparency layer has classes and data types that map R data types to Oracle Database data types.

Those classes and data types are described in the following topics:

1.4.2.1 About Oracle Machine Learning for R Data Types and Classes

OML4R has data types that map R data types to SQL data types.

In an R session, when you create database objects from R objects or you create R objects from database data, OML4R translates R data types to SQL data types and the reverse where possible.

OML4R creates objects that are instances of OML4R classes. OML4R overloads many standard R functions so that they use OML4R classes and data types. R language constructs and syntax are supported for objects that are mapped to Oracle Database objects.

Table 1-1 Mappings Between R, OML4R, and SQL Data Types

R Data Type OML4R Data Type SQL Data Type

character mode vector

ore.character

VARCHAR2

INTERVAL YEAR TO MONTH

integer mode vector

ore.integer

NUMBER

logical mode vector

ore.logical

The NUMBER 0 for FALSE and 1 for TRUE

numeric mode vector

ore.number

BINARY_DOUBLE

BINARY_FLOAT

FLOAT

NUMBER

Date

ore.date

DATE

POSIXct

POSIXlt

ore.datetime

TIMESTAMP

TIMESTAMP WITH TIME ZONE

TIMESTAMP WITH LOCAL TIME ZONE

difftime

ore.difftime

INTERVAL DAY TO SECOND

None

Not supported

LONG

LONG RAW

RAW

User defined data types

Reference data types

Note:

  • Objects of type ore.datetime do not support a time zone setting, instead they use the system time zone Sys.timezone if it is available or GMT if Sys.timezone is not available.

  • The SQL VARCHAR2 data type is mapped to the R character data type through the embedded R input data argument. Users can convert the character variable to a factor in R if needed by using as.factor().

1.4.2.2 About the ore.frame Class

An ore.frame object represents a relational query for an Oracle Database instance.

It is the OML4R equivalent of a data.frame. Typically, you get ore.frame objects that are proxies for database tables. You can then add new columns, or make other changes, to the ore.frame proxy object. Any such change does not affect the underlying table. If you then request data from the source table of the ore.frame object, the transparency layer function generates a SQL query that has the additional columns in the select list, but the table is not changed.

In R, the elements of a data.frame have an explicit order. You can specify elements by using integer indexing. In contrast, relational database tables do not define any order of rows and therefore cannot be directly mapped to R data structures.

OML4R has both ordered and unordered ore.frame objects. If a table has a primary key, which is a set of one or more columns that form a distinct tuple within a row, you can produce ordered results by performing a sort using an ORDER BY clause in a SELECT statement. However, ordering relational data can be expensive and is often unnecessary for transparency layer operations. For example, ordering is not required to compute summary statistics when invoking the summary function on an ore.frame.

Example 1-4 Classes of a data.frame and a Corresponding ore.frame

This example creates a data.frame with columns that contain different data types and displays the structure of the data.frame. The example then invokes the ore.push function to create a temporary table in the database that contains a copy of the data of the data.frame. The ore.push invocation also generates an ore.frame object that is a proxy for the table. The example displays the classes of the ore.frame object and of the columns in the data.frame and the ore.frame objects.

df <- data.frame(a="abc",
                 b=1.456,
                 c=TRUE,
                 d=as.integer(1),
                 e=Sys.Date(),
                 f=as.difftime(c("0:3:20", "11:23:15")))
ore.push(df)
class(of)
class(df$a
class(of$a)
class(df$b)
class(of$b)
class(df$c)
class(of$c)
class(df$d)
class(of$d)
class(df$e)
class(of$e)
class(df$f)
class(of$f)
Listing for Example 1-4
R> df <- data.frame(a="abc",
+                   b=1.456,
+                   c=TRUE,
+                   d=as.integer(1),
+                   e=Sys.Date(),
+                   f=as.difftime(c("0:3:20", "11:23:15")))
R> ore.push(df)
R> class(of)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> class(df$a)
[1] "factor"
R> class(of$a)
[1] "ore.factor"
attr(,"package")
[1] "OREbase"
R> class(df$b)
[1] "numeric"
R> class(of$b)
[1] "ore.numeric"
attr(,"package")
[1] "OREbase"
R> class(df$c)
[1] "logical"
R> class(of$c)
[1] "ore.logical"
attr(,"package")
[1] "OREbase"
R> class(df$d)
[1] "integer"
R> class(of$d)
[1] "ore.integer"
attr(,"package")
[1] "OREbase"
R> class(df$e)
[1] "Date"
R> class(of$e)
[1] "ore.date"
attr(,"package")
[1] "OREbase"
R> class(df$f)
[1] "difftime"
R> class(of$f)
[1] "ore.difftime"
attr(,"package")
[1] "OREbase"
1.4.2.3 Support for R Naming Conventions

OML4R uses R naming conventions for ore.frame columns instead of the more restrictive Oracle Database naming conventions.

The column names of an ore.frame can be longer than 30 bytes, can contain double quotes, and can be non-unique.

1.4.2.4 About Coercing R and Oracle Machine Learning for R Class Types

Some OML4R functions coerce R objects and class types to OML4R ore objects and types.

The generic as.ore function coerces in-memory R objects to ore objects. The more specific functions, such as as.ore.character, coerce objects to specific types. The ore.push function implicitly coerces R class types to ore class types and the ore.pull function coerces ore class types to R class types. For information on those functions, see "Moving Data to and from the Database".

Example 1-5 Coercing R and OML4R Class Types

This example illustrates coercing R objects to ore objects. creates an R integer object and then uses the generic method as.ore to coerce it to an ore object, which is an ore.integer. The example coerces the R object to various other ore class types. For an example of using as.factor in embedded R execution function, see Example 6-13.

x <- 1:10
class(x)
X <- as.ore(x)
class(X)
Xn <- as.ore.numeric(x)
class(Xn)
Xc <- as.ore.character(x)
class(Xc)
Xc
Xf <- as.ore.factor(x)
Xf
Listing for Example 1-5
R> x <- 1:10
R> class(x)
[1] "integer"
R> X <- as.ore(x)
R> class(X)
[1] "ore.integer"
attr(,"package")
[1] "OREbase"
R> Xn <- as.ore.numeric(x)
R> class(Xn)
[1] "ore.numeric"
attr(,"package")
[1] "OREbase"
R> Xc <- as.ore.character(x)
R> class(Xc)
[1] "ore.character"
attr(,"package")
[1] "OREbase"
R> Xc
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
R> Xf <- as.ore.factor(x)
R> Xf
 [1] 1  2  3  4  5  6  7  8  9  10
Levels: 1 10 2 3 4 5 6 7 8 9

1.5 Typical Operations in Using Oracle Machine Learning for R

In using OML4R, the following is a typical progression of operations:

  1. In an R session, connect to a schema in an Oracle Database instance.

  2. Attach the schema and synchronize with the schema objects, which generates OML4R proxy objects for database tables.

  3. Prepare the data for analysis and possibly perform exploratory data analysis and data visualization.

  4. Build models using functions in the OREmodels or OREdm packages.

  5. Score data using the models either in your local R session or by using embedded R execution.

  6. Deploy the results of the analysis to end users.

Figure 1-1 Typical OML4R Workflow

This figure illustrates these steps and typical reiterations of them.

Description of Figure 1-1 follows
Description of "Figure 1-1 Typical OML4R Workflow"

1.6 Oracle Machine Learning for R Global Options

OML4R has global options that affect various functions.

Table 1-2 lists the OML4R global options and descriptions of them.

Table 1-2 OML4R Global Options

Global Description

ore.envAsEmptyenv

A logical value that specifies whether an environment referenced in an object should be replaced with an empty environment during serialization to an Oracle Database. When TRUE, the referenced environment in the object is replaced with an empty environment whose parent is .GlobalEnv, and the objects in the original referenced environment are not serialized. In some cases, this can significantly reduce the size of serialized objects. When FALSE, all of the objects in the referenced environment are serialized, and can be unserialized and loaded into memory. The default value for this option is FALSE.

The following OML4R functions use this global option:

  • ore.push, in saving a serialized list object to the database

  • ore.save, in saving objects to an OML4R datastore

  • ore.doEval and the other embedded R execution functions for serializing parameters of list type and for serializing some objects returned by an R function during embedded R execution

ore.na.extract

A logical value used during logical subscripting of an ore.frame or ore.vector object. When TRUE, rows or elements with an NA logical subscript produce rows or elements with NA values, which mimics how R treats missing value logical subscripting of data.frame and vector objects.

When FALSE, an NA logical subscript is interpreted as a FALSE value, resulting in the removal of the corresponding row or element. The default value is FALSE.

ore.parallel

A preferred degree of parallelism to use in embedded R execution. One of the following:

  • A positive integer greater than or equal to 2 for a specific degree of parallelism

  • FALSE or 1 for no parallelism

  • TRUE for the default parallelism of the data argument

  • NULL for the database default for the operation

The default value is NULL.

ore.sep

A character string that specifies the separator to use between multiple column row names of an ore.frame. The default value is |.

ore.trace

A logical value that specifies whether iterative OML4R functions should print output at each iteration. The default value is FALSE.

ore.warn.order

A logical value that specifies whether OML4R displays a warning message when an ore.frame that lacks row names or an ore.vector that lacks element names is used in a function that requires ordering. The default value is TRUE.

See Also: