3 Prepare and Explore Data in the Database

Use Oracle Machine Learning for R functions to prepare data for analysis and to perform exploratory analysis of the data.

These functions make it easier for you to prepare very large enterprise database-resident data for modeling. They are described the following topics:

3.1 Prepare Data in the Database Using Oracle Machine Learning for R

Using OML4R, you can prepare data for analysis in the database.

Data preparation is described in the following topics:

3.1.1 About Preparing Data in the Database

Oracle Machine Learning for R provides functions that enable you to use R to prepare database data for analysis.

Using these functions, you can perform typical data preparation tasks on ore.frame and other OML4R objects. You can perform data preparation operations on large quantities of data in the database and then pull the results to your local R session for analysis using functions in packages available from The Comprehensive R Archive Network (CRAN).

You can do operations on data such as the following.

  • Selecting

  • Binning

  • Sampling

  • Sorting and Ordering

  • Summarizing

  • Transforming

  • Performing data preparation operations on date and time data

Performing these operations is described in the other topics in this chapter.

3.1.2 Select Data

A typical step in preparing data for analysis is selecting or filtering values of interest from a larger data set.

The examples in this topic demonstrate selecting data from an ore.frame object by column, by row, and by value. The examples are in the following topics:

3.1.2.1 Select Data by Column

This example selects columns from an ore.frame object.

Example 3-1 Selecting Data by Column

This example first creates a temporary database table, with the corresponding proxy ore.frame object iris_of, from the iris data.frame object. It displays the first three rows of iris_of. The example selects two columns from iris_of and creates the ore.frame object iris_projected with them. It then displays the first three rows of iris_projected.

iris_of <- ore.push(iris)
head(iris_of, 3)

iris_projected = iris_of[, c("Petal.Length", "Species")]

head (iris_projected, 3)
Listing for This Example
iris_of <- ore.push(iris)
head(iris_of, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

R> iris_projected = iris_of[, c("Petal.Length", "Species")]

R> head (iris_projected, 3)
  Petal.Length Species
1          1.4  setosa
2          1.4  setosa
3          1.3  setosa
3.1.2.2 Select Data by Row

This example selects rows from an ordered ore.frame object.

Example 3-2 Selecting Data by Row

This example first adds a column to the iris data.frame object for use in creating an ordered ore.frame object. It invokes the ore.drop function to delete the database table IRIS_TABLE, if it exists. It then creates a database table, with the corresponding proxy ore.frame object IRIS_TABLE, from the iris data.frame. The example invokes the ore.exec function to execute a SQL statement that makes the RID column the primary key of the database table. It then invokes the ore.sync function to synchronize the IRIS_TABLE ore.frame object with the table and displays the first three rows of the proxy ore.frame object.

The example next selects 51 rows from IRIS_TABLE by row number and creates the ordered ore.frame object iris_selrows with them. It displays the first six rows of iris_selrows. It then selects 3 rows by row name and displays the result.

# Add a column to the iris data set to use as row identifiers.
iris$RID <- as.integer(1:nrow(iris) + 100)
ore.drop(table = 'IRIS_TABLE')
ore.create(iris, table = 'IRIS_TABLE')
ore.exec("alter table IRIS_TABLE add constraint IRIS_TABLE 
          primary key (\"RID\")")
ore.sync(table = "IRIS_TABLE")
head(IRIS_TABLE, 3)

# Select rows by row number. 
iris_selrows <- IRIS_TABLE[50:100,]
head(iris_selrows)

# Select rows by row name.
IRIS_TABLE[c("101", "151", "201"),]
Listing for This Example
R> # Add a column to the iris data set to use as row identifiers.
R> iris$RID <- as.integer(1:nrow(iris) + 100)
R> ore.drop(table = 'IRIS_TABLE')
R> ore.create(iris, table = 'IRIS_TABLE')
R> ore.exec("alter table IRIS_TABLE add constraint IRIS_TABLE 
+            primary key (\"RID\")")
R> ore.sync(table = "IRIS_TABLE")
R> head(IRIS_TABLE, 3)
    Sepal.Length Sepal.Width Petal.Length Petal.Width Species RID
101          5.1         3.5          1.4         0.2  setosa 101
102          4.9         3.0          1.4         0.2  setosa 102
103          4.7         3.2          1.3         0.2  setosa 103
R> # Select rows by row number. 
R> iris_selrows <- IRIS_TABLE[50:100,]
R> head(iris_selrows)
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species RID
150          5.0         3.3          1.4         0.2     setosa 150
151          7.0         3.2          4.7         1.4 versicolor 151
152          6.4         3.2          4.5         1.5 versicolor 152
153          6.9         3.1          4.9         1.5 versicolor 153
154          5.5         2.3          4.0         1.3 versicolor 154
155          6.5         2.8          4.6         1.5 versicolor 155
R> # Select rows by row name.
R> IRIS_TABLE[c("101", "151", "201"),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species RID
101          5.1         3.5          1.4         0.2     setosa 101
151          7.0         3.2          4.7         1.4 versicolor 151
201          6.3         3.3          6.0         2.5  virginica 201
3.1.2.3 Select Data by Value

This example selects portions of a data set.

Example 3-3 Selecting Data by Value

The example pushes the iris data set to the database and gets the ore.frame object iris_of. It filters the data to produce iris_of_filtered, which contains the values from the rows of iris_of that have a petal length of less than 1.5 and that are in the Sepal.Length and Species columns. The example also filters the data using conditions, so that iris_of_filtered contains the values from iris_of that are of the setosa or versicolor species and that have a petal width of less than 2.0.

iris_of <- ore.push(iris)
# Select sepal length and species where petal length is less than 1.5.
iris_of_filtered <- iris_of[iris_of$Petal.Length < 1.5,
                            c("Sepal.Length", "Species")]
names(iris_of_filtered)
nrow(iris_of_filtered)
head(iris_of_filtered, 3)
# Alternate syntax filtering.
iris_of_filtered <- subset(iris_of, Petal.Length < 1.5)
nrow(iris_of_filtered)
head(iris_of_filtered, 3)
# Using the AND and OR conditions in filtering.
# Select all rows with in which the species is setosa or versicolor.
# and the petal width is less than 2.0.
iris_of_filtered <- iris_of[(iris_of$Species == "setosa" |
                             iris_of$Species == "versicolor") &
                             iris_of$Petal.Width < 2.0,]
nrow(iris_of_filtered)
head(iris_of, 3)
Listing for This Example
R> iris_of <- ore.push(iris)
R> # Select sepal length and species where petal length is less than 1.5.
R> iris_of_filtered <- iris_of[iris_of$Petal.Length < 1.5,
+                              c("Sepal.Length", "Species")]
R> names(iris_of_filtered)
[1] "Sepal.Length" "Species"
R> nrow(iris_of_filtered)
[1] 24
R> head(iris_of_filtered, 3)
  Sepal.Length Species
1          5.1  setosa
2          4.9  setosa
3          4.7  setosa
R> # Alternate syntax filtering.
R> iris_of_filtered <- subset(iris_of, Petal.Length < 1.5)
R> nrow(iris_of_filtered)[1] 24
R> head(iris_of_filtered, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
R> # Using the AND and OR conditions in filtering.
R> # Select all rows with in which the species is setosa or versicolor.
R> # and the petal width is less than 2.0.
R> iris_of_filtered <- iris_of[(iris_of$Species == "setosa" |
+                              iris_of$Species == "versicolor") &
+                              iris_of$Petal.Width < 2.0,]
R> nrow(iris_of_filtered)[1] 100
R> head(iris_of, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

3.1.3 Index Data

You can use integer or character vectors to index an ordered ore.frame object.

You can use the indexing to perform sampling and partitioning, as described in "Sampling Data" and "Partitioning Data".

Oracle Machine Learning for R supports functionality similar to R indexing with these differences:

  • Integer indexing is not supported for ore.vector objects.

  • Negative integer indexes are not supported.

  • Row order is not preserved.

Example 3-4 Indexing an ore.frame Object

This example demonstrates character and integer indexing. The example uses the ordered SPAM_PK ore.frame object from Example 2-13. The example shows that you can access rows by name and that you can also access a set of rows by supplying a vector of character row names. The example then shows that you can supply the actual integer value. In the example this results in a set of different rows because the USERID values start at 1001, as opposed to 1.

# Index to a specifically named row.
SPAM_PK["2060", 1:4]
# Index to a range of rows by row names.
SPAM_PK[as.character(2060:2064), 1:4]
# Index to a range of rows by integer index.
SPAM_PK[2060:2063, 1:4]
Listing for This Example
R> # Index to a specifically named row.
R> SPAM_PK["2060", 1:4]
       TS USERID make address
2060 2060    380    0       0
R> # Index to a range of rows by row names.
R> SPAM_PK[as.character(2060:2064), 1:4]
       TS USERID make address
2060 2060    380    0       0
2061 2061    381    0       0
2062 2062    381    0       0
2063 2063    382    0       0
2064 2064    382    0       0
R> # Index to a range of rows by integer index.
R> SPAM_PK[2060:2063, 1:4]
       TS USERID make address
3060 3060    380 0.00    0.00
3061 3061    381 0.00    1.32
3062 3062    381 0.00    2.07
3063 3063    382 0.34    0.00

3.1.4 Combine Data

You can join data from ore.frame objects that represent database tables by using the merge function.

Example 3-5 Joining Data from Two Tables

This example creates two data.frame objects and merges them. It then invokes the ore.create function to create a database table for each data.frame object. The ore.create function automatically generates an ore.frame object as a proxy object for the table. The ore.frame object has the same name as the table. The example merges the ore.frame objects. Note that the order of the results of the two merge operations is not the same because the ore.frame objects are unordered.

# Create data.frame objects.
df1 <- data.frame(x1=1:5, y1=letters[1:5])
df2 <- data.frame(x2=5:1, y2=letters[11:15])

# Combine the data.frame objects.
merge (df1, df2, by.x="x1", by.y="x2")

# Create database tables and ore.frame proxy objects to correspond to
# the local R objects df1 and df2.
ore.create(df1, table="DF1_TABLE")
ore.create(df2, table="DF2_TABLE")

# Combine the ore.frame objects.
merge (DF1_TABLE, DF2_TABLE, by.x="x1", by.y="x2")
Listing for This Example
R> # Create data.frame objects.
R> df1 <- data.frame(x1=1:5, y1=letters[1:5])
R> df2 <- data.frame(x2=5:1, y2=letters[11:15])

R> # Combine the data.frame objects.
R> merge (df1, df2, by.x="x1", by.y="x2")
  x1 y1 y2
1  1  a  o
2  2  b  n
3  3  c  m
4  4  d  l
5  5  e  k

R> # Create database tables and ore.frame proxy objects to correspond to
R> # the local R objects df1 and df2.
R> ore.create(df1, table="DF1_TABLE")
R> ore.create(df2, table="DF2_TABLE")

R> # Combine the ore.frame objects.
R> merge (DF1_TABLE, DF2_TABLE, by.x="x1", by.y="x2")
  x1 y1 y2
1  5  e  k
2  4  d  l
3  3  c  m
4  2  b  n
5  1  a  o
Warning message:
ORE object has no unique key - using random order

3.1.5 Summarize Data

Summarize data with the aggregate function.

Example 3-6 Aggregating Data

This example pushes the iris data set to database memory as the ore.frame object iris_of. It aggregates the values of iris_of by the Species column using the length function. It then displays the first three rows of the result.

# Create a temporary database table from the iris data set and get an ore.frame.
iris_of <- ore.push(iris)
aggdata <- aggregate(iris_of$Sepal.Length,
                     by = list(species = iris_of$Species),
                     FUN = length)
head(aggdata, 3)
Listing for This Example
# Create a temporary database table from the iris data set and get an ore.frame.
R> iris_of <- ore.push(iris)
R> aggdata <- aggregate(iris_of$Sepal.Length,
+                      by = list(species = iris_of$Species),
+                      FUN = length)
R> head(aggdata, 3)
             species  x
setosa         setosa 50
versicolor versicolor 50
virginica   virginica 50

3.1.6 Transform Data

In preparing data for analysis, a typical step is to transform data by reformatting it or deriving new columns and adding them to the data set.

The examples in this topic demonstrate two ways of formatting data and deriving columns.

Example 3-7 Formatting Data

This example creates a function to format the data in a column.

# Create a function for formatting data.
petalCategory_fmt <- function(x) {
    ifelse(x > 5, 'LONG',
    ifelse(x > 2, 'MEDIUM', 'SMALL'))
  }
# Create an ore.frame in database memory with the iris data set.
iris_of <- ore.push(iris)
# Select some rows from iris_of.
iris_of[c(10, 20, 60, 80, 110, 140),]
# Format the data in Petal.Length column.
iris_of$Petal.Length <- petalCategory_fmt(iris_of$Petal.Length)
# Select the same rows from iris_of.

Listing for This Example

R> # Create a function for formatting data.
R> petalCategory_fmt <- function(x) {
+    ifelse(x > 5, 'LONG',
+    ifelse(x > 2, 'MEDIUM', 'SMALL'))
+ }
R> # Create an ore.frame in database memory with the iris data set.
R> iris_of <- ore.push(iris)
R> # Select some rows from iris_of.
R> iris_of[c(10, 20, 60, 80, 110, 140),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
10           4.9         3.1          1.5         0.1     setosa
20           5.1         3.8          1.5         0.3     setosa
60           5.2         2.7          3.9         1.4 versicolor
80           5.7         2.6          3.5         1.0 versicolor
110          7.2         3.6          6.1         2.5  virginica
140          6.9         3.1          5.4         2.1  virginica
R> # Format the data in Petal.Length column.
R> iris_of$Petal.Length <- petalCategory_fmt(iris_of$Petal.Length)
R> # Select the same rows from iris_of.
R> iris_of[c(10, 20, 60, 80, 110, 140),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
10           4.9         3.1        SMALL         0.1     setosa
20           5.1         3.8        SMALL         0.3     setosa
60           5.2         2.7       MEDIUM         1.4 versicolor
80           5.7         2.6       MEDIUM         1.0 versicolor
110          7.2         3.6         LONG         2.5  virginica
140          6.9         3.1         LONG         2.1  virginica

Example 3-8 Using the transform Function

This example does the same thing as the previous example except that it uses the transform function to reformat the data in a column of the data set.

# Create an ore.frame in database memory with the iris data set.
iris_of2 <- ore.push(iris)
# Select some rows from iris_of.
iris_of2[c(10, 20, 60, 80, 110, 140),]
iris_of2 <- transform(iris_of2,
                  Petal.Length = ifelse(Petal.Length > 5, 'LONG',
                                 ifelse(Petal.Length > 2, 'MEDIUM', 'SMALL')))
iris_of2[c(10, 20, 60, 80, 110, 140),]

Listing for This Example

R> # Create an ore.frame in database memory with the iris data set.
R> iris_of2 <- ore.push(iris)
R> # Select some rows from iris_of.
R> iris_of2[c(10, 20, 60, 80, 110, 140),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
10           4.9         3.1          1.5         0.1     setosa
20           5.1         3.8          1.5         0.3     setosa
60           5.2         2.7          3.9         1.4 versicolor
80           5.7         2.6          3.5         1.0 versicolor
110          7.2         3.6          6.1         2.5  virginica
140          6.9         3.1          5.4         2.1  virginica
R> iris_of2 <- transform(iris_of2,
+                  Petal.Length = ifelse(Petal.Length > 5, 'LONG',
+                                 ifelse(Petal.Length > 2, 'MEDIUM', 'SMALL')))
R> iris_of2[c(10, 20, 60, 80, 110, 140),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
10           4.9         3.1        SMALL         0.1     setosa
20           5.1         3.8        SMALL         0.3     setosa
60           5.2         2.7       MEDIUM         1.4 versicolor
80           5.7         2.6       MEDIUM         1.0 versicolor
110          7.2         3.6         LONG         2.5  virginica
140          6.9         3.1         LONG         2.1  virginica

Example 3-9 Adding Derived Columns

This example uses the transform function to add a derived column to the data set and then to add additional columns to it.

# Set the page width.
options(width = 80)
# Create an ore.frame in database memory with the iris data set.
iris_of <- ore.push(iris)
names(iris_of)
# Add one column derived from another
iris_of <- transform(iris_of, LOG_PL = log(Petal.Length))
names(iris_of)
head(iris_of, 3)
# Add more columns.
iris_of <- transform(iris_of,
                    SEPALBINS = ifelse(Sepal.Length < 6.0, "A", "B"),
                    PRODUCTCOLUMN = Petal.Length * Petal.Width,
                    CONSTANTCOLUMN = 10)
names(iris_of)
# Select some rows of iris_of.
iris_of[c(10, 20, 60, 80, 110, 140),]

Listing for This Example

R> # Set the page width.
R> options(width = 80)
R> # Create an ore.frame in database memory with the iris data set.
R> iris_of <- ore.push(iris)
R> names(iris_of)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species" 
R> # Add one column derived from another
R> iris_of <- transform(iris_of, LOG_PL = log(Petal.Length))
R> names(iris_of)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
[6] "LOG_PL"
R> head(iris_of, 3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species    LOG_PL
1          5.1         3.5          1.4         0.2  setosa 0.3364722
2          4.9         3.0          1.4         0.2  setosa 0.3364722
3          4.7         3.2          1.3         0.2  setosa 0.2623643
R> # Add more columns.
R> iris_of <- transform(iris_of,
                    SEPALBINS = ifelse(Sepal.Length < 6.0, "A", "B"),
                    PRODUCTCOLUMN = Petal.Length * Petal.Width,
                    CONSTANTCOLUMN = 10)
R> names(iris_of)
[1] "Sepal.Length"   "Sepal.Width"    "Petal.Length"   "Petal.Width"   
[5] "Species"        "LOG_PL"         "CONSTANTCOLUMN" "SEPALBINS"     
[9] "PRODUCTCOLUMN"
R> # Select some rows of iris_of.
R> iris_of[c(10, 20, 60, 80, 110, 140),]
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species    LOG_PL
10           4.9         3.1          1.5         0.1     setosa 0.4054651
20           5.1         3.8          1.5         0.3     setosa 0.4054651
60           5.2         2.7          3.9         1.4 versicolor 1.3609766
80           5.7         2.6          3.5         1.0 versicolor 1.2527630
110          7.2         3.6          6.1         2.5  virginica 1.8082888
140          6.9         3.1          5.4         2.1  virginica 1.6863990
    CONSTANTCOLUMN SEPALBINS PRODUCTCOLUMN
10              10         A          0.15
20              10         A          0.45
60              10         A          5.46
80              10         A          3.50
110             10         B         15.25
140             10         B         11.34

3.1.7 Sample Data

Sampling is an important capability for statistical analytics.

Typically, you sample data to reduce its size and to perform meaningful work on it. In R you usually must load data into memory to sample it. However, if the data is too large, this isn't possible.

In OML4R, instead of pulling the data from the database and then sampling, you can sample directly in the database and then pull only those records that are part of the sample. By sampling in the database, you minimize data movement and you can work with larger data sets. Note that it is the ordering framework integer row indexing in the transparency layer that enables this capability.

Note:

Sampling requires using ordered ore.frame objects as described in Creating Ordered and Unordered ore.frame Objects.

The examples in this section illustrate several sampling techniques.

Example 3-10 Simple Random Sampling

This example demonstrates a simple selection of rows at random. The example creates a small data.frame object and pushes it to the database to create an ore.frame object, MYDATA. Out of 20 rows, the example samples 5. It uses the R sample function to produce a random set of indices that it uses to get the sample from MYDATA. The sample, simpleRandomSample, is an ore.frame object.

set.seed(1)
N <- 20
myData <- data.frame(a=1:N,b=letters[1:N])
MYDATA <- ore.push(myData)
head(MYDATA)
sampleSize <- 5
simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE]
class(simpleRandomSample)
simpleRandomSample

Listing for This Example

R> set.seed(1)
R> N <- 20
R> myData <- data.frame(a=1:N,b=letters[1:N])
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
R> sampleSize <- 5
R> simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE]
R> class(simpleRandomSample)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> simpleRandomSample
    a b
2   2 b
7   7 g
10 10 j
12 12 l
19 19 s

Example 3-11 Split Data Sampling

This example demonstrates randomly partitioning data into training and testing sets. This splitting of the data is normally done in classification and regression to assess how well a model performs on new data. The example uses the MYDATA object created in the previous example.

This example produces a sample set of indices to use as the test data set. It then creates the logical vector group that is TRUE if the index is in the sample and is FALSE otherwise. Next, it uses row indexing to produce the training set where the group is FALSE and the test set where the group is TRUE. Notice that the number of rows in the training set is 15 and the number of rows in the test set is 5, as specified in the invocation of the sample function.

set.seed(1)
sampleSize <- 5
ind <- sample(1:nrow(MYDATA), sampleSize)
group <- as.integer(1:nrow(MYDATA) %in% ind)
MYDATA.train <- MYDATA[group==FALSE,]
dim(MYDATA.train)
MYDATA.test <- MYDATA[group==TRUE,]
dim(MYDATA.test)

Listing for This Example

R> set.seed(1)
R> sampleSize <- 5
R> ind <- sample(1:nrow(MYDATA), sampleSize)
R> group <- as.integer(1:nrow(MYDATA) %in% ind)
R> MYDATA.train <- MYDATA[group==FALSE,]
dim(MYDATA.train)
[1] 15  2 
R> MYDATA.test <- MYDATA[group==TRUE,]
R> dim(MYDATA.test)
[1] 5 2

Example 3-12 Systematic Sampling

This example demonstrates systematic sampling, in which rows are selected at regular intervals. The example uses the seq function to create a sequence of values that start at 2 and increase by increments of 3. The number of values in the sequence is equal to the number of rows in MYDATA. The MYDATA object is created in the first example.

set.seed(1)
N <- 20
myData <- data.frame(a=1:20,b=letters[1:N])
MYDATA <- ore.push(myData)
head(MYDATA)
start <- 2
by <- 3
systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE]
systematicSample

Listing for This Example

R> set.seed(1)
R> N <- 20
R> myData <- data.frame(a=1:20,b=letters[1:N])
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
R> start <- 2
R> by <- 3
R> systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE]
systematicSample
    a b
2   2 b
5   5 e
8   8 h
11 11 k
14 14 n
17 17 q
20 20 t

Example 3-13 Stratified Sampling

This example demonstrates stratified sampling, in which rows are selected within each group where the group is determined by the values of a particular column. The example creates a data set that has each row assigned to a group. The function rnorm produces random normal numbers. The argument 4 is the desired mean for the distribution. The example splits the data according to group and then samples proportionately from each partition. Finally, it row binds the list of subset ore.frame objects into a single ore.frame object and then displays the values of the result, stratifiedSample.

set.seed(1)
N <- 200
myData <- data.frame(a=1:N,b=round(rnorm(N),2),
                     group=round(rnorm(N,4),0))
MYDATA <- ore.push(myData)
head(MYDATA)
sampleSize <- 10
stratifiedSample <- do.call(rbind,
                            lapply(split(MYDATA, MYDATA$group),
                                   function(y) {
                                   ny <- nrow(y)
                                   y[sample(ny, sampleSize*ny/N), , drop = FALSE]
                              }))
stratifiedSample

Listing for This Example

R> set.seed(1)
R> N <- 200
R> myData <- data.frame(a=1:N,b=round(rnorm(N),2),
+                       group=round(rnorm(N,4),0))
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
  a     b group
1 1 -0.63     4
2 2  0.18     6
3 3 -0.84     6
4 4  1.60     4
5 5  0.33     2
6 6 -0.82     6
R> sampleSize <- 10
R> stratifiedSample <- do.call(rbind,
+                              lapply(split(MYDATA, MYDATA$group),
+                                function(y) {
+                                  ny <- nrow(y)
+                                  y[sample(ny, sampleSize*ny/N), , drop = FALSE]
+                             }))
R> stratifiedSample
          a     b group
173|173 173  0.46     3
9|9       9  0.58     4
53|53    53  0.34     4
139|139 139 -0.65     4
188|188 188 -0.77     4
78|78    78  0.00     5
137|137 137 -0.30     5

Example 3-14 Cluster Sampling

This example demonstrates cluster sampling, in which entire groups are selected at random. The example splits the data according to group and then samples among the groups and row binds into a single ore.frame object. The resulting sample has data from two clusters, 6 and 7.

set.seed(1)
N <- 200
myData <- data.frame(a=1:N,b=round(runif(N),2),
                     group=round(rnorm(N,4),0))
MYDATA <- ore.push(myData)
head(MYDATA)
sampleSize <- 5
clusterSample <- do.call(rbind,
                         sample(split(MYDATA, MYDATA$group), 2))
unique(clusterSample$group)

Listing for This Example

R> set.seed(1)
R> N <- 200
R> myData <- data.frame(a=1:N,b=round(runif(N),2),
+                       group=round(rnorm(N,4),0))
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
  a    b group
1 1 0.27     3
2 2 0.37     4
3 3 0.57     3
4 4 0.91     4
5 5 0.20     3
6 6 0.90     6
R> sampleSize <- 5
R> clusterSample <- do.call(rbind,
+                           sample(split(MYDATA, MYDATA$group), 2))
R> unique(clusterSample$group)
[1] 6 7

Example 3-15 Quota Sampling

This example demonstrates quota sampling, in which a consecutive number of records are selected as the sample. The example uses the head function to select the sample. The tail function could also have been used.

set.seed(1)
N <- 200
myData <- data.frame(a=1:N,b=round(runif(N),2))
MYDATA <- ore.push(myData)                     
sampleSize <- 10
quotaSample1 <- head(MYDATA, sampleSize)
quotaSample1

Listing for This Example

R> set.seed(1)
R> N <- 200
R> myData <- data.frame(a=1:N,b=round(runif(N),2))
R> MYDATA <- ore.push(myData)                     
R> sampleSize <- 10
R> quotaSample1 <- head(MYDATA, sampleSize)
R> quotaSample1
    a    b
1   1 0.15
2   2 0.75
3   3 0.98
4   4 0.97
5   5 0.35
6   6 0.39
7   7 0.95
8   8 0.11
9   9 0.93
10 10 0.35

3.1.8 Partition Data

In analyzing large data sets, a typical operation is to randomly partition the data set into subsets.

You can analyze the partitions by using OML4R embedded R execution, as shown in the following example.

Example 3-16 Randomly Partitioning Data

This example creates a data.frame object with the symbol myData in the local R session and adds a column to it that contains a randomly generated set of values. It pushes the data set to database memory as the object MYDATA. The example invokes the embedded R execution function ore.groupApply, which partitions the data based on the partition column and then applies the lm function to each partition.

N <- 200
k <- 5
myData <- data.frame(a=1:N,b=round(runif(N),2))
myData$partition <- sample(rep(1:k, each = N/k,
                               length.out = N), replace = TRUE)
MYDATA <- ore.push(myData)
head(MYDATA)
results <- ore.groupApply(MYDATA, MYDATA$partition,
                          function(y) {lm(b~a,y)}, parallel = TRUE)
length(results)
results[[1]]
Listing for This Example
R> N <- 200
R> k <- 5
R> myData <- data.frame(a=1:N,b=round(runif(N),2))
R> myData$partition <- sample(rep(1:k, each = N/k,
+                             length.out = N), replace = TRUE)
R> MYDATA <- ore.push(myData)
R> head(MYDATA)
  a    b partition
1 1 0.89         2
2 2 0.31         4
3 3 0.39         5
4 4 0.66         3
5 5 0.01         1
6 6 0.12         4
R> results <- ore.groupApply(MYDATA, MYDATA$partition,
+                            function(y) {lm(b~a,y)}, parallel = TRUE)
R> length(results)
[1] 5
R> results[[1]]
 
Call:
lm(formula = b ~ a, data = y)
 
Coefficients:
(Intercept)            a  
   0.388795     0.001015

3.1.9 Prepare Time Series Data

OML4R provides you with the ability to perform many data preparation operations on time series data, such as filtering, ordering, and transforming the data.

OML4R maps R data types to SQL data types, which allows you to create OML4R objects and perform data preparation operations in database memory. The following examples demonstrate some operations on time series data.

Example 3-17 Aggregating Date and Time Data

This example illustrates some of the statistical aggregation functions. For a data set, the example first generates on the local client a sequence of five hundred dates spread evenly throughout 2001. It then introduces a random difftime and a vector of random normal values. The example then uses the ore.push function to create MYDATA, an in-database version of the data. The example invokes the class function to show that MYDATA is an ore.frame object and that the datetime column is of class ore.datetime. The example displays the first three rows of the generated data. It then uses the statistical aggregation operations of min, max, range, median, and quantile on the datetime column of MYDATA.

N <- 500
mydata <- data.frame(datetime = 
                seq(as.POSIXct("2001/01/01"),
                    as.POSIXct("2001/12/31"),
                    length.out = N),
                difftime = as.difftime(runif(N),
                                 units = "mins"),
                x = rnorm(N))
MYDATA <- ore.push(mydata)
class(MYDATA)
class(MYDATA$datetime)
head(MYDATA,3)
# statistical aggregations
min(MYDATA$datetime)
max(MYDATA$datetime)
range(MYDATA$datetime)
quantile(MYDATA$datetime,
         probs = c(0, 0.05, 0.10))

Listing for This Example

R> N <- 500
R> mydata <- data.frame(datetime = 
+             seq(as.POSIXct("2001/01/01"),
+                 as.POSIXct("2001/12/31"),
+                 length.out = N),
+             difftime = as.difftime(runif(N),
+                              units = "mins"),
+             x = rnorm(N))
R> MYDATA <- ore.push(mydata)
R> class(MYDATA)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> class(MYDATA$datetime)
[1] "ore.datetime"
attr(,"package")
[1] "OREbase"
R> head(MYDATA,3)
             datetime       difftime           x
1 2001-01-01 00:00:00 16.436782 secs  0.68439244
2 2001-01-01 17:30:25  8.711562 secs  1.38481435
3 2001-01-02 11:00:50  1.366927 secs -0.00927078

R> # statistical aggregations
R> min(MYDATA$datetime)
[1] "2001-01-01 CST"
R> max(MYDATA$datetime)
[1] "2001-12-31 CST"
R> range(MYDATA$datetime)
[1] "2001-01-01 CST" "2001-12-31 CST"
R> quantile(MYDATA$datetime,
+           probs = c(0, 0.05, 0.10))
                       0%                        5%                       10% 
"2001-01-01 00:00:00 CST" "2001-01-19 04:48:00 CST" "2001-02-06 09:36:00 CST"

Example 3-18 Using Date and Time Arithmetic

This example creates a one day shift by taking the datetime column of the MYDATA ore.frame object created in the previous example and adding a difftime of one day. The result is day1Shift, which the example shows is of class ore.datetime. The example displays the first three elements of the datetime column of MYDATA and those of day1Shift. The first element of day1Shift is January 2, 2001.

This example also computes lag differences using the overloaded diff function. The difference between the dates is all the same because the 500 dates in MYDATA are evenly distributed throughout 2001.

day1Shift <- MYDATA$datetime + as.difftime(1, units = "days")
class(day1Shift)
head(MYDATA$datetime,3)
head(day1Shift,3)
lag1Diff <- diff(MYDATA$datetime)
class(lag1Diff)
head(lag1Diff,3)

Listing for This Example

R> day1Shift <- MYDATA$datetime + as.difftime(1, units = "days")
R> class(day1Shift)
[1] "ore.datetime"
attr(,"package")
[1] "OREbase"
R> head(MYDATA$datetime,3)
[1] "2001-01-01 00:00:00 CST" "2001-01-01 17:30:25 CST" "2001-01-02 11:00:50 CST"
R> head(day1Shift,3)
[1] "2001-01-02 00:00:00 CST" "2001-01-02 17:30:25 CST" "2001-01-03 11:00:50 CST"
R> lag1Diff <- diff(MYDATA$datetime)
R> class(lag1Diff)
[1] "ore.difftime"
attr(,"package")
[1] "OREbase"
R> head(lag1Diff,3)
Time differences in secs
[1] 63025.25 63025.25 63025.25

Example 3-19 Comparing Dates and Times

This example demonstrates date and time comparisons. The example uses the datetime column of the MYDATA ore.frame object created in the first example. This example selects the elements of MYDATA that have a date earlier than April 1, 2001. The resulting isQ1 is of class ore.logical and for the first three entries the result is TRUE. The example finds out how many dates matching isQ1 are in March. It then sums the logical vector and displays the result, which is that 43 rows are in March. The example next filters rows based on dates that are the end of the year, after December 27. The result is eoySubset, which is an ore.frame object. The example displays the first three rows returned in eoySubset.

isQ1 <- MYDATA$datetime < as.Date("2001/04/01")
class(isQ1)
head(isQ1,3)
isMarch <- isQ1 & MYDATA$datetime > as.Date("2001/03/01")
class(isMarch)
head(isMarch,3)
sum(isMarch)
eoySubset <- MYDATA[MYDATA$datetime > as.Date("2001/12/27"), ]
class(eoySubset)
head(eoySubset,3)

Listing for This Example

R> isQ1 <- MYDATA$datetime < as.Date("2001/04/01")
R> class(isQ1)
[1] "ore.logical"
attr(,"package")
[1] "OREbase"
R> head(isQ1,3)
[1] TRUE TRUE TRUE
R> isMarch <- isQ1 & MYDATA$datetime > as.Date("2001/03/01")
R> class(isMarch)
[1] "ore.logical"
attr(,"package")
[1] "OREbase"
R> head(isMarch,3)
[1] FALSE FALSE FALSE
R> sum(isMarch)
[1] 43
R> eoySubset <- MYDATA[MYDATA$datetime > as.Date("2001/12/27"), ]
R> class(eoySubset)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> head(eoySubset,3)
               datetime      difftime          x
495 2001-12-27 08:27:53 55.76474 secs -0.2740492
496 2001-12-28 01:58:18 15.42946 secs -1.4547270
497 2001-12-28 19:28:44 28.62195 secs  0.2929171

Example 3-20 Using Date and Time Accessors

OML4R has accessor functions that you can use to extract various components from datetime objects, such as year, month, day of the month, hour, minute, and second. This example demonstrates the use of these functions. The example uses the datetime column of the MYDATA ore.frame object created in the first example.

This example gets the year elements of the datetime column. The invocation of the unique function for year displays 2001 because it is the only year value in the column. However, for objects that have a range of values, as for example, ore.mday, the range function returns the day of the month. The result contains a vector with values that range from 1 through 31. Invoking the range function succinctly reports the range of values, as demonstrated for the other accessor functions.

year <- ore.year(MYDATA$datetime)
unique(year)
month <- ore.month(MYDATA$datetime)
range(month)
dayOfMonth <- ore.mday(MYDATA$datetime)
range(dayOfMonth)
hour <- ore.hour(MYDATA$datetime)
range(hour)
minute <- ore.minute(MYDATA$datetime)
range(minute)
second <- ore.second(MYDATA$datetime)
range(second)

Listing for This Example

R> year <- ore.year(MYDATA$datetime)
R> unique(year)
[1] 2001
R> month <- ore.month(MYDATA$datetime)
R> range(month)
[1]  1 12
R> dayOfMonth <- ore.mday(MYDATA$datetime)
R> range(dayOfMonth)
[1]  1 31
R> hour <- ore.hour(MYDATA$datetime)
R> range(hour)
[1]  0 23
R> minute <- ore.minute(MYDATA$datetime)
R> range(minute)
[1]  0 59
R> second <- ore.second(MYDATA$datetime)
R> range(second)
[1]  0.00000 59.87976

Example 3-21 Coercing Date and Time Data Types

This example uses the as.ore subclass objects to coerce an ore.datetime data type into other data types. The example uses the datetime column of the MYDATA ore.frame object created in the first example. That column contains ore.datetime values. This example first extracts the date from the MYDATA$datetime column. The resulting dateOnly object has ore.date values that contain only the year, month, and day, but not the time. The example then coerces the ore.datetime values into objects with ore.character and ore.integer values that represent the names of days, the number of the day of the year, and the quarter of the year.

dateOnly <- as.ore.date(MYDATA$datetime)
class(dateOnly)
head(sort(unique(dateOnly)),3)
nameOfDay <- as.ore.character(MYDATA$datetime, format = "DAY")
class(nameOfDay)
sort(unique(nameOfDay))
dayOfYear <- as.integer(as.character(MYDATA$datetime, format = "DDD"))
class(dayOfYear)
range(dayOfYear)
quarter <- as.integer(as.character(MYDATA$datetime, format = "Q"))
class(quarter)
sort(unique(quarter))

Listing for This Example

R> dateOnly <- as.ore.date(MYDATA$datetime)
R> class(dateOnly)[1] "ore.date"
attr(,"package")[1] "OREbase"
R> head(sort(unique(dateOnly)),3)
[1] "2001-01-01" "2001-01-02" "2001-01-03"
R> nameOfDay <- as.ore.character(MYDATA$datetime, format = "DAY")
R> class(nameOfDay)
[1] "ore.character"
attr(,"package")
[1] "OREbase"
R> sort(unique(nameOfDay))
[1] "FRIDAY " "MONDAY " "SATURDAY " "SUNDAY " "THURSDAY " "TUESDAY " "WEDNESDAY"
R> dayOfYear <- as.integer(as.character(MYDATA$datetime, format = "DDD"))
R> class(dayOfYear)
[1] "ore.integer"
attr(,"package")
[1] "OREbase"
R> range(dayOfYear)
[1]   1 365
R> quarter <- as.integer(as.character(MYDATA$datetime, format = "Q"))
R> class(quarter)
[1] "ore.integer"
attr(,"package")
[1] "OREbase"
R> sort(unique(quarter))
[1] 1 2 3 4

Example 3-22 Using a Window Function

This example uses the window functions ore.rollmean and ore.rollsd to compute the rolling mean and the rolling standard deviation. The example uses the MYDATA ore.frame object created in the first example. This example ensures that MYDATA is an ordered ore.frame by assigning the values of the datetime column as the row names of MYDATA. The example computes the rolling mean and the rolling standard deviation over five periods. Next, to use the R time series functionality in the stats package, the example pulls data to the client. To limit the data pulled to the client, it uses the vector is.March from the third example to select only the data points in March. The example creates a time series object using the ts function, builds the Arima model, and predicts three points out.

row.names(MYDATA) <- MYDATA$datetime
MYDATA$rollmean5 <- ore.rollmean(MYDATA$x, k = 5)
MYDATA$rollsd5 <- ore.rollsd (MYDATA$x, k = 5)
head(MYDATA)
marchData <- ore.pull(MYDATA[isMarch,])
tseries.x <- ts(marchData$x)
arima110.x <- arima(tseries.x, c(1,1,0))
predict(arima110.x, 3)
tseries.rm5 <- ts(marchData$rollmean5)
arima110.rm5 <- arima(tseries.rm5, c(1,1,0))
predict(arima110.rm5, 3)

Listing for This Example

R> row.names(MYDATA) <- MYDATA$datetime
R> MYDATA$rollmean5 <- ore.rollmean(MYDATA$x, k = 5)
R> MYDATA$rollsd5 <- ore.rollsd (MYDATA$x, k = 5)
R> head(MYDATA)
                               datetime       difftime
2001-01-01 00:00:00 2001-01-01 00:00:00 39.998460 secs
                                                     x   rollmean5   rollsd5
                                            -0.3450421 -0.46650761 0.8057575
                               datetime       difftime
2001-01-01 17:30:25 2001-01-01 17:30:25  37.75568 secs
                                                     x   rollmean5   rollsd5
                                            -1.3261019  0.02877517 1.1891384
                               datetime       difftime
2001-01-02 11:00:50 2001-01-02 11:00:50  18.44243 secs
                                                     x   rollmean5   rollsd5
                                             0.2716211 -0.13224503 1.0909515
                               datetime       difftime
2001-01-03 04:31:15 2001-01-03 04:31:15 38.594384 secs
                                                     x   rollmean5   rollsd5
                                             1.5146235  0.36307913 1.4674456
                               datetime       difftime
2001-01-03 22:01:41 2001-01-03 22:01:41  2.520976 secs
                                                     x   rollmean5   rollsd5
                                            -0.7763258  0.80073340 1.1237925
                               datetime       difftime
2001-01-04 15:32:06 2001-01-04 15:32:06 56.333281 secs 
                                                     x   rollmean5   rollsd5
                                             2.1315787  0.90287282 1.0862614
R> marchData <- ore.pull(MYDATA[isMarch,])
R> tseries.x <- ts(marchData$x)
R> arima110.x <- arima(tseries.x, c(1,1,0))
R> predict(arima110.x, 3)
$pred
Time Series:
Start = 44 
End = 46 
Frequency = 1 
[1] 1.4556614 0.6156379 1.1387587
 
$se
Time Series:
Start = 44 
End = 46 
Frequency = 1 
[1] 1.408117 1.504988 1.850830
 
R> tseries.rm5 <- ts(marchData$rollmean5)
R> arima110.rm5 <- arima(tseries.rm5, c(1,1,0))
R> predict(arima110.rm5, 3)
$pred
Time Series:
Start = 44 
End = 46 
Frequency = 1 
[1] 0.3240135 0.3240966 0.3240922
 
$se
Time Series:
Start = 44 
End = 46 
Frequency = 1 
[1] 0.3254551 0.4482886 0.5445763

3.2 Explore Data

Oracle Machine Learning for R provides functions that enable you to perform exploratory data analysis.

With these functions, you can perform common statistical operations.

The functions and their uses are described in the following topics:

3.2.1 About the Exploratory Data Analysis Functions

The OML4R functions for exploratory data analysis are in the OREeda package.

Table 3-1 Functions in the OREeda Package

Function Description

ore.corr

Performs correlation analysis across numeric columns in an ore.frame object.

ore.crosstab

Expands on the xtabs function by supporting multiple columns with optional aggregations, weighting, and ordering options. Building a cross-tabulation is a pre-requisite to using the ore.freq function.

ore.esm

Builds exponential smoothing models on data in an ordered ore.vector object.

ore.freq

Operates on output from the ore.crosstab function and automatically determines techniques that are relevant for the table.

ore.rank

Enables the investigation of the distribution of values along numeric columns in an ore.frame object.

ore.sort

Provides flexible sorting for ore.frame objects.

ore.summary

Provides descriptive statistics for ore.frame objects within flexible row aggregations.

ore.univariate

Provides distribution analysis of numeric columns in an ore.frame object of. Reports all statistics from the ore.summary function plus signed-rank test and extreme values.

3.2.2 About the NARROW Data Set for Examples

Many of the examples of the exploratory data analysis functions use the NARROW data set.

NARROW is an ore.frame that has 9 columns and 1500 rows, as shown in the following example. Some of the columns are numeric, others are not.

Example 3-23 The NARROW Data Set

This example shows the class, dimensions, and names of the NARROW object.

R> class(NARROW)
R> dim(NARROW)
R> names(NARROW)

Listing for This Example

R> class(NARROW)
[1] "ore.frame"
attr(,"package")
[1] "OREbase"
R> dim(NARROW)[1] 1500    9
R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"

3.2.3 Correlate Data

You can use the ore.corr function to perform correlation analysis.

With the ore.corr function, you can do the following:

  • Perform Pearson, Spearman or Kendall correlation analysis across numeric columns in an ore.frame object.

  • Perform partial correlations by specifying a control column.

  • Aggregate some data prior to the correlations.

  • Post-process results and integrate them into an R code flow.

    You can make the output of the ore.corr function conform to the output of the R cor function; doing so allows you to use any R function to post-process the output or to use the output as the input to a graphics function.

For details about the function arguments, invoke help(ore.corr).

The following examples demonstrate these operations.

Example 3-24 Performing Basic Correlation Calculations

This example demonstrates how to specify the different types of correlation statistics.

# Before performing correlations, project out all non-numeric values 
# by specifying only the columns that have numeric values.
names(NARROW)
NARROW_NUMS <- NARROW[,c(3,8,9)]
names(NARROW_NUMS)
# Calculate the correlation using the default correlation statistic, Pearson.
x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS')
head(x, 3)
# Calculate using Spearman.
x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')
head(x, 3)
# Calculate using Kendall
x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')
head(x, 3)

Listing for This Example

R> # Before performing correlations, project out all non-numeric values 
R> # by specifying only the columns that have numeric values.
R> names(NARROW)
 [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" "COUNTRY" "EDUCATION" "OCCUPATION"
 [8] "YRS_RESIDENCE" "CLASS" "AGEBINS"
R> NARROW_NUMS <- NARROW[,c(3,8,9)]
R> names(NARROW_NUMS)
[1] "AGE" "YRS_RESIDENCE" "CLASS"

R> # Calculate the correlation using the default correlation statistic, Pearson.
R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS')
R> head(x, 3)
            ROW           COL PEARSON_T PEARSON_P PEARSON_DF
1           AGE         CLASS 0.2200960     1e-15       1298
2           AGE YRS_RESIDENCE 0.6568534     0e+00       1098
3 YRS_RESIDENCE         CLASS 0.3561869     0e+00       1298
R> # Calculate using Spearman.
R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman')
R> head(x, 3)
            ROW           COL SPEARMAN_T SPEARMAN_P SPEARMAN_DF
1           AGE         CLASS  0.2601221      1e-15        1298
2           AGE YRS_RESIDENCE  0.7462684      0e+00        1098
3 YRS_RESIDENCE         CLASS  0.3835252      0e+00        1298
R> # Calculate using Kendall
R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall')
R> head(x, 3)
            ROW           COL KENDALL_T    KENDALL_P KENDALL_DF
1           AGE         CLASS 0.2147107 4.285594e-31       <NA>
2           AGE YRS_RESIDENCE 0.6332196 0.000000e+00       <NA>
3 YRS_RESIDENCE         CLASS 0.3362078 1.094478e-73       <NA>

Example 3-25 Creating Correlation Matrices

This example pushes the iris data set to a temporary table in the database, which has the proxy ore.frame object iris_of. It creates correlation matrices grouped by species.

iris_of <- ore.push(iris)
x <- ore.corr(iris_of, var = "Sepal.Length, Sepal.Width, Petal.Length",
              partial = "Petal.Width", group.by = "Species")
 class(x)
 head(x)

Listing for This Example

R> iris_of <- ore.push(iris)
R> x <- ore.corr(iris_of, var = "Sepal.Length, Sepal.Width, Petal.Length",
+                partial = "Petal.Width", group.by = "Species")
R> class(x)
[1] "list"
R> head(x)
$setosa
           ROW          COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF
1 Sepal.Length Petal.Length      0.1930601   9.191136e-02              47
2 Sepal.Length  Sepal.Width      0.7255823   1.840300e-09              47
3  Sepal.Width Petal.Length      0.1095503   2.268336e-01              47
 
$versicolor
           ROW          COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF
1 Sepal.Length Petal.Length     0.62696041   7.180100e-07              47
2 Sepal.Length  Sepal.Width     0.26039166   3.538109e-02              47
3  Sepal.Width Petal.Length     0.08269662   2.860704e-01              47
 
$virginica
           ROW          COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF
1 Sepal.Length Petal.Length      0.8515725   4.000000e-15              47
2 Sepal.Length  Sepal.Width      0.3782728   3.681795e-03              47
3  Sepal.Width Petal.Length      0.2854459   2.339940e-02              47

3.2.4 Cross-Tabulate Data

Cross-tabulation is a statistical technique that finds an interdependent relationship between two tables of values.

The ore.crosstab function enables cross-column analysis of an ore.frame. This function is a sophisticated variant of the R table function.

You must use ore.crosstab function before performing frequency analysis using ore.freq.

If the result of the ore.crosstab function invocation is a single cross-tabulation, then the function returns an ore.frame object. If the result is multiple cross-tabulations, then the function returns a list of ore.frame objects.

For details about function arguments, invoke help(ore.crosstab).

Example 3-26 Creating a Single Column Frequency Table

The most basic use case is to create a single-column frequency table, as shown in this example.

This example filters the NARROW ore.frame, grouping by GENDER.

ct <- ore.crosstab(~AGE, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(~AGE, data=NARROW)
R> head(ct)
   AGE ORE$FREQ ORE$STRATA ORE$GROUP
17  17       14          1         1
18  18       16          1         1
19  19       30          1         1
20  20       23          1         1
21  21       22          1         1
22  22       39          1         1

Example 3-27 Analyzing Two Columns

This example analyses AGE by GENDER and AGE by CLASS.

ct <- ore.crosstab(AGE~GENDER+CLASS, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(AGE~GENDER+CLASS, data=NARROW)
R> head(ct)
$`AGE~GENDER`
     AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
17|F  17      F        5          1         1
17|M  17      M        9          1         1
18|F  18      F        6          1         1
18|M  18      M        7          1         1
19|F  19      F       15          1         1
19|M  19      M       13          1         1
# The remaining output is not shown.

Example 3-28 Weighting Rows

To weight rows, include a count based on another column as shown in this example. This example weights values in AGE and GENDER using values in YRS_RESIDENCE.

ct <- ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW)
R> head(ct)
     AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
17|F  17      F        1          1         1
17|M  17      M        8          1         1
18|F  18      F        4          1         1
18|M  18      M       10          1         1
19|F  19      F       15          1         1
19|M  19      M       17          1         1

Example 3-29 Ordering Cross-Tabulated Data

There are several possibilities for ordering rows in a cross-tabulated table, such as the following:

  • Default or NAME orders by the columns being analyzed

  • FREQ orders by frequency counts

  • -NAME or -FREQ does reverse ordering

  • INTERNAL bypasses ordering

This example orders by frequency count and then by reverse order by frequency count.

ct <- ore.crosstab(AGE~GENDER|FREQ, data=NARROW)
head(ct)
ct <- ore.crosstab(AGE~GENDER|-FREQ, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(AGE~GENDER|FREQ, data=NARROW)
R> head(ct)
     AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
66|F  66      F        1          1         1
70|F  70      F        1          1         1
73|M  73      M        1          1         1
74|M  74      M        1          1         1
76|F  76      F        1          1         1
77|F  77      F        1          1         1

R> ct <- ore.crosstab(AGE~GENDER|-FREQ, data=NARROW)
R> head(ct)
     AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
27|M  27      M       33          1         1
35|M  35      M       28          1         1
41|M  41      M       27          1         1
34|M  34      M       26          1         1
37|M  37      M       26          1         1
28|M  28      M       25          1         1

Example 3-30 Analyzing Three or More Columns

This example demonstrates analyzing three or more columns. The result is similar to what the SQL GROUPING SETS clause accomplishes.

ct <- ore.crosstab(AGE+COUNTRY~GENDER, NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(AGE+COUNTRY~GENDER, NARROW)
R> head(ct)
$`AGE~GENDER`
     AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
17|F  17      F        5          1         1
17|M  17      M        9          1         1
18|F  18      F        6          1         1
18|M  18      M        7          1         1
19|F  19      F       15          1         1
19|M  19      M       13          1         1
# The rest of the output is not shown.
$`COUNTRY~GENDER`
                             COUNTRY GENDER ORE$FREQ ORE$STRATA ORE$GROUP
Argentina|F                Argentina      F       14          1         1
Argentina|M                Argentina      M       28          1         1
Australia|M                Australia      M        1          1         1
# The rest of the output is not shown.

Example 3-31 Specifying a Range of Columns

You can specify a range of columns instead of having to type all the column names, as demonstrated in this example.

names(NARROW)
# Because AGE, MARITAL_STATUS and COUNTRY are successive columns, 
# you can simply do the following:
ct <- ore.crosstab(AGE-COUNTRY~GENDER, NARROW)
# An equivalent invocation is the following:
ct <- ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)

Listing for This Example

R> names(NARROW)
[1] "ID"             "GENDER"         "AGE"            "MARITAL_STATUS"
[5] "COUNTRY"        "EDUCATION"      "OCCUPATION"     "YRS_RESIDENCE" 
[9] "CLASS"    
R> # Because AGE, MARITAL_STATUS and COUNTRY are successive columns, 
R> # you can simply do the following:
R> ct <- ore.crosstab(AGE-COUNTRY~GENDER, NARROW)
R> # An equivalent invocation is the following:
R> ct <- ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)

Example 3-32 Producing One Cross-Tabulation Table for Each Value of Another Column

This example produces one cross-tabulation table (AGE, GENDER) for each unique value of another column COUNTRY.

ct <- ore.crosstab(~AGE/COUNTRY, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(~AGE/COUNTRY, data=NARROW)
R> head(ct)
                            AGE ORE$FREQ ORE$STRATA ORE$GROUP
Argentina|17                 17        1          1         1
Brazil|17                    17        1          1         3
United States of America|17  17       12          1        19
United States of America|18  18       16          1        19
United States of America|19  19       30          1        19
United States of America|20  20       23          1        19

Example 3-33 Producing One Cross-Tabulation Table for Each Set of Value of Two Columns

You can extend the cross-tabulation to more than one column, as shown in this example, which produces one (AGE, EDUCATION) table for each unique combination of (COUNTRY, GENDER).

ct <- ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW)
head(ct)

Listing for This Example

R> ct <- ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW)
R> head(ct)
                                   AGE EDUCATION ORE$FREQ ORE$STRATA ORE$GROUP
United States of America|F|17|10th  17      10th        3          1        33
United States of America|M|17|10th  17      10th        5          1        34
United States of America|M|17|11th  17      11th        1          1        34
Argentina|M|17|HS-grad              17   HS-grad        1          1         2
United States of America|M|18|10th  18      10th        1          1        34
United States of America|F|18|11th  18      11th        2          1        33

Example 3-34 Augmenting Cross-Tabulation with Stratification

All of the cross-tabulation tables in the previous examples can be augmented with stratification, as shown in this example.

ct <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW) 
head(ct)
R> head(ct)
# The previous function invocation is the same as the following:
ct <- ore.crosstab(AGE~GENDER, NARROW, strata="CLASS")

Listing for This Example

R> ct <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW) 
R> head(ct)
R> head(ct)
       AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP
0|17|F  17      F        5          1         1
0|17|M  17      M        9          1         1
0|18|F  18      F        6          1         1
0|18|M  18      M        7          1         1
0|19|F  19      F       15          1         1
0|19|M  19      M       13          1         1
# The previous function invocation is the same as the following:
ct <- ore.crosstab(AGE~GENDER, NARROW, strata="CLASS")

Example 3-35 Binning Followed by Cross-Tabulation

This example does a custom binning by AGE and then calculates the cross-tabulation for GENDER and the bins.

NARROW$AGEBINS <- ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2,
                  ifelse(NARROW$AGE<40,3,4)))
ore.crosstab(GENDER~AGEBINS, NARROW)

Listing for This Example

R> NARROW$AGEBINS <- ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2,
+                    ifelse(NARROW$AGE<40,3,4)))
R> ore.crosstab(GENDER~AGEBINS, NARROW)
    GENDER AGEBINS ORE$FREQ ORE$STRATA ORE$GROUP
F|1      F       1       26          1         1
F|2      F       2      108          1         1
F|3      F       3       86          1         1
F|4      F       4      164          1         1
M|1      M       1       29          1         1
M|2      M       2      177          1         1
M|3      M       3      230          1         1
M|4      M       4      381          1         1

3.2.5 Analyze the Frequency of Cross-Tabulations

The ore.freq function analyses the output of the ore.crosstab function and automatically determines the techniques that are relevant to an ore.crosstab result.

The techniques depend on the kind of cross-tabulation tables, which are the following:

  • 2-way cross-tabulation tables

    • Various statistics that describe relationships between columns in the cross-tabulation

    • Chi-square tests, Cochran-Mantel-Haenzsel statistics, measures of association, strength of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend

  • N-way cross-tabulation tables

    • N 2-way cross-tabulation tables

    • Statistics across and within strata

The ore.freq function uses Oracle Database SQL functions when available.

The ore.freq function returns an ore.frame in all cases.

Before you use ore.freq, you must calculate crosstabs, as shown in the following example.

For details about the function arguments, invoke help(ore.freq).

Example 3-36 Using the ore.freq Function

This example pushes the iris data set to the database and gets the ore.frame object iris_of. The example gets a crosstab and invokes the ore.freq function on it.

IRIS <- ore.push(iris)
ct <- ore.crosstab(Species ~ Petal.Length + Sepal.Length, data = IRIS)
ore.freq(ct)

Listing for This Example

R> IRIS <- ore.push(iris)
R> ct <- ore.crosstab(Species ~ Petal.Length + Sepal.Length, data = IRIS)
R> ore.freq(ct)
$`Species~Petal.Length`
  METHOD     FREQ DF       PVALUE              DESCR GROUP
1 PCHISQ 181.4667 84 3.921603e-09 Pearson Chi-Square     1
 
$`Species~Sepal.Length`
  METHOD  FREQ DF      PVALUE              DESCR GROUP
1 PCHISQ 102.6 68 0.004270601 Pearson Chi-Square     1

3.2.6 Build Exponential Smoothing Models on Time Series Data

The ore.esm function builds a simple or a double exponential smoothing model for in-database time series observations in an ordered ore.vector object.

The function operates on time series data, whose observations are evenly spaced by a fixed interval, or transactional data, whose observations are not equally spaced. The function can aggregate the transactional data by a specified time interval, as well as handle missing values using a specified method, before entering the modeling phase.

The ore.esm function processes the data in one or more R engines running on the database server. The function returns an object of class ore.esm.

You can use the predict method to predict the time series of the exponential smoothing model built by ore.esm. If you have loaded the forecast package, then you can use the forecast method on the ore.esm object. You can use the fitted method to generate the fitted values of the training time series data set.

For information about the arguments of the ore.esm function, invoke help(ore.esm).

Example 3-37 Building a Double Exponential Smoothing Model

This example builds a double exponential smoothing model on a synthetic time series data set. The predict and fitted functions are invoked to generate the predictions and the fitted values, respectively. The figure shows the observations, fitted values, and the predictions.

N <- 5000
ts0 <- ore.push(data.frame(ID=1:N,
                           VAL=seq(1,5,length.out=N)^2+rnorm(N,sd=0.5)))
rownames(ts0) <- ts0$ID
x <- ts0$VAL
esm.mod <- ore.esm(x, model = "double")
esm.predict <- predict(esm.mod, 30)
esm.fitted <- fitted(esm.mod, start=4000, end=5000)
plot(ts0[4000:5000,], pch='.')
lines(ts0[4000:5000, 1], esm.fitted, col="blue")
lines(esm.predict, col="red", lwd=2)

Figure 3-1 Fitted and Predicted Values Based on the esm.mod Model

Description of Figure 3-1 follows
Description of "Figure 3-1 Fitted and Predicted Values Based on the esm.mod Model"

Example 3-38 Building a Time Series Model with Transactional Data

This example builds a simple smoothing model based on a transactional data set. As preprocessing, it aggregates the values to the day level by taking averages, and fills missing values by setting them to the previous aggregated value. The model is then built on the aggregated daily time series. The function predict is invoked to generate predicted values on the daily basis.

ts01 <- data.frame(ID=seq(as.POSIXct("2008/6/13"), as.POSIXct("2011/6/16"),
                   length.out=4000), VAL=rnorm(4000, 10))
ts02 <- data.frame(ID=seq(as.POSIXct("2011/7/19"), as.POSIXct("2012/11/20"),
                   length.out=1500), VAL=rnorm(1500, 10))
ts03 <- data.frame(ID=seq(as.POSIXct("2012/12/09"), as.POSIXct("2013/9/25"),
                   length.out=1000), VAL=rnorm(1000, 10))
ts1 = ore.push(rbind(ts01, ts02, ts03))
rownames(ts1) <- ts1$ID
x <- ts1$VAL
esm.mod <- ore.esm(x, "DAY", accumulate = "AVG", model="simple",
                   setmissing="PREV")
esm.predict <- predict(esm.mod)
esm.predict

Listing for This Example

R> ts01 <- data.frame(ID=seq(as.POSIXct("2008/6/13"), as.POSIXct("2011/6/16"),
+                      length.out=4000), VAL=rnorm(4000, 10))
R> ts02 <- data.frame(ID=seq(as.POSIXct("2011/7/19"), as.POSIXct("2012/11/20"),
+                     length.out=1500), VAL=rnorm(1500, 10))
R> ts03 <- data.frame(ID=seq(as.POSIXct("2012/12/09"), as.POSIXct("2013/9/25"),
+                     length.out=1000), VAL=rnorm(1000, 10))
R> ts1 = ore.push(rbind(ts01, ts02, ts03))
R> rownames(ts1) <- ts1$ID
R> x <- ts1$VAL
R> esm.mod <- ore.esm(x, "DAY", accumulate = "AVG", model="simple",
+                     setmissing="PREV")
R> esm.predict <- predict(esm.mod)
R> esm.predict
           ID      VAL
1  2013-09-26 9.962478
2  2013-09-27 9.962478
3  2013-09-28 9.962478
4  2013-09-29 9.962478
5  2013-09-30 9.962478
6  2013-10-01 9.962478
7  2013-10-02 9.962478
8  2013-10-03 9.962478
9  2013-10-04 9.962478
10 2013-10-05 9.962478
11 2013-10-06 9.962478
12 2013-10-07 9.962478

Example 3-39 Building a Double Exponential Smoothing Model Specifying an Interval

This example uses stock data from the TTR package. It builds a double exponential smoothing model based on the daily stock closing prices. The 30-day predicted stock prices, along with the original observations, are shown in the following figure.

library(TTR)
stock <- "orcl"
xts.data <- getYahooData(stock, 20010101, 20131024)
df.data <- data.frame(xts.data)
df.data$date <- index(xts.data)
of.data <- ore.push(df.data[, c("date", "Close")])
rownames(of.data) <- of.data$date
esm.mod <- ore.esm(of.data$Close, "DAY", model = "double")
esm.predict <- predict(esm.mod, 30)
plot(of.data,type="l")
lines(esm.predict,col="red",lwd=4)

Figure 3-2 Stock Price Prediction

Description of Figure 3-2 follows
Description of "Figure 3-2 Stock Price Prediction"

3.2.7 Rank Data

The ore.rank function analyzes distribution of values in numeric columns of an ore.frame.

The ore.rank function supports useful functionality, including:

  • Ranking within groups

  • Partitioning rows into groups based on rank tiles

  • Calculation of cumulative percentages and percentiles

  • Treatment of ties

  • Calculation of normal scores from ranks

The ore.rank function syntax is simpler than the corresponding SQL queries.

The ore.rank function returns an ore.frame in all instances.

You can use these R scoring methods with ore.rank:

  • To compute exponential scores from ranks, use savage.

  • To compute normal scores, use one of blom, tukey, or vw (van der Waerden).

For details about the function arguments, invoke help(ore.rank).

The following examples illustrate using ore.rank. The examples use the NARROW data set.

Example 3-40 Ranking Two Columns

This example ranks the two columns AGE and CLASS and reports the results as derived columns; values are ranked in the default order, which is ascending.

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')

Example 3-41 Handling Ties in Ranking

This example ranks the two columns AGE and CLASS. If there is a tie, the smallest value is assigned to all tied values.

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')

Example 3-42 Ranking by Groups

This example ranks the two columns AGE and CLASS and then ranks the resulting values according to COUNTRY.

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')

Example 3-43 Partitioning into Deciles

To partition the columns into a different number of partitions, change the value of groups. For example, groups=4 partitions into quartiles. This example ranks the two columns AGE and CLASS and partitions the columns into deciles (10 partitions).

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)

Example 3-44 Estimating Cumulative Distribution Function

This example ranks the two columns AGE and CLASS and estimates the cumulative distribution function for both column.

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)

Example 3-45 Scoring Ranks

This example ranks the two columns AGE and CLASS and scores the ranks in two different ways. The first command partitions the columns into percentiles (100 groups). The savage scoring method calculates exponential scores and blom scoring calculates normal scores.

x <- ore.rank(data=NARROW, var='AGE=RankOfAge, 
          CLASS=RankOfClass', score='savage', groups=100, group.by='COUNTRY')
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', score='blom')

3.2.8 Sort Data

The ore.sort function enables flexible sorting of a data frame along one or more columns specified by the by argument.

The ore.sort function can be used with other data pre-processing functions. The results of sorting can provide input to R visualization.

The sorting done by the ore.sort function takes place in the Oracle database. The ore.sort function supports the database nls.sort option.

The ore.sort function returns an ore.frame.

For details about the function arguments, invoke help(ore.sort).

Most of the following examples use the NARROW data set. Some examples use the ONTIME_S data set.

Example 3-46 Sorting Columns in Descending Order

This example sorts the columns AGE and GENDER in descending order.

x <- ore.sort(data=NARROW, by='AGE,GENDER', reverse=TRUE)

Example 3-47 Sorting Different Columns in Different Orders

This example sorts AGE in descending order and GENDER in ascending order.

x <- ore.sort(data=NARROW, by='-AGE,GENDER')

Example 3-48 Sorting and Returning One Row per Unique Value

This example sorts by AGE and keep one row per unique value of AGE:

x <- ore.sort(data=NARROW, by='AGE', unique.key=TRUE)

Example 3-49 Removing Duplicate Columns

This example sorts by AGE and removes duplicate rows:

x <- ore.sort(data=NARROW, by='AGE', unique.data=TRUE)

Example 3-50 Removing Duplicate Columns and Returning One Row per Unique Value

This example sorts by AGE, removes duplicate rows, and returns one row per unique value of AGE.

x <- ore.sort(data=NARROW, by='AGE', unique.data=TRUE, unique.key = TRUE)

Example 3-51 Preserving Relative Order in the Output

This example maintains the relative order in the sorted output.

x <- ore.sort(data=NARROW, by='AGE', stable=TRUE)

Example 3-52 Sorting Two Columns in Different Orders

This example sorts ONTIME_S by airline name in descending order and departure delay in ascending order.

sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')

Example 3-53 Sorting Two Columns in Different Orders and Producing Unique Combinations

This example sorts ONTIME_S by airline name and departure delay and selects one of each combination (that is, returns a unique key).

sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY', 
                          unique.key=TRUE)

3.2.9 Summarize Data

Summarize data with the aggregate function.

Example 3-54 Aggregating Data

This example pushes the iris data set to database memory as the ore.frame object iris_of. It aggregates the values of iris_of by the Species column using the length function. It then displays the first three rows of the result.

# Create a temporary database table from the iris data set and get an ore.frame.
iris_of <- ore.push(iris)
aggdata <- aggregate(iris_of$Sepal.Length,
                     by = list(species = iris_of$Species),
                     FUN = length)
head(aggdata, 3)
Listing for This Example
# Create a temporary database table from the iris data set and get an ore.frame.
R> iris_of <- ore.push(iris)
R> aggdata <- aggregate(iris_of$Sepal.Length,
+                      by = list(species = iris_of$Species),
+                      FUN = length)
R> head(aggdata, 3)
             species  x
setosa         setosa 50
versicolor versicolor 50
virginica   virginica 50

3.2.10 Analyze the Distribution of Numeric Variables

The ore.univariate function provides distribution analysis of numeric variables in an ore.frame.

The ore.univariate function provides these statistics:

  • All statistics reported by the summary function

  • Signed rank test, Student's t-test

  • Extreme values reporting

The ore.univariate function returns an ore.frame as output in all cases.

For details about the function arguments, invoke help(ore.univariate).

Example 3-55 Calculating the Default Univariate Statistics

This example calculates the default univariate statistics for AGE, YRS_RESIDENCE, and CLASS.

ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")

Example 3-56 Calculating the Default Univariate Statistics

This example calculates location statistics for YRS_RESIDENCE.

ore.univariate(NARROW, var="YRS_RESIDENCE", stats="location")

Example 3-57 Calculating the Complete Quantile Statistics

This example calculates complete quantile statistics for AGE and YRS_RESIDENCE.

ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")

3.2.11 Principal Component Analysis

The overloaded prcomp and princomp functions perform principal component analysis in parallel in the database.

The prcomp function uses a singular value decomposition of the covariance and correlations between variables. The princomp function uses eigen decomposition of the covariance and correlations between samples.

The transparency layer methods ore.frame-prcomp and ore.frame-princomp enable you to use the generic functions prcomp and princomp on data in an ore.frame object. This allows the functions to execute in parallel processes in the database.

For both functions, the methods support the function signature that accepts an ore.frame as the x argument and the signature that accepts a formula. The ore.frame must contain only numeric data. The formula must refer only to numeric variables and have no response variable.

Function prcomp returns a prcomp object and function princomp returns a princomp object.

For details about the function arguments, invoke help('ore.frame-prcomp') and help('ore.frame-princomp').

Note:

The biplot function is not supported for the objects returned by these transparency layer methods.

Example 3-58 Using the prcomp and princomp Functions

USARRESTS <- ore.push(USArrests)

# Using prcomp

prcomp(USARRESTS)
prcomp(USARRESTS, scale. = TRUE)

# Formula interface
prcomp(~ Murder + Assault + UrbanPop, data = USARRESTS, scale. = TRUE)

# Using prcomp

princomp(USARRESTS)
princomp(USARRESTS, cor = TRUE)

# Formula interface
princomp(~ Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)

Listing for This Example

R> USARRESTS <- ore.push(USArrests)
R> 
R> # Using prcomp
R>
R> prcomp(USARRESTS)
Standard deviations:
[1] 83.732400 14.212402  6.489426  2.482790

Rotation:
                PC1         PC2         PC3         PC4
Murder   0.04170432 -0.04482166  0.07989066 -0.99492173
Assault  0.99522128 -0.05876003 -0.06756974  0.03893830
UrbanPop 0.04633575  0.97685748 -0.20054629 -0.05816914
Rape     0.07515550  0.20071807  0.97408059  0.07232502

R> prcomp(USARRESTS, scale. = TRUE)
Standard deviations:
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation:
               PC1        PC2        PC3         PC4
Murder   0.5358995 -0.4181809  0.3412327  0.64922780
Assault  0.5831836 -0.1879856  0.2681484 -0.74340748
UrbanPop 0.2781909  0.8728062  0.3780158  0.13387773
Rape     0.5434321  0.1673186 -0.8177779  0.08902432
R> 
R> # Formula interface
R> prcomp(~ Murder + Assault + UrbanPop, data = USARRESTS, scale. = TRUE)
Standard deviations:
[1] 1.3656547 0.9795415 0.4189100

Rotation:
               PC1         PC2        PC3
Murder   0.6672955 -0.30345520  0.6801703
Assault  0.6970818 -0.06713997 -0.7138411
UrbanPop 0.2622854  0.95047734  0.1667309
R>
R> # Using princomp
R>
R> princomp(USARRESTS)
Call:
princomp(USARRESTS)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
82.890847 14.069560  6.424204  2.457837 

 4  variables and  50 observations.
R> princomp(USARRESTS, cor = TRUE)
Call:
princomp(USARRESTS, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3    Comp.4 
1.5748783 0.9948694 0.5971291 0.4164494 

 4  variables and  50 observations.
R> 
R> # Formula interface
R> princomp(~ Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)
Call:
princomp(~Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)

Standard deviations:
   Comp.1    Comp.2    Comp.3 
1.3656547 0.9795415 0.4189100 

 3  variables and  50 observations.

3.2.12 Singular Value Decomposition

The overloaded svd function performs singular value decomposition in parallel in the database.

The svd function accepts an ore.frame or an ore.tblmatrix object as the x argument. The ore.frame-svd method distributes block SVD computation to parallel processes executing in the database. The method uses the global option ore.parallel to determine the degree of parallelism to employ.

The function returns a list object that contains the d vector and v matrix components of a singular value decomposition of argument x. It does not return the left singular vector matrix u, therefore the argument nu is not used.

For details about the function arguments, invoke help('ore.frame-svd').

Example 3-59 Using the svd Function

USARRESTS <- ore.push(USArrests)
svd(USARRESTS)

Listing for This Example

R> USARRESTS <- ore.push(USArrests)
R> svd(USARRESTS)
$d
[1] 1419.06140  194.82585   45.66134   18.06956

$v
           [,1]        [,2]        [,3]        [,4]
[1,] 0.04239181 -0.01616262  0.06588426  0.99679535
[2,] 0.94395706 -0.32068580 -0.06655170 -0.04094568
[3,] 0.30842767  0.93845891 -0.15496743  0.01234261
[4,] 0.10963744  0.12725666  0.98347101 -0.06760284

3.3 Data Manipulation Using OREdplyr

OREdplyr package functions transparently implement dplyr functions for use with ore.frame and ore.numeric objects.

Many of these functions have non-standard evaluation (NSE) and standard evaluation (SE) interfaces. The SE functions have an underscore ( _) appended to the function name. NSE functions are useful in interactive R sessions; SE functions are convenient for use in programs.

The functions in the OREdplyr package are described in the following topics.

3.3.1 Select and Order Data

OREdplyr functions for selecting and ordering data in columns and rows of an ore.frame object.

Table 3-2 Selecting and Ordering Columns and Rows

Function Description

arrange

arrange_

Orders rows by the specified columns.

desc

Sorts an ore.number, ore.factor, or ore.character object in descending order

distinct

distinct_

Selects unique rows from an input ore.frame object over the specified columns.

filter

filter_

Filters rows by matching the specified condition.

mutate

mutate_

Adds new columns.

rename

rename_

Renames the specified columns and keeps all columns.

select

select_

Selects only the specified columns.

slice

slice_

Selects rows by position; ignores the grouping of the input ordered ore.frame object.

tranmute

tranmute_

Adds new columns and drops the existing columns.

Examples of using these functions are the following:

3.3.1.1 Examples of Selecting Columns

Examples of the select and rename functions of the OREdplyr package.

Example 3-60 Selecting Columns

The following examples select columns from the IRIS ore.frame object that is created by using the ore.push function on the iris data.frame objects.

IRIS <- ore.push(iris)
# Select the specified column
names(select(IRIS, Petal.Length))
names(select(IRIS, petal_length = Petal.Length))

# Drop the specified column
names(select(IRIS, -Petal.Length))

# rename() keeps all variables
names(rename(IRIS, petal_length = Petal.Length))

Listing for This Example


R> IRIS <- ore.push(iris)
R> # Select the specified column
R> names(select(IRIS, Petal.Length))
[1] "Petal.Length"
R> names(select(IRIS, petal_length = Petal.Length))
[1] "petal_length"
R>
R> # Drop the specified column
R> names(select(IRIS, -Petal.Length))
[1] "Sepal.Length" "Sepal.Width"  "Petal.Width"  "Species" 
R>
R> # rename() keeps all variables
R> names(rename(IRIS, petal_length = Petal.Length))
[1] "Sepal.Length" "Sepal.Width"  "petal_length" "Petal.Width"  "Species"
3.3.1.2 Examples of Programming with select_

Examples of the select_ function of the OREdplyr package.

Example 3-61 Programming with select

This example uses the select_ function to select columns from the IRIS ore.frame object that is created by using the ore.push function on the iris data.frame object.

IRIS <- ore.push(iris)
# Use ~, double quote, or quote function to specify the column to select
head(select_(IRIS, ~Petal.Length))
head(select_(IRIS, "Petal.Length"))
head(select_(IRIS, quote(-Petal.Length), quote(-Petal.Width)))
head(select_(IRIS, .dots = list(quote(-Petal.Length), quote(-Petal.Width))))

Listing for This Example

R> IRIS <- ore.push(iris)
R> # Use ~, double quote, or quote function to specify the column to select
R> head(select_(IRIS, ~Petal.Length))
  Petal.Length
1          1.4
2          1.4
3          1.3
4          1.5
5          1.4
6          1.7
R> head(select_(IRIS, "Petal.Length"))
  Petal.Length
1          1.4
2          1.4
3          1.3
4          1.5
5          1.4
6          1.7
R> head(select_(IRIS, quote(-Petal.Length), quote(-Petal.Width)))
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
6          5.4         3.9  setosa
R> head(select_(IRIS, .dots = list(quote(-Petal.Length), quote(-Petal.Width))))
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
6          5.4         3.9  setosa
3.3.1.3 Examples of Selecting Distinct Columns

Examples of the distinct and arrange functions of the OREdplyr package.

Example 3-62 Selecting Distinct Columns

df <- data.frame(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)
DF <- ore.push(df)
nrow(DF)
nrow(distinct(DF))
arrange(distinct(DF, x), x)
arrange(distinct(DF, y), y)

# Use distinct on computed variables
arrange(distinct(DF, diff = abs(x - y)), diff)

Listing for This Example

R> df <- data.frame(
+   x = sample(10, 100, rep = TRUE),
+   y = sample(10, 100, rep = TRUE)
+ )
R> DF <- ore.push(df)
R> nrow(DF)
[1] 100
R> nrow(distinct(DF))
[1] 66
R> arrange(distinct(DF, x), x)
    x
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
10 10
R> arrange(distinct(DF, y), y)
    y
1   1
2   2
3   3
4   4
5   5
6   6
7   7
8   8
9   9
R> 
R> # Use distinct on computed variables
R> arrange(distinct(DF, diff = abs(x - y)), diff)
   diff
1     0
2     1
3     2
4     3
5     4
6     5
7     6
8     7
9     8
10    9
3.3.1.4 Examples of Selecting Rows by Position

Examples of the slice and filter functions of the OREdplyr package.

Example 3-63 Selecting Rows by Position

MTCARS <- ore.push(mtcars)
# Display the names of the rows in MTCARS
rownames(MTCARS)
# Select the first row
slice(MTCARS, 1L)

# Arrange the rows by horsepower, then select the first row by position
MTCARS <- arrange(MTCARS, hp)
slice(MTCARS, 1L)

by_cyl <- group_by(MTCARS, cyl)
# Grouping is ignored by slice.
slice(by_cyl, 1:2)
# Use filter and row_number to obtain slices per group.
filter(by_cyl, row_number(hp) < 3L) 

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> # Display the names of the rows in MTCARS
R> rownames(MTCARS)
 [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"          "Hornet 4 Drive"      "Hornet Sportabout"  
 [6] "Valiant"             "Duster 360"          "Merc 240D"           "Merc 230"            "Merc 280"           
[11] "Merc 280C"           "Merc 450SE"          "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"            "Honda Civic"         "Toyota Corolla"     
[21] "Toyota Corona"       "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"          "Pontiac Firebird"   
[26] "Fiat X1-9"           "Porsche 914-2"       "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
[31] "Maserati Bora"       "Volvo 142E"  
R> # Select the first row
R> slice(MTCARS, 1L)
          mpg cyl disp  hp drat   wt  qsec vs am gear carb
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4    4
R>
R> # Arrange the rows by horsepower, then select the first row by position
R> MTCARS <- arrange(MTCARS, hp)
R> slice(MTCARS, 1L)
   mpg cyl disp hp drat    wt  qsec vs am gear carb
1 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2
R>
R> by_cyl <- group_by(MTCARS, cyl)
R> # Grouping is ignored by slice
R> slice(by_cyl, 1:2)
   mpg cyl  disp hp drat    wt  qsec vs am gear carb
1 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
2 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
Warning message:
In slice_.ore.frame(.data, .dots = .ore.dplyr.exprall(..., env = parent.frame())) :
  grouping is ignored
R> # Use filter and row_number to obtain slices per group
R> filter(by_cyl, row_number(hp) < 3L)
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
2 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
3 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
4 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
5 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
6 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
3.3.1.5 Examples of Arranging Columns

Examples of the arrange and desc functions of the OREdplyr package.

Example 3-64 Arranging Columns

This example arranges columns from the ore.frame object MTCARS that is created by using the ore.push function on the mtcars data.frame object. The second arrange() invocation calls the desc() function to arrange the values in descending order.

MTCARS <- ore.push(mtcars)
head(arrange(mtcars, cyl, disp))
head(arrange(MTCARS, desc(disp)))

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> head(arrange(MTCARS, cyl, disp))
   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
2 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
4 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
5 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
6 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
R> head(arrange(MTCARS, desc(disp)))
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4
2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4
3 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4
4 19.2   8  400 175 3.08 3.845 17.05  0  0    3    2
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
6 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4
3.3.1.6 Examples of Filtering Columns

Examples of the filter function of the OREdplyr package.

Example 3-65 Filtering Columns

This example filters columns from the MTCARS ore.frame object that is created by using the ore.push function on the mtcars data.frame object.

MTCARS <- ore.push(mtcars)
head(filter(MTCARS, cyl == 8))
# Using multiple criteria
head(filter(MTCARS, cyl < 6 & vs == 1))

# Using multiple arguments is the equivalent to using &
head(filter(MTCARS, cyl < 6, vs == 1))

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> head(filter(MTCARS, cyl == 8))
   mpg cyl  disp  hp drat   wt  qsec vs am gear carb
1 18.7   8 360.0 175 3.15 3.44 17.02  0  0    3    2
2 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4
3 16.4   8 275.8 180 3.07 4.07 17.40  0  0    3    3
4 17.3   8 275.8 180 3.07 3.73 17.60  0  0    3    3
5 15.2   8 275.8 180 3.07 3.78 18.00  0  0    3    3
6 10.4   8 472.0 205 2.93 5.25 17.98  0  0    3    4
R> head(filter(MTCARS, cyl < 6 & vs == 1))
   mpg cyl  disp hp drat    wt  qsec vs am gear carb
1 22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1
2 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
3 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
4 32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
5 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
6 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1
R>
R> # Using multiple arguments is the equivalent to using &
R> head(filter(MTCARS, cyl < 6, vs == 1))
   mpg cyl  disp hp drat    wt  qsec vs am gear carb
1 22.8   4 108.0 93 3.85 2.320 18.61  1  1    4    1
2 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
3 22.8   4 140.8 95 3.92 3.150 22.90  1  0    4    2
4 32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
5 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
6 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1
3.3.1.7 Examples of Mutating Columns

Examples of the mutate and transmute functions of the OREdplyr package.

Example 3-66 Mutating Columns

This example uses the MTCARS ore.frame object that is created by using the ore.push function on the mtcars data.frame object.

The mutate function adds the extra column displ_1 with the value derived from that of column disp. Setting the column to NULL removes the column.

MTCARS <- ore.push(mtcars)
head(mutate(MTCARS, displ_l = disp / 61.0237))
head(transmute(MTCARS, displ_l = disp / 61.0237))
head(mutate(MTCARS, cyl = NULL))
head(mutate(MTCARS, cyl = NULL, hp = NULL, displ_l = disp / 61.0237))

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> head(mutate(MTCARS, displ_l = disp / 61.0237))
   mpg cyl disp  hp drat    wt  qsec vs am gear carb  displ_l
1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 2.621932
2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 2.621932
3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1.769804
4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 4.227866
5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 5.899347
6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 3.687092
R> head(transmute(MTCARS, displ_l = disp / 61.0237))
   displ_l
1 2.621932
2 2.621932
3 1.769804
4 4.227866
5 5.899347
6 3.687092
R> head(mutate(mtcars, cyl = NULL))
   mpg disp  hp drat    wt  qsec vs am gear carb
1 21.0  160 110 3.90 2.620 16.46  0  1    4    4
2 21.0  160 110 3.90 2.875 17.02  0  1    4    4
3 22.8  108  93 3.85 2.320 18.61  1  1    4    1
4 21.4  258 110 3.08 3.215 19.44  1  0    3    1
5 18.7  360 175 3.15 3.440 17.02  0  0    3    2
6 18.1  225 105 2.76 3.460 20.22  1  0    3    1
R> head(mutate(mtcars, cyl = NULL, hp = NULL, displ_l = disp / 61.0237))
   mpg disp drat    wt  qsec vs am gear carb  displ_l
1 21.0  160 3.90 2.620 16.46  0  1    4    4 2.621932
2 21.0  160 3.90 2.875 17.02  0  1    4    4 2.621932
3 22.8  108 3.85 2.320 18.61  1  1    4    1 1.769804
4 21.4  258 3.08 3.215 19.44  1  0    3    1 4.227866
5 18.7  360 3.15 3.440 17.02  0  0    3    2 5.899347
6 18.1  225 2.76 3.460 20.22  1  0    3    1 3.687092

3.3.2 Join Rows

OREdplyr functions for joining rows.

Table 3-3 Joining Rows

Function Description

full_join

Returns the union of rows from left_join and right_join.

inner_join

Returns all combination of rows from x and y over matched columns.

left_join

Returns rows from inner_join plus rows from y that do not match with x. For unmatched rows of y, NA is returned.

right_join

Returns rows from inner_join plus rows from x that do not match with y. For unmatched rows of x, NA is returned.

Example 3-67 Joining Rows

To join two tables, the join function selects the columns in each table that have the same name or uses the argument by to specify the columns.

MTCARS <- ore.push(mtcars)
M1 <- filter(select(MTCARS, mpg, cyl, carb), carb < 6L)
M2 <- filter(select(MTCARS, cyl, hp, carb), carb > 2L)

names(inner_join(M1, M2))
nrow(left_join(M1, M2))
nrow(right_join(M1, M2))
nrow(full_join(M1, M2))

names(M2) <- c("cyl", "hp", "carb2")
names(inner_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(inner_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(left_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(right_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(full_join(M1, M2, by = c("cyl", carb="carb2")))

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> M1 <- filter(select(MTCARS, mpg, cyl, carb), carb < 6L)
R> M2 <- filter(select(MTCARS, cyl, hp, carb), carb > 2L)
R>
R> names(inner_join(M1, M2))
[1] "cyl"  "carb" "mpg"  "hp"  
R> nrow(left_join(M1, M2))
[1] 78
R> nrow(right_join(M1, M2))
[1] 63
R> nrow(full_join(M1, M2))
[1] 80
R> 
R> names(M2) <- c("cyl", "hp", "carb2")
R> names(inner_join(M1, M2, by = c("cyl", carb="carb2")))
[1] "cyl"  "carb" "mpg"  "hp"
R> nrow(inner_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 61
R> nrow(left_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 78
R> nrow(right_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 63
R> nrow(full_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 80

3.3.3 Group Columns and Rows

OREdplyr functions for grouping columns and rows.

Table 3-4 Grouping Columns and Rows

Function Description

group_by

group_by_

Groups an ore.frame object over the specified columns.

group_size

Lists the number of rows in each group.

groups

Shows the names of the grouping columns.

n_groups

Returns the number of groups.

ungroup

Drops the grouping from the input ore.frame object.

Example 3-68 Using Grouping Functions

The following examples use the ore.frame object MTCARS that is created by using the ore.push function on the mtcars data.frame object. They exemplify the use of the grouping functions group_by, group_size, groups, n_group, and ungroup. They also use the OREdplyr functions arrange, rename, and summarize.

MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)

# Apply the summarise function to each group
arrange(summarise(by_cyl, mean(disp), mean(hp)), cyl)

# Summarise drops one layer of grouping
by_vs_am <- group_by(MTCARS, vs, am)
by_vs <- summarise(by_vs_am, n = n())
arrange(by_vs, vs, am)
arrange(summarise(by_vs, n = sum(n)), vs)

# Remove grouping
summarise(ungroup(by_vs), n = sum(n))

# Group by expressions with mutate
arrange(group_size(group_by(mutate(MTCARS, vsam = vs + am), vsam)), vsam)

# Rename the grouping column
groups(rename(group_by(MTCARS, vs), vs2 = vs))

# Add more grouping columns
groups(group_by(by_cyl, vs, am))
groups(group_by(by_cyl, vs, am, add = TRUE))

# Drop duplicate groups
groups(group_by(by_cyl, cyl, cyl))

# Load the magrittr library to use the forward-pipe operator %>%
library(magrittr)
by_cyl_gear_carb <- MTCARS %>% group_by(cyl, gear, carb)
n_groups(by_cyl_gear_carb)
arrange(group_size(by_cyl_gear_carb), cyl, gear, carb)

by_cyl <- MTCARS %>% group_by(cyl)
# Number of groups
n_groups(by_cyl)

# Size of each group
arrange(group_size(by_cyl), cyl)

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R>
R> # Apply the summarise function to each group
R> arrange(summarise(by_cyl, mean(disp), mean(hp)), cyl)
  cyl mean.disp.  mean.hp.
1   4   105.1364  82.63636
2   6   183.3143 122.28571
3   8   353.1000 209.21429
R>
R> # Summarise drops one layer of grouping
R> by_vs_am <- group_by(MTCARS, vs, am)
R> by_vs <- summarise(by_vs_am, n = n())
R> arrange(by_vs, vs, am)
  vs am  n
1  0  0 12
2  0  1  6
3  1  0  7
4  1  1  7
R> arrange(summarise(by_vs, n = sum(n)), vs)
  vs  n
1  0 18
2  1 14
R>
R> # Remove grouping
R> summarise(ungroup(by_vs), n = sum(n))
 n 
32 
R> 
R> # Group by expressions with mutate
R> arrange(group_size(group_by(mutate(MTCARS, vsam = vs + am), vsam)), vsam)
  vsam  n
1    0 12
2    1 13
3    2  7
R> 
R> # Rename the grouping column
R> groups(rename(group_by(MTCARS, vs), vs2 = vs))
[1] "vs2"
R> 
R> # Add more grouping columns
R> groups(group_by(by_cyl, vs, am))
[[1]]
[1] "vs"

[[2]]
[1] "am"

R> groups(group_by(by_cyl, vs, am, add = TRUE))
[[1]]
[1] "cyl"

[[2]]
[1] "vs"

[[3]]
[1] "am"
R> 
R> # Drop duplicate groups
R> groups(group_by(by_cyl, cyl, cyl))
[1] "cyl
R> 
R> # Load the magrittr library to use the forward-pipe operator %>%
R> library(magrittr)
R> by_cyl_gear_carb <- MTCARS %>% group_by(cyl, gear, carb)
R> n_groups(by_cyl_gear_carb)
[1] 12
R> arrange(group_size(by_cyl_gear_carb), cyl, gear, carb)
   cyl gear carb n
1    4    3    1 1
2    4    4    1 4
3    4    4    2 4
4    4    5    2 2
5    6    3    1 2
6    6    4    4 4
7    6    5    6 1
8    8    3    2 4
9    8    3    3 3
10   8    3    4 5
11   8    5    4 1
12   8    5    8 1
R> 
R> by_cyl <- MTCARS %>% group_by(cyl)
R> # Number of groups
R> n_groups(by_cyl)
[1] 3
R> # Number of groups
R> n_groups(by_cyl)
[1] 3
R>
R> # Size of each group
R> arrange(group_size(by_cyl), cyl)
  cyl  n
1   4 11
2   6  7
3   8 14

3.3.4 Aggregate Columns and Rows

OREdplyr functions for aggregating columns and rows.

Table 3-5 Aggregating Columns and Rows

Function Description

count

count_

Counts rows by group; similar to tally, but it does the group_by for you.

summarise

summarise_

Summarizes columns by using aggregate functions. When an ore.frame object is grouped, the aggregate function is applied group-wise. The resulting ore.frame drops one grouping of the input ore.frame.

tally

Tallies rows by group; a convenient wrapper for summarise that either calls n or sum(n) depending on whether you’re tallying for the first time or re-tallying.

Example 3-69 Aggregating Columns

The following examples use the ore.frame object MTCARS that is created by using the ore.push function on the mtcars data.frame object. They exemplify the use of the aggregation functions count, summarize, and tally. They also use the OREdplyr functions arrange and group_by.

MTCARS <- ore.push(mtcars)
arrange(tally(group_by(MTCARS, cyl)), cyl)
tally(group_by(MTCARS, cyl), sort = TRUE)

# Multiple tallys progressively roll up the groups
cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))

cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), wt = hp, sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))

cyl_by_gear <- count(MTCARS, cyl, gear, wt = hp + mpg, sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))

# Load the magrittr library to use the forward-pipe operator %>%
library(magrittr)
MTCARS %>% group_by(cyl) %>% tally(sort = TRUE)

# count is more succinct and also does the grouping
MTCARS %>% count(cyl) %>% arrange(cyl)
MTCARS %>% count(cyl, wt = hp) %>% arrange(cyl)
MTCARS %>% count_("cyl", wt = hp, sort = TRUE)

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> arrange(tally(group_by(MTCARS, cyl)), cyl)
  cyl  n
1   4 11
2   6  7
3   8 14
R> tally(group_by(MTCARS, cyl), sort = TRUE)
  cyl  n
1   8 14
2   4 11
3   6  7
R> 
R> # Multiple tallys progressively roll up the groups
R> cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
  cyl  n
1   8 14
2   4 11
3   6  7
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
 n 
32 
R>
R> cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), wt = hp, sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
  cyl    n
1   8 2929
2   4  909
3   6  856
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
   n 
4694 
R> 
R> cyl_by_gear <- count(MTCARS, cyl, gear, wt = hp + mpg, sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
  cyl      n
1   8 3140.4
2   4 1202.3
3   6  994.2
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
     n 
5336.9 
R> 
R> # Load the magrittr library to use the forward-pipe operator %>%
R> library(magrittr)
R> MTCARS %>% group_by(cyl) %>% tally(sort = TRUE)
  cyl  n
1   8 14
2   4 11
3   6  7
R> 
R> # count is more succinct and also does the grouping
R> MTCARS %>% count(cyl) %>% arrange(cyl)
  cyl  n
1   4 11
2   6  7
3   8 14
R> MTCARS %>% count(cyl, wt = hp) %>% arrange(cyl)
  cyl    n
1   4  909
2   6  856
3   8 2929
R> MTCARS %>% count_("cyl", wt = hp, sort = TRUE)
  cyl    n
1   8 2929
2   4  909
3   6  856

3.3.5 Sample Rows

OREdplyr functions for sampling rows.

Table 3-6 Sampling Row Functions

Function Description

sample_frac

Samples an ore.frame object by a fraction.

sample_n

Samples an ore.frame object by a fixed number of rows.

Example 3-70 Sampling Rows

These examples use the ore.frame object MTCARS that is created by using the ore.push function on the mtcars data.frame object. They exemplify the use of the sampling functions sample_n and sample_frac. They also use the OREdplyr functions arrange and summarize.

MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)

# Sample fixed number per group of rows from the entire dataset
sample_n(MTCARS, 10)
nrow(sample_n(MTCARS, 50, replace = TRUE))
sample_n(MTCARS, 10, weight = mpg)
sample_n(MTCARS, 10, weight = MTCARS[["mpg"]])

# Sample fixed number of rows per group with replacement and weight 
arrange(sample_n(by_cyl, 3), cyl, mpg)
arrange(summarise(sample_n(by_cyl, 10, replace = TRUE), n = n()), cyl)
arrange(summarise(sample_n(by_cyl, 3, weight = mpg/mean(mpg)), n = n()), cyl)
arrange(summarise(sample_n(by_cyl, 3, 
                           weight = by_cyl[["mpg"]]/mean(by_cyl[["mpg"]])), n = n()), cyl) 

# Sample fixed fraction per group
nrow(sample_frac(MTCARS, 0.1))
nrow(sample_frac(MTCARS, 1.5, replace = TRUE))
nrow(sample_frac(MTCARS, 0.1, weight = 1/mpg))

Listing for This Example

R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R> 
R> # Sample fixed number per group of rows from the entire dataset
R> sample_n(MTCARS, 10)
                  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Datsun 710|4     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Ford Pantera L|2 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Honda Civic|10   30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa|6   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Maserati Bora|3  15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Mazda RX4|5      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag|9  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Merc 280|8       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Toyota Corolla|7 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona|1  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
R> nrow(sample_n(MTCARS, 50, replace = TRUE))
[1] 50
R>
R> # Sample fixed number of rows per group with replacement and weight 
R> arrange(sample_n(by_cyl, 3), cyl, mpg)
  cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
1   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
2   4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
3   4 30.4  95.1 113 3.77 1.513 16.90  1  1    5    2
4   6 19.2 167.6 123 3.92 3.440 18.30  1  0    4    4
5   6 19.7 145.0 175 3.62 2.770 15.50  0  1    5    6
6   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
7   8 10.4 460.0 215 3.00 5.424 17.82  0  0    3    4
8   8 15.2 304.0 150 3.15 3.435 17.30  0  0    3    2
9   8 15.2 275.8 180 3.07 3.780 18.00  0  0    3    3
R> arrange(summarise(sample_n(by_cyl, 10, replace = TRUE), n = n()), cyl)
  cyl  n
1   4 10
2   6 10
3   8 10
R> arrange(summarise(sample_n(by_cyl, 3, weight = mpg/mean(mpg)), n = n()), cyl)
  cyl n
1   4 3
2   6 3
3   8 3
R> arrange(summarise(sample_n(by_cyl, 3, weight = by_cyl[["mpg"]]/mean(by_cyl[["mpg"]])), n = n()), cyl)
  cyl n
1   4 3
2   6 3
3   8 3
R> 
R> nrow(sample_frac(MTCARS, 0.1))
[1] 3
R> nrow(sample_frac(MTCARS, 1.5, replace = TRUE))
[1] 48
R> nrow(sample_frac(MTCARS, 0.1, weight = 1/mpg))
[1] 3

3.3.6 Rank Rows

OREdplyr functions for ranking rows.

The ranking functions rank the elements in an ordered ore.vector by its values. An ore.character is coerced to an ore.factor. The values of an ore.factor are based upon factor levels. To reverse the direction of the ranking, use the desc function.

Table 3-7 Ranking Rows

Function Description
cume_dist

A cumulative distribution function: returns the proportion of all values that are less than or equal to the current rank.

dense_rank

Like min_rank but with no gaps between ranks.

first

Gets the first value from an ordered ore.vector object.

last

Gets the last value from an ordered ore.vector object.

min_rank

Equivalent to rank(ties.method = "min").

nth

Obtains the value at the specified position in the order.

ntile

A rough ranking that breaks the input vector into n buckets.

n_distinct

Gets the nth value from an ordered ore.vector object.

percent_rank

Returns a number between 0 and 1 that is computed by rescaling min_rank to [0, 1].

row_number

Equivalent to rank(ties.method = "first").

top_n

Selects the top or bottom number of rows.

Example 3-71 Ranking Rows

These examples use the ranking functions row_number, min_rank, dense_rank, percent_rank, cume_dist, and ntile.

X <- ore.push(c(5, 1, 3, 2, 2, NA))

row_number(X)
row_number(desc(X))

min_rank(X)

dense_rank(X)

percent_rank(X)

cume_dist(X)

ntile(X, 2)
ntile(ore.push(runif(100)), 10)

MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)

# Using ranking functions with an ore.frame
head(mutate(MTCARS, rank = row_number(hp)))

head(mutate(MTCARS, rank = min_rank(hp)))

head(mutate(MTCARS, rank = dense_rank(hp)))

# Using ranking functions with a grouped ore.frame
head(mutate(by_cyl, rank = row_number(hp)))

head(mutate(by_cyl, rank = min_rank(hp)))

head(mutate(by_cyl, rank = dense_rank(hp)))

Listing for This Example

R> X <- ore.push(c(5, 1, 3, 2, 2, NA))
R> 
R> row_number(X)
[1] 5 1 4 2 3 6
R> row_number(desc(X))
[1] 1 5 2 3 4 6
R> 
R> min_rank(X)
[1] 5 1 4 2 2 6
R> 
R> dense_rank(X)
[1] 4 1 3 2 2 6
R> 
R> percent_rank(X)
[1] 0.8 0.0 0.6 0.2 0.2 1.0
R> 
R> cume_dist(X)
[1] 0.8333333 0.1666667 0.6666667 0.5000000 0.5000000 1.0000000
R> 
R> ntile(X, 2)
[1] 2 1 2 1 1 2
R> ntile(ore.push(runif(100)), 10)
  [1]  6 10  5  2  1  1  8  3  8  8  7  3 10  3  7  9  9  4  4 10 10  7  2  3  7  4  5  5  3  9  4  6  8  4 10  6  1  5  5  4  6  9
 [43]  5  8  2  7  7  1  2  9  1  2  8  5  6  5  3  4  7  1  3  1 10  1  5  5 10  9  2  3  9  6  6  8  8  6  3  7  2  2  8  4  1  9
 [85]  6 10  4 10  7  2  9 10  7  2  4  9  6  3  8  1
R>
R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R>
R> # Using ranking functions with an ore.frame
R> head(mutate(MTCARS, rank = row_number(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   12
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   13
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    7
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   14
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   20
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   10
R> 
R> head(mutate(MTCARS, rank = min_rank(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   12
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   12
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    7
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   12
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   20
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1   10
R> 
R> head(mutate(MTCARS, rank = dense_rank(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4   11
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4   11
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    6
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1   11
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2   15
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    9
R> 
R> # Using ranking functions with a grouped ore.frame
R> head(mutate(by_cyl, rank = row_number(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    3
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    7
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    4
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    1
R>
R> head(mutate(by_cyl, rank = min_rank(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    7
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    1
R>
R> head(mutate(by_cyl, rank = dense_rank(hp)))
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb rank
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    6
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    1

3.4 About Using Third-Party Packages on the Client

In Oracle Machine Learning for R, if you want to use functions from an open source R package from The Comprehensive R Archive Network (CRAN) or other third-party R package, then you would generally do so in the context of embedded R execution.

Using embedded R execution, you can take advantage of the likely greater amount of memory on the database server.

However, if you want to use a third-party package function in your local R session on data from an Oracle database table, you must use the ore.pull function to get the data from an ore.frame object to your local session as a data.frame object. This is the same as using open source R except that you can extract the data from the database without needing the help of a DBA.

When pulling data from a database table to a local data.frame, you are limited to using the amount of data that can fit into the memory of your local machine. On your local machine, you do not have the benefits provided by embedded R execution.

To use a third-party package, you must install it on your system and load it in your R session.

Example 3-72 Downloading, Installing, and Loading a Third-Party Package on the Client

This example demonstrates downloading, installing, and loading the CRAN package kernlab. The kernlab package contains kernel-based machine learning methods. The example invokes the install.packages function to download and install the package. It then invokes the library function to load the package.

install.packages("kernlab")
library("kernlab")

Listing for This Example

R> install.packages("kernlab")
trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.0/kernlab_0.9-19.zip'
Content type 'application/zip' length 2029405 bytes (1.9 Mb)
opened URL
downloaded 1.9 Mb
 
package 'kernlab' successfully unpacked and MD5 sums checked
 
The downloaded binary packages are in
	C:\Users\oml_user\AppData\Local\Temp\RtmpSKVZql\downloaded_packages
R> library("kernlab")

Example 3-73 Using a kernlab Package Function

This example invokes the demo function to look for example programs in the kernlab package. Because the package does not have examples, this example then gets help for the ksvm function. The example invokes example code from the help.

demo(package = "kernlab")
help(package = "kernlab", ksvm)
data(spam)
index <- sample(1:dim(spam)[1])
spamtrain <- spam[index[1:floor(dim(spam)[1]/2)], ]
spamtest <- spam[index[((ceiling(dim(spam)[1]/2)) + 1):dim(spam)[1]], ]
filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot",
+              kpar=list(sigma=0.05),C=5,cross=3)
filter
table(mailtype,spamtest[,58])

Listing for This Example

> demo(package = "kernlab")
no demos found
> help(package = "kernlab", ksvm)    # Output not shown.
> data(spam)
> index <- sample(1:dim(spam)[1])
> spamtrain <- spam[index[1:floor(dim(spam)[1]/2)], ]
> spamtest <- spam[index[((ceiling(dim(spam)[1]/2)) + 1):dim(spam)[1]], ]
> filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot",
+                kpar=list(sigma=0.05),C=5,cross=3)
> filter
Support Vector Machine object of class "ksvm" 
 
SV type: C-svc  (classification) 
 parameter : cost C = 5 
 
Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.05 
 
Number of Support Vectors : 970 
 
Objective Function Value : -1058.218 
Training error : 0.018261 
Cross validation error : 0.08696 
> mailtype <- predict(filter,spamtest[,-58])
> table(mailtype,spamtest[,58])
         
mailtype  nonspam spam
  nonspam    1347  136
  spam         45  772