3 Prepare and Explore Data in the Database
Use Oracle Machine Learning for R functions to prepare data for analysis and to perform exploratory analysis of the data.
These functions make it easier for you to prepare very large enterprise database-resident data for modeling. They are described the following topics:
- Prepare Data in the Database Using Oracle Machine Learning for R
Using OML4R, you can prepare data for analysis in the database. - Explore Data
Oracle Machine Learning for R provides functions that enable you to perform exploratory data analysis. - Data Manipulation Using OREdplyr
OREdplyr
package functions transparently implementdplyr
functions for use withore.frame
andore.numeric
objects. - About Using Third-Party Packages on the Client
In Oracle Machine Learning for R, if you want to use functions from an open source R package from The Comprehensive R Archive Network (CRAN) or other third-party R package, then you would generally do so in the context of embedded R execution.
3.1 Prepare Data in the Database Using Oracle Machine Learning for R
Using OML4R, you can prepare data for analysis in the database.
Data preparation is described in the following topics:
- About Preparing Data in the Database
Oracle Machine Learning for R provides functions that enable you to use R to prepare database data for analysis. - Select Data
A typical step in preparing data for analysis is selecting or filtering values of interest from a larger data set. - Index Data
You can use integer or character vectors to index an orderedore.frame
object. - Combine Data
You can join data fromore.frame
objects that represent database tables by using themerge
function. - Summarize Data
Summarize data with theaggregate
function. - Transform Data
In preparing data for analysis, a typical step is to transform data by reformatting it or deriving new columns and adding them to the data set. - Sample Data
Sampling is an important capability for statistical analytics. - Partition Data
In analyzing large data sets, a typical operation is to randomly partition the data set into subsets. - Prepare Time Series Data
OML4R provides you with the ability to perform many data preparation operations on time series data, such as filtering, ordering, and transforming the data.
Parent topic: Prepare and Explore Data in the Database
3.1.1 About Preparing Data in the Database
Oracle Machine Learning for R provides functions that enable you to use R to prepare database data for analysis.
Using these functions, you can perform typical data preparation tasks on ore.frame
and other OML4R objects. You can perform data preparation operations on large quantities of data in the database and then pull the results to your local R session for analysis using functions in packages available from The Comprehensive R Archive Network (CRAN).
You can do operations on data such as the following.
-
Selecting
-
Binning
-
Sampling
-
Sorting and Ordering
-
Summarizing
-
Transforming
-
Performing data preparation operations on date and time data
Performing these operations is described in the other topics in this chapter.
3.1.2 Select Data
A typical step in preparing data for analysis is selecting or filtering values of interest from a larger data set.
- Select Data by Column
This example selects columns from anore.frame
object. - Select Data by Row
This example selects rows from an orderedore.frame
object. - Select Data by Value
This example selects portions of a data set.
3.1.2.1 Select Data by Column
This example selects columns from an ore.frame
object.
Example 3-1 Selecting Data by Column
This example first creates a temporary database table, with the corresponding proxy ore.frame
object iris_of
, from the iris
data.frame
object. It displays the first three rows of iris_of
. The example selects two columns from iris_of
and creates the ore.frame
object iris_projected
with them. It then displays the first three rows of iris_projected
.
iris_of <- ore.push(iris) head(iris_of, 3) iris_projected = iris_of[, c("Petal.Length", "Species")] head (iris_projected, 3)Listing for This Example
iris_of <- ore.push(iris) head(iris_of, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa R> iris_projected = iris_of[, c("Petal.Length", "Species")] R> head (iris_projected, 3) Petal.Length Species 1 1.4 setosa 2 1.4 setosa 3 1.3 setosa
Parent topic: Select Data
3.1.2.2 Select Data by Row
This example selects rows from an ordered ore.frame
object.
Example 3-2 Selecting Data by Row
This example first adds a column to the iris
data.frame
object for use in creating an ordered ore.frame
object. It invokes the ore.drop
function to delete the database table IRIS_TABLE, if it exists. It then creates a database table, with the corresponding proxy ore.frame
object IRIS_TABLE
, from the iris
data.frame
. The example invokes the ore.exec
function to execute a SQL statement that makes the RID column the primary key of the database table. It then invokes the ore.sync
function to synchronize the IRIS_TABLE
ore.frame
object with the table and displays the first three rows of the proxy ore.frame
object.
The example next selects 51 rows from IRIS_TABLE
by row number and creates the ordered ore.frame
object iris_selrows
with them. It displays the first six rows of iris_selrows
. It then selects 3 rows by row name and displays the result.
# Add a column to the iris data set to use as row identifiers. iris$RID <- as.integer(1:nrow(iris) + 100) ore.drop(table = 'IRIS_TABLE') ore.create(iris, table = 'IRIS_TABLE') ore.exec("alter table IRIS_TABLE add constraint IRIS_TABLE primary key (\"RID\")") ore.sync(table = "IRIS_TABLE") head(IRIS_TABLE, 3) # Select rows by row number. iris_selrows <- IRIS_TABLE[50:100,] head(iris_selrows) # Select rows by row name. IRIS_TABLE[c("101", "151", "201"),]Listing for This Example
R> # Add a column to the iris data set to use as row identifiers. R> iris$RID <- as.integer(1:nrow(iris) + 100) R> ore.drop(table = 'IRIS_TABLE') R> ore.create(iris, table = 'IRIS_TABLE') R> ore.exec("alter table IRIS_TABLE add constraint IRIS_TABLE + primary key (\"RID\")") R> ore.sync(table = "IRIS_TABLE") R> head(IRIS_TABLE, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species RID 101 5.1 3.5 1.4 0.2 setosa 101 102 4.9 3.0 1.4 0.2 setosa 102 103 4.7 3.2 1.3 0.2 setosa 103 R> # Select rows by row number. R> iris_selrows <- IRIS_TABLE[50:100,] R> head(iris_selrows) Sepal.Length Sepal.Width Petal.Length Petal.Width Species RID 150 5.0 3.3 1.4 0.2 setosa 150 151 7.0 3.2 4.7 1.4 versicolor 151 152 6.4 3.2 4.5 1.5 versicolor 152 153 6.9 3.1 4.9 1.5 versicolor 153 154 5.5 2.3 4.0 1.3 versicolor 154 155 6.5 2.8 4.6 1.5 versicolor 155 R> # Select rows by row name. R> IRIS_TABLE[c("101", "151", "201"),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species RID 101 5.1 3.5 1.4 0.2 setosa 101 151 7.0 3.2 4.7 1.4 versicolor 151 201 6.3 3.3 6.0 2.5 virginica 201
Parent topic: Select Data
3.1.2.3 Select Data by Value
This example selects portions of a data set.
Example 3-3 Selecting Data by Value
The example pushes the iris
data set to the database and gets the ore.frame
object iris_of
. It filters the data to produce iris_of_filtered
, which contains the values from the rows of iris_of
that have a petal length of less than 1.5 and that are in the Sepal.Length and Species columns. The example also filters the data using conditions, so that iris_of_filtered
contains the values from iris_of
that are of the setosa or versicolor species and that have a petal width of less than 2.0.
iris_of <- ore.push(iris) # Select sepal length and species where petal length is less than 1.5. iris_of_filtered <- iris_of[iris_of$Petal.Length < 1.5, c("Sepal.Length", "Species")] names(iris_of_filtered) nrow(iris_of_filtered) head(iris_of_filtered, 3) # Alternate syntax filtering. iris_of_filtered <- subset(iris_of, Petal.Length < 1.5) nrow(iris_of_filtered) head(iris_of_filtered, 3) # Using the AND and OR conditions in filtering. # Select all rows with in which the species is setosa or versicolor. # and the petal width is less than 2.0. iris_of_filtered <- iris_of[(iris_of$Species == "setosa" | iris_of$Species == "versicolor") & iris_of$Petal.Width < 2.0,] nrow(iris_of_filtered) head(iris_of, 3)Listing for This Example
R> iris_of <- ore.push(iris) R> # Select sepal length and species where petal length is less than 1.5. R> iris_of_filtered <- iris_of[iris_of$Petal.Length < 1.5, + c("Sepal.Length", "Species")] R> names(iris_of_filtered) [1] "Sepal.Length" "Species" R> nrow(iris_of_filtered) [1] 24 R> head(iris_of_filtered, 3) Sepal.Length Species 1 5.1 setosa 2 4.9 setosa 3 4.7 setosa R> # Alternate syntax filtering. R> iris_of_filtered <- subset(iris_of, Petal.Length < 1.5) R> nrow(iris_of_filtered)[1] 24 R> head(iris_of_filtered, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa R> # Using the AND and OR conditions in filtering. R> # Select all rows with in which the species is setosa or versicolor. R> # and the petal width is less than 2.0. R> iris_of_filtered <- iris_of[(iris_of$Species == "setosa" | + iris_of$Species == "versicolor") & + iris_of$Petal.Width < 2.0,] R> nrow(iris_of_filtered)[1] 100 R> head(iris_of, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa
Parent topic: Select Data
3.1.3 Index Data
You can use integer or character vectors to index an ordered ore.frame
object.
You can use the indexing to perform sampling and partitioning, as described in "Sampling Data" and "Partitioning Data".
Oracle Machine Learning for R supports functionality similar to R indexing with these differences:
-
Integer indexing is not supported for
ore.vector
objects. -
Negative integer indexes are not supported.
-
Row order is not preserved.
Example 3-4 Indexing an ore.frame Object
This example demonstrates character and integer indexing. The example uses the ordered SPAM_PK
ore.frame
object from Example 2-13. The example shows that you can access rows by name and that you can also access a set of rows by supplying a vector of character row names. The example then shows that you can supply the actual integer value. In the example this results in a set of different rows because the USERID values start at 1001, as opposed to 1.
# Index to a specifically named row. SPAM_PK["2060", 1:4] # Index to a range of rows by row names. SPAM_PK[as.character(2060:2064), 1:4] # Index to a range of rows by integer index. SPAM_PK[2060:2063, 1:4]Listing for This Example
R> # Index to a specifically named row. R> SPAM_PK["2060", 1:4] TS USERID make address 2060 2060 380 0 0 R> # Index to a range of rows by row names. R> SPAM_PK[as.character(2060:2064), 1:4] TS USERID make address 2060 2060 380 0 0 2061 2061 381 0 0 2062 2062 381 0 0 2063 2063 382 0 0 2064 2064 382 0 0 R> # Index to a range of rows by integer index. R> SPAM_PK[2060:2063, 1:4] TS USERID make address 3060 3060 380 0.00 0.00 3061 3061 381 0.00 1.32 3062 3062 381 0.00 2.07 3063 3063 382 0.34 0.00
3.1.4 Combine Data
You can join data from ore.frame
objects that represent database tables by using the merge
function.
Example 3-5 Joining Data from Two Tables
This example creates two data.frame
objects and merges them. It then invokes the ore.create
function to create a database table for each data.frame
object. The ore.create
function automatically generates an ore.frame
object as a proxy object for the table. The ore.frame
object has the same name as the table. The example merges the ore.frame
objects. Note that the order of the results of the two merge
operations is not the same because the ore.frame
objects are unordered.
# Create data.frame objects. df1 <- data.frame(x1=1:5, y1=letters[1:5]) df2 <- data.frame(x2=5:1, y2=letters[11:15]) # Combine the data.frame objects. merge (df1, df2, by.x="x1", by.y="x2") # Create database tables and ore.frame proxy objects to correspond to # the local R objects df1 and df2. ore.create(df1, table="DF1_TABLE") ore.create(df2, table="DF2_TABLE") # Combine the ore.frame objects. merge (DF1_TABLE, DF2_TABLE, by.x="x1", by.y="x2")Listing for This Example
R> # Create data.frame objects. R> df1 <- data.frame(x1=1:5, y1=letters[1:5]) R> df2 <- data.frame(x2=5:1, y2=letters[11:15]) R> # Combine the data.frame objects. R> merge (df1, df2, by.x="x1", by.y="x2") x1 y1 y2 1 1 a o 2 2 b n 3 3 c m 4 4 d l 5 5 e k R> # Create database tables and ore.frame proxy objects to correspond to R> # the local R objects df1 and df2. R> ore.create(df1, table="DF1_TABLE") R> ore.create(df2, table="DF2_TABLE") R> # Combine the ore.frame objects. R> merge (DF1_TABLE, DF2_TABLE, by.x="x1", by.y="x2") x1 y1 y2 1 5 e k 2 4 d l 3 3 c m 4 2 b n 5 1 a o Warning message: ORE object has no unique key - using random order
3.1.5 Summarize Data
Summarize data with the aggregate
function.
Example 3-6 Aggregating Data
This example pushes the iris
data set to database memory as the ore.frame
object iris_of
. It aggregates the values of iris_of
by the Species column using the length
function. It then displays the first three rows of the result.
# Create a temporary database table from the iris data set and get an ore.frame. iris_of <- ore.push(iris) aggdata <- aggregate(iris_of$Sepal.Length, by = list(species = iris_of$Species), FUN = length) head(aggdata, 3)Listing for This Example
# Create a temporary database table from the iris data set and get an ore.frame. R> iris_of <- ore.push(iris) R> aggdata <- aggregate(iris_of$Sepal.Length, + by = list(species = iris_of$Species), + FUN = length) R> head(aggdata, 3) species x setosa setosa 50 versicolor versicolor 50 virginica virginica 50
3.1.6 Transform Data
In preparing data for analysis, a typical step is to transform data by reformatting it or deriving new columns and adding them to the data set.
The examples in this topic demonstrate two ways of formatting data and deriving columns.
Example 3-7 Formatting Data
This example creates a function to format the data in a column.
# Create a function for formatting data. petalCategory_fmt <- function(x) { ifelse(x > 5, 'LONG', ifelse(x > 2, 'MEDIUM', 'SMALL')) } # Create an ore.frame in database memory with the iris data set. iris_of <- ore.push(iris) # Select some rows from iris_of. iris_of[c(10, 20, 60, 80, 110, 140),] # Format the data in Petal.Length column. iris_of$Petal.Length <- petalCategory_fmt(iris_of$Petal.Length) # Select the same rows from iris_of.
Listing for This Example
R> # Create a function for formatting data. R> petalCategory_fmt <- function(x) { + ifelse(x > 5, 'LONG', + ifelse(x > 2, 'MEDIUM', 'SMALL')) + } R> # Create an ore.frame in database memory with the iris data set. R> iris_of <- ore.push(iris) R> # Select some rows from iris_of. R> iris_of[c(10, 20, 60, 80, 110, 140),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 10 4.9 3.1 1.5 0.1 setosa 20 5.1 3.8 1.5 0.3 setosa 60 5.2 2.7 3.9 1.4 versicolor 80 5.7 2.6 3.5 1.0 versicolor 110 7.2 3.6 6.1 2.5 virginica 140 6.9 3.1 5.4 2.1 virginica R> # Format the data in Petal.Length column. R> iris_of$Petal.Length <- petalCategory_fmt(iris_of$Petal.Length) R> # Select the same rows from iris_of. R> iris_of[c(10, 20, 60, 80, 110, 140),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 10 4.9 3.1 SMALL 0.1 setosa 20 5.1 3.8 SMALL 0.3 setosa 60 5.2 2.7 MEDIUM 1.4 versicolor 80 5.7 2.6 MEDIUM 1.0 versicolor 110 7.2 3.6 LONG 2.5 virginica 140 6.9 3.1 LONG 2.1 virginica
Example 3-8 Using the transform Function
This example does the same thing as the previous example except that it uses the transform
function to reformat the data in a column of the data set.
# Create an ore.frame in database memory with the iris data set. iris_of2 <- ore.push(iris) # Select some rows from iris_of. iris_of2[c(10, 20, 60, 80, 110, 140),] iris_of2 <- transform(iris_of2, Petal.Length = ifelse(Petal.Length > 5, 'LONG', ifelse(Petal.Length > 2, 'MEDIUM', 'SMALL'))) iris_of2[c(10, 20, 60, 80, 110, 140),]
Listing for This Example
R> # Create an ore.frame in database memory with the iris data set. R> iris_of2 <- ore.push(iris) R> # Select some rows from iris_of. R> iris_of2[c(10, 20, 60, 80, 110, 140),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 10 4.9 3.1 1.5 0.1 setosa 20 5.1 3.8 1.5 0.3 setosa 60 5.2 2.7 3.9 1.4 versicolor 80 5.7 2.6 3.5 1.0 versicolor 110 7.2 3.6 6.1 2.5 virginica 140 6.9 3.1 5.4 2.1 virginica R> iris_of2 <- transform(iris_of2, + Petal.Length = ifelse(Petal.Length > 5, 'LONG', + ifelse(Petal.Length > 2, 'MEDIUM', 'SMALL'))) R> iris_of2[c(10, 20, 60, 80, 110, 140),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 10 4.9 3.1 SMALL 0.1 setosa 20 5.1 3.8 SMALL 0.3 setosa 60 5.2 2.7 MEDIUM 1.4 versicolor 80 5.7 2.6 MEDIUM 1.0 versicolor 110 7.2 3.6 LONG 2.5 virginica 140 6.9 3.1 LONG 2.1 virginica
Example 3-9 Adding Derived Columns
This example uses the transform
function to add a derived column to the data set and then to add additional columns to it.
# Set the page width. options(width = 80) # Create an ore.frame in database memory with the iris data set. iris_of <- ore.push(iris) names(iris_of) # Add one column derived from another iris_of <- transform(iris_of, LOG_PL = log(Petal.Length)) names(iris_of) head(iris_of, 3) # Add more columns. iris_of <- transform(iris_of, SEPALBINS = ifelse(Sepal.Length < 6.0, "A", "B"), PRODUCTCOLUMN = Petal.Length * Petal.Width, CONSTANTCOLUMN = 10) names(iris_of) # Select some rows of iris_of. iris_of[c(10, 20, 60, 80, 110, 140),]
Listing for This Example
R> # Set the page width. R> options(width = 80) R> # Create an ore.frame in database memory with the iris data set. R> iris_of <- ore.push(iris) R> names(iris_of) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" R> # Add one column derived from another R> iris_of <- transform(iris_of, LOG_PL = log(Petal.Length)) R> names(iris_of) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" [6] "LOG_PL" R> head(iris_of, 3) Sepal.Length Sepal.Width Petal.Length Petal.Width Species LOG_PL 1 5.1 3.5 1.4 0.2 setosa 0.3364722 2 4.9 3.0 1.4 0.2 setosa 0.3364722 3 4.7 3.2 1.3 0.2 setosa 0.2623643 R> # Add more columns. R> iris_of <- transform(iris_of, SEPALBINS = ifelse(Sepal.Length < 6.0, "A", "B"), PRODUCTCOLUMN = Petal.Length * Petal.Width, CONSTANTCOLUMN = 10) R> names(iris_of) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" [5] "Species" "LOG_PL" "CONSTANTCOLUMN" "SEPALBINS" [9] "PRODUCTCOLUMN" R> # Select some rows of iris_of. R> iris_of[c(10, 20, 60, 80, 110, 140),] Sepal.Length Sepal.Width Petal.Length Petal.Width Species LOG_PL 10 4.9 3.1 1.5 0.1 setosa 0.4054651 20 5.1 3.8 1.5 0.3 setosa 0.4054651 60 5.2 2.7 3.9 1.4 versicolor 1.3609766 80 5.7 2.6 3.5 1.0 versicolor 1.2527630 110 7.2 3.6 6.1 2.5 virginica 1.8082888 140 6.9 3.1 5.4 2.1 virginica 1.6863990 CONSTANTCOLUMN SEPALBINS PRODUCTCOLUMN 10 10 A 0.15 20 10 A 0.45 60 10 A 5.46 80 10 A 3.50 110 10 B 15.25 140 10 B 11.34
3.1.7 Sample Data
Sampling is an important capability for statistical analytics.
Typically, you sample data to reduce its size and to perform meaningful work on it. In R you usually must load data into memory to sample it. However, if the data is too large, this isn't possible.
In OML4R, instead of pulling the data from the database and then sampling, you can sample directly in the database and then pull only those records that are part of the sample. By sampling in the database, you minimize data movement and you can work with larger data sets. Note that it is the ordering framework integer row indexing in the transparency layer that enables this capability.
Note:
Sampling requires using ordered ore.frame
objects as described in Creating Ordered and Unordered ore.frame Objects.
The examples in this section illustrate several sampling techniques.
Example 3-10 Simple Random Sampling
This example demonstrates a simple selection of rows at random. The example creates a small data.frame
object and pushes it to the database to create an ore.frame
object, MYDATA
. Out of 20 rows, the example samples 5. It uses the R sample
function to produce a random set of indices that it uses to get the sample from MYDATA
. The sample, simpleRandomSample
, is an ore.frame
object.
set.seed(1) N <- 20 myData <- data.frame(a=1:N,b=letters[1:N]) MYDATA <- ore.push(myData) head(MYDATA) sampleSize <- 5 simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE] class(simpleRandomSample) simpleRandomSample
Listing for This Example
R> set.seed(1) R> N <- 20 R> myData <- data.frame(a=1:N,b=letters[1:N]) R> MYDATA <- ore.push(myData) R> head(MYDATA) a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f R> sampleSize <- 5 R> simpleRandomSample <- MYDATA[sample(nrow(MYDATA), sampleSize), , drop=FALSE] R> class(simpleRandomSample) [1] "ore.frame" attr(,"package") [1] "OREbase" R> simpleRandomSample a b 2 2 b 7 7 g 10 10 j 12 12 l 19 19 s
Example 3-11 Split Data Sampling
This example demonstrates randomly partitioning data into training and testing sets. This splitting of the data is normally done in classification and regression to assess how well a model performs on new data. The example uses the MYDATA
object created in the previous example.
This example produces a sample set of indices to use as the test data set. It then creates the logical vector group
that is TRUE
if the index is in the sample and is FALSE
otherwise. Next, it uses row indexing to produce the training set where the group is FALSE
and the test set where the group is TRUE
. Notice that the number of rows in the training set is 15 and the number of rows in the test set is 5, as specified in the invocation of the sample
function.
set.seed(1) sampleSize <- 5 ind <- sample(1:nrow(MYDATA), sampleSize) group <- as.integer(1:nrow(MYDATA) %in% ind) MYDATA.train <- MYDATA[group==FALSE,] dim(MYDATA.train) MYDATA.test <- MYDATA[group==TRUE,] dim(MYDATA.test)
Listing for This Example
R> set.seed(1) R> sampleSize <- 5 R> ind <- sample(1:nrow(MYDATA), sampleSize) R> group <- as.integer(1:nrow(MYDATA) %in% ind) R> MYDATA.train <- MYDATA[group==FALSE,] dim(MYDATA.train) [1] 15 2 R> MYDATA.test <- MYDATA[group==TRUE,] R> dim(MYDATA.test) [1] 5 2
Example 3-12 Systematic Sampling
This example demonstrates systematic sampling, in which rows are selected at regular intervals. The example uses the seq
function to create a sequence of values that start at 2 and increase by increments of 3. The number of values in the sequence is equal to the number of rows in MYDATA
. The MYDATA
object is created in the first example.
set.seed(1) N <- 20 myData <- data.frame(a=1:20,b=letters[1:N]) MYDATA <- ore.push(myData) head(MYDATA) start <- 2 by <- 3 systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE] systematicSample
Listing for This Example
R> set.seed(1) R> N <- 20 R> myData <- data.frame(a=1:20,b=letters[1:N]) R> MYDATA <- ore.push(myData) R> head(MYDATA) a b 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 f R> start <- 2 R> by <- 3 R> systematicSample <- MYDATA[seq(start, nrow(MYDATA), by = by), , drop = FALSE] systematicSample a b 2 2 b 5 5 e 8 8 h 11 11 k 14 14 n 17 17 q 20 20 t
Example 3-13 Stratified Sampling
This example demonstrates stratified sampling, in which rows are selected within each group where the group is determined by the values of a particular column. The example creates a data set that has each row assigned to a group. The function rnorm
produces random normal numbers. The argument 4 is the desired mean for the distribution. The example splits the data according to group and then samples proportionately from each partition. Finally, it row binds the list of subset ore.frame
objects into a single ore.frame
object and then displays the values of the result, stratifiedSample
.
set.seed(1) N <- 200 myData <- data.frame(a=1:N,b=round(rnorm(N),2), group=round(rnorm(N,4),0)) MYDATA <- ore.push(myData) head(MYDATA) sampleSize <- 10 stratifiedSample <- do.call(rbind, lapply(split(MYDATA, MYDATA$group), function(y) { ny <- nrow(y) y[sample(ny, sampleSize*ny/N), , drop = FALSE] })) stratifiedSample
Listing for This Example
R> set.seed(1) R> N <- 200 R> myData <- data.frame(a=1:N,b=round(rnorm(N),2), + group=round(rnorm(N,4),0)) R> MYDATA <- ore.push(myData) R> head(MYDATA) a b group 1 1 -0.63 4 2 2 0.18 6 3 3 -0.84 6 4 4 1.60 4 5 5 0.33 2 6 6 -0.82 6 R> sampleSize <- 10 R> stratifiedSample <- do.call(rbind, + lapply(split(MYDATA, MYDATA$group), + function(y) { + ny <- nrow(y) + y[sample(ny, sampleSize*ny/N), , drop = FALSE] + })) R> stratifiedSample a b group 173|173 173 0.46 3 9|9 9 0.58 4 53|53 53 0.34 4 139|139 139 -0.65 4 188|188 188 -0.77 4 78|78 78 0.00 5 137|137 137 -0.30 5
Example 3-14 Cluster Sampling
This example demonstrates cluster sampling, in which entire groups are selected at random. The example splits the data according to group and then samples among the groups and row binds into a single ore.frame
object. The resulting sample has data from two clusters, 6 and 7.
set.seed(1) N <- 200 myData <- data.frame(a=1:N,b=round(runif(N),2), group=round(rnorm(N,4),0)) MYDATA <- ore.push(myData) head(MYDATA) sampleSize <- 5 clusterSample <- do.call(rbind, sample(split(MYDATA, MYDATA$group), 2)) unique(clusterSample$group)
Listing for This Example
R> set.seed(1) R> N <- 200 R> myData <- data.frame(a=1:N,b=round(runif(N),2), + group=round(rnorm(N,4),0)) R> MYDATA <- ore.push(myData) R> head(MYDATA) a b group 1 1 0.27 3 2 2 0.37 4 3 3 0.57 3 4 4 0.91 4 5 5 0.20 3 6 6 0.90 6 R> sampleSize <- 5 R> clusterSample <- do.call(rbind, + sample(split(MYDATA, MYDATA$group), 2)) R> unique(clusterSample$group) [1] 6 7
Example 3-15 Quota Sampling
This example demonstrates quota sampling, in which a consecutive number of records are selected as the sample. The example uses the head
function to select the sample. The tail
function could also have been used.
set.seed(1) N <- 200 myData <- data.frame(a=1:N,b=round(runif(N),2)) MYDATA <- ore.push(myData) sampleSize <- 10 quotaSample1 <- head(MYDATA, sampleSize) quotaSample1
Listing for This Example
R> set.seed(1) R> N <- 200 R> myData <- data.frame(a=1:N,b=round(runif(N),2)) R> MYDATA <- ore.push(myData) R> sampleSize <- 10 R> quotaSample1 <- head(MYDATA, sampleSize) R> quotaSample1 a b 1 1 0.15 2 2 0.75 3 3 0.98 4 4 0.97 5 5 0.35 6 6 0.39 7 7 0.95 8 8 0.11 9 9 0.93 10 10 0.35
3.1.8 Partition Data
In analyzing large data sets, a typical operation is to randomly partition the data set into subsets.
You can analyze the partitions by using OML4R embedded R execution, as shown in the following example.
Example 3-16 Randomly Partitioning Data
This example creates a data.frame
object with the symbol myData
in the local R session and adds a column to it that contains a randomly generated set of values. It pushes the data set to database memory as the object MYDATA
. The example invokes the embedded R execution function ore.groupApply
, which partitions the data based on the partition column and then applies the lm
function to each partition.
N <- 200 k <- 5 myData <- data.frame(a=1:N,b=round(runif(N),2)) myData$partition <- sample(rep(1:k, each = N/k, length.out = N), replace = TRUE) MYDATA <- ore.push(myData) head(MYDATA) results <- ore.groupApply(MYDATA, MYDATA$partition, function(y) {lm(b~a,y)}, parallel = TRUE) length(results) results[[1]]Listing for This Example
R> N <- 200 R> k <- 5 R> myData <- data.frame(a=1:N,b=round(runif(N),2)) R> myData$partition <- sample(rep(1:k, each = N/k, + length.out = N), replace = TRUE) R> MYDATA <- ore.push(myData) R> head(MYDATA) a b partition 1 1 0.89 2 2 2 0.31 4 3 3 0.39 5 4 4 0.66 3 5 5 0.01 1 6 6 0.12 4 R> results <- ore.groupApply(MYDATA, MYDATA$partition, + function(y) {lm(b~a,y)}, parallel = TRUE) R> length(results) [1] 5 R> results[[1]] Call: lm(formula = b ~ a, data = y) Coefficients: (Intercept) a 0.388795 0.001015
3.1.9 Prepare Time Series Data
OML4R provides you with the ability to perform many data preparation operations on time series data, such as filtering, ordering, and transforming the data.
OML4R maps R data types to SQL data types, which allows you to create OML4R objects and perform data preparation operations in database memory. The following examples demonstrate some operations on time series data.
Example 3-17 Aggregating Date and Time Data
This example illustrates some of the statistical aggregation functions. For a data set, the example first generates on the local client a sequence of five hundred dates spread evenly throughout 2001. It then introduces a random difftime
and a vector of random normal values. The example then uses the ore.push
function to create MYDATA
, an in-database version of the data. The example invokes the class
function to show that MYDATA
is an ore.frame
object and that the datetime
column is of class ore.datetime
. The example displays the first three rows of the generated data. It then uses the statistical aggregation operations of min
, max
, range
, median
, and quantile
on the datetime
column of MYDATA
.
N <- 500 mydata <- data.frame(datetime = seq(as.POSIXct("2001/01/01"), as.POSIXct("2001/12/31"), length.out = N), difftime = as.difftime(runif(N), units = "mins"), x = rnorm(N)) MYDATA <- ore.push(mydata) class(MYDATA) class(MYDATA$datetime) head(MYDATA,3) # statistical aggregations min(MYDATA$datetime) max(MYDATA$datetime) range(MYDATA$datetime) quantile(MYDATA$datetime, probs = c(0, 0.05, 0.10))
Listing for This Example
R> N <- 500 R> mydata <- data.frame(datetime = + seq(as.POSIXct("2001/01/01"), + as.POSIXct("2001/12/31"), + length.out = N), + difftime = as.difftime(runif(N), + units = "mins"), + x = rnorm(N)) R> MYDATA <- ore.push(mydata) R> class(MYDATA) [1] "ore.frame" attr(,"package") [1] "OREbase" R> class(MYDATA$datetime) [1] "ore.datetime" attr(,"package") [1] "OREbase" R> head(MYDATA,3) datetime difftime x 1 2001-01-01 00:00:00 16.436782 secs 0.68439244 2 2001-01-01 17:30:25 8.711562 secs 1.38481435 3 2001-01-02 11:00:50 1.366927 secs -0.00927078 R> # statistical aggregations R> min(MYDATA$datetime) [1] "2001-01-01 CST" R> max(MYDATA$datetime) [1] "2001-12-31 CST" R> range(MYDATA$datetime) [1] "2001-01-01 CST" "2001-12-31 CST" R> quantile(MYDATA$datetime, + probs = c(0, 0.05, 0.10)) 0% 5% 10% "2001-01-01 00:00:00 CST" "2001-01-19 04:48:00 CST" "2001-02-06 09:36:00 CST"
Example 3-18 Using Date and Time Arithmetic
This example creates a one day shift by taking the datetime
column of the MYDATA
ore.frame
object created in the previous example and adding a difftime
of one day. The result is day1Shift
, which the example shows is of class ore.datetime
. The example displays the first three elements of the datetime
column of MYDATA
and those of day1Shift
. The first element of day1Shift
is January 2, 2001.
This example also computes lag differences using the overloaded diff
function. The difference between the dates is all the same because the 500 dates in MYDATA
are evenly distributed throughout 2001.
day1Shift <- MYDATA$datetime + as.difftime(1, units = "days") class(day1Shift) head(MYDATA$datetime,3) head(day1Shift,3) lag1Diff <- diff(MYDATA$datetime) class(lag1Diff) head(lag1Diff,3)
Listing for This Example
R> day1Shift <- MYDATA$datetime + as.difftime(1, units = "days") R> class(day1Shift) [1] "ore.datetime" attr(,"package") [1] "OREbase" R> head(MYDATA$datetime,3) [1] "2001-01-01 00:00:00 CST" "2001-01-01 17:30:25 CST" "2001-01-02 11:00:50 CST" R> head(day1Shift,3) [1] "2001-01-02 00:00:00 CST" "2001-01-02 17:30:25 CST" "2001-01-03 11:00:50 CST" R> lag1Diff <- diff(MYDATA$datetime) R> class(lag1Diff) [1] "ore.difftime" attr(,"package") [1] "OREbase" R> head(lag1Diff,3) Time differences in secs [1] 63025.25 63025.25 63025.25
Example 3-19 Comparing Dates and Times
This example demonstrates date and time comparisons. The example uses the datetime
column of the MYDATA
ore.frame
object created in the first example. This example selects the elements of MYDATA
that have a date earlier than April 1, 2001. The resulting isQ1
is of class ore.logical
and for the first three entries the result is TRUE
. The example finds out how many dates matching isQ1
are in March. It then sums the logical vector and displays the result, which is that 43 rows are in March. The example next filters rows based on dates that are the end of the year, after December 27. The result is eoySubset
, which is an ore.frame
object. The example displays the first three rows returned in eoySubset
.
isQ1 <- MYDATA$datetime < as.Date("2001/04/01") class(isQ1) head(isQ1,3) isMarch <- isQ1 & MYDATA$datetime > as.Date("2001/03/01") class(isMarch) head(isMarch,3) sum(isMarch) eoySubset <- MYDATA[MYDATA$datetime > as.Date("2001/12/27"), ] class(eoySubset) head(eoySubset,3)
Listing for This Example
R> isQ1 <- MYDATA$datetime < as.Date("2001/04/01") R> class(isQ1) [1] "ore.logical" attr(,"package") [1] "OREbase" R> head(isQ1,3) [1] TRUE TRUE TRUE R> isMarch <- isQ1 & MYDATA$datetime > as.Date("2001/03/01") R> class(isMarch) [1] "ore.logical" attr(,"package") [1] "OREbase" R> head(isMarch,3) [1] FALSE FALSE FALSE R> sum(isMarch) [1] 43 R> eoySubset <- MYDATA[MYDATA$datetime > as.Date("2001/12/27"), ] R> class(eoySubset) [1] "ore.frame" attr(,"package") [1] "OREbase" R> head(eoySubset,3) datetime difftime x 495 2001-12-27 08:27:53 55.76474 secs -0.2740492 496 2001-12-28 01:58:18 15.42946 secs -1.4547270 497 2001-12-28 19:28:44 28.62195 secs 0.2929171
Example 3-20 Using Date and Time Accessors
OML4R has accessor functions that you can use to extract various components from datetime
objects, such as year, month, day of the month, hour, minute, and second. This example demonstrates the use of these functions. The example uses the datetime
column of the MYDATA
ore.frame
object created in the first example.
This example gets the year elements of the datetime
column. The invocation of the unique
function for year
displays 2001
because it is the only year value in the column. However, for objects that have a range of values, as for example, ore.mday
, the range
function returns the day of the month. The result contains a vector with values that range from 1 through 31. Invoking the range function succinctly reports the range of values, as demonstrated for the other accessor functions.
year <- ore.year(MYDATA$datetime) unique(year) month <- ore.month(MYDATA$datetime) range(month) dayOfMonth <- ore.mday(MYDATA$datetime) range(dayOfMonth) hour <- ore.hour(MYDATA$datetime) range(hour) minute <- ore.minute(MYDATA$datetime) range(minute) second <- ore.second(MYDATA$datetime) range(second)
Listing for This Example
R> year <- ore.year(MYDATA$datetime) R> unique(year) [1] 2001 R> month <- ore.month(MYDATA$datetime) R> range(month) [1] 1 12 R> dayOfMonth <- ore.mday(MYDATA$datetime) R> range(dayOfMonth) [1] 1 31 R> hour <- ore.hour(MYDATA$datetime) R> range(hour) [1] 0 23 R> minute <- ore.minute(MYDATA$datetime) R> range(minute) [1] 0 59 R> second <- ore.second(MYDATA$datetime) R> range(second) [1] 0.00000 59.87976
Example 3-21 Coercing Date and Time Data Types
This example uses the as.ore
subclass objects to coerce an ore.datetime
data type into other data types. The example uses the datetime
column of the MYDATA
ore.frame
object created in the first example. That column contains ore.datetime
values. This example first extracts the date from the MYDATA$datetime
column. The resulting dateOnly
object has ore.date
values that contain only the year, month, and day, but not the time. The example then coerces the ore.datetime
values into objects with ore.character
and ore.integer
values that represent the names of days, the number of the day of the year, and the quarter of the year.
dateOnly <- as.ore.date(MYDATA$datetime) class(dateOnly) head(sort(unique(dateOnly)),3) nameOfDay <- as.ore.character(MYDATA$datetime, format = "DAY") class(nameOfDay) sort(unique(nameOfDay)) dayOfYear <- as.integer(as.character(MYDATA$datetime, format = "DDD")) class(dayOfYear) range(dayOfYear) quarter <- as.integer(as.character(MYDATA$datetime, format = "Q")) class(quarter) sort(unique(quarter))
Listing for This Example
R> dateOnly <- as.ore.date(MYDATA$datetime) R> class(dateOnly)[1] "ore.date" attr(,"package")[1] "OREbase" R> head(sort(unique(dateOnly)),3) [1] "2001-01-01" "2001-01-02" "2001-01-03" R> nameOfDay <- as.ore.character(MYDATA$datetime, format = "DAY") R> class(nameOfDay) [1] "ore.character" attr(,"package") [1] "OREbase" R> sort(unique(nameOfDay)) [1] "FRIDAY " "MONDAY " "SATURDAY " "SUNDAY " "THURSDAY " "TUESDAY " "WEDNESDAY" R> dayOfYear <- as.integer(as.character(MYDATA$datetime, format = "DDD")) R> class(dayOfYear) [1] "ore.integer" attr(,"package") [1] "OREbase" R> range(dayOfYear) [1] 1 365 R> quarter <- as.integer(as.character(MYDATA$datetime, format = "Q")) R> class(quarter) [1] "ore.integer" attr(,"package") [1] "OREbase" R> sort(unique(quarter)) [1] 1 2 3 4
Example 3-22 Using a Window Function
This example uses the window functions ore.rollmean
and ore.rollsd
to compute the rolling mean and the rolling standard deviation. The example uses the MYDATA
ore.frame
object created in the first example. This example ensures that MYDATA
is an ordered ore.frame
by assigning the values of the datetime
column as the row names of MYDATA
. The example computes the rolling mean and the rolling standard deviation over five periods. Next, to use the R time series functionality in the stats
package, the example pulls data to the client. To limit the data pulled to the client, it uses the vector is.March
from the third example to select only the data points in March. The example creates a time series object using the ts
function, builds the Arima model, and predicts three points out.
row.names(MYDATA) <- MYDATA$datetime MYDATA$rollmean5 <- ore.rollmean(MYDATA$x, k = 5) MYDATA$rollsd5 <- ore.rollsd (MYDATA$x, k = 5) head(MYDATA) marchData <- ore.pull(MYDATA[isMarch,]) tseries.x <- ts(marchData$x) arima110.x <- arima(tseries.x, c(1,1,0)) predict(arima110.x, 3) tseries.rm5 <- ts(marchData$rollmean5) arima110.rm5 <- arima(tseries.rm5, c(1,1,0)) predict(arima110.rm5, 3)
Listing for This Example
R> row.names(MYDATA) <- MYDATA$datetime R> MYDATA$rollmean5 <- ore.rollmean(MYDATA$x, k = 5) R> MYDATA$rollsd5 <- ore.rollsd (MYDATA$x, k = 5) R> head(MYDATA) datetime difftime 2001-01-01 00:00:00 2001-01-01 00:00:00 39.998460 secs x rollmean5 rollsd5 -0.3450421 -0.46650761 0.8057575 datetime difftime 2001-01-01 17:30:25 2001-01-01 17:30:25 37.75568 secs x rollmean5 rollsd5 -1.3261019 0.02877517 1.1891384 datetime difftime 2001-01-02 11:00:50 2001-01-02 11:00:50 18.44243 secs x rollmean5 rollsd5 0.2716211 -0.13224503 1.0909515 datetime difftime 2001-01-03 04:31:15 2001-01-03 04:31:15 38.594384 secs x rollmean5 rollsd5 1.5146235 0.36307913 1.4674456 datetime difftime 2001-01-03 22:01:41 2001-01-03 22:01:41 2.520976 secs x rollmean5 rollsd5 -0.7763258 0.80073340 1.1237925 datetime difftime 2001-01-04 15:32:06 2001-01-04 15:32:06 56.333281 secs x rollmean5 rollsd5 2.1315787 0.90287282 1.0862614 R> marchData <- ore.pull(MYDATA[isMarch,]) R> tseries.x <- ts(marchData$x) R> arima110.x <- arima(tseries.x, c(1,1,0)) R> predict(arima110.x, 3) $pred Time Series: Start = 44 End = 46 Frequency = 1 [1] 1.4556614 0.6156379 1.1387587 $se Time Series: Start = 44 End = 46 Frequency = 1 [1] 1.408117 1.504988 1.850830 R> tseries.rm5 <- ts(marchData$rollmean5) R> arima110.rm5 <- arima(tseries.rm5, c(1,1,0)) R> predict(arima110.rm5, 3) $pred Time Series: Start = 44 End = 46 Frequency = 1 [1] 0.3240135 0.3240966 0.3240922 $se Time Series: Start = 44 End = 46 Frequency = 1 [1] 0.3254551 0.4482886 0.5445763
3.2 Explore Data
Oracle Machine Learning for R provides functions that enable you to perform exploratory data analysis.
With these functions, you can perform common statistical operations.
The functions and their uses are described in the following topics:
- About the Exploratory Data Analysis Functions
The OML4R functions for exploratory data analysis are in theOREeda
package. - About the NARROW Data Set for Examples
Many of the examples of the exploratory data analysis functions use theNARROW
data set. - Correlate Data
You can use theore.corr
function to perform correlation analysis. - Cross-Tabulate Data
Cross-tabulation is a statistical technique that finds an interdependent relationship between two tables of values. - Analyze the Frequency of Cross-Tabulations
Theore.freq
function analyses the output of theore.crosstab
function and automatically determines the techniques that are relevant to anore.crosstab
result. - Build Exponential Smoothing Models on Time Series Data
Theore.esm
function builds a simple or a double exponential smoothing model for in-database time series observations in an orderedore.vector
object. - Rank Data
Theore.rank
function analyzes distribution of values in numeric columns of anore.frame
. - Sort Data
Theore.sort
function enables flexible sorting of a data frame along one or more columns specified by theby
argument. - Summarize Data
Summarize data with theaggregate
function. - Analyze the Distribution of Numeric Variables
Theore.univariate
function provides distribution analysis of numeric variables in anore.frame
. - Principal Component Analysis
The overloadedprcomp
andprincomp
functions perform principal component analysis in parallel in the database. - Singular Value Decomposition
The overloadedsvd
function performs singular value decomposition in parallel in the database.
Parent topic: Prepare and Explore Data in the Database
3.2.1 About the Exploratory Data Analysis Functions
The OML4R functions for exploratory data analysis are in the OREeda
package.
Table 3-1 Functions in the OREeda Package
Function | Description |
---|---|
|
Performs correlation analysis across numeric columns in an |
|
Expands on the |
|
Builds exponential smoothing models on data in an ordered |
|
Operates on output from the |
|
Enables the investigation of the distribution of values along numeric columns in an |
|
Provides flexible sorting for |
|
Provides descriptive statistics for |
|
Provides distribution analysis of numeric columns in an |
Parent topic: Explore Data
3.2.2 About the NARROW Data Set for Examples
Many of the examples of the exploratory data analysis functions use the NARROW
data set.
NARROW
is an ore.frame
that has 9 columns and 1500 rows, as shown in the following example. Some of the columns are numeric, others are not.
Example 3-23 The NARROW Data Set
This example shows the class, dimensions, and names of the NARROW
object.
R> class(NARROW) R> dim(NARROW) R> names(NARROW)
Listing for This Example
R> class(NARROW) [1] "ore.frame" attr(,"package") [1] "OREbase" R> dim(NARROW)[1] 1500 9 R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" [5] "COUNTRY" "EDUCATION" "OCCUPATION" "YRS_RESIDENCE" [9] "CLASS"
Parent topic: Explore Data
3.2.3 Correlate Data
You can use the ore.corr
function to perform correlation analysis.
With the ore.corr
function, you can do the following:
-
Perform Pearson, Spearman or Kendall correlation analysis across numeric columns in an
ore.frame
object. -
Perform partial correlations by specifying a control column.
-
Aggregate some data prior to the correlations.
-
Post-process results and integrate them into an R code flow.
You can make the output of the
ore.corr
function conform to the output of the Rcor
function; doing so allows you to use any R function to post-process the output or to use the output as the input to a graphics function.
For details about the function arguments, invoke help(ore.corr)
.
The following examples demonstrate these operations.
Example 3-24 Performing Basic Correlation Calculations
This example demonstrates how to specify the different types of correlation statistics.
# Before performing correlations, project out all non-numeric values # by specifying only the columns that have numeric values. names(NARROW) NARROW_NUMS <- NARROW[,c(3,8,9)] names(NARROW_NUMS) # Calculate the correlation using the default correlation statistic, Pearson. x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS') head(x, 3) # Calculate using Spearman. x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman') head(x, 3) # Calculate using Kendall x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall') head(x, 3)
Listing for This Example
R> # Before performing correlations, project out all non-numeric values R> # by specifying only the columns that have numeric values. R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" "COUNTRY" "EDUCATION" "OCCUPATION" [8] "YRS_RESIDENCE" "CLASS" "AGEBINS" R> NARROW_NUMS <- NARROW[,c(3,8,9)] R> names(NARROW_NUMS) [1] "AGE" "YRS_RESIDENCE" "CLASS" R> # Calculate the correlation using the default correlation statistic, Pearson. R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS') R> head(x, 3) ROW COL PEARSON_T PEARSON_P PEARSON_DF 1 AGE CLASS 0.2200960 1e-15 1298 2 AGE YRS_RESIDENCE 0.6568534 0e+00 1098 3 YRS_RESIDENCE CLASS 0.3561869 0e+00 1298 R> # Calculate using Spearman. R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='spearman') R> head(x, 3) ROW COL SPEARMAN_T SPEARMAN_P SPEARMAN_DF 1 AGE CLASS 0.2601221 1e-15 1298 2 AGE YRS_RESIDENCE 0.7462684 0e+00 1098 3 YRS_RESIDENCE CLASS 0.3835252 0e+00 1298 R> # Calculate using Kendall R> x <- ore.corr(NARROW_NUMS,var='AGE,YRS_RESIDENCE,CLASS', stats='kendall') R> head(x, 3) ROW COL KENDALL_T KENDALL_P KENDALL_DF 1 AGE CLASS 0.2147107 4.285594e-31 <NA> 2 AGE YRS_RESIDENCE 0.6332196 0.000000e+00 <NA> 3 YRS_RESIDENCE CLASS 0.3362078 1.094478e-73 <NA>
Example 3-25 Creating Correlation Matrices
This example pushes the iris
data set to a temporary table in the database, which has the proxy ore.frame
object iris_of
. It creates correlation matrices grouped by species.
iris_of <- ore.push(iris) x <- ore.corr(iris_of, var = "Sepal.Length, Sepal.Width, Petal.Length", partial = "Petal.Width", group.by = "Species") class(x) head(x)
Listing for This Example
R> iris_of <- ore.push(iris) R> x <- ore.corr(iris_of, var = "Sepal.Length, Sepal.Width, Petal.Length", + partial = "Petal.Width", group.by = "Species") R> class(x) [1] "list" R> head(x) $setosa ROW COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF 1 Sepal.Length Petal.Length 0.1930601 9.191136e-02 47 2 Sepal.Length Sepal.Width 0.7255823 1.840300e-09 47 3 Sepal.Width Petal.Length 0.1095503 2.268336e-01 47 $versicolor ROW COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF 1 Sepal.Length Petal.Length 0.62696041 7.180100e-07 47 2 Sepal.Length Sepal.Width 0.26039166 3.538109e-02 47 3 Sepal.Width Petal.Length 0.08269662 2.860704e-01 47 $virginica ROW COL PART_PEARSON_T PART_PEARSON_P PART_PEARSON_DF 1 Sepal.Length Petal.Length 0.8515725 4.000000e-15 47 2 Sepal.Length Sepal.Width 0.3782728 3.681795e-03 47 3 Sepal.Width Petal.Length 0.2854459 2.339940e-02 47
Parent topic: Explore Data
3.2.4 Cross-Tabulate Data
Cross-tabulation is a statistical technique that finds an interdependent relationship between two tables of values.
The ore.crosstab
function enables cross-column analysis of an ore.frame
. This function is a sophisticated variant of the R table
function.
You must use ore.crosstab
function before performing frequency analysis using ore.freq
.
If the result of the ore.crosstab
function invocation is a single cross-tabulation, then the function returns an ore.frame
object. If the result is multiple cross-tabulations, then the function returns a list of ore.frame
objects.
For details about function arguments, invoke help(ore.crosstab)
.
Example 3-26 Creating a Single Column Frequency Table
The most basic use case is to create a single-column frequency table, as shown in this example.
This example filters the NARROW ore.frame
, grouping by GENDER.
ct <- ore.crosstab(~AGE, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(~AGE, data=NARROW) R> head(ct) AGE ORE$FREQ ORE$STRATA ORE$GROUP 17 17 14 1 1 18 18 16 1 1 19 19 30 1 1 20 20 23 1 1 21 21 22 1 1 22 22 39 1 1
Example 3-27 Analyzing Two Columns
This example analyses AGE by GENDER and AGE by CLASS.
ct <- ore.crosstab(AGE~GENDER+CLASS, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(AGE~GENDER+CLASS, data=NARROW) R> head(ct) $`AGE~GENDER` AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 17|F 17 F 5 1 1 17|M 17 M 9 1 1 18|F 18 F 6 1 1 18|M 18 M 7 1 1 19|F 19 F 15 1 1 19|M 19 M 13 1 1 # The remaining output is not shown.
Example 3-28 Weighting Rows
To weight rows, include a count based on another column as shown in this example. This example weights values in AGE and GENDER using values in YRS_RESIDENCE.
ct <- ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(AGE~GENDER*YRS_RESIDENCE, data=NARROW) R> head(ct) AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 17|F 17 F 1 1 1 17|M 17 M 8 1 1 18|F 18 F 4 1 1 18|M 18 M 10 1 1 19|F 19 F 15 1 1 19|M 19 M 17 1 1
Example 3-29 Ordering Cross-Tabulated Data
There are several possibilities for ordering rows in a cross-tabulated table, such as the following:
-
Default or NAME orders by the columns being analyzed
-
FREQ orders by frequency counts
-
-NAME or -FREQ does reverse ordering
-
INTERNAL bypasses ordering
This example orders by frequency count and then by reverse order by frequency count.
ct <- ore.crosstab(AGE~GENDER|FREQ, data=NARROW) head(ct) ct <- ore.crosstab(AGE~GENDER|-FREQ, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(AGE~GENDER|FREQ, data=NARROW) R> head(ct) AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 66|F 66 F 1 1 1 70|F 70 F 1 1 1 73|M 73 M 1 1 1 74|M 74 M 1 1 1 76|F 76 F 1 1 1 77|F 77 F 1 1 1 R> ct <- ore.crosstab(AGE~GENDER|-FREQ, data=NARROW) R> head(ct) AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 27|M 27 M 33 1 1 35|M 35 M 28 1 1 41|M 41 M 27 1 1 34|M 34 M 26 1 1 37|M 37 M 26 1 1 28|M 28 M 25 1 1
Example 3-30 Analyzing Three or More Columns
This example demonstrates analyzing three or more columns. The result is similar to what the SQL GROUPING SETS
clause accomplishes.
ct <- ore.crosstab(AGE+COUNTRY~GENDER, NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(AGE+COUNTRY~GENDER, NARROW) R> head(ct) $`AGE~GENDER` AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 17|F 17 F 5 1 1 17|M 17 M 9 1 1 18|F 18 F 6 1 1 18|M 18 M 7 1 1 19|F 19 F 15 1 1 19|M 19 M 13 1 1 # The rest of the output is not shown. $`COUNTRY~GENDER` COUNTRY GENDER ORE$FREQ ORE$STRATA ORE$GROUP Argentina|F Argentina F 14 1 1 Argentina|M Argentina M 28 1 1 Australia|M Australia M 1 1 1 # The rest of the output is not shown.
Example 3-31 Specifying a Range of Columns
You can specify a range of columns instead of having to type all the column names, as demonstrated in this example.
names(NARROW) # Because AGE, MARITAL_STATUS and COUNTRY are successive columns, # you can simply do the following: ct <- ore.crosstab(AGE-COUNTRY~GENDER, NARROW) # An equivalent invocation is the following: ct <- ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)
Listing for This Example
R> names(NARROW) [1] "ID" "GENDER" "AGE" "MARITAL_STATUS" [5] "COUNTRY" "EDUCATION" "OCCUPATION" "YRS_RESIDENCE" [9] "CLASS" R> # Because AGE, MARITAL_STATUS and COUNTRY are successive columns, R> # you can simply do the following: R> ct <- ore.crosstab(AGE-COUNTRY~GENDER, NARROW) R> # An equivalent invocation is the following: R> ct <- ore.crosstab(AGE+MARITAL_STATUS+COUNTRY~GENDER, NARROW)
Example 3-32 Producing One Cross-Tabulation Table for Each Value of Another Column
This example produces one cross-tabulation table (AGE, GENDER) for each unique value of another column COUNTRY.
ct <- ore.crosstab(~AGE/COUNTRY, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(~AGE/COUNTRY, data=NARROW) R> head(ct) AGE ORE$FREQ ORE$STRATA ORE$GROUP Argentina|17 17 1 1 1 Brazil|17 17 1 1 3 United States of America|17 17 12 1 19 United States of America|18 18 16 1 19 United States of America|19 19 30 1 19 United States of America|20 20 23 1 19
Example 3-33 Producing One Cross-Tabulation Table for Each Set of Value of Two Columns
You can extend the cross-tabulation to more than one column, as shown in this example, which produces one (AGE, EDUCATION) table for each unique combination of (COUNTRY, GENDER).
ct <- ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW) head(ct)
Listing for This Example
R> ct <- ore.crosstab(AGE~EDUCATION/COUNTRY+GENDER, data=NARROW) R> head(ct) AGE EDUCATION ORE$FREQ ORE$STRATA ORE$GROUP United States of America|F|17|10th 17 10th 3 1 33 United States of America|M|17|10th 17 10th 5 1 34 United States of America|M|17|11th 17 11th 1 1 34 Argentina|M|17|HS-grad 17 HS-grad 1 1 2 United States of America|M|18|10th 18 10th 1 1 34 United States of America|F|18|11th 18 11th 2 1 33
Example 3-34 Augmenting Cross-Tabulation with Stratification
All of the cross-tabulation tables in the previous examples can be augmented with stratification, as shown in this example.
ct <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW) head(ct) R> head(ct) # The previous function invocation is the same as the following: ct <- ore.crosstab(AGE~GENDER, NARROW, strata="CLASS")
Listing for This Example
R> ct <- ore.crosstab(AGE~GENDER^CLASS, data=NARROW) R> head(ct) R> head(ct) AGE GENDER ORE$FREQ ORE$STRATA ORE$GROUP 0|17|F 17 F 5 1 1 0|17|M 17 M 9 1 1 0|18|F 18 F 6 1 1 0|18|M 18 M 7 1 1 0|19|F 19 F 15 1 1 0|19|M 19 M 13 1 1 # The previous function invocation is the same as the following: ct <- ore.crosstab(AGE~GENDER, NARROW, strata="CLASS")
Example 3-35 Binning Followed by Cross-Tabulation
This example does a custom binning by AGE and then calculates the cross-tabulation for GENDER and the bins.
NARROW$AGEBINS <- ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2, ifelse(NARROW$AGE<40,3,4))) ore.crosstab(GENDER~AGEBINS, NARROW)
Listing for This Example
R> NARROW$AGEBINS <- ifelse(NARROW$AGE<20, 1, ifelse(NARROW$AGE<30,2, + ifelse(NARROW$AGE<40,3,4))) R> ore.crosstab(GENDER~AGEBINS, NARROW) GENDER AGEBINS ORE$FREQ ORE$STRATA ORE$GROUP F|1 F 1 26 1 1 F|2 F 2 108 1 1 F|3 F 3 86 1 1 F|4 F 4 164 1 1 M|1 M 1 29 1 1 M|2 M 2 177 1 1 M|3 M 3 230 1 1 M|4 M 4 381 1 1
Parent topic: Explore Data
3.2.5 Analyze the Frequency of Cross-Tabulations
The ore.freq
function analyses the output of the ore.crosstab
function and automatically determines the techniques that are relevant to an ore.crosstab
result.
The techniques depend on the kind of cross-tabulation tables, which are the following:
-
2-way cross-tabulation tables
-
Various statistics that describe relationships between columns in the cross-tabulation
-
Chi-square tests, Cochran-Mantel-Haenzsel statistics, measures of association, strength of association, risk differences, odds ratio and relative risk for 2x2 tables, tests for trend
-
-
N-way cross-tabulation tables
-
N 2-way cross-tabulation tables
-
Statistics across and within strata
-
The ore.freq
function uses Oracle Database SQL functions when available.
The ore.freq
function returns an ore.frame
in all cases.
Before you use ore.freq
, you must calculate crosstabs, as shown in the following example.
For details about the function arguments, invoke help(ore.freq)
.
Example 3-36 Using the ore.freq Function
This example pushes the iris
data set to the database and gets the ore.frame
object iris_of
. The example gets a crosstab and invokes the ore.freq
function on it.
IRIS <- ore.push(iris) ct <- ore.crosstab(Species ~ Petal.Length + Sepal.Length, data = IRIS) ore.freq(ct)
Listing for This Example
R> IRIS <- ore.push(iris) R> ct <- ore.crosstab(Species ~ Petal.Length + Sepal.Length, data = IRIS) R> ore.freq(ct) $`Species~Petal.Length` METHOD FREQ DF PVALUE DESCR GROUP 1 PCHISQ 181.4667 84 3.921603e-09 Pearson Chi-Square 1 $`Species~Sepal.Length` METHOD FREQ DF PVALUE DESCR GROUP 1 PCHISQ 102.6 68 0.004270601 Pearson Chi-Square 1
Parent topic: Explore Data
3.2.6 Build Exponential Smoothing Models on Time Series Data
The ore.esm
function builds a simple or a double exponential smoothing model for in-database time series observations in an ordered ore.vector
object.
The function operates on time series data, whose observations are evenly spaced by a fixed interval, or transactional data, whose observations are not equally spaced. The function can aggregate the transactional data by a specified time interval, as well as handle missing values using a specified method, before entering the modeling phase.
The ore.esm
function processes the data in one or more R engines running on the database server. The function returns an object of class ore.esm
.
You can use the predict
method to predict the time series of the exponential smoothing model built by ore.esm
. If you have loaded the forecast
package, then you can use the forecast
method on the ore.esm
object. You can use the fitted method to generate the fitted values of the training time series data set.
For information about the arguments of the ore.esm
function, invoke help(ore.esm)
.
Example 3-37 Building a Double Exponential Smoothing Model
This example builds a double exponential smoothing model on a synthetic time series data set. The predict
and fitted
functions are invoked to generate the predictions and the fitted values, respectively. The figure shows the observations, fitted values, and the predictions.
N <- 5000 ts0 <- ore.push(data.frame(ID=1:N, VAL=seq(1,5,length.out=N)^2+rnorm(N,sd=0.5))) rownames(ts0) <- ts0$ID x <- ts0$VAL esm.mod <- ore.esm(x, model = "double") esm.predict <- predict(esm.mod, 30) esm.fitted <- fitted(esm.mod, start=4000, end=5000) plot(ts0[4000:5000,], pch='.') lines(ts0[4000:5000, 1], esm.fitted, col="blue") lines(esm.predict, col="red", lwd=2)
Figure 3-1 Fitted and Predicted Values Based on the esm.mod Model
Description of "Figure 3-1 Fitted and Predicted Values Based on the esm.mod Model"
Example 3-38 Building a Time Series Model with Transactional Data
This example builds a simple smoothing model based on a transactional data set. As preprocessing, it aggregates the values to the day level by taking averages, and fills missing values by setting them to the previous aggregated value. The model is then built on the aggregated daily time series. The function predict
is invoked to generate predicted values on the daily basis.
ts01 <- data.frame(ID=seq(as.POSIXct("2008/6/13"), as.POSIXct("2011/6/16"), length.out=4000), VAL=rnorm(4000, 10)) ts02 <- data.frame(ID=seq(as.POSIXct("2011/7/19"), as.POSIXct("2012/11/20"), length.out=1500), VAL=rnorm(1500, 10)) ts03 <- data.frame(ID=seq(as.POSIXct("2012/12/09"), as.POSIXct("2013/9/25"), length.out=1000), VAL=rnorm(1000, 10)) ts1 = ore.push(rbind(ts01, ts02, ts03)) rownames(ts1) <- ts1$ID x <- ts1$VAL esm.mod <- ore.esm(x, "DAY", accumulate = "AVG", model="simple", setmissing="PREV") esm.predict <- predict(esm.mod) esm.predict
Listing for This Example
R> ts01 <- data.frame(ID=seq(as.POSIXct("2008/6/13"), as.POSIXct("2011/6/16"), + length.out=4000), VAL=rnorm(4000, 10)) R> ts02 <- data.frame(ID=seq(as.POSIXct("2011/7/19"), as.POSIXct("2012/11/20"), + length.out=1500), VAL=rnorm(1500, 10)) R> ts03 <- data.frame(ID=seq(as.POSIXct("2012/12/09"), as.POSIXct("2013/9/25"), + length.out=1000), VAL=rnorm(1000, 10)) R> ts1 = ore.push(rbind(ts01, ts02, ts03)) R> rownames(ts1) <- ts1$ID R> x <- ts1$VAL R> esm.mod <- ore.esm(x, "DAY", accumulate = "AVG", model="simple", + setmissing="PREV") R> esm.predict <- predict(esm.mod) R> esm.predict ID VAL 1 2013-09-26 9.962478 2 2013-09-27 9.962478 3 2013-09-28 9.962478 4 2013-09-29 9.962478 5 2013-09-30 9.962478 6 2013-10-01 9.962478 7 2013-10-02 9.962478 8 2013-10-03 9.962478 9 2013-10-04 9.962478 10 2013-10-05 9.962478 11 2013-10-06 9.962478 12 2013-10-07 9.962478
Example 3-39 Building a Double Exponential Smoothing Model Specifying an Interval
This example uses stock data from the TTR
package. It builds a double exponential smoothing model based on the daily stock closing prices. The 30-day predicted stock prices, along with the original observations, are shown in the following figure.
library(TTR) stock <- "orcl" xts.data <- getYahooData(stock, 20010101, 20131024) df.data <- data.frame(xts.data) df.data$date <- index(xts.data) of.data <- ore.push(df.data[, c("date", "Close")]) rownames(of.data) <- of.data$date esm.mod <- ore.esm(of.data$Close, "DAY", model = "double") esm.predict <- predict(esm.mod, 30) plot(of.data,type="l") lines(esm.predict,col="red",lwd=4)
Parent topic: Explore Data
3.2.7 Rank Data
The ore.rank
function analyzes distribution of values in numeric columns of an ore.frame
.
The ore.rank
function supports useful functionality, including:
-
Ranking within groups
-
Partitioning rows into groups based on rank tiles
-
Calculation of cumulative percentages and percentiles
-
Treatment of ties
-
Calculation of normal scores from ranks
The ore.rank
function syntax is simpler than the corresponding SQL queries.
The ore.rank
function returns an ore.frame
in all instances.
You can use these R scoring methods with ore.rank
:
-
To compute exponential scores from ranks, use
savage
. -
To compute normal scores, use one of
blom,
tukey
, orvw
(van der Waerden).
For details about the function arguments, invoke help(ore.rank)
.
The following examples illustrate using ore.rank
. The examples use the NARROW
data set.
Example 3-40 Ranking Two Columns
This example ranks the two columns AGE and CLASS and reports the results as derived columns; values are ranked in the default order, which is ascending.
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass')
Example 3-41 Handling Ties in Ranking
This example ranks the two columns AGE and CLASS. If there is a tie, the smallest value is assigned to all tied values.
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', ties='low')
Example 3-42 Ranking by Groups
This example ranks the two columns AGE and CLASS and then ranks the resulting values according to COUNTRY.
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', group.by='COUNTRY')
Example 3-43 Partitioning into Deciles
To partition the columns into a different number of partitions, change the value of groups
. For example, groups=4
partitions into quartiles. This example ranks the two columns AGE and CLASS and partitions the columns into deciles (10 partitions).
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',groups=10)
Example 3-44 Estimating Cumulative Distribution Function
This example ranks the two columns AGE and CLASS and estimates the cumulative distribution function for both column.
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass',nplus1=TRUE)
Example 3-45 Scoring Ranks
This example ranks the two columns AGE and CLASS and scores the ranks in two different ways. The first command partitions the columns into percentiles (100 groups). The savage
scoring method calculates exponential scores and blom
scoring calculates normal scores.
x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', score='savage', groups=100, group.by='COUNTRY') x <- ore.rank(data=NARROW, var='AGE=RankOfAge, CLASS=RankOfClass', score='blom')
Parent topic: Explore Data
3.2.8 Sort Data
The ore.sort
function enables flexible sorting of a data frame along one or more columns specified by the by
argument.
The ore.sort
function can be used with other data pre-processing functions. The results of sorting can provide input to R visualization.
The sorting done by the ore.sort
function takes place in the Oracle database. The ore.sort
function supports the database nls.sort
option.
The ore.sort
function returns an ore.frame
.
For details about the function arguments, invoke help(ore.sort)
.
Most of the following examples use the NARROW
data set. Some examples use the ONTIME_S
data set.
Example 3-46 Sorting Columns in Descending Order
This example sorts the columns AGE and GENDER in descending order.
x <- ore.sort(data=NARROW, by='AGE,GENDER', reverse=TRUE)
Example 3-47 Sorting Different Columns in Different Orders
This example sorts AGE in descending order and GENDER in ascending order.
x <- ore.sort(data=NARROW, by='-AGE,GENDER')
Example 3-48 Sorting and Returning One Row per Unique Value
This example sorts by AGE and keep one row per unique value of AGE:
x <- ore.sort(data=NARROW, by='AGE', unique.key=TRUE)
Example 3-49 Removing Duplicate Columns
This example sorts by AGE and removes duplicate rows:
x <- ore.sort(data=NARROW, by='AGE', unique.data=TRUE)
Example 3-50 Removing Duplicate Columns and Returning One Row per Unique Value
This example sorts by AGE, removes duplicate rows, and returns one row per unique value of AGE.
x <- ore.sort(data=NARROW, by='AGE', unique.data=TRUE, unique.key = TRUE)
Example 3-51 Preserving Relative Order in the Output
This example maintains the relative order in the sorted output.
x <- ore.sort(data=NARROW, by='AGE', stable=TRUE)
Example 3-52 Sorting Two Columns in Different Orders
This example sorts ONTIME_S by airline name in descending order and departure delay in ascending order.
sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY')
Example 3-53 Sorting Two Columns in Different Orders and Producing Unique Combinations
This example sorts ONTIME_S by airline name and departure delay and selects one of each combination (that is, returns a unique key).
sortedOnTime1 <- ore.sort(data=ONTIME_S, by='-UNIQUECARRIER,DEPDELAY', unique.key=TRUE)
Parent topic: Explore Data
3.2.9 Summarize Data
Summarize data with the aggregate
function.
Example 3-54 Aggregating Data
This example pushes the iris
data set to database memory as the ore.frame
object iris_of
. It aggregates the values of iris_of
by the Species column using the length
function. It then displays the first three rows of the result.
# Create a temporary database table from the iris data set and get an ore.frame. iris_of <- ore.push(iris) aggdata <- aggregate(iris_of$Sepal.Length, by = list(species = iris_of$Species), FUN = length) head(aggdata, 3)Listing for This Example
# Create a temporary database table from the iris data set and get an ore.frame. R> iris_of <- ore.push(iris) R> aggdata <- aggregate(iris_of$Sepal.Length, + by = list(species = iris_of$Species), + FUN = length) R> head(aggdata, 3) species x setosa setosa 50 versicolor versicolor 50 virginica virginica 50
Parent topic: Explore Data
3.2.10 Analyze the Distribution of Numeric Variables
The ore.univariate
function provides distribution analysis of numeric variables in an ore.frame
.
The ore.univariate
function provides these statistics:
-
All statistics reported by the
summary
function -
Signed rank test, Student's t-test
-
Extreme values reporting
The ore.univariate
function returns an ore.frame as output in all cases.
For details about the function arguments, invoke help(ore.univariate)
.
Example 3-55 Calculating the Default Univariate Statistics
This example calculates the default univariate statistics for AGE, YRS_RESIDENCE, and CLASS.
ore.univariate(NARROW, var="AGE,YRS_RESIDENCE,CLASS")
Example 3-56 Calculating the Default Univariate Statistics
This example calculates location statistics for YRS_RESIDENCE.
ore.univariate(NARROW, var="YRS_RESIDENCE", stats="location")
Example 3-57 Calculating the Complete Quantile Statistics
This example calculates complete quantile statistics for AGE and YRS_RESIDENCE.
ore.univariate(NARROW, var="AGE,YRS_RESIDENCE",stats="quantiles")
Parent topic: Explore Data
3.2.11 Principal Component Analysis
The overloaded prcomp
and princomp
functions perform principal component analysis in parallel in the database.
The prcomp
function uses a singular value decomposition of the covariance and correlations between variables. The princomp
function uses eigen decomposition of the covariance and correlations between samples.
The transparency layer methods ore.frame-prcomp
and ore.frame-princomp
enable you to use the generic functions prcomp
and princomp
on data in an ore.frame
object. This allows the functions to execute in parallel processes in the database.
For both functions, the methods support the function signature that accepts an ore.frame
as the x
argument and the signature that accepts a formula. The ore.frame
must contain only numeric data. The formula must refer only to numeric variables and have no response variable.
Function prcomp
returns a prcomp
object and function princomp
returns a princomp
object.
For details about the function arguments, invoke help('ore.frame-prcomp')
and help('ore.frame-princomp')
.
Note:
The biplot
function is not supported for the objects returned by these transparency layer methods.
Example 3-58 Using the prcomp and princomp Functions
USARRESTS <- ore.push(USArrests)
# Using prcomp
prcomp(USARRESTS)
prcomp(USARRESTS, scale. = TRUE)
# Formula interface
prcomp(~ Murder + Assault + UrbanPop, data = USARRESTS, scale. = TRUE)
# Using prcomp
princomp(USARRESTS)
princomp(USARRESTS, cor = TRUE)
# Formula interface
princomp(~ Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)
Listing for This Example
R> USARRESTS <- ore.push(USArrests)
R>
R> # Using prcomp
R>
R> prcomp(USARRESTS)
Standard deviations:
[1] 83.732400 14.212402 6.489426 2.482790
Rotation:
PC1 PC2 PC3 PC4
Murder 0.04170432 -0.04482166 0.07989066 -0.99492173
Assault 0.99522128 -0.05876003 -0.06756974 0.03893830
UrbanPop 0.04633575 0.97685748 -0.20054629 -0.05816914
Rape 0.07515550 0.20071807 0.97408059 0.07232502
R> prcomp(USARRESTS, scale. = TRUE)
Standard deviations:
[1] 1.5748783 0.9948694 0.5971291 0.4164494
Rotation:
PC1 PC2 PC3 PC4
Murder 0.5358995 -0.4181809 0.3412327 0.64922780
Assault 0.5831836 -0.1879856 0.2681484 -0.74340748
UrbanPop 0.2781909 0.8728062 0.3780158 0.13387773
Rape 0.5434321 0.1673186 -0.8177779 0.08902432
R>
R> # Formula interface
R> prcomp(~ Murder + Assault + UrbanPop, data = USARRESTS, scale. = TRUE)
Standard deviations:
[1] 1.3656547 0.9795415 0.4189100
Rotation:
PC1 PC2 PC3
Murder 0.6672955 -0.30345520 0.6801703
Assault 0.6970818 -0.06713997 -0.7138411
UrbanPop 0.2622854 0.95047734 0.1667309
R>
R> # Using princomp
R>
R> princomp(USARRESTS)
Call:
princomp(USARRESTS)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
82.890847 14.069560 6.424204 2.457837
4 variables and 50 observations.
R> princomp(USARRESTS, cor = TRUE)
Call:
princomp(USARRESTS, cor = TRUE)
Standard deviations:
Comp.1 Comp.2 Comp.3 Comp.4
1.5748783 0.9948694 0.5971291 0.4164494
4 variables and 50 observations.
R>
R> # Formula interface
R> princomp(~ Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)
Call:
princomp(~Murder + Assault + UrbanPop, data = USARRESTS, cor = TRUE)
Standard deviations:
Comp.1 Comp.2 Comp.3
1.3656547 0.9795415 0.4189100
3 variables and 50 observations.
Parent topic: Explore Data
3.2.12 Singular Value Decomposition
The overloaded svd
function performs singular value decomposition in parallel in the database.
The svd
function accepts an ore.frame
or an ore.tblmatrix
object as the x
argument. The ore.frame-svd
method distributes block SVD computation to parallel processes executing in the database. The method uses the global option ore.parallel
to determine the degree of parallelism to employ.
The function returns a list
object that contains the d
vector and v
matrix components of a singular value decomposition of argument x
. It does not return the left singular vector matrix u
, therefore the argument nu
is not used.
For details about the function arguments, invoke help('ore.frame-svd')
.
Example 3-59 Using the svd Function
USARRESTS <- ore.push(USArrests)
svd(USARRESTS)
Listing for This Example
R> USARRESTS <- ore.push(USArrests)
R> svd(USARRESTS)
$d
[1] 1419.06140 194.82585 45.66134 18.06956
$v
[,1] [,2] [,3] [,4]
[1,] 0.04239181 -0.01616262 0.06588426 0.99679535
[2,] 0.94395706 -0.32068580 -0.06655170 -0.04094568
[3,] 0.30842767 0.93845891 -0.15496743 0.01234261
[4,] 0.10963744 0.12725666 0.98347101 -0.06760284
Parent topic: Explore Data
3.3 Data Manipulation Using OREdplyr
OREdplyr
package functions transparently implement dplyr
functions for use with ore.frame
and ore.numeric
objects.
Many of these functions have non-standard evaluation (NSE) and standard evaluation (SE) interfaces. The SE functions have an underscore ( _) appended to the function name. NSE functions are useful in interactive R sessions; SE functions are convenient for use in programs.
The functions in the OREdplyr
package are described in the following topics.
- Select and Order Data
OREdplyr
functions for selecting and ordering data in columns and rows of anore.frame
object. - Join Rows
OREdplyr
functions for joining rows. - Group Columns and Rows
OREdplyr
functions for grouping columns and rows. - Aggregate Columns and Rows
OREdplyr
functions for aggregating columns and rows. - Sample Rows
OREdplyr
functions for sampling rows. - Rank Rows
OREdplyr
functions for ranking rows.
Parent topic: Prepare and Explore Data in the Database
3.3.1 Select and Order Data
OREdplyr
functions for selecting and ordering data in columns and rows of an ore.frame
object.
Table 3-2 Selecting and Ordering Columns and Rows
Function | Description |
---|---|
|
Orders rows by the specified columns. |
|
Sorts an |
|
Selects unique rows from an input |
|
Filters rows by matching the specified condition. |
|
Adds new columns. |
|
Renames the specified columns and keeps all columns. |
|
Selects only the specified columns. |
|
Selects rows by position; ignores the grouping of the input ordered |
|
Adds new columns and drops the existing columns. |
Examples of using these functions are the following:
- Examples of Selecting Columns
Examples of theselect
andrename
functions of theOREdplyr
package. - Examples of Programming with select_
Examples of theselect_
function of theOREdplyr
package. - Examples of Selecting Distinct Columns
Examples of thedistinct
andarrange
functions of theOREdplyr
package. - Examples of Selecting Rows by Position
Examples of theslice
andfilter
functions of theOREdplyr
package. - Examples of Arranging Columns
Examples of thearrange
anddesc
functions of theOREdplyr
package. - Examples of Filtering Columns
Examples of thefilter
function of theOREdplyr
package. - Examples of Mutating Columns
Examples of themutate
andtransmute
functions of theOREdplyr
package.
Parent topic: Data Manipulation Using OREdplyr
3.3.1.1 Examples of Selecting Columns
Examples of the select
and rename
functions of the OREdplyr
package.
Example 3-60 Selecting Columns
The following examples select columns from the IRIS ore.frame
object that is created by using the ore.push
function on the iris data.frame
objects.
IRIS <- ore.push(iris)
# Select the specified column
names(select(IRIS, Petal.Length))
names(select(IRIS, petal_length = Petal.Length))
# Drop the specified column
names(select(IRIS, -Petal.Length))
# rename() keeps all variables
names(rename(IRIS, petal_length = Petal.Length))
Listing for This Example
R> IRIS <- ore.push(iris)
R> # Select the specified column
R> names(select(IRIS, Petal.Length))
[1] "Petal.Length"
R> names(select(IRIS, petal_length = Petal.Length))
[1] "petal_length"
R>
R> # Drop the specified column
R> names(select(IRIS, -Petal.Length))
[1] "Sepal.Length" "Sepal.Width" "Petal.Width" "Species"
R>
R> # rename() keeps all variables
R> names(rename(IRIS, petal_length = Petal.Length))
[1] "Sepal.Length" "Sepal.Width" "petal_length" "Petal.Width" "Species"
Parent topic: Select and Order Data
3.3.1.2 Examples of Programming with select_
Examples of the select_
function of the OREdplyr
package.
Example 3-61 Programming with select
This example uses the select_
function to select columns from the IRIS ore.frame
object that is created by using the ore.push
function on the iris data.frame
object.
IRIS <- ore.push(iris)
# Use ~, double quote, or quote function to specify the column to select
head(select_(IRIS, ~Petal.Length))
head(select_(IRIS, "Petal.Length"))
head(select_(IRIS, quote(-Petal.Length), quote(-Petal.Width)))
head(select_(IRIS, .dots = list(quote(-Petal.Length), quote(-Petal.Width))))
Listing for This Example
R> IRIS <- ore.push(iris)
R> # Use ~, double quote, or quote function to specify the column to select
R> head(select_(IRIS, ~Petal.Length))
Petal.Length
1 1.4
2 1.4
3 1.3
4 1.5
5 1.4
6 1.7
R> head(select_(IRIS, "Petal.Length"))
Petal.Length
1 1.4
2 1.4
3 1.3
4 1.5
5 1.4
6 1.7
R> head(select_(IRIS, quote(-Petal.Length), quote(-Petal.Width)))
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
R> head(select_(IRIS, .dots = list(quote(-Petal.Length), quote(-Petal.Width))))
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
Parent topic: Select and Order Data
3.3.1.3 Examples of Selecting Distinct Columns
Examples of the distinct
and arrange
functions of the OREdplyr
package.
Example 3-62 Selecting Distinct Columns
df <- data.frame(
x = sample(10, 100, rep = TRUE),
y = sample(10, 100, rep = TRUE)
)
DF <- ore.push(df)
nrow(DF)
nrow(distinct(DF))
arrange(distinct(DF, x), x)
arrange(distinct(DF, y), y)
# Use distinct on computed variables
arrange(distinct(DF, diff = abs(x - y)), diff)
Listing for This Example
R> df <- data.frame(
+ x = sample(10, 100, rep = TRUE),
+ y = sample(10, 100, rep = TRUE)
+ )
R> DF <- ore.push(df)
R> nrow(DF)
[1] 100
R> nrow(distinct(DF))
[1] 66
R> arrange(distinct(DF, x), x)
x
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
R> arrange(distinct(DF, y), y)
y
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
R>
R> # Use distinct on computed variables
R> arrange(distinct(DF, diff = abs(x - y)), diff)
diff
1 0
2 1
3 2
4 3
5 4
6 5
7 6
8 7
9 8
10 9
Parent topic: Select and Order Data
3.3.1.4 Examples of Selecting Rows by Position
Examples of the slice
and filter
functions of the OREdplyr
package.
Example 3-63 Selecting Rows by Position
MTCARS <- ore.push(mtcars)
# Display the names of the rows in MTCARS
rownames(MTCARS)
# Select the first row
slice(MTCARS, 1L)
# Arrange the rows by horsepower, then select the first row by position
MTCARS <- arrange(MTCARS, hp)
slice(MTCARS, 1L)
by_cyl <- group_by(MTCARS, cyl)
# Grouping is ignored by slice.
slice(by_cyl, 1:2)
# Use filter and row_number to obtain slices per group.
filter(by_cyl, row_number(hp) < 3L)
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> # Display the names of the rows in MTCARS
R> rownames(MTCARS)
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout"
[6] "Valiant" "Duster 360" "Merc 240D" "Merc 230" "Merc 280"
[11] "Merc 280C" "Merc 450SE" "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128" "Honda Civic" "Toyota Corolla"
[21] "Toyota Corona" "Dodge Challenger" "AMC Javelin" "Camaro Z28" "Pontiac Firebird"
[26] "Fiat X1-9" "Porsche 914-2" "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
R> # Select the first row
R> slice(MTCARS, 1L)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
R>
R> # Arrange the rows by horsepower, then select the first row by position
R> MTCARS <- arrange(MTCARS, hp)
R> slice(MTCARS, 1L)
mpg cyl disp hp drat wt qsec vs am gear carb
1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
R>
R> by_cyl <- group_by(MTCARS, cyl)
R> # Grouping is ignored by slice
R> slice(by_cyl, 1:2)
mpg cyl disp hp drat wt qsec vs am gear carb
1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Warning message:
In slice_.ore.frame(.data, .dots = .ore.dplyr.exprall(..., env = parent.frame())) :
grouping is ignored
R> # Use filter and row_number to obtain slices per group
R> filter(by_cyl, row_number(hp) < 3L)
mpg cyl disp hp drat wt qsec vs am gear carb
1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
5 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
6 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
Parent topic: Select and Order Data
3.3.1.5 Examples of Arranging Columns
Examples of the arrange
and desc
functions of the OREdplyr
package.
Example 3-64 Arranging Columns
This example arranges columns from the ore.frame
object MTCARS that is created by using the ore.push
function on the mtcars data.frame
object. The second arrange()
invocation calls the desc()
function to arrange the values in descending order.
MTCARS <- ore.push(mtcars)
head(arrange(mtcars, cyl, disp))
head(arrange(MTCARS, desc(disp)))
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> head(arrange(MTCARS, cyl, disp))
mpg cyl disp hp drat wt qsec vs am gear carb
1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
3 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
4 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
5 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
6 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
R> head(arrange(MTCARS, desc(disp)))
mpg cyl disp hp drat wt qsec vs am gear carb
1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
3 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
4 19.2 8 400 175 3.08 3.845 17.05 0 0 3 2
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
Parent topic: Select and Order Data
3.3.1.6 Examples of Filtering Columns
Examples of the filter
function of the OREdplyr
package.
Example 3-65 Filtering Columns
This example filters columns from the MTCARS ore.frame
object that is created by using the ore.push
function on the mtcars data.frame
object.
MTCARS <- ore.push(mtcars)
head(filter(MTCARS, cyl == 8))
# Using multiple criteria
head(filter(MTCARS, cyl < 6 & vs == 1))
# Using multiple arguments is the equivalent to using &
head(filter(MTCARS, cyl < 6, vs == 1))
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> head(filter(MTCARS, cyl == 8))
mpg cyl disp hp drat wt qsec vs am gear carb
1 18.7 8 360.0 175 3.15 3.44 17.02 0 0 3 2
2 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4
3 16.4 8 275.8 180 3.07 4.07 17.40 0 0 3 3
4 17.3 8 275.8 180 3.07 3.73 17.60 0 0 3 3
5 15.2 8 275.8 180 3.07 3.78 18.00 0 0 3 3
6 10.4 8 472.0 205 2.93 5.25 17.98 0 0 3 4
R> head(filter(MTCARS, cyl < 6 & vs == 1))
mpg cyl disp hp drat wt qsec vs am gear carb
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
R>
R> # Using multiple arguments is the equivalent to using &
R> head(filter(MTCARS, cyl < 6, vs == 1))
mpg cyl disp hp drat wt qsec vs am gear carb
1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
4 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
5 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
6 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Parent topic: Select and Order Data
3.3.1.7 Examples of Mutating Columns
Examples of the mutate
and transmute
functions of the OREdplyr
package.
Example 3-66 Mutating Columns
This example uses the MTCARS ore.frame
object that is created by using the ore.push
function on the mtcars data.frame
object.
The mutate
function adds the extra column displ_1 with the value derived from that of column disp. Setting the column to NULL removes the column.
MTCARS <- ore.push(mtcars)
head(mutate(MTCARS, displ_l = disp / 61.0237))
head(transmute(MTCARS, displ_l = disp / 61.0237))
head(mutate(MTCARS, cyl = NULL))
head(mutate(MTCARS, cyl = NULL, hp = NULL, displ_l = disp / 61.0237))
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> head(mutate(MTCARS, displ_l = disp / 61.0237))
mpg cyl disp hp drat wt qsec vs am gear carb displ_l
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2.621932
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2.621932
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1.769804
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4.227866
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 5.899347
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.687092
R> head(transmute(MTCARS, displ_l = disp / 61.0237))
displ_l
1 2.621932
2 2.621932
3 1.769804
4 4.227866
5 5.899347
6 3.687092
R> head(mutate(mtcars, cyl = NULL))
mpg disp hp drat wt qsec vs am gear carb
1 21.0 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 225 105 2.76 3.460 20.22 1 0 3 1
R> head(mutate(mtcars, cyl = NULL, hp = NULL, displ_l = disp / 61.0237))
mpg disp drat wt qsec vs am gear carb displ_l
1 21.0 160 3.90 2.620 16.46 0 1 4 4 2.621932
2 21.0 160 3.90 2.875 17.02 0 1 4 4 2.621932
3 22.8 108 3.85 2.320 18.61 1 1 4 1 1.769804
4 21.4 258 3.08 3.215 19.44 1 0 3 1 4.227866
5 18.7 360 3.15 3.440 17.02 0 0 3 2 5.899347
6 18.1 225 2.76 3.460 20.22 1 0 3 1 3.687092
Parent topic: Select and Order Data
3.3.2 Join Rows
OREdplyr
functions for joining rows.
Table 3-3 Joining Rows
Function | Description |
---|---|
|
Returns the union of rows from |
|
Returns all combination of rows from x and y over matched columns. |
|
Returns rows from |
|
Returns rows from |
Example 3-67 Joining Rows
To join two tables, the join
function selects the columns in each table that have the same name or uses the argument by
to specify the columns.
MTCARS <- ore.push(mtcars)
M1 <- filter(select(MTCARS, mpg, cyl, carb), carb < 6L)
M2 <- filter(select(MTCARS, cyl, hp, carb), carb > 2L)
names(inner_join(M1, M2))
nrow(left_join(M1, M2))
nrow(right_join(M1, M2))
nrow(full_join(M1, M2))
names(M2) <- c("cyl", "hp", "carb2")
names(inner_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(inner_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(left_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(right_join(M1, M2, by = c("cyl", carb="carb2")))
nrow(full_join(M1, M2, by = c("cyl", carb="carb2")))
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> M1 <- filter(select(MTCARS, mpg, cyl, carb), carb < 6L)
R> M2 <- filter(select(MTCARS, cyl, hp, carb), carb > 2L)
R>
R> names(inner_join(M1, M2))
[1] "cyl" "carb" "mpg" "hp"
R> nrow(left_join(M1, M2))
[1] 78
R> nrow(right_join(M1, M2))
[1] 63
R> nrow(full_join(M1, M2))
[1] 80
R>
R> names(M2) <- c("cyl", "hp", "carb2")
R> names(inner_join(M1, M2, by = c("cyl", carb="carb2")))
[1] "cyl" "carb" "mpg" "hp"
R> nrow(inner_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 61
R> nrow(left_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 78
R> nrow(right_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 63
R> nrow(full_join(M1, M2, by = c("cyl", carb="carb2")))
[1] 80
Parent topic: Data Manipulation Using OREdplyr
3.3.3 Group Columns and Rows
OREdplyr
functions for grouping columns and rows.
Table 3-4 Grouping Columns and Rows
Function | Description |
---|---|
|
Groups an |
|
Lists the number of rows in each group. |
|
Shows the names of the grouping columns. |
|
Returns the number of groups. |
|
Drops the grouping from the input |
Example 3-68 Using Grouping Functions
The following examples use the ore.frame
object MTCARS that is created by using the ore.push
function on the mtcars data.frame
object. They exemplify the use of the grouping functions group_by
, group_size
, groups
, n_group
, and ungroup
. They also use the OREdplyr
functions arrange
, rename
, and summarize
.
MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)
# Apply the summarise function to each group
arrange(summarise(by_cyl, mean(disp), mean(hp)), cyl)
# Summarise drops one layer of grouping
by_vs_am <- group_by(MTCARS, vs, am)
by_vs <- summarise(by_vs_am, n = n())
arrange(by_vs, vs, am)
arrange(summarise(by_vs, n = sum(n)), vs)
# Remove grouping
summarise(ungroup(by_vs), n = sum(n))
# Group by expressions with mutate
arrange(group_size(group_by(mutate(MTCARS, vsam = vs + am), vsam)), vsam)
# Rename the grouping column
groups(rename(group_by(MTCARS, vs), vs2 = vs))
# Add more grouping columns
groups(group_by(by_cyl, vs, am))
groups(group_by(by_cyl, vs, am, add = TRUE))
# Drop duplicate groups
groups(group_by(by_cyl, cyl, cyl))
# Load the magrittr library to use the forward-pipe operator %>%
library(magrittr)
by_cyl_gear_carb <- MTCARS %>% group_by(cyl, gear, carb)
n_groups(by_cyl_gear_carb)
arrange(group_size(by_cyl_gear_carb), cyl, gear, carb)
by_cyl <- MTCARS %>% group_by(cyl)
# Number of groups
n_groups(by_cyl)
# Size of each group
arrange(group_size(by_cyl), cyl)
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R>
R> # Apply the summarise function to each group
R> arrange(summarise(by_cyl, mean(disp), mean(hp)), cyl)
cyl mean.disp. mean.hp.
1 4 105.1364 82.63636
2 6 183.3143 122.28571
3 8 353.1000 209.21429
R>
R> # Summarise drops one layer of grouping
R> by_vs_am <- group_by(MTCARS, vs, am)
R> by_vs <- summarise(by_vs_am, n = n())
R> arrange(by_vs, vs, am)
vs am n
1 0 0 12
2 0 1 6
3 1 0 7
4 1 1 7
R> arrange(summarise(by_vs, n = sum(n)), vs)
vs n
1 0 18
2 1 14
R>
R> # Remove grouping
R> summarise(ungroup(by_vs), n = sum(n))
n
32
R>
R> # Group by expressions with mutate
R> arrange(group_size(group_by(mutate(MTCARS, vsam = vs + am), vsam)), vsam)
vsam n
1 0 12
2 1 13
3 2 7
R>
R> # Rename the grouping column
R> groups(rename(group_by(MTCARS, vs), vs2 = vs))
[1] "vs2"
R>
R> # Add more grouping columns
R> groups(group_by(by_cyl, vs, am))
[[1]]
[1] "vs"
[[2]]
[1] "am"
R> groups(group_by(by_cyl, vs, am, add = TRUE))
[[1]]
[1] "cyl"
[[2]]
[1] "vs"
[[3]]
[1] "am"
R>
R> # Drop duplicate groups
R> groups(group_by(by_cyl, cyl, cyl))
[1] "cyl
R>
R> # Load the magrittr library to use the forward-pipe operator %>%
R> library(magrittr)
R> by_cyl_gear_carb <- MTCARS %>% group_by(cyl, gear, carb)
R> n_groups(by_cyl_gear_carb)
[1] 12
R> arrange(group_size(by_cyl_gear_carb), cyl, gear, carb)
cyl gear carb n
1 4 3 1 1
2 4 4 1 4
3 4 4 2 4
4 4 5 2 2
5 6 3 1 2
6 6 4 4 4
7 6 5 6 1
8 8 3 2 4
9 8 3 3 3
10 8 3 4 5
11 8 5 4 1
12 8 5 8 1
R>
R> by_cyl <- MTCARS %>% group_by(cyl)
R> # Number of groups
R> n_groups(by_cyl)
[1] 3
R> # Number of groups
R> n_groups(by_cyl)
[1] 3
R>
R> # Size of each group
R> arrange(group_size(by_cyl), cyl)
cyl n
1 4 11
2 6 7
3 8 14
Parent topic: Data Manipulation Using OREdplyr
3.3.4 Aggregate Columns and Rows
OREdplyr
functions for aggregating columns and rows.
Table 3-5 Aggregating Columns and Rows
Function | Description |
---|---|
|
Counts rows by group; similar to |
|
Summarizes columns by using aggregate functions. When an |
tally |
Tallies rows by group; a convenient wrapper for |
Example 3-69 Aggregating Columns
The following examples use the ore.frame
object MTCARS that is created by using the ore.push
function on the mtcars data.frame
object. They exemplify the use of the aggregation functions count
, summarize
, and tally
. They also use the OREdplyr
functions arrange
and group_by
.
MTCARS <- ore.push(mtcars)
arrange(tally(group_by(MTCARS, cyl)), cyl)
tally(group_by(MTCARS, cyl), sort = TRUE)
# Multiple tallys progressively roll up the groups
cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))
cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), wt = hp, sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))
cyl_by_gear <- count(MTCARS, cyl, gear, wt = hp + mpg, sort = TRUE)
tally(cyl_by_gear, sort = TRUE)
tally(tally(cyl_by_gear))
# Load the magrittr library to use the forward-pipe operator %>%
library(magrittr)
MTCARS %>% group_by(cyl) %>% tally(sort = TRUE)
# count is more succinct and also does the grouping
MTCARS %>% count(cyl) %>% arrange(cyl)
MTCARS %>% count(cyl, wt = hp) %>% arrange(cyl)
MTCARS %>% count_("cyl", wt = hp, sort = TRUE)
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> arrange(tally(group_by(MTCARS, cyl)), cyl)
cyl n
1 4 11
2 6 7
3 8 14
R> tally(group_by(MTCARS, cyl), sort = TRUE)
cyl n
1 8 14
2 4 11
3 6 7
R>
R> # Multiple tallys progressively roll up the groups
R> cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
cyl n
1 8 14
2 4 11
3 6 7
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
n
32
R>
R> cyl_by_gear <- tally(group_by(MTCARS, cyl, gear), wt = hp, sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
cyl n
1 8 2929
2 4 909
3 6 856
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
n
4694
R>
R> cyl_by_gear <- count(MTCARS, cyl, gear, wt = hp + mpg, sort = TRUE)
R> tally(cyl_by_gear, sort = TRUE)
Using n as weighting variable
cyl n
1 8 3140.4
2 4 1202.3
3 6 994.2
R> tally(tally(cyl_by_gear))
Using n as weighting variable
Using n as weighting variable
n
5336.9
R>
R> # Load the magrittr library to use the forward-pipe operator %>%
R> library(magrittr)
R> MTCARS %>% group_by(cyl) %>% tally(sort = TRUE)
cyl n
1 8 14
2 4 11
3 6 7
R>
R> # count is more succinct and also does the grouping
R> MTCARS %>% count(cyl) %>% arrange(cyl)
cyl n
1 4 11
2 6 7
3 8 14
R> MTCARS %>% count(cyl, wt = hp) %>% arrange(cyl)
cyl n
1 4 909
2 6 856
3 8 2929
R> MTCARS %>% count_("cyl", wt = hp, sort = TRUE)
cyl n
1 8 2929
2 4 909
3 6 856
Parent topic: Data Manipulation Using OREdplyr
3.3.5 Sample Rows
OREdplyr
functions for sampling rows.
Table 3-6 Sampling Row Functions
Function | Description |
---|---|
|
Samples an |
|
Samples an |
Example 3-70 Sampling Rows
These examples use the ore.frame
object MTCARS that is created by using the ore.push
function on the mtcars data.frame
object. They exemplify the use of the sampling functions sample_n
and sample_frac
. They also use the OREdplyr
functions arrange
and summarize
.
MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)
# Sample fixed number per group of rows from the entire dataset
sample_n(MTCARS, 10)
nrow(sample_n(MTCARS, 50, replace = TRUE))
sample_n(MTCARS, 10, weight = mpg)
sample_n(MTCARS, 10, weight = MTCARS[["mpg"]])
# Sample fixed number of rows per group with replacement and weight
arrange(sample_n(by_cyl, 3), cyl, mpg)
arrange(summarise(sample_n(by_cyl, 10, replace = TRUE), n = n()), cyl)
arrange(summarise(sample_n(by_cyl, 3, weight = mpg/mean(mpg)), n = n()), cyl)
arrange(summarise(sample_n(by_cyl, 3,
weight = by_cyl[["mpg"]]/mean(by_cyl[["mpg"]])), n = n()), cyl)
# Sample fixed fraction per group
nrow(sample_frac(MTCARS, 0.1))
nrow(sample_frac(MTCARS, 1.5, replace = TRUE))
nrow(sample_frac(MTCARS, 0.1, weight = 1/mpg))
Listing for This Example
R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R>
R> # Sample fixed number per group of rows from the entire dataset
R> sample_n(MTCARS, 10)
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710|4 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Ford Pantera L|2 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Honda Civic|10 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Lotus Europa|6 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Maserati Bora|3 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Mazda RX4|5 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag|9 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Merc 280|8 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Toyota Corolla|7 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona|1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
R> nrow(sample_n(MTCARS, 50, replace = TRUE))
[1] 50
R>
R> # Sample fixed number of rows per group with replacement and weight
R> arrange(sample_n(by_cyl, 3), cyl, mpg)
cyl mpg disp hp drat wt qsec vs am gear carb
1 4 22.8 108.0 93 3.85 2.320 18.61 1 1 4 1
2 4 24.4 146.7 62 3.69 3.190 20.00 1 0 4 2
3 4 30.4 95.1 113 3.77 1.513 16.90 1 1 5 2
4 6 19.2 167.6 123 3.92 3.440 18.30 1 0 4 4
5 6 19.7 145.0 175 3.62 2.770 15.50 0 1 5 6
6 6 21.4 258.0 110 3.08 3.215 19.44 1 0 3 1
7 8 10.4 460.0 215 3.00 5.424 17.82 0 0 3 4
8 8 15.2 304.0 150 3.15 3.435 17.30 0 0 3 2
9 8 15.2 275.8 180 3.07 3.780 18.00 0 0 3 3
R> arrange(summarise(sample_n(by_cyl, 10, replace = TRUE), n = n()), cyl)
cyl n
1 4 10
2 6 10
3 8 10
R> arrange(summarise(sample_n(by_cyl, 3, weight = mpg/mean(mpg)), n = n()), cyl)
cyl n
1 4 3
2 6 3
3 8 3
R> arrange(summarise(sample_n(by_cyl, 3, weight = by_cyl[["mpg"]]/mean(by_cyl[["mpg"]])), n = n()), cyl)
cyl n
1 4 3
2 6 3
3 8 3
R>
R> nrow(sample_frac(MTCARS, 0.1))
[1] 3
R> nrow(sample_frac(MTCARS, 1.5, replace = TRUE))
[1] 48
R> nrow(sample_frac(MTCARS, 0.1, weight = 1/mpg))
[1] 3
Parent topic: Data Manipulation Using OREdplyr
3.3.6 Rank Rows
OREdplyr
functions for ranking rows.
The ranking functions rank the elements in an ordered ore.vector
by its values. An ore.character
is coerced to an ore.factor
. The values of an ore.factor
are based upon factor levels. To reverse the direction of the ranking, use the desc
function.
Table 3-7 Ranking Rows
Function | Description |
---|---|
cume_dist |
A cumulative distribution function: returns the proportion of all values that are less than or equal to the current rank. |
dense_rank |
Like |
|
Gets the first value from an ordered |
|
Gets the last value from an ordered |
min_rank |
Equivalent to |
|
Obtains the value at the specified position in the order. |
ntile |
A rough ranking that breaks the input vector into n buckets. |
|
Gets the nth value from an ordered |
percent_rank |
Returns a number between 0 and 1 that is computed by rescaling |
|
Equivalent to |
|
Selects the top or bottom number of rows. |
Example 3-71 Ranking Rows
These examples use the ranking functions row_number
, min_rank
, dense_rank
, percent_rank
, cume_dist
, and ntile
.
X <- ore.push(c(5, 1, 3, 2, 2, NA))
row_number(X)
row_number(desc(X))
min_rank(X)
dense_rank(X)
percent_rank(X)
cume_dist(X)
ntile(X, 2)
ntile(ore.push(runif(100)), 10)
MTCARS <- ore.push(mtcars)
by_cyl <- group_by(MTCARS, cyl)
# Using ranking functions with an ore.frame
head(mutate(MTCARS, rank = row_number(hp)))
head(mutate(MTCARS, rank = min_rank(hp)))
head(mutate(MTCARS, rank = dense_rank(hp)))
# Using ranking functions with a grouped ore.frame
head(mutate(by_cyl, rank = row_number(hp)))
head(mutate(by_cyl, rank = min_rank(hp)))
head(mutate(by_cyl, rank = dense_rank(hp)))
Listing for This Example
R> X <- ore.push(c(5, 1, 3, 2, 2, NA))
R>
R> row_number(X)
[1] 5 1 4 2 3 6
R> row_number(desc(X))
[1] 1 5 2 3 4 6
R>
R> min_rank(X)
[1] 5 1 4 2 2 6
R>
R> dense_rank(X)
[1] 4 1 3 2 2 6
R>
R> percent_rank(X)
[1] 0.8 0.0 0.6 0.2 0.2 1.0
R>
R> cume_dist(X)
[1] 0.8333333 0.1666667 0.6666667 0.5000000 0.5000000 1.0000000
R>
R> ntile(X, 2)
[1] 2 1 2 1 1 2
R> ntile(ore.push(runif(100)), 10)
[1] 6 10 5 2 1 1 8 3 8 8 7 3 10 3 7 9 9 4 4 10 10 7 2 3 7 4 5 5 3 9 4 6 8 4 10 6 1 5 5 4 6 9
[43] 5 8 2 7 7 1 2 9 1 2 8 5 6 5 3 4 7 1 3 1 10 1 5 5 10 9 2 3 9 6 6 8 8 6 3 7 2 2 8 4 1 9
[85] 6 10 4 10 7 2 9 10 7 2 4 9 6 3 8 1
R>
R> MTCARS <- ore.push(mtcars)
R> by_cyl <- group_by(MTCARS, cyl)
R>
R> # Using ranking functions with an ore.frame
R> head(mutate(MTCARS, rank = row_number(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 13
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 7
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 14
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 20
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10
R>
R> head(mutate(MTCARS, rank = min_rank(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 12
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 12
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 7
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 12
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 20
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 10
R>
R> head(mutate(MTCARS, rank = dense_rank(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 11
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 11
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 6
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 11
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 9
R>
R> # Using ranking functions with a grouped ore.frame
R> head(mutate(by_cyl, rank = row_number(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 7
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 4
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
R>
R> head(mutate(by_cyl, rank = min_rank(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 7
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
R>
R> head(mutate(by_cyl, rank = dense_rank(hp)))
mpg cyl disp hp drat wt qsec vs am gear carb rank
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 6
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1
Parent topic: Data Manipulation Using OREdplyr
3.4 About Using Third-Party Packages on the Client
In Oracle Machine Learning for R, if you want to use functions from an open source R package from The Comprehensive R Archive Network (CRAN) or other third-party R package, then you would generally do so in the context of embedded R execution.
Using embedded R execution, you can take advantage of the likely greater amount of memory on the database server.
However, if you want to use a third-party package function in your local R session on data from an Oracle database table, you must use the ore.pull
function to get the data from an ore.frame
object to your local session as a data.frame
object. This is the same as using open source R except that you can extract the data from the database without needing the help of a DBA.
When pulling data from a database table to a local data.frame
, you are limited to using the amount of data that can fit into the memory of your local machine. On your local machine, you do not have the benefits provided by embedded R execution.
To use a third-party package, you must install it on your system and load it in your R session.
For an example that uses the kernlab
package, see Example 2-13.
Example 3-72 Downloading, Installing, and Loading a Third-Party Package on the Client
This example demonstrates downloading, installing, and loading the CRAN package kernlab
. The kernlab
package contains kernel-based machine learning methods. The example invokes the install.packages
function to download and install the package. It then invokes the library
function to load the package.
install.packages("kernlab") library("kernlab")
Listing for This Example
R> install.packages("kernlab") trying URL 'http://cran.rstudio.com/bin/windows/contrib/3.0/kernlab_0.9-19.zip' Content type 'application/zip' length 2029405 bytes (1.9 Mb) opened URL downloaded 1.9 Mb package 'kernlab' successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\oml_user\AppData\Local\Temp\RtmpSKVZql\downloaded_packages R> library("kernlab")
Example 3-73 Using a kernlab Package Function
This example invokes the demo
function to look for example programs in the kernlab
package. Because the package does not have examples, this example then gets help for the ksvm
function. The example invokes example code from the help.
demo(package = "kernlab") help(package = "kernlab", ksvm) data(spam) index <- sample(1:dim(spam)[1]) spamtrain <- spam[index[1:floor(dim(spam)[1]/2)], ] spamtest <- spam[index[((ceiling(dim(spam)[1]/2)) + 1):dim(spam)[1]], ] filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot", + kpar=list(sigma=0.05),C=5,cross=3) filter table(mailtype,spamtest[,58])
Listing for This Example
> demo(package = "kernlab") no demos found > help(package = "kernlab", ksvm) # Output not shown. > data(spam) > index <- sample(1:dim(spam)[1]) > spamtrain <- spam[index[1:floor(dim(spam)[1]/2)], ] > spamtest <- spam[index[((ceiling(dim(spam)[1]/2)) + 1):dim(spam)[1]], ] > filter <- ksvm(type~.,data=spamtrain,kernel="rbfdot", + kpar=list(sigma=0.05),C=5,cross=3) > filter Support Vector Machine object of class "ksvm" SV type: C-svc (classification) parameter : cost C = 5 Gaussian Radial Basis kernel function. Hyperparameter : sigma = 0.05 Number of Support Vectors : 970 Objective Function Value : -1058.218 Training error : 0.018261 Cross validation error : 0.08696 > mailtype <- predict(filter,spamtest[,-58]) > table(mailtype,spamtest[,58]) mailtype nonspam spam nonspam 1347 136 spam 45 772
Parent topic: Prepare and Explore Data in the Database