This chapter describes R support for big data. It contains the following sections:
Note:
Oracle R Advanced Analytics for Hadoop was previously called Oracle R Connector for Hadoop.Oracle R Advanced Analytics for Hadoop provides a general computation framework, in which you can use the R language to write your custom logic as mappers or reducers. The code executes in a distributed, parallel manner using the available compute and storage resources on the Hadoop cluster.
Oracle R Advanced Analytics for Hadoop includes a collection of R packages that provides:
Interfaces to work with Apache Hive tables, the Apache Hadoop compute infrastructure, the local R environment, and Oracle database tables
Predictive analytic techniques for linear regression, generalized linear models, neural networks, matrix completion using low rank matrix factorization, nonnegative matrix factorization, k-means clustering, principal components analysis, and multivariate analysis. While these techniques have R interfaces, Oracle R Advanced Analytics for Hadoop implement them in either Java or R as distributed, parallel MapReduce jobs, thereby leveraging all nodes of your Hadoop cluster.
You install and load this package as you would any other R package. Using simple R functions, you can perform tasks like these:
Access and transform HDFS data using a Hive-enabled transparency layer
Use the R language for writing mappers and reducers
Copy data between R memory, the local file system, HDFS, Hive, and Oracle databases
Schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations
To use Oracle R Advanced Analytics for Hadoop, you should be familiar with MapReduce programming, R programming, and statistical methods.
Oracle R Advanced Analytics for Hadoop APIs
Oracle R Advanced Analytics for Hadoop provides access from a local R client to Apache Hadoop using functions with these prefixes:
hadoop
: Identifies functions that provide an interface to Hadoop MapReduce
hdfs
: Identifies functions that provide an interface to HDFS
orch
: Identifies a variety of functions; orch
is a general prefix for ORCH
functions
ore
: Identifies functions that provide an interface to a Hive data store
Oracle R Advanced Analytics for Hadoop uses data frames as the primary object type, but it can also operate on vectors and matrices to exchange data with HDFS. The APIs support the numeric, integer, and character data types in R.
All of the APIs are included in the ORCH
library. The functions are listed in "Oracle R Advanced Analytics for Hadoop Functions".
See Also:
The R Project website athttp://www.r-project.org/
For Oracle R Advanced Analytics for Hadoop to access the data stored in HDFS, the input files must comply with the following requirements:
All input files for a MapReduce job must be stored in one directory as the parts of one logical file. Any valid HDFS directory name and file name extensions are acceptable.
Any file in that directory with a name beginning with an underscore (_) is ignored.
All delimiters are supported, and key and value delimiters can be different.
You can also convert a delimited file into binary format, using the Rdata representation from R, for the best performance.
Apache Hive provides an alternative storage and retrieval mechanism to HDFS files through a querying language called HiveQL, which closely resembles SQL. Hive uses MapReduce for distributed processing. However, the data is structured and has additional metadata to support data discovery. Oracle R Advanced Analytics for Hadoop uses the data preparation and analysis features of HiveQL, while enabling you to use R language constructs.
See Also:
The Apache Hive website athttp://hive.apache.org
ORCH provides these conversion functions to help you move data between HDFS and Hive:
orch.hdfs2hive orch.hive2hdfs
You can connect to Hive and manage objects using R functions that have an ore
prefix, such as ore.connect
. If you are also using Oracle R Enterprise, then you recognize these functions. The ore
functions in Oracle R Enterprise create and manage objects in an Oracle database, and the ore
functions in Oracle R Advanced Analytics for Hadoop create and manage objects in a Hive database. You can connect to one database at a time, either Hive or Oracle Database, but not both simultaneously.
See Table 8-7 for a list of ORE as.ore.*
and is.ore.*
functions.
Oracle R Advanced Analytics for Hadoop also overloads the following standard generic R functions with methods to work with Hive objects.
casefold
, chartr
, gsub
, nchar
, substr
, substring
, tolower
, toupper
This release does not support grepl
or sub
.
attach
, show
[
, $
, $<-
, [[
, [[<-
Subset functions: head
, tail
Metadata functions: dim
, length
, NROW
, nrow
, NCOL
, ncol
, names
, names<-
, colnames
, colnames<-
Conversion functions: as.data.frame
, as.env
, as.list
Arithmetic operators: +
, -
, *
, ^
, %%
, %/%
, /
Compare
, Logic
, xor
, !
Test functions: is.finite
, is.infinite
, is.na
, is.nan
Mathematical transformations: abs
, acos
, asin
, atan
, ceiling
, cos
, exp
, expm1
, floor
, log
, log10
, log1p
, log2
, logb
, round
, sign
, sin
, sqrt
, tan
, trunc
Basic statistics: colMeans
, colSums
, rowMeans
, rowSums
, Summary
, summary
, unique
by
, merge
unlist
, rbind
, cbind
, data.frame
, eval
This release does not support dimnames
, interaction
, max.col
, row.names
, row.names<-
, scale
, split
, subset
, transform
, with
, or within
.
ifelse
, Logic
, xor
, !
Not supported
Arithmetic operators: +
, -
, *
, ^
, %%
, %/%
, /
Test functions: is.finite
, is.infinite
, is.nan
abs
, acos
, asin
, atan
, ceiling
, cos
, exp
, expm1
, floor
, log
, log1p
, log2
, log10
, logb
, mean
, round
, sign
, sin
, sqrt
, Summary
, summary
, tan
, trunc
, zapsmall
This release does not support atan2
, besselI
, besselK
, besselJ
, besselY
, diff
, factorial
, lfactorial
, pmax
, pmin
, or tabulate
.
show
, length
, c
Test functions: is.vector
, is.na
Conversion functions: as.vector
, as.character
, as.numeric
, as.integer
, as.logical
[
, [<-
, |
by
, Compare
, head
, %in%
, paste
, sort
, table
, tail
, tapply
, unique
This release does not support interaction
, lengthb
, rank
, or split
.
Example 8-1 shows simple data preparation and processing. For additional details, see "Support for Hive Data Types."
Example 8-1 Using R to Process Data in Hive Tables
# Connect to Hive ore.connect(type="HIVE") # Attach the current envt. into search path of R ore.attach() # create a Hive table by pushing the numeric columns of the iris data set IRIS_TABLE <- ore.push(iris[1:4]) # Create bins based on Petal Length IRIS_TABLE$PetalBins = ifelse(IRIS_TABLE$Petal.Length < 2.0, "SMALL PETALS", + ifelse(IRIS_TABLE$Petal.Length < 4.0, "MEDIUM PETALS", + ifelse(IRIS_TABLE$Petal.Length < 6.0, + "MEDIUM LARGE PETALS", "LARGE PETALS"))) #PetalBins is now a derived column of the HIVE object > names(IRIS_TABLE) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "PetalBins" # Based on the bins, generate summary statistics for each group aggregate(IRIS_TABLE$Petal.Length, by = list(PetalBins = IRIS_TABLE$PetalBins), + FUN = summary) 1 LARGE PETALS 6 6.025000 6.200000 6.354545 6.612500 6.9 0 2 MEDIUM LARGE PETALS 4 4.418750 4.820000 4.888462 5.275000 5.9 0 3 MEDIUM PETALS 3 3.262500 3.550000 3.581818 3.808333 3.9 0 4 SMALL PETALS 1 1.311538 1.407692 1.462000 1.507143 1.9 0 Warning message: ORE object has no unique key - using random order
Oracle R Advanced Analytics for Hadoop can access any Hive table containing columns with string and numeric data types such as tinyint
, smallint
, bigint
, int
, float
, and double
.
There is no support for these complex data types:
array binary map struct timestamp union
If you attempt to access a Hive table containing an unsupported data type, then you get an error message. To access the table, you must convert the column to a supported data type.
To convert a column to a supported data type:
Open the Hive command interface:
$ hive hive>
Identify the column with an unsupported data type:
hive> describe table_name;
View the data in the column:
hive> select column_name from table_name;
Create a table for the converted data, using only supported data types.
Copy the data into the new table, using an appropriate conversion tool.
Example 8-2 shows the conversion of an array. Example 8-3 and Example 8-4 show the conversion of timestamp data.
Example 8-2 Converting an Array to String Columns
R> ore.sync(table="t1") Warning message: table t1 contains unsupported data types . . . hive> describe t1; OK col1 int col2 array<string> hive> select * from t1; OK 1 ["a","b","c"] 2 ["d","e","f"] 3 ["g","h","i"] hive> create table t2 (c1 string, c2 string, c2 string); hive> insert into table t2 select col2[0], col2[1], col2[2] from t1; . . . R> ore.sync(table="t2") R> ore.ls() [1] "t2" R> t2$c1 [1] "a" "d" "g"
Example 8-3 uses automatic conversion of the timestamp
data type into string. The data is stored in a table named t5
with a column named tstmp
.
Example 8-3 Converting a Timestamp Column
hive> select * from t5; hive> create table t6 (timestmp string); hive> insert into table t6 SELECT tstmp from t5;
Example 8-4 uses the Hive get_json_object
function to extract the two columns of interest from the JSON table into a separate table for use by Oracle R Advanced Analytics for Hadoop.
Example 8-4 Converting a Timestamp Column in a JSON File
hive> select * from t3; OK {"custId":1305981,"movieId":null,"genreId":null,"time":"2010-12-30:23:59:32","recommended":null,"activity":9} hive> create table t4 (custid int, time string); hive> insert into table t4 SELECT cast(get_json_object(c1, '$.custId') as int), cast(get_json_object(c1, '$.time') as string) from t3;
The Hive command language interface (CLI) is used for executing queries and provides support for Linux clients. There is no JDBC or ODBC support.
The ore.create
function creates Hive tables only as text files. However, Oracle R Advanced Analytics for Hadoop can access Hive tables stored as either text files or sequence files.
You can use the ore.exec
function to execute Hive commands from the R console. For a demo, run the hive_sequencefile
demo.
Oracle R Advanced Analytics for Hadoop can access tables and views in the default Hive database only. To allow read access to objects in other databases, you must expose them in the default database. For example, you can create views.
Oracle R Advanced Analytics for Hadoop does not have a concept of ordering in Hive. An R frame persisted in Hive might not have the same ordering after it is pulled out of Hive and into memory. Oracle R Advanced Analytics for Hadoop is designed primarily to support data cleanup and filtering of huge HDFS data sets, where ordering is not critical. You might see warning messages when working with unordered Hive frames:
Warning messages: 1: ORE object has no unique key - using random order 2: ORE object has no unique key - using random order
To suppress these warnings, set the ore.warn.order
option in your R session:
R> options(ore.warn.order = FALSE)
Table 8-0 provides an example of loading a Hive table into an R data frame for analysis. It uses these Oracle R Advanced Analytics for Hadoop functions:
hdfs.attach ore.attach ore.connect ore.create ore.hiveOptions ore.sync
Example 8-5 Loading a Hive Table
# Connect to HIVE metastore and sync the HIVE input table into the R session. ore.connect(type="HIVE") ore.sync(table="datatab") ore.attach() # The "datatab" object is a Hive table with columns named custid, movieid, activity, and rating. # Perform filtering to remove missing (NA) values from custid and movieid columns # Project out three columns: custid, movieid and rating t1 <- datatab[!is.na(datatab$custid) & !is.na(datatab$movieid) & datatab$activity==1, c("custid","movieid", "rating")] # Set HIVE field delimiters to ','. By default, it is Ctrl+a for text files but # ORCH 2.0 supports only ',' as a file separator. ore.hiveOptions(delim=',') # Create another Hive table called "datatab1" after the transformations above. ore.create (t1, table="datatab1") # Use the HDFS directory, where the table data for datatab1 is stored, to attach # it to ORCH framework. By default, this location is "/user/hive/warehouse" dfs.id <- hdfs.attach("/user/hive/warehouse/datatab1") # dfs.id can now be used with all hdfs.*, orch.* and hadoop.* APIs of ORCH for further processing and analytics.
Oracle R Advanced Analytics for Hadoop provides a basic level of database access. You can move the contents of a database table to HDFS, and move the results of HDFS analytics back to the database.
You can then perform additional analysis on this smaller set of data using a separate product named Oracle R Enterprise. It enables you to perform statistical analysis on database tables, views, and other data objects using the R language. You have transparent access to database objects, including support for Business Intelligence and in-database analytics.
Access to the data stored in an Oracle database is always restricted to the access rights granted by your DBA.
Oracle R Enterprise is included in the Oracle Advanced Analytics option to Oracle Database Enterprise Edition. It is not one of the Oracle Big Data Connectors.
See Also:
Oracle R Enterprise User's GuideOracle R Advanced Analytics for Hadoop uses Sqoop to move data between HDFS and Oracle Database. Sqoop imposes several limitations on Oracle R Advanced Analytics for Hadoop:
You cannot import Oracle tables with BINARY_FLOAT
or BINARY_DOUBLE
columns. As a work-around, you can create a view that casts these columns to NUMBER
.
All column names must be in upper case.
The following scenario may help you identify opportunities for using Oracle R Advanced Analytics for Hadoop with Oracle R Enterprise.
Using Oracle R Advanced Analytics for Hadoop, you can look for files that you have access to on HDFS and execute R calculations on data in one such file. You can also upload data stored in text files on your local file system into HDFS for calculations, schedule an R script for execution on the Hadoop cluster using DBMS_SCHEDULER
, and download the results into a local file.
Using Oracle R Enterprise, you can open the R interface and connect to Oracle Database to work on the tables and views that are visible based on your database privileges. You can filter out rows, add derived columns, project new columns, and perform visual and statistical analysis.
Again using Oracle R Advanced Analytics for Hadoop, you might deploy a MapReduce job on Hadoop for CPU-intensive calculations written in R. The calculation can use data stored in HDFS or, with Oracle R Enterprise, in an Oracle database. You can return the output of the calculation to an Oracle database and to the R console for visualization or additional processing.
The Oracle R Advanced Analytics for Hadoop functions are described in R Help topics. This section groups them into functional categories and provides brief descriptions.
Table 8-1 describes the analytic functions.
Table 8-1 Functions for Statistical Analysis
Function | Description |
---|---|
Evaluates a fit generated by |
|
Exports a model (W and H factor matrices) to the specified destination for |
|
Creates a list of factor levels that can be used in the |
|
Fits and uses generalized linear models on data stored in HDFS. |
|
Perform k-means clustering on a data matrix that is stored as a file in HDFS. |
|
Fits a linear model using tall-and-skinny QR (TSQR) factorization and parallel distribution. The function computes the same statistical parameters as the Oracle R Enterprise |
|
Fits a low rank matrix factorization model using either the jellyfish algorithm or the Mahout alternating least squares with weighted regularization (ALS-WR) algorithm. |
|
Provides a neural network to model complex, nonlinear relationships between inputs and outputs, or to find patterns in the data. |
|
Provides the main entry point to create a nonnegative matrix factorization model using the jellyfish algorithm. This function can work on much larger data sets than the R |
|
Plugs in to the R |
|
Computes the top n items to be recommended for each user that has predicted ratings based on the input |
Table 8-2 describes the functions that execute HDFS commands from within the R environment.
Table 8-2 Functions for Using HDFS
Table 8-3 describes the functions available in Oracle R Advanced Analytics for Hadoop for use with Hive. See "ORE Functions for Hive".
Table 8-3 Functions for Using Hive
Function | Description |
---|---|
Converts a Hive table to a DFS identifier in ORCH. |
|
Converts an HDFS object identifier to a Hive table represented by an |
|
Creates a database table from a |
|
Drops a database table or view. |
|
Retrieves the specified |
|
Copies data from a Hive table to an R object. |
|
Copies data from an R object to a Hive table. |
|
Replaces the values in an |
Table 8-4 describes the aggregate functions from the OREstats package that Oracle R Advanced Analytics for Hadoop supports for use with Hive data.
Table 8-4 Oracle R Enterprise Aggregate Functions
Function | Description |
---|---|
Splits the data into subsets and computes summary statistics for each subset. |
|
Returns Tukey's five-number summary (minimum, lower hinge, median, upper hinge, and maximum) for the input data. |
|
Calculates an interquartile range. |
|
Calculates a sample median. |
|
Generates sample quantiles that correspond to the specified probabilities. |
|
Calculates the standard deviation. |
|
|
Calculates the variance. |
Footnote 1 For vectors only
Table 8-5 describes the functions for establishing a connection to Oracle Database.
Table 8-5 Functions for Using Oracle Database
Function | Description |
---|---|
Establishes a connection to Oracle Database. |
|
Checks whether Oracle R Advanced Analytics for Hadoop is connected to Oracle Database. |
|
Returns a connection object for the current connection to Oracle Database, excluding the authentication credentials. |
|
Displays information about the current connection. |
|
Disconnects the local R session from Oracle Database. |
|
Reconnects to Oracle Database with the credentials previously returned by |
Table 8-6 describes the functions for copying data between platforms, including R data frames, HDFS files, local files, and tables in an Oracle database.
Table 8-6 Functions for Copying Data
Function | Description |
---|---|
Copies data from an unstructured data file in HDFS into the R framework. By default, data files in HDFS are not visible to the connector. However, if you know the name of the data file, you can use this function to attach it to the Oracle R Advanced Analytics for Hadoop name space. |
|
Copies a file from HDFS to the local file system. |
|
Copies data from HDFS into a data frame in the local R environment. All metadata is extracted and all attributes, such as column names and data types, are restored if the data originated in an R environment. Otherwise, generic attributes like val1 and val2 are assigned. |
|
Copies data from HDFS into an Oracle database. This operation requires authentication by Oracle Database. See |
|
Copies data from an Oracle database to HDFS. This operation requires authentication by Oracle Database. See |
|
Copies data from an R in-memory object (data.frame) to HDFS. All data attributes, like column names and data types, are stored as metadata with the data. |
|
Copies a random sample of data from a Hadoop file into an R in-memory object. Use this function to copy a small sample of the original HDFS data for developing the R calculation that you ultimately want to execute on the entire HDFS data set on the Hadoop cluster. |
|
Copies a file from the local file system into HDFS. |
|
Indicates whether an R object contains a valid HDFS file identifier. |
Table 8-7 describes functions for converting and testing data types. The Oracle R Enterprise OREbase
package provides these functions.
Table 8-7 Functions for Converting and Testing Data Types
Function | Description |
---|---|
Coerces an in-memory R object to an ORE object. |
|
|
Coerces an in-memory R object to an ORE character object. |
|
Coerces an in-memory R object to an ORE date object. |
|
Coerces an in-memory R object to an ORE datetime object. |
|
Coerces an in-memory R object to an ORE difftime object. |
|
Coerces an in-memory R object to an ORE factor object. |
|
Coerces an in-memory R object to an ORE frame object. |
|
Coerces an in-memory R object to an ORE integer object. |
|
Coerces an in-memory R object to an ORE list object. |
|
Coerces an in-memory R object to an ORE logical object. |
|
Coerces an in-memory R object to an ORE matrix object. |
|
Coerces an in-memory R object to an ORE numeric object. |
|
Coerces an in-memory R object to an ORE object. |
|
Coerces an in-memory R object to an ORE vector object. |
Tests whether the specified value is an object of a particular Oracle R Enterprise class. |
|
|
Tests whether the specified value is a character. |
|
Tests whether the specified value is a date. |
|
Tests whether the specified value is a datetime type. |
|
Tests whether the specified value is a difftime type. |
|
Tests whether the specified value is a factor. |
|
Tests whether the specified value is a frame. |
|
Tests whether the specified value is an integer. |
|
Tests whether the specified value is a list. |
|
Tests whether the specified value is a logical type. |
|
Tests whether the specified value is a matrix. |
|
Tests whether the specified value is numeric. |
|
Tests whether the specified value is an object. |
|
Tests whether the specified value is a vector. |
Table 8-8 describes functions that you use when creating and running MapReduce programs.
Table 8-8 Functions for Using MapReduce
Function | Description |
---|---|
Starts the Hadoop engine and sends the mapper, reducer, and combiner R functions for execution. You must load the data into HDFS first. |
|
Lists the running jobs, so that you can evaluate the current load on the Hadoop cluster. |
|
Starts the Hadoop engine and sends the mapper, reducer, and combiner R functions for execution. If the data is not already stored in HDFS, then |
|
Switches the execution platform between the local host and the Hadoop cluster. No changes in the R code are required for a dry run. |
|
Makes R objects from a user's local R session available in the Hadoop execution environment, so that they can be referenced in MapReduce jobs. |
|
Outputs key-value pairs in a MapReduce job. |
|
Outputs a set of key-value pairs in a MapReduce job. |
|
Compresses one or more in-memory R objects that the mappers or reducers must write as the values in key-value pairs. |
|
Sets the path where temporary data is stored. |
|
Restores the R objects that were compressed with a previous call to |
Table 8-9 lists the functions available to help you debug your R program scripts.
Oracle R Advanced Analytics for Hadoop provides an extensive set of demos, which you can access in the same way as any other R demos.
The demo
function lists the functions available in ORCH:
R> demo(package="ORCH")
Demos in package 'ORCH':
hdfs_cpmv ORCH's copy and move APIs
hdfs_datatrans ORCH's HDFS data transfer APIs
hdfs_dir ORCH's HDFS directory manipulation APIs
hdfs_putget ORCH's get and put API usage
hive_aggregate Aggregation in HIVE
hive_analysis Basic analysis & data processing operations
hive_basic Basic connectivity to HIVE storage
hive_binning Binning logic
hive_columnfns Column function
hive_nulls Handling of NULL in SQL vs. NA in R
.
.
.
To run a demo from this list, use this syntax:
demo("demo_name", package="ORCH")
For example, this package runs the Hive binning demo:
R> demo("hive_binning", package = "ORCH")
demo('hive_binning', package = 'ORCH')
demo(hive_binning)
---- ~~~~~~~~~~~~
> #
> # ORACLE R CONNECTOR FOR HADOOP DEMOS
> #
> # Name: hive_binning.R
> # Description: Demonstrates binning logic in R
> #
> #
.
.
.
If an error occurs, then exit from R without saving the workspace image and start a new session. You should also delete the temporary files created in both the local file system and the HDFS file system:
# rm -r /tmp/orch* # hdfs dfs -rm -r /tmp/orch*
Oracle R Advanced Analytics for Hadoop can invoke the Sqoop utility to connect to Oracle Database either to extract data or to store results.
Sqoop is a command-line utility for Hadoop that imports and exports data between HDFS or Hive and structured databases. The name Sqoop comes from "SQL to Hadoop." The following explains how Oracle R Advanced Analytics for Hadoop stores a database user password and sends it to Sqoop.
Oracle R Advanced Analytics for Hadoop stores a user password only when the user establishes the database connection in a mode that does not require reentering the password each time. The password is stored encrypted in memory. See the Help topic for orch.connect
.
Oracle R Advanced Analytics for Hadoop generates a configuration file for Sqoop and uses it to invoke Sqoop locally. The file contains the user's database password obtained by either prompting the user or from the encrypted in-memory representation. The file has local user access permissions only. The file is created, the permissions are set explicitly, and then the file is open for writing and filled with data.
Sqoop uses the configuration file to generate custom JAR files dynamically for the specific database job and passes the JAR files to the Hadoop client software. The password is stored inside the compiled JAR file; it is not stored in plain text.
The JAR file is transferred to the Hadoop cluster over a network connection. The network connection and the transfer protocol are specific to Hadoop, such as port 5900.
The configuration file is deleted after Sqoop finishes compiling its JAR files and starts its own Hadoop jobs.