Oracle® Big Data Connectors User's Guide Release 1 (1.0) Part Number E27365-06 |
|
|
PDF · Mobi · ePub |
This chapter describes R support for big data. It contains these topics:
Oracle R Connector for Hadoop is an R package that provides an interface between the local R environment and Hadoop. You install and load this package the same as you would for any other R package. Using simple R functions, you can copy data between R memory, the local file system, and HDFS. You can schedule R programs to execute as Hadoop MapReduce jobs and return the results to any of those locations.
Oracle R Connector for Hadoop provides API access from a local R client to Hadoop, using these APIs:
hadoop
: Provides an interface to Hadoop MapReduce.
hdfs
: Provides an interface to HDFS.
orhc
: Provides an interface between the local R instance and Oracle Database.
All of these functions are included in the ORHC library. The functions are listed in this chapter in alphabetical order.
A separate package R package, Oracle R Enterprise, provides access to Oracle Database. Access to the data stored in Oracle Database is always restricted to the access rights granted by your Oracle DBA.
Oracle R Enterprise provides direct access to Oracle Database objects and enables you to perform statistical analysis on database tables, views, and other data objects. Users can develop R scripts for deployment while retaining the results in the secure environment of Oracle Database.
Oracle R Enterprise is included in the Oracle Database Advanced Analytics option; it is not included in Oracle Big Data Connectors.
See Also:
Oracle R Enterprise User's GuideThe following scenario may help you identify opportunities for using Oracle R Connector for Hadoop with Oracle R Enterprise.
Using the Oracle R Connector for Hadoop, you might look for files that you have access to on HDFS and schedule R calculations to execute on data in one such file. Furthermore, you can upload data stored in text files on your local file system into HDFS for calculations, schedule an R script for execution on the Hadoop cluster, and download the results into a local file.
Using the Oracle Database Advanced Analytics option, you can open the R interface and connect to Oracle Database to work on the tables and views that are visible based on your database privileges. You can filter out rows, add derived columns, project new columns, and perform visual and statistical analysis using Oracle R Enterprise.
Again using the Oracle R Connector for Hadoop, you might deploy a MapReduce job on Hadoop for CPU-intensive calculations written in R. The calculation can use data stored in HDFS or with the Oracle Database Advanced Analytics option, in Oracle Database You can return the output of the calculation to Oracle Database and the R console for visualization or additional processing.
Oracle R Connector for Hadoop invokes the Sqoop utility to connect to Oracle Database either to extract data or to store results. Sqoop is a command-line utility for Hadoop that imports and exports data between HDFS or Hive and structured databases, such as Oracle Database. The name Sqoop comes from "SQL to Hadoop."
The following explains how Oracle R Connector for Hadoop stores a database user password and sends it to Sqoop.
Oracle R Connector for Hadoop stores a user password only when the user establishes the database connection in a mode that does not require reentering the password each time. The password is stored encrypted in memory. See orhc.connect.
Oracle R Connector for Hadoop generates a configuration file for Sqoop and uses it to invoke Sqoop locally. The file contains the user's database password obtained by either prompting the user or from the encrypted in-memory representation. The file has local user access permissions only. The file is created, the permissions are set explicitly, then the file is open for writing and filled with data.
Sqoop uses the configuration file to generate custom JAR files dynamically for the specific database job and passes the JAR files to the Hadoop client software. The password is stored inside the compiled JAR file; it is not stored in plain text.
The JAR file is transferred to the Hadoop cluster over a network connection. The network connection and the transfer protocol is specific to Hadoop, such as port 5900.
The configuration file is deleted after Sqoop finishes compiling its JAR files and starts its own Hadoop jobs.
hadoop.exec hadoop.run hdfs.attach hdfs.cd hdfs.download hdfs.exists hdfs.get hdfs.ls hdfs.mkdir hdfs.parts hdfs.pull hdfs.push hdfs.put hdfs.pwd hdfs.rm hdfs.rmdir hdfs.sample hdfs.size hdfs.upload orhc.connect orhc.disconnect orhc.reconnect orhc.which
The OHRC functions are grouped into these categories:
orhc.connect orhc.disconnect orhc.reconnect orhc.which
hdfs.upload hdfs.download hdfs.get hdfs.push hdfs.put hdfs.pull
hdfs.attach hdfs.cd hdfs.exists hdfs.ls hdfs.mkdir hdfs.parts hdfs.pwd hdfs.rm hdfs.rmdir hdfs.sample hdfs.size
hadoop.exec hadoop.run
Starts the Hadoop engine and sends the mapper, reducer and combiner R functions for execution. You must load the data into HDFS first.
hadoop.exec( dfs.id, mapper, reducer, combiner, export)
Object identifier in HDFS.
Name of a mapper function written in the R language.
Name of a reducer function written in the R language (optional).
Name of a combiner function written in the R language (optional).
Names of exported R objects from your current R environment that are referenced by any of your mapper, reducer, or combiner functions (optional).
This function provides more control of the data flow than hadoop.run. You must use hadoop.exec
when chaining several mappers and reducers in a pipeline, because the data does not leave HDFS. The results are stored in HDFS.
Data object identifier in HDFS.
This sample script uses hdfs.attach
to obtain the object identifier of a small, sample data file in HDFS named ontime_R.
dfs <- hdfs.attach('ontime_R') res <- NULL res <- hadoop.exec( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumAD <- 0 count <- 0 for (x in vals) { if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1} } res <- sumAD / count keyval(key, res) } )
After the script runs, the location of the results is identified by the res
variable, in an HDFS file named /user/oracle/xq/orhc3d0b8218:
R> res [1] "/user/oracle/xq/orhc3d0b8218" attr(,"dfs.id") [1] TRUE R> print(hdfs.get(res)) key val1 1 SFO 27.05804
Starts the Hadoop engine and sends the mapper, reducer and combiner R functions for execution. If the data is not already stored in HDFS, then hadoop.run
first copies the data there.
hadoop.run( data, mapper, reducer, combiner, export)
Data frame, Oracle R Enterprise frame (ore.frame
), or an HDFS file descriptor.
Name of a mapper function written in the R language.
Name of a reducer function written in the R language (optional).
Name of a combiner function written in the R language (optional).
Names of exported R objects.
The hadoop.run
function returns the results from HDFS to the source of the input data. For example, the results for HDFS input data are kept in HDFS, and the results for ore.frame
input data are pulled into Oracle Database.
An object in the same format as the input data.
This sample script uses hdfs.attach
to obtain the object identifier of a small, sample data file in HDFS named ontime_R.
dfs <- hdfs.attach('ontime_R') res <- NULL res <- hadoop.run( dfs, mapper = function(key, ontime) { if (key == 'SFO') { keyval(key, ontime) } }, reducer = function(key, vals) { sumAD <- 0 count <- 0 for (x in vals) { if (!is.na(x$ARRDELAY)) {sumAD <- sumAD + x$ARRDELAY; count <- count + 1} } res <- sumAD / count keyval(key, res) } )
After the script runs, the location of the results is identified by the res
variable, in an HDFS file named /user/oracle/xq/orhc3d0b8218:
R> res [1] "/user/oracle/xq/orhc3d0b8218" attr(,"dfs.id") [1] TRUE R> print(hdfs.get(res)) key val1 1 SFO 27.05804
Pulls data from an unstructured data file in HDFS into the Oracle R Connector for Hadoop framework. By default, data files in HDFS are not visible to the R Connector. However, if you know the name of the data file, you can use this function to attach it to the R Connector name space.
If the data does not have metadata identifying the names and data types of the columns, then the function samples the data to deduce the data type (number or string). It then re-creates the file with the appropriate metadata.
hdfs.attach(dfs.name)
The name of a file in HDFS.
Use this function to attach an HDFS file to your R environment, the same as you might attach a data frame.
The object ID of the file in HDFS, or NULL
if the operation failed.
This example stores the object ID of ontime_R
in a variable named dfs
, then displays its value.
R> dfs <- hdfs.attach('ontime_R') R> dfs [1] "/user/oracle/xq/ontime_R" attr(,"dfs.id") [1] TRUE
Sets the default HDFS path.
hdfs.cd(dfs.path)
A path that is either absolute or relative to the current path.
TRUE
if the path is changed successfully, or FALSE
if the operation failed.
This example changes the current directory from /user/oracle to /user/oracle/sample:
R> hdfs.cd("sample")
[1] "/user/oracle/sample"
Copies a file from HDFS to the local file system.
hdfs.download( dfs.id, filename, overwrite)
The object ID of the file in HDFS.
The name of a file in the local file system where the data is copied.
Controls whether the operation can overwrite an existing local file. Set to TRUE
to overwrite filename
, or FALSE
to signal an error (default).
This function provides the fastest and easiest way to copy a file from HDFS. No data transformations occur except merging multiple parts into a single file. The local file has the exact same data as the HDFS file.
Local file name, or NULL
if the copy failed.
This example displays a list of files in the current HDFS directory and copies ontime2000.DB to the local file system as /home/oracle/ontime2000.dat.
R> hdfs.ls() [1] "ontime2000_DB" "ontime_DB" "ontime_File" "ontime_R" "testdata.dat" R> tmpfile <- hdfs.download("ontime2000_DB", "/home/oracle/ontime2000.dat", overwrite=F) R> tmpfile [1] "/home/oracle/ontime2000.dat"
Verifies that an object exists in HDFS.
hdfs.exists( dfs.id)
An object ID or file name in HDFS.
If this function returns TRUE
, then you can attach the data and use it in a hadoop.run
function. You can also use this function to validate an HDFS identifier and ensure that the data exists.
TRUE
if the identifier is valid and the data exists, or FALSE
if the object is not found.
This example shows that the ontime_R
file exists.
R> hdfs.exists("ontime_R")
[1] TRUE
Copies data from HDFS into a data frame in the local R environment. All metadata is extracted and all attributes, such as column names and data types, are restored if the data originated in an R environment. Otherwise, generic attributes like val1 and val2 are assigned.
hdfs.get( dfs.id, sep)
The object ID of the file in HDFS.
The symbol used to separate fields in the file. A comma (,) is the default separator.
If the HDFS file is small enough to fit into an in-memory R data frame, then you can copy the file using this function instead of hdfs.pull
. The hdfs.get
function can be faster, because it does not use Sqoop and thus does not have the overhead incurred by hdfs.pull
.
A data.frame
object in memory in the local R environment pointing to the exported data set, or NULL
if the operation failed.
This example returns the contents of a data frame named res
.
R> print(hdfs.get(res))
key val1
1 AA 1361.4643
2 AS 515.8000
3 CO 2507.2857
4 DL 1601.6154
5 HP 549.4286
6 NW 2009.7273
7 TW 1906.0000
8 UA 1134.0821
9 US 2387.5000
10 WN 541.1538
Lists the names of all HDFS directories containing data in the specified path.
hdfs.ls(dfs.path)
A path relative to the current default path. The default path is the current working directory.
Use hdfs.cd to set the default path.
A list of data object names in HDFS, or NULL
if the specified path is invalid.
This example lists the subdirectories in the current directory:
R> hdfs.ls()
[1] "ontime_DB" "ontime_FILE" "ontime_R"
The next example lists directories in the parent directory:
R> hdfs.ls("..")
[1] "demo" "input" "olhcache" "output" "sample" "xq"
This example returns NULL
because the specified path is not in HDFS.
R> hdfs.ls("/bin")
NULL
Creates a subdirectory in HDFS relative to the current working directory.
hdfs.mkdir( dfs.name, cd)
Name of the new directory.
TRUE
to change the current working directory to the new subdirectory, or FALSE
to keep the current working directory (default).
Text.
Full path of the new directory as a String, or NULL
if the directory was not created.
This example creates the /user/oracle/sample
directory.
R> hdfs.mkdir('sample', cd=T)
[1] "/user/oracle/sample"
attr(,"dfs.path")
[1] TRUE
Returns the number of parts composing an object in HDFS.
hdfs.parts( dfs.id)
Object identifier in HDFS.
HDFS splits large files into parts, which provide a basis for the parallelization of MapReduce jobs. The more parts an HDFS file has, the more mappers can run in parallel.
Number of parts composing the object, or 0 if the object does not exist in HDFS
This example shows that the ontime_R
file has one part:
R> hdfs.parts("ontime_R")
[1] 1
Copies data from HDFS into Oracle Database.
This operation requires authentication by Oracle Database. See orhc.connect
.
hdfs.pull( dfs.id, sep, db.name, overwrite, driver)
The file name in HDFS.
The symbol used to separate fields in the file. A comma (,) is the default separator.
The name of a table in Oracle Database. (Optional)
Controls whether db.name
can overwrite a table with the same name. Set to TRUE
to overwrite the table, or FALSE
to signal an error (default).
Identifies the driver used to copy the data. The default driver is sqoop
.
Because this operation is synchronous, copying a large data set may appear to hang the R environment. You regain use of R when copying is complete.
To copy large volumes of data into Oracle Database, consider using Oracle Loader for Hadoop. With the Oracle Database Advanced Analytics option, you can use Oracle R Enterprise to analyze the data in an Oracle database.
An ore.frame
object that points to the database table with data loaded from HDFS, or NULL
if the operation failed
Oracle R Enterprise User's Guide for a description of ore.frame
objects.
Copies data from Oracle Database to HDFS.
This operation requires authentication by Oracle Database. See orhc.connect
.
Note:
The Oracle R Enterprise (ORE) library must be attached to use this function.hdfs.push( x, key, dfs.name, overwrite, driver, split.by)
An ore.frame
object with the data in Oracle Database to be pushed.
The index or name of the key column.
Unique name for the object in HDFS.
TRUE
to allow dfs.name
to overwrite an object with the same name, or FALSE
to signal an error (default).
Driver for copying the data (optional). The default driver is sqoop
.
The column to use for data partitioning. (Optional)
Because this operation is synchronous, copying a large data set may appear to hang the R environment. You regain use of R when copying is complete.
An ore.frame
object is an Oracle R Enterprise metadata object that points to a database table. It corresponds to an R data.frame
object.
HDFS object ID pointing to the exported data set, or NULL
if the operation failed
Oracle R Enterprise User's Guide
This examples creates an ore.frame
object named ontime_s2000
that contains the rows from the ONTIME_S
table in Oracle Database where the year equals 2000. Then hdfs.push
uses ontime_s2000
to create /user/oracle/xq/ontime2000_DB in HDFS.
R> ontime_s2000 <- ONTIME_S[ONTIME_S$YEAR == 2000,] R> class(ontime_s2000) [1] "ore.frame" attr(,"package") [1] "OREbase" R> ontime2000.dfs <- hdfs.push(ontime_s2000, key='DEST', dfs.name='ontime2000_DB') R> ontime2000.dfs [1] "/user/oracle/xq/ontime2000_DB" attr(,"dfs.id") [1] TRUE
Copies data from an ORE data frame to HDFS. Column names, data types, and other attributes are stored as metadata in HDFS.
Note:
The Oracle R Enterprise (ORE) library must be attached to use this function.hdfs.put( data, key, dfs.name, overwrite)
An ore.frame
object in the local R environment to be copied to HDFS.
The index or name of the key column.
A unique name for the new file.
Controls whether dfs.name
can overwrite a file with the same name. Set to TRUE
to overwrite the file, or FALSE
to signal an error.
You can use this function to transfer control parameters or to look up data relevant to a Hadoop R calculation from the R environment into an HDFS file.
You can also use hdfs.put
instead of hdfs.push
to copy data from ore.frame
objects, such as database tables, to HDFS. The table must be small enough to fit in R memory, otherwise the function fails. The hdfs.put
function first reads all table data into R memory and then transfers it to HDFS. For a small table, this function can be faster because it does not use Sqoop and thus does not have the overhead incurred by hdfs.push
.
The object ID of the new file, or NULL
if the operation failed.
This example creates a file named /user/oracle/xq/testdata.dat with the contents of the dat
data frame.
R> myfile <- hdfs.put(dat, key='DEST', dfs.name='testdata.dat') R> print(myfile) [1] "/user/oracle/xq/testdata.dat" attr(,"dfs.id") [1] TRUE
Identifies the current working directory in HDFS.
hdfs.pwd()
The current working directory, or NULL
if you are not connected to HDFS.
This example shows that /user/oracle is the current working directory.
R> hdfs.pwd()
[1] "/user/oracle/"
Removes a file or directory from HDFS.
hdfs.rm(dfs.id)
The object ID of a file in HDFS to be removed.
All object identifiers in Hadoop pointing to this data are invalid after this operation.
TRUE
if the data is deleted, or FALSE
if the operation failed.
R> hdfs.rm("data1.log")
[1] TRUE
Deletes a subdirectory in HDFS relative to the current working directory.
hdfs.rmdir( dfs.name)
Name of the directory in HDFS to delete.
This function deletes all data objects stored in the directory, which invalidates all associated object identifiers in HDFS.
TRUE
if the directory is deleted successfully, or FALSE
if the operation fails.
R> hdfs.rmdir("mydata")
[1] TRUE
Copies a random sample of data from a Hadoop file into an R in-memory object. Use this function to copy a small sample of the original HDFS data for developing the R calculation that you ultimately want to execute on the entire HDFS data set on the Hadoop cluster.
hdfs.sample( dfs.id, lines, sep)
HDFS object ID where the data is located.
Number of lines to return as a sample. The default value is 1000 lines.
The symbol used to separate fields in the Hadoop file. A comma (,
) is the default separator.
If the data originated in an R environment, then all metadata is extracted and all attributes are restored, including column names and data types. Otherwise, generic attribute names, like val1
and val2
, are assigned.
A data.frame
object with the sample data set, or NULL
if the operation failed.
This example displays the first three lines of the ontime_R file.
R> hdfs.sample("ontime_R", lines=3)
YEAR MONTH MONTH2 DAYOFMONTH DAYOFMONTH2 DAYOFWEEK DEPTIME...
1 2000 12 NA 31 NA 7 1730...
2 2000 12 NA 31 NA 7 1752...
3 2000 12 NA 31 NA 7 1803...
Returns the size in bytes of an object in HDFS.
hdfs.size( dfs.id)
Object identifier in HDFS.
Use this interface to determine, for instance, whether you can pull the contents of the entire HDFS file into local R memory or a local file, or if you can only sample the data while creating a prototype of your R calculation.
Size in bytes of the object, or 0 if the object does not exist in HDFS
This example returns a file size for ontime_R of 999,839 bytes.
R> hdfs.size("ontime_R")
[1] 999839
Copies a file from the local file system into HDFS.
hdfs.upload( filename, dfs.name, overwrite, split.size, header)
Name of a file in the local file system.
Name of the new directory in HDFS.
Controls whether db.name
can overwrite a directory with the same name. Set to TRUE
to overwrite the directory, or FALSE
to signal an error (default).
Maximum number of bytes in each part of the Hadoop file. (Optional)
Indicates whether the first line of the local file is a header containing column names. Set to TRUE
if it has a header, or FALSE
if it does not (default).
A header enables you to exact the column names and reference the data fields by name instead of by index in your MapReduce R scripts.
This function provides the fastest and easiest way to copy a file into HDFS. If the file is larger than split.size
, then Hadoop splits it into two or more parts. The new Hadoop file gets a unique object ID, and each part is named part-0000
x
. Hadoop automatically creates metadata for the file.
HDFS object ID for the loaded data, or NULL
if the copy failed.
hdfs.download hdfs.get hdfs.put
This example uploads a file named ontime_s2000.dat into HDFS and shows the location of the file, which is stored in a variable named ontime.dfs_File
.
R> ontime.dfs_File <- hdfs.upload('ontime_s2000.dat', dfs.name='ontime_File') R> print(ontime.dfs_File) [1] "/user/oracle/xq/ontime_File"
Establishes a connection to Oracle Database.
orhc.connect( host, user, sid, passwd, port, secure, driver, silent)
Host name or IP address of the server where Oracle Database is running.
Database user name.
Password for the database user.
System ID (SID) for the Oracle Database instance.
Port number for the Oracle Database listener. The default value is 1521.
Authentication setting for Oracle Database:
TRUE
: You must enter a database password each time you attempt to connect. (Default)
FALSE
: You enter a database password once. It is encrypted in memory and used every time a database connection is required.
Driver used to connect to Oracle Database (optional). Sqoop is the default driver.
TRUE
to suppress the prompts for missing host, user, password, port, and SID values, or FALSE
to see them (default).
Use this function when your analysis requires access to data stored in an Oracle database or to return the results to the database.
With an Oracle Database Advanced Analytics license for Oracle R Enterprise and a connection to Oracle Database, you can work directly with the data stored in database tables and pass processed data frames to R calculations on Hadoop.
TRUE
for a successful and validated connection, or FALSE
for a failed connection attempt
orhc.disconnect
This example installs the OHRC library and connects to the local Oracle database:
R> library(ORHC) Oracle R Connector for HadoopHadoop is up and running. R> orhc.connect("localhost", 'RQUSER", "orcl") Connecting ORCH to RDBMS via [sqoop] Host: localhost Port: 1521 SID: orcl User: RQUSER Enter password for [RQUSER]: password Connected. [1] TRUE
Disconnects the local R session from Oracle Database.
orhc.disconnect()
No orhc
functions work without a connection to Oracle Database.
You can use the return value of this function to reestablish a connection using orhc.reconnect
.
An Oracle Database connection object, or NULL
if Oracle Database refuses to disconnect
orhc.connect orhc.reconnect
R> orhc.disconnect()
Disconnected.
Reconnects to Oracle Database with the credentials previously returned by orhc.disconnect
.
orhc.reconnect(dbcon)
Credentials previously returned by orhc.disconnect.
Oracle R Connector for Hadoop preserves all user credentials and connection attributes, enabling you to reconnect to a previously disconnected session. Depending on the orhc.connect
secure setting for the original connection, you may be prompted for a password. After reconnecting, you can continue data transfer operations between Oracle Database and HDFS.
Reconnecting to a session is faster than opening a new one, because reconnecting does not require extensive connectivity checks.
TRUE
for a successfully reestablished and validated connection, or FALSE
for a failed attempt
orhc.connect orhc.disconnect
Displays information about the current connection to Oracle Database, excluding the authentication credentials.
orhc.which()
None
This function is useful when connecting to multiple Oracle databases during your analysis task.
This example describes a connection by RQUSER to the local Oracle database:
R> orhc.which()
Connected to RDBMS via [sqoop]
Host: localhost
Port: 1521
SID: orcl
User: RQUSER